WestminsterResearch - PDF Free Download

WestmisterResearch http://www.westmister.ac.uk/westmisterresearch Parallelizig the Chambolle Algorithm for Performace-Optimized Mappig o FPGA Devices Beretta, I., Raa, V., Aki, A., Nacci, A. A., Sciuto, D. ad Atieza, D. ACM, 2016. This is the author's versio of the work. It is posted here by permissio of ACM for your persoal use. Not for redistributio. The defiitive versio was published i the ACM Trasactios o Embedded Computig Systems (TECS) 15 (3) Article No. 44, 2016. http://doi.acm.org/10.1145/. The WestmisterResearch olie digital archive at the Uiversity of Westmister aims to make the research output of the Uiversity available to a wider audiece. Copyright ad Moral Rights remai with the authors ad/or copyright owers. Whilst further distributio of specific materials from withi this archive is forbidde, you may freely distribute the URL of WestmisterResearch: ((http://westmisterresearch.wmi.ac.uk/). I case of abuse or copyright appearig without permissio e-mail repository@westmister.ac.uk

39 Parallelizig the Chambolle Algorithm for Performace Optimized Mappig o FPGA Devices IVAN BERETTA, École Polytechique Fédérale de Lausae VINCENZO RANA, Politecico di Milao ABDULKADIR AKIN, École Polytechique Fédérale de Lausae ALESSANDRO ANTONIO NACCI, Politecico di Milao DONATELLA SCIUTO, Politecico di Milao DAVID ATIENZA, École Polytechique Fédérale de Lausae The performace ad the efficiecy of recet computig platforms have bee deeply iflueced by the widespread adoptio of hardware accelerators, such as Graphics Processig Uits (GPUs) or Field Programmable Gate Arrays (FPGAs), which are ofte employed to support the tasks of Geeral Purpose Processors (GPP). Oe of the mai advatages of these accelerators over their sequetial couterparts (GPPs) is their ability of performig massive parallel computatio. However, i order to exploit this competitive edge, it is ecessary to extract the parallelism from the target algorithm to be executed, which is i geeral a very challegig task. This cocept is demostrated, for istace, by the poor performace achieved o relevat multimedia algorithms, such as Chambolle, which is a well-kow algorithm employed for the optical flow estimatio. The implemetatios of this algorithm that ca be foud i the state of the art are geerally based o GPUs, but barely improve the performace that ca be obtaied with a powerful GPP. I this paper, we propose a ovel approach to extract the parallelism from computatio-itesive multimedia algorithms, which icludes a aalysis of their depedecy schema ad a assessmet of their data reuse. We the perform a thorough aalysis of the Chambolle algorithm, providig a formal proof of its ier data depedecies ad locality properties. The, we exploit the cosideratios draw from this aalysis by proposig a architectural template that takes advatage of the fie-graied parallelism of FPGA devices. Moreover, sice the proposed template ca be istatiated with differet parameters, we also propose a desig metric, the expasio rate, to help the desiger i the estimatio of the efficiecy ad performace of the differet istaces, makig it possible to select the right oe before the implemetatio phase. We fially show, by meas of experimetal results, how the proposed aalysis ad parallelizatio approach leads to the desig of efficiet ad highperformace FPGA-based implemetatios that are orders of magitude faster tha the state-of-the-art oes. Categories ad Subject Descriptors: B.6.1 [Desig Styles]: Parallel circuits; I.4.8 [Scee Aalysis]: Motio; C.1.3 [Other Architecture Styles]: Data-flow architectures; B.8.2 [Performace Aalysis ad Desig Aids] Geeral Terms: Desig, Algorithms, Performace Additioal Key Words ad Phrases: Chambolle, Optical flow, TV-L1, Field Programmable Gate Arrays, Parallel Architectures, Custom Hardware 1. INTRODUCTION Heterogeeous ad specialized computatio is forecast to icreasigly grow over the ext years, ad establish itself as oe of the mai paradigms for embedded systems desig [Cordes et al. 2013]. The employmet of special-purpose cores to perform a complex fuctioality withi a System-o-Chip (SoC), is motivated by higher performace ad lower power cosumptio with respect to a equivalet executio o a geeral-purpose processig uit. Furthermore, i certai domais such as multimedia processig, these specialized cores perform tasks that are sufficietly geeral to guaratee a good reusability i a wide rage of systems. For example, specialized cores ca be used to accelerate commo operatios such as covolutio filters [Jamro ad Wiatr 2001] or the Jacobi operator [Sleijpe ad Vorst 2000]. The desig of special-purpose hardware modules traditioally aims at optimizig their computatioal efficiecy, while meetig predefied area requiremets that may ACM Trasactios o Embedded Computig Systems, Vol. 15, No. 3, Article 39, Publicatio date: March 2016.

39:2 Beretta et al. be imposed whe the core is part of a more complex multi-core SoC. To achieve the target performace, applicatio-specific accelerators ca be implemeted o differet cuttig-edge platforms, such as Graphics Processig Uits (GPUs) or Field Programmable Gate Arrays (FPGAs). However, eve though GPUs are faster tha FPGAs, they show a rigid structure desiged for sigle istructio multiple data processig, hece they are ot a good choice whe dealig with algorithms with very complex data depedecies amog iteratios [Bodily et al. 2010]. FPGAs, o the other had, provide a fully customizable platform where ay kid of custom operatio, either complex or very simple, ca be implemeted i hardware ad applied o multiple blocks of data i parallel. Ufortuately, the desig of complex ad custom FPGA systems is a very challegig task, ad tools to drive the desiger i the defiitio of such architectures are still ot mature. Represetative examples of importat computatio-itesive algorithms that greatly beefit from parallelizatio ad performace optimizatio ca be foud i the field of multimedia processig ([Jia et al. 2013], [Ali et al. 2014]). Several researchers have addressed their effort towards some of these algorithms i the last years ([Che et al. 2012] [Ghodhbai et al. 2014]). I this paper we focus our attetio to Chambolle [Chambolle 2004], which is a relevat algorithm belogig to this class ad for which a high-performace parallel implemetatio has ot yet bee proposed, as we show i the aalysis of the state-of-the-art approaches preseted i Sectio 2. The Chambolle algorithm is a well-kow ad widely-employed algorithm i such fields as motio estimatio ad compesatio, or rollig shutter correctio (see Sectio 2 for more details). However, eve though this algorithm is used i may applicatios (e..g., the TV-L 1 optical flow estimatio described i Sectio 2), o parallel ad efficiet implemetatio has bee proposed so far; i fact, eve the best performig implemetatios o GPUs are essetially sequetial, ad they do ot achieve real-time frame rates with high resolutio images [Zach et al. 2007]. This lack of performace is maily due to the complex data depedecies schemas that usually characterize this kid of algorithms. I additio to the lack of efficiet GPU ad multi-core implemetatios, o hardware implemetatio methodology exists to exploit the high amout of resources available o the latest programmable devices, such as FPGAs. For these reasos we believe that the Chambolle algorithm ca be cosidered as a corerstoe for may multimedia systems that deal with challegig problems (such as the optical flow estimatio [Behbahai et al. 2007]) ad for which efficiet implemetatios have ot yet bee foud, maily because of their complex data depedecies. This work builds upo the Chambolle implemetatio we first outlied i [Aki et al. 2011], complemetig it with a more detailed algorithm aalysis, as well as a deep desig space exploratio. Specifically, we propose a breakdow of the Chambolle kerel, formally provig its depedecy patter ad its locality. We the defie a ovel algorithmic-level metric to drive the desig space exploratio of iterative algorithms, which we amed expasio rate. The metric eables to estimate implemetatio aspects, such as the impact of memory trasfers, as a fuctio of the geometry of the algorithm. Fially, we exted the desig space exploratio to other platforms, specifically to GPUs. The remaider of this paper is structured as follows. I Sectio 3, we provide a detailed aalysis of the Chambolle algorithm, focusig o its mai characteristics ad proprieties. The, we describe the proposed desig strategy to efficietly tackle its complexity, parallelizig its computatio i order to drastically improve its performace (Sectio 4). After showig the proposed architectural template, we itroduce the cocept of expasio rate, aother relevat cotributio of this work. Sectio 5 reports the desig space exploratio for the Chambolle algorithm, ad presets the implemetatio aspects of the proposed hardware implemetatio. Fially, Sectio 6 describes ACM Trasactios o Embedded Computig Systems, Vol. 15, No. 3, Article 39, Publicatio date: March 2016.

Parallelizig the Chambolle Algorithm for Performace Optimized Mappig o FPGA Devices 39:3 the experimetal results provig that the proposed parallelizatio of the Chambolle algorithm is cosiderably faster tha the solutios foud i the literature. These approaches are maily based o GPU acceleratio that do ot completely exploit the implicit fie-graied parallelism of this kid of multimedia algorithm. Sectio 7 shows how the proposed approach, based o a fier parallelizatio of the iput algorithm ad targetig FPGA devices, is able to drastically icrease the degree of parallelism that ca be extracted from the algorithm, ad exploitig it to icrease the efficiecy ad the performace of the computig architecture. Fially, Sectio 8 cocludes the paper by drawig some fial cosideratios. 2. STATE OF THE ART The optical flow is a vector field represetig the movemet of a object i a sequece of frames, ad it ca be determied by aalyzig the variatio of the brightess iside a sequece of successive images [Verri ad Poggio 1989]. The estimatio of this vector field is oe of the most importat problems i image ad video processig, as it ca be employed for motio estimatio [Su et al. 2000] ad compesatio [Li et al. 1997], as well as i other fields such as robotics [Kim et al. 2007] ad eve medical aalysis [Behbahai et al. 2007]. Aother importat applicatio of the optical flow is the correctio of a image acquired by CMOS optical sesors usig the rollig shutter techique [Baker et al. 2010], which is owadays used i most of the low-ed photo cameras. I particular, rollig shutter is a method of image acquisitio i which each frame is recorded by scaig across the frame either vertically or horizotally, which may geerate errors ad distortios i the fial image. The optical flow estimatio is a computatioally challegig problem [Behbahai et al. 2007] because of the large amout of movemets that ca be detected i a frame, ad because of the oise that ca alter the image brightess. A wide rage of differet techiques, such as [Hor ad Schuck 1981] [Black ad Aada 1993] [Papeberg et al. 2006], has bee proposed i the past, but variatioal methods [Aubert et al. 1999] i.e., algorithms based o the miimizatio of a quatity kow as total variatio [Rudi et al. 1992] have emerged as oe of the most successful approaches i recet years. The variatioal techique we cosider i this work is called TV-L 1 [Pock et al. 2007], which distiguishes itself from other approaches because it ca hadle highlyvaryig itesities i the frames. The TV-L 1 method icludes both a mathematical defiitio of the variatioal problem, ad a umerical scheme to compute the solutio. The umerical scheme is based o a fixed-poit algorithm origially proposed by Atoi Chambolle [Chambolle 2004], which iteratively refies the solutio (which i this case represets the optical flow estimatio) at differet levels of precisio. Though TV-L 1 seems to be very promisig from a theoretical poit of view, its implemetatios fail to reach real-time performace (i.e., to process at least 30 frames per secod), except for very small images. A multithread software implemetatio of TV-L 1 that has bee developed ad aalyzed at EPFL, for example, ca take more tha 15 secods to process just oe frame o a stadard x86 workstatio, ad up to 50 secods are required o the ARM processor of a Apple iphoe 3GS. The profilig of the estimatios of the TV-L 1 optical flow o both platforms shows that the Chambolle algorithm itself is the bottleeck that geerates the poor timig performace. I fact, besides the executio of a outermost loop which does ot require ay complex matrix operatio, approximately 90% of the executio time is spet o the Chambolle iterative techique, which proves to be the most critical ad computatioally itesive part. However, all the implemetatios of the Chambolle algorithm that ca be foud i literature fail i achievig real-time frame rates with high resolutio images [Zach et al. 2007]. Furthermore, at the best of our kowledge, a parallel implemetatio of ACM Trasactios o Embedded Computig Systems, Vol. 15, No. 3, Article 39, Publicatio date: March 2016.

39:4 Beretta et al. this approach has ever bee proposed because of the complex depedecies amog the itermediate results [Aki et al. 2011]. I [Pock et al. 2007] ad [Zach et al. 2007], the robust TV-L 1 techique to calculate the optical flow betwee two frames is proposed ad implemeted usig moder GPUs. The authors proved that a real-time frame rate ca be achieved by the most powerful devices for low-resolutio sequeces, but oly very few frames that are larger tha 512 512 ca be processed i oe secod. A Matlab implemetatio of the techique i [Zach et al. 2007] requires from 5 to 6 secods to complete the estimatio of the optical flow o a high-ed workstatio, ad it also shows some limitatios i terms of memory usage. Additioal hardware results of the estimatio of the TV-L 1 optical flow o GPUs ca be also foud i [Weishaupt et al. 2010], but eve the fastest implemetatio caot top a rate of 6 frames per secod, eve o 512 512 images. A full summary of the performace of the aforemetioed state-of-the-art implemetatios of Chambolle are reported i Sectio 6, as a referece to evaluate the solutios proposed i this paper. Fast estimatios of the optical flow ca be achieved by usig differet techiques ad by simplifyig the workig domai. For example, the implemetatio proposed i [Abutaleb et al. 2009] ca process up to 156 fps o 768 576 images, workig o a low-cost FPGA device. However, the resultig optical flow is specifically suited for motio detectio, ad it caot be used i other applicatios such as rollig shutter correctio. The specific target allows the authors to filter the iput frames, ad i particular to apply backgroud subtractio, which heavily simplifies the amout of data to be processed for the optical flow estimatio. 3. CHAMBOLLE ALGORITHM ANALYSIS This sectio presets the aalysis we have performed o the Chambolle algorithm, describig the structure of its depedecy schema (Sectio 3.2) ad providig a formal proof of its locality (Sectio 3.3). The otatio used i this sectio is a mior modificatio of the oe used i [Chambolle 2004], ad requires few basic cocepts that are described i Sectio 3.1. Fially, Sectio 3.4 presets a simplified pseudo-code formulatio of the Chambolle algorithm. 3.1. Prelimiary Defiitios I the cotext of multimedia processig, the iput of the Chambolle algorithm is represeted as a rectagular matrix of legth L ad width W, which represets a picture of L W pixels. Let X be defied as the euclidea space X = R L W, ad let Y be the cartesia product Y = X X. Fially, let us recall the defiitio of the Euclidea orm. over R 2, which is defied as y = y1 2 + y2 2, for ay poit y = (y 1, y 2 ) R 2. It is ow possible to itroduce the two mai operators that are used i the formulatio of the Chambolle algorithm: the discrete gradiet divergece operators. Give a elemet x X, the discrete gradiet x Y is defied as: ( ) ( x) i,j = ( x) (1) i,j, ( x)(2) i,j (1) where: { ( x) (1) i,j = xi+1,j x i,j, if i < L 0, if i = L {, ( x) (2) i,j = xi,j+1 x i,j, if j < W 0, if j = W (2) for i = 1,.., L ad j = 1,.., W. The cases i = L ad j = W are cosidered separately, as they refer to pixels that lie o the boudaries of the matrix. ACM Trasactios o Embedded Computig Systems, Vol. 15, No. 3, Article 39, Publicatio date: March 2016.

Parallelizig the Chambolle Algorithm for Performace Optimized Mappig o FPGA Devices 39:5 The discrete divergece operator takes a elemet p Y as a operad, ad returs the value div p X defied as: p (1) i,j p(1) i 1,j, if 1 < i < L p (2) (div p) i,j = p (1) i,j p(2) i,j 1, if 1 < j < W i,j, if i = 1 + p (2) i,j, if j = 1 (3), if i = L, if j = L p (1) i 1,j p (2) i,j 1 As discussed i the previous sectios, the Chambolle algorithm aims at miimizig a quatity kow as total variatio [Rudi et al. 1992]. With the cocepts defied i this subsectio, it is ow possible to formalize this metric. Give g X ad θ > 0, the miimizatio of the total variatio ca be formulated as follows: mi x X x g 2 + 2θ 1 i L, 1 j W ( x) i,j As show i [Chambolle 2004], the miimizatio problem has a closed-form solutio whose aalytical equatio is kow, but its umerical estimatio is ot straightforward. I order to fid a solutio umerically, the problem must be expressed i the followig form: mi { θ div p g p Y 2 : p i,j 2 1, i = 1,..., L, j = 1,..., W } This formulatio ca be umerically approached usig a recursive techique kow as semi-implicit gradiet descet [Chambolle 2004], which is the core part of the Chambolle algorithm. I particular, for ay 0, which defies umber of iteratios or levels, a elemet p Y is recursively adjusted as follows: p (+1) i,j + τ( Φ() ) i,j 1 + τ ( Φ () ) i,j = p() i,j, Φ () = div p () g θ where τ > 0 is a fixed value (i geeral it is equal to 1/4 to guaratee the covergece of the algorithm [Chambolle 2004]), ad p (0) = 0 by defiitio. The matrix Φ () X is a matrix that is defied i order to keep the otatio compact. 3.2. Depedecy Schema Accordig to equatio (6), the solutio of the Chambolle algorithm recursively depeds o previous values (for example, there is a explicit depedecy betwee p (+1) i,j ad p () i,j ), which may prevet a parallelized implemetatio because a large amout of data might be required to compute the value of p (+1) i,j. The goal of this sectio is to uroll the depedecies icluded i equatio (6), ad derive the full shape of the stecil. For the sake of illustratio, the poits o the boudaries of the matrices are omitted, therefore idices i ad j are always strictly greater tha 1, ad strictly lower tha L ad W, respectively. I fact, boudary values are oly a special case of the proposed aalysis, ad they ca be easily hadled by substitutig the correspodig values from equatios (2) ad (3). I equatio (6), the deomiator is a scalar quatity, whereas both the two terms i the umerator belog to Y = X X. As a cosequece, p (+1) i,j Y, thus it ca be writte as: p (+1) i,j (4) (5) (6) = (px (+1) i,j, py (+1) i,j ) (7) ACM Trasactios o Embedded Computig Systems, Vol. 15, No. 3, Article 39, Publicatio date: March 2016.

39:6 Beretta et al. where both px (+1) ad py (+1) are L W matrices computed at level + 1. The term ( Φ () ) i,j ca the be urolled accordig to equatios (1) ad (2), rememberig that the poit (i, j) is ot o the boudaries of the matrix, ad obtaiig: ( Φ () ) i,j = ( ( Φ () ) 1 i,j, ( Φ () ) 2 i,j ) = ( Φ () i+1,j Φ() i,j, Φ() i,j+1 Φ() i,j By substitutig this result i equatio (6), ad by cosiderig the decompositio of show i (7), two separate equatios for px (+1) i,j ad py (+1) i,j ca be writte: p (+1) i,j px (+1) i,j py (+1) i,j = px() i,j = py() i,j + τ(φ() i+1,j Φ() i,j ) 1 + τ ( Φ () ) i,j + τ(φ() i,j+1 Φ() i,j ) 1 + τ ( Φ () ) i,j Fially, Φ () should be expressed as a fuctio of px () ad py (). This ca be achieved by computig the div p () term accordig to equatio (3): (div p () ) i,j = px () i,j ad thus gettig that a elemet Φ () i,j ( Φ () i,j = div p () g ) θ i,j ) (8) (9) (10) px() i 1,j + py() i,j py() i,j 1 (11) ca be expressed as: = px () i,j px() i 1,j + py() i,j py() i,j 1 g i,j θ The resultig value is substituted ito equatios (9) ad (10) i order to show the depedecy betwee px (+1) ad py (+1) ad some poits i px () ad py (), i.e., poits referrig to the previous iteratio. I particular, the resultig equatios are: (12) px (+1) i,j = px() i,j + + τ[ px() i+1,j 2px() i,j + px() i 1,j ] 1 + τ ( Φ () ) i,j ( τ[ py () i+1,j py() i,j + py() i,j 1 py() i+1,j 1 + gi,j g i+1,j θ + 1 + τ ( Φ () ) i,j ) ] (13) py (+1) i,j = py() i,j + + τ[ px() i,j+1 px() i,j + px() i 1,j px() i 1,j+1 ] 1 + τ ( Φ () ) i,j ( τ[ py () i,j+1 2py() i,j + py() i,j 1 + gi,j g i,j 1 θ 1 + τ ( Φ () ) i,j A visual represetatio of the depedecies extracted from equatios (13) ad (14) is show i Figure 1(a), where all the itermediate matrices px (+1), py (+1), px (), py () ad Φ () are illustrated. However, sice px () ad py () are oly kow if the elemet p = (px, py) is kow, it is possible to use a more compact represetatio that oly cosiders p (+1) ad p (), thus obtaiig the schema i Figure 1(b). Sice Figure 1(b) depicts the depedecies betwee two cosecutive iteratios, it also graphically illustrates the shape of the stecil applied by the Chambolle algorithm. 3.3. Locality of the Algorithm The stecil show i Figure 1(b) ca be geeralized i two ways. First, it is possible to idetify the depedecies whe more tha oe elemet of the matrix has to be ) ] + (14) ACM Trasactios o Embedded Computig Systems, Vol. 15, No. 3, Article 39, Publicatio date: March 2016.

Parallelizig the Chambolle Algorithm for Performace Optimized Mappig o FPGA Devices 39:7 px +1 i - 1 i i + 1 px i - 1 i i + 1 j - 1 j - 1 j j + 1 Φ i - 1 i i + 1 j j + 1 j - 1 p +1 py +1 i - 1 i i + 1 j j + 1 py i - 1 i i + 1 p j - 1 j j + 1 j - 1 j j + 1 (a) Depedecies amog matrices px (+1), px (+1), Φ (), px () ad py () p i - 1 i i + 1 j - 1 j +1 j + 1 (b) Simplified represetatio of the depedecies amog p (+1) ad p () Fig. 1. Graphical represetatio of the stecil shape of the Chambolle algorithm computed, as for example a sub-matrix of p (+1) of size l w. Figure 2(a) shows the depedecy schema whe a 2 1 ad a 2 2 sub-matrices are computed at level + 1. Secod, it is possible to icrease the umber of levels beyod + 1, as show i Figure 2(b) for level + 2. I geeral, a sub-matrix of size l w at level + 1 depeds o the same l w pixels at level, but it also requires a rig of additioal elemets at level that surrouds the sub-matrix. I the example with a 2 2 sub-matrix show i Figure 2(a), the goal is to compute 4 poits at level + 1, which ca be achieved startig from the same poits at level, ad icludig a rig of 10 elemets at level that surrouds the sub-matrix (otice that the pixels i the upper-left ad i the lower-right corers are ot required). Similarly, if more levels are cosidered at oce, the elemets of the rig require additioal surroudig poits, thus leadig to a depedecy schema composed of cocetric rigs of growig size, as show i Figure 2(b). Give the regularity of the depedecy schema, it is possible to estimate the umber of poits that are required to compute a geeric sub-matrix at a arbitrary level. Let ACM Trasactios o Embedded Computig Systems, Vol. 15, No. 3, Article 39, Publicatio date: March 2016.

39:8 Beretta et al. +1 +1 +1 +1 +1 +1 +2 +2 +1 +1 +2 +2 +1 +1 +1 +1 +1 +1 +1 +1 +2 +1 (a) Depedecies for the computatio of 2 ad 4 poits of p () (b) Depedecies for the computatio of multiple levels ( to + 2) Fig. 2. Geeralizatio of the depedecies amog the poits i matrix p Ω(l, w, N) be the umber of elemets eeded to calculate a sub-matrix of size l w (with 1 l L ad 1 w W ) at a level N 2. It ca be observed that the case N = 1 is trivial, as o recursio is ecessary to get the result. I additio, if a poit at level N has to be computed, all the values from level N 1 to level 1 must be kow, so that the recursio of equatio (6) will termiate. I the case of Chambolle, the value of Ω(l, w, N) ca be computed as follows: Ω(l, w, N) = N 1 k=1 [ (l + 2k)(w + 2k) 2 ] k h The outermost summatio cosiders all the levels N k, ad computes the umber of poits that are required at that level. At each level, both the legth ad the width of the surroudig rig elarge by two poits, a effect that is captured by the (l + 2k)(w + 2k) term. The iermost summatio corrects the estimatio by removig a level-depedet umber of poits from the upper-left ad the lower-right corers of the rig, which are ot required at that level. For example, let us cosider the computatio of a 2 2 sub-matrix at level N = 3, which is the same schema show i Figure 2(b) whe = 1. For k = 1, level N k = 2 is cosidered, ad the umber of poits that are required is equal to (2+2 1)(2+2 1) 2 1 = 14. At k = 2, level N k = 1 is cosidered, ad a total of (2 + 2 2)(2 + 2 2) 2 3 = 30 poits are eeded. Overall, 14 + 30 = 44 poits are required to compute a 2 2 sub-matrix at level 3. The value of Ω(l, w, N) ca be used to compute the static expasio rate metric of Chambolle that will be itroduced i Sectio 4. It is also importat to remark that, i h=1 (15) ACM Trasactios o Embedded Computig Systems, Vol. 15, No. 3, Article 39, Publicatio date: March 2016.

Parallelizig the Chambolle Algorithm for Performace Optimized Mappig o FPGA Devices 39:9 Algorithm 1 Chambolle Algorithm 1: for i = 1,.., N iteratios do 2: div p = (Backward X (px u1 ) + Backward Y (py u1 )) 3: T erm = div p v 1 /θ 4: T erm 1 = F orward X (T erm) 5: T erm 2 = F orward Y (T erm) 6: u 1 = T erm 2 1 + T erm2 2 7: px u1 = [px u1 + τ/θ T erm 1 ] / [1 + τ/θ u 1 ] 8: py u1 = [py u1 + τ/θ T erm 2 ] / [1 + τ/θ u 1 ] 9: u 1 = v 1 θ div p 10: ed for geeral, Ω(l, w, N) ca be cosidered as a upper boud of the total umber of pixels, because some of the poits may be located o the boudaries of the matrix, so they deped o a smaller umber of eighbors. Coversely, Ω(l, w, N) is a exact estimatio whe the poits are ot located o the matrix borders. I both cases, the fact that the umber of required eighbors is bouded by Ω(l, w, N) esures that this computatio ca be performed locally. 3.4. A Simplified Pseudo-Code Formulatio of Chambolle I the previous subsectios, the locality of the Chambolle algorithm ad its depedecy schema has bee aalyzed startig from its mathematical formulatio. For the sake of clarity, a simpler pseudo-code formulatio of the algorithm is ow itroduced. The pseudo-code form has bee first proposed i [Zach et al. 2007], ad it itroduces a set of high-level macro-operatios that are better suited for hardware desig, while preservig the same depedecies uderlied i Figure 2. I the pseudo-code formulatio, the optical flow betwee the two iput frames I 0 ad I 1 both expressed i a matrix form is represeted by a bi-dimesioal vector u = (u 1, u 2 ), which is the output of the Chambolle algorithm. The vector u is iitialized at 0, ad its fial value is computed by meas of a iterative sequece of levels, as discussed i the previous subsectios. At each level, a support variable v = (v 1, v 2 ) is defied usig a thresholdig fuctio of I 1 ad of the value of u computed at the previous level [Zach et al. 2007]. The, the value of u at the curret level is determied usig the iterative steps of the Chambolle algorithm, which are reported i Algorithm 1. For the sake of simplicity, the pseudo-code oly shows the computatio of u 1, but u 2 is computed i the same way, by simply substitutig u 1 ad v 1 with u 2 ad v 2. The vector u is updated by meas of two itermediate values, amely px = (px u1, px u2 ) ad py = (py u1, py u2 ), which are iitialized at 0 [Zach et al. 2007]. I order to simplify the descriptio, the auxiliary variables Term, Term 1, ad Term 2 are also itroduced to store the itermediate results of the computatio (lies 3 5). The Backward X (z) fuctio returs a matrix where each elemet of z is subtracted by its left eighbor, whereas i Backward Y it is subtracted by its upper eighbor. Similarly, i fuctio Forward X the elemet is subtracted by by its right eighbor, ad i Forward Y by its lower eighbor. It is worth otig that, accordig to the way they are ivoked i Algorithm 1, these four fuctios geerate the same stecil shape illustrated i Figure 2. Fially, the costats θ ad τ are the same values that are used i the mathematical formulatio of Chambolle, ad determie the precisio of the algorithm. ACM Trasactios o Embedded Computig Systems, Vol. 15, No. 3, Article 39, Publicatio date: March 2016.

39:10 Beretta et al. 4. THE PROPOSED DESIGN STRATEGY The aalysis described i the previous sectio shows that the Chambolle algorithm is characterized by the followig properties: (1) o read-after-write (RAW) coflicts exist withi a sigle iteratio, as show by the pseudo-code preseted i Algorithm 1. This meas that the computatio of a elemet at iteratio i + 1 ca ot deped o the value of aother elemet at iteratio i + 1, but oly o previously-geerated elemets, i.e., those computed at iteratio i; (2) as show i Sectio 3.3, which describes the locality of the Chambolle algorithm, the set of elemets required to compute a elemet at the iteratio i + 1 is a small subset of the frame f i produced at the i-th iteratio, ad these elemets are spatially close to elemet p that has to be computed; (3) fially, the aalysis of the depedecy schema of the Chambolle algorithm performed i Sectio 3.2 shows that, give two target elemets that are separated by a traslatio, the correspodig depedecy schemas have the same shape, but they are traslated by the same distace as the target elemet. By exploitig these features, we have bee able to propose a efficiet architecture that serves as a template for the high-performace ad parallel implemetatio of the Chambolle algorithm, as described i Sectio 4.1. Sice the template has to be tailored to the specific eeds of the desiger, for istace to explore the resource-performace trade-offs, we itroduce i Sectio 4.2 a set of metrics that ca be used by the desiger to tue the differet architectural parameters of the proposed template. 4.1. Proposed Architectural Template The proposed architectural template is based o a computatioal structure that is differet from the straightforward oe-etire-frame-at-a-time approach. I fact, it aims at directly computig a portio of the results of a arbitrary iteratio, by loadig ad processig oly the elemets that are required to produce the output, accordig to the depedecies schema of the algorithm. The set of elemets produced as a output are typically a subset of the elemets that are processed as a iput because of data depedecies, therefore the core that performs such multi-iteratio computatio ca be see as a coe (see Figure 3). ITERATION N-2 ITERATION N-1 ITERATION N Fig. 3. 3D represetatio of a geeric computatioal coe spaig 2 iteratios The kowledge of the data depedecies makes it possible to express the result of the (i + m)-th iteratio as a fuctio of (part of) the elemets computed at the i-th iteratio. As a cosequece, give the data available from the i-th iteratio, istead of tryig to compute the whole f i+1, the proposed approach focuses o a subset of the matrix elemets ad directly computes the results of a geeric m-th iteratio (with m 1), thus obtaiig a subset of f i+m. The resultig computatioal coe has a depth equal to m. ACM Trasactios o Embedded Computig Systems, Vol. 15, No. 3, Article 39, Publicatio date: March 2016.

Parallelizig the Chambolle Algorithm for Performace Optimized Mappig o FPGA Devices 39:11 I order to obtai the etire output frame f i+m, multiple executios of the computatioal coes may be required. The proposed architectural template is defied as a combiatio of multiple levels of coes of differet depths, which are able to compute the result of multiple iteratios of the elemetary trasformatio t. A istace of the proposed template is show i Figure 4, ad it works as follows: a small subset (widow) of the iput data which is stored i the off-chip memory is trasferred to the o-chip memory to feed the coes of the first level of the architecture. I the example show i Figure 4, the first level is composed of four coes: A, B, C ad D. The output of each level is the used as iput for the subsequet level, util all the ecessary iteratios are performed. The output of the last level (Level 3 i the example i Figure 4) is fially stored back ito the off-chip memory, ad the whole process starts over o a differet widow of the iput data, util all the matrices have bee computed. This techique, which allows to spa across the iput matrix i order to progressively produce the output, is called slidig widow................... INPUT Iteratio 1 Iteratio 2 A B C D Level 1 Iteratio 3 Iteratio 4 Iteratio 5 Iteratio 6 E F G Level 2 Iteratio 7 Iteratio 8 Iteratio 9 Iteratio 10 H Level 3 OUTPUT (4x4).................. Fig. 4. A istace of the proposed coe-based architectural template ACM Trasactios o Embedded Computig Systems, Vol. 15, No. 3, Article 39, Publicatio date: March 2016.

39:12 Beretta et al. The slidig widow techique is illustrated more i detail i Figure 5. The widows are aliged i such a way that the correctly-computed elemets cover the etire frame, implyig a certai degree of overlappig amog them. The slidig widows approach itroduces both a memory ad a computatio overhead. The former is due to the fact that certai elemets are replicated i multiple sub-matrices, ad are processed by more tha oe coe. The latter is due to the structure of the coes, which are typically uaware of which part of the processed data is valid, ad will evetually cotribute to fial output. The idea of dividig the iput ito a set of overlappig regios has already bee proposed for a few specific algorithms i the scope of custom hardware desig [Roca et al. 1999], eve though it has ever bee methodically combied with other optimizatios, such as the computatio of multiple iteratios withi a coe. Slidig Widow Elemets ot computed correctly Elemets computed correctly Overlappig regio { Iput Matrix Fig. 5. The slidig widow techique to produce the whole output frame Sice the umber ad the depth of the coes i the actual architecture ca vary depedig o the desired trade-off amog resources usage ad target performace, multiple istaces of the proposed template may exist. I particular, each o of these istaces is uiquely defied by the two followig parameters: (1) the size of the output widow of each coe, defied as the umber of output elemets cotaied i the rectagle of size l w; (2) the depth of each coe, i.e. the umber of levels i which the computatio is divided or, equivaletly, the umber of iteratios that are performed at oce by each coe. Figure 4 shows a istace of the template with a output widow of 4 4 elemets ad 3 levels of computatio: the first oe ivolves 2 iteratios, while the other two levels ivolve 4 iteratios each. It is worth otig that, sice the amout of data exchaged betwee two levels x ad x + 1 (the output of level x is the iput of level x + 1) oly depeds o the size of the output of level x+1 ad o the umber of iteratios cosidered by the two levels of computatio, the parameters previously itroduced suffice to completely specify ay architecture. The oly requiremet for a istace to be feasible is that, if coes of differet depths are required to complete the computatio, at least oe coe of each depth must be implemeted o the device. For istace, the example i Figure 4 is feasible if the ACM Trasactios o Embedded Computig Systems, Vol. 15, No. 3, Article 39, Publicatio date: March 2016.

Parallelizig the Chambolle Algorithm for Performace Optimized Mappig o FPGA Devices 39:13 available resources are sufficiet to fit coes A ad E because, i this case, the first level ca be implemeted by sequetially executig coe A four times (i order to cover B, C ad D as well), ad coe E four times (3 executios are required for level 2, ad oe for level 3). May istaces are geerally feasible, ad the same istace may be implemeted i differet ways by istatiatig differet umbers of cores of differet depths, accordig to the resources availability. As a cosequece, multiple differet tradeoffs betwee area usage ad achievable throughput (the more coes, the better) eed to be evaluated. The tradeoff aalysis ca be performed by defiig proper quality metrics, which are discussed i the followig sectio. 4.2. Desig Evaluatio Usig the Expasio Rate As the defiitio of a computatioal coe spaig across the frame itroduces a computatio ad memory overhead i the fial architecture, it is ecessary to defie proper quality metrics to estimate its impact ad help the desiger i tuig the architectural parameters, such as depth ad widow size of each coe. A ideal metric should oly deped o the structure of the algorithm i order to be computed i the early stages of the desig, but o the other had it should provide a reliable estimatio of postimplemetatio aspects, such as area ad throughput. I this cotext, we defie such a metric, related oly to the geometry of the depedecy scheme, ad we ame it expasio rate. Two flavors of the expasio rate are proposed i this work, the first focusig o the geometry of the stecil, while the secod is maily drive by memory cosideratios. The two values are coceptually differet as they address two separate aspects of the desig, hece they ca be cosidered as complemetary while evaluatig differet desig optios. The two flavors of the expasio rate are defied as follows: Static Expasio Rate (SER): the SER is defied as the ormalized ratio betwee the umber of iput elemets to be processed, ad the size of the output widow. I particular, the static expasio rate for a coe of depth m that produces a output area of size l w, is defied as follows: SER(l, w, m) = m Ω(l, w, m) l w (16) where Ω(l, w, m) is the set of iput elemets that must be processed i order to geerate the output area, while performig m iteratios at oce. The metric is purely based o geometrical cosideratios, i fact, Ω(l, w, m) oly depeds o the shape of the stecil, which i tur depeds o the iput algorithm. The m-th square root acts as a ormalizatio operatio, which is ecessary to compare coes of differet depths. I fact, a coe with a higher depth likely requires a larger umber of iput elemets to produce the same output area, but this higher overhead is compesated by the beefits of performig more iteratios at oce. Dyamic Expasio Rate (DER): the DER is coceptually defied as a ratio betwee the umber of iput elemets that eed to be loaded from the memory, ad the size of the output that is produced by the coe. The amout of data to be fetched from the memory is equal to the umber of elemets that are ecessary to compute the curret output widow, ad were ot required to compute the previous oe. Hece, this metric is able to evaluate the overlappig of the slidig widow, ad assess how this affects the memory access. Formally, the DER is defied as: DER(l, w, m) = Ψ(l, w, m) l w (17) ACM Trasactios o Embedded Computig Systems, Vol. 15, No. 3, Article 39, Publicatio date: March 2016.

39:14 Beretta et al. where the fuctio Ψ(l, w, m) idicates the umber of o-overlappig iput elemets betwee two cosecutive applicatios of the coe. This value is specific for each iput algorithm, ad ca be computed by cosiderig either a horizotal or a vertical traslatio of the slidig widow. The expasio rate is equal to 1 oly if the output ad the iput widow sizes are equal, hece o overhead exists, while it assumes higher values whe the umber of iput elemets that are processed by the coe is much larger tha the size of the output widow. I this way, the expasio rate ca be used to maximize the ratio betwee the umber of output ad of iput elemets. The metric is also a fuctio of the depth of the coe, because performig a larger umber of iteratios at oce reduces the umber of itermediate results to be stored, icreases performace ad may balace the additioal overhead of processig a larger iput widow. 5. IMPLEMENTATION DETAILS This sectio illustrates the desig of a parallel implemetatio of Chambolle, whose structure is based o the coe architecture proposed i Sectio 4. Startig from the stecil shape of the algorithm, a set of coes have bee derived ad further optimized usig ad hoc cosideratios. I particular, the desig of the processig elemets withi each coe has bee specifically tued to achieve the best possible performace, usig a efficiet ad applicatio-specific data reuse mechaism, described i Sectio 5.3, as well as a properly-suited memory maagemet system, detailed i Sectio 5.4. As a result of this desig effort, the proposed solutio largely outperforms all the existig hardware implemetatios of Chambolle that ca be foud i the literature. I the proposed architecture, the shape of the computatioal coe follows the stecil shape show i Figure 2(b). Each coe aims at directly computig each elemet of px ad py (see Algorithm 1) at iteratio +x by fidig a formula that employs the values available at iteratio. Each coe is the shifted usig a slidig widow mechaism, i order to spa the etire area of the iput matrix. As discussed i Sectio 4, the ratioale is to divide the output frame (I 1 i Sectio 3.4) ito overlappig sub-matrices, whose profitable areas are cotiguous. This approach itroduces a slight memory overhead, because certai elemets are replicated i multiple sub-matrices. A computatio overhead is also itroduced, as the cores may process some elemets which are ot profitable ad will ot be part of the output. However, the slidig widow techique eables a coarse-graied parallelizatio of Chambolle i spite of its recursive ature ad its complex data depedecies, ad this greatly improves the throughput of the proposed implemetatio. The remaiig of this sectio provides a detailed descriptio of the computatio that takes place withi each computatioal coe. I additio, we discuss the implemetatio of the slidig widow techique, which allows the coes to spa the iput matrix, icludig all the relevat implemetatio details related to the memory orgaizatio. 5.1. Expasio Rate Aalysis The expasio rate metrics, which have bee itroduced i Sectio 4, ca be evaluated to guide the choice the most suitable coe size for the Chambolle algorithm. The static expasio rate, which captures the geometrical properties of the algorithm, ca be computed accordig to equatio (16), replacig the value of Ω(l, w, N) which quatifies the umber of iput elemets that must be processed to geerate the output widow with the equatio obtaied i (15). The resultig equatio is the ACM Trasactios o Embedded Computig Systems, Vol. 15, No. 3, Article 39, Publicatio date: March 2016.

Parallelizig the Chambolle Algorithm for Performace Optimized Mappig o FPGA Devices 39:15 Static Expasio Rate (SER) Dyamic Expasio Rate (DER) 7 25 6 20 5 15 4 3 10 2 5 1 0 2 4 6 Number of iteratios 8 10 0 80 60 40 20 Output widow legth 100 0 0 2 4 6 Number of iteratios 8 10 0 80 60 40 20 Output widow legth 100 Fig. 6. Static ad dyamic expasio rates for the Chambolle algorithm followig: SER(l, w, m) = m m 1 k=1 [ (l + 2k)(w + 2k) 2 k h=1 h ] This equatio is plotted i Figure 6 for differet values of the umber of iteratios ad the output widow size. For the sake of illustratio, a squared output widow has bee assumed i the figure, so its size ca be summarized usig oly oe axis, which represets the legth of its edge. It ca be observed that the expasio rate is miimized with widows of large size (i.e., larger tha 60 60), while a depedecy with respect to the umber of iteratios is sigificat oly for widows of small size. This behavior is cosistet with the shape of the Chambolle stecil, which requires a lot of overlappig iput elemets whe a large output is computed. Similarly, the dyamic expasio rate ca be computed startig from equatio (17), ad computig the umber of elemets to be fetched from the memory whe the coe slides to the followig output widow. Accordig to the shape of the stecil illustrated i Figure 2, it ca be derived that: whe the coe slides horizotally, a total of l (w + 2m) ew elemets of the iput matrix have to be fetched; whe the coe slides vertically, w (l + 2m) ew elemets have to be loaded from the memory. The two slidig directios ca be used idifferetly to compute the dyamic expasio rate, as they evetually lead to the same coclusios. Figure 6 shows the behavior of the DER for differet values of the umber of iteratios ad the output size: a squared output widow is agai assumed for illustrative purposes, thus makig the horizotal ad vertical traslatios equivalet. Similarly to the static case, the evaluatio of the dyamic expasio rate also recommeds the employmet of large output widows, with a edge larger tha 80 elemets. The coclusio of the aalysis of SER ad DER, reported i Figure 6, is that a widow whose legth is larger tha 60 ad 80 elemets should be preferred, respectively. The itersectio of the two metrics esures that ay output widow larger tha 80 80 ca effectively mitigate the effects of the computatio ad memory access overheads. Fially, we use Chambolle as a illustrative example to illustrate the ability of the expasio rate to capture post-implemetatio desig aspects, specifically area ad throughput, i spite of beig defied as a sole fuctio of the geometry of the iput al- l w (18) ACM Trasactios o Embedded Computig Systems, Vol. 15, No. 3, Article 39, Publicatio date: March 2016.

39:16 Beretta et al. Fig. 7. Expasio rate estimatio versus actual post-implemetatio area ad throughput gorithm. Figure 7 highlights the best solutios whe two commo desig approaches are adopted. Specifically, the x-axis represets the ormalized ratio betwee throughput ad area, which correspods to a sceario where the desig goal is to maximize the performace of the system, give the available resources. The y-axis, o the other had, represets the ormalized throughput, correspodig to a sceario where performace have to be maximized without area limitatios. The quatitative aalysis of Figure 7 icludes differet widow sizes ad umber of iteratios, which i tur correspod to differet values of the expasio rate i this case, the SER, but similar results are obtaied for the DER. The widow sizes rage betwee 6 6 ad 89 89, while the umber of iteratios varies betwee 1 ad 5, ad is represeted i the picture by the size of the circles. The gree data poits (solid lies) highlight the top 20% of solutios i terms of SER. It ca be observed that, i geeral, solutios with a higher expasio rate ted to have higher throughputs, ad make a efficiet use of the area they require. This is further supported by the results i Figure 8, which reports throughput ad throughput/area values as a fuctio of the SER, the data poits beig clustered ad averaged i order to better highlight the correlatio. The expasio rate ca therefore be cosidered as a reliable metric for desig space exploratio, ad it ca be computed by followig the algorithm aalysis proposed i Sectio 3, rather tha performig a time-cosumig sythesis for each cadidate widow size. 5.2. Overview of the Proposed Hardware Solutio Amog the differet implemetatios that satisfy the costrait idetified i the previous sectio (widows larger tha 80 80 elemets), we herei propose as a example ACM Trasactios o Embedded Computig Systems, Vol. 15, No. 3, Article 39, Publicatio date: March 2016.

Parallelizig the Chambolle Algorithm for Performace Optimized Mappig o FPGA Devices 39:17 Fig. 8. Normalized throughput ad throughput/area for differet rages of the ormalized SER a solutio that employs coes workig o sub-matrices of 88 92 elemets, which is close to the target threshold, i order to keep low resource (especially memory) requiremets. The proposed hardware architecture slides these widows to spa the etire legth of the origial matrix. A top-level block diagram of the proposed hardware architecture is show i Figure 9. The hardware employs two cocurret coes movig as slidig widows (amed SW1 ad SW2), which work completely i parallel, each oe updatig the values of both u 1 ad u 2 (we use the otatio sw1 u 1 to idicate the value of u 1 computed by the slidig coe SW1). A coe movig as a slidig widow is logically divided ito two parts: a array of processig elemets (PEs), ad a dedicated amout of o-chip memory implemeted o the BRAMs of the FPGA device. A detailed view of a coe, ad i particular of the circuit that processes sw1 u 1, is show i Figure 10. The data required to compute the compoets of u (i.e., v, px ad py, as show i Algorithm 1) is stored i the o-chip BRAMs, i order to reduce the access to the off-chip memory. We have desiged the coe to compute 7 elemets i parallel for both u 1 ad u 2, thus fidig 14 elemets of vector u at the same time. This structure ot oly itroduces a fier level of parallelism to accelerate the executio, CONTROL UNIT Address ad cotrol Address ad cotrol sigals for SW1 sigals for SW2 θ N iteratios dt sw1 u1 8 BRAMs for sw1 u 1 1 BRAM for Term 8 BRAMs for sw2 u 1 1 BRAM for Term sw2 u1 PE Array for sw1 u 1 PE Array for sw2 u 1 sw1 u2 8 BRAMs for sw1 u 2 1 BRAM for Term 8 BRAMs for sw2 u 2 1 BRAM for Term sw2 u2 PE Array for sw1 u 2 PE Array for sw2 u 2 Fig. 9. Top-level block diagram of the proposed hardware implemetatio of Chambolle ACM Trasactios o Embedded Computig Systems, Vol. 15, No. 3, Article 39, Publicatio date: March 2016.

39:18 Beretta et al. CONTROL UNIT Read addresses Write addresses Read ad write for BRAMs for px ad py addresses for BRAM-Term θ N iteratios dt Addresses for exteral access Read or write eable Iitial loadig of px, py ad v for sw1 u 1 Port 2 (Address) Port 1 (Address) Port 2 (Eable) Port 1 (Write oly) 8 BRAMs for sw1 u 1 Port 2 (Data I) Port 2 (Data Out) Port 1 (Data I) sw1 u1, px ad py Vertical Rotator sw1 u1 Vertical Rotator Updated px ad py px, py ad v PE Array for sw1 u 1 1 BRAM for Term Fig. 10. Computatio of sw1 u 1 withi a coe but also eables a sigificat data reuse amog the PEs (as discussed i the followig subsectio), ad reduces the access to both o-chip ad off-chip memory. As a result, the proposed hardware is able to compute the value of oe elemet i just 18 clock cycles: 1 cycle is required by the cotrol uit, 1 cycle by the sychroous read from the BRAM memory, 1 cycle by the vertical rotator, ad 15 cycles by the PE array. Furthermore, the processig of each oe of sw1 u 1, sw1 u 2, sw2 u 1 ad sw2 u 2 requires 8 BRAMs to store the respective px, py ad v values, plus a additioal BRAM that is ecessary to exchage data betwee two iteratios of the PEs. Hece, oly 36 BRAMs blocks are employed by the proposed desig. 5.3. Processig Elemet Arrays ad Data Reuse The proposed hardware implemetatio icludes the proposed PE arrays, two for each coe, to fid the outputs u 1 ad u 2 of Chambolle, which are subsequetly used to update v by meas of the thresholdig fuctio. Each PE array cotais 14 processig elemets, 7 of which are called PE-Ts ad are used to calculate the values of Term ad u (see Algorithm 1), while the other 7 are amed PE-Vs ad are used to compute px ad py. Overall, there are 56 PEs i the proposed hardware, evely divided amog PE-Ts ad PE-Vs. Withi the coe, a ladder orgaizatio of a PE array is proposed: Figure 11 illustrates this orgaizatio o the PEs that work o the first 7 rows (also called first regio) of the iput matrix. The same figure also illustrates how the same PEs are the reused to process the followig 7 rows (secod regio). I particular, while PE-T 1 is calculatig Term for the elemets i uppermost row, PE-T 7 computes Term for the elemets i row 6. The, after all the PEs have completed the first 7 rows, PE-T 1 starts computig Term for row 7, while PE-T 7 shifts to row 13. The value of Term for oe elemet depeds o the values of px ad py at the same positio (we refer to these values as c px ad c py), plus the px vector of the elemet o the left (l px), ad the py vector of the elemet above (a py). Without ay data reuse policy, each PE-T i a PE array requires 4 values to be loaded from the o-chip memory, ad cosequetly 4 PE arrays with 7 PE-Ts require 112 values to be read from the memory. Thaks to the proposed ladder orgaizatio of the PEs, this data trasfer ca be limited by propagatig the itermediate results. Figure 12 shows how the the 7 PE-Ts are disposed, ad how they were aliged i the previous cycle (dashed boxes). Sice all the PEs require their c px ad c py vectors computed i a previous iteratio, they are loaded from the BRAMs. The, as the processig directio i a coe goes ACM Trasactios o Embedded Computig Systems, Vol. 15, No. 3, Article 39, Publicatio date: March 2016.

Parallelizig the Chambolle Algorithm for Performace Optimized Mappig o FPGA Devices 39:19 Colum Number: 0... 8 9 10 11... PE-V1... 91 PE-V2 PE-T1 Row BRAM Number: Number: 0 BRAM 0 PE-V3 PE-T2 1 BRAM 1 PE-V4 PE-V5 PE-T4 PE-V6 PE-T5 PE-T3 2 BRAM 2 3 BRAM 3 4 BRAM 4 Regio 0 PE-V7 PE-T6 5 BRAM 5 BRAM-Term PE-T7 PE-V1 BRAM-Term 6 BRAM 6 PE-V2 PE-T1 7 BRAM 7 PE-V3 PE-T2 8 BRAM 0 PE-V4 PE-V5 PE-T4 PE-V6 PE-T5 PE-T3 9 BRAM 1 10 BRAM 2 11 BRAM 3 Regio 1 PE-V7 PE-T6 12 BRAM 4 BRAM-Term PE-T7 13 BRAM 5... 87 BRAM 7 Fig. 11. Orgaizatio of 7 PE-Ts ad 7 PE-Vs i a computatioal coe, ad memory orgaizatio durig the computatio of sw1 u 1 l_px a_py PE-T1 c_px c_py l_px a_py PE-T1 c_px c_py BRAM l_px a_py PE-T2 c_px c_py l_px a_py PE-T2 c_px c_py BRAM BRAM... l_px a_py PE-T3 c_px c_py l_px a_py PE-T3 c_px c_py BRAM BRAM BRAM BRAM l_px a_py PE-T7 c_px c_py l_px a_py PE-T7 c_px c_py Processig Directio BRAM BRAM Fig. 12. Data reuse amog the 7 PE-Ts durig the computatio of sw1 u 1 (the dashed boxes idicate the positio of the PE-Ts i the previous cycle) from left to right, these vectors ca be reused as l px ad a py vectors for the followig cycle without accessig the memory. For istace, PE-T 3 takes the l px vector from the flip-flop that stores the c px vector processed i previous cycle. Similarly, c py ca be reused as a py by the PE-Ts which are located below, as for example the c py vector used by PE-T 2 is the a py vector of PE-T 3 for the ext cycle. The PE-Vs start computig px ad py for oe elemet oe cycle after the PE-Ts, ad they also exploit a massive reuse of data. Algorithm 1 shows that, i order to compute px ad py vectors for a elemet, three Term values are required: the oe ACM Trasactios o Embedded Computig Systems, Vol. 15, No. 3, Article 39, Publicatio date: March 2016.

39:20 Beretta et al. of the correspodig elemet, the oe of its right eighbor, ad the oe of the bottom eighbor. I the proposed implemetatio, the values of Term that are processed by the array of PE-Ts are reused, ad propagated usig pipeliig flip-flops. For istace, i order to compute px ad py for the elemet at positio (2, 11), the Term values of elemets i (2, 11), (2, 12) ad (3, 11) are required. PE-T 3 calculates the Term value at (2, 11), ad at the same time PE-T 4 calculates the Term value for (3, 10). I the ext clock cycle, PE-T 3 ad PE-T 4 compute the Term values for (2, 12) ad (3, 11), respectively. The, PE-V 3 takes the required Term values from PE-T 3 ad PE-T 4, as well as the sychroized result of PE-T 3 that was computed i previous clock cycle, ad determies the ew px ad py for elemet (2, 11), without readig ay data from BRAM. Oce the values of px ad py have bee determied, they are stored i BRAM for the followig iteratios. 5.4. Memory Orgaizatio The proposed data reuse scheme reduces both the umber of accesses to the BRAMs ad the amout of memory required to store the itermediate results. As show i Figure 12, the array of PE-Ts eeds to read 15 vectors from BRAMs, but 28 vectors would be required if data reuse had ot bee implemeted. We ow illustrate how those BRAMs are orgaized. Accordig to Figure 11, PE-Vs from 2 to 7 take the required values of Term from the two adjacet PE-Ts ad from the result computed i previous clock cycle by the PE-Ts that are o their right. Therefore, the computatio of these six PE-Vs does ot require ay additioal BRAM to store the itermediate values of Term computed by the PE-Ts. Oly PE-V 1 eeds to load the Term values computed by PE-T 7 i the previous regio, which has to be stored i a BRAM block (called BRAM-Term). For istace, i order to calculate px ad py for row 6, the values of Term for rows 6 ad 7 are required, but they caot be computed i successive clock cycles because the two rows belog to two differet regios (see Figure 11), ad are processed by the PE array i two separate momets. Therefore, the Term values of row 6 are stored i a dual-port BRAM, ad they are read back whe PE-T 1 computes the Term values of row 7. As a PE uses 8 BRAMs for px, py ad v, plus a additioal BRAM-Term block as a bridge betwee two differet regios, 9 BRAMs are required to process each regio. The results computed by each PE-V are stored i the correspodig BRAMs accordig to the addressig show i Figure 11. Whe the array completes a regio ad starts processig the followig oe, the address used to access the BRAMs eeds to be icreased by a offset of 92, ad this step is performed by a vertical rotator, which is show i Figure 10. Overall, the 8 BRAMs of each regio are idexed usig 1012 addresses, ad 32 bit blocks of data are stored i each address. The 32 bits ecode v, which requires 13 bits, followed by c px ad c py, which require 9 bits each. After the PE-Vs fid the ew values of px ad py, the values i the BRAMs are updated by usig the write ports of the BRAMs, overwritig the vector values that have bee read i previous cycles. 5.5. Processig Elemets We fially provide a detailed descriptio of the PE-T ad PE-V processig elemets. The hardware architecture of a PE-T is show i Figure 13, ad the oe of a PE-V is show i Figure 14. The implemetatio of a PE-T icludes the Backward operatios for px ad py, which are performed i parallel before computig the value of the output Term, which is the used as r T erm (right Term) for the PE-V that is processig the same row, whereas b T erm (bottom Term) ca feed the PE-V that is processig the upper row. Moreover, the value of Term is pipelied for 1 clock cycle i order to use it as c T erm ACM Trasactios o Embedded Computig Systems, Vol. 15, No. 3, Article 39, Publicatio date: March 2016.

Parallelizig the Chambolle Algorithm for Performace Optimized Mappig o FPGA Devices 39:21,--% &"#$%!"#$%,--% ("#'% &"#$%!"#'%,--%,--%!"./01% ("#'%,--% 0"./01% *+!"% )%,--%!"%,--%,--%,--%!" Fig. 13. Hardware architecture of a PE-T /2' /0',$1'/0' %"#$%&'!"#$%&' ("#$%&'!+,-*.,*',$1'/2' )*' Fig. 14. Hardware architecture of a PE-V (curret Term). The propagatio schema of the differet Terms (right, bottom ad curret) is show i Figure 11. The hardware architecture for PE-Vs implemets the Forward operatios betwee c T erm, r T erm ad b T erm i parallel, ad the computes the ew px ad py vectors. The mai issue i the desig of the PE-V architecture is the square root fuctio to compute px ad py, as show o lie 6 of Algorithm 1. A efficiet ad precise hardware implemetatio of the square root is still a ope problem [Sajid et al. 2010] [Li ad Chu 1997], ad there are two mai techiques to hadle it: iterative techiques, which achieve better precisios, ad look-up tables, which are faster. I the proposed implemetatio, a look-up table implemetatio was employed to focus o timig performace, while the achieved precisio is still acceptable i the cotext of optical flow estimatio. I fact, the error of the approximated square root is below 1% i more tha 90% of the tested samples. The look-up table takes a 32-bit sigal represeted usig a fixed poit otatio, where the iteger part takes 24 bits, ACM Trasactios o Embedded Computig Systems, Vol. 15, No. 3, Article 39, Publicatio date: March 2016.

39:22 Beretta et al. ad the decimal part takes 8 bits. The etries of the table are 8-bit values, thus the table cotais 2 8 = 256 pre-computed values, ad oly requires 70 LUTs to be deployed o the FPGA. Istead of dividig the iput value ito 4 pieces of 8 bits each, which ca idex 4 differet tables, a techique has bee desiged to icrease the precisio while usig oly oe table (thus savig approximately 12200 LUTs over the 28 PE-Vs). I particular, the 8 most sigificat bits of the iput value are cosidered, ad used to get the result from the table, discardig the remaiig bits. The 8-bit block starts i a odd positio (coutig from left to right), ad fiishes i a eve oe: if the first o-zero bit is located i the -th positio, where is eve, the the 8 bit block will start from the zero bit at positio 1. I this way, if the decimal value of the 8 bit block is equal to m, ad if the rightmost bit of the block is i positio 2k, the the umber is equal to m 2 2k, ad its square root is computed by accessig the table at value m, ad left-shiftig the output by k positios. 6. EXPERIMENTAL RESULTS Fig. 15. Area usage o a Xilix Virtex-5 XC5VLX110T FPGA The proposed coe-based parallelizatio of the Chambolle algorithm has bee fully implemeted i Verilog ad sythesized for a Xilix Virtex-5 XC5VLX110T FPGA [Xilix 2009]. Figure 15 shows the resource usage of the Chambolle core, which reaches a operatig frequecy of 221 MHz after place ad route. If required by the target device, the umber of required DSPs ca be reduced by mappig part of the multiplicatios o the LUTs. Figure 16 shows the compariso, i terms of frames per secod, betwee the performace achieved by the proposed approach ad the oes obtaied by state-of-the-art implemetatios. These are implemeted o either CPUs or GPUs as, at the best of our kowledge, o implemetatio that leverages the fie-graied parallelism of FP- GAs has bee proposed i the literature. The evaluatio assumes that the images to be processed are pre-loaded i the device memory, i order to focus the measuremets o the Chambolle algorithm itself rather tha o the trasiet setup. The estimated speedup achieved by the implemetatio proposed i this work rages from 16.5 to 76 o images with a resolutio of 512 512, which is the most commo format foud i the literature related to Chambolle. However, the advatages of the proposed parallelizatio approach are eve more oticeable o larger images. I fact, the proposed implemetatio is the oly oe able to achieve more tha 30 fps ad, hece, meet the real-time costraits o 1024 768 images. O the cotrary, most of the existig approaches work with reasoable frame ACM Trasactios o Embedded Computig Systems, Vol. 15, No. 3, Article 39, Publicatio date: March 2016.

Parallelizig the Chambolle Algorithm for Performace Optimized Mappig o FPGA Devices 39:23 Performace compariso, i terms of frames per secod, with respect to state-of-the-art impleme- Fig. 16. tatios rates (higher tha 20 fps) oly o very small images (cosistig of either 128 128 or 256 256 pixels). Thus, i order to perform a fair compariso ad to ormalize the size of the images processed by the differet approaches, we compare them i Figure 17 i terms of umber of mega-pixels elaborated per secod. I this case, the speed-up obtaied by the proposed desig with respect to the best state-of-the-art implemetatios rages from 38 to 130 (77 i the average), provig that the proposed approach scales very well with the frame size. ACM Trasactios o Embedded Computig Systems, Vol. 15, No. 3, Article 39, Publicatio date: March 2016.

39:24 Beretta et al. Fig. 17. Performace compariso, i terms of mega-pixels per secod, with respect to state-of-the-art implemetatios 7. COMPARISON WITH RESPECT TO GPU IMPLEMENTATIONS We fially discuss a possible implemetatio of Chambolle o GPUs, i order to prove how the fie-graied cofiguratio capabilities of FPGAs provide a better eviromet for the implemetatio of this algorithm. Comparisos amog the two architectures have bee already proposed i the literature, such as i [Bodily et al. 2010], provig that GPUs do ot match the flexibility provided by FPGAs whe custom computatio ACM Trasactios o Embedded Computig Systems, Vol. 15, No. 3, Article 39, Publicatio date: March 2016.