WestminsterResearch

Size: px
Start display at page:

Download "WestminsterResearch"

Transcription

1 WestmisterResearch Parallelizig the Chambolle Algorithm for Performace-Optimized Mappig o FPGA Devices Beretta, I., Raa, V., Aki, A., Nacci, A. A., Sciuto, D. ad Atieza, D. ACM, This is the author's versio of the work. It is posted here by permissio of ACM for your persoal use. Not for redistributio. The defiitive versio was published i the ACM Trasactios o Embedded Computig Systems (TECS) 15 (3) Article No. 44, The WestmisterResearch olie digital archive at the Uiversity of Westmister aims to make the research output of the Uiversity available to a wider audiece. Copyright ad Moral Rights remai with the authors ad/or copyright owers. Whilst further distributio of specific materials from withi this archive is forbidde, you may freely distribute the URL of WestmisterResearch: (( I case of abuse or copyright appearig without permissio repository@westmister.ac.uk

2 39 Parallelizig the Chambolle Algorithm for Performace Optimized Mappig o FPGA Devices IVAN BERETTA, École Polytechique Fédérale de Lausae VINCENZO RANA, Politecico di Milao ABDULKADIR AKIN, École Polytechique Fédérale de Lausae ALESSANDRO ANTONIO NACCI, Politecico di Milao DONATELLA SCIUTO, Politecico di Milao DAVID ATIENZA, École Polytechique Fédérale de Lausae The performace ad the efficiecy of recet computig platforms have bee deeply iflueced by the widespread adoptio of hardware accelerators, such as Graphics Processig Uits (GPUs) or Field Programmable Gate Arrays (FPGAs), which are ofte employed to support the tasks of Geeral Purpose Processors (GPP). Oe of the mai advatages of these accelerators over their sequetial couterparts (GPPs) is their ability of performig massive parallel computatio. However, i order to exploit this competitive edge, it is ecessary to extract the parallelism from the target algorithm to be executed, which is i geeral a very challegig task. This cocept is demostrated, for istace, by the poor performace achieved o relevat multimedia algorithms, such as Chambolle, which is a well-kow algorithm employed for the optical flow estimatio. The implemetatios of this algorithm that ca be foud i the state of the art are geerally based o GPUs, but barely improve the performace that ca be obtaied with a powerful GPP. I this paper, we propose a ovel approach to extract the parallelism from computatio-itesive multimedia algorithms, which icludes a aalysis of their depedecy schema ad a assessmet of their data reuse. We the perform a thorough aalysis of the Chambolle algorithm, providig a formal proof of its ier data depedecies ad locality properties. The, we exploit the cosideratios draw from this aalysis by proposig a architectural template that takes advatage of the fie-graied parallelism of FPGA devices. Moreover, sice the proposed template ca be istatiated with differet parameters, we also propose a desig metric, the expasio rate, to help the desiger i the estimatio of the efficiecy ad performace of the differet istaces, makig it possible to select the right oe before the implemetatio phase. We fially show, by meas of experimetal results, how the proposed aalysis ad parallelizatio approach leads to the desig of efficiet ad highperformace FPGA-based implemetatios that are orders of magitude faster tha the state-of-the-art oes. Categories ad Subject Descriptors: B.6.1 [Desig Styles]: Parallel circuits; I.4.8 [Scee Aalysis]: Motio; C.1.3 [Other Architecture Styles]: Data-flow architectures; B.8.2 [Performace Aalysis ad Desig Aids] Geeral Terms: Desig, Algorithms, Performace Additioal Key Words ad Phrases: Chambolle, Optical flow, TV-L1, Field Programmable Gate Arrays, Parallel Architectures, Custom Hardware 1. INTRODUCTION Heterogeeous ad specialized computatio is forecast to icreasigly grow over the ext years, ad establish itself as oe of the mai paradigms for embedded systems desig [Cordes et al. 2013]. The employmet of special-purpose cores to perform a complex fuctioality withi a System-o-Chip (SoC), is motivated by higher performace ad lower power cosumptio with respect to a equivalet executio o a geeral-purpose processig uit. Furthermore, i certai domais such as multimedia processig, these specialized cores perform tasks that are sufficietly geeral to guaratee a good reusability i a wide rage of systems. For example, specialized cores ca be used to accelerate commo operatios such as covolutio filters [Jamro ad Wiatr 2001] or the Jacobi operator [Sleijpe ad Vorst 2000]. The desig of special-purpose hardware modules traditioally aims at optimizig their computatioal efficiecy, while meetig predefied area requiremets that may ACM Trasactios o Embedded Computig Systems, Vol. 15, No. 3, Article 39, Publicatio date: March 2016.

3 39:2 Beretta et al. be imposed whe the core is part of a more complex multi-core SoC. To achieve the target performace, applicatio-specific accelerators ca be implemeted o differet cuttig-edge platforms, such as Graphics Processig Uits (GPUs) or Field Programmable Gate Arrays (FPGAs). However, eve though GPUs are faster tha FPGAs, they show a rigid structure desiged for sigle istructio multiple data processig, hece they are ot a good choice whe dealig with algorithms with very complex data depedecies amog iteratios [Bodily et al. 2010]. FPGAs, o the other had, provide a fully customizable platform where ay kid of custom operatio, either complex or very simple, ca be implemeted i hardware ad applied o multiple blocks of data i parallel. Ufortuately, the desig of complex ad custom FPGA systems is a very challegig task, ad tools to drive the desiger i the defiitio of such architectures are still ot mature. Represetative examples of importat computatio-itesive algorithms that greatly beefit from parallelizatio ad performace optimizatio ca be foud i the field of multimedia processig ([Jia et al. 2013], [Ali et al. 2014]). Several researchers have addressed their effort towards some of these algorithms i the last years ([Che et al. 2012] [Ghodhbai et al. 2014]). I this paper we focus our attetio to Chambolle [Chambolle 2004], which is a relevat algorithm belogig to this class ad for which a high-performace parallel implemetatio has ot yet bee proposed, as we show i the aalysis of the state-of-the-art approaches preseted i Sectio 2. The Chambolle algorithm is a well-kow ad widely-employed algorithm i such fields as motio estimatio ad compesatio, or rollig shutter correctio (see Sectio 2 for more details). However, eve though this algorithm is used i may applicatios (e..g., the TV-L 1 optical flow estimatio described i Sectio 2), o parallel ad efficiet implemetatio has bee proposed so far; i fact, eve the best performig implemetatios o GPUs are essetially sequetial, ad they do ot achieve real-time frame rates with high resolutio images [Zach et al. 2007]. This lack of performace is maily due to the complex data depedecies schemas that usually characterize this kid of algorithms. I additio to the lack of efficiet GPU ad multi-core implemetatios, o hardware implemetatio methodology exists to exploit the high amout of resources available o the latest programmable devices, such as FPGAs. For these reasos we believe that the Chambolle algorithm ca be cosidered as a corerstoe for may multimedia systems that deal with challegig problems (such as the optical flow estimatio [Behbahai et al. 2007]) ad for which efficiet implemetatios have ot yet bee foud, maily because of their complex data depedecies. This work builds upo the Chambolle implemetatio we first outlied i [Aki et al. 2011], complemetig it with a more detailed algorithm aalysis, as well as a deep desig space exploratio. Specifically, we propose a breakdow of the Chambolle kerel, formally provig its depedecy patter ad its locality. We the defie a ovel algorithmic-level metric to drive the desig space exploratio of iterative algorithms, which we amed expasio rate. The metric eables to estimate implemetatio aspects, such as the impact of memory trasfers, as a fuctio of the geometry of the algorithm. Fially, we exted the desig space exploratio to other platforms, specifically to GPUs. The remaider of this paper is structured as follows. I Sectio 3, we provide a detailed aalysis of the Chambolle algorithm, focusig o its mai characteristics ad proprieties. The, we describe the proposed desig strategy to efficietly tackle its complexity, parallelizig its computatio i order to drastically improve its performace (Sectio 4). After showig the proposed architectural template, we itroduce the cocept of expasio rate, aother relevat cotributio of this work. Sectio 5 reports the desig space exploratio for the Chambolle algorithm, ad presets the implemetatio aspects of the proposed hardware implemetatio. Fially, Sectio 6 describes ACM Trasactios o Embedded Computig Systems, Vol. 15, No. 3, Article 39, Publicatio date: March 2016.

4 Parallelizig the Chambolle Algorithm for Performace Optimized Mappig o FPGA Devices 39:3 the experimetal results provig that the proposed parallelizatio of the Chambolle algorithm is cosiderably faster tha the solutios foud i the literature. These approaches are maily based o GPU acceleratio that do ot completely exploit the implicit fie-graied parallelism of this kid of multimedia algorithm. Sectio 7 shows how the proposed approach, based o a fier parallelizatio of the iput algorithm ad targetig FPGA devices, is able to drastically icrease the degree of parallelism that ca be extracted from the algorithm, ad exploitig it to icrease the efficiecy ad the performace of the computig architecture. Fially, Sectio 8 cocludes the paper by drawig some fial cosideratios. 2. STATE OF THE ART The optical flow is a vector field represetig the movemet of a object i a sequece of frames, ad it ca be determied by aalyzig the variatio of the brightess iside a sequece of successive images [Verri ad Poggio 1989]. The estimatio of this vector field is oe of the most importat problems i image ad video processig, as it ca be employed for motio estimatio [Su et al. 2000] ad compesatio [Li et al. 1997], as well as i other fields such as robotics [Kim et al. 2007] ad eve medical aalysis [Behbahai et al. 2007]. Aother importat applicatio of the optical flow is the correctio of a image acquired by CMOS optical sesors usig the rollig shutter techique [Baker et al. 2010], which is owadays used i most of the low-ed photo cameras. I particular, rollig shutter is a method of image acquisitio i which each frame is recorded by scaig across the frame either vertically or horizotally, which may geerate errors ad distortios i the fial image. The optical flow estimatio is a computatioally challegig problem [Behbahai et al. 2007] because of the large amout of movemets that ca be detected i a frame, ad because of the oise that ca alter the image brightess. A wide rage of differet techiques, such as [Hor ad Schuck 1981] [Black ad Aada 1993] [Papeberg et al. 2006], has bee proposed i the past, but variatioal methods [Aubert et al. 1999] i.e., algorithms based o the miimizatio of a quatity kow as total variatio [Rudi et al. 1992] have emerged as oe of the most successful approaches i recet years. The variatioal techique we cosider i this work is called TV-L 1 [Pock et al. 2007], which distiguishes itself from other approaches because it ca hadle highlyvaryig itesities i the frames. The TV-L 1 method icludes both a mathematical defiitio of the variatioal problem, ad a umerical scheme to compute the solutio. The umerical scheme is based o a fixed-poit algorithm origially proposed by Atoi Chambolle [Chambolle 2004], which iteratively refies the solutio (which i this case represets the optical flow estimatio) at differet levels of precisio. Though TV-L 1 seems to be very promisig from a theoretical poit of view, its implemetatios fail to reach real-time performace (i.e., to process at least 30 frames per secod), except for very small images. A multithread software implemetatio of TV-L 1 that has bee developed ad aalyzed at EPFL, for example, ca take more tha 15 secods to process just oe frame o a stadard x86 workstatio, ad up to 50 secods are required o the ARM processor of a Apple iphoe 3GS. The profilig of the estimatios of the TV-L 1 optical flow o both platforms shows that the Chambolle algorithm itself is the bottleeck that geerates the poor timig performace. I fact, besides the executio of a outermost loop which does ot require ay complex matrix operatio, approximately 90% of the executio time is spet o the Chambolle iterative techique, which proves to be the most critical ad computatioally itesive part. However, all the implemetatios of the Chambolle algorithm that ca be foud i literature fail i achievig real-time frame rates with high resolutio images [Zach et al. 2007]. Furthermore, at the best of our kowledge, a parallel implemetatio of ACM Trasactios o Embedded Computig Systems, Vol. 15, No. 3, Article 39, Publicatio date: March 2016.

5 39:4 Beretta et al. this approach has ever bee proposed because of the complex depedecies amog the itermediate results [Aki et al. 2011]. I [Pock et al. 2007] ad [Zach et al. 2007], the robust TV-L 1 techique to calculate the optical flow betwee two frames is proposed ad implemeted usig moder GPUs. The authors proved that a real-time frame rate ca be achieved by the most powerful devices for low-resolutio sequeces, but oly very few frames that are larger tha ca be processed i oe secod. A Matlab implemetatio of the techique i [Zach et al. 2007] requires from 5 to 6 secods to complete the estimatio of the optical flow o a high-ed workstatio, ad it also shows some limitatios i terms of memory usage. Additioal hardware results of the estimatio of the TV-L 1 optical flow o GPUs ca be also foud i [Weishaupt et al. 2010], but eve the fastest implemetatio caot top a rate of 6 frames per secod, eve o images. A full summary of the performace of the aforemetioed state-of-the-art implemetatios of Chambolle are reported i Sectio 6, as a referece to evaluate the solutios proposed i this paper. Fast estimatios of the optical flow ca be achieved by usig differet techiques ad by simplifyig the workig domai. For example, the implemetatio proposed i [Abutaleb et al. 2009] ca process up to 156 fps o images, workig o a low-cost FPGA device. However, the resultig optical flow is specifically suited for motio detectio, ad it caot be used i other applicatios such as rollig shutter correctio. The specific target allows the authors to filter the iput frames, ad i particular to apply backgroud subtractio, which heavily simplifies the amout of data to be processed for the optical flow estimatio. 3. CHAMBOLLE ALGORITHM ANALYSIS This sectio presets the aalysis we have performed o the Chambolle algorithm, describig the structure of its depedecy schema (Sectio 3.2) ad providig a formal proof of its locality (Sectio 3.3). The otatio used i this sectio is a mior modificatio of the oe used i [Chambolle 2004], ad requires few basic cocepts that are described i Sectio 3.1. Fially, Sectio 3.4 presets a simplified pseudo-code formulatio of the Chambolle algorithm Prelimiary Defiitios I the cotext of multimedia processig, the iput of the Chambolle algorithm is represeted as a rectagular matrix of legth L ad width W, which represets a picture of L W pixels. Let X be defied as the euclidea space X = R L W, ad let Y be the cartesia product Y = X X. Fially, let us recall the defiitio of the Euclidea orm. over R 2, which is defied as y = y1 2 + y2 2, for ay poit y = (y 1, y 2 ) R 2. It is ow possible to itroduce the two mai operators that are used i the formulatio of the Chambolle algorithm: the discrete gradiet divergece operators. Give a elemet x X, the discrete gradiet x Y is defied as: ( ) ( x) i,j = ( x) (1) i,j, ( x)(2) i,j (1) where: { ( x) (1) i,j = xi+1,j x i,j, if i < L 0, if i = L {, ( x) (2) i,j = xi,j+1 x i,j, if j < W 0, if j = W (2) for i = 1,.., L ad j = 1,.., W. The cases i = L ad j = W are cosidered separately, as they refer to pixels that lie o the boudaries of the matrix. ACM Trasactios o Embedded Computig Systems, Vol. 15, No. 3, Article 39, Publicatio date: March 2016.

6 Parallelizig the Chambolle Algorithm for Performace Optimized Mappig o FPGA Devices 39:5 The discrete divergece operator takes a elemet p Y as a operad, ad returs the value div p X defied as: p (1) i,j p(1) i 1,j, if 1 < i < L p (2) (div p) i,j = p (1) i,j p(2) i,j 1, if 1 < j < W i,j, if i = 1 + p (2) i,j, if j = 1 (3), if i = L, if j = L p (1) i 1,j p (2) i,j 1 As discussed i the previous sectios, the Chambolle algorithm aims at miimizig a quatity kow as total variatio [Rudi et al. 1992]. With the cocepts defied i this subsectio, it is ow possible to formalize this metric. Give g X ad θ > 0, the miimizatio of the total variatio ca be formulated as follows: mi x X x g 2 + 2θ 1 i L, 1 j W ( x) i,j As show i [Chambolle 2004], the miimizatio problem has a closed-form solutio whose aalytical equatio is kow, but its umerical estimatio is ot straightforward. I order to fid a solutio umerically, the problem must be expressed i the followig form: mi { θ div p g p Y 2 : p i,j 2 1, i = 1,..., L, j = 1,..., W } This formulatio ca be umerically approached usig a recursive techique kow as semi-implicit gradiet descet [Chambolle 2004], which is the core part of the Chambolle algorithm. I particular, for ay 0, which defies umber of iteratios or levels, a elemet p Y is recursively adjusted as follows: p (+1) i,j + τ( Φ() ) i,j 1 + τ ( Φ () ) i,j = p() i,j, Φ () = div p () g θ where τ > 0 is a fixed value (i geeral it is equal to 1/4 to guaratee the covergece of the algorithm [Chambolle 2004]), ad p (0) = 0 by defiitio. The matrix Φ () X is a matrix that is defied i order to keep the otatio compact Depedecy Schema Accordig to equatio (6), the solutio of the Chambolle algorithm recursively depeds o previous values (for example, there is a explicit depedecy betwee p (+1) i,j ad p () i,j ), which may prevet a parallelized implemetatio because a large amout of data might be required to compute the value of p (+1) i,j. The goal of this sectio is to uroll the depedecies icluded i equatio (6), ad derive the full shape of the stecil. For the sake of illustratio, the poits o the boudaries of the matrices are omitted, therefore idices i ad j are always strictly greater tha 1, ad strictly lower tha L ad W, respectively. I fact, boudary values are oly a special case of the proposed aalysis, ad they ca be easily hadled by substitutig the correspodig values from equatios (2) ad (3). I equatio (6), the deomiator is a scalar quatity, whereas both the two terms i the umerator belog to Y = X X. As a cosequece, p (+1) i,j Y, thus it ca be writte as: p (+1) i,j (4) (5) (6) = (px (+1) i,j, py (+1) i,j ) (7) ACM Trasactios o Embedded Computig Systems, Vol. 15, No. 3, Article 39, Publicatio date: March 2016.

7 39:6 Beretta et al. where both px (+1) ad py (+1) are L W matrices computed at level + 1. The term ( Φ () ) i,j ca the be urolled accordig to equatios (1) ad (2), rememberig that the poit (i, j) is ot o the boudaries of the matrix, ad obtaiig: ( Φ () ) i,j = ( ( Φ () ) 1 i,j, ( Φ () ) 2 i,j ) = ( Φ () i+1,j Φ() i,j, Φ() i,j+1 Φ() i,j By substitutig this result i equatio (6), ad by cosiderig the decompositio of show i (7), two separate equatios for px (+1) i,j ad py (+1) i,j ca be writte: p (+1) i,j px (+1) i,j py (+1) i,j = px() i,j = py() i,j + τ(φ() i+1,j Φ() i,j ) 1 + τ ( Φ () ) i,j + τ(φ() i,j+1 Φ() i,j ) 1 + τ ( Φ () ) i,j Fially, Φ () should be expressed as a fuctio of px () ad py (). This ca be achieved by computig the div p () term accordig to equatio (3): (div p () ) i,j = px () i,j ad thus gettig that a elemet Φ () i,j ( Φ () i,j = div p () g ) θ i,j ) (8) (9) (10) px() i 1,j + py() i,j py() i,j 1 (11) ca be expressed as: = px () i,j px() i 1,j + py() i,j py() i,j 1 g i,j θ The resultig value is substituted ito equatios (9) ad (10) i order to show the depedecy betwee px (+1) ad py (+1) ad some poits i px () ad py (), i.e., poits referrig to the previous iteratio. I particular, the resultig equatios are: (12) px (+1) i,j = px() i,j + + τ[ px() i+1,j 2px() i,j + px() i 1,j ] 1 + τ ( Φ () ) i,j ( τ[ py () i+1,j py() i,j + py() i,j 1 py() i+1,j 1 + gi,j g i+1,j θ τ ( Φ () ) i,j ) ] (13) py (+1) i,j = py() i,j + + τ[ px() i,j+1 px() i,j + px() i 1,j px() i 1,j+1 ] 1 + τ ( Φ () ) i,j ( τ[ py () i,j+1 2py() i,j + py() i,j 1 + gi,j g i,j 1 θ 1 + τ ( Φ () ) i,j A visual represetatio of the depedecies extracted from equatios (13) ad (14) is show i Figure 1(a), where all the itermediate matrices px (+1), py (+1), px (), py () ad Φ () are illustrated. However, sice px () ad py () are oly kow if the elemet p = (px, py) is kow, it is possible to use a more compact represetatio that oly cosiders p (+1) ad p (), thus obtaiig the schema i Figure 1(b). Sice Figure 1(b) depicts the depedecies betwee two cosecutive iteratios, it also graphically illustrates the shape of the stecil applied by the Chambolle algorithm Locality of the Algorithm The stecil show i Figure 1(b) ca be geeralized i two ways. First, it is possible to idetify the depedecies whe more tha oe elemet of the matrix has to be ) ] + (14) ACM Trasactios o Embedded Computig Systems, Vol. 15, No. 3, Article 39, Publicatio date: March 2016.

8 Parallelizig the Chambolle Algorithm for Performace Optimized Mappig o FPGA Devices 39:7 px +1 i - 1 i i + 1 px i - 1 i i + 1 j - 1 j - 1 j j + 1 Φ i - 1 i i + 1 j j + 1 j - 1 p +1 py +1 i - 1 i i + 1 j j + 1 py i - 1 i i + 1 p j - 1 j j + 1 j - 1 j j + 1 (a) Depedecies amog matrices px (+1), px (+1), Φ (), px () ad py () p i - 1 i i + 1 j - 1 j +1 j + 1 (b) Simplified represetatio of the depedecies amog p (+1) ad p () Fig. 1. Graphical represetatio of the stecil shape of the Chambolle algorithm computed, as for example a sub-matrix of p (+1) of size l w. Figure 2(a) shows the depedecy schema whe a 2 1 ad a 2 2 sub-matrices are computed at level + 1. Secod, it is possible to icrease the umber of levels beyod + 1, as show i Figure 2(b) for level + 2. I geeral, a sub-matrix of size l w at level + 1 depeds o the same l w pixels at level, but it also requires a rig of additioal elemets at level that surrouds the sub-matrix. I the example with a 2 2 sub-matrix show i Figure 2(a), the goal is to compute 4 poits at level + 1, which ca be achieved startig from the same poits at level, ad icludig a rig of 10 elemets at level that surrouds the sub-matrix (otice that the pixels i the upper-left ad i the lower-right corers are ot required). Similarly, if more levels are cosidered at oce, the elemets of the rig require additioal surroudig poits, thus leadig to a depedecy schema composed of cocetric rigs of growig size, as show i Figure 2(b). Give the regularity of the depedecy schema, it is possible to estimate the umber of poits that are required to compute a geeric sub-matrix at a arbitrary level. Let ACM Trasactios o Embedded Computig Systems, Vol. 15, No. 3, Article 39, Publicatio date: March 2016.

9 39:8 Beretta et al (a) Depedecies for the computatio of 2 ad 4 poits of p () (b) Depedecies for the computatio of multiple levels ( to + 2) Fig. 2. Geeralizatio of the depedecies amog the poits i matrix p Ω(l, w, N) be the umber of elemets eeded to calculate a sub-matrix of size l w (with 1 l L ad 1 w W ) at a level N 2. It ca be observed that the case N = 1 is trivial, as o recursio is ecessary to get the result. I additio, if a poit at level N has to be computed, all the values from level N 1 to level 1 must be kow, so that the recursio of equatio (6) will termiate. I the case of Chambolle, the value of Ω(l, w, N) ca be computed as follows: Ω(l, w, N) = N 1 k=1 [ (l + 2k)(w + 2k) 2 ] k h The outermost summatio cosiders all the levels N k, ad computes the umber of poits that are required at that level. At each level, both the legth ad the width of the surroudig rig elarge by two poits, a effect that is captured by the (l + 2k)(w + 2k) term. The iermost summatio corrects the estimatio by removig a level-depedet umber of poits from the upper-left ad the lower-right corers of the rig, which are ot required at that level. For example, let us cosider the computatio of a 2 2 sub-matrix at level N = 3, which is the same schema show i Figure 2(b) whe = 1. For k = 1, level N k = 2 is cosidered, ad the umber of poits that are required is equal to (2+2 1)(2+2 1) 2 1 = 14. At k = 2, level N k = 1 is cosidered, ad a total of ( )( ) 2 3 = 30 poits are eeded. Overall, = 44 poits are required to compute a 2 2 sub-matrix at level 3. The value of Ω(l, w, N) ca be used to compute the static expasio rate metric of Chambolle that will be itroduced i Sectio 4. It is also importat to remark that, i h=1 (15) ACM Trasactios o Embedded Computig Systems, Vol. 15, No. 3, Article 39, Publicatio date: March 2016.

10 Parallelizig the Chambolle Algorithm for Performace Optimized Mappig o FPGA Devices 39:9 Algorithm 1 Chambolle Algorithm 1: for i = 1,.., N iteratios do 2: div p = (Backward X (px u1 ) + Backward Y (py u1 )) 3: T erm = div p v 1 /θ 4: T erm 1 = F orward X (T erm) 5: T erm 2 = F orward Y (T erm) 6: u 1 = T erm T erm2 2 7: px u1 = [px u1 + τ/θ T erm 1 ] / [1 + τ/θ u 1 ] 8: py u1 = [py u1 + τ/θ T erm 2 ] / [1 + τ/θ u 1 ] 9: u 1 = v 1 θ div p 10: ed for geeral, Ω(l, w, N) ca be cosidered as a upper boud of the total umber of pixels, because some of the poits may be located o the boudaries of the matrix, so they deped o a smaller umber of eighbors. Coversely, Ω(l, w, N) is a exact estimatio whe the poits are ot located o the matrix borders. I both cases, the fact that the umber of required eighbors is bouded by Ω(l, w, N) esures that this computatio ca be performed locally A Simplified Pseudo-Code Formulatio of Chambolle I the previous subsectios, the locality of the Chambolle algorithm ad its depedecy schema has bee aalyzed startig from its mathematical formulatio. For the sake of clarity, a simpler pseudo-code formulatio of the algorithm is ow itroduced. The pseudo-code form has bee first proposed i [Zach et al. 2007], ad it itroduces a set of high-level macro-operatios that are better suited for hardware desig, while preservig the same depedecies uderlied i Figure 2. I the pseudo-code formulatio, the optical flow betwee the two iput frames I 0 ad I 1 both expressed i a matrix form is represeted by a bi-dimesioal vector u = (u 1, u 2 ), which is the output of the Chambolle algorithm. The vector u is iitialized at 0, ad its fial value is computed by meas of a iterative sequece of levels, as discussed i the previous subsectios. At each level, a support variable v = (v 1, v 2 ) is defied usig a thresholdig fuctio of I 1 ad of the value of u computed at the previous level [Zach et al. 2007]. The, the value of u at the curret level is determied usig the iterative steps of the Chambolle algorithm, which are reported i Algorithm 1. For the sake of simplicity, the pseudo-code oly shows the computatio of u 1, but u 2 is computed i the same way, by simply substitutig u 1 ad v 1 with u 2 ad v 2. The vector u is updated by meas of two itermediate values, amely px = (px u1, px u2 ) ad py = (py u1, py u2 ), which are iitialized at 0 [Zach et al. 2007]. I order to simplify the descriptio, the auxiliary variables Term, Term 1, ad Term 2 are also itroduced to store the itermediate results of the computatio (lies 3 5). The Backward X (z) fuctio returs a matrix where each elemet of z is subtracted by its left eighbor, whereas i Backward Y it is subtracted by its upper eighbor. Similarly, i fuctio Forward X the elemet is subtracted by by its right eighbor, ad i Forward Y by its lower eighbor. It is worth otig that, accordig to the way they are ivoked i Algorithm 1, these four fuctios geerate the same stecil shape illustrated i Figure 2. Fially, the costats θ ad τ are the same values that are used i the mathematical formulatio of Chambolle, ad determie the precisio of the algorithm. ACM Trasactios o Embedded Computig Systems, Vol. 15, No. 3, Article 39, Publicatio date: March 2016.

11 39:10 Beretta et al. 4. THE PROPOSED DESIGN STRATEGY The aalysis described i the previous sectio shows that the Chambolle algorithm is characterized by the followig properties: (1) o read-after-write (RAW) coflicts exist withi a sigle iteratio, as show by the pseudo-code preseted i Algorithm 1. This meas that the computatio of a elemet at iteratio i + 1 ca ot deped o the value of aother elemet at iteratio i + 1, but oly o previously-geerated elemets, i.e., those computed at iteratio i; (2) as show i Sectio 3.3, which describes the locality of the Chambolle algorithm, the set of elemets required to compute a elemet at the iteratio i + 1 is a small subset of the frame f i produced at the i-th iteratio, ad these elemets are spatially close to elemet p that has to be computed; (3) fially, the aalysis of the depedecy schema of the Chambolle algorithm performed i Sectio 3.2 shows that, give two target elemets that are separated by a traslatio, the correspodig depedecy schemas have the same shape, but they are traslated by the same distace as the target elemet. By exploitig these features, we have bee able to propose a efficiet architecture that serves as a template for the high-performace ad parallel implemetatio of the Chambolle algorithm, as described i Sectio 4.1. Sice the template has to be tailored to the specific eeds of the desiger, for istace to explore the resource-performace trade-offs, we itroduce i Sectio 4.2 a set of metrics that ca be used by the desiger to tue the differet architectural parameters of the proposed template Proposed Architectural Template The proposed architectural template is based o a computatioal structure that is differet from the straightforward oe-etire-frame-at-a-time approach. I fact, it aims at directly computig a portio of the results of a arbitrary iteratio, by loadig ad processig oly the elemets that are required to produce the output, accordig to the depedecies schema of the algorithm. The set of elemets produced as a output are typically a subset of the elemets that are processed as a iput because of data depedecies, therefore the core that performs such multi-iteratio computatio ca be see as a coe (see Figure 3). ITERATION N-2 ITERATION N-1 ITERATION N Fig. 3. 3D represetatio of a geeric computatioal coe spaig 2 iteratios The kowledge of the data depedecies makes it possible to express the result of the (i + m)-th iteratio as a fuctio of (part of) the elemets computed at the i-th iteratio. As a cosequece, give the data available from the i-th iteratio, istead of tryig to compute the whole f i+1, the proposed approach focuses o a subset of the matrix elemets ad directly computes the results of a geeric m-th iteratio (with m 1), thus obtaiig a subset of f i+m. The resultig computatioal coe has a depth equal to m. ACM Trasactios o Embedded Computig Systems, Vol. 15, No. 3, Article 39, Publicatio date: March 2016.

12 Parallelizig the Chambolle Algorithm for Performace Optimized Mappig o FPGA Devices 39:11 I order to obtai the etire output frame f i+m, multiple executios of the computatioal coes may be required. The proposed architectural template is defied as a combiatio of multiple levels of coes of differet depths, which are able to compute the result of multiple iteratios of the elemetary trasformatio t. A istace of the proposed template is show i Figure 4, ad it works as follows: a small subset (widow) of the iput data which is stored i the off-chip memory is trasferred to the o-chip memory to feed the coes of the first level of the architecture. I the example show i Figure 4, the first level is composed of four coes: A, B, C ad D. The output of each level is the used as iput for the subsequet level, util all the ecessary iteratios are performed. The output of the last level (Level 3 i the example i Figure 4) is fially stored back ito the off-chip memory, ad the whole process starts over o a differet widow of the iput data, util all the matrices have bee computed. This techique, which allows to spa across the iput matrix i order to progressively produce the output, is called slidig widow INPUT Iteratio 1 Iteratio 2 A B C D Level 1 Iteratio 3 Iteratio 4 Iteratio 5 Iteratio 6 E F G Level 2 Iteratio 7 Iteratio 8 Iteratio 9 Iteratio 10 H Level 3 OUTPUT (4x4) Fig. 4. A istace of the proposed coe-based architectural template ACM Trasactios o Embedded Computig Systems, Vol. 15, No. 3, Article 39, Publicatio date: March 2016.

13 39:12 Beretta et al. The slidig widow techique is illustrated more i detail i Figure 5. The widows are aliged i such a way that the correctly-computed elemets cover the etire frame, implyig a certai degree of overlappig amog them. The slidig widows approach itroduces both a memory ad a computatio overhead. The former is due to the fact that certai elemets are replicated i multiple sub-matrices, ad are processed by more tha oe coe. The latter is due to the structure of the coes, which are typically uaware of which part of the processed data is valid, ad will evetually cotribute to fial output. The idea of dividig the iput ito a set of overlappig regios has already bee proposed for a few specific algorithms i the scope of custom hardware desig [Roca et al. 1999], eve though it has ever bee methodically combied with other optimizatios, such as the computatio of multiple iteratios withi a coe. Slidig Widow Elemets ot computed correctly Elemets computed correctly Overlappig regio { Iput Matrix Fig. 5. The slidig widow techique to produce the whole output frame Sice the umber ad the depth of the coes i the actual architecture ca vary depedig o the desired trade-off amog resources usage ad target performace, multiple istaces of the proposed template may exist. I particular, each o of these istaces is uiquely defied by the two followig parameters: (1) the size of the output widow of each coe, defied as the umber of output elemets cotaied i the rectagle of size l w; (2) the depth of each coe, i.e. the umber of levels i which the computatio is divided or, equivaletly, the umber of iteratios that are performed at oce by each coe. Figure 4 shows a istace of the template with a output widow of 4 4 elemets ad 3 levels of computatio: the first oe ivolves 2 iteratios, while the other two levels ivolve 4 iteratios each. It is worth otig that, sice the amout of data exchaged betwee two levels x ad x + 1 (the output of level x is the iput of level x + 1) oly depeds o the size of the output of level x+1 ad o the umber of iteratios cosidered by the two levels of computatio, the parameters previously itroduced suffice to completely specify ay architecture. The oly requiremet for a istace to be feasible is that, if coes of differet depths are required to complete the computatio, at least oe coe of each depth must be implemeted o the device. For istace, the example i Figure 4 is feasible if the ACM Trasactios o Embedded Computig Systems, Vol. 15, No. 3, Article 39, Publicatio date: March 2016.

14 Parallelizig the Chambolle Algorithm for Performace Optimized Mappig o FPGA Devices 39:13 available resources are sufficiet to fit coes A ad E because, i this case, the first level ca be implemeted by sequetially executig coe A four times (i order to cover B, C ad D as well), ad coe E four times (3 executios are required for level 2, ad oe for level 3). May istaces are geerally feasible, ad the same istace may be implemeted i differet ways by istatiatig differet umbers of cores of differet depths, accordig to the resources availability. As a cosequece, multiple differet tradeoffs betwee area usage ad achievable throughput (the more coes, the better) eed to be evaluated. The tradeoff aalysis ca be performed by defiig proper quality metrics, which are discussed i the followig sectio Desig Evaluatio Usig the Expasio Rate As the defiitio of a computatioal coe spaig across the frame itroduces a computatio ad memory overhead i the fial architecture, it is ecessary to defie proper quality metrics to estimate its impact ad help the desiger i tuig the architectural parameters, such as depth ad widow size of each coe. A ideal metric should oly deped o the structure of the algorithm i order to be computed i the early stages of the desig, but o the other had it should provide a reliable estimatio of postimplemetatio aspects, such as area ad throughput. I this cotext, we defie such a metric, related oly to the geometry of the depedecy scheme, ad we ame it expasio rate. Two flavors of the expasio rate are proposed i this work, the first focusig o the geometry of the stecil, while the secod is maily drive by memory cosideratios. The two values are coceptually differet as they address two separate aspects of the desig, hece they ca be cosidered as complemetary while evaluatig differet desig optios. The two flavors of the expasio rate are defied as follows: Static Expasio Rate (SER): the SER is defied as the ormalized ratio betwee the umber of iput elemets to be processed, ad the size of the output widow. I particular, the static expasio rate for a coe of depth m that produces a output area of size l w, is defied as follows: SER(l, w, m) = m Ω(l, w, m) l w (16) where Ω(l, w, m) is the set of iput elemets that must be processed i order to geerate the output area, while performig m iteratios at oce. The metric is purely based o geometrical cosideratios, i fact, Ω(l, w, m) oly depeds o the shape of the stecil, which i tur depeds o the iput algorithm. The m-th square root acts as a ormalizatio operatio, which is ecessary to compare coes of differet depths. I fact, a coe with a higher depth likely requires a larger umber of iput elemets to produce the same output area, but this higher overhead is compesated by the beefits of performig more iteratios at oce. Dyamic Expasio Rate (DER): the DER is coceptually defied as a ratio betwee the umber of iput elemets that eed to be loaded from the memory, ad the size of the output that is produced by the coe. The amout of data to be fetched from the memory is equal to the umber of elemets that are ecessary to compute the curret output widow, ad were ot required to compute the previous oe. Hece, this metric is able to evaluate the overlappig of the slidig widow, ad assess how this affects the memory access. Formally, the DER is defied as: DER(l, w, m) = Ψ(l, w, m) l w (17) ACM Trasactios o Embedded Computig Systems, Vol. 15, No. 3, Article 39, Publicatio date: March 2016.

15 39:14 Beretta et al. where the fuctio Ψ(l, w, m) idicates the umber of o-overlappig iput elemets betwee two cosecutive applicatios of the coe. This value is specific for each iput algorithm, ad ca be computed by cosiderig either a horizotal or a vertical traslatio of the slidig widow. The expasio rate is equal to 1 oly if the output ad the iput widow sizes are equal, hece o overhead exists, while it assumes higher values whe the umber of iput elemets that are processed by the coe is much larger tha the size of the output widow. I this way, the expasio rate ca be used to maximize the ratio betwee the umber of output ad of iput elemets. The metric is also a fuctio of the depth of the coe, because performig a larger umber of iteratios at oce reduces the umber of itermediate results to be stored, icreases performace ad may balace the additioal overhead of processig a larger iput widow. 5. IMPLEMENTATION DETAILS This sectio illustrates the desig of a parallel implemetatio of Chambolle, whose structure is based o the coe architecture proposed i Sectio 4. Startig from the stecil shape of the algorithm, a set of coes have bee derived ad further optimized usig ad hoc cosideratios. I particular, the desig of the processig elemets withi each coe has bee specifically tued to achieve the best possible performace, usig a efficiet ad applicatio-specific data reuse mechaism, described i Sectio 5.3, as well as a properly-suited memory maagemet system, detailed i Sectio 5.4. As a result of this desig effort, the proposed solutio largely outperforms all the existig hardware implemetatios of Chambolle that ca be foud i the literature. I the proposed architecture, the shape of the computatioal coe follows the stecil shape show i Figure 2(b). Each coe aims at directly computig each elemet of px ad py (see Algorithm 1) at iteratio +x by fidig a formula that employs the values available at iteratio. Each coe is the shifted usig a slidig widow mechaism, i order to spa the etire area of the iput matrix. As discussed i Sectio 4, the ratioale is to divide the output frame (I 1 i Sectio 3.4) ito overlappig sub-matrices, whose profitable areas are cotiguous. This approach itroduces a slight memory overhead, because certai elemets are replicated i multiple sub-matrices. A computatio overhead is also itroduced, as the cores may process some elemets which are ot profitable ad will ot be part of the output. However, the slidig widow techique eables a coarse-graied parallelizatio of Chambolle i spite of its recursive ature ad its complex data depedecies, ad this greatly improves the throughput of the proposed implemetatio. The remaiig of this sectio provides a detailed descriptio of the computatio that takes place withi each computatioal coe. I additio, we discuss the implemetatio of the slidig widow techique, which allows the coes to spa the iput matrix, icludig all the relevat implemetatio details related to the memory orgaizatio Expasio Rate Aalysis The expasio rate metrics, which have bee itroduced i Sectio 4, ca be evaluated to guide the choice the most suitable coe size for the Chambolle algorithm. The static expasio rate, which captures the geometrical properties of the algorithm, ca be computed accordig to equatio (16), replacig the value of Ω(l, w, N) which quatifies the umber of iput elemets that must be processed to geerate the output widow with the equatio obtaied i (15). The resultig equatio is the ACM Trasactios o Embedded Computig Systems, Vol. 15, No. 3, Article 39, Publicatio date: March 2016.

16 Parallelizig the Chambolle Algorithm for Performace Optimized Mappig o FPGA Devices 39:15 Static Expasio Rate (SER) Dyamic Expasio Rate (DER) Number of iteratios Output widow legth Number of iteratios Output widow legth 100 Fig. 6. Static ad dyamic expasio rates for the Chambolle algorithm followig: SER(l, w, m) = m m 1 k=1 [ (l + 2k)(w + 2k) 2 k h=1 h ] This equatio is plotted i Figure 6 for differet values of the umber of iteratios ad the output widow size. For the sake of illustratio, a squared output widow has bee assumed i the figure, so its size ca be summarized usig oly oe axis, which represets the legth of its edge. It ca be observed that the expasio rate is miimized with widows of large size (i.e., larger tha 60 60), while a depedecy with respect to the umber of iteratios is sigificat oly for widows of small size. This behavior is cosistet with the shape of the Chambolle stecil, which requires a lot of overlappig iput elemets whe a large output is computed. Similarly, the dyamic expasio rate ca be computed startig from equatio (17), ad computig the umber of elemets to be fetched from the memory whe the coe slides to the followig output widow. Accordig to the shape of the stecil illustrated i Figure 2, it ca be derived that: whe the coe slides horizotally, a total of l (w + 2m) ew elemets of the iput matrix have to be fetched; whe the coe slides vertically, w (l + 2m) ew elemets have to be loaded from the memory. The two slidig directios ca be used idifferetly to compute the dyamic expasio rate, as they evetually lead to the same coclusios. Figure 6 shows the behavior of the DER for differet values of the umber of iteratios ad the output size: a squared output widow is agai assumed for illustrative purposes, thus makig the horizotal ad vertical traslatios equivalet. Similarly to the static case, the evaluatio of the dyamic expasio rate also recommeds the employmet of large output widows, with a edge larger tha 80 elemets. The coclusio of the aalysis of SER ad DER, reported i Figure 6, is that a widow whose legth is larger tha 60 ad 80 elemets should be preferred, respectively. The itersectio of the two metrics esures that ay output widow larger tha ca effectively mitigate the effects of the computatio ad memory access overheads. Fially, we use Chambolle as a illustrative example to illustrate the ability of the expasio rate to capture post-implemetatio desig aspects, specifically area ad throughput, i spite of beig defied as a sole fuctio of the geometry of the iput al- l w (18) ACM Trasactios o Embedded Computig Systems, Vol. 15, No. 3, Article 39, Publicatio date: March 2016.

17 39:16 Beretta et al. Fig. 7. Expasio rate estimatio versus actual post-implemetatio area ad throughput gorithm. Figure 7 highlights the best solutios whe two commo desig approaches are adopted. Specifically, the x-axis represets the ormalized ratio betwee throughput ad area, which correspods to a sceario where the desig goal is to maximize the performace of the system, give the available resources. The y-axis, o the other had, represets the ormalized throughput, correspodig to a sceario where performace have to be maximized without area limitatios. The quatitative aalysis of Figure 7 icludes differet widow sizes ad umber of iteratios, which i tur correspod to differet values of the expasio rate i this case, the SER, but similar results are obtaied for the DER. The widow sizes rage betwee 6 6 ad 89 89, while the umber of iteratios varies betwee 1 ad 5, ad is represeted i the picture by the size of the circles. The gree data poits (solid lies) highlight the top 20% of solutios i terms of SER. It ca be observed that, i geeral, solutios with a higher expasio rate ted to have higher throughputs, ad make a efficiet use of the area they require. This is further supported by the results i Figure 8, which reports throughput ad throughput/area values as a fuctio of the SER, the data poits beig clustered ad averaged i order to better highlight the correlatio. The expasio rate ca therefore be cosidered as a reliable metric for desig space exploratio, ad it ca be computed by followig the algorithm aalysis proposed i Sectio 3, rather tha performig a time-cosumig sythesis for each cadidate widow size Overview of the Proposed Hardware Solutio Amog the differet implemetatios that satisfy the costrait idetified i the previous sectio (widows larger tha elemets), we herei propose as a example ACM Trasactios o Embedded Computig Systems, Vol. 15, No. 3, Article 39, Publicatio date: March 2016.

18 Parallelizig the Chambolle Algorithm for Performace Optimized Mappig o FPGA Devices 39:17 Fig. 8. Normalized throughput ad throughput/area for differet rages of the ormalized SER a solutio that employs coes workig o sub-matrices of elemets, which is close to the target threshold, i order to keep low resource (especially memory) requiremets. The proposed hardware architecture slides these widows to spa the etire legth of the origial matrix. A top-level block diagram of the proposed hardware architecture is show i Figure 9. The hardware employs two cocurret coes movig as slidig widows (amed SW1 ad SW2), which work completely i parallel, each oe updatig the values of both u 1 ad u 2 (we use the otatio sw1 u 1 to idicate the value of u 1 computed by the slidig coe SW1). A coe movig as a slidig widow is logically divided ito two parts: a array of processig elemets (PEs), ad a dedicated amout of o-chip memory implemeted o the BRAMs of the FPGA device. A detailed view of a coe, ad i particular of the circuit that processes sw1 u 1, is show i Figure 10. The data required to compute the compoets of u (i.e., v, px ad py, as show i Algorithm 1) is stored i the o-chip BRAMs, i order to reduce the access to the off-chip memory. We have desiged the coe to compute 7 elemets i parallel for both u 1 ad u 2, thus fidig 14 elemets of vector u at the same time. This structure ot oly itroduces a fier level of parallelism to accelerate the executio, CONTROL UNIT Address ad cotrol Address ad cotrol sigals for SW1 sigals for SW2 θ N iteratios dt sw1 u1 8 BRAMs for sw1 u 1 1 BRAM for Term 8 BRAMs for sw2 u 1 1 BRAM for Term sw2 u1 PE Array for sw1 u 1 PE Array for sw2 u 1 sw1 u2 8 BRAMs for sw1 u 2 1 BRAM for Term 8 BRAMs for sw2 u 2 1 BRAM for Term sw2 u2 PE Array for sw1 u 2 PE Array for sw2 u 2 Fig. 9. Top-level block diagram of the proposed hardware implemetatio of Chambolle ACM Trasactios o Embedded Computig Systems, Vol. 15, No. 3, Article 39, Publicatio date: March 2016.

19 39:18 Beretta et al. CONTROL UNIT Read addresses Write addresses Read ad write for BRAMs for px ad py addresses for BRAM-Term θ N iteratios dt Addresses for exteral access Read or write eable Iitial loadig of px, py ad v for sw1 u 1 Port 2 (Address) Port 1 (Address) Port 2 (Eable) Port 1 (Write oly) 8 BRAMs for sw1 u 1 Port 2 (Data I) Port 2 (Data Out) Port 1 (Data I) sw1 u1, px ad py Vertical Rotator sw1 u1 Vertical Rotator Updated px ad py px, py ad v PE Array for sw1 u 1 1 BRAM for Term Fig. 10. Computatio of sw1 u 1 withi a coe but also eables a sigificat data reuse amog the PEs (as discussed i the followig subsectio), ad reduces the access to both o-chip ad off-chip memory. As a result, the proposed hardware is able to compute the value of oe elemet i just 18 clock cycles: 1 cycle is required by the cotrol uit, 1 cycle by the sychroous read from the BRAM memory, 1 cycle by the vertical rotator, ad 15 cycles by the PE array. Furthermore, the processig of each oe of sw1 u 1, sw1 u 2, sw2 u 1 ad sw2 u 2 requires 8 BRAMs to store the respective px, py ad v values, plus a additioal BRAM that is ecessary to exchage data betwee two iteratios of the PEs. Hece, oly 36 BRAMs blocks are employed by the proposed desig Processig Elemet Arrays ad Data Reuse The proposed hardware implemetatio icludes the proposed PE arrays, two for each coe, to fid the outputs u 1 ad u 2 of Chambolle, which are subsequetly used to update v by meas of the thresholdig fuctio. Each PE array cotais 14 processig elemets, 7 of which are called PE-Ts ad are used to calculate the values of Term ad u (see Algorithm 1), while the other 7 are amed PE-Vs ad are used to compute px ad py. Overall, there are 56 PEs i the proposed hardware, evely divided amog PE-Ts ad PE-Vs. Withi the coe, a ladder orgaizatio of a PE array is proposed: Figure 11 illustrates this orgaizatio o the PEs that work o the first 7 rows (also called first regio) of the iput matrix. The same figure also illustrates how the same PEs are the reused to process the followig 7 rows (secod regio). I particular, while PE-T 1 is calculatig Term for the elemets i uppermost row, PE-T 7 computes Term for the elemets i row 6. The, after all the PEs have completed the first 7 rows, PE-T 1 starts computig Term for row 7, while PE-T 7 shifts to row 13. The value of Term for oe elemet depeds o the values of px ad py at the same positio (we refer to these values as c px ad c py), plus the px vector of the elemet o the left (l px), ad the py vector of the elemet above (a py). Without ay data reuse policy, each PE-T i a PE array requires 4 values to be loaded from the o-chip memory, ad cosequetly 4 PE arrays with 7 PE-Ts require 112 values to be read from the memory. Thaks to the proposed ladder orgaizatio of the PEs, this data trasfer ca be limited by propagatig the itermediate results. Figure 12 shows how the the 7 PE-Ts are disposed, ad how they were aliged i the previous cycle (dashed boxes). Sice all the PEs require their c px ad c py vectors computed i a previous iteratio, they are loaded from the BRAMs. The, as the processig directio i a coe goes ACM Trasactios o Embedded Computig Systems, Vol. 15, No. 3, Article 39, Publicatio date: March 2016.

20 Parallelizig the Chambolle Algorithm for Performace Optimized Mappig o FPGA Devices 39:19 Colum Number: PE-V PE-V2 PE-T1 Row BRAM Number: Number: 0 BRAM 0 PE-V3 PE-T2 1 BRAM 1 PE-V4 PE-V5 PE-T4 PE-V6 PE-T5 PE-T3 2 BRAM 2 3 BRAM 3 4 BRAM 4 Regio 0 PE-V7 PE-T6 5 BRAM 5 BRAM-Term PE-T7 PE-V1 BRAM-Term 6 BRAM 6 PE-V2 PE-T1 7 BRAM 7 PE-V3 PE-T2 8 BRAM 0 PE-V4 PE-V5 PE-T4 PE-V6 PE-T5 PE-T3 9 BRAM 1 10 BRAM 2 11 BRAM 3 Regio 1 PE-V7 PE-T6 12 BRAM 4 BRAM-Term PE-T7 13 BRAM BRAM 7 Fig. 11. Orgaizatio of 7 PE-Ts ad 7 PE-Vs i a computatioal coe, ad memory orgaizatio durig the computatio of sw1 u 1 l_px a_py PE-T1 c_px c_py l_px a_py PE-T1 c_px c_py BRAM l_px a_py PE-T2 c_px c_py l_px a_py PE-T2 c_px c_py BRAM BRAM... l_px a_py PE-T3 c_px c_py l_px a_py PE-T3 c_px c_py BRAM BRAM BRAM BRAM l_px a_py PE-T7 c_px c_py l_px a_py PE-T7 c_px c_py Processig Directio BRAM BRAM Fig. 12. Data reuse amog the 7 PE-Ts durig the computatio of sw1 u 1 (the dashed boxes idicate the positio of the PE-Ts i the previous cycle) from left to right, these vectors ca be reused as l px ad a py vectors for the followig cycle without accessig the memory. For istace, PE-T 3 takes the l px vector from the flip-flop that stores the c px vector processed i previous cycle. Similarly, c py ca be reused as a py by the PE-Ts which are located below, as for example the c py vector used by PE-T 2 is the a py vector of PE-T 3 for the ext cycle. The PE-Vs start computig px ad py for oe elemet oe cycle after the PE-Ts, ad they also exploit a massive reuse of data. Algorithm 1 shows that, i order to compute px ad py vectors for a elemet, three Term values are required: the oe ACM Trasactios o Embedded Computig Systems, Vol. 15, No. 3, Article 39, Publicatio date: March 2016.

21 39:20 Beretta et al. of the correspodig elemet, the oe of its right eighbor, ad the oe of the bottom eighbor. I the proposed implemetatio, the values of Term that are processed by the array of PE-Ts are reused, ad propagated usig pipeliig flip-flops. For istace, i order to compute px ad py for the elemet at positio (2, 11), the Term values of elemets i (2, 11), (2, 12) ad (3, 11) are required. PE-T 3 calculates the Term value at (2, 11), ad at the same time PE-T 4 calculates the Term value for (3, 10). I the ext clock cycle, PE-T 3 ad PE-T 4 compute the Term values for (2, 12) ad (3, 11), respectively. The, PE-V 3 takes the required Term values from PE-T 3 ad PE-T 4, as well as the sychroized result of PE-T 3 that was computed i previous clock cycle, ad determies the ew px ad py for elemet (2, 11), without readig ay data from BRAM. Oce the values of px ad py have bee determied, they are stored i BRAM for the followig iteratios Memory Orgaizatio The proposed data reuse scheme reduces both the umber of accesses to the BRAMs ad the amout of memory required to store the itermediate results. As show i Figure 12, the array of PE-Ts eeds to read 15 vectors from BRAMs, but 28 vectors would be required if data reuse had ot bee implemeted. We ow illustrate how those BRAMs are orgaized. Accordig to Figure 11, PE-Vs from 2 to 7 take the required values of Term from the two adjacet PE-Ts ad from the result computed i previous clock cycle by the PE-Ts that are o their right. Therefore, the computatio of these six PE-Vs does ot require ay additioal BRAM to store the itermediate values of Term computed by the PE-Ts. Oly PE-V 1 eeds to load the Term values computed by PE-T 7 i the previous regio, which has to be stored i a BRAM block (called BRAM-Term). For istace, i order to calculate px ad py for row 6, the values of Term for rows 6 ad 7 are required, but they caot be computed i successive clock cycles because the two rows belog to two differet regios (see Figure 11), ad are processed by the PE array i two separate momets. Therefore, the Term values of row 6 are stored i a dual-port BRAM, ad they are read back whe PE-T 1 computes the Term values of row 7. As a PE uses 8 BRAMs for px, py ad v, plus a additioal BRAM-Term block as a bridge betwee two differet regios, 9 BRAMs are required to process each regio. The results computed by each PE-V are stored i the correspodig BRAMs accordig to the addressig show i Figure 11. Whe the array completes a regio ad starts processig the followig oe, the address used to access the BRAMs eeds to be icreased by a offset of 92, ad this step is performed by a vertical rotator, which is show i Figure 10. Overall, the 8 BRAMs of each regio are idexed usig 1012 addresses, ad 32 bit blocks of data are stored i each address. The 32 bits ecode v, which requires 13 bits, followed by c px ad c py, which require 9 bits each. After the PE-Vs fid the ew values of px ad py, the values i the BRAMs are updated by usig the write ports of the BRAMs, overwritig the vector values that have bee read i previous cycles Processig Elemets We fially provide a detailed descriptio of the PE-T ad PE-V processig elemets. The hardware architecture of a PE-T is show i Figure 13, ad the oe of a PE-V is show i Figure 14. The implemetatio of a PE-T icludes the Backward operatios for px ad py, which are performed i parallel before computig the value of the output Term, which is the used as r T erm (right Term) for the PE-V that is processig the same row, whereas b T erm (bottom Term) ca feed the PE-V that is processig the upper row. Moreover, the value of Term is pipelied for 1 clock cycle i order to use it as c T erm ACM Trasactios o Embedded Computig Systems, Vol. 15, No. 3, Article 39, Publicatio date: March 2016.

22 Parallelizig the Chambolle Algorithm for Performace Optimized Mappig o FPGA Devices 39:21,--% &"#$%!"#$%,--% ("#'% &"#$%!"#'%,--%,--%!"./01% ("#'%,--% 0"./01% *+!"% )%,--%!"%,--%,--%,--%!" Fig. 13. Hardware architecture of a PE-T /2' /0',$1'/0' %"#$%&'!"#$%&' ("#$%&'!+,-*.,*',$1'/2' )*' Fig. 14. Hardware architecture of a PE-V (curret Term). The propagatio schema of the differet Terms (right, bottom ad curret) is show i Figure 11. The hardware architecture for PE-Vs implemets the Forward operatios betwee c T erm, r T erm ad b T erm i parallel, ad the computes the ew px ad py vectors. The mai issue i the desig of the PE-V architecture is the square root fuctio to compute px ad py, as show o lie 6 of Algorithm 1. A efficiet ad precise hardware implemetatio of the square root is still a ope problem [Sajid et al. 2010] [Li ad Chu 1997], ad there are two mai techiques to hadle it: iterative techiques, which achieve better precisios, ad look-up tables, which are faster. I the proposed implemetatio, a look-up table implemetatio was employed to focus o timig performace, while the achieved precisio is still acceptable i the cotext of optical flow estimatio. I fact, the error of the approximated square root is below 1% i more tha 90% of the tested samples. The look-up table takes a 32-bit sigal represeted usig a fixed poit otatio, where the iteger part takes 24 bits, ACM Trasactios o Embedded Computig Systems, Vol. 15, No. 3, Article 39, Publicatio date: March 2016.

23 39:22 Beretta et al. ad the decimal part takes 8 bits. The etries of the table are 8-bit values, thus the table cotais 2 8 = 256 pre-computed values, ad oly requires 70 LUTs to be deployed o the FPGA. Istead of dividig the iput value ito 4 pieces of 8 bits each, which ca idex 4 differet tables, a techique has bee desiged to icrease the precisio while usig oly oe table (thus savig approximately LUTs over the 28 PE-Vs). I particular, the 8 most sigificat bits of the iput value are cosidered, ad used to get the result from the table, discardig the remaiig bits. The 8-bit block starts i a odd positio (coutig from left to right), ad fiishes i a eve oe: if the first o-zero bit is located i the -th positio, where is eve, the the 8 bit block will start from the zero bit at positio 1. I this way, if the decimal value of the 8 bit block is equal to m, ad if the rightmost bit of the block is i positio 2k, the the umber is equal to m 2 2k, ad its square root is computed by accessig the table at value m, ad left-shiftig the output by k positios. 6. EXPERIMENTAL RESULTS Fig. 15. Area usage o a Xilix Virtex-5 XC5VLX110T FPGA The proposed coe-based parallelizatio of the Chambolle algorithm has bee fully implemeted i Verilog ad sythesized for a Xilix Virtex-5 XC5VLX110T FPGA [Xilix 2009]. Figure 15 shows the resource usage of the Chambolle core, which reaches a operatig frequecy of 221 MHz after place ad route. If required by the target device, the umber of required DSPs ca be reduced by mappig part of the multiplicatios o the LUTs. Figure 16 shows the compariso, i terms of frames per secod, betwee the performace achieved by the proposed approach ad the oes obtaied by state-of-the-art implemetatios. These are implemeted o either CPUs or GPUs as, at the best of our kowledge, o implemetatio that leverages the fie-graied parallelism of FP- GAs has bee proposed i the literature. The evaluatio assumes that the images to be processed are pre-loaded i the device memory, i order to focus the measuremets o the Chambolle algorithm itself rather tha o the trasiet setup. The estimated speedup achieved by the implemetatio proposed i this work rages from 16.5 to 76 o images with a resolutio of , which is the most commo format foud i the literature related to Chambolle. However, the advatages of the proposed parallelizatio approach are eve more oticeable o larger images. I fact, the proposed implemetatio is the oly oe able to achieve more tha 30 fps ad, hece, meet the real-time costraits o images. O the cotrary, most of the existig approaches work with reasoable frame ACM Trasactios o Embedded Computig Systems, Vol. 15, No. 3, Article 39, Publicatio date: March 2016.

24 Parallelizig the Chambolle Algorithm for Performace Optimized Mappig o FPGA Devices 39:23 Performace compariso, i terms of frames per secod, with respect to state-of-the-art impleme- Fig. 16. tatios rates (higher tha 20 fps) oly o very small images (cosistig of either or pixels). Thus, i order to perform a fair compariso ad to ormalize the size of the images processed by the differet approaches, we compare them i Figure 17 i terms of umber of mega-pixels elaborated per secod. I this case, the speed-up obtaied by the proposed desig with respect to the best state-of-the-art implemetatios rages from 38 to 130 (77 i the average), provig that the proposed approach scales very well with the frame size. ACM Trasactios o Embedded Computig Systems, Vol. 15, No. 3, Article 39, Publicatio date: March 2016.

25 39:24 Beretta et al. Fig. 17. Performace compariso, i terms of mega-pixels per secod, with respect to state-of-the-art implemetatios 7. COMPARISON WITH RESPECT TO GPU IMPLEMENTATIONS We fially discuss a possible implemetatio of Chambolle o GPUs, i order to prove how the fie-graied cofiguratio capabilities of FPGAs provide a better eviromet for the implemetatio of this algorithm. Comparisos amog the two architectures have bee already proposed i the literature, such as i [Bodily et al. 2010], provig that GPUs do ot match the flexibility provided by FPGAs whe custom computatio ACM Trasactios o Embedded Computig Systems, Vol. 15, No. 3, Article 39, Publicatio date: March 2016.

Chapter 3 Classification of FFT Processor Algorithms

Chapter 3 Classification of FFT Processor Algorithms Chapter Classificatio of FFT Processor Algorithms The computatioal complexity of the Discrete Fourier trasform (DFT) is very high. It requires () 2 complex multiplicatios ad () complex additios [5]. As

More information

Pattern Recognition Systems Lab 1 Least Mean Squares

Pattern Recognition Systems Lab 1 Least Mean Squares Patter Recogitio Systems Lab 1 Least Mea Squares 1. Objectives This laboratory work itroduces the OpeCV-based framework used throughout the course. I this assigmet a lie is fitted to a set of poits usig

More information

Elementary Educational Computer

Elementary Educational Computer Chapter 5 Elemetary Educatioal Computer. Geeral structure of the Elemetary Educatioal Computer (EEC) The EEC coforms to the 5 uits structure defied by vo Neuma's model (.) All uits are preseted i a simplified

More information

A New Morphological 3D Shape Decomposition: Grayscale Interframe Interpolation Method

A New Morphological 3D Shape Decomposition: Grayscale Interframe Interpolation Method A ew Morphological 3D Shape Decompositio: Grayscale Iterframe Iterpolatio Method D.. Vizireau Politehica Uiversity Bucharest, Romaia ae@comm.pub.ro R. M. Udrea Politehica Uiversity Bucharest, Romaia mihea@comm.pub.ro

More information

IMP: Superposer Integrated Morphometrics Package Superposition Tool

IMP: Superposer Integrated Morphometrics Package Superposition Tool IMP: Superposer Itegrated Morphometrics Package Superpositio Tool Programmig by: David Lieber ( 03) Caisius College 200 Mai St. Buffalo, NY 4208 Cocept by: H. David Sheets, Dept. of Physics, Caisius College

More information

Task scenarios Outline. Scenarios in Knowledge Extraction. Proposed Framework for Scenario to Design Diagram Transformation

Task scenarios Outline. Scenarios in Knowledge Extraction. Proposed Framework for Scenario to Design Diagram Transformation 6-0-0 Kowledge Trasformatio from Task Scearios to View-based Desig Diagrams Nima Dezhkam Kamra Sartipi {dezhka, sartipi}@mcmaster.ca Departmet of Computig ad Software McMaster Uiversity CANADA SEKE 08

More information

A SOFTWARE MODEL FOR THE MULTILAYER PERCEPTRON

A SOFTWARE MODEL FOR THE MULTILAYER PERCEPTRON A SOFTWARE MODEL FOR THE MULTILAYER PERCEPTRON Roberto Lopez ad Eugeio Oñate Iteratioal Ceter for Numerical Methods i Egieerig (CIMNE) Edificio C1, Gra Capitá s/, 08034 Barceloa, Spai ABSTRACT I this work

More information

Lecture 18. Optimization in n dimensions

Lecture 18. Optimization in n dimensions Lecture 8 Optimizatio i dimesios Itroductio We ow cosider the problem of miimizig a sigle scalar fuctio of variables, f x, where x=[ x, x,, x ]T. The D case ca be visualized as fidig the lowest poit of

More information

Image Segmentation EEE 508

Image Segmentation EEE 508 Image Segmetatio Objective: to determie (etract) object boudaries. It is a process of partitioig a image ito distict regios by groupig together eighborig piels based o some predefied similarity criterio.

More information

Ones Assignment Method for Solving Traveling Salesman Problem

Ones Assignment Method for Solving Traveling Salesman Problem Joural of mathematics ad computer sciece 0 (0), 58-65 Oes Assigmet Method for Solvig Travelig Salesma Problem Hadi Basirzadeh Departmet of Mathematics, Shahid Chamra Uiversity, Ahvaz, Ira Article history:

More information

Improving Template Based Spike Detection

Improving Template Based Spike Detection Improvig Template Based Spike Detectio Kirk Smith, Member - IEEE Portlad State Uiversity petra@ee.pdx.edu Abstract Template matchig algorithms like SSE, Covolutio ad Maximum Likelihood are well kow for

More information

Data Structures and Algorithms. Analysis of Algorithms

Data Structures and Algorithms. Analysis of Algorithms Data Structures ad Algorithms Aalysis of Algorithms Outlie Ruig time Pseudo-code Big-oh otatio Big-theta otatio Big-omega otatio Asymptotic algorithm aalysis Aalysis of Algorithms Iput Algorithm Output

More information

Lecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming

Lecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming Lecture Notes 6 Itroductio to algorithm aalysis CSS 501 Data Structures ad Object-Orieted Programmig Readig for this lecture: Carrao, Chapter 10 To be covered i this lecture: Itroductio to algorithm aalysis

More information

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe CHAPTER 19 Query Optimizatio Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe Itroductio Query optimizatio Coducted by a query optimizer i a DBMS Goal:

More information

condition w i B i S maximum u i

condition w i B i S maximum u i ecture 10 Dyamic Programmig 10.1 Kapsack Problem November 1, 2004 ecturer: Kamal Jai Notes: Tobias Holgers We are give a set of items U = {a 1, a 2,..., a }. Each item has a weight w i Z + ad a utility

More information

Structuring Redundancy for Fault Tolerance. CSE 598D: Fault Tolerant Software

Structuring Redundancy for Fault Tolerance. CSE 598D: Fault Tolerant Software Structurig Redudacy for Fault Tolerace CSE 598D: Fault Tolerat Software What do we wat to achieve? Versios Damage Assessmet Versio 1 Error Detectio Iputs Versio 2 Voter Outputs State Restoratio Cotiued

More information

Creating Exact Bezier Representations of CST Shapes. David D. Marshall. California Polytechnic State University, San Luis Obispo, CA , USA

Creating Exact Bezier Representations of CST Shapes. David D. Marshall. California Polytechnic State University, San Luis Obispo, CA , USA Creatig Exact Bezier Represetatios of CST Shapes David D. Marshall Califoria Polytechic State Uiversity, Sa Luis Obispo, CA 93407-035, USA The paper presets a method of expressig CST shapes pioeered by

More information

. Written in factored form it is easy to see that the roots are 2, 2, i,

. Written in factored form it is easy to see that the roots are 2, 2, i, CMPS A Itroductio to Programmig Programmig Assigmet 4 I this assigmet you will write a java program that determies the real roots of a polyomial that lie withi a specified rage. Recall that the roots (or

More information

Data diverse software fault tolerance techniques

Data diverse software fault tolerance techniques Data diverse software fault tolerace techiques Complemets desig diversity by compesatig for desig diversity s s limitatios Ivolves obtaiig a related set of poits i the program data space, executig the

More information

Analysis Metrics. Intro to Algorithm Analysis. Slides. 12. Alg Analysis. 12. Alg Analysis

Analysis Metrics. Intro to Algorithm Analysis. Slides. 12. Alg Analysis. 12. Alg Analysis Itro to Algorithm Aalysis Aalysis Metrics Slides. Table of Cotets. Aalysis Metrics 3. Exact Aalysis Rules 4. Simple Summatio 5. Summatio Formulas 6. Order of Magitude 7. Big-O otatio 8. Big-O Theorems

More information

CSC165H1 Worksheet: Tutorial 8 Algorithm analysis (SOLUTIONS)

CSC165H1 Worksheet: Tutorial 8 Algorithm analysis (SOLUTIONS) CSC165H1, Witer 018 Learig Objectives By the ed of this worksheet, you will: Aalyse the ruig time of fuctios cotaiig ested loops. 1. Nested loop variatios. Each of the followig fuctios takes as iput a

More information

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design College of Computer ad Iformatio Scieces Departmet of Computer Sciece CSC 220: Computer Orgaizatio Uit 11 Basic Computer Orgaizatio ad Desig 1 For the rest of the semester, we ll focus o computer architecture:

More information

arxiv: v2 [cs.ds] 24 Mar 2018

arxiv: v2 [cs.ds] 24 Mar 2018 Similar Elemets ad Metric Labelig o Complete Graphs arxiv:1803.08037v [cs.ds] 4 Mar 018 Pedro F. Felzeszwalb Brow Uiversity Providece, RI, USA pff@brow.edu March 8, 018 We cosider a problem that ivolves

More information

A Note on Least-norm Solution of Global WireWarping

A Note on Least-norm Solution of Global WireWarping A Note o Least-orm Solutio of Global WireWarpig Charlie C. L. Wag Departmet of Mechaical ad Automatio Egieerig The Chiese Uiversity of Hog Kog Shati, N.T., Hog Kog E-mail: cwag@mae.cuhk.edu.hk Abstract

More information

How do we evaluate algorithms?

How do we evaluate algorithms? F2 Readig referece: chapter 2 + slides Algorithm complexity Big O ad big Ω To calculate ruig time Aalysis of recursive Algorithms Next time: Litterature: slides mostly The first Algorithm desig methods:

More information

Evaluation scheme for Tracking in AMI

Evaluation scheme for Tracking in AMI A M I C o m m u i c a t i o A U G M E N T E D M U L T I - P A R T Y I N T E R A C T I O N http://www.amiproject.org/ Evaluatio scheme for Trackig i AMI S. Schreiber a D. Gatica-Perez b AMI WP4 Trackig:

More information

Fast Fourier Transform (FFT) Algorithms

Fast Fourier Transform (FFT) Algorithms Fast Fourier Trasform FFT Algorithms Relatio to the z-trasform elsewhere, ozero, z x z X x [ ] 2 ~ elsewhere,, ~ e j x X x x π j e z z X X π 2 ~ The DFS X represets evely spaced samples of the z- trasform

More information

EE123 Digital Signal Processing

EE123 Digital Signal Processing Last Time EE Digital Sigal Processig Lecture 7 Block Covolutio, Overlap ad Add, FFT Discrete Fourier Trasform Properties of the Liear covolutio through circular Today Liear covolutio with Overlap ad add

More information

APPLICATION NOTE PACE1750AE BUILT-IN FUNCTIONS

APPLICATION NOTE PACE1750AE BUILT-IN FUNCTIONS APPLICATION NOTE PACE175AE BUILT-IN UNCTIONS About This Note This applicatio brief is iteded to explai ad demostrate the use of the special fuctios that are built ito the PACE175AE processor. These powerful

More information

Octahedral Graph Scaling

Octahedral Graph Scaling Octahedral Graph Scalig Peter Russell Jauary 1, 2015 Abstract There is presetly o strog iterpretatio for the otio of -vertex graph scalig. This paper presets a ew defiitio for the term i the cotext of

More information

Chapter 11. Friends, Overloaded Operators, and Arrays in Classes. Copyright 2014 Pearson Addison-Wesley. All rights reserved.

Chapter 11. Friends, Overloaded Operators, and Arrays in Classes. Copyright 2014 Pearson Addison-Wesley. All rights reserved. Chapter 11 Frieds, Overloaded Operators, ad Arrays i Classes Copyright 2014 Pearso Addiso-Wesley. All rights reserved. Overview 11.1 Fried Fuctios 11.2 Overloadig Operators 11.3 Arrays ad Classes 11.4

More information

Outline. Research Definition. Motivation. Foundation of Reverse Engineering. Dynamic Analysis and Design Pattern Detection in Java Programs

Outline. Research Definition. Motivation. Foundation of Reverse Engineering. Dynamic Analysis and Design Pattern Detection in Java Programs Dyamic Aalysis ad Desig Patter Detectio i Java Programs Outlie Lei Hu Kamra Sartipi {hul4, sartipi}@mcmasterca Departmet of Computig ad Software McMaster Uiversity Caada Motivatio Research Problem Defiitio

More information

Appendix D. Controller Implementation

Appendix D. Controller Implementation COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Appedix D Cotroller Implemetatio Cotroller Implemetatios Combiatioal logic (sigle-cycle); Fiite state machie (multi-cycle, pipelied);

More information

The Magma Database file formats

The Magma Database file formats The Magma Database file formats Adrew Gaylard, Bret Pikey, ad Mart-Mari Breedt Johaesburg, South Africa 15th May 2006 1 Summary Magma is a ope-source object database created by Chris Muller, of Kasas City,

More information

CIS 121 Data Structures and Algorithms with Java Fall Big-Oh Notation Tuesday, September 5 (Make-up Friday, September 8)

CIS 121 Data Structures and Algorithms with Java Fall Big-Oh Notation Tuesday, September 5 (Make-up Friday, September 8) CIS 11 Data Structures ad Algorithms with Java Fall 017 Big-Oh Notatio Tuesday, September 5 (Make-up Friday, September 8) Learig Goals Review Big-Oh ad lear big/small omega/theta otatios Practice solvig

More information

Introduction. Nature-Inspired Computing. Terminology. Problem Types. Constraint Satisfaction Problems - CSP. Free Optimization Problem - FOP

Introduction. Nature-Inspired Computing. Terminology. Problem Types. Constraint Satisfaction Problems - CSP. Free Optimization Problem - FOP Nature-Ispired Computig Hadlig Costraits Dr. Şima Uyar September 2006 Itroductio may practical problems are costraied ot all combiatios of variable values represet valid solutios feasible solutios ifeasible

More information

Accuracy Improvement in Camera Calibration

Accuracy Improvement in Camera Calibration Accuracy Improvemet i Camera Calibratio FaJie L Qi Zag ad Reihard Klette CITR, Computer Sciece Departmet The Uiversity of Aucklad Tamaki Campus, Aucklad, New Zealad fli006, qza001@ec.aucklad.ac.z r.klette@aucklad.ac.z

More information

9.1. Sequences and Series. Sequences. What you should learn. Why you should learn it. Definition of Sequence

9.1. Sequences and Series. Sequences. What you should learn. Why you should learn it. Definition of Sequence _9.qxd // : AM Page Chapter 9 Sequeces, Series, ad Probability 9. Sequeces ad Series What you should lear Use sequece otatio to write the terms of sequeces. Use factorial otatio. Use summatio otatio to

More information

An Efficient Algorithm for Graph Bisection of Triangularizations

An Efficient Algorithm for Graph Bisection of Triangularizations A Efficiet Algorithm for Graph Bisectio of Triagularizatios Gerold Jäger Departmet of Computer Sciece Washigto Uiversity Campus Box 1045 Oe Brookigs Drive St. Louis, Missouri 63130-4899, USA jaegerg@cse.wustl.edu

More information

Performance Plus Software Parameter Definitions

Performance Plus Software Parameter Definitions Performace Plus+ Software Parameter Defiitios/ Performace Plus Software Parameter Defiitios Chapma Techical Note-TG-5 paramete.doc ev-0-03 Performace Plus+ Software Parameter Defiitios/2 Backgroud ad Defiitios

More information

Running Time ( 3.1) Analysis of Algorithms. Experimental Studies. Limitations of Experiments

Running Time ( 3.1) Analysis of Algorithms. Experimental Studies. Limitations of Experiments Ruig Time ( 3.1) Aalysis of Algorithms Iput Algorithm Output A algorithm is a step- by- step procedure for solvig a problem i a fiite amout of time. Most algorithms trasform iput objects ito output objects.

More information

Analysis of Algorithms

Analysis of Algorithms Aalysis of Algorithms Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite amout of time. Ruig Time Most algorithms trasform iput objects ito output objects. The

More information

Improvement of the Orthogonal Code Convolution Capabilities Using FPGA Implementation

Improvement of the Orthogonal Code Convolution Capabilities Using FPGA Implementation Improvemet of the Orthogoal Code Covolutio Capabilities Usig FPGA Implemetatio Naima Kaabouch, Member, IEEE, Apara Dhirde, Member, IEEE, Saleh Faruque, Member, IEEE Departmet of Electrical Egieerig, Uiversity

More information

Lecture 1: Introduction and Strassen s Algorithm

Lecture 1: Introduction and Strassen s Algorithm 5-750: Graduate Algorithms Jauary 7, 08 Lecture : Itroductio ad Strasse s Algorithm Lecturer: Gary Miller Scribe: Robert Parker Itroductio Machie models I this class, we will primarily use the Radom Access

More information

Exact Minimum Lower Bound Algorithm for Traveling Salesman Problem

Exact Minimum Lower Bound Algorithm for Traveling Salesman Problem Exact Miimum Lower Boud Algorithm for Travelig Salesma Problem Mohamed Eleiche GeoTiba Systems mohamed.eleiche@gmail.com Abstract The miimum-travel-cost algorithm is a dyamic programmig algorithm to compute

More information

Running Time. Analysis of Algorithms. Experimental Studies. Limitations of Experiments

Running Time. Analysis of Algorithms. Experimental Studies. Limitations of Experiments Ruig Time Aalysis of Algorithms Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite amout of time. Most algorithms trasform iput objects ito output objects. The

More information

A Generalized Set Theoretic Approach for Time and Space Complexity Analysis of Algorithms and Functions

A Generalized Set Theoretic Approach for Time and Space Complexity Analysis of Algorithms and Functions Proceedigs of the 10th WSEAS Iteratioal Coferece o APPLIED MATHEMATICS, Dallas, Texas, USA, November 1-3, 2006 316 A Geeralized Set Theoretic Approach for Time ad Space Complexity Aalysis of Algorithms

More information

FPGA IMPLEMENTATION OF BASE-N LOGARITHM. Salvador E. Tropea

FPGA IMPLEMENTATION OF BASE-N LOGARITHM. Salvador E. Tropea FPGA IMPLEMENTATION OF BASE-N LOGARITHM Salvador E. Tropea Electróica e Iformática Istituto Nacioal de Tecología Idustrial Bueos Aires, Argetia email: salvador@iti.gov.ar ABSTRACT I this work, we preset

More information

Bezier curves. Figure 2 shows cubic Bezier curves for various control points. In a Bezier curve, only

Bezier curves. Figure 2 shows cubic Bezier curves for various control points. In a Bezier curve, only Edited: Yeh-Liag Hsu (998--; recommeded: Yeh-Liag Hsu (--9; last updated: Yeh-Liag Hsu (9--7. Note: This is the course material for ME55 Geometric modelig ad computer graphics, Yua Ze Uiversity. art of

More information

Lecturers: Sanjam Garg and Prasad Raghavendra Feb 21, Midterm 1 Solutions

Lecturers: Sanjam Garg and Prasad Raghavendra Feb 21, Midterm 1 Solutions U.C. Berkeley CS170 : Algorithms Midterm 1 Solutios Lecturers: Sajam Garg ad Prasad Raghavedra Feb 1, 017 Midterm 1 Solutios 1. (4 poits) For the directed graph below, fid all the strogly coected compoets

More information

Basic allocator mechanisms The course that gives CMU its Zip! Memory Management II: Dynamic Storage Allocation Mar 6, 2000.

Basic allocator mechanisms The course that gives CMU its Zip! Memory Management II: Dynamic Storage Allocation Mar 6, 2000. 5-23 The course that gives CM its Zip Memory Maagemet II: Dyamic Storage Allocatio Mar 6, 2000 Topics Segregated lists Buddy system Garbage collectio Mark ad Sweep Copyig eferece coutig Basic allocator

More information

GE FUNDAMENTALS OF COMPUTING AND PROGRAMMING UNIT III

GE FUNDAMENTALS OF COMPUTING AND PROGRAMMING UNIT III GE2112 - FUNDAMENTALS OF COMPUTING AND PROGRAMMING UNIT III PROBLEM SOLVING AND OFFICE APPLICATION SOFTWARE Plaig the Computer Program Purpose Algorithm Flow Charts Pseudocode -Applicatio Software Packages-

More information

3D Model Retrieval Method Based on Sample Prediction

3D Model Retrieval Method Based on Sample Prediction 20 Iteratioal Coferece o Computer Commuicatio ad Maagemet Proc.of CSIT vol.5 (20) (20) IACSIT Press, Sigapore 3D Model Retrieval Method Based o Sample Predictio Qigche Zhag, Ya Tag* School of Computer

More information

6.854J / J Advanced Algorithms Fall 2008

6.854J / J Advanced Algorithms Fall 2008 MIT OpeCourseWare http://ocw.mit.edu 6.854J / 18.415J Advaced Algorithms Fall 2008 For iformatio about citig these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. 18.415/6.854 Advaced Algorithms

More information

DETECTION OF LANDSLIDE BLOCK BOUNDARIES BY MEANS OF AN AFFINE COORDINATE TRANSFORMATION

DETECTION OF LANDSLIDE BLOCK BOUNDARIES BY MEANS OF AN AFFINE COORDINATE TRANSFORMATION Proceedigs, 11 th FIG Symposium o Deformatio Measuremets, Satorii, Greece, 2003. DETECTION OF LANDSLIDE BLOCK BOUNDARIES BY MEANS OF AN AFFINE COORDINATE TRANSFORMATION Michaela Haberler, Heribert Kahme

More information

An Efficient Algorithm for Graph Bisection of Triangularizations

An Efficient Algorithm for Graph Bisection of Triangularizations Applied Mathematical Scieces, Vol. 1, 2007, o. 25, 1203-1215 A Efficiet Algorithm for Graph Bisectio of Triagularizatios Gerold Jäger Departmet of Computer Sciece Washigto Uiversity Campus Box 1045, Oe

More information

Computer Systems - HS

Computer Systems - HS What have we leared so far? Computer Systems High Level ENGG1203 2d Semester, 2017-18 Applicatios Sigals Systems & Cotrol Systems Computer & Embedded Systems Digital Logic Combiatioal Logic Sequetial Logic

More information

EE 459/500 HDL Based Digital Design with Programmable Logic. Lecture 13 Control and Sequencing: Hardwired and Microprogrammed Control

EE 459/500 HDL Based Digital Design with Programmable Logic. Lecture 13 Control and Sequencing: Hardwired and Microprogrammed Control EE 459/500 HDL Based Digital Desig with Programmable Logic Lecture 13 Cotrol ad Sequecig: Hardwired ad Microprogrammed Cotrol Refereces: Chapter s 4,5 from textbook Chapter 7 of M.M. Mao ad C.R. Kime,

More information

Software development of components for complex signal analysis on the example of adaptive recursive estimation methods.

Software development of components for complex signal analysis on the example of adaptive recursive estimation methods. Software developmet of compoets for complex sigal aalysis o the example of adaptive recursive estimatio methods. SIMON BOYMANN, RALPH MASCHOTTA, SILKE LEHMANN, DUNJA STEUER Istitute of Biomedical Egieerig

More information

Multi-Threading. Hyper-, Multi-, and Simultaneous Thread Execution

Multi-Threading. Hyper-, Multi-, and Simultaneous Thread Execution Multi-Threadig Hyper-, Multi-, ad Simultaeous Thread Executio 1 Performace To Date Icreasig processor performace Pipeliig. Brach predictio. Super-scalar executio. Out-of-order executio. Caches. Hyper-Threadig

More information

Pseudocode ( 1.1) Analysis of Algorithms. Primitive Operations. Pseudocode Details. Running Time ( 1.1) Estimating performance

Pseudocode ( 1.1) Analysis of Algorithms. Primitive Operations. Pseudocode Details. Running Time ( 1.1) Estimating performance Aalysis of Algorithms Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite amout of time. Pseudocode ( 1.1) High-level descriptio of a algorithm More structured

More information

The Closest Line to a Data Set in the Plane. David Gurney Southeastern Louisiana University Hammond, Louisiana

The Closest Line to a Data Set in the Plane. David Gurney Southeastern Louisiana University Hammond, Louisiana The Closest Lie to a Data Set i the Plae David Gurey Southeaster Louisiaa Uiversity Hammod, Louisiaa ABSTRACT This paper looks at three differet measures of distace betwee a lie ad a data set i the plae:

More information

Computers and Scientific Thinking

Computers and Scientific Thinking Computers ad Scietific Thikig David Reed, Creighto Uiversity Chapter 15 JavaScript Strigs 1 Strigs as Objects so far, your iteractive Web pages have maipulated strigs i simple ways use text box to iput

More information

What are we going to learn? CSC Data Structures Analysis of Algorithms. Overview. Algorithm, and Inputs

What are we going to learn? CSC Data Structures Analysis of Algorithms. Overview. Algorithm, and Inputs What are we goig to lear? CSC316-003 Data Structures Aalysis of Algorithms Computer Sciece North Carolia State Uiversity Need to say that some algorithms are better tha others Criteria for evaluatio Structure

More information

COSC 1P03. Ch 7 Recursion. Introduction to Data Structures 8.1

COSC 1P03. Ch 7 Recursion. Introduction to Data Structures 8.1 COSC 1P03 Ch 7 Recursio Itroductio to Data Structures 8.1 COSC 1P03 Recursio Recursio I Mathematics factorial Fiboacci umbers defie ifiite set with fiite defiitio I Computer Sciece sytax rules fiite defiitio,

More information

COMP Parallel Computing. PRAM (1): The PRAM model and complexity measures

COMP Parallel Computing. PRAM (1): The PRAM model and complexity measures COMP 633 - Parallel Computig Lecture 2 August 24, 2017 : The PRAM model ad complexity measures 1 First class summary This course is about parallel computig to achieve high-er performace o idividual problems

More information

Python Programming: An Introduction to Computer Science

Python Programming: An Introduction to Computer Science Pytho Programmig: A Itroductio to Computer Sciece Chapter 6 Defiig Fuctios Pytho Programmig, 2/e 1 Objectives To uderstad why programmers divide programs up ito sets of cooperatig fuctios. To be able to

More information

Chapter 1. Introduction to Computers and C++ Programming. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Chapter 1. Introduction to Computers and C++ Programming. Copyright 2015 Pearson Education, Ltd.. All rights reserved. Chapter 1 Itroductio to Computers ad C++ Programmig Copyright 2015 Pearso Educatio, Ltd.. All rights reserved. Overview 1.1 Computer Systems 1.2 Programmig ad Problem Solvig 1.3 Itroductio to C++ 1.4 Testig

More information

CMSC Computer Architecture Lecture 12: Virtual Memory. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 12: Virtual Memory. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 12: Virtual Memory Prof. Yajig Li Uiversity of Chicago A System with Physical Memory Oly Examples: most Cray machies early PCs Memory early all embedded systems

More information

Filter design. 1 Design considerations: a framework. 2 Finite impulse response (FIR) filter design

Filter design. 1 Design considerations: a framework. 2 Finite impulse response (FIR) filter design Filter desig Desig cosideratios: a framework C ı p ı p H(f) Aalysis of fiite wordlegth effects: I practice oe should check that the quatisatio used i the implemetatio does ot degrade the performace of

More information

Computer Graphics Hardware An Overview

Computer Graphics Hardware An Overview Computer Graphics Hardware A Overview Graphics System Moitor Iput devices CPU/Memory GPU Raster Graphics System Raster: A array of picture elemets Based o raster-sca TV techology The scree (ad a picture)

More information

Outline and Reading. Analysis of Algorithms. Running Time. Experimental Studies. Limitations of Experiments. Theoretical Analysis

Outline and Reading. Analysis of Algorithms. Running Time. Experimental Studies. Limitations of Experiments. Theoretical Analysis Outlie ad Readig Aalysis of Algorithms Iput Algorithm Output Ruig time ( 3.) Pseudo-code ( 3.2) Coutig primitive operatios ( 3.3-3.) Asymptotic otatio ( 3.6) Asymptotic aalysis ( 3.7) Case study Aalysis

More information

New Results on Energy of Graphs of Small Order

New Results on Energy of Graphs of Small Order Global Joural of Pure ad Applied Mathematics. ISSN 0973-1768 Volume 13, Number 7 (2017), pp. 2837-2848 Research Idia Publicatios http://www.ripublicatio.com New Results o Eergy of Graphs of Small Order

More information

Stone Images Retrieval Based on Color Histogram

Stone Images Retrieval Based on Color Histogram Stoe Images Retrieval Based o Color Histogram Qiag Zhao, Jie Yag, Jigyi Yag, Hogxig Liu School of Iformatio Egieerig, Wuha Uiversity of Techology Wuha, Chia Abstract Stoe images color features are chose

More information

Fundamentals of Media Processing. Shin'ichi Satoh Kazuya Kodama Hiroshi Mo Duy-Dinh Le

Fundamentals of Media Processing. Shin'ichi Satoh Kazuya Kodama Hiroshi Mo Duy-Dinh Le Fudametals of Media Processig Shi'ichi Satoh Kazuya Kodama Hiroshi Mo Duy-Dih Le Today's topics Noparametric Methods Parze Widow k-nearest Neighbor Estimatio Clusterig Techiques k-meas Agglomerative Hierarchical

More information

One advantage that SONAR has over any other music-sequencing product I ve worked

One advantage that SONAR has over any other music-sequencing product I ve worked *gajedra* D:/Thomso_Learig_Projects/Garrigus_163132/z_productio/z_3B2_3D_files/Garrigus_163132_ch17.3d, 14/11/08/16:26:39, 16:26, page: 647 17 CAL 101 Oe advatage that SONAR has over ay other music-sequecig

More information

Extending The Sleuth Kit and its Underlying Model for Pooled Storage File System Forensic Analysis

Extending The Sleuth Kit and its Underlying Model for Pooled Storage File System Forensic Analysis Extedig The Sleuth Kit ad its Uderlyig Model for Pooled File System Foresic Aalysis Frauhofer Istitute for Commuicatio, Iformatio Processig ad Ergoomics Ja-Niclas Hilgert* Marti Lambertz Daiel Plohma ja-iclas.hilgert@fkie.frauhofer.de

More information

1.2 Binomial Coefficients and Subsets

1.2 Binomial Coefficients and Subsets 1.2. BINOMIAL COEFFICIENTS AND SUBSETS 13 1.2 Biomial Coefficiets ad Subsets 1.2-1 The loop below is part of a program to determie the umber of triagles formed by poits i the plae. for i =1 to for j =

More information

SPIRAL DSP Transform Compiler:

SPIRAL DSP Transform Compiler: SPIRAL DSP Trasform Compiler: Applicatio Specific Hardware Sythesis Peter A. Milder (peter.milder@stoybroo.edu) Fraz Frachetti, James C. Hoe, ad Marus Pueschel Departmet of ECE Caregie Mello Uiversity

More information

Speeding-up dynamic programming in sequence alignment

Speeding-up dynamic programming in sequence alignment Departmet of Computer Sciece Aarhus Uiversity Demark Speedig-up dyamic programmig i sequece aligmet Master s Thesis Dug My Hoa - 443 December, Supervisor: Christia Nørgaard Storm Pederse Implemetatio code

More information

Operating System Concepts. Operating System Concepts

Operating System Concepts. Operating System Concepts Chapter 4: Mass-Storage Systems Logical Disk Structure Logical Disk Structure Disk Schedulig Disk Maagemet RAID Structure Disk drives are addressed as large -dimesioal arrays of logical blocks, where the

More information

27 Refraction, Dispersion, Internal Reflection

27 Refraction, Dispersion, Internal Reflection Chapter 7 Refractio, Dispersio, Iteral Reflectio 7 Refractio, Dispersio, Iteral Reflectio Whe we talked about thi film iterferece, we said that whe light ecouters a smooth iterface betwee two trasparet

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Part A Datapath Design

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Part A Datapath Design COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter The Processor Part A path Desig Itroductio CPU performace factors Istructio cout Determied by ISA ad compiler. CPI ad

More information

1. SWITCHING FUNDAMENTALS

1. SWITCHING FUNDAMENTALS . SWITCING FUNDMENTLS Switchig is the provisio of a o-demad coectio betwee two ed poits. Two distict switchig techiques are employed i commuicatio etwors-- circuit switchig ad pacet switchig. Circuit switchig

More information

Load balanced Parallel Prime Number Generator with Sieve of Eratosthenes on Cluster Computers *

Load balanced Parallel Prime Number Generator with Sieve of Eratosthenes on Cluster Computers * Load balaced Parallel Prime umber Geerator with Sieve of Eratosthees o luster omputers * Soowook Hwag*, Kyusik hug**, ad Dogseug Kim* *Departmet of Electrical Egieerig Korea Uiversity Seoul, -, Rep. of

More information

LU Decomposition Method

LU Decomposition Method SOLUTION OF SIMULTANEOUS LINEAR EQUATIONS LU Decompositio Method Jamie Traha, Autar Kaw, Kevi Marti Uiversity of South Florida Uited States of America kaw@eg.usf.edu http://umericalmethods.eg.usf.edu Itroductio

More information

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe CHAPTER 18 Strategies for Query Processig Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe Itroductio DBMS techiques to process a query Scaer idetifies

More information

Chapter 9. Pointers and Dynamic Arrays. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Chapter 9. Pointers and Dynamic Arrays. Copyright 2015 Pearson Education, Ltd.. All rights reserved. Chapter 9 Poiters ad Dyamic Arrays Copyright 2015 Pearso Educatio, Ltd.. All rights reserved. Overview 9.1 Poiters 9.2 Dyamic Arrays Copyright 2015 Pearso Educatio, Ltd.. All rights reserved. Slide 9-3

More information

Alpha Individual Solutions MAΘ National Convention 2013

Alpha Individual Solutions MAΘ National Convention 2013 Alpha Idividual Solutios MAΘ Natioal Covetio 0 Aswers:. D. A. C 4. D 5. C 6. B 7. A 8. C 9. D 0. B. B. A. D 4. C 5. A 6. C 7. B 8. A 9. A 0. C. E. B. D 4. C 5. A 6. D 7. B 8. C 9. D 0. B TB. 570 TB. 5

More information

Dynamic Programming and Curve Fitting Based Road Boundary Detection

Dynamic Programming and Curve Fitting Based Road Boundary Detection Dyamic Programmig ad Curve Fittig Based Road Boudary Detectio SHYAM PRASAD ADHIKARI, HYONGSUK KIM, Divisio of Electroics ad Iformatio Egieerig Chobuk Natioal Uiversity 664-4 Ga Deokji-Dog Jeoju-City Jeobuk

More information

Lecture 2: Spectra of Graphs

Lecture 2: Spectra of Graphs Spectral Graph Theory ad Applicatios WS 20/202 Lecture 2: Spectra of Graphs Lecturer: Thomas Sauerwald & He Su Our goal is to use the properties of the adjacecy/laplacia matrix of graphs to first uderstad

More information

FEATURE BASED RECOGNITION OF TRAFFIC VIDEO STREAMS FOR ONLINE ROUTE TRACING

FEATURE BASED RECOGNITION OF TRAFFIC VIDEO STREAMS FOR ONLINE ROUTE TRACING FEATURE BASED RECOGNITION OF TRAFFIC VIDEO STREAMS FOR ONLINE ROUTE TRACING Christoph Busch, Ralf Dörer, Christia Freytag, Heike Ziegler Frauhofer Istitute for Computer Graphics, Computer Graphics Ceter

More information

A Study on the Performance of Cholesky-Factorization using MPI

A Study on the Performance of Cholesky-Factorization using MPI A Study o the Performace of Cholesky-Factorizatio usig MPI Ha S. Kim Scott B. Bade Departmet of Computer Sciece ad Egieerig Uiversity of Califoria Sa Diego {hskim, bade}@cs.ucsd.edu Abstract Cholesky-factorizatio

More information

Fire Recognition in Video. Walter Phillips III Mubarak Shah Niels da Vitoria Lobo.

Fire Recognition in Video. Walter Phillips III Mubarak Shah Niels da Vitoria Lobo. Fire Recogitio i Video Walter Phillips III Mubarak Shah Niels da Vitoria Lobo {wrp65547,shah,iels}@cs.ucf.edu Computer Visio Laboratory Departmet of Computer Sciece Uiversity of Cetral Florida Orlado,

More information

Counting the Number of Minimum Roman Dominating Functions of a Graph

Counting the Number of Minimum Roman Dominating Functions of a Graph Coutig the Number of Miimum Roma Domiatig Fuctios of a Graph SHI ZHENG ad KOH KHEE MENG, Natioal Uiversity of Sigapore We provide two algorithms coutig the umber of miimum Roma domiatig fuctios of a graph

More information

Big-O Analysis. Asymptotics

Big-O Analysis. Asymptotics Big-O Aalysis 1 Defiitio: Suppose that f() ad g() are oegative fuctios of. The we say that f() is O(g()) provided that there are costats C > 0 ad N > 0 such that for all > N, f() Cg(). Big-O expresses

More information

Parallel Polygon Approximation Algorithm Targeted at Reconfigurable Multi-Ring Hardware

Parallel Polygon Approximation Algorithm Targeted at Reconfigurable Multi-Ring Hardware Parallel Polygo Approximatio Algorithm Targeted at Recofigurable Multi-Rig Hardware M. Arif Wai* ad Hamid R. Arabia** *Califoria State Uiversity Bakersfield, Califoria, USA **Uiversity of Georgia, Georgia,

More information

Protected points in ordered trees

Protected points in ordered trees Applied Mathematics Letters 008 56 50 www.elsevier.com/locate/aml Protected poits i ordered trees Gi-Sag Cheo a, Louis W. Shapiro b, a Departmet of Mathematics, Sugkyukwa Uiversity, Suwo 440-746, Republic

More information

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5 Morga Kaufma Publishers 26 February, 28 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 5 Set-Associative Cache Architecture Performace Summary Whe CPU performace icreases:

More information

Appendix A. Use of Operators in ARPS

Appendix A. Use of Operators in ARPS A Appedix A. Use of Operators i ARPS The methodology for solvig the equatios of hydrodyamics i either differetial or itegral form usig grid-poit techiques (fiite differece, fiite volume, fiite elemet)

More information