Improving High Level Synthesis Optimization Opportunity Through Polyhedral Transformations

Size: px
Start display at page:

Download "Improving High Level Synthesis Optimization Opportunity Through Polyhedral Transformations"

Transcription

1 Improvng Hgh Level Synthess Optmzaton Opportunty Through Polyhedral Transformatons We Zuo 2,5, Yun Lang 1, Peng L 1, Kyle Rupnow 3, Demng Chen 2,3 and Jason Cong 1,4 1 Center for Energy-Effcent Computng and Applcatons, School of EECS, Pekng Unversty, Chna 2 Unversty of Illnos, Urbana-Champagn, USA 3 Advanced Dgtal Scence Center, Sngapore 4 Computer Scence Department, Unversty of Calforna, Los Angeles, USA 5 School of Informaton and Electroncs, Beng Insttute of technology,chna {wezuo,dchen}@llnos.edu,{erclyun,peng.l}@pku.edu.cn k.rupnow@adsc.com.sg,cong@cs.ucla.edu ABSTRACT Hgh level synthess (HLS) s an mportant enablng technology for the adopton of hardware accelerator technologes. It promses the performance and energy effcency of hardware desgns wth a lower barrer to entry n desgn expertse, and shorter desgn tme. State-of-the-art hgh level synthess now ncludes a wde varety of powerful optmzatons that mplement effcent hardware. These optmzatons can mplement some of the most mportant features generally performed n manual desgns ncludng parallel hardware unts, ppelnng of executon both wthn a hardware unt and between unts, and fne-graned data communcaton. We may generally classfy the optmzatons as those that optmze hardware mplementaton wthn a code block (ntra-block) and those that optmze communcaton and ppelnng between code blocks (nterblock). However, both optmzatons are n practce dffcult to apply. Real-world applcatons contan data-dependent blocks of code and communcate through complex data access patterns. Exstng hgh level synthess tools cannot apply these powerful optmzatons unless the code s nherently compatble, severely lmtng the optmzaton opportunty. In ths paper we present an ntegrated framework to model and enable both ntra- and nter-block optmzatons. Ths ntegrated technque substantally mproves the opportunty to use the powerful HLS optmzatons that mplement parallelsm, ppelnng, and fne-graned communcaton. Our polyhedral model-based technque systematcally defnes a set of data access patterns, dentfes effectve data access patterns, and performs the loop transformatons to enable the ntra- and nter-block optmzatons. Our framework automatcally explores transformaton optons, performs code transformatons, and nserts the approprate HLS drectves to mplement the HLS optmzatons. Furthermore, our framework can automatcally generate the optmzed communcaton blocks for fne-graned communcaton between hardware blocks. Expermen- Correspondng Author Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. To copy otherwse, to republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. FPGA 13, February 11 13, 2013, Monterey, Calforna, USA. Copyrght 2013 ACM /13/02...$ tal evaluaton demonstrates that we can acheve an average of 6.04X speedup over the hgh level synthess soluton wthout our transformatons to enable ntra- and nter-block optmzatons. Categores and Subect Descrptors B.5.2 [Hardware]: [Desgn Ads] automatc synthess General Terms Algorthm, Desgn, Performance Keywords Polyhedral, Hgh Level Synthess, FPGA 1. INTRODUCTION FPGAs have long been adopted for computaton acceleraton, especally n domans that also demand power and energy effcent computng. Ther hghly flexble archtecture enables sgnfcant optmzaton opportuntes. Desgners can mplement fne-graned computaton unts, hghly parallel archtectures, fne-graned ppelnng of computaton unts, effcent communcaton structures and customzed memory and compute unt parttonng. However, the flexblty that s a strength for optmzaton opportuntes s also a challenge for effcent programmablty. FPGA mplementaton remans a sgnfcant challenge for prospectve users manual desgn at regster transfer level (RTL) s error-prone and dffcult to debug. Such manual desgn often takes weeks and months n comparson to the hours or days to mplement an algorthm n software. Furthermore, effcent manual desgn typcally requres sgnfcant specalzed knowledge effcent FPGA mplementatons must consder archtecture-specfc parameters and mplement functons for effcent mappng to the FPGAs fxed-sze allocaton quota (LUT, BRAM, DSP,...). Hgh level synthess (HLS) seeks to address these problems. HLS offers automated translaton from hgh level languages (e.g., C,C++, SystemC, Haskell and CUDA) to regster transfer level (RTL) mplementatons to reduce the desgn effort, automated optmzaton to reduce the requrement of desgn knowledge, and FPGA-specfc mappng to automate low-level optmzaton choces. Thus, HLS promses to be a crtcal brdgng technology that offers the latency and power/energy benefts of FPGA-based hardware acceleraton at the desgn effort (and expertse) of software development. Stateof-the-art HLS tools cover a wde range of nput source code and acheve hgh-qualty results [8]. 9

2 Block 1 for ( = 1; < N; ++) for ( = 1; < N; ++) { g[][] = f1(u[][], u[- 1][], u[+1][], u[][-1], u[][+1]); } Block 2 for ( = 1; < N; ++) for ( = 1; < N; ++) { u[][] = f2(u[][], u[][+1], u[][-1], u[+1][], u[-1][], g[][+1], g[][-1], g[+1][], g[-1][]); } Dependence of Block 2 Dagonal access of Block 2 Block 1 for (1 = 1; 1 < 2 * N - 2; 1++) for (1 = max(1 N + 1, 0); 1 < mn(n - 1, 1); 1++) { g[1-1][1] = f1(u[1-1][1], u[1-1][1-1], u[1-1-1][1], u[1-1+1][1], u[1-1][1+1]); } Block 2 for (1 = 1; 1 < 2 * N - 2; 1++) for (1 = max(1 N + 1, 0); 1 < mn(n 1, 1); 1++) { u[1-1][1] = f2(u[1-1][1], u[1-1][1-1], u[1-1-1][1], u[1-1+1][1], u[1-1][1+1], g[1-1][1-1], g[1-1-1][1], g[1-1+1][1], g[1-1][1+1]); } 1 1 (d) Loop skewng Block 1 Block 2 (a) Orgnal source code (b) Dagonal access (c) Transformed source code (e)parallelzaton and ppelnng Fgure 1: An example of two data-dependent blocks. In sub-fgure (e), we assume the parallelzaton degree s 2. These tools have acheved sgnfcant mprovement; however, recent studes show that although these tools can offer hgh qualty desgns for small kernels, there s stll a sgnfcant performance gap between HLS and manual desgn for real-world complex applcatons [23, 17, 10]. For example, [23, 17] demonstrated a 40X dfference between HLS and the manual desgn for a hgh-defnton stereo matchng mplementaton. Small kernels (often used as smple HLS benchmarks) contan a sngle block (a loop nest), but realworld applcatons often contan many data-dependent blocks that communcate through complex data access patterns. Whereas a vdeo processng kernel may be a sngle nested loop performng a medan flter, a real applcaton would employ a sequence of datadependent blocks, ncludng blocks such as mage up/down samplng, cost aggregaton, and energy mnmzaton. Vdeo processng applcatons are often structured so that each processng step s a loop nest (block) that operates on array nputs and produces array outputs such that block + 1 reads the array varables wrtten by block. As seen n effcent manual RTL mplementatons of these algorthms, t s crucal to mnmze the communcaton granularty, ppelne the data-dependent blocks, and duplcate compute unts to mprove both throughput and latency. However, exstng HLS tools fal to enable ntra-block parallelzaton and nter-block ppelnng when the data access patterns are complex [23, 17] and although advanced commercal tools support these optmzatons, they are not always able to effcently dentfy the opportunty and transform source code to enable such optmzatons. Code transformatons to enable these mportant and powerful optmzatons are thus crtcal for HLS tools. One of the prmary goals of HLS s to reduce desgn effort; thus, HLS tools must effcently support a varety of code ncludng source code wrtten by software engneers wth lttle hardware expertse, and software not wrtten for use wth HLS. Software programmers choose data access patterns based on propertes such as ntutve orderng, cache localty, and code modularty. Even software wrtten for HLS may stll contan loops that demand transformaton; effcent nter-block optmzaton may demand non-ntutve teraton orderng such that even experenced hardware desgners may have dffculty seeng an ntutve orderng of blocks that enables optmzaton. Thus, t s qute natural that the default data access pattern does not support ntra-block parallelzaton or nter-block ppelnng. Nevertheless, data access patterns can be altered va loop transformatons such as permutaton, reverse, and skewng [24]. Usng these transformatons, we can enable ntra-block parallelzaton by reorderng loop teratons so that successve teratons are ndependent; smlarly, we can enable nter-block ppelnng by ensurng that block produces data n the order that block + 1 consumes data (by transformng block, block + 1 or both). Both ntra-block parallelzaton and nter-block ppelnng are currently supported by exstng HLS tools. The problem s rather that these optmzatons cannot always be enabled wth the default data access patterns. Thus, the goal of ths work s to enable effcent ntegrated use of ntra- and nter-block optmzatons through loop transformaton. Motvatng Example. We llustrate the concept of ntra- and nterblock optmzaton usng the Denose [11] applcaton n medcal magng (Fgure 1). The source code n Fgure 1 (a) s modfed to hghlght the data access pattern, wth computaton arthmetc omtted for clarty. The applcaton s composed of two blocks (loop nests), where the frst block wrtes to the g array that s subsequently read by the second block. Both blocks are representatve of stencl codes where the array update operaton follows a fxed nput dependence pattern that reads neghborng elements. In ths example, only the second block has data-dependence between loop teratons, as shown n Fgure 1 (b), and the frst block wrtes the data array g n the same order that the second block consumes array g (column order). Thus, n ths example, ntra-block parallelzaton s avalable n the frst block and nter-block ppelnng s avalable by default between the blocks, but a transform s requred to enable parallelzaton n the second block. However, note that f the second block s transformed to enable parallelzaton, then the frst block must also be transformed n order to retan the opportunty for nter-block ppelnng. 1 Table 1: Comparson of dfferent mplementatons Implementaton Cycles Frequency Speedup 1 w/o transform, w/o opt MHz 1 2 w/o transform, w/ opt MHz w/ transform, w/ opt MHz In Table 1 we can see the performance n cycles and speedup over the orgnal source for ths example. By default, the orgnal source code supports some optmzaton, as seen n the second mplementaton nter-block ppelnng mproves the performance, but the latency of the ndvdual blocks become the performance bottleneck. Note that n the second mplementaton, we do not perform partal parallelzaton (e.g., parallelze the frst block only) because f the throughputs are not matched, the bufferng between 1 For effcent mplementaton of nter-block ppelnng, we also mplement memory partton optmzaton and customzed communcaton blocks whch wll be dscussed n secton

3 the blocks would not be feasble. In the thrd mplementaton, the loop transformaton s used to allevate the dependence problem and retan the opportunty for nter-block ppelnng. In Fgure 1 (c) and (d), we perform loop skewng on both blocks to traverse the loop n a dagonal fashon from the bottom-rght to top-left, as shown n Fgure 1 (b). Ths loop skewng both preserves the data dependence and enables parallelzaton of the nner loop because the teratons on the same dagonal lne are ndependent. The frst block can also use the loop skewng transformaton safely because t does not have any data dependences. The optmzed executon schedule s shown n Fgure 1 (e). Ths transformaton sgnfcantly mproves the speedup opportunty from about 3X to 31X speedup. In ths case, some optmzatons were supported by default, but there was stll sgnfcant speedup opportunty by ensurng that both optmzatons were enabled. In the experments secton, we wll demonstrate that some source codes support nether parallelzaton nor ppelnng by default, but loop transformatons can enable both ntra-block parallelzaton and nter-block ppelnng. As ths bref motvaton dscusson demonstrates, real-world applcatons commonly contan multple data-dependent blocks wth communcaton through complex data access patterns. For such applcatons, the powerful optmzaton functons of exstng hgh level synthess tools may not be supported by default due to the data access patterns. In ths paper we develop an ntegrated method to model, analyze, and transform block data access patterns to support these mportant and powerful optmzatons. Usng polyhedral models [24, 13, 20, 6, 5], we represent the loop teraton space, dependences, data access patterns, and loop transformatons n a lnear algebrac form. For loops amenable to ths algebrac representaton, the polyhedral model allows us to analytcally determne vald loop transformatons that best support both ntra-block parallelzaton and nter-block ppelnng. Ths paper advances the state-of-the-art of hgh level synthess wth An automated polyhedral model-based framework that systematcally dentfes effectve access patterns and apples approprate loop transformatons that enable ntra- and nterblock optmzatons. An automated framework to generate communcaton nterfaces between blocks. We demonstrate that our automated framework acheves an average of 6.04X speedup over HLS wth all avalable optmzatons but wthout transformatons to enable. Ths paper s organzed as follows. Secton 2 brefly provdes the background of the polyhedral model and ntroduces the useful notaton. Secton 3 presents our framework, whch defnes data access patterns, dentfes data access patterns for effectve optmzatons, and dscusses the mplementaton of fne-graned data communcaton modules. Secton 4 presents expermental results. Secton 5 dscusses related work, and conclusons are presented n Secton BACKGROUND AND NOTATION In ths secton we brefly ntroduce the polyhedral model and the notaton that we wll use n ths paper. A detaled descrpton of polyhedral models can be found n [5, 6, 13, 20]. In ths work we are usng the polyhedral model to consder data access patterns for communcaton between sets of loop nests, and to optmze ths communcaton orderng n order to perform fne-graned communcaton and enable parallelzaton wthn a loop nest and ppelnng between loop nests through loop transformaton. In partcular, n ths work we consder programs that consst of a sequence of datadependent blocks, where each block s a loop nest contanng multple statements, and each statement may have multple accesses to data arrays. 2.1 Polyhedral Model Polyhedral models can be used to represent executon nformaton of a program s loop nests, such as the loop teraton doman, statement/teraton dependences, array access functons, and schedulng functons (executon order). DEFINITION 1 (Polyhedron). The set of all vectors x Z n such that A x + b 0, where A Z m n and b Z m. Each row of A represents a half-space that lmts A x + b to nonnegatve values, and the polyhedron s the ntersecton of the m half-spaces represented by the m rows of A. A bounded polyhedron s a polytope. DEFINITION 2 (Iteraton Vector and Doman). The teraton vector of a loop nest represents an executon nstance of the loop nest. contans the values of the loop ndces of all the surroundng loops. The teraton doman D L of loop nest L s the set of all teraton vectors that satsfy the loop-bound constrants. Because an teraton doman s bounded by these loop-bound constrants, t s a polytope. DEFINITION 3 (Schedule Functon). Gven a m-dmensonal loop nest, a d-dmensonal (1 d m) schedule F( ) s defned as F( ) = S + o C 11 C C 1m C 21 C C 2m = C d1 C d2... C dm C 10 C 20. C d0 where S Z d m, o Z d, and C Z. The schedule functon takes the teraton vector as nput and produces an orderng of the loop teratons usng the matrx S and an offset vector ( o). Thus, the schedule functon creates a partcular orderng of teratons by mappng each teraton n the doman to a tmestamp, where mappng m dmensons nto fewer (d) tmestamps mples that some of the teratons can be performed n parallel. The schedule functon has the addtonal beneft that the tmestamp orderng also corresponds to a lexcographc orderng. For example, for teraton vectors 1 and 2, 1 s scheduled before 2 f F( 1) < F( 2). DEFINITION 4 (Transformaton Functon). In the polyhedral model, the transformaton functon s represented as a sequence of schedules appled to perform the transformaton. In theory, any loop transformaton can be represented n the polyhedral model. In ths paper we focus on a set of un-modular loop transformatons [24], ncludng loop reverse (e.g., reverses the traverse drecton), loop permutaton (e.g., nterchange two loops), and skewng (e.g., make the nner-loop bounds dependent on the outer-loop bounds) and ther composte transformatons. The schedule functon/executon order of teratons of a loop nest can be altered through loop transformatons. Fgure 2 llustrates the polyhedral model usng a concrete example. The orgnal source code that conssts of two data-dependent blocks (loop nests) s shown n Fgure 2 (a). The frst block wrtes to array B, and the second block subsequently reads from t. Wthout transformaton, each ndvdual block can be parallelzed, but 11

4 for (1 = N - 1; 1 >=0; 1--) for (1 = 0; 1 < N; 1++) S1 : B[1][1] = B[1][1] + A[1][1]; NN 1 0 FF sss ıı = 1 0 ıı + NN 1 0, ıı = 1 for (2 = 0; 2 < N; 2++) for( 2 = 0; 2 < N; 2++) S2: D[2] = D[2] + B[2][2] * C[2]; NN 1 0 FF sss ıı = 1 0 ıı + 0 0, ıı = 2 (a) Orgnal source code (b) Iteraton doman Fgure 2: An example of polyhedral representaton. (c) Schedule functon ther executon can not be overlapped through ppelnng as they access array B n dfferent order. In Fgure 2 (b), we show the teraton doman (polytope) of the two blocks whch establshes the lower- and upper-bounds for each loop dmenson. In Fgure 2 (c), we show the schedulng functon that corresponds to the orgnal source code for each of the loop nests. In order to transform the orgnal schedulng functon to acheve a desred data access pattern, we can defne a desred data access pattern and derve the necessary loop transformatons; a proof that ths dervaton s always possble s shown n Secton 3. As descrbed above, the polyhedral model (ncludng teraton doman, teraton vector, schedulng functon and transformaton functon) descrbes the teraton doman and order for any loop nest that has all array accesses as affne expressons of loop ndces. In addton, dependency relatons between teratons n the program can be represented n the polyhedral model. Two teratons of a loop nest (or two nstances of blocks) are dependent f they access the same array locatons, and at least one of the accesses s a wrte operaton. These dependences can be represented by addtonal rows n the teraton doman to establsh constrants between the loop teratons. Loop transformatons are vald only f they also preserve the addtonal program dependency constrants. 3. METHODOLOGY Our ntegrated ntra- and nter-block optmzaton framework takes a data-dependent mult-block program as nput, and performs three steps, as shown n Fgure 3. The goal of our optmzaton s to mnmze the overall latency (maxmze the performance speedup). Frst, we systematcally defne a set of data access patterns, classfy them, and derve the assocated loop transformatons (Secton 3.1). Next, for each loop transformaton that valdly preserves data-dependences, we estmate the performance mprovement (Secton 3.2) and choose the best estmated performance. The performance estmaton models both ntra- and nter-block speedup and assocated mplementaton overheads. The ntra-block parallelzaton degree s determned by the resource usage of the program and the avalable resource on the mplementaton platform. Fnally, for the chosen transformaton, we automatcally perform the loop transformatons, nsert hgh level synthess drectves, and generate the communcaton blocks that nterface the data-dependent blocks. If the communcaton block s a smple FIFO, we automatcally nsert FIFO hgh level synthess drectves; when the communcaton nterface requres multple reads or a stencl pattern, we automatcally customze the communcaton blocks (Secton 3.3). The fnal output of the flow s an optmzed RTL desgn. Note that n pror works [5, 6, 13, 18, 20, 24], polyhedral models consder data access patterns for external memory accesses and loop transformatons n order to optmze data localtes and maxmze parallelsm for CPUs. Optmzatons for CPU code attempt to mnmze memory bandwdth and mprove cache behavor. In addton, the transformed CPU code often has complex control flow that s not sutable for effcent FPGA mplementaton. Thus, the transformaton chosen for optmzaton on CPU platform may be the wrong decson for our HLS optmzaton. In ths work, we use polyhedral model to defne data access patterns for a dfferent obectve. We am to optmze the nterblock communcaton and enable ntra-block parallelzaton and nter-block ppelnng through loop transformaton for FPGAs usng HLS. For our obectve, we model the FPGA-specfc features. Therefore, although the canddate data access patterns mght be the same as pror work, we apply dfferent transformatons n dfferent combnatons n order to meet our optmzaton obectve. Although the underlyng technques are all based on polyhedral models, dfferent obectves lead to dfferent ways to select data access patterns and loop transformatons. Data-dependent mult-block program Defne data access pattern and derve the assocated loop transformaton Performance Estmaton Intra-block Inter-block Overhead Transform the computaton blocks and generate the communcaton blocks Optmzed RTL desgn Fgure 3: Optmzaton Framework. 3.1 Classfcaton of Array Access Patterns We model the array access patterns usng the polyhedral model, thus we assume that all array accesses are affne expressons of loop ndces and constants. The program nputs are composed of multple data-dependent blocks where each block contans a sngle mult-dmensonal loop nest. Let us consder a loop nest of dmensonalty D that accesses an N-dmensonal array. The array access pattern s defned by matrx M whose sze s N D, where the rows () represent the data access pattern n dmenson of the data array, and columns () represent the access pattern n the loop level. Gven the array access pattern M, loop teraton vector, and constant offset vector o, the array access vector s s defned as s = M + o s s column vector of sze N, where each row () represents array accesses n dmenson, and the offset vector s a constant offset nto that dmenson. Fgure 4 shows an example of array access 12

5 pattern and vector for the wrtes to array B; smlar access patterns and vectors could be derved for the reads from array A. for( =0; < N; ++) for(=0; < N; ++) B[][] = A[][] + A[ - 1][]; 1 0, s = Fgure 4: An example of array access pattern. In the followng, we demonstrate the data access patterns for 2- dmensonal arrays and 2-dmensonal loop-nests. For ease of llustraton, we classfy the access patterns below; although, the polyhedral framework treats these access patterns n a unform way. In the framework, we need to defne the canddate access patterns that can be evaluated for the program; n ths work we lmt the access patterns to the smple set of access patterns below. The polyhedral model can be easly extended to handle a wde varety of addtonal access patterns, and our work can use any addtonal access patterns to estmate the performance beneft. We leave the extenson to future work. Now we wll descrbe how to defne the array access patterns usng M. Let ( ) a1 b 1 a 2 b 2 We classfy the array access patterns based on the values of a 1, a 2, b 1 and b 2. For array accesses wth non-unt loop strdes, we perform loop normalzaton as a preprocessng step so that our analyss can assume unt loop strde. Column and Reverse Column. ( ) ±1 0 0 ±1 The array access vector can be obtaned by M. Fgure 5 shows the four patterns n ths category, where the sgns of a 1 and b 2 determne traversal drecton. For example, wth outer loop ndex and nner loop ndex, f a 1 = 1 and b 2 = 1, then the outer loop traverses ncreasng values of, and the nner loop traverses ncreasng values of. ( 0 ±1 ±1 0 Smlar to column and reverse column, there are four patterns n ths category, and the sgns of b 1 and a 2 determne the traversal drectons. Dagonal Access. In ths category, the loop traverses n a dagonal lne fashon. We further dvde ths nto two cases based on the slopes of the dagonal lnes. slope 1. ) ( ±1 N > b1 1 0 ±1 The array access vector can be obtaned by M. The slope of the dagonal lne s determned by b 1, and the sgns of a 1 and b 2 determne the traversal drectons. When b 1 N, the traversal order reduces to one of the column access orders. Fgure 6 shows the four data access patterns for slope = 1. ) slope < 1. ( N > a1 > 1 ±1 ±1 0 The slope of the dagonal lne s determned by 1 a 1, and the sgns of a 2 and b 1 determne the traversal drectons. When a 1 N, the traversal order reduces to one of the row access orders. All of the array access patterns defned above are unmodular matrces where a 1 b 2 a 2 b 1 = 1. Thus, all of these access patterns can be acheved by unmodular loop transformatons of the block(s) [24]. Loop Transformaton. Loop transformatons can change the schedule (e.g., executon order) of loop teratons such that the data access pattern can be changed. Thus, here we derve the loop transformaton gven a desred data access pattern. THEOREM 3.1. The transformaton functon T requred for the desred data access pattern M des can be obtaned by T = M 1 desm orf 1 or where M or and F or are the data access pattern and schedule functon of the source code wthout transformaton. PROOF. Let be the loop terator vector after transformaton. Thus, the desred array access vector after transformaton s M des. Let us assume the schedule functon and data access pattern of the orgnal code are F or and M or, respectvely. Thus, the schedule functon after transformaton s TF or. Schedule functon maps the orgnal loop teratons to a new orderng of the loop teratons. Thus, Thus, TF or = = F 1 ort 1 The data accessed by the teratons before and after transformaton s the same Thus, M orf 1 ort 1 = M des T = M 1 desm orf 1 or The data access pattern and loop transformatons are all unmodular matrx. For unmodular matrx, we can always derve ts reverse matrx. 3.2 Performance Metrc In the prevous subsecton we defned a set of data access patterns that nclude row, column, and dagonal access wth dfferent slopes and drectons. We also descrbed how to derve the requred loop transformaton for a gven data access pattern. Our desgn obectve s to maxmze the performance speedup. Now, we wll dscuss the process of evaluatng all of the canddate data access patterns and choosng the canddate that maxmzes applcaton speedup. To perform ths evaluaton, we develop a performance metrc that combnes modelng of both ntra- and nter-block speedup and ther assocated mplementaton overhead. We defne that a program s a sequence of K blocks {b 1, b 2,..., b K}, and ) 13

6 (a) left to rght (), bottom to up () (b) left to rght (), top to down () (c) rght to left (), bottom to up () (d) rght to left (), top to down () Fgure 5: An example of column access pattern. There are 4 patterns wth dfferent traverse drectons (a) bottom-left to top-rght (), bottom-rght to top-left () (b) bottom-left to top-rght (), top-left to bottom-rght () (c) top-rght to bottom-left (), bottom-rght to top-left () (d) top-rght to bottom-left (), top-left to bottom-rght () Fgure 6: An example of dagonal access pattern wth slope = 1. There are 4 patterns wth dfferent traverse drectons. each block b has a d dmensonal loop nest that accesses an n dmensonal data array. In ths work we restrct that the loop nest dmensonalty and data-array dmensonalty are equal for all sets of communcatng blocks n the program, so we denote the loop nest dmensons and data array dmensons as N. The latency n clock cycles of each block b wthout ntra-block optmzaton s lat. If there s any control flow between the blocks, we consder the worst-case path; thus, the total baselne program latency n clock cycles (wthout any optmzaton) s the smple sum of the blocks latences =K lat base = lat lat for each block s estmated by performng block-wse hgh level synthess, and usng the HLS-generated performance estmates 2. When no optmzatons are appled, we have verfed that these estmates are accurate by comparng them wth post-synthess smulatons. However, the HLS performance estmates can be naccurate for modelng the parallelsm, especally because the nterblock ppelnng hdes executon latency. Therefore, we develop an alternatve performance metrc that can estmate ths optmzaton effect. Note that our performance estmaton s based on clock cycles only. By default, we apply ntra-block ppelnng (e.g., loop ppelnng wthn a block by settng ntaton nterval (II)) for all the blocks and set the same target perod for all the mplementatons. Thus, the clock perod tends to be smlar across dfferent mplementatons, and we gnore the effect of clock perod n our estmaton. Let P be the set of all canddate data access patterns. Gven p P, we can estmate the performance of each block by the ntrablock parallelzaton factor, and then the performance of the entre =1 2 We use the commercal AutoPlot [25] HLS tool to estmate lat. program wth the nter-block ppelnng that overlaps the blocks executon. In ths work, we target nter-block ppelnng, and correspondngly, t always makes sense to parallelze each block to the same factor to match the blocks throughput and mnmze nterblock bufferng. Thus, we can smplfy the program performance estmate n clock cycles as follows where S ntra p lat p = S ntra p lat base S nter p + cost p and Sp nter represent the ntra- and nter-block speedup (dscussed next), respectvely and cost p represents the mplementaton overhead. For each block, we can ft multple data processng ppelnes for parallel processng. In hgh level synthess, duplcate data processng ppelnes are mplemented by unrollng the nner loop. As we dscussed above, we want to match the blocks parallelsm degree; therefore we search for the maxmum unrollng factor where all blocks n the program can be unrolled to the same factor. For block b, we use R to denote ts resource usage. R s a 4-tuple that represents ts resource usage n FFs, LUTs, BRAMs, and DSPs. Then the total communcaton cost (n resources) between block and block + 1 s estmated as another 4-tuple Comm, where we use the HLS estmates for R and Comm. We have observed and expermentally valdated that the HLS resource usage estmates correlate wth actual resource after logc synthess, although they tend to be conservatvely hgh estmates. In partcular, the HLS estmates for FF resource use and LUT resource use tend to be hgh. As we scale the unroll factor, ths overestmate s compounded due to underestmaton of resource sharng and LUT/FF packng nto Slces. Therefore, we emprcally determned another correlaton factor tuple α that represents ths scalng; unlke the communcaton factor, we use the same α factor for all blocks n a desgn and 14

7 for all desgns n our benchmark set. In total, we predct the actual resource usage as follows R + Comm ) α par =K ( =1 where par represents the ntra-block parallelzaton degree. Then, gven the fxed resource budget of the FPGA, we can derve the maxmally allowed parallelzaton degree, Max par. We defne Sp ntra as follows Sp ntra = Max par 1 otherwse pattern p enables parallelzaton for all the blocks Smlarly, for the nter-block ppelnng factor Sp nter, f we can transform all of the blocks to follow the same data access pattern for nter-block communcaton, then we can fully enable nter-block ppelnng. Therefore, we defne Sp nter as follows S nter p = { K pattern p enables ppelnng 1 otherwse where K s the number of blocks n the program. Note that we gnore the ppelnng fll and dran overhead n the above estmaton. Fnally, the mplementaton of the access patterns may ncur addtonal overhead, whch can sgnfcantly affect the program s performance estmate. Both row (slope = 0), column (slope = ) and ther reverses are easy to mplement wthout any overhead, but dagonal access patterns requre loop skewng, as shown n Fgure 1. Therefore, the nner-loop teraton domans are a functon of outer-loop ndex values. When slope = 1, we can use mn and max operatons to bound the nner-loop teraton domans, as shown n Fgure 1 (c); but wth 1 < slope < N or 0 < slope < 1, we must use a combnaton of mn, max, cel and floor functons to bound the nner-loop teraton domans. As dscussed prevously, we apply ntra-block ppelnng (e.g., loop ppelnng wthn a block) for all the blocks by default to mprove the performance. However, to enable ntra-block ppelnng, the nner-loop bounds have to be constants for HLS tools. Thus, for dagonal access patterns, we use the maxmum loop bound (N) for the nner loop but add extra f-else condtons to flter out the false loop teratons. The f-else statement compares the current nner-loop ndex wth ts doman and executes the loop body only f the condton s true. The f-else comparson to flter out false loop teratons takes extra cycle and the number of false teratons depends on the doman of the outer loop. When slope > 1, the doman of the outer loop ncreases lnearly wth the slope; when slope < 1, the doman of the 1 outer loop ncreases lnearly wth. Hence, the performance slope overhead cost p s defned as follows 0 slope = 0 or slope = cost p = C slope 1 slope < N C 1 slope 0 < slope < 1 where C s a constant. In our formulaton, the ntra-block parallelzaton and nter-block ppelnng are performed globally as a coarse-graned optmzaton. It s possble that a fne-graned optmzaton that selects dfferent parallelzaton and ppelnng parameters among dfferent sets of blocks and/or communcaton paths could yeld better performance. We expect that ths could be mplemented by applyng ths technque separately to subsets of program blocks wth addtonal refnement to the performance model and buffers between the blocks; we leave these tasks for future work. Pattern Selecton. As shown prevously, there are multple data access patterns that nclude row, column and dagonal patterns wth a varety of slopes and traversal drectons. Wth hgh-dmenson loop nests, ths can be a large number of canddate patterns; however, we can easly prune the search space usng the data dependences to prune canddates that would volate dependences. For stencl applcatons, the update of one data tem depends on ts neghbors n the surroundng stencl wndow, whch s usually small compared to the sze of the entre array. For dagonal access, we only need to focus on the data access pattern wth 1 slope n, n where n s the stencl wndow dmenson. For all remanng slope values (n < slope < N and 1 < slope < 1 ), they are equvalent N n to slope = n or slope = 1 n terms of the traversal order of data n tems n the stencl wndow, except that they ncur addtonal overhead n mplementng the loop bounds. Ths traversal order s the only factor that affects the ablty to apply ntra-block parallelzaton and nter-block ppelnng. Usng these dependences to prune the lst, t s n practce feasble to estmate performance for each of the canddates and choose the best. Note that we only explore a subset of legal polyhedral transformatons n ths work. In theory, t s also possble to select the optmal pattern followng the polyhedral optmzaton flow n [20] where all the FPGA-specfc features dscussed above ncludng resource modelng and performance metrcs have to be encoded as constrants and cost functons for the optmzaton problem. In practce, our pattern selecton soluton runs very fast for all the tested benchmarks. 3.3 Implementaton Our framework s an automatc flow consstng of three logcal steps: we automatcally transform source code, use HLS tools to enable optmzatons and synthess, and generate FIFO nterfaces between computaton and communcaton blocks. If the communcaton block requres a large communcaton buffer and a complex mult-read communcaton nterface, we automatcally nsert customzed communcaton blocks. For the frst step, we ntegrate our framework nto PoCC polyhedral framework [1]. Our framework defnes data access patterns, estmates performance, selects desred patterns based on performance metrcs and performs loop transformatons. The PoCC framework s at the C-code source level; we have also modfed the framework to automatcally produce source code compatble wth our chosen HLS tool, AutoPlot [25] 3. Next, we use the AutoPlot HLS tool to enable optmzatons and synthesze the computaton blocks. AutoPlot provdes a set of drectves for optmzaton, ncludng loop_unrollng and ppelne for the ntra-block parallelzaton and ppelnng, and the dataflow and AP _F IF O drectves for nter-block ppelnng (wth communcaton through a FIFO nterface). AutoPlot wll execute unrolled loop teratons n parallel f there are no data dependences between the teratons. Smlarly, AutoPlot wll overlap executon of datadependent blocks f they use the same access pattern and can thus communcate through a FIFO nterface. However, for the blocks that have multple accesses to an array, ether due to algorthm (e.g., stencl codes read a pattern of neghborng data) or parallelzaton (e.g., multple teratons execute and access data n parallel), memory bandwdth bottleneck often prevents the desgn from reachng the expected ntra-block parallelsm and nter-block ppelnng. To allevate the memory bottleneck, we mplement the memory partton technque [16] that parttons the arrays nto several banks and enables more array accesses per clock cycle. 3 AutoESL was acqured by Xlnx; AutoPlot s now part of Vvado HLS. 15

8 It s mportant to emphasze agan that although AutoPlot provdes these drectves for ntra- and nter-block optmzatons, these optmzatons can not always be appled wth the default data access patterns and AutoPlot s unable to dentfy the optmzaton opportuntes to enable them through transformatons. Thus, the mportance of ths work s not ust that we generate the AutoPlot drectves that are already avalable, but that we automatcally mprove the number of stuatons n whch the drectves can be used. In the frst two steps, we can handle applcatons wth smple nter-block communcaton patterns that can be transformed to use the FIFO nterface automatcally. However, some communcaton patterns are more complex; a block may perform multple data reads per nner-loop teraton. In these stuatons, we need an optmzed communcaton block. Partcularly, we automatcally nsert a reuse buffer [15] that stores data that wll be temporally reused. The mplementaton of the reuse buffer can use mult-port BRAMs or regsters dependng on resource and buffer szes, as shown n Fgure 7. Block 1 for () for() wrte(stream, data); wrte BRAMs Bank 0 Bank 1 Block 2 for () for() read(stream, data); read Communcaton Block Regsters decoder Fgure 7: Communcaton block. 4. EXPERIMENTAL EVALUATION In ths secton, we present our expermental results usng a set of real-world applcatons that contan multple data-dependent blocks and communcate through complex data access patterns. We frst dscuss the experments setup and evaluated benchmarks. Then, we show the performance mprovement of our proposed framework. 4.1 Experments Setup For our experments, we use a set of benchmarks from Poly- Bench 3.0 [1] and some real-world applcatons from [11]. Table 2 descrbes the benchmark detals. Our framework s based on the polyhedral compler nfrastructure PoCC 1.1 [1]. PoCC s a source-to-source compler that ncludes a set of tools for polyhedral complaton. It extracts the polyhedral ntermedate representaton at source code level. We modfy PoCC to defne data access patterns, evaluate FPGA archtecturespecfc performance, perform loop transformatons and code generaton. Fnally, PoCC also provdes lbrares for program dependences checkng. The output of our modfed PoCC s transformed C code wth AutoPlot pragmas to enable the HLS optmzatons. Then, we use the AutoPlot HLS tool verson to synthesze the transformed C code nto Verlog RTL. As prevously dscussed, AutoPlot supports ntra- and nter-block optmzatons. We automatcally nsert drectves nto the confguraton fle to enable these optmzatons. The target FPGA platform s Xlnx-Vrtex-6 LX75T. We synthesze the RTL generated from AutoPlot usng Xlnx ISE 13.1 and gather area and clock perod data. To compute the operatng frequency, we round down the operatng frequency determned by the synthess report s achevable clock perod to an nteger. The clock cycles are collected through smulaton usng Modelsm 6.1. Fnally, we compute latency usng the operatng frequency and clock cycles and compute speedup usng the latency value. Table 2: Benchmarks Benchmark Descrpton Deconv Image Rcan Deconvoluton [11] Denose Image Rcan Denosng [11] Seg Image Segmentaton [11] Sedel Sedel stencl computaton [1] Jacob Jacob stencl computaton [1] 4.2 Performance Improvement We compare the performance of three dfferent mplementatons. The frst mplementaton s the baselne t s the orgnal source code wthout usng ntra-block parallelzaton or nter-block ppelnng optmzaton. The second mplementaton s an mproved verson t apples the ntra-block parallelzaton and nter-block ppelnng optmzatons to the orgnal source code when supported wthout code transformaton. The frst and second mplementatons do not requre source code transformaton. Although the second mplementaton tres to use optmzatons, they cannot be enabled f the default data access pattern does not support t. The thrd mplementaton s our proposed mplementaton t transforms the source code and then enables the ntra- and nter-block optmzatons. Note that by default, we apply ntra-block ppelnng wthn each ndvdual block for all the mplementatons and benchmarks. Performance Comparson. Table 3 descrbes the clock cycles, resources and acheved frequences of the three mplementatons for all the benchmarks. Table 4 presents the detals of the chosen data access patterns and enabled optmzatons wth and wthout transformaton for all the benchmarks. Compared to the baselne desgn, the second and thrd mplementatons mprove the latency n clock cycles at the expense of more resource usage usng the ppelnng and parallelzaton optmzatons. Ths demonstrates that wth greater opportunty for ntra-block parallelzaton and nter-block ppelnng, HLS tools can effectvely use addtonal FPGA resources. The latency speedup s shown n Fgure 8. The speedup s normalzed to the baselne mplementaton. Compared to the baselne, the second mplementaton (wthout transformaton, wth optmzaton) acheves an average of 4.89X speedup. Our proposed mplementaton (wth transformaton, wth optmzaton) acheves further speedup by enablng more optmzatons through polyhedral transformatons. Our proposed mplementaton acheves an average of 29.59X speedup over the non-optmzed source code and 6.04X speedup over code that s optmzed but not transformed to enable optmzatons. The average speedup s computed usng geometrc mean. The performance mprovement s from three-fold: frst, the ntrablock parallelzaton mproves computaton latency wthn blocks; second, the nter-block ppelnng mproves computaton latency by overlappng the executon of blocks; and thrd, our memory and communcaton block optmzaton mproves the memory bandwdth. We agan use the example of the Denose benchmark to demonstrate these ponts; the second mplementaton employs nter-block ppelnng n conuncton wth memory partton. These two optmzatons mprove clock cycles by overlappng the executon of two blocks and allowng multple memory accesses per clock cycle. Wthout code transformaton, the ntra-block parallelzaton s avalable for the frst block but not the second block as shown n Fgure 1. We do not perform partal parallelzaton (e.g., parallelzng the frst block only) together wth nter-block ppelnng 16

9 Table 3: Performance and resource comparson of dfferent mplementatons Benchmark Implementaton Cycles LUT FF DSP BRAM Frequency(MHz) w/o trans, w/o opt Deconv w/o trans, w/ opt w/ trans, w/ opt w/o trans, w/o opt Denose w/o trans, w/ opt w/ trans, w/ opt w/o trans, w/o opt Seg w/o trans, w/ opt w/ trans, w/ opt w/o trans, w/o opt Sedel w/o trans, w/ opt w/ trans, w/ opt w/o trans, w/o opt Jacob w/o trans, w/ opt w/ trans, w/ opt Table 4: Detals of optmzatons and data access pattern Benchmark Optmzatons Selected Data Access Pattern w/o trans w/ trans Deconv nter-block ppelne nter-block ppelne & ntra-block parallel dagonal (slope = 1) Denose nter-block ppelne nter-block ppelne & ntra-block parallel dagonal (slope = 1) Seg none nter-block ppelne & ntra-block parallel dagonal (slope = 1) Sedel nter-block ppelne nter-block ppelne & ntra-block parallel dagonal (slope = 2) Jacob ntra-block parallel nter-block ppelne & ntra-block parallel row because f the throughputs are not matched then t would requre a buffer sze proportonal to the total data sze whch s not feasble n general. The thrd mplementaton transforms the code to a dagonal access pattern. Then, n addton to nter-block ppelnng and memory customzaton, t enables ntra-block parallelzaton for both blocks that allows multple teratons to execute smultaneously. The thrd mplementaton maxmzes resource use such that each block s parallelzed to the same degree and fts n the Vrtex 6 LX75T, whch has LUTs, FFs, 288 DSPs, and Kb BRAMs. For Denose, the ntra-block parallelzaton degree s 9, gven the resources on FPGA. As a result, our mplementaton acheves 31.09X and 9.14X speedup over the frst and second mplementatons, respectvely. In addton, our optmzaton produces a sde-beneft of mproved operatng frequency. As stated earler, we assume that the clock perod s not affected by our transformatons and make no specfc effort to mprove clock perod. However, because our technque mplements smplfed memory and communcaton nterfaces, we also have the beneft of reducng the complexty of addressng and communcaton structures. Therefore, we also mprove the desgn s crtcal path and get a correspondng modest mprovement n operatng frequency as shown n Table 3. Latency Speedup w/o trans, w/o opt w/o trans, w/ opt w/trans, w/ opt Deconv Denose Seg Sedel Jacob Fgure 8: Latency speedup comparson. Data Access Pattern and Optmzaton. For benchmarks Denose, Deconv and Sedel, wthout transformatons nter-block ppelne can be enabled as dfferent blocks communcate n the same order; but ntra-block parallelzaton cannot be enabled wth the default data access order. Our mplementaton transforms them nto dagonal access order wth dfferent slopes, whch enables both ntra- and nter-block optmzaton. The slope of the data access order s chosen based on the performance metrc we developed n secton 3.2. In partcular, we choose slope = 2 for Sedel as t allows smultaneous executon of the teratons on the same dagonal lne and retan nter-block ppelnng. For benchmark Seg, wthout transformaton nether ntra-block parallelzaton nor nter-block ppelnng can be enabled. Thus, the frst and second mplementaton are the same as shown n Table 3. Our mplementaton transforms t nto dagonal access pattern (slope = 1), whch enables both ntra- and nter-block optmzaton. For benchmarks Jacob, wthout transformaton ntra-block parallelzaton s avalable as there are no dependences among teratons, but nter-block ppelnng cannot be enabled as two blocks produce and consume data n dfferent orders. Our mplementaton transforms t to enable both ntra- and nter-block optmzaton. For all the benchmarks, our mplementaton successfully enables both ntra-block parallelzaton and nterblock ppelnng optmzatons. 5. RELATED WORK Hgh level synthess has seen sgnfcant advances n recent years, mprovng the qualty of nput source languages. There are many currently actve HLS tools n both ndustry and academa, such as [8, 25, 4, 7, 19, 14]. Leadng HLS tools support varous ntrablock and nter-block optmzatons, ncludng ppelnng and parallelzaton. Wth such powerful optmzatons, HLS offers ncreased productvty wth lower desgn effort; however, n practce these transformatons are dffcult to apply only certan data access 17

10 patterns are supported, lmtng the applcablty of an mportant HLS feature. Recent studes show that there s stll a sgnfcant performance gap between manual desgn and HLS-generated desgns [23, 17, 10], and the nablty to apply these optmzatons s one of the causes of ths gap. Real-world applcatons contan multple data-dependent blocks that communcate through complex data access patterns, wth source code that was not orgnally ntended for HLS. These applcatons commonly have orgnal data access patterns that do not support the HLS optmzatons; thus, t s crtcal to transform the source code to enable these optmzatons. Pror work n optmzaton opportuntes for data-dependent blocks ndvdually worked on ppelnng and communcaton technques [26, 22, 9]. Zegler et al. developed coarse-graned ppelnng for data-dependent loops that fnd effcent communcaton schemes [26], but they assume that the data access order s already dentcal between the communcatng loops. Cong et al. developed a resource constraned schedulng formulaton for the communcaton problem that can fnd the optmal communcaton order [9], but the technque requres completely unrolled loops, and addtonal storage and computaton overhead for communcaton reorderng. Smlarly, Rodrgues et al. presented a fne-graned synchronzaton technque wth hardware nter-stage buffers to manage the communcaton [22]. Pror works retan the orgnal executon order but apply technques to reorder communcaton; however, these technques may requre large nter-stage buffers n order to handle the reorderng, whch lmts the feasble problem szes they can support. Furthermore, these works manually optmze communcaton nterfaces, rather than automated optmzaton such as that performed for hgh level synthess. In our technque, we nstead rely on loop transformatons to convert the executon order so that the communcaton order s optmzed, whle smultaneously mnmzng the need for nterstage buffers. The polyhedral model based loop transformatons are based on a mathematcal model that can represent any composton of loop transformatons usng affne transformatons [24, 20, 13]. In the past, polyhedral models have been used for maxmzng parallelsm whle mnmzng communcaton for parallel computng [18, 2]. Recently, polyhedral models have been used n hgh level synthess for FPGAs to optmze on-chp memory bandwdth [12, 21], or to optmze the SDRAM bandwdth [3]. In contrast, our approach s to optmze multple data-dependent blocks smultaneously n order to match ther data access patterns and thus smultaneously optmze both ntra-block parallelsm and nter-block ppelnng. 6. CONCLUSION Hgh level synthess s a crtcal technology to ease the adopton of hardware accelerator resources. However, although current hgh level synthess offers a varety of powerful optmzatons, the mplementaton constrants lmt the applcablty and thus mpact of the optmzatons. Input source codes commonly have complex data access patterns that do not nherently support mportant hgh level synthess optmzaton technques, but polyhedral models can model the data access patterns and fnd loop transformatons that enable these mportant parallelzaton and ppelnng optmzatons. We have presented an ntegrated technque usng polyhedral models to model and enable both ntra- and nter-block optmzatons. Ths ntegrated technque substantally mproves the opportunty to use HLS optmzatons for parallelsm, ppelnng, and fne-graned communcaton. Our framework automatcally dentfes data access patterns and canddate loop transformatons, evaluates the performance beneft of each canddate to select the best opton, performs the code transformatons, and nserts the HLS code drectves and communcaton structures to mplement the optmzed hardware. Expermental evaluaton demonstrates an average of 6.04X speedup over hgh level synthess wthout our transformatons to enable ntra- and nter-block optmzatons. 7. ACKNOWLEDGMENTS Ths work was partally supported by Natonal Hgh Technology Research and Development Program of Chna 2012AA and A*STAR Sngapore under the Human Sxth Sense Proect. 8. REFERENCES [1] Pocc. The polyhedral compler collecton. http: // [2] Nawaaz Ahmed, Nkolay Mateev, and Keshav Pngal. Syntheszng transformatons for localty enhancement of mperfectly-nested loop nests. Int. J. Parallel Program., 29(5): , [3] Samuel Baylss and George A. Constantndes. Optmzng SDRAM bandwdth for custom FPGA loop accelerators. In FPGA, pages , [4] Thomas Bollaert. Hgh-Level Synthess: From Algorthm to Dgtal Crcut, chapter Catapult synthess: A practcal ntroducton to nteractve C synthes. Sprnger, [5] Uday Bondhugula et al. Automatc transformatons for communcaton-mnmzed parallelzaton and localty optmzaton n the polyhedral model. In CC, pages , [6] Uday Bondhugula, Albert Hartono, J. Ramanuam, and P. Sadayappan. A practcal automatc polyhedral parallelzer and localty optmzer. In PLDI, pages , [7] Andrew Cans et al. Legup: hgh-level synthess for FPGA-based processor/accelerator systems. In FPGA, pages 33 36, [8] Jason Cong et al. Hgh-level synthess for FPGAs: From prototypng to deployment. IEEE TCAD, pages , [9] Jason Cong, Ypng Fan, Guolng Han, We Jang, and Zhru Zhang. Behavor and communcaton co-optmzaton for systems wth sequental communcaton meda. In DAC, pages , [10] Jason Cong, Muhuan Huang, and Y Zou. Acceleratng flud regstraton algorthm on mult-fpga platforms. In FPL, pages 50 57, [11] Jason Cong, Vvek Sarkar, Glenn Renman, and Alex Bu. Customzable doman-specfc computng. IEEE Des. Test, 28(2):6 15. [12] Jason Cong, Peng Zhang, and Y Zou. Optmzng memory herarchy allocaton wth loop transformatons for hgh-level synthess. In DAC, pages , [13] Paul Feautrer. Some effcent solutons to the affne schedulng problem: I. one-dmensonal tme. Int. J. Parallel Program., 21(5): , [14] Swath T. Guruman et al. Hgh-level synthess of multple dependent CUDA kernels on FPGA. In ASPDAC, [15] Ilya Issenn, Erk Brockmeyer, Mguel Mranda, and Nkl Dutt. DRDU: A data reuse analyss technque for effcent scratch-pad memory management. ACM Trans. Des. Autom. Electron. Syst., 12(2), [16] Peng L et al. Memory parttonng and schedulng co-optmzaton n behavoral synthess. In ICCAD, pages , [17] Yun Lang et al. Hgh-level synthess: Productvty, performance, and software constrants. J. Electrcal and Computer Engneerng, [18] Amy W. Lm, Gerald I. Cheong, and Monca S. Lam. An affne parttonng algorthm to maxmze parallelsm and mnmze communcaton. In ICS, pages , [19] Alex Papakonstantnou et al. Multlevel granularty parallelsm synthess on FPGAs. In FCCM, pages IEEE, [20] Lous-Noël Pouchet et al. Loop transformatons: convexty, prunng and optmzaton. In POPL, pages , [21] Lous-Noël Pouchet, Peng Zhang, P. Sadayappan, and Jason Cong. Polyhedral-based data reuse optmzaton for confgurable computng. In FPGA, [22] Ru Rodrgues, Joao M. P. Cardoso, and Pedro C. Dnz. A data-drven approach for ppelnng sequences of data-dependent loops. In FCCM, pages , [23] Kyle Rupnow et al. Hgh level synthess of stereo matchng: Productvty, performance, and software constrants. In FPT, pages 1 8, [24] Mchael. E. Wolf and Monca. S. Lam. A loop transformaton theory and an algorthm to maxmze parallelsm. IEEE Trans. Parallel Dstrb. Syst., 2(4): , [25] Zhru Zhang et al. Hgh-Level Synthess: From Algorthm to Dgtal Crcut, chapter AutoPlot: a platform-based ESL synthess system. Sprnger, [26] Hed E. Zegler, Mary W. Hall, and Pedro C. Dnz. Compler-generated communcaton for ppelned FPGA applcatons. In DAC, pages ,

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr

More information

Loop Transformations for Parallelism & Locality. Review. Scalar Expansion. Scalar Expansion: Motivation

Loop Transformations for Parallelism & Locality. Review. Scalar Expansion. Scalar Expansion: Motivation Loop Transformatons for Parallelsm & Localty Last week Data dependences and loops Loop transformatons Parallelzaton Loop nterchange Today Scalar expanson for removng false dependences Loop nterchange Loop

More information

Loop Permutation. Loop Transformations for Parallelism & Locality. Legality of Loop Interchange. Loop Interchange (cont)

Loop Permutation. Loop Transformations for Parallelism & Locality. Legality of Loop Interchange. Loop Interchange (cont) Loop Transformatons for Parallelsm & Localty Prevously Data dependences and loops Loop transformatons Parallelzaton Loop nterchange Today Loop nterchange Loop transformatons and transformaton frameworks

More information

The Codesign Challenge

The Codesign Challenge ECE 4530 Codesgn Challenge Fall 2007 Hardware/Software Codesgn The Codesgn Challenge Objectves In the codesgn challenge, your task s to accelerate a gven software reference mplementaton as fast as possble.

More information

Support Vector Machines

Support Vector Machines /9/207 MIST.6060 Busness Intellgence and Data Mnng What are Support Vector Machnes? Support Vector Machnes Support Vector Machnes (SVMs) are supervsed learnng technques that analyze data and recognze patterns.

More information

Vectorization in the Polyhedral Model

Vectorization in the Polyhedral Model Vectorzaton n the Polyhedral Model Lous-Noël Pouchet pouchet@cse.oho-state.edu Dept. of Computer Scence and Engneerng, the Oho State Unversty October 200 888. Introducton: Overvew Vectorzaton: Detecton

More information

Polyhedral Compilation Foundations

Polyhedral Compilation Foundations Polyhedral Complaton Foundatons Lous-Noël Pouchet pouchet@cse.oho-state.edu Dept. of Computer Scence and Engneerng, the Oho State Unversty Feb 8, 200 888., Class # Introducton: Polyhedral Complaton Foundatons

More information

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization Problem efntons and Evaluaton Crtera for Computatonal Expensve Optmzaton B. Lu 1, Q. Chen and Q. Zhang 3, J. J. Lang 4, P. N. Suganthan, B. Y. Qu 6 1 epartment of Computng, Glyndwr Unversty, UK Faclty

More information

Loop Transformations, Dependences, and Parallelization

Loop Transformations, Dependences, and Parallelization Loop Transformatons, Dependences, and Parallelzaton Announcements Mdterm s Frday from 3-4:15 n ths room Today Semester long project Data dependence recap Parallelsm and storage tradeoff Scalar expanson

More information

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory Background EECS. Operatng System Fundamentals No. Vrtual Memory Prof. Hu Jang Department of Electrcal Engneerng and Computer Scence, York Unversty Memory-management methods normally requres the entre process

More information

Parallel matrix-vector multiplication

Parallel matrix-vector multiplication Appendx A Parallel matrx-vector multplcaton The reduced transton matrx of the three-dmensonal cage model for gel electrophoress, descrbed n secton 3.2, becomes excessvely large for polymer lengths more

More information

An Optimal Algorithm for Prufer Codes *

An Optimal Algorithm for Prufer Codes * J. Software Engneerng & Applcatons, 2009, 2: 111-115 do:10.4236/jsea.2009.22016 Publshed Onlne July 2009 (www.scrp.org/journal/jsea) An Optmal Algorthm for Prufer Codes * Xaodong Wang 1, 2, Le Wang 3,

More information

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz Compler Desgn Sprng 2014 Regster Allocaton Sample Exercses and Solutons Prof. Pedro C. Dnz USC / Informaton Scences Insttute 4676 Admralty Way, Sute 1001 Marna del Rey, Calforna 90292 pedro@s.edu Regster

More information

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields 17 th European Symposum on Computer Aded Process Engneerng ESCAPE17 V. Plesu and P.S. Agach (Edtors) 2007 Elsever B.V. All rghts reserved. 1 A mathematcal programmng approach to the analyss, desgn and

More information

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data A Fast Content-Based Multmeda Retreval Technque Usng Compressed Data Borko Furht and Pornvt Saksobhavvat NSF Multmeda Laboratory Florda Atlantc Unversty, Boca Raton, Florda 3343 ABSTRACT In ths paper,

More information

Lecture 15: Memory Hierarchy Optimizations. I. Caches: A Quick Review II. Iteration Space & Loop Transformations III.

Lecture 15: Memory Hierarchy Optimizations. I. Caches: A Quick Review II. Iteration Space & Loop Transformations III. Lecture 15: Memory Herarchy Optmzatons I. Caches: A Quck Revew II. Iteraton Space & Loop Transformatons III. Types of Reuse ALSU 7.4.2-7.4.3, 11.2-11.5.1 15-745: Memory Herarchy Optmzatons Phllp B. Gbbons

More information

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration Improvement of Spatal Resoluton Usng BlockMatchng Based Moton Estmaton and Frame Integraton Danya Suga and Takayuk Hamamoto Graduate School of Engneerng, Tokyo Unversty of Scence, 6-3-1, Nuku, Katsuska-ku,

More information

Hermite Splines in Lie Groups as Products of Geodesics

Hermite Splines in Lie Groups as Products of Geodesics Hermte Splnes n Le Groups as Products of Geodescs Ethan Eade Updated May 28, 2017 1 Introducton 1.1 Goal Ths document defnes a curve n the Le group G parametrzed by tme and by structural parameters n the

More information

Smoothing Spline ANOVA for variable screening

Smoothing Spline ANOVA for variable screening Smoothng Splne ANOVA for varable screenng a useful tool for metamodels tranng and mult-objectve optmzaton L. Rcco, E. Rgon, A. Turco Outlne RSM Introducton Possble couplng Test case MOO MOO wth Game Theory

More information

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Learning the Kernel Parameters in Kernel Minimum Distance Classifier Learnng the Kernel Parameters n Kernel Mnmum Dstance Classfer Daoqang Zhang 1,, Songcan Chen and Zh-Hua Zhou 1* 1 Natonal Laboratory for Novel Software Technology Nanjng Unversty, Nanjng 193, Chna Department

More information

Solving two-person zero-sum game by Matlab

Solving two-person zero-sum game by Matlab Appled Mechancs and Materals Onlne: 2011-02-02 ISSN: 1662-7482, Vols. 50-51, pp 262-265 do:10.4028/www.scentfc.net/amm.50-51.262 2011 Trans Tech Publcatons, Swtzerland Solvng two-person zero-sum game by

More information

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation 17 th European Symposum on Computer Aded Process Engneerng ESCAPE17 V. Plesu and P.S. Agach (Edtors) 2007 Elsever B.V. All rghts reserved. 1 An Iteratve Soluton Approach to Process Plant Layout usng Mxed

More information

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision SLAM Summer School 2006 Practcal 2: SLAM usng Monocular Vson Javer Cvera, Unversty of Zaragoza Andrew J. Davson, Imperal College London J.M.M Montel, Unversty of Zaragoza. josemar@unzar.es, jcvera@unzar.es,

More information

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

Determining the Optimal Bandwidth Based on Multi-criterion Fusion Proceedngs of 01 4th Internatonal Conference on Machne Learnng and Computng IPCSIT vol. 5 (01) (01) IACSIT Press, Sngapore Determnng the Optmal Bandwdth Based on Mult-crteron Fuson Ha-L Lang 1+, Xan-Mn

More information

Lobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide

Lobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide Lobachevsky State Unversty of Nzhn Novgorod Polyhedron Quck Start Gude Nzhn Novgorod 2016 Contents Specfcaton of Polyhedron software... 3 Theoretcal background... 4 1. Interface of Polyhedron... 6 1.1.

More information

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching A Fast Vsual Trackng Algorthm Based on Crcle Pxels Matchng Zhqang Hou hou_zhq@sohu.com Chongzhao Han czhan@mal.xjtu.edu.cn Ln Zheng Abstract: A fast vsual trackng algorthm based on crcle pxels matchng

More information

Feature Reduction and Selection

Feature Reduction and Selection Feature Reducton and Selecton Dr. Shuang LIANG School of Software Engneerng TongJ Unversty Fall, 2012 Today s Topcs Introducton Problems of Dmensonalty Feature Reducton Statstc methods Prncpal Components

More information

Wishing you all a Total Quality New Year!

Wishing you all a Total Quality New Year! Total Qualty Management and Sx Sgma Post Graduate Program 214-15 Sesson 4 Vnay Kumar Kalakband Assstant Professor Operatons & Systems Area 1 Wshng you all a Total Qualty New Year! Hope you acheve Sx sgma

More information

Conditional Speculative Decimal Addition*

Conditional Speculative Decimal Addition* Condtonal Speculatve Decmal Addton Alvaro Vazquez and Elsardo Antelo Dep. of Electronc and Computer Engneerng Unv. of Santago de Compostela, Span Ths work was supported n part by Xunta de Galca under grant

More information

ELEC 377 Operating Systems. Week 6 Class 3

ELEC 377 Operating Systems. Week 6 Class 3 ELEC 377 Operatng Systems Week 6 Class 3 Last Class Memory Management Memory Pagng Pagng Structure ELEC 377 Operatng Systems Today Pagng Szes Vrtual Memory Concept Demand Pagng ELEC 377 Operatng Systems

More information

High-Boost Mesh Filtering for 3-D Shape Enhancement

High-Boost Mesh Filtering for 3-D Shape Enhancement Hgh-Boost Mesh Flterng for 3-D Shape Enhancement Hrokazu Yagou Λ Alexander Belyaev y Damng We z Λ y z ; ; Shape Modelng Laboratory, Unversty of Azu, Azu-Wakamatsu 965-8580 Japan y Computer Graphcs Group,

More information

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour 6.854 Advanced Algorthms Petar Maymounkov Problem Set 11 (November 23, 2005) Wth: Benjamn Rossman, Oren Wemann, and Pouya Kheradpour Problem 1. We reduce vertex cover to MAX-SAT wth weghts, such that the

More information

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1) Secton 1.2 Subsets and the Boolean operatons on sets If every element of the set A s an element of the set B, we say that A s a subset of B, or that A s contaned n B, or that B contans A, and we wrte A

More information

Parallel Numerics. 1 Preconditioning & Iterative Solvers (From 2016)

Parallel Numerics. 1 Preconditioning & Iterative Solvers (From 2016) Technsche Unverstät München WSe 6/7 Insttut für Informatk Prof. Dr. Thomas Huckle Dpl.-Math. Benjamn Uekermann Parallel Numercs Exercse : Prevous Exam Questons Precondtonng & Iteratve Solvers (From 6)

More information

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points; Subspace clusterng Clusterng Fundamental to all clusterng technques s the choce of dstance measure between data ponts; D q ( ) ( ) 2 x x = x x, j k = 1 k jk Squared Eucldean dstance Assumpton: All features

More information

A Binarization Algorithm specialized on Document Images and Photos

A Binarization Algorithm specialized on Document Images and Photos A Bnarzaton Algorthm specalzed on Document mages and Photos Ergna Kavalleratou Dept. of nformaton and Communcaton Systems Engneerng Unversty of the Aegean kavalleratou@aegean.gr Abstract n ths paper, a

More information

Support Vector Machines

Support Vector Machines Support Vector Machnes Decson surface s a hyperplane (lne n 2D) n feature space (smlar to the Perceptron) Arguably, the most mportant recent dscovery n machne learnng In a nutshell: map the data to a predetermned

More information

Motivation. EE 457 Unit 4. Throughput vs. Latency. Performance Depends on View Point?! Computer System Performance. An individual user wants to:

Motivation. EE 457 Unit 4. Throughput vs. Latency. Performance Depends on View Point?! Computer System Performance. An individual user wants to: 4.1 4.2 Motvaton EE 457 Unt 4 Computer System Performance An ndvdual user wants to: Mnmze sngle program executon tme A datacenter owner wants to: Maxmze number of Mnmze ( ) http://e-tellgentnternetmarketng.com/webste/frustrated-computer-user-2/

More information

Cache Performance 3/28/17. Agenda. Cache Abstraction and Metrics. Direct-Mapped Cache: Placement and Access

Cache Performance 3/28/17. Agenda. Cache Abstraction and Metrics. Direct-Mapped Cache: Placement and Access Agenda Cache Performance Samra Khan March 28, 217 Revew from last lecture Cache access Assocatvty Replacement Cache Performance Cache Abstracton and Metrcs Address Tag Store (s the address n the cache?

More information

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr) Helsnk Unversty Of Technology, Systems Analyss Laboratory Mat-2.08 Independent research projects n appled mathematcs (3 cr) "! #$&% Antt Laukkanen 506 R ajlaukka@cc.hut.f 2 Introducton...3 2 Multattrbute

More information

LLVM passes and Intro to Loop Transformation Frameworks

LLVM passes and Intro to Loop Transformation Frameworks LLVM passes and Intro to Loop Transformaton Frameworks Announcements Ths class s recorded and wll be n D2L panapto. No quz Monday after sprng break. Wll be dong md-semester class feedback. Today LLVM passes

More information

Communication-Minimal Partitioning and Data Alignment for Af"ne Nested Loops

Communication-Minimal Partitioning and Data Alignment for Afne Nested Loops Communcaton-Mnmal Parttonng and Data Algnment for Af"ne Nested Loops HYUK-JAE LEE 1 AND JOSÉ A. B. FORTES 2 1 Department of Computer Scence, Lousana Tech Unversty, Ruston, LA 71272, USA 2 School of Electrcal

More information

Reducing Frame Rate for Object Tracking

Reducing Frame Rate for Object Tracking Reducng Frame Rate for Object Trackng Pavel Korshunov 1 and We Tsang Oo 2 1 Natonal Unversty of Sngapore, Sngapore 11977, pavelkor@comp.nus.edu.sg 2 Natonal Unversty of Sngapore, Sngapore 11977, oowt@comp.nus.edu.sg

More information

Cluster Analysis of Electrical Behavior

Cluster Analysis of Electrical Behavior Journal of Computer and Communcatons, 205, 3, 88-93 Publshed Onlne May 205 n ScRes. http://www.scrp.org/ournal/cc http://dx.do.org/0.4236/cc.205.350 Cluster Analyss of Electrcal Behavor Ln Lu Ln Lu, School

More information

Efficient Distributed File System (EDFS)

Efficient Distributed File System (EDFS) Effcent Dstrbuted Fle System (EDFS) (Sem-Centralzed) Debessay(Debsh) Fesehaye, Rahul Malk & Klara Naherstedt Unversty of Illnos-Urbana Champagn Contents Problem Statement, Related Work, EDFS Desgn Rate

More information

Loop Pipelining for High-Throughput Stream Computation Using Self-Timed Rings

Loop Pipelining for High-Throughput Stream Computation Using Self-Timed Rings Loop Ppelnng for Hgh-Throughput Stream Computaton Usng Self-Tmed Rngs Gennette Gll, John Hansen and Montek Sngh Dept. of Computer Scence Unv. of North Carolna, Chapel Hll, NC 27599, USA {gllg,jbhansen,montek}@cs.unc.edu

More information

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers IOSR Journal of Electroncs and Communcaton Engneerng (IOSR-JECE) e-issn: 78-834,p- ISSN: 78-8735.Volume 9, Issue, Ver. IV (Mar - Apr. 04), PP 0-07 Content Based Image Retreval Usng -D Dscrete Wavelet wth

More information

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices Internatonal Mathematcal Forum, Vol 7, 2012, no 52, 2549-2554 An Applcaton of the Dulmage-Mendelsohn Decomposton to Sparse Null Space Bases of Full Row Rank Matrces Mostafa Khorramzadeh Department of Mathematcal

More information

CHAPTER 2 PROPOSED IMPROVED PARTICLE SWARM OPTIMIZATION

CHAPTER 2 PROPOSED IMPROVED PARTICLE SWARM OPTIMIZATION 24 CHAPTER 2 PROPOSED IMPROVED PARTICLE SWARM OPTIMIZATION The present chapter proposes an IPSO approach for multprocessor task schedulng problem wth two classfcatons, namely, statc ndependent tasks and

More information

AMath 483/583 Lecture 21 May 13, Notes: Notes: Jacobi iteration. Notes: Jacobi with OpenMP coarse grain

AMath 483/583 Lecture 21 May 13, Notes: Notes: Jacobi iteration. Notes: Jacobi with OpenMP coarse grain AMath 483/583 Lecture 21 May 13, 2011 Today: OpenMP and MPI versons of Jacob teraton Gauss-Sedel and SOR teratve methods Next week: More MPI Debuggng and totalvew GPU computng Read: Class notes and references

More information

A SYSTOLIC APPROACH TO LOOP PARTITIONING AND MAPPING INTO FIXED SIZE DISTRIBUTED MEMORY ARCHITECTURES

A SYSTOLIC APPROACH TO LOOP PARTITIONING AND MAPPING INTO FIXED SIZE DISTRIBUTED MEMORY ARCHITECTURES A SYSOLIC APPROACH O LOOP PARIIONING AND MAPPING INO FIXED SIZE DISRIBUED MEMORY ARCHIECURES Ioanns Drosts, Nektaros Kozrs, George Papakonstantnou and Panayots sanakas Natonal echncal Unversty of Athens

More information

Load Balancing for Hex-Cell Interconnection Network

Load Balancing for Hex-Cell Interconnection Network Int. J. Communcatons, Network and System Scences,,, - Publshed Onlne Aprl n ScRes. http://www.scrp.org/journal/jcns http://dx.do.org/./jcns.. Load Balancng for Hex-Cell Interconnecton Network Saher Manaseer,

More information

An Entropy-Based Approach to Integrated Information Needs Assessment

An Entropy-Based Approach to Integrated Information Needs Assessment Dstrbuton Statement A: Approved for publc release; dstrbuton s unlmted. An Entropy-Based Approach to ntegrated nformaton Needs Assessment June 8, 2004 Wllam J. Farrell Lockheed Martn Advanced Technology

More information

Programming in Fortran 90 : 2017/2018

Programming in Fortran 90 : 2017/2018 Programmng n Fortran 90 : 2017/2018 Programmng n Fortran 90 : 2017/2018 Exercse 1 : Evaluaton of functon dependng on nput Wrte a program who evaluate the functon f (x,y) for any two user specfed values

More information

Topology Design using LS-TaSC Version 2 and LS-DYNA

Topology Design using LS-TaSC Version 2 and LS-DYNA Topology Desgn usng LS-TaSC Verson 2 and LS-DYNA Wllem Roux Lvermore Software Technology Corporaton, Lvermore, CA, USA Abstract Ths paper gves an overvew of LS-TaSC verson 2, a topology optmzaton tool

More information

Algorithmic Transformation Techniques for Efficient Exploration of Alternative Application Instances

Algorithmic Transformation Techniques for Efficient Exploration of Alternative Application Instances In: Proc. 0th Int. Symposum on Hardware/Software Codesgn (CODES 02), Estes Park, Colorado, USA, May 6 8, 2002 Algorthmc Transformaton Technques for Effcent Exploraton of Alternatve Applcaton Instances

More information

Array transposition in CUDA shared memory

Array transposition in CUDA shared memory Array transposton n CUDA shared memory Mke Gles February 19, 2014 Abstract Ths short note s nspred by some code wrtten by Jeremy Appleyard for the transposton of data through shared memory. I had some

More information

Memory Modeling in ESL-RTL Equivalence Checking

Memory Modeling in ESL-RTL Equivalence Checking 11.4 Memory Modelng n ESL-RTL Equvalence Checkng Alfred Koelbl 2025 NW Cornelus Pass Rd. Hllsboro, OR 97124 koelbl@synopsys.com Jerry R. Burch 2025 NW Cornelus Pass Rd. Hllsboro, OR 97124 burch@synopsys.com

More information

TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS. Muradaliyev A.Z.

TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS. Muradaliyev A.Z. TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS Muradalyev AZ Azerbajan Scentfc-Research and Desgn-Prospectng Insttute of Energetc AZ1012, Ave HZardab-94 E-mal:aydn_murad@yahoocom Importance of

More information

Mathematics 256 a course in differential equations for engineering students

Mathematics 256 a course in differential equations for engineering students Mathematcs 56 a course n dfferental equatons for engneerng students Chapter 5. More effcent methods of numercal soluton Euler s method s qute neffcent. Because the error s essentally proportonal to the

More information

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Some materal adapted from Mohamed Youns, UMBC CMSC 611 Spr 2003 course sldes Some materal adapted from Hennessy & Patterson / 2003 Elsever Scence Performance = 1 Executon tme Speedup = Performance (B)

More information

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1 4/14/011 Outlne Dscrmnatve classfers for mage recognton Wednesday, Aprl 13 Krsten Grauman UT-Austn Last tme: wndow-based generc obect detecton basc ppelne face detecton wth boostng as case study Today:

More information

Today Using Fourier-Motzkin elimination for code generation Using Fourier-Motzkin elimination for determining schedule constraints

Today Using Fourier-Motzkin elimination for code generation Using Fourier-Motzkin elimination for determining schedule constraints Fourer Motzkn Elmnaton Logstcs HW10 due Frday Aprl 27 th Today Usng Fourer-Motzkn elmnaton for code generaton Usng Fourer-Motzkn elmnaton for determnng schedule constrants Unversty Fourer-Motzkn Elmnaton

More information

UB at GeoCLEF Department of Geography Abstract

UB at GeoCLEF Department of Geography   Abstract UB at GeoCLEF 2006 Mguel E. Ruz (1), Stuart Shapro (2), June Abbas (1), Slva B. Southwck (1) and Davd Mark (3) State Unversty of New York at Buffalo (1) Department of Lbrary and Informaton Studes (2) Department

More information

Kent State University CS 4/ Design and Analysis of Algorithms. Dept. of Math & Computer Science LECT-16. Dynamic Programming

Kent State University CS 4/ Design and Analysis of Algorithms. Dept. of Math & Computer Science LECT-16. Dynamic Programming CS 4/560 Desgn and Analyss of Algorthms Kent State Unversty Dept. of Math & Computer Scence LECT-6 Dynamc Programmng 2 Dynamc Programmng Dynamc Programmng, lke the dvde-and-conquer method, solves problems

More information

Data Representation in Digital Design, a Single Conversion Equation and a Formal Languages Approach

Data Representation in Digital Design, a Single Conversion Equation and a Formal Languages Approach Data Representaton n Dgtal Desgn, a Sngle Converson Equaton and a Formal Languages Approach Hassan Farhat Unversty of Nebraska at Omaha Abstract- In the study of data representaton n dgtal desgn and computer

More information

Edge Detection in Noisy Images Using the Support Vector Machines

Edge Detection in Noisy Images Using the Support Vector Machines Edge Detecton n Nosy Images Usng the Support Vector Machnes Hlaro Gómez-Moreno, Saturnno Maldonado-Bascón, Francsco López-Ferreras Sgnal Theory and Communcatons Department. Unversty of Alcalá Crta. Madrd-Barcelona

More information

LECTURE : MANIFOLD LEARNING

LECTURE : MANIFOLD LEARNING LECTURE : MANIFOLD LEARNING Rta Osadchy Some sldes are due to L.Saul, V. C. Raykar, N. Verma Topcs PCA MDS IsoMap LLE EgenMaps Done! Dmensonalty Reducton Data representaton Inputs are real-valued vectors

More information

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance Tsnghua Unversty at TAC 2009: Summarzng Mult-documents by Informaton Dstance Chong Long, Mnle Huang, Xaoyan Zhu State Key Laboratory of Intellgent Technology and Systems, Tsnghua Natonal Laboratory for

More information

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique //00 :0 AM Outlne and Readng The Greedy Method The Greedy Method Technque (secton.) Fractonal Knapsack Problem (secton..) Task Schedulng (secton..) Mnmum Spannng Trees (secton.) Change Money Problem Greedy

More information

S1 Note. Basis functions.

S1 Note. Basis functions. S1 Note. Bass functons. Contents Types of bass functons...1 The Fourer bass...2 B-splne bass...3 Power and type I error rates wth dfferent numbers of bass functons...4 Table S1. Smulaton results of type

More information

Preconditioning Parallel Sparse Iterative Solvers for Circuit Simulation

Preconditioning Parallel Sparse Iterative Solvers for Circuit Simulation Precondtonng Parallel Sparse Iteratve Solvers for Crcut Smulaton A. Basermann, U. Jaekel, and K. Hachya 1 Introducton One mportant mathematcal problem n smulaton of large electrcal crcuts s the soluton

More information

Efficient Code Generation for Automatic Parallelization and Optimization

Efficient Code Generation for Automatic Parallelization and Optimization Effcent Code Generaton for utomatc Parallelzaton and Optmzaton Cédrc Bastoul Laboratore PRSM, Unversté de Versalles Sant Quentn 5 avenue des États-Uns, 7805 Versalles Cedex, France Emal: cedrcbastoul@prsmuvsqfr

More information

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur FEATURE EXTRACTION Dr. K.Vjayarekha Assocate Dean School of Electrcal and Electroncs Engneerng SASTRA Unversty, Thanjavur613 41 Jont Intatve of IITs and IISc Funded by MHRD Page 1 of 8 Table of Contents

More information

Classifier Selection Based on Data Complexity Measures *

Classifier Selection Based on Data Complexity Measures * Classfer Selecton Based on Data Complexty Measures * Edth Hernández-Reyes, J.A. Carrasco-Ochoa, and J.Fco. Martínez-Trndad Natonal Insttute for Astrophyscs, Optcs and Electroncs, Lus Enrque Erro No.1 Sta.

More information

Active Contours/Snakes

Active Contours/Snakes Actve Contours/Snakes Erkut Erdem Acknowledgement: The sldes are adapted from the sldes prepared by K. Grauman of Unversty of Texas at Austn Fttng: Edges vs. boundares Edges useful sgnal to ndcate occludng

More information

Problem Set 3 Solutions

Problem Set 3 Solutions Introducton to Algorthms October 4, 2002 Massachusetts Insttute of Technology 6046J/18410J Professors Erk Demane and Shaf Goldwasser Handout 14 Problem Set 3 Solutons (Exercses were not to be turned n,

More information

CS 534: Computer Vision Model Fitting

CS 534: Computer Vision Model Fitting CS 534: Computer Vson Model Fttng Sprng 004 Ahmed Elgammal Dept of Computer Scence CS 534 Model Fttng - 1 Outlnes Model fttng s mportant Least-squares fttng Maxmum lkelhood estmaton MAP estmaton Robust

More information

Assembler. Building a Modern Computer From First Principles.

Assembler. Building a Modern Computer From First Principles. Assembler Buldng a Modern Computer From Frst Prncples www.nand2tetrs.org Elements of Computng Systems, Nsan & Schocken, MIT Press, www.nand2tetrs.org, Chapter 6: Assembler slde Where we are at: Human Thought

More information

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task Proceedngs of NTCIR-6 Workshop Meetng, May 15-18, 2007, Tokyo, Japan Term Weghtng Classfcaton System Usng the Ch-square Statstc for the Classfcaton Subtask at NTCIR-6 Patent Retreval Task Kotaro Hashmoto

More information

ARTICLE IN PRESS. Signal Processing: Image Communication

ARTICLE IN PRESS. Signal Processing: Image Communication Sgnal Processng: Image Communcaton 23 (2008) 754 768 Contents lsts avalable at ScenceDrect Sgnal Processng: Image Communcaton journal homepage: www.elsever.com/locate/mage Dstrbuted meda rate allocaton

More information

Today s Outline. Sorting: The Big Picture. Why Sort? Selection Sort: Idea. Insertion Sort: Idea. Sorting Chapter 7 in Weiss.

Today s Outline. Sorting: The Big Picture. Why Sort? Selection Sort: Idea. Insertion Sort: Idea. Sorting Chapter 7 in Weiss. Today s Outlne Sortng Chapter 7 n Wess CSE 26 Data Structures Ruth Anderson Announcements Wrtten Homework #6 due Frday 2/26 at the begnnng of lecture Proect Code due Mon March 1 by 11pm Today s Topcs:

More information

Meta-heuristics for Multidimensional Knapsack Problems

Meta-heuristics for Multidimensional Knapsack Problems 2012 4th Internatonal Conference on Computer Research and Development IPCSIT vol.39 (2012) (2012) IACSIT Press, Sngapore Meta-heurstcs for Multdmensonal Knapsack Problems Zhbao Man + Computer Scence Department,

More information

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe CSCI 104 Sortng Algorthms Mark Redekopp Davd Kempe Algorthm Effcency SORTING 2 Sortng If we have an unordered lst, sequental search becomes our only choce If we wll perform a lot of searches t may be benefcal

More information

Ontology Generator from Relational Database Based on Jena

Ontology Generator from Relational Database Based on Jena Computer and Informaton Scence Vol. 3, No. 2; May 2010 Ontology Generator from Relatonal Database Based on Jena Shufeng Zhou (Correspondng author) College of Mathematcs Scence, Laocheng Unversty No.34

More information

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes SPH3UW Unt 7.3 Sphercal Concave Mrrors Page 1 of 1 Notes Physcs Tool box Concave Mrror If the reflectng surface takes place on the nner surface of the sphercal shape so that the centre of the mrror bulges

More information

REDUCING hardware design time is more than ever a

REDUCING hardware design time is more than ever a TCAD-2012-0168 1 Polyhedral Bubble Inserton: A Method to Improve Nested Loop Ppelnng for Hgh-Level Synthess Antone Morvan, Steven Derren, and Patrce Qunton Abstract Hgh-Level Synthess (HLS) allows hardware

More information

Private Information Retrieval (PIR)

Private Information Retrieval (PIR) 2 Levente Buttyán Problem formulaton Alce wants to obtan nformaton from a database, but she does not want the database to learn whch nformaton she wanted e.g., Alce s an nvestor queryng a stock-market

More information

Intra-Parametric Analysis of a Fuzzy MOLP

Intra-Parametric Analysis of a Fuzzy MOLP Intra-Parametrc Analyss of a Fuzzy MOLP a MIAO-LING WANG a Department of Industral Engneerng and Management a Mnghsn Insttute of Technology and Hsnchu Tawan, ROC b HSIAO-FAN WANG b Insttute of Industral

More information

The relation between diamond tiling and hexagonal tiling

The relation between diamond tiling and hexagonal tiling The relaton between damond tlng and hexagonal tlng Tobas Grosser INRIA and École Normale Supéreure, Pars tobas.grosser@nra.fr Sven Verdoolaege INRIA, École Normale Supéreure and KU Leuven sven.verdoolaege@nra.fr

More information

Scheduling with Integer Time Budgeting for Low-Power Optimization

Scheduling with Integer Time Budgeting for Low-Power Optimization Schedlng wth Integer Tme Bdgetng for Low-Power Optmzaton We Jang, Zhr Zhang, Modrag Potkonjak and Jason Cong Compter Scence Department Unversty of Calforna, Los Angeles Spported by NSF, SRC. Otlne Introdcton

More information

AADL : about scheduling analysis

AADL : about scheduling analysis AADL : about schedulng analyss Schedulng analyss, what s t? Embedded real-tme crtcal systems have temporal constrants to meet (e.g. deadlne). Many systems are bult wth operatng systems provdng multtaskng

More information

Range images. Range image registration. Examples of sampling patterns. Range images and range surfaces

Range images. Range image registration. Examples of sampling patterns. Range images and range surfaces Range mages For many structured lght scanners, the range data forms a hghly regular pattern known as a range mage. he samplng pattern s determned by the specfc scanner. Range mage regstraton 1 Examples

More information

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS ARPN Journal of Engneerng and Appled Scences 006-017 Asan Research Publshng Network (ARPN). All rghts reserved. NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS Igor Grgoryev, Svetlana

More information

A Facet Generation Procedure. for solving 0/1 integer programs

A Facet Generation Procedure. for solving 0/1 integer programs A Facet Generaton Procedure for solvng 0/ nteger programs by Gyana R. Parja IBM Corporaton, Poughkeepse, NY 260 Radu Gaddov Emery Worldwde Arlnes, Vandala, Oho 45377 and Wlbert E. Wlhelm Teas A&M Unversty,

More information

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms Course Introducton Course Topcs Exams, abs, Proects A quc loo at a few algorthms 1 Advanced Data Structures and Algorthms Descrpton: We are gong to dscuss algorthm complexty analyss, algorthm desgn technques

More information

Type-2 Fuzzy Non-uniform Rational B-spline Model with Type-2 Fuzzy Data

Type-2 Fuzzy Non-uniform Rational B-spline Model with Type-2 Fuzzy Data Malaysan Journal of Mathematcal Scences 11(S) Aprl : 35 46 (2017) Specal Issue: The 2nd Internatonal Conference and Workshop on Mathematcal Analyss (ICWOMA 2016) MALAYSIAN JOURNAL OF MATHEMATICAL SCIENCES

More information

Optimizing Document Scoring for Query Retrieval

Optimizing Document Scoring for Query Retrieval Optmzng Document Scorng for Query Retreval Brent Ellwen baellwe@cs.stanford.edu Abstract The goal of ths project was to automate the process of tunng a document query engne. Specfcally, I used machne learnng

More information

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS Proceedngs of the Wnter Smulaton Conference M E Kuhl, N M Steger, F B Armstrong, and J A Jones, eds A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS Mark W Brantley Chun-Hung

More information

An Efficient Genetic Algorithm with Fuzzy c-means Clustering for Traveling Salesman Problem

An Efficient Genetic Algorithm with Fuzzy c-means Clustering for Traveling Salesman Problem An Effcent Genetc Algorthm wth Fuzzy c-means Clusterng for Travelng Salesman Problem Jong-Won Yoon and Sung-Bae Cho Dept. of Computer Scence Yonse Unversty Seoul, Korea jwyoon@sclab.yonse.ac.r, sbcho@cs.yonse.ac.r

More information