Improving High Level Synthesis Optimization Opportunity Through Polyhedral Transformations

Size: px

Start display at page:

Download "Improving High Level Synthesis Optimization Opportunity Through Polyhedral Transformations"

Leona Arnold
6 years ago
Views:

1 Improvng Hgh Level Synthess Optmzaton Opportunty Through Polyhedral Transformatons We Zuo 2,5, Yun Lang 1, Peng L 1, Kyle Rupnow 3, Demng Chen 2,3 and Jason Cong 1,4 1 Center for Energy-Effcent Computng and Applcatons, School of EECS, Pekng Unversty, Chna 2 Unversty of Illnos, Urbana-Champagn, USA 3 Advanced Dgtal Scence Center, Sngapore 4 Computer Scence Department, Unversty of Calforna, Los Angeles, USA 5 School of Informaton and Electroncs, Beng Insttute of technology,chna {wezuo,dchen}@llnos.edu,{erclyun,peng.l}@pku.edu.cn k.rupnow@adsc.com.sg,cong@cs.ucla.edu ABSTRACT Hgh level synthess (HLS) s an mportant enablng technology for the adopton of hardware accelerator technologes. It promses the performance and energy effcency of hardware desgns wth a lower barrer to entry n desgn expertse, and shorter desgn tme. State-of-the-art hgh level synthess now ncludes a wde varety of powerful optmzatons that mplement effcent hardware. These optmzatons can mplement some of the most mportant features generally performed n manual desgns ncludng parallel hardware unts, ppelnng of executon both wthn a hardware unt and between unts, and fne-graned data communcaton. We may generally classfy the optmzatons as those that optmze hardware mplementaton wthn a code block (ntra-block) and those that optmze communcaton and ppelnng between code blocks (nterblock). However, both optmzatons are n practce dffcult to apply. Real-world applcatons contan data-dependent blocks of code and communcate through complex data access patterns. Exstng hgh level synthess tools cannot apply these powerful optmzatons unless the code s nherently compatble, severely lmtng the optmzaton opportunty. In ths paper we present an ntegrated framework to model and enable both ntra- and nter-block optmzatons. Ths ntegrated technque substantally mproves the opportunty to use the powerful HLS optmzatons that mplement parallelsm, ppelnng, and fne-graned communcaton. Our polyhedral model-based technque systematcally defnes a set of data access patterns, dentfes effectve data access patterns, and performs the loop transformatons to enable the ntra- and nter-block optmzatons. Our framework automatcally explores transformaton optons, performs code transformatons, and nserts the approprate HLS drectves to mplement the HLS optmzatons. Furthermore, our framework can automatcally generate the optmzed communcaton blocks for fne-graned communcaton between hardware blocks. Expermen- Correspondng Author Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. To copy otherwse, to republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. FPGA 13, February 11 13, 2013, Monterey, Calforna, USA. Copyrght 2013 ACM /13/02...$ tal evaluaton demonstrates that we can acheve an average of 6.04X speedup over the hgh level synthess soluton wthout our transformatons to enable ntra- and nter-block optmzatons. Categores and Subect Descrptors B.5.2 [Hardware]: [Desgn Ads] automatc synthess General Terms Algorthm, Desgn, Performance Keywords Polyhedral, Hgh Level Synthess, FPGA 1. INTRODUCTION FPGAs have long been adopted for computaton acceleraton, especally n domans that also demand power and energy effcent computng. Ther hghly flexble archtecture enables sgnfcant optmzaton opportuntes. Desgners can mplement fne-graned computaton unts, hghly parallel archtectures, fne-graned ppelnng of computaton unts, effcent communcaton structures and customzed memory and compute unt parttonng. However, the flexblty that s a strength for optmzaton opportuntes s also a challenge for effcent programmablty. FPGA mplementaton remans a sgnfcant challenge for prospectve users manual desgn at regster transfer level (RTL) s error-prone and dffcult to debug. Such manual desgn often takes weeks and months n comparson to the hours or days to mplement an algorthm n software. Furthermore, effcent manual desgn typcally requres sgnfcant specalzed knowledge effcent FPGA mplementatons must consder archtecture-specfc parameters and mplement functons for effcent mappng to the FPGAs fxed-sze allocaton quota (LUT, BRAM, DSP,...). Hgh level synthess (HLS) seeks to address these problems. HLS offers automated translaton from hgh level languages (e.g., C,C++, SystemC, Haskell and CUDA) to regster transfer level (RTL) mplementatons to reduce the desgn effort, automated optmzaton to reduce the requrement of desgn knowledge, and FPGA-specfc mappng to automate low-level optmzaton choces. Thus, HLS promses to be a crtcal brdgng technology that offers the latency and power/energy benefts of FPGA-based hardware acceleraton at the desgn effort (and expertse) of software development. Stateof-the-art HLS tools cover a wde range of nput source code and acheve hgh-qualty results [8]. 9

2 Block 1 for ( = 1; < N; ++) for ( = 1; < N; ++) { g[][] = f1(u[][], u[- 1][], u[+1][], u[][-1], u[][+1]); } Block 2 for ( = 1; < N; ++) for ( = 1; < N; ++) { u[][] = f2(u[][], u[][+1], u[][-1], u[+1][], u[-1][], g[][+1], g[][-1], g[+1][], g[-1][]); } Dependence of Block 2 Dagonal access of Block 2 Block 1 for (1 = 1; 1 < 2 * N - 2; 1++) for (1 = max(1 N + 1, 0); 1 < mn(n - 1, 1); 1++) { g[1-1][1] = f1(u[1-1][1], u[1-1][1-1], u[1-1-1][1], u[1-1+1][1], u[1-1][1+1]); } Block 2 for (1 = 1; 1 < 2 * N - 2; 1++) for (1 = max(1 N + 1, 0); 1 < mn(n 1, 1); 1++) { u[1-1][1] = f2(u[1-1][1], u[1-1][1-1], u[1-1-1][1], u[1-1+1][1], u[1-1][1+1], g[1-1][1-1], g[1-1-1][1], g[1-1+1][1], g[1-1][1+1]); } 1 1 (d) Loop skewng Block 1 Block 2 (a) Orgnal source code (b) Dagonal access (c) Transformed source code (e)parallelzaton and ppelnng Fgure 1: An example of two data-dependent blocks. In sub-fgure (e), we assume the parallelzaton degree s 2. These tools have acheved sgnfcant mprovement; however, recent studes show that although these tools can offer hgh qualty desgns for small kernels, there s stll a sgnfcant performance gap between HLS and manual desgn for real-world complex applcatons [23, 17, 10]. For example, [23, 17] demonstrated a 40X dfference between HLS and the manual desgn for a hgh-defnton stereo matchng mplementaton. Small kernels (often used as smple HLS benchmarks) contan a sngle block (a loop nest), but realworld applcatons often contan many data-dependent blocks that communcate through complex data access patterns. Whereas a vdeo processng kernel may be a sngle nested loop performng a medan flter, a real applcaton would employ a sequence of datadependent blocks, ncludng blocks such as mage up/down samplng, cost aggregaton, and energy mnmzaton. Vdeo processng applcatons are often structured so that each processng step s a loop nest (block) that operates on array nputs and produces array outputs such that block + 1 reads the array varables wrtten by block. As seen n effcent manual RTL mplementatons of these algorthms, t s crucal to mnmze the communcaton granularty, ppelne the data-dependent blocks, and duplcate compute unts to mprove both throughput and latency. However, exstng HLS tools fal to enable ntra-block parallelzaton and nter-block ppelnng when the data access patterns are complex [23, 17] and although advanced commercal tools support these optmzatons, they are not always able to effcently dentfy the opportunty and transform source code to enable such optmzatons. Code transformatons to enable these mportant and powerful optmzatons are thus crtcal for HLS tools. One of the prmary goals of HLS s to reduce desgn effort; thus, HLS tools must effcently support a varety of code ncludng source code wrtten by software engneers wth lttle hardware expertse, and software not wrtten for use wth HLS. Software programmers choose data access patterns based on propertes such as ntutve orderng, cache localty, and code modularty. Even software wrtten for HLS may stll contan loops that demand transformaton; effcent nter-block optmzaton may demand non-ntutve teraton orderng such that even experenced hardware desgners may have dffculty seeng an ntutve orderng of blocks that enables optmzaton. Thus, t s qute natural that the default data access pattern does not support ntra-block parallelzaton or nter-block ppelnng. Nevertheless, data access patterns can be altered va loop transformatons such as permutaton, reverse, and skewng [24]. Usng these transformatons, we can enable ntra-block parallelzaton by reorderng loop teratons so that successve teratons are ndependent; smlarly, we can enable nter-block ppelnng by ensurng that block produces data n the order that block + 1 consumes data (by transformng block, block + 1 or both). Both ntra-block parallelzaton and nter-block ppelnng are currently supported by exstng HLS tools. The problem s rather that these optmzatons cannot always be enabled wth the default data access patterns. Thus, the goal of ths work s to enable effcent ntegrated use of ntra- and nter-block optmzatons through loop transformaton. Motvatng Example. We llustrate the concept of ntra- and nterblock optmzaton usng the Denose [11] applcaton n medcal magng (Fgure 1). The source code n Fgure 1 (a) s modfed to hghlght the data access pattern, wth computaton arthmetc omtted for clarty. The applcaton s composed of two blocks (loop nests), where the frst block wrtes to the g array that s subsequently read by the second block. Both blocks are representatve of stencl codes where the array update operaton follows a fxed nput dependence pattern that reads neghborng elements. In ths example, only the second block has data-dependence between loop teratons, as shown n Fgure 1 (b), and the frst block wrtes the data array g n the same order that the second block consumes array g (column order). Thus, n ths example, ntra-block parallelzaton s avalable n the frst block and nter-block ppelnng s avalable by default between the blocks, but a transform s requred to enable parallelzaton n the second block. However, note that f the second block s transformed to enable parallelzaton, then the frst block must also be transformed n order to retan the opportunty for nter-block ppelnng. 1 Table 1: Comparson of dfferent mplementatons Implementaton Cycles Frequency Speedup 1 w/o transform, w/o opt MHz 1 2 w/o transform, w/ opt MHz w/ transform, w/ opt MHz In Table 1 we can see the performance n cycles and speedup over the orgnal source for ths example. By default, the orgnal source code supports some optmzaton, as seen n the second mplementaton nter-block ppelnng mproves the performance, but the latency of the ndvdual blocks become the performance bottleneck. Note that n the second mplementaton, we do not perform partal parallelzaton (e.g., parallelze the frst block only) because f the throughputs are not matched, the bufferng between 1 For effcent mplementaton of nter-block ppelnng, we also mplement memory partton optmzaton and customzed communcaton blocks whch wll be dscussed n secton

3 the blocks would not be feasble. In the thrd mplementaton, the loop transformaton s used to allevate the dependence problem and retan the opportunty for nter-block ppelnng. In Fgure 1 (c) and (d), we perform loop skewng on both blocks to traverse the loop n a dagonal fashon from the bottom-rght to top-left, as shown n Fgure 1 (b). Ths loop skewng both preserves the data dependence and enables parallelzaton of the nner loop because the teratons on the same dagonal lne are ndependent. The frst block can also use the loop skewng transformaton safely because t does not have any data dependences. The optmzed executon schedule s shown n Fgure 1 (e). Ths transformaton sgnfcantly mproves the speedup opportunty from about 3X to 31X speedup. In ths case, some optmzatons were supported by default, but there was stll sgnfcant speedup opportunty by ensurng that both optmzatons were enabled. In the experments secton, we wll demonstrate that some source codes support nether parallelzaton nor ppelnng by default, but loop transformatons can enable both ntra-block parallelzaton and nter-block ppelnng. As ths bref motvaton dscusson demonstrates, real-world applcatons commonly contan multple data-dependent blocks wth communcaton through complex data access patterns. For such applcatons, the powerful optmzaton functons of exstng hgh level synthess tools may not be supported by default due to the data access patterns. In ths paper we develop an ntegrated method to model, analyze, and transform block data access patterns to support these mportant and powerful optmzatons. Usng polyhedral models [24, 13, 20, 6, 5], we represent the loop teraton space, dependences, data access patterns, and loop transformatons n a lnear algebrac form. For loops amenable to ths algebrac representaton, the polyhedral model allows us to analytcally determne vald loop transformatons that best support both ntra-block parallelzaton and nter-block ppelnng. Ths paper advances the state-of-the-art of hgh level synthess wth An automated polyhedral model-based framework that systematcally dentfes effectve access patterns and apples approprate loop transformatons that enable ntra- and nterblock optmzatons. An automated framework to generate communcaton nterfaces between blocks. We demonstrate that our automated framework acheves an average of 6.04X speedup over HLS wth all avalable optmzatons but wthout transformatons to enable. Ths paper s organzed as follows. Secton 2 brefly provdes the background of the polyhedral model and ntroduces the useful notaton. Secton 3 presents our framework, whch defnes data access patterns, dentfes data access patterns for effectve optmzatons, and dscusses the mplementaton of fne-graned data communcaton modules. Secton 4 presents expermental results. Secton 5 dscusses related work, and conclusons are presented n Secton BACKGROUND AND NOTATION In ths secton we brefly ntroduce the polyhedral model and the notaton that we wll use n ths paper. A detaled descrpton of polyhedral models can be found n [5, 6, 13, 20]. In ths work we are usng the polyhedral model to consder data access patterns for communcaton between sets of loop nests, and to optmze ths communcaton orderng n order to perform fne-graned communcaton and enable parallelzaton wthn a loop nest and ppelnng between loop nests through loop transformaton. In partcular, n ths work we consder programs that consst of a sequence of datadependent blocks, where each block s a loop nest contanng multple statements, and each statement may have multple accesses to data arrays. 2.1 Polyhedral Model Polyhedral models can be used to represent executon nformaton of a program s loop nests, such as the loop teraton doman, statement/teraton dependences, array access functons, and schedulng functons (executon order). DEFINITION 1 (Polyhedron). The set of all vectors x Z n such that A x + b 0, where A Z m n and b Z m. Each row of A represents a half-space that lmts A x + b to nonnegatve values, and the polyhedron s the ntersecton of the m half-spaces represented by the m rows of A. A bounded polyhedron s a polytope. DEFINITION 2 (Iteraton Vector and Doman). The teraton vector of a loop nest represents an executon nstance of the loop nest. contans the values of the loop ndces of all the surroundng loops. The teraton doman D L of loop nest L s the set of all teraton vectors that satsfy the loop-bound constrants. Because an teraton doman s bounded by these loop-bound constrants, t s a polytope. DEFINITION 3 (Schedule Functon). Gven a m-dmensonal loop nest, a d-dmensonal (1 d m) schedule F( ) s defned as F( ) = S + o C 11 C C 1m C 21 C C 2m = C d1 C d2... C dm C 10 C 20. C d0 where S Z d m, o Z d, and C Z. The schedule functon takes the teraton vector as nput and produces an orderng of the loop teratons usng the matrx S and an offset vector ( o). Thus, the schedule functon creates a partcular orderng of teratons by mappng each teraton n the doman to a tmestamp, where mappng m dmensons nto fewer (d) tmestamps mples that some of the teratons can be performed n parallel. The schedule functon has the addtonal beneft that the tmestamp orderng also corresponds to a lexcographc orderng. For example, for teraton vectors 1 and 2, 1 s scheduled before 2 f F( 1) < F( 2). DEFINITION 4 (Transformaton Functon). In the polyhedral model, the transformaton functon s represented as a sequence of schedules appled to perform the transformaton. In theory, any loop transformaton can be represented n the polyhedral model. In ths paper we focus on a set of un-modular loop transformatons [24], ncludng loop reverse (e.g., reverses the traverse drecton), loop permutaton (e.g., nterchange two loops), and skewng (e.g., make the nner-loop bounds dependent on the outer-loop bounds) and ther composte transformatons. The schedule functon/executon order of teratons of a loop nest can be altered through loop transformatons. Fgure 2 llustrates the polyhedral model usng a concrete example. The orgnal source code that conssts of two data-dependent blocks (loop nests) s shown n Fgure 2 (a). The frst block wrtes to array B, and the second block subsequently reads from t. Wthout transformaton, each ndvdual block can be parallelzed, but 11

4 for (1 = N - 1; 1 >=0; 1--) for (1 = 0; 1 < N; 1++) S1 : B[1][1] = B[1][1] + A[1][1]; NN 1 0 FF sss ıı = 1 0 ıı + NN 1 0, ıı = 1 for (2 = 0; 2 < N; 2++) for( 2 = 0; 2 < N; 2++) S2: D[2] = D[2] + B[2][2] * C[2]; NN 1 0 FF sss ıı = 1 0 ıı + 0 0, ıı = 2 (a) Orgnal source code (b) Iteraton doman Fgure 2: An example of polyhedral representaton. (c) Schedule functon ther executon can not be overlapped through ppelnng as they access array B n dfferent order. In Fgure 2 (b), we show the teraton doman (polytope) of the two blocks whch establshes the lower- and upper-bounds for each loop dmenson. In Fgure 2 (c), we show the schedulng functon that corresponds to the orgnal source code for each of the loop nests. In order to transform the orgnal schedulng functon to acheve a desred data access pattern, we can defne a desred data access pattern and derve the necessary loop transformatons; a proof that ths dervaton s always possble s shown n Secton 3. As descrbed above, the polyhedral model (ncludng teraton doman, teraton vector, schedulng functon and transformaton functon) descrbes the teraton doman and order for any loop nest that has all array accesses as affne expressons of loop ndces. In addton, dependency relatons between teratons n the program can be represented n the polyhedral model. Two teratons of a loop nest (or two nstances of blocks) are dependent f they access the same array locatons, and at least one of the accesses s a wrte operaton. These dependences can be represented by addtonal rows n the teraton doman to establsh constrants between the loop teratons. Loop transformatons are vald only f they also preserve the addtonal program dependency constrants. 3. METHODOLOGY Our ntegrated ntra- and nter-block optmzaton framework takes a data-dependent mult-block program as nput, and performs three steps, as shown n Fgure 3. The goal of our optmzaton s to mnmze the overall latency (maxmze the performance speedup). Frst, we systematcally defne a set of data access patterns, classfy them, and derve the assocated loop transformatons (Secton 3.1). Next, for each loop transformaton that valdly preserves data-dependences, we estmate the performance mprovement (Secton 3.2) and choose the best estmated performance. The performance estmaton models both ntra- and nter-block speedup and assocated mplementaton overheads. The ntra-block parallelzaton degree s determned by the resource usage of the program and the avalable resource on the mplementaton platform. Fnally, for the chosen transformaton, we automatcally perform the loop transformatons, nsert hgh level synthess drectves, and generate the communcaton blocks that nterface the data-dependent blocks. If the communcaton block s a smple FIFO, we automatcally nsert FIFO hgh level synthess drectves; when the communcaton nterface requres multple reads or a stencl pattern, we automatcally customze the communcaton blocks (Secton 3.3). The fnal output of the flow s an optmzed RTL desgn. Note that n pror works [5, 6, 13, 18, 20, 24], polyhedral models consder data access patterns for external memory accesses and loop transformatons n order to optmze data localtes and maxmze parallelsm for CPUs. Optmzatons for CPU code attempt to mnmze memory bandwdth and mprove cache behavor. In addton, the transformed CPU code often has complex control flow that s not sutable for effcent FPGA mplementaton. Thus, the transformaton chosen for optmzaton on CPU platform may be the wrong decson for our HLS optmzaton. In ths work, we use polyhedral model to defne data access patterns for a dfferent obectve. We am to optmze the nterblock communcaton and enable ntra-block parallelzaton and nter-block ppelnng through loop transformaton for FPGAs usng HLS. For our obectve, we model the FPGA-specfc features. Therefore, although the canddate data access patterns mght be the same as pror work, we apply dfferent transformatons n dfferent combnatons n order to meet our optmzaton obectve. Although the underlyng technques are all based on polyhedral models, dfferent obectves lead to dfferent ways to select data access patterns and loop transformatons. Data-dependent mult-block program Defne data access pattern and derve the assocated loop transformaton Performance Estmaton Intra-block Inter-block Overhead Transform the computaton blocks and generate the communcaton blocks Optmzed RTL desgn Fgure 3: Optmzaton Framework. 3.1 Classfcaton of Array Access Patterns We model the array access patterns usng the polyhedral model, thus we assume that all array accesses are affne expressons of loop ndces and constants. The program nputs are composed of multple data-dependent blocks where each block contans a sngle mult-dmensonal loop nest. Let us consder a loop nest of dmensonalty D that accesses an N-dmensonal array. The array access pattern s defned by matrx M whose sze s N D, where the rows () represent the data access pattern n dmenson of the data array, and columns () represent the access pattern n the loop level. Gven the array access pattern M, loop teraton vector, and constant offset vector o, the array access vector s s defned as s = M + o s s column vector of sze N, where each row () represents array accesses n dmenson, and the offset vector s a constant offset nto that dmenson. Fgure 4 shows an example of array access 12

5 pattern and vector for the wrtes to array B; smlar access patterns and vectors could be derved for the reads from array A. for( =0; < N; ++) for(=0; < N; ++) B[][] = A[][] + A[ - 1][]; 1 0, s = Fgure 4: An example of array access pattern. In the followng, we demonstrate the data access patterns for 2- dmensonal arrays and 2-dmensonal loop-nests. For ease of llustraton, we classfy the access patterns below; although, the polyhedral framework treats these access patterns n a unform way. In the framework, we need to defne the canddate access patterns that can be evaluated for the program; n ths work we lmt the access patterns to the smple set of access patterns below. The polyhedral model can be easly extended to handle a wde varety of addtonal access patterns, and our work can use any addtonal access patterns to estmate the performance beneft. We leave the extenson to future work. Now we wll descrbe how to defne the array access patterns usng M. Let ( ) a1 b 1 a 2 b 2 We classfy the array access patterns based on the values of a 1, a 2, b 1 and b 2. For array accesses wth non-unt loop strdes, we perform loop normalzaton as a preprocessng step so that our analyss can assume unt loop strde. Column and Reverse Column. ( ) ±1 0 0 ±1 The array access vector can be obtaned by M. Fgure 5 shows the four patterns n ths category, where the sgns of a 1 and b 2 determne traversal drecton. For example, wth outer loop ndex and nner loop ndex, f a 1 = 1 and b 2 = 1, then the outer loop traverses ncreasng values of, and the nner loop traverses ncreasng values of. ( 0 ±1 ±1 0 Smlar to column and reverse column, there are four patterns n ths category, and the sgns of b 1 and a 2 determne the traversal drectons. Dagonal Access. In ths category, the loop traverses n a dagonal lne fashon. We further dvde ths nto two cases based on the slopes of the dagonal lnes. slope 1. ) ( ±1 N > b1 1 0 ±1 The array access vector can be obtaned by M. The slope of the dagonal lne s determned by b 1, and the sgns of a 1 and b 2 determne the traversal drectons. When b 1 N, the traversal order reduces to one of the column access orders. Fgure 6 shows the four data access patterns for slope = 1. ) slope < 1. ( N > a1 > 1 ±1 ±1 0 The slope of the dagonal lne s determned by 1 a 1, and the sgns of a 2 and b 1 determne the traversal drectons. When a 1 N, the traversal order reduces to one of the row access orders. All of the array access patterns defned above are unmodular matrces where a 1 b 2 a 2 b 1 = 1. Thus, all of these access patterns can be acheved by unmodular loop transformatons of the block(s) [24]. Loop Transformaton. Loop transformatons can change the schedule (e.g., executon order) of loop teratons such that the data access pattern can be changed. Thus, here we derve the loop transformaton gven a desred data access pattern. THEOREM 3.1. The transformaton functon T requred for the desred data access pattern M des can be obtaned by T = M 1 desm orf 1 or where M or and F or are the data access pattern and schedule functon of the source code wthout transformaton. PROOF. Let be the loop terator vector after transformaton. Thus, the desred array access vector after transformaton s M des. Let us assume the schedule functon and data access pattern of the orgnal code are F or and M or, respectvely. Thus, the schedule functon after transformaton s TF or. Schedule functon maps the orgnal loop teratons to a new orderng of the loop teratons. Thus, Thus, TF or = = F 1 ort 1 The data accessed by the teratons before and after transformaton s the same Thus, M orf 1 ort 1 = M des T = M 1 desm orf 1 or The data access pattern and loop transformatons are all unmodular matrx. For unmodular matrx, we can always derve ts reverse matrx. 3.2 Performance Metrc In the prevous subsecton we defned a set of data access patterns that nclude row, column, and dagonal access wth dfferent slopes and drectons. We also descrbed how to derve the requred loop transformaton for a gven data access pattern. Our desgn obectve s to maxmze the performance speedup. Now, we wll dscuss the process of evaluatng all of the canddate data access patterns and choosng the canddate that maxmzes applcaton speedup. To perform ths evaluaton, we develop a performance metrc that combnes modelng of both ntra- and nter-block speedup and ther assocated mplementaton overhead. We defne that a program s a sequence of K blocks {b 1, b 2,..., b K}, and ) 13

6 (a) left to rght (), bottom to up () (b) left to rght (), top to down () (c) rght to left (), bottom to up () (d) rght to left (), top to down () Fgure 5: An example of column access pattern. There are 4 patterns wth dfferent traverse drectons (a) bottom-left to top-rght (), bottom-rght to top-left () (b) bottom-left to top-rght (), top-left to bottom-rght () (c) top-rght to bottom-left (), bottom-rght to top-left () (d) top-rght to bottom-left (), top-left to bottom-rght () Fgure 6: An example of dagonal access pattern wth slope = 1. There are 4 patterns wth dfferent traverse drectons. each block b has a d dmensonal loop nest that accesses an n dmensonal data array. In ths work we restrct that the loop nest dmensonalty and data-array dmensonalty are equal for all sets of communcatng blocks n the program, so we denote the loop nest dmensons and data array dmensons as N. The latency n clock cycles of each block b wthout ntra-block optmzaton s lat. If there s any control flow between the blocks, we consder the worst-case path; thus, the total baselne program latency n clock cycles (wthout any optmzaton) s the smple sum of the blocks latences =K lat base = lat lat for each block s estmated by performng block-wse hgh level synthess, and usng the HLS-generated performance estmates 2. When no optmzatons are appled, we have verfed that these estmates are accurate by comparng them wth post-synthess smulatons. However, the HLS performance estmates can be naccurate for modelng the parallelsm, especally because the nterblock ppelnng hdes executon latency. Therefore, we develop an alternatve performance metrc that can estmate ths optmzaton effect. Note that our performance estmaton s based on clock cycles only. By default, we apply ntra-block ppelnng (e.g., loop ppelnng wthn a block by settng ntaton nterval (II)) for all the blocks and set the same target perod for all the mplementatons. Thus, the clock perod tends to be smlar across dfferent mplementatons, and we gnore the effect of clock perod n our estmaton. Let P be the set of all canddate data access patterns. Gven p P, we can estmate the performance of each block by the ntrablock parallelzaton factor, and then the performance of the entre =1 2 We use the commercal AutoPlot [25] HLS tool to estmate lat. program wth the nter-block ppelnng that overlaps the blocks executon. In ths work, we target nter-block ppelnng, and correspondngly, t always makes sense to parallelze each block to the same factor to match the blocks throughput and mnmze nterblock bufferng. Thus, we can smplfy the program performance estmate n clock cycles as follows where S ntra p lat p = S ntra p lat base S nter p + cost p and Sp nter represent the ntra- and nter-block speedup (dscussed next), respectvely and cost p represents the mplementaton overhead. For each block, we can ft multple data processng ppelnes for parallel processng. In hgh level synthess, duplcate data processng ppelnes are mplemented by unrollng the nner loop. As we dscussed above, we want to match the blocks parallelsm degree; therefore we search for the maxmum unrollng factor where all blocks n the program can be unrolled to the same factor. For block b, we use R to denote ts resource usage. R s a 4-tuple that represents ts resource usage n FFs, LUTs, BRAMs, and DSPs. Then the total communcaton cost (n resources) between block and block + 1 s estmated as another 4-tuple Comm, where we use the HLS estmates for R and Comm. We have observed and expermentally valdated that the HLS resource usage estmates correlate wth actual resource after logc synthess, although they tend to be conservatvely hgh estmates. In partcular, the HLS estmates for FF resource use and LUT resource use tend to be hgh. As we scale the unroll factor, ths overestmate s compounded due to underestmaton of resource sharng and LUT/FF packng nto Slces. Therefore, we emprcally determned another correlaton factor tuple α that represents ths scalng; unlke the communcaton factor, we use the same α factor for all blocks n a desgn and 14

7 for all desgns n our benchmark set. In total, we predct the actual resource usage as follows R + Comm ) α par =K ( =1 where par represents the ntra-block parallelzaton degree. Then, gven the fxed resource budget of the FPGA, we can derve the maxmally allowed parallelzaton degree, Max par. We defne Sp ntra as follows Sp ntra = Max par 1 otherwse pattern p enables parallelzaton for all the blocks Smlarly, for the nter-block ppelnng factor Sp nter, f we can transform all of the blocks to follow the same data access pattern for nter-block communcaton, then we can fully enable nter-block ppelnng. Therefore, we defne Sp nter as follows S nter p = { K pattern p enables ppelnng 1 otherwse where K s the number of blocks n the program. Note that we gnore the ppelnng fll and dran overhead n the above estmaton. Fnally, the mplementaton of the access patterns may ncur addtonal overhead, whch can sgnfcantly affect the program s performance estmate. Both row (slope = 0), column (slope = ) and ther reverses are easy to mplement wthout any overhead, but dagonal access patterns requre loop skewng, as shown n Fgure 1. Therefore, the nner-loop teraton domans are a functon of outer-loop ndex values. When slope = 1, we can use mn and max operatons to bound the nner-loop teraton domans, as shown n Fgure 1 (c); but wth 1 < slope < N or 0 < slope < 1, we must use a combnaton of mn, max, cel and floor functons to bound the nner-loop teraton domans. As dscussed prevously, we apply ntra-block ppelnng (e.g., loop ppelnng wthn a block) for all the blocks by default to mprove the performance. However, to enable ntra-block ppelnng, the nner-loop bounds have to be constants for HLS tools. Thus, for dagonal access patterns, we use the maxmum loop bound (N) for the nner loop but add extra f-else condtons to flter out the false loop teratons. The f-else statement compares the current nner-loop ndex wth ts doman and executes the loop body only f the condton s true. The f-else comparson to flter out false loop teratons takes extra cycle and the number of false teratons depends on the doman of the outer loop. When slope > 1, the doman of the outer loop ncreases lnearly wth the slope; when slope < 1, the doman of the 1 outer loop ncreases lnearly wth. Hence, the performance slope overhead cost p s defned as follows 0 slope = 0 or slope = cost p = C slope 1 slope < N C 1 slope 0 < slope < 1 where C s a constant. In our formulaton, the ntra-block parallelzaton and nter-block ppelnng are performed globally as a coarse-graned optmzaton. It s possble that a fne-graned optmzaton that selects dfferent parallelzaton and ppelnng parameters among dfferent sets of blocks and/or communcaton paths could yeld better performance. We expect that ths could be mplemented by applyng ths technque separately to subsets of program blocks wth addtonal refnement to the performance model and buffers between the blocks; we leave these tasks for future work. Pattern Selecton. As shown prevously, there are multple data access patterns that nclude row, column and dagonal patterns wth a varety of slopes and traversal drectons. Wth hgh-dmenson loop nests, ths can be a large number of canddate patterns; however, we can easly prune the search space usng the data dependences to prune canddates that would volate dependences. For stencl applcatons, the update of one data tem depends on ts neghbors n the surroundng stencl wndow, whch s usually small compared to the sze of the entre array. For dagonal access, we only need to focus on the data access pattern wth 1 slope n, n where n s the stencl wndow dmenson. For all remanng slope values (n < slope < N and 1 < slope < 1 ), they are equvalent N n to slope = n or slope = 1 n terms of the traversal order of data n tems n the stencl wndow, except that they ncur addtonal overhead n mplementng the loop bounds. Ths traversal order s the only factor that affects the ablty to apply ntra-block parallelzaton and nter-block ppelnng. Usng these dependences to prune the lst, t s n practce feasble to estmate performance for each of the canddates and choose the best. Note that we only explore a subset of legal polyhedral transformatons n ths work. In theory, t s also possble to select the optmal pattern followng the polyhedral optmzaton flow n [20] where all the FPGA-specfc features dscussed above ncludng resource modelng and performance metrcs have to be encoded as constrants and cost functons for the optmzaton problem. In practce, our pattern selecton soluton runs very fast for all the tested benchmarks. 3.3 Implementaton Our framework s an automatc flow consstng of three logcal steps: we automatcally transform source code, use HLS tools to enable optmzatons and synthess, and generate FIFO nterfaces between computaton and communcaton blocks. If the communcaton block requres a large communcaton buffer and a complex mult-read communcaton nterface, we automatcally nsert customzed communcaton blocks. For the frst step, we ntegrate our framework nto PoCC polyhedral framework [1]. Our framework defnes data access patterns, estmates performance, selects desred patterns based on performance metrcs and performs loop transformatons. The PoCC framework s at the C-code source level; we have also modfed the framework to automatcally produce source code compatble wth our chosen HLS tool, AutoPlot [25] 3. Next, we use the AutoPlot HLS tool to enable optmzatons and synthesze the computaton blocks. AutoPlot provdes a set of drectves for optmzaton, ncludng loop_unrollng and ppelne for the ntra-block parallelzaton and ppelnng, and the dataflow and AP _F IF O drectves for nter-block ppelnng (wth communcaton through a FIFO nterface). AutoPlot wll execute unrolled loop teratons n parallel f there are no data dependences between the teratons. Smlarly, AutoPlot wll overlap executon of datadependent blocks f they use the same access pattern and can thus communcate through a FIFO nterface. However, for the blocks that have multple accesses to an array, ether due to algorthm (e.g., stencl codes read a pattern of neghborng data) or parallelzaton (e.g., multple teratons execute and access data n parallel), memory bandwdth bottleneck often prevents the desgn from reachng the expected ntra-block parallelsm and nter-block ppelnng. To allevate the memory bottleneck, we mplement the memory partton technque [16] that parttons the arrays nto several banks and enables more array accesses per clock cycle. 3 AutoESL was acqured by Xlnx; AutoPlot s now part of Vvado HLS. 15

8 It s mportant to emphasze agan that although AutoPlot provdes these drectves for ntra- and nter-block optmzatons, these optmzatons can not always be appled wth the default data access patterns and AutoPlot s unable to dentfy the optmzaton opportuntes to enable them through transformatons. Thus, the mportance of ths work s not ust that we generate the AutoPlot drectves that are already avalable, but that we automatcally mprove the number of stuatons n whch the drectves can be used. In the frst two steps, we can handle applcatons wth smple nter-block communcaton patterns that can be transformed to use the FIFO nterface automatcally. However, some communcaton patterns are more complex; a block may perform multple data reads per nner-loop teraton. In these stuatons, we need an optmzed communcaton block. Partcularly, we automatcally nsert a reuse buffer [15] that stores data that wll be temporally reused. The mplementaton of the reuse buffer can use mult-port BRAMs or regsters dependng on resource and buffer szes, as shown n Fgure 7. Block 1 for () for() wrte(stream, data); wrte BRAMs Bank 0 Bank 1 Block 2 for () for() read(stream, data); read Communcaton Block Regsters decoder Fgure 7: Communcaton block. 4. EXPERIMENTAL EVALUATION In ths secton, we present our expermental results usng a set of real-world applcatons that contan multple data-dependent blocks and communcate through complex data access patterns. We frst dscuss the experments setup and evaluated benchmarks. Then, we show the performance mprovement of our proposed framework. 4.1 Experments Setup For our experments, we use a set of benchmarks from Poly- Bench 3.0 [1] and some real-world applcatons from [11]. Table 2 descrbes the benchmark detals. Our framework s based on the polyhedral compler nfrastructure PoCC 1.1 [1]. PoCC s a source-to-source compler that ncludes a set of tools for polyhedral complaton. It extracts the polyhedral ntermedate representaton at source code level. We modfy PoCC to defne data access patterns, evaluate FPGA archtecturespecfc performance, perform loop transformatons and code generaton. Fnally, PoCC also provdes lbrares for program dependences checkng. The output of our modfed PoCC s transformed C code wth AutoPlot pragmas to enable the HLS optmzatons. Then, we use the AutoPlot HLS tool verson to synthesze the transformed C code nto Verlog RTL. As prevously dscussed, AutoPlot supports ntra- and nter-block optmzatons. We automatcally nsert drectves nto the confguraton fle to enable these optmzatons. The target FPGA platform s Xlnx-Vrtex-6 LX75T. We synthesze the RTL generated from AutoPlot usng Xlnx ISE 13.1 and gather area and clock perod data. To compute the operatng frequency, we round down the operatng frequency determned by the synthess report s achevable clock perod to an nteger. The clock cycles are collected through smulaton usng Modelsm 6.1. Fnally, we compute latency usng the operatng frequency and clock cycles and compute speedup usng the latency value. Table 2: Benchmarks Benchmark Descrpton Deconv Image Rcan Deconvoluton [11] Denose Image Rcan Denosng [11] Seg Image Segmentaton [11] Sedel Sedel stencl computaton [1] Jacob Jacob stencl computaton [1] 4.2 Performance Improvement We compare the performance of three dfferent mplementatons. The frst mplementaton s the baselne t s the orgnal source code wthout usng ntra-block parallelzaton or nter-block ppelnng optmzaton. The second mplementaton s an mproved verson t apples the ntra-block parallelzaton and nter-block ppelnng optmzatons to the orgnal source code when supported wthout code transformaton. The frst and second mplementatons do not requre source code transformaton. Although the second mplementaton tres to use optmzatons, they cannot be enabled f the default data access pattern does not support t. The thrd mplementaton s our proposed mplementaton t transforms the source code and then enables the ntra- and nter-block optmzatons. Note that by default, we apply ntra-block ppelnng wthn each ndvdual block for all the mplementatons and benchmarks. Performance Comparson. Table 3 descrbes the clock cycles, resources and acheved frequences of the three mplementatons for all the benchmarks. Table 4 presents the detals of the chosen data access patterns and enabled optmzatons wth and wthout transformaton for all the benchmarks. Compared to the baselne desgn, the second and thrd mplementatons mprove the latency n clock cycles at the expense of more resource usage usng the ppelnng and parallelzaton optmzatons. Ths demonstrates that wth greater opportunty for ntra-block parallelzaton and nter-block ppelnng, HLS tools can effectvely use addtonal FPGA resources. The latency speedup s shown n Fgure 8. The speedup s normalzed to the baselne mplementaton. Compared to the baselne, the second mplementaton (wthout transformaton, wth optmzaton) acheves an average of 4.89X speedup. Our proposed mplementaton (wth transformaton, wth optmzaton) acheves further speedup by enablng more optmzatons through polyhedral transformatons. Our proposed mplementaton acheves an average of 29.59X speedup over the non-optmzed source code and 6.04X speedup over code that s optmzed but not transformed to enable optmzatons. The average speedup s computed usng geometrc mean. The performance mprovement s from three-fold: frst, the ntrablock parallelzaton mproves computaton latency wthn blocks; second, the nter-block ppelnng mproves computaton latency by overlappng the executon of blocks; and thrd, our memory and communcaton block optmzaton mproves the memory bandwdth. We agan use the example of the Denose benchmark to demonstrate these ponts; the second mplementaton employs nter-block ppelnng n conuncton wth memory partton. These two optmzatons mprove clock cycles by overlappng the executon of two blocks and allowng multple memory accesses per clock cycle. Wthout code transformaton, the ntra-block parallelzaton s avalable for the frst block but not the second block as shown n Fgure 1. We do not perform partal parallelzaton (e.g., parallelzng the frst block only) together wth nter-block ppelnng 16

9 Table 3: Performance and resource comparson of dfferent mplementatons Benchmark Implementaton Cycles LUT FF DSP BRAM Frequency(MHz) w/o trans, w/o opt Deconv w/o trans, w/ opt w/ trans, w/ opt w/o trans, w/o opt Denose w/o trans, w/ opt w/ trans, w/ opt w/o trans, w/o opt Seg w/o trans, w/ opt w/ trans, w/ opt w/o trans, w/o opt Sedel w/o trans, w/ opt w/ trans, w/ opt w/o trans, w/o opt Jacob w/o trans, w/ opt w/ trans, w/ opt Table 4: Detals of optmzatons and data access pattern Benchmark Optmzatons Selected Data Access Pattern w/o trans w/ trans Deconv nter-block ppelne nter-block ppelne & ntra-block parallel dagonal (slope = 1) Denose nter-block ppelne nter-block ppelne & ntra-block parallel dagonal (slope = 1) Seg none nter-block ppelne & ntra-block parallel dagonal (slope = 1) Sedel nter-block ppelne nter-block ppelne & ntra-block parallel dagonal (slope = 2) Jacob ntra-block parallel nter-block ppelne & ntra-block parallel row because f the throughputs are not matched then t would requre a buffer sze proportonal to the total data sze whch s not feasble n general. The thrd mplementaton transforms the code to a dagonal access pattern. Then, n addton to nter-block ppelnng and memory customzaton, t enables ntra-block parallelzaton for both blocks that allows multple teratons to execute smultaneously. The thrd mplementaton maxmzes resource use such that each block s parallelzed to the same degree and fts n the Vrtex 6 LX75T, whch has LUTs, FFs, 288 DSPs, and Kb BRAMs. For Denose, the ntra-block parallelzaton degree s 9, gven the resources on FPGA. As a result, our mplementaton acheves 31.09X and 9.14X speedup over the frst and second mplementatons, respectvely. In addton, our optmzaton produces a sde-beneft of mproved operatng frequency. As stated earler, we assume that the clock perod s not affected by our transformatons and make no specfc effort to mprove clock perod. However, because our technque mplements smplfed memory and communcaton nterfaces, we also have the beneft of reducng the complexty of addressng and communcaton structures. Therefore, we also mprove the desgn s crtcal path and get a correspondng modest mprovement n operatng frequency as shown n Table 3. Latency Speedup w/o trans, w/o opt w/o trans, w/ opt w/trans, w/ opt Deconv Denose Seg Sedel Jacob Fgure 8: Latency speedup comparson. Data Access Pattern and Optmzaton. For benchmarks Denose, Deconv and Sedel, wthout transformatons nter-block ppelne can be enabled as dfferent blocks communcate n the same order; but ntra-block parallelzaton cannot be enabled wth the default data access order. Our mplementaton transforms them nto dagonal access order wth dfferent slopes, whch enables both ntra- and nter-block optmzaton. The slope of the data access order s chosen based on the performance metrc we developed n secton 3.2. In partcular, we choose slope = 2 for Sedel as t allows smultaneous executon of the teratons on the same dagonal lne and retan nter-block ppelnng. For benchmark Seg, wthout transformaton nether ntra-block parallelzaton nor nter-block ppelnng can be enabled. Thus, the frst and second mplementaton are the same as shown n Table 3. Our mplementaton transforms t nto dagonal access pattern (slope = 1), whch enables both ntra- and nter-block optmzaton. For benchmarks Jacob, wthout transformaton ntra-block parallelzaton s avalable as there are no dependences among teratons, but nter-block ppelnng cannot be enabled as two blocks produce and consume data n dfferent orders. Our mplementaton transforms t to enable both ntra- and nter-block optmzaton. For all the benchmarks, our mplementaton successfully enables both ntra-block parallelzaton and nterblock ppelnng optmzatons. 5. RELATED WORK Hgh level synthess has seen sgnfcant advances n recent years, mprovng the qualty of nput source languages. There are many currently actve HLS tools n both ndustry and academa, such as [8, 25, 4, 7, 19, 14]. Leadng HLS tools support varous ntrablock and nter-block optmzatons, ncludng ppelnng and parallelzaton. Wth such powerful optmzatons, HLS offers ncreased productvty wth lower desgn effort; however, n practce these transformatons are dffcult to apply only certan data access 17

10 patterns are supported, lmtng the applcablty of an mportant HLS feature. Recent studes show that there s stll a sgnfcant performance gap between manual desgn and HLS-generated desgns [23, 17, 10], and the nablty to apply these optmzatons s one of the causes of ths gap. Real-world applcatons contan multple data-dependent blocks that communcate through complex data access patterns, wth source code that was not orgnally ntended for HLS. These applcatons commonly have orgnal data access patterns that do not support the HLS optmzatons; thus, t s crtcal to transform the source code to enable these optmzatons. Pror work n optmzaton opportuntes for data-dependent blocks ndvdually worked on ppelnng and communcaton technques [26, 22, 9]. Zegler et al. developed coarse-graned ppelnng for data-dependent loops that fnd effcent communcaton schemes [26], but they assume that the data access order s already dentcal between the communcatng loops. Cong et al. developed a resource constraned schedulng formulaton for the communcaton problem that can fnd the optmal communcaton order [9], but the technque requres completely unrolled loops, and addtonal storage and computaton overhead for communcaton reorderng. Smlarly, Rodrgues et al. presented a fne-graned synchronzaton technque wth hardware nter-stage buffers to manage the communcaton [22]. Pror works retan the orgnal executon order but apply technques to reorder communcaton; however, these technques may requre large nter-stage buffers n order to handle the reorderng, whch lmts the feasble problem szes they can support. Furthermore, these works manually optmze communcaton nterfaces, rather than automated optmzaton such as that performed for hgh level synthess. In our technque, we nstead rely on loop transformatons to convert the executon order so that the communcaton order s optmzed, whle smultaneously mnmzng the need for nterstage buffers. The polyhedral model based loop transformatons are based on a mathematcal model that can represent any composton of loop transformatons usng affne transformatons [24, 20, 13]. In the past, polyhedral models have been used for maxmzng parallelsm whle mnmzng communcaton for parallel computng [18, 2]. Recently, polyhedral models have been used n hgh level synthess for FPGAs to optmze on-chp memory bandwdth [12, 21], or to optmze the SDRAM bandwdth [3]. In contrast, our approach s to optmze multple data-dependent blocks smultaneously n order to match ther data access patterns and thus smultaneously optmze both ntra-block parallelsm and nter-block ppelnng. 6. CONCLUSION Hgh level synthess s a crtcal technology to ease the adopton of hardware accelerator resources. However, although current hgh level synthess offers a varety of powerful optmzatons, the mplementaton constrants lmt the applcablty and thus mpact of the optmzatons. Input source codes commonly have complex data access patterns that do not nherently support mportant hgh level synthess optmzaton technques, but polyhedral models can model the data access patterns and fnd loop transformatons that enable these mportant parallelzaton and ppelnng optmzatons. We have presented an ntegrated technque usng polyhedral models to model and enable both ntra- and nter-block optmzatons. Ths ntegrated technque substantally mproves the opportunty to use HLS optmzatons for parallelsm, ppelnng, and fne-graned communcaton. Our framework automatcally dentfes data access patterns and canddate loop transformatons, evaluates the performance beneft of each canddate to select the best opton, performs the code transformatons, and nserts the HLS code drectves and communcaton structures to mplement the optmzed hardware. Expermental evaluaton demonstrates an average of 6.04X speedup over hgh level synthess wthout our transformatons to enable ntra- and nter-block optmzatons. 7. ACKNOWLEDGMENTS Ths work was partally supported by Natonal Hgh Technology Research and Development Program of Chna 2012AA and A*STAR Sngapore under the Human Sxth Sense Proect. 8. REFERENCES [1] Pocc. The polyhedral compler collecton. http: // [2] Nawaaz Ahmed, Nkolay Mateev, and Keshav Pngal. Syntheszng transformatons for localty enhancement of mperfectly-nested loop nests. Int. J. Parallel Program., 29(5): , [3] Samuel Baylss and George A. Constantndes. Optmzng SDRAM bandwdth for custom FPGA loop accelerators. In FPGA, pages , [4] Thomas Bollaert. Hgh-Level Synthess: From Algorthm to Dgtal Crcut, chapter Catapult synthess: A practcal ntroducton to nteractve C synthes. Sprnger, [5] Uday Bondhugula et al. Automatc transformatons for communcaton-mnmzed parallelzaton and localty optmzaton n the polyhedral model. In CC, pages , [6] Uday Bondhugula, Albert Hartono, J. Ramanuam, and P. Sadayappan. A practcal automatc polyhedral parallelzer and localty optmzer. In PLDI, pages , [7] Andrew Cans et al. Legup: hgh-level synthess for FPGA-based processor/accelerator systems. In FPGA, pages 33 36, [8] Jason Cong et al. Hgh-level synthess for FPGAs: From prototypng to deployment. IEEE TCAD, pages , [9] Jason Cong, Ypng Fan, Guolng Han, We Jang, and Zhru Zhang. Behavor and communcaton co-optmzaton for systems wth sequental communcaton meda. In DAC, pages , [10] Jason Cong, Muhuan Huang, and Y Zou. Acceleratng flud regstraton algorthm on mult-fpga platforms. In FPL, pages 50 57, [11] Jason Cong, Vvek Sarkar, Glenn Renman, and Alex Bu. Customzable doman-specfc computng. IEEE Des. Test, 28(2):6 15. [12] Jason Cong, Peng Zhang, and Y Zou. Optmzng memory herarchy allocaton wth loop transformatons for hgh-level synthess. In DAC, pages , [13] Paul Feautrer. Some effcent solutons to the affne schedulng problem: I. one-dmensonal tme. Int. J. Parallel Program., 21(5): , [14] Swath T. Guruman et al. Hgh-level synthess of multple dependent CUDA kernels on FPGA. In ASPDAC, [15] Ilya Issenn, Erk Brockmeyer, Mguel Mranda, and Nkl Dutt. DRDU: A data reuse analyss technque for effcent scratch-pad memory management. ACM Trans. Des. Autom. Electron. Syst., 12(2), [16] Peng L et al. Memory parttonng and schedulng co-optmzaton n behavoral synthess. In ICCAD, pages , [17] Yun Lang et al. Hgh-level synthess: Productvty, performance, and software constrants. J. Electrcal and Computer Engneerng, [18] Amy W. Lm, Gerald I. Cheong, and Monca S. Lam. An affne parttonng algorthm to maxmze parallelsm and mnmze communcaton. In ICS, pages , [19] Alex Papakonstantnou et al. Multlevel granularty parallelsm synthess on FPGAs. In FCCM, pages IEEE, [20] Lous-Noël Pouchet et al. Loop transformatons: convexty, prunng and optmzaton. In POPL, pages , [21] Lous-Noël Pouchet, Peng Zhang, P. Sadayappan, and Jason Cong. Polyhedral-based data reuse optmzaton for confgurable computng. In FPGA, [22] Ru Rodrgues, Joao M. P. Cardoso, and Pedro C. Dnz. A data-drven approach for ppelnng sequences of data-dependent loops. In FCCM, pages , [23] Kyle Rupnow et al. Hgh level synthess of stereo matchng: Productvty, performance, and software constrants. In FPT, pages 1 8, [24] Mchael. E. Wolf and Monca. S. Lam. A loop transformaton theory and an algorthm to maxmze parallelsm. IEEE Trans. Parallel Dstrb. Syst., 2(4): , [25] Zhru Zhang et al. Hgh-Level Synthess: From Algorthm to Dgtal Crcut, chapter AutoPlot: a platform-based ESL synthess system. Sprnger, [26] Hed E. Zegler, Mary W. Hall, and Pedro C. Dnz. Compler-generated communcaton for ppelned FPGA applcatons. In DAC, pages ,

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr