Loop Pipelining for High-Throughput Stream Computation Using Self-Timed Rings

Loop Ppelnng for Hgh-Throughput Stream Computaton Usng Self-Tmed Rngs Gennette Gll, John Hansen and Montek Sngh Dept. of Computer Scence Unv. of North Carolna, Chapel Hll, NC 27599, USA {gllg,jbhansen,montek}@cs.unc.edu ABSTRACT We present a technque for ncreasng the throughput of stream processng archtectures by removng the bottlenecks caused by loop structures. We mplement loops as self-tmed ppelned rngs that can operate on multple data sets concurrently. Our contrbuton ncludes a transformaton algorthm whch takes as nput a hgh-level program and gves as output the structure of an optmzed ppelne rng. Our technque handles nested loops and s further enhanced by loop unrollng. Smulatons run on benchmark examples show a 1.3 to 4.9x speedup wthout unrollng and a 2.6 to 9.7x speedup wth twofold loop unrollng. 1. INTRODUCTION Ths paper targets the doman of hgh-performance dgtal ICs that are mplemented usng ppelned dataflow archtectures. We focus on stream processng archtectures,.e., those that take a stream of data tems and produce a stream of processed results. Hgh-speed stream processors are a natural match for many hgh-end applcatons, ncludng 3D graphcs renderng, mage and vdeo processng, dgtal flters and DSPs, cryptography, and networkng processors. The development of fast stream processors s lkely to be key to sustanng the explosve growth we are wtnessng n consumer electroncs, multmeda applcatons, and hgh-speed networkng. Whle stream processors are well-suted to mplementng algorthms that are dataflow n character, a key challenge s the effcent mplementaton of control constructs. In partcular, the presence of condtonals ( f-then-else constructs) and loops ( for and whle constructs) n algorthms typcally creates performance bottlenecks that lmt the throughput of the resultng stream processor even f the remander of the algorthm s effcently ppelned. Whle there has been some recent work on addressng ths challenge for condtonals [4, 7], there s no satsfactory approach for effcently handlng loop constructs. Ths paper, therefore, focuses on algorthmc loops, and provdes an effcent approach for ther hgh-throughput mplementaton. Exstng approaches to mplementng loop structures are lmted n the throughputs they can acheve. They focus prmarly on lmted concurrency mprovements: () shavng off delays from the crtcal path of an teraton, e.g. by local transformatons that change sequental operatons nto parallel ones, and () slght overlappng of Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. To copy otherwse, to republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. ICCAD 06, November 5-9, 2006, San Jose, CA Copyrght 2006 ACM 1-59593-389-1/06/0011...$5.00. adjacent teratons,.e. overlappng the end of one teraton wth the start of the next teraton of the same data tem, or overlappng the end of an tem s last teraton slghtly wth the next tem s frst teraton. These approaches can somewhat shorten the executon tme of each data tem s computaton. However, none of these approaches truly allow multple data tems to be processed concurrently; only a sngle data tem s allowed to be processed by the loop hardware at any tme. Thus, n exstng approaches, even f the loop s ppelned at the crcut level, t s effectvely unppelned at the algorthm level. Ths paper presents an approach for effcently mplementng teratve loops n hardware, whch truly acheves loop ppelnng at the algorthmc level. Unlke exstng approaches that only target local concurrency mprovements, our approach focuses nstead on allowng multple successve data tems to be computed concurrently, thereby offerng more substantal throughput benefts. Our approach s applcable to teratve algorthms n whch successve data tems can be computed ndependent of each other, but where successve teratons on the same tem are allowed to be dependent on each other. Ths class of algorthms s qute rch and can be used n several realworld applcatons ncludng networkng, encrypton, multmeda applcatons, and dfferental-equaton solvers. Our contrbuton s twofold: () novel archtectural buldng blocks for mplementng teratve archtectures, and () an automated synthess approach that allows hgh-level teratve algorthms to be mapped onto these buldng blocks. The core archtectural buldng block a self-tmed rng structure drectly mplements teratve computatons n an effcent manner. We ntroduce a novel rng nterface composed of several helper blocks that allow the rng to operate at ts deal throughput,.e. nether under-utlzed nor congested. Our approach takes advantage of one of the key benefts of self-tmed archtectures: modular desgn. The archtectural buldng blocks can be easly composed together because they all have the same flexble external nterface, whch s robust to dfferences n nternal mplementatons and varatons n nternal delays. Our synthess approach takes as nput a hgh-level specfcaton of an teratve algorthm and generates the structure of ts loop-ppelned mplementaton. Our synthess approach also handles condtonals, sequental blocks, and parallel blocks, and s therefore applcable to a rch class of algorthms. The modularty of our buldng blocks allows our synthess approach to easly handle nested loops, wth arbtrary level of nestng. Fnally, our approach s further enhanced by ncorporatng loop unrollng at the hardware level,.e. replcatng loop hardware to further explot data parallelsm. Our approach has been valdated by applyng t to a set of complex examples wth real-world applcatons, ncludng teratve polynomal root-fndng, ordnary dfferental-equaton solvng, greatestcommon-dvsor computaton, a smple encrypton algorthm, etc.

Each example was syntheszed usng a mature commercal asynchronous synthess flow Haste/TDE (formerly Tangram) tool sute from Handshake Solutons [1], a Phlps subsdary both wth and wthout our loop ppelnng approach. The results of applyng our approach are qute encouragng: 1.3 4.9x ncrease n throughput wthout loop unrollng. When a twofold loop unrollng was appled, even hgher speedups were acheved: 2.6 9.7x. The remander of ths paper s organzed as follows. Secton 2 presents background on self-tmed rng archtectures, and dscusses prevous work on loop optmzaton. Secton 3 then ntroduces our new approach to loop ppelnng; for smplcty of presentaton, ths ntal dscusson focuses only on loops wth a fxed number of teratons. The basc approach s then extended n Secton 4 to provde several advanced features: handlng loops wth varable teraton counts; handlng nested loops; and explotng loop unrollng. 2. PREVIOUS WORK AND BACKGROUND 2.1 Prevous Work Prevous Approaches n Hardware Desgn. Several approaches n hardware desgn have attempted to optmze teratve computaton. However, ther man objectve s to reduce the latency of the loop, as opposed to mprovng ts throughput. One such approach s the CASH compler by Budu and Goldsten [4], whch translates an ANSI C program nto data-flow hardware. A dfferent approach by Theobald and Nowck [9] targets generaton of dstrbuted asynchronous control from control-dataflow graphs, wth the objectve of optmzng communcaton between the controllers, and between a controller and ts assocated datapath object. The above approaches offer very lmted throughput beneft, snce they are manly targeted to mprovng a loop s latency. In partcular, the optmzatons ntroduced by these approaches focus on shavng off some delays from each loop teraton, and offer modest concurrency ncreases: the tal-end of an teraton for a gven data tem s allowed to overlap somewhat wth the start of that same tem s next teraton, or the tal-end of an tem s last teraton overlaps somewhat the start of the next tem s frst teraton. Nether approach allows multple dstnct data tems to be truly processed concurrently. Hence, the throughput benefts of these approaches are modest, and nowhere n the 2 10x range targeted by our proposed approach. Lke our approach, work by Kapas, Dally et al. [7] targets effcent mplementaton of stream algorthms. However, t does not address the crtcal challenge of ppelnng loops and only addresses the problems related to condtonal branches. Wllams and Horowtz [10, 11] ntroduced the classc work on self-tmed rngs. An teratve floatng-pont dvder chp was desgned usng a self-tmed rng structure [11], whch was later adapted for use n the earler HAL processors. However, ths desgn dd not actually allow multple data sets to be nsde the rng concurrently. In addton, ther approach does not target translaton of hgh-level algorthms drectly nto hardware. A key contrbuton of [10] was analyss of the performance of self-tmed rngs whch can be loaded wth multple data sets. Ths analyss s revewed n Secton 2.2.2. Prevous Approaches n Software Desgn. There are many loop optmzaton technques for software complers; we menton a only few relevant approaches here. In software ppelnng [3] the goal s to re-structure a loop so as to reduce the mpact of nter-nstructon dependences by mxng nstructons from successve teratons. Parallelzng complers often use some combnaton of loop unrollng and compacton to acheve parallelzaton between successve loop teratons [2]. There are other approaches used by parallelzng complers: loop skewng, ndex-set splttng, and node splttng [8]. However, many of these approaches apply only to loops that terate over Fgure 1: A smple self-tmed ppelne arrays, wth each teraton operatng on a subsequent array element. As such, they have lmted applcablty to algorthms that requre multple teratons for each data tem. 2.2 Background: Self-Tmed Ppelnes and Rngs Ths secton brefly revews asynchronous or self-tmed ppelnes and rngs, ncludng ther structure, operaton and performance. 2.2.1 Asynchronous or Self-Tmed Ppelnes Structure. Fgure 1 shows the basc structure of a self-tmed ppelne. Each ppelne stage conssts of a controller, a storage element ( data latch ), and processng logc. A key feature s the absence of a system-wde clock for global synchronzaton. Instead, synchronzaton s acheved locally through request/acknowledge handshakng between neghborng stages. Operaton. Data s transferred from one stage to next accordng to the handshake protocol chosen for the ppelne. Typcally, a stage generates a request to ntate a handshake wth ts successor stage, ndcatng that new data s ready. If the successor stage s empty t accepts the data and performs two further actons: () t acknowledges ts predecessor for the data receved and, () ntates a smlar handshake wth ts own successor stage further down the ppelne. Performance. There are three metrcs typcally used to characterze the performance of a self-tmed ppelne: forward latency, reverse latency,andcycle tme. The throughput of the rng as a whole can be derved from these three metrcs. The forward latency s smply the tme t takes one data tem to flow through an ntally empty ppelne. Thus, f the latency of Stage s L Stage, then the latency of the entre ppelne s: L Ppelne = L Stage (1) Smlarly, the reverse latency characterzes the speed at whch empty stages or holes flow backward through an ntally full ppelne. The reverse latency of the entre ppelne s smply the sum of the stage reverse latences: R Ppelne = R Stage (2) The cycle tme of a stage, denoted by T Stage, s the mnmum tme that must elapse after a data tem s produced by that stage before the next data tem wll be generated. Snce a complete cycle of a stage typcally conssts of transmtted a data tem forward, followed by acceptng a hole from the next stage, the followng relatonshp holds: T Stage = L Stage + R Stage (3) The maxmum throughput a stage can support s smply the recprocal of ts cycle tme. The cycle tme of a lnear ppelne s typcally lmted by the cycle tme of ts slowest stage: ( ) T Ppelne = max T Stage (4)

problem n soluton out nterface Self-Tmed Rng Fgure 2: A self-tmed rng 2.2.2 Self-Tmed Ppelne Rngs Frequently, self-tmed ppelne stages must be arranged n confguratons other than smple lnear ppelnes to meet the requrements of an applcaton. One confguraton of specal nterest n ths paper s a self-tmed rng. The rng structure allows one data tem to repeatedly pass through the same sequence of processng stages, thereby allowng teratve computatons to be mplemented. Structure. Fgure 2 shows the basc structure of a self-tmed rng. The rng contans a number of stages whch perform computaton and an nterface stage whose functon s to load data tems nto the rng and to dran results from the rng. Operaton. Once an tem s ntroduced nto the rng through the nterface stage, t revolves nsde the rng untl some termnatng condton ndcates that the computaton s complete. The result s then draned from the rng through the nterface stage. Performance. The classc work by Wllams [10] ntroduced a useful metrc for measurng the performance of a rng: the number of tmes any data tem crosses the nterface stage per second. We wll refer to ths measure as the rng frequency. The performance of the rng s hghly dependent on ts occupancy,.e., the number of data tems revolvng nsde t. When the number of data tems s small, the rng frequency s low, and the ppelne s sad to be data lmted. On the other hand, when nearly every stage of the ppelne s flled wth data tems, the performance s once agan lmted because holes are needed to allow data tems to flow through the ppelne; n ths scenaro, the ppelne s sad to be congested, or hole lmted. Data Lmted Operaton. Ifthere arek data tems n the rng, then n the tme a partcular data tem completes one revoluton around the rng (.e., L Stage ), all k tems would have crossed the nterface stage. Therefore, the maxmum rng frequency attanable s proportonal to the rng occupancy: / Rng Frequency k L Stage (5) Hole Lmted Operaton. If the rng s flled wth data tems n nearly all stages, then the rng frequency s lmted by the number of holes n the rng. For one data tem to cross the nterface stage, exactly one hole must cross n the reverse drecton. If there are h holes n the rng, then n the tme a partcular hole completes one revoluton around the rng (.e., R Stage ), all h holes would have crossed the nterface stage, thereby allowng h data tems to advance. The maxmum rng frequency s therefore proportonal to the number of holes. Thus, f N s the number of stages n the rng, then h = N k, and we have the followng bound on the performance: / Rng Frequency (N k) R Stage (6) Fgure 3 shows a plot of the rng frequency versus ts occupancy. The rsng porton of the graph represents the data lmted regon, where performance rses lnearly wth the number of data tems. The Rng Frequency 0 Lmted by a slow stage Data lmted regon Slope = 1/ΣL Max deal freq = 1/(L+R) Slope = 1/ΣR Hole lmted regon 1 2 N-2 N-1 Rng Occupancy, k Fgure 3: Upper bounds on the rng frequency fallng porton, smlarly, represents the hole lmted regon, where performance drops lnearly wth a decrease n the number of holes. Lmtatons Due to a Slow Stage. If all the stages n the rng have smlar forward and reverse latences, then the maxmum attanable performance wll be the frequency at whch the rsng the fallng lnes of Fgure 3 ntersect. Ths pont represents a frequency that s the nverse of the cycle tme of each rng stage: 1/T = 1/(L + R). However, f some stages are slower than others, then the rng frequency wll be lmted by the cycle tme of the slowest stage. In the fgure, the horzontal lne represents the maxmum operatng rate that can be sustaned by the slowest stage n the rng [10]: / Rng Frequency 1 max N (T Stage ) (7) The overall rng performance wll always be constraned to le under the canopy formed by the three lnes n Fgure 3. 3. BASIC APPROACH: PIPELINING LOOPS We now ntroduce our basc approach for convertng teratve loops nto self-tmed ppelne rngs wth the beneft of sgnfcantly hgher throughput. In ths secton, we focus on a smple case: a for loop wth a constant teraton count,.e., t terates the same number of tmes on each data set. Our advanced approach s ntroduced n Secton 4, whch s able to handle a wder varety of specfcatons, ncludng loops wth varable teraton counts, nested loops, and unrolled loops. We frst motvate our work n Secton 3.1 by showng how a for loop can be a bottleneck n the compute ppelne, thereby severely lmtng throughput obtaned. Then we ntroduce our approach for elmnatng ths bottleneck, ncludng hardware templates for ppelnng the loop (Secton 3.2), an algorthm for mappng a hgh-level code fragment onto ths template (Secton 3.3), and an analyss of the performance beneft our of approach (Secton 3.4). 3.1 Motvaton: The Loop Bottleneck We llustrate our method usng the smple code example shown n Fgure 4. Ths example represents a generc code fragment for a streamng hardware system that terates on every data set usng a for loop. In the example, the symbols s1 through s8 represent statements. The code conssts of a man procedure that performs communcaton wth the envronment and a compute functon that does the computaton. Once a data set s receved from the envronment by man, t s sent to compute for processng, and the result s then sent to the envronment. These operatons are performed repeatedly by man,.

func compute(n context) s1; s2; for = 1 to N s3; s4; s5; s6; end s7; s8; return(out context) proc man whle (true) do read(nput); output = compute(nput); wrte(output); end Fgure 4: Sample code for a stream processor that terates on every data set usng a for loop read() s1 loop block s2 for N s7 s8 s6 s5 s4 s3 wrte() Fgure 5: A smple mplementaton of the compute functon Rather than specfyng varables n ths code, we consder functons to have a context whch s operated upon and modfed by each statement. At the begnnng of the compute functon, the context conssts of the entre nput set sent to the functon call, labelled n context. Ths ntal context s augmented wth any locally defned varables before beng sent nto functon. At the end of the functon, the context holds the set of outputs to be returned to the callng envronment. The operaton of the code fragment s as follows. The man procedure reads a data set from the envronment and nvokes compute functon. The value passed from man forms the ntal context (.e., set of nputs) for compute. Insde the body of compute, a certan number of statements e.g., s1 and s2 operate on the context. Next, the for loop operates on the context for N teratons. Then a certan number of statements, represented by s7 and s8, operate on the context after the for loop. Fnally, the modfed context s returned to man, whch communcates the relevant porton of t to the recevng envronment. A drect translaton of ths code nto data-drven hardware typcally yelds the schematc structure shown n Fgure 5. Each statement n the code s1 through s8 becomes a ppelne stage. Statements s3 through s6 along wth the for statement compose the loop block. Durng operaton, the data streams n from the envronment through the read() stage and s streamed out to the envronment through the wrte() stage. A key observaton s that the loop block has the same external nterface as an ndvdual ppelne stage, even though nternally t contans an entre rng for teratve computaton. Specfcally, the loop block accepts one data set from the predecessor stage, performs a calculaton, and passes the computed result on to a successor stage. It does not accept new data untl the results of the calculaton have been accepted by the successor stage. Interestngly, the presence of a loop n the compute body has the same effect on the performance of the stream processor as a sngle ppelne stage wth long latency. As dscussed n Secton 2.2.1, the throughput of a ppelne s determned by the cycle tme of the slowest stage (Equaton 4), whch n turn drectly depends on that stage s s1 s2 dst s3 arbter s4 s5 nterface s6 f < N ++ Fgure 6: Proposed mplementaton of compute latency (Equaton 3). As a result of ths long effectve latency and cycle tme of the loop block, the throughput of the entre stream processor s severely lmted. Even f each of the ndvdual stages s1 through s8 are mplemented effcently, the presence of a sngle long-latency for loop drastcally dmnshes the throughput obtaned through ppelnng. Fgure 5 llustrates ths bottleneck scenaro by ndcatng the presence of data wth a dot. Stages downstream from the loop block are dle whle the stages upstream from the loop block are stalled. Although some exstng approaches can generate slghtly more optmal mplementatons than that of Fgure 5, they stll suffer from the same bottleneck llustrated by ths example, as dscussed n Secton 2.1. Whle these approaches [4, 9] somewhat ncrease concurrency, they only allow a very slght overlap between successve data sets. 3.2 Proposed Rng Structure In order to overcome the bottleneck of loop computaton, our approach ntroduces a novel structure based on a self-tmed rng that sgnfcantly mproves performance by operatng on several data sets at once. However, there are three restrctons on the class of teratve streamng algorthms that can be translated usng our method: () calculatons on each dstnct data set should not depend on each other, () no communcaton wth the caller envronment should occur wthn the body of the loop, and () dfferent statements n the loop body should not share any resources. Note that our translaton scheme stll apples to algorthms n whch consecutve teratons on the same data set do depend on each other. Our proposed rng structure for the code n Fgure 4 s llustrated n Fgure 6. In our mplementaton, the rng as a whole does not have the same nterface behavor as an ndvdual ppelne stage: the rng can actually accept a new data set whle the prevous one s (or several prevous ones are) stll beng operated on, space permttng. The nterface s composed of specal-purpose helper stages a dstrbutor,a, an arbter, and an f stage whch allow multple data sets to enter the rng wthout nterferng. Fgure 6 llustrates the ablty of the rng to operate on multple data sets by ndcatng the presence of data wth a dot. Note that each data set wthn the rng s n a dfferent stage of computaton. Our ppelned rng should deally be flled wth as many data sets as possble wthout causng throughput degradaton due to congeston. As dscussed n Secton 2.2.2, every self-tmed rng has some deal occupancy, C, at whch t acheves maxmum throughput. Therefore, our strategy s to allow at most C elements to be present nsde the rng at any tme. The deal occupancy can be analytcally computed f the forward and reverse latences of stage nsde the rng are known, as dscussed n Secton 2.2.2. If, however, these latences are unknown, or hghly data dependent, the deal rng occupancy s determned by smulaton. The rng operates as follows. When a data set arrves from stage s7 s8

(a) IN(S) S IN(P1) Ppe(P1) fork IN(P2) (c) Ppe(P2) OUT(S) OUT(P1) jon OUT(P2) (b) IN(f) (d) IN(P2) Ppe(P1) Ppe(P2) IN(P1) Ppe(P1) OUT(P1) IN(B) bool OUT(f) fork B merge IN(P2) Ppe(P2) OUT(P2) (e) dst token IN(P1) arbter IN(P1) IN(P1) token OUT(P1), bool OUT(P1) Ppe(P1) Ppe(cond) f Fgure 7: A graphcal representaton of each composton functon a. Output of sngle stage(s) b. Output of compose sequental(p1, P2) c. Output of compose parallel(p1, P2) d. Output of compose cond(b, P1, P2) e. Output of compose loop(p1, cond) Ppe (P : program). begn f P s a sngle assgnment statement S then output sngle stage(s) else f P s the sequental block P1; P2 then compose sequental( Ppe(P1), Ppe(P2) ) else f P s the parallel block P1 P2 then compose parallel( Ppe(P1), Ppe(P2) ) else f P s the condtonal f(b) then P1 else P2 then compose condtonal( Ppe(B), Ppe(P1), Ppe(P2) ) else f P s a loop for (n) P1 or whle (cond) do P1 then compose loop( Ppe(P1), Ppe(cond) ) end Fgure 8: Our transformaton algorthm s2 as a new nput to the loop, t s frst sent to the dstrbutor (labelled dst n the fgure). If the rng s at less than deal occupancy (.e f <C), the dstrbutor sends the data set nto the rng and ncrements the to keep track of the rng occupancy. The data set then progresses through the stages of the loop body. At the end of an teraton, the data set passes through an f stage. If the termnatng condton s not met, the f stage passes the data set back to the begnnng of the rng. If the termnatng condton s met, the f stage sends the data set out of the rng and also sends a sgnal to the decrement. If the rng had stopped acceptng new data sets because t had reached deal occupancy, ths decrementng of the wll allow a new data set to enter the rng. One arbter s necessary n ths hardware scheme because data can enter the begnnng of the loop from two places. New data arrves va the dstrbutor, and current data loops back va the f stage. These two events can occur at arbtrary tmes, makng arbtraton necessary. 3.3 Transformaton Algorthm The algorthm for our hardware translaton scheme s shown n Fgure 8. The nput to our algorthm s a hgh-level program; the output s a ppelned structural mplementaton of that program. Currently, our approach handles the followng types of language constructs: sequental blocks, parallel blocks, condtonals, and loops. Our algorthm assumes the ablty to perform lve varable dataflow analyss, n order to fnd IN and OUT sets of sectons of code [8]. In partcular, the IN set s the set of data values that are requred to go nto a code fragment ether because they may be used nsde, or because they may need to be relayed to a successor of that fragment. The OUT set of a code fragment s the unon of the IN sets of ts successors. These sets are used to determne whch values need to be communcated between stages. Fgures 7(a-e) show a graphcal representaton of our ppelne composton functons. In these fgures, each box represents a set of one or more ppelne stages. Each arrow represents a communcaton between ppelne stages. Labels on the arrows ndcate the context, or set of varables, that must be passed between stages. (Note that the context s computed usng the IN and OUT sets of sectons of code.) A stage or mult-stage block that s not yet connected to other stages or blocks ndcates an nput port wth an open crcle,, and an output port wth closed a crcle,. These ports wll be connected to ports of other stages or blocks at the next upper level of herarchy durng the recursve traversal of our algorthm. 3.4 Performance Beneft and Overheads Prevous work on performance analyss of rngs allows us to predct the speedup obtaned by the use of our method. As dscussed n Secton 2.2.2 and shown n Fgure 3, the rng frequency s proportonal to the total number of data sets that are revolvng nsde the rng, as long as the rng s not congested (.e., hole lmted ). Therefore, the maxmum speedup of our approach s proportonal to the deal rng occupancy. The speed mprovement of our method s largely due to mproved hardware utlzaton. A rng that holds only one data set has a hgh amount of unused hardware at any gven tme. By allowng multple data sets, we are able to obtan hgh hardware utlzaton from the components wthn the rng. Our approach adds some overhead that decreases the actual speedup and ncreases total area. Certan helper stages such as the dstrbutor, and arbter ncrease the latency of the rng and add a small area overhead. Also, the has some latency and therefore wll not allow new data to enter mmedately after old data leaves. The most notable ncrease to area s the extra storage elements that are requred n order to hold the entre context at each rng stage. Ths overhead s necessary to allow each data set to hold ts own copy of the loop s context, thereby enablng multple data sets to coexst ndependently wthn the rng. 4. ADVANCED APPROACH In Secton 3 we descrbed our basc scheme for translatng a loop wth a fxed number of teratons nto a self-tmed rng wth maxmal throughput. In ths secton, we descrbe three advanced technques: () handlng loops that have a varable (e.g., data-dependent) number of teratons, () handlng nested loops, and () usng loop unrollng to further mprove performance. 4.1 Data-Dependent Loops Many algorthms contan loops that terate a dfferent number of tmes dependng on the value of the nput data set. Our basc approach of translatng the loop nto a self-tmed rng can stll be appled, but one problem arses. Specfcally, f each data set s allowed to leave the loop as soon as ts computaton has fnshed, the data

func GCD( a, b ) whle( b!= 0) s=a-b; f (s < 0) then swap(a,b) else a := s return C Fgure 9: Eucld s GCD solver Table 1: Reorderng n GCD computaton. nputs orgnal re-ordered output a b output value tag 100 208 4 19 1 209 190 19 3 2 45 219 3 4 0 252 114 6 6 3 136 146 2 2 4 43 169 1 5 6 15 155 5 1 5 133 77 7 7 7 sets wll ext the loop n some order that may dffer from the order n whch they entered. Thus, the data sets ext the loop out of order. One example of an algorthm that has a data-dependent teraton count s Eucld s algorthm for computng the greatest common dvsor (GCD) of two ntegers. A pseudo-code mplementaton of ths algorthm s shown n Fgure 9. If the approach descrbed n Secton 3 were used to mplement ths code, the values n the output stream would be out of order. Table 1 shows the output n the orgnal order and the output generated by the GCD rng assumng an occupancy of three. Our soluton to the reorderng problem s to append a unque tag to each data set; the tag represents a sequence number. For example, the tags 0 through 7 are used n the GCD example n Table 1. Ths tag becomes a part of the context of the loop, ensurng that even f the results emerge from the loop out of order, they are stll tagged correctly. Ths tag can be used n two dfferent ways: () the out-of-order results are smply passed on to the envronment along wth ther tags, or () a re-order buffer s used to correctly order the tems before sendng them to the envronment. The frst method s preferable f the envronment can handle tagged outputs, whch can be useful n applcatons such as graphc renderers where outputs need not come out n order. The second method, whch ntroduces a re-order buffer, s preferable f the callng envronment must reman naïve to the re-orderng problem. There has been much research on the desgn of re-order buffers n the feld of computer archtecture [6]; our approach smply leverages that work. 4.2 Nested Loops Nested loops are mplctly handled by the recursve ppelnng algorthm shown n Fgure 8. We provde an example here of a nested loop transformaton to llustrate ths capablty. An example of an algorthm that has nested loops s the bsecton algorthm for fndng a zero of a polynomal. Fgure 10 shows the pseudo-code for ths algorthm. The body of functon bsecton contans a whle loop whch successvely halves the search nterval for fndng a zero of the polynomal. Each teraton of ths loop requres the value of the polynomal to be evaluated at the md-pont of ths nterval. Ths evaluaton s carred out by callng the functon poly eval whch, n turn, s an teratve algorthm based on Cramer s rule, and contans a for loop wth data-dependent teraton count. When our algorthm s appled to the bsecton code n Fgure 10, t func bsecton ( coefs, tol, pos, neg ) 1 whle( abs (pos - neg) < tol) 2 md := (pos + neg) / 2; 3 a := poly eval( coefs, md ); 4 fa < 0 5 then neg = mdpt; 6 else pos = mdpt; return mdpt; func poly eval( coefs, x, degree ) a = coefs[degree]; for( = degree-1; 0; --) temp = a * x; a = temp + coefs[]; return a Fgure 10: Code for bsecton and polynomal evaluaton = dst arbter f * + -- >= 0 Fgure 11: Structure for polynomal evaluaton begns by callng the functon compose loop on the entre program. (For convenence the ndvdual statements n the code wll be referred to by ther numbered labels.) The functon compose loop n turn trggers a call to Ppe to ppelne the body of the loop. After handlng lne 2, the algorthm calls compose sequental(ppe(3), Ppe([4-6])). The output of Ppe(3) s a rng structure that mplements the polynomal loop, as shown n Fgure 11. The poly eval rng s fnally composed sequentally wth the rest of the statements wthn the bsecton body to form the structure shown n Fgure 12. 4.3 Loop Unrollng Unrollng the loop body to form a rng wth a greater number of stages can greatly mprove performance when combned wth our hardware translaton approach. Intutvely, ths mprovement results from the duplcaton of hardware nsde the loop and a correspondng ncrease n the rng occupancy. Thus, the unrolled loop s able to perform more work per unt tme. More formally, the rng frequency (at deal occupancy, cf. Equaton 7) remans farly unchanged when the loop s unrolled, but every tck at the loop s nterface now represents a greater amount of work completed. In partcular, f the loop s unrolled u tmes, every tme a data set crosses the nterface stage, t ndcates that u teratons have just been completed on that data set, rather than just one. Therefore, gnorng overheads, the loop s effectve computaton throughput ncreases by factor equal to the number of tmes t s unrolled, u. As a second-order effect, loop unrollng actually also has the beneft of somewhat reducng the overhead of the specal-purpose helper stages: the dstrbutor, the and the arbter. That overhead s now amortzed over a larger rng. As a result, the latency of each data set wll tend to somewhat decrease and hardware utlzaton wll slghtly ncrease. One possble negatve effect of loop unrollng s that t can cause some data sets to be terated over more tmes than necessary, thereby requrng extra checks wthn the unrolled loop to preserve the semantcs of the computaton. Although loop unrollng s a common technque n both software

dst + / 2 arbter = dst * arbter f + -- f f >= 0 Fgure 12: Bsecton loop wth nested polynomal evaluaton and hardware optmzaton, our current use of t has much dfferent goals and performance effects. In software complers, the prmary beneft of loop unrollng s to ntroduce more room to allow nstructons to be reordered, wth the purpose of reducng stalls due to branch and data hazards. In hardware translaton approaches, such as [4, 9], loop unrollng s used n conjuncton wth compacton to ncrease concurrency wthn the loop body. However, these approaches do not allow an ncrease n the occupancy of the loop, thereby obtanng lmted throughput mprovement. In contrast, our approach ncreases the loop occupancy by the same factor as the number of tmes t s unrolled, thereby obtanng dramatcally hgher speedup. 5. RESULTS Benchmarks. Our approach targets teratve algorthms that operate on a stream of ndependent data sets. Secton 4 dscussed several such examples: GCD, BISECT, and POLY. In addton to those examples, we tested mplementatons of the followng algorthms as benchmarks: BTREE: The BTREE algorthm searches a bnary tree resdng n ROM for a gven nput key. A key match returns the data value assocated wth the node that matches the key. Behavor s hghly data-dependent: the number of loop teratons depends on how deep the nput key s node s n the tree. CRC: The cyclc redundancy check (CRC) algorthm calculates an 8-bt checksum for a gven block of nput data. Ths algorthm has a fxed teraton count (16 teratons). ODE: The ODE example s the ordnary dfferental equaton solver from [12, 13, 9], based on the Euler method. It receves as nput the coeffcents of a thrd degree ordnary dfferental equaton, along wth an nterval over whch to ntegrate, ntal condtons, and a step sze. Its output s the fnal value of the dependent varable. The number of loop teratons depends on the sze of the nterval and the step sze. TEA: Tny Encrypton Algorthm (TEA) encrypts a 64-bt block wth a 128-bt key. The number of loop teratons depends on the number of encrypton rounds chosen. For ths example, the number of rounds was fxed at 32 (16 loop teratons). Expermental Setup. The benchmark examples were syntheszed usng the Haste/TDE synthess flow from Handshake Solutons [1], a Phlps subsdary. Haste (formerly Tangram ) s the only mature automated asynchronous synthess flow avalable at present. Whle the control-domnated archtectures that Haste currently produces are not an deal match for ppelned dataflow applcatons, ths expermental setup allows the relatve performance beneft of our approach to be accurately estmated. < - For each of the benchmark examples, three dfferent mplementatons were syntheszed: () a baselne verson ( Orgnal ) that dd not use our approach; () a second verson ( Ppelned ) that used our loop ppelnng approach, but dd not use loop unrollng; and () a fnal verson ( Unrolled ) that employed a twofold unrollng for ts loop along wth our loop ppelnng approach. Smulaton and area estmaton tools from the Haste sute were used to quantfy the latency, throughput and area of the resultng mplementatons. Snce the performance of some of the examples s data-dependent, nput streams contanng as many as 100 data sets were used, and latency and throughput results were averaged over them. Results. Tables 2 3 summarze the results of our experments. Table 2 presents the area, latency and throughput obtaned for each of the benchmarks. The fnal column presents the normalzed throughput (relatve to the Orgnal verson), to llustrate the performance beneft of our approach. Fnally, the area and latency overheads of our approach are summarzed n Table 3, once agan normalzed wth respect to the Orgnal verson. Dscusson. The results demonstrate that a substantal mpact on throughput s acheved by loop ppelnng: up to 9.7x speedup. Wthout the use of loop unrollng, our approach obtans a throughput mprovement by a factor of 1.3 to 4.9. When a twofold loop unrollng was appled along wth loop ppelnng, the speedup obtaned was even hgher: a factor of 2.6 to 9.7x. In greater detal, algorthms that were ppelned nto a relatvely small number of stages (CRC and GCD) had a lmted potental for loop ppelnng because the maxmum capacty of the self-tmed rng structures n these cases was low. As a result, the throughput beneft was around 1.3x (wthout unrollng). On the other hand, algorthms that were hghly ppelned (BISECT, ODE, and TEA), had larger rng structures whch could accommodate a greater number of data sets concurrently. For these benchmarks, the throughput ncrease was substantally hgher: a factor of 2 to 4.9x (wthout unrollng). As expected, the twofold unrollng led to a throughput ncrease n each case by a factor of 1.7 to 2.0x, yeldng an overall combned throughput beneft of 2.6 to 9.7x relatve to the orgnal mplementaton. Although our approach results n a sgnfcant boost n throughput, there are costs assocated wth the performance mprovement. Shown n Table 3 are the ncreases n total area consumed, and n the average latency per data set. In terms of area, the ppelned verson adds the overheads of loop control dscussed n Secton 3.4. Each ppelne stage must latch data from the prevous stage, so algorthms wth many stages or large contexts (e.g. BISECT) wllseealargencreasenarea. Byunrollng the loop twofold, the total area of the mplementaton ncreased by a factor of 1.3 1.9x. The average latency for a data set also ncreased when loop ppelnng was used. Ths was expected because, lke most tradtonal ppelnng approaches, our approach ncreases throughput at the expense of latency. The latency overhead n most of the benchmarks was n the 1.4 3.6x range, except for the example that contaned a nested loop, BISECT. For BISECT, the latency overhead reported s substantally hgher (8.2x) due to the compounded overheads of ts nested loops. Whle the area and latency overheads may seem dauntng, for most applcatons the man performance measure s overall executon tme. By ncreasng latency and chp area, a dramatc mprovement n throughput results, reducng executon tme by a large factor. 6. CONCLUSIONS AND ONGOING WORK In ths paper we presented a technque that allows loop hardware to operate on many data sets at the same tme. We proposed a novel loop nterface and a set of transformatons from hgh-level code to

Table 2: Synthess Results: Performance Beneft Algorthm/ Area Latency Throughput Normalzed Approach (µm 2 ) (ns) (Mega tems/s) Throughput BISECT Orgnal 28928 1946 0.51 1 Ppelned 98420 15960 1.03 2.0 Unrolled 184400 16000 1.76 3.4 BTREE Orgnal 2900 40 24.65 1 Ppelned 7335 110 41.91 1.7 Unrolled 10840 75 79.62 3.2 CRC Orgnal 4405 66 14.99 1 Ppelned 10730 193 19.79 1.3 Unrolled 15080 137 40.21 2.7 GCD Orgnal 1770 108 9.11 1 Ppelned 4998 390 12.24 1.3 Unrolled 6574 277 23.51 2.6 ODE Orgnal 8931 571 1.75 1 Ppelned 15610 1338 3.61 2.1 Unrolled 25630 1156 7.07 4.1 POLY Orgnal 23661 367 2.71 1 Ppelned 53880 1300 5.19 1.9 Unrolled 99280 1226 8.80 3.2 TEA Orgnal 30390 1205 0.83 1 Ppelned 96720 2529 4.04 4.9 Unrolled 166500 1704 8.07 9.7 Table 3: Area and Latency (Relatve Overheads) Algorthm Ppelned Unrolled Area Latency Area Latency (Norm.) (Norm.) (Norm.) (Norm.) BISECT 3.4 8.2 6.4 8.2 BTREE 2.5 2.7 3.7 1.9 CRC 2.4 2.9 3.4 2.1 GCD 2.8 3.6 3.7 2.6 ODE 1.8 2.3 2.9 2.0 POLY 2.3 3.3 4.2 3.3 TEA 3.2 2.1 5.5 1.4 a ppelned loop structure. Our results showed sgnfcant throughput mprovement over the non-ppelned rng mplementaton, wth further benefts through unrollng. More work needs to be done to decrease the area overhead of our technque. Our current approach s conservatve: each stage n the rng structure stores a dstnct copy of the maxmal set of lve varables that t may need. We are explorng more optmal technques for reducng ths overhead. In ongong work, we are re-mplementng the above benchmarks n a true dataflow style, n order to elmnate the unnecessary control overheads ntroduced by the Haste synthess flow. We antcpate sgnfcantly better performance and area results after ths modfcaton. In addton, a loop-ppelned GCD desgn has been mplemented n slcon. Intal results ndcate fully functonal parts, though a complete evaluaton s pendng. A full-featured transformaton algorthm s the next logcal step n ths work. Addtonal features nclude support for resource sharng, unrestrcted communcaton wth the envronment, and handlng dependences across data sets. Other loop optmzng technques, such as compacton, are also beng pursued. 7. REFERENCES [1] Handshake Solutons, a Phlps subsdary. http://www.handshakesolutons.com/. [2] A. Aken and A. Ncolau. Perfect ppelnng: A new loop parallelzaton technque. In European Symposum on Programmng, pages 221 235, 1988. [3] V. H. Allan, R. B. Jones, R. M. Lee, and S. J. Allan. Software ppelnng. ACM Comput. Surv., 27(3):367 432, 1995. [4] M. Budu. Spatal Computaton. PhD thess, Carnege Mellon Unversty, Computer Scence Department, December 2003. Techncal report CMU-CS-03-217. [5] A. Davs and S. M. Nowck. An ntroducton to asynchronous crcut desgn. Techncal Report UUCS-97-013, Dept. of Computer Scence, Unversty of Utah, Sept. 1997. [6] J. L. Hennessy and D. A. Patterson. Computer archtecture: a quanttatve approach. Morgan Kaufmann Publshers Inc., San Francsco, CA, USA, 2002. [7] U. J. Kapas, W. J. Dally, S. Rxner, P. R. Mattson, J. D. Owens, and B. Khalany. Effcent condtonal operatons for data-parallel archtectures. In Proc. of IEEE/ACM Intl. Symp. on Mcroarchtecture, 2000. [8] K. Kennedy and J. R. Allen. Optmzng complers for modern archtectures: a dependence-based approach. Morgan Kaufmann Publshers Inc., San Francsco, CA, USA, 2002. [9] M. Theobald and S. M. Nowck. Transformatons for the synthess and optmzaton of asynchronous dstrbuted control. In Proc. ACM/IEEE Desgn Automaton Conf., June 2001. [10] T. E. Wllams. Self-Tmed Rngs and ther Applcaton to Dvson. PhD thess, Stanford Unversty, June 1991. [11] T. E. Wllams and M. A. Horowtz. A zero-overhead self-tmed 160ns 54b CMOS dvder. IEEE Journal of Sold-State Crcuts, 26(11):1651 1661, Nov. 1991. [12] K. Y. Yun, P. A. Beerel, V. Vaklotojar, A. E. Dooply, and J. Arceo. The desgn and verfcaton of a hgh-performance low-control-overhead asynchronous dfferental equaton solver. In Proc. Int. Symp. on Advanced Research n Asynchronous Crcuts and Systems, pages 140 153. IEEE Computer Socety Press, Apr. 1997. [13] K. Y. Yun, P. A. Beerel, V. Vaklotojar, A. E. Dooply, and J. Arceo. The desgn and verfcaton of a hgh-performance low-control-overhead asynchronous dfferental equaton solver. IEEE Trans. on VLSI Systems, 6(4):643 655, Dec. 1998.