Design of a High-Speed Asynchronous Turbo Decoder

Size: px

Start display at page:

Download "Design of a High-Speed Asynchronous Turbo Decoder"

Ariel Blankenship
6 years ago
Views:

1 Desgn of a Hgh-Speed Asynchronous Turbo Decoder Pana Golan Georgos D. Dmou, Malla Praash, Peter A. Beerel Department of Electrcal Engneerng-Systems Unversty of Southern Calforna, Los Angeles, CA {pgolandmou,mallap,pabeerel}@usc.edu Abstract Ths paper explores the advantages of hgh performance asynchronous crcuts n a sem-custom standard cell envronment for hgh-throughput turbo codng. Turbo codes are hgh-performance error correcton codes used n applcatons where maxmal nformaton transfer s needed over a lmtedbandwdth communcaton ln n the presence of data corruptng nose. Specfcally we desgned an asynchronous hgh-speed Turbo decoder that can be potentally used for new wreless communcatons protocols wth close to OC-12 throughputs. The desgn has been mplemented usng a new statc sngle-trac-full-buffer (SSTFB) standard cell lbrary n IBM 0.18µm technology that provdes low latency, fast cycle-tme, and more robustness to nose than prevously studed sngle-trac full-buffer technology (STFB). A hgh-speed synchronous counterpart usng the same hgh-speed archtecture s desgned n the same technology for comparson. The results demonstrate that for a varety of networ constrants, the asynchronous desgn provdes advantages n throughput per area. Moreover, the asynchronous desgn can support very low-latency networ constrants not achevable wth the synchronous alternatve. 1. Introducton Drven by overwhelmng desgn-tme constrants, standard-cell based synchronous desgn styles supported by mature CAD desgn tools and a largely automated flow domnate the ASSP and ASIC maret places. As devce feature szes shrn and process varablty ncreases, however, the relance on a global cloc becomes ncreasngly dffcult, yeldng far-from-optmal solutons. Because standard-cell desgns use very conservatve crcut famles and are often over-desgned to accommodate worst-case varatons, the performance and power gap between full-custom and standard-cell desgns contnuously wdens [20]. Recent research demonstrates that t s possble to narrow ths gap usng conventonal standard-cell technques wth asynchronous cell lbrares. For example, two prototype standard-cell lbrares n TSMC 0.25µm technology for two dfferent asynchronous templates have demonstrated hgh-performance [1][2]. One usng a Sngle Trac Full Buffer (STFB) lbrary [1][2] successfully operate over a wde range of temperatures and voltages, wth a measured frequency of over 1.2 GHz at a nomnal 2.5 Volts [1]. Ths should be compared to the typcal 300 MHz standard-cell synchronous desgns achevable n the same process. In addton, a second generaton sngle trac famly called Statc Sngle Trac Full Buffer (SSTFB) has been proposed whch promses to have smlar advantages whle beng more robust to hgher process varablty, crosstal nose, and leaage currents [4][6]. Ths paper explores the lbrary development and applcaton of SSTFB to the asynchronous core of an ultra-hgh-throughput turbo decoder desgned for close to OC-12 data rates [18] wth dfferent data bloc szes. We desgned both synchronous and asynchronous ASIC cores to compare area, throughput, and power. The synchronous desgn requres substantally more parallelsm to match the throughput of the asynchronous desgn. Consequently, our post-layout results on a sgnfcant porton of the desgn, ncludng the crtcal path, ndcate that we acheve up to a 2X mprovement n throughput per area for small to medum bloc szes and can support smaller bloc szes at hgher throughputs than achevable by the synchronous counterpart. The remander of ths paper s organzed as follows. Secton 2 wll cover necessary crcut and an ntroducton to turbo codng, wth emphass on the mplementaton challenges of a hgh-speed turbo decoder. Secton 3 wll ntroduce an effcent hghspeed turbo decodng tree-siso archtecture and our synchronous baselne mplementaton. Secton 4 wll then descrbe our asynchronous mplementaton,

2 coverng both the IBM 0.18µm SSTFB lbrary development as well as P&R desgn flow. Secton 5 wll then descrbe the desgn verfcaton and postlayout performance of both desgns. Secton 6 detals the comparson between synchronous and asynchronous cores. Fnally, Secton 7 concludes and outlnes areas of future wor. 2. Bacground Ths secton begns by descrbng the STFB and SSTFB templates, followed by necessary bacground on Turbo decodng STFB Template Fgure 1. Typcal STFB transstor-level schematc Fgure 1 shows a typcal STFB cell s transstorlevel dagram of a n-bt nput logc functon. When there s no toen n the dual-ral output channel (R0/R1), the 2-nput NOR acts as rght completon detecton (RCD) bloc and asserts the B sgnal, enablng the processng of next set of nput toen. In partcular, when the next set of toens arrve at the n dual-ral nput channels (L0 /L1 ) the logc functon s evaluated and one of the state sgnal S0 or S1 s lowered, dependent upon f the logc functon evaluates to 0 or 1. Smultaneously, the state sgnal lowerng causes the 2-nput NAND gate whch acts as a state completon detecton bloc (SCD) to assert A, resettng the toens from the left channels by drvng all nputs low, overpowerng the correspondng statczers (M1, M2). The presence of the output toen on the rght channel resets the B sgnal whch actvates the two PMOS transstors at the top of the N-stac, restorng S0/S1, and deactvatng the NMOS transstor at the bottom of the N-stac, thus dsablng the stage from frng whle the output channel s busy. The cycle tme of the STFB template s 6 transtons wth the forward latency of 2 transtons. Notce that the output and nput channels can have dfferent toens at the same tme, hghlghtng why ths template s a full buffer [1][2]. Notce also that the NMOS transstor stac (N-stac) s desgned to be sem-wea-condtoned n that t wll not evaluate untl all expected nput toens arrve. Ths combnaton of functonalty and nput completon detecton removes the need of usng a left envronment completon detecton bloc (LCD), reducng the template complexty, sze and cycle-tme. Alternatves such as the PCHB quas-delay-nsenstve template separate the N-stac and LCD functonalty, but are substantally larger and slower [2] Statc STFB Template The basc concern of the STFB crcut famly s that the communcaton wres can be tr-stated for some perod of tme wth only a small statczer fghtng leaage and crosstal nose. Whle effectve for 250nm, the nose margn for ths technology may be too low n deeper submcron processes. In partcular, a cross-couplng nose event on a long tr-stated wre can ether create a new toen or remove a toen from the system causng system falure (often n the form of a deadloc). Moreover, n smaller geometres leaage currents may become so hgh that the statczers would need to be made stronger to the pont that they cannot be easly over-powered. For ths reason, Ferrett et al. [6] proposed a new STFB famly called statc STFB (SSTFB) n whch the channel wres are always actvely drven wth modfed drver crcuts. The functonalty of a SSTFB cell s the same as STFB cell wth a notceable dfference that after the sender drves the lne hgh, the recever s responsble for actvely eepng the lne hgh untl t wants to drve t low as shown n Fgure 2. Smlarly, after the recever drves the lne low, the sender s responsble for actvely eepng the lne low untl t wants to drve t hgh. The lne s always statcally drven and no fght wth statczers exsts. Ths means that the eeper crcutry can be szed to a sutable strength creatng a tradeoff between performance/power/area and robustness to nose. The nverters n the eeper crcutry can be also be sewed such that they turn on early creatng an overlap between the drvng and hold logc (as suggested n [11]). Ths overlap avods the channel wre beng n a tr-state condton thus mang the crcut famly more robust to nose. The overlap also helps ensure that the channel wres are always drven close to the power supples further ncreasng nose margns [4]. In ths way, the lnes are never tr-stated and are statcally drven; explanng whey the crcut famly s called statc STFB. Fgure 3 shows a transstor level dagram of a SSTFB Buffer. 2

3 based on the operatonal condtons of the fnal system. In our desgn we have chosen to mplement a SCCC, whch s nown for ts very low error floor capabltes, but the desgn proposed could be easly modfed to decode any of the other types of codes lsted above. Fgure 2. SSTFB drver crcutry The SSTFB template s very flexble and can be expanded to mplement dfferent functonaltes and non lnear ppelne stages, ncludng a merge, for, and full-. We refer the reader to [4] for more detals. L0 L1 eeper A eeper A B B S0 S1 eeper eeper R0 R1 S0 S1 R0 R1 Fgure 3. Schematc of SSTFB BUFFER 2.3. Turbo decodng Turbo decodng s becomng a very popular soluton for error correcton especally n wreless applcatons [13]. Turbo codng started wth the ntroducton of Parallel Concatenated Convolutonal Codes (PCCC) that were proven to acheve performance that s very close to the theoretcal codng bound defned by Shannon s Capacty. Snce then several varatons have been ntroduced, such as Serally Concatenated Convolutonal Codes (SCCC) and Low Densty Party Chec Codes (LDPCC). The same Turbo-Le decodng schedule could be used to process all of the above. In the case of LDPCC ths s true when the code belongs to a class of LDPCC that can be descrbed as a Generalzed Repeat- Accumulate code (GRA) [21]. All these codes acheve Turbo-Le performance, but vary slghtly n terms of performance and computatonal complexty and the selecton for a partcular desgn s made A B Fgure 4. An example of a 4-state FSM encoder and the correspondng trells used for decodng Theoretcal bacground. The basc Turbo-Le encoder structure nvolves an nterleaver and a set of smple error correcton/detecton codes (most commonly convolutonal codes). The structure on the decoder sde loos very smlar to that of the encoder, wth two ey dfferences and s llustrated n Fgure 5. Frst every code n the encoder s replaced by a Soft-In- Soft-Out (SISO) module. Second the data flow on the decoder s b-drectonal and teratve, and an nterleaver/de-nterleaver module n the decoder replaces the nterleaver on the encoder sde. In our mplementaton one SISO s used that can perform both operatons to reduce desgn complexty. Decoded Data Outer SISO De- Interleaver Interleaver Inner SISO Receved Data Buffer Fgure 5. The decoder structure where each SISO s used to decode one CC. Durng each SISO operaton the data bloc s processed along a trells that represents the state of the encoder durng the transmsson process, as llustrated n Fgure 4. The state number s ndcated nsde each state, whle each branch s characterzed by the values of the nputs and outputs of the state machne (shown as x/y on each branch n Fgure 4). The length of the trells matches the number of bts n the data bloc. Each branch n the trells has a branch metrc assocated wth t (not shown n Fgure 6), updated each teraton, whch corresponds to a noton of the relatve probablty assumng that branch too place n the encoder. The SISO module s responsble for updatng the probablty that at tme the value b was encoded by fndng the 3

4 shortest path through the entre trells that has value b at tme. To do ths, for each trells transton the decoder computes the forward state metrcs whch represent the shortest path from the begnnng of the trells up to that pont and the bacward state metrcs whch represent the shortest path nformaton from the end of the trells up to that transton. Ths s shown n Fgure 6, where the shortest path s shown by the bold lnes n the trells and can be derved by the values of the state metrcs. The decoder then can fnd the shortest path where b =1 and the shortest path on whch b =0. Usng ths nformaton t can produce the new probablty for ths bt beng a 1 relatve to beng a 0 based on the nformaton avalable across all the branches of the trells. t = 0 t = 1 t = 2 t = t = +1 t = K-2 t = K-1 t = K F 0 0 F 1 0 F 0 1 F 1 1 F 0 2 F 1 2 F 0 F 1 B 0 +1 B 1 +1 B 0 K-2 B 1 K-2 B 0 K-1 B 1 K-1 B 0 K B 1 K Fgure 6. An example of a 2-state trells and the assocated metrcs durng the decodng process. In partcular, we have chosen to use the Mn-Sum algorthm for the SISO operaton. The Soft Input to the SISO s defned for each bt at tme as the negatve log-lelhood rato of the probablty of a 1 beng transmtted over that of a 0: b ( log ( P{ b = 1} )) ( log( P{ b = 0} )) SI. (1) = The branch metrcs between state and state (f such exsts) s the ont probablty for the partcular trells branch gven all SI, or: BM = SI. (2) b = 1 The forward and bacward state metrcs for tme nstance are then defned recursvely as follows: F B b ( F BM ) = mn (3) ( B BM ) = mn + + (4) 1 where, F s the value of the forward state metrc for state at tme. The ndex only taes the values for whch the transton from state to state s vald. The state metrc calculatons are also referred to as the Add-Compare-Select operaton or ACS and consttute the maorty of the processng tang place n the SISO. An example 4-bt ACS s shown n Fgure 7. Fgure 7. A bt-ppelned 4-bt ACS operator. The blac rectangles ndcate ppelne boundares The SISO outputs are called the Soft Outputs (SO) for all nput and output bts of the encoder FSM. To prevent the values that are sent between SISO modules n the decoder from growng ndefntely, nstead of the actual value the SISO outputs the dfferental (called the extrnsc SO) between the actual value calculated (also called ntrnsc SO) and the orgnal SI nputted. The fnal quanttes are also saturated to a fxed btwdth, to reduce the complexty of the SISO modules. So the Soft Output for bt b at tme s defned as: SO (, ) (, F + BM + B mnf + BM + B ) = mn (5) + ntrb b + 1 b 1,, = 1 = 0 SO extrb = SO nt SI. (6) rb The decodng process from a top-level standpont starts by the receved sgnal beng translated nto metrcs that represent the probabltes for each of the receved bts. Then a SISO module uses the receved sequence as nputs and produces soft outputs. The soft outputs are then nterleaved (or de-nterleaved dependng on the code) and passed onto the SISO modelng the next convolutonal encoder n the transmt sequence as Soft Inputs. Durng the decodng process the SISOs that correspond to all the codes exchange Soft Input data both n the encoder sequence and n the reverse drecton. Each SISO s fred several tmes and the process terates untl the metrcs stop mprovng, or the maxmum number of teratons s reached. For our comparsons we use 6 teratons whch acheves most of the codng gan wthout beng too computatonally ntensve Hgh-speed mplementaton challenges. The mmedate effect of ths teratve approach s that n order to acheve a certan decoded data rate the SISO b 4

5 has to run several tmes faster nternally n order to eep up wth the data. For example a system usng a code wth two convolutonal codes and decodng usng 5 teratons would have to run roughly 10 tmes faster nternally than the target throughput. The calculaton s teratve and cannot be speed up easly. Several tlng approaches have been developed that brea the bloc nto sub-blocs that can be processed n parallel, but normally the logc tself cannot be sped up further than the state metrc calculaton process ( F and B ) due to the data dependency n that calculaton. The adopted Tree-SISO archtecture addresses ths problem and wll be descrbed n detal n a later secton. Even f the state update loop s broen (as n Massera s and Tree-SISO archtectures [14][7][15]) there are other practcal problems that hnder the throughput. A popular approach to ncrease throughput s to use many unts n parallel. Although ths generally wors, t has mplementaton problems that mae the desgn of very hgh-speed Turbo decoders extremely challengng. The frst problem s related to memory access. As mentoned above the data after beng processed has to be nterleaved between SISO modules. In order to get good codng performance, the nterleaver must use a permutaton that s deally random. Therefore, wth a hgh degree of parallelsm many bts per cycle have to frst be stored nto a RAM structure and then retreved n random order. The usual approach s to have multple bans of RAM that each receves data correspondng to one processed bt. As the data comes out of those bans n random order, t then has to be multplexed and dstrbuted bac to the SISOs. Constrants are placed on the nterleaver to ensure t s clash-free. That not only adds sgnfcant complexty to the decoder, due to the crossbar swtch that has to be bult nto the nterleaver, but also places constrants on the nterleaver desgn that could yeld very sub-optmal nterleaver performance, due to lac of randomness. From a hardware standpont t also requres the nstantaton of many more RAM cores that are extremely small and shallow, that consequently requre a lot more area and power than fewer larger and narrower RAM nstances. The second problem s also a sde effect of the nterleaver presence. The entre bloc of data has to be wrtten nto the nterleaver before the data can read to start the next SISO process. Ths s due to the nterleaver s random permutaton, whch mples that the frst bts of data that have to be fetched are lely to be among the last bts of data prevously stored. Consequently, as the degree of parallelsm s ncreased the processng tme can be lnearly reduced, but the ppelne latency remans constant yeldng dmnshng benefts Throughput (Mbps) # of Parallel Processors Fgure 8. Throughput vs. # of processors Performance degradaton s more pronounced n cases wth small data bloc szes where the ppelne latency s comparable or n extreme stuatons larger than the actual processng tme. Ths does not only occur once, but occurs every SISO operaton. Fgure 8 llustrates ths pont by showng the throughput that a decoder can acheve as a functon of processors to perform one SISO operaton. The graph assumes that each processor runs at 100 MHz for 5 teratons for a code that has two convolutonal codes and a data bloc sze of 2 Kbts # of processors for fxed throughput Processor Frequency (MHz) Fgure 9. # of processors vs. processor frequency Fgure 9 llustrates ths pont from a dfferent perspectve by showng the number of processors that can be used to acheve a throughput of 540 Mbps wth varyng processor frequency (assumng 5 teratons and 2 Kbt data bloc sze). It s easy to see that ncreasng the processor frequency can acheve much larger reducton n the number of processors than the expected lnear functon. 3. Synchronous hgh speed Turbo In order to evaluate the performance of our desgn and demonstrate the capabltes that our asynchronous technology has to offer, we desgned a synchronous core as well to be able to compare area, performance and power between the two desgns. We chose to 5

6 desgn the same unt usng both technologes, usng our SSTFB lbrary for the one and the Artsan lbrary for the other Tree SISO Desgnng a very fast Turbo decoder structure has many challenges as mentoned above. Our goal based on our analyss was to desgn the fastest SISO unt possble so that we can eep the degree of parallelsm requred to a mnmum. For ths reason we chose to use a Tree SISO structure [7][15] whch removes the recursve nature n the data path and thus enables fne gran ppelnng. In order to llustrate ths mplementaton, we must defne one addtonal operaton that merges adacent transtons of the trells nto larger trells sectons that correspond to more than one tme ndex. In ths manner, several state metrcs can be computed smultaneously when the approprate state metrc becomes avalable. The new operaton, called the fuson operaton, s defned as: BM p p, = mn( BM + BM ), m (, ) (7), l, m m, l l p The mn operaton s defned over all vald combnatons of branch metrc pars of startng and endng states. The new branch metrcs correspond to the shortest possble path derved from the merged trells sectons between any par of startng and endng states. The structure used to mplement the SISO s borrowed from prefx structures, but wth the addton of a suffx path that s used to perform the operaton bacwards for the Bacward State Metrc calculaton The code We chose a typcal SCCC turbo code structure wth two 2-state convolutonal codes to reduce the sze of the decoder crcut. The data s frst encoded usng a rate ½ non-recursve convolutonal code wth polynomals [1+D, 1+D] and the results are nterleaved. Next a rate 1 recursve code wth a polynomal [1/(1+D)] s used to re-encode the nterleaved data before transmsson. Overall ths yelds a rate ½ code. Puncturng could be used to acheve hgher code rates and ncrease flexblty, wth mnor modfcatons to the desgn, but ths was not done at ths stage for smplcty. For a bloc sze of K bts the frst code has a trells length of K and the second one of 2K. Therefore the throughput equaton s as follows: f K T = 3K I 2 p + 8M where K s the bloc sze n bts, f s the cloc frequency n Hz (or n the case of the asynchronous desgn the equvalent throughput), I s the number of teratons and 8M s the number of bts that can be processed n parallel. Fnally p s the ppelne latency n terms of cycles. Each SISO operaton has to fnsh and store data bac nto memory, therefore the ppelne overhead s present for every half-teraton. The executon schedule for every teraton s shown n Fgure 10. It should also be noted that only the last teraton produces decoded data, whch s why the throughput s nversely proportonal to the number of teratons. Ppelne Overhead p Inner SISO Processng K/4M p (8) Ppelne Overhead Outer SISO Processng K/8M Fgure 10. Executon schedule of every decoder teraton In the case of our asynchronous desgn M s 1 snce we are gong to use a sngle 8-bt wde SISO processor to acheve the desred throughput. In the case of the synchronous desgn, M wll have to be hgher snce the synchronous desgn s much slower than our asynchronous one and multple processors of sze 8 would be requred to acheve the same throughput. We chose 8 snce t s the mnmum sze Tree-SISO for a 2- state code, and t should be noted that due to the structure of the Tree-SISO a 16-bt wde processor s more complex than two 8-bt wde ones P&R results After the schematc desgn was fnshed t was exported to a Verlog netlst and mported nto SOC Encounter for P&R. The lbrares for the IBM 0.18µm technology were used for characterzaton, and tmng constrants were wrtten to defne the target frequency. The core was placed as a standalone module wthout IO pads. The desgn was placed usng tmng-drven placement and was then routed usng tmng-drven routng. After routng was done the cloc tree was syntheszed and the desgn was taen through further processng to fx hold tme volatons and then the fnal tmng analyss was performed. The cloc frequency that was acheved was 475MHz for the entre 8-bt wde core. The area of the core was 2.46mm 2. We assume 6

7 that for hgher degrees of parallelsm multple copes of ths core could be routed separately and that no performance degradaton would be nduced due to the added crcutry. We also assumed that the cloc crcut would be mostly unaffected and that t would ust be multpled n sze ust le the rest of the crcutry. As a pont of comparson, the prevously publshed fastest turbo desgn acheves approxmately 1Gbps at 6 teratons n a 0.18µm process, usng 32mm 2 area and a sngle-buffered nput memory [19]. The code n [19] s a PCCC whch requres the processng of 2K trells steps/teraton, so that structure would decode approxmately 667Mbps for an SCCC code le the one we chose, whch requres processng of 3K trells steps/teraton. Our synchronous desgn has smlar throughput (653Mbps) for M=6 and usng 14.76mm 2 of core area and 2.17mm 2 of memory area, whch ndcates that our synchronous desgn s comparable wth state-of-the-art decoders found n the lterature. Snce our synchronous and asynchronous cores would use the same memory area, our comparson n Secton 6 only consders the core area. Fgure 11. Asynchronous ASIC desgn flow detaled GDSII data, an abstract vew to support placement and routng n LEF format, and fnally ts symbol. Usng ths lbrary, a largely conventonal standard-cell ASIC bac-end desgn flow usng conventonal place and route tools can be used to create the layout, as llustrated n Fgure 11. Note that we currently use Nanosm to perform analog transstor-level smulatons to verfy both correctness and measure performance. Characterzng the lbrary n LIBERTY format whch enables faster bac-annotated gate-level Verlog smulaton, as suggested by [3] s an area of future wor Lbrary desgn The computaton n the chp conssts of addtons, comparsons, and selectons. Consequently, the SSTFB lbrary needed only 14 cells along wth a varety of subcells used to smplfy the development of new lbrary cells. Extensve spce smulatons were done on the schematc to verfy that all the tmng assumptons and specfcatons were acheved. DRC and LVS checs usng DIVA and ASSURA verfcaton sutes were performed on the layout vews to verfy ther correctness. Our SSTFB lbrary currently ncludes only a sngle sze for each cell. Wth ths crcut technology there are fve ways to combat nose and process varatons: 1) ncrease the sze of the eeper transstors 2) ncrease the mnmum separaton between wres n the place and route flow 3) decrease the maxmum allowable length of any route 4) sheld long communcaton wres and 5) sew the hold nverters (INV_HI and INV_LO) shown n Fgure 3 to create more tme overlap between the drvng and hold transstors. Durng lbrary development, we focused on technques 1) and 5), enablng the use of mnmum separaton between wres wth no sheldng for a maxmum wre length of 400 µm. 4. Asynchronous Turbo Ths secton covers the asynchronous Turbo desgn, ncludng a bref descrpton of the overall desgn flow, the lbrary development, gate-level desgn, and physcal desgn Asynchronous ASIC Desgn Flow For each SSTFB cell needed we created four lbrary vews: functonal vews contans the behavoral descrpton of the cell n Verlog HDL, schematc vews contans the transstor level mplementaton of the cell, layout vew contanng 7

8 State + Metrc Par 1 State + Metrc Par Merge Output For For For For For For For For Fgure 12. An example of 4-bt ACS asynchronous bloc. The blac rectangles are slac matchng buffers and the connectons are dual ral statc sngle-trac channels 4.3. Gate-level desgn of asynchronous Turbo The mplementaton of the asynchronous Turbo core was done usng the same approach as that of the synchronous verson. Schematc entry was used for all the modules, startng wth the smaller ones and movng up the herarchy to the top level, shown n Fgure 12. The herarchcal structure for the two desgns was ept largely dentcal. The dfferences n the asynchronous mplementaton are lsted below: Use of dedcated For cells. The SSTFB lbrary mplements pont-to-pont communcaton between cells and a partcular sgnal cannot for and go to two destnatons. To solve ths problem we created and used dedcated FORK cells to support fanout Slac matchng. Specal desgn consderatons had to be taen to balance the ppelnes n the asynchronous desgn to avod ppelne stallng or starvaton. Another aspect of slac matchng, whch we have not yet taen full advantage of, s the fact that the buffer cells are generally much faster than other logc (e.g., fulls) cells. Ths allows us to use shorter buffer chans for delayng sgnals, because these chans can absorb stalled toens whle not degradng overall performance [5][8] Incorporatng FORKs nsde cells. The use of dedcated FORK cells creates addtonal ppelne stages, whch can ncrease the number of slac matchng buffers needed.. In order to mtgate ths problem we decded to ncorporate the FORK nsde some of the logc cells. For example consder a full cell wth only sum output (FA_S). In order to dstrbute the sum output to two cells, conventonally we used a dedcated FORK cell after at FA_S output. Instead, we created a specal cell, full wth two sum outputs (FA_2S) by copyng the sum output toen nternally, thus havng two sum output channels. In addton to reducng the need for slac cells, the new full cell (FA_2S) s 45% less area than the combnaton of FA_S and an external FORK. The same concept can be appled to the other logc cells also SLACK2 and SLACK4 cells. We observed that that most of our ntal desgn was domnated by SSTFB Buffer cells for slac matchng. Most of the tmes these buffer cells exsted n lnear chans.e. one buffer cell drvng another buffer cell as shown n Fgure 12. In order to save area, we decded to create two types of specal cells SLACK2 and SLACK4 cells whch are functonally equvalent to 2 SSTFB Buffers and 4 SSTFB Buffers n seres, respectvely. The transstor level dagram of a SLACK2 cell s shown n Fgure 13. L1 S0 S1 L0 A0 A0 B0 A0 B0 S1 M0 M1 S0 M1 X0 B0 X1 M0 A1 A1 B1 B1 X1 XO A1 R0 R1 Fgure 13. SLACK2 transstor level dagram In SLACK2 the nternal channel (M0-M1) s dynamcally drven by a POUT drver qute smlar to the output drver used n STFB cells [6]. Because they are short and nternal to the cell, they have mnmal potental crosstal nose due to couplng capactance and thus are relatvely safe. Moreover, because the nternal wres are short, we used mnmum transstor szes, savng both area and power. In partcular, SLACK2 and SLACK4 cells have 17% and 30% less layout area compared to 2 and 4 Buffers, respectvely. Moreover, pre-layout transstor level smulatons ndcate that SLACK2 cells and SLACK4 consume 10% and 19% less power compared to 2 Buffers and 4 Buffers, respectvely. R1 B1 R0 8

9 4.4. P&R and desgn results The desgn was place and routed usng Cadence SOC Encounter n a smlar fashon as the synchronous counterpart. Congeston based placement was performed and the routng was performed on the desgn usng Nanoroute. The fnal core has 70% utlzaton and the fnal area consumed by the logc s 6.92mm 2. Ths verson of the core s prelmnary because there s stll some slac matchng needed to acheve our target throughput and some lbrary development needed to remove all DRC volatons. However, the core s fully routed usng 6 metal layers showng that the desgn s routable. 5. Verfcaton and Smulaton Results Ths secton covers the desgn verfcaton of the asynchronous turbo and ts smulaton results Desgn Verfcaton The schematc of the desgn was converted nto a Verlog netlst and the smulaton was performed usng our Verlog models for the SSTFB cells n NC- Sm. Due to the sze of the desgn and the nstantaton of multple dentcal components, the verfcaton was performed n a bottom-up fashon. We started wth smple cells such as s and moved up the herarchy to ACS unts, state update nodes, branch metrc calculaton unts, the completon logc, and fnally the top-level module. For each module a set of vectors were generated that would test all corner cases of ts behavor and the results were verfed and cross referenced to the synchronous counterpart Post-Layout ECO and smulaton results To estmate the performance of the chp we smulated the 55K-transstor module that mplements Equatons 5 and 6. Ths module contans an 8-bt rpple carry chan of full s whch ncludes the crtcal cycle of the desgn. Ths module represents around 1/30 th of the complete desgn but s the most computatonally ntensve module n the SISO. To mprove the performance of the desgn we added SSTFB Buffers on long wres usng the ECO flow n SOC Encounter. The addton of the buffers ncreased the modules throughput by 26% and ncreased the utlzaton factor from 70% to 76% but otherwse dd not mpact area. The fnal layout was extracted usng Assura RC n coupled mode and the crcut was smulated usng Nanosm, yeldng a throughput of 1.15GHz. 6. Prelmnary Comparsons The prevous two sectons descrbed the synchronous baselne and asynchronous desgns n detal and showed ther post P&R results. Ths secton compares the performance, area, throughput/area, energy and power/area results for the two desgns. Because the SSTFB standard cell lbrary has only one sze per cell, the number we present are conservatve and may mprove f we developed and used more szes per cell. Bloc Sze (bts) Async T (Mbps) Sync T (Mbps) M Sync area (mm 2 ) T/area rato Table 1. Throughput per area comparson 6.1. Performance estmates and comparson The frequency of the post P&R synchronous core s 475 MHz. From the post layout smulaton explaned n Secton 5.2 we expect the asynchronous core frequency to be approxmately 1.15GHz. Thus, we expect the asynchronous core to run 2.4 tmes faster then ts synchronous counterpart Area comparson The logc area of the synchronous and asynchronous cores are 2.46mm 2 and 6.92mm 2, respectvely. Both asynchronous and synchronous cores mplement the exact same functon wth the same degree of parallelsm. However snce the asynchronous core s 2.4 tmes faster and has smaller ppelne latency than the synchronous core, we must nstantate the synchronous core many tmes n order to match the throughput, as descrbed n Secton Substtutng the numbers n Equaton (8), we compute that for equvalent throughput wth 6 teratons and ppelne latences of 60 cycles for the synchronous desgn and 32 equvalent cycles for the asynchronous one, the 9

10 number of requred synchronous cores vares from 11 for a throughput of 418Mbps for bloc sze of 768bts to 3 for a throughput of 490Mbps for bloc sze of 4Kbts Throughput/area comparson Throughput/area s another mportant metrc for the comparson, snce t ndcates the performance n relaton to the area used to acheve t. We chose to use throughput/area nstead of ust area for comparable throughputs as a metrc for our comparsons. Ths s because the throughput that s achevable by each desgn s not exactly equal, so the rato comparson provdes a normalzed metrc that s farer. From Table 1 we can see for example that for a bloc sze of 1Kbts whch s a very common bloc sze used n wreless applcatons we obtan a throughput per area advantage of The advantages are even bgger for smaller bloc szes, and for bloc szes of 512 or smaller, the synchronous desgn cannot match the throughput of the asynchronous counterpart, regardless of the degree of parallelsm M. As the bloc sze ncreases, latency becomes less of a crtcal factor and the two desgns become more comparable. Bloc Sze (bts) Energy per bloc (sync) Energy per bloc (async) Rato E E E E E E E E Table 2. Energy per bloc comparson 6.4. Energy Comparsons From the post-layout spce smulaton the power consumed of the selected module s 0.53W. If we extrapolate the number we expect the power of the complete Tree SISO to be approxmately 15.5W. The power for a sngle synchronous core (M=1) s 1.72W. From Table 2 we can see that for smaller bloc szes we are more energy effcent than the synchronous desgn, but for larger bloc szes the synchronous desgn more effcent. It should be stated that the power calculaton for both desgns was performed for the worst case scenaro, namely wth the unts processng data at the maxmum rate. Even though the calculaton s based on pea power, we beleve that the numbers mght be conservatve, but the rato should be representatve of the relatve power consumpton of the two desgns. We have also not ncluded leaage power comparsons n the calculatons, but gven that the asynchronous desgn requres less area for the same throughput and leaage power s proportonal to the total area, we expect to have an advantage n respect to that aspect as well Desgn Tme Comparsons Another mportant metrc to be compared s desgn tme. The desgn of synchronous turbo decoder was done usng front end vews of the standard cell lbrary provded by Artsan. The man steps nvolved were desgn conceptualzaton, schematc entry of the desgn, verfcaton of the desgn, Placement and Routng and fnally tmng enclosure of the desgn usng PrmeTme. Ths desgn effort too approxmately 3-4 graduatestudent months. The desgn of the asynchronous counterpart nvolved almost the same steps wth two notceable dfference. Frst, we had to create our own standard cell lbrary. Second, due to the lac of statc tmng analyss tools, tmng verfcaton and closure was performed usng tme-consumng Nanosm smulatons. The creaton of standard cell lbrary too 5 graduate-student months and the tmng closure and verfcaton too approxmately 10 graduate-student months. Also we want to mae clear that the asynchronous desgn s not yet complete as there remans some DRC errors to be removed n the fnal layout. 7. Summary and conclusons Our results demonstrate that SSTFB asynchronous turbo decoder s benefcal for small to medum bloc szes. Prelmnary comparsons show that the asynchronous turbo decoder can offer more than 2X advantage n throughput per area for bloc szes of 1K bts or less and smaller energy per bloc for bloc szes of 768 bts or less. Thus the asynchronous desgn s partcularly useful n low latency wreless applcatons n whch bloc sze must be small. More generally, ths desgn experment demonstrates the potental benefts of hgh-performance low-latency asynchronous lbrares and standard-cell desgn flows for processng ntensve applcatons. However, ths chp desgn also motvates a number of areas of future wor. The current SSTFB lbrary has only one sze per cell. Whle ths s suffcent to acheve hgh performance, multple szes for each cell can sgnfcantly reduce the overall capactance and power consumpton. In addton, 64% of the cell nstances n 10

11 the Tree SISO desgn are dual-ral buffers for slac matchng. If these are replaced by 1-of-4 or 1-of-8 buffers (that have less swtchng actvty per bt), sgnfcant reductons n power consumpton s lely. Fnally, n order to obtan our target performance we must perform an ECO slac-matchng flow on the entre desgn n order to mtgate the performance degradaton of long wres. Ths process s currently manual, however we hope to automate ths process wth a CAD tool that uses the slac matchng technque descrbed n [5] and the bac-annotaton flow descrbed n [3]. Acnowledgements We would le to than the anonymous revewers for helpful comments on earler drafts of ths wor. Ths wor was partally supported by NSF ITR Award No. CCR References [1] M. Ferrett and P. A. Beerel. Hgh Performance Asynchronous Desgn Usng Sngle-Trac Full-Buffer Standard Cells, IEEE Journal of Sold-State Crcuts, Vol. 41, No. 6, pp , June [2] M. Ferrett and P. A. Beerel. Sngle-Trac Asynchronous Ppelne Templates usng 1-of-N Encodng, DATE 02, Mar [3] P. Golan and P. A. Beerel. Bac-Annotaton n Hgh- Speed Asynchronous Desgn, Journal of Low power Electroncs, Vol. 2, pp , [4] P. Golan and P. A. Beerel. Hgh Speed Nose Robust Asynchronous Crcuts, ISVLSI 06, March, Pages: [5] P. A Beerel, A. Lnes, M. Daves, N.-H. Km. Slac matchng asynchronous desgns. ASYNC 06, March [6] M. Ferrett. Sngle-trac Asynchronous Ppelne Template, Ph.D. Thess, Unversty of Southern Calforna, Aug., [7] P. A. Beerel and K. M. Chugg. A Low Latency SISO wth Applcaton to Broadband Turbo Decodng, IEEE Journal on Selected Areas n Communcatons, Vol. 19 Issue 5, May, [8] R. Manohar and A. J. Martn. Slac Elastcty n Concurrent Computng. Proceedngs of the Fourth Internatonal Conference on the Mathematcs of Program Constructon, Lecture Notes n Computer Scence 1422, pp , Sprnger-Verlag [9] A. J. Martn, A. Lnes, R. Manohar, M. Nystrom, P. Penzes, R. Southworth, U. Cummngs and T. K. Lee. The Desgn of an Asynchronous MIPS R3000 Mcroprocessor. ARVLSI 97, [10] M. Nyström, E. Ou, A. J. Martn. An Eght-bt Dvder Implemented n Asynchronous Pulse Logc. ASYNC 04, Aprl [11] K. van Berel, and A. Bn. Sngle-Trac Handshae Sgnalng wth Applcaton to Mcroppelnes and Handshae Crcuts, ASYNC 06, pp , [12] I. E. Sutherland and S. Farbans. GasP: a Mnmal FIFO Control. ASYNC 01, pp , March [13] K. M. Chugg, A. Anastasopoulos, and X. Chen. Iteratve Detecton: Adaptvty, Complexty Reducton, and Applcatons. Kluwer Academc Press, [14] G. Masera, G. Pccnn M. Ruo Roch, and M. Zambon. VLSI Archtectures for Turbo Codes. IEEE Transactons on Very Large Scale Integraton (VLSI) Systems, Vol. 7, No. 3, September 1999 [15] P. Thennvboon and K. M. Chugg. A Low-Latency SISO va Message Passng on a Bnary Tree, Allerton Conf., Urbana, IL, Oct [16] A. J. Vterb. An Intutve Justfcaton and a Smplfed Implementaton of the MAP Decoder for Convolutonal Codes, IEEE Journal on Selected Areas n Communcatons, Vol. 16, No. 2, February [17] R. Dobn and M. Peleg. Parallel Interleaver Desgn and VLSI Archtecture for Low Latency Map Turbo Decoders, IEEE Transactons on VLSI Systems, Aprl [18] IEEE Worng Groups Web Page. [19] B. Bougard, A. Gulett L.Van der Perre, F. Catthoor. A Class of Power Effcent VLSI Archtectures for Hgh Speed Turbo-Decodng, IEEE Global Telecommuncatons Conference, Vol. 1, pp , 2002 [20] D. Chnery and K. Keutzer, Closng the Gap between ASIC & Custom. Tools and Technques for Hgh Performance ASIC desgn. Kluwer Academc Publshers, ISBN [21] K. M. Chugg, P. Thennvboon, G.D. Dmou, P. Gray, and J. Melzer, A new class of turbo-le codes wth unversally good performance and hgh-speed decodng, n IEEE Mltary Communcatons Conference, Vol. 5, pp , Oct

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr