A Fast Parallel Reed-Solomon Decoder On a Reconfigurable Architecture

Size: px

Start display at page:

Download "A Fast Parallel Reed-Solomon Decoder On a Reconfigurable Architecture"

Alisha Dean
6 years ago
Views:

1 A Fast Parallel Reed-Solomon Decoder On a Reconfgurable Archtecture Arezou Kooh, Nader Bagherzadeh, Chengz Pan EECS Department EECS Department EECS Department Unversty of Calforna Irvne Unversty of Calforna, Irvne Unversty of Calforna, Irvne akooh@ece.uc.edu Nader@ece.uc.edu panc@uc.edu ABSTRACT Ths paper presents a software mplementaton of a very fast parallel Reed-Solomon decoder on the second generaton of MorphoSys reconfgurable computaton platform, whch s targetng on streamed applcatons such as multmeda and DSP. Numerous modfcatons of the frst-generaton of the archtecture have made a scalable computaton and communcaton ntensve archtecture capable of extractng parallelsms of fne gran n nstructon level. Many algorthms and the whole Dgtal Vdeo Broadcastng base-band recever as well, have been mapped onto the second archtecture wth mpressng performance. The mappng of a Reed-Solomon decoder proposed n ths paper hghly parallelzes all of ts sub-algorthms, ncludng Syndrome Computaton, Berlekamp Algorthm, Chen Search, and Error Value Computaton, n a SIMD fashon. The mappng s tested on a cycle-accurate smulator, Mulate, and the performance s encouragngly better than other archtectures. The decodng speed of the RS (255,239,16) decoder usng two dfferent methods of GF multplcaton can be 1.319Gbps and 2.534Gbps, respectvely. Furthermore, snce there s no functonalty specfcally talored to Reed-Solomon decoder, the result has demonstrated the capablty of MorphoSys archtecture to extractng Instructon Level Parallelsm from streamed applcatons. Categores and Subject Descrptors C.1.2 [Computer System Organzaton]: Processor Archtectures, Multple Data Stream Archtectures (Multprocessors), Sngle-nstructon-stream, Multple-Data - Stream Processors General Terms Algorthms, Performance, Desgn, Expermentaton Keywords: Reconfgurable Archtecture, SIMD Processor, Reed_Solomon codes, Berlekamp Algorthm, Chen Search 1. INTRODUCTION Toward a comng bllon-transstor era, today s computaton platform desgn has already foreseen the end of the road for conventonal mcro-archtectures [4], and numerous new Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. To copy otherwse, or republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. CODES+ISSS 03, October 1 3, 2003, Newport Beach, Calforna, USA. Copyrght 2003 ACM /03/0010 $5.00. approaches have arsen above the horzon, such as EPIC (Itanum 2) [5], RAW [6], Imagne [7], and VIRAM [8], etc. Many of them target on stream applcatons, whch have already been consumng more than 90% of total computng cycles nowadays [7]. The bggest challenge of archtecture desgn s the scalablty, only wth whch can one follow up the step of Moore s Law. The dffculty of scalablty s mposed by slower decrease of wre transmsson delay than that of transstor swtchng delay. Ths dscrepancy requres a new phlosophy on desgn of scalar operand network [9] and memory herarchy. In ths paper, we wll ntroduce the 2 nd -generaton MorphoSys reconfgurable archtecture called M2, a computaton and communcaton ntensve platform capable of extractng fne gran parallelsms at the nstructon level. Many algorthms and the whole Dgtal Vdeo Broadcastng base-band recever as well, have been mapped onto M2 wth mpressng performance. The mappng of a Reed-Solomon decoder s proposed n ths paper. RS codes are powerful block codes wdely used as an error correcton method n the areas such as dgtal communcaton, dgtal dsc error correcton, etc. Recently, concatenated codes made up of convolutonal codes followed by RS codes have been proved as an effcent way of error correcton n wreless data communcaton systems. We present M2 Archtecture at the frst secton of the paper, ncludng the archtecture dfferent parts, M2 mplementaton and ts programmng model. Secton 3 ntroduces the RS decodng Algorthms and the mappng procedure on the M2. The secton also ncludes the M2 archtecture feasblty for dfferent parts of the Algorthms. In secton 4 we compare the result of the RS (255,239,16) decoder wth the exstng benchmarks ncludng the TI64x and dfferent Ascs. The decodng speed of the RS (255,239,16) decoder usng two dfferent methods of GF multplcaton can be 1.319Gbps and 2.534Gbps, respectvely 2. MORPHOSYS RECONFIGURABLE ARCHITECTURE 2.1. Introducton of MorphoSys MorphoSys s a reconfgurable computaton platform targetng computaton ntensve data parallel applcatons, ncludng streamed applcatons. M1 [1-2], the frst prototype of MorphoSys, has been used as a platform for many applcatons such as multmeda, wreless communcaton, sgnal processng, and computer graphcs. M2, the 2 nd generaton of MorphoSys, follows the basc concepts of MorphoSys; however, t s redesgned n both scalar operand network and memory herarchy, thus greatly enhanced n performance. Feedbacks from numerous 59

2 kernel and system mappngs ponted out M1 s bottlenecks, whch have been revsted n M2. M2 archtecture conssts of three man subsystems: a core processor called TnyRISC, an array of 64 Reconfgurable Cells (RCs) organzed n SIMD fashon, and a specal data movement unt called Frame Buffer (FB). The programmng model s smple. TnyRISC takes charge of the whole Data Control Flow (DCF); RC array and Frame Buffer are only trggered by TnyRISC and executng on ther own confguratons (called context) contnuously for a gven number of cycles specfed by TnyRISC. The man dfference between M1 and M2 s descrbed n [3]. Fgure 1 shows the MorphoSys dagram. Man Memory can be ether on-chp or off-chp wthout sgnfcant dfference of the connecton nterface, as long as Man Memory s also composed of 64 banks M2 mplementaton The followng table-1 gves out the characterstcs of M2 and compares t wth the other processors. Results of frst order smulaton usng Synopss tools show that the crtcal-path delay s about 1.8~2.3ns. Hence, 450 MHz s used n our smulaton. Other data are ether based on M1 mplementaton and M2 postsynthess (current status), or projected as our desgn am, whch are chosen no more aggressve than commercal processors. Though ndependently desgned, M2 combnes the structural advantages of three mportant archtecture parameters: the overall structure of host-processor/slave-computaton-fabrc, the controlled sze of dstrbuted memory wthn each AlU clusters and the powerful sequental code, whch s hard to be parallelzed. Second, the modest sze of dstrbuted memory enables more AlUs to be ntegrated n the array and, more mportantly, reduces the latency of scalar operand network, whch s essental to extract to adopt wde range of optmzaton technques, both coarse-gran and fne-gran, and avods usage of hardware-specfc languages as StreamC. Man Memory Frame Buffer DMA TnyRISC RC Array Context Memory Fgure 1. MorphoSys Dagram Table 1. Comparson of M2 and other archtectures* the latency s analyzed n 5-tuple <send occupancy, send latency, network hop latency, receve latency, receve occupancy> [9]. VIRAM Imagne RAW M2 (wth/wthout on-chp memory) Parallelsm Capablty Scalar Operand Network Memory Herarchy Parallelsm Vector SIMD MIMD SIMD model Peak 16-bt OPS 6.4G 23.7G 3.6G (32-bt) 28.8G Clock Speed 200MHz 296MHz 225MHz 450MHz Chp area 15*18mm 2 12*12mm 2 18*18mm 2 8*8mm 2 /16*16mm 2 # Transstors 120M 21M 122M 20M/120M Power 2W(averag e) 4W 25W 4W(peak MAC) Technology 0.18µm 0.15µm 0.15µm 0.13µm # Network 8(Banks) nodes Total bandwdth 51.2Gbps 75.8Gbps 922Gbps 922Gbps Latency * <0,1,1,1,0> <0,1,1,1,0> <0,1,1~6,1,0> <0,0,1~2,0,0> Full permutaton No Yes Yes Yes 1 st level sze 64Kb(VRF 96Kb(LRF) 16.4Kb(RF) 16.4Kb(RF) ) 2 nd level sze 104Mb(DR 1Mb(SRF) 16Mb(Local 2Mb(Local Mem.) AM) Mem.) 3 rd level sze Off-chp Off-chp Off-chp 16Mb/off-chp 1 st level 205Gbps 758Gbps 230Gbps 1382Gbps bandwdth 2 nd level 51.2Gbps 829Gbps ** 334Gbps 461Gbps bandwdth 3 rd level N/A 12.4Gbps 200Gbps 115Gbps/56.7Gbps bandwdth 60

3 2.3. Programmng Model One potental drawback of M2 s the SIMD model of RC array. However, mappng experence on MorhpoSys, VIRAM, Imagne, and other SIMD extenson of general-purpose processor lke Pentum MMX, all show that SIMD s suffcent for streamed multmeda applcatons, not mentonng the much smaller code sze than that of MIMD model. For example, mappng of DVB-T system on M2 does not need MIMD generalty. Although SPMD feature mght make mappng of MPEG-2 decoder easer, other subsystems are applyng mostly affne array access and small porton of non-affne but statc access nvolvng random communcaton, whch can be done effcently by the scalar operand network. Snce SIMD programmng has long been a matured feld n general, t s reasonable to suppose that an effcent C compler for streamed applcatons s hghly possble, wth some prevous experence already [10-11].Moreover, parallel programmng n M2 can adopt doman decomposton or functon decomposton, the former more straghtforward, whle Imagne has to stck wth cumbersome functon decomposton often experencng load-unbalanced problem. The followng fgure llustrates a real code segment for The Berelekamp algorthm operaton whch also shows programmng model of MorphoSys. Sequental functons Data and Computaton Parallel (RC Array) functons CONTROL + DATA CONTROL FLOW DATA PATH TnyRISC assembly sub ldfb $1, $15,$15,3 $2, 8, 511 Ldrcex $4, 1, 0, 1, 0, 41 ldfb $2,0,0,256 ldl $10,511 Ldrcex $5, 1, 1, 42 add $2,0,0,256 ldctxt $1, 0, 80 delay4: add $4, $4, 64 ldw sub $10,$10,4 $1, 0, 0, 0, add + $5, cbcast $5, 64 0, 0, 0, 0 jal nop 0, 2, 0, 1, ldrcex sbcb $4, 1, 0, 2, 1, 1, 0, lu brle $0,$10,delay4 ldrcex $5, 1, 0, 1, 0, 44 5 dbcbc $1, 1, 4, 64 addw nop add $4, $4, 64 0, 0, 0, 1, stfb $6, 1, 0, 16 wfb 0, 0, 1, 32 Frame Buffer DMAC TnyRISC RC Array RC context set 0: 0,1 ADD KEEP R0 I H1 I ; > R6 R0; set 1: 1,1 SUB CLOAD!0x004 H0 R0 > R6 I I >2 R0; ; set 2: 2,1 ADD ADD R0 E H3 r0 LSL > R6 2 >0 R0; WE; set 3: 3,1 SUB ADD H2 XQ R0 r0 > LSL R6 2 R0; >0; set 4: 4,1 ADD SUB R0 XQ H5 r0 > LSL R6 2 R0; >0; set 5: 5,1 SUB SUB H4 E R0 r0 LSL > R6 2 >0 R0; WE; 6: ADD R0 H7 > R6 R0; set 6,1 CLOAD!0x004 I I >2 ; 7: SUB H6 R0 > R6 R0; set 7,1 KEEP I I ; Context Memory Fgure 2- Programmng Model n MorphoSys 2.4. Scalablty There s a msunderstandng n many places, that scalablty s equal to even dstrbuton of resources. Actually, ths s not true n large scale. For example, as the tle number of RAW processor grows, the scalar operand network latency ncreases, and more severely, traffc load per tle ncreases, whch wll fnally destroy the scalablty. In order to avod ths load ncrease, one would expect to keep the atomc sze of 8X8 basc array and nstead use an ncremental level of nter-array communcaton network accommodatng ths ncremental load. Ths leads to a herarchcal scalng behavor, n each level of whch there are both dstrbuted and centralzed recourses. The real bllon-transstor archtecture comng around year 2007 can be made up of 8 M2 together wth the counterpart of TnyRISC and operand network at a hgh level.3 Mappng of Reed-Solomon Decodng Algorthm 3. MAPPING OF REED-SOLOMON DECODING ALGORITHM 3.1 Reed-Solomon Decodng Algorthm [14] In the RS (n,k) code, n s the block length, k s the transmtted code word length and n-k s the mnmum dstance over GF ( 2 n ). Decoder s capable of correctng ( n k) t = error n 2 the receved code word. RS decoder ntroduced n ths paper s developed for the SIMD reconfgurable processor. It s necessary to fnd a RS decodng algorthm, whch makes the parallel processng possble. Let s consder v(x) s the transmtted code word and r(x) s the receved code word. Then the error added by channel s e( x) = r( x) v( x) (1) Consderng ν errors are occurred n the transmtted code word the syndromes are computed as follows: S = v j n υ j j j j k j ( α ) = r( α ) + e( α ) = ek ( α ) = e X l l k= 0 Eq (3) shows the key equaton to fnd the error locatons and error magntudes. 2t ( x) S( x) = Γ( x) mod x l= 1 (2) Λ (3) ( x) Λ s the error locaton polynomal and ts roots are the nverses of the error locatons. It can be presented as follows: Λ ν υ ( ) = 1+ 1x ν x = ( 1 xα l ) x (4) l= 1 After calculatng the error locaton polynomal roots the error value correspondng to each error locaton can be easly calculated usng the Eq (5) e l ( ) ( ) l Γ ( ) ( α ) l l Λ ( α ) = α (5) The Berlekamp algorthm [14] s used to fnd the error locaton polynomal for the SIMD mplementaton of ths decoder GF Multplcaton and Addton Insde Each RC GF multplcaton s the basc operaton should be consdered n the mplementaton of RS decoder. There are two ways of mplementng GF multplcaton nsde each RC. The frst method s usng look up tables for convertng the vector representaton to the power representaton. Beng able to change the power representaton to ts vector, we can consder GF elements n the whole decodng procedure as ther vector representatons. The GF multplcaton wll be smply the addton of the two powers. After addton, Mod 255 of the result should be computed. We avod ths tme consumng computaton by storng two look up tables contnuously nstead of one for convertng the power to vector. the second look up table wll be the vector correspondng for the power of 255 to 512. Ths method works f we just have the multplcaton of two GF elements and wll convert the result to ts vector presentaton soon after that. Ths s the case n 61

4 mplementng the GF Multplcaton. Total applcaton s manly conssts of GF polynomal computaton whch s the GF addtons after each GF multplcaton. In order to make each RC the computatonal cell for the GF addton and multplcaton, we are takng advantage of the local memory nsde each RC. These memores are specally mplemented for storng look up tables, whch are extremely, used n the data streamed applcatons. The second method comes from the feasblty of mplementng the GF multpler hardware nsde each RC usng the fne gran block. Usng ths way there s no need to know about the power presentaton of GF elements and all the computaton wll go on usng the vector presentaton. Usng the GF multpler nstead of the look up tables wll speed up total applcaton, as now we are able to multply two GF elements and get the result n one clock cycle. B0-B7 B8-B15 B16-B23 B24-B31 B32-B39 B40-B47 B48-B55 B56-B63 D0 D1 D2 D3 D4 D5 D6 D7 Forney s method stage and wll be used to calculate the error values Syndrome Computaton The frst step s the Syndrome calculaton out of the receved code word. Receved code word s saved n the frame buffer and wll be transmtted to each RC.Each RC computes two Syndromes n parallel wth the other RCs of the row. Ths scheme s generalzed dependng on the number of Syndromes should be computed. Ths has been made easy snce there s one bank of the Frame buffer correspondng to each RC. So we can easly scale the Syndrome numbers transmttng the data from frame buffer s bank to each RC. S 1 S 8 1 T 1 1? r 2 == 0 S 2 S 10 β3 S 3 S 11 β3 Syndrom Computaton S 4 S 12 Chen Search β4 S 5 S 13 β5 S 64 S Berlekamp Algorthm 2 3 S 4 5 S 46 T 2 T 3 ST 12 4 T 5 ST 12 6 β6 1 S 7 S 15 7 T 7 β7 S 8 S 16 8 T 8 β8 Fgure 3. context broadcast to RC columns from the context memores of each column Forney's Method 3.3. Developng the Parallel Algorthm of the RS Decoder 1 δ 1 2 δ 2 3 δ 3 S 4 Sδ δ 5 S 46 Sδ δ 7 8 δ 8 In order to acheve the hghest performance of the decoder mplemented on the Morphosys, We should develop the ntroduced decodng algorthm for the SIMD processor. Fg 3 shows the parallelsm exploted at the data level for the RS (255, 239,16). Each row of RC has been reconfgured for decodng one block of the data. 8 blocks of data wll be decoded n parallel wth each other. Ths wll decrease the throughput to 1/8 of the total tme needed for decodng 8 blocks n parallel. Fg 4 shows the task-flow dagram of the decoder. Each RC holds one coeffcent of the error locaton polynomal, error magntude polynomal and error correcton polynomal. Each RC s the basc operatonal block for GF addton and multplcaton. In the fgure S Stands for the dfferent syndromes computed durng the syndrome computaton. and T s are representng the error locaton and error correcton polynomals coeffcents respectvely whch are computed durng the Berelekamp algorthm. In the Chen search part β stands for the GF (255) element, whch should be substtuted nto the error locaton polynomal, to fnd the root of the error locaton polynomal. δ presents a coeffcent of the error magntude polynomal whch s computed durng the Fgure 4. Task dagram of the decodng procedure n the tme doman Berlekamp Algorthm Berlekamp Algorthm s consstng of three man parts. Dscrepancy computaton, calculaton of the new error locaton polynomal and error correcton polynomal. These calculatons are bascally made of addton of two polynomals together or multplyng one of them by the GF element. Addton has been made possble by storng the coeffcents of the same order n the same RC. Addton of the polynomals wll be addng of these two coeffcents together n parallel n all RCs. In order to compute the dscrepancy we need to take advantage of the data movement, between RCs. Ths data movement s possble n the same clock cycle as the multplcaton happens usng that element from the other RC. Ths s depcted n Fg 5.Takng advantage of the data movement between RCs s lke ncreasng the regster fle of each RC to the ones from the RCs of the same row and same column. Ths s the strong tool to enhance the level of parallelsm of the algorthm. Each RC can use the results of the other computaton, as t has been stored n ts regster fle. 62

5 Chen Search Chen search can be done n parallel n the 7 RCs of the row. The frst RC wll check the result for the possble error detectons. Each RC s responsble for tryng 33 GF( 2 8 ) elements. The whole search has been done n parallel n the 7 RCs. FB wll be used as a secondary memory here to store the elements of the GF ( 2 8 ). Each bank wll keep the necessary elements and ther power, snce we need up to 8 power of each element to reduce the computaton tme. In fact memory herarchy s one of the mportant feature of the Morphosys. We can take advantage of the dfferent parameter storage n the dfferent levels of the memory Error Value computaton Error value calculaton wll be performed sequentally for the dfferent error locatons. Ths way we take advantage of parallel GF addton and multplcaton n dfferent RCs. Fg 6 shows dfferent terms calculaton of the error values nsde each RC. s the dscrepancy calculated durng the Berlekamp algorthm part and S 1 S 2 s are the error locaton coeffcents. S 3 S S 7 s 6 S 5 S 4 S 3 S 2 1 S CONCLUSION AND COMPARISON MorphoSys as a reconfgurable archtecture provdes a very flexble platform for mplementng dfferent DSP applcatons. Ths s perfectly demonstrated through the RS decoder mplementaton wth dfferent error correctng length. Beng able to reconfgure each RC for the dfferent parts of the wreless communcaton recever wll provde the best functonalty of each RC for the whole applcaton. TI DSPs have the same approach to perform GF multplcaton. In comparson Morphosys takes advantage of the hgher data parallelsm obtaned through the SIMD characterstcs of t. DSP C6400 provdes the hardware support for performng the Galos Feld multples. In the absence of hardware to effectvely perform Galos feld math, prevous DSP mplementatons made use of logarthms to perform multplcaton. Ths has decreased performance to ¼ of the present one wth hardware GF multplcaton. Many dfferent Reed-Solomon ASIC Archtectures are proposed n the lterature [12-13]. These blocks are mostly specfed and optmzed for the RS decoder. They can be consdered as a sngle chp RS decoder. In comparson Morphosys can be vewed as the programmable archtecture optmzed for the whole recever Morphosys outperforms Ascs, whch have the smaller sze. We have smulated the result on the Morphosys hardware smulator called Mulate. The result s gven for the RS decoder of (255,239,16) usng two dfferent methods of GF multplcaton. The decodng speed s 1.319Gbps and 2.534Gbps, respectvely. Fg 7 shows the bt rate of the decoder and ts comparson wth the dfferent Ascs and Also TI64x 2.534G bps S 5 s 6 S 7 824M 517M 320M 1.319G Fgure 5. Data movement between RCs n order to calculate dfferent terms of the Dscrepancy n the dfferent RC Coc5128 Amphon TI-C64x Xlnx Decoder Morphosys1 Morphosys2 Fgure7. Bt rate comparson of two decoders mplemented on Morphosys wth the other Ascs and TIC64x k 6 7 k 8 Fgure 6. Dfferent terms calculaton of the error values 63

6 References: [1] H. Sngh, Lee, Lu, Bagherzadeh, Kurdah, MorphoSys: An Integrated Reconfgurable System for Data-Parallel and Computaton Intensve Applcatons, IEEE Transactons on Computers, vol. 49, No. 5, pp , May [2] Lee, Sngh, Lu, Bagherzadeh, Kurdah, Desgn and Implementaton of the MorphoSys Reconfgurable Computng Processor, Journal of VLSI Sgnal Processng Systems for Sgnal, Image, and Vdeo Technology, vol. 24, No. 2-3, Kluwer Academc Publshers, pp , Mar [3] Pan, Kamalzad, Kooh, Bagherzadeh, Desgn and Analyss of a Programmable Sngle-Chp Archtecture for DVB-T Base-Band Recever, To Appear n DATE [4] Agarwal, Hrshkesh, Keckler, Burger, Clock rate versus IPC: the end of the road for conventonal mcroarchtectures, Computer Archtecture, Proceedngs of the 27th Internatonal Symposum on, 2000 Page(s): [5] Schlansker, Rau, EPIC: Explctly Parallel Instructon Computng, Computer, Volume: 33 Issue: 2, Feb 2000 Page(s): [6] Taylor, et.al. The Raw mcroprocessor: a computatonal fabrc for software crcuts and general-purpose programs, Mcro, IEEE, Volume: 22 Issue: 2, Mar/Apr 2002, Page(s): [7] Rxner, et.al. A bandwdth-effcent archtecture for meda processng, Mcroarchtecture, MICRO-31. Proceedngs. 31st Annual ACM/IEEE Internatonal Symposum on, 30 Nov-2 Dec 1998 [8] Kozyraks, Patterson, Vector vs. superscalar and VLIW archtectures for embedded multmeda benchmarks, Mcroarchtecture, (MICRO-35). Proceedngs. 35th Annual IEEE/ACM Internatonal Symposum on, 2002 Page(s): [9] Taylor, Lee, Amarasnghe, Agarwal, Scalar Operand Networks: On-chp Interconnect for ILP n Parttoned Archtectures, HPCA 2003, February [10] Venkataraman, Najjar, Kurdah, Bagherzadeh, Bohm, A Compler Framework for Mappng Applcatons to a Coarsegraned Reconfgurable Computer Archtecture, CASES 2001, Atlanta, GA, November [11] Maestre, Kurdah, Bagherzadeh, Sngh, Hermda, Fernandez, Kernel schedulng n reconfgurable computng, DATE [12] Implementaton of hgh speed Reed-Solomon decoder. [Conference Paper] 42nd Mdwest Symposum on Crcuts and Systems (Cat. No.99CH36356). IEEE. Part vol. 2, 2000, pp vol. 2. Pscataway, NJ, USA. [13] Martna M, Masera G, Pccnn G, Vacca F, Zambon M. VLSI Reed Solomon decoder archtecture for networked multmeda applcatons. [Conference Paper] Proceedngs 14th Annual IEEE Internatonal ASIC/SOC Conference (IEEE Cat. No.01TH8558). IEEE. 2001, pp Pscataway, NJ,USASystems II-Analog & Dgtal Sgnal Processng, vol.47, no.11, Nov. 2000, pp Publsher: IEEE, USA. [14] Shu LIN/ Danel J Costello Error Control Codng Fundamentals and applcatons [15] [16] [17] 64

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr