DESIGN AND ANALYSIS OF LDPC DECODERS FOR SOFTWARE DEFINED RADIO

DESIGN AND ANALYSIS OF LDPC DECODERS FOR SOFTWARE DEFINED RADIO Sagwo Seo, Trevor Mudge Advaced Computer Architecture Laboratory Uiversity of Michiga at A Arbor {swseo, tm}@umich.edu Yumig Zhu, Chaitali Chakrabarti Departmet of Electrical Egieerig Arioa State Uiversity {yumig, chaitali}@asu.edu ABSTRACT Low Desity Parity Check (LDPC) codes are oe of the most promisig error correctio codes that are beig adopted by may wireless stadards. This paper presets a case study for a scalable LDPC decoder supportig multiple code rates ad multiple block sies o a software defied radio (SDR) platform. Sice techology scalig aloe is ot sufficiet for curret SDR architectures to meet the requiremets of the ext geeratio wireless stadards, this paper presets three techiques to improve the throughput performace. The techiques are use of data path accelerators, additio of memory uits ad additio of a few assembly istructios. The proposed LDPC decoder implemetatio achieved 30.4 Mbps decodig throughput for the =2304 ad R=5/6 LDPC code outlied i the IEEE 802.16e stadard. Idex Terms LDPC, Mi-sum iterative decodig, SDR, SODA, 1. INTRODUCTION Low desity parity check (LPDC) codes have excellet error correctio performace that approaches the Shao capacity limit [1], [2]. As a result, they have bee adopted i may curret ad ext geeratio wireless protocols such as DVB- S2 ad the IEEE 802.16e stadard (WiMAX). Decoders used for LDPC codes have high throughput requiremets ad have bee successfully implemeted usig ASICs ad FPGAs [3]. However, the emergece of a wide variety of wireless protocols that are rapidly chagig makes custom hardware for these decoders relatively time cosumig ad expesive to develop. Software Defied Radio (SDR) is a programmable hardware platform capable of supportig software implemetatios of wireless commuicatio protocols for physical layers [4]. This paper presets a case study for a LDPC decoder implemetatio that supports multiple code rates ad multiple block sies o a SDR platform, SODA (Sigal-processig O-Demad Architecture). SODA is a multiprocessor architecture, where each processor is equipped with a 32-wide (Sigle Istructio Multiple Data) pipelie, a scalar pipelie ad scratchpad memories. Whe the LDPC matrix Fig. 1. LDPC matrix H ad the correspodig bipartite graph is represeted by structured submatrices, the data-level parallelism ca be efficietly hadled by the pipelie. However the curret SODA architecture is uable to meet the high decodig throughput ad the scalability requiremets (multiple block sies ad multiple code rates) of the IEEE 802.16e stadard. I this paper we preset use of data path accelerators, additio of memory uits ad additio of a few assembly istructios to address the throughput ad scalability requiremets. The proposed LDPC decoder implemetatio achieves 30.4 Mbps decodig throughput for the =2304 ad R=5/6 LDPC code outlied i the IEEE 802.16e stadard. The rest of the paper is orgaied as follows. Sectio 2 gives a brief overview of LDPC codes. Sectio 3 itroduces SODA, the -based high-performace DSP processor for SDR ad mappig of the LDPC decoder oto SODA. Sectio 4 describes LDPC accelerators, memory cotroller/buffer orgaiatio ad assembly support required for the high throughput scalable LDPC decoder implemetatio. Sectio 5 presets memory ad throughput aalysis of the augmeted architecture. Sectio 6 cocludes the paper. 2.1. Itroductio 2. LDPC BASICS A LDPC code is a class of liear block codes whose codewords satisfy a set of liear parity-check costraits [1]. These costraits are typically defied by a m-by- parity-check matrix H, whose m rows specify each of the m costraits (the umber of parity checks), ad represets the legth of a codeword. H is also characteried by W r ad W c, which represet the umber of 1 s i the rows ad colums, respec-

tively. A LDPC code ca be represeted by a bipartite graph, which cosists of two types of odes, Variable Nodes (VN) ad Check Nodes (CN). Check ode i is coected to variable ode j wheever h ij of H is o-ero. Fig. 1 describes the matrix H ad the correspodig bipartite graph of a simple LDPC code. Theoretically, the LDPC decodig process fiishes whe all parity-check equatios are satisfied. I reality, a predefied umber of iteratios (NUM) based o SNR is geerally used. 2.3. LDPC Matrix Partitio 2.2. LDPC Decodig Process LDPC codes are decoded iteratively usig a message passig algorithm [1]. This algorithm ivolves exchagig the belief iformatio amog the variable odes ad check odes that are coected by edges i the bipartite graph. Let I be the itrisic iformatio from the received sigal, L be the reliable iformatio for variable ode, L,m be the iformatio coveyed from variable ode to check ode m, ad E,m be the extrisic iformatio geerated i check ode m that is passed to variable ode. The belief iformatio is updated i a iterative maer ad implemeted i two phases. I the first phase, the variable odes sed their belief iformatio, L,m, to check odes coected to them; i the secod phase, the check odes sed the updated belief iformatio (ew E,m ) to the variable odes coected to them for updatig L (See Fig. 1). The iteratio steps are summaried i Algorithm 1. Algorithm 1: Mi-sum LDPC Decodig Algorithm 1. Iitialiatio: E,m = 0, L = I 2. VN to CN: L,m = L - E,m 3. Update E,m: E,m ew = f(l,m S N(m)) 4. Update L : L ew = L,m + E,m ew 5. Repeat the steps 2,3,4 for NUM iteratio times 6. Make a decisio of bit based o the correspodig L value Here, N(m) is the set of variable odes which are coected with check ode m i the bipartite graph. Similarly, M() is the set of check odes which are coected with variable ode. The decodig algorithms differ i how the fuctio f i Step 3 of Algorithm 1 is evaluated. There are three optios for the LDPC iterative decodig algorithm: Belief Propagatio (BP), λ-mi ad mi-sum algorithms [5]. Although BP ad λ-mi algorithms show better error correctio performace compared to mi-sum algorithm, these algorithms require a look-up table for hyperbolic fuctio values, which requires additioal memory space. The mi-sum algorithm is selected here because of the limited memory sie ad easy computatio patters. The mi-sum algorithm f is show as follows. Here, N(m),. E ew,m = - ( sig(l,m)) mi L,m As ca be see, the operatios i the mi-sum LDPC decodig algorithm are limited to additio, subtractio ad fidig a miimum value, all of which ca be supported by our SDR architecture described i Sectio 3. Fig. 2. Partitioig of H ito -by- cyclic idetity matrices A LDPC matrix H has radomly distributed 1 s which results i complex data routig ad is a major challege for buildig a high-performace ad low-power LDPC decoder. [3] ad [6] show that itroductio of some structural regularity i the matrix does ot degrade its error correctio performace. Moreover the regularity eables partially parallel implemetatio of LDPC decoders ad has bee utilied i the IEEE 802.16e stadard. Fig. 2 shows the partitioig of H ito -by- cyclic idetity submatrices. Here, I x represets a cyclic idetity matrix with rows shifted cyclically to the right by x positios. This characteristic reduces the routig overhead ad has bee exploited efficietly i our architecture. Fig. 2 also shows how the of the idetity matrices alog a row ca be grouped to form a block row. So, i essece, the H matrix ca also be partitioed ito m block rows each of sie -by-. 3. SDR ARCHITECTURE I this sectio, we preset the -based SDR architecture, SODA [4]. This architecture was iitially desiged to support wireless protocols such as WCDMA ad IEEE 802.11a. 3.1. SODA Overview The SODA multiprocessor architecture is show i Fig. 3. It cosists of multiple data processig elemets (s), oe cotrol processor ad a global scratchpad memory, all of which are coected through a shared bus. Each SODA cosists of five major compoets: 1) a 32-way, 16-bit datapath pipelie for supportig vector operatios. Each datapath icludes oe 16-bit ALU with multiplier ad a 2 readport, 1 write-port 16 etry register file. Itra-processor data movemets are supported through a Shuffle Network

Itercoect Bus SODA System Cotrol Processor Global Scratchpad Memory Executio Memories Uit Executio Memories Uit Executio Memories Uit SODA To System Bus 5. DMA RF DMA ALU 3. memory Memory Scalar Memory 512-bit Reg. File Scalar RF E X S T V E X 1. pipelie Pred. Regs 512-bit W B ALU+ Mult Shuffle W Network B (SSN) to V Scalar T S (VtoS) 2. Scalar pipelie Scalar W ALU B code as specified by the IEEE 802.16e stadard o a SODA. We describe the ehacemets that had to be made i terms of accelerators, memory uits, ad ew assembly istructios to support multiple code rates ad multiple block sies. Fig. 4 shows the modified pipelie the additioal uits have bee show usig shaded blocks. AGU RF E X AGU ALU W B 4. AGU pipelie Fig. 3. SDR architecture: SODA [4] (SSN); 2) a 16-bit datapath scalar pipelie for sequetial operatios. The scalar pipelie executes i lock-step with pipelie; -to-scalar ad scalar-to- operatios exchage data betwee two pipelies; 3) two local scratchpad memories for the pipelie ad the scalar pipelie; 4) a AGU (Address-Geeratio-Uit) pipelie for providig the addresses for local memory accesses; ad 5) a programmable DMA (Direct-Memory-Access) uit to trasfer data betwee scratchpad memories ad iterface with the outside system (iter-processor data trasfer). The pipelie, the scalar pipelie ad the AGU pipelie execute i VLIW-styled lockstep maer, cotrolled with oe program couter (PC) [4]. 3.2. LDPC o SODA The mi-sum LDPC decodig algorithm (Algorithm 1) is map ped oto SODA i the followig way. Step 2 of Algorithm 1 is applied to o-ero -by- submatrices. However, because Step 3 uses the L,m values related with check ode m, the pipelie loads values of type L ad aligs the data i check ode order by usig SSN before executig Step 2. The shuffled L,m values for all o-ero -by- submatrices i oe -by- block row are calculated i the datapath. After that, the -to-scalar uit is used for fidig the miimum E,m ew amog W r of L,m values for the same check ode m. Next, E,m ew ad the correspodig sig idicator are used to update a L value (Step 4). This procedure implies that some slices execute additios ad others execute subtractios based o sig values a feature that is supported by predicated istructios i SODA. After updatig the L values, the data is iversely shuffled ad stored i variable ode order. This process is repeated for every -by- block row i every iteratio. 4. SCALABLE LDPC IMPLEMENTATION I this sectio, we study a scalable LDPC decoder implemetatio for block sie, code rate R=k/, ad (W c, W r )-LDPC Fig. 4. Modified pipelie i a SODA 4.1. LDPC Accelerator I order to meet the high decodig throughput requiremets, we itroduce a LDPC accelerator i every slice as show i the Fig. 4. There are oly two possible E,m ew values for check ode m i Step 3 of Algorithm 1 (which are selected from W r values of type L,m ): the miimum E m1 ad the secod miimum E m2. Each LDPC accelerator expedites fidig the miimum values usig two compare/store uits with two W r -bit special registers, a selectio register P m ad a sig register S m, as ca be see i Fig. 5. The operatio of the LDPC accelerator is summaried below. The Algorithm of LDPC Accelerator if (L,m <= Em1) \\ operatios i Cmp&Store 1 { Em1 <= L,m; Em2 <= Em1; if (L,m < Em1) Pm = 1 << i; else Pm = 0; } else if (L,m < Em2) \\ operatios i Cmp&Store 2 { Em2 <= L,m; } Sm = (Sm sig(l,m)) << 1; E m1, E m2, P m ad S m are extracted usig a flush sigal ad these values are used to compute E m, usig the followig operatio (Step 7 ad 14 of Algorithm 2). if (P m[i] == 1) E m,[i] = (S m[i]) E m1, else E m,[i] = (S m[i]) E m2

m start =0. This is doe for all o-ero W r submatrices i a -by- block row. At the ed of this process, BUF1 cotais W r groups of aliged L data (see Fig. 6). I a similar way, the memory cotroller fills BUF2 for L update data with aother shift amout ((s s update ) mod ) (to be described i Sectio. 4.3). Note that the width of BUF1 ad BUF2 is. 4.2. Memory Uits Fig. 5. LDPC accelerator A major challege i decodig LDPC codes is the large umber of data aligmet operatios required for every -by- permutatio matrix. values of type L eed to be shuffled so that they ca be correctly aliged for check ode processig. If is less tha the width ( ), the data aligmet ca be executed i oe clock cycle usig SSN. However, the IEEE 802.16e stadard uses differet values (24, 28, 32,..., 96) for differet block sies [7]. If is larger tha, may clock cycles are required for data aligmet operatio whe SSN is used. This causes a degradatio i the LDPC decodig throughput performace. To solve the aligmet issue, we propose a memory cotroller ad buffer orgaiatio (istead of usig the shuffle etwork) as show i Fig. 4. BUF1 ad BUF2 cotai aliged (to be described i Sectio. 4.3) respectively; BUF3 cotais E m1 ad E m2 ; ad BUF4 cotais P m ad S m. L ad L update 4.3. Modified Decodig Algorithm Algorithm 2 shows the LDPC decodig algorithm o the modified SODA architecture. The L ad L update values are aliged ad stored i BUF1 ad BUF2 (Steps 1 ad 2 of Algorithm 2). The aliged values of L ad L update (Step 5) alog with E m1, E m2 (Step 4), P m ad S m (Step 6) of the first row of the first group (see for example Group 1 i Fig. 6) are fed to the ALU uit ad LDPC accelerator i each slice. These values are updated i Steps 7, 8, 9 of Algorithm 2. The process is repeated for the first row of the ext group (see for example Group 2 i Fig.6), ad so o. After completig processig of all the first rows of all the W r groups (Step 10), the updated values of E m1, E m2, P m ad S m are stored i their respective buffers (Steps 11, 12). The updated values are used to compute Em, ew ad L update (Step 15, 16) of the first row of each W r group (Step 17). The process is repeated for the secod row of each W r group, ad so o (Step 18). The above schedule results i high decodig throughput performace; it reduces the umber of data switches ad also speeds up the operatio of fidig the miimum values i the mi-sum decodig algorithm. After processig all the data for oe -by- block row, the data for the ext -by- block row is loaded ito BUF1 ad BUF2, ad the process repeats the umber of -by- block rows(= (1 R) ) times. Algorithm 2: LDPC decodig algorithm i the modified SODA Fig. 6. Data aligmet i buffers The memory cotroller hadles movemet of L data betwee the memory ad BUF1. Sice the -by- permutatio matrices i the LDPC codes used i the IEEE 802.16e stadard are circular right-shifted idetity matrices, each permutatio matrix ca be defied by a sigle right-shifted amout s. The aligmet operatio ca ow be achieved by two memory copy operatios described below. If the shifted amout is s ad the start memory address is m start, the memory cotroller first copies MEM[m start + s... m start + 1] to BUF1, ad the copies MEM[m start... m start + s 1] to BUF1. This is show i Fig. 6 for a example where s=5, 1. load aliged L to BUF1 2. load aliged L update to BUF2 3. load W r for the curret -by- block row 4. load E m1, E m2 from BUF3 5. load L, L update 6. load P m, S m from BUF4 from BUF1, BUF2 7. compute E curr m, usig E m1, E m2, P m, ad S m. 8. update L,m = L + L update - E curr m, 9. update E m1, E m2, P m, ad S m usig L,m 10. repeat step 5 to step 9 W r times 11. store updated E m1, E m2 (E ew m1,e ew m2 ) i BUF3 12. store updated P m, S m (P ew m 13. load L update from BUF2 agai 14. compute E ew m, usig E ew m1, E ew, Sm ew ) i BUF4 m2, Pm ew, ad S ew m

15. update L update += Em, ew 16. store updated L update i MEM 17. repeat step 12 to step 16 W r times 18. repeat step 4 to step 17 times 19. repeat step 1 to step 18 (1 R) times. 20. repeat step 1 to step 19 NUM times. I order to reduce the memory for storig L,m, we itroduce the parameter L update, which is (-E,m + E,m ew ). I fact, the memory space is reduced by a factor of m by keepig oe L update value for each check ode istead of storig all L,m values for every ad m combiatio. Sice updated L update values are processed i check ode order, iverse aligmet operatio is required to store the data i variable ode order i memory. After L update is stored back i memory, for the ext -by- block row computatio, the data is realiged with a differet shift amout. However, these two aligmet operatios ca be reduced to oe aligmet operatio usig aother shift amout s update ; istead of iverse aligmet operatio, L update is stored with the curret shifted amout s update ad the, i the ext iteratio, the memory cotroller use ((s s update ) mod ) as a shift amout to alig L update. 4.4. Assembly Support New assembly istructios are required for the proposed architecture to improve the decodig throughput performace. Steps 1 ad 2 of Algorithm 2 are idepedet ad ca be executed i parallel. These are combied to form istructio ldpc mem2buf. Similarly steps 5 ad 6 of Algorithm 2 ca be executed i parallel ad combied to form istructio ldbufs. Steps 8 ad 9 of Algorithm 2 ca be executed i a pipelied maer through the ALU uit ad the LDPC accelerator uit. We combie these two istructios ad itroduce a macro-operatio istructio, ldpc i. To implemet steps 11 ad 12 of Algorithm 2, the ew istructio, ldpc out.(vp), is itroduced to flush E m1, E m2, P m, ad S m from LDPC accelerators ad store them i BUF3 ad BUF4. The additioal ew assembly istructios are listed below. The New Assembly Istructios 1. ldpc mem2buf Addr[Mem],Addr[BUF1],Addr[BUF2],S1,S2 : sed a cotrol sigal to the memory cotroller : the cotroller loads L,L update from a memory ad aligs the data with shift amouts (S1, S2) i BUF1 ad BUF2 2. ldbuf3 V3,V4,Addr[BUF3] : load V3=E m1, V4=E m2 from BUF3 3. ldbufs V1,V2,P1,P2,Addr[BUF1],Addr[BUF2],Addr[BUF4] : load V1=L, V2=L update, P1=P m, P2=S m from BUF1, BUF2, BUF4 4. ldpc i V1,V6 : 1) calculate L,m with V1=L ad V6=L update E curr m, : 2) update E m1,e m2,p m,s m i LDPC accelerators with L,m. 5. ldpc out.v V7,V8,Addr[BUF3] : extract V7=E m1, V8=E m2 from LDPC accelerators ad store them i BUF3 6. ldpc out.p P3,P4,Addr[BUF4] : extract P3=P m, P4=S m from LDPC accelerators ad store them i BUF4 The overhead of addig these ew istructios is the icreased istructio bit width ad the istructio decoder complexity. 4.5. Scalability Issues The proposed architecture supports differet values of ad W r correspodig to the differet code sies ad code rates madated by the IEEE 802.16e stadard. The memory cofiguratio described i Sectio 4.2 hadles the more difficult case of whe >. Larger results i more computatios ad so a larger would help i achievig higher decodig throughput. The pealty is the larger area, both is terms of datapath ad memory, ad larger power. The parameter W r affects the decodig throughput (umber of iteratios i Algorithm 2). Sice it also affects the buffer sie ad P m, S m registers i the LDPC accelerators, the architecture has to be desiged for the largest value of W r. 5. ANALYSIS I this sectio, we study the required memory ad buffer sie, ad also aalye the improvemet i the decodig throughput due to the memory orgaiatio, datapath accelerators ad assembly istructio support. 5.1. Memory Sie Aalysis LDPC decodig process cosists of computatioally simple operatios ad multiple memory operatios. As a result, if the memory is ot orgaied properly, the it is highly likely that the pipelie would have to wait for the data to arrive. I a typical implemetatio, there are four mai values that are to be stored: L, L,m, E,m, ad shuffle iformatio. For =2304 ad R=5/6 LDPC codes outlied i the IEEE 802.16e stadard, a brute-force decodig method eeds 3.456GB for storig the L,m ad E,m values. Eve if we cosider oly o-ero elemets, the storage still requires 30KB (15KB+15KB), which is a still large memory space for a SDR platform. Therefore, a ew scheme to reduce memory space should be cosidered. There is o way to reduce the storage of L because the data is used to decide the fial decoded bit value. However, the storage for L,m ad E,m ca be sigificatly reduced.

To reduce E,m storage sie, we exploited the fact that there are oly two possible E,m ew values for check ode m: E m1 ad E m2. This two-miimum method reduces the required memory space by a factor of W r /2. For the case metioed above, the storage requiremet for E,m values is reduced to 1.5KB. Also, istead of storig all L,m values, we store L update values, thereby reducig the storage by a factor of m(=4) to 3.75KB. Storage Sie(B) Ex.(KB) MEM: L, L update 4 9 BUF1: L 2 W r 3.75 BUF2: L update 2 W r 3.75 BUF3: E m1, E m2 4 (1 R) 1.5 BUF4: P m, S m 2W r (1 R) 0.94 Table 1. Memory/Buffer requiremets for =2304 ad R=5/6 LDPC code i the IEEE 802.16e stadard Table 1 summaries the memory ad buffer requiremets for a block sie, code rate R=k/, ad (W c,w r )-LDPC code. We list the memory requiremets for =2304 ad R=5/6 LDPC code (the IEEE 802.16e stadard) whe = 32, W r = 20, ad = 96 uder the colum Ex. i the table. 5.2. Throughput Aalysis The data path accelerators, the memory uits, ad the ew istructios all help i icreasig the decodig throughput. For the =2304 ad R=5/6 LDPC code i the IEEE 802.16e stadard ad for NUM=10, the achievable clock cycle reductios for each of the ehacemets are show i Table 2. Here 40000 is the umber of cycles i the origial SODA implemetatio. Red. (Cycles) % red. LDPC Accelerators 5760(40000) 14.4 Memory Uits 6912(40000) 17.3 New Istructios 4608(40000) 11.5 Table 2. Cycle reductios due to ehacemets The proposed SODA is implemeted i 0.18um techology ad is clocked at 400MH. The LDPC decodig throughput for =2304 ad R=5/6 LDPC code ca be boosted from 18.3 Mbps to 30.4 Mbps usig the proposed ehacemets. With techology scalig, the decodig throughput is expected to icrease to aroud 62.2 Mbps i 90m techology. The area ad power overhead i the datapath ad memory is quite small. For istace the area of the memory cotroller ad LDPC accelerators is egligible (5.37%) compared to the origial desig. However the complexity of addig CISCtype istructios requires careful evaluatio. 6. CONCLUSION I this paper, we preseted a software-hardware co-desig case study of LDPC decoder for SDR. We first provided a overview of LDPC codes ad the showed how LDPC decodig ca be doe by the SDR architecture. Next we showed how use of datapath accelerators, memory buffers ad additioal istructios ca be used to improve the decodig throughput performace. We implemeted a scalable LDPC decoder for the IEEE 802.16e stadard. Our results show that we ca achieve 30.4 Mbps decodig throughput for =2304 ad R=5/6 LDPC code. 7. ACKNOWLEDGEMENT This research is supported i part by ARM Ltd., NSF CSR- EHS 0615135, NSF ITR 0325761 ad The Korea Foudatio for Advaced Studies. 8. REFERENCES [1] Gallager, Low-desity parity-check codes, IRE Trasactios o Iformatio Theory, vol. IT-8, o.1, pp. 21 28, Jauary 1962. [2] D.J.C.MacKay; R.M.Neal, Near shao-limit performace of low-desity parity-check codes, Electroics letters, vol. 32, pp. 1645 1646, August 1996. [3] M.M.Masour; N.R.Shabhag, High-throughput ldpc decoders, IEEE Trasactios o VLSI Systems, vol. 11, o.6, pp. 976 996, December 2003. [4] Y.Li et. al., Soda: A low-power architecture for software radio, Proceedigs of the 33rd Aual Iteratioal Symposium o Computer Architecture (ISCA), 2006. [5] F.Guilloud; E.Boutillo; J.L.Dager, λ-mi decodig algorithm of regular ad irregular ldpc codes, 3rd Iteratioal Symposium o Turbo Codes & related topics, September 2003. [6] D.E.Hocevar, A reduced complexity decoder architecture via layered decodig of ldpc codes, IEEE Workshop o Sigal Processig Systems, pp. 107 112, 2004. [7] IEEE Std 802.16e-2005, available at http://stadards.ieee.org/getieee802/dowload/802.16e- 2005.pdf, February 2006.