Reduced-Complexity Column-Layered Decoding and. Implementation for LDPC Codes

Redued-Complexity Column-Layered Deoding and Implementation for LDPC Codes Zhiqiang Cui 1, Zhongfeng Wang 2, Senior Member, IEEE, and Xinmiao Zhang 3 1 Qualomm In., San Diego, CA 92121, USA 2 Broadom Corp., Irvine, CA 92617, USA 3 Case Western Reserve University, Cleveland, OH 44106, USA Abstrat: Layered deoding is well appreiated in Low-Density Parity-Chek (LDPC) deoder implementation sine it an ahieve effetively high deoding throughput with low omputation omplexity. This work, for the first time, addresses low omplexity olumn-layered deoding shemes and VLSI arhitetures for multi-gb/s appliations. At first, the Min-Sum algorithm is inorporated into the olumn-layered deoding. Then algorithmi transformations and judiious approximations are explored to minimize the overall omputation omplexity. Compared to the original olumn-layered deoding, the new approah an redue the omputation omplexity in hek node proessing for high-rate LDPC odes by up to 90% while maintaining the fast onvergene speed of layered deoding. Furthermore, a relaxed pipelining sheme is presented to enable very high lok speed for VLSI implementation. Equipped with these new tehniques, an effiient deoder arhiteture for quasi-yli LDPC odes is developed and implemented with 0.13um CMOS tehnology. It is shown that a deoding throughput of nearly 4 Gb/s at maximum of 10 iterations an be ahieved for a (4096, 3584) LDPC ode. Hene, this work has failitated pratial appliations of olumn-layered deoding and partiularly made it very attrative in high-speed, high-rate LDPC deoder implementation. 1

Index Terms Deoder, Error orretion odes, Low-density parity-hek (LDPC), Quasi-yli (QC) odes, VLSI Arhiteture, Layered deoding. 1 INTRODUCTION Conventionally, LDPC odes are deoded using the Sum-Produt algorithm (SPA) [1] or the modified Min-Sum algorithm (MSA) [2]. In general, the SPA has the best deoding performane. The MSA is an approximation of the SPA aimed to redue omputation omplexity. It is widely employed in LDPC deoder design [3][4][5][6]. Both algorithms are based on two-phase message-passing (TPMP) sheme. In the literature, variations of the SPA and MSA deoding approahes have been investigated to redue the interonnet omplexity in VLSI implementation [15][16]. Reently, layered LDPC deoding shemes [7][8][9][10] have attrated muh attention in both aademy and industry beause they an effetively speed up the onvergene of LDPC deoding and thus redue the required maximum number of deoding iterations. Presently two kinds of layered deoding approahes, i.e., row-layered deoding [7][8][9] and olumn-layered deoding [9][10] have been proposed. In row-layered deoding, the parity hek matrix of the LDPC ode is partitioned into multiple row layers. The message updating is performed row layer by row layer. The olumn-layered deoding employs the similar idea exept that the parity hek matrix is partitioned into multiple olumn layers. The message omputation is performed olumn layer by olumn layer. The implementation of LDPC deoder with row-layered deoding has been widely studied [11][12][13][14]. The SPA-based olumn-layered deoding approah [10] was proposed in 2005. In the same year, Radosavljevi [9], et al, proposed a simplifiation of the SPA-based olumn-layered deoding approah. It has been reported that the olumn-layered deoding has a similar onvergene speed as rowlayered deoding. However, the olumn-layered deoding algorithm has attrated muh less attention due to its inherent high omputation omplexity. In this work, we investigate and explore the benefits of the 2

olumn-layered deoding approah. We first inorporate the Min-Sum algorithm into the olumn-layered message passing sheme beause it has muh lower omplexity than the SPA. In addition, by deeply investigating the message updating proess in the Min-Sum based olumn-layered deoding, we develop a simplified olumn-layered deoding sheme, whih maximally eliminates the redundant omputations in the original sheme and signifiantly redues the omputation omplexity further with judiious algorithmi approximation. It is shown that up to 90% of omputations in hek node proessing an be saved for high rate LDPC odes. For high-speed VLSI implementation, the proposed design has signifiant advantages over the onventional row-layered deoding. First, the olumn-layered deoding inherently has shorter ritial path. For a row-layered deoder, the messages assoiated to multiple sub-bloks in a row layer an be proessed in one yle to inrease throughput. However, it will inrease the omplexity of hek node unit (CNU) and require serial onatenation of multiple omparison and seletion stages in VLSI implementation. It has been reported that signifiant hardware overhead is required to optimize the orresponding iruitry for high lok speed [12]. In the olumn layered deoding, the major implementation omplexity is assoiated with variable node unit (VNU), partiularly when the messages orresponding to multiple sub-bloks in a olumn layer are proessed in parallel. Beause only addition operations are performed in a VNU, it is very onvenient to employ arithmeti optimization to minimize the ritial path. Seond, in the proposed design, the overall pipeline lateny in deoding a ode blok is equal to the number of pipeline stages while in rowlayered deoding, the same amount of pipeline lateny is introdued for the message updating of every layer. Moreover, the intrinsi message loading lateny is minimized in the olumn-layered deoding beause the deoding an start as soon as the intrinsi messages orresponding to one blok olumn are available. In summary, the proposed design is well suited for very high deoding throughput and low power LDPC deoder implementation. 3

2 SIMPLIFIED COLUMN-LAYERED DECODING WITH MIN-SUM ALGORITHM The olumn-layered deoding sheme (also known as the shuffled iterative deoding) for LDPC odes was proposed in [10]. The deoding sheme is based on the SPA algorithm. Similar to row-layered deoding [7][8][9], the maximum number of deoding iterations an be signifiantly redued for the same deoding performane. Sine the Min-Sum algorithm has muh lower implementation omplexity than the SPA algorithm, it is widely utilized in hardware implementation. In this paper, the Min-Sum algorithm is inorporated into the olumn-layered deoding for the first time. Then, algorithmi transformations and intelligent approximations are explored to signifiantly redue the omputation omplexity and memory requirement. 2.1 Min-Sum Algorithm Based Column-layered Deoding Let C be a binary (N, K) LDPC ode speified by a parity-hek matrix H with M rows and N olumns. Eah row of the parity hek matrix is assoiated with a hek node, and eah olumn is assoiated with a variable node. Let N { v : H = 1} = v denote the set of variable nodes that partiipate in hek node, and M ( v) { : H = 1} denote the set of hek nodes assoiated to variable node v. Let I v = v denote the intrinsi message for variable node v and R v represent the hek-to-variable message onveyed from hek node to variable node v, and Lv represent the variable-to-hek message onveyed from variable node v to hek node. Assume that the N bits of a odeword are divided into G groups of the same size, N 0, N1, N G 1. Aordingly, the parity-hek matrix is divided into G blok olumns. The proposed olumn-layered deoding based on the Min-Sum algorithm is desribed with the pseudo ode below. Min-Sum-based olumn-layered deoding algorithm Initialization: L v = I v for v=0, 1,, N-1, =0, 1,, M-1; 4

Iterative deoding: For iter = 1, 2,, maximum iteration number { } For g=0, 1,, G-1 { } Horizontal Step: For eah hek node that is onneted to variable node Vertial Step: For eah variable node Hard deision and termination: Make hard deision by using the sign of v N g, updates Lv and L v R v v N g, omputes = sgn( Ln ) min L. (1) n N ( )\ v n n N ( )\ v as follows: α m M ( v) Rmv, (2) L v \ = Iv + Lv = Iv + α m M (v ) Rmv. (3) L v ; Terminate the deoding if a valid odeword is found. The optimum value of the saling fator α is around 0.8[Wrong!]0. For the onveniene of VLSI implementation, it is set as 0.75 in this work. Assume all variable nodes are divided into 4 groups (i.e., G = 4). The omputation flow of the olumnlayered deoding for one iteration is illustrated in Figure 1. (a). The shaded sub-matries indiate overage of omputation in deoding eah layer, where the omputation in (1) must be arried out for all blok olumns exept for the urrent updating blok olumn. Hene, the omputation omplexity of the original olumn-layered deoding sheme per iteration is many times more than that of the onventional TPMP sheme whether the MSA or SPA is used. 5

g = 0 g = 1 g = 2 g = 3 horizontal step vertial step ( a ) g = 0 g = 1 g = 2 g = 3 horizontal step vertial step ( b ) Figure 1. The omputation flow in an iteration of olumn-layered deoding, (a) original algorithm, (b) after algorithm reforumlation. 2.2 Low Complexity Deoding Sheme 2.2.1 Algorithm Reformulation With a lose study of the omputation flow of the olumn-layered deoding, it an be observed that a signifiant amount of redundant omputation is performed in the original olumn-layered deoding algorithm. To improve the omputation effiieny, the omputation in (1) for onseutive olumn layers an be inrementally performed. In LDPC deoding, eah variable node sends a variable-to-hek message to every neighboring hek node. Assume that a hek node has d variable node neighbors, and hene reeives d soft messages from its neighboring variable nodes. To larify the main proedure of the reformulated olumn-layered deoding, we assume that eah hek node has only one variable node neighbor in eah blok olumn of the parity hek matrix. It should be noted that this onstraint is not indispensable for olumn-layered deoding in general. Let = m [ m m2... m d ] 1 be a sorted vetor of the magnitudes of the soft messages reeived by hek node in asending order. The supersript g indiates 6

the soft message was generated when the th g blok olumn is proessed. Similarly, let S be n N ( ) sign( Ln ) when the deoding for the layer g is ompleted. To redue the omputation omplexity of ( g 1) Min-Sum based olumn-layered deoding, m an be omputed from m ( g 1) 1. For eah variable node v in N g, remove the old L from m v in three steps. to obtain a temporary sorted ( g 1) vetor m. In addition, remove the old sign L ) from S ( v and obtain a temporary sign-produt ~ ( g ) S. Sine the smallest value, m 1, in m is min )\ ( L ) and S is sign ( Ln ), the n N ( v n n N ( )\ v value of R an be omputed as m1 v ~ ( 1) S g. Send the R v message to the variable node v. 2. Perform variable-to-hek message omputations for all variable nodes belonging to N g. The new Lv messages are sent bak to orresponding hek nodes. 3. For eah hek node, insert the updated Lv into m in a sorted order to obtain The reformulated olumn-layered deoding proedure is summarized as follows: m. The Reformulated Column-layered Deoding Initialization: Let L v = Iv for all variable nodes. For eah hek node, sort the magnitudes of the L v messages from its neighboring variable nodes. Compute the sign produt for eah hek node S = n N ) Iterative deoding: For iter = 1, 2,, maximum iteration number { For g=0, 1,, G-1 { Horizontal Step-A: for eah hek node that onnets to variable node v N, sign ( L. ( n) ompute m by removing the old L from m g, (4) v ( g 1) and R = m, where S = S old sign. (5) v ~ S g 1 L v Vertial Step: For eah variable node v, ompute L and L using (2) and (3). N g v v 7

} Horizontal Step-B: for eah hek node that onnets to variable node v N, ompute m using m and L, (6) ~ ) v v ( g) ( g) ( g and S = S sign( L ). (7) Hard deision and termination: Make hard deision by using the sign of L v ; Terminate deoding if a valid odeword is found or the maximum deoding iteration is reahed. } ~ ~ (0 (0) ) In the above algorithm, m and S in a deoding iteration are omputed from m G and G S, respetively, whih are obtained in the previous iteration. The new omputation flow for one full iteration is depited in Figure 1. (b). 2.2.2 Simplifiation of Min-Sum Based Column-layered Deoding The algorithm reformulation presented above removes a large amount of redundant omputation in the original olumn-layered deoding sheme, and thus signifiantly redues the overall omputation omplexity. However, beause every vetor m ontains d values, the reformulated algorithm still requires a onsiderable amount of memory to store hek-to-variable messages. For row, only the two smallest values in the sorted vetor m are diretly involved in the message updating Step-A from layer g-1 to layer g. For example, if the smallest value in m g is from layer g in the previous iteration, it is removed from the vetor m g in horizontal Step-A. Then the seond smallest value is used as the magnitude of variable-to-hek message. In horizontal Step-B, m is omputed with the new value, L v, from variable node, and the pre-sorted vetor m from horizontal Step-A. The new value, L, ould take any index in m. If it takes a very small index, it has more hane to be used in Step-A of further omputation. Otherwise, it has muh less hane to be one of the two smallest values in the remaining omputation. In another word, though m ontains d values, most of the values in the end of the sorted v 8

vetor are less likely to be used as reliability information for message updating. Thus, it is reasonable to redue the length of the vetor m to further redue the implementation omplexity. Our simulations show that if the lengths of the vetors m and m are set to be 3, the deoding onvergene speed and performane have almost no degradation ompared to the standard Min-Sum based olumn-layered deoding. In suh ase, we name the sheme as three-min olumn-layered deoding. Similarly, the lengths of the vetors m and m an be set as 2 or even 1. Correspondingly, more penalties in onvergene speed and performane are expeted. In this approximation, m ontains three values in most ases. It may ontain two valid values if one value is removed from the vetor m g in horizontal Step-A. In the omputation of horizontal Step-B, at most three omparison operations are required to sort the new value Lv from variable node and the sorted values in m. If m ontains three valid values, the third omparison is needed to determine whether the Lv is the third smallest value or it should be disarded. Table I shows the average number of m updating events in one iteration for the deoding of a (4096, 3584) (4, 32) QC-LDPC ode at the SNR of 4.1dB. In one iteration, the total number of m updating events for a hek node is 32. A message updating event an be ategorized into one of three types. Type I, a value is removed from m g and then a new value is inserted into m. Type II, no value is removed from m g but m is updated. Type III, no value is removed from m g and L v is disarded. It an be seen from Table I that if no value is removed from m g, L v is muh more likely to be disarded than being the third smallest value in m. To further redue the deoding omplexity, the third omparison in horizontal Step-B is eliminated. In the modifiation, Lv is disarded if no value is removed from m g and Lv is larger than the seond value in m. This results in the simplified three-min olumn-layered deoding. In this ase, a 9

( g 1) hek node requires 3 equal-omparisons to remove the old Lv from vetor m in horizontal Step-A and requires 2 regular omparisons to update m vetor with the new L in horizontal Step-B. The major differene between the three-min deoding and the simplified three-min deoding is that the three-min deoding requires 3 omparisons for sorting in horizontal Step-B. The simplified three-min requires 2 omparisons for sorting in horizontal Step-B. Our simulation shows that the additional approximation introdued by the simplified three-min deoding only auses very small performane loss. v Table I shows that a sorted vetor m only gets updated about 4 times in an iteration of the three-min deoding. On the ontrary, for TPMP or row-layered deoding, the sorting omputation to find the smallest and the seond smallest magnitudes for a hek node re-starts in every iteration. For the same LDPC ode, the average number of updating ativities for a sorted vetor is more than 16 per iteration. It leads to signifiant power savings for the three-min olumn-layered deoding in hek node message updating. TABLE I. THE AVERAGE NUMBER OF m UPDATING EVENTS IN ONE ITERATION ~ ( g) m ~ ( g) m = ~ ( g) m = ~ ( g) m = In the 3 rd iteration In the 6 th iteration ( g 1) m 2.804 2.783 ( g 1) m, Lv being the smallest value in ( g 1) m, Lv being the seond smallest value in ( g 1) m, Lv being the third smallest value in m 0.036 0.050 m 0.140 0.170 m 0.880 1.005 m = ~ ( g) m = ( g 1) m, Lv being disarded 28.140 27.992 2.2.3 Pipelining of Column-layered Deoding Pipelining is a ommon pratie in VLSI implementation to inrease lok speed and thus to speed up data proessing throughput. In general, pipelining an only be applied to feed-forward data paths in order to maintain the original funtion of VLSI iruitry. In LDPC olumn-layered deoding algorithms, data dependeny exists between onseutive layers. Thus, the VLSI iruitry for hek node, variable node, and message memories forms a logi loop and pipelining an not be diretly applied to inrease the effetive 10

lok speed of LDPC deoders. In this setion, a relaxed pipelining sheme of olumn-layered LDPC deoding is proposed. In the original olumn-layered deoding, the hange between m g ( g 1+ P) and m is very small if the value of P is not large. Thus, m g ( g 1+ P) ( g P) an be used as the estimation of m. An approximation of R + v an be alulated from m g ( g 1+ P) before m are obtained. Then, it is immediately used for omputing ( g P) v L +. The approximation allows P lok yles to omplete the message omputation for eah layer. ( P) When L g+ ( g 1+ P) is obtained, the m is already available. Thus, the horizontal step for layer g+p an be v 1 ) undertaken with ( g + m P ( P) and L g+ ( g P). The updating method of m v + is the same as before. The pipelined olumn-layered deoding algorithm is formulated as the following. The Pipelined Column-layered Deoding Initialization: Let L v = Iv for all variable nodes. For eah hek node, sort the magnitudes of the L v messages from its neighboring variable nodes. Compute the sign produt for eah hek node = S n N ( sign ( L ) n). Iterative deoding: For iter = 1, 2,, maximum iteration number { For g=0, 1,, G-1 { ( P) Compute R g+ ( g ) : For eah hek node that onnets to variable node v N, ompute m v ~ P + by ( P) removing the old L g+ from m g ( P). Then, g ~ ( g+ P) ( g 1) S = v ( g P) L v S old sign( ) R + ( g ) = v ~ + P S m 1, where +. (8) ( P) Vertial Step: For eah variable node v N g + ( P), omputes L g+ ( P) and L g+ using (2) and (3). v v Horizontal Step-A: for eah hek node that onnets to variable node v N, ompute m by removing the old L from m g, (9) v Horizontal Step-B: for eah hek node that onnets to variable node N v, 11

} ompute m using m and L, (10) ~ ) v v ( g) ( g) ( g and S = S sign( L ). (11) Hard deision and termination: Make hard deision by using the sign of maximum deoding iteration is reahed. } L v ; Terminate deoding if a valid odeword is found or the 2.2.4 Impat to the Complexity of VLSI Implementation For the standard Min-Sum based olumn-layered deoding, a hek node requires d 2 regular omparators to ompute the hek-to-variable message for a new olumn-layer. To exeute (1) and (2), the variable-to-hek messages orresponding to the whole H matrix must be stored. For the simplified threemin olumn-layered deoding algorithm, a hek node only requires 2 regular omparators. The magnitudes of variable-to-hek messages an be used on-the-fly and are not stored. Only the sign bits of variable-to-hek messages should be stored. For eah hek node, three magnitudes, three indies, and one sign bit need to be saved. For a (N, N-M) (4, 32) strutured LDPC ode, approximately, ( 30 2) / 30 = 93% of omputation an be redued. Assuming eah message is quantized as 4 bits, the number of memory bits needed by a hek node is 3 (3 + 5) + 1 = 25. The perentage of memory bits that an be saved for extrinsi messages is [( 32 4 32 25) ] /(32 4 M ) = 55.4% M. Beause the proposed olumn-layered deoding approahes do not save the magnitude of variable-tohek messages, they are inherently memory-effiient. Equipped with the proposed deoding tehniques, an effiient olumn-layered deoder arhiteture for quasi-yli LDPC odes is developed with arhitetural and arithmeti optimizations. The details are provided in Setion IV. 12

3 PERFORMANCE SIMULATION To evaluate the deoding performane and onvergene speed of the proposed algorithms, three LDPC odes are simulated. Codes I and II are rate 1/2, (2304, 1152) LDPC ode and rate 5/6, (2304, 1920) LDPC ode, respetively, adopted in WiMax (802.16e) standard. Code III is a (4096, 3584) (4, 32)-regular QC-LDPC ode, whih is onstruted using the progressive edge growth method (PEG) [17][18]. The ode is also used to illustrate the VLSI arhiteture design in Setion 4. In eah simulation, at least 50 frame errors are observed. Figure 2. shows the frame error rates in deoding the WiMax rate-1/2 (2304, 1152) LDPC ode with four different layered deoding algorithms, i.e., 1) the row-layered deoding, 2) the original olumnlayered deoding, 3) the three-min olumn-layered deoding, and 4) the simplified three-min olumnlayered deoding. The maximum number of deoding iterations is set as 10 and 20, respetively, for all deoding approahes. It an be observed that all the four deoding algorithms have almost the same deoding performane with maximum of 20 iterations. With the maximum of 10 deoding iterations, only subtle performane differene an be observed. Figure 3. shows the frame error rates in deoding the WiMax rate-5/6 (2304, 1920) LDPC ode with maximum of 10 and 20 iterations, respetively. It shows again that the performane differenes among various layered deoding algorithms are negligible. 13

10 0 10-1 10 iterations Frame error rates 10-2 10-3 20 iterations 10-4 Row-layered. Column-layered. 3-min olumn-layered. simplified 3-min olumn-layered. 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 Eb/No (db) Figure 2. Frame error rate of various layered deoding approahes in deoding WiMax rate-1/2 (2304, 1152) LDPC ode. 14

10-1 Frame error rate 10-2 10-3 10-4 10-5 20 iterations Row-layered. Column-layered. 3-min olumn-layered. simplified 3-min olumn-layered. 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4 4.1 Eb/No (db) 10 iterations Figure 3. Frame error rate of various layered deoding approahes in deoding WiMax rate-5/6 (2304, 1920) LDPC ode. Figure 4. shows the frame error rate (FER) in deoding the (4096, 3584) (4, 32) QC-LDPC ode. The number of olumns in eah olumn layer is 128. The maximum number of iteration is set as 10. It an be seen that the three-min olumn-layered deoding algorithm ahieves almost the same deoding performane as the standard Min-Sum based olumn-layered deoding algorithm. The simplified three-min approah has about 0.02dB performane loss. The pipelined three-min olumn-layered deoding methods introdued slight performane loss ompared to the non-pipelined three-min deoding algorithm. With two stages of pipelining, the pipelined deoding sheme has less than 0.02dB performane loss (ompared to whih??). 15

Figure 4. Frame error rate of various deoding approahes in deoding a (4096, 3584) (4, 32) QC-LDPC ode. 4 PROPOSED COLUMN-LAYERED DECODER ARCHITECTURES In this setion, an optimized QC-LDPC deoder arhiteture for a (4096, 3584) (4, 32)-regular QC- LDPC ode using the proposed pipelined olumn-layered deoding sheme is presented. The arhiteture effiiently enables very high deoding parallelism for layered deoding while having very short ritial path. In one lok yle, the messages orresponding to 4 irulant matries are proessed in parallel. Twostage pipelining is employed in order to improve the lok frequeny. In order to further redue the ritial path delay, we rearrange the additions in variable node unit to maximally take the advantage of arry-save addition. 16

4.1 Column-Layered Deoder Arhiteture for QC-LDPC Codes The parity hek matrix of the QC-LDPC ode to be onsidered in this work onsists of an array of 4 32 ylially shifted permutation matries of dimension 128 128. The deoding steps disussed in Setion II.B are performed for one blok olumn at a time. That means all extrinsi messages orresponding to a blok olumn of the H matrix are proessed in parallel in one lok yle. The arhiteture of pipelined deoder with two pipeline stages is shown in Figure 5. Eah M omponent in an M-register array on the left side of Figure 5. represents a 25-bit register for storing a sorted vetor assoiated with a row of the H matrix. The Get Rv omponent performs (8) to get hek-to-variable messages for layer g+p, where P=2. The CNUa omponent (the portion of hek node unit for horizontal Step-A) performs (9) to remove the old variable-to-hek message assoiated to layer g from the sorted vetor. The CNUb omponent (the portion of hek node unit for horizontal Step-B) is utilized to perform (10) and (11). The total numbers of M-register, Get Rv, CNUa, and CNUb are all 512. The number of variable node unit (VNU) in the VNU array is 128. A VNU is required to perform the vertial step (2) and (3) for a variable node. The barrel shifters in the left side of the VNU array align the hek-to-variable message from row order to olumn order. Similarly, the barrel shifters in the right side of VNU array align the variable-to-hek message from olumn order to row order. It an be observed form Figure 5. that two types of loops are formed in olumn-layered deoding. The first type of loop onsists of a CNUa and a CNUb omponents. The seond type of loop is omposed of a Get Rv, a CNUb, a VNU, and two barrel shifters. With the pipelined olumn-layered deoding approah, the ritial path in the seond type loop is drastially redued when applying two-stage pipelining. The Get Rv omponent is needed beause its outputs are the hek-to-variable messages for layer g+p. On the ontrary, the outputs of CNUa are intermediate values for layer g. For non-pipelined deoding, the Get Rv omponent an be eliminated beause the output of CNUa ontains the hek-to-variable message for layer g. We only need one stage of barrel shifter at the input of CNUa array to minimize 17

ritial path. The onnetions among the array of CNUa, CNUb and VNU have no hange during the deoding. Barrel shifter Barrel shifter Barrel shifter Pipeline Pipeline Barrel shifter Figure 5. The top-level blok diagram of pipelined olumn-layered deoding. 4.2 Arhiteture of Chek Node Unit Figure 6. shows the struture of the CNU for the three-min deoding sheme. The sub-matries in a blok row of the H matrix are proessed one at a time. The CNU is omposed of two onatenated stages. 18

The first stage, CNUa, omputes the m, and the seond stage, CNUb, generates m. The data in Fig. 10 are assoiated with the vetors in the horizontal deoding step as follows: ( g 1) m = [min1_old, min2_old, min3_old], ~ ( g) m = [min1_temp, min2_temp, min3_temp], ( g) m = [min1_new, min2_new, min3_new], ( g 1) I = [idx1_old, idx2_old, idx3_old], ~ ( g ) = I [idx1_temp, idx2_temp, idx3_tmp], ( g) I = [idx1_new, idx2_new, idx3_new]. In eah M-register for a sorted vetor, three smallest magnitudes and their indies are stored. In the deoding of a olumn layer, if the olumn index is not in the vetor, the magnitudes and indies in the vetor are diretly passed through the selet-logi-a to the seond stage. Otherwise, min1_temp and min2_temp get values from the two remaining smallest magnitudes. The value of min3_temp beomes void. The temporary index values are determined in the same way. It is lear that min1_temp is the magnitude of the Rv message for the olumn layer in non-pipelined deoding. After new L v value is sent bak from VNU, it is used to ompute the new value for the sorted vetor. The struture of a Get Rv omponent is shown in Figure 6.. This omponent is not required for non-pipelined deoding. For the proposed simplified three-min deoding approah, the adder for the omputation of abs Lv ) min 3 _ temp ( is not needed. Instead, a simple three-input OR gate an be used to disable M-register update in the ondition shown in row 5 of Table I. The three inputs for the OR gate are the outputs of the three omparators in CNUa. It is shown in Fig. 3 that the removal of the adder results in less than 0.02dB performane loss. 19

Figure 6. (a) CNU arhiteture for the three-min olumn-layered deoding. (b) The struture of the Get Rv omponent. 4.3 Arhiteture of Optimized Variable Node Unit Shown in Figure 7. is the struture of an optimized VNU that an simultaneously proess 4 hek-tovariable messages. The addition operations are rearranged suh that the advantage of arry-save adder an be maximally taken. In the beginning of a VNU, the hek-to-variable messages in signed-magnitude format are onverted to 2 s omplement representation. In the data onversion for eah signed-magnitude number, the sign bit is not immediately added to the bitwise-not of lower bits. Instead, all sign bits are added through adder array. Then, eah two-bit sign-sum is sent to an adder in the seond addition stage for final summation. Beause eah adder in the seond and third stages has three inputs, it an be implemented using a arry-save adder and a regular binary adder. The right shift operations >>1 and >>2 are used for performing the salar multipliation of 0.75. The above mentioned arithmeti optimizations aim to signifiantly redue the logi delay in a VNU. 20

Figure 7. The arhiteture of the optimized VNU. To illustrate the data flow of the optimized VNU, let us take the omputation of L 3 v as an example. Assume that R 4 v is a negative number and other inputs are positive numbers. The value before the shift of saling operation of L 3 is R R + bit _ inverse ( R ) + '0' + '0' '1' ). The ' 0' + '0' + '1' omputation is v ( v 2v 4 1 + v + performed by a 1-bit full adder in the adder array blok assoiated with L 3 v message. After the final shift and addition stage, the output of L 3 v is 0.75 ( R 1 v + R2v + R4v ) + Iv. 5 HARDWARE REQUIREMENT AND DECODING THROUGHPUT The pipelined olumn-layered deoder with two pipeline stages is modeled using Verilog RTL and synthesized using Fujitsu 0.13um standard library. The required hardware resoure and the synthesis result 21

are summarized in Table II. Eah intrinsi message is quantized as 4 bits. It takes 32 lok yles to ompute the initial sorted vetors and 32 10 = 320 lok yles for 10 deoding iterations. The number of lok yles for pipeline lateny is 2. Let flk denote the lok frequeny of the deoder, the information (soure data) deoding throughput is f ( 4096 512) /(32 + 320 + 2). Thus, the deoder ahieves a deoding throughput of 3.928 Gb/s at a lok speed of 388MHz. lk TABLE II. THE HARDWARE RESOURCE FOR (4096, 3584) (4, 32) QC-LDPC DECODERS VNU 128 CNU 512 Get_Rv 512 Intrinsi message (bits) 4 4096 = 16384 Sorted vetor (bits) 25 512 = 12800 Signs of variable-to-hek messages 4 4096 = 16384 2 Area per VNU ( um ) 5597.8 2 Area per CNU ( um ) 2100.9 2 Area per Get_Rv ( um ) 152.1 Clok Frequeny (MHz) 388 2 Synthesis area ( mm ) 6.755 Information deoding throughput (Gb/s) 3.928 Table III ompares the proposed olumn-layered deoder with the state-of-the-art row-layered deoders. To mitigate the disrepany introdued from different implementation tehnologies, the area and lok speed of all these designs are saled to 65nm. For the implementation tehnology of 0.18um, 0.13um, and 90nm, the area saling down fator is set as 8, 4, and 2, respetively. The orresponding lok frequeny is saled up by 3 1.25, 2 1.25, and 1.25, respetively. The maximum deoding iteration is set as 22

10 for all deoders with layered deoding algorithms. For area of synthesis result, a saling fator of 1/0.7 is applied to approximate the layout area. TABLE III. DESIGN COMPARISONS WITH RECENT HIGH SPEED LDPC DECODERS This work [11] [14] [12] [13] Code (4096, 3584) (9600, 7200) 802.11n 802.16e, 2048-bits 802.11n Rate 7/8 3/4 1/2 ~5/6 1/2 ~ 5/6 1/2 ~ 7/8 Algorithm Column-layered Min-Sum Row-layered Min-Sum Row-layered Min-Sum Row-layered BP Row-layered BCJR LLR message quantization 4-bits 6-bits 5-bit - 4-bits Max. iteration 10 10 5 10 10 Max deoding parallelism 512 80 81 2 96 - Tehnology 0.13um 65nm 0.18um 90nm 0.18um Clok frequeny(mhz) 388 500 208 450 125 Area 6.755 0.504 3.39 3.5 14.3 (synthesis) (synthesis) Max info. throughput (Gb/s) 3.928 1.08 0.78 1.0 0.64 Clok frequeny (saled to 65nm) 606 500 406 562 244 Layout area (saled to 65nm) 2.41 0.72 0.42 1.75 1.79 Normalized info. throughput (saled to 65nm) 6.13 1.08 0.76 (10 iterations) 1.25 1.25 Considering the normalized deoding throughput, throughput/area ratio, and deoding performane among various designs, it an be onluded that the proposed simplified olumn-layered deoding algorithm and arhiteture have signifiant advantages in high throughput LDPC deoder implementation. 23

6 CONCLUSION In this paper, various tehniques have been explored to redue the omputation omplexity of the olumn-layered deoding. As a result, the proposed method an drastially redue the overall omputation omplexity of the original sheme while largely maintaining deoding performane and onvergene speed. In addition, a relaxed pipelining sheme has been shown to break the data dependeny between adjaent olumn layers, and thus enhane the lok speed. Combining all the proposed tehniques, a low-omplexity, high-speed LDPC deoder arhiteture for generi QC-LDPC odes has been developed and the implementation result for a speifi example has demonstrated that the proposed olumn-layered deoder arhiteture is very ompetitive to state-of-the-art row-layered LDPC deoder designs. REFERENCES [1] R. G. Gallager, Low-density parity-hek odes, IRE Trans. Inform. Theory, vol. IT-8, pp. 21-28, Jan. 1962. [2] J. Chen, A. Dholakia, E. Eleftheriou, M.P.C. Fossorier, X. Y. Hu, Redued-Complexity Deoding of LDPC Codes, IEEE Transations on Communiations, vol. 53, issue 8, pp. 1288 1299, Aug. 2005. [3] Z. Wang and Z. Cui, A memory effiient partially parallel deoder arhiteture for QC-LDPC odes, in Pro. of the 39th Asilomar Conferene on Signals, Systems & Computers, pp. 729-733, 2005. [4] C. Lin, K. Lin, H. Chan, and C. Lee, A 3.33 Gb/s (1200, 720) low-density parity hek ode deoder, in Pro. 31st Eur. Solid-State Ciruits Conf., pp. 211 214, Sep. 2005. [5] D. Oh and K. K. Parhi Nonuniformly quantized Min-Sum deoder arhiteture for low-density parity-hek odes, in Pro. of the 18th ACM Great Lakes symposium on VLSI, pp. 451-456, 2008 [6] X. Shih, C. Zhan, C. Lin, and A. Wu, An 8.29 mm2 52 mw Multi-Mode LDPC Deoder Design for Mobile WiMAX System in 0.13 µm CMOS Proess, IEEE Journal of Solid-State Ciruits, vol. 43, issue 3, pp. 672 683, Marh 2008. [7] E. Sharon, S. Litsyn, and J. Goldberger, An effiient message-passing shedule for LDPC deoding, in Pro. of the 23rd IEEE Convention of Eletrial and Eletronis Engineers in Israel, pp. 223-226, Sept., 2004. 24

[8] D. E. Hoevar, A redued omplexity deoder arhiteture via layered deoding of LDPC odes, IEEE Workshop on Signal Proessing Systems, pp. 107-112, 2004. [9] P. Radosavljevi, A. Baynast, and J. R. Cavallaro, Optimized Message Passing Shedules for LDPC Deoding, in Pro. of 39th Asilomar Conferene on Signals, Systems and Computers, pp. 591 595, 2005. [10] J. Zhang, M.P.C. Fossorier, Shuffled iterative deoding, IEEE Transations on Communiations, vol. 53, Issue 2, pp. 209 213, Feb. 2005. [11] T. Brak, M. Alles, T. Lehnigk-Emden, F. Kienle, N. When, N. E. L Insalata, F. Rossi, M. Rovini, and L. Fanui, Low Complexity LDPC Code Deoders for Next Generation Standards, in Pro. of Design, Automation and Test in Europe (DATE '07), Apr. 2007. [12] Y. Sun and J. R. Cavallaro, J.R, A low-power 1-Gbps reonfigurable LDPC deoder design for multiple 4G wireless standards, 2008 IEEE International SOC Conferene, pp. 367 370, 2008 [13] M. M. Mansour and N. R. Shanbhag, A 640-Mb/s 2048-bit programmable LDPC deoder hip, IEEE Journal of Solid-State Ciruits, vol. 41, issue 3. pp. 684-698, 2006. [14] C. Studer, N. Preyss, C. Roth, and A. Burg, Configurable high-throughput deoder arhiteture for quasiyli LDPC odes, 42nd Asilomar Conferene on Signals, Systems and Computers, pp. 1137 1142, 2008. [15] A. Darabiha, A. C. Carusone, and F. R. Kshishang, Blok-interlaed LDPC deoders with redued interonnet omplexity, IEEE Transations on Ciruits and Systems II, vol. 55, no. 1, pp. 74-78, Jan. 2008. [16] T. Mohsenin, D. Truong, and B. Baas, Multi-Split-Row Threshold deoding implementations for LDPC odes, 2009 IEEE International Symposium on Ciruits and Systems, pp. 2449 2452. [17] X. -Y. Hu, E. Eleftheriou, and D. M. Arnold, Regular and irregular progressive edge-growth tanner graphs, IEEE Trans. on Info. Theory, vol. 51, issue 1, pp. 386-398, Jan. 2005. [18] Z. Li and B. V. K. V. Kumar, A lass of good quasi-yli low-density parity hek odes based on progressive edge growth graph, Thirty-Eighth Asilomar Conferene on Signals, Systems and Computers, vol. 2, pp. 1990-1994, 2004. 25