An FPGA Implementation of ( )-Regular Low-Density. Parity-Check Code Decoder

Size: px

Start display at page:

Download "An FPGA Implementation of ( )-Regular Low-Density. Parity-Check Code Decoder"

Kathlyn Dalton
6 years ago
Views:

1 An FPGA Imlementation of ( )-Regular Low-Density Parity-Check Code Decoder Tong Zhang ECSE Det., RPI Troy, NY tzhang@ecse.ri.edu Keshab K. Parhi ECE Det., Univ. of Minnesota Minneaolis, MN arhi@ece.umn.edu Abstract Because of their excellent error-correcting erformance, Low-Density Parity-Check (LDPC) codes have recently attracted a lot of attentions. In this work, we are interested in the ractical LDPC code decoder hardware imlementations. The direct fully arallel decoder imlementation usually incurs too high hardware comlexity for many real alications, thus artly arallel decoder design aroaches that can achieve aroriate trade-offs between hardware comlexity and decoding throughut are highly desirable. Alying a joint code and decoder design methodology, we develo a high-seed ( )-regular LDPC code artly arallel decoder architecture, based on which we imlement a -bit, rate- ( )-regular LDPC code decoder on Xilinx FPGA device. This artly arallel decoder suorts a maximum symbol throughut of Mbs and achieves BER at 2dB over AWGN channel while erforming maximum decoding iterations. Introduction In the ast few years, the recently rediscovered Low-Density Parity-Check (LDPC) codes [][2][3] have received a lot of attentions and have been widely considered as next-generation error-correcting codes for telecommunication and magnetic storage. Defined as the null sace of a very sarse arity check matrix, an LDPC code is tyically reresented by a biartite grah, usually called Tanner grah, in which one set of variable nodes corresonds to the set of codeword, another set of check nodes corresonds to the set A biartite grah is one in which the nodes can be artitioned into two sets, X and Y, so that the only edges of the grah are between the nodes in X and the nodes in Y.

2 of arity check constraints and each edge corresonds to a non-zero entry in the arity check matrix. An LDPC code is known as ( )-regular LDPC code if each variable node has the degree of and each check node has the degree of, or in its arity check matrix each column and each row have and non-zero entries, resectively. The code rate of a ( )-regular LDPC code is rovided that the arity check matrix has full rank. The construction of LDPC codes is tyically random. LDPC codes can be effectively decoded by the iterative belief-roagation (BP) algorithm [3] that, as illustrated in Fig., directly matches the Tanner grah: decoding messages are iteratively comuted on each variable node and check node and exchanged through the edges between the neighboring nodes. check nodes check to variable message variable to check message variable nodes Figure : Tanner grah reresentation of an LDPC code and the decoding messages flow. Recently, tremendous efforts have been devoted to analyze and imrove the LDPC codes error-correcting caability, see []-[], etc. Besides their owerful error-correcting caability, another imortant reason why LDPC codes attract so many attentions is that the iterative BP decoding algorithm is inherently fully arallel, thus a great otential decoding seed can be exected. The high-seed decoder hardware imlementation is obviously one of the most crucial issues determining the extent of LDPC alications in the real world. The most natural solution for the decoder architecture design is to directly instantiate the BP decoding algorithm to hardware: each variable node and check node are hysically assigned their own rocessors and all the rocessors are connected through an interconnection network reflecting the Tanner grah connectivity. By comletely exloiting the arallelism of the BP decoding algorithm, such fully arallel decoder can achieve very high decoding seed, e.g., a -bit, rate- LDPC code fully arallel decoder with the maximum symbol throughut of Gbs has been hysically imlemented using ASIC technology [2]. The main disadvantage of such fully arallel design is that with the increase of code length, tyically the LDPC code length is very large (at least several thousands), the incurred hardware comlexity will become more and more rohibitive for many ractical uroses, e.g., for K code length the ASIC decoder imlementation [2] consumes M gates. Moreover, as ointed out in [2], the routing over- 2

3 head for imlementing the entire interconnection network will become quite formidable due to the large code length and randomness of the Tanner grah. Thus high-seed artly arallel decoder design aroaches that achieve aroriate trade-offs between hardware comlexity and decoding throughut are highly desirable. For any given LDPC code, due to the randomness of its Tanner grah, it is nearly imossible to directly develo a high-seed artly arallel decoder architecture. To circumvent this difficulty, Boutillon et al. [3] roosed a decoder-first code design methodology: instead of trying to conceive the high-seed artly arallel decoder for any given random LDPC code, use an available high-seed artly arallel decoder to define a constrained random LDPC code. We may consider it as an alication of the well-known Think in the reverse direction methodology. Insired by the decoder-first code design methodology, we roosed a joint code and decoder design methodology in [] for ( )-regular LDPC code artly arallel decoder design. By jointly conceiving the code construction and artly arallel decoder architecture design, we resented a ( )-regular LDPC code artly arallel decoder structure in [], which not only defines very good ( )-regular LDPC codes but also could otentially achieve high-seed artly arallel decoding. In this aer, alying the joint code and decoder design methodology, we develo an elaborate ( )- regular LDPC code high-seed artly arallel decoder architecture based on which we imlement a - bit, rate- ( )-regular LDPC code decoder using Xilinx Virtex FPGA (Field Programmable Gate Array) device. In this work, we significantly modify the original decoder structure [] to imrove the decoding throughut and simlify the control logic design. To achieve good error-correcting caability, the LDPC code decoder architecture has to ossess randomness to some extent, which makes the FPGA imlementations more challenging since FPGA has fixed and regular hardware resources. We roose a novel scheme to realize the random connectivity by concatenating two routing networks, where all the random hardwire routings are localized and the overall routing comlexity is significantly reduced. Exloiting the good minimum distance roerty of LDPC codes, this decoder emloys arity check as the earlier decoding stoing criterion to achieve adative decoding for energy reduction. With the maximum 8 decoding iterations, this FPGA artly arallel decoder suorts a maximum Mbs symbol throughut and achieves BER at 2dB over AWGN channel. This aer begins with a brief descrition of the LDPC code decoding algorithm in Section 2. In section 3, we briefly describe the joint code and decoder design methodology for ( )-regular LDPC code artly arallel decoder design. In Section, we resent the detailed high-seed artly arallel decoder architecture design. Finally, an FPGA imlementation of a ( )-regular LDPC code artly arallel decoder is discussed in Section 5. 3

4 H 2 Decoding Algorithm Since the direct imlementation of BP algorithm will incur too high hardware comlexity due to the large number of multilications, we introduce some logarithmic quantities to convert these comlicated multilications into additions, which leads to the Log-BP algorithm [2][5]. Before the descrition of Log-BP decoding algorithm, we introduce some definitions as follows: Let denote the sarse arity check matrix of the LDPC code and denote the entry of at the osition ( ). We define the set of bits that articiate in arity check as, and the set of arity checks in which bit articiates as!" #$ %. We denote the set with bit excluded by '&, and the set ( with arity check excluded by )&*. Algorithm 2. Iterative Log-BP Decoding Algorithm Inut: The rior robabilities +-, / and +) Outut: Hard decision ; 8=> ; Procedure: < ; :9:9:9 ; / ,, :9:9:9 ;. Initialization: For each, comute the intrinsic (or channel) message?8/@badcfegh EJI N O:P Q, comute R $ TS7 VU3WX? O@YADC0Z []\ )^_ H ^ `\ )^_ H ^ba dcqef\:g\]s hu3wx? jlk i [ m?o#n m?o#o and for each KML 2. Iterative Decoding Horizontal (or check node comutation) ste: For each KMLqN N:P, comute r $ T@YADC%Z *[s\ 2t \ 2t a u wvyx{z0 }~X S VUNW R wv () where R vbxƒz% ~X PR $ v P. Vertical (or variable node comutation) ste: For each KML N O:P Q, comute R $ %TS7 VU3WX?O$ "O@BADC Z W[s\ )^_ ˆH ^ (2) \ )^_ ˆH ^ a where?o$ 5T?N>[ v xw ] }D~y r WvX. For each, udate the seudo-osterior log-likelihood ratio (LLR) Š2 as: Š Œ?N [ Ž $xw ] }D~ r (3)

5 Decision ste: (a) Perform hard decision on Š :9:9:9 Šf=> to obtain < ; ; :9:9:9 8=> ; such that 8T ; if Š and ; if Š ; (b) If 9-; <, then algorithm terminates, else go to Horizontal ste until the re-set maximum number of iterations have occurred. We call R $ r and r in the above algorithm as extrinsic messages, where R is delivered from variable node to check node and $ is delivered from check node to variable node. It is clear that each decoding iteration can be erformed in fully arallel by hysically maing each check node to one individual Check Node rocessing Unit (CNU) and each variable node to one individual Variable Node rocessing Unit (VNU). Moreover, by delivering the hard decision ; from each VNU to its neighboring < can be easily erformed by all the CNUs. Thanks to the good minimum distance CNUs, the arity check 98; roerty of LDPC code, such adative decoding scheme can effectively reduce the average energy consumtion of the decoder without erformance degradation. In the artly arallel decoding, the oerations of a certain number of check nodes or variable nodes are time-multilexed, or folded [], to a single CNU or VNU. For an LDPC code with check nodes and variable nodes, if its artly arallel decoder contains E CNUs and E VNUs, we denote = factor and = as VNU folding factor. as CNU folding 3 Joint Code and Decoder Design In this section we briefly describe the joint ( )-regular LDPC code and decoder design methodology []. It is well known that the BP (or Log-BP) decoding algorithm works well if the underlying Tanner grah is -cycle free and does not contain too many short cycles. Thus the motivation of this joint design aroach is to construct an LDPC code that not only fits to a high-seed artly arallel decoder but also has the average cycle length as large as ossible in its -cycle free Tanner grah. This joint design rocess is outlined as follows and the corresonding schematic flow diagram is shown in Fig. 2.. Exlicitly construct two matrices, and, in such a way that ; LDPC code whose Tanner grah has the girth 2 of 2; defines a ( )-regular 2 Girth is the length of a shortest cycle in a grah. 5

6 #!!!!!!!!! 2. Develo a artly arallel decoder that is configured by a set of constrained random arameters and defines a ( )-regular LDPC code ensemble, in which each code is a sub-code of and has the arity check matrix ; ; 3. Select a good ( )-regular LDPC code from the code ensemble based on the criteria of large Tanner grah average cycle length and comuter simulations. Tyically the arity check matrix of the selected code has only few redundant checks, so we may assume the code rate is always. High Seed Partly Parallel Decoder Constrained Random Parameters random inut H 3 Construction of H= H H 2 deterministic inut H (3,k) regular LDPC code ensemble defined by H= H H 3 Selected Code Construction of ; by 9 : The structure of ; Figure 2: Joint design flow diagram. submatrices. Each block matrix J in is an J in is obtained by a cyclic shift of an is shown in Fig. 3, where both and are 9 identity matrix. Let identity matrix and each block matrix denote the right cyclic shift oerator where 25 reresents right cyclic shifting matrix by columns, then J 2 7 where K 7)9 A and reresents the identity matrix, e.g., let, and %, we have 7d9A A, then "! F 7 Notice that in both and each row contains s and each column contains a single. Thus, the variable nodes and 9 check matrix ; defines a ( )-regular LDPC code with 9 nodes. Let $ denote the Tanner grah of, we have the following theorem regarding to the girth of $ :

7 L H = H H 2 = I, 0 P, I 2, P2, 0 0 I,2 I k, P,2 Pk, I 2,2 0 0 P2,2 0 0 I k,2 Pk,2 I,k 0 I 2,k P,k P2,k 0 0 I k,k Pk,k L k L k Figure 3: Structure of ; Theorem 3. If can not be factored as 9 there is at least one 2-cycle assing each check node., where N=L k 2 L :9:9:9, then the girth of $ is 2 and Partly Parallel Decoder: Based on the secific structure of ;, a rincial ( )-regular LDPC code artly arallel decoder structure was resented in []. This decoder is configured by a set of constrained random arameters and defines a ( )-regular LDPC code ensemble. Each code in this ensemble is essentially constructed by inserting extra 9 check nodes to the high-girth ( )-regular LDPC code under the constraint secified by the decoder. Therefore, it is reasonable to exect that the codes in this ensemble more likely do not contain too many short cycles and we may easily select a good code from it. For real alications, we can select a good code from this code ensemble as follows: first in the code ensemble find several codes with relatively high average cycle lengths, then select the one leading to the best result in the comuter simulations. The rincial artly arallel decoder structure resented in [] has the following roerties: It contains memory banks, each one consists of several RAMs to store all the decoding messages associated with variable nodes; Each memory bank associates with one address generator that is configured by one element in a constrained random integer set. It contains a configurable random-like one-dimensional shuffle network with the routing comlexity scaled by It contains 9 0 ; VNUs and CNUs so that the VNU and CNU folding factors are 9, resectively; 7 and

8 Each iteration comletes in clock cycles in which only CNUs work in the first clock cycles and both CNUs and VNUs work in the last clock cycles. Over all the ossible and, this decoder defines a ( has the arity check matrix ;, where the submatrix is jointly secified by and. )-regular LDPC code ensemble in which each code Partly Parallel Decoder Architecture In this work, alying the joint code and decoder design methodology, we develo a high-seed ( )-regular LDPC code artly arallel decoder architecture based on which a -bit, rate- ( )-regular LDPC code artly arallel decoder has been imlemented using Xilinx Virtex FPGA device. Comared with the structure resented in [], this artly arallel decoder architecture has the following distinct characteristics: It emloys a novel concatenated configurable random two-dimensional shuffle network imlementation scheme to realize the random-like connectivity with low routing overhead, which is esecially desirable for FPGA imlementations; To imrove the decoding throughut, both the VNU folding factor and CNU folding factor are, instead of and in the structure resented in []; To simlify the control logic design and reduce the memory bandwidth requirement, this decoder comletes each decoding iteration in clock cycles in which CNUs and VNUs work in the and clock cycles, alternatively. Following the joint design methodology, we have that this decoder should define a ( code ensemble in which each code has 9 variable nodes and )-regular LDPC 9 check nodes, and as illustrated in Fig., the arity check matrix of each code has the form where and have the exlicit structures as shown in Fig. 3 and the random-like is secified by certain configuration arameters of the decoder. To facilitate the descrition of the decoder architecture, we introduce some definitions as follows: we denote the submatrix consisting of the consecutive columns in that go through the block matrix J as J ƒ~, in which from left to right each column is labeled as e J {~ with increasing from to, as shown in Fig.. We label the variable node corresonding to column e J ƒ~ " J {~ as and the " J {~ variable nodes for W :9:9:9 constitute a variable node grou VG J. Finally, we arrange the 9 check nodes corresonding to all the 9 rows of submatrix into check node grou CG. 8

9 left most column h L (k,2) right most column h (k,2) L H = H H 2 H 3 = I, P, I k, Pk, I,2 P,2 I k,2 Pk,2 I,k P,k I P k,k k,k L k L k L k (k,2) H 2 L k Figure : The arity check matrix. PE, PE 2, PE k,k active during variable node rocessing VNU RAMs, VNU RAMs 2, VNU RAMs k,k π (regular & fixed) π 2 (regular & fixed) π (random like & 3 configurable) active during check node rocessing CNU, CNU CNU,k 2, CNU 2,k CNU 3, CNU 3,k Figure 5: The rincial ( )-regular LDPC code artly arallel decoder structure. 9

10 Fig. 5 shows the rincial structure of this artly arallel decoder. It mainly contains PE Blocks PE J for s, three bi-directional shuffle networks, and andq9 CNUs. Each PE J contains one memory bank RAMs J that stores all the decoding messages, including the intrinsic and extrinsic messages and hard decisions, associated with all the variable nodes in the variable node grou VG J, and contains one VNU to erform the variable node comutations for these variable nodes. Each bi-directional shuffle network 2 realizes the extrinsic message exchange between all the 9 variable nodes and the 9 check nodes in CG. The CNU for :9:9:9 erform the check node comutations for all the 9 check nodes in CG. This decoder comletes each decoding iteration in clock cycles, and during the and cycles, it works in check node rocessing mode and variable node rocessing mode, resectively. In the check node rocessing mode, the decoder not only erforms the comutations of all the check nodes but also comletes the extrinsic message exchange between neighboring nodes. In variable node rocessing mode, the decoder only erforms the comutations of all the variable nodes. clock The intrinsic and extrinsic messages are all quantized to bits and the iterative decoding dataaths of this artly arallel decoder are illustrated in Fig., in which the dataaths in check node rocessing and variable node rocessing are reresented by solid lines and dash dot lines, resectively. As shown in Fig., each PE Block PE J contains five RAM blocks: EXT RAM for % EXT RAM has memory locations and the location with the address (, INT RAM and DEC RAM. Each ) contains the J ƒ~ extrinsic messages exchanged between the variable node in VG J and its neighboring check node in CG " J {~. The INT RAM and DEC RAM store the intrinsic message and hard decision associated with node at the memory location with the address ( ). As we will see later, such decoding messages storage strategy could greatly simlify the control logic for generating the memory access address. For the urose of simlicity, in Fig. we do not show the dataath from INT RAM to EXT RAM s for extrinsic message initialization, which can be easily realized in clock cycles before the decoder enters the iterative decoding rocess.. Check node rocessing During the check node rocessing, the decoder erforms the comutations of all the check nodes and realizes the extrinsic message exchange between all the neighboring nodes. At the beginning of check node rocessing, in each PE J the memory location with address in EXT RAM contains -bit hybrid data that consists of -bit hard decision and 5-bit variable-to-check extrinsic message associated with the variable node " J {~ in 0

11 CNU,j bits 5 bits π (regular & fixed) h () x,y PE x,y { h (i) x,y} 8 bits INT_RAM CNU 2,j CNU 3,j bits 5 bits bits 5 bits π 2 (regular & fixed) (random like & π 3 configurable) (2) h x,y (3) h x,y {β (i) x,y} 5 bits 8 bits EXT_RAM_ EXT_RAM_2 EXT_RAM_3 { h (i) x,y} {β (i) x,y} 5 bits 5 bits VNU bit DEC_RAM Figure : Iterative decoding dataaths. VG J. Each clock cycle this decoder erforms the read-shuffle-modify-unshuffle-write oerations to convert one variable-to-check extrinsic message in each EXT RAM to its check-to-variable counterart. As illustrated in Fig., we may outline the dataath loo in check node rocessing as follows:. Read: One -bit hybrid data e } B~ J is read from each EXT RAM in each PE J ; 2. Shuffle: Each hybrid data e } B~ J goes through the shuffle network8 and arrives CNU ; 3. Modify: Each CNU erforms the arity check on the inut hard decision bits and generates the outut 5-bit check-to-variable extrinsic messages r } B~ J based on the inut 5-bit variable-to-check extrinsic messages;. Unshuffle: Send each check-to-variable extrinsic message r } B~ J back to the PE Block via the same ath as its variable-to-check counterart; 5. Write: Write each r } B~ J to the same memory location in EXT RAM as its variable-to-check counterart. All the CNUs deliver the arity check results to a central control block that will, at the end of check node rocessing, determine whether all the arity check equations secified by the arity check matrix have been satisfied, if yes, the decoding for current code frame will terminate. To achieve higher decoding throughut, we imlement the read-shuffle-modify-unshuffle-write loo oeration by five-stage ielining as shown in Fig. 7, where CNU is -stage ielined. To make this ielining scheme feasible, we realize each bi-directional I/O connection in the three shuffle networks by two distinct

12 Read CNU (st half) CNU CNU (2nd half) bits bits 5 bits 5 bits Shuffle Unshuffle Write Figure 7: Five-stage ielining of the check node rocessing dataath. sets of wires with oosite directions, which means that the hybrid data from PE Blocks to CNUs and the check-to-variable extrinsic messages from CNUs to PE Blocks are carried on distinct sets of wires. Comared with sharing one set of wires in time-multilexed fashion, this aroach has higher wire routing overhead but obviates the logic gate overhead due to the realization of time-multilex and, more imortantly, make it feasible to directly ieline the dataath loo for higher } B~ decoding throughut. In this decoder, one } B~ address generator AG node rocessing, AG J associates with one EXT RAM in each PE J. In the check J generates the address for reading hybrid data and, due to the five-stage ielining of dataath loo, the address for writing back the check-to-variable message is obtained via delaying the read address by five clock cycles. It is clear that the connectivity among all the variable nodes and check nodes, or the entire arity check matrix, realized by this decoder is jointly secified by all the address generators and, the connectivity among all the variable nodes and the the three shuffle networks. Moreover, for check nodes in CG } B~ is comletely determined by AG J and2. Following the joint design methodology, we imlement all the address generators and the three shuffle networks as follows... Imlementations of AG J ~ and The bi-directional shuffle network and AG J ~ realize the connectivity among all the variable nodes and all " J {~ the check nodes in CG as secified by the fixed submatrix. Recall that node corresonds to the column e " J {~ " J {~ as illustrated in Fig. and the extrinsic messages associated with node is always stored at address. Exloiting the exlicit structure of, we easily obtain the imlementation schemes for AG J ~ and as follows: Each AG J ~ is realized as -bit binary counter that is cleared to zero at the beginning of check node rocessing; The bi-directional shuffle network connects the PE J with the same -index to the same CNU. 2

13 I I..2 Imlementations of AG ~ J and The bi-directional shuffle network and AG ~ J realize the connectivity among all the variable nodes and all the check nodes in CG as secified by the fixed matrix. Similarly, exloiting the extrinsic messages storage strategy and the exlicit structure of, we imlement AG J ~ and as follows: Each AG J ~ is realized as -bit binary counter that only counts u to the value and is loaded with the value of K 7d9 A at the beginning of check node rocessing; The bi-directional shuffle network connects the PE J with the same -index to the same CNU. Notice that the counter load value for each AG J ~ directly comes from the construction of each block matrix J in as described in Section Imlementations of AG ~ J and The bi-directional shuffle network ~ and AG J jointly define the connectivity between all the variable nodes and all the check nodes in CG, which is reresented by as illustrated in Fig.. In the above, we show that by exloiting the secific structures of Y~ and and the extrinsic messages storage strategy, we can directly obtain the imlementations of each AG are not easy because of the following requirements on : J and for W ~. However, the imlementations of AG J and. The Tanner grah corresonding to the arity check matrix should be -cycle free; 2. To make be random to some extent, should be random-like; ~ As roosed in [], to simlify the design rocess, we searately conceive AG J and ~ that the imlementations of AG J and accomlish the above and ~ Imlementations of AG J ~ : We imlement each AG J as requirement, resectively. in such a way -bit binary counter that counts u to the value and is initialized with a constant value J at the beginning of check node rocessing. Each J is selected in random under the following two constraints:. Given, J 2. Given, J, L :9:9:9 f ; K q: 9 A, f K L :9:9:9 f. It can be roved that the above two constraints on J are sufficient to make the entire arity check matrix always corresond to a -cycle free Tanner grah no matter how we imlement. 3

14 Imlementation of : Since each AG ~ J is realized as a counter, the attern of shuffle network can not be fixed otherwise the shuffle attern of will regularly reeat in the, which means that will always contain very regular connectivity atterns no matter how random-like the attern of itself is. Thus we should make be configurable to some extent. In this work, we roose the following concatenated configurable random shuffle network imlementation scheme for. π 3 Inut Data from PE Blocks a, a,k a a k, k,k r=0... L ROM R s bit (r) s (r) bit k a, b, b,k ak, b k, (r) ψ ψ (r) k (R or Id) (R or Id) k a,k a k,k b k,k b, b k, r=0... L s bit (c) ψ (c) (C or Id) ROM C s (c) ψk (C or Id) k (c) c, b, c, c k, b k, bit k c k, Outut Data to CNU 3,j s c, c,k c c k, k,k Stage I: Intra Row Shuffle Stage II: Intra Column Shuffle Figure 8: Forward ath of. Fig. 8 shows the forward ath (from PE J to CNU ) of the bi-directional shuffle network. In each clock cycle, it realizes the data shuffle from J to J by two concatenated stages: intra-row shuffle and intra-column shuffle. Firstly, the J data block, where each J comes from PE J ~, asses an intra-row shuffle network array in which each shuffle network shuffles the inut data J to J ~ for. Each is configured by -bit control signal S K~ leading to the fixed random ermutation if S ~, or to the identity ermutation (Id) otherwise. The reason why we use the Id attern instead of another random shuffle attern is to minimize the routing overhead, and our simulations suggest that there is no gain on the errorcorrecting erformance by using another random shuffle attern instead of Id attern. The -bit configuration word K~ changes every clock cycle and all the -bit control words are stored in ROM R. Next, the J # ~ data block goes through an intra-column shuffle network array in which each shuffles the J to J for ~. Similarly, each is configured by -bit control signal S ~ ermutation if S ~, or to the identity ermutation (Id) otherwise. The -bit configuration word S ~ changes every clock cycle and all the leading to the fixed random -bit control words are stored in ROM C. As the outut of forward ath, the J with the same -index are delivered to the same CNU K~. To realize the bi-directional shuffle, we only need to imlement each configurable shuffle network ~ and as bi-directional so that can

15 ; unshuffle the data backward from CNU to PE J along the same route as the forward ath on distinct sets of wires. Notice that, due to the ielining on the dataath loo, the backward ath control signals are obtained via delaying the forward ath control signals by three clock cycles. To make the connectivity realized by be random-like and change each clock cycle, we only need to randomly generate the control word S ~ and S ~ for each clock cycle and the fixed shuffle atterns of each and. Since most modern FPGA devices have multile metal layers, the imlementations of the two shuffle arrays can be overlaed from the bird s-eye view. Therefore, the above concatenated imlementation scheme will confine all the routing wires to small area (in one row or one column), which will significantly reduce the ossibility of routing congestion and reduce the routing overhead..2 Variable node rocessing Comared with the above check node rocessing, the oerations erformed in the variable node rocessing is quite simle since the decoder only needs to carry out all the variable node comutations. Notice that at the beginning of variable node rocessing, the three 5-bit check-to-variable extrinsic messages associated with each variable node J ƒ~ are stored at the address of the three EXT RAM in PE J. The 5-bit intrinsic J ƒ~ is also stored at the address of INT RAM in PE J. In message associated with variable node each clock cycle, this decoder erforms the read-modify-write oerations to convert the three check-to-variable extrinsic messages associated with the same variable node to three hybrid data consisting of variable-to-check extrinsic messages and hard decisions. As shown in Fig., we may outline the dataath loo in variable node rocessing as follows:. Read: In each PE J, three 5-bit check-to-variable extrinsic messages r B~ J and one 5-bit intrinsic messages? J associated with the same variable node are read from the three EXT RAM and INT RAM at the same address; 2. Modify: Based on the inut check-to-variable extrinsic messages and intrinsic message, each VNU generates the -bit hard decision ; J and three -bit hybrid data e J B~ ; 3. Write: Each e J } B~ is written back to the same memory location as its check-to-variable counterart and J is written to DEC RAM. The forward ath from memory to VNU and backward ath from VNU to memory are imlemented by distinct sets of wires and the entire read-modify-write dataath loo is ielined by three-stage ielining as 5

16 illustrated in Fig. 9. VNU Read 5 bits VNU (st half) VNU (2nd half) bits Write bit Figure 9: Three-stage ielining of the variable node rocessing dataath. Since all the extrinsic and intrinsic messages associated with the same variable node are stored at the same address in different RAM blocks, we can use only one binary counter to generate all the read address. Due to the ielining of the dataath, the write address is obtained via delaying the read address by three clock cycles..3 CNU and VNU Architectures Each CNU carries out the oerations of one check node, including the arity check and comutation of checkto-variable extrinsic messages. Fig. 0 shows the CNU architecture for check node with the degree of. Each inut } B~ is a -bit hybrid data consisting of -bit hard decision and 5-bit variable-to-check extrinsic message. The arity check is erformed by XORing all the six -bit hard decisions. Each -bit variable-to-check extrinsic messages is reresented by sign-magnitude format with a sign bit and four magnitude bits. The architecture for comuting the check-to-variable extrinsic messages is directly obtained from () in Algorithm 2.. The function *T@YADC is realized by the LUT (Look-U Table) that is imlemented as a combinational logic block in FPGA. Each outut 5-bit check-to-variable extrinsic message } B~ is also reresented by signmagnitude format. Each VNU generates the hard decision and all the variable-to-check extrinsic messages associated with one variable node. Fig. shows the VNU architecture for variable node with the degree of 3. With the inut -bit intrinsic message and three -bit check-to-variable extrinsic messages } B~ associated with the same variable node, VNU generates three 5-bit variable-to-check extrinsic messages and -bit hard decision according to (2) and (3) in Algorithm 2., resectively. To enable each CNU receive the hard decisions to erform arity check as described in the above, the hard decision is combined with each 5-bit variable-to-check extrinsic message to form -bit hybrid data Y~ as shown in Fig.. Since each inut check-to-variable extrinsic message } B~ is reresented by sign-magnitude format, we need to convert it to two s comlement format before erforming the additions. Before going through the LUT that each data is converted back the

17 x () x (2) (3) () (5) x () x x x ieline LUT LUT LUT LUT LUT LUT () (2) (3) () (5) () y y y y y y Parity check result Figure 0: Architecture for check node rocessor unit (CNU) with 5. sign-magnitude format.. Data Inut/Outut This artly arallel decoder works simultaneously on three consecutive code frames in two-stage ielining mode: while one frame is being iteratively decoded, the next frame is loaded into the decoder and the hard decisions of the revious frame are read out from the decoder. Thus each INT RAM contains two RAM blocks to store the intrinsic messages of both current and next frames. Similarly, each DEC RAM contains two RAM blocks to store the hard decisions of both current and revious frames. The design scheme for intrinsic message inut and hard decision outut is heavily deendent on the floor lanning of the PE Blocks. To minimize the routing overhead, we develo a square-shaed floor lanning for PE Blocks as illustrated in Fig. 2 and the corresonding data inut/outut scheme is described in the following:. Intrinsic Data Inut: The intrinsic messages of next frame is loaded symbol er clock cycle. As shown in Fig. 2, the memory location of each inut intrinsic data is determined by the inut load address that has the width bits in bits secify which PE Block (or which bits locate the memory location in the selected INT RAM) is being accessed and the 7

18 ieline S to T: Sign Magnitude to Two s Comlement z 5 T to S: Two s Comlement to Sign Magnitude Hard decision y () 5 S to T T to S 7 LUT x () y (2) 5 S to T T to S 7 LUT x (2) y (3) 5 S to T T to S 7 LUT x (3) Figure : Architecture for variable node rocessor unit (VNU) with. Load Address Intrinsic Data log 2 k 2 Binary decoder log 2 L + log 2 k 2 log 2 L 5 k 2 2 k k PE Block Select PE, PE,2 PE,k 2 k k PE 2, PE 2,2 PE 2,k Read Address log 2 L k 2 Decoding Outut PE k, PE k,2 2 k PE k,k k Figure 2: Data Inut/Outut structure. 8

19 INT RAM. As shown in Fig. 2, the rimary intrinsic data and load address inut directly connect to the PE Blocks PE for, and from each PE J the intrinsic data and load address are delivered to the adjacent PE Block PE in ielined fashion. 2. Decoded Data Outut: The decoded data (or hard decisions) of the revious frame is read out in -bit read address inut directly connects ielined fashion. As shown in Fig. 2, the to the PE Blocks PE J for, and from each PE J the read address are delivered to the adjacent block PE J in ielined fashion. Based on its inut read address, each PE Block oututs -bit hard decision er clock cycle. Therefore, as illustrated in Fig. 2, the width of ielined decoded data bus increases by after going through one PE Block, and at the rightmost side, we obtain decoded outut that are combined together as the -bit rimary decoded data outut. -bit 5 FPGA Imlementation Alying the above decoder architecture, we imlemented a ( )-regular LDPC code artly arallel decoder for using Xilinx Virtex-E XCV200E device with the ackage FG5. The corresonding LDPC code length is 9 9 and code rate is. We obtain the constrained random arameter set for imlementing ~ and each AG J as follows: first generate a large number of arameter sets from which we find few sets leading to relatively high Tanner grah average cycle length, then we select one set leading to the best erformance based on comuter simulations. The target XCV200E FPGA device contains 8 large on-chi block RAMs, each one is a fully synchronous dual-ort K-bit RAM. In this decoder imlementation, we configure each dual-ort K-bit RAM as two indeendent single-ort -bit RAM blocks so that each EXT RAM can be realized by one single-ort -bit RAM block. Since each INT RAM contains two RAM blocks for storing the intrinsic messages of both current and next code frames, we use two single-ort -bit RAM blocks to imlement one INT RAM. Due to the relatively small memory size requirement, the DEC RAM is realized by distributed RAM that rovides shallow RAM structures imlemented in CLBs. Since this decoder contains PE Blocks, each one incororates one INT RAM and three EXT RAM s, we totally utilize single-ort -bit RAM blocks (or dual-ort K-bit RAM blocks). We manually configured the lacement of each PE Block according to the floor lanning scheme as shown in Fig. 2 in Section.. Notice that such lacement scheme exactly matches the structure of the configurable shuffle network as described in Section..3, thus the routing overhead for imlementing the is also minimized in this FPGA imlementation. 9

20 From the architecture descrition in Section, we know that, during each clock cycle in the iterative decoding, this decoder need to erform both read and write oerations on each single-ort RAM block EXT RAM. Therefore, suose the rimary clock frequency is, we must generate a clock signal as the RAM control signal to achieve read-and-write oeration in one clock cycle. This 2x clock signal is generated using the Delay-Locked Loo (DLL) in XCV200E. To facilitate the entire imlementation rocess, we extensively utilized the highly otimized Xilinx IP cores to instantiate many function blocks, i.e., all the RAM blocks, all the counters for generating addresses, and the ROMs used to store the control signals for shuffle network. Moreover, all the adders in CNUs and VNUs are imlemented by rile-carry adder that is exactly suitable for Xilinx FPGA imlementations thanks to the on-chi dedicated fast arithmetic carry chain. Table : FPGA resources utilization statistics. Resource Number Utilization rate Resource Number Utilization rate Slices,792 % Slices Registers 0,05 9% inut LUTs 5,933 3% Bonded IOBs 8 8% Block RAMs 90 8% DLLs 2% This decoder was described in the VHDL hardware descrition language (HDL) and SYNOPSYS FPGA Exress was used to synthesize the VHDL imlementation. We used the Xilinx Develoment System tool suite to lace and route the synthesized imlementation for the target XCV200E device with the seed otion -7. Table shows the hardware resource utilization statistics. Notice that 7% of the total utilized slices, or 89 slices, were used for imlementing all the CNUs and VNUs. Fig. 3 shows the laced and routed design in which the lacement of all the PE Blocks are constrained based on the on-chi RAM block locations. Based on the results reorted by the Xilinx static timing analysis tool, the maximum decoder clock frequency can be 5 MHz. If this decoder erforms S decoding iterations for each code frame, the total clock cycle number for decoding one frame will be DS$9 [ where the extra clock cycles is due to the initialization rocess, and the maximum symbol decoding throughut will be W9 9 " DS9 [ * *9 " DSd[ 7 Mbs. Here we set S and obtain the maximum symbol decoding throughut as 5 Mbs. Fig. shows the corresonding erformance over AWGN channel with S, including the BER (Bit Error Rate), FER (Frame Error Rate) and the average iteration numbers. 20

Figure 3: The laced and routed decoder imlementation. BER/FER 0 0 0 0 2 0 3 0 0 5 BER FER Average number of iterations 8 2 0 8 0.5 2 2.

21 Figure 3: The laced and routed decoder imlementation. BER/FER BER FER Average number of iterations E b /N 0 (db) E b /N 0 (db) Figure : Simulation results on BER (Bit Error Rate), FER (Frame Error Rate) and the average iteration numbers. 2

22 Conclusion Due to the unique characteristics of LDPC codes, we believe that jointly conceiving the code construction and artly arallel decoder design should be a key for ractical high-seed LDPC coding system imlementations. In this work, alying a joint design methodology, we develoed a ( )-regular LDPC code high-seed artly arallel decoder architecture design and imlemented a -bit, rate- ( )-regular LDPC code decoder on the Xilinx XCV200E FPGA device. The detailed decoder architecture and floor lanning scheme have been resented and a concatenated configurable random shuffle network imlementation is roosed to minimize the routing overhead for the random-like shuffle network realization. With the maximum 8 decoding iterations, this decoder can achieve u to 5 Mbs symbol decoding throughut and the BER at 2dB over AWGN channel. Moreover, exloiting the good minimum distance roerty of LDPC code, this decoder uses arity check after each iteration as earlier stoing criterion to effectively reduce the average energy consumtion. References [] R. G. Gallager, Low-density arity-check codes, IRE Transactions on Information Theory, vol. IT-8,. 2 28, Jan. 92. [2] R. G. Gallager, Low-Density Parity-Check Codes, M.I.T Press, 93. available at htt://justice.mit.edu/eole/gallager.html. [3] D. J. C. MacKay, Good error-correcting codes based on very sarse matrices, IEEE Transactions on Information Theory, vol. 5, , Mar [] M. C. Davey and D. J. C. MacKay, Low-density arity check codes over, IEEE Commun. Letters, vol. 2,. 5 7, June 998. [5] M. Luby, M. Shokrolloahi, M. Mizenmacher, and D. Sielman, Imroved low-density arity-check codes using irregular grahs and belief roagation, in Proc. 998 IEEE Intl. Sym. Inform. Theory,. 7, Cambridge, MA, Aug availabel at htt://htt.icsi.berkeley.edu/ luby/. [] T. Richardson and R. Urbanke, The caacity of low-density arity-check codes under message-assing decoding, IEEE Transactions on Information Theory, vol. 7, , Feb [7] T. Richardson, A. Shokrollahi, and R. Urbanke, Design of caacity-aroaching irregular low-density arity-check codes, IEEE Transactions on Information Theory, vol. 7,. 9 37, Feb [8] S.-Y. Chung, T. J. Richardson, and R. L. Urbanke, Analysis of sum-roduct decoding of low-density arity-check codes using a Gaussian aroximation, IEEE Transactions on Information Theory, vol. 7, , Feb [9] M. G. Luby, M. Mitzenmacher, M. A. Shokrollahi, and D. A. Sielman, Imroved low-density arity-check codes using irregular grahs, IEEE Transactions on Information Theory, vol. 7, , Feb [0] S. Chung, G. D. Forney, T. Richardson, and R. Urbanke, On the design of low-density arity-check codes within db of the shannon limit, IEEE Communications Letters, vol. 5,. 58 0, Feb

23 [] G. Miller and D. Burshtein, Bounds on the maximum-likelihood decoding error robability of low-density aritycheck codes, IEEE Transactions on Information Theory, vol. 7, , Nov [2] A. J. Blanksby and C. J. Howland, A 90-mW -Gb/s 02-b, rate-/2 low-density arity-check code decoder, IEEE Journal of Solid-State Circuits, vol. 37,. 0 2, March [3] E. Boutillon, J. Castura, and F. R. Kschischang, Decoder-first code design, in Proceedings of the 2nd International Symosium on Turbo Codes and Related Toics,. 59 2, Brest, France, Set available at htt://lester.univ-ubs.fr:8080/ boutillon/ublications.html. [] T. Zhang and K. K. Parhi, VLSI imlementation-oriented ( )-regular low-density arity-check codes,. 25 3, IEEE Worksho on Signal Processing Systems (SiPS), Set available at htt:// [5] M. Chiani, A. Conti, and A. Ventura, Evaluation of low-density arity-check codes over block fading channels, in 2000 IEEE International Conference on Communications, , [] K. K. Parhi, VLSI Digital Signal Processing Systems: Design and Imlementation, John Wiley & Sons,

An FPGA Implementation of (3, 6)-Regular Low-Density Parity-Check Code Decoder

EURASIP Journal on Applied Signal Processing 2003:, 30 42 c 2003 Hindawi Publishing Corporation An FPGA Implementation of (3, )-Regular Low-Density Parity-Check Code Decoder Tong Zhang Department of Electrical,