k (check node degree) and j (variable node degree)

A Parallel Turbo Decodig Message Passig Architecture for Array LDPC Codes Kira Guam, Pakaj Bhagawat, Weihuag Wag, Gwa Choi, Mark Yeary * Dept. of Electrical Egieerig, Texas A&M Uiversity, College Statio, TX-77840 * Dept. of Electrical ad Computer Egieerig, Uiversity of Oklahoma, Norma, OK-73109 Abstract The VLSI implemetatio complexity of a low desity parity check (LDPC) decoder is largely iflueced by itercoect ad the storage requiremets. Here, the proposed layout-aware layered decoder architecture utilizes the data reuse properties of mi-sum, layered decodig ad structured properties of array LDPC codes. This results i a sigificat reductio of logic ad itercoects requiremets of the decoder whe compared to the state-of-the-art LDPC decoders. The ASIC implemetatio of the proposed fully parallel architecture achieves throughput of 4.6 Gbps (for a maximum of 15 iteratios). The chip size is.3 mm x.3 mm with a gate cout of 787 K i 0.13 micro techology. 1. Itroductio Low-Desity Parity Check (LDPC) codes ad turbo codes are amog the best kow codes that operate ear Shao limit [1]. Whe compared to the decodig of turbo codes, LDPC decoders require simpler computatioal processig ad they are more suitable for parallelizatio, low implemetatio complexity, ad low latecy. While iitial LDPC decoder desigs suffered from complex itercoect issues, structured LDPC codes simplify the itercoect complexity. LDPC codes are cosidered for error correctio codig i ext geeratio digital video broadcastig (DVB-S), MIMO- WLAN 80.11,, 80.1, 80.0, Gigabit Etheret 80.3, magetic chaels(storage/recordig systems),ad log-haul optical commuicatio systems. N, K LDPC code is a liear block A biary code of codeword legth N ad iformatio block legth K that ca be described by a sparse M N parity check matrix where M deotes the umber of parity check equatios. LDPC codes ca be decoded by the Gallager s iterative two-phase message passig algorithm (TPMP) which ivolves check-ode update ad variable-ode update as two phase schedule. Various algorithms are available for check-ode updates ad widely used algorithms are sum of products (SP), misum (MS) ad Jacobia based BCJ (amed after its discoverers Bah Cocke, Jeliik ad aviv). Masour ad Shabhag [] itroduced the cocept of turbo decodig message passig (TDMP), which is sometimes also called as layered decodig, usig BCJ for their architecture-aware LDPC (AA-LDPC) codes. TDMP offers x throughput ad sigificat memory advatages whe compared to TPMP. This is later studied ad applied for differet LDPC codes usig sum of products algorithm ad its variatios i [3]. TDMP is able to reduce the umber of iteratios required by up to 50% without performace degradatio whe compared to the stadard message passig algorithm. A quatitative performace compariso for differet check updates was give by Che ad Fossorier et al [4]. Their research showed that the performace loss of offset based misum with 5-bit quatizatio is less tha 0.1dB i SN as compared with that of floatig poit SP ad BCJ. While fully-parallel LDPC decoder desigs [5] suffered from complex itercoect issues, various semi-parallel ad parallel implemetatios based o structured LDPC codes [, 6-9] alleviate the itercoect complexity. I this paper, we propose a ovel parallel micro-architecture structure for check-ode message processig uit (CNU) for the mi-sum decodig of LDPC codes. This ew micro-architecture structure employs the miimum umber of comparator uits by exploitig the cocept of survivors i the search, which results i reduced umber of comparisos, additio operatios ad additioal reductio of the memory requiremet. TDMP is applied for the offset MS for array LDPC codes. The resultig decoder architecture has sigificatly lower requiremets of logic ad itercoects whe compared to the published decoder implemetatios. The rest of the paper is orgaized as follows. Sectio itroduces the backgroud of array LDPC codes ad mi-sum the decodig algorithm. Sectio 3 presets the value-reuse property ad proposed micro-architecture structure of CNU. The data flow graph ad parallel architecture for TDMP usig offset MS is icluded i Sectio 4. Sectio 5 shows the ASIC implemetatio results ad performace compariso with related work ad Sectio 6 cocludes the paper.. Backgroud.1 Array LDPC Codes A brief review of the array codes is provided to facilitate the TDMP ad the decoder architecture. The array LDPC parity-check matrix is specified by three parameters: a prime umber p ad two itegers k (check ode degree) ad j (variable ode degree) such that j, k < p. It is give by

H I I I I I α α... α k 1 4 ( k 1) A = I α α... α I α α α j 1 ( j 1) ( j 1)( k 1) p p (1) where I is the idetity matrix, ad α is a permutatio matrix represetig a sigle left or p p I. Power of α i H deotes right cyclic shift of multiple cyclic shifts, with the umber of shifts give by the value of the expoet. I the followig discussio, we use the α as a permutatio matrix p p represetig a sigle left cyclic shift of I.. Mi-sum decodig of LDPC Assume biary phase shift keyig (BPSK) modulatio (a 1 is mapped to -1 ad a 0 is mapped to -1) over a additive white Gaussia oise (AWGN) chael. The received values are Gaussia with mea = ±1 x ad variaceσ. y The reliability messages used i belief propagatio (BP) based mi-sum algorithm ca be computed i two phases, viz. (1) check ode processig ad () variable ode processig. The two operatios are repeated iteratively util the decodig criterio is satisfied. This is also referred to as stadard message passig or two-phase message passig (TPMP). For the i th iteratio, Q is the message from variable ode to m check ode m, is the message from check ode m to bit ode, Μ() odes for variable ode, Ν(m) is the set of the eighborig check is the set of the eighborig variable odes for check ode m. The message passig for TPMP are calculated i the followig steps as i [4]: (1) check-ode processig: for each m ad Ν(m), ( i ) ( i ) = δ κ () ( i 1) κ = = mi Q. Ν ( m) \ m The sig of check-ode message is defied as δ ( i 1) = sg Q m Ν ( m) \ δ takes value of 1 where + or 1 (4) (3) () Variable-ode processig: for each ad m Ν(), ( 0) Qm = L + m (5) m Μ ( m) \ m ( 0) where the log-likelihood ratio of bit is L = y. (3) Decisio P = L + (6) m M A hard decisio is take where x ˆ = 0 if P ( x) 0, T ad x ˆ = 1 if P ( x ) < 0. If xˆ H = 0, the decodig process is fiished with x ˆ as the decoder output; otherwise, go to step (1). If the decodig process does t ed withi some maximum iteratio it, stop ad max output a error messeger. The reliability messages computed i misum algorithm are overestimated. The overestimatio is corrected i offset mi-sum algorithm by subtractig a positive costat β from the magitude of the i the followig way: = δ max κ β,0 (7) For (3, 6) rate 0.5 code, β is computed as 0.15 usig the desity evolutio techique [4]..3 Parallel TDMP for array LDPC I TDMP, the array LDPC with j block rows ca be viewed as cocateatio of j layers or costituet sub-codes similar to observatios made for AA-LDPC codes i []. After the check-ode processig is fiished for oe block row, the messages are immediately used to update the variable odes (), whose results are the provided for processig the ext block row of check odes(1). This differs from TPMP where all check odes are processed first ad the the variable ode messages will be computed. Each decodig iteratio i TDMP has j umber of subiteratios. I the begiig of the decodig process, variable messages are iitialized as chael values ad they are used to process the check odes of the first block row. After completio of that block row, variable messages are updated with the ew check ode messages. This cocludes the first sub-iteratio. I similar fashio, result of check-ode processig of the secod block row is immediately used i the same iteratio to update the variable ode messages for third block row. The completio of check-ode processig ad associated variable-ode processig of all block rows costitutes oe iteratio. The TDMP ca be described with equatio (8-11): (0) (0) = P = L (8) 0, i = 1,, it, [Iteratio loop] max

l = 1,,, j = 1,,... k ( ) ( ) i Ql, = P l ( l, ) l, = f ( Q l, ) ( l, ) i ( l, ) i = Q +, [Sub-iteratio loop] [Block colu loop] ( i 1) [ ] [ ] [ ] [ ], (9) (10) P (11) where the vectors ( i ) ad ( i ) represet all the ad Q l, Q l, messages i each block of H matrix, s( ) deotes the shift coefficiet for the block i l th block row ad th i 1 S ( ) block colu of the H matrix [ ] deotes that i 1 the vector is cyclically shifted up by the amout s( ), k is the check-ode degree of the block row. A egative sig o s( ) idicates that it is cyclic dow shift (equivalet cyclic left shift). f ( ) deotes the check-ode processig, which ca be doe usig BCJ, SP or MS. For the proposed work we use MS as defied i equatio (-7). As soo as the output vector correspodig to each block colu i H matrix l, for a block row l is available, this could be used to S ( ) produce updated sum [ ] P (10). This could be immediately used i (9) to process block row l + 1 except that the shift s( ) imposed o P has to be udoe ad a ew shift s( l + 1, ) has to be imposed. This could be simply doe by imposig a shift correspodig to the differece of s( l + 1, ) ad s( ). Due to the structure of array LDPC H S, ) matrix, the [ ] ( l P i each block colu eed to go through either a) a cyclic dow shift of 1or b)cyclic up shift of ( 1 if we do layered decodig. I j ) additio we ca do parallel processig of block-colu loop usig p parallel CNUs. The property a) is possible due to costat differece of shift across block colus i array code s H matrix ad layers are processed i the order from 1 to j i each iteratio while the property b) is due to the fact that we eed to process layer 1 after processig layer j. We utilize this property i the proposed layout aware architecture by implemetig the two required cyclic shifts with wirig (ad cocetric layout) ad by processig all block colus i a layer i parallel. 3. Value-reuse Properties of Check Node Processig of Mi-Sum Decodig Oe of the key cotributios of this paper is the observatio that offset mi-sum decodig algorithm share the value-reuse property, as explaied below, hece reduce the logic ad message passig requiremet of the decoder. For each check ode m, Ν m takes oly values ad they are the two least miimum of iput magitude values., δ takes value of either + 1or 1 Sice Ν( m) ad ( i ) takes oly values. Equatio (3) gives rise to oly 3 possible values for the whole set Ν m. I a VLSI implemetatio, this property greatly simplifies the logic ad reduces the memory. This result would greatly simplify the umber of comparisos required as well as the memory eeded to store CNU outputs. This also meas reductio i correctio computatios. Normally CNU (check-ode uit) processig is doe usig the siged magitude arithmetic for (3) ad VNU (variable-ode uit processig) equatio (4) is doe i s complemet arithmetic. This requires s complemet to siged coversio at the iputs of CNU ad siged to s complemet at the output of CNU. This would mea that we eed to apply s complemet to oly two values istead of k values at the output of CNU. The work i [8] proposed a parallel CNU for MS that requires k( 1 + 0.5(log k 1)) comparators, 4 k 5-bit addersthe reaso for this complexity is ot usig the value reuse property of MS. I steps 1 ad below, cosider the k = case of rate 0.5 (4,8) code so that 8 ad assume the word legth of siged magitude bit ode messages is 5 so that there are 4 bits allocated for magitude. Step 1: Locate the two miimum values of the array Step1.1: Fid the first miimum usig the biary tree. Figure 1a gives a biary tree of comparators. Each comparator take two 4 bit iput words ad produces the miimum ad a flag which is set to 1 if the upper iput is less tha the lower iput Step 1.: Select the survivors by usig the comparator output flags as the cotrol iputs to multiplexes. For example i the last stage of the comparator tree the value other tha the least miimum is the survivor. No further comparisos are ecessary alog the tree path to the survivor. We trace back the survivors usig the comparator outputs. For a biary tree for iput vector of legth 8, we eed to 1, 4 to 1 ad 8 to 1 multiplexers. At ay stage of biary tree we have oly oe survivor. So there would be comparisos i sequetial fashio log k survivors ad log k 1

Step 1.3: Perform the required comparisos amog survivors i sequetial fashio. to K1 ad K to produce M1 ad M. Now compute M1 ad/or M. Now based o computed sig iformatio by XO logic(4) ad idex of K1, is set to oe of the 3 possible values M1,-M1, +/- M. (see Figure ). (a) a) Figure 1: Fider for the two least miimum i CNUBiary tree to fid the least miimum. (b) Traceback multiplexers ad comparators o survivors to fid the secod miimum. Multiplexers for selectig survivors are ot show. I Figure 1, C0, C1 ad C are 1 bit outputs correspodig to A<B coditio. 0 i C0 otatio is used to deote first level of outputs from right ad so o C[0] does mea the 1 bit comparator output at the first output of comparators at 3 rd level outputs from right ad so o. A,A1 ad A0 are magitudes(usually 4 bits wide[]) of bit ode messages. 0 i A0 otatio is used to deote first level of iputs ad so o. A0[0] does mea the 4bit iput word at the first iput of first level of iputs ad so o. It does ot mea 0 th bit of A0. K1 =A0 [C0 C1[C0] C[C0C1[C0]]] is the least miimum. The 3 bit trace back C0 C1[C0] C[C0C1[C0]] gives the locatio of the least miimum. See Figure c. The followig iputs are obtaied from the itermediate odes of the search tree. We have to use -i 1, 4-i 1 ad 8-i 1multiplexers respectively to obtai the followig survivors. B= A[c0];, B1= A1[!c1 c0]]; B0= A0[!c c1 c0] Step : This steps produces the outputs accordig to (7) i the followig optimized fashio. Apply the offset Figure : Proposed Parallel CNU based o Value-reuse property of Mi-Sum 4. Fully Parallel Architecture Usig TDMP ad Mi-Sum A ew data flow graph, see Figure 3, is desiged based o the TDMP ad o the value reuse property of mi-sum algorithm described above. For ease of discussio, we will illustrate the architecture for a specific structured code: Array code of legth 08 described i sectio, j = 3, k = 6 ad p = 347. First, fuctioality of each block i the architecture is explaied. A check ode process uit (CNU) is the parallel CNU based o offset mi-sum described i previous sectio. The CNU array is composed of p computatio uits that compute the messages for each block row i fully parallel fashio. Sice messages of previous j 1 block rows, are eeded for TDMP, the compressed iformatio of them is stored i FS register baks. There is oe register bak of depth 1, which j is i this case, coected with each CNU. Each fial state cotais M1,-M1,+/-M ad idex for M1. The sig bits are stored i sig flip-flops. The total umber of sig flip-flops is k ad each block row has pk sig flip-flops. We eed 1 j of such sig flip-flop baks i total. p umber of select uits is used for old. A select uit, whose fuctioality ad structure is the same as the block deoted as select i CNU, geerates the messages for 6( = k) edges of a check ode from 3 values stored i fial state register i parallel fashio. This uit ca be treated as de-compressor of the check ode edge iformatio which is stored i compact form i FS registers.i the begiig of the decodig process (9-11), P matrix(of dimesios px k) is set to received chael values i the first clock cycle(i.e. the first subiteratio) while the output matrix of select uit is set to zero matrix. The multiplexer array at the iput of P buffer is used for this iitializatio. The P matrix is computed by addig the shifted Q matrix(labeled as l Q shift i Figure 3)to the output matrix l of the CNU

array(labeled as ew ). The compressed iformatio is stored i the register baks FS ad sig so that these ca be used to geerate old for the l th sub-iteratio i ext iteratio. to each row i H matrix. MPU I commuicates with its 5(=k-1) adjacet eighbors (MPUs whose umbers are mod(i+1,p)...mod(i+5,p)) to achieve cyclic dow shifts of 1,,..5 respectively for block colus, 3 6 i H matrix (1) as show i Figure 4. 5. ASIC IMPLEMENTATION ESULTS Figure 3: Dataflow graph of the proposed parallel architecture for layered decoder Figure 4: a)illustratio of coectios betwee Message Processig uits to achieve cyclic dow shift of (-1) o each block colu b) Cocetric layout to accommodate 347 message processig uits. Coectios for cyclic up shift of (=(j-1).) are ot show. Note that the P matrix that is geerated ca be used immediately to geerate the Q matrix as the iput to the CNU array as CNU array is ready to process the ext block row. Now each block colu i the P message matrix will udergo a cyclic shift by the amout of differece of the shifts of the block row that is processed ad the previous block row that was just processed i the previous sub-iteratio. Figure 4 gives the layout details for the proposed decoder. The k adder uits, k select uit is termed as the parallel variable-ode uit (VNU). A message processig uit (MPU) cosists of a parallel CNU, a parallel VNU ad associated registers belogig Figure 5: BE Performace of the decoder To achieve the same BE as that of TPMP schedule o SP, TDMP schedule o offset MS eed half the umber of iteratios. This essetially doubles the throughput for the give amout of parallelizatio. Parallel decoder for TDMP or layered decodig eeds to use the hardware to decode oe layer oly while the parallel decoder for TPMP will have the hardware to decode all the layers simultaeously. So TDMP decoder takes j clock cycles to fiish j sub-iteratios(for j layers) to costitute iteratio as explaied i Sectio. If we use 5-bit precisio i ad Q messages ad 6-bit precisio i P messages to achieve almost the same performace as that of the floatig poit SP with oly 0. db performace degradatio. The proposed parallel CNU for offset MS is of very low complexity ad requires k + k 4-bit comparators ad k + 4 log 5-bit adders. The work i [8] proposed a parallel CNU for MS that requires k(1 + 0.5(log k 1)) comparators, 4 k 5-bit adders-the reaso for this complexity is ot usig the value re-use property of MS. It is also observed that istead of storig all the messages, the compressed iformatio is stored alog with the sig bits of the messages. This result i a reductio of memory is aroud of 0%-7% for 5 bit messages based o the check ode degree k of the code. So, i the proposed architecture desiged to support the code with k = 6, the savigs of memory is 0%. I additio there are sigificat savigs due to the layered decodig ad parallel architecture. We implemeted the proposed fully parallel layered decoder architecture o 0.13 micro techology. We used the stadard cells vsclib013[10].

TABLE I: COMPAISON WITH THE EXISTING WOK [] [5] [9] Proposed Decoded Throughput 640 Mbps 1.0 Gbps 1.0 Gbps 4.6 Gbps Area 14.3 mm, 5.5 mm 8.74 mm 5.9 mm Memory 51,680 bits (excludig 34,816 bits 34,816 bits 7046 bits flip-flops ad FIFO s) (scattered flip-flops) (scattered flip-flops) (scattered flip-flops) Iterleaver/outer 3.8 mm -Network 6.5 mm -Network NA 0.89 mm -Network Frequecy 15 MHz 64 MHz 64 MHz 100 MHz LDPC Code AA-LDPC, Structured adom ad irregular Modified Array Code Array Code rate 0.5 ad (3,6) code rate 0.5 code rate 0.5 ad (3,6) code rate 0.5 ad (3,6) code Block Legth 048 104 104 08 Check Node update BCJ SP SP Offset MS Decodig Schedule TDMP TPMP TPMP TDMP CMOS Techology 0.18 µ 0.16 µ 0.16 µ 0.13 µ The area of the chip is.3 mm x.3 mm ad the post routig frequecy is 100 MHz. The estimated gate cout is 787 K -iput NAND gates i 0.13 micro techology. The ASIC implemetatio of the proposed parallel architecture achieves a decoded throughput of 4.6 Gbps for 15 iteratios ad user data throughput of.3 Gbps. The area distributio of the CNU, VNU, storage flip-flops to support layered decodig, sychroizatio flip-flops to support multiple iteratios ad wirig is respectively 3%, 9%, 1%, 10% ad 17%. Figure 6: Net legth distributio of top level message passig ets for code with k=6 ad p=347 The et legth distributio of the message passig wires P is give i Figure 6. Most of the wires are short due to cocetric layout. The et legth distributio of the chael LL message ets(which eed ot be shifted as the first block row of array H matrix have all zero shift coefficiets) ad decoded bit ets has uiform bis at i*0.1 mm, i = 1,,..., 10 assumig that the iputs ad outputs are also arraged i rig fashio aroud the chip. Note that chael LLs are iputs to decoder message processig uits for every ew frame-every 45 clock cycles assumig 15 iteratios ad 45(=15*3) subiteratios. However the additioal IO circuitry (the serial to parallel ad parallel to serial coversio aroud the chip) which is applicatio depedet is ot accouted i the chip area ad is estimated ot to exceed 15% of chip area. Table I gives the performace compariso with the state of the art work. This work offers a syergetic combiatio of array LDPC codes [6], reduced complexity decodig [4], layered decodig [] by providig the ovel micro-architecture structures ad architectures to offer sigificat advatages. 6. Coclusio We preset layout-aware parallel decoder architecture for turbo decodig message passig of array LDPC codes usig the mi-sum algorithm for check ode update. Our work offers several advatages whe compared to the other state of the art LDPC decoders i terms of sigificat reductio i logic ad itercoect. efereces [1] D.J.C. MacKay ad.m. Neal. Near Shao Limit Performace of Low Desity Parity Check codes Electroics Letters, volume 3, pp. 1645-1646, Aug 1996. [] M. Masour ad N. Shabhag, "A 640-Mb/s 048-bit programmable LDPC decoder chip," IEEE Joural of Solid-State Circuits, vol. 41, o.3, pp. 684-698, March 006. [3] Hocevar, D.E., "A reduced complexity decoder architecture via layered decodig of LDPC codes," IEEE SiPS 004 pp. 107-11, 13-15 Oct. 004 [4] J. Che ad M. Fossorier, `Ǹear Optimum Uiversal Belief Propagatio Based Decodig of Low-Desity Parity Check Codes ', IEEE Trasactios o Commuicatios, vol. COM-50, pp. 406-414, March 00. [5] Blaksby, A.J.; Howlad,C.J, A 690-mW 1-Gb/s 104-b, rate-1/ low-desity parity-check code decoder, IEEE Joural of Solid-State Circuits, vol. 37, o..3, Mar 00 Pages:404-41 [6] S. Olcer, "Decoder architecture for array-code-based LDPC codes," Global Telecommuicatios Coferece, 003. GLOBECOM '03. IEEE, vol.4, o.pp. 046-050 vol.4, 1-5 Dec. 003 [7] E. Kim, G. Choi, "Diagoal low-desity parity-check code for simplified routig i decoder," IEEE SiPS 005, vol., o.pp. 756-761, -4 Nov. 005 [8] M. Karkooti ad J. Cavallaro, Semi-parallel recofigurable architectures for real-time LDPC decodig, Proceedigs of Iteratioal Coferece o Iformatio Techology, Codig ad Computig, vol. 1, pp. 579-585, 004. [9] V. Nagaraja, et a "High-throughput VLSI implemetatios of iterative decoders ad related code costructio problems,", GLOBECOM '004. IEEE, vol.1, pp. 361-365, vol. 1, 9 Nov.-3 Dec. 004. [10] Ope source stadard cell library, http://www.vlsitechology.org