THE low-density parity-check (LDPC) code is getting

Implementng the NASA Deep Space LDPC Codes for Defense Applcatons Wley H. Zhao, Jeffrey P. Long 1 Abstract Selected codes from, and extended from, the NASA s deep space low-densty party-check (LDPC) codes are mplemented for hgh speed defense applcatons. Ths s part of an effort to buld Government reference waveform mplementatons to assst defense acquston programs and to promote waveform re-use. Detals of the decoder mplementaton, ncludng memory layout, parallelzaton archtecture, layered-decodng schedulng, and feld programmable gate array (FPGA) resource utlzaton are presented. Index Terms Forward Error Correcton, Low Densty Party Check Codes, FPGA I. INTRODUCTION THE low-densty party-check (LDPC) code s gettng utlzed for hgh speed hgh performance applcatons due to the ease of ts parallel mplementaton n hardware and close to capacty performance. Waveforms employng LDPC codes nclude DVB-S2 [1] and IEEE 802.16 [2] and lately an ncreasng number of defense applcatons have begun to ncorporate LDPC codes n ther waveform desgns, such as the Global tonng Systems (GPS) [3] among others [4], [5]. As waveforms n defense applcatons are desgned or upgraded, one can expect LDPC codes to fnd wder applcaton n order to take the advantage of the hgh speed mplementaton and capacty approachng performance. In ths paper, we present the mplementaton of four LDPC codes based on the NASA deep space LDPC codes [6]. The work s a part of an effort to buld Government reference waveform mplementatons to help reduce rsk and promote waveform re-use across related defense acquston programs. The codes are mplemented n Very hgh speed ntegrated crcut Hardware Descrpton Language (VHDL) wth the ntent of employng Feld Programmable Gate Arrays (FPGA) to offer a re-programmable rado soluton. The paper begns wth a bref descrpton of the LDPC codes mplemented along wth the decodng algorthm used (Sec. II) and contnues wth a descrpton of the archtectural approach utlzed (Sec III). Ths dscusson wll nclude detals on layered processng and proper schedulng as a technque to ncrease throughput by ppe-lnng multple layers smultaneously n the decodng engne. In addton, the memory utlzaton and layout wll be explaned n the context of the layered processng approach. The paper wll conclude wth FPGA resource utlzaton for the mplementaton as well as the throughput achevable and performance curves (Sec. IV). Wley H. Zhao s wth the MITRE Corporaton, Bedford, MA, e-mal: wzhao@mtre.org. Jeffrey P. Long s wth the MITRE Corporaton, Bedford, MA, e-mal: jplong@mtre.org. Although not presented here t should be mentoned that a hgh speed parallel encoder was also developed as a complement to the decoder dscussed n ths work. II. THE IMPLEMENTED CODES AND DECODING ALGORITHM DESCRIPTION Four LDPC codes based on NASA s deep space LDPC codes are mplemented n ths effort. Two of them are rate 1/2 wth the coded block sze of 2048 and 16384 bts and the other two are rate 7/8 wth the coded block sze of 4096 and 16384 bts. These code rates and block szes accommodate both power-lmted, bandwdth-lmted and short, long delay scenaros as the applcaton requres. Among these codes, rate 1/2 2048-bt code s drectly from [6]. The other three are extensons from [6]. NASA s deep space LDPC codes are Accumulate Repeat- 4 Jagged-Accumulate (AR4JA) codes [7]. These are structured LDPC codes desgned by lftng up a protograph party check matrx nto a larger party check matrx consstng of crculants. Party check matrces constructed wth crculants are crtcal n mplementng parallel decodng algorthms for hgh speed applcatons. The DVB-S2 codes and the IEEE 802.16 codes are of ths type. However, unlke DVB-S2 codes, whch consst of supermposed crculants, the NASA deep space LDPC codes consst of only smple crculants. Ths sgnfcantly smplfes the layered decoder mplementaton[8], [9]. The layered decodng algorthm has been the preferred decodng algorthm because of ts faster convergence compared to the floodng algorthm [10]. For structured LDPC codes consstng of smple crculants, each row of crculants n the party check matrx naturally becomes a layer. The actual rows wthn a layer (.e., check nodes, c-nodes) can be processed n parallel because they update dfferent varable nodes (columns n the party check matrx, v-nodes) wthout memory access conflcts. The decodng proceeds as follows: Step 0: ntalze the v-node Log-lkelhood-rato (LLR) to the channel LLR and c-node to v-node messages to zero at the begnnng of a code block LQ = LLR(x ) (1) Lr j = 0 (2) Step 1, 2, and 3 are appled to each layer at frst, then repeated for each decodng teraton Step 1: compute the v-node to c-node messages Lq j, for each c-node j n a layer Lq j = LQ Lr j (3)

2 Step 2: compute the c-node to v-node messages Lr j = sgn(lq j) f f ( Lq j ) (4) V j\ Step 3: update v-node LLRs V j\ LQ = Lq j + Lr j (5) where LQ s the v-node LLR; LLR(x ) s the receved channel LLR for bt ; Lr j s the message from c-node j to v-node ; Lq j s the message from v-node to c-node j; V j represents the set of v-nodes connected to c-node j; V j \ represents V j excludng v-node. The functon f s defned as f(x) = log ex + 1 ( x ) e x 1 = log tan (6) 2 and t satsfes a specal addton rule ( ) f f(β ) β (7) where the parwse addton s β 1 β2 = mn(β 1, β 2 ) + log (1 + e (β1+β2)) ( log 1 + e β1 β2 ). (8) In ths mplementaton, (4) s computed usng the λ-mn algorthm [11] wth λ = 3,.e., among the v-nodes n V j \, the smallest three are used to compute the c-node to v-node messages usng (7) and (8). Equaton (8) s mplemented by a small lookup table for the log functon. III. HARDWARE ARCHITECTURE Hardware mplementaton of a LDPC decoder can be very resource ntensve. A well thought out archtecture approach s the key to developng a decoder that s both hgh throughput and small n sze. In addton extra care must be taken to talor the mplementaton for an FPGA target. The hardware archtecture approach taken here attempts to maxmze throughput and mnmze hardware complexty by utlzng a parallel layered processng scheme. Ths takes advantage of the structured nature of these NASA LDPC codes as well as the use of crculants n buldng the party check matrx. A. Decoder memory layout and parallelzaton scheme For LDPC codes wth party check matrces made of crculants of sze M (.e., each crculant s a M M rotated dentty matrx), (3), (4), and (5) can be mplemented n parallel by a factor up to M. For example, M LQ s and M Lr s are accessed n parallel to update M Lq s smultaneously. Ths s closely related to the memory layout of LQ and Lr. In the followng, we follow a smlar approach used for DVB-S2 codes [12]. Fg.1 llustrates how LQ and Lr s lad out n FPGA block RAM (BRAM). LQ s (ntally populated wth nput LLRs) are parttoned nto multple rows, each wth M values. Correspondngly, Lr s are parttoned nto rows wth M values each. Each memory access reads or wrtes one row of LQ or Lr wth M values. The parallel decodng processng s equvalent to processng crculant by crculant. Each vald par of LQ row and Lr row corresponds to a crculant n the party check matrx. Snce rotatons n each crculant are dfferent, t s necessary to algn LQ and Lr to a common reference so that the 1 to M element of a Lr row corresponds to the 1 to M element of the correspondng LQ row. Ths s acheved by crcularly left rotatng a LQ row by the amount ndcated by the correspondng crculant rotaton. After the rotaton, the frst element n the rotated LQ corresponds to the frst element of the correspondng Lr row as ndcated n Fg.1. Fgure 1. An example decoder memory layout of Lr and LQ for a hypothetcal code wth crculant sze of M = 128. Each map from a row of Lr to a row of LQ corresponds to a crculant n the party check matrx. The relaton of the crculant rotaton and the left rotaton of a LQ row to algn wth the Lr row s ndcated. The mappng of Lr rows to LQ rows and the correspondng crculant rotatons are the code parameters saved n the decoder. For the four mplemented LDPC codes, the crculant szes are 128 (for 2048 rate 1/2), 1024 (for 16384 rate 1/2), 64 (for 4096 rate 7/8), and 256 (for 16384 rate 7/8). Full parallelzaton by M would requre too much FPGA resources for large M (e.g., 1024). Further more, the four codes need to share the same decoder archtecture so that they can be swtch from one another wthout a complete FPGA reload. In the followng, a scheme of parallelzaton by L, wth L beng a factor of M, s descrbed followng a smlar approach n [13]. Wth ths scheme, L = 64 s the maxmum parallelzaton supported by all four codes mplemented. To create a parallel by L decoder, each row of LQ and Lr n Fg.1 s converted nto z = M/L rows by takng every other z th element nto a new row wth wdth L. We call these smaller rows sub-rows. After the re-arrangement, the same memory access scheme now accesses sub-rows, L elements at a tme. The layered decodng algorthm now works on sublayers. The crcular left rotatons that algn sub-rows of LQ and Lr, and the sub-row processng order are related to the orgnal

3 crculant rotaton by the followng t 1 = R M /z (9) t 2 = R M % z (10) A s = sub + t 2 ) % z (k { (11) R s = (t 1 + 1) % L A s < t 2 t 1 A s t 2 (12) where R M [0, M 1] s the rotaton of the crculant; k sub [0, z 1] s the sub-row processng order ndex; A s [0, z 1] s the ndex of the sub-row correspondng to the processng order ndex k sub ; R s [0, L 1] s the rotaton of the LQ sub-row needed to algn sub-rows of LQ and Lr. The symbol takes the nteger porton of a dvson and % s the remander operator. Fg.2 shows an example of ths operaton wth a crculant of sze M = 9 and a parallelzaton by L = 3. elements are actually repeated n the hardware 64 tmes. The memory s constructed of FPGA block RAMs arranged n parallel to effectvely buld a sngle very wde word memory. The number of block RAM requred depends on what the FPGA vendors offer n terms of the maxmum word wdth per block RAM. Snce the decoder s parallel by 64 each word n RAM s 64 tmes the wdth of each element. Detals on the choce of metrc wdth s gven n Sec. IV. The majorty of the RAM utlzaton s contaned n the Lr and LQ RAM. The LQ RAM s ntally loaded wth the nput LLRs as n step 0 and then contnually updated each teraton. The multplexer shown facltes that. The shfter s a smple 64 element wde barrel shfter ppe-lned to mprove speed. Snce the v-node to c-node calculaton of step 1 s needed for the LQ updates calculated n step 3 a smple FIFO s utlzed to temporarly store the step 1 results for step 3. A full memory s not usually needed here but the FIFO s stll requred to support the very wde word wdth. LQ RAM Lr RAM Shfter Parallel Input S l c e Parallel Output Check Node Calculaton FIFO These Elements are Repeated n Parallel Fgure 3. Decodng Core Fgure 2. The example llustrates how to convert from the full parallelzaton by the crculant sze M = 9 to a parallelzaton by L = 3. Dfferent colored elements n a row of LQ are converted to z = M/L = 3 sub-rows. If the crculant assocated wth ths row of LQ and a certan row of Lr (not shown) has a rotaton value of 2, the left rotated LQ row for parallelzaton by M s shown at the bottom left. The rotaton value of 2 s dstrbuted among z = 3 sub-rows for parallelzaton by L = 3, the frst two sub-rows are left rotated by 1 and the thrd sub-row s not rotated. The thrd sub-row s the frst to be processed. The c-node to v-node calculaton or check node calculaton (Fg. 4) of step 2 utlzes a λ-mn algorthm wth λ = 3 as mentoned n Sec. II. Ths s mplemented usng a mnmum sort routne followed by a calculaton whch performs the operaton n (4). The operaton denoted as mn* performs β 1 β 2 as n (8). The complete detals of the mn* operaton are shown n Fg. 5 As mentoned the log functon s mplemented va a small look-up table. Ths table has only a few small values so t s hard-coded and syntheszes to combnatoral logc. B. Decodng Core and Check Node Calculaton The decoder core shown n Fg. 3 s responsble for mplementng the algorthm explaned n Sec. II. Snce the decodng s done n parallel, the dagram s color coded to show whch C. Layer Schedulng The decoder processes multple layers n each teraton. As a decodng teraton s essentally a process where you read memory, calculate some results and wrte back to memory, n theory t s possble to ppelne the operaton to process

4 reg Fll Input RAM Read Input RAM Lq Sgn Sgn Array Read Output RAM DECODE Wrte Output RAM mn2 Mn Sort mn1 mn0 Mn* Mn* Lr Fgure 6. Decoder Processng Approach Top Level Control Fgure 4. Check Node Calculaton Rdy Rdy Data Vald Input Ram LDPC 64 Wde 64 Wde Core Output Ram Data Vald LLR Data (Seral) Decoded Data(1 Bt) β 1 β 2 Mn Fgure 5. Mn* Calculaton layers back to back and ncrease throughput. In practce there s latency n the calculaton step and f a memory read requres data that s stll processng through the ppelne then there wll be an error. In our decoder mplementaton, layers are reordered to avod such memory conflcts so that multple layers can be ppe-lned to ncrease throughput. For codes that cannot avod such memory conflcts, wat tmes are nserted n the ppelne flow so that the problem layer s completed and wrtten nto memory before the next layer s started. D. Overall Archtecture As was shown n Fg. 3 the decodng core processes parallel sets of LQ and Lr n a sngle clock cycle. Ths creates a very effcent processng approach but does not lend tself to a very practcal external nterface. To facltate data transfer n and out of the decoder core, two sets of memory are utlzed. Ths memory serves several functons. It converts the seral stream of LLR data comng nto the decoder nto the wde parallel word and performs the reverse wth the output data bts. It also performs the packng and unpackng of the ncomng and outgong data nto the partcular memory arrangement as mentoned n Sec. III-A. Lastly t acts smlar to a double buffer arrangement to allow data to be contnually transferred n and out whle decodng. Fg. 6 shows how decodng and data transfer occurs. Snce the decodng process requres the most amount of tme, nput data can be slowly read n and slowly read out whle the decodng completes. We then take advantage of the hgh degree of parallelzaton and quckly transfer the contents n and out of the RAMs wthout sgnfcantly mpactng throughput. A complete component (Fg. 7) ncludes ths RAM at the nput and output wth addtonal control to synchronze the decodng and transfer functons. LUT LUT y Fgure 7. Decoder Top Level Dagram IV. FPGA RESOURCE UTILIZATION, THROUGHPUT AND PERFORMANCE Ths FPGA mplementaton was targeted at a Xlnx FPGA n the Vrtex 6 famly. Whle not the latest technology offered by Xlnx ths FPGA s consdered the stable choce for many of the FPGA processng cards currently offered on the market. As the results n Tab. I show the desgn does not exceed the sze constrants of modern FPGAs and t would be possble to utlze multple cores wthn a sngle FPGA n order to accommodate very hgh data rates. Table I FPGA RESOURCE UTILIZATION Xlnx Vrtex 6 LX 130 Slce Regsters 18627 Slce LUTS 25035 Block RAM (36Kb) 36 The desgn metrc wdths shown n Tab. II are perhaps the most crtcal n terms of determnng overall sze. They drectly mpact the number of memory elements requred as well as the logc and regster usage. Obvously the goal would be to make these metrcs as small as possble but you can quckly encounter a loss n performance and may need to trade off sze for performance. Table II IMPLEMENTATION DETAILS LLR Input Metrc Wdth 5 Internal LQ Metrc Wdth 8 Lr Metrc Wdth 6 The decoder throughput s gven n Tab. III. In the example gven, the decoder s asked to do 30 teratons. The desgn as targeted for a Xlnx V6 s able to acheve a 175 Mhz clock rate. The throughput of each code s then calculated as shown. Obvously wth fewer teratons the throughput could be ncreased so agan a balance needs to be found tradng throughput aganst performance. Of note s the row enttled Extra Processng Delay. Ths s the extra wat tme n clock

5 Table III DECODER THROUGHPUT Rate 1/2 2K Rate 1/2 16K Rate 7/8 4K Rate 7/8 16K Clock Speed Acheved 175 MHz 175 MHz 175 MHz 175 MHz M 128 1024 64 256 L 64 64 64 64 Rows/Iteraton 120 960 252 1008 Iteratons 30 30 30 30 Extra Processng Delay 75 3 149 104 Uncoded Throughput 31 Mbps 50 Mbps 52 Mbps 75 Mbps cycles mentoned n Sec. III-C. Ths drectly mpacts the throughput and needs to be mnmzed as much as possble utlzng careful schedulng of the layers. Fg.8 shows the smulated Bt Error Rate (BER) and the block, here referred as Frame, Error Rate (FER) for the four mplemented codes usng a fxed pont decoder model runnng at 30 teratons. Fgure 8. Smulated BER and FER curves for the four mplemented codes usng a fxed pont decoder model runnng at 30 teratons. V. CONCLUSIONS In ths work we have successfully mplemented an extenson to the NASA codes developed for Deep Space applcatons wth potental use n varous hgh-speed hgh-throughput defense applcatons. In addton we have shown how t s possble to acheve hgh throughput wth few logc resources and stll mantan decodng performance utlzng layered parallel processng. It s felt that the results presented here wll be crtcal to helpng LDPC codes fnd more applcaton n Defense applcatons wthout the mpact to Sze Weght and Power (SWaP) as s tradtonally thought. In addton the utlzaton of FPGAs ensures use n re-programmable rado applcatons. Future work wll focus on further enhancements and optmzatons to ncrease throughput and reduce sze whle mantanng the performance shown. REFERENCES [1] Dgtal vdeo broadcastng (DVB); Second generaton framng structure, channel codng and modulaton systems for broadcastng, nteractve servces, news gatherng and other broad-band satellte applcatons, ETSI Std. EN 302 307 V1.2.1, 2009-08. [2] IEEE Standard for Local and metropoltan area networks Part 16: Ar Interface for Fxed and Moble Broadband Wreless Access Systems, IEEE Std. 802.16e, 2005. [3] Navstar GPS Space Segment/User Segment L1C Interfaces, GLOBAL POSITIONING SYSTEMS WING (GPSW) SYSTEM ENGINEERING and INTEGRATION, Verson 5.4, 15 October 2008. [4] Hgh Data Rate - Rado Frequency Ground Modem, Phase 3, Modem Performance Specfcaton, The MITRE Corporaton, Draft, 08 June 2012. [5] Jont Internet Protocol Modem Performance Specfcaton, Defense Informaton Systems Agency and Defense Communcatons and Army Transmsson Systems, IS-GPS-800, 04 Sep 2008. [6] Low Densty Party Check Codes for Use n Near-Earth and Deep Space Applcatons, Orange Book, Issue 2, Consulatve Commttee for Space Data Systems (CCSDS) Expermental Specfcaton 131.1-O-2, 2007. [7] D. Dvsalar, C. Jones, S. Dolnar, and J. Thorpe, Protograph based ldpc codes wth mnmum dstance lnearly growng wth block sze, n Global Telecommuncatons Conference, 2005. GLOBECOM 05. IEEE, vol. 3, 2005, pp. 5 pp.. [8] M. Mansour and N. Shanbhag, Hgh-throughput ldpc decoders, Very Large Scale Integraton (VLSI) Systems, IEEE Transactons on, vol. 11, no. 6, pp. 976 996, 2003. [9] D. E. Hocevar, A reduced complexty decoder archtecture va layered decodng of ldpc codes, n IEEE Workshop on Sgnal Processng Systems, 2004, p. 107. [10] R. G. Gallager, Low-Densty Party-Check Codes. Cambrdge, MA: M.I.T. Press, 1963. [11] F. Gulloud, E. Boutllon, and J.-L. Danger, Lambda-Mn decodng algorthm of regular and rregular LDPC codes, n 3rd Internatonal Symposum on Turbo Codes and related topcs, 1-5 september, Brest, France, 2003. [12] M. Eroz, F.-W. Sun, and L.-N. Lee, An nnovatve low-densty partycheck code desgn wth near-shannon-lmt performance and smple mplementaton, Communcatons, IEEE Transactons on, vol. 54, no. 1, pp. 13 17, 2006. [13] M. Gomes, G. Falcao, V. Slva, V. Ferrera, A. Sengo, and M. Falcao, Flexble parallel archtecture for dvb-s2 ldpc decoders, n Global Telecommuncatons Conference, 2007. GLOBECOM 07. IEEE, 2007, pp. 3265 3269.