MULTI-RATE HIGH-THROUGHPUT LDPC DECODER: TRADEOFF ANALYSIS BETWEEN DECODING THROUGHPUT AND AREA

Size: px

Start display at page:

Download "MULTI-RATE HIGH-THROUGHPUT LDPC DECODER: TRADEOFF ANALYSIS BETWEEN DECODING THROUGHPUT AND AREA"

Leonard Fitzgerald
5 years ago
Views:

1 MULTIRATE HIGHTHROUGHPUT LDPC DECODER: TRADEOFF ANALYSIS BETWEEN DECODING THROUGHPUT AND AREA Predrag Radosavljevic, Alexandre de Baynast, Marjan Karkooti, and Joseph R. Cavallaro Department of Electrical and Computer Engineering, Rice University {rpredrag, debaynas, marjan, ABSTRACT In order to achieve high decoding throughput (hundreds of MBits/sec and above) for multiple code rates and moderate codeword lengths, several LDPC decoder solutions with different levels of processing parallelism are possible. Selection between these solutions is based on a threefold criterion: hardware complexity, decoding throughput, and errorcorrecting performance. In this work, we determine the multirate LDPC decoder architecture with the best tradeoff in terms of area cost, errorcorrecting performance, and decoding throughput. The prototype architecture of this decoder is implemented on an FPGA. I. INTRODUCTION The recent designs of LDPC decoders are mostly based on blockstructured Parity Check Matrices (PCMs) that allow structured architecture solutions with different levels of processing parallelism (data throughput). The tradeoff between throughput and area for partially parallel LDPC decoders that support blockstructured regular PCMs has been investigated in [7, 9]. However, there still remains a challenge to construct improved PCMs to preserve at the same time the architecture modularity and to approach the performance of irregular random codes. The scalable partially parallel decoder from [10] that supports three code rates is based on optimized structured regular and irregular PCMs, but achieves only moderate decoding throughput. On the other hand, fully parallel decoders such as in [5] can be very fast but the lack of flexibility and large area is a major disadvantage in wireless systems that require support for multiple code rates and codeword sizes. Authors in [6] propose a novel optimization of LDPC codes to achieve area reduction in a fully parallel decoder, however, the codes are not generally suitable for use in flexible decoder architectures. In this paper, we investigate the complex tradeoff between the levels of architecture parallelism, area cost, and performance and target the design of flexible decoders. This analysis also requires special optimizations of blockstructured PCMs. The specific target is to design a single decoder architecture that supports multiple code rates (1/, /3, 3/4 and 5/6) and moderate codeword lengths (between 648 and 59) with high data throughput ( 1GBits/sec) which is compatible with the IEEE 80.11n standard specifications [3]. The wide range of decoder solutions with different levels of processing parallelism are analyzed. The decoder solution with the best tradeoff This work was supported in part by Nokia Corporation and by NSF under grants CNS055169, CNS and EIA between throughput, area and performance is chosen for hardware implementation. A prototype architecture is implemented on a Xilinx FPGA. II. LOW DENSITY PARITYCHECK CODES An LDPC code is a linear block code specified by a very sparse PCM as shown in Fig. 1 with random placement of the nonzero entries. Each coded bit is represented by a PCM column (variable node), while each row of the PCM represents a parity check equation (check node). The loglikelihood ratios (LLRs) are used for representation of reliability in order to simplify the arithmetic operations: R mj message denotes a check node LLR sent from the check node m to the variable node j, L(q mj ) message represents a variable node LLR sent from the variable node j to the check node m, and L(q j ) (j =1,...,n) represent the a posteriori probability ratios ( ) for all coded bits. The are initialized with the apriori(channel) reliability value of the coded bit j ( j =1,...,n). The iterative layered belief propagation (LBP) algorithm is a variation of the standard belief propagation (SBP) algorithm [7]. The convergence rate is doubled because the scheduling of reliability is optimized [8]. As is shown in Fig. 1, the typical blockstructured PCM is composed of concatenated horizontal layers (component codes) and shifted identity submatrices. The belief propagation algorithm is repeated for each component code while the updated are passed between the subcodes. For each variable node j inside the current horizontal layer, L(q mj ) that correspond to all check nodes neighbors m are computed according to: L(q mj )=L(q j ) R mj. (1) For each check node m, R mj, for all variable nodes j that participate in a particular paritycheck equation m, are computed according to: R mj = j N(m)\{j} sign (L(q mj )) Ψ Ψ(L(q mj )), () j N(m)\{j} where N(m) is the set of all variable nodes [ from )] the paritycheck equation m, and Ψ(x) = log tanh. The a ( x posteriori reliability in the current horizontal layer are updated according to: L(q j )=L(q mj )+R mj. (3) /06/$0.00 c 006 IEEE

2 If all paritycheck equations are satisfied or the predetermined maximum number of iterations is reached, the decoding algorithm stops. Otherwise, the algorithm repeats from (1) for the next horizontal layer. III. LDPC DECODERS WITH DIFFERENT PROCESSING PARALLELISM The processing parallelism consists of two parts: the parallelism in decoding of concatenated codes (horizontal layers of the PCM), and the parallelism in reading/writing of reliability from/to memory modules. Several decoder solutions with different levels of processing parallelism are considered targeting hundreds of MBits/sec data throughput: 1. L1RW1: Decoder with full decoding parallelism per one horizontal layer (L1), reading/writing of from one nonzero submatrix in each clock cycle (RW1).. L1RW: Decoder with full decoding parallelism per one horizontal layer, reading/writing of from two consecutive nonzero submatrices in each clock cycle (RW). 3. L3RW1: Decoder with the pipelining of three consecutive horizontal layers (L3), reading/writing of from one nonzero submatrix in each clock cycle. 4. L3RW: Decoder with the pipelining of three consecutive horizontal layers, reading/writing of from two consecutive nonzero submatrices in each clock cycle. 5. L6RW: Decoder with double pipelining (the pipelining of six consecutive horizontal layers, L6), reading/writing of from two consecutive nonzero submatrices in each clock cycle. 6. FULL: Fully parallel decoder with the simultaneous execution of all layers and simultaneous reading/writing of all reliability. The blockstructured irregular codes have been proposed for the IEEE 80.11n standard [3] since they can approach errorcorrecting performance of excellent random codes. Inspired by these codes, we propose a systematic construction of irregular blockstructured PCMs which allows two times higher memory access throughput. At the same time the errorcorrecting performance are preserved. An example of these novel blockstructured PCMs is shown in Fig. 1. The main constraint in the proposed PCM design is that any two consecutive nonzero submatrices from the same horizontal layer belong to odd and even PCMs blockcolumns (see Fig.1). This property is essential for achieving more parallel memory access since the from two submatrices can be simultaneously read/written from/to two independent memory modules. The PCMs with 4 blockcolumns provide better performance than the PCMs with 18, 48, 7 or 96 blockcolumns. This is due to the nearoptimal profile and small number of short cycles while supporting sufficiently high parallelism level crucial for highthroughput decoder implementations. Therefore, all analyzed decoder architectures including the fully parallel one support our novel PCMs with 4 blockcolumns. Figure (part labelled as RW1) shows the organization of memory used in solutions 1 and 3. The single dualport Rows Columns Figure 1: An example of the novel blockstructured irregular paritycheck matrix with 4 blockcolumns and 8 horizontal layers. The codeword size is 196, rate=/3, and the size of each submatrix is Blockcolumn 1 Blockcolumn Blockcolumn 3 Blockcolumn 3 Blockcolumn 4 RW1 Blockcolumn 1 Blockcolumn 3 Blockcolumn 3 RW Blockcolumn Blockcolumn 4 Blockcolumn 4 Figure : Organization of memory into single dualport RAM module, and two independent dualport RAM modules with from odd and even blockcolumns of the parity RAM block is utilized where each memory location contains the from one blockcolumn. Therefore, one full submatrix of the PCM can be accessed in each clock cycle. The width of the valid memory content depends on the codeword length (size of the submatrix), and it is up to 108 for the largest supported codeword length of 59. Meanwhile, each check node memory location also contains check node from one full nonzero submatrix. The architectureoriented constraint of equally distributed odd and even nonzero blockcolumns in every horizontal layer allows for the read/write of two submatrices per clock cycle (used in solutions, 4 and 5). As is shown in Fig. (part labelled as RW), the memory is partitioned into two independent modules. Each location in one memory module contains that correspond to one (out of twelve) odd blockcolumns. Another memory module contains from the even block columns. Meanwhile, each check node memory location contains check node that correspond to two consecutive nonzero submatrices from the same horizontal layer. The entire content of the check node memory location is loaded/stored in a single clock cycle. It is important to observe that the dualport RAMs can be replaced with single port RAMs if there is no pipelining of the horizontal layers. However, the memory organization does not depend on the number of memory ports.

3 (l) th layer (l1) th layer Time l th layer Throughput/area [MBits/sec/mm ] Average Throughput [MBits/sec] Arithmetic area [K gates] Memory size [KBits] Total area [mm /100] L3RW1 L3RW L6RW FULL x10 x10 Figure 3: Belief propagation based on pipelining of three decoding stages for three consecutive horizontal layers of the parity L1RW1 L1RW x10 (l5) th layer (l4) th layer (l3) th layer (l) th layer (l1) th layer l th layer Levels of Processing Parallelism Figure 5: The average decoding throughput, hardware complexity and tradeoff ratio for the LDPC decoder solutions with different levels of processing parallelism. Time Figure 4: Belief propagation based on pipelining of three decoding stages for six consecutive horizontal layers of the parity By construction, all rows inside a single horizontal layer are independent and can be processed in parallel without any performance loss. Every horizontal layer is decoded through three stages: memory reading stage, processing stage, and memory writing stage corresponding to Eqs. (1), () and (3), respectively. The decoding parallelism is increased if three consecutive horizontal layers are pipelined (label L3), which is visualized in Fig. 3. The threestage pipelining introduces a small performance loss due to the overlapping between the from the same blockcolumns but from the different horizontal layers. The decoding parallelism is doubled if the pipelining of six consecutive layers is employed (label L6), as is shown in Fig. 4. In order to avoid the reading/writing collisions (simultaneous reading and writing of from the identical blockcolumns but from two different horizontal layers), it is necessary to postpone the start of every second layer by one clock cycle (see Fig. 4). IV. THROUGHPUTAREAPERFORMANCE TRADEOFF ANALYSIS The proposed decoder solutions with different processing parallelism levels presented in the previous section are compared. It is assumed that each solution (except the L6RW architecture) supports multiple code rates (from 1/ to 5/6) and codeword lengths (648, 196, 1944, and 59) with novel blockstructured PCMs. The L6RW solution only supports code rates up to 3/4 since the PCM with 4blockcolumns has only four horizontal layers for the code rate of 5/6 which is insufficient to fill the pipeline (so L6RW is at a disadvantage). Figure 5 shows the decoding throughput and area complexity for all analyzed decoders: decoding parallelism and memory access parallelism increase from left to right (from L1RW1 to FULL as in section III.). The tradeoff is defined as a ratio between decoding throughput and decoder core area and therefore it is represented in MBits/sec/mm. The decoding throughput is based on the average number of decoding iterations required to achieve a frameerror rate (FER) of 10 4 for the largest supported code rate and codeword length. We assume 10 6 simulated codeword transmissions, the maximum number of iterations is set to 15 and the clock frequency is 00 MHz. The hardware complexity is measured by the area in mm assuming a 0.18 µm 6metal CMOS process. The total area is computed as a summation of the arithmetic area and the memory area. The arithmetic part of each decoder is represented as a number of standard CMOS logic gates (also shown in Fig. 5), where it is assumed that the typical TSMC design density for 0.18 µm technology is 100K logic gates per mm [1]. The memory size is given as the number of bits required for the storage of and check (SRAMs) and supported PCMs (ROMs). The corresponding memory area is computed by assuming the oneported SRAM cell size of 4.65 µm which is a typical size for 0.18 µm technology [1]. The dualport SRAM blocks are required for implementation of the L3RW1, L3RW and L6RW decoders because the memory reading and writing stages are pipelined; the memory cell area is typically doubled for the same design rule []. The fully parallel solution does not have any inherent flexibility the arithmetic logic for all 16 ratesize combinations is necessary which significantly increases the arithmetic area. However, the same set of latches can be utilized for the storage of reliability for all ratesize combinations. If pipelining of only three layers is employed then the decoding throughput is increased by approximately three times while the arithmetic area is increased only marginally. On the other hand, the memory area is doubled since then both mirror and mirror check node memories are required. Overall, three

4 Frame Error Rate LBP 6 layers pipelined (Max.Iter =15) SBP, Max.Iter=15 LBP 3 layers pipelined (Max.Iter=15) SBP, Max.Iter=5 LBP 6 layers pipelined (Max.Iter =5) LBP, no pipelining (Max.Iter =15) Eb/No [db] Figure 6: FER for PCM with 4 blockcolumns (code rate of /3, code size of 196: nonpipelined LBP, pipelined LBP, doublepipelined LBP, SBP. stage pipelining significantly improves the throughput/area ratio (see Fig. 5, L1RW1 vs. L3RW1 architecture, and L1RW vs. L3RW architecture). If the memory access parallelism is doubled then the decoding throughput is directly increased by more than 50%, while the arithmetic area is doubled and memory size remains the same (see Fig. ). Overall, if only the memory access parallelism is increased, then the throughput/area ratio is only slightly improved (see Fig. 5, L1RW1 vs. L1RW architecture, and L3RW1 vs. L3RW architecture). The further increase of decoding parallelism (L6RW and FULL solutions) does not improve the tradeoff ratio since the throughput improvements are smaller than the necessary increase in area. A similar effect is expected if the memory access parallelism is further increased; if four submatrices per clock cycle are accessed (not shown here since it requires special redesign of the PCMs) then the arithmetic area would increase two times. However, the decoding throughput would improve by only about 5% since the processing latency will start to dominate compared to the memory access latency. It can be observed from Fig. 5 that the best throughput per area ratio is achieved for the threestage pipelining approach with memory access that allows reading/writing of two submatrices per clock cycle (the L3RW solution). At the same time, the performance loss due to the pipelining is acceptable (about 0.1dB for the FER around 10 4, see Fig. 6, 10 6 codeword transmissions are simulated). In order to avoid the performance loss, the double pipelining and the SBP approaches would require additional decoding iterations which further decrease the decoding throughput. V. THREESTAGE PIPELINED LDPC DECODER WITH DOUBLEPARALLEL MEMORY ACCESS An LDPC decoder with threestage pipelining and a memory organization that supports access of two submatrices during a single clock cycle (L3RW) is chosen for hardware implemen Address Generator CHECK MEMORY CHECK MEMORY Address Generator To 1 Mirror 1 Bank of 108 Decoding FUs CONTROLLER To Mirror MODULE 1 MODULE 1 ROM 1 Modules ROM Modules ROM 1 Mirror ROM Mirror Figure 7: The block diagram of decoder architecture for block structured irregular paritycheck matrices with 4 blockcolumns. tation because it achieves the best tradeoff between decoding throughput and area as shown in section IV.. The highlevel block diagram of the proposed decoder architecture is shown in Fig. 7. The architecture supports codeword sizes of: 648, 196, 1944, and 59, and code rates of: 1/, /3, 3/4, and 5/6. Four identical permuters are required for the blockshifting of after being loaded from the memory modules (original and mirror). The scalable permuter (permutation of the blocks of sizes: 7, 54, 81, and 108) is composed of 7bit input 3:1 multiplexers organized in multiple pipelined stages. The modified minsum approximation with correcting offset [4] is employed as a central part of the decoding function unit (DFU). A single DFU is responsible for processing one row of the PCM through three pipeline stages according to Eqs. (1), (), and (3) as is shown in Fig 8. The support for different code rates and codeword lengths implies the usage of multiple PCMs. Information about each supported PCM is stored in the corresponding ROM module where a single memory location contains the blockcolumn position of one nonzero submatrix and the associated shift value. The blockcolumn position represents the reading/writing address of the appropriate memory module. In order to avoid the permutation of during the writing stage, the relative shift values (difference between two consecutive shift values of the same blockcolumn) are stored instead of the original shift values. The support of multiple code rates is insured by the control unit. The number of nonzero submatrices per horizontal layer varies with the code rate which affects the latencies of reading/writing stages, as well as the processing latency within the DFUs. The control logic provides the synchronization between the pipelined stages. It also handles the exemptions which occur if the number of nonzero submatrices in the horizontal layer is odd. In that case, only one blockcolumn per clock cycle is read/written from/to odd or even memory module. The full content of the corresponding check node memory location (width of two submatrices) is loaded even though the second half of it is not valid. The control unit then disables the appropriate part of the arithmetic logic inside the DFUs. Figure 9 shows the FER performance of the implemented

5 1 Check Memory Module s SM s SM Min Sum Unit XOR Logic Min* Unit Buffer Min* Min* To SM>'s Logic Min Min Offset Offset Buffer Modif. Min Min CMP /SEL CMP /SEL Check Memory () SM s Buffer Min Min* Min Buffer Sum Sum Unit SM s To To Module Module Reading Processing Writing Figure 8: The block diagram of the decoding function unit (DFU) with three pipeline stages and blockserial processing. Interface to and check node memories is also included. Frame Error Rate LBP pipelined, 6 bit fixed point LBP pipelined, 7 bit fixed point LBP pipelined, 8 bit fixed point LBP pipelined, floating point Table 1: Xilinx Virtex4XC4VFX60 FPGA utilization. Resource Used Utilization rate Slices 19,595 77% LUTs 36,45 7% Block RAMs % word lengths. We believe that the pipelining of multiple horizontal layers combined with a sufficiently parallel memory organization is a general tradeoff solution that can be applied to other blockstructured LDPC codes Eb/No [db] Figure 9: The FER for the implemented decoder (code rate of /3, code length of 196, maximum number of iterations is 15); pipelined LBP (floating vs. fixedpoint). flexible decoder for the case of a /3rate code and a codeword length of 196. The arithmetic precision of seven bits is chosen for representation of reliability (two s complement with one bit for the fractional part) as well as for all arithmetic operations. A prototype architecture has been implemented using Xilinx System Generator and targeted to a Xilinx Virtex4XC4VFX60 FPGA. Table 1 shows the utilization statistics. Based on the XST synthesis tool report and full place and route statistics, a maximum clock frequency of 130 MHz can be achieved which determines the maximum average throughput of approximately 1.1 GBits/sec. VI. CONCLUSION In this paper, the multirate decoder architecture that represents the best tradeoff in terms of throughput, area, and performance is found and implemented on an FPGA. An identical tradeoff analysis can be applied for a wide range of code rates and code REFERENCES [1] cell.html. [] [3] IEEE Wireless LANsWWiSE Proposal: High Throughput extension to the Standard. IEEE n. [4] J. Chen, A. Dholakai, E. Eleftheriou, M.P.C. Fossorier, and X.Y. Hu. Reducedcomplexity decoding of LDPC codes. IEEE Transactions on Communications, 53(8): , Aug [5] A. Darabiha, A.C. Carusone, and F.R. Kschischang. MultiGbit/sec low density parity check decoders with reduced interconnect complexity. In IEEE International Symposium on Circuits and Systems (ISCAS), pages , May 005. [6] E. Kim and G.S. Choi. Diagonal lowdensity paritycheck code for simplified routing in decoder. In IEEE Workshop on Signal Processing Systems Design and Implementation, pages , Nov [7] M.M. Mansour and N.R. Shanbhag. Highthroughput LDPC decoders. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 11(6): , Dec [8] P. Radosavljevic, A. de Baynast, and J.R. Cavallaro. Optimized message passing schedules for LDPC decoding. In 39th Asilomar Conference on Signals, Systems and Computers, 005, pages , Nov [9] S.H. Kang and I.C. Park. Loosely coupled memorybased decoding architecture for low density parity check codes. IEEE Transactions on Circuits and Systems, 53(5): , May 006. [10] L. Yang, H. Lui, and C.J.R. Shi. Code construction and FPGA implementation of a lowerrorfloor multirate lowdensity paritycheck code decoder. IEEE Transactions on Circuits and Systems, 53(4):89 904, April 006.

MULTI-RATE HIGH-THROUGHPUT LDPC DECODER: TRADEOFF ANALYSIS BETWEEN DECODING THROUGHPUT AND AREA

MULTI-RATE HIGH-THROUGHPUT LDPC DECODER: TRADEOFF ANALYSIS BETWEEN DECODING THROUGHPUT AND AREA Predrag Radosavljevic, Alexandre de Baynast, Marjan Karkooti, and Joseph R. Cavallaro Department of Electrical