Cost efficient FPGA implementations of Min- Sum and Self-Corrected-Min-Sum decoders

Cost efficient FPGA implementations of Min- Sum and Self-Corrected-Min-Sum decoders Oana Boncalo (1), Alexandru Amaricai (1), Valentin Savin (2) (1) University Politehnica Timisoara, Romania (2) CEA-LETI, MINATEC Campus, Grenoble, France Research reported in this presentation has been supported has been supported by the Franco-Romanian (ANR- UEFISCDI) Joint Research Programme Blanc 2013 project DIAMOND, and by the Seventh Framework Program of European Union under Grant Agreement 309129 (i-risc project).

Outline Min-Sum & Self-Corrected Min-Sum Decoding algorithms Motivation Proposed Architecture for the MS decoder Objectives MS Architecture Results Comparison Proposed Architecture for the SCMS decoder Memory Requirements Proposed Improvements Results Comparison Conclusion Titre de la présentation Date 2

LDPC codes (Notation) Defined by sparse bipartite graphs variable-nodes: coded-bits check-nodes: parity-check equations 10 9 α m,n 5 8 Decoded by MP algorithms γ n : input (a priori) LLR values 7 4 α m,n : variable-to-check node messages β m,n : check-to-variable node messages γ n : output (a posteriori) LLR values This work γ n input 6 5 4 3 3 2 Min-Sum Self-Corrected Min-Sum γ n output 2 1 β m,n 1 Titre de la présentation Date 3

Min-Sum (MS) Decoder Initialization: n = 1,, N; m H(n) γ n = log Pr x n = 0 y n Pr x n = 1 y n 10 α m,n (0) α m,n = γ n Iterations: l = 1,, l max 9 5 CNU: m = 1,, M; n H(m) 8 (l) β m,n (l 1) = sgn α m,n n H m \n min α (l 1) n H(m)\n m,n 7 4 AP-LLR: n = 1,, N (l) (l) γ n = γn + β m,n m H(n) 6 5 3 VNU: n = 1,, N; m H(n) (l) α m,n (l) (l) = γ n βm,n γ n input 4 3 2 (l) if sgn α m,n (l) then α m,n = 0 l 1 sgn α m,n (l 1) and α m,n 0 γ n output 2 1 β m,n 1 Titre de la présentation Date 4

Self-Corrected Min-Sum (SCMS) Decoder Initialization: n = 1,, N; m H(n) γ n = log Pr x n = 0 y n Pr x n = 1 y n 10 α m,n (0) α m,n = γ n Iterations: l = 1,, l max 9 5 CNU: m = 1,, M; n H(m) 8 (l) β m,n (l 1) = sgn α m,n n H m \n min α (l 1) n H(m)\n m,n 7 4 AP-LLR: n = 1,, N (l) (l) γ n = γn + β m,n m H(n) 6 5 3 VNU: n = 1,, N; m H(n) (l) α m,n (l) (l) = γ n βm,n γ n input 4 3 2 (l) if sgn α m,n (l) then α m,n = 0 l 1 sgn α m,n self-correction (erasure) rule (l 1) and α m,n 0 γ n output 2 1 β m,n 1 Titre de la présentation Date 5

Self-Corrected Min-Sum (SCMS) Decoder Relies on the erasure of unreliable variable-node messages Error-correction capability very close to Belief-Propagation With respect to other MS-based variants like NMS and OMS, the SCMS has been proven to provide better performance in the error-floor region, especially for irregular LDPC codes SCMS has also been proved to be robust to: Imprecise arithmetic Faulty hardware WiMAX, rate 1/2, AWGN Titre de la présentation Date 7

Self-Corrected Min-Sum (SCMS) Decoder Relies on the erasure of unreliable variable-node messages Error-correction capability very close to Belief-Propagation With respect to other MS-based variants like NMS and OMS, the SCMS has been proven to provide better performance in the error-floor region, especially for irregular LDPC codes SCMS has also been proved to be robust to: Imprecise arithmetic Faulty hardware But these improvements come with a price: signs of messages and erasure bits have to be stored increased memory requirements additional overhead in terms of routing the messages to the appropriate processing units Titre de la présentation Date 8

Objectives Main Objective Cost Efficient Implementation of SCMS decoder 1. MS decoder a) Propose a cost efficient implementation 2. SCMS decoder a) Evaluate the overhead due to the self-correction rule b) Propose solutions to reduce overhead Titre de la présentation Date 9

Objectives MS decoder architecture for QC-LDPC codes WiMAX LDPC code with rate 1/2, (N = 2304, K = 1152) code Low cost Efficient utilization of FPGA resources Fit in Virtex-5 VLX50T FPGA device (Digilent Genesys board) Throughput Hundreds of Mbps (> 500 Mbps) Flexible Easily adaptable to other codes and code rates Titre de la présentation Date 11

WiMAX LDPC code with rate 1/2 Quasi-Cyclic code, with base matrix B of size 12 24 Circulant size = 96 parity check matrix of size 1152 2304 Layered structure: 12 layers 96 rows per layer Layered architecture Lower memory requirements Store only AP-LLRs ( ) and CN messages ( ) Lower interconnect scheme Faster convergence Increased decoding performance if small number of iterations Titre de la présentation Date 13

MS Decoder Architecture AP-LLR ( ) memory and routing network 7/8 BRAM blocks (depending on quantization) 2 barrel shifters for routing (RD and WB) Serial RD/WB of AP-LLRs Check-node message ( ) memory 4 BRAM blocks Double buffering + shift registers for routing Data processing units (VCN) 96 parallel processing units Parallel processing within a layer Serial processing of input messages Control Unit Titre de la présentation Date 14

MS Decoder Architecture AP-LLR Memory Check-Node messages memory Titre de la présentation Date 15

MS decoder AP-LLR memory Memory organization dual port BRAM 72 bits per memory word 6-bits AP-LLR 8 BRAM blocks Serial processing (RD and WB) of AP-LLRs Reduction of AP-LLR memory word Efficient usage of BRAM block Pipelined barrel shifters 7 cc from Mem RD to processing/processing to Mem WB Titre de la présentation Date 16

MS Decoder CN message memory Compressed messages Min1, Min2, IndexMin1, Signs 4-bit quantization: 16 bits (compressed) vs 28 bits (uncompressed) Memory word 16 compressed messages 4 messages x 4 BRAMs Reading 96 messages Routing 6 clock cycles Shift-registers Titre de la présentation Date 17

MS Decoder Processing Unit (VCN) Serial processing (6,4) quantization Slice based adders & comp. Conversion between SM and C2 (5,3) quantization 0.2-0.25 db performance penalty Arithmetic operations implemented as ROM Only SM computations Titre de la présentation Date 18

Results comparison Architecture proposal Code Quantization Device Frequency (MHz) Throughput (Mbps) Chandrasetty2012 [1] (576, 1152) γ-4 bit, β-2 bit Virtex-5 138 11400 10823 Stimming2012 [2] (1152, 2304) γ-4 bit, β-3 bit Virtex-5 154 8900 21688 Resources Slices BRAM Chen2011 [3] (768, 1536) γ-6 bit, β-4 bit Virtex-4 149-162 223-888 1472-6100 24 Kim2011 [4] (336, 672) γ-6 bit, β-4 bit Virtex-4 335 822 27000 32 Beuschel2008 [5] (1152, 2304) γ-6 bit, β-4 bit Virtex-4 75 75 9877-19500 122 Proposed (baseline) (1152, 2304) γ-6 bit, β-4 bit Virtex-5 281 719 6529 12 Proposed (new VCN) (1152, 2304) γ-5 bit, β-3 bit Virtex-5 312 800 6326 11 [1] V. A. Chandrasetty and S. M. Aziz, "An area efficient LDPC decoder using a reduced complexity min-sum algorithm", VLSI Journal, 2012 [2] A. B. Stimming and A. Dollas, "FPGA-based design and implementation of a multi-gbps LDPC decoder", FPL, 2012 [3] X. Chen, J. Kang and S. Lin and V. Akella, "Memory System Optimization for FPGA Based Implementation of Quasi- Cyclic LDPC Codes Decoders", IEEE Trans. on CAS, 2011. [4] S. Kim, G. E. Sobelman, and H. Lee, "A Reduced-Complexity Architecture for LDPC Layered Decoding Schemes", IEEE Trans. On VLSI, 2011. [5] C. Beuschel, H.-J. Pfleiderer, "FPGA implementation of a Flexible LDPC decoder", FPL, 2008. Titre de la présentation Date 19

Results comparison Architecture proposal Code Quantization Device Frequency (MHz) Throughput (Mbps) Resources Slices BRAM Chandrasetty2012 [1] (576, 1152) γ-4 bit, β-2 bit Virtex-5 138 11400 10823 Stimming2012 [2] (1152, 2304) γ-4 bit, β-3 bit Virtex-5 154 8900 21688 Chen2011 [3] (768, 1536) γ-6 bit, β-4 bit Virtex-4 149-162 223-888 1472-6100 24 Kim2011 [4] (336, 672) γ-6 bit, β-4 bit Virtex-4 335 822 27000 32 Beuschel2008 [5] (1152, 2304) γ-6 bit, β-4 bit Virtex-4 75 75 9877-19500 122 Proposed (baseline) (1152, 2304) γ-6 bit, β-4 bit Virtex-5 281 719 6529 12 Proposed (new VCN) (1152, 2304) γ-5 bit, β-3 bit Virtex-5 312 800 6326 11 Low cost FPGA LDPC decoder architecture Efficient usage of BRAM blocks both for AP-LLR and CN message memory Novel implementation of data processing by implementing arithmetic operation via ROM Adaptable architecture, suitable to other LDPC codes Titre de la présentation Date 20

SCMS Memory Requirements Additional memory requirements s old m,n = sgn α old m,n, sign of the previous α m,n message e old m,n, erasure bit, indicating if previous α m,n message has been erased Total = 2 additional bits per graph edge = 2*d c bits per check node MS, with compressed : 4-bit quant. d c = 6 d c = 20 Index min1 min 1 min 2 signs 15 bits 31 bits SCMS, with compressed Index min1 min 1 min 2 signs signs erasure bits 27 bits 71 bits Titre de la présentation Date 22

SCMS Improvements MS, with compressed Index min1 min 1 min 2 signs SCMS, with compressed Index min1 min 1 min 2 signs signs erasure bits SCMS, with no signs Index min1 min 1 min 2 signs erasure bits Duplicate the signs computation SCMS, with no signs and erasure bits Index min1 min 1 min 2 signs Modify the selfcorrection (erasure) rule Titre de la présentation Date 23

Memory Efficient SCMS Improvement 1 Duplicate the check node message sign computation Allows the storage of only the variable node message signs and erasure bits message sign computation block Titre de la présentation Date 24

Memory Efficient SCMS Improvement 1 Duplicate the check node message sign computation Allows the storage of only the variable node message signs and erasure bits Erasure Detection Block Inputs sold m,n : sign old new s m,n : sign new eold m,n : erasure bit Output e new m,n = e old m,n s old m,n snew m,n Titre de la présentation Date 25

Memory Efficient SCMS Improvement 2 Modifying the erasure rule Avoid storing the erasure bits erasure bits are estimated Erasure Detection Block Inputs sold m,n : sign old new s m,n : sign new β min1 : min1 value Output e new old_est m,n = e m,n s old m,n snew m,n where e m,n old_est = 1, if β min1 = 0 0, otherwise Titre de la présentation Date 26

Memory Efficient SCMS Improvement 2 Bit Error Rate (BER) Modifying the erasure rule Avoid storing the erasure bits erasure bits are estimated Very small (negligible) performance degradation 1E+00 WiMAX, rate 1/2, AWGN, (5,3)-quant Erasure Detection Block Inputs sold m,n : sign old new s m,n : sign new β min1 : min1 value Output e new old_est m,n = e m,n s old m,n snew m,n where e old_est m,n = 1, if β min1 = 0 0, otherwise 1E-01 1E-02 1E-03 1E-04 1E-05 1E-06 1E-07 MS SCMS SCMS-V2 1 1.5 2 2.5 3 SNR (db) Titre de la présentation Date 27

Results comparison Conventional MS 5895 slices, 12 BRAM Frequency : 290 MHz Conventional SCMS 7179 slices, 16 BRAM Frequency : 270 MHz SCMS with no check node message signs (1) 6748 slices, 14 BRAM Frequency : 289 MHz SCMS with no check node message signs and erasure bits (2) 5946 slices, 12 BRAM Frequency : 266 MHz (5, 3)-quantization Results after place & route Xilinx XC5VLX50T device, speed grade -3 using the Xilinx ISE 14.7 tool Additional overhead in the VCN units is overruled by the reduction in the β routing logic similar cost conventional MS Titre de la présentation Date 28

Conclusion Two improvements for the layered SCMS architecture suitable for both FPGA and ASIC implementations New self-correction rule which avoids the requirement for erasure bits storage Performance close to conventional SCMS Outperforms MS by 0.4 db, while having similar cost Low cost FPGA architectures for both MS and SCMS Efficient usage of BRAM blocks both for AP-LLR and CN message memory Novel implementation of data processing by implementing arithmetic operation via ROM Adaptable architectures, suitable to other QC-LDPC codes O. Boncalo, A. Amaricai, A. Hera, and V. Savin, Cost Efficient FPGA Layered LDPC Decoder with Serial AP-LLR Processing, International Conference on Field Programmable Logic and Applications (FPL), September 2014 O. Boncalo, A. Amaricai, and V. Savin, Memory Efficient Implementation of Self-Corrected Min-Sum Decoder, IEEE International Conference on Electronics Circuits and Systems (ICECS), December, 2014 Titre de la présentation Date 29

Merci de votre attention