Cost efficient FPGA implementations of Min- Sum and Self-Corrected-Min-Sum decoders

Similar documents
PERFORMANCE ANALYSIS OF HIGH EFFICIENCY LOW DENSITY PARITY-CHECK CODE DECODER FOR LOW POWER APPLICATIONS

Low Complexity Quasi-Cyclic LDPC Decoder Architecture for IEEE n

HIGH-THROUGHPUT MULTI-RATE LDPC DECODER BASED ON ARCHITECTURE-ORIENTED PARITY CHECK MATRICES

Multi-Rate Reconfigurable LDPC Decoder Architectures for QC-LDPC codes in High Throughput Applications

98 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 58, NO. 1, JANUARY 2011

MULTI-RATE HIGH-THROUGHPUT LDPC DECODER: TRADEOFF ANALYSIS BETWEEN DECODING THROUGHPUT AND AREA

Tradeoff Analysis and Architecture Design of High Throughput Irregular LDPC Decoders

AN FPGA BASED OVERLAPPED QUASI CYCLIC LDPC DECODER FOR WI-MAX

BER Evaluation of LDPC Decoder with BPSK Scheme in AWGN Fading Channel

Partly Parallel Overlapped Sum-Product Decoder Architectures for Quasi-Cyclic LDPC Codes

A Generic Architecture of CCSDS Low Density Parity Check Decoder for Near-Earth Applications

A Reduced Routing Network Architecture for Partial Parallel LDPC decoders

A NOVEL HARDWARE-FRIENDLY SELF-ADJUSTABLE OFFSET MIN-SUM ALGORITHM FOR ISDB-S2 LDPC DECODER

ASIP LDPC DESIGN FOR AD AND AC

HDL Implementation of an Efficient Partial Parallel LDPC Decoder Using Soft Bit Flip Algorithm

Error Control Coding for MLC Flash Memories

International Journal of Engineering Trends and Technology (IJETT) - Volume4Issue5- May 2013

OVer past decades, iteratively decodable codes, such as

MULTI-RATE HIGH-THROUGHPUT LDPC DECODER: TRADEOFF ANALYSIS BETWEEN DECODING THROUGHPUT AND AREA

HIGH THROUGHPUT LOW POWER DECODER ARCHITECTURES FOR LOW DENSITY PARITY CHECK CODES

lambda-min Decoding Algorithm of Regular and Irregular LDPC Codes

Hardware Implementation

Partially-Parallel LDPC Decoder Achieving High-Efficiency Message-Passing Schedule

Efficient Configurable Decoder Architecture for Non-binary Quasi-cyclic LDPC Codes

LOW-POWER IMPLEMENTATION OF A HIGH-THROUGHPUT LDPC DECODER FOR IEEE N STANDARD. Naresh R. Shanbhag

Strategies for High-Throughput FPGA-based QC-LDPC Decoder Architecture

Efficient Markov Chain Monte Carlo Algorithms For MIMO and ISI channels

Piecewise Linear Approximation Based on Taylor Series of LDPC Codes Decoding Algorithm and Implemented in FPGA

Disclosing the LDPC Code Decoder Design Space

A High-Throughput FPGA Implementation of Quasi-Cyclic LDPC Decoder

Design of a Quasi-Cyclic LDPC Decoder Using Generic Data Packing Scheme

A Memory Efficient FPGA Implementation of Quasi-Cyclic LDPC Decoder

Improving Min-sum LDPC Decoding Throughput by Exploiting Intra-cell Bit Error Characteristic in MLC NAND Flash Memory

Optimized ARM-Based Implementation of Low Density Parity Check Code (LDPC) Decoder in China Digital Radio (CDR)

Hybrid Iteration Control on LDPC Decoders

LOW-DENSITY parity-check (LDPC) codes, which are defined

Reduced Complexity of Decoding Algorithm for Irregular LDPC Codes Using a Split Row Method

Quantized Iterative Message Passing Decoders with Low Error Floor for LDPC Codes

FPGA Implementation of Binary Quasi Cyclic LDPC Code with Rate 2/5

Optimized Min-Sum Decoding Algorithm for Low Density PC Codes

Area and Energy Efficient VLSI Architectures for Low-Density Parity-Check Decoders using an On-the-fly Computation

A new two-stage decoding scheme with unreliable path search to lower the error-floor for low-density parity-check codes

Optimal Overlapped Message Passing Decoding of Quasi-Cyclic LDPC Codes

LOW-DENSITY parity-check (LDPC) codes were invented by Robert Gallager [1] but had been

Block-Layered Decoder Architecture for Quasi-Cyclic Nonbinary LDPC Codes

RECENTLY, low-density parity-check (LDPC) codes have

Modern Communications Chapter 5. Low-Density Parity-Check Codes

Memory Efficient Decoder Architectures for Quasi-Cyclic LDPC Codes

A Massively Parallel Implementation of QC-LDPC Decoder on GPU

An FPGA Implementation of (3, 6)-Regular Low-Density Parity-Check Code Decoder

LOW-DENSITY PARITY-CHECK (LDPC) codes [1] can

Review Article Flexible LDPC Decoder Architectures

Overlapped Scheduling for Folded LDPC Decoding Based on Matrix Permutation

Performance Analysis of Gray Code based Structured Regular Column-Weight Two LDPC Codes

Distributed Decoding in Cooperative Communications

FPGA Matrix Multiplier

LowcostLDPCdecoderforDVB-S2

LLR-based Successive-Cancellation List Decoder for Polar Codes with Multi-bit Decision

Fault Tolerant Parallel Filters Based On Bch Codes

On the Implementation of Long LDPC Codes for Optical Communications

On the Performance Evaluation of Quasi-Cyclic LDPC Codes with Arbitrary Puncturing

Low complexity FEC Systems for Satellite Communication

A Flexible FPGA-Based Quasi-Cyclic LDPC Decoder

Performance Analysis of Min-Sum LDPC Decoding Algorithm S. V. Viraktamath 1, Girish Attimarad 2

ISSN (Online), Volume 1, Special Issue 2(ICITET 15), March 2015 International Journal of Innovative Trends and Emerging Technologies

Stratix II vs. Virtex-4 Performance Comparison

Investigation of Error Floors of Structured Low- Density Parity-Check Codes by Hardware Emulation

Non-recursive complexity reduction encoding scheme for performance enhancement of polar codes

IEEE-754 compliant Algorithms for Fast Multiplication of Double Precision Floating Point Numbers

An HEVC Fractional Interpolation Hardware Using Memory Based Constant Multiplication

Performance analysis of LDPC Decoder using OpenMP

THE turbo code is one of the most attractive forward error

A Novel Architecture for Scalable, High throughput, Multi-standard LDPC Decoder

Investigation of a Masking Countermeasure against Side-Channel Attacks for RISC-based Processor Architectures

ERROR correcting codes are used to increase the bandwidth

STARTING in the 1990s, much work was done to enhance

Interlaced Column-Row Message-Passing Schedule for Decoding LDPC Codes

IEEE 802.3ap Codes Comparison for 10G Backplane System

Dynamic Window Decoding for LDPC Convolutional Codes in Low-Latency Optical Communications

Basic FPGA Architectures. Actel FPGAs. PLD Technologies: Antifuse. 3 Digital Systems Implementation Programmable Logic Devices

H-ARQ Rate-Compatible Structured LDPC Codes

FPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression

C LDPC Coding Proposal for LBC. This contribution provides an LDPC coding proposal for LBC

BER Guaranteed Optimization and Implementation of Parallel Turbo Decoding on GPU

Search for Improvements in Low Density Parity Check Codes for WiMAX (802.16e) Applications

Iterative Refinement on FPGAs

HELSINKI UNIVERSITY OF TECHNOLOGY Faculty of Electronics, Communication and Automation Department of Communications and Networking

A Dedicated Hardware Solution for the HEVC Interpolation Unit

I. Introduction. India; 2 Assistant Professor, Department of Electronics & Communication Engineering, SRIT, Jabalpur (M.P.

LOW-density parity-check (LDPC) codes have attracted

Binary Adders. Ripple-Carry Adder

A New MIMO Detector Architecture Based on A Forward-Backward Trellis Algorithm

CPE/EE 422/522. Introduction to Xilinx Virtex Field-Programmable Gate Arrays Devices. Dr. Rhonda Kay Gaede UAH. Outline

Optimized Graph-Based Codes For Modern Flash Memories

High Throughput Energy Efficient Parallel FFT Architecture on FPGAs

Iterative Decoder Architectures

Performance Analysis of CORDIC Architectures Targeted by FPGA Devices

A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA

A Hardware Generator for Factor Graph Applications

Transcription:

Cost efficient FPGA implementations of Min- Sum and Self-Corrected-Min-Sum decoders Oana Boncalo (1), Alexandru Amaricai (1), Valentin Savin (2) (1) University Politehnica Timisoara, Romania (2) CEA-LETI, MINATEC Campus, Grenoble, France Research reported in this presentation has been supported has been supported by the Franco-Romanian (ANR- UEFISCDI) Joint Research Programme Blanc 2013 project DIAMOND, and by the Seventh Framework Program of European Union under Grant Agreement 309129 (i-risc project).

Outline Min-Sum & Self-Corrected Min-Sum Decoding algorithms Motivation Proposed Architecture for the MS decoder Objectives MS Architecture Results Comparison Proposed Architecture for the SCMS decoder Memory Requirements Proposed Improvements Results Comparison Conclusion Titre de la présentation Date 2

LDPC codes (Notation) Defined by sparse bipartite graphs variable-nodes: coded-bits check-nodes: parity-check equations 10 9 α m,n 5 8 Decoded by MP algorithms γ n : input (a priori) LLR values 7 4 α m,n : variable-to-check node messages β m,n : check-to-variable node messages γ n : output (a posteriori) LLR values This work γ n input 6 5 4 3 3 2 Min-Sum Self-Corrected Min-Sum γ n output 2 1 β m,n 1 Titre de la présentation Date 3

Min-Sum (MS) Decoder Initialization: n = 1,, N; m H(n) γ n = log Pr x n = 0 y n Pr x n = 1 y n 10 α m,n (0) α m,n = γ n Iterations: l = 1,, l max 9 5 CNU: m = 1,, M; n H(m) 8 (l) β m,n (l 1) = sgn α m,n n H m \n min α (l 1) n H(m)\n m,n 7 4 AP-LLR: n = 1,, N (l) (l) γ n = γn + β m,n m H(n) 6 5 3 VNU: n = 1,, N; m H(n) (l) α m,n (l) (l) = γ n βm,n γ n input 4 3 2 (l) if sgn α m,n (l) then α m,n = 0 l 1 sgn α m,n (l 1) and α m,n 0 γ n output 2 1 β m,n 1 Titre de la présentation Date 4

Self-Corrected Min-Sum (SCMS) Decoder Initialization: n = 1,, N; m H(n) γ n = log Pr x n = 0 y n Pr x n = 1 y n 10 α m,n (0) α m,n = γ n Iterations: l = 1,, l max 9 5 CNU: m = 1,, M; n H(m) 8 (l) β m,n (l 1) = sgn α m,n n H m \n min α (l 1) n H(m)\n m,n 7 4 AP-LLR: n = 1,, N (l) (l) γ n = γn + β m,n m H(n) 6 5 3 VNU: n = 1,, N; m H(n) (l) α m,n (l) (l) = γ n βm,n γ n input 4 3 2 (l) if sgn α m,n (l) then α m,n = 0 l 1 sgn α m,n self-correction (erasure) rule (l 1) and α m,n 0 γ n output 2 1 β m,n 1 Titre de la présentation Date 5

Self-Corrected Min-Sum (SCMS) Decoder Relies on the erasure of unreliable variable-node messages Error-correction capability very close to Belief-Propagation With respect to other MS-based variants like NMS and OMS, the SCMS has been proven to provide better performance in the error-floor region, especially for irregular LDPC codes WiMAX, rate 1/2, AWGN Titre de la présentation Date 6

Self-Corrected Min-Sum (SCMS) Decoder Relies on the erasure of unreliable variable-node messages Error-correction capability very close to Belief-Propagation With respect to other MS-based variants like NMS and OMS, the SCMS has been proven to provide better performance in the error-floor region, especially for irregular LDPC codes SCMS has also been proved to be robust to: Imprecise arithmetic Faulty hardware WiMAX, rate 1/2, AWGN Titre de la présentation Date 7

Self-Corrected Min-Sum (SCMS) Decoder Relies on the erasure of unreliable variable-node messages Error-correction capability very close to Belief-Propagation With respect to other MS-based variants like NMS and OMS, the SCMS has been proven to provide better performance in the error-floor region, especially for irregular LDPC codes SCMS has also been proved to be robust to: Imprecise arithmetic Faulty hardware But these improvements come with a price: signs of messages and erasure bits have to be stored increased memory requirements additional overhead in terms of routing the messages to the appropriate processing units Titre de la présentation Date 8

Objectives Main Objective Cost Efficient Implementation of SCMS decoder 1. MS decoder a) Propose a cost efficient implementation 2. SCMS decoder a) Evaluate the overhead due to the self-correction rule b) Propose solutions to reduce overhead Titre de la présentation Date 9

Outline Min-Sum & Self-Corrected Min-Sum Decoding algorithms Motivation Proposed Architecture for the MS decoder Objectives MS Architecture Results Comparison Proposed Architecture for the SCMS decoder Memory Requirements Proposed Improvements Results Comparison Conclusion Titre de la présentation Date 10

Objectives MS decoder architecture for QC-LDPC codes WiMAX LDPC code with rate 1/2, (N = 2304, K = 1152) code Low cost Efficient utilization of FPGA resources Fit in Virtex-5 VLX50T FPGA device (Digilent Genesys board) Throughput Hundreds of Mbps (> 500 Mbps) Flexible Easily adaptable to other codes and code rates Titre de la présentation Date 11

WiMAX LDPC code with rate 1/2 Quasi-Cyclic code, with base matrix B of size 12 24 Circulant size = 96 parity check matrix of size 1152 2304 Layered structure: 12 layers 96 rows per layer Titre de la présentation Date 12

WiMAX LDPC code with rate 1/2 Quasi-Cyclic code, with base matrix B of size 12 24 Circulant size = 96 parity check matrix of size 1152 2304 Layered structure: 12 layers 96 rows per layer Layered architecture Lower memory requirements Store only AP-LLRs ( ) and CN messages ( ) Lower interconnect scheme Faster convergence Increased decoding performance if small number of iterations Titre de la présentation Date 13

MS Decoder Architecture AP-LLR ( ) memory and routing network 7/8 BRAM blocks (depending on quantization) 2 barrel shifters for routing (RD and WB) Serial RD/WB of AP-LLRs Check-node message ( ) memory 4 BRAM blocks Double buffering + shift registers for routing Data processing units (VCN) 96 parallel processing units Parallel processing within a layer Serial processing of input messages Control Unit Titre de la présentation Date 14

MS Decoder Architecture AP-LLR Memory Check-Node messages memory Titre de la présentation Date 15

MS decoder AP-LLR memory Memory organization dual port BRAM 72 bits per memory word 6-bits AP-LLR 8 BRAM blocks Serial processing (RD and WB) of AP-LLRs Reduction of AP-LLR memory word Efficient usage of BRAM block Pipelined barrel shifters 7 cc from Mem RD to processing/processing to Mem WB Titre de la présentation Date 16

MS Decoder CN message memory Compressed messages Min1, Min2, IndexMin1, Signs 4-bit quantization: 16 bits (compressed) vs 28 bits (uncompressed) Memory word 16 compressed messages 4 messages x 4 BRAMs Reading 96 messages Routing 6 clock cycles Shift-registers Titre de la présentation Date 17

MS Decoder Processing Unit (VCN) Serial processing (6,4) quantization Slice based adders & comp. Conversion between SM and C2 (5,3) quantization 0.2-0.25 db performance penalty Arithmetic operations implemented as ROM Only SM computations Titre de la présentation Date 18

Results comparison Architecture proposal Code Quantization Device Frequency (MHz) Throughput (Mbps) Chandrasetty2012 [1] (576, 1152) γ-4 bit, β-2 bit Virtex-5 138 11400 10823 Stimming2012 [2] (1152, 2304) γ-4 bit, β-3 bit Virtex-5 154 8900 21688 Resources Slices BRAM Chen2011 [3] (768, 1536) γ-6 bit, β-4 bit Virtex-4 149-162 223-888 1472-6100 24 Kim2011 [4] (336, 672) γ-6 bit, β-4 bit Virtex-4 335 822 27000 32 Beuschel2008 [5] (1152, 2304) γ-6 bit, β-4 bit Virtex-4 75 75 9877-19500 122 Proposed (baseline) (1152, 2304) γ-6 bit, β-4 bit Virtex-5 281 719 6529 12 Proposed (new VCN) (1152, 2304) γ-5 bit, β-3 bit Virtex-5 312 800 6326 11 [1] V. A. Chandrasetty and S. M. Aziz, "An area efficient LDPC decoder using a reduced complexity min-sum algorithm", VLSI Journal, 2012 [2] A. B. Stimming and A. Dollas, "FPGA-based design and implementation of a multi-gbps LDPC decoder", FPL, 2012 [3] X. Chen, J. Kang and S. Lin and V. Akella, "Memory System Optimization for FPGA Based Implementation of Quasi- Cyclic LDPC Codes Decoders", IEEE Trans. on CAS, 2011. [4] S. Kim, G. E. Sobelman, and H. Lee, "A Reduced-Complexity Architecture for LDPC Layered Decoding Schemes", IEEE Trans. On VLSI, 2011. [5] C. Beuschel, H.-J. Pfleiderer, "FPGA implementation of a Flexible LDPC decoder", FPL, 2008. Titre de la présentation Date 19

Results comparison Architecture proposal Code Quantization Device Frequency (MHz) Throughput (Mbps) Resources Slices BRAM Chandrasetty2012 [1] (576, 1152) γ-4 bit, β-2 bit Virtex-5 138 11400 10823 Stimming2012 [2] (1152, 2304) γ-4 bit, β-3 bit Virtex-5 154 8900 21688 Chen2011 [3] (768, 1536) γ-6 bit, β-4 bit Virtex-4 149-162 223-888 1472-6100 24 Kim2011 [4] (336, 672) γ-6 bit, β-4 bit Virtex-4 335 822 27000 32 Beuschel2008 [5] (1152, 2304) γ-6 bit, β-4 bit Virtex-4 75 75 9877-19500 122 Proposed (baseline) (1152, 2304) γ-6 bit, β-4 bit Virtex-5 281 719 6529 12 Proposed (new VCN) (1152, 2304) γ-5 bit, β-3 bit Virtex-5 312 800 6326 11 Low cost FPGA LDPC decoder architecture Efficient usage of BRAM blocks both for AP-LLR and CN message memory Novel implementation of data processing by implementing arithmetic operation via ROM Adaptable architecture, suitable to other LDPC codes Titre de la présentation Date 20

Outline Min-Sum & Self-Corrected Min-Sum Decoding algorithms Motivation Proposed Architecture for the MS decoder Objectives MS Architecture Results Comparison Proposed Architecture for the SCMS decoder Memory Requirements Proposed Improvements Results Comparison Conclusion Titre de la présentation Date 21

SCMS Memory Requirements Additional memory requirements s old m,n = sgn α old m,n, sign of the previous α m,n message e old m,n, erasure bit, indicating if previous α m,n message has been erased Total = 2 additional bits per graph edge = 2*d c bits per check node MS, with compressed : 4-bit quant. d c = 6 d c = 20 Index min1 min 1 min 2 signs 15 bits 31 bits SCMS, with compressed Index min1 min 1 min 2 signs signs erasure bits 27 bits 71 bits Titre de la présentation Date 22

SCMS Improvements MS, with compressed Index min1 min 1 min 2 signs SCMS, with compressed Index min1 min 1 min 2 signs signs erasure bits SCMS, with no signs Index min1 min 1 min 2 signs erasure bits Duplicate the signs computation SCMS, with no signs and erasure bits Index min1 min 1 min 2 signs Modify the selfcorrection (erasure) rule Titre de la présentation Date 23

Memory Efficient SCMS Improvement 1 Duplicate the check node message sign computation Allows the storage of only the variable node message signs and erasure bits message sign computation block Titre de la présentation Date 24

Memory Efficient SCMS Improvement 1 Duplicate the check node message sign computation Allows the storage of only the variable node message signs and erasure bits Erasure Detection Block Inputs sold m,n : sign old new s m,n : sign new eold m,n : erasure bit Output e new m,n = e old m,n s old m,n snew m,n Titre de la présentation Date 25

Memory Efficient SCMS Improvement 2 Modifying the erasure rule Avoid storing the erasure bits erasure bits are estimated Erasure Detection Block Inputs sold m,n : sign old new s m,n : sign new β min1 : min1 value Output e new old_est m,n = e m,n s old m,n snew m,n where e m,n old_est = 1, if β min1 = 0 0, otherwise Titre de la présentation Date 26

Memory Efficient SCMS Improvement 2 Bit Error Rate (BER) Modifying the erasure rule Avoid storing the erasure bits erasure bits are estimated Very small (negligible) performance degradation 1E+00 WiMAX, rate 1/2, AWGN, (5,3)-quant Erasure Detection Block Inputs sold m,n : sign old new s m,n : sign new β min1 : min1 value Output e new old_est m,n = e m,n s old m,n snew m,n where e old_est m,n = 1, if β min1 = 0 0, otherwise 1E-01 1E-02 1E-03 1E-04 1E-05 1E-06 1E-07 MS SCMS SCMS-V2 1 1.5 2 2.5 3 SNR (db) Titre de la présentation Date 27

Results comparison Conventional MS 5895 slices, 12 BRAM Frequency : 290 MHz Conventional SCMS 7179 slices, 16 BRAM Frequency : 270 MHz SCMS with no check node message signs (1) 6748 slices, 14 BRAM Frequency : 289 MHz SCMS with no check node message signs and erasure bits (2) 5946 slices, 12 BRAM Frequency : 266 MHz (5, 3)-quantization Results after place & route Xilinx XC5VLX50T device, speed grade -3 using the Xilinx ISE 14.7 tool Additional overhead in the VCN units is overruled by the reduction in the β routing logic similar cost conventional MS Titre de la présentation Date 28

Conclusion Two improvements for the layered SCMS architecture suitable for both FPGA and ASIC implementations New self-correction rule which avoids the requirement for erasure bits storage Performance close to conventional SCMS Outperforms MS by 0.4 db, while having similar cost Low cost FPGA architectures for both MS and SCMS Efficient usage of BRAM blocks both for AP-LLR and CN message memory Novel implementation of data processing by implementing arithmetic operation via ROM Adaptable architectures, suitable to other QC-LDPC codes O. Boncalo, A. Amaricai, A. Hera, and V. Savin, Cost Efficient FPGA Layered LDPC Decoder with Serial AP-LLR Processing, International Conference on Field Programmable Logic and Applications (FPL), September 2014 O. Boncalo, A. Amaricai, and V. Savin, Memory Efficient Implementation of Self-Corrected Min-Sum Decoder, IEEE International Conference on Electronics Circuits and Systems (ICECS), December, 2014 Titre de la présentation Date 29

Merci de votre attention