MULTI-RATE HIGH-THROUGHPUT LDPC DECODER: TRADEOFF ANALYSIS BETWEEN DECODING THROUGHPUT AND AREA

Similar documents
MULTI-RATE HIGH-THROUGHPUT LDPC DECODER: TRADEOFF ANALYSIS BETWEEN DECODING THROUGHPUT AND AREA

HIGH-THROUGHPUT MULTI-RATE LDPC DECODER BASED ON ARCHITECTURE-ORIENTED PARITY CHECK MATRICES

Tradeoff Analysis and Architecture Design of High Throughput Irregular LDPC Decoders

Distributed Decoding in Cooperative Communications

A Reduced Routing Network Architecture for Partial Parallel LDPC Decoders

Interlaced Column-Row Message-Passing Schedule for Decoding LDPC Codes

Low Complexity Quasi-Cyclic LDPC Decoder Architecture for IEEE n

Partly Parallel Overlapped Sum-Product Decoder Architectures for Quasi-Cyclic LDPC Codes

A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors

International Journal of Engineering Trends and Technology (IJETT) - Volume4Issue5- May 2013

Cost efficient FPGA implementations of Min- Sum and Self-Corrected-Min-Sum decoders

LOW-DENSITY parity-check (LDPC) codes, which are defined

LOW-DENSITY PARITY-CHECK (LDPC) codes [1] can

PERFORMANCE ANALYSIS OF HIGH EFFICIENCY LOW DENSITY PARITY-CHECK CODE DECODER FOR LOW POWER APPLICATIONS

Memory Efficient Decoder Architectures for Quasi-Cyclic LDPC Codes

AN FPGA BASED OVERLAPPED QUASI CYCLIC LDPC DECODER FOR WI-MAX

BER Evaluation of LDPC Decoder with BPSK Scheme in AWGN Fading Channel

Design of a Low Density Parity Check Iterative Decoder

RECENTLY, low-density parity-check (LDPC) codes have

Partially-Parallel LDPC Decoder Achieving High-Efficiency Message-Passing Schedule

lambda-min Decoding Algorithm of Regular and Irregular LDPC Codes

LOW-POWER IMPLEMENTATION OF A HIGH-THROUGHPUT LDPC DECODER FOR IEEE N STANDARD. Naresh R. Shanbhag

HIGH THROUGHPUT LOW POWER DECODER ARCHITECTURES FOR LOW DENSITY PARITY CHECK CODES

Overlapped Scheduling for Folded LDPC Decoding Based on Matrix Permutation

Layered Decoding With A Early Stopping Criterion For LDPC Codes

A Reduced Routing Network Architecture for Partial Parallel LDPC decoders

Informed Dynamic Scheduling for Belief-Propagation Decoding of LDPC Codes

Design of a Quasi-Cyclic LDPC Decoder Using Generic Data Packing Scheme

Hybrid Iteration Control on LDPC Decoders

98 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 58, NO. 1, JANUARY 2011

ASIP LDPC DESIGN FOR AD AND AC

A Flexible FPGA-Based Quasi-Cyclic LDPC Decoder

STARTING in the 1990s, much work was done to enhance

A NOVEL HARDWARE-FRIENDLY SELF-ADJUSTABLE OFFSET MIN-SUM ALGORITHM FOR ISDB-S2 LDPC DECODER

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA

Area and Energy Efficient VLSI Architectures for Low-Density Parity-Check Decoders using an On-the-fly Computation

A Generic Architecture of CCSDS Low Density Parity Check Decoder for Near-Earth Applications

CPE/EE 422/522. Introduction to Xilinx Virtex Field-Programmable Gate Arrays Devices. Dr. Rhonda Kay Gaede UAH. Outline

Multi-Rate Reconfigurable LDPC Decoder Architectures for QC-LDPC codes in High Throughput Applications

Design and Implementation of 3-D DWT for Video Processing Applications

LLR-based Successive-Cancellation List Decoder for Polar Codes with Multi-bit Decision

A new two-stage decoding scheme with unreliable path search to lower the error-floor for low-density parity-check codes

Optimized ARM-Based Implementation of Low Density Parity Check Code (LDPC) Decoder in China Digital Radio (CDR)

Reduced Complexity of Decoding Algorithm for Irregular LDPC Codes Using a Split Row Method

! Program logic functions, interconnect using SRAM. ! Advantages: ! Re-programmable; ! dynamically reconfigurable; ! uses standard processes.

Low Complexity Opportunistic Decoder for Network Coding

INTRODUCTION TO FPGA ARCHITECTURE

Efficient Majority Logic Fault Detector/Corrector Using Euclidean Geometry Low Density Parity Check (EG-LDPC) Codes

Investigation of Error Floors of Structured Low- Density Parity-Check Codes by Hardware Emulation

THE DESIGN OF STRUCTURED REGULAR LDPC CODES WITH LARGE GIRTH. Haotian Zhang and José M. F. Moura

FPGA. Logic Block. Plessey FPGA: basic building block here is 2-input NAND gate which is connected to each other to implement desired function.

Parallel graph traversal for FPGA

A Massively Parallel Implementation of QC-LDPC Decoder on GPU

A New MIMO Detector Architecture Based on A Forward-Backward Trellis Algorithm

CHAPTER 4 BLOOM FILTER

FPGA Implementation of Binary Quasi Cyclic LDPC Code with Rate 2/5

Optimized Min-Sum Decoding Algorithm for Low Density PC Codes

Hardware Implementation

EECS150 - Digital Design Lecture 16 - Memory

On the construction of Tanner graphs

Virtex-II Architecture. Virtex II technical, Design Solutions. Active Interconnect Technology (continued)

Optimal Overlapped Message Passing Decoding of Quasi-Cyclic LDPC Codes

FLASH RELIABILITY, BEYOND DATA MANAGEMENT AND ECC. Hooman Parizi, PHD Proton Digital Systems Aug 15, 2013

Lowering the Error Floors of Irregular High-Rate LDPC Codes by Graph Conditioning

RECENTLY, researches on gigabit wireless personal area

FPGA Implementation of Matrix Inversion Using QRD-RLS Algorithm

Low Error Rate LDPC Decoders

A SCALABLE COMPUTING AND MEMORY ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION ON FIELD-PROGRAMMABLE GATE ARRAYS. Theepan Moorthy and Andy Ye

BER Guaranteed Optimization and Implementation of Parallel Turbo Decoding on GPU

Are Polar Codes Practical?

CHAPTER 4. DIGITAL DOWNCONVERTER FOR WiMAX SYSTEM

Majority Logic Decoding Of Euclidean Geometry Low Density Parity Check (EG-LDPC) Codes

Multi-Megabit Channel Decoder

HDL Implementation of an Efficient Partial Parallel LDPC Decoder Using Soft Bit Flip Algorithm

J. Manikandan Research scholar, St. Peter s University, Chennai, Tamilnadu, India.

FPGA Provides Speedy Data Compression for Hyperspectral Imagery

A Memory Efficient FPGA Implementation of Quasi-Cyclic LDPC Decoder

FPGA architecture and design technology

EECS150 - Digital Design Lecture 16 Memory 1

Basic FPGA Architectures. Actel FPGAs. PLD Technologies: Antifuse. 3 Digital Systems Implementation Programmable Logic Devices

The Lekha 3GPP LTE Turbo Decoder IP Core meets 3GPP LTE specification 3GPP TS V Release 10[1].

C LDPC Coding Proposal for LBC. This contribution provides an LDPC coding proposal for LBC

EE414 Embedded Systems Ch 5. Memory Part 2/2

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

ERROR correcting codes are used to increase the bandwidth

A Fast Systematic Optimized Comparison Algorithm for CNU Design of LDPC Decoders

Maximizing the Throughput-Area Efficiency of Fully-Parallel Low-Density Parity-Check Decoding with C-Slow Retiming and Asynchronous Deep Pipelining

IMPLEMENTATION OF DISTRIBUTED CANNY EDGE DETECTOR ON FPGA

ISSCC 2003 / SESSION 8 / COMMUNICATIONS SIGNAL PROCESSING / PAPER 8.7

SigmaRAM Echo Clocks

Spiral 2-8. Cell Layout

UNIT - V MEMORY P.VIDYA SAGAR ( ASSOCIATE PROFESSOR) Department of Electronics and Communication Engineering, VBIT

EFFICIENT RECURSIVE IMPLEMENTATION OF A QUADRATIC PERMUTATION POLYNOMIAL INTERLEAVER FOR LONG TERM EVOLUTION SYSTEMS

ECE 485/585 Microprocessor System Design

/$ IEEE

A High-Throughput FPGA Implementation of Quasi-Cyclic LDPC Decoder

FPGA Matrix Multiplier

Low complexity FEC Systems for Satellite Communication

An Efficient Fused Add Multiplier With MWT Multiplier And Spanning Tree Adder

A MULTIBANK MEMORY-BASED VLSI ARCHITECTURE OF DIGITAL VIDEO BROADCASTING SYMBOL DEINTERLEAVER

Transcription:

MULTI-RATE HIGH-THROUGHPUT LDPC DECODER: TRADEOFF ANALYSIS BETWEEN DECODING THROUGHPUT AND AREA Predrag Radosavljevic, Alexandre de Baynast, Marjan Karkooti, and Joseph R. Cavallaro Department of Electrical and Computer Engineering, Rice University {rpredrag, debaynas, marjan, cavallar}@rice.edu ABSTRACT In order to achieve high decoding throughput (hundreds of MBits/sec and above) for multiple code rates and moderate codeword lengths (up to.5k bits), several decoder solutions with different levels of processing parallelism are possible. Selection between these solutions is based on a threefold criterion: hardware complexity, decoding throughput, and error-correcting performance. In this work, we determine multi-rate LDPC decoder architecture with the best tradeoff in terms of area size, error-correcting performance, and decoding throughput. The prototype architecture of optimal LDPC decoder is implemented on FPGA. I. INTRODUCTION 1 Recent designs of LDPC decoders are mostly based on block-structured Parity Check Matrices (PCMs). Tradeoff between data throughput and area for structured partlyparallel LDPC decoders has been investigated in [5, 8]. However, authors restricted their study to block structured regular codes that do not exhibit excellent performance. Scalable decoder from [9] that supports three code rates is based on structured regular and irregular PCMs, but slow convergence causes only moderate decoding throughput. Although extremely fast, the lack of flexibility and relatively large area occupation is major disadvantage of fully parallel decoder from [3]. Our goal is to design high-throughput ( 1GBits/sec) LDPC decoder that supports multiple code rates (between 1/ and 5/6) and moderate codeword lengths (up to.5k bits) as it is defined by the IEEE 80.16e and IEEE 80.11n wireless standards [7]. Implemented decoder aims to represent the best tradeoff between decoding throughput and hardware complexity with excellent error-correcting performance. Range of decoder solutions with different levels of processing parallelism are analyzed. Decoder solution with the best throughput per area ratio is chosen for hardware implementation. A prototype architecture is implemented on Xilinx FPGA. II. LOW DENSITY PARITY-CHECK CODES LDPC code is a linear block code specified by a very sparse PCM as shown in Fig. 1 with random placement of nonzero entries [7]. Each coded bit is represented by a PCM column (variable node), while each row of PCM represents parity check equation (check node). Log-likelihood ratios (LLRs) 1 This work was supported in part by Nokia Corporation and by NSF under grants EIA-04458 and EIA-03166. are used for representation of reliability in order to simplify arithmetic operations: R mj message denotes the check node LLR sent from the check node m to the variable node j, L(q mj ) message represents variable node LLR sent from the variable node j to the check node m, and L(q j ) (j = 1,...,n) represent the a posteriori probability ratio ( ) for all coded bits. are initialized with the a priori (channel) reliability value of the coded bit j ( j = 1,...,n). Iterative layered belief propagation (LBP) algorithm is a variation of the standard belief propagation (SBP) algorithm [5], and achieves twice faster decoding convergence because of optimized scheduling of reliability [6]. As it is shown in Fig. 1, typical block-structured PCM is composed of concatenated horizontal layers (component codes) and shifted identity sub-matrices. Belief propagation algorithm is repeated for each component code while updated are passed between the sub-codes. For each variable node j inside current horizontal layer, L(q mj ) that correspond to all check nodes neighbors m are computed according to: L(q mj ) = L(q j ) R mj (1) For each check node m, R mj for all variable nodes j that participate in particular parity-check equation are computed according to: R mj = sign(l(q mj )) Ψ Ψ (L(q mj )) () j N(m)\{j} j N(m)\{j} where N(m) is the set of all variable [ nodes from)] paritycheck equation m, and Ψ(x) = log tanh. The ( x a posteriori reliability in the current horizontal layer are updated according to: L(q j ) = L(q mj ) + R mj. (3) If all parity-check equations are satisfied or pre-determined maximum number of iterations is reached, decoding algorithm stops. Otherwise, the algorithm repeats from (1) for the next horizontal layer. III. LDPC DECODERS WITH DIFFERENT PROCESSING PARALLELISM AND DESIGNED PCMS Processing parallelism consists of two parts: parallelism in decoding of concatenated codes (horizontal layers of PCM), and parallelism in reading/writing of reliability from/to memory modules. According to different

levels of processing parallelism, while targeting decoding throughput of hundreds of MBits/sec and above, the following decoder solutions are considered: 1. L1RW1: Decoder with full decoding parallelism per one horizontal layer (LBP algorithm), reading/writing of from one nonzero sub-matrix in each clock cycle.. L1RW: Decoder with full decoding parallelism per one horizontal layer (LBP algorithm), reading/writing of from two consecutive nonzero sub-matrices in each clock cycle. 3. L3RW1: Decoder with pipelining of three consecutive horizontal layers, reading/writing of from one nonzero sub-matrix in each clock cycle. 4. L3RW: Decoder with pipelining of three consecutive horizontal layers, reading/writing of from two consecutive nonzero sub-matrices in each clock cycle. 5. L6RW: Decoder with double pipelining (pipelining of six consecutive horizontal layers), reading/writing of from two consecutive nonzero sub-matrices in each clock cycle. 6. FULL: Fully parallel decoder with simultaneous execution of all layers and simultaneous reading/writing of all reliability (SBP decoding algorithm). Optimized block-structured irregular PCMs with excellent error-correcting performance have been proposed for IEEE 80.16 and 80.11n wireless standards [7]. Inspired by these codes, we propose novel design of irregular blockstructured PCMs which allows twice parallel memory access while preserving error-correcting performance. An example of designed block-structured PCMs is shown in Fig. 1. 0 50 The main constraint in proposed PCM design is that any two consecutive nonzero sub-matrices from the same horizontal layer belong to odd and even PCMs block-columns (see Fig.1). This property is essential for achieving more parallel memory access since from two submatrices can be simultaneously read/written from/to two independent memory modules. PCMs with 4 blockcolumns provide the best performance due to near-optimal profile and sufficiently high parallelism degree for highthroughput decoder implementations: all proposed decoder solutions including the fully parallel architecture support newly designed PCMs. In order to analyze different levels of processing parallelism, we first introduce different ways of memory access parallelization, as well as different ways to decode horizontal layers of PCM. Figure (part labelled as RW1) shows memory organization composed of single dual-port RAM block where each memory location contains from one (out of 4) block-column. This memory organization allows read/write of one full sub-matrix of PCM in each clock cycle. The width of the valid memory content depends on the codeword length (size of the submatrix), and it can be up to 108 for the largest supported codeword length of 59. Meanwhile, each check node memory location contains check node from single nonzero sub-matrix. Architecture-oriented constraint of equally distributed odd and even nonzero block-columns in every horizontal layer allows read/write of two sub-matrices per clock cycle. As it is shown in Fig. (part labelled as RW) memory is partitioned into two independent modules. Each location of one memory module contains that correspond to one (out of twelve) odd block-columns. Another memory module contains from even block columns. Meanwhile, each check node memory location contains check node that correspond to two consecutive nonzero sub-matrices from the same horizontal layer. Whole content of the check node memory location is loaded/stored in a single clock cycle. It is important to observe that dual-port RAMs are replaced with single port RAMs if there is no pipelining of horizontal layers. However, memory organization is independent of the number of memory ports and remains same. 100 150 108 Rows 00 50 300 350 Block-column 1 Block-column Block-column 3 108 Block-column 1 Block-column 3 108 Block-column Block-column 4 400 0 00 400 600 800 1000 100 Columns Block-column 3 Block-column 4 Block-column 3 Block-column 4 Figure 1: Example of designed block-structured irregular parity-check matrix with 4 block-columns and 8 horizontal layers. Codeword size is 196, rate= /3, and size of each sub-matrix is 54 54. RW1 RW Figure : Organization of memory into single dualport RAM module, and two independent dual-port RAM modules with from odd and even block-columns of parity check matrix. By construction, all rows inside single horizontal layer are independent and can be processed in parallel without any performance loss. Every horizontal layer is processed through three stages: memory reading stage, processing stage, and memory writing stage corresponding to Eqs. (1), (), and (3) respectively. Processing parallelism can be increased if three consecutive horizontal layers are pipelined

as it is visualized in Fig. 3 and labelled as L3. Three-stage pipelined decoding introduces certain performance loss due to the overlapping between the same from pipelined layers. (l-) th layer (l-1) th layer Time l th layer Figure 3: Belief propagation based on pipelining of three decoding stages for three consecutive horizontal layers of parity check matrix. Processing parallelism is two times increased if pipelining of six consecutive layers is employed, as it is shown in Fig. 4 and labelled as L6. In order to avoid reading/writing collisions (simultaneous reading and writing of from the identical block-columns but from two different horizontal layers), it is necessary to postpone start of every second layer by one clock cycle (see Fig. 4). (l-5) th layer (l-4) th layer (l-3) th layer (l-) th layer (l-1) th layer Time l th layer Figure 4: Belief propagation based on pipelining of three decoding stages for six consecutive horizontal layers of parity check matrix. IV. THROUGHPUT-AREA-PERFORMANCE TRADEOFF ANALYSIS FOR PROPOSED LDPC DECODERS Proposed six decoder solutions with different processing parallelism levels from section III. are compared. It is assumed that each solution (except L6RW architecture) supports multiple code rates (from 1/ to 5/6) and codeword lengths (648, 196, 1944, and 59) as well as newly designed block-structured PCMs with 4 block-columns. The L6RW solution supports code rates of up to 3/4 since designed PCM with 4-block-columns has only four horizontal layers for code rate of 5/6: this solution has reduced flexibility. Figure 5 shows decoding throughput and area complexity for analyzed LDPC decoders: decoding parallelism and memory access parallelism increase from left to right (same order as labelled from L1RW1 to FULL in section III.). Hardware complexity is represented as an area size in mm assuming 0.18 µm 6-metal CMOS process. Total area is computed as a summation of the arithmetic area and area of the memory. Arithmetic part of each decoder is represented as a number of standard CMOS logic gates (also shown on Fig. 5), while it is assumed typical TSMC s design rule for 0.18 µm technology of 100K logic gates per mm [1]. Memory size is given as the number of bits required for a storage of and check (SRAMs) and supported PCMs (ROMs). Corresponding memory area is computed by assuming SRAM cell size of a 4.65 µm which is typical size for 0.18 µm technology [1]. Dualport SRAMs are required for L3RW1, L3RW and L6RW decoder architectures since they employ multiple pipeline stages: memory cell area is typically increased two times for the same design rule []. Tradeoff is defined as a ratio between decoding throughput and total area and it is represented in MBits/sec/mm. Decoding throughput is based on the average number of decoding iterations required to achieve frame-error rate (FER) of 10 4 (averaged over 10 6 codeword transmissions) for the largest supported code rate and code length for clock frequency of 00 MHz. Since fully parallel solution does not have inherent flexibility arithmetic logic dedicated to 16 different rate-size combinations is necessary which significantly increases arithmetic area. On the other hand, same set of latches are utilized for the storage of reliability for all 16 supported cases. 1800 1600 1400 100 1000 800 600 400 00 0 Throughput/area [MBits/sec/mm ] Average Throughput [MBits/sec] Arithmetic area [K gates] Memory size [KBits] Total area [mm /100] L1RW1 L1RW L3RW1 L3RW L6RW FULL 1 3 4 5 6 Levels of Processing Parallelism Figure 5: Average decoding throughput, hardware complexity and tradeoff ratio for LDPC decoder solutions with different levels of processing parallelism labelled same as in section III. By employing only pipelining of three consecutive horizontal layers decoding throughput is increased by approximately three times while arithmetic area is increased only marginally. On the other hand, memory area is doubled since mirror memories are required for storage of and check. Overall, three-stage pipelining improves throughput/area ratio (see Fig. 5, L1RW1 vs. L3RW1 architecture, and L1RW vs. L3RW architecture). If memory access parallelism is doubled (from RW1 to RW), throughput is directly increased by more than 50% while arithmetic

area is twice larger and memory size is same. Overall, only by increasing memory access parallelism throughput/area ratio is slightly improved (see Fig. 5, L1RW1 vs. L1RW architecture, and L3RW1 vs. L3RW architecture). Further increase of decoding parallelism (L6RW and FULL solutions) does not improve tradeoff ratio since throughput improvements are smaller than the increase of area occupation. We can expect similar effect if memory access parallelism is further increased. As an illustration, if four sub-matrices per clock cycle are loaded/stored (not shown since it requires special redesign of block-structured PCMs) arithmetic area increases two times but decoding throughput improves only by approximately 5% since processing latency inside DFUs will start to dominate compare to memory access latency. It can be observed from Fig. 5 that the best throughput per area ratio is achieved for three-stage pipelining with memory access that allows reading/writing of two sub-matrices per clock cycle (L3RW solution). In the same time, performance loss due to pipelining is acceptable (about 0.1dB for FER around 10 4, see Fig. 6). In order to avoid performance loss double pipelining and SBP require additional decoding iterations which decreases decoding throughput. Frame Error Rate 10 0 10 1 10 10 3 10 4 10 5 LBP 6 layers pipelined (Max.Iter =15) SBP, Max.Iter=15 LBP 3 layers pipelined (Max.Iter=15) SBP, Max.Iter=5 LBP 6 layers pipelined (Max.Iter =5) LBP, no pipelining (Max.Iter =15) 10 6.3.4.5.6.7.8.9 3 3.1 3. Eb/No [db] Figure 6: FER for PCM with 4 block-columns (code rate of /3, code size of 196: non-pipelined LBP, pipelined LBP, double-pipelined LBP, SBP. V. THREE-STAGE PIPELINED LDPC DECODER WITH DOUBLE-PARALLEL MEMORY ACCESS LDPC decoder with three-stage pipelining and memory organization that supports access of two sub-matrices during the single clock cycle is chosen for the hardware implementation as the best tradeoff between throughput and area. High-level block diagram of decoder architecture is shown in Fig. 7. Single decoder architecture supports codeword sizes of: 648, 196, 1944, and 59, and code rates of: 1/, /3, 3/4, and 5/6, which is compatible with IEEE 80.16 and 80.11n standards [7]. Four identical permuters are required for block-shifting of after loading them from each original and mirror memory module. Scalable permuter (permutation of blocks of sizes: 7, 54, 81, and 108) is composed of 7- bit input 3:1 multiplexers organized in multiple pipelined stages. Modified min-sum approximation with correcting offset [4] is employed as a central part of decoding function unit (DFU). Single DFU is responsible for processing one row of PCM through three pipeline stages according to Eqs. (1), (), and (3) as it is shown in Fig 8. Address Generator CHECK MEMORY CHECK MEMORY Address Generator To 1 -Mirror 1 Bank of 108 Decoding FUs CONTROLLER To -Mirror 1 3 4 MODULE 1 MODULE 1 ROM 1 Modules ROM Modules ROM 1 Mirror ROM Mirror Figure 7: Block diagram of decoder architecture for block structured irregular parity-check matrices with 4 blockcolumns. Support for different code rates and codeword lengths implies usage of multiple PCMs. Information about each supported PCM is stored in several ROM modules. Single memory location of ROM module contains block-column position of nonzero sub-matrix as well as associated shift value. Block-column position is reading/writing address of appropriate memory module. In order to avoid permutation of during the writing stage, relative shift values (difference between two consecutive shift values of the same block-column) are stored instead of original shift values. Support of multiple code rates is defined by control unit. The number of nonzero sub-matrices per horizontal layer significantly varies with the code rate which affects memory latency, as well as processing latency inside DFUs. Control logic provides synchronization between pipelined stages with variable latencies. Controller also handles an exemption which occurs if the number of nonzero submatrices in horizontal layer is odd. In that case, only one block-column per clock cycle is read/written from/to odd or even memory module. Full content of corresponding check node memory location (width of two sub-matrices) is loaded while second half of it is not valid. Control unit then disables part of arithmetic logic inside DFUs. A. Error-Correcting Performance and Hardware Implementation Figure 9 shows FER performance of implemented decoder for /3-rate code and length of 196. It can be noticed that

1 - Check Memory - s- SM s- SM Buffer Offset Min- Min* Min- Buffer Buffer Unit - Min* Modif. Unit Min Min Min- Buffer Min - Min* Min Unit Min* Offset XOR Logic To SM->'s Logic CMP /SEL CMP /SEL Check Memory () SM- s - SM- s + - + 3 4 To To Reading Processing Writing Figure 8: Block diagram of single decoding function unit (DFU) with three pipeline stages and block-serial processing. Interface to and check node memories is also included. 7-bit arithmetic precision is sufficient for accurate decoding. Arithmetic precision of seven bits is chosen for representation of reliability (two s complement with one bit for the fractional part) as well as for all arithmetic operations. Frame Error Rate 10 0 10 1 10 10 3 10 4 LBP pipelined, 6 bit fixed point LBP pipelined, 7 bit fixed point LBP pipelined, 8 bit fixed point LBP pipelined, floating point.3.4.5.6.7.8.9 3 3.1 3. Eb/No [db] Figure 9: FER for implemented decoder (code rate of /3, code length of 196, maximum number of iterations is 15): pipelined LBP (floating vs. fixed-point). A prototype architecture has been implemented in Xilinx System Generator and targeted to a Xilinx Virtex4- XC4VFX60 FPGA. Table 1 shows the utilization statistics. Based on XST synthesis tool report, the maximum clock frequency of 130 MHz can be achieved which determines average throughput of approximately 1.1 GBits/sec VI. CONCLUSION In this paper multi-rate decoder architecture that represents the best tradeoff in terms of throughput, area and performance is found and implemented on FPGA. Identical tradeoff analysis can be applied for a wide range of code rates Table 1: statistics. Xilinx Virtex4-XC4VFX60 FPGA utilization Resource Used Utilization rate Slices 19,595 77% LUTs 36,45 7% Block RAMs 140 60% and codeword lengths. We believe that pipelining of multiple horizontal layers (blocks of PCM) combined with sufficiently parallel memory access is general tradeoff solution that can be applied for other block-structured LDPC codes. REFERENCES [1] http://www.amis.com/asics/standard cell.html. [] http://www.taborcommunications.com/dsstar/04/04/107497.html. [3] A. Darabiha, A.C. Carusone, and F.R. Kschischang. Multi-Gbit/sec low density parity check decoders with reduced interconnect complexity. In Circuits and Systems, 005. ISCAS 005. IEEE International Symposium on, May 005. [4] Jinghu Chen, A. Dholakai, E. Eleftheriou, M.P.C. Fossorier, and Xiao- Yu Hu. Reduced-complexity decoding of LDPC codes. Communications, IEEE Transactions on, 53:188 199, Aug. 005. [5] M.M. Mansour and N.R. Shanbhag. High-throughput LDPC decoders. Very Large Scale Integration (VLSI) Systems. IEEE Transactions on, 11:976 996, Dec. 003. [6] P. Radosavljevic, A. de Baynast, and J.R. Cavallaro. Optimized message passing schedules for LDPC decoding. In 39th Asilomar Conference on Signals, Systems and Computers, 005, pages 591 595, Nov. 005. [7] Robert Xu, David Yuan, and Li Zeng. High girth LDPC coding for OFDMA PHY. Technical Report IEEE C80.16e-05/031, IEEE 80.16 Broadband Wireless Access Working Group, 005. [8] Se-Hyeon Kang and In-Choel Park. Loosely coupled memory-based decoding architecture for low density parity check codes. to appear in IEEE Transactions on Circuits and Systems. [9] L. Yang, H. Lui, and C.-J.R. Shi. Code construction and FPGA implementation of a low-error-floor multi-rate low-density parity-check code decoder. to appear in IEEE Transactions on Circuits and Systems.