DECODING OF ARRAY LDPC CODES USING ON-THE FLY COMPUTATION Kiran Gunnam, Weihuang Wang, Euncheol Kim, Gwan Choi, Mark Yeary *

Similar documents
Reduced-Complexity Column-Layered Decoding and. Implementation for LDPC Codes

Area and Energy Efficient VLSI Architectures for Low-Density Parity-Check Decoders using an On-the-fly Computation

Pipelined Multipliers for Reconfigurable Hardware

Parallel Block-Layered Nonbinary QC-LDPC Decoding on GPU

Design of High Speed Mac Unit

Partial Character Decoding for Improved Regular Expression Matching in FPGAs

AN FPGA BASED OVERLAPPED QUASI CYCLIC LDPC DECODER FOR WI-MAX

On - Line Path Delay Fault Testing of Omega MINs M. Bellos 1, E. Kalligeros 1, D. Nikolos 1,2 & H. T. Vergos 1,2

Folding. Hardware Mapped vs. Time multiplexed. Folding by N (N=folding factor) Node A. Unfolding by J A 1 A J-1. Time multiplexed/microcoded

A DYNAMIC ACCESS CONTROL WITH BINARY KEY-PAIR

Announcements. Lecture Caching Issues for Multi-core Processors. Shared Vs. Private Caches for Small-scale Multi-core

High Speed Area Efficient VLSI Architecture for DCT using Proposed CORDIC Algorithm

The AMDREL Project in Retrospective

BER Evaluation of LDPC Decoder with BPSK Scheme in AWGN Fading Channel

Partly Parallel Overlapped Sum-Product Decoder Architectures for Quasi-Cyclic LDPC Codes

We don t need no generation - a practical approach to sliding window RLNC

PERFORMANCE ANALYSIS OF HIGH EFFICIENCY LOW DENSITY PARITY-CHECK CODE DECODER FOR LOW POWER APPLICATIONS

Learning Convention Propagation in BeerAdvocate Reviews from a etwork Perspective. Abstract

Memory Efficient Decoder Architectures for Quasi-Cyclic LDPC Codes

Acoustic Links. Maximizing Channel Utilization for Underwater

HIGH-THROUGHPUT MULTI-RATE LDPC DECODER BASED ON ARCHITECTURE-ORIENTED PARITY CHECK MATRICES

Cost efficient FPGA implementations of Min- Sum and Self-Corrected-Min-Sum decoders

3-D IMAGE MODELS AND COMPRESSION - SYNTHETIC HYBRID OR NATURAL FIT?

MULTI-RATE HIGH-THROUGHPUT LDPC DECODER: TRADEOFF ANALYSIS BETWEEN DECODING THROUGHPUT AND AREA

A Dual-Hamiltonian-Path-Based Multicasting Strategy for Wormhole-Routed Star Graph Interconnection Networks

Design of a Quasi-Cyclic LDPC Decoder Using Generic Data Packing Scheme

Methods for Multi-Dimensional Robustness Optimization in Complex Embedded Systems

This fact makes it difficult to evaluate the cost function to be minimized

High-level synthesis under I/O Timing and Memory constraints

SVC-DASH-M: Scalable Video Coding Dynamic Adaptive Streaming Over HTTP Using Multiple Connections

COSSIM An Integrated Solution to Address the Simulator Gap for Parallel Heterogeneous Systems

A Dictionary based Efficient Text Compression Technique using Replacement Strategy

Piecewise Linear Approximation Based on Taylor Series of LDPC Codes Decoding Algorithm and Implemented in FPGA

MULTI-RATE HIGH-THROUGHPUT LDPC DECODER: TRADEOFF ANALYSIS BETWEEN DECODING THROUGHPUT AND AREA

Outline: Software Design

Accommodations of QoS DiffServ Over IP and MPLS Networks

Reduced Complexity of Decoding Algorithm for Irregular LDPC Codes Using a Split Row Method

The Tofu Interconnect D

A {k, n}-secret Sharing Scheme for Color Images

SSD Based First Layer File System for the Next Generation Super-computer

HIGH THROUGHPUT LOW POWER DECODER ARCHITECTURES FOR LOW DENSITY PARITY CHECK CODES

Uplink Channel Allocation Scheme and QoS Management Mechanism for Cognitive Cellular- Femtocell Networks

International Journal of Engineering Trends and Technology (IJETT) - Volume4Issue5- May 2013

Low Complexity Quasi-Cyclic LDPC Decoder Architecture for IEEE n

COST PERFORMANCE ASPECTS OF CCD FAST AUXILIARY MEMORY

Partially-Parallel LDPC Decoder Achieving High-Efficiency Message-Passing Schedule

Dynamic Backlight Adaptation for Low Power Handheld Devices 1

RECENTLY, low-density parity-check (LDPC) codes have

Analysis of input and output configurations for use in four-valued CCD programmable logic arrays

Post-K Supercomputer with Fujitsu's Original CPU, A64FX Powered by Arm ISA

Algorithms, Mechanisms and Procedures for the Computer-aided Project Generation System

Graph-Based vs Depth-Based Data Representation for Multiview Images

VLSI Architectures for Communications and Signal Processing Kiran Gunnam

Abstract. Key Words: Image Filters, Fuzzy Filters, Order Statistics Filters, Rank Ordered Mean Filters, Channel Noise. 1.

Dr.Hazeem Al-Khafaji Dept. of Computer Science, Thi-Qar University, College of Science, Iraq

Performance of Histogram-Based Skin Colour Segmentation for Arms Detection in Human Motion Analysis Application

Self-aware and Self-expressive Camera Networks

Definitions Homework. Quine McCluskey Optimal solutions are possible for some large functions Espresso heuristic. Definitions Homework

Cluster-based Cooperative Communication with Network Coding in Wireless Networks

Dynamic Programming. Lecture #8 of Algorithms, Data structures and Complexity. Joost-Pieter Katoen Formal Methods and Tools Group

Tradeoff Analysis and Architecture Design of High Throughput Irregular LDPC Decoders

Multi-Rate Reconfigurable LDPC Decoder Architectures for QC-LDPC codes in High Throughput Applications

Multi-hop Fast Conflict Resolution Algorithm for Ad Hoc Networks

A Reduced Routing Network Architecture for Partial Parallel LDPC Decoders

Quad copter Control Using Android Smartphone

A Load-Balanced Clustering Protocol for Hierarchical Wireless Sensor Networks

Automated System for the Study of Environmental Loads Applied to Production Risers Dustin M. Brandt 1, Celso K. Morooka 2, Ivan R.

LowcostLDPCdecoderforDVB-S2

Zippy - A coarse-grained reconfigurable array with support for hardware virtualization

1. Introduction. 2. The Probable Stope Algorithm

Low Complexity Opportunistic Decoder for Network Coding

Alleviating DFT cost using testability driven HLS

ERROR correcting codes are used to increase the bandwidth

Overlapped Scheduling for Folded LDPC Decoding Based on Matrix Permutation

Cross-layer Resource Allocation on Broadband Power Line Based on Novel QoS-priority Scheduling Function in MAC Layer

Facility Location: Distributed Approximation

HEXA: Compact Data Structures for Faster Packet Processing

NONLINEAR BACK PROJECTION FOR TOMOGRAPHIC IMAGE RECONSTRUCTION. Ken Sauer and Charles A. Bouman

System-Level Parallelism and Throughput Optimization in Designing Reconfigurable Computing Applications

What are Cycle-Stealing Systems Good For? A Detailed Performance Model Case Study

THROUGHPUT EVALUATION OF AN ASYMMETRICAL FDDI TOKEN RING NETWORK WITH MULTIPLE CLASSES OF TRAFFIC

Making Light Work of the Future IP Network

Improved Circuit-to-CNF Transformation for SAT-based ATPG

The recursive decoupling method for solving tridiagonal linear systems

Performance Improvement in a Multi Cluster using a Modified Scheduling and Global Memory Management with a Novel Load Balancing Mechanism

Multi-Channel Wireless Networks: Capacity and Protocols

DETECTION METHOD FOR NETWORK PENETRATING BEHAVIOR BASED ON COMMUNICATION FINGERPRINT

A Novel Validity Index for Determination of the Optimal Number of Clusters

Cluster-Based Cumulative Ensembles

Maximizing the Throughput-Area Efficiency of Fully-Parallel Low-Density Parity-Check Decoding with C-Slow Retiming and Asynchronous Deep Pipelining

Hardware Implementation

Anbuselvi et al., International Journal of Advanced Engineering Technology E-ISSN

XML Data Streams. XML Stream Processing. XML Stream Processing. Yanlei Diao. University of Massachusetts Amherst

arxiv: v1 [cs.db] 13 Sep 2017

Reevaluating the overhead of data preparation for asymmetric multicore system on graphics processing

COMPARISON OF SIMPLIFIED GRADIENT DESCENT ALGORITHMS FOR DECODING LDPC CODES

Optimizing Sparse Matrix Operations on GPUs using Merge Path

A High-Throughput FPGA Implementation of Quasi-Cyclic LDPC Decoder

On the Generation of Multiplexer Circuits for Pass Transistor Logic

35 th Design Automation Conference Copyright 1998 ACM

Transcription:

DECODING OF ARRAY LDPC CODES USING ON-THE FLY COMPUTATION Kiran Gunnam, Weihuang Wang, Eunheol Kim, Gwan Choi, Mark Yeary * Dept. of Eletrial Engineering, Texas A&M University, College Station, TX-77840 * Dept. of Eletrial and Computer Engineering, University of Oklahoma, Norman, OK-73109 ABSTRACT Message passing memory takes around 30% of hip area and onsumes from 50%-90% power of the typial semiparallel deoders for the Low Density Parity Chek Codes (LDPC). We propose a new LDPC Deoder arhiteture based on the Min Sum algorithm that redues the need of message passing memory by 80% and the routing requirements by more than 50%. This novel arhiteture is based on sheduling of omputation that results in on the fly omputation of variable node and hek node reliability messages. The results are memory-effiient and router-less implementations of (3,30) ode of length 1830 and (3,6) ode of length 1226; eah on a Xilinx Virtex 2V8000 FPGA devie ahieved 1.27 Gbps and 585 Mbps respetively. EXTENDEDED ABSTRACT Low Density Parity Chek Codes (LDPC) odes whih are among the Shannon limit odes have been given intensive attention in reent few years due to their merits in implementing a high throughput, low lateny deoder. A suboptimal deoding, Sum of Produt (SP), algorithm has been proposed for near Shannon limit performane and its approximate version, Offset Min- Sum (MS), algorithm has also been proposed [1]. Offset MS redues the omplexity of the deoding by removing the non-linear operations needed in SP. We present a new arhiteture that exploits the various properties of strutured Array LDPC odes [2] and the value reuse properties of Min-Sum algorithm to redue the memory, routing and omputational requirements. The key features of this arhiteture are: 1. 80% savings in message passing memory requirements when ompared to other semiparallel arhitetures based on MS and its variants[10-11] 2. Salable for any ode length due to the onentri and regular layout unlike the fully parallel arhiteture [3] 3. Redution of router muxes anywhere from 50% and beyond based on dynami state onept. REDUCED MESSAGE PASSING MEMORY AND ROUTER SIMPLIFICATION Array odes are defined in [2] and have three parameters ( d v, d ) and length N. Here d v is the variable node degree and d is the hek node degree. The size of irulant matrix blok in array ode is a prime number and is given by p = N / d. H matrix is given for different odes in Figure 1. Most of the previous work is in the area of semi-parallel implementation of strutured LDPC odes, however most of them are based on SP, for instane[5-9]. [10-11] proposed arhitetures based on MS and its variants. In the arhiteture of [8], the hek node messages in the H matrix are produed blok olumn wise so that all the variable messages in eah blok olumn an be produed on the fly. Again these variable-node messages an be immediately onsumed by the partial state omputation sub-units in Chek Node Units. This sheduling results in savings in message passing memory that is needed to store intermediate messages. This work extends above onepts used for SP to the Offset MS. Cyli shifters take around 10%-20% of hip area based on the deoder s parallelization and onstitute the ritial path of the deoder. We make an observation that if all the blok rows are assigned to different omputational unit arrays of Chek Node Unit(CNU) and serial CNU proessing aross blok row is employed, then we need to have a onstant wiring to ahieve any yli shift as eah subsequent shift an be realized using the feedbak of previous shifted value. This leads to the elimination of forward router between CNU and Variable Node unit (VNU) as well as the reverse router between VNU and CNU. This is possible due to the fat that blok-serial proessing is employed and Array odes have a onstant inremental shift in eah blok row. For the first blok row, the shift and inremental shift is 0. For the seond blok row, the shifts are [0,1,2,, d 1 ] and the inremental shift is 1. For the third blok row, the shifts are [0, 2,, 2*( d 1) ] and the inremental shift is 2. NEW CHECK NODE UNIT MICRO ARCHITECTURE The proposed serial and parallel Chek node unit design[13] utilizes a less known property of the

min-sum algorithm that the hek node proessing produes only two different output magnitude values irrespetive of the number of inoming variable-node messages. [10] resorts to the use of 2* d omparators and additional proessing suh as offset orretion and 2 s omplement for all d messages and does not utilize this property. This property would greatly simplify the number of omparisons required as well as the memory needed to store CNU outputs. Figure 2 shows the serial CNU arhiteture for (3, 30) ode. In the first 30 lok yles of the hek node proessing, inoming variable messages are ompared with the two up-to-date least minimum numbers (partial state, PS) to generate the new partial state, whih inlude the least minimum value, M1, the seond minimum value M2 and index of M1. Final state (FS) is then omputed by offsetting the partial state. It should be noted that the final state inludes only three signed numbers, i.e. M1, -M1, +/-M2 with offset orretion, and index of M1. VNU miroarhiteture is implemented as a parallel unit as the number of inputs is small. It takes 3 Chek node messages and one hannel value. It is a binary tree adder followed by subtrators and 2 s omplement to signed magnitude onversion to generate Variable node messages [11]. ARCHITECTURE Figures 3 and 4 present the proposed arhiteture and pipeline sheduling for the implementation of (3, 30) Array LDPC ode of length 1830 with the irulant matrix size of 61. The Chek Node proessing unit array is omposed of 3 sub-arrays. Eah sub-array ontains 61 serial CNUs whih ompute the partial state for eah blok row to produe the heknode messages for eah blok olumn of H. Blok row 1 is array of 61 simple CNUs. CNU array blok row 2 and 3 are omposed of dynami CNUs(Fig 2b). The Variable node proessing array is omposed of 61 parallel VNU units whih an proess 3*61 messages at eah lok yle. The sign bits will be stored in a FIFO (implemented as RAM), however, there is no need to subjet these values to shifts as these values are not modified in hek node proessing partial state proessing. In the array of simple serial CNU that is designed to do hek node proessing for first blok row in H matrix, the hek node proessing for eah row in H matrix is done suh that all the omparisons are performed loally with in one CNU to update the partial state eah lok yle and transfer the partial state to final state d one every yle. In the array of dynami CNU designed for seond blok row in H matrix, CNU 122 gets its partial state from CNU 121, CNU 121 gets its partial state from CNU 120 and so on. Array of dynami CNU designed for third blok row in H matrix suh that the onnetion between partial state registers among various units ahieve yli shifts of [0,2,..,58]. Similar priniple is used when making onnetions for the final state in the CNU array to ahieve reverse routing. As shown in Figure 4, initially the variable messages are available in row wise as they are set to soft log likelihood information (LLR) of the bits oming from the hannel. Q Init is an SRAM of size 2 * N and holds the hannel LLR values of two different frames. It an supply p intrinsi values to the VNUs eah lok yle. The data path of the design is set to 5 bits to provide the same BER performane as that of the floating point Sum of produts algorithm with 0.1-0.2 db SNR loss [1]. Eah iteration takes d + 3 lok yles. For (3, 30) ode this results in 6*33 lok yles to proess eah frame when a maximum number of iterations set to 6. For (3,6) ode this results in 20*9 lok yles to proess eah frame when number of iterations is set to 20. RESULTS AND PERFOMANCE COMPARISON The savings in message passing memory due to sheduling are 80% as we need to store only the sign bits of variable node messages. Forward router and reverse routers are eliminated. This results in redution of number of multiplexers from 2*( d 1) *log 2( p) * p * wl (as routers are eliminated) to ( d 1) * p *(3* wl + log 2( d ) + 1) (to support transfer of partial state to final state in the array of dynami CNU). Here wl = 5 and is the word length of the data path. Table 1 shows resoure onsumption of different omponents used in the design for (3, 30) ode of length 1830. Implementations for (3, 30) odes of lengths 1830 and (3,6) ode of length 1226 on a Xilinx Virtex 2V3000 devie ahieved 1.2 Gbps(system frequeny 153 MHz) and 340 Mbps(system frequeny 140 MHz) respetively. Up to our best knowledge our LDPC implementations ahieves the highest throughput per given FPGA resoures. Figure 5 gives omparison of design metris for our designs with the other designs [10, 12] based on similar ode parameters and Min Sum implementation. REFERENCES [1] J. Chen and M. Fossorier, ``Near Optimum Universal Belief Propagation Based Deoding of Low-Density Parity Chek Codes,'' IEEE Transations on Communiations, vol. COM-50, pp. 406-414, Marh 2002.

[2] E. Eleftheriou and S. Oler, Low density parity-hek odes for digital subsriber lines, Pro. ICC 2002, New York, pp.1752-1757(2002). [3] Blanksby, A.J.; Howland,C.J, A 690-mW 1-Gb/s 1024-b, rate-1/2 low-density parity-hek ode deoder, Solid-State Ciruits, IEEE Journal of, Vol.37, Iss.3, Mar 2002 Pages:404-412 [4] T. Zhang and Parhi, A 54 Mbps (3, 6)-regular FPGA LDPC deoder, IEEE Workshop on Signal Pro. Systems, 2002. (SIPS '02), Ot. 2002, pp. 127-132. [5] Yijun Li; Elassal, M.; Bayoumi, M., "Power effiient arhiteture for (3,6)-regular low-density parity-hek ode deoder," Ciruits and Systems, 2004. ISCAS '04. Proeedings of the 2004 International Symposium on, vol.4, no.pp. IV- 81-4 Vol.4, 23-26 May 2004 [6] A. Selvarathinam, G.Choi, K. Narayanan, A.Prabhakar, E. Kim, A Massively Salable Deoder Arhiteture for Low-Density Parity- Chek Codes, in proeedings of ISCAS 2003, Bangkok, Thailand. [7] Mansour, M.M.; Shanbhag, N.R., "A 640-Mb/s 2048-bit programmable LDPC deoder hip," Solid-State Ciruits, IEEE Journal of, vol.41, no.3pp. 684-698, Marh 2006. [8] K. Gunnam, G. Choi and M. B. Yeary, An LDPC Deoding Shedule for Memory Aess Redution, IEEE International Conferene on Aoustis, Speeh, and Signal Proessing (ICASSP 2004) [9] Bhagawat, P.; Uppal, M.; Choi, G FPGA based implementation of deoder for array low-density parity-hek odes.; Aoustis, Speeh, and Signal Proessing, 2005. Proeedings. (ICASSP '05). IEEE International Conferene on Volume 5, 18-23 Marh 2005 Page(s):v/29 - v/32 Vol. 5 [10] Karkooti, M.; Cavallaro, J.R.Semi-parallel reonfigurable arhitetures for real-time LDPC deoding ; Information Tehnology: Coding and Computing, 2004. Proeedings. ITCC 2004. International Conferene on Volume 1, 2004 Page(s):579-585 Vol.1 [11]Yeo, E.; Pakzad, P.; Nikoli, B.; Anantharam, V., High throughput low-density parity-hek deoder arhitetures in proeedings of Global Teleommuniations Conferene, 2001.Volume: 5, Page(s): 3019 3024 [12] T. Brak, F. Kienle and N. Wehn. Dislosing the LDPC Code Deoder Design Spae Design, Automation and Test in Europe (DATE) Conferene 2006, pages 200-205, Marh 2006, Munih, Germany [13] Kiran Gunnam, Gwan Choi, A Low Power Arhiteture for Min- Sum Deoding of LDPC Codes, Available online at http://dropezone.tamu.edu/tehpubs. TAMU, ECE Tehnial Report, May, 2006, Publiation No. TAMU-ECE-2006-02.

H = I I I... I I 2... 5 I... 2 4 10 0 0... 0 1 1 0... 0 0 = 0 1... 0 0............... 0 0... 1 0 I I (a) I... I H = I 2... 29 I 2 4... 58 (b) Figure 1: H matrix of LDPC odes: (a) (3, 6) ode, rate 0.5, p=211, length 1266; (b) (3, 30) ode, rate 0.9, p=61, length 1830 Figure 2: Chek node proessing unit, Q: Variable node message, R: Chek node message. (a) simple sheme; (b) dynami sheme.

Figure 3: Arhiteture Figure 4: Pipeline

Table 1: FPGA results (Devie: Xilinx 2v8000ff1152-5) No. Slies No. 4-input No. Slie Operating LUT Flip-flops frequeny(mhz) CNU simple 45 70 53 236 CNU dynami 55 102 55 193 CNU array blok row 1 2599 4285 2980 CNU array blok row 2,3 3464 6438 2967 CNU array 9230 17143 8875 196 VNU 27 42 25 169 VNU array 1623 2562 1525 169 Top 11695 19732 10733 153 Total number available 46592 93184 93184 p=61, rate 0.9, length 1830 p=211, rate 0.5, length 1266 M. Karkooti et al., rate 0.5, length 1536 T. Brak, et al., rate 0.8, length 3000 54855 1270 64.4 23040 30520 585 11352 11695 19732 15534 20374 5490 3798 18300 15360 153 12660 140121 180 127 10.7 6.2 No. Slies Message Input Frequeny No. LUT's passing buffer (MHz) memory (bits) (bits) Figure 5: Results omparison Throughput (Mbps) Throughput per LUT (Kbps)