Toward Realtime Side Information Decoding On Multi-Core Processors

Similar documents
Elementary Educational Computer

Improvement of the Orthogonal Code Convolution Capabilities Using FPGA Implementation

Analysis Metrics. Intro to Algorithm Analysis. Slides. 12. Alg Analysis. 12. Alg Analysis

Fully Parallel Window Decoder Architecture for Spatially-Coupled LDPC Codes

Chapter 3 Classification of FFT Processor Algorithms

GPUMP: a Multiple-Precision Integer Library for GPUs

Fast Fourier Transform (FFT) Algorithms

Chapter 1. Introduction to Computers and C++ Programming. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Appendix D. Controller Implementation

A New Morphological 3D Shape Decomposition: Grayscale Interframe Interpolation Method

Pattern Recognition Systems Lab 1 Least Mean Squares

Joint Message-Passing Symbol-Decoding of LDPC Coded Signals over Partial-Response Channels

Low Complexity H.265/HEVC Coding Unit Size Decision for a Videoconferencing System

Lecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming

3D Model Retrieval Method Based on Sample Prediction

An Improved Shuffled Frog-Leaping Algorithm for Knapsack Problem

CNN-based architecture for real-time object-oriented video coding applications

Computing a k-sparse n-length Discrete Fourier Transform using at most 4k samples and O(k log k) complexity

k (check node degree) and j (variable node degree)

Multi-Threading. Hyper-, Multi-, and Simultaneous Thread Execution

Image Segmentation EEE 508

DESIGN AND ANALYSIS OF LDPC DECODERS FOR SOFTWARE DEFINED RADIO

Lecture 1: Introduction and Strassen s Algorithm

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Part A Datapath Design

6.854J / J Advanced Algorithms Fall 2008

. Written in factored form it is easy to see that the roots are 2, 2, i,

CMSC Computer Architecture Lecture 12: Virtual Memory. Prof. Yanjing Li University of Chicago

Evaluation scheme for Tracking in AMI

Algorithms for Disk Covering Problems with the Most Points

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design

Bayesian approach to reliability modelling for a probability of failure on demand parameter

BASED ON ITERATIVE ERROR-CORRECTION

Chapter 4 The Datapath

Data diverse software fault tolerance techniques

Lecture 28: Data Link Layer

EE123 Digital Signal Processing

Lower Bounds for Sorting

EE 459/500 HDL Based Digital Design with Programmable Logic. Lecture 13 Control and Sequencing: Hardwired and Microprogrammed Control

The Magma Database file formats

Performance Plus Software Parameter Definitions

1. SWITCHING FUNDAMENTALS

Analysis of Server Resource Consumption of Meteorological Satellite Application System Based on Contour Curve

APPLICATION NOTE PACE1750AE BUILT-IN FUNCTIONS

Ones Assignment Method for Solving Traveling Salesman Problem

How do we evaluate algorithms?

A Fully SNR, Spatial and Temporal Scalable 3DSPIHT-Based Video Coding Algorithm for Video Streaming Over Heterogeneous Networks

CIS 121 Data Structures and Algorithms with Java Spring Stacks, Queues, and Heaps Monday, February 18 / Tuesday, February 19

On the Use of Hard-Decision LDPC Decoders on MLC NAND Flash Memory

A Study on the Performance of Cholesky-Factorization using MPI

MOTIF XF Extension Owner s Manual

GE FUNDAMENTALS OF COMPUTING AND PROGRAMMING UNIT III

Evaluation of Distributed and Replicated HLR for Location Management in PCS Network

FAST BIT-REVERSALS ON UNIPROCESSORS AND SHARED-MEMORY MULTIPROCESSORS

Sorting in Linear Time. Data Structures and Algorithms Andrei Bulatov

Improving Template Based Spike Detection

Lecture 5. Counting Sort / Radix Sort

End Semester Examination CSE, III Yr. (I Sem), 30002: Computer Organization

Accuracy Improvement in Camera Calibration

Wavelet Transform. CSE 490 G Introduction to Data Compression Winter Wavelet Transformed Barbara (Enhanced) Wavelet Transformed Barbara (Actual)

A Note on Least-norm Solution of Global WireWarping

Random Network Coding in Wireless Sensor Networks: Energy Efficiency via Cross-Layer Approach

BOOLEAN MATHEMATICS: GENERAL THEORY

ALU Augmentation for MPEG-4 Repetitive Padding

Introduction to SWARM Software and Algorithms for Running on Multicore Processors

Computer Systems - HS

Cache-Optimal Methods for Bit-Reversals

Lecture 18. Optimization in n dimensions

Neural Networks A Model of Boolean Functions

A REDUCED-COMPLEXITY LDPC DECODING ALGORITHM WITH CHEBYSHEV POLYNOMIAL FITTING

CS 683: Advanced Design and Analysis of Algorithms

Throughput-Delay Scaling in Wireless Networks with Constant-Size Packets

A SOFTWARE MODEL FOR THE MULTILAYER PERCEPTRON

COMP Parallel Computing. PRAM (1): The PRAM model and complexity measures

The impact of GOP pattern and packet loss on the video quality. of H.264/AVC compression standard

EE University of Minnesota. Midterm Exam #1. Prof. Matthew O'Keefe TA: Eric Seppanen. Department of Electrical and Computer Engineering

Chapter 10. Defining Classes. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Optimum Solution of Quadratic Programming Problem: By Wolfe s Modified Simplex Method

The VSS CCD photometry spreadsheet

CIS 121 Data Structures and Algorithms with Java Spring Stacks and Queues Monday, February 12 / Tuesday, February 13

Computers and Scientific Thinking

Accelerating Multi Dimensional Queries in Data Warehouses

EE260: Digital Design, Spring /16/18. n Example: m 0 (=x 1 x 2 ) is adjacent to m 1 (=x 1 x 2 ) and m 2 (=x 1 x 2 ) but NOT m 3 (=x 1 x 2 )

Automatic Generation of Polynomial-Basis Multipliers in GF (2 n ) using Recursive VHDL

Chapter 4 Threads. Operating Systems: Internals and Design Principles. Ninth Edition By William Stallings

Algorithm. Counting Sort Analysis of Algorithms

Efficient Hardware Design for Implementation of Matrix Multiplication by using PPI-SO

Realistic Storage of Pending Requests in Content-Centric Network Routers

Prediction-based Incremental Refinement For Binomially-factorized Discrete Wavelet Transforms

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5

Lecturers: Sanjam Garg and Prasad Raghavendra Feb 21, Midterm 1 Solutions

A Generalized Set Theoretic Approach for Time and Space Complexity Analysis of Algorithms and Functions

CIS 121 Data Structures and Algorithms with Java Fall Big-Oh Notation Tuesday, September 5 (Make-up Friday, September 8)

Architectural styles for software systems The client-server style

Stone Images Retrieval Based on Color Histogram

Running Time. Analysis of Algorithms. Experimental Studies. Limitations of Experiments

Adaptive Resource Allocation for Electric Environmental Pollution through the Control Network

Pruning and Summarizing the Discovered Time Series Association Rules from Mechanical Sensor Data Qing YANG1,a,*, Shao-Yu WANG1,b, Ting-Ting ZHANG2,c

Dynamic Programming and Curve Fitting Based Road Boundary Detection

VIDEO WATERMARKING IN 3D DCT DOMAIN

Transcription:

MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Toward Realtime Side Iformatio Decodig O Multi-Core Processors Svetislav Momcilovic, Yige Wag, Shatau Rae, Athoy Vetro TR21-1 December 21 Abstract Most distributed source codig schemes ivolve the applicatio of a chael code to the sigal ad trasmissio of the resultig sydromes. For low complexity ecodig with superior compressio performace, graph-based chael codes such as LDPC codes are used to geerate the sydromes. The ecoder performs simple XOR operatios, while the decoder uses belief propagatio (BP) decodig to recover the sigal of iterest usig the sydromes ad some correlated side iformatio. We cosider parallelizatio of BP decodig o geeral-purpose multi core CPUs. The motivatio is to make BP decodig fast eough for realtime applicatios. We cosider three differet BP decodig algorithms: Sum-Product BP, Mi-Sum BP ad Algorithm E. The speedup obtaied by parallelizig these algorithms is examied alog with the tradeoff agaist decodig performace Parallelizatio is achieved by dividig the received sydrome vectors amog differet cores, ad by usig vector operatios to simultaeously process multiple check odes i each core. While Mi-Sum BP has itermediate decodig complexity, a vectorized versio of Mi-Sum BP performs early as fast as the much simpler Algorithm E with sigificatly fewer decodig errors. Our experimets idicates that, for the best compromise betwee speed ad performace, the decoder should use Mi-Sum BP whe the side iformatio is of good quality ad Sum-Product BP otherwise. Multimedia Sigal Processig Workshop This work may ot be copied or reproduced i whole or i part for ay commercial purpose. Permissio to copy i whole or i part without paymet of fee is grated for oprofit educatioal ad research purposes provided that all such whole or partial copies iclude the followig: a otice that such copyig is by permissio of Mitsubishi Electric Research Laboratories, Ic.; a ackowledgmet of the authors ad idividual cotributios to the work; ad all applicable portios of the copyright otice. Copyig, reproductio, or republishig for ay other purpose shall require a licese with paymet of fee to Mitsubishi Electric Research Laboratories, Ic. All rights reserved. Copyright c Mitsubishi Electric Research Laboratories, Ic., 21 21 Broadway, Cambridge, Massachusetts 2139

MERLCoverPageSide2

Toward Realtime Side Iformatio Decodig o Multi-core Processors Svetislav Momcilovic, Yige Wag, Shatau Rae, Athoy Vetro Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA 214, USA. Abstract Most distributed source codig schemes ivolve the applicatio of a chael code to the sigal ad trasmissio of the resultig sydromes. For low-complexity ecodig with superior compressio performace, graph-based chael codes such as LDPC codes are used to geerate the sydromes. The ecoder performs simple XOR operatios, while the decoder uses belief propagatio (BP) decodig to recover the sigal of iterest usig the sydromes ad some correlated side iformatio. We cosider parallelizatio of BP decodig o geeral-purpose multicore CPUs. The motivatio is to make BP decodig fast eough for realtime applicatios. We cosider three differet BP decodig algorithms: Sum-Product BP, Mi-Sum BP ad Algorithm E. The speedup obtaied by parallelizig these algorithms is examied alog with the tradeoff agaist decodig performace. Parallelizatio is achieved by dividig the received sydrome vectors amog differet cores, ad by usig vector operatios to simultaeously process multiple check odes i each core. While Mi-Sum BP has itermediate decodig complexity, a vectorized versio of Mi-Sum BP performs early as fast as the much simpler Algorithm E with sigificatly fewer decodig errors. Our experimets idicate that, for the best compromise betwee speed ad performace, the decoder should use Mi- Sum BP whe the side iformatio is of good quality ad Sum- Product BP otherwise. I. INTRODUCTION Distributed source codig is a attractive optio for sesor etworks ad surveillace systems i which image or video is acquired ad compressed usig low-cost hardware. This method of compressio ivolves ecodig the acquired sigal coditioed o some statistically correlated side iformatio at the decoder. For example, i the case of video sigals, the side iformatio for the curret video frame ca be furished by a motio compesated versio of the previous decoded frame. Distributed source codig draws o iformatio-theoretic results o lossless codig of correlated sources [1] ad ratedistortio tradeoffs for ecodig of correlated sources [2]. Recet years have see a revival i distributed source codig, i particular distributed video codig [3] which has exploited graph-based chael codes. Nearly all implemetatios of distributed compressio systems ivolve appropriately quatizig the sigal of iterest, ad the extractig parity or sydrome symbols from it by applyig a chael code. The sydromes costitute the compressed bit stream which is trasmitted to the decodig statio where, they are combied with the side iformatio ad fed to a chael decoder. The chael decoder essetially treats S. Momcilovic is with INESC-ID TU Lisbo, Portugal. This work was carried out whe he was a iter at MERL. 978-1-4244-8112-5/1/$26. c 21 IEEE the side iformatio as a error proe versio of the sigal of iterest ad uses the received sydromes to correct the errors, thereby recoverig the sigal of iterest. This process, ecompassig the side iformatio decodig as well as the distortio itroduced by quatizatio, is referred to as Wyer- Ziv codig. I theory, ay chael code ca be used i this way. I practice, however, graph-based chael codes such as Low-desity Parity Check (LDPC) codes or Turbo Codes are preferred over hard-decisio algebraic schemes such as Reed- Solomo codes or BCH codes owig to very low ecodig complexity, ad availability of soft-decisio decodig algorithms that achieve better chael codig performace. LDPC ecodig, for example, ivolves simple XOR operatios, while LDPC decodig ivolves ruig a Belief Propagatio (BP) algorithm to recover the sigal of iterest. A example of a distributed video codig system usig a LDPC code is show i Fig. 1. BP decodig is much more complex tha the operatios at the ecoder. I this paper, we cosider BP decodig o geeral-purpose multi-core CPUs. The motivatio is to make BP decodig fast eough for realtime decodig of time-sesitive sigals such as surveillace videos. This is a practical requiremet that has remaied relatively uexplored i the literature; the emphasis has bee o pure compressio performace. I a detailed evaluatio of the DISCOVER codec [4] o a geeral purpose dual core machie, the authors report that, to achieve high recostructio quality eve for QCIF-sized video frames, side iformatio decodig of a sigle frame required 4-8 secods depedig upo the video cotet. By exploitig parallelizatio o multi-core CPUs used o cosumer-level computers, we take a step toward real time decodig of Wyer- Ziv coded sigals. The techiques i this paper also apply to the recet parallel implemetatios of BP o Graphics Processig Uits (GPUs) based o NVIDIA s CUDA-based parallel computig architecture [5], [6], [7]. LDPC codes were iveted by Gallager i the 196s [8], but were igored because of the limited processig capabilities at the time. They were rediscovered by MacKay ad Neal [9] ad sice the, have received icreased attetio due to their ear-shao-limit error performace. For distributed source codig, a class of rate-adaptive codes called LDPC Accumulate (LDPCA) codes [1] have become very popular. A LDPCA code is a LDPC code cocateated with a accumulator. The accumulator allows sydromes to be trasmitted icremetally util decodig succeeds. For ay set of accumulated sydromes, the decodig procedure is the same as that of a covetioal LDPC code. I this paper, we focus o

DCT, Quat & get bitplaes Iput Frame... bitplaes Low-complexity ecoder sydromes......... sydromes sydromes DCT, Quat & get bitplaes side ifo Combie bitplaes, get IDCT & recostruct Motio compesatio Decoded Frame Previous decoded frames Side iformatio decoder Fig. 1. The mai compoets of a distributed video codig system. Decodig of oly oe bitplae is show. the LDPC decodig, keepig i mid that LDPCA decodig would eed repeated ivocatios of LDPC decodig. The remaider of the paper is orgaized as follows. Sectio II describes the three LDPC decodig algorithms evaluated i this paper. I particular, the calculatios performed at the check odes ad variable odes are described. Sectio III describes how parallelizatio is achieved by dividig decodig tasks amog multiple processor cores ad by icorporatig vector istructios withi each core. I Sectio IV, the speedup obtaied via parallelizatio of the three BP decodig algorithms is discussed alog with the tradeoff i decodig performace. II. LDPC DECODING ALGORITHMS A (N,K) LDPC code is defied as the ull space of asparseparitycheckmatrixh M N,whereN is the code legth, K is the code dimesio, ad M N K. The rate of the LDPC code, R = K/N. ALDPCcodecaalso be represeted by a bipartite Taer graph with two types of odes: variable odes ad check odes. Each row i H correspods to a check ode ad each colum correspods to avariableode;thei th check ode is coected to the j th variable ode if ad oly if H(i, j) =1. Assume that the vectors beig ecoded are biary. For image ad video applicatios, o-biary vectors costructed from blocks of pixels are coverted ito biary ad the idividual bitplaes are provided as iputs to the LDPC ecoder, as show i Fig. 1. Ecodig cosists of calculatig the sydromes accordig to s = Hc where c is the iput biary sequece or bitplae ad s is the sydrome vector. If R < 1, the sydrome vector s represets a compressed ecoded versio of the iput vector c, ad is trasmitted to the decoder. To iitialize side iformatio decodig, the variable odes i the LDPC code graph are populated with a hypothesis about the bits to be recovered. This hypothesis is obtaied usig a side iformatio vector v which is correlated to the vector c which is to be recovered. I the simplest case, a startig hypothesis for c is the vector v itself. I geeral, the startig hypothesis expresses the likelihood that the bit v i i the i th variable ode has value or 1. The check odes are associated with the bits from the received sydrome vector s. Each check ode specifies a costrait equatio satisfied by all the variable odes coected to it. I BP decodig, messages are passed betwee the variable odes ad check odes with the aim of eforcig these costraits. The messages propagate the beliefs at a give ode to the other odes coected to it. After a few iteratios of message passig, the variable odes should satisfy the costraits imposed by the check bits, i which case the decodig is deemed to be successful. BP decodig ca be realized either via a fully parallel floodig-type schedulig, or a fully serial shuffled-type schedulig [11], or a partial parallel group-shuffled approach [11]. Here, we focus o the fully parallel scheme i which the flow of operatios is as idicated i Fig. 2. We cosider 3decodigalgorithms:Sum-ProductBP,Mi-SumBPad Algorithm E. The decodig operatios for each of these three algorithms are detailed below. As a setup step, usig biary phase shift keyig (BPSK), the sequece to be ecoded, i.e., c is mapped ito x accordig to x =1 2c.Thedestiatio observes a side iformatio sequece y where y =1 2v { 1, 1}. Deotethesetofvariableodescoectedtocheck ode j by N (j) ={k : H jk =1} ad the set of check odes coected to variable ode k as M(k) ={j : H jk =1}. Deote usig N (j) k the set N (j) with variable ode k excluded, ad M(k) j the set M(k) with check ode j excluded. The followig otatio is used for the i th iteratio of message passig: m: messagefromcheckodem to variable ode v m: (i) messagefromvariableode to check ode m v (i) :beliefofvariableode u :messagefromtheside-iformatiochaelforvariable ode A. Sum-Product BP Decodig Operatios 1) Iitializatio: Set i = 1, ad the maximum umber of iteratios to I MAX. For 1 N, set u = P (x=1 y) l P (x = 1 y ).Foreachm,, setv() m = u. 2) Iterative decodig: (a) Perform check ode calculatios, i.e., for 1 m M ad each N(m), m =2tah 1 ( ) v (i 1) m tah (1) 2 N (m) For details about the derivatio of the above formula, the reader is referred to [12]. (b) Perform variable ode calculatios, i.e., for 1 N ad each m M(), v (i) m = u + m M() m m (2) 3) Hard decisio ad stoppig criterio test: Set v (i) = u + m M() u(i) m. Createĉ (i) =[ˆc (i) ] such that ˆc (i) otherwise. = 1 if v (i) <, ad ˆc (i) = If Hĉ (i) = s or if the umber of iteratios has reached I MAX, stop decodig ad go to Step 4. Otherwise set i := i +1 ad go to Step 2. 4) Output ĉ (i) as the decoded codeword.

t = t + 1 y Start t=1 Get t th side ifo ad sydrome vectors To parallelize, divide sydrome vectors amog cores iteratio couter i = 1 i = i + 1 check ode operatios variable ode operatios To parallelize, use vector istructios withi a core check equatios satisfied? maximum iteratios doe? y all vectors decoded? y Stop Fig. 2. A flow diagram cotaiig a high-level summary of the sequece of operatios performed i BP decodig. B. Mi-Sum BP Decodig Operatios The Mi-Sum algorithm [13] is a simplified versio of Sum- Product BP. All decodig steps are the same as those i Sum- Product BP except the check ode update i (1), which is ow approximated by m = N (m) sg(v (i 1) m ) mi N (m) v(i 1) m. (3) As decodig primarily ivolves additios ad comparisos, Mi-Sum BP is less complex tha Sum-Product BP. C. Algorithm E Decodig Operatios Algorithm E was proposed ad aalyzed i [14], [15]. It quatizes all the messages i Sum-Product BP to 1,, or +1 ad ca be carried out as follows: 1) Iitializatio: Set i =1ad the maximum umber of iteratio to I MAX.Foreachm,, setv m () = y. 2) Iterative Decodig: (a) Perform check ode calculatios as follows: For 1 m M ad each N(m), m = N (m) v (i 1) m (b) Perform variable ode calculatios as follows: For 1 N ad each m M(), v m (i) = sg w (i) y + m M() m m where sg(x) takes values -1, or +1 for x<, x =, ad x> respectively, ad w (i) is a weight chose to optimize performace. For example, i [15], w (1) =2 ad w (i) =1for i 2 is foud to optimize the decodig performace for a regular (3, 6) LDPC code. 3) The, for the stoppig criterio test, evaluate the variable ode beliefs as v (i) = sg w (i) y + m m M() ad proceed as i the Sum-Product BP algorithm I terms of processig time per iteratio, Algorithm E is faster tha Sum-Product BP ad Mi-Sum BP owig to its simpler decodig operatios. I the high sigal-to-oise ratio (SNR) regime, where the side iformatio is accurate ad oly a few bits are estimated i error, a eve faster algorithm called Active-Set Algorithm E, or Fast Algorithm E has bee proposed [16]. The ratioale is that whe SNR is large, most messages coverge quickly. Thus, it is ot ecessary to update every variable/check ode at each iteratio. The decoder just checks whether the messages eterig a ode are differet from their values i the previous iteratio. If oe of the messages has chaged, the ode is ot updated ad overall decodig time is reduced with o loss of performace. III. IMPLEMENTATION ON MULTI-CORE CPUS A. Processor-level Parallelizatio To speed up the executio of BP decodig, two kids of parallelism are used. At the level of the processor cores, a sigle program/multiple data (SPMD) approach is used. This approach is useful i scearios where the same set of istructios is executed i multiple iteratios o differet data. As there are o data depedecies betwee the iteratios, they ca be implemeted idepedetly ad i ay order o separate processor cores. The parallelizatio of such loops requires the creatio of multiple threads ad ew thread cotexts, by meas of replicatio of the variables, e.g., couters, that will be private i each thread. All other variables are shared betwee the threads, ad may be accessed cocurretly by multiple threads. The creatio of ew thread cotexts, ad simultaeous access of the same variables by differet processor cores results i a parallelizatio overhead. Thespeedupobtaied via parallel executio eeds to be large eough to overcome the effect of parallelizatio overhead. I BP decodig, as show i Fig. 2, there are loops at three levels. The top-most loop is o the block level, where the decodig algorithm is repeated for each received M-legth block of sydromes. With the SPMD approach, this loop ca be executed i parallel by allocatig the sydrome blocks to the multiple cores, as there are o data depedecies betwee blocks. The mid-level loop is o the iteratio level, where at each iteratio, messages are exchaged betwee variable odes ad check odes. The exchaged messages are differet i each iteratio ad depedet o the messages from the previous iteratio. Therefore, they ca ot be executed i ay arbitrary order, ad this loop caot be parallelized via the SPMD approach. The iermost-level cosists of two loops for computig the variable ode messages ad check ode messages. I these loops, the same calculatios are performed at each variable or check ode, ad with small modificatios, these loops ca be executed i parallel usig the SPMD approach. The proposed scheme parallelizes the check ode loop rather tha the variable ode loop, because check ode processig was foud experimetally to occupy alargerfractiooftheprocessigtime.vectoristructios

Check odes Variable odes coected to each check ode Traspose & Reoder Variable odes Check odes 128 b 128 b 128 b 128 b X X 32-bit message Fig. 3. Remappig the ode idices to eable simultaeous processig of multiple check odes. I this example, each small square is a 32-bit message. are used i order to achieve parallelizatio of the check ode calculatios withi a processor core. This is elaborated below. B. Parallelizatio via Vector Istructios Vector istructios allow each processor core to process several check odes simultaeously, thereby reducig the processig time per iteratio. I order to exploit vector istructios to the fullest, it is ecessary that the calculatio performed at each check ode is simple ad similar to the calculatios performed at every other check ode. Ufortuately, Sum-Product BP evaluates the tah( ) fuctio via direct computatio or atablelookup,forwhichthereisoefficietimplemetatio usig vector istructios. Also, Fast Algorithm E processes a ode oly if the messages eterig it have chaged sice the last iteratio ad this o-uiformity makes it usuitable for vector istructios. O the other had, Mi-Sum BP ad Algorithm E ca both use vector istructios because, i these algorithms, every check ode is processed i early the same way as every other check ode. Oe iterestig observatio, elaborated i Sectio IV is that, by usig vector istructios for Algorithm E, the speedup obtaied is eough to rival the speed of Fast Algorithm E. To apply vector istructios, it is ecessary to remap the messages betwee the check odes ad variable odes, as show i Fig. 3. I particular, it is ecessary to arrage the messages such that they occupy W -bit blocks, where W is the size of the largest block o which additios ad logical operatios ca be performed. Thus, for a give data-type, No. of calculatios i parallel = W sizeof(datatype) Fig. 3 shows the data orgaizatio i memory for the case i which W =128ad the messages are each 32 bits log, allowig a block of 4 check odes to be processed i parallel. Each idividual square cotais a message from a variable ode to the appropriate check ode. The traspositio esures that messages related to a give check ode are placed i successive locatios i memory. Reorderig as show i the figure provides the most efficiet way to process groups of 4checkodeswhileisertigamiimumumberofeutral messages. For the chose word size of 32 bits, these eutral messages, marked X are placed to esure that the umber of messages beig processed is a multiple of 4. The value of these eutral messages is set to zero for additios ad to oe for multiplicatios so that they do ot affect the calculatios. TABLE I DIMENSIONS OF LDPC MATRICES Matrix Size #edges rate H G 594 12 32635.83 H H 594 3 32635.49 H I 594 534 32635.1 To implemet parallel check ode operatios i the above fashio for Mi-Sum BP, the SSE 1 Vector Istructio Set from Itel is used. I Algorithm E, o the other had, the check ode calculatios ivolve messages that take values -1 or or +1 ad they are implemeted etirely usig logical operatios o char, i.e.,8-bitvalues.thus,ithevectorimplemetatioof Algorithm E, 16 check odes ca be processed simultaeously. IV. EXPERIMENTAL RESULTS All experimets were coducted o a Itel Core 2 Quad CPU Q965 ruig at 3 GHz with 4 GB of RAM. The OpeMP Applicatio Program Iterface Versio 3. [17] was used to implemet parallelizatio of various BP decodig algorithms. This API provides C/C++ compiler directives ad library routies to support shared-memory parallelism. The simulatios were coducted o ie LDPC codes at various rates; we report results o the three largest codes i this paper. The parity check matrices are labeled H G, H H ad H I,ad their dimesios are show i Table I. These codes are all derived from a sigle LDPCA code with 594 variable odes. The code dimesios are motivated by a distributed video compressio applicatio 2.Thesydromevectorshavetobe decoded with the help of side iformatio, e.g., the previous video frame or a motio compesated versio of it. Video studies have show that a Laplacia model is close to the observed depedecy betwee the source bitplae ad the side iformatio bitplae. I this work, we are iterested primarily i speedup from parallelizatio per iteratio of LDPC decodig, ot i choosig the LDPC code with the smallest umber of check odes or the LDPC code that coverges i the smallest umber of iteratios. Sice the parallelizatio speedup per iteratio is idepedet of the chael model used, we assume a much simpler Biary Symmetric Chael (BSC) model betwee the source bitplae ad the side iformatio bitplae. Thus, if the crossover probability of the BSC is too large, the LDPC code will ot be able to recover the source bitplae from the side iformatio bitplae eve after the maximum umber of iteratios, I MAX,isreached.Iallour simulatios, I MAX =1. First, cosider the speedup obtaied simply by dividig all the received sydrome vectors amog the available processor cores. As show i Fig. 4, the decodig time is the least for Algorithm E ad the highest for Sum-Product BP. These 1 SSE = Streamig SIMD Extesios 2 Cosider a video frame of size 72 528 pixels. A 8 8 blockwise Discrete Cosie Trasform (DCT) is applied to the frame, each DCT coefficiet is separately quatized ad the resultig bitplaes are iput to the LDPC ecoder. For a sigle LDPC code to be applied to a particular DCT coefficiet, the umber of variable odes i the LDPC code must be 72 528 =594.There 8 8 is oe LDPC code for every coded bitplae of each of the 64 DCT coefficiets, ad each of these codes trasmits a sydrome vector to the decoder.

Decodig Time (ms) 18 16 14 12 1 8 6 4 2 Sum Product LUT MiSum AlgE Sum Product LUT MiSum AlgE Sum Product LUT MiSum AlgE Sum Product LUT MiSum AlgE Decodig Speedup Factor 3.5 3 2.5 2 1.5 1.5 Sum-Prod LUT MiSum Alg E Normalized Speedup 1.9.8.7.6.5.4.3.2.1 Sum-Prod LUT MiSum Alg E Fig. 4. As ew cores are added, the speed of BP decodig icreases, but parallelizatio overhead prevets a ormalized speedup of oe per core. results are for the parity check matrix H I with a low BSC crossover probability of.5. For each algorithm, the decodig times are averaged over 1 decodigs, i.e., the code is ru 1 times with the same sydrome vectors but with side iformatio radomly perturbed accordig a BSC. The bar labeled LUT refers to a implemetatio of Sum-Product BP i which the tah( ) fuctio is read from a look-up table with 32 bit precisio. The LUT variat rus faster tha Sum- Product BP, which uses a C-math fuctio to compute tah( ) but has worse performace tha Sum-Product BP, i.e., a larger umber of ucorrected errors. The secod bar graph plots the decodig speedup factor S(y) while the third graph plots the ormalized speedup S(y). Thesefactorsaregiveby: S(y) = Decodig time with Decodig time with y cores, S(y) = S(y) y where y is the umber of cores used i parallel. The results show that, as more cores are added, the ormalized speedup reduces because of the parallelizatio overhead associated with replicatig thread cotexts, ad the cotetio that occurs whe two threads access the same portios of memory. Now, we describe the beefits of usig vector istructios which speeds up the check ode decodig operatios withi each core, as explaied i Sectio III-B. Note that, for a fixed BP decodig algorithm, usig vector istructios does ot chage the umber of iteratios eeded for covergece, thus the codig performace of a BP decodig algorithm ad its vectorized versio are idetical; the latter versio just executes faster per iteratio. As explaied i Sectio III-B, Mi-Sum BP ad Algorithm E ca be profitably vectorized as show i Figs 5(a), (b) for the code matrices H G ad H H. Firstly, the codig performace of Sum-Product BP is the best, followed by Mi-Sum BP, followed by Algorithm E. This is expected because the latter two algorithms are approximatios of Sum-Product BP. Further, as the crossover probability of the BSC betwee the source ad side iformatio bitplaes icreases, the probability of ucorrected errors icreases util it plateaus at 1, which meas that there are udetected or ucorrected errors i every decoded vector. As there are more check odes i H H tha H G,theplateauoccursatahigher crossover probability for the H H code. Whe the crossover probability icreases, more BP iteratios are eeded to recover the ecoded vector, so the decodig time icreases util the umber of iteratios maxes out at 1. A fact that is ot visible from these plots is that Sum-Product BP coverges i the fewest iteratios but each iteratio cosumes more time. Secodly, recall that Fast Algorithm E executes fewer iteratios by first checkig whether a ode eeds updatig. The decodig time graphs show that, by usig vector istructios i the plai Algorithm E, the decodig speed approaches ad, i some cases, exceeds that of Fast Algorithm E. Note that, owig to the coditioal checks, Fast Algorithm E is ot suitable for implemetatio usig vector istructios. Thirdly, Mi-Sum BP provides itermediate decodig performace ad decodig speed betwee Sum-Product BP ad Algorithm E. Iterestigly however, with vector istructios, the Mi-Sum BP decodig time is early as small as that of Algorithm E while retaiig its superior decodig performace. The reaso for this is that, check ode operatios cosume 55-65% of the decodig time i Mi-Sum BP, but oly 35-45% of the decodig time i Algorithm E. Sice vectorizatio reduces check ode processig time, Mi-Sum BP beefits more from vectorizatio tha Algorithm E. We coclude that Vector Mi- Sum BP is early always to be preferred over Algorithm E for side iformatio decodig. Whe the crossover probability is low, e.g., while decodig the higher sigificat bitplaes of image pixels, Vector Mi-Sum BP is to be preferred over Sum-Product BP because it gives the same performace i less time. However, whe the crossover probability icreases, e.g., while decodig the middle bitplaes of image pixels, Sum- Product BP gives sigificatly fewer decodig errors ad must be preferred over Mi-Sum BP eve though it is slower. V. DISCUSSION To see the decodig time results i the cotext of a video viewig applicatio, cosider the followig very rough calculatio: Suppose that Mi-Sum BP decodig is performed ad we ca tolerate error-proe decoded blocks with probability less tha.1. From Fig. 5, this requiremet is satisfied for BSC crossover probability.5 for the code H H,forexample. The decodig time for vector Mi-Sum BP at this probability is 8.8 ms. With a 8 8 block DCT trasform, there are 64 coefficiets to be coded. However, ot all bitplaes of each DCT coefficiet are sigificat. At 4 db quality i atural images, we foud experimetally that, out of 64 8=512 bitplaes, it is ecessary to code about 3 bitplaes. Thus, 3 Mi-Sum BP decodigs must be carried out per video frame. For the code H H,thisgivesatotaldecodigtimeof

Decodig Time (ms) 7 6 5 4 3 2 1 Sum-Product BP MiSumBP Vector MiSumBP Fast Alg E Vector Alg E Residual block error probability 1 1-1 1-2 Sum-Product BP MiSumBP Alg E.1.15.2.25.5.1.15.2.25 BSC crossover probability BSC crossover probability (a) Parity Check Matrix H G,coderate.83 Decodig Time 8 7 6 5 4 3 2 1 Sum-Product BP MiSumBP Vector MiSumBP Fast Alg E Vector Alg E.2.4.6.8.1.12 BSC crossover probability Residual block error probability 1 1-1 1-2 (b) Parity Check Matrix H H,coderate.49 Sum-Product BP MiSumBP Alg E.4.6.8.1.12 BSC crossover probability Fig. 5. A compariso of the speeds ad performace of Sum-Product BP, Mi-Sum BP ad Algorithm E at various crossover probabilities. 3.88 = 2.64 secods. This implies that the decodig speed is.38 frames/s for stadard defiitio video, or 1.52 frames/s for CIF video, or 6.6 frames/s for QCIF video o a geeral-purpose Quad Core machie. There are may simplifyig assumptios made above: Firstly, differet code matrices would be required for each bitplae. More reliable bitplaes would decode faster tha H H ad less reliable bitplaes would decode slower. Secodly, motio compesatio is eeded to geerate good side iformatio ad this icurs additioal delay. Nevertheless, it is ecouragig to see that realtime Wyer-Ziv decodig is withi reach o multicore CPUs ad certaily o massively parallel GPUs. Our curret work cosists of combiig parallelized BP decodig, parallelized motio compesatio ad improved side iformatio decodig ito a realtime distributed video decoder. I additio to side iformatio decodig, the beefits of parallelizatio ad vector istructios reported herei are expected to be useful i may other applicatios that use BP decodig - disparity estimatio i multiview images/video, traditioal digital commuicatios, ad speech recogitio to ame a few. REFERENCES [1] D. Slepia ad J. K. Wolf, Noiseless Codig of Correlated Iformatio Sources, IEEE Tras. Iformatio Theory, pp. 471 48, July1973. [2] A. D. Wyer ad J. Ziv, The rate-distortio fuctio for source codig with side iformatio at the decoder, IEEE Tras. Iformatio Theory, vol. 22, pp. 1 1, Ja. 1976. [3] B. Girod, A.Aaro, S. Rae, ad D. Rebollo-Moedero, Distributed video codig, Proceedigs of the IEEE, Special Issue o Advaces i Video Codig ad Delivery, vol. 93, o. 1, pp. 71 83, Ja. 25. [4] X. Artigas, J. Asceso, M. Dalai, S. Klomp, D. Kubasov, ad M. Ouaret, The DISCOVER codec: Architecture, techiques ad evaluatio, i Picture Codig Symposium, Lisbo, Portugal, Nov. 27. [5] S. Grauer-Gray, C. Kambhamettu, ad K. Palaiappa, GPU implemetatio of belief propagatio usig CUDA for cloud trackig ad recostructio, i 5th IAPR Workshop o Patter Recogitio i Remote Sesig (PRRS), Tampa, FL, Dec. 28. [6] S. Wag, S. Cheg, ad Q. Wu, A parallel decodig algorithm of LDPC codes usig CUDA, i Proc. Asilomar Coferece o Sigals, Systems, ad Computers, PacificGrove,CA,Oct.28. [7] A. D. Copelad, N. B. Chag, ad S. Leug, GPU accelerated decodig of high performace error correctig codes, i 29 High Performace Embedded Computig (HPEC), Lexigto,MA,Sept.29. [8] R. G. Gallager, Low-desity parity-check codes, M.I.T. Press, 1963. [9] D. J. MacKay ad R. M. Neal, Near Shao-limit performace of low desity parity check codes, Electroics Letters, vol. 32, pp. 1645 1646, 1996. [1] D. Varodaya, A. Aaro, ad B. Girod, Rate-adaptive codes for distributed source codig, EURASIP Sigal Processig Joural,vol.86, o. 11, pp. 3123 313, Nov. 26. [11] J. Zhag ad M. Fossorier, Shuffled iterative decodig, IEEE Trasactios o Commuicatios, vol.53,o.2,pp.29 213,25. [12] F. R. Kschischag, B. J. Frey, ad H. Loeliger, Factor graphs ad the sum-product algorithm, IEEE Tras. Iformatio Theory, vol.47,pp. 498 519, Feb. 21. [13] N. Wiberg, Codes ad Decodig o Geeral Graphs. Studies i Sci. ad Techol., Dissertatio o. 44, Liköpig, Swede, 1996. [14] M. Mitzemacher, A ote o low desity parity check codes for erasures ad errors, i SRC Tech. Note 1998-17, COMPAQ, 1998. [15] T. J. Richardso ad R. Urbake, The capacity of low-desity paritycheck codes uder message-passig decodig, IEEE Tras. Iformatio Theory, vol.47,pp.599 618,Feb.21. [16] Y. Wag, J. S. Yedidia, ad S. C. Draper, Multi-stage decodig of LDPC codes, i IEEE It. Symp. Iform. Theory, Jue 29, pp. 2151 2155. [17] OpeMP Versio 3. Applicatio Program Iterface. OpeMP Architecture Review Board, May 28.