Parallel Block-Layered Nonbinary QC-LDPC Decoding on GPU

Size: px

Start display at page:

Download "Parallel Block-Layered Nonbinary QC-LDPC Decoding on GPU"

Brent Roberts
5 years ago
Views:

1 Parallel Blok-Layered Nonbinary QC-LDPC Deoding on GPU Huyen Thi Pham, Sabooh Ajaz and Hanho Lee Department of Information and Communiation Engineering, Inha University, Inheon, , Korea Abstrat This paper presents an effiient implementation of a parallel blok-layered nonbinary quasi-yli low-density parity-hek (NB-QC-LDPC) deoder on a graphis proessing unit (GPU) to ahieve signifiant improvements in both flexibility and salability. An effiient blok-layered sheme and a data struture suitable for parallel omputing are proposed to perform deoding on the GPU. The sheme is applied to a minmax deoding algorithm that exploits the inherent massive parallelization apabilities of NB-QC-LDPC deoder. The results of the proposed approah demonstrate that the layered sheme an be effiiently implemented in a GPU devie. Moreover, experimental results show that the proposed GPU-based bloklayered NB-QC-LDPC deoder provides a faster deoding runtime ompare to CPU-based implementation and obtains a oding gain under a low 1-1 BER and low 1-7 FER. Keywords nonbinary; quasi-yli; LDPC; GPU; parallel omputation, CUDA I. INTRODUCTION A binary low-density parity-hek (LDPC) ode that provides performane lose to that of the Shannon limit for long ode lengths was investigated by Gallager [1]. Reently, nonbinary LDPC (NB-LDPC) odes [2 7] have attrated a tremendous amount of researh interest beause of their exellent error orretion apabilities. Matthew and MaKay [2] showed that NB-LDPC odes provide signifiant performane improvements when the ode lengths are short and moderate. However, the deoding algorithms for NBLDPC odes require omplex omputations and large memories [6]. It has been shown that NB-LDPC odes, whih have higher order Galois-field GF(q), provide better performane; however, the deoding omplexity grows up rapidly, and the simulation time on a entral proessing unit (CPU) is extremely slow. Therefore, it is impossible to show the error floor property of NB-LDPC odes using CPU-based the simulations at a low bit error rate (BER) and frame error rate (FER). Reently, graphis proessing units (GPUs) have been widely used for their high omputational power by whih they an simultaneously exeute numerous threads. NVIDIA presented the Compute Unified Devie Arhiteture (CUDA) using the C high-level programming language, whih offers a software environment that failitates the development of highperformane appliations. The GPU an provide massively parallel omputation threads with many-ore arhiteture, whih an aelerate simulations of NB-LDPC deoding. Currently, the implementation of the NB-LDPC deoder onto /15/. 215 IEEE the GPU devie is being atively researhed [8-12]. However, the implementation of NB-LDPC odes remains very hallenging. The reent work of Beermann et al. [12] applied a layered sheme for the belief propagation algorithm on the GPU. In our work, it demonstrates that the horizontal-layered sheme an be effiiently implemented with a min-max algorithm on the GPU devie. In this paper, an effiient GPU based-implementation of the parallel blok-layered NB-QC-LDPC deoder is presented to aelerate the deoding proess. The rest of this paper is organized as follows. In setion II, the NB-LDPC odes are briefly reviewed, and an effiient parallel blok-layered minmax algorithm for GPU is proposed. Setion III, the proposed parallel arhiteture and implementation on a GPU using CUDA are desribed. Experimental results are presented in Setion IV. Finally, onlusions are given in Setion V. II. NB-LDPC CODES AND DECODING ALGORITHM A. NB-LDPC Codes An (N, K) NB-LDPC ode (N ode symbols, K information symbols, and M=N K parity symbols) is defined as parityhek matrix H, whih inludes a small proportion of Galoisfield elements. The NB-QC-LDPC ode is illustrated by a Tanner graph. Eah row of H matrix is onneted with a hek node; eah olumn of H matrix onnets with a variable node on the Tanner graph. These odes introdue good BER performane and effiient parallel proessing. Zhou et al. [3] presented two new algebrai onstrutions for NB-LDPC odes based on array dispersions of matries. In this study [3], the parity hek matries of NB-QC-LDPC odes are row olumn (RC) onstrained arrays, whih are extended as irulant permutation matries (CPMs) over nonbinary Galois-fields. The strutural property of an RC onstraint is a onstraint on the rows and olumns of H matrix. A (744, 653) NB-LDPC ode over GF(25) is generated by using RC-onstrained arrays; this struture is applied in simulations of our study. By using this method [3], a H submatrix is generated. For a pair (dv = 3, d = 24), dv is the olumn weight or variable node degree, and d is the row weight or hek node degree. Eah element in submatrix H(3,24) is dispersed either as an all-zero matrix of size (q 1) (q 1) or an α-multiplied CPM of size (q 1) (q 1). Fig. 1 shows a (744,653) NB-QC-LDPC ode, whih is onstruted by submatrix H(3,24). In α-multiplied CPM, there is only one i-th nonzero entry in the first row of the matrix. It

2 Algorithm 1: Proposed GPU-based parallel blok-layered minmax deoding algorithm Initialization: L n ln(pr(x n s n hannel) / Pr( x n a hannel)); 1,,1 L n L n ; R mn ; Iterations: For (k = 1; k <= I max ; k++) // Loop for eah iteration For (l = 1; l <= L; l++) // Loop for eah layer For (m = ; m < q- 1; m++) // (q-1) hek nodes proess parallel k 1, l Step1: L~ nk,,lm Lnk,l 1 R mn ( L~ nk,,lm ); L n, m a min GF ( q ) k,l L n, m (a ) L n, m (a ) L n, m ; 3 k,l Step2: R mn (b) Fig. 1. H matrix for a (744,653) NB-QC-LDPC ode over GF(25). (b) Example of α-multiplied CPM for α2. is generated by dispersing an element, αi, other entries are zero. Eah of the other rows is a right yli-shift of the previous row multiplied by α. Fig. 1(b) shows an α-multiplied CPM for α2. B. Proposed GPU-based Parallel Blok-layered Min-max Deoding Algorithm In this setion, the horizontal layered deoding [5, 7] is applied to derease both the memory and deoding iterations. The H matrix is divided into layers. Then, the deoder iteration is sequentially performed at eah layer iteration. In this work, we propose a parallel blok-layered min-max deoding algorithm on a GPU, as shown in Algorithm 1, in whih kernels are designed to simultaneously proess (q 1) hek nodes. One blok layer is onstruted by nonoverlapped (q 1) rows; eah olumn of these blok layers has a weight value of one. Algorithm 1 is briefly summarized as follows. The layered deoding divides the H matrix rows into L = dv layers. The deoding proessing for layer 1, layer 2,, and layer dv are sequentially performed to omplete a single iteration; the extrinsi values are exhanged among the layers. This proess is onseutively performed until the number of iterations reahes maximum value Imax or until the parity hek is satisfied. The initialization of the parallel blok-layered minmax deoding algorithm is similar as that shown in Algorithm 1. In addition, the variable node (V2C) messages, L~ nk,,lm of the l layer in iteration k are omputed based on Lkn,l 1 and k 1, l k,l 1 k 1,l R mn. It is noted that L n and R mn are the a posteriori messages of the l 1 layer in iteration k and the hek node (C2V) messages of layer l in iteration k 1, respetively. L~ nk,,lm is expressed as follows: k,l 1 k 1, l L n, m L n R mn (1) In the first layer of the first iteration, the V2C messages L1n, are the reliability information from hannel L n, ( L1n, L n ), and the hek node memory min ( max ( L nk,ml (a n ))); ( a n ) n N ( m ) mn ( a ) n N ( m \{n}) Step3: L nk,l L~ kn,,lm R mk,,ln ; End for End for Deision: ~ n arg min(l nk,l ) ; End for values (CMEM) are equal to zero ( Rmn,l ). Let xn be the n-th symbol in a reeived ode word, and let sn be the most likely symbol of xn. The Ln vetor has q elements, inluding one zero element and (q 1) positive elements. The min-max deoding, whih is implemented by the forward-bakward algorithm (FBA) [4], is applied in the hek node proess. This paper proposes a modified FBA, whih remove multipliation with nonzero elements of H matrix in the onditional equation of merger step to derease the omplexity of the hek node proessing (CNP). III. PARALLEL BLOCK LAYERED NB-QC-LDPC DECODING ON GPU A. Data flow of NB-QC-LDPC Deoding on GPU NVIDIA GPUs are powerful arithmeti engines that an run thousands of lightweight threads in parallel. A GPU-based heterogeneous platform has one or more CPUs and GPUs that are well-suited to implementing NB-LDPC deoding algorithms. In addition, the NB-LDPC deoding algorithm has a high omputation to memory aess ratio (CMAR). The CMAR represents the omplexity of operations that justify the ost of moving data to and from the devie. To obtain modern proessor arhiteture integrated in GPUs, the appliation must first be assessed to identify the hotspots, whih an be parallelized. Runtime of main bloks in the min-max algorithm is measured by running a serial C ode on a CPU. It has been shown that the hek node proessing is a bottlenek and aounts for 95.2% in the proessing time. Hene, the omputations of CNP an be parallelized on the GPU platform. In this setion, we present an effiient implementation of a parallel blok-layered NB-QC-LDPC deoding sheme based on a GPU platform to aelerate the deoding proessing. Fig. 2 shows a data flow for the parallel blok-layered deoding

3 Algorithm2: Modified Forward-Bakward Algorithm Forward metris: First step: F L, ( h ) Reursive step: for i=1 to d-2 1 F i (a ) min (max(f i 1 (a ), Li, (a ))) a, a GF ( q ) a h i a a Bakward metri: First step: B d 1 L d 1, ( h ) Reursive step: for i=d-2 to 1 1 d 1 (3) (4) min (max( Bi 1 (a ), Li, (a ))) (5) Bi (2) a, a GF ( q ) a h i a a Modified Merger: M, B1 ; M, d 1 F d 2 M, k min (max( F k 1(a ), B k 1(a ))) a, a GF ( q ) a a a (6) (7) sheme on CPU and GPU platforms. The CUDA program is divided into two tasks: one is for the CPU; the other is for the GPU. The host CPU handles the kernel sheduling, ontrol of the deoding iterations, omputing of BER performane, and so on. The host CPU must transfer the symbols of the reeived ode word to the GPU; it then reeives the deoded symbols from the GPU. Most of the deoding omputations are implemented on the GPU. All intermediate messages are stored in the devie memory to restrit data transfer between the host and devie. Eah of the modules in Fig. 2 responds to a kernel mapped on the GPU platform. B. Data and Memory Struture As mentioned in Setion II, H matrix is onstituted from the (dv, d) submatrix, where the elements of the submatrix are extended by α-multiplied CPM (q 1) (q 1). To take advantage of α-multiplied CPM, a submatrix must be stored in memory instead of in full H matrix. This method is alled the ompress tehnique, whih redues the storage memory for H matrix and enables fast memory aess. In this work, we propose a layered deoding sheme in whih (q 1) hek nodes in a row blok are simultaneously proessed. Therefore, in this setion, we desribe data and memory strutures for single-layer proessing. A total of d (q 1) V2C vetors for one layer are distributed within (q-1) hek nodes. The omputations for (q 1) hek nodes in one layer only need (q 1) d q messages, whih are stored in a variable node proessing (VNP) temporary memory. Thus, the memory required for the layered sheme is redued by a fator of the number of layers, dv, ompared to the flooding sheme. Fig. 3 depits a 3D struture of C2V vetors [q, d, q 1]. The three dimensions of the C2V vetors are as follows: width q orresponds to q entries in a vetor, height d orresponds to d V2C vetors onneting to one hek node, and the depth orresponds to (q 1) hek nodes. This struture allows (q 1) hek nodes, whih operate parallel to aess d V2C vetors in alignment. If a 3D array is formatted by [width, height, depth], eah element [x, y, z] of an array is uniquely indexed by [x + y width + z width height] in the 1D array, as shown in Fig. 3(b). By arranging L nk,,lm V2C Fig. 2. Data flow of parallel blok-layered NB-QC-LDPC deoding on CPU and GPU platforms. vetors and R km,,ln C2V vetors in this format, the q adjaent data entries are aessed by q adjaent threads; thus, oalesed memory aess is enabled, whih ahieves high throughput. Furthermore, additions and subtrations in GF(q) are implemented as exlusive OR (XOR) operations, and divisions by αa are omputed by multipliation with α(31-a)%31. Therefore, the GPU s texture memory is employed to implement nonbinary arithmeti in GF(q), whih is available to all kernels. Two 2D lookup tables of size q q exist for multipliation and addition; two 1D lookup tables of size q exist for onversion between exponential and deimal representation. A 64-KB onstant memory is used to store values from the parity-hek matrix, whih are atually the values and indies of bit nodes onneted to eah hek node. C. Parallel Forward-Bakward Sheme in Chek Node Proessing The proposed deoding algorithm is partitioned into four main kernels; the kernel sizes are listed in TABLE I. The onfiguration parameters of kernels are flexibly hanged depending on the parameters of eah GPU ard used. Eah hek node sequentially omputes forward (FD), bakward (BD) and merger messages. However, FD and BD messages are independently alulated. In this arhiteture, these messages are simultaneously proessed by using q threads for the forward step and q threads for the bakward step. One forward-bakward messages are available, the merger omputation begins. Fig. 4 shows the arhiteture of a detailed kernel for one hek node. Input messages for the kernel are stored in VNP temporary memory. Moreover, output forward-bakward messages are kept in forward and bakward memories to ontinue omputing merger messages, and stored in an on-hip loal memory for omputation of the next forward-bakward messages. Beause the Fi and Bj messages are used to ompute Fi+1 and Bj-1 in the next step, the on-hip loal memory with high bandwidth and low lateny is used to store output messages Fi and Bj to speed up the FD, BD steps. The implementation step of the FD, BD or merger omputation is alled an elementary step. The memory

4 TABLE I. KERNEL ARCHITECTURE FOR MAIN BLOCKS OF THE DECODER OVER GF(25). Funtion Initial LLR FD, BD MG VNP Deision CNP No. Thread bloks d q-1 q-1 d 1 No. Threads q (q-1) q+q q d q (q-1) d (q-1) Total No.Threads d q (q-1) (q-1) ( q+q) (q-1) q d d q (q-1) d (q-1) 1 15itr, BER 15itr, FER -2 (b) Fig. 3. Data struture for oalesing memory aess in CNP, 3D struture of CN messages, (b) 1D struture of CN messages. Bit/frame Error Rate EbNo(dB) Fig. 5. BER and FER performane of a (744, 653) blok-layered NB-QCLDPC ode over GF(25) with min-max algorithm using the GPU. Using equations (3) and (5), there are q different pairs of a and a to satisfy a + ha = a in the forward- bakward step. Suppose that F1 vetor is omputed if the onditions suh as h, urrent V2C vetor L1(a ), and previous forward vetor F(a ) are known. L1(a ) Shared Memory Threads Synthreads (b) Fig. 4. Forward-bakward kernel implementation of the FBA on GPU, (b) Shared memory for random memory aess. requirement for Fi and Bj in (q 1) hek nodes is 2 q sizeof(float) (q 1). For example, 7.75 KB of loal memory is required for 31 hek nodes in GF(25). A barrier synhronization funtion, synthreads(), is performed after eah forward or bakward step, Fi or Bj, to ensure that threads are synhronized. An elementary step for forward omputation has two input message vetors as Fi(a ) and Li+1(a ). One output message vetor is defined as Fi+1. To ompute one message of the F1 output vetor, q ombinations of F(a ) and L1(a ) are determined by substituting the indexes into a and a to satisfy a + ha = a. After obtaining q pairs, the q messages are firstly generated by seleting the larger ones in eah of the q pairs F(a ) L1(a ). Then, the minimal value of the q messages is found and defaulted as an output message of the forward vetor F1. For example, using h = α2, a = α1, there are 32 pairs that satisfy a + α2a = α1 as: a a = { α3, α α16, α1,, α3 α2}. As mentioned above, the variable node messages L1(a ) are stored for aess in the order of a linear memory. However, to ompute forward messages F1 in an elementary step, V2C messages L1(a ) are aessed in an arbitrary order. In this ase, the order to aess the L1(a ) messages follows { α3, α16,,, α 2 }. To address this problem, the V2C messages are opied to on-hip shared memory, whih has high bandwidth and low lateny, before beginning the omputation. Moreover, the additions and divisions are usefully implemented by lookup tables in the text memory. In this way, firstly variable node messages are diretly opied to the shared memory. Then, the output is written using the indexes, whih are omputed in text memory, as shown in Fig. 4(b). Thus, bank onflits are not generated and memory aess is speeded up.

5 TABLE II. DECODING TIME ON DIFFERENT DEVICES AT IMAX = 1. Deoding Time (ms) Eb/N -3dB 4dB 5dB 6dB 7dB 8dB CPU (Intel i7) GTX 65 Ti GTX TITAN Blak Fig. 6 shows the average total throughput of the deoder that is proessed by the CPU platform and various GPU platforms over different hannel qualities. Two platforms are almost similar in terms of BER, FER results. However, the speeds of the two platforms differ and are measured by the average runtime per deoding iteration with different Eb/N values. In Fig. 6, from 1 db to 3 db, the throughput remained fairly stati at the lowest value. This was due to the bad hannel performane in low Eb/N values; moreover, the deoding has to be exeuted at a maximum of ten iterations. In addition, the throughputs inreased with inreasing Eb/N beause fewer deoding iterations have to be exeuted until a orret ode word is reovered. GTX TITAN Blak GTX 65 Ti CPU.45.4 Throughput [Mbit/s] EbNo(dB) 5 6 deoding on the GTX TITAN Blak, the deoding runtime is ms at 4 db, whih is 7.5 times faster than that on the CPU-based implementation. On the other hand, different GPU devies were set up on different CPU platforms, whih produed varying deoding runtimes. The GTX TITAN Blak graphi ard has more advantages over the GTX 65 Ti. Thus, the deoding runtime is approximately twie as fast as that of the GTX 65 Ti. Moreover, the general-purpose GPUs ould perform floating-point arithmeti operations with better the auray and lower BER in very large-sale integration LDPC deoders. 7 8 Fig. 6. Average deoder throughput on GPUs and CPU at a maximum of 1 iterations. IV. EXPERIMENTAL RESULTS The experimental setup to evaluate the performane of the proposed NB-QC-LDPC deoder onsisted of an NVIDIA GTX 65Ti GPU and an Intel Core i7-477 CPU. The CPU platform of an Intel Core i7-477 CPU at 3.4 GHz with 16 GB RAM was used to simulate the serial C ode. An NVIDIA GTX 65Ti GPU with 768 CUDA ores at.97 GHz and 124 MB of GDDR5 devie memory was used to implement the CUDA C ode. Moreover, an NVIDIA GTX TITAN Blak graphis ard was applied to perform the CUDA C ode. This work used CUDA toolkit v5.5 for the implementation. A regular (744, 653) NB-QC-LDPC ode onstruted over GF(25) with an.877 ode rate was used in this simulation. The deoding performane of (744, 653) NB-QC-LDPC ode and its random ounterpart over an additive white Gaussian noise (AWGN) hannel with binary phase shift keying (BPSK) are illustrated in Fig. 5. This simulation is performed by the min-max algorithm for NB-QC-LDPC ode over GF(25) with 15 deoding iterations on the GPU. It demonstrates that the GPU aelerated deoding proess to enable the detetion of error floors of approximately 1-7 FER within days instead of weeks of omputation in C++. As the result, the implementation on the GPU ompared well with VLSI approahes. Furthermore, it is lear that the GPU proessing led to superior FER, BER performane as opposed to VLSI solutions [5, 7]. TABLE II shows the deoding runtime using the CPU platform and various GPU devies. Exeution times were obtained with CPU timers. In (744, 653) NB-QC-LDPC Two fators affet the deoding runtime of the layered sheme in this study. Firstly, it is dependent on the number of layers beause deoding has to be sequentially performed on eah layer to finish one iteration instead of one time per iteration in the flooding sheme. Thus, the deoding time of the layered sheme an be estimated to be dv times higher than that in the flooding sheme. However, the layered sheme doubly inreases the onvergene speed of the iterative deoder. This means that the number of required deoding iterations an be signifiantly dereased ompared to in the flooding sheme. Seondly, the hek node degree or d additionally impats the deoding runtime beause FBA is used in the CNP, whih is sequentially implemented in (d 1) steps. It is onluded that the deoding time is the balane for ahieving the same deoding performane between the layered and flooding shemes. Nonetheless, the layered sheme is more memory-effiient than the flooding sheme. To take advantage of independent omputation in the forward and bakward steps, we used q threads for the forward step and q threads for the bakward step to simultaneously proess eah step. Therefore, the running time for the CNP kernel is estimated to be doubly redued ompared to [8]. V. CONCLUSION In this paper, we presented an effiient GPU-based implementation of the parallel blok-layered NB-QC-LDPC deoder to aelerate the deoding proess. Owing to its inherently massive parallelism, NB-QC-LDPC deoding is easier to apply to GPU implementation than binary LDPC odes. The experimental results show that the GPU-based implementation of layered deoding sheme for the NB-QCLDPC provides a faster deoding runtime and oding gain under a low 1-1 BER and low 1-7 FER. A new solution is

6 thereby provided for NB-QC-LDPC deoding on a GPU, whih provides greater effiieny than on a CPU platform. ACKNOWLEDGMENT This researh was supported by Basi Siene Researh Program through the NRF funded by the Ministry of Siene, ICT and future Planning (213R1A2A2A168628). REFERENCES [1] [2] [3] [4] [5] R. G. Gallager, Low density parity hek odes, IRE Trans. on Information Theory, vol. 8, no. 1, pp , C. D. Matthew, and D. MaKay, Low-Density Parity Chek Codes over GF(q), IEEE Communiations Letters, vol. 2, no. 6, pp , Jun B. Zhou, J. Kang, S. Song, S. Lin, K. A. Ghaffar, and M. Xu, Constrution of non-binary Quasi-yli LDPC odes by arrays and array dispersions, IEEE Trans. on Communiations, vol. 57, no. 6, pp , Jun. 29. V. Savin, Min-Max deoding for nonbinary LDPC odes, Pro. IEEE. Int. Symp. Inf. Theory, Toronto Canada, pp , Jul. 28. C.-S. Choi and H. Lee, A Blok-Layered Deoder Arhiteture for Quasi-Cyli Non-Binary LDPC odes, Journal of Signal Proessing Systems, vol. 78, no. 2, pp , Feb [6] D. Delerq, M. Fossorier, Deoding Algorithms for Nonbinary LDPC Codes Over GF(q), IEEE Trans. on ommuniations, vol.55, no.4, pp , Apr. 27. [7] X. Zhang, and F. Cai, Effiient Partial-Parallel Deoder Arhiteture for Quasi-Cyli Nonbinary LDPC Codes, IEEE Trans. on Ciruits and Systems I, vol. 58, no. 2, pp , Feb [8] J. Andrade, G. Falao, and V. Silva, K. Kasai, FFT-SPA Non-binary LDPC deoding on GPU, Pro. IEEE International Conferene on Speeh and Signal Proessing, Vanouver, BC, pp , May 2631, 213. [9] G. Wang, H. Shen, et. al., Parallel Nonbinary LDPC Deoding on GPU, Pro. the 46th Asilomar Conferene on Signals, Systems and Computers, Paifi Grove, CA, pp , Nov 4-7, 212. [1] M. Beermann, E. Monzo, L. Shmalen, P. Vary, High speed deoding of non-binary irregular LDPC odes using GPUs, Pro. IEEE Workshop on Signal Proessing Systems, pp , Ot 16-18, 213. [11] H. Pham Thi, S. Ajaz, and H. Lee, Effiient Min-max Nonbinary LDPC Deoding on GPU, IEEE SoC Design Conferene (ISOCC), pp , Nov 3-6, 214. [12] M. Beermann, E. Monzo, L. Shmalen, and P. Vary, GPU Aelerated Belief Propagation Deoding of Non-Binary LDPC Codes with Parallel and Sequential Sheduling Journal of Signal Proessing Systems, vol. 78, no. 1, pp , January. 215.

Reduced-Complexity Column-Layered Decoding and. Implementation for LDPC Codes

Reduced-Complexity Column-Layered Decoding and. Implementation for LDPC Codes Redued-Complexity Column-Layered Deoding and Implementation for LDPC Codes Zhiqiang Cui 1, Zhongfeng Wang 2, Senior Member, IEEE, and Xinmiao Zhang 3 1 Qualomm In., San Diego, CA 92121, USA 2 Broadom Corp.,