496 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 26, NO. 3, MARCH 2018

Size: px

Start display at page:

Download "496 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 26, NO. 3, MARCH 2018"

Willis Shepherd
5 years ago
Views:

1 496 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 26, NO. 3, MARCH 2018 Basic-Set Trellis Min Max Decoder Architecture for Nonbinary LDPC Codes With High-Order Galois Fields Huyen Pham Thi, Member, IEEE, and Hanho Lee, Senior Member, IEEE Abstract Nonbinary low-density parity-check (NB-LDPC) codes outperform their binary counterparts in terms of errorcorrection performance. However, the drawback of NB-LDPC decoders is high complexity, especially for the check node unit (CNU), and the complexity increases considerably when increasing the Galois-field (GF) order. In this paper, a novel basic-set trellis min max algorithm is proposed to greatly reduce not only the CNU complexity but also the number of messages exchanged between the check node and the variable node compared with previous studies, which is highly efficient for higher order GFs. In addition, the proposed CNU is designed to compute the messages in a parallel way. Layered decoder architectures based on the proposed algorithm were implemented for the (837, 726) NB-LDPC code over GF(32) and the (1512, 1323) code over GF(64) using 90-nm CMOS technology, and obtained a reduction in the complexity by 30% and 37% for the CNU, and 40% and 37.4% for the whole decoder, respectively. Moreover, the proposed decoder achieves a higher throughput at 1.67 Gbit/s and 1.4 Gbit/s compared with the other state-of-the-art high-rate NB-LDPC decoders with high-order GFs. Index Terms Basic set (BS), check node processing, high order, layered decoding, nonbinary low-density parity-check LDPC, trellis min max (TMM), VLSI design. I. INTRODUCTION NONBINARY low-density parity-check (NB-LDPC) codes defined over Galois fields (GFs) GF(q) with q > 2 outperform their binary counterparts in terms of error-correcting performance and performance improvement in the error-floor region when code length is moderate [1]. In addition, these codes have good ability of burst error correction, especially for high-order GFs. Research results in [2] and [3] demonstrate that NB-LDPC codes provide superior performance compared with the best optimized binary LDPC code over fading channels, and the combination of NB-LDPC code with high-order modulations improves both the bandwidth efficiency and the error-correction capability. Moreover, the elimination of the Manuscript received June 26, 2017; revised September 17, 2017; accepted November 4, Date of publication December 8, 2017; date of current version February 22, This work was supported by the Basic Science Research Program through the NRF funded by the Ministry of Science, ICT and Future Planning under Grant 2016R1A2B (Corresponding author: Hanho Lee.) The authors are with the Department of Information and Communication Engineering, Inha University, Incheon 22212, South Korea ( phamhuyenmta87@gmail.com; hhlee@inha.ac.kr). Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /TVLSI error floor is critical for flash memory applications, and the NB-LDPC codes show much promise for multilevel flash memory applications [4]. However, the main disadvantage of NB-LDPC codes is their highly complex decoding algorithms; it is difficult to achieve maximum throughput and minimum area for their architectures. In practical implementations, the NB-LDPC decoders have several drawbacks, such as a highly complex check node unit (CNU), a large area spent on storage elements, and routing congestion. First, the belief propagation (BP) algorithm used for binary LDPC decoding was introduced for the NB-LDPC decoding [1]. Then, a fast Fourier transform-bp (FFT-BP) algorithm [5] in the probability domain was proposed to reduce the computational complexity in check node processing by replacing the convolutional operations with multiplications in the frequency domain. Although the probability domain algorithm provides optimal error-correcting performance, the large number of additions and multiplications causes an exponential increase in hardware complexity. In [6], the FFT-BP algorithm based on the logarithm domain used log-likelihood ratio (LLR) values to decode the channel messages instead of probability values, in which the multiplications are replaced with additions. For practical NB-LDPC decoder implementations, suboptimal algorithms such as extended min-sum (EMS) [7] and the min max [8] algorithm have been proposed to reduce the complexity of the CNU as the main bottleneck of the NB-LDPC decoder. The min max algorithm [8] is interesting because it uses comparisons instead of additions [7] in the check node processing, which not only reduces the hardware complexity but also prevents the numerical growth of the decoder. In addition, in [8], a forward backward scheme was utilized to derive the check node output messages. This scheme includes sequential computations, which cause a throughput problem for the decoder architectures. Moreover, additional storage memories are required to store the intermediate messages such as forward and backward messages. Recently, the path construction algorithms [9], [10] and the relaxed min max (RMM) algorithm [11] introduced the trellis representation for check node processing to eliminate computing the forward backward messages, and thus reduces the memory requirement for the intermediate messages. The RMM algorithm [11] using the minimum basis to generate the check node output messages was proposed for NB-LDPC decoders, which further reduces the check node complexity IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See for more information.

2 THI AND LEE: BS-TMM DECODER ARCHITECTURE FOR NONBINARY LDPC CODES WITH HIGH-ORDER GFS 497 However, the sequential check node processing requires a large number of clock cycles, which limits the maximum throughput of the decoder. In [12] and [13], the trellis EMS algorithm was proposed to improve the throughput of the NB-LDPC decoders, where the check node output messages are generated in parallel by means of an extra column inserted to the original trellis. A disadvantage of the decoders in [12] and [13] is high area, which causes a reduction of the overall decoder efficiency. To take advantage of the idea in [12], the simplified trellis min max (STMM) algorithm [14] was proposed to improve the throughput of the min max decoders with less complexity. In [15], the one-minimum-only TMM algorithm was introduced on the basis of the STMM algorithm to reduce the CNU complexity by obtaining only one minimum and estimating the second one. In [12] [15], q d c check node output messages are exchanged between the check node and the variable nodes. For high-order GFs or high-rate NB-LDPC codes, there are two main drawbacks in [12] [15]. First, the amount of exchanged messages increases, which causes wiring congestion, and thus limits the maximum throughput of the decoders. Second, the check node output messages are stored in the memory for the next decoding iteration in the layered decoders. Therefore, the memory requirement becomes large, which leads to a significant growth in the decoder area for NB-LDPC codes. To overcome the drawbacks of [12] [15], Lacruz et al. [16] originally introduced a compression technique to reduce the exchanged messages between one check node and the variable nodes to four sets, including the intrinsic and extrinsic information, the path coordinates, and the hard-decision symbols with a size of 5 (q 1) + d c messages without any errorcorrecting performance loss. For further improvement, the research in [17] and [18] proposed to simplify the CNU architecture and reduce the exchanged messages to 4 (q 1)+d c messages with a similar error-correcting performance in [16]. The approximated TMM algorithms in [19] and [20] were introduced to reduce the amount of intrinsic information from (q 1) elements [16] to only two elements and L q elements, respectively, at the cost of some error-correcting performance loss. The remaining elements are calculated from the approximation functions. In this paper, a novel basic-set TMM (BS-TMM) algorithm is proposed for NB-LDPC codes based on the theory of the GF GF(q = 2 p ), where each field element is uniquely represented by a linear combination of p independent field elements. In the proposed BS-TMM algorithm, the basis set including the intrinsic information of only p = log 2 q independent field elements in the extra column is stored, and the other elements are constructed on the basis of this basic set. Moreover, a novel algorithm is introduced for finding p independent field elements with the most reliable messages of the basic set in parallel. The BS-TMM algorithm allows the reduction of exchanged messages between one check node and variable nodes from 4 (q 1) + d c [16] to (q 1) + 3 p + d c messages with a negligible performance loss of 0.1 db. The proposed method provides a great area reduction and throughput improvement for the NB-LDPC decoders with good errorcorrection performance. Therefore, it is extremely efficient for Algorithm 1 Layered Min Max Decoding Algorithm [17] the design of high-rate and high-order NB-LDPC decoders. Two NB-LDPC decoders, including (837, 726) over GF(32) and (1512, 1323) over GF(64), were implemented on the basis of the BS-TMM algorithm. The rest of this paper is organized as follows. Section II reviews the decoding algorithms for the NB-LDPC codes. Section III presents the proposed BS-TMM decoding algorithm for the NB-LDPC codes. In Section IV, the CNU architecture and the overall decoder architecture based on the BS-TMM algorithm are proposed. The implementation results and comparison with previous works are discussed in Section V. Finally, conclusions are drawn in Section VI. II. TRELLIS MIN MAX DECODING ALGORITHM A. Review of the Layered Min Max Algorithm A sparse parity-check matrix H with M rows and N columns defines an NB-LDPC linear block code, where each nonzero element h mn belongs to the GF GF(q). Moreover, a Tanner graph corresponding to H is used to represent the NB-LDPC codes in a graphical way, where variable nodes represent N columns of H and check nodes represent M rows of H. Letd c and d v be the check node degree (row weight) and the variable node degree (column weight) of H, respectively. Therefore, N(m) denotes the set of d c variable nodes connected to check node m,andm(n) denotes the set of d v check nodes connected to variable node n. LetQ mn (a) and R mn (a) be the exchanged messages from n variable node to m check node (V2C) and from m check node to n variable node (C2V) for each symbol a GF(q), respectively. A regular NB-LDPC code with fixed values of d c and d v is considered in this paper. A horizontal layered decoding algorithm is applied in this paper because of its higher convergence with similar performance, compared with the flood decoding algorithm. The layered decoding algorithm for the NB-LDPC codes is presented in Algorithm 1. Let c n be the nth reference symbol of a received codeword and z n be the nth harddecision symbol with the highest reliability. The decoding process is initialized by obtaining the LLR vectors with a size of q of the channel information by means of L n (a) = ln(pr(c n = z n channel)/pr(c n = a channel)). At the first layer of the first iteration, Q n (a) as the a posteriori information

3 498 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 26, NO. 3, MARCH 2018 Algorithm 2 TMM Algorithm [14] for the variable node n is equal to L n (a), andr mn (a) is equal to zero. Let k and l be indexes in the loop for the kth iteration and lth layer, respectively. The Q n (a) messages are permuted following the nonzero element h mn of matrix H to obtain Q n (h mn a). Then, the V2C messages Q mn (a) are derived in step 3, and the normalization of these messages is implemented by steps 4 and 5 to ensure that the LLR value for the most reliable symbol in each vector is equal to zero. Step 6 involves in the computation of the check node output messages R mn (a) using the function, which depends on the algorithm applied for the check node processing. The updated messages Q n (a) in step 7 need to be undergone the reverse permutation before starting a new layer. The decoding process is repeated until the maximum number of iterations I max is reached. Finally, the output codeword c n is the most reliable symbol corresponding to Q n (a) messages. B. Trellis Min Max Algorithm With Compressed Messages In [14], the STMM algorithm was proposed for check node processing to generate the check node output messages in a parallel way. The STMM algorithm, which is presented in Algorithm 2, provides a good tradeoff between the errorcorrecting performance and the decoding complexity, compared with the previous works [11], [21]. The first step involves the transformation of the input messages from the normal domain Q mn (a) to the delta domain Q mn (a). This transformation ensures that the most reliable symbols are always in the first index corresponding to the GF symbol 0, and the rest of the indexes are in order of {α 0,α 1,...,α q 1 }. Step 2 relates to the computation of the syndrome β using the most reliable symbols z n from V2C messages. In step 3, the first minimum value m1(a) and its column index m1 col (a), as well as the second minimum value m2(a) for each trellis row are calculated using the function. Step 4 constructs an extra column Q(a) based on the most reliable path of the configuration set conf(n r, n c ) for each symbol a. Let path 0 be the optimal path for symbol 0 including all nodes in the first row of the delta trellis. The configuration set conf(n r, n c ) [13] including the possible paths is constructed by the n r most reliable messages and a maximum of n c deviations from path 0. The configuration set conf(1, 2) is considered in [14], where n r = 1andn c = 2. Thus, only the most reliable message m1(a) and maximum of two deviations [η 1 (a), η 2 (a)] are considered for each symbol a. In the case of one deviation, η 1 (a) is equal to η 2 (a). Otherwise, they are different. The check node output messages in the delta domain R mn (a) are simultaneously generated from step 5 to step 13 depending on the deviation information as η 1 (a) and η 2 (a). For each trellis row, if column j is not the deviations of the most reliable path, then the output message R mn j (a) is assigned by the extra column value Q ( a). If the most reliable path has one deviation at column j, then the second most reliable message m2(a) is assigned to the output message. In the case of the most reliable path formed by two deviations, m1(a) is assigned to the output message. Finally, step 14 transforms the output messages from the delta domain to the normal domain as the C2V messages R mn (a). A suitable scaling factor λ is used to improve the performance of the decoder, which does not affect the hardware complexity. A disadvantage of the STMM algorithm [14] is the large memory requirement to store q d c output messages of R mn (a) for each check node, which causes a high decoder area. In [18], a compression technique is applied to reduce the output check node messages from q d c values to four elementary sets such as I (a), E(a), P(a), andzn, including 4 (q 1)+d c values without any error-correcting performance loss, as follows. Thus, the memory requirement is significantly reduced Output: I (a) E(a) P(a) zn = z n + β. The set I (a) is generated in a similar way to the extra column Q(a). ThesetE(a) includes complement values, whose values are either m1(a) or m2(a) depending on the deviation information as shown in (2). The set P(a) contains the path information for updating the output messages as shownin(3) { m2(a) if η E(a) = 1 (a) = η 2 (a) (2) m1(a) otherwise. P(a) = { ( m1 col η 1 (a) ) (, m1 col η 2 (a) )}. (3) Finally, updating the C2V messages is implemented by decompression of the check node output messages in the variable node processing as follows: { ( ) R mn j a + z I (a) if P(a) = j n j = (4) E(a) otherwise. III. BASIC-SET TRELLIS MIN MAX DECODING ALGORITHM A. Basic-Set Trellis Min Max Algorithm In this section, the novel BS-TMM algorithm is proposed to greatly reduce the complexity and the memory requirement (1)

4 THI AND LEE: BS-TMM DECODER ARCHITECTURE FOR NONBINARY LDPC CODES WITH HIGH-ORDER GFS 499 Algorithm 3 BS-TMM Algorithm for check node processing as well as the exchanged messages between check nodes and variable nodes with a negligible error-correcting performance loss. The BS-TMM algorithm is highly efficient for designing the decoders with highorder GFs. Without loss of generality, the GF GF(q) with q = 2 p including q elements such as {0,α 0,α 1,...,α q 2 } is considered in our work. For each GF GF(2 p ), any field element is uniquely represented by the linear addition of p independent field elements. To take advantage of this, in our work, a set of only p = log 2 q independent field elements with the smallest LLRs, called the basic set B, are generated in the check node processing instead of (q 1) nonzero field elements in the extra column Q(a) [16], [19]. Then, construction of the Q(a) is implemented in the variable node processing based on the basic set B. The BS-TMM algorithm is represented in Algorithm 3. Steps 1 3 are similar to steps 1 3 in Algorithm 2. Step 4 computes the basic set B = {m1l, I l, a l } 1 l p including 3 p values ( p LLR values, p column indexes, and p field elements), based on the minimum values m1(a) and their column indexes I col (a) (1 a < q). Finding the basic set B is given by the function in Algorithm 4. Step 5 relates to calculating the complement values in set E(a). The complement values for p field elements, which belong to the basic set B, are assigned to the second minimum values m2(a). For the remaining field elements, the complement values are assigned to the minimum values m1(a). Finally, the output of the check node processing includes three sets B, E(a), andzn with a size of 3 p + (q 1) + d c values, which are used for generating the C2V messages in the variable node processing. Table I shows the number of bits exchanged between check node and variable node in the proposed algorithm and previous works for the general GF(q = 2 p )andw quantization bits for the LLR values. In addition, the number of exchanged bits for high-order GFs such as GF(32), GF(64), and GF(128) is also computed with d c = 27 and w = 6 quantization bits, and illustrated in Fig. 1. It is clear that the proposed algorithm greatly reduces the exchanged bits, compared with previous works. In [14], all C2V messages generated in the check node processing are exchanged, which causes an extremely high number of check node output bits. It can be seen that the exchanged bits are reduced by factors of almost 13, 16, and for GF(32), GF(64), and GF(128), respectively. Fig. 1. Number of exchanged bits between the check node and variable node for different GFs. TABLE I COMPARISON OF EXCHANGED MESSAGES BETWEEN CHECK NODE AND VARIABLE NODE WITH d c = 27 AND w = 6 In [16] [19], a small number of fixed sets, in which the size of each set is proportional to either q or d c, are exchanged. Compared with the original compression technique [16], the proposed work reduces the number of exchanged bits by factors of almost 2.5 and 3.48 for GF(32) and GF(128), respectively. In comparison with the latest work [19], the reduction of the exchanged bits is 38.59% and 52.07% for GF(32) and GF(128), respectively. The BS-TMM algorithm achieves a large reduction of the exchanged bits for two reasons. First, the BS-TMM algorithm reduces the number of fixed sets, where the basic set B, including 3 p values, is exchanged instead of 3 (q 1) values of two sets I (a) and P(a), as shown in (1). Second, the size of the basic set B is proportional to p = log 2 q, whereas the size of sets I (a) and P(a) is proportional to q. Thus, the BS-TMM algorithm is extremely efficient for high-order GFs. The function in Algorithm 4 relates to finding the basic set B based on the minimum values m1(a) and their column indexes I col (a). LetM be a set including the minimum values m1(a) and their column indexes I col (a). Instep1,setM is rearranged in ascending order of the m1(a) values to generate anewsetm.thefirsttwofieldelementsfromsetm are selected for the basic set because they are independent field elements with the smallest LLRs, as shown in steps 2 4. The remaining elements of the basic set are found by the loop from steps 5 to 11. The goal of the loop is to find

5 500 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 26, NO. 3, MARCH 2018 Algorithm 4 Function: Finding Basic Set B Algorithm 5 Construct Extra Column Q(a) and R mn (a) the next independent field elements with the smallest LLRs except for both the selected elements and the elements that are generated by the possible combinations of the selected elements in steps 7 and 8. Finally, the basic set B generated includes p independent field elements with the most reliable LLR values. In the variable node processing, the extra column Q(a) is recovered and the C2V messages R mn (a) are generated on the basis of the output sets of the check node processing, including B, E(a), andzn, as shown in Algorithm 5. First, the extra column Q(a) and the path information d(a) are calculated in steps 1 7. For p field elements, which belong to the basic set B,theQ(a) value is the most reliable LLR m1l, and the path information d(a) has one deviation at the column index Il with 1 l p. The remaining field elements are computed on the basis of all possible combinations of the field elements in the basic set B.TheirQ(a) values are the maximum LLR value from the LLR values corresponding to the combined field elements, and their path information d(a) has more than one deviation and a maximum of p deviations. Updating the C2V messages is implemented in steps For each row, if the column index j does not belong to the part information d(a), the C2V message R mj (a) is assigned to the extra column Q(a). Otherwise, the C2V message R mj (a) is assigned to the complement set E(a). Finally, the C2V messages in the delta domain are converted to the normal domain in step 15. In Fig. 2, an example of the delta trellis for GF(8) with d c = 4 is presented, where the minimum values in each row are marked with a dashed square. The extra column Q(a) in the rightmost column is constructed on the basis of basic set B, as shown in Algorithm 5. This example demonstrates the method of building the basic set B and the extra column Q(a). From the delta trellis, set M, including the minimum values and their column indexes as M = {(2, 1,α 0 ), (10, 2,α 1 ), (26, 3,α 2 ), (1, 1,α 3 ), (3, 4,α 4 ), (30, 1,α 5 ), (4, 1, α 6 )}, is generated. After rearranging set M with ascending order of the minimum values, set M ={(1, 1,α 3 ), (2, 1,α 0 ), (3, 4,α 4 ), (4, 1,α 6 ), (10, 2,α 1 ), (26, 3,α 2 ), (30, 1,α 5 )} is Fig. 2. Example of the trellis based on GF(8) with d c = 4. achieved. The first two field elements from M are selected for the basic set as B = {(1, 1,α 3 ), (2, 1,α 0 )}. The third field element selected is the field element with the smallest LLR value from the remaining field elements of set M except for field element α 1 = α 3 + α 0 or (10, 2,α 1 ), which is a combination of two field elements in B. Hence, (3, 4,α 4 ) is selected, and the basic set B = {(1, 1,α 3 ), (2, 1,α 0 ), (3, 4,α 4 )} includes p = 3 independent field elements with the most reliable messages. Then, the extra column Q(a) is constructed. For p field elements in the extra column, which belong to the basic set B such as {α 3,α 0,α 4 }, their LLR values Q(a) and the path information d(a) are the same as the LLR values and column indexes in the basic set B. For other field elements, all combinations of the field elements in B are considered as follows: Q(α 3 + α 0 = α 1 ) = max (1, 2) = 2

6 THI AND LEE: BS-TMM DECODER ARCHITECTURE FOR NONBINARY LDPC CODES WITH HIGH-ORDER GFS 501 Fig. 3. FERs of the (837, 726) NB-LDPC code over GF(32) under the AWGN channel. Fig. 4. FERs of the (1512, 1323) NB-LDPC code over GF(64) under the AWGN channel. and d(α 3 + α 0 = α 1 ) = {1, 1}; Q(α 0 + α 4 = α 5 ) = max(2, 3) = 3andd(α 0 + α 4 = α 5 ) ={1, 4}; Q(α 3 + α 4 = α 6 ) = max (1, 3) = 3andd(α 3 + α 4 = α 6 ) = {1, 4}; and Q(α 3 + α 0 + α 4 = α 2 ) = max (1, 2, 3) = 3and d(α 3 + α 0 + α 4 = α 2 ) ={1, 1, 4}. B. Performance Analysis To demonstrate the error-correcting performance of the proposed BS-TMM decoding algorithm, we performed the simulations for two GFs: GF(32) and GF(64). Fig. 3 illustrates the frame error rate (FER) performance for (837, 726) NB-LDPC code over GF(32) with d v = 4 and d c = 27 under the additive white Gaussian noise (AWGN) channel and binary phase shift keying modulation. As shown in Fig. 3, the floating-point simulation result of the BS-TMM algorithm with 15 iterations shows a minor performance loss at almost 0.1 db, compared with the STMM algorithm [14] and the twoextra-column TMM algorithm [17]. However, the proposed BS-TMM algorithm provides low computation complexity, a large area reduction, and a significant improvement in throughput. This is explained by the fact that (q 1) messages in the extra column Q(a) in [14] and [17] are constructed directly from all reliable messages of the configuration set conf(1, 2) using (q 1) processors, whereas these are constructed on the basis of only p reliable messages in the basic set in our work. Compared with the R-TMM algorithm [11], in which C2V messages are generated on the basis of minimum basic sets, the FER performance of the BS-TMM algorithm is almost the same as that of the R-TMM algorithm. It is noted that d c minimum basic sets are required in the R-TMM algorithm [11] to generate the C2V messages, whereas the proposed BS-TMM algorithm requires only one basic set to construct the extra column. Moreover, the sequential design implemented in [11] causes a throughput problem, whereas the proposed BS-TMM algorithm-based design performs all calculations in one clock cycle. For the purpose of hardware implementation, various fixed-point simulations were performed using different quantization schemes. A scheme with 5-bit quantization and eight iterations was chosen, which shows a performance loss at almost 0.1 db, compared with the floating-point result at 15 iterations. Fig. 4 represents the FER performance of the (1512, 1323) NB-LDPC code over GF(64). As can be seen that the BS-TMM algorithm has a minor performance loss of 0.07 db and 0.14 db in comparison with the modified trellis min max (mt-mm) algorithm [19] and the STMM algorithm [14], respectively. This result demonstrates that the proposed BS-TMM algorithm provides good FER performance and significantly reduced computation complexity for the high-order GF. IV. BS-TMM DECODER ARCHITECTURE In this section, the proposed quasi-cyclic NB-LDPC decoder architectures and design technologies for the BS-TMM algorithm are described. The quasi-cyclic NB-LDPC codes over GF(q) are constructed by the algebraic construction method based on array dispersions of matrices in [22], where a (q 1) (q 1) submatrix is generated first. Then, a submatrix with size (d v, d c ) is selected from the (q 1) (q 1) submatrix. Each field element from the (d v, d c ) submatrix is dispersed in either a zero matrix or a circulant permutation matrix (CPM) of size (q 1) (q 1). As a result, the H matrix generated from the (d v, d c ) submatrix has M = (q 1) d v rows and N = (q 1) d c columns. A. CNU Architecture The top-level CNU architecture for the BS-TMM algorithm is shown in Fig. 5, where each module corresponds to a step in Algorithm 3. The transformation module converts V2C messages from normal to delta domain using the control signals z j. This module is constructed by means of d c reordering networks, as shown in [23], where each reordering network requires q log 2 q w-bit multiplexers. The

7 502 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 26, NO. 3, MARCH 2018 Fig. 5. Top-level CNU architecture for BS-TMM algorithm. Fig. 6. Two-min finder architecture with eight inputs [24]. check node syndrome β is generated by a tree adder structure. The delta-to-normal domain transformation is derived later using d c reordering networks with the control signals z j = z j β. The function is responsible for finding the first two minimum values and the first minimum value s index from d c inputs using the 2-min finder. The 2-min finder is adopted by applying the technique in [24], which provides a good tradeoff between the area and latency. Because (q 1) rows in the delta trellis except the first row must perform the function, a total of (q 1) 2-min finders are required. Fig. 6 shows an example of the 2-min finder architecture with eight inputs. In [14] [19], the values in the extra column Q(a) and the path information are generated by means of the first minimum values m1(a) and their column indexes I col (a). (q 1) processors are required to generate LLR values and the path information for (q 1) nonzero elements in the Q(a). Each processor is responsible for constructing q/2 possible paths to find the LLR value and the path information of one nonzero field element, in the case of using the configuration set conf(1, 2). For higher order GFs (increasing the value q), constructing the extra column Q(a) becomes more complex and costly in terms of area. In our work, a basic set B including the LLR values and the column indexes of only p = log 2 q independent field elements needs to be constructed, which provides a large reduction of not only the area but also Fig. 7. (a) Third element in the basic set. (b) Fourth element in the basic set. the messages exchanged between check node and variable node. It is noted that, in our work, multiple nodes can come from the same column stage. This causes negligible performance loss as shown in [11]. For example, the trellis in Fig. 2 shows that two independent field elements in the basic set, such as {(1, 1,α 3 ) and (2, 1,α 0 )}, come from the same column stage. The architecture of the basic set constructor corresponds to the steps in Algorithm 4. A parallel sorting approach in [20] is applied in this paper to simultaneously generate the rearranged minimum values m1 (a) and their indexes I col (a) in ascending order of the m1(a) values in one clock cycle. Then, the first two field elements in the basic set are selected from the first two field elements in the rearranged values, such as (m1 1, I 1, a 1 ) = (m1 (a 1 ), I col (a 1), a 1 ) and (m1 2, I 2, a 2 ) = (m1 (a 2 ), I col (a 2), a 2 ). In this paper, we propose an architecture to obtain the (p 2) remaining independent field elements in a parallel way. Thus, p independent field elements in the basic set B are calculated in one clock cycle. Fig. 7 shows the proposed architectures for the next two elements in the basic set. In Fig. 7(a), the architecture is designed to generate the third element, where the combination of the first two elements such as a1 + a 2 is removed from the remaining rearranged field elements {a 3, a 4,...,a q 1 } by assigning the maximum

8 THI AND LEE: BS-TMM DECODER ARCHITECTURE FOR NONBINARY LDPC CODES WITH HIGH-ORDER GFS 503 TABLE II SYNTHESIS RESULTS FOR THE PROPOSED CNU ARCHITECTURE Fig. 8. E(a) complement generator for GF(8). quantization bits instead of the LLR value m1 (a j ). The min1- finder architecture is responsible for finding the smallest value and its index from (q 3) inputs. Then, the smallest value is the LLR value of the third element m1 3, and its index is used to obtain the field element a3 and the column index I3. The signals c1 3, c1 4,...,c1 q 1 are used to eliminate the combination of the previous field elements as a1 + a 2 in finding the current element and the next element. Immediately, the fourth element is generated, as shown in Fig. 7(b), which is independent of the previous elements. To ensure the independence, two eliminations are made. First, the input field elements in this stage are either the rearranged field elements {a 3, a 4,...,a q 1 } or assigning to zero element with p bits depending on the control signals c3 1, c1 4,...,c1 q 1. Second, the combinations of the previous elements with the third element, such as {a1 + a 3, a 2 + a 3, a 1 + a 2 + a 3 } and the third element a3, are eliminated. The control signals c2 3, c2 4,...,c2 q 1 are responsible for both eliminations in finding the fourth element, and further in the next field element. Finding the LLR value m1 4,fieldelementa 4, and column index I 4 of the fourth element is similar to that of the third element. This procedure is the same for the remaining field elements in the basic set. Finally, p independent field elements in the basic set B are generated simultaneously. The architecture of the E(a) complement generator for GF(8) is designed to generate (q 1) LLR values in parallel, as shown in Fig. 8. The one-hot function generates a group of q bits, where only one bit at location a j is equal to 1, and all the other bits are equal to 0. Therefore, the control signal e[0:7] has p high bits. The high bit locations correspond to the field elements generated from one deviation path, and the complement values E(a) are assigned to m2(a). Otherwise, the field elements are generated from more than two deviations, and the complement values E(a) are assigned to m1(a). The outputs of the proposed CNU architecture, including z n, E(a), and B = {m1 l, I l, a l } 1 l p, are used to generate the C2V messages R mn (a) corresponding to Algorithm 5 in the variable node processing. Thus, the total number of bits exchanged from C2V is d c p+ (q 1) w + p (w + log 2 (d c ) + p) bits. The synthesis results of the proposed CNU architecture for the (837, 726) NB-LDPC code over GF(32) and the (1512, 1323) NB-LDPC code over GF(64) are presented in Table II using the Synopsys design tools and a TSMC 90-nm CMOS standard cell library. Compared with the works in [14] and [15], the proposed CNU greatly reduces the gate count by 51.22% and 34.64% for GF(32), respectively, because of the removal of (q 1) processors for finding the extra column [14], [15] and applying the compression technique [16]. Compared with the original TMM algorithm with the compressed messages in [16], it can be seen that the area saving is almost 30% for GF(32) and 37% for GF(64). This is due to the complexity reduction for finding the basic set with a size of p = log 2 q instead of finding the sets corresponding to the extra column with a size of q [16]. In [20], L = 4 is chosen for designing the CNU architecture in both GF(32) and GF(64), while log 2 (32) = 5 and log 2 (64) = 6 values are kept in the proposed CNU architecture for GF(32) and GF(64), respectively. The number of exchange bits between check node and variable node in [20] is almost similar with the one in the proposed work. However, the proposed CNU architecture reduces the area by 18.7% for GF(32) and 14.2% for GF(64). This area reduction achieves because of that the work in [20] requires (q 1) processors to calculate (q 1) elements of the complement set E(a), while the proposed work uses only one module to calculate (q 1) elements of E(a) set. In addition, this improvement will be significantly increased if L values [20] are chosen similarly to the proposed CNU design. Compared with [17], since the proposed CNU needs to find only p = log 2 q elements instead of q elements of two extra columns, the CNU complexity is reduced by 11.6%. B. Decoder Architecture In this section, a complete decoder architecture based on the BS-TMM algorithm is designed for NB-LDPC codes. The proposed decoder architecture achieves a great reduction in the area because of the large area reduction in the CNU architecture. In addition, an improvement in the throughput is obtained since reducing the wires between the check node and variable node processors mitigates the routing congestion. The layered min max algorithm for the proposed decoder is presented in Algorithm 6, where the BS-TMM in Algorithm 3 is implemented in the check node processor. In addition, the decompression network (DN) corresponding to Algorithm 5 is implemented in the variable node processor to generate the C2V messages R mn (a) from outputs of the CNU architecture. The DN has three parts: 1) generating the LLR

9 504 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 26, NO. 3, MARCH 2018 Fig. 9. Proposed extra column and path information generator for GF(8). (a) Extra column generator for the jth element. (b) Control signal generator. (c) Path information generator for the jth element. Algorithm 6 Proposed Layered Decoding Algorithm values of the extra column Q(a) and the path information d(a) with a maximum of p deviations on the basis of the basic set B = {m1l, I l, a l } 1 l p; 2) generating the C2V messages in the delta domain as R mn (a) on the basis of Q(a), E(a), andd(a); 3) and converting the C2V messages from delta to normal domain. It is noted that two DNs are required in the variable node processor. However, the proposed decoder area is much lower than that of the conventional decoders [14], [15]. First, Fig. 9 shows the proposed extra column and path information generator for GF(8). The LLR value of each element in the extra column Q(a j ) is selected from one of the p LLR values m1l (1 l p) in the basic set B depending on the p control signals s l [ j] (1 l p),asshown in Fig. 9(a). (q 1) architectures as in Fig. 9(a) are required to compute (q 1) messages in the Q(a) simultaneously. p control signals s l [ j] (1 l p) are generated using the architecture in Fig. 9(b). To compute Q(a j ), only one of the control signals s l [ j] is equal to 1, and others are equal to 0. Thus, only one of p LLR values is selected for the output Q(a j ). In order to calculate the p control signals s l [ j], 2 p 1 = q 1 combinations of p field elements in the basic set B excluding the zero element are divided into p groups, as shown in Fig. 9(b). p control signals s l [ j] correspond to p outputs of the groups. The lth group contains the field element al and its combinations with all possible combinations of the previous field elements ak (0 < k < l). Therefore, the lth group (l > 0) includes 2 l 1 combinations of the field elements. In addition, (q 1) path information corresponding to (q 1) field elements is also constructed, where each path information d[ j] has p column indexes d l [ j] (1 l p). (q 1) architectures as in Fig. 9(c) are required to compute (q 1) path information. For p field elements in the basic set al (1 l p), their paths are one deviation, and thus p values of the path information {d l [ j]} 1 l p are the same as column index Il. The path of the field element generated by the combination of all field elements in the basic set has a maximum of p deviations; thus, p values of the path information d l [ j] (1 l p) correspond to p column indexes Il (1 l p) in the basic set. For other field elements generated by the remaining combinations, the number of their deviations is k (1 < k < p), and then the d l [ j] is assigned to the column index Il with 1 l k. Otherwise, the d l [ j] with k < l p is assigned to the column index Ik. Fig. 10 presents the proposed C2V message generator for GF(8) with d c = 4. The C2V messages R mj (a) (1 j d c ) are simultaneously introduced by either Q(a) or E(a), which are the outputs of the multiplexers. The control signals for the multiplexers depend on the column indexes and p deviations of the path information. If the column index j (1 j d c ) is equal to at least one of p deviations d l (a) (1 l p), then the output of the multiplexer is assigned to compensation value E(a). Otherwise, the output of the multiplexer is assigned to Q(a). Fig. 11 shows the top-level decoder architecture for the proposed layered decoding algorithm, where one row of H

10 THI AND LEE: BS-TMM DECODER ARCHITECTURE FOR NONBINARY LDPC CODES WITH HIGH-ORDER GFS 505 corresponding to the output bits of the check node processor. A total of M [ p (w+ log(d c ) +p)+(q 1) w+d c p] bits are stored in one iteration. Compared with the M q d c w bits stored in CNMEM in the conventional approach [14], the memory requirement for CNMEM in the proposed decoder is greatly reduced, which leads to a large reduction in decoder area. Fig. 10. Proposed C2V message generator for GF(8) with d c = 4. Fig. 11. Top-level decoder architecture based on the BS-TMM algorithm. corresponding to one layer is processed in one clock cycle. It can be seen that the decoder architecture is divided into a variable node processor and check node processor. To start the decoding process, the LLR messages from channel information L n (a) are loaded in variable node memory (VNMEM). From the next layer and next iteration, the output messages of the variable node processor Q k,l n (a) are stored in the VNMEM. The VNMEM includes d c memories with a depth of (q 1) as the size of the CPM [22] and a width of q w bits. For each decoding time, one address is read and one address is written from each memory. The permutation and depermutation of the variable messages in steps 4 and 9 in Algorithm 6 are implemented by modules P and P 1, respectively. Each module requires d c (q 1) log 2 q multiplexers of w bits to permute or depermute d c vectors of (q 1) messages, and the control signals are based on the h mn nonzero values of H. The normalization module N is responsible for finding the most reliable messages and their locations z n, and generating the Q k,l mn(a) messages for the inputs of the check node processor. In addition, normalization ensures that the smallest value in each LLR vector Q k,l mn(a) is always equal to zero. At the last decoding iteration, the z n values are the hard-decision symbols c n stored in the output memory, and the P module and subtractor are inactive during this process. Since a layered decoding scheme is used, the outputs of the check node processor in one iteration must be stored in the check node memory (CNMEM) for the next iteration process. Thus, the CNMEM in the proposed decoder has a depth of M and a width of p (w+ log(d c ) +p)+(q 1) w+d c p bits V. IMPLEMENTATION RESULTS AND COMPARISON To illustrate the efficiency of our proposal for NB-LDPC codes, especially for high-order GFs, the complete decoder architectures were implemented for two codes (837, 726) NB-LDPC code over GF(32) and (1512, 1323) NB-LDPC code over GF(64). A Verilog HDL was used to model the architectures, and Synopsys design tools with the TSMC 90-nm CMOS standard cell library were used to implement the proposed decoder architectures. The throughput Tp of the decoders is archived as shown in (5), where seg is the number of pipeline stages used in the decoder architecture to improve the timing. In the proposed decoder architectures, seg = 9was chosen to obtain a balance between throughput and area Tp = f clk[mhz] (q 1) d c p [Mbps]. (5) I max (M + d v seg) + (q 1) Table III shows the implementation results of the proposed decoder in comparison with the other state-of-the-art works for the (837, 726) NB-LDPC code over GF(32). It can be seen that the proposed decoder outperforms the other approaches in both area and throughput. Compared with the STMM algorithm with uncompressed messages [14], our work has almost 8.3 times higher efficiency, and reduces gate count by a factor of 4.3. This significant improvement is achieved by the great reduction in both the storage bits in the CNMEM and the CNU complexity, as explained previously. Compared wih [11], our proposal not only reduces the gate count but also increases the throughput because of its reduced complexity and parallel processing in the CNU. Thus, the proposed decoder achieves almost 9.4 times higher efficiency. In [20], a reducedcomplexity NB-LDPC decoder was proposed on the basis of reducing the size of the intrinsic information and the path coordinates to L q values, and the decoder performance depends on the selected L value, whereas our approach reduces the size of these sets to p = log 2 q values for any GF. Because the complexity of the proposed CNU is reduced, the efficiency of the proposed decoder with p = 5 is almost 1.7 times higher than that in [20] implemented with L = 4. Compared with the decoders in [16], [19], and [17], the proposed decoder reduces the gate count by 40%, 35.4%, and 5.5%, and achieves 53%, 44.7%, and 4.5% higher efficiency, respectively. Moreover, the proposed decoder is almost 14.4 times more efficient than that of [25]. In our work, the (1512, 1323) NB-LDPC code over GF(64) is constructed by the submatrix (d v, d c ) = (3, 24) and a CPM of size (q 1) (q 1) [22], which is the same code rate as the (1536, 1344) NB-LDPC code in previous works, as shown in Table IV. It is noted that the size of the CPM in previous works is q q instead of (q 1) (q 1). The synthesis results of the proposed decoder for the (1512, 1323) NB-LDPC

11 506 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 26, NO. 3, MARCH 2018 TABLE III COMPARISON OF THE PROPOSED DECODER WITH OTHER WORKS FOR THE (837, 726) NB-LDPC CODE OVER GF(32) TABLE IV IMPLEMENTATION RESULTS OF THE PROPOSED DECODER FOR THE (1512, 1323) NB-LDPC CODE OVER GF(64) IN A 90-nm CMOS PROCESS VI. CONCLUSION In this paper, we propose a novel basic-set trellis min max algorithm for decoding NB-LDPC codes to reduce the complexity of the CNU architecture, the messages exchanged between the check node and the variable node, and the storage bits in the CNMEM, compared with previous works. The implementation results show that the decoder architecture based on the proposed algorithm provides a great area reduction and throughput improvement, compared with the other state-of-the-art works. In addition, the results for the NB-LDPC code over GF(64) demonstrate that the proposed algorithm is especially efficient for the high-rate NB-LDPC codes with high-order GFs. REFERENCES code and the comparison with previous works are presented in Table IV. For fair comparison with previous works in terms of throughput, the clock frequency, after placing and routing the design, was reduced following the work in [20]. It can be seen that the proposed decoder reduces the gate count by 57% and achieves almost 3.8 times higher efficiency, compared with the work from [14]. Compared with the works with compressed messages [16], [19], the proposed decoder improves not only the gate count but also the throughput because of a large reduction of the complexity in the CNU and the messages exchanged between check node and variable node, which contributes to mitigating the routing congestion. Therefore, the proposed decoder reduces the gate count by 37.4% and 48.4%, and obtains a higher efficiency at 53.2% and 61.4%, compared with [16] and [19], respectively. Moreover, the proposed decoder exhibits almost 38.6% higher efficiency, compared with the work in [20] with L = 5 for codes in GF(64). [1] M. C. Davey and D. MacKay, Low-density parity check codes over GF(q), IEEE Commun. Lett., vol. 2, no. 6, pp , Jun [2] R. Peng and R.-R. Chen, WLC45-2: Application of nonbinary LDPC codes for communication over fading channels using higher order modulations, in Proc. IEEE Global Telecommun. Conf. (GLOBECOM), Nov./Dec. 2006, pp [3] M. Arabaci, I. B. Djordjevic, L. Xu, and T. Wang, Nonbinary LDPCcoded modulation for high-speed optical fiber communication without bandwidth expansion, IEEE Photon. J., vol. 4, no. 3, pp , Jun [4] C. A. Aslam, Y. L. Guan, and K. Cai, Non-binary LDPC code with multiple memory reads for multi-level-cell (MLC) flash, in Proc. Asia Pacific Signal Inf. Process. Assoc., Annu. Summit Conf. (APSIPA), 2014, pp [5] L. Barnault and D. Declercq, Fast decoding algorithm for LDPC over GF(2 q ), in Proc. IEEE Inf. Theory Workshop, Mar./Apr. 2003, pp [6] H. Wymeersch, H. Steendam, and M. Moeneclaey, Log-domain decoding of LDPC codes over GF(q), in Proc. IEEE Int. Conf. Commun., vol. 2. Jun. 2004, pp [7] D. Declercq and M. Fossorier, Decoding algorithms for nonbinary LDPC codes over GF(q), IEEE Trans. Commun., vol. 55, no. 4, pp , Apr [8] V. Savin, Min-max decoding for non binary LDPC codes, in Proc. IEEE Int. Symp. Inf. Theory, Jul. 2008, pp [9] X. Zhang and F. Cai, Reduced-complexity decoder architecture for nonbinary LDPC codes, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 19, no. 7, pp , Jul [10] K. He, J. Sha, and Z. Wang, Nonbinary LDPC code decoder architecture with efficient check node processing, IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 59, no. 6, pp , Jun

Declercq, and K. Gunnam, Trellis-based extended min-sum algorithm for non-binary LDPC codes and its hardware structure, IEEE Trans. Commun., vol. 61, no. 7, pp. 2600 2611, Jul. 2013. [13] E. Li, F.

12 THI AND LEE: BS-TMM DECODER ARCHITECTURE FOR NONBINARY LDPC CODES WITH HIGH-ORDER GFS 507 [11] F. Cai and X. Zhang, Relaxed min-max decoder architectures for nonbinary low-density parity-check codes, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 21, no. 11, pp , Nov [12] E. Li, D. Declercq, and K. Gunnam, Trellis-based extended min-sum algorithm for non-binary LDPC codes and its hardware structure, IEEE Trans. Commun., vol. 61, no. 7, pp , Jul [13] E. Li, F. García-Herrero, D. Declercq, K. Gunnam, J. O. Lacruz, and J. Valls, Low latency T-EMS decoder for non-binary LDPC codes, in Conf. Rec. 47th Asilomar Conf. Signals, Syst. Comput. (ASILOMAR), Nov. 2013, pp [14] J. O. Lacruz, F. García-Herrero, D. Declercq, and J. Valls, Simplified trellis min max decoder architecture for nonbinary low-density paritycheck codes, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 23, no. 9, pp , Sep [15] J. O. Lacruz, F. García-Herrero, J. Valls, and D. Declercq, One minimum only trellis decoder for non-binary low-density parity-check codes, IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 62, no. 1, pp , Jan [16] J. O. Lacruz, F. García-Herrero, and J. Valls, Reduction of complexity for nonbinary LDPC decoders with compressed messages, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 23, no. 11, pp , Nov [17] H. P. Thi and H. Lee, Two-extra-column trellis min max decoder architecture for nonbinary LDPC codes, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 25, no. 5, pp , May [18] J. O. Lacruz, F. García-Herrero, M. J. Canet, J. Valls, and A. Pérez-Pascual, A 630 Mbps non-binary LDPC decoder for FPGA, in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), May 2015, pp [19] J. O. Lacruz, F. García-Herrero, M. J. Canet, and J. Valls, Highperformance NB-LDPC decoder with reduction of message exchange, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 24, no. 5, pp , May [20] J. O. Lacruz, F. García-Herrero, M. J. Canet, and J. Valls, Reducedcomplexity nonbinary LDPC decoder for high-order Galois fields based on trellis min max algorithm, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 24, no. 8, pp , Aug [21] Y.-L. Ueng, K.-H. Liao, H.-C. Chou, and C.-J. Yang, A high-throughput trellis-based layered decoding architecture for non-binary LDPC codes using max-log-qspa, IEEE Trans. Signal Process., vol. 61, no. 11, pp , Jun [22] B. Zhou, J. Kang, S. Song, S. Lin, K. Abdel-Ghaffar, and M. Xu, Construction of non-binary quasi-cyclic LDPC codes by arrays and array dispersions, IEEE Trans. Commun., vol. 57, no. 6, pp , Jun [23] J. Lin, J. Sha, Z. Wang, and L. Li, Efficient decoder design for nonbinary quasicyclic LDPC codes, IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 57, no. 5, pp , May [24] C.-L. Wey, M.-D. Shieh, and S.-Y. Lin, Algorithms of finding the first two minimum values and their hardware implementation, IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 55, no. 11, pp , Dec [25] X. Chen and C.-L. Wang, High-throughput efficient non-binary LDPC decoder based on the simplified min-sum algorithm, IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 59, no. 11, pp , Nov Huyen Pham Thi (M 14) received the B.S. degree from the Department of Information and Communication Engineering, Military Technical Academy, Ha Noi, Vietnam, in She is currently working toward the M.S. and Ph.D. integrated degree with the Department of Information and Communication Engineering from Inha University, Incheon, South Korea. Her current research interests include algorithms and VLSI architecture design for digital signal processing, forward error correction architectures, and communication systems. Hanho Lee (M 98 SM 13) received the Ph.D. and M.S. degrees in electrical and computer engineering from the University of Minnesota, Minneapolis, MN, USA, in 2000 and 1996, respectively. From 2000 to 2002, he was a Member of Technical Staff with Lucent Technologies (Bell Labs Innovations), Allentown, PA, USA. From 2002 to 2004, he was an Assistant Professor with the Department of Electrical and Computer Engineering, University of Connecticut, Storrs, CT, USA. Since 2004, he has been with the Department of Information and Communication Engineering, Inha University, Incheon, Korea, where he is currently a Professor. From 2010 to 2011, he was a Visiting Scholar with Bell Labs, Alcatel-Lucent, Murray Hill, NJ, USA. His current research interests include VLSI architecture design for forward error correction coding, cryptographic, VLSI signal processing, and digital communications.

Block-Layered Decoder Architecture for Quasi-Cyclic Nonbinary LDPC Codes

J Sign Process Syst (2015) 78:209 222 DOI 10.1007/s11265-013-0816-5 Block-Layered Decoder Architecture for Quasi-Cyclic Nonbinary LDPC Codes Chang-Seok Choi & Hanho Lee Received: 21 February 2013 /Revised: