Optimal Overlapped Message Passing Decoding of Quasi-Cyclic LDPC Codes

Optimal Overlapped Message Passing Decoding of Quasi-Cyclic LDPC Codes Yongmei Dai and Zhiyuan Yan Department of Electrical and Computer Engineering Lehigh University, PA 18015, USA E-mails: {yod304, yan}@lehigh.edu Abstract Efficient hardware implementation of low-density parity-check (LDPC) codes is of great interest since LDPC codes are being considered for a wide range of applications. Recently, overlapped message passing (OMP) decoding has been proposed to improve the throughput and hardware utilization efficiency (HUE) of decoder architectures for LDPC codes. In this paper, we first study the scheduling for the OMP decoding of LDPC codes, and show that maximizing the throughput amounts to minimizing the intraand inter-iteration waiting times. We then focus on the OMP decoding of quasi-cyclic (QC) LDPC codes, and propose a new partly parallel OMP decoder architecture. For any QC LDPC code, our new OMP decoder architecture achieves the maximum throughput and HUE, hence has higher throughput and HUE than previously proposed OMP decoder architectures while maintaining the same hardware requirements. We also show that the maximum throughput and HUE achieved by OMP decoders both are ultimately determined by the given code. Thus, we propose a coset-based construction method, which results in QC LDPC codes that allow OMP decoders to achieve higher throughput and HUE. Index Terms Low-density parity-check (LDPC) codes, quasi-cyclic (QC) codes, message passing, throughput, hardware utilization efficiency (HUE). This work was financed by a grant from the Commonwealth of Pennsylvania, Department of Community and Economic Development, through the Pennsylvania Infrastructure Technology Alliance (PITA). Part of the material in this paper will be presented at Globecom 2005 and SiPS 2005.

2 I. INTRODUCTION Discovered first by Gallager in the early 1960s [1], low-density parity-check (LDPC) codes were mostly ignored for over three decades by coding researchers. Among notable exceptions, Tanner [2] introduced a graphical interpretation of LDPC codes, wherein the codeword bits and the parity check equations were represented as variable and check nodes respectively in a bipartite graph. Since their rediscovery in the mid-1990s, LDPC codes have attracted a lot of attention (see, e.g., [3 27]) because, when decoded with the iterative message passing algorithm (see, e.g., [4]), long LDPC codes can achieve error performances within a small fraction of the Shannon limit (see, e.g., [5]). Their near-capacity performances make LDPC codes strong candidates for error control in many communication and data storage systems. In addition to code design, efficient hardware implementation of LDPC decoders has also been an active research area, and various (fully parallel, fully serial, and partly parallel) decoder architectures have been proposed [7 18]. However, various challenging issues still remain. One such issue is with regard to the code structure of LDPC codes. In general, random code structure often used to enhance error performances causes problems in the implementation of such codes [28], whereas regular code structure facilitates the hardware implementation, which motivates the joint code-decoder design methodology [10, 13, 14, 18, 19]. One subclass of LDPC codes with regular structure is the quasi-cyclic (QC) LDPC codes 1 [10, 20 25]. Due to their regular structure, QC LDPC codes lend themselves conveniently to efficient hardware implementations [10 12,15] since the decoder architectures for QC LDPC codes require less memory and simpler logic control. Construction of QC LDPC codes with good error performances is therefore of both theoretical and practical interest. Various construction methods of QC LDPC codes (see, e.g., [10, 20 23, 26]) have been proposed. In particular, the QC LDPC codes based on multiplicative structure of finite fields (often referred to as the SFT codes) [21, 23] are shown to have good girth and minimum distance properties [21, 23 25]. Furthermore, for short to medium block length, the SFT codes perform comparably to or better than randomly constructed regular LDPC codes [23]. For partly parallel LDPC decoder architectures, another issue is the low throughput and hardware utilization efficiency (HUE) due to the data dependencies inherent in the multi-iteration message passing decoding [11, 27]. Each iteration of the message passing decoding consists of two sequential steps: check node updates followed by variable node updates. Two types intra- and inter-iteration of data dependencies exist since the variable (check respectively) node updates of any iteration utilize the 1 To simplify terminology, QC LDPC codes in this paper refer to the QC LDPC codes that are defined by parity check matrices consisting of cyclically shifted identity matrices. There are other constructions for QC LDPC codes (see, for example, [6]).

3 results of the check (variable respectively) node updates of the same (previous respectively) iteration. Hence, check node function units (CNFUs) and variable node function units (VNFUs) the two kinds of hardware units that implement the check and variable node updates, respectively often work one after another, resulting in under-utilization of hardware and low throughput. On the architectural level, two techniques are often used to improve HUE: folding (processor sharing) and data interleaving. Folding the check and variable node updates improves HUE, but not throughput [27]. Data interleaving, i.e., letting the CNFUs and VNFUs operate alternately on two different blocks of data at the same time, improves both throughput and HUE, but also doubles the memory requirements [9, 10]. A more efficient solution to this issue exists on the algorithmic level. Recently, an overlapped message passing (OMP) decoding algorithm, which (partially) overlaps the check and variable node updates, has been proposed for QC LDPC codes [11] and LDPC codes in general [12]. The OMP decoding takes advantage of the concurrency between the check and variable node updates, and improves both throughput and HUE while using the same memory and introducing no error performance degradation. Since it is carried out on the algorithmic level, further architecture-level improvements such as parallel processing (see, e.g., [17]) can be carried out separately. In this paper, we study the OMP decoding of LDPC codes on both decoder algorithmic and codedesign levels. In particular, the contributions of this paper are a new OMP decoder architecture for QC LDPC codes that achieves the maximum throughput and HUE, and a new construction method leading to QC LDPC codes that allow OMP decoders to achieve higher throughput and HUE than the SFT codes while maintaining similar error performances. The details of the results in this paper are as follows. We first present the scheduling for the OMP decoding of LDPC codes, and show that maximizing the throughput amounts to minimizing the intra- and inter-iteration waiting times. Assuming the hardware structure in [11], we then focus on the OMP decoding of QC LDPC codes and show that maximizing the throughput is an optimization problem subject to the data dependency and hardware constraints, imposed by the decoding algorithm and hardware structure respectively. Furthermore, with the data dependency constraints alone, the minimum intra- and inter-iteration waiting times are not only iteration invariant but also equal to each other. Using this property, we solve the optimization problem with both the data dependency and hardware constraints, and show that the maximum throughput gain achieved by OMP decoders is bounded above by 2 and is determined by the minimum waiting time. We then propose a reduced-complexity procedure that obtains the minimum waiting time, and an OMP decoder architecture that achieves the maximum throughput gain and HUE for any QC LDPC code. In comparison to the decoder architecture in [11], our new decoder architecture has the same hardware requirements but

4 achieves higher throughput and HUE. Although the OMP decoding algorithm is applicable to any QC LDPC code, previous studies have focused mostly on the SFT codes [11]. However, the upper bound of 2 on the maximum throughput gain of OMP decoders is not always achieved by the SFT codes. In order to achieve this upper bound, we propose a new construction method that is based on cosets. We show that the QC LDPC codes so constructed have similar girth and minimum Hamming distance bounds to those in [23] for the SFT codes. Furthermore, when the design parameters are fixed, our coset-based QC LDPC codes include the SFT codes [21, 23] as subsets. Hence, given any SFT code, there exists a group of coset-based QC LDPC codes (including the given SFT code) with the same parameters and similar error performances as opposed to the given SFT code. Replacing the given SFT code with the code among the group that allows OMP decoders to achieve the highest throughput gain improves both throughput and HUE while maintaining similar error performances. The rest of the paper is organized as follows. Section II presents the scheduling for the OMP decoding of LDPC codes. In Section III, we first briefly review QC LDPC codes and their decoder architectures, and then propose an OMP decoder that achieves the maximum throughput gain and HUE. Section IV presents our coset-based construction of QC LDPC codes. Some concluding remarks are given in Section V. II. SCHEDULING FOR OMP DECODING Message passing decoding (see, e.g., [4]) of LDPC codes is a multi-iteration procedure that iteratively exchanges the extrinsic log likelihood ratio (LLR) information between check and variable nodes until the stopping criterion is satisfied. Usually, this means either the parity check equations are satisfied or the pre-set iteration number is reached. Each iteration of message passing decoding consists of two sequential steps: check node updates followed by variable node updates. Two types intra- and inter-iteration of data dependencies exist since the variable (check respectively) node updates of any iteration utilize the results of the check (variable respectively) node updates of the same (previous respectively) iteration. Thus, as shown in Fig. 1(a), check node updates and variable node updates are performed one after another conventionally. This is not necessary, as pointed out in [11, 12], since overlapping of check and variable node updates is possible via proper scheduling. An overlapped scheduling of the check and variable node updates is shown in Fig. 1(b), where w (p) and w (p) represent the intra- and the inter-iteration waiting time at the p-th iteration respectively and n denotes the total number of iterations. The intra-iteration waiting time, w (p), is used to ensure that when the VNFUs start the variable node updates of the p-th iteration, the data operated on by the VNFUs

5 (1) w (2) w w w ( p) ( p) (n) w (n) w Fig. 1. Non-overlapped vs. overlapped scheduling (a) Non-overlapped scheduling (b) Overlapped scheduling are the results of the check node updates of the p-th iteration. Similarly, the inter-iteration waiting time, w (p+1), is used to ensure that when the CNFUs start the check node updates of the (p + 1)-th iteration, the data operated on by the CNFUs are the results of the variable node updates of the p-th iteration. The scheduling scheme is completely determined by {w (1), w (2), w (2),, w (n), w (n) }. Suppose the amount of time the check and variable node updates of each iteration take is f c and f v respectively, then 0 w (p) w (p) = f c, for all 1 p n and w (p) f c and 0 w (p) f v for all p s. The worst case scenario occurs when = f v, for all 2 p n, and the overlapped scheduling in Fig. 1(b) is reduced to the conventional non-overlapped scheduling in Fig. 1(a). The throughput in this worst case scenario will be used as a reference when calculating the throughput gain achieved by OMP decoders henceforth in this paper. If n iterations are performed, n (f c + f v ) are needed for the non-overlapped decoding as opposed to f v + n + n for the OMP decoding, and the throughput gain G is given by p=1 w (p) p=2 w (p) n (f c + f v ) G = f v + n + n p=1 w (p) p=2. (1) w (p) When f c and f v are fixed, the maximum throughput gain is achieved when the sum n p=1 w (p) + n p=2 w (p)

6 is minimized. When the inputs to the decoder are continuous, the throughput gain is given by nf c + (n 1)f v G = f c + n 1 + n p=1 w (p) instead of (1). Since the difference is negligible, we use (1) for consistency with [11] henceforth in this paper. w (p) For ease of implementation, it is desirable that the waiting times are iteration invariant. That is, p=2 = w for 1 p n and w (p) = w for all 2 p n. w (p) (2) III. OPTIMAL OMP DECODING OF QC LDPC CODES QC LDPC codes not only have relatively good error performances, but can also be encoded efficiently with shift registers [24,25] and decoded with partly parallel decoder architectures, which require simpler address generation mechanisms, less memory, and localized memory access [10, 11]. Thus, in this paper, we focus on the OMP decoding of QC LDPC codes using partly parallel decoder architectures. To make this paper self-contained, in the following, we first briefly review QC LDPC codes and their decoder architectures. A. QC LDPC Codes and Their Decoder Architectures The parity check matrix defining a (j, k) regular QC LDPC code can be expressed as I x0,0 I x0,1 I x0,2 I x0,k 1 I H = x1,0 I x1,1 I x1,2 I x1,k 1......, (3). I xj 1,0 I xj 1,1 I xj 1,2 I xj 1,k 1 where I xs,t denotes an m m identity matrix with all the rows cyclically shifted to the right by x s,t positions (0 s j 1, 0 t k 1). There exist various ways to define the value for x s,t [10, 20 23, 26]. For example, construction of the SFT codes [21, 23] is based on the multiplicative structure of finite fields. Assume m is prime, then Z m = {0, 1,, m 1} is a field under addition and multiplication modulo m. Let a and b be two non-zero elements in Z m with multiplicative orders k and j respectively. Setting x s,t = b s a t (mod m), the parity check matrix in (3) defines an SFT code. Construction based on a non-prime m is also possible [23]. We assume the same hardware structure as that in [11], which consists of j k dual-port memory banks, j k-input CNFUs, and k j-input VNFUs. Each memory bank has m units labelled with addresses

7 {0, 1, 2,, m 1}, and corresponds to one I xs,t in H. Each 1 in I xs,t denotes an edge between a check node and a variable node in the Tanner graph. In other words, each 1 is connected to a check node and a variable node. The c s,t -th memory unit in the memory bank alternately stores the LLR information of the check node and the variable node that are connected to the 1 in the c s,t -th row of I xs,t after check node updates and variable node updates respectively. Since the 1 in the c s,t -th row is also in the v s,t -th column (v s,t = c s,t + x s,t (mod m)), c s,t and v s,t respectively are called the row and column addresses of the c s,t -th memory unit. The s-th CNFU operates on the k memory banks corresponding to the s-th block row of H, i.e., [I xs,0, I xs,1,, I xs,k 1 ], and simultaneously updates the memory units with the same row address in these memory banks, one row per clock cycle. It starts with the memory units with memory address c s in each memory bank, and goes through the memory units in a cyclically increasing order. The t-th VNFU operates on the j memory banks corresponding to the t-th block column of H, i.e., [I x0,t, I x1,t,, I xj 1,t ] T, and simultaneously updates the memory units with the same column address in these memory banks, one column per clock cycle. It starts with the memory units with column address v t in each memory bank and updates the memory units in a cyclically increasing order. Thus, m clock cycles are needed for the CNFUs (VNFUs) to finish one iteration of the check (variable) node updates, i.e., f c = f v = m. More details about the hardware structure can be found in [11]. In [12] each of the CNFUs (VNFUs) starts the check (variable) node updates as soon as one row (column) address in its own block rows (columns) is ready, whereas in [11] and this paper all the CNFUs (VNFUs) always start the check (variable) node updates at the same time. The latter approach apparently leads to simpler control mechanisms. B. Data Dependency Only Given the hardware structure, the intra- and inter-iteration waiting times henceforth will be measured in terms of the number of clock cycles. Since the data dependencies are inherent in the multi-iteration message passing algorithm, in this subsection, we study the minimization of the intra- and inter-iteration waiting times subject to only the data dependency constraints. Based on the hardware structure described above, data dependencies manifest themselves as follows: Due to the cyclic fashion in which each CNFU or VNFU operates on the memory units of every memory bank, satisfaction of the data dependencies at the start of the check or variable node updates guarantees that the data dependencies will be satisfied for the whole iteration; Intra-iteration data dependency: when any VNFU starts operating on the j memory units with column address v t for the variable node updates of the p-th iteration, all these j memory units have been

8 processed by the check node updates of the p-th iteration; Inter-iteration data dependency: when any CNFU starts operating on the k memory units with row address c s for the check node updates of the (p + 1)-th iteration, all these k memory units have been processed by the variable node updates of the p-th iteration. For the p-th iteration, let c (p) = {c (p) 0, c(p) 1,, c(p) j 1 } denote the starting row addresses of the j CNFUs, and v (p) = {v (p) 0, v(p) 1,, v(p) k 1 } denote the starting column addresses of the k VNFUs. Then c(p) and v (p) (for all 1 p n) are the parameters that we optimize to minimize the intra- and inter-iteration waiting times. Since the message passing decoding of LDPC codes always starts from the check node updates, during the first iteration, the CNFUs can choose the starting row addresses to be any address between 0 and m 1 so as to minimize w (1). For the subsequent check and variable node updates, the choices of the starting addresses of the CNFUs and VNFUs are restricted by the preceding updates due to the data dependency constraints. The range of the choices of the row and the column addresses is therefore limited when minimizing w (p) and w (p) for 2 p n. Hence, it would seem that the minimum waiting times for each iteration are different. However, as shown below, the minimum intra- and interiteration waiting times are iteration invariant. To simply our analysis, assume that the VNFUs can also choose the starting column addresses to be any address between 0 and m 1 without being affected by the preceding updates and define w (1) need to minimize. Let w (p) and w (p) as the corresponding inter-iteration waiting time that the VNFUs denote the minimum values of w (p) and w (p) for 1 p n respectively, the following lemma shows that the minimum intra- and inter-iteration waiting times are equal and iteration invariant. Lemma 1: Given the hardware structure, the data dependency constraints alone determine that the minimum intra- and inter-iteration waiting times satisfy w (p+1) = w (p+1) = w (p) = w (p) = w for 1 p n 1. That is, they are equal and iteration invariant. Furthermore, any sequence {w (1), w (2), w (2),, w (n), w (n) } that satisfies w (p) w and w (p) w for 1 p n is achievable. See Appendix A for the proof of Lemma 1. Since the minimum intra- and inter-iteration waiting times are equal to each other, henceforth, w is simply called the minimum waiting time. C. Hardware and Data Dependency Constrained Optimization The hardware structure described in Section III-A imposes extra hardware constraints w (p) + w (p+1) m and w (p+1) + w (p+1) m for p = 1, 2,, n 1 to the optimization problem. That is, the starting times of the check or variable node updates of two consecutive iterations need to be at least m clock cycles apart. This is because the CNFUs (or VNFUs) cannot start the updates of any iteration until the

9 updates of the previous iteration are completed, which take m clock cycles. It is clear that the waiting times satisfying only the hardware constraints can also be iteration invariant. Therefore, the waiting times that maximize the throughput gain subject to both the data dependency and hardware constraints can be iteration invariant. Henceforth, we use w and w to denote the intra- and inter-iteration waiting times, respectively. The throughput gain defined in (1) becomes G = 2mn m+nw +(n 1)w, and maximizing G with both the data dependency and hardware constraints thus reduces to determining w and w subject to Let us denote the maximum throughput achieved as G. Since w w and w w, (4) w + w m. (5) 2mn m+nw +(n 1)w 2mn mn+w < 2 due to the hardware constraint in (5), we always have G < 2. When n is large, G 2 is achieved. This upper bound on G is consistent with intuition: given the hardware constraint in (5), the best case scenario is for the CNFUs and VNFUs to work concurrently and continuously, which doubles the throughput and HUE of the worst case scenario where the CNFUs and VNFUs work alternately. Also, it is clear that when the hardware constraint in (5) is met with equality, most of the throughput gain is achieved and reducing w improves the throughput gain only slightly. We can see below that the solution to the optimization problem and the value of G depend on whether w m 2 is satisfied. When w m 2, the throughput gain is maximized by w = w and w = m w, and G = 2mn mn+w, which is approximately 2 when n is large. In this case, the hardware constraint in (5) is indeed met with equality. The scheduling of the CNFUs and VNFUs is shown in Fig. 2(a). After finishing the check (variable) node updates of any iteration, the CNFUs (VNFUs) can start the check (variable) node updates of the next iteration right away. Thus, the HUE for this case is 100%. When w m 2 + 1, the throughput gain is maximized by w = w = w, and G = which is approximately m w 2mn w (2n 1)+m, when n is large. In this case, w + w > m and hence the hardware constraint in (5) is not met with equality. The scheduling of the CNFUs and VNFUs for this case is shown in Fig. 2(b). After finishing the check (variable) node updates of any iteration, the CNFUs (VNFUs) have to wait w cc = 2w m (w vv = 2w m) clock cycles before they can start the check (variable) node updates of the next iteration. The HUE is given by m w cc +m, which is clearly smaller than 1. We stress that G given above for both cases is the maximum throughput gain that can be achieved by any OMP decoder using the hardware structure in [11]. We also observe that G achieved by OMP decoders for any QC LDPC code is ultimately determined by the code itself via w. Thus, w is in a sense a measure of inherent concurrency of the code. It is clear that only when w m 2, the HUE

10 w w (a) w cc w w w w w vv (b ) Fig. 2. Overlapped scheduling (a) w m 2 (b) w m 2 + 1 of 100% and G 2 can be achieved. The condition w m 2 does not hold in general. However, as explained in Section III-D, w 1 m 2 always holds when j = 2. D. Minimum waiting time As seen above, obtaining w is vital to achieving the maximum throughput gain and HUE. In the following, we propose two procedures that obtain w. Since Lemma 1 implies that w = w (1), w can be obtained by minimizing the intra-iteration waiting time over all possible starting row addresses c (1) = {c (1) 0, c(1) 2,, c(1) j 1 } for the CNFUs. For the t-th VNFU (0 t k 1), let us use the r t-th block row (0 r t j 1) as a reference block row when we try to determine the waiting time for the t-th VNFU. We abuse the notation slightly here since, as shown below, the reference block rows can be iteration invariant. The s-th (0 s j 1) and r t -th CNFUs start from the row addresses c (1) s and c (1) r t respectively, whose corresponding column addresses are c (1) s + x s,t (mod m) and c (1) r t + x rt,t (mod m) respectively. The difference between the two column addresses, given by d s,t = (c (1) s + x s,t ) (c (1) r t + x rt,t) (mod m), (6) implies that it takes d s,t clock cycles for the r t -th CNFU to reach a column address that has been processed by the s-th CNFU. Let r = {r 0, r 1,, r k 1 } denote the reference block rows for all the VNFUs. With c (1) and r fixed, we obtain a j k column address difference matrix D(c (1), r) = [d s,t ].

11 For the given c (1) and r, the maximum column address difference, max D(c (1), r), gives the maximum s,t difference between the column addresses of the chosen reference block rows and the other block rows for all the VNFUs. This maximum column address difference can be minimized over all possible c (1) and r. Finally, note that the minimum waiting time is the number of clock cycles needed for each of the VNFUs to have at least one column address that has been processed by all the CNFUs and is ready for the variable node updates. Hence, the minimum waiting time is the maximum column address difference plus one. To summarize, { [ ]} w = min min max D(c (1), r) c (1) r s,t + 1. (7) Obviously, w in (7) can be found by an exhaustive search over all combinations of {c (1) 0, c(1) 1,, c(1) j 1 } and {r 0, r 1,, r k 1 }. Without loss of generality, one of the starting row addresses, say c (1) 0, can be fixed to zero due to the cyclic nature of the row addresses. Thus, the complexity of the exhaustive search is O(m j 1 j k+1 k). Since each of the VNFUs can choose its reference block row independently, the complexity of the search can be reduced by adopting the following alternative procedure: 1) For the t-th VNFU, it takes d t def = max 0 s j 1 d s,t clock cycles for the r t -th CNFU to reach a column address that has been processed by all the other CNFUs, which means the t-th VNFU cannot start until d t + 1 clock cycles after the CNFUs have started if we use r t -th block row as the reference. 2) Since we can use any block row as the reference for the t-th VNFU, let dt def = min 0 r t j 1 d t, then the t-th VNFU needs to wait a minimum of d t + 1 clock cycles after the CNFUs have started. 3) Let d = max d t, the maximum waiting time among all the VNFUs is then d + 1. 0 t k 1 4) Repeat steps 1) 3) for all the possible starting addresses c (1). Let w = min d and c (1) be the c (1) minimizing starting row addresses. This procedure essentially produces and we can show that Lemma 2: w = w + 1. w = min c (1) { [ ( max min t r t ) ]} max d s,t, (8) See Appendix B for the proof of Lemma 2. w in (8) can be obtained via an exhaustive search of complexity O(m j 1 j 2 k). The optimal starting row addresses may not be unique. Finally, since d ( ) {( ) ( )} t = min max d s,t = min max d s,t, m max d s,t m r t s s s 2 when j = 2, we always have w m 2. s

12 E. Address Generation Mechanisms We have determined the intra- and inter-iteration waiting times that maximize the throughput gain above. It remains to determine the starting row and column addresses of the CNFUs and VNFUs for each iteration. For the p-th iteration, let c (p) = { c (p) 0, c(p) 1,, c(p) j 1 } and ṽ(p) = {ṽ (p) 0, ṽ(p) 1,, ṽ(p) k 1 } denote the optimal starting row and column addresses of the CNFUs and the VNFUs respectively, and let r = { r 0, r 1,, r k 1 } denote the minimizing reference block rows of all the VNFUs. Since w is obtained by setting the starting row addresses of the CNFUs as c (1), w clock cycles after the CNFUs start the check node updates of the first iteration, each VNFU can find at least one column address in its own block column that has already been updated by the CNFUs. As shown above, for the t-th VNFU, the r t -th block row is the minimizing reference block row given c (1). Thus, after d t + 1 clock cycles, c (1) r t + x rt,t + d t (mod m) is the only column address the t-th VNFU can update. After w clock cycles, all the column addresses [ c (1) r t + x rt,t + d t, c (1) r t + x rt,t + w] are ready for the t-th VNFU, where [y 1, y 2 ] denotes the set of cyclically increasing addresses {y 1 (mod m), y 1 + 1 (mod m),, y 2 (mod m)}. Clearly, the starting column addresses for some of the VNFUs are not necessarily unique. For ease of implementation, we choose to use addresses v (1) t = c (1) r t +x rt,t + w (mod m) (0 t k 1) for all the VNFUs to start the variable node updates of the first iteration. After the VNFUs start the variable node updates of the first iteration for w clock cycles, each CNFU can find at least one row address in its own block row that has already been updated by the VNFUs. This row address is determined by inter-iteration data dependency rather than hardware. Therefore, when w m 2 + 1 and w = w, similar as above, there exist multiple choices of the starting row addresses for some of the CNFUs, and we can use the addresses c (1) s + w (mod m) (0 s j 1) for all the CNFUs. When w m 2 and w = m w, which is m w w = m 2 w 2 clock cycles more than what needs for the data dependency, all the addresses [ c (1) s + w, c (1) s w 2] (0 s j 1) are ready for all the CNFUs. For ease of implementation, in both cases, we choose to use c (2) s = c (1) s + w (mod m) (0 s j 1) for all the CNFUs to start the check node updates of the second iteration. Since c (2) is obtained by shifting c (1) by w, the column address difference defined in (6) remains the same in the second iteration. The minimizing reference block rows r also remain the same and c (2) indeed leads to w in the second iteration. Since the starting column addresses of all the VNFUs are determined by the starting row addresses of the CNFUs and the reference block rows, ṽ (2) can also be obtained by shifting ṽ (1) by w. Thus, by simply shifting the starting row and column addresses by w at each iteration, the intra- and inter-iteration waiting times that maximize the throughput gain can always

13 be achieved. In summary, the optimal starting addresses for the CNFUs and VNFUs are generated as follows: { } c (1) = c (1) 0, c(1) 1,, c(1) j 1 are obtained from (8). { } ṽ (1) = c (1) r 0 + x r0,0 + w (mod m), c (1) r 1 + x r1,1 + w (mod m),, c (1) r k 1 + x rk 1,k 1 + w (mod m). { } c (p+1) = c (p) 0 + w (mod m), c (p) 1 + w (mod m),, c (p) j 1 + w (mod m) for 1 p n 1. { } ṽ (p+1) = ṽ (p) 0 + w (mod m), ṽ (p) 1 + w (mod m),, ṽ (p) k 1 + w (mod m) for 1 p n 1. Finally, note that both w and the optimal starting addresses for the CNFUs and VNFUs can be precomputed. Hence, no extra hardware is needed for the decoder architecture. F. Results and Discussions In [11], Chen et al. proposed a procedure to determine the waiting time and an OMP decoder architecture for QC LDPC codes. Our decoder architecture and that in [11] use the same hardware structure, and the primary difference between these two OMP decoder architectures is the scheduling of the CNFUs and VNFUs. In this subsection, we discuss the differences between our procedure in Section III-D and that in [11], and also use examples to illustrate the differences of the scheduling of the two decoder architectures. In our approach, the reference block row of each VNFU is chosen independently, whereas in [11] all the VNFUs use the same r-th block row as the reference block row. As opposed to w in (7), the waiting time obtained in [11], denoted as w, is given by { [ w = min max min r s c (1) s ( max t d s,t ) ]} + 1. (9) It can be easily shown that w is the result of the minimization of (7) over r 0 = r 1 = = r k 1 = r and all possible c (1), that is, w = min c (1) { min r 0 =r 1 = =r k 1 =r [ ]} max D(c (1), r) s,t + 1. (10) Since w is obtained over all possible r, whereas w is obtained over a subset of values for r, we have w w, i.e., the waiting time obtained using our proposed procedure is no larger than that produced by the procedure in [11]. The following example is used to illustrate how our scheduling scheme works and the differences between our scheduling scheme and that in [11]. Suppose m = 11, a = 3 and b = 10 have multiplicative orders k = 5 and j = 2 in Z 11 respectively. The parity check matrix corresponding to these parameters

14 (1) c 0 c 01) ( + w (1) c 1 c 11) ( + w w w Fig. 3. Memory states when CNFUs complete the 1 st iteration using [11] is given by H = I 1 I 3 I 9 I 5 I 4 I 10 I 8 I 2 I 6 I 7. (11) A decoder for the code defined by H above consists of two CNFUs, five VNFUs, and ten dual-port memory banks of m = 11 units. In Figs. 3 and 4, the ten memory banks, each of which is shown as a block inside the bold black box, are arranged as a 2 5 array. CNFU s communicates with the k memory banks in the s-th block row, and VNFU t communicates with the j memory banks in the t-th block column. The memory addresses (also row addresses) of the memory units in each memory bank are 0, 1,, m 1 from the top to the bottom, although the addresses are not shown in Figs. 3 and 4. The corresponding column address of each memory unit is shown inside the memory unit in Figs. 3 and 4. As mentioned above, CNFUs update the memory units with the same row addresses in their respective block rows, while VNFUs update the memory units with the same column addresses in their respective block columns. Therefore, in Figs. 3 and 4, the two CNFUs work on the memory units in the same row in their respective block rows and the five VNFUs work on the memory units with the same numbers inside the memory units in their respective block columns. The procedure proposed in [11] leads to w = 8 (w = w 1 = 7) no matter which block row is used as the reference block row. Without loss of generality, assume the 0-th block row is used as the

15 c~0(1) ~ w ~ c~0(1) + w c~1(1) ~ w ~ c~1(1) + w (a) c~0(1) ~ w ~ c~0(1) + w c~1(1) ~ w ~ c~1(1) + w (b ) Fig. 4. Memory states (a) when CNFUs complete the 1st iteration (b) when VNFUs complete the 1st iteration (1) (1) reference, i.e., r0 = r1 = = r4 = r = 0, and then c0 = 0 and c1 = 2. After eight clock cycles, the five VNFUs can start the variable node updates of the first iteration from the column addresses (1) c0 + x0,t + w (mod m) (0 t 4), which are {8, 10, 5, 1, 0}. Thereafter, the CNFUs and VNFUs are working concurrently on different memory addresses of the same memory banks. Fig. 3 shows the memory states when the two CNFUs complete the check node updates of the first iteration. As shown in Fig. 3, CNFU 1 cannot find a row in its own block row where all the data have been updated by the VNFUs. Hence, the CNFUs cannot start the check node updates of the second iteration immediately and need to wait wcc = 5 clock cycles. Similarly, the VNFUs also need to wait wvv = 5 clock cycles before

16 they can start the variable node updates of the second iteration. The HUE in this case is 68.75% and a throughput gain of G = 1.37 is achieved. Throughout this paper, the total number of iterations is set to n = 50 when calculating the throughput gain. We apply our proposed procedure to the same code and find that when c (1) 0 = 0 and c (1) 1 = 6, we have w = 5 ( w = w 1 = 4). As w m 2, the CNFUs and VNFUs can work continuously. The reduction in the waiting time is due to the use of the best reference block row for every VNFU. Here, VNFU 0 and VNFU 1 take the 0-th block row as the reference block row and the other three VNFUs take the 1-st block row as the reference block row, i.e., r = {0, 0, 1, 1, 1}. After five clock cycles, the VNFUs can start the variable node updates of the first iteration from the column addresses c (1) r +x rt t,t + w (mod m) (0 t 4), which are {5, 7, 1, 5, 6}, respectively. Six clock cycles after the VNFUs start, the CNFUs finish the check node updates of the first iteration. The memory states of this moment are shown in Fig. 4(a). It can be seen that all the data at memory address c (1) 0 + w (mod m) for CNFU 0 and the data at memory address c (1) 1 + w (mod m) for CNFU 1 have already been updated by the VNFUs. The CNFUs can therefore start the check node updates of the second iteration right away. Similarly, as shown in Fig. 4(b), when the VNFUs finish the variable node updates of the first iteration, the variable node updates of the next iteration can start immediately from the column addresses {9, 0, 5, 9, 10}, which are their previous starting addresses plus w modulo m. The decoding process can proceed in this way until the desired number of iterations are performed. The HUE in this case is 100% and a throughput gain of G = 1.97 is achieved. In Table I, we compare the waiting time and the throughput gain attained by our decoder architecture with those obtained using the method proposed in [11] for all the SFT codes listed in [23, Table I]. In all cases, the waiting times obtained using the procedure in [11] are greater than m 2. Hence, the CNFUs and VNFUs cannot work continuously using the decoder architecture proposed in [11]. The maximum throughput gain and HUE of OMP decoders are not achieved in all cases. The waiting times generated by our procedure are all smaller than their respective counterparts generated using the procedure in [11], and thus the throughput gains as well as the HUE attained using our decoder architecture are all greater than those attained using the decoder architecture in [11]. We find that w m 2, and hence the throughput gain of approximately 2 and HUE of 100% are achieved using our decoder architecture in all cases except those where m is marked by an. For the cases where m is marked by an, the minimum waiting times obtained by our procedure are still greater than m 2. In these cases, extra waiting times are needed between consecutive iterations of the check (variable) node updates. This example shows that OMP decoders of the SFT codes do not always achieve the upper

17 bound of roughly 2 on the throughput gain, and motivates new design methods for QC LDPC codes that not only allow OMP decoders to achieve higher throughput gain but also have similar parameters and error performances to those of the SFT codes. IV. COSET-BASED QC LDPC CODES As shown in Section III-C, the upper bound for G is approximately 2. However, as seen above, the SFT codes do not always achieve the upper bound. In order to achieve the upper bound, we propose a coset-based method to construct QC LDPC codes. The coset-based QC LDPC codes allow OMP decoders to achieve higher throughput gain and HUE while maintaining the same error performances as the SFT codes. A. Coset-Based Construction Suppose m is prime, then Z m = {1, 2,, m 1} is a multiplicative group under multiplication modulo m. Let a Z m be an element with multiplicative order k, denoted as o(a) = k, i.e., k is the smallest positive integer such that a k 1 (mod m). It can be shown that k m c = m 1 and a = {1, a, a 2,, a k 1 } forms a multiplicative subgroup of Z m. Z m can be partitioned, with respect to a, into m c disjoint cosets [29]: Z m = m c C ui, where i=1 C ui = {u i, u i a, u i a 2,, u i a k 1 }. (12) Note that all the multiplications carried out in Z m are modulo m, which are omitted for simplicity of notation. By convention, u i is usually called the coset leader of C ui. Note that any element in C ui can be the leader of the coset, and switching to a different coset leader amounts to cyclically shifting the elements in the coset. Suppose we use u i = u ia k 1 as the coset leader of C u i, where 0 k 1 k 1, then C u i = {u i a k1,, u i, u i a,, u i a k1 1 } can be obtained by cyclically shifting the elements in C ui to the left by k 1 positions. Any j distinct coset leaders can be used to construct a parity check matrix: first choose j distinct coset leaders {u i0, u i1,, u ij 1 }; then the corresponding j cosets can be used to define a (j, k) regular QC LDPC code by setting x s,t in (3) to u is a t. The resultant parity check matrix is given by I ui0 I ui0 a I ui0 a I 2 u i0 a k 1 I H = ui1 I ui1 a I ui1 a I 2 u i1 a k 1....... (13). I uij 1 I uij 1 a I uij 1 a I 2 u ij 1 a k 1

18 Let us denote the code defined by H as C(m, j, k, a, u i0, u i1,, u ij 1 ). Clearly, this code has length N = km and code rate R 1 j/k. Furthermore, we define the ensemble of the coset-based QC LDPC codes with design parameters m, j, and k as C(m, j, k) def = { C(m, j, k, a, u i0, u i1,, u ij 1 ) : o(a) = k and u i0, u i1,, u ij 1 are distinct in Z m}. Clearly, the cardinality of C(m, j, k), denoted as C(m, j, k), is φ(k)pj m 1, where φ( ) is the Euler function [30]. Note that some of the codes in C(m, j, k) may be the same or equivalent to each other. B. Relation to Other Constructions The coset-based construction is a generalization of both the construction for the SFT codes and that based on cyclotomic cosets. To understand the differences between these construction methods, the effects of three types of operations on the parity check matrix H have to be discussed. The first two types of operations are permutations of the block rows and block columns in H. Block row permutations simply change the order of parity check equations and the code remains the same under this type of operations. Block column permutations permute the coordinates of the code, and hence result in equivalent codes. Thus, neither type of operations affects the minimum waiting time w. The third type of operations is cyclic shifts of one or more block rows of H. Notice the difference between the operations of this type and block column permutations. In the coset-based construction, the operations of this type essentially amount to choosing different coset leaders in the chosen cosets, and yield different codes and hence can potentially change the minimum waiting time w. The construction for the SFT codes first chooses b Z m, an element with order j, and then sets x s,t in (3) to b s a t. Let us denote this code as S(m, j, k, a, b) and define the ensemble of the SFT codes with design parameters m, j, and k as S(m, j, k) def = {S(m, j, k, a, b) : o(a) = k and o(b) = j in Z m}. Since there are φ(k) and φ(j) elements with orders k and j respectively in Z m, the cardinality of S(m, j, k), denoted as S(m, j, k), is φ(k)φ(j). The parity check matrices of all the codes in S(m, j, k) are block row and block column permutations of each other. Hence these codes are the same or equivalent to each other, and have the same minimum waiting time. Since S(m, j, k, a, b) is essentially a coset-based code with {1, b, b 2,, b j 1 } as the coset leaders, we have S(m, j, k) C(m, j, k). Thus, the cosetbased construction is less restrictive and leads to a bigger class of codes. In [10], Mansour et al. proposed a construction method that is based on cyclotomic cosets for QC LDPC codes. The construction in [10] is the same as our coset-based construction described above except

19 for the usage of the term cyclotomic cosets in [10]. Cyclotomic cosets, to the best of our knowledge, are defined with respect to GF(a) [30, page 62], which requires a to be a prime power. The example given in [10] is indeed based on cyclotomic cosets with respect to GF(2). However, the construction in [10] did not explicitly require a to be a prime power. Despite the ambiguity in terminology, we show below that when cyclotomic cosets exist, the cyclotomic-coset-based and our coset-based constructions lead to the same class (up to block column permutations) of codes. Since there are φ(k) elements of order k in Z m, some of them may be prime powers. Suppose both a and a are of order k, and a is a prime power while a is not. Since a is a prime power, GF(a) exists. Since a and a are both of order k, a = {1, a, a 2,, a k 1 } is simply a permuted version of a = {1, a, a 2,, a k 1 }. With the same coset leader u i, the coset C ui with respect to a is also a permuted version of C ui with respect to a. When the same set of coset leaders {u i0, u i1,, u ij 1 } are used, the parity check matrix H for C(m, j, k, a, u i0, u i1,, u ij 1 ) is simply a block column permuted version of the H for C(m, j, k, a, u i0, u i1,, u ij 1 ). However, it is possible that none of the φ(k) elements with order k is a prime power and hence cyclotomic cosets cannot be defined. Thus, we focus on only the QC LDPC codes constructed using cosets. C. Properties of Coset-Based QC LDPC Codes For any LDPC code, its girth g and minimum Hamming distance d H are two very important measures of error performances. A relatively large girth ensures that the error performance of the iterative message passing decoding algorithm is close to that of the maximum likelihood decoding, while a large minimum Hamming distance is important to mitigate the error floor effects at high signal to noise ratio (SNR) in the decoding [31]. It has been shown that (j, k) regular SFT codes (j 2 and k 3) have girth 6 g 12 [21,23] and their minimum Hamming distance d H is lower bounded by either j + 1 for odd j or j + 2 for even j [24] and is upper bounded by (j + 1)! [25]. The following two lemmas give the girth and the minimum Hamming distance bounds of the coset-based QC LDPC codes. Lemma 3: A (j, k) regular QC LDPC code constructed based on cosets has a girth g 6. Lemma 4: The minimum Hamming distance of a (j, k) regular QC LDPC code constructed based on cosets satisfies j + 1 d H (j + 1)!. Furthermore, all codewords have even Hamming weights and hence d H j + 2 for even j. See Appendices D and E for the proofs of Lemmas 3 and 4. We remark that Lemmas 3 and 4 are not affected by the three types of operations described in Section IV-B. The girth and minimum Hamming distance bounds given by Lemmas 3 and 4 for the coset-based QC LDPC codes are similar to those

20 10-1 (4,9) regular QC LDPC R=0.56 10-2 m=109 N=981 10-3 BER 10-4 10-5 SFT cyclotomic coset ours m=397 N=3573 m=181 N=1629 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 SNR (db) Fig. 5. BER performance comparisons. For the SFT codes, (a, b) = (16, 33), (39, 19), and (14, 63) for m = 109, 181, and 397 respectively. For the codes based on cyclotomic cosets, a = 16, 43, and 79 for m = 109, 181, and 397 respectively, and the coset leaders are {1, 6, 8, 11}, {1, 90, 144, 166} and {1, 20, 133, 306} for m = 109, 181, and 397 respectively. For our coset-based codes, a = 38, 39, and 14 for m = 109, 181, and 397 respectively, and the coset leaders are {1, 2, 3, 4}, {1, 58, 97, 153} and {1, 59, 111, 355} for m = 109, 181, and 397 respectively. for the SFT codes. For any given SFT code S(m, j, k, a, b), all the codes in C(m, j, k) have the same design parameters m, j, and k, and hence similar code parameters (block length and code rate). Since S(m, j, k) C(m, j, k), there exists a set of codes in C(m, j, k) with similar error performances to the given SFT code, and we conjecture that this set contains more than one coset-based QC LDPC code besides the given SFT code. We have simulated the bit error rate (BER) performances of the SFT codes and the QC LDPC codes constructed using cosets and cyclotomic cosets over additive white Gaussian noise (AWGN) channels with BPSK modulation, and our simulation results so far confirm our conjecture. In Fig. 5, we show the BER performance of a family of (4, 9) regular QC LDPC codes constructed using the three different methods. The three groups of curves shown in Fig. 5 correspond to m = 109, 181, and 397 (block length N = 981, 1629, and 3573) respectively. Note that in each group the codes constructed using cyclotomic cosets and cosets use different coset leaders. A maximum number of 50 iterations of message passing decoding is performed. All these codes have 3 redundant rows in their parity check

21 matrices and hence all have roughly the same code rate R of 0.56. For the given length and rate, the BER performances of the three codes in each group are almost the same. Therefore, the SFT codes or the cyclotomic-coset-based codes do not appear to be advantageous in terms of BER performances over the coset-based codes. D. Results and Discussions For any SFT code, there exists a set of coset-based QC LDPC codes (including the given code) that achieve similar BER performances, and we can replace the SFT code with the code having the smallest w in the set. This improves the throughput gain and HUE of OMP decoders while maintaining the BER performances. In the following, we use examples to show that when the SFT codes cannot achieve the throughput gain of roughly 2, it is possible to find coset-based QC LDPC codes with similar code parameters and BER performances which can achieve higher throughput gain and HUE. Let m = 109, j = 4, and k = 9. There are 6 elements, {16, 38, 27, 105, 66, 75}, with multiplicative order 9 in Z 109. Since a = 16 has order 9, Z 109 = {1, 2,, 108} can be partitioned into 12 disjoint cosets with respect to 16 = {1, 16, 38, 63, 27, 105, 45, 66, 75}: C 1 = {1, 16, 38, 63, 27, 105, 45, 66, 75}, C 2 = {2, 32, 76, 17, 54, 101, 90, 23, 41}, C 3 = {3, 48, 5, 80, 81, 97, 26, 89, 7}, C 4 = {4, 64, 43, 34, 108, 93, 71, 46, 82}, C 6 = {6, 96, 10, 51, 53, 85, 52, 69, 14}, C 8 = {8, 19, 86, 68, 107, 77, 33, 92, 55}, C 9 = {9, 35, 15, 22, 25, 73, 78, 49, 21}, C 11 = {11, 67, 91, 39, 79, 65, 59, 72, 62}, C 12 = {12, 83, 20, 102, 106, 61, 104, 29, 28}, C 13 = {13, 99, 58, 56, 24, 57, 40, 95, 103}, C 18 = {18, 70, 30, 44, 50, 37, 47, 98, 42}, C 31 = {31, 60, 88, 100, 74, 94, 87, 84, 36}. Any j distinct coset leaders can be used to construct a QC LDPC code. Suppose b = 33 is an element in Z 109 with order j = 4, and consider the cosets with {1, b = 33, b2 = 108, b 3 = 76} as the coset leaders. Note that C 33, C 108, and C 76 can be obtained by cyclically shifting C 8, C 4 and C 2 to the left by 6, 4, and 2 positions, respectively. Using C 1, C 33, C 108, and C 76, we get the parity-check matrix I 1 I 16 I 38 I 63 I 27 I 105 I 45 I 66 I 75 I H 1 = 33 I 92 I 55 I 8 I 19 I 86 I 68 I 107 I 77, I 108 I 93 I 71 I 46 I 82 I 4 I 64 I 43 I 34 I 76 I 17 I 54 I 101 I 90 I 23 I 41 I 2 I 32 which defines an SFT code with a = 16 and b = 33. Using the procedure proposed in Section III-D, we find that the minimum waiting time w = 57 for this code. Since w m 2 + 1, extra waiting