Tradeoff Analysis and Architecture Design of High Throughput Irregular LDPC Decoders

Size: px
Start display at page:

Download "Tradeoff Analysis and Architecture Design of High Throughput Irregular LDPC Decoders"

Transcription

1 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 1, NO. 1, NOVEMBER Tradeoff Analysis and Architecture Design of High Throughput Irregular LDPC Decoders Predrag Radosavljevic, Student Member, IEEE, Marjan Karkooti, Student Member, IEEE, Alexandre de Baynast, Member, IEEE, and Joseph R. Cavallaro, Senior Member, IEEE Abstract Low density parity check (LDPC) codes have attracted significant research interest thanks to their excellent error-correcting abilities and high level of processing parallelism. Recent architecture designs of LDPC decoders are mostly based on the block-structured parity check matrices (PCMs) composed of horizontal layers or component codes. Irregular block-structured LDPC codes have very good error-correcting performance and they are suitable for modular semi-parallel architecture implementations. In order to achieve high decoding throughput (hundreds of MBits/sec and above) for multiple code rates and moderate codeword lengths, different levels of processing parallelism are possible. The pipelining of multiple horizontal layers of PCM is combined with different levels of memory access parallelism leading to a family of structured decoders. We provide a general trade-off analysis for selection between various decoder architectures. The goal is to find an optimal balance between hardware complexity and decoding throughput while preserving error-correcting performance. As a proof of concept, non-pipelined and pipelined LDPC decoders are prototyped for an FPGA and synthesized for an ASIC design. Both architectures support a broad range of code rates and codeword sizes with small hardware overhead while achieving high decoding throughput. Index Terms Low-density parity-check (LDPC) codes, layered belief propagation, structured codes, throughput-areaperformance tradeoff, flexible high-throughput decoder architecture, FPGA prototyping, ASIC design. I. INTRODUCTION LOW-DENSITY parity-check (LDPC) codes optimized in [1] approach capacity to within db. The resulting parity check matrices (PCMs) have fully random highly irregular structure and are suitable only for fully parallel decoder architectures. Fully parallel decoders can be very fast [2] but the lack of flexibility and their large area is a major disadvantage in current and future wireless systems that require support for adaptive code rates and codeword sizes. Architecture oriented PCMs have been proposed in [3]. They consist of concatenated circularly shifted identity matrices and zero matrices, and allow support for structured semiparallel architecture solutions with different levels of processing parallelism. The tradeoff between throughput and area for semi-parallel LDPC decoders that support block-structured regular PCMs has been investigated in [3], [4]. However, there still remains a challenge to construct improved PCMs to preserve the architecture modularity as well as to approach Manuscript received November 28, The authors are with the Department of Electrical and Computer Engineering, Rice University, Houston, TX USA ( rpredrag@rice.edu; marjan@rice.edu; debaynas@rice.edu; cavallar@rice.edu). the performance of irregular fully random codes. In [5] blockstructured irregular LPDC codes have been proposed for the IEEE n standard. The corresponding profiles are near optimal and lead to excellent decoding performance. Based on irregular block-structured PCMs, several semi-parallel LDPC decoders have been designed. A high throughput decoder with fast convergence speed has been proposed in [6], but the flexibility of architecture design is not considered. On the other hand, the scalable partially parallel decoder from [7] supports three code rates and is based on optimized structured regular and irregular PCMs, but achieves only moderate decoding throughput. Therefore, it is important for multi-standard operations to design a flexible decoder that allows fast convergence and high processing parallelism with moderate area cost. In this paper, we propose the design of high-throughput semi-parallel LDPC decoder architectures for block-structured irregular PCMs that support multiple code rates and set of moderate codeword lengths. Block-structured irregular PCMs allow high levels of processing parallelism with only marginal performance loss [8], [9] and fast convergence decoding speed. Processing parallelism can be achieved at different levels: parallelism per row and/or column of PCMs, pipelining between different stages of decoding algorithm, parallelism in memory access of reliability messages and any combination of these, giving tens of possibilities to increase the decoding throughput. We provide a methodology to determine the best tradeoff among different parallelism levels that maximizes the ratio between decoding throughput and area cost for blockstructured LDPC decoders. Although presented for particular code rates and codeword sizes, the methodology is generic enough to be applied for any range of code rates/sizes. Based on these results, as a proof of concept, we propose implementation of a flexible non-pipelined and pipelined gigabit LDPC decoders that support a wide range of code rates and codeword sizes. The paper is organized as follows. Section II introduces the LDPC decoding algorithm, as well as the block-structured PCMs that we use in our implementations. Different levels of processing parallelism for high-throughput structured LDPC decoders are proposed in Section III that lead to several decoder solutions. In Section IV, the area cost and decoding throughput of these architectures are estimated and compared in order to find the one with the best balance between the throughput, area cost and error-correcting performance. The design of flexible high-throughput multi-rate LDPC decoders is described in Section V. The hardware design results, both FPGA and ASIC implementations, are given in Section VI and

2 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 1, NO. 1, NOVEMBER paper is concluded in Section VII. II. LOW DENSITY PARITY CHECK DECODING An LDPC code is a linear block code specified by a very sparse parity check matrix (PCM) [10] where non-zero entries are typically placed at random. Each coded bit is represented by a variable node (one column of the PCM), whereas each parity check equation represents a check node (one row of the PCM). The variable connection degree i is the number of parity check equations in which a variable node is involved and is equivalent to the number of ones in the corresponding column of the PCM. The check node connection degree j is the number of variable nodes involved in the particular parity check equation and is equivalent to the number of ones in the corresponding row of the PCM. If the variable (respectively, check node) connection degrees are not constant for all columns (respectively, rows), LDPC codes are called irregular and typically have better performance [11]. A. Block-Structured Irregular LDPC Codes Block structure of PCMs is the key architecture-oriented constraint in the recent design of LDPC encoders and decoders [6], [12], [13], [3], [7], [14], [15]. The block-structured PCM consists of the concatenated circular sub-matrices of size S S where each sub-matrix is either a zero matrix or a cyclically shifted identity matrix. An example of systematic block-structured PCM is shown in Fig. 1 for the code rate of 2/3 and the codeword size of Block-structured PCMs are split into L horizontal layers, each one corresponding to a component code with the column degree of at most one. The PCMs are also divided into C vertical block-columns. It is shown in [16] that 24 block-columns provide the best errorcorrecting performance for a large range of code rates and codeword sizes. B. Layered Belief Propagation Standard Belief Propagation (SBP) algorithm is typically used to iteratively decode the LDPC codes [10]. The belief propagation algorithm can be applied inside each horizontal layer, while the reliability messages are passed between the pair of layers improving the decoding convergence of SBP by approximately two times [17], [18], [19], [20]. This is layered : Zero Matrix : Shifted Identity Matrix Fig. 1. Block-structured irregular parity-check matrix with C = 24 blockcolumns. The codeword size N = 1296, Rate= 2/3, L = 8 horizontal layers. Each sub-matrix is of size belief propagation (LBP) algorithm. In order to simplify the arithmetic operations, we use the log-likelihood ratios (LLRs) for the representation of the messages [21]. At the beginning of the first decoding iteration (k=1), the a posteriori reliability messages are initialized with the a priori channel reliability values: L(q j ) = 2r j /σ 2, j = 1,...,n, (1) where r j is the received sample at time j, σ 2 is noise variance and n is the codeword size. Transmission over an AWGN channel with BPSK modulation is assumed. All check node messages R mj are initialized with zeros. The following processing is repeated for each horizontal layer of the PCM. For all m rows inside the horizontal layer l, the variable node messages can be expressed as: L(q (k,l) mj ) = L(q(k,l p) j ) R (k 1,l) mj, (2) where L(q (k,l p) j ) is the a posteriori reliability message updated in the (l-p)-th layer during the k-th decoding iteration. The (l-p)-th layer is the last layer before the l-th layer with the non-zero entry at the j-th column position of the parity check matrix. R (k 1,l) mj is the check node message updated within the same l-th layer but in the previous decoding iteration. For each check node m in the horizontal layer l, the messages R (k,l) mj corresponding to all variable nodes neighbors j are computed according to: = ( sign L(q (k,l) mj )Ψ ) ( ) Ψ L(q (k,l) mj ), (3) R (k,l) mj j N(m)\{j} j N(m)\{j} where N(m) \ {j} denotes the set of all variable nodes but j that are connected to the check [ node ( )] m inside the horizontal layer l, and Ψ(x) = log tanh x 2. For implementation purposes, the updating of the check node messages in (3) can be efficiently replaced with the modified min-sum approximation [22], [23]. The updating of check node messages in the row m of the decoding iteration k in the horizontal layer l becomes: R (k,l) mj = j N(m)\{j} ( sign L(q (k,l) ) mj ) { } max min j N(m)\{j} L(q(k,l) mj ) β, 0, (4) where β is a correcting factor. The a posteriori probabilities (APPs) at the k-th decoding iteration in the l-th layer are updated according to: L(q (k,l) j ) = L(q (k,l) mj ) + R(k,l) mj. (5) Hard decisions can be made after every horizontal layer based on the sign of the current APPs L(q j ), j = 1,...,n as in [6]. The decoding process stops if all parity-check equations are satisfied. Otherwise, the decoding algorithm repeats from (2) with k:=k+1 until a maximum number of iterations is reached or the valid codeword is found.

3 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 1, NO. 1, NOVEMBER III. DESIGN SPACE EXPLORATION OF LDPC DECODER ARCHITECTURES Our goal is to design a block-structured LDPC decoder that achieves high-decoding throughput, i.e. hundreds of Mbits/sec and above, and supports multiple codeword lengths and code rates. In order to achieve such a high throughput, all architecture solutions that are presented support full decoding parallelism per horizontal layer which corresponds to simultaneous decoding of all rows in the layer. All decoders can support novel architecture-oriented block-structured PCMs from [16] that allow two times higher level of memory-access parallelism and have the same level of flexibility in terms of supported codeword sizes and code rates. The processing parallelism can be divided into two parts: parallelism in decoding of component codes (horizontal layers of the PCM), and the memory access parallelism (reading/writing of messages from/to memory modules). Both parts will be analyzed in detail. A. Parallelism in Decoding of Component Codes According to the number of layers that are simultaneously processed through three decoding stages, the following levels of parallelism are considered: 1. No pipelining which corresponds to sequential decoding of layers. One layer is processed at a time (labelled as L1). 2. Pipelining of three layers which corresponds to simultaneous processing of three layers through three different decoding stages (labelled as L3). 3. Pipelining of six layers which assumes that in each decoding stage two horizontal layers are simultaneously processed (labelled as L6). The original layered belief propagation algorithm with no pipelining of horizontal layers (component codes) is described in II-B. In order to improve the decoding throughput with limited increase of area cost we propose to pipeline several consecutive horizontal layers of the PCM. No additional arithmetic logic is required and only a marginal performance loss is introduced. A similar approach is also applied in [24]. The decoding process based on the layered belief propagation algorithm is divided into three stages that can be simultaneously executed for three horizontal layers of the PCM and it is visualized in Fig. 2. The throughput gain is approximately three since all three stages have similar latency. The reading stage of the l-th layer: the most recently updated APPs from columns of the PCM with non-zero entries in the l-th layer are read from the memory, as well as the check node messages R mj from the l-th layer. Variable node messages L(q mj ) from the l-th layer are updated according to (2), where L(q (k,l p) j ) is the corresponding APP loaded from the memory. The processing stage of the (l-1)-th layer: based on the updated variable node messages L(q mj ) from the (l- 1)-th horizontal layer, all check node messages R mj that belong to this layer are computed simultaneously according to (4), where l is replaced with l-1. Because of the block-serial processing of PCM s sub-matrices, the processing stage starts as soon as the first block of variable node messages is available from the reading stage of layer l. Consequently, there are partial overlapping between processing stage of layer l-1 and reading stages of layers l-1 and l, as it is shown in Fig. 2. The writing stage of the (l-2)-th layer: the APPs L(q j ) that correspond to all non-zero entries inside the (l-2)- th horizontal layer are updated and written back in the memory. It can be observed that (5), where l is replaced with l-2, cannot be directly applied since the most recently updated APP messages would not be utilized. After combining (2) and (5) for the (l-2)-th layer, the APP messages are updated according to: L(q (k,l 2) j ) = L(q (k,l 2 p) j ) R (k 1,l 2) mj +R (k,l 2) mj, (6) where L(q (k,l 2 p) j ) is the latest updated value of the corresponding APP which is passed to the (l-2)-th horizontal layer at the k-th decoding iteration. This value has been updated p horizontal layers before the (l-2)-th layer, p 1. During the writing stage all newly updated check node messages from the (l 2)-th horizontal layer are also written back to the appropriate memory. In order to facilitate the simultaneous execution of both (6) and (2) a mirror memory with identical content as the original memory is utilized. Simultaneous loading of different APP messages, L(q (k,l p) j ) and L(q (k,l 2 p) j ), as well as storing of updated messages L(q (k,l 2) j ), occurs which requires two read and one write port. On the other hand, conventional RAM modules have only two ports to access the data which is not sufficient in this case. Furthermore, in order to avoid buffering of check messages loaded from the layers not yet updated in the k-th decoding iteration (messages R (k 1,l 2) R (k 1,l 1) mj mj,, and R (k 1,l) mj ), we utilize a mirror check memory as well. By employing the pipelining of three horizontal layers, approximately three times higher decoding throughput can be achieved since the latencies of all three stages are nearly identical. The arithmetic logic area is same as for the non-pipelined implementation while memory required for the storage of APP and check messages is doubled. The control logic requires small modifications since only the timing of memory access is different - messages are loaded/stored from/to memory modules more frequently than for the non-pipelined case. Reading l-2 Reading l-1 Reading l Processing l-2 Processing l-1 Processing l Writing l-2 Writing l-1 Writing l Fig. 2. The belief propagation algorithm based on pipelining of three decoding stages (reading, processing, writing) between three consecutive horizontal layers of the parity check matrix. Time

4 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 1, NO. 1, NOVEMBER Reading l-5 Reading l-3 Reading l-1 Reading l-4 Reading l-2 Reading l Processing l-5 Processing l-3 Processing l-1 Processing l-4 Processing l-2 Processing l Writing l-5 Writing l-3 Writing l-1 Writing l-4 Writing l-2 Writing l Time Fig. 3. The belief propagation algorithm based on pipelining of three decoding stages for six consecutive horizontal layers of the parity check matrix. Certain decoding performance loss in the case of simultaneous processing (pipelining) of multiple horizontal layers occurs due to the fact that some APP messages are loaded from the memory before being updated in some of the previously processed layers. The number of overlapping conflicts can be reduced if the APP sub-blocks are processed in the particular order which varies for different layers and different PCMs. This, however, requires more complex memory addressing and controlling for each individual supported PCM. On the other hand, from the implementation point it is much simpler to process horizontal layers in the re-scheduled order and to reduce the number of overlapping conflicts. It is observed that simple modulo 4 schedule of layers provides sufficiently good results for all supported PCMs. Our pipelined architecture supports loading/updating of APP sub-matrices in the original PCM order (consecutive sub-matrices from left to right in the PCM). The performance loss due to the simultaneous processing of three horizontal layers is small - up to about 0.1 db for a frame error rate (FER) of 10 4 (see as an example Fig. 7 in Section IV-B). The processing parallelism can be further increased by approximately two times if the pipelining of six consecutive layers is employed, as is visualized in Fig. 3. Two horizontal layers are simultaneously processed in each of three decoding stages. The performance loss is somewhat larger than if three layers are pipelined because more APP messages are now loaded from the memory before being updated. However, the decoding latency is substantially decreased. The pipelining of more than six horizontal layers is not suitable for practical implementation because of the relatively small number of layers in supported PCMs, for example, eight horizontal layers for rate 2/3 IEEE n codes. B. Memory Access Parallelism The second kind of parallelism is based on the number of blocks of messages that are simultaneously read/written from/to memory modules. Each accessed block corresponds to one non-zero sub-matrix of the PCM. Two levels of memory access parallelism are considered: reading/writing of one (labelled as RW1) or two blocks (labelled as RW2) of reliability messages per clock cycle. The number of memory modules and read/write memory ports depends on how many horizontal layers are simultaneously processed (pipelined). The organization of the APP memory if no pipelining is employed (L1 case) for both levels of memory access parallelism is shown in Fig. 4. Each memory location contains APP messages from one block-column of the PCM. Width of the valid memory word depends on the size of the submatrix and it is proportional to the codeword length. If the largest supported codeword size is N MAX and the PCM has C block-columns, the maximum valid width is S K = N MAX /C messages (sub-matrices of size S K S K ), where K is the number of supported codeword sizes. Memory organization composed of a single one-port RAM block shown in Fig. 4(a) allows either read or write of one full sub-matrix per clock cycle (RW1 memory access parallelism). Figure 4(b) shows the APP memory that is partitioned into two independent modules. Each location of one APP memory module contains the APP messages that correspond to one (out of C/2) odd block-column of the PCM. Another APP memory module contains APP messages from even block columns of the PCM. This memory organization allows access to a pair of blocks of APP messages every clock cycle (RW2 memory access parallelism). S K APP messages Block-column 1 Block-column 2 Block-column 3 Block-column C-1 Block-column C (a) RW1: single one-port RAM module. Fig. 4. S K APP messages Block-column 1 Block-column 3 Block-column C-1 S K APP messages Block-column 2 Block-column 4 Block-column C (b) RW2: two independent one-port RAM modules with messages from the odd and the even block-columns of the PCM. Organization of the APP memory for the non-pipelined L1 case. The organization of the APP memory in the case of pipelining of three horizontal layers of the PCM (L3 case) is presented in Fig. 5: RW1 memory access parallelism is shown in Fig. 5(a), and RW2 memory access parallelism is shown in Fig. 5(b). As pointed in Section III-A, original and mirror dualport RAM blocks are now utilized. It is important to note that the content to be written in both original and mirror modules is always identical - updated APP message according to (6). Further extension into one original and three mirror dual-port RAM blocks is required for the case when six horizontal layers are pipelined because there are four simultaneous reading processes and two simultaneous writing processes (see Fig. 3). In the case of RW1 memory-access parallelism each check node memory location contains the check node messages from the corresponding single nonzero sub-matrix. In the case of RW2 memory-access parallelism each check node memory location contains the check node messages that correspond to two consecutive nonzero sub-matrices from the same horizontal layer. If the pipelining of horizontal layers of the PCM is employed mirror check node memories are utilized in the

5 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 1, NO. 1, NOVEMBER Original memory Mirror memory S K APP messages Block-column 1 Block-column 2 Block-column 3 Block-column C-1 Block-column C S K APP messages Block-column 1 Block-column 2 Block-column 3 Block-column C-1 Block-column C (a) RW1: single dual-port RAM module (original and mirror). Original memory S K APP messages Block-column 1 Block-column 3 Block-column C-1 S K APP messages Block-column 1 Block-column 3 Block-column C-1 Mirror memory S K APP messages Block-column 2 Block-column 4 Block-column C S K APP messages Block-column 2 Block-column 4 Block-column C (b) RW2: two independent dual-port RAM modules (original and mirror) with messages from the odd and the even block-columns of the PCM. Fig. 5. Organization of the APP memory in the case of pipelining of three horizontal layers (L3 case). FULL: Fully parallel decoder with the simultaneous execution of all layers and simultaneous reading/writing of all reliability messages as in [3]. This solution utilizes standard belief propagation (SBP) algorithm. IV. THROUGHPUT-AREA-PERFORMANCE TRADEOFF ANALYSIS In this section we compare all aforementioned decoder solutions based on the proposed estimation methodology. We determine the processing solution that offers the best balance between decoding throughput and area cost. Whereas the proposed methodology can be applied for any coding rate and codeword size, we consider solutions that are also compatible with the IEEE n standard. Each solution, except the L6RW1 and L6RW2 architectures, supports four code rates (1/2, 2/3, 3/4 and 5/6) and three codeword lengths (648, 1296 and 1944, maximal supported sub-matrix size is S K =81) as well as block-structured irregular PCMs from [16] with C=24 block-columns. The L6RW1 and L6RW2 solutions only support code rates up to 3/4 since the PCM with 24-blockcolumns has only four horizontal layers for the code rate of 5/6 which is insufficient to fill the pipeline. same way as for the APP messages. C. Decoders with Different Levels of Processing Parallelism By combining different aforementioned levels of parallelism in decoding of horizontal layers and in memory access, the following decoder architectures are considered going from the least parallel to the most parallel solution: L1RW1: Decoder with sequential decoding of horizontal layers. The layers are decoded one at a time (L1); reading/writing of messages from one nonzero sub-matrix in each clock cycle (RW1). Original layered belief propagation (LBP) algorithm is utilized. L1RW2: Decoder with sequential decoding of horizontal layers; reading/writing of messages from two consecutive nonzero sub-matrices in each clock cycle (RW2). The architecture-oriented constraint of equally distributed odd and even nonzero block-columns in every horizontal layer of the PCMs [16] allows read/write of two sub-matrices in a single clock cycle. L3RW1: Decoder with the pipelining of three consecutive horizontal layers (L3); reading/writing of messages from one nonzero sub-matrix in each clock cycle. The LBP algorithm with three-stage pipelining is utilized. L3RW2: Decoder with the pipelining of three consecutive horizontal layers; reading/writing of messages from two consecutive nonzero sub-matrices in each clock cycle. L6RW1: Decoder with the pipelining of six consecutive horizontal layers (L6); reading/writing of messages from one nonzero sub-matrix in each clock cycle. The LBP algorithm with six pipelining layers is utilized. L6RW2: Decoder with the pipelining of six consecutive horizontal layers; reading/writing of messages from two consecutive nonzero sub-matrices in each clock cycle. A. Decoding Latency and Area Size Figure 6 shows the decoding throughput and area complexity for all analyzed decoders: the decoding parallelism and the memory access parallelism increase from left to right (from L1RW1 to FULL). The decoding throughput is based on the average number of decoding iterations required to achieve a frame-error rate of 10 4 for the largest supported code rate and codeword length. The clock frequency of 200 MHz is assumed which is a typical clock frequency for an ASIC implementation. The average decoding throughput depends on the average decoding latency. The decoding latency is based on the latency of reading, processing and writing stages. Table I shows the latency (in clock cycles) of three decoding stages for different decoder solutions, where M represents the number of sub-matrices that are accessed per clock cycle. Table II gives the latency in clock cycles, where Iter represents the average number of decoding iterations, and R, P, W represent latencies of reading, processing and writing stages, respectively. TABLE I LATENCY (IN CLOCK CYCLES) OF DIFFERENT PIPELINE STAGES FOR DIFFERENT DECODER SOLUTIONS FROM SECTION III-C. L i RW j (i=1,3,6; j=1,2) FULL Read (R) W R /M Process (P) W R /M + 5 W R + 5 Write (W) W R /M All analyzed LDPC decoders are composed of two main parts: 1) Arithmetic logic part that consists of permuters for shifting blocks of APP messages and decoding function

6 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 1, NO. 1, NOVEMBER TABLE II TOTAL AVERAGE LATENCY (IN CLOCK CYCLES) FOR DIFFERENT DECODER SOLUTIONS FROM SECTION III-C. Decoder Latency [Clks] L1RW1, L1RW2 Iter L (R + P + W) L3RW1, L3RW2 Iter L P W L6RW1, L6RW2 Iter (L/2) P W FULL Iter (R + P + W) units (DFUs) for decoding of individual rows (check equations) within horizontal layers. 2) Memory part for the storage of APP, check messages and supported PCMs. The hardware cost of arithmetic logic is estimated as the number of standard CMOS logic gates. The corresponding arithmetic area in mm 2 can be estimated by assuming that the size of one CMOS gate is equal to 4.5 µm 2. This particular gate size is obtained through the synthesis process of a single two-input NAND gate using the Chartered Semiconductor 0.13µm, 1.2 V, 6 metal level standard cell CMOS technology [25] and the ASIC design flow from [26]. The total arithmetic area (in CMOS gates) is equal to the summation of the permuter area and the total area of all utilized DFUs: Arithm Area = P M Perm Area + S K M DFU Area, (7) where P =1 for L1 case, P =2 for L3 case, P =4 for L6 case, M is the number of sub-matrices accessed per clock cycle, S K is the largest supported sub-matrix size. The area of DFU does not depend on the range of code rates/sizes thanks to the block-serial processing, and no additional arithmetic logic in the case of pipelining is required. The area of flexible permuter for block-shifting of K different block-sizes of APP messages up to the size S K, where S k = k B (k=1,...,k, and B N) is given by: Perm Area = log n S K S K Area of MUX ()b bit K + B Area of MUX (k:1)b bit k=2 + S K 1 Comparator Area, (8) where log n S K is the number of initial multiplexer stages in the permuter. Every stage consists of S K b-bit multiplexers. Furthermore, there is an additional stage that insures the flexibility of the permuter. It consists of S K 1 multiplexers divided into K 1 banks with k:1 multiplexers in the k-th bank and of S K 1 comparators for generation of appropriate select signals. More details about the design of the flexible permuter can be found in Section V-C. The memory size is represented as the number of bits required for the storage of APP and check messages (RAM memory) as well as for the compact storage of all supported PCMs (ROM memory). The size of the check memory (CM) in bits is given by: CM = P ( S K b ) L W R (l), (9) where P is the number of utilized CM modules (P =1 for L1 case, P =2 for L3 case, P =4 for L6 case), S K is the largest supported sub-block size, b is the bit-precision used to represent check messages, L is the number of horizontal layers, and W R (l) is the number of non-zero sub-matrices per layer l (l=1,...,l). The size of APP memory in bits is given by: APP = P (S K b C), (10) where C is the number of block-columns in supported blockstructured PCMs. The ROM memory is divided into three subparts: storage of shift/offset values, storage of non-zero blockcolumn positions and storage of the number of non-zero submatrices per horizontal layer. The ROM memory size required for compact storage of supported PCMs in bits equals: ( ) L ROM = F 2 log 2 S K W R (l) + F ( log 2 C l=1 l=1 ) L W R (l) l=1 + F L log 2 max {W R (l)}, (11) where F is the number of supported rate-size combinations, log 2 S K is the number of bits required to represent shift values of individual sub-matrices, the coefficient 2 in the first term indicates that both shift values and relative offset values are stored to simplify the permutation operations as described later in Section V-D.2, log 2 C is the number of bits required to represent the block-column positions of the non-zero sub-matrices, and log 2 max {W R (l)} is the bit-precision required to represent the number of non-zero sub-matrices per horizontal layer. The total memory size is therefore equal to the cumulative size of the Check memory, APP memory and ROM memory: MEM Size = CM + APP + ROM. (12) The total memory area in mm 2 is computed by assuming SRAM cell size of 6.02 µm 2 (one-port memory) which is the cell density of the utilized Chartered Semiconductor 0.13µm CMOS technology [25]. Dual-port SRAM blocks are required for the L3RW1, L3RW2, L6RW1 and L6RW2 decoder architectures since multiple horizontal layers are processed in parallel and simultaneous reading and writing from the same memory module may happen. For Chartered 0.13µm CMOS technology, the cell area of dual-port memory is about 1.8 times larger than the cell area of one-port memory. In the case of a fully parallel solution, the arithmetic logic dedicated to all 12 supported rate-size combinations is required. The total decoder core area for all solutions is computed as a summation of the estimated arithmetic area and the estimated area of all memory modules.

7 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 1, NO. 1, NOVEMBER Throughput/area [MBits/sec/mm 2 ] Average Throughput [MBits/sec] Arithmetic area [K gates] Memory size [KBits] Total core area [mm 2 /100] L3RW2 L6RW2 x L6RW1, L6RW2 FULL Standard BP L3RW1, L3RW2 L1RW1, L1RW L1RW2 L3RW1 L6RW1 x 5 FULL x 5 Frame Error Rate L1RW Levels of decoding parallelism Fig. 6. The average decoding throughput (for clock frequency of 200 MHz), hardware complexity and the tradeoff ratio for LDPC decoder solutions. Different levels of processing parallelism are labelled using the notation defined in Section III-C E b /N 0 [db] Fig. 7. FER for the PCM with 24 block-columns, code rate of 2/3, code size of 1296, maximum number of iterations is 15. Analyzed decoders from III- C with different levels of parallelism: L1RW1, L1RW2, L3RW1, L3RW2, L6RW1, L6RW2, and FULL. B. Tradeoff Ratio Analysis The tradeoff between the level of processing parallelism (decoding throughput) and the hardware cost is crucial for efficient architecture implementation. In this work, in order to determine the best tradeoff solution, we introduce the following cost function: the ratio between decoding throughput and total decoder core area, with units of MBits/sec/mm 2. The architecture with the best tradeoff should have the largest ratio. The trade-off ratio is plotted in Fig. 6 for all analyzed decoders. If pipelining of three horizontal layers is employed then the decoding throughput is increased by approximately three times while the arithmetic area increases only marginally. On the other hand, the RAM memory size (in bits) is doubled since then both mirror APP and mirror Check memories are required. Overall, three-stage pipelining significantly improves the throughput vs. area ratio (see Fig. 6, L1RW1 vs. L3RW1 architecture, and L1RW2 vs. L3RW2 architecture). If the memory access parallelism is doubled then the decoding throughput is directly increased by more than 50%, while the arithmetic area is doubled and memory size remains the same. If only the memory access parallelism is increased, then the throughput/area ratio is only slightly improved (see Fig. 6, L1RW1 vs. L1RW2 architecture, and L3RW1 vs. L3RW2 architecture). The further increase of decoding parallelism (L6RW1, L6RW2 and FULL architecture solutions) does not improve the tradeoff ratio since the throughput improvements are smaller than the corresponding increase in area. A similar effect will occur if the memory access parallelism is further increased; if four sub-matrices per clock cycle are accessed (not shown here) then the arithmetic area would increase two times comparing to the L3RW2 solution. However, the decoding throughput would improve by only about 25% since the latency of a processing stage would not decrease two times compare to the L3RW2 solution - see Table I and the case when M increases from 2 to 4 (W R /M becomes significantly smaller than 5). Based on the described estimation method, we obtain that the best throughput per area ratio is achieved for the threestage pipelining approach with memory organization that allows reading/writing of two sub-matrices per clock cycle (L3RW2 solution). Furthermore, the L3RW2 decoder has only small performance loss due to pipelining as shown in Fig.7 - about 0.1dB for a FER around 10 4 with 10 6 simulated codeword transmissions. Fully parallel and decoder with six pipelined layers introduce additional performance loss compare to original layered belief propagation algorithm with no pipelining. V. SCALABLE STRUCTURED DECODER ARCHITECTURE As a proof of concept, the hardware implementation of the L1RW1 and L3RW1 structured and flexible LDPC decoders are proposed in this section. The L3RW1 decoder utilizes the pipelining of three horizontal layers of PCM with memory reading/writing of one block of messages per clock cycle. It is nearly the best throughput/area tradeoff solution and is a natural extension of the L1RW1 architecture where one horizontal layer is processed at a time. Figure 8 shows a block diagram of the proposed structured LDPC decoders. The design consists of memory blocks for the storage of reliability messages (RAM blocks) and PCMs (ROM blocks), processing blocks (DFUs), permuters for routing of APP messages from the memory modules to processing blocks, and the control unit. One or two permuters are used depending if the pipelining of horizontal layers is implemented or not. The decoding process starts by selecting the code rate and the codeword size. The appropriate compact representation of the PCM is loaded from the Shift/Offset ROM, the Position ROM (read/write address of the APP memory) and the W R ROM (number of nonzero sub-matrices in each horizontal layer of the PCM) shown as one block in Fig. 8. The decoder

8 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 1, NO. 1, NOVEMBER receives the initial (channel) reliability information of each coded bit and initialize the APP memory. After being loaded in a single clock cycle, APP messages of the current blockcolumn are routed through permuters to the appropriate DFU for processing. Once the processing is finished, newly updated APP and check messages are written back to APP and Check memories. After that a new set of values (from the next horizontal layer) is loaded from APP memory modules and the decoding process for the next layer starts. Checking of parity and hard decision based on APP messages are performed at the end of each decoding iteration. The control unit controls the data flow between all units. Each block of the decoder architecture will be discussed in more detail in the following text. A. Multi-Rate Controller This unit controls the flow of messages into different processing blocks through decoding iterations. The controller generates enable/reset signals and hand-shaking signals necessary for the correct timing of decoding operations. It also controls counters that generate addresses for ROM and RAM memories. In addition, the controller inputs values of the codeword size and the code rate and reads the number of nonzero sub-matrices in each layer from the ROM memory. Based on these parameters, this unit controls the flow of data to DFUs, permuters and other blocks in the system. In the case of pipelining, the control unit generates the identical reading/writing signals, enable/reset signals, etc as in the non-pipelined case, only more frequently. This is the main modification compared to the non-pipelined decoder solution. In Addr Gen Check Mems APP Mems Mirror APP Mems Mirror Check Mems Controller Permuter Permuter pipelining of layers DFU Banks Shift/ Offset/ Position/ Degree ROMs Hard Decision Out Fig. 8. High level block diagram of the structured LDPC decoder L1RW1. Note that the APP and Check mirror memory and additional permuter (section marked pipelining of layers ) are utilized if the pipelining of layers is used. This section is added to L1RW1 solution to create the L3RW1 architecture. B. Decoding Function Unit Banks of decoding function units (DFUs) from Fig. 8 represent the core of the decoder architecture since DFUs update check and APP reliability messages. The number of parallel processing elements in these banks represents the parallelism degree of the decoder design. For decoding of different codeword lengths, the architecture should support the set of sub-block sizes given by: S k = k B, k = 1, 2,...K, (13) where S K is the maximum supported sub-block size and B N. Hence, the DFUs are divided into K banks, each of which contains B DFUs. For the largest supported codeword size C S K, all K DFU banks are utilized (total of S K DFUs), whereas for the codeword size of C S k, k out of K DFU banks are utilized (total of S k DFUs). Figure 9 shows block diagram of a single DFU, and its position in the decoder hierarchy. Each DFU is responsible for processing a single row of the PCM within the current horizontal layer. The DFU loads W R check messages from Check memory and W R APP messages from APP memory, and then performs the decoding according to equations (2), (4) and (6). Check messages of the current row are subtracted from APP messages according to (2). The sign and the magnitude of these results are separated and passed through different computation paths which corresponds to (4). The Min-Sum unit serially determines the two smallest values (in the absolute sense) among W R variable node messages from the current row and saves the index of these messages relative to the first message. The correcting offset β is then subtracted from the minimums according to (4). Then, the appropriate check messages are serially updated (W R per row) according to (4). The Min-Sum unit with serial input is implemented since the same unit can be used for any value of W R. Based on old check and APP messages that are loaded from the mirror memory, as well as on newly updated check messages, the new W R APP messages are computed according to (6). The updated APP messages from S k DFUs are concatenated and stored in a single address of the APP memory module (see Fig. 9 when all K banks of DFUs are used). The block of concatenated APP messages is not permuted back before being stored in the memory: the APP messages are stored in the current shifted order, and the permutation in the next layer is based on values from the Offset ROM. Thus only one permuter per loaded block of APP messages is required. C. Flexible Permuter Once loaded from the memory, the APP messages are routed through permuter to corresponding decoding function units. The routing is implemented as a network of multiplexers organized in multiple stages where shift values from PCM s submatrices determine the appropriate path of the APP message. In order to avoid the utilization of separate permuters for block-shifting of different sub-block sizes, we design a flexible permuter that permutes any of the K supported sub-block sizes with only small hardware overhead. Supported sub-block sizes can be represented as: S k = k n m ; k = 1, 2,...K, and n, m

9 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 1, NO. 1, NOVEMBER APPIn (i...i+w r ) Check In (i...i+w r ) - 2's--> SM SGN W r ABS Serial Min -Sum Unit Offset Latch Shift Register rst en Offset Counter CMP Latch Latch CMP SM--> 2's - + Check In (i...i+w r )APPIn (i...i+w r ) Check Out (i...i+w r ) APPOut (i...i+w r ) b-bit Check In Check Old APPIn APPOld W r DFU 1 DFU 2 DFU B Packing Packing Check Out 1...B APPOut 1...B 1 B DFU Units 2 B DFU Units K B DFU Units B, b-bit B, b-bit Packing Packing S K, b-bit To CheckMem S K, b-bit To APPMem Fig. 9. Block diagram of decoding function unit (DFU) and its hierarchy within the decoder structure. The DFUs are organized into K banks (S K = K B) where each bank contains B DFUs. Index i represents address of the APP/Check memory location from where the message is loaded and where the updated message will be stored. 1 APP messages (from memory) S K Fig. 10. n m 1 n m 2 n m K n m 1 n m 2 n m K log n S K MUX stages The block diagram of flexible permuter. n m 1 n m 2 n m K S K-1 Comparators n m 1 K:1 n m 2 (K-1):1 n m K-1 2:1 N. Permuter has log n S K initial stages of multiplexers. Each stage contains K n m of b-bit multiplexers divided into K banks with n m multiplexers in each bank, where b is a bitprecision of APP messages. The permuter becomes flexible by adding an additional stage of S K 1 multiplexers for the final block-permutation. This stage is divided into K-1 banks with n m of b-bit input k:1 multiplexers in the (K k+1)-th bank, where k=k,k-1,...,2 (see Fig. 10). The appropriate select signals need to be routed to the particular multiplexer stage. Select signals of the first log n S K multiplexer stages are appropriate digits of the PCM s shift values in modulo n representation. The same select signal is used for all multiplexers within a single stage. There are i = 1,..., S K outputs of the initial log n S K stages that need to be additionally permuted in the final multiplexer stage. Select signals of the final multiplexer stage are determined according to the size S k of the block of APP messages that is being permuted (S k < S K ) as well as according to the corresponding sub-matrix shift/offset value sh. For i > sh (i=1,...,s k ) the first input of the i-th multiplexer in the last stage is being selected; for i > S k (i=s k +1,...,S k +sh) the k-th multiplexer inputs of the first sh multiplexers in the last stage are selected. There is an exemption from this rule if k=k-1, and sh n m ; if i=1,...,sh n m + 1, then the first inputs of the i-th multiplexers in the last stage are chosen; if i > sh, i = n m + 1,..., S K, then the (k+1)-th multiplexer s inputs are chosen in the k-th bank of remaining multiplexers in the last stage. S K 1 comparators are required to generate select signals in the last multiplexer stage. If k < K, last (K-k) n m outputs are discarded. If k=k (S k = S K ), the block-shifting is already completed in the initial log n S K multiplexer stages, and there is no need for additional permutation in the last stage. In order to be compatible with the IEEE n standard, the supported sub-block sizes are 27, 54 and 81 that correspond to the codeword sizes of 648, 1296 and 1944 bits [5]. In this particular case K=3, S K =81, n=3 and m=3. Therefore, the permuter has four multiplexer stages of 81 3:1 multiplexers in each stage, and the additional stage with 27 3:1 multiplexers, 27 2:1 multiplexers, and 54 comparators. It can be noticed that up to two flexible permuters are utilized in Fig. 8. In the case of simultaneous execution of three horizontal layers, two different blocks of APP messages are loaded in the same time from the original and mirror APP memories and they need to be independently permuted. Therefore, two separate permuters that operate in parallel are required. The proposed flexible permuter structure can be compared with the reconfigurable interconnect network based on the Benes network structure from [27]. This permutation network in [27] supports cyclic shifting for sub-matrices of arbitrary size. However, the support for arbitrary sub-matrix size is not a critical condition for emerging wireless standards such as IEEE n and similar standards: it is sufficient to support a pre-determined set of sub-matrix sizes. Furthermore, the area cost of our flexible permuter is smaller for larger sub-matrix size: 30.5K gates (b = 8, S K =81, after the ASIC synthesis) vs. 34.2K gates (b = 8, S K =64) in [27].

10 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 1, NO. 1, NOVEMBER D. Memory Organization Proposed decoder design utilizes several ROM and RAM blocks. The RAM banks are used for the storage of check and APP messages, while several ROM blocks are utilized for the compact storage of all supported parity check matrices. 1) Check and APP Memory: In order to increase the memory throughput and take advantage of the block-structured PCMs, the compact storage of the APP and check messages is used as described in Section III-C. While the L1RW1 solution uses single port RAM blocks, the pipelined L3RW1 architecture utilizes dual-port RAMs to simultaneously read and write APP and check messages from different horizontal layers. Furthermore, to avoid the buffering of a large number of check messages, mirror Check memory modules are utilized. In order to reduce the memory word-lengths both the APP and the check memories are divided into K modules with identical addressing. 2) Storage of Multiple PCMs: The proposed flexible decoder architecture supports K LDPC codeword lengths and R code rates which represents F = K R different rate/size combinations: therefore F PCMs are stored in a ROM memory. Shift values and the position of each nonzero sub-matrix are used for compact representation of supported PCMs. In order to avoid the utilization of an additional permuter, the offset values are also saved in the memory. These values are shift values with regard to the previous shift value of the same row (see Fig. 1). If the shift value of the block-column c is sh l for the horizontal layer l and sh l+1 for the horizontal layer l+1, the stored offset value o l+1 is equal to: o l+1 = sh l+1 sh l. The permuter uses the value of o l+1 for block-shifting instead of the original shift values. If at the end of decoding iteration all parity check equations are satisfied each block-column has to be shifted back to its original order by corresponding offset value. This also requires additional routing of permuter s outputs to the hard-decision unit in order to produce decoded bits in the correct order. VI. HARDWARE IMPLEMENTATION OF LDPC DECODER A. FPGA Prototype Prototype architectures for the L1RW1 and L3RW1 cases (non-pipelined and three-stage pipelined version of the layered belief propagation algorithm) have been implemented in Xilinx System Generator and targeted to a moderate size Xilinx Virtex4-fx60 FPGA. The 7-bit (b=7) precision is used in both cases. Both decoders support three codeword sizes (648, 1296 and 1944), and four code rates (1/2, 2/3, 3/4, and 5/6) and block-structured PCMs from [5] with 24 block-columns: K=3, n=3, m=3, C = 24. Table III shows the FPGA utilization statistics for both architectures. Since there is no ROM in an FPGA, block RAMs are also used for the storage of supported PCMs. Based on the Xilinx XST synthesis tool post placeand-route report, a maximum clock frequency of 160 MHz is achieved for both designs. B. ASIC Implementation The proposed decoder architectures are also synthesized for a Chartered Semiconductor 0.13µm, 1.2 V, CMOS technology [25] using the BEE/Insecta design flow [26] and TABLE III DESIGN STATISTICS FOR THE FLEXIBLE LDPC DECODER ON VIRTEX4-XC4VFX60 FPGA. Resource L1RW1 L3RW1 Slices 11,328 13,665 FFs 12,368 13,617 LUTs 17,104 21,667 Block RAMs Utilization (Slices) 45 % 54 % Utilization (Block RAMs) 37.5 % 53 % Synopsys tools. The Chartered memory compiler was used to generate efficient RAM and ROM blocks. We synthesized both pipelined and non-pipelined decoders at a maximum operational clock frequency of 412 MHz. The total core area for the L1RW1 decoder is 2.20 mm 2, and it is increased to 3.33 mm 2 for the L3RW1 decoder due to the additional permuter and mirror RAM memory blocks for the storage of APP and check messages. Dynamic power dissipation (if clock frequency is 400 MHz) is 1.11W and 1.82W for L1RW1 and L3RW1 decoder, respectively. C. Error-Rate Performance and Decoding Throughput Figure 11 shows the frame error rate performance of the non-pipelined L1RW1 and pipelined L3RW1 decoders. A code rate of 2/3 and codeword length of 1296 bits are assumed as well as 7-bit fixed point arithmetic precision. A performance loss of up to 0.1 db compared to the floating point version is observed. Frame Error Rate E b /N 0 [db] L3RW1, 7 bit fixed point L3RW1, floating point L1RW1, 7 bit fixed point L1RW1, floating point Fig. 11. The FER for the implemented L1RW1 and L3RW1 decoders: code rate of 2/3, code length of 1296, maximum number of iterations is set to 15. Considering only the information bits, the effective decoding throughput has been computed based on the average number of decoding iterations required to achieve the frame error rate of The clock frequency is set to 160 MHz

MULTI-RATE HIGH-THROUGHPUT LDPC DECODER: TRADEOFF ANALYSIS BETWEEN DECODING THROUGHPUT AND AREA

MULTI-RATE HIGH-THROUGHPUT LDPC DECODER: TRADEOFF ANALYSIS BETWEEN DECODING THROUGHPUT AND AREA MULTI-RATE HIGH-THROUGHPUT LDPC DECODER: TRADEOFF ANALYSIS BETWEEN DECODING THROUGHPUT AND AREA Predrag Radosavljevic, Alexandre de Baynast, Marjan Karkooti, and Joseph R. Cavallaro Department of Electrical

More information

MULTI-RATE HIGH-THROUGHPUT LDPC DECODER: TRADEOFF ANALYSIS BETWEEN DECODING THROUGHPUT AND AREA

MULTI-RATE HIGH-THROUGHPUT LDPC DECODER: TRADEOFF ANALYSIS BETWEEN DECODING THROUGHPUT AND AREA MULTIRATE HIGHTHROUGHPUT LDPC DECODER: TRADEOFF ANALYSIS BETWEEN DECODING THROUGHPUT AND AREA Predrag Radosavljevic, Alexandre de Baynast, Marjan Karkooti, and Joseph R. Cavallaro Department of Electrical

More information

HIGH-THROUGHPUT MULTI-RATE LDPC DECODER BASED ON ARCHITECTURE-ORIENTED PARITY CHECK MATRICES

HIGH-THROUGHPUT MULTI-RATE LDPC DECODER BASED ON ARCHITECTURE-ORIENTED PARITY CHECK MATRICES HIGH-THROUGHPUT MULTI-RATE LDPC DECODER BASED ON ARCHITECTURE-ORIENTED PARITY CHECK MATRICES Predrag Radosavljevic, Alexandre de Baynast, Marjan Karkooti, Joseph R. Cavallaro ECE Department, Rice University

More information

HIGH THROUGHPUT LOW POWER DECODER ARCHITECTURES FOR LOW DENSITY PARITY CHECK CODES

HIGH THROUGHPUT LOW POWER DECODER ARCHITECTURES FOR LOW DENSITY PARITY CHECK CODES HIGH THROUGHPUT LOW POWER DECODER ARCHITECTURES FOR LOW DENSITY PARITY CHECK CODES A Dissertation by ANAND MANIVANNAN SELVARATHINAM Submitted to the Office of Graduate Studies of Texas A&M University in

More information

Distributed Decoding in Cooperative Communications

Distributed Decoding in Cooperative Communications Distributed Decoding in Cooperative Communications Marjan Karkooti and Joseph R. Cavallaro Rice University, Department of Electrical and Computer Engineering, Houston, TX, 77005 {marjan,cavallar} @rice.edu

More information

PERFORMANCE ANALYSIS OF HIGH EFFICIENCY LOW DENSITY PARITY-CHECK CODE DECODER FOR LOW POWER APPLICATIONS

PERFORMANCE ANALYSIS OF HIGH EFFICIENCY LOW DENSITY PARITY-CHECK CODE DECODER FOR LOW POWER APPLICATIONS American Journal of Applied Sciences 11 (4): 558-563, 2014 ISSN: 1546-9239 2014 Science Publication doi:10.3844/ajassp.2014.558.563 Published Online 11 (4) 2014 (http://www.thescipub.com/ajas.toc) PERFORMANCE

More information

Cost efficient FPGA implementations of Min- Sum and Self-Corrected-Min-Sum decoders

Cost efficient FPGA implementations of Min- Sum and Self-Corrected-Min-Sum decoders Cost efficient FPGA implementations of Min- Sum and Self-Corrected-Min-Sum decoders Oana Boncalo (1), Alexandru Amaricai (1), Valentin Savin (2) (1) University Politehnica Timisoara, Romania (2) CEA-LETI,

More information

Layered Decoding With A Early Stopping Criterion For LDPC Codes

Layered Decoding With A Early Stopping Criterion For LDPC Codes 2012 2 nd International Conference on Information Communication and Management (ICICM 2012) IPCSIT vol. 55 (2012) (2012) IACSIT Press, Singapore DOI: 10.7763/IPCSIT.2012.V55.14 ayered Decoding With A Early

More information

A Reduced Routing Network Architecture for Partial Parallel LDPC decoders

A Reduced Routing Network Architecture for Partial Parallel LDPC decoders A Reduced Routing Network Architecture for Partial Parallel LDPC decoders By HOUSHMAND SHIRANI MEHR B.S. (Sharif University of Technology) July, 2009 THESIS Submitted in partial satisfaction of the requirements

More information

LOW-DENSITY PARITY-CHECK (LDPC) codes [1] can

LOW-DENSITY PARITY-CHECK (LDPC) codes [1] can 208 IEEE TRANSACTIONS ON MAGNETICS, VOL 42, NO 2, FEBRUARY 2006 Structured LDPC Codes for High-Density Recording: Large Girth and Low Error Floor J Lu and J M F Moura Department of Electrical and Computer

More information

Design of a Low Density Parity Check Iterative Decoder

Design of a Low Density Parity Check Iterative Decoder 1 Design of a Low Density Parity Check Iterative Decoder Jean Nguyen, Computer Engineer, University of Wisconsin Madison Dr. Borivoje Nikolic, Faculty Advisor, Electrical Engineer, University of California,

More information

International Journal of Engineering Trends and Technology (IJETT) - Volume4Issue5- May 2013

International Journal of Engineering Trends and Technology (IJETT) - Volume4Issue5- May 2013 Design of Low Density Parity Check Decoder for WiMAX and FPGA Realization M.K Bharadwaj #1, Ch.Phani Teja *2, K.S ROY #3 #1 Electronics and Communications Engineering,K.L University #2 Electronics and

More information

Hardware Implementation

Hardware Implementation Low Density Parity Check decoder Hardware Implementation Ruchi Rani (2008EEE2225) Under guidance of Prof. Jayadeva Dr.Shankar Prakriya 1 Indian Institute of Technology LDPC code Linear block code which

More information

ARITHMETIC operations based on residue number systems

ARITHMETIC operations based on residue number systems IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 53, NO. 2, FEBRUARY 2006 133 Improved Memoryless RNS Forward Converter Based on the Periodicity of Residues A. B. Premkumar, Senior Member,

More information

CHAPTER 4 BLOOM FILTER

CHAPTER 4 BLOOM FILTER 54 CHAPTER 4 BLOOM FILTER 4.1 INTRODUCTION Bloom filter was formulated by Bloom (1970) and is used widely today for different purposes including web caching, intrusion detection, content based routing,

More information

Partly Parallel Overlapped Sum-Product Decoder Architectures for Quasi-Cyclic LDPC Codes

Partly Parallel Overlapped Sum-Product Decoder Architectures for Quasi-Cyclic LDPC Codes Partly Parallel Overlapped Sum-Product Decoder Architectures for Quasi-Cyclic LDPC Codes Ning Chen, Yongmei Dai, and Zhiyuan Yan Department of Electrical and Computer Engineering, Lehigh University, PA

More information

Low Complexity Quasi-Cyclic LDPC Decoder Architecture for IEEE n

Low Complexity Quasi-Cyclic LDPC Decoder Architecture for IEEE n Low Complexity Quasi-Cyclic LDPC Decoder Architecture for IEEE 802.11n Sherif Abou Zied 1, Ahmed Tarek Sayed 1, and Rafik Guindi 2 1 Varkon Semiconductors, Cairo, Egypt 2 Nile University, Giza, Egypt Abstract

More information

STARTING in the 1990s, much work was done to enhance

STARTING in the 1990s, much work was done to enhance 1048 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 57, NO. 5, MAY 2010 A Low-Complexity Message-Passing Algorithm for Reduced Routing Congestion in LDPC Decoders Tinoosh Mohsenin, Dean

More information

A Flexible FPGA-Based Quasi-Cyclic LDPC Decoder

A Flexible FPGA-Based Quasi-Cyclic LDPC Decoder IEEE ACCESS 1 A Flexible FPGA-Based Quasi-Cyclic LDPC Decoder Peter Hailes, Lei Xu, Robert G. Maunder, Bashir M. Al-Hashimi and Lajos Hanzo School of ECS, University of Southampton, SO17 1BJ, UK Corresponding

More information

A Reduced Routing Network Architecture for Partial Parallel LDPC Decoders

A Reduced Routing Network Architecture for Partial Parallel LDPC Decoders A Reduced Routing Network Architecture for Partial Parallel LDPC Decoders Houshmand ShiraniMehr 1, Tinoosh Mohsenin 2 and Bevan Baas 1 1 ECE Department, University of California, Davis, 2 CSEE Department,

More information

Optimal Overlapped Message Passing Decoding of Quasi-Cyclic LDPC Codes

Optimal Overlapped Message Passing Decoding of Quasi-Cyclic LDPC Codes Optimal Overlapped Message Passing Decoding of Quasi-Cyclic LDPC Codes Yongmei Dai and Zhiyuan Yan Department of Electrical and Computer Engineering Lehigh University, PA 18015, USA E-mails: {yod304, yan}@lehigh.edu

More information

FPGA Matrix Multiplier

FPGA Matrix Multiplier FPGA Matrix Multiplier In Hwan Baek Henri Samueli School of Engineering and Applied Science University of California Los Angeles Los Angeles, California Email: chris.inhwan.baek@gmail.com David Boeck Henri

More information

EFFICIENT RECURSIVE IMPLEMENTATION OF A QUADRATIC PERMUTATION POLYNOMIAL INTERLEAVER FOR LONG TERM EVOLUTION SYSTEMS

EFFICIENT RECURSIVE IMPLEMENTATION OF A QUADRATIC PERMUTATION POLYNOMIAL INTERLEAVER FOR LONG TERM EVOLUTION SYSTEMS Rev. Roum. Sci. Techn. Électrotechn. et Énerg. Vol. 61, 1, pp. 53 57, Bucarest, 016 Électronique et transmission de l information EFFICIENT RECURSIVE IMPLEMENTATION OF A QUADRATIC PERMUTATION POLYNOMIAL

More information

Interlaced Column-Row Message-Passing Schedule for Decoding LDPC Codes

Interlaced Column-Row Message-Passing Schedule for Decoding LDPC Codes Interlaced Column-Row Message-Passing Schedule for Decoding LDPC Codes Saleh Usman, Mohammad M. Mansour, Ali Chehab Department of Electrical and Computer Engineering American University of Beirut Beirut

More information

Multi-Rate Reconfigurable LDPC Decoder Architectures for QC-LDPC codes in High Throughput Applications

Multi-Rate Reconfigurable LDPC Decoder Architectures for QC-LDPC codes in High Throughput Applications Multi-Rate Reconfigurable LDPC Decoder Architectures for QC-LDPC codes in High Throughput Applications A thesis submitted in partial fulfillment of the requirements for the degree of Bachelor of Technology

More information

LOW-DENSITY parity-check (LDPC) codes, which are defined

LOW-DENSITY parity-check (LDPC) codes, which are defined 734 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 56, NO. 9, SEPTEMBER 2009 Design of a Multimode QC-LDPC Decoder Based on Shift-Routing Network Chih-Hao Liu, Chien-Ching Lin, Shau-Wei

More information

FPGA Provides Speedy Data Compression for Hyperspectral Imagery

FPGA Provides Speedy Data Compression for Hyperspectral Imagery FPGA Provides Speedy Data Compression for Hyperspectral Imagery Engineers implement the Fast Lossless compression algorithm on a Virtex-5 FPGA; this implementation provides the ability to keep up with

More information

A New MIMO Detector Architecture Based on A Forward-Backward Trellis Algorithm

A New MIMO Detector Architecture Based on A Forward-Backward Trellis Algorithm A New MIMO etector Architecture Based on A Forward-Backward Trellis Algorithm Yang Sun and Joseph R Cavallaro epartment of Electrical and Computer Engineering Rice University, Houston, TX 775 Email: {ysun,

More information

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC Zoltan Baruch Computer Science Department, Technical University of Cluj-Napoca, 26-28, Bariţiu St., 3400 Cluj-Napoca,

More information

98 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 58, NO. 1, JANUARY 2011

98 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 58, NO. 1, JANUARY 2011 98 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 58, NO. 1, JANUARY 2011 Memory System Optimization for FPGA- Based Implementation of Quasi-Cyclic LDPC Codes Decoders Xiaoheng Chen,

More information

/$ IEEE

/$ IEEE IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 56, NO. 1, JANUARY 2009 81 Bit-Level Extrinsic Information Exchange Method for Double-Binary Turbo Codes Ji-Hoon Kim, Student Member,

More information

The Lekha 3GPP LTE Turbo Decoder IP Core meets 3GPP LTE specification 3GPP TS V Release 10[1].

The Lekha 3GPP LTE Turbo Decoder IP Core meets 3GPP LTE specification 3GPP TS V Release 10[1]. Lekha IP Core: LW RI 1002 3GPP LTE Turbo Decoder IP Core V1.0 The Lekha 3GPP LTE Turbo Decoder IP Core meets 3GPP LTE specification 3GPP TS 36.212 V 10.5.0 Release 10[1]. Introduction The Lekha IP 3GPP

More information

Spiral 2-8. Cell Layout

Spiral 2-8. Cell Layout 2-8.1 Spiral 2-8 Cell Layout 2-8.2 Learning Outcomes I understand how a digital circuit is composed of layers of materials forming transistors and wires I understand how each layer is expressed as geometric

More information

A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors

A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors Brent Bohnenstiehl and Bevan Baas Department of Electrical and Computer Engineering University of California, Davis {bvbohnen,

More information

Basic FPGA Architectures. Actel FPGAs. PLD Technologies: Antifuse. 3 Digital Systems Implementation Programmable Logic Devices

Basic FPGA Architectures. Actel FPGAs. PLD Technologies: Antifuse. 3 Digital Systems Implementation Programmable Logic Devices 3 Digital Systems Implementation Programmable Logic Devices Basic FPGA Architectures Why Programmable Logic Devices (PLDs)? Low cost, low risk way of implementing digital circuits as application specific

More information

Digital Integrated Circuits

Digital Integrated Circuits Digital Integrated Circuits Lecture 9 Jaeyong Chung Robust Systems Laboratory Incheon National University DIGITAL DESIGN FLOW Chung EPC6055 2 FPGA vs. ASIC FPGA (A programmable Logic Device) Faster time-to-market

More information

lambda-min Decoding Algorithm of Regular and Irregular LDPC Codes

lambda-min Decoding Algorithm of Regular and Irregular LDPC Codes lambda-min Decoding Algorithm of Regular and Irregular LDPC Codes Emmanuel Boutillon, Frédéric Guillou, Jean-Luc Danger To cite this version: Emmanuel Boutillon, Frédéric Guillou, Jean-Luc Danger lambda-min

More information

Fault Tolerant Parallel Filters Based On Bch Codes

Fault Tolerant Parallel Filters Based On Bch Codes RESEARCH ARTICLE OPEN ACCESS Fault Tolerant Parallel Filters Based On Bch Codes K.Mohana Krishna 1, Mrs.A.Maria Jossy 2 1 Student, M-TECH(VLSI Design) SRM UniversityChennai, India 2 Assistant Professor

More information

A Reconfigurable Multifunction Computing Cache Architecture

A Reconfigurable Multifunction Computing Cache Architecture IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 9, NO. 4, AUGUST 2001 509 A Reconfigurable Multifunction Computing Cache Architecture Huesung Kim, Student Member, IEEE, Arun K. Somani,

More information

FPGA IMPLEMENTATION FOR REAL TIME SOBEL EDGE DETECTOR BLOCK USING 3-LINE BUFFERS

FPGA IMPLEMENTATION FOR REAL TIME SOBEL EDGE DETECTOR BLOCK USING 3-LINE BUFFERS FPGA IMPLEMENTATION FOR REAL TIME SOBEL EDGE DETECTOR BLOCK USING 3-LINE BUFFERS 1 RONNIE O. SERFA JUAN, 2 CHAN SU PARK, 3 HI SEOK KIM, 4 HYEONG WOO CHA 1,2,3,4 CheongJu University E-maul: 1 engr_serfs@yahoo.com,

More information

OVer past decades, iteratively decodable codes, such as

OVer past decades, iteratively decodable codes, such as 1 Trellis-based Extended Min-Sum Algorithm for Non-binary LDPC Codes and its Hardware Structure Erbao Li, David Declercq Senior Member, IEEE, and iran Gunnam Senior Member, IEEE Abstract In this paper,

More information

HDL Implementation of an Efficient Partial Parallel LDPC Decoder Using Soft Bit Flip Algorithm

HDL Implementation of an Efficient Partial Parallel LDPC Decoder Using Soft Bit Flip Algorithm I J C T A, 9(20), 2016, pp. 75-80 International Science Press HDL Implementation of an Efficient Partial Parallel LDPC Decoder Using Soft Bit Flip Algorithm Sandeep Kakde* and Atish Khobragade** ABSTRACT

More information

INTRODUCTION TO FPGA ARCHITECTURE

INTRODUCTION TO FPGA ARCHITECTURE 3/3/25 INTRODUCTION TO FPGA ARCHITECTURE DIGITAL LOGIC DESIGN (BASIC TECHNIQUES) a b a y 2input Black Box y b Functional Schematic a b y a b y a b y 2 Truth Table (AND) Truth Table (OR) Truth Table (XOR)

More information

4DM4 Lab. #1 A: Introduction to VHDL and FPGAs B: An Unbuffered Crossbar Switch (posted Thursday, Sept 19, 2013)

4DM4 Lab. #1 A: Introduction to VHDL and FPGAs B: An Unbuffered Crossbar Switch (posted Thursday, Sept 19, 2013) 1 4DM4 Lab. #1 A: Introduction to VHDL and FPGAs B: An Unbuffered Crossbar Switch (posted Thursday, Sept 19, 2013) Lab #1: ITB Room 157, Thurs. and Fridays, 2:30-5:20, EOW Demos to TA: Thurs, Fri, Sept.

More information

THE DESIGN OF STRUCTURED REGULAR LDPC CODES WITH LARGE GIRTH. Haotian Zhang and José M. F. Moura

THE DESIGN OF STRUCTURED REGULAR LDPC CODES WITH LARGE GIRTH. Haotian Zhang and José M. F. Moura THE DESIGN OF STRUCTURED REGULAR LDPC CODES WITH LARGE GIRTH Haotian Zhang and José M. F. Moura Department of Electrical and Computer Engineering Carnegie Mellon University, Pittsburgh, PA 523 {haotian,

More information

Design of Convolution Encoder and Reconfigurable Viterbi Decoder

Design of Convolution Encoder and Reconfigurable Viterbi Decoder RESEARCH INVENTY: International Journal of Engineering and Science ISSN: 2278-4721, Vol. 1, Issue 3 (Sept 2012), PP 15-21 www.researchinventy.com Design of Convolution Encoder and Reconfigurable Viterbi

More information

Design and Implementation of 3-D DWT for Video Processing Applications

Design and Implementation of 3-D DWT for Video Processing Applications Design and Implementation of 3-D DWT for Video Processing Applications P. Mohaniah 1, P. Sathyanarayana 2, A. S. Ram Kumar Reddy 3 & A. Vijayalakshmi 4 1 E.C.E, N.B.K.R.IST, Vidyanagar, 2 E.C.E, S.V University

More information

CPE/EE 422/522. Introduction to Xilinx Virtex Field-Programmable Gate Arrays Devices. Dr. Rhonda Kay Gaede UAH. Outline

CPE/EE 422/522. Introduction to Xilinx Virtex Field-Programmable Gate Arrays Devices. Dr. Rhonda Kay Gaede UAH. Outline CPE/EE 422/522 Introduction to Xilinx Virtex Field-Programmable Gate Arrays Devices Dr. Rhonda Kay Gaede UAH Outline Introduction Field-Programmable Gate Arrays Virtex Virtex-E, Virtex-II, and Virtex-II

More information

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA Yun R. Qu, Viktor K. Prasanna Ming Hsieh Dept. of Electrical Engineering University of Southern California Los Angeles, CA 90089

More information

Topics. Midterm Finish Chapter 7

Topics. Midterm Finish Chapter 7 Lecture 9 Topics Midterm Finish Chapter 7 ROM (review) Memory device in which permanent binary information is stored. Example: 32 x 8 ROM Five input lines (2 5 = 32) 32 outputs, each representing a memory

More information

EE414 Embedded Systems Ch 5. Memory Part 2/2

EE414 Embedded Systems Ch 5. Memory Part 2/2 EE414 Embedded Systems Ch 5. Memory Part 2/2 Byung Kook Kim School of Electrical Engineering Korea Advanced Institute of Science and Technology Overview 6.1 introduction 6.2 Memory Write Ability and Storage

More information

FPGA Implementation of Binary Quasi Cyclic LDPC Code with Rate 2/5

FPGA Implementation of Binary Quasi Cyclic LDPC Code with Rate 2/5 FPGA Implementation of Binary Quasi Cyclic LDPC Code with Rate 2/5 Arulmozhi M. 1, Nandini G. Iyer 2, Anitha M. 3 Assistant Professor, Department of EEE, Rajalakshmi Engineering College, Chennai, India

More information

Overlapped Scheduling for Folded LDPC Decoding Based on Matrix Permutation

Overlapped Scheduling for Folded LDPC Decoding Based on Matrix Permutation Overlapped Scheduling for Folded LDPC Decoding Based on Matrix Permutation In-Cheol Park and Se-Hyeon Kang Department of Electrical Engineering and Computer Science, KAIST {icpark, shkang}@ics.kaist.ac.kr

More information

C LDPC Coding Proposal for LBC. This contribution provides an LDPC coding proposal for LBC

C LDPC Coding Proposal for LBC. This contribution provides an LDPC coding proposal for LBC C3-27315-3 Title: Abstract: Source: Contact: LDPC Coding Proposal for LBC This contribution provides an LDPC coding proposal for LBC Alcatel-Lucent, Huawei, LG Electronics, QUALCOMM Incorporated, RITT,

More information

Chapter 6 (Lect 3) Counters Continued. Unused States Ring counter. Implementing with Registers Implementing with Counter and Decoder

Chapter 6 (Lect 3) Counters Continued. Unused States Ring counter. Implementing with Registers Implementing with Counter and Decoder Chapter 6 (Lect 3) Counters Continued Unused States Ring counter Implementing with Registers Implementing with Counter and Decoder Sequential Logic and Unused States Not all states need to be used Can

More information

BER Evaluation of LDPC Decoder with BPSK Scheme in AWGN Fading Channel

BER Evaluation of LDPC Decoder with BPSK Scheme in AWGN Fading Channel I J C T A, 9(40), 2016, pp. 397-404 International Science Press ISSN: 0974-5572 BER Evaluation of LDPC Decoder with BPSK Scheme in AWGN Fading Channel Neha Mahankal*, Sandeep Kakde* and Atish Khobragade**

More information

Performance Analysis of CORDIC Architectures Targeted by FPGA Devices

Performance Analysis of CORDIC Architectures Targeted by FPGA Devices International OPEN ACCESS Journal Of Modern Engineering Research (IJMER) Performance Analysis of CORDIC Architectures Targeted by FPGA Devices Guddeti Nagarjuna Reddy 1, R.Jayalakshmi 2, Dr.K.Umapathy

More information

Field Programmable Gate Array (FPGA)

Field Programmable Gate Array (FPGA) Field Programmable Gate Array (FPGA) Lecturer: Krébesz, Tamas 1 FPGA in general Reprogrammable Si chip Invented in 1985 by Ross Freeman (Xilinx inc.) Combines the advantages of ASIC and uc-based systems

More information

LOW-POWER IMPLEMENTATION OF A HIGH-THROUGHPUT LDPC DECODER FOR IEEE N STANDARD. Naresh R. Shanbhag

LOW-POWER IMPLEMENTATION OF A HIGH-THROUGHPUT LDPC DECODER FOR IEEE N STANDARD. Naresh R. Shanbhag LOW-POWER IMPLEMENTATION OF A HIGH-THROUGHPUT LDPC DECODER FOR IEEE 802.11N STANDARD Junho Cho Department of Electrical Engineering, Seoul National University, Seoul, 151-744, Korea Naresh R. Shanbhag

More information

! Program logic functions, interconnect using SRAM. ! Advantages: ! Re-programmable; ! dynamically reconfigurable; ! uses standard processes.

! Program logic functions, interconnect using SRAM. ! Advantages: ! Re-programmable; ! dynamically reconfigurable; ! uses standard processes. Topics! SRAM-based FPGA fabrics:! Xilinx.! Altera. SRAM-based FPGAs! Program logic functions, using SRAM.! Advantages:! Re-programmable;! dynamically reconfigurable;! uses standard processes.! isadvantages:!

More information

FPGA Architecture Overview. Generic FPGA Architecture (1) FPGA Architecture

FPGA Architecture Overview. Generic FPGA Architecture (1) FPGA Architecture FPGA Architecture Overview dr chris dick dsp chief architect wireless and signal processing group xilinx inc. Generic FPGA Architecture () Generic FPGA architecture consists of an array of logic tiles

More information

A Generic Architecture of CCSDS Low Density Parity Check Decoder for Near-Earth Applications

A Generic Architecture of CCSDS Low Density Parity Check Decoder for Near-Earth Applications A Generic Architecture of CCSDS Low Density Parity Check Decoder for Near-Earth Applications Fabien Demangel, Nicolas Fau, Nicolas Drabik, François Charot, Christophe Wolinski To cite this version: Fabien

More information

Lowering the Error Floors of Irregular High-Rate LDPC Codes by Graph Conditioning

Lowering the Error Floors of Irregular High-Rate LDPC Codes by Graph Conditioning Lowering the Error Floors of Irregular High- LDPC Codes by Graph Conditioning Wen-Yen Weng, Aditya Ramamoorthy and Richard D. Wesel Electrical Engineering Department, UCLA, Los Angeles, CA, 90095-594.

More information

FPGA. Logic Block. Plessey FPGA: basic building block here is 2-input NAND gate which is connected to each other to implement desired function.

FPGA. Logic Block. Plessey FPGA: basic building block here is 2-input NAND gate which is connected to each other to implement desired function. FPGA Logic block of an FPGA can be configured in such a way that it can provide functionality as simple as that of transistor or as complex as that of a microprocessor. It can used to implement different

More information

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS /$ IEEE

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS /$ IEEE IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1 Exploration of Heterogeneous FPGAs for Mapping Linear Projection Designs Christos-S. Bouganis, Member, IEEE, Iosifina Pournara, and Peter

More information

ISSCC 2003 / SESSION 8 / COMMUNICATIONS SIGNAL PROCESSING / PAPER 8.7

ISSCC 2003 / SESSION 8 / COMMUNICATIONS SIGNAL PROCESSING / PAPER 8.7 ISSCC 2003 / SESSION 8 / COMMUNICATIONS SIGNAL PROCESSING / PAPER 8.7 8.7 A Programmable Turbo Decoder for Multiple 3G Wireless Standards Myoung-Cheol Shin, In-Cheol Park KAIST, Daejeon, Republic of Korea

More information

TURBO codes, [1], [2], have attracted much interest due

TURBO codes, [1], [2], have attracted much interest due 800 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 47, NO. 2, FEBRUARY 2001 Zigzag Codes and Concatenated Zigzag Codes Li Ping, Member, IEEE, Xiaoling Huang, and Nam Phamdo, Senior Member, IEEE Abstract

More information

Partially-Parallel LDPC Decoder Achieving High-Efficiency Message-Passing Schedule

Partially-Parallel LDPC Decoder Achieving High-Efficiency Message-Passing Schedule IEICE TRANS. FUNDAMENTALS, VOL.E89 A, NO.4 APRIL 2006 969 PAPER Special Section on Selected Papers from the 18th Workshop on Circuits and Systems in Karuizawa Partially-Parallel LDPC Decoder Achieving

More information

Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm

Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm 1 A.Malashri, 2 C.Paramasivam 1 PG Student, Department of Electronics and Communication K S Rangasamy College Of Technology,

More information

AN FPGA BASED OVERLAPPED QUASI CYCLIC LDPC DECODER FOR WI-MAX

AN FPGA BASED OVERLAPPED QUASI CYCLIC LDPC DECODER FOR WI-MAX 2 th May 24. Vol. 63 No.2 25-24 JATIT & LLS. All rights reserved. ISSN: 992-8645 www.jatit.org E-ISSN: 87-395 AN FPGA BASED OVERLAPPED QUASI CYCLIC LDPC DECODER FOR WI-MAX G.AMIRTHA GOWRI, 2 S.SUBHA RANI

More information

A NOVEL HARDWARE-FRIENDLY SELF-ADJUSTABLE OFFSET MIN-SUM ALGORITHM FOR ISDB-S2 LDPC DECODER

A NOVEL HARDWARE-FRIENDLY SELF-ADJUSTABLE OFFSET MIN-SUM ALGORITHM FOR ISDB-S2 LDPC DECODER 18th European Signal Processing Conference (EUSIPCO-010) Aalborg, Denmark, August -7, 010 A NOVEL HARDWARE-FRIENDLY SELF-ADJUSTABLE OFFSET MIN-SUM ALGORITHM FOR ISDB-S LDPC DECODER Wen Ji, Makoto Hamaminato,

More information

COSC 6385 Computer Architecture - Memory Hierarchies (II)

COSC 6385 Computer Architecture - Memory Hierarchies (II) COSC 6385 Computer Architecture - Memory Hierarchies (II) Edgar Gabriel Spring 2018 Types of cache misses Compulsory Misses: first access to a block cannot be in the cache (cold start misses) Capacity

More information

HANSABA COLLEGE OF ENGINEERING & TECHNOLOGY (098) SUBJECT: DIGITAL ELECTRONICS ( ) Assignment

HANSABA COLLEGE OF ENGINEERING & TECHNOLOGY (098) SUBJECT: DIGITAL ELECTRONICS ( ) Assignment Assignment 1. What is multiplexer? With logic circuit and function table explain the working of 4 to 1 line multiplexer. 2. Implement following Boolean function using 8: 1 multiplexer. F(A,B,C,D) = (2,3,5,7,8,9,12,13,14,15)

More information

LLR-based Successive-Cancellation List Decoder for Polar Codes with Multi-bit Decision

LLR-based Successive-Cancellation List Decoder for Polar Codes with Multi-bit Decision > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLIC HERE TO EDIT < LLR-based Successive-Cancellation List Decoder for Polar Codes with Multi-bit Decision Bo Yuan and eshab. Parhi, Fellow,

More information

Efficient VLSI Huffman encoder implementation and its application in high rate serial data encoding

Efficient VLSI Huffman encoder implementation and its application in high rate serial data encoding LETTER IEICE Electronics Express, Vol.14, No.21, 1 11 Efficient VLSI Huffman encoder implementation and its application in high rate serial data encoding Rongshan Wei a) and Xingang Zhang College of Physics

More information

It is well understood that the minimum number of check bits required for single bit error correction is specified by the relationship: D + P P

It is well understood that the minimum number of check bits required for single bit error correction is specified by the relationship: D + P P October 2012 Reference Design RD1025 Introduction This reference design implements an Error Correction Code (ECC) module for the LatticeEC and LatticeSC FPGA families that can be applied to increase memory

More information

A SCALABLE COMPUTING AND MEMORY ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION ON FIELD-PROGRAMMABLE GATE ARRAYS. Theepan Moorthy and Andy Ye

A SCALABLE COMPUTING AND MEMORY ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION ON FIELD-PROGRAMMABLE GATE ARRAYS. Theepan Moorthy and Andy Ye A SCALABLE COMPUTING AND MEMORY ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION ON FIELD-PROGRAMMABLE GATE ARRAYS Theepan Moorthy and Andy Ye Department of Electrical and Computer Engineering Ryerson

More information

Embedded Systems Design: A Unified Hardware/Software Introduction. Outline. Chapter 5 Memory. Introduction. Memory: basic concepts

Embedded Systems Design: A Unified Hardware/Software Introduction. Outline. Chapter 5 Memory. Introduction. Memory: basic concepts Hardware/Software Introduction Chapter 5 Memory Outline Memory Write Ability and Storage Permanence Common Memory Types Composing Memory Memory Hierarchy and Cache Advanced RAM 1 2 Introduction Memory:

More information

Embedded Systems Design: A Unified Hardware/Software Introduction. Chapter 5 Memory. Outline. Introduction

Embedded Systems Design: A Unified Hardware/Software Introduction. Chapter 5 Memory. Outline. Introduction Hardware/Software Introduction Chapter 5 Memory 1 Outline Memory Write Ability and Storage Permanence Common Memory Types Composing Memory Memory Hierarchy and Cache Advanced RAM 2 Introduction Embedded

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

Address connections Data connections Selection connections

Address connections Data connections Selection connections Interface (cont..) We have four common types of memory: Read only memory ( ROM ) Flash memory ( EEPROM ) Static Random access memory ( SARAM ) Dynamic Random access memory ( DRAM ). Pin connections common

More information

Virtex-II Architecture. Virtex II technical, Design Solutions. Active Interconnect Technology (continued)

Virtex-II Architecture. Virtex II technical, Design Solutions. Active Interconnect Technology (continued) Virtex-II Architecture SONET / SDH Virtex II technical, Design Solutions PCI-X PCI DCM Distri RAM 18Kb BRAM Multiplier LVDS FIFO Shift Registers BLVDS SDRAM QDR SRAM Backplane Rev 4 March 4th. 2002 J-L

More information

Evolution of Implementation Technologies. ECE 4211/5211 Rapid Prototyping with FPGAs. Gate Array Technology (IBM s) Programmable Logic

Evolution of Implementation Technologies. ECE 4211/5211 Rapid Prototyping with FPGAs. Gate Array Technology (IBM s) Programmable Logic ECE 42/52 Rapid Prototyping with FPGAs Dr. Charlie Wang Department of Electrical and Computer Engineering University of Colorado at Colorado Springs Evolution of Implementation Technologies Discrete devices:

More information

Efficient Majority Logic Fault Detector/Corrector Using Euclidean Geometry Low Density Parity Check (EG-LDPC) Codes

Efficient Majority Logic Fault Detector/Corrector Using Euclidean Geometry Low Density Parity Check (EG-LDPC) Codes Efficient Majority Logic Fault Detector/Corrector Using Euclidean Geometry Low Density Parity Check (EG-LDPC) Codes 1 U.Rahila Begum, 2 V. Padmajothi 1 PG Student, 2 Assistant Professor 1 Department Of

More information

Design of Flash Controller for Single Level Cell NAND Flash Memory

Design of Flash Controller for Single Level Cell NAND Flash Memory Design of Flash Controller for Single Level Cell NAND Flash Memory Ashwin Bijoor 1, Sudharshana 2 P.G Student, Department of Electronics and Communication, NMAMIT, Nitte, Karnataka, India 1 Assistant Professor,

More information

LOW-density parity-check (LDPC) codes have attracted

LOW-density parity-check (LDPC) codes have attracted 2966 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 50, NO. 12, DECEMBER 2004 LDPC Block and Convolutional Codes Based on Circulant Matrices R. Michael Tanner, Fellow, IEEE, Deepak Sridhara, Arvind Sridharan,

More information

Digital Design with FPGAs. By Neeraj Kulkarni

Digital Design with FPGAs. By Neeraj Kulkarni Digital Design with FPGAs By Neeraj Kulkarni Some Basic Electronics Basic Elements: Gates: And, Or, Nor, Nand, Xor.. Memory elements: Flip Flops, Registers.. Techniques to design a circuit using basic

More information

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University Hardware Design Environments Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University Outline Welcome to COE 405 Digital System Design Design Domains and Levels of Abstractions Synthesis

More information

Optimized ARM-Based Implementation of Low Density Parity Check Code (LDPC) Decoder in China Digital Radio (CDR)

Optimized ARM-Based Implementation of Low Density Parity Check Code (LDPC) Decoder in China Digital Radio (CDR) Optimized ARM-Based Implementation of Low Density Parity Check Code (LDPC) Decoder in China Digital Radio (CDR) P. Vincy Priscilla 1, R. Padmavathi 2, S. Tamilselvan 3, Dr.S. Kanthamani 4 1,4 Department

More information

International Journal of Scientific & Engineering Research, Volume 4, Issue 5, May-2013 ISSN

International Journal of Scientific & Engineering Research, Volume 4, Issue 5, May-2013 ISSN 255 CORRECTIONS TO FAULT SECURE OF MAJORITY LOGIC DECODER AND DETECTOR FOR MEMORY APPLICATIONS Viji.D PG Scholar Embedded Systems Prist University, Thanjuvr - India Mr.T.Sathees Kumar AP/ECE Prist University,

More information

COE 561 Digital System Design & Synthesis Introduction

COE 561 Digital System Design & Synthesis Introduction 1 COE 561 Digital System Design & Synthesis Introduction Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum & Minerals Outline Course Topics Microelectronics Design

More information

Fault Grading FPGA Interconnect Test Configurations

Fault Grading FPGA Interconnect Test Configurations * Fault Grading FPGA Interconnect Test Configurations Mehdi Baradaran Tahoori Subhasish Mitra* Shahin Toutounchi Edward J. McCluskey Center for Reliable Computing Stanford University http://crc.stanford.edu

More information

The Lekha 3GPP LTE FEC IP Core meets 3GPP LTE specification 3GPP TS V Release 10[1].

The Lekha 3GPP LTE FEC IP Core meets 3GPP LTE specification 3GPP TS V Release 10[1]. Lekha IP 3GPP LTE FEC Encoder IP Core V1.0 The Lekha 3GPP LTE FEC IP Core meets 3GPP LTE specification 3GPP TS 36.212 V 10.5.0 Release 10[1]. 1.0 Introduction The Lekha IP 3GPP LTE FEC Encoder IP Core

More information

FAULT TOLERANT SYSTEMS

FAULT TOLERANT SYSTEMS FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 6 Coding I Chapter 3 Information Redundancy Part.6.1 Information Redundancy - Coding A data word with d bits is encoded

More information

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes. HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes Ian Glendinning Outline NVIDIA GPU cards CUDA & OpenCL Parallel Implementation

More information

Design and Analysis of Kogge-Stone and Han-Carlson Adders in 130nm CMOS Technology

Design and Analysis of Kogge-Stone and Han-Carlson Adders in 130nm CMOS Technology Design and Analysis of Kogge-Stone and Han-Carlson Adders in 130nm CMOS Technology Senthil Ganesh R & R. Kalaimathi 1 Assistant Professor, Electronics and Communication Engineering, Info Institute of Engineering,

More information

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP 133 CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP 6.1 INTRODUCTION As the era of a billion transistors on a one chip approaches, a lot of Processing Elements (PEs) could be located

More information

HIGH-PERFORMANCE RECONFIGURABLE FIR FILTER USING PIPELINE TECHNIQUE

HIGH-PERFORMANCE RECONFIGURABLE FIR FILTER USING PIPELINE TECHNIQUE HIGH-PERFORMANCE RECONFIGURABLE FIR FILTER USING PIPELINE TECHNIQUE Anni Benitta.M #1 and Felcy Jeba Malar.M *2 1# Centre for excellence in VLSI Design, ECE, KCG College of Technology, Chennai, Tamilnadu

More information

A new two-stage decoding scheme with unreliable path search to lower the error-floor for low-density parity-check codes

A new two-stage decoding scheme with unreliable path search to lower the error-floor for low-density parity-check codes IET Communications Research Article A new two-stage decoding scheme with unreliable path search to lower the error-floor for low-density parity-check codes Pilwoong Yang 1, Bohwan Jun 1, Jong-Seon No 1,

More information

Method for hardware implementation of a convolutional turbo code interleaver and a sub-block interleaver

Method for hardware implementation of a convolutional turbo code interleaver and a sub-block interleaver Method for hardware implementation of a convolutional turbo code interleaver and a sub-block interleaver isclosed is a method for hardware implementation of a convolutional turbo code interleaver and a

More information