/$ IEEE

Size: px

Start display at page:

Download "/$ IEEE"

Bertina Bell
5 years ago
Views:

1 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 56, NO. 5, MAY Low-Power Memory-Reduced Traceback MAP Decoding for Double-Binary Convolutional Turbo Decoder Cheng-Hung Lin, Student Member, IEEE, Chun-Yu Chen, Student Member, IEEE, An-Yeu (Andy) Wu, Member, IEEE, and Tsung-Han Tsai, Member, IEEE Abstract Iterative decoding of convolutional turbo code (CTC) has a large memory power consumption. To reduce the power consumption of the state metrics cache (SMC), low-power memory-reduced traceback maximum a posteriori algorithm (MAP) decoding is proposed. Instead of storing all state metrics, the traceback MAP decoding reduces the size of the SMC by accessing difference metrics. The proposed traceback computation requires no complicated reversion checker, path selection, and reversion flag cache. For double-binary (DB) MAP decoding, radix-2 2 and radix-4 traceback structures are introduced to provide a tradeoff between power consumption and operating frequency. These two traceback structures achieve an around 20% power reduction of the SMC, and around 7% power reduction of the DB MAP decoders. In addition, a high-throughput 12-mode WiMAX CTC decoder applying the proposed radix-2 2 traceback structure is implemented by using a m CMOS process in a core area of 7.16 mm 2. Based on postlayout simulation results, the proposed decoder achieves a maximum throughput rate of Mbps and an energy efficiency of 0.43 nj/bit per iteration. Index Terms Low-power design, maximum a posteriori (MAP) algorithm, turbo decoder. I. INTRODUCTION SINGLE-BINARY (SB) convolutional turbo code (CTC), proposed in 1993 [1], has been proven to provide a high coding gain near the Shannon capacity limit. The SB-CTC has been adopted in the forward-error-control (FEC) scheme for wideband code-division multiple access (WCDMA) [2] and cdma2000 [3]. In 1999, the nonbinary CTC [4] was introduced, which has a superior coding performance compared with the SB CTC [5]. In recent years, double-binary (DB) CTC has been adopted in the FEC coding of advanced wireless communication standards, such as digital video broadcasting-return channel over satellite and terrestrial (DVB-RCS Manuscript received September 28, 2008; revised December 23, First published March 10, 2009; current version published May 20, This work was supported in part by Industrial Technology Research Institute, Taiwan, under the WiMAX FEC Project 8362B51310, and by the National Science Council, Taiwan, under Grant NSC E This paper was presented in part at the ISCAS, Kobe, Japan, May 2005, VLSI-DAT, Hsinchu, Taiwan, April 2008, and ISCAS, Seattle, WA, May This paper was recommended by Guest Editor H. Schmid. C.-H. Lin, C.-Y. Chen, and A.-Y. Wu are with the Graduate Institute of Electronics Engineering and Department of Electrical Engineering, National Taiwan University, Taipei 10617, Taiwan ( andywu@cc.ee.ntu.edu.tw). T.-H. Tasi is with the Department of Electrical Engineering, National Central University, Jhongli City 32001, Taiwan. Color version of one or more of the figures in this paper are available online at Digital Object Identifier /TCSI Fig. 1. Decoding path of the state metrics access: the (a) conventional path and (b) traceback path. and DVB-RCT) [6], and worldwide interoperability for microwave access (WiMAX) [7]. Powerful soft-input soft-output (SISO) algorithms for CTC decoding are the maximum a posteriori algorithm (MAP) [8], and its derivatives, such as the log-map (L-MAP) [9], Max-log-MAP (ML-MAP) [9], combinations of the L-MAP and ML-MAP [11], and enhanced Max-log-MAP (EML-MAP) [12]. In this paper, we use the term MAP for an abbreviation of L-MAP and (E)ML-MAP. The memory organization of MAP decoding is an important issue in facilitating the hardware implementation of CTC decoders [13]. In particular, the power reduction of state metrics cache (SMC) is critical for MAP decoders [14]. With regard to SB CTC decoding, some researches have been proposed to reduce the power consumption of the SMC [14] [21]. The reverse computations [20], [21] significantly reduce SMC power consumption with reversion checkers and reversion flag caches. However, the reversion checker and reversion flag cache prolong the critical path or decoding cycles. In addition, the computational complexities of the reverse computations are increased dramatically when the reverse computations are extended from the SB to the DB MAP. Although some researchers [22], [23] have proposed the VLSI designs of DB CTC decoders, there are presently very few papers discussing an efficient SMC power reduction of DB CTC decoders. In this paper, the traceback MAP decoding is proposed to trace the state metrics back by accessing the difference metrics. Fig. 1 illustrates the decoding paths of the conventional computation and proposed traceback computation. In the conventional path, the state metrics computed by the natural recursion processor (NRP) in the natural order are stored in the SMC /$ IEEE

2 1006 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 56, NO. 5, MAY 2009 Powerful SISO algorithms for DB CTC decoding are the DB MAP. First, the arithmetic operations of the DB EML-MAP [24] are described as follows: (1a) Fig. 2. DB RSC encoder of WiMAX CTC scheme. (1b) Then, the state metrics are read out to compute the a posteriori log-likelihood ratio (LLR) by the log-a posteriori module (LAPO) in the reverse order. In the traceback path, the difference metrics computed by the NRP are stored in the SMC. Then, the state metrics are traced back with the stored difference metrics by the traceback recursion processor (TRP) in the reverse order. The power consumption of the SMC can be reduced by accessing the difference metrics because the number of stored metrics is lower. The computational power of TRP is small, and the overall power consumption of the traceback path is reduced. For the traceback DB MAP decoding, two traceback structures are demonstrated. The radix-2 2 traceback structure has low hardware costs, and the radix-4 traceback structure has short path delays. Experimental results show that these two traceback structures achieve an around 20% power reduction of the SMC, and around 7% power reduction of the DB MAP decoder. In addition, an application of the proposed traceback DB MAP decoding for the DB CTC is to deploy the radix-2 2 traceback structure to a high-throughput WiMAX CTC decoder. We emphasize the 12 general transmissions of the WiMAX CTC scheme without the optional hybrid automatic repeat request (HARQ) transmissions. The 12-mode CTC decoder was implemented by using a TSMC 0.13 m CMOS process. The prototyping chip in a core area of 7.16 mm achieves Mbps, with an energy efficiency of 0.43 nj/bit per iteration. The remainder of this paper is as follows. Section II provides reviews of the DB MAP decoding. Section III introduces some background information regarding SMC power reductions. Section IV describes the proposed memory-reduced traceback MAP decoding. Section V demonstrates the radix-4 traceback DB MAP decoding and the corresponding computational units. Section VI demonstrates the comparisons and experimental results of the proposed radix-4 DB traceback structures. The prototyping chip of the 12-mode WiMAX CTC decoder is described in Section VII. Finally, Section VIII concludes this paper. where branch metrics; forward recursion state metrics; backward recursion state metrics; a priori LLR; a posteriori LLR; intrinsic values; extrinsic values; two binary bits at time ; transmitted codewords BPSK; for (1c) (1d) (1e) (1f) II. REVIEWS OF DB MAP DECODING A CTC encoder is composed of a CTC interleaver and two parallel or serial concatenated recursive systematic convolutional (RSC) encoders. Fig. 2 shows the DB RSC encoder of a WiMAX CTC scheme. The constraint length of this DB RSC encoder is 4. The transmitted codewords are denoted by and. soft received codewords; number of parity bit; ; state index; scaling factor.

LIN et al.: LOW-POWER MEMORY-REDUCED TRACEBACK MAP DECODING FOR DB CONVOLUTIONAL TURBO DECODER 1007 Decisions are based on (2) Note that the values of, and are always equal to zero.

Conventional decoding procedure. (3) A lookup table (LUT) can implement the corrective term.

3 LIN et al.: LOW-POWER MEMORY-REDUCED TRACEBACK MAP DECODING FOR DB CONVOLUTIONAL TURBO DECODER 1007 Decisions are based on (2) Note that the values of, and are always equal to zero. The differences between the DB L-MAP and DB (E)ML-MAP are that a priori LLR multiplies the channel value, and MAX operations are replaced by in the DB L-MAP. The [9] is defined as Fig. 3. Conventional decoding procedure. (3) A lookup table (LUT) can implement the corrective term. The difference between the DB EML-MAP and DB ML-MAP is that the extrinsic values in the DB ML-MAP do not multiply a scaling factor. Despite the SB and DB MAP decoding, the L-MAP has the significant coding gain and the EML-MAP achieves better coding gain than the ML-MAP. Without specifying which algorithm is used, we use the term MAP for an abbreviation of the L-MAP and (E)ML-MAP in the following sections. The details of SB MAP algorithms can be referred to in [10] for the SB CTC. III. POWER REDUCTIONS OF STATE METRICS CACHE Before we introduce the proposed memory-reduced traceback MAP decoding, the background information of SMC power reduction for the MAP decoding is demonstrated in this section. A. Conventional (Windowing) Decoding Procedure Despite the SB or DB MAP decoding, forward (1b) and backward (1c) recursion states metrics are computed in chronologically reverse order, where is the constraint length of a RSC encoder. Both forward and backward recursion state metrics are required for the computation of a posteriori LLR (1d). Thus, a large SMC stores the forward (or backward) recursion state metrics to compute the a posteriori LLR until the backward (or forward) recursion state metrics are generated. The conventional decoding procedure employing the windowing technique [25] was proposed to reduce the depth of SMC from a block size to a window size, where is. (The well-known sliding window (SW) and parallel window (PW) MAP architectures can be referred to in [13] and [26], respectively.) Fig. 3 shows the conventional decoding procedure. In the natural order, the NRP, composed of add-compare-select units (ACSUs), is used to recursively compute the forward (or backward) recursion state metrics in cycles. The obtained state metrics in each cycle are immediately stored into the SMC and recursively fed back to the NRP to calculate the next state metrics. To compute the a posteriori LLR, the state metrics are read out from the SMC. The SMC of the conventional decoding procedure still accounts for more than 50% of the entire power consumption [14]. Fig. 4. Reverse decoding procedure. B. Reverse Decoding Procedure To further reduce the access power of SMC, the reverse computations [20], [21] modified the conventional decoding procedure for the radix-2 SB MAP decoding. The decoding procedure of the reverse computations is illustrated in Fig. 4. Compared with the conventional decoding procedure shown in Fig. 3, the reverse decoding procedure adds a reversion checker, a reversion flag cache, and a reverse recursion processor (RRP). The reversion checker decides whether the state metrics computed by the NRP are reversible or not. If a state metric is not reversible, this state metric is stored in one subbank of the SMC. Meanwhile, the reversion flag cache stores the path information of this state metric. To compute the a posteriori LLR, the irreversible state metric is read out from the SMC according to the path information stored in the reversion flag cache. Otherwise, the reversible state metric is computed by the RRP, composed of reverse units. The recovered state metrics are recursively fed back to the RRP to calculate the next reversible state metrics. The reverse computations reduce the power consumption of the radix-2 SB MAP decoder because the subbanks of SMC are dynamically accessed, and the computational power of the reversion checker and RRP is small. The reverse computations for the radix-2 SB L-MAP decoding achieve an around 30% power reduction with an over 20% logic overhead. However, the reversion checker and reversion flag cache prolong the critical path or decoding cycles. In addition, dividing the SMC into subbanks increases the silicon area of the SMC and consumes more overall power of the SMC if all subbanks are accessed. For the radix-2 SB MAP decoding, a -state trellis structure can be decomposed into radix-2 butterfly structures.

The symbol index is denoted by k. For the conversional decoding procedure, the computational units in (a) are the ACSUs.

4 1008 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 56, NO. 5, MAY 2009 Fig. 6. Traceback decoding procedure. Fig. 5. Natural, reverse, and traceback recursions in the (a) (c) radix-2 butterfly structures and (d) (f) radix-4 butterfly structures. The symbol index is denoted by k. For the conversional decoding procedure, the computational units in (a) are the ACSUs. For the reverse decoding procedure, the computational units in (a) and (d) are the ACSUs and reversion checkers, and the computational units in (b) and (e) are the reverse units. For the traceback decoding procedure, the computational units in (a) and (d) are ACSUs only, and the computational units in (c) and (f) are the TBUs. Fig. 5(a) and (b) illustrates an example of the SB reverse computation in a radix-2 butterfly structure. In Fig. 5(a), the reversion checker checks all fixed paths and determines the reversible paths. In Fig. 5(b), the dashed lines denote the reversible (selective) paths determined by the reversion flag cache. One radix-2 butterfly structure has four paths and four cases to be checked. The essential DB MAP decoding is radix-4 trellis decoding. For the radix-4 DB MAP decoding, a -state trellis structure can be decomposed into radix-4 butterfly structures. Fig. 5(d) and (e) illustrates an example of the DB reverse computation in a radix-4 butterfly structure. One radix-4 butterfly structure has 16 paths and 256 cases to be checked in the reverse computations. Hence, the reverse computations for the radix-4 trellis require more complicated reversion checkers and larger reversion flag caches. The computational complexities are dramatically increased when the reverse computations are extended from the SB to the DB MAP. IV. PROPOSED MEMORY-REDUCED TRACEBACK MAP DECODING Here, the traceback MAP decoding is proposed to reduce the power consumption of the SMC. The traceback MAP decoding has five major stages/phases, given here. 1) The branch metrics are computed with the received codewords and the a priori LLR in the natural order. 2) The forward (or backward) state metrics are recursively computed by the NRP with the branch metrics in the natural order, and the difference metrics are stored into the SMC. Note that the difference metric is the difference between two state metrics and has the same bit-length of the state metric. 3) The forward (or backward) state metrics are recursively traced back by the TRP with the stored difference metrics in the reverse order. Concurrently, the backward (or forward) state metrics are recursively computed with the branch metrics. 4) The a posteriori LLR is computed with regenerated forward (or backward) state metrics, the backward (or forward) state metrics, and branch metrics by the LAPO in the reverse order. 5) The extrinsic values and hard bits are computed in the reverse order with the a posteriori LLR. In contrast to the conventional MAP decoding, the traceback MAP decoding reduces the number of stored metrics by accessing the difference metrics. Hence, the SMC power consumption is reduced. In addition, the computational power overhead of tracing the state metrics back is much smaller than the SMC power consumption. Thus, the overall power consumption of the MAP decoding is reduced. The work in [18] has introduced that not absolute values but differences between the state metrics are important for the a posteriori LLR. In the proposed traceback computation, the differences between state metrics are kept by storing the difference metrics. Hence, the traceback MAP decoding performs without losing correction ability. In addition, the proposed traceback computation works in the L-MAP and (E)ML-MAP. Fig. 6 shows the proposed traceback decoding procedure, which is a modification of the conventional decoding procedure shown in Fig. 3. Compared with the conventional decoding procedure, only an additional TRP is added. Fig. 5(a) and (c) illustrates an example of the radix-2 SB traceback computation in a radix-2 butterfly structure. In Fig. 5(a), only one difference metric (black arc) computed by a radix-2 ACSU is stored in the SMC. In Fig. 5(c), the two state metrics (white nodes) are traced back in the traceback recursion by one traceback unit (TBU) with the stored difference metric (black arc). The recovered two state metrics are recursively fed back to the TRP to calculate the next two state metrics. The stored metrics are reduced from two state metrics to one difference metric. The bit-lengths of the state metric and difference metric are the same. Thus, the radix-2 traceback SB MAP decoding requires half the SMC size of the conventional decoding procedure. Compared with the reverse computations shown in Figs. 4 and

5 LIN et al.: LOW-POWER MEMORY-REDUCED TRACEBACK MAP DECODING FOR DB CONVOLUTIONAL TURBO DECODER (b), the traceback computation has fixed paths and requires no complicated checker and path selection. The design details of the radix-2 traceback SB MAP decoding, including the radix-2 ACSU, radix-2 TBU, and overall architecture of the WCDMA CTC decoder, can be referred to in [19]. Based on the experimental results using a m CMOS process, the radix-2 SB traceback MAP decoding achieves a 21.4% power reduction of the SW L-MAP decoder with a 3.2% logic overhead. V. TRACEBACK COMPUTATION AND COMPUTATIONAL UNITS FOR THE RADIX-4 DB MAP DECODING In this paper, we emphasize the design of the radix-4 traceback DB MAP decoding and the corresponding computational units. Here, we demonstrate how the radix-4 traceback DB MAP decoding works with accessing the difference metrics. Two widely used ACSUs are then described, and the corresponding TBUs are proposed for the radix-4 traceback DB MAP decoding. A. Traceback Computation for the DB MAP Decoding The radix-4 traceback DB MAP decoding still employs the aforementioned five major stages/phases in Section IV and the traceback decoding procedure shown in Fig. 6. For the DB MAP decoding, the decoding trellis changes to the radix-4 trellis. Fig. 5(d) and (f) illustrates an example of the traceback computation in a radix-4 butterfly structure. In Fig. 5(d), the natural recursion in a radix-4 butterfly structure shows that three difference metrics (black arcs) are generated by a 4-input 1-output ACSU. These three difference metrics are stored into the SMC for the traceback computation. In Fig. 5(f), one 1-input 4-output TBU then traces the four state metrics (white nodes) back, with the three stored difference metrics (black arcs) in the traceback recursion. Taking the DB RCS encoder shown in Fig. 2, for instance, Fig. 7(a) illustrates the corresponding eight-state radix-4 trellis diagram of the natural recursion for the DB MAP decoding, where denotes the time index in the natural order, and denotes the symbol index. Fig. 7(b) illustrates the radix-4 trellis of the traceback recursion for the DB MAP decoding, where denotes the time index in the reverse order. In the conventional decoding procedure (see Fig. 3), the eight states metrics are recursively computed by the NRP. The SMC stores these eight state metrics to compute the a posteriori LLR. In the traceback decoding procedure (see Fig. 6), the six difference metrics [black arcs in Fig. 7(a)] are computed by the NRP and stored into the SMC. The TRP, composed of 2 TBUs [black nodes in Fig. 7(b)], traces the eight state metrics back with the stored six difference metrics [black arcs in Fig. 7(b)]. Compared with the conventional decoding procedure, the stored metrics are reduced from eight state metrics to six difference metrics. Thus, the traceback computation reduces the SMC power consumption for the DB MAP decoding. B. Radix-2 2 Traceback Pair For the radix-4 MAP decoding, two types of ACSUs, the radix-2 2 ACSU and the radix-4 ACSU, are widely used to constitute the NRP. Fig. 8(a) shows an example of the Fig state radix-4 trellis diagrams of the (a) natural recursion and (b) traceback recursion. The symbol index is denoted by k. The time index in the natural recursion is denoted by t, and the time index in the traceback recursion is denoted by t. The computational units in (a) are the ACSUs, and the computational units in (b) are the TBUs. radix-2 2 ACSU. The radix-2 2 ACSU consists of four front adders, three radix-2 compare-select units (CSUs), and an LUT. In the proposed traceback computation, three difference metrics (Diff 0, Diff 1, and Diff 2) generated by three radix-2 CSUs are stored in the SMC. Fig. 8(b) shows the corresponding TBU of the radix-2 2 ACSU. Fig. 8(c) illustrates a computational example of the radix-2 2 ACSU and TBU. The radix-2 2 ACSU obtains the maximal state metric B based on Diff 0, Diff 1, and Diff 2. Subsequently, the radix-2 2 TBU regenerates A, B, C, and D based on Diff 0, Diff 1, and Diff 2, since B can be initially achieved. Hence, the four state metrics can be recomputed by the radix-2 2 TBU with the difference metrics stored in the SMC. In the radix-2 2 TBU, the sign bit of the difference metric decides the paths of the two multiplexers and the operation of the binary adder/subtracter. Taking the trellis diagram shown in Fig. 7(b), for instance, six difference metrics are stored in the SMC because two current states and two TBUs can trace eight next states back. The storage of the SMC is reduced from eight state metrics to six difference metrics at each stage. Note that the values A, B, C, and D in the output end of the radix-2 2 TBU can be the input values of the LAPO to compute. This approach reduces eight adders ( in (1d)) in the LAPO. C. Radix-4 Traceback Pair Fig. 9(a) shows an example of the radix-4 ACSU. The radix-4 ACSU has a comparator to select the maximal state metric quickly. Unlike the radix-2 2 ACSU, three difference metrics related to A (Diff 0, Diff 1, and Diff 2) and two selective bits (S0 and S1) of the radix-4 ACSU are stored in the SMC. The or

6 1010 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 56, NO. 5, MAY 2009 Fig. 8. Radix-222 traceback pair: (a) the ACSU, (b) the TBU, and (c) a computational example. corresponding TBU of the radix-4 ACSU is shown in Fig. 9(b). Fig. 9(c) illustrates a computational example of the radix-4 ACSU and TBU. The radix-4 ACSU obtains the maximal state metric B based on two selective bits generated by the six parallel subtractions in the comparator. The three difference metrics related to A are stored in the SMC. Since B can be initially achieved, A can be regenerated by the radix-4 TBU based on the stored S0 and S1. Subsequently, the radix-4 TBU regenerates A, B, C, and D based on Diff 0, Diff 1, and Diff 2. Taking the trellis diagram shown in Fig. 7(b), for instance, six difference metrics and four extra selective bits are stored in the SMC because two current states and two TBUs can trace eight next states back. The storage of the SMC is reduced from eight state metrics to six difference metrics and four extra bits at each stage. The values A, B, C, and D in the output end of the radix-4 TBU can be the input values of the LAPO to compute in. This approach reduces eight adders [ or in (1d)] in the LAPO. These two traceback pairs perform the (E)ML-MAP if the LUT of the corrective term in (3) is not implemented. Otherwise, these two traceback pairs perform the L-MAP. The proposed traceback computation and traceback pairs of the radix-4 DB MAP decoding can be simply applied to the radix-4 SB MAP decoding with a simple trellis modification.

LIN et al.: LOW-POWER MEMORY-REDUCED TRACEBACK MAP DECODING FOR DB CONVOLUTIONAL TURBO DECODER 1011 Fig. 9. Radix-4 traceback pair: (a) the ACSU, (b) the TBU, and (c) a computational example. VI.

The decoding procedures of the conventional and proposed traceback structures are only considered and evaluated in the radix-4 DB EML-MAP. A.

7 LIN et al.: LOW-POWER MEMORY-REDUCED TRACEBACK MAP DECODING FOR DB CONVOLUTIONAL TURBO DECODER 1011 Fig. 9. Radix-4 traceback pair: (a) the ACSU, (b) the TBU, and (c) a computational example. VI. COMPARISONS AND EXPERIMENTAL RESULTS Here, accurate silicon area and power evaluations are obtained by using Verilog HDL codes synthesized with the standard cell library of TSMC m CMOS process. The decoding procedures of the conventional and proposed traceback structures are only considered and evaluated in the radix-4 DB EML-MAP. A. Area and Power Evaluations and Comparisons Table I lists a summary of the conventional and traceback structures for the radix-4 DB EML-MAP decoding, where denotes the bit-length of a state metric or a difference metric. The constraint length cannot be less than 3 for exploiting the radix-4 trellis. Taking and, for instance, eight radix-2 2 (or radix-4) ACSUs generate eight state metrics at each time stage. Hence, the SMC of the conventional structures has 80 bit-lengths. However, the traceback structure composed of eight radix-2 2 ACSUs and two radix-2 2 TBUs reduces the SMC bit-lengths from 80 to 60. The traceback structure composed of eight radix-4 ACSUs and two radix-4 TBUs reduces the SMC bit-lengths from 80 to 64. For the comparison of the different structures, the practical evaluations are under the WiMAX CTC decoding. The -type TABLE I SUMMARY OF THE CONVENTIONAL AND TRACEBACK STRUCTURES parallel-window (PW) MAP decoding in [26] is adopted to achieve a high throughput rate because the decoding latency is 1. Table II lists evaluation parameters under the specification of WiMAX CTC. Ten parallel windows are used to decode an CTC block, and we achieves. The size of information bits is a double of a block size because of the DB CTC. The bit-length of a state metric ( or )or a difference metric is 10. Table III shows the evaluated results of computational units of the different structures. The silicon area and path delay are reported by Synopsis Design Vision. The power consumption on 2-dB SNR noisy data is estimated by Synopsis PrimePower at 100 MHz operating frequency. For the eight-state radix-4 trellis, eight ACSUs (as an NRP) are grouped as two examples of the conventional structures. Eight ACSUs (as an NRP) and two

TRP) are grouped as two examples of the traceback structures.

8 1012 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 56, NO. 5, MAY 2009 TABLE II EVALUATION SPECIFICATION OF WIMAX CTC DECODING TABLE IV SMC COMPARISON OF THE DIFFERENT STRUCTURES TABLE III COMPUTATIONAL UNIT COMPARISON OF THE DIFFERENT STRUCTURES TBUs (as a TRP) are grouped as two examples of the traceback structures. The radix-2 2 ACSU has the longest path delay, because the values of the latter radix-2 CSU in the radix-2 2 ACSU are not correct until the sign (most significant) bits of the two former radix-2 CSU are stable. The radix-2 2 TBU does not suffer from this problem because the sign bits of the difference metrics are initially known. In addition, the radix-4 ACSU selects the maximal state metric quickly because the comparator generates selective bits based on the six parallel subtractions at the first level. Compared with the radix-2 2 traceback structure, the radix-4 traceback structure has shorter and more balanced path delays. However, the radix-2 2 traceback structure has less area costs and power consumptions. Because the ACSU dominate the operating frequency of a MAP decoder, the radix-4 traceback structure can derive a higher operating frequency than the radix-2 2 traceback structure. Operating at a higher operating frequency consumes more power for the radix-4 traceback structure. Thus, it is a tradeoff between the power consumption and maximum operating frequency for the radix-4 traceback DB MAP decoding. In Table IV, the SMCs composed of single-port (SP) RAMs are generated by using the TSMC m process for the comparison of the different structures. Because of the -type PW MAP decoding, the SMC can be implemented by a SP RAM with depth. Note that eight state metrics have to be stored in the SMC, despite the conventional structure composed of the radix-2 2 or radix-4 ACSUs. For the radix-2 2 traceback structure, only six difference metrics are stored in the SMC. For the radix-4 traceback structure, six difference metrics and four extra selective bits are stored in the SMC. Fig. 10 shows comparisons of silicon area and power consumption of the SMC and TRP. The radix-2 2 traceback structure achieves a 24.9% area reduction and 25% power reduction of the SMC. On the other hand, the radix-4 traceback structure achieves a 19.6% area reduction of the SMC and 19.5% power reduction of the SMC. The power overhead of the TRPs is less than 4%, and the area overhead of the TRPs is less than 19%. Thus, the radix-4 and radix-2 2 traceback structure achieve the area and power reduction of the SMC of the radix-4 and radix-2 2 conventional structure. B. Area and Power Evaluations of PW EML-MAP Decoders Fig. 11 illustrates a processing element (PE) employing the traceback -type PW EML-MAP decoding. The architecture can be divided into the forward recursion path (upper path) and backward recursion path (lower path). One branch metrics unit (BMU) for (1a), one NRP for (1b) or (1c), one SMC with depth, one proposed TRP, one LAPO for (1d), one log-extrinsic module (LEX) for (1e), and one hard decision module (HD) for (2) are used for each path. When the TRPs are removed, the PE performs the conventional -type PW EML-MAP decoding. The branch metrics are directly calculated without buffering in a branch metrics cache because of the -type PW decoding. To meet the parameters in Table II, ten PEs are constructed as a DB MAP kernel. Fig. 12 shows comparisons of silicon area and power consumption of the ten PEs composed of the conventional and traceback structures. The power consumption on 2-dB SNR noisy data is estimated at 100 MHz operating frequency. Compared with the radix-2 2 conventional structure, the radix-2 2 traceback structure achieves a 9.7% power reduction and a 2.1% area reduction. The logic overhead of radix-2 2 TRP is 9.8%. Similarly, the radix-4 traceback structure achieves a 7.9% power reduction and a 2.1% area reduction compared with the radix-4 conventional structure. The logic overhead of radix-4 TRP is 6.6%. The silicon areas of the traceback structures are not significantly reduced because the radix-4 DB MAP decoding complicates the hardware complexities of all computational units. Compared with the conventional structures, however, the power consumptions of the traceback structures are noticeably lower. The logic overhead of the radix-4 DB traceback computation is less than 10%. VII. APPLICATION TO 12-MODE WIMAX CTC DECODER Here, the 12-mode WiMAX CTC decoder for is presented. The design parameters of the CTC decoding are

Architecture of the processing element (PE) employing traceback PW EML-MAP decoding. Fig. 12.

9 LIN et al.: LOW-POWER MEMORY-REDUCED TRACEBACK MAP DECODING FOR DB CONVOLUTIONAL TURBO DECODER 1013 Fig. 10. (a) Power consumption and (b) silicon area comparisons of the SMC and TRP using the conventional and traceback structures. Fig. 11. Architecture of the processing element (PE) employing traceback PW EML-MAP decoding. Fig. 12. (a) Power consumption and (b) silicon area comparisons of 10 PW EML-MAP PEs composed of the conventional and traceback structures. shown in Table II. A prototyping CTC decoder has been implemented by using a TSMC m CMOS process. A. Overall Architecture A VLSI architecture design of the 12-mode WiMAX CTC decoder has been presented in [27] to achieve the features of high throughput and efficient reconfigurability. The overall architecture of the implemented 12-mode WiMAX CTC decoder is shown in Fig. 13. The decoder is composed of three input buffers, an internal buffer, an output buffer, a global controller, and a DB MAP kernel. The DB MAP kernel is composed of ten PW EML-MAP PEs and the conjoint WiMAX CTC interleavers (CIs). Because the radix-2 2 traceback structure has less power consumption (see Fig. 12), the radix-2 2 traceback structure is adopted in the PEs. Each buffer is divided into ten banks to be accessed simultaneously by the PE. To satisfy the

1014 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 56, NO. 5, MAY 2009 TABLE V CHIP SUMMARY OF THE 12-MODE WIMAX CTC DECODER Fig. 13.

(+ represents the SMC composed of single-port RAM.) power-efficient 12-mode CTC decoding, the global controller activates the appropriate PEs, CIs, and buffer banks, depending on the block size. B.

14 and Table V, respectively. The prototyping chip contains 27.4 Kb RAM and integrates 635 Kgates of logic. Each PE accounts for 52.3 Kgates of logic, and each CI accounts for 0.51 Kgates of logic.

2 V, 25 C) power consumption of this CTC decoder is 197.3 mw, which is estimated at 100 MHz operating frequency on 2-dB SNR noisy data.

10 1014 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 56, NO. 5, MAY 2009 TABLE V CHIP SUMMARY OF THE 12-MODE WIMAX CTC DECODER Fig. 13. Block diagram of the 12-mode WiMAX CTC decoder. TABLE VI COMPARISONS OF DIFFERENT CTC DECODER CHIPS Fig. 14. Chip layout of the proposed 12-mode WiMAX CTC decoder. (+ represents the SMC composed of single-port RAM.) power-efficient 12-mode CTC decoding, the global controller activates the appropriate PEs, CIs, and buffer banks, depending on the block size. B. Prototyping Chip Implementation The chip layout and summary of the prototyping chip implemented by using the TSMC m CMOS process within a core size of 7.16 mm are shown in Fig. 14 and Table V, respectively. The prototyping chip contains 27.4 Kb RAM and integrates 635 Kgates of logic. Each PE accounts for 52.3 Kgates of logic, and each CI accounts for 0.51 Kgates of logic. Based on worst-case static timing analysis and post-layout simulation results, the decoder achieves a maximum operating frequency of 100 MHz. The post-layout, gate-level, and typical-case (@1.2 V, 25 C) power consumption of this CTC decoder is mw, which is estimated at 100 MHz operating frequency on 2-dB SNR noisy data. For a single WiMAX FEC block within the 12 general transmissions, this CTC decoder can achieve a higher throughput rate of Mbps at 100 MHz than that of Mbps at 214 MHz of the Xilinx CTC decoder implemented in a FPGA [28]. C. Comparisons of CTC Decoders In Table VI, the proposed 12-mode WiMAX CTC decoder is compared with other chip designs that support multiple block sizes. The work in [29] is designed for the HSDPA standard with one radix-4 SB SW L-MAP PE and block sizes from 40 to The work in [30] is designed with seven radix-2 SB SW L-MAP PEs and block sizes from It is hard to compare these chips since the coding parameters are different from each other. However, we use normalized energy efficiency (NEE) as one performance index. The NEE indicates how much energy a decoder chip consumes to process a hard bit at an iteration. In addition, we use normalized area efficiency (NAE) as the other performance index. The NAE indicates how many hard bits per one mm for a single CTC block a decoder chip decodes. When we normalize the works from m to m technology, the normalized energy factor is V V m m. Similarly, the normalized area factor is m m. Our proposed decoder achieves a lower NEE (0.43 nj/bit/iteration) and higher NAE (0.68 bits/mm ). Note that the NEE and NAE are not used to justify which design is superior to the others, but to provide an evaluative method for reference. (4) (5)

The proposed traceback MAP decoding performs in the L-MAP and (E)ML-MAP without losing correction ability. Two pairs of the traceback computation are introduced for the radix-4 DB MAP decoding.

11 LIN et al.: LOW-POWER MEMORY-REDUCED TRACEBACK MAP DECODING FOR DB CONVOLUTIONAL TURBO DECODER 1015 VIII. CONCLUSION To reduce the power consumption of SMC, the traceback MAP decoding has been presented with a low logic overhead. The proposed traceback MAP decoding performs in the L-MAP and (E)ML-MAP without losing correction ability. Two pairs of the traceback computation are introduced for the radix-4 DB MAP decoding. The traceback radix-2 2 pair has low hardware costs, and the radix-4 traceback pair has short path delays. In addition, the proposed traceback pairs of the radix-4 DB MAP decoding can be simply applied to the radix-4 SB MAP decoding. The experimental result shows that the two traceback structures achieve an around 20% power reduction of the SMC, and around 7% power reduction of ten DB MAP decoders for the WiMAX CTC. Compared with other designs, the 12-mode WiMAX CTC decoder employing the radix-2 2 traceback structure achieves a low energy efficiency of 0.43 nj/bit per iteration. ACKNOWLEDGMENT The authors would like to thank National Chip Implementation Center (CIC) for technical support. REFERENCES [1] C. Berrou, A. Glavieux, and P. Thitimajshima, Near Shannon limit error-correcting coding and decoding: Turbo codes, in Proc. IEEE Int. Conf. Commun. (ICC), 1993, pp [2] 3rd Generation Partnership Project (3GPP). [Online]. Available: [3] 3rd Generation Partnership Project 2 (3GPP2). [Online]. Available: [4] C. Berrou and M. Jezequel, Non-binary convolutional codes for turbo coding, Electron. Lett., vol. 35, no. 1, pp , Jan [5] C. Berrou et al., The advantages of non-binary turbo codes, in Proc. IEEE Inf. Theory Workshop (ITW), 2001, pp [6] Digital Video Broadcasting (DVB). [Online]. Available: dvb.org/ [7] Worldwide Interoperability for Microwave Access (WiMAX). [Online]. Available: [8] L. R. Bahl, J. Cocke, F. Jelinek, and J. Raviv, Optimal decoding of linear codes for minimizing symbol error rate, IEEE Trans. Inf. Theory, vol. IT-20, no. 2, pp , Mar [9] P. Robertson, E. Villebrun, and P. Hoeher, A comparison of optimal and sub-optimal MAP decoding algorithms operating in the log domain, in Proc. IEEE Int. Conf. Commun. (ICC), 1995, pp [10] J. P. Woodard and L. Hanzo, Comparative study of turbo decoding techniques: An overview, IEEE Trans. Veh. Technol., vol. 49, no. 6, pp , Nov [11] S. Papaharalabos, P. Sweeney, and B. G. Evans, SISO algorithms based on combined max/max* operations for turbo decoding, Electron. Lett., vol. 41, no. 3, pp , Feb [12] J. Vogt and A. Finger, Improving the max-log-map turbo decoder, Electron. Lett., vol. 36, no. 23, pp , Nov [13] G. Masera et al., VLSI architecture for turbo codes, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 7, no. 3, pp , Aug [14] C. Schurgers, F. Catthoor, and M. Engels, Memory optimization of MAP turbo decoder algorithms, IEEE Trans. Very Large Scale Integr. (VLSI) Sys., vol. 9, no. 3, pp , Sep [15] H. Liu et al., Energy efficient turbo decoder by reducing the state metric quantization, in Proc. IEEE Workshop Signal Processing Syst. (SiPS), 2007, pp [16] Z. Wang, Z. Chi, and K. K. Parhi, Area-efficient high-speed decoding schemes for turbo decoders, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 10, no. 6, pp , Aug [17] M. M. Mansour and N. R. Shanbhag, VLSI architectures for SISO-APP decoders, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 11, no. 4, pp , Aug [18] E. Boutillon, W. J. Gross, and P. G. Gulak, VLSI architectures for the MAP algorithm, IEEE Trans. Commun., vol. 51, pp , Feb [19] T.-H. Tsai, C.-H. Lin, and A.-Y. Wu, A memory-reduced log-map kernel for turbo decoder, in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), 2005, pp [20] D.-S. Lee and I.-C. Park, Low-power log-map decoding based on reduced metric memory access, IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 53, no. 6, pp , Jun [21] H.-M. Choi, J.-H. Kim, and I.-C. Park, Low-power hybrid turbo decoding based on reverse calculation, IEICE Trans. Fund. Electron. Comm. Comput. Sci., vol. E89-A, no. 3, pp , Mar [22] J.-H. Kim and I.-C. Park, Double-binary circular turbo decoding based on border metric encoding, IEEE Trans. Circuits. Syst. II, Exp. Briefs., vol. 55, no. 1, pp , Jan [23] J.-H. Kim and I.-C. Park, Bit-level extrinsic information exchange method for double-binary turbo codes, IEEE Trans. Circuits. Syst. II, Exp. Briefs, vol. 56, no. 1, pp , Jan [24] Y. Ould-Cheikh-Mouhamedou, P. Guinand, and P. Kabal, Enhanced max-log-app and enhanced log-app decoding for DVB-RCS, in Proc. 3rd Int. Symp. Turbo Codes, Sep. 2003, pp [25] A. J. Viterbi, An intuitive justification and simplified implementation of the MAP decoder for convolutional codes, IEEE J. Sel. Areas Commun., vol. 16, no. 2, pp , Feb [26] A. Worm, H. Lamm, and N. Wehn, A high-speed MAP architectures with optimized memory and power consumption, in Proc. IEEE Workshop Signal Processing Syst. (SiPS), 2000, pp [27] C.-H. Lin, C.-Y. Chen, and A.-Y. Wu, High-throughput 12-mode CTC decoder for WiMAX standard, in Proc. IEEE Int. Symp. VLSI Design, Automation, and Test (VLSI-DAT), 2008, pp [28] IEEE e CTC Decoder Core, DS137 (V2.2), XiLinx IP Center, Jun. 5, [29] M. Bickerstaff et al., A 24Mb/s radix-4 logmap turbo decoder for 3GPP-HSDPA mobile wireless, in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, 2003, pp [30] B. Bougard et al., A scalable 8.7nj/bit 75.6Mb/s parallel concatenated convolutional (turbo-)codec, in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, 2003, pp Cheng-Hung Lin (S 05) received the B.S. degree in electronic engineering from Fu Jen Catholic University, Taipei, Taiwan, in 2002, and the M.S. degree in electrical engineering from National Central University, Taoyuan, Taiwan, in He is currently working toward the Ph.D. degree in electronic engineering at National Taiwan University, Taipei. His research interests include the design of very large-scale integration architectures and circuits for digital signal processing and communication systems. He is currently working on the hardware design for coding systems. Chun-Yu Chen (S 08) received the B.S. degree from National Chiao Tung University, Hsinchu, Taiwan, in 2007 and the M.S. degree from National Taiwan University, Taipei, in 2009, both in electronic engineering. He is currently an engineer with Silicon Motion Technology Corporation, Taipei. His research interests include the design of very large-scale integration architectures and circuits for digital signal processing and communication systems.

1016 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 56, NO. 5, MAY 2009 An-Yeu (Andy) Wu (S 91 M 96) received the B.S. degree from National Taiwan University, Taipei, in 1987, and the M.

From August 1995 to July 1996, he was a Member of Technical Staff (MTS) with AT&T Bell Laboratories, Murray Hill, NJ, working on high-speed transmission IC designs.

12 1016 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 56, NO. 5, MAY 2009 An-Yeu (Andy) Wu (S 91 M 96) received the B.S. degree from National Taiwan University, Taipei, in 1987, and the M.S. and Ph.D. degrees from the University of Maryland, College Park, in 1992 and 1995, respectively, all in electrical engineering. From August 1995 to July 1996, he was a Member of Technical Staff (MTS) with AT&T Bell Laboratories, Murray Hill, NJ, working on high-speed transmission IC designs. From 1996 to July 2000, he was with the Electrical Engineering Department of National Central University, Taiwan. In August 2000, he joined the faculty of the Department of Electrical Engineering and the Graduate Institute of Electronics Engineering, National Taiwan University, where he is currently a Professor. His research interests include low-power/high-performance VLSI architectures for DSP and communication applications, adaptive/multirate signal processing, reconfigurable broadband access systems and architectures, and SoC platform for software/hardware co-design. Dr. Wu served as an Associate Editor for EURASIP Journal of Applied Signal Processing from 2001 to 2004 and acted as the leading Guest Editor for a special issue on Signal Processing for Broadband Access Systems: Techniques and Implementations of the same journal (published in December 2003). He also served as the Associate Editor of the IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS from 2003 to 2005 and the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS from 2007 to He is now an Associate Editor of the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS and the IEEE TRANSACTIONS ON SIGNAL PROCESSING. He has served on the technical program committees of many major IEEE International Conferences, such as ICIP, SiPS, AP-ASIC, ISCAS, ISPACS, ICME, SOC, and A-SSCC. He was the recipient of the A-class Research Award from the National Science Council, Taiwan, four times from 1997 to He received the Macronix International Corporation (MXIC) Young Chair Professor Award in 2003, the Distinguished Young Engineer Award from The Chinese Institute of Electrical Engineering, Taiwan, in 2004, and, in 2005, the Dr. Wu Ta-you Award (young scholar award) and the President Fu Si-nien Award, from the National Science Council and National Taiwan University, respectively, for his research work on VLSI system designs. Tsung-Han Tsai (S 96 M 98) received the B.S., M.S., and Ph.D. degrees in electrical engineering from the National Taiwan University, Taipei, in 1990, 1994, and 1998, respectively. From 1999 to 2000, he was a Professor of Electronic Engineering with Fu Jen Catholic University, Taipei, Taiwan. In 2000, he joined the Department of Electrical Engineering, National Central University, Taipei, where he is currently a Professor. He has been awarded 14 patents and has published more than 120 referred papers in international journals and conferences. His research interests include VLSI signal processing, video/audio coding algorithms, DSP architecture design, wireless communication, and system-on-chip design. Dr. Tsai is a member of the Audio Engineering Society (AES) and the Institute of Electronics, Information and Communication Engineers (IEICE). He was the recipient of the Industrial Cooperation Award in 2003 from the Ministry of Education, Taiwan. He is a member of the Technical Committee of the IEEE Circuits and Systems Society and serves as a Technical Program Committee member or Session Chair of several international conferences.

/$ IEEE

/$ IEEE IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 56, NO. 1, JANUARY 2009 81 Bit-Level Extrinsic Information Exchange Method for Double-Binary Turbo Codes Ji-Hoon Kim, Student Member,