AS TURBO codes [1], or parallel concatenated convolutional

Size: px

Start display at page:

Download "AS TURBO codes [1], or parallel concatenated convolutional"

Dwayne Gray
6 years ago
Views:

1 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 7, JULY SIMD Processor-Based Turbo Decoder Supporting Multiple Third-Generation Wireless Standards Myoung-Cheol Shin, Member, IEEE, and In-Cheol Park, Senior Member, IEEE Abstract A programmable turbo decoder is designed to support multiple third-generation wireless communication standards. We propose a hybrid architecture of hardware and software, which has small size, low power, and high performance like hardware implementations, as well as the flexibility and programmability of software. It mainly consists of a configurable hardware soft-inputsoft-output (SISO) decoder and a 16-b single-instruction multipledata processor, which is equipped with five processing elements and special instructions customized for interleaving in order to provide interleaved data at the speed of the hardware SISO. A fast and flexible software implementation of the block interleaving algorithm is also proposed. The interleaver generation is split into two parts, preprocessing and on-the-fly generation, to reduce the timing overhead of changing the interleaver structure. We present detailed descriptions of the interleaving implementation applied to the W-CDMA and cdma2000 standard turbo codes. The decoder occupies 8.90 mm 2 in a m CMOS with five metal layers and exhibits the maximum decoding rate of 5.48 Mb/s. Index Terms cdma2000, parallel algorithm, turbo code, turbo interleaver, W-CDMA. I. INTRODUCTION AS TURBO codes [1], or parallel concatenated convolutional codes, have extremely impressive performances, they entered the field of standardized systems in recent years. One of the most important examples is the channel coding for high-speed data transmission of the third-generation (3G) mobile communication systems such as W-CDMA [2] and cdma2000 [3]. Flexible and programmable turbo decoders are required for 3G communications because of two reasons: 1) the global roaming is recommended between different 3G standards and 2) even in a standard, the turbo coding frame size may change on a frame-by-frame basis. The turbo decoder consists of interleavers and soft-input-soft-output (SISO) decoders that decode recursive systematic convolutional (RSC) codes. Flexible and programmable implementation is especially needed for the turbo interleaver, as each 3G standard has a distinct and complicated interleaver. The simplest approach to implement an interleaver is to store the interleaved patterns in a ROM. The approach is not adequate for a turbo decoder supporting multiple wireless standards, as Manuscript received November 8, 2002; revised March 29, This work was supported in part by the Institute of Information Technology Assessment through the ITRC and IC Design Education Center (IDEC). M.-C. Shin was with the Korea Advanced Institute of Science and Technology, Daejeon , Korea. He is now with Samsung Electronics Company, Ltd., Giheung, , Korea ( mch.shin@samsung.com). I.-C. Park is with the Department of Electrical Engineering, Korea Advanced Institute of Science and Technology, Daejeon , Korea ( icpark@ee.kaist.ac.kr). Digital Object Identifier /TVLSI it needs a large ROM to store all of the possible interleaved patterns. Though some implementations based on digital signal processors (DSPs) have programmability that may support various standards, their performance is far below the maximum bit rate of up-to-date wireless systems [4], [5]. In [6], Bekooij et al. proposed a flexible turbo decoder as a form of very long instruction word (VLIW) microprocessor. However, dedicated hardware blocks, which are not programmable, are employed to implement its interleaver and SISO decoder. In this paper, we propose a multiple-standard turbo decoder implemented with a combination of the dedicated hardware part processing the computationally intensive but regular tasks such as SISO decoding and the software part running on a programmable single-instruction-multiple-data (SIMD) processor for the tasks that requires flexibility. The turbo interleaving that differs largely depending on the standards is also implemented in software. In this way, we can achieve the small area, high performance, and low power consumption of hardware, as well as the flexibility and programmability of software needed to support multiple standards. In addition, we address a software interleaving implementation that can change the interleaver structure in a very short period of time and requires only a small amount of memory. The interleaver construction is split into two parts, preprocessing and on-the-fly generation, in order to hide the timing overhead of interleaver changing effectively. Since the proposed SIMD processor is equipped with instructions specialized for turbo interleaving algorithms, the processor can provide interleaved data at the speed of the hardware SISO decoder. The remainder of this paper is organized as follows. Section II presents the fundamentals of turbo coding and iterative decoding, leading to a discussion on the turbo codes adopted by 3G wireless systems in Section III. In Section IV, the proposed turbo decoder is described in detail. Section V explains the proposed interleaver implementation, and Section VI describes its application to standards including W-CDMA and cdma2000. In Section VII, experimental results on the implemented chip are presented. Finally, we make concluding remarks in Section VIII. II. ITERATIVE DECODING OF TURBO CODES Here, we summarize the most essential part of the iterative turbo decoding and the notations to be use in next sections. A turbo encoder consists of two binary RSC encoders separated by an -bit interleaver, together with an optional puncturing mechanism [1]. A typical example is shown in Fig. 1. The function of the interleaver is to take each incoming frame of data bits and rearrange them in a pseudorandom fashion prior to /$ IEEE

2 802 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 7, JULY 2007 The SISO that is optimal in terms of minimizing the decoded BER should produce the a posteriori probability that each encoder input bit was 1 or 0 given the received symbol sequence y and a priori probability. Bahl et al. originally proposed this maximum a posteriori probability (MAP) algorithm [1], [7]. As it requires a huge dynamic range of intermediate values and a large number of expensive operations such as multiplications and divisions, its simplified versions, Log-MAP or Max-Log-MAP algorithms [8], are usually used in hardware implementation. If we use capital Greek letters to indicate the logarithm of the variables, for example, and, where is a current trellis state and is a previous state, the algorithms can be represented with the following equations. The branch metric of each transition is calculated as Fig. 1. Turbo encoder for cdma2000. the second encoding. Unlike the other interleavers mainly used to spread burst errors, it is important that the turbo interleaver should sort the bits as randomly as possible so that no apparent order, or uniformity, can be found. Any uniformity may degrade the error-correcting performance of turbo decoding. The puncturing mechanism periodically deletes some code bits to increase the coding rate. As shown in Fig. 1, the encoder processes the input sequence of bits to produce the systematic sequence and parity sequences, where. All of these sequences are binary bits, but the modulated sequences and are generally assumed to have the value of following the equations: (1) where. The general structure of the iterative turbo decoder is shown in Fig. 2, where the superscript represents an interleaved sequence [1]. Two component decoders are linked by interleavers in a feedback structure. Component decoders are usually called SISO decoders as they deal with the probabilities of the inputs instead of hard decisions. For an encoder input bit, the soft input and output of the SISO decoder are typically represented in terms of the so-called log likelihood ratio (LLR) given by Each SISO receives the systematic channel bits, which are modeled as the sum of modulated bits and channel noise with variance, parity bits, similarly the sum of and channel noise, and the extrinsic information coming from the other SISO as additional a priori information. As the feedback iterations go on, the bit-error-rate (BER) performance of the decoded bits improves due to the extrinsic information. Since the improvement obtained with additional iterations decreases as the number of iterations increases, four to eight iterations are usually used. (2) where is the logarithm of a priori probability, which is equal to the extrinsic LLR output ) obtained from the previous SISO, and is the channel reliability measure given as. The forward path metric is computed recursively as In the Max-Log-MAP algorithm, the operation is approximated as the real maximum operation, which causes some BER performance degradation. However, the Log-MAP algorithm calculates the operation exactly by adding a correction term as follows: The correction function can be implemented with a small lookup table (LUT). Similarly, the backward metric is calculated as The logarithm of a posteriori probability for each input bit is then obtained by and the extrinsic information to be fed to the next SISO by (3) (4) (5) (6) (7) (8)

3 SHIN AND PARK: SIMD PROCESSOR-BASED TURBO DECODER SUPPORTING MULTIPLE THIRD-GENERATION WIRELESS STANDARDS 803 Fig. 2. Iterative turbo decoder structure. TABLE I DIFFERENCES BETWEEN THE cdma2000 AND W-CDMA TURBO CODES III. TURBO CODES IN 3G STANDARDS We discuss here the similarities and differences of the turbo codes of the W-CDMA [2] and cdma2000 [3], which are shown in Table I, and why a multistandard turbo decoder requires a programmable interleaver. Fig. 1 shows the turbo encoder architecture of cdma2000. Rate-1/2, -1/3, -1/4, and -1/5 turbo codes are realized with appropriate puncturing patterns. For the W-CDMA, the code rate is fixed to 1/3 and the encoder is obtained simply by removing the path to and from Fig. 1. The code rates other than 1/3 can be implemented by applying the rate-matching process [2]. The constraint length of the RSC code for W-CDMA and cdma2000 is four, and the number of states is eight. The RSC decoder of the W-CDMA is the same as that of cdma2000 when the rate is 1/3. Since the turbo codes work in a frame-wise fashion, the trellises of the two constituent encoders need to be terminated at the end of a frame. For both cdma2000 and W-CDMA systems, turbo codes are terminated in a similar way. The dotted lines in Fig. 1 are active during the trellis termination. Tail bits come from the shift registers in order to bring the trellises back to the all-zero state. Because the contents of the shift registers are different in each constituent encoder, the tail bit sequences are different and the systematic code bit of the second encoder should be transmitted. The cdma2000 termination sequence varies depending on the rate, and the termination sequence of W-CDMA is the same as that of the rate-1/2 cdma2000 turbo code. We can conclude that a SISO compatible to both standards can be implemented without much difficulty, as the RSC code of W-CDMA is actually a subset of cdma2000. Though the turbo decoders for cdma2000 and W-CDMA share the constituent encoder structure, they have significant differences in the interleaver implementation, because their interleaver frame size, individual operations, and other parameters used in generating interleaved addresses are quite different. Compared with W-CDMA, the cdma2000 interleaver is much simpler and suitable for hardware implementation. On the other hand, the W-CDMA interleaver is more complicated and not straightforward to implement in hardware. The detailed descriptions of both interleavers are given in Section VI. Due to those differences, only a fully programmable processor is appropriate for a unified interleaver used for multiple standards. IV. TURBO DECODER ARCHITECTURE As shown in Fig. 3, the proposed turbo decoder employs hybrid implementation of hardware and software. By combining a hardware SISO decoder and a programmable processor, the performance of hardware and the flexibility of software can be achieved together. It has the simplest time-multiplex architecture [8] that contains only one SISO, one interleaver, and one extrinsic LLR ( ) memory, which are repeatedly used for both the first and the second SISO decoding processes in one iteration. The data are accessed in a sequential order for the first SISO decoding of an iteration and in an interleaved order for the second decoding. Since the write address is the same as the

4 804 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 7, JULY 2007 Fig. 3. Block diagram of the proposed turbo decoder. read address: 1) the content in memory is always sequential; 2) the read address can be buffered in order to use it later as the write address; and 3) the decoder contains no deinterleaver but only one interleaver. As shown by the dotted lines in Fig. 3, the SIMD processor calculates the interleaved addresses of various standards, and the address queue whose length is equal to the SISO latency saves the read addresses in order to use them again as the write addresses. During the first SISO decoding process of an iteration, which does not need an interleaver, the processor controls the hardware blocks, interfaces with external host, and processes the trellis termination and a stopping criterion. The processor stops the decoding iteration to reduce power consumption based on the cyclic redundancy check (CRC) or a simple stopping criterion presented in [10], which indicates whether or not the further iterations may improve BER. A. SIMD Processor To keep pace with the hardware SISO decoder, parallel processing is indispensable in the design of a programmable interleaver. A simple SIMD architecture depicted in Fig. 4 is chosen, as it is suitable for the simple and repetitive interleaved address generation and has simpler control and lower power consumption than VLIW or superscalar architectures. The number of processing elements (PEs) of the SIMD processor is set to five because the number of rows in the W-CDMA interleaver, which is the unit of interleaved address generation, is 5, 10, or 20. That of cdma2000 varies between 18 and 24 if we discard the rows that always produce invalid addresses. The processor employs Harvard architecture, i.e., its instruction memory and data memory are separated, and has a separate I/O port. Thus the processor can fetch an instruction, load data, and send an interleaved address through the I/O port at the same time. The bit widths of instructions and data are all 16, and the processor has four pipeline stages: IF (instruction fetch), ID (instruction decode), EX (execution), and MEM (memory operation). The write back is carried out at the end of the MEM stage. The branch delay is one cycle and the processor always executes the instruction in a branch delay slot. The first PE, PE0, plays the role of controlling the other PEs and functions as a complete scalar processor. The scalar instructions running only on PE0 include control instructions such as branch and call and multicycle operations such as multiply, divide, and remainder. Basic arithmetic/logic instructions and a few customized instructions for interleaving, which are described in detail in Section V, can be performed by all of the PEs. All of the common register files of the five PEs form a five-element vector register file of 16 entries, as shown in Fig. 4. PE0 has an additional 16-entry scalar register files to store nonvector data or special control data. The five PEs share a single data memory port and a single I/O port in order to save memory access power consumption and support simple I/O interface. For this reason, a SIMD instruction is not executed in all PEs at the same time, but executed in one PE after another so that a data memory port and an I/O port can be shared in a time-multiplexed fashion. An example of this situation is shown in Fig. 5. The rectangles filled with light gray represent the data memory instructions and the dark gray the I/O instructions. The figure shows that those instructions share a port without any conflict. B. SISO Decoder As shown in Fig. 6, the architecture of the proposed SISO decoder is similar to the memory architecture presented in [11], where the window size. Input data are read in a memory and used three times for calculating forward metric s, backward metric s, and extrinsic LLR s. The SISO decoder contains one additional memory for temporarily storing the computed s and an additional section of add-compare-select-add (ACSA) units to calculate dummy backward metrics for the sliding window algorithm [11]. To support multiple standards, the SISO decoder employs configurable ACSA units shown in Fig. 7. The ACSA can support the rate of 1/2 to 1/5 turbo codes with arbitrary transfer functions by configuring the input multiplexers in Fig. 7. For branch metric calculation, inputs are added in a carry save adder (CSA) to reduce timing delay of multiple additions. Generally, the Log-MAP algorithm outperforms the Max- Log-MAP algorithm [8]. However, Worm et al. demonstrated in [12] that the Max-Log-MAP is more tolerant to the channel estimation error than the Log-MAP algorithm. The ACSA unit is therefore designed to select one of the two algorithms with the last multiplexer in Fig. 7. If we can obtain reliable channel estimation such as received signal strength indication (RSSI), we can choose the Log-MAP algorithm for better performance. On the other hand, if nothing is known about the channel, we should use the Max-Log-MAP to avoid error caused by the misestimation of channel. The bit width of data and internal operations are determined based on the approach presented in [13]. Let us use the notation of the fixed-point representation as, where and represent the total bit width and the bit width of the fractional part, respectively. Based on the result of a performance simulation, we selected (5, 2) for inputs and (7, 2) for at the cost of slight performance degradation instead of (6, 3) and (8, 3), respectively, which produce almost ideal results. By choosing the smaller widths, the power and area of the buffer memory were

SHIN AND PARK: SIMD PROCESSOR-BASED TURBO DECODER SUPPORTING MULTIPLE THIRD-GENERATION WIRELESS STANDARDS 805 Fig. 4. Architecture of the SIMD processor. Fig. 5.

6%, and, in addition, the size and speed of the LUT for Log-MAP was improved dramatically.

TURBO INTERLEAVER IMPLEMENTATION Here, we describe common features of 3G standard turbo interleavers and present the proposed interleaved address generation algorithm and the

5 SHIN AND PARK: SIMD PROCESSOR-BASED TURBO DECODER SUPPORTING MULTIPLE THIRD-GENERATION WIRELESS STANDARDS 805 Fig. 4. Architecture of the SIMD processor. Fig. 5. Execution timing of the processing elements. Fig. 7. ASCA unit used to calculate a forward metric A (s). Fig. 6. Overall architecture of the SISO decoder. saved by 15.6%, and, in addition, the size and speed of the LUT for Log-MAP was improved dramatically. Given the bit width of inputs, we can induce the required bit width of internal calculations, which is ten in this case [13]. V. TURBO INTERLEAVER IMPLEMENTATION Here, we describe common features of 3G standard turbo interleavers and present the proposed interleaved address generation algorithm and the specialized instructions to implement these multiple turbo interleavers with the proposed SIMD processor. A. Prunable Interleavers Standard turbo codes commonly employ block interleavers. Although the complicated operations and parameters used in interleaving are quite different, W-CDMA and cdma2000 share the general concept of prunable block interleavers presented in

6 806 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 7, JULY 2007 Fig. 8. Prunable interleaver with N =18. (a) Incoming data indexes. (b) Intrarow permutation. (c) Interrow permutation. [14]. They can implement interleavers of arbitrary size by first building the mother interleaver of a predefined size and then pruning unnecessary indexes. Data are written in a two-dimensional (2-D) mother interleaver matrix row by row and permuted within each row. After the rows are also permuted one another, the interleaved data are read out column by column. An example of a simple prunable interleaver with is shown in Fig. 8, where the data indexes are written in a matrix form. The intrarow permutation rule applied to Fig. 8(b) is where is the permuted index of the th row and th column, base, and increment. Fig. 8(c) shows the interrow permutation result, which will be read out column by column as a sequence of since the elements exceeding the range of interest, 18 and 19, are pruned. If the interleaver size changes by more than the granularity of the mother interleavers, the entire interleaver structure should be reconstructed. All of the 3G mobile communication standards support variable bit rates that may change the interleaver size on a 10-ms or 20-ms frame basis, and W-CDMA even supports multiple separately coded transport channels. This means the interleaver structure may change at every 10 ms. Therefore, an efficient method to change the interleaver structure in a short period of time is essential to the turbo decoding of 3G communications. B. Preprocessing and On-the-Fly Generation Generating the whole interleaved address pattern at once is time-consuming and requires a large memory to store the pattern. We propose a solution to this problem by dividing the interleaver construction into two parts: preprocessing for interleaving and incremental on-the-fly address generation. When the bit rate changes, only the preprocessing is performed to prepare a relatively small number of seed parameters and variables selected to make the on-the-fly generation as simple as possible. Whenever the interleaved address sequence is required, the processor generates it column by column using the parameters and variables. This dividing technique reduces the timing overhead when the interleaver structure changes. It also does not require a large memory because it does not save the whole interleaved address pattern, but the minimal seed data to calculate it. This scheme exploits the properties of prunable interleavers. Let us explain our approach with the example of Fig. 8. In order to remove the computationally expensive multiplication and modulo operations that take a lot of clock cycles from (9), (9) Fig. 9. Proposed implementation applied to Fig. 8. (a) Preprocessing for interleaving. (b) Incremental on-the-fly address generation. we use an incremental vector w of and a cumulative vector of as follows: Then, (9) can be rewritten as and can be obtained recursively as instead of q (10) (11) (12) where and. since and. Then, (12) is equivalent to if otherwise (13) where the multiplication and modulo operations are replaced by cheaper operations, the multiplication by an addition, and the modulo by a subtract if greater or equal (SUBGE) instruction described in Section V-C. As shown in Fig. 9(a), b, w, and for the first column of the block interleaver are calculated and stored in vector registers of the SIMD processor in the preprocessing for interleaving. The figure shows that they are stored in the order of interrow permutation such as in advance so as to simplify the on-the-fly generation. In the column-by-column on-the-fly address generation shown in Fig. 9(b), the SIMD processor updates according to (13) and calculates the addresses based on (11). Calculated addresses are sent to the address queue, if they are smaller than. C. SIMD Instructions for Turbo Interleavers To speed up the interleaved address generation, we introduced three SIMD processor instructions: STOLT (store to output port if less than), SUBGE (subtract if greater or equal), and LOOP. Each of them substitutes a sequence of three ordinary instructions but takes only one clock cycle to execute.

7 SHIN AND PARK: SIMD PROCESSOR-BASED TURBO DECODER SUPPORTING MULTIPLE THIRD-GENERATION WIRELESS STANDARDS 807 The STOLT instruction is to send an address output to the queue only if the generated address is in the range of the interleaver size, which is the same as the pruning mentioned earlier. It is equivalent to a sequence of three instructions: CMP (compare), BLT (branch if less than), and STO (store to output port). Another conditional instruction SUBGE, which is equivalent to the sequence of CMP, BGE (branch if greater or equal), and SUB (subtract), is quite useful for the block interleavers that commonly use modulo operations. It substitutes a modulo or remainder operation if the condition is satisfied, which corresponds to (13). The LOOP instruction is adopted from DSP processors to reduce the loop overhead. This instruction conforms to the sequence of CMP, BNE (branch if not equal), and SUB instructions, which at once decrements the loop count and branches. Using the special instructions, we can reduce the lengths of the on-the-fly generation program loop of W-CDMA and cdma2000 to six and five instructions, respectively. This indicates that the five PEs can provide one address per cycle for cdma2000 turbo decoding. In addition to the three special instructions, there are a few instructions necessary to implement interleavers. The MUL (multiply) and REM (remainder) operations, which take several cycles to complete, are required in the preprocessing of the W-CDMA interleaver. The MASK instruction that disables one or more of PEs is convenient when the column length of the block interleaver is not a multiple of five. VI. APPLICATION TO STANDARDS We have applied the technique described in Section V to W-CDMA [2] and cdma2000 [3] turbo interleavers. We applied the proposed turbo decoder and turbo interleaver implementation to a few other standard turbo codes. Multistandard turbo decoding can be realized by loading several interleaver programs and control programs on the memory and switch them whenever needed. A. W-CDMA The W-CDMA turbo interleaver is quite similar to the example of Fig. 8 but much more complex. A prime number and its associated primitive root are involved in the whole interleaver generation process [2]. First, the number of rows of the block interleaver matrix, the prime number, and the number of columns are determined. is determined as 5, 10, or 20, then as the minimum prime number such that, and as or, depending on the range of the given interleaver frame size. The interrow permutation rule is also determined by and the table in [2]. The intrarow permutation rule of the most case is where (14) (15) and the prime integer is a least prime integer such that g.c.d, and for each Fig. 10. R = 5. Pseudocode of the W-CDMA on-the-fly address generation when. For a more detailed description of the W-CDMA interleaver, readers are referred to [2]. We can divide the complicated process into a preprocessing part and an on-the-fly generation such that no timing overhead exists in changing the frame rate. The computationally expensive modulo operations required in the on-the fly part are substituted by the SUBGE instructions. In the preprocessing part,, and are found following the method explained above. Then, the permutation function is obtained using the recursive (15) and saved in the data memory, which is the most time-consuming part of the preprocessing. Finally, for the on-the-fly generation, we save the seed variable vectors of length : address base, cumulative variable, and increment value, where. We save instead of in order to replace the modulo operations with SUBGEs in the on-the-fly generation part. Since the W-CDMA allows to be 5, 10, or 20 and the SIMD processor has five PEs, each vector is stored in four separate vector registers when. The pseudocode of the on-the-fly address generation for W-CDMA when is shown in Fig. 10. Each line corresponds to an instruction of the processor code. Line 2 corresponds to SUBGE, line 5 corresponds to STOLT, and line 6 corresponds to LOOP. The SUBGE safely substitutes because the condition is satisfied ( and before they are added). If or 20, lines 1 5 are repeated twice or four times to produce an entire column of the interleaver matrix with five PEs. The actual order of processor instructions are changed from the pseudocode to avoid data dependencies that stall the processor pipeline. B. cdma2000 The cdma2000 turbo interleaver has a dimension of, where is an integer between four and ten. The relation between two neighboring entries in a row is, where is the column index,, and is a row-specific value found in an LUT given in [3]. The rows are then shuffled according to a bit-reversal rule. This procedure is equivalent to the bit-level processing illustrated in Fig. 11 [3]. As one can expect from the figure, the turbo interleaver of the cdma2000 standard is easy to implement in hardware. The preprocessing of the cdma2000 is quite simple. Like the W-CDMA turbo interleaver, we save the address base of

8 808 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 7, JULY 2007 TABLE II SUMMARY OF THE CHIP IMPLEMENTATION Fig. 11. Turbo interleaver generation procedure of cdma2000. Fig. 12. Pseudocode of the cdma2000 on-the-fly address generation. for, address offset, and increment obtained by the LUT into the vector registers of the SIMD processor. If and are all discarded and their spaces are overwritten with, and, respectively. Because the address generated in the th row is always larger than or equal to, the row will always produce an invalid address if we do not discard them. After discarding those rows, the number of remaining rows becomes out of 32 for any of 12 predefined cdma2000 interleavers. The pseudocode of the on-the-fly generation part is shown in Fig. 12 and is almost identical to that of the W-CDMA. C. Other Standards The Forward Link Only (FLO) mobile multicasting standard adopted the turbo code of cdma2000 with a fixed frame size of [15]. The FLO turbo decoder can be implemented in the same way as that of cdma2000. The turbo interleaver used in the Consultative Committee for Space Data Systems (CCSDS) standard [16] also belongs to the category of prunable interleavers. The proposed SIMD processor can implement the interleaver, and only four instructions are enough to form the on-the-fly generation loop. However, the constraint length of CCSDS turbo code is five, which requires a SISO decoder that is twice as large as the other standards. The turbo codes adopted by Digital Video Broadcasting standard of return channel via satellite (DVB-RCS) [17] and IEEE Standard of broadband wireless access (BWA) systems [18] are double binary circular recursive systematic convolutional (CRSC) codes. The double binary encoders split input data into two sequences and introduce nonuniformity by simply shuffling them each other instead of applying the complex nonuniform interleaving to a single data sequence. We can implement a turbo decoder for DVB-RCS and IEEE Standard using the proposed architecture shown in Fig. 3. The control of the data flow from the memory to the SISO decoder will become a little more complex. However, the software interleavers can be very easily implemented by utilizing SUBGE instruction and four-way parallel processing of the proposed SIMD processor. All of the 3G standards adopted convolutional codes as the channel coding as well as turbo codes. Since both the turbo decoders and the Viterbi decoders are based on the similar trellis decoding, most part of the decoders can be shared and reused [19]. The proposed turbo decoder in Fig. 3 can also work as a Viterbi decoder by rearranging data flow by repeatedly using only one group of ACSA units in Fig. 6 and by utilizing the SIMD processor for traceback. VII. EXPERIMENTAL RESULTS We implemented an entire turbo decoder that supports both W-CDMA and cdma2000 1x RTT [3] turbo codes in a m CMOS technology [20]. The summary of the chip characteristics and the layout are given in Table II and Fig. 13, respectively. The maximum operating frequency of 135 MHz at 2.5 V and energy/bit/iteration of 6.89 nj was obtained by measuring 40 working samples on a test machine. The estimated maximum data rate of 5.48 Mb/s indicates that the decoder can easily cover 3G standards whose maximum rate is 2 Mb/s. Table III compares the complexity of the proposed turbo decoder and that of [19], where a turbo decoder is implemented for the W-CDMA standard. As you can see in Table III, our implementation is smaller than that in [19]. Although the work in [19] contains more circuits such as the first interleaver of the W-CDMA receiver and the bit precision of the computation is larger than ours, we can safely claim that the proposed multistandard turbo decoder based on a programmable processor is comparable in size to average hardware turbo decoder implementations. We measured the BER performance of the proposed turbo decoder by a simulation program in C language, assuming an additive white Gaussian noise (AWGN) channel. We used the W-CDMA turbo code, the Log-MAP decoding algorithm, and the CRC stopping criterion. The performances of the proposed decoder and an ideal decoder are shown in Fig. 14. The ideal turbo decoder is implemented with double-precision floating-

SHIN AND PARK: SIMD PROCESSOR-BASED TURBO DECODER SUPPORTING MULTIPLE THIRD-GENERATION WIRELESS STANDARDS 809 Fig. 14. BER performance comparison with the ideal turbo decoder.

TABLE III AREA COMPARISON OF TURBO DECODER IMPLEMENTATIONS rows of Table IV actually have a negligible effect on the system speed.

We obtained the BER curves on the right when the interleaver frame size is 1024 and the maximum number of iterations is six, and the others when the frame size is 5114 and the maximum number of

05 db compared with the ideal case, which is mainly due to the finite-precision effect such as saturation and quantization error.

9 SHIN AND PARK: SIMD PROCESSOR-BASED TURBO DECODER SUPPORTING MULTIPLE THIRD-GENERATION WIRELESS STANDARDS 809 Fig. 14. BER performance comparison with the ideal turbo decoder. TABLE IV CYCLE COUNTS FOR THE MOST CRITICAL INTERLEAVER SIZES OF W-CDMA Fig. 13. Micrograph of the chip. TABLE III AREA COMPARISON OF TURBO DECODER IMPLEMENTATIONS rows of Table IV actually have a negligible effect on the system speed. point arithmetic, and the original Log-MAP algorithm that does not exploit the sliding window technique and the stopping criterion. We obtained the BER curves on the right when the interleaver frame size is 1024 and the maximum number of iterations is six, and the others when the frame size is 5114 and the maximum number of iterations is ten. Fig. 14 shows that the performance degradation of our implementation is within 0.05 db compared with the ideal case, which is mainly due to the finite-precision effect such as saturation and quantization error. As shown in Table IV, the performance of the proposed interleaving algorithm is analyzed in terms of the cycle counts for four critical interleaver sizes of the W-CDMA turbo decoding. For brevity, the simpler and faster cdma2000 interleaver case is omitted. The last column of Table IV shows that the on-the-fly address generation is almost as fast as one address per clock cycle for large interleaver sizes, which imply high bit rates. In addition, the preprocessing time is shorter than the SISO decoding time, which completely hides the preprocessing delay as the preprocessing completes during the first SISO decoding. The second SISO decoding with the interleaved addresses can start as soon as the first SISO decoding finishes, since the SIMD processor is ready for the on-the-fly generation of the new interleaver structure. Since a small interleaver size implies a low bit rate in 3G systems, the relatively larger overheads for small-sized interleavers shown in the first and second VIII. CONCLUSION We have presented a turbo decoder designed for multiple 3G wireless communication standards. It contains a configurable hardware SISO decoder and a 16-b SIMD processor with five PEs and specialized instructions to perform incremental block interleaving. The performance and power efficiency of the hardware and the flexibility of the software are thus achieved together. In addition, a fast incremental software implementation of turbo interleavers has been proposed, which is suitable for use with real-time structure change such as VBR in 3G wireless communications. To hide the timing overhead of interleaver changing, the interleaver generation is split into two parts, preprocessing and incremental on-the-fly generation. The proposed decoder implemented in a m CMOS technology shows 5.48 Mb/s performance, which is sufficient for 3G communication standards. It can decode both of W-CDMA and cdma2000 bit streams by changing the software running on the SIMD processor. The size of the multistandard turbo decoder is comparable to average W-CDMA standard hardware turbo decoders. The proposed software implementation of turbo interleaver running on the SIMD processor is sufficiently fast that the timing overhead of interleaver changing is hidden in most cases, and the rate of the interleaved address generation for the 3G standards is almost one interleaved address per cycle.

[2] 3rd Generation Partnership Project; Technical Specification Group Radio Access Network; Multiplexing and Channel Coding (FDD), 3GPP TS 25.212 v4.6.0, Sep. 2002.

10 810 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 7, JULY 2007 REFERENCES [1] C. Berrou, A. Glavieux, and P. Thitimajshima, Near shannon limit error-correcting coding and decoding: Turbo-codes, in Proc. Int. Conf. Commun., 1993, pp [2] 3rd Generation Partnership Project; Technical Specification Group Radio Access Network; Multiplexing and Channel Coding (FDD), 3GPP TS v4.6.0, Sep [3] Physical Layer Standard for cdma2000 Spread Spectrum Systems, 3GPP2 C.S0002-B,ver. 1.00, Apr [4] J. P. Woodard, Implementation of high rate turbo decoders for third generation mobile communication, Proc. IEE Colloq. Turbo Codes Digit. Broadcasting, pp. 12/1 12/6, [5] U. Walther and G. P. Fettweis, DSP implementation issues for UMTSchannel coding, in Proc. ICASSP, 2000, pp [6] M. Bekooij, J. Dielissen, F. Harmsze, S. Sawitzki, J. Huisken, A. van der Werf, and J. van Meerbergen, Power-efficient application-specific VLIW processor for turbo decoding, in IEEE Int. Solid State Circuits Conf. Dig. Tech. Papers, 2001, pp [7] L. R. Bahl, J. Cocke, F. Jelinek, and J. Raviv, Optimal decoding of linear codes for minimizing symbol error rate, IEEE Trans. Inf. Theory, vol. IT-20, no. 2, pp , Mar [8] P. Robertson, E. Villebrun, and P. Hoeher, A comparison of optimal and sub-optimal MAP decoding algorithms operating in the log domain, in Proc. Int. Conf. Commun., Seattle, WA, 1995, pp [9] J. Vogt, K. Koora, A. Finger, and G. Fettweis, Comparison of different turbo decoder realization for IMT-2000, in Proc. IEEE GLOBECOM, 1999, pp [10] R. Y. Shao, S. Lin, and M. P. C. Fossorier, Two simple stopping criteria for turbo decoding, IEEE Trans. Commun., vol. 47, no. 8, pp , Aug [11] G. Masera, G. Piccinini, M. R. Roch, and M. Zamboni, VLSI architectures for turbo codes, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 7, no. 3, pp , Sep [12] A. Worm, Turbo-decoding without snr estimation, IEEE Commun. Lett., vol. 4, no. 6, pp , Jun [13] G. Montorsi and S. Benedetto, Design of fixed-point iterative decoders for concatenated codes with interleavers, IEEE J. Sel. Areas Commun., vol. 19, no. 5, pp , May [14] M. Eroz and A. R. Hammons Jr., On the design of prunable interleavers for turbo codes, in Proc. Veh. Technol. Conf., Houston, TX, 1999, pp [15] Forward Link Only Air Interface Specification for Terrestrial Mobile Multimedia Multicast, TIA-1099, rev. 6, Aug [16] Recommendation for Space Data System Standards: Telemetry Channel Coding, CCSDS B-6, Oct. 2002, Blue Book. [17] Digital Video Broadcasting (DVB): Interaction Channel for Satellite Distribution Systems, EN , v1.4.1, Sep [18] IEEE Standard for Local and Metropolitan Area Networks, Part 16: Air Interface for Fixed Broadband Wireless Access Systems, IEEE Standard , Oct [19] M. Bickerstaff, D. Garrett, T. Prokop, C. Thomas, B. Widdup, G. Zhou, C. Nicol, and R.-H. Yan, A unified turbo/viterbi channel decoder for 3GPP mobile wireless in 0.18 m CMOS, in IEEE Int. Solid State Circuits Conf. Dig. Tech. Papers, 2002, pp [20] M.-C. Shin and I.-C. Park, A programmable turbo decoder for multiple 3G wireless standards, in IEEE Int. Solid State Circuits Conf. Dig. Tech. Papers, 2003, pp Myoung-Cheol Shin (S 97 M 07) received the B.S., M.S., and Ph.D. degrees in electrical engineering and computer science from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, in 1996, 1998, and 2004, respectively. In 2003, he joined the System LSI Division, Samsung Electronics Company, Ltd., Giheung, Korea, as a Senior Engineer working on VLSI architectures and implementations of 3G wireless communication receivers and mobile digital broadcasting receivers. He holds one patent in this area. In-Cheol Park (S 88 M 92 SM 02) received the B.S. degree in electronic engineering from Seoul National University, Seoul, Korea, in 1986, and the M.S. and Ph.D. degrees in electrical engineering from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, in 1988 and 1992, respectively. Since June 1996, he has been an Assistant Professor and is now a Professor with the School of Electrical Engineering and Computer Science, KAIST. Prior to joining KAIST, he was with the IBM T. J. Watson Research Center, Yorktown, NY, from May 1995 to May 1996, where he performed research on high-speed circuit design. His current research interest includes computer-aided design algorithms for high-level synthesis and VLSI architectures for general-purpose microprocessors. Prof. Park was the recipient of the Best Paper Award at ICCD in 1999 and the Best Design Award at ASP-DAC in 1997.

ISSCC 2003 / SESSION 8 / COMMUNICATIONS SIGNAL PROCESSING / PAPER 8.7

ISSCC 2003 / SESSION 8 / COMMUNICATIONS SIGNAL PROCESSING / PAPER 8.7 8.7 A Programmable Turbo Decoder for Multiple 3G Wireless Standards Myoung-Cheol Shin, In-Cheol Park KAIST, Daejeon, Republic of Korea