IEEE Proof Web Version

Size: px

Start display at page:

Download "IEEE Proof Web Version"

Myron Palmer
5 years ago
Views:

1 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1 From Parallelism Levels to a Multi-ASIP Architecture for Turbo Decoding Olivier Muller, Member, IEEE, Amer Baghdadi, and Michel Jézéquel, Member, IEEE Abstract Emerging digital communication applications and the underlying architectures encounter drastically increasing performance and flexibility requirements. In this paper, we present a novel flexible multiprocessor platform for high throughput turbo decoding. The proposed platform enables exploiting all parallelism levels of turbo decoding applications to fulfill performance requirements. In order to fulfill flexibility requirements, the platform is structured around configurable application-specific instruction-set processors (ASIP) combined with an efficient memory and communication interconnect scheme. The designed ASIP has an single instruction multiple data (SIMD) architecture with a specialized and extensible instruction-set and 6-stages pipeline control. The attached memories and communication interfaces enable its integration in multiprocessor architectures. These multiprocessor architectures benefit from the recent shuffled decoding technique introduced in the turbo-decoding field to achieve higher throughput. The major characteristics of the proposed platform are its flexibility and scalability which make it reusable for all simple and double binary turbo codes of existing and emerging standards. Results obtained for double binary WiMAX turbo codes demonstrate around 250 Mb/s throughput using 16-ASIP multiprocessor architecture. Index Terms application-specific instruction-set processor (ASIP), <PLEASE DEFINE BCJR.> BCJR, parallel processing, multiprocessor, turbo decoding. I. INTRODUCTION SYSTEMS on chips (SoCs) in the field of digital communication are becoming more and more diversified and complex. In this field, performance requirements, like throughput and error rates, are becoming increasingly severe. To reduce the error rate with a lower signal-to-noise ratio (SNR) (closer to the Shannon limit), turbo (iterative) processing algorithms have recently emerged [1]. These algorithms, which originally concerned channel coding, are currently being reused over the whole digital communication system, like for equalization, demodulation, synchronization, and multiple-input multiple-output (MIMO). Furthermore, the severe time-to-market constraints and the continuously developing new standards and applications in this digital communication, make resorting to new design methodologies and the proposal of a flexible turbo communication Manuscript received April 04, 2007; revised July 11, 2007, September 05, 2007, and January 04, This work has been supported in part by the European Commission through the Network of Excellence in Wireless Communications (NEWCOM). The authors are with the Electronics Department, TELECOM Bretagne, Technopôle Brest Iroise, Brest, France ( olivier. muller@telecom-bretagne.eu; amer.baghdadi@telecom-bretagne.eu; michel.jezequel@telecom-bretagne.eu). Digital Object Identifier /TVLSI platform inevitable. Flexibility could be achieved by the use of programmable/configurable processors rather than application-specific integrated circuits (ASICs). Thus, embedded multiprocessor architectures integrating an adequate communication network-on-chip (NoC) will constitute an ultimate solution to preserve flexibility while achieving the required computation and throughput rates. Algorithm parallelization of turbo decoding has been widely investigated these last few years and several implementations have been proposed. Some of these implementations succeeded in achieving high throughput for specific standards with a fully dedicated architecture. High performance turbo decoders dedicated to 3GPP standards have been implemented in ASIC [2] and in field-programmable gate arrays (FPGA) [3]. In [4], a new class of turbo codes more suitable for high throughput implementation is proposed. However, such implementations do not take into account flexibility and scalability issues. Unlike these implementations, others include software and/or reconfigurable parts to achieve the required flexibility while achieving lower throughput. This is addressed, for example, in [5] with the XiRISC processor, a reconfigurable processor using embedded FPGA, or in [6] with a digital signal processing (DSP) integrating dedicated instructions for turbo decoding. Because of their great flexibility, these solutions do not fulfill performance requirements of all standards (e.g., 150 Mb/s for Homeplug). In fact, the concept of the application-specific instruction set processor (ASIP) [7] constitutes the appropriate solution for fulfilling the flexibility and performance constraints of emerging and future applications. The use of ASIPs in embedded SoCs is becoming inevitable due to the rapid increase in complexity and flexibility of emerging applications and evolving standards. Two approaches are mainly proposed by EDA vendors for ASIP design. The first approach is based on an environment where the designer can select and configure predefined hardware elements to enhance a predefined basic processor core according to the application needs. User-defined hardware blocks, together with the corresponding instructions, can be added to the processor. This approach was used in a parallel multiprocessor implementation [8]. Despite the advanced heterogeneous communication network that optimizes data transfer and enables parallel turbodecoding implementation, the platform lacks performance due to the predefined basic processor core imposed by this approach. In the second approach, the designer has full design freedom thanks to an Architecture Description Language (ADL) which is used to specify the instruction set and the ASIP architecture Web Version [9]. In [10], we proposed the first ASIP dedicated to turbo codes using this approach. Thanks to its performance and the multiprocessor template proposed, the solution was able to cover al /$ IEEE

2 2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS Fig. 1. Turbo decoding: (a) turbo decoder; (b) BCJR SISO; (c) trellis. most all standards while presenting few limitations [support up to 8-state trellis; instruction-level parallelism (ILP) not fully exploited]. Another ASIP based on the same approach, proposed in [11] and [25], resolves these limitations and achieves higher throughput thanks to full exploitation of ILP through a long pipeline. In addition, this ASIP provides support for convolutional codes and integrates interfaces for a multiprocessor platform. However, its long pipeline length inhibits the exploitation of the most efficient parallelism for high throughput (component-decoder parallelism Section III-B2). In this work, we present an original parallelism classification of turbo decoding applications and directly link the different parallelism levels of the classification to their VLSI implementation techniques and issues in a multi-asip platform. An improved ASIP model enabling the support of all parallelism techniques is proposed. It can be configured to decode all simple and double binary turbo codes. Besides the specific arithmetic units that make up this processor model, special care was taken with the memory organization and communication buses. Its architecture facilitates its integration in a multiprocessor scheme enabling an efficient and flexible implementation of the turbo decoding algorithm. The rest of this paper is organized as follows. Section II presents the turbo decoding algorithm for a better understanding of subsequent sections. Section III analyses all parallel processing techniques of turbo decoding and proposes a three-level classification of these techniques. Then, Section IV details the proposed single instruction multiple data (SIMD) ASIP architecture for turbo decoding which exploits fully the first level of parallelism. Exploiting the other parallelism levels requires the resort to multi-asip architectures. This is illustrated in Section V, where we make use of the second level of parallelism that achieves high throughput with reasonable hardware complexity. Finally, Section VI summarizes the results obtained and concludes this paper. II. CONVOLUTIONAL TURBO DECODING In iterative decoding algorithms [12], the underlying turbo principle relies on extrinsic information exchanges and iterative processing between different soft input soft output (SISO) modules. Using input information and a priori extrinsic information, each SISO module computes a posteriori extrinsic information. Fig. 2. BCJR computation schemes: (a) forward-backward and (b) butterfly. Fig. 3. Frame decomposition and sub-block parallelism. This a posteriori extrinsic information becomes the a priori information for the other modules and are exchanged via interleaving and deinterleaving processes. For convolutional turbo codes [1], classically constructed with two convolutional component codes, the SISO modules process the BCJR or forward-backward algorithm [13] which is the optimal algorithm for the maximum a posteriori (MAP) decoding of convolutional codes (see Fig. 1). So, a BCJR SISO will first compute branch metrics (or metric), which represents the probability of a transition occurring between two trellis states ( : starting state; : ending state). Note that a branch metric can be decomposed in an intrinsic part due to systematic information and a priori information and an extrinsic part due to redundancy information. Then a BCJR SISO computes forward and backward recursions. Forward recursion (or recursion) computes a trellis section (i.e., the probability of all states of the trellis regarding the th symbol) using the previous trellis section and branch metrics between these two sections, while backward recursion (or recursion) computes a trellis section using the future trellis section and branch metrics between these two sections. With max-log-map algorithm [14], it can be expressed Web Version

3 MULLER et al.: FROM PARALLELISM LEVELS TO A MULTI-ASIP ARCHITECTURE FOR TURBO DECODING 3 Finally, the extrinsic information of the symbol is computed for all decisions from the forward recursion, the backward recursion and the extrinsic part of the branch metrics III. PARALLEL PROCESSING LEVELS In turbo decoding with the <PLEASE DEFINE BCJR.> BCJR algorithm, parallelism techniques can be classified at three levels: 1) BCJR metric level parallelism; 2) BCJR SISO decoder level parallelism; and 3) turbo-decoder level parallelism. The first (fine grain) parallelism level concerns symbol elementary computations inside a SISO decoder processing the BCJR algorithm. Parallelism between these SISO decoders, inside one turbo decoder, belongs to the second parallelism level. The third (coarse grain) parallelism level duplicates the turbo decoder itself. (1) (2) (3) A. BCJR Metric Level Parallelism The BCJR metric level parallelism concerns the processing of all metrics involved in the decoding of each received symbol inside a BCJR SISO decoder. It exploits the inherent parallelism of the trellis structure, and also the parallelism of BCJR computations [15]. 1) Parallelism of Trellis Transitions: Trellis-transition parallelism can easily be extracted from trellis structure as the same operations are repeated for all transition pairs. In log-domain [14], these operations are either add-compare-select (ACS) operations for the max-log-map algorithm or ACSO operations (ACS with a correction offset [14]) for the log-map algorithm. Each BCJR computation (1) (3) requires a number of ACS-like operation equals to half the number of transitions per trellis section. Thus, this number, which depends on the structure of the convolutional code, constitutes the upper bound of the trellis-transition parallelism degree. Furthermore, this parallelism implies low area overhead as only the ACS units have to be duplicated. In particular, no additional memories are required since all the parallelized operations are executed on the same trellis section, and in consequence on the same data. 2) Parallelism of BCJR Computations: A second metric parallelism can be orthogonally extracted from the BCJR algorithm through a parallel execution of the three BCJR computations. Parallel execution of backward recursion and APP computations was proposed with the original forward-backward scheme, depicted in Fig. 1(a). So, in this scheme, we can notice that BCJR computation parallelism degree is equal to one in the forward part and two in the backward part. To increase this parallelism degree, several schemes are proposed [16]. Fig. 1(b) shows the butterfly scheme which doubles the parallelism degree of the original scheme through the parallelism between the forward and backward recursion computations. This is performed without any memory increase and only BCJR computation resources have to be duplicated. Thus, BCJR computation parallelism is area efficient but still limited in parallelism degree. In conclusion, BCJR metric level parallelism achieves optimal area efficiency as it does not affect memory size, which occupies most of the area in a turbo decoder circuit. Exploiting this level of parallelism is detailed in Section IV. Nevertheless, the parallelism degree is limited by the decoding algorithm and the code structure. Thus, achieving higher parallelism degree implies exploring higher processing levels. B. BCJR-SISO Decoder Level Parallelism The second level of parallelism concerns the SISO decoder level. It consists of the use of multiple SISO decoders, each executing the BCJR algorithm and processing a sub-block of the same frame in one of the two interleaving orders. At this level, parallelism can be applied either on sub-blocks and/or on component decoders. 1) Sub-Block Parallelism: In sub-block parallelism, each frame is divided into M sub-blocks and then each sub-block is processed on a BCJR-SISO decoder using adequate initializations [16], [17]. A formalism is proposed in [16] to compare various existing sub-block decoding schemes towards parallelism degree and memory efficiency. Besides duplication of BCJR-SISO decoders, this parallelism imposes two other constraints. On the one hand, interleaving has to be parallelized in order to scale proportionally the communication bandwidth [8]. Due to the scramble property of interleaving, this parallelism can induce communication conflicts except for interleavers of emerging standards that are conflictfree. These conflicts force the communication structure to implement conflict management mechanisms and imply a long and variable communication time. This issue is generally addressed by minimizing interleaving delay with specific communication networks [8], [18]. On the other hand, BCJR-SISO decoders have to be initialized adequately either by acquisition or by message passing [17], [19]. The acquisition method involves estimating recursion metrics thanks to an overlapping region called acquisition window or prologue. Starting from a trellis section, where all the states are initialized to a uniform constant, the acquisition window will be processed over its length to provide reliable recursion metrics at sub-block ending points. The message passing method initializes a sub-block with recursion metrics computed during the last iteration in the neighboring sub-blocks. In [19], we observed that message passing initialization enables a more efficient decoding reaching better throughput at comparable hardware complexity. Thus, the message passing initialization is mainly considered in the rest of this paper. Regarding the first iteration, message passing method is undefined and the iterative process starts with a uniform initialization of the sub-block ending states. Instead, an initialization by acquisition can slightly improve the convergence of the iterative process, but the resulting gain is usually less than one iteration. This parallelism is necessary to reach high throughput. Nevertheless, its efficiency for high throughput is strongly reduced since resolving the initialization issue implies a computation Web Version

4 4 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS Fig. 4. Turbo decoding: (a) serial and (b) shuffled. overhead following an Amdahl s law (due to acquisition length or additional iterations) [19]. 2) Component-Decoder Parallelism: The component-decoder parallelism is a new kind of BCJR-SISO decoder parallelism that has become operational with the introduction of the shuffled decoding technique [20]. The basic idea of shuffled decoding technique is to execute all component decoders in parallel and to exchange extrinsic information as soon as it is created, so that component decoders use more reliable a priori information. Thus, the shuffled decoding technique performs decoding (computation time) and interleaving (communication time) fully concurrently while serial decoding implies waiting for the update of all extrinsic information before starting the next half iteration (see Fig. 4). Modifying serial decoding to restart processing right after the previous half iteration in order to save the propagation latency was studied in [21]. The resulting decoding requires nevertheless additional control mechanisms to avoid consistency conflict in memories. Since communication time is often considered to be the limiting factor in multiprocessor turbo decoding, saving the propagation latency is a crucial property of shuffled decoding. In addition, by doubling the number of BCJR SISO decoders, component-decoder parallelism halves the iteration period in comparison with originally proposed serial turbo decoding. Nevertheless, to preserve error-rate performance with shuffled decoding, an overhead of iteration between 5% and 50% is required depending on the BCJR computation scheme, on the degree of sub-block parallelism, on propagation time, and on interleaving rules [22]. In fact, this overhead decreases with respect to sub-block parallelism degree [19] while computation overhead of sub-block parallelism increases. Consequently, at high throughput and comparable complexity, the computation overhead becomes greater by doubling the sub-block parallelism degree than by using shuffled decoding. Thus, for high throughput, shuffled decoding is more efficient than sub-block parallelism. Simulations demonstrate minor shuffled decoding overhead variations for low propagation latency. Above a propagation latency of three times the extrinsic information emission time, the overhead reduces the interest of the shuffled decoding technique. Finally, this level of parallelism presents great potential for scalability and high area efficiency. Exploiting this level of parallelism is detailed in Section V. C. Turbo-Decoder Level Parallelism The highest level of parallelism simply duplicates whole turbo decoders to process iterations and/or frames in parallel. Iteration parallelism occurs in a pipelined fashion with a maximum pipeline depth equal to the iteration number, whereas frame parallelism presents no limitation in parallelism degree. Nevertheless, turbo-decoder level parallelism is too area-expensive (all memories and computation resources are duplicated) and presents no gain in frame decoding latency and for these reasons it is not considered in this work. IV. EXPLOITING BCJR METRIC LEVEL PARALLELISM: ASIP FOR BCJR SISO DECODER A. Context of Architectural Choices As seen in Section III-A, the BCJR metric level parallelism that occurs inside a BCJR SISO decoder is the most area efficient level of parallelism. Thus, a hardware implementation achieving high throughput should first exploit this parallelism. The complexity of convolutional turbo codes proposed in all existing and emerging standards is limited to eight-state double binary turbo codes or 16-state simple binary turbo codes. Hence, to fully exploit trellis transition parallelism (Section III-A1) for all standards, a parallelism degree of 32 is required. The implementation of future more complex codes can be supported by splitting trellis sections into sub-sections of 32-parallelism degrees and by processing sub-sections sequentially. Regarding Web Version

5 MULLER et al.: FROM PARALLELISM LEVELS TO A MULTI-ASIP ARCHITECTURE FOR TURBO DECODING 5 BCJR computation parallelism (Section III-A2), we choose a parallelism degree of two instead of four (maximum). Using a parallelism degree of four with the butterfly scheme leads to underutilization of the BCJR computation units (two are only used half of the time). These parallelism requirements imply the use of specific hardware units. To implement these units while preserving flexibility, application-specific instruction set processors constitute the perfect solution [7]. The BCJR SISO decoder should also have adequate communication interfaces in order to handle the inter sub-block communications in the case of BCJR-SISO decoder parallelism (see Section V). In this context, in order to efficiently implement shuffled decoding, the propagation time should be less than three emission periods (see Section III-B2). Let be the time required to cross the network, be the pipeline length between the load stage of the extrinsic information and the store stage of the extrinsic information, #cycle be the number of clock cycles required by the processor to compute an extrinsic information value, and be the frequency of the processor. Then # (6) # # To preserve a low ratio and thus make the use of shuffled decoding technique efficient we choose to use a short pipeline length achieving: #. A long pipeline inhibits the exploitation of the shuffled decoding technique. For example, the ASIP developed in [11], which can emit one extrinsic information value per cycle # with a of 8, has a ratio greater than 8. B. Architecture of the ASIP The presented processor, dedicated to the BCJR algorithm, is an enhanced version of the ASIP proposed in [10]. 1) Global View: The ASIP is mainly composed of operative and control parts besides its communication interfaces and attached memories [see Fig. 5(a)]. The operative part is tailored to process a window of 64 symbols by means of two identical BCJR computation units, corresponding to forward and backward processing in the MAP algorithm. Each unit produces recursion metrics and extrinsic information. The storage of recursion metrics produced by one unit, to be used by the other unit, is performed in 8 cross memories of bit words. So the processor integrates 16 internal cross memories in order to provide the adequate bandwidth. Another 96-bit width internal memory (config) contains up to 256 trellis descriptions, so that the processor can be configured for the corresponding standard. Incoming data that group systematic and redundant information of the channel, in addition to extrinsic information, are stored in external memories attached to the ASIP (input data, info ext). The input data memory has a 32-bit width to contain up to four 8-bit channel information data (systematic or redundant). The info ext memory has a 64-bit width to contain up to four 16-bit extrinsic information data, since four extrinsic information data # (4) (5) are required by double binary codes. Depending on the application s requirements, the depth of incoming data memories can be scaled up to to cover all existing and emerging standards frame-length specifications. The external future and past memory banks are used to initialize state metric values for the beginning and end of each window according to the message passing method. Each bank has two 128-bit width memories, one storing forward recursions and the other backward recursions. These initialization memories are used as follows: 1) at the beginning of the decoding, the state metric registers are either set according to the available information about the state, or reset so that all state metrics have equal probability; 2) after computations (e.g., acquisition if programmed, or recursion), the state metrics obtained for the beginning and end of each window can be stored in a memory bank in order to be used for initialization of the next iteration. The depth of these memories can be scaled to the number of windows required with a maximum of For the th window associated with the processor, initialization metrics are read at address in the forward past memory and at address in the backward past memory. Then the refined state metrics are stored at address in the forward past memory and at address in the backward past memory. The future memory bank is only accessible at address 0 and for the state metrics of the end of the last window associated with the processor. In this case, the backward initialization metrics are read from the backward future memory at address 0 and the forward metrics is stored in the forward future memory at address 0. For all the external memories, memory latencies of one cycle in read/write access have been integrated in the ASIP pipeline. 2) BCJR Computation Unit: Each BCJR computation unit is based on single instruction multiple data (SIMD) architecture in order to exploit trellis transition parallelism. Thus, 32 adder nodes (one per transition) and 8 max nodes are incorporated in each unit [see Fig. 5(b)]. The 32 adder nodes are organized as a4 8 processing matrix. In this organization, for an 8-state double binary code, the raw and column of an adder node correspond respectively to the considered symbol decision and the ending state of the associated transition. For a 16-state simple binary code, transitions with ending states 0 to 7 are mapped on matrix nodes of raw 0, if transition bit decision is 0, or matrix nodes of raw 1, if transition bit decision is 1, whereas states 8 to 15 are mapped on nodes of raws 2 and 3. An adder node [see Fig. 5(c)] contains one adder, multiplexers, one register for configuration (RT), and an output register (RADD). It supports the addition required in a recursion between a state metric (coming from the state metric register bank RMC) and a branch metric (coming from the branch metric register bank RG), and also the addition required in information generation since it can accumulate the previous result with the state metric of the other recursion coming from the register bank RC. The max nodes [see Fig. 5(d)] are shared in the processing matrix so that the max operations can be performed on RADD registers either rawwise or columnwise, depending on the ASIP instructions. A max node contains three max operators connected in a tree. This makes it possible to perform either a four-input maximum (using the three operators) or two two-input maximum. Results Web Version

6 6 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS Fig. 5. (a) ASIP architecture. (b) BCJR computation unit. (c) Adder node. (d) Max node. (e) Control unit. are stored either in the first raws or columns of RADD matrix or in RMC bank to achieve recursion computation. The BCJR computation unit also contains a GLOBAL arithmetic logic unit (ALU) that computes extrinsic information, hard decisions and other global processing, and a branch metric (BM) generator, that performs branch metrics calculation from extrinsic information register bank (RIE) and from channel information available in the pipeline registers (PR). The BM generator supports cyclic puncturing patterns with a maximum pattern length of 16. The pattern length is configurable in a 4-bit register, while puncturing patterns associated with the four channel information data are configurable through four 16-bit registers, in which each zero corresponds to a punctured bit. The BM generator supports a code rate between and 1 for double binary code and between and 1 for simple binary code. 3) Pipeline Strategy: The ASIP control part is based on a sixstage pipeline [see Fig. 5(e)]. The first two stages (FE, DC) fetch instructions from the program memory and decode them. Then, depending on the instruction type, the operand fetch (OPF) stage loads data from the input data memory to the pipeline registers PR, and/or data from the extrinsic information memory to the RIE registers, and/or data from the past/future memories to the RMC registers, and/or the configuration in the RT registers. In comparison to [10], a BM stage has been added to the pipeline in order to anticipate the calculation of branch metrics performed in the BM generator, to increase the clock frequency of the ASIP, and to improve the number of cycles per symbol. The Execute (EX) stage performs the rest of the BCJR operations. This choice reduces the performance of the ASIP since the architecture does not fully exploit ILP. However it was intentionally chosen to keep the pipeline length as short as possible in order to efficiently support the shuffled decoding technique. Fig. 6. Butterfly ZOL mechanism. Hence extrinsic information can cross the pipeline from the OPF to Store (ST) stage in only 4 cycles (see Section IV-A). 4) Control Structure: The control part also requires several dedicated control registers. Thus, the window size is fixed in the register R SIZE, and the current processed symbol inside the BCJR computation unit A (respectively, BCJR computation unit B) is stored in the pipeline register ADDRESS A (respectively, ADDRESS B). These addresses, as well as the program counter and the corresponding instruction, are then pipelined. To correctly access incoming data memories and past/future memories, the processor has a 10-bit WINDOW ID register that identifies the window computed and a 10-bit R SUB BLOCK register that sets the number of windows processed by the ASIP. Thus, one ASIP can process up to 1024 windows. In addition, the control architecture provides branch mechanisms and a zero overhead loop (ZOL) fully dedicated to the butterfly scheme (see Section III-A2). To alleviate the ASIP instruction set, the ZOL mechanism is tightly coupled with addresses Web Version

7 MULLER et al.: FROM PARALLELISM LEVELS TO A MULTI-ASIP ARCHITECTURE FOR TURBO DECODING 7 Fig. 7. Assembly programs for: (a) WiMAX double binary turbo code and (b) 3GPP simple binary turbo code. generation (see Fig. 6). Thus, the first loop is performed while the address of the symbol processed by unit A is smaller than the address of the symbol processed by unit B. In case of odd window size, the middle symbol is processed by unit A when both addresses are equal. Finally, the second loop is performed while the address of the symbol processed by unit B is positive. C. ASIP Instruction Set The designed instruction set of our ASIP architecture is coded on sixteen bits. The basic version contains 30 instructions that perform the basic operations of the MAP algorithm. To increase performance, the ASIP was extended with compacted instructions that can perform several operations in different pipeline stages within a single instruction. The following section details the mandatory instructions to perform simple decoding. These instructions are divided into three different classes: control, operative, and IO. 1) Control: As mentioned previously, the butterfly ZOL instruction repeats R SIZE times the two loops of the butterfly scheme. It requires three markers to retain relative addresses of first-loop end instruction, second-loop begin instruction, and second-loop end instruction. An unconditional branch instruction has also been designed and uses the direct addressing mode. SET SIZE instruction is used to set the ASIP window size to a maximum size of 64 symbols. SET WINDOW ID and SET SUB BLOCK NB are also used to set the WINDOW ID or R SUB BLOCK registers. Thus, the processor manages up to symbols. 2) Operative MAP: An add instruction is defined and used in two different modes: metrics computation (add m) and extrinsic information computation (add i). According to the add mode and the configuration registers (RT), each processing node selects the desired operands to perform the addition and to store the result in the corresponding RADD register. In the same way, a max1 and max2 instructions are defined with the same modes as an ADD instruction. This max1 instruction only performs one comparison-selection (two outputs per max node) while the max2 instruction cascades comparison-selection operations (one output per max node). These instructions have to be repeated as often as necessary to obtain either extrinsic information or recursion metrics at the considered address in the sub-block. The basic instruction set also contains the DECISION instruction to produce hard decisions on processed symbols. 3) IO: The basic instruction set also provides input and output instructions. With these instructions, parallel multi-accesses are executed in order to: load decoder input data (LD DATA), input recursion metrics (LD REC), configuration (LD CONFIG); store output recursion metrics (ST REC); handle internal cross metrics between the two BCJR computation units (LD CROSS, LD ST); send extrinsic information packets and hard decisions (ST EXT, DEC). We choose to group extrinsic information in packets for efficient IO operations. Each packet can contain a packet header and extrinsic information of the current symbol (up to four 16-bit data in the case of double binary codes). This header typically contains the processed symbol address including the WINDOW ID and local address (cf. Section V-B). D. Application Examples Fig. 7 gives the ASIP assembly programs required to decode a 48-symbol sub-block considering the turbo code used in: (a) WiMAX standard and (b) to decode a 40-bit sub-block considering the turbo code using 3GPP standard. In both cases, the first instructions load the required configuration (LD CONFIG) and initialize the recursion metrics (LD REC). Then the butterfly loops are initialized using the ZOLB instruction. The first loop (2 instructions) only computes the state metrics. Two max operations (max2 instruction) are required for the double binary code, whereas only one max operation (max1 instruction) is required for simple binary code. The second loop (five instructions) computes, in addition to state metrics, the extrinsic information for the eight-state code (using three max operations). Finally, the ASIP exports the sub-block ending metrics (ST REC) and program branches to the first instruction of the butterfly. Thus, regarding the execution time, cycles are needed in the first loop of the butterfly scheme, and cycles in the second loop, where is the sub-block size. So, about cycles are needed to process the symbols of the sub-block. Thus, 3.5 cycles are roughly needed per symbol (3.5 cycles/bit in simple binary mode and 1.75 cycles/bit in double binary mode). Web Version E. Implementation Results In this paper, we use the Processor Designer framework from CoWare [23]. Processor Designer is based on the LISA ADL [9] which allows the automatic generation of ASIP models (VHDL,

8 8 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS Verilog, and SystemC) for hardware synthesis and system integration, in addition to the generation of the underlying software development tools. Using this tool, a VHDL description of the ASIP architecture was generated. It was synthesized with a Synopsys Design Compiler using ST 90-nm ASIC library under worst case conditions (0.9 V, 105 C). The optimized ASIP presents a maximum clock frequency of 400 MHz and occupies about 63.1 KGates (equivalent). Compared with previous ASIP [10], clock frequency is improved by 20% and area is decreased by 35%. The presented ASIP can process 228 Mb/s in double binary mode and 114 Mb/s in simple binary mode. Note that, as a future extension, the performance of the simple binary mode can be significantly improved if the code rate of component codes is greater than or equal to one half (valid in most standards), by compacting the trellis [24]. With this condition, two consecutive stages of the simple binary trellis can be compacted in one stage of a new double binary trellis without error rate degradation. With this new double binary trellis configuration, the simple binary code can be also decoded at 228 Mb/s. Trellis compaction requires a different soft decision management implying extra operations. This feature is not implemented in the current ASIP architecture. However, it can be supported with minor modifications of LD DATA and ST EXT instructions. The LD DATA instruction should handle the soft input bit-to-symbol conversion (new add operators in BM stage) and the ST EXT instruction should handle the soft output symbol-to-bit conversion (new max operators in ST stage). These elementary operators do not change the maximum clock frequency since they introduce their own non-critical path. Thus, negligible hardware overhead is induced without any degradation in throughput. Besides, as the packet generated by the ASIP still contains two binary soft decisions, the interface of the network has to split the processor packet into two network packets to perform interleaving. Table I compares the performance results of log-map turbo decoding implementations for the UMTS turbo code. We can observe that the designed ASIP has excellent throughput performance thanks to a number of cycles per bit per SISO close to one. One is obtained by fully dedicated hardware implementations. Compared to [11], our ASIP presents a slightly lower throughput for almost similar area (63 versus 56 KGates) and with a shorter pipeline depth (6 versus 11 stages) in order to make shuffled decoding possible. With the trellis compaction extension, which reveals the real potential of the proposed architecture to decode simple binary codes, the ASIP can have a slightly better throughput despite the use of a 90-nm target technology. Furthermore, thanks to its dedicated past/future memories, our processor can skip the acquisition phases efficiently and without degradation. The figures of Table I do not integrate this acquisition computation overhead which can rise to around 15% in [25]. V. EXPLOITING BCJR-SISO DECODER LEVEL PARALLELISM: MULTI-ASIP PLATFORM The ASIP presented in the previous section fully exploits the first parallelism level (BCJR metric level parallelism Section III-A) by efficiently performing all computations of a BCJR-SISO decoder. In order to exploit the second TABLE I COMPARISON OF DIFFERENT TURBO DECODING IMPLEMENTATIONS FOR UMTS Fig. 8. Extrinsic information exchanges in BCJR-SISO decoder level parallelism. parallelism level (BCJR-SISO decoder level Section III-B), a multi-asip architecture is required. A. Multi-ASIP Turbo Decoding Overview Sub-block parallelism implies the use of a BCJR-SISO decoder, e.g., our ASIP, for each sub-block. Initializations of state metrics of a sub-block can then be performed using the message passing technique (see Section III-B2) through the state metric interfaces of the ASIP. Component-decoder parallelism implies the use of at least one ASIP for each component decoder, where ASIPs are executed in parallel and exchanging extrinsic information concurrently (shuffled decoding). Fig. 8 illustrates the architecture template required to exploit both kinds of parallelism. Besides the multiple ASIP integration, this figure shows the need for dedicated communication Web Version

9 MULLER et al.: FROM PARALLELISM LEVELS TO A MULTI-ASIP ARCHITECTURE FOR TURBO DECODING 9 Fig. 9. ASIP-based multiprocessor architecture for turbo decoding. structures to accommodate the massive information exchanges between ASIPs. Regarding communication interfaces, the developed ASIP incorporates the required state metric interfaces and the extrinsic information ones. On the other hand, interleaving of extrinsic information has to be handled by the communication structure. As seen in Section IV-A, efficient shuffled decoding imposes a low ratio (less than 3). The proposed ASIP architecture leads to a ratio equal to. Thus, to preserve shuffled decoding efficiency, the communication structure has to ensure a short propagation time which can be qualified using (6):. B. Communication Structures In order to illustrate how we implement the required communication structures, Fig. 9 presents a four-asip turbo decoder architecture where each component decoder is implemented using two ASIPs. This figure shows the three kinds of networks which are used: data interface network, state metric network, and extrinsic information network. First, the data interface network is used to dispatch new channel data from the frame memory of the IO interface to the local input data memories of ASIPs and, concurrently, to gather output data from ASIPs. Second, the state metric network enables exchanges between neighboring ASIPs in a component decoder. These exchanges are mandatory to initialize sub-blocks with message passing technique. As seen in Section IV-B1, ASIP accesses the initialization values of the beginning and end of the sub-block at address 0 of its past/future memories. In the case of full sub-block parallelism (i.e., no windowing), memories can be replaced by buffers and the state metric network consists of a set of buffers between neighboring processors, reflecting the trellis termination strategy. Thus, a circular trellis termination strategy, i.e., ending and beginning states of the frame are identical, implies the use of a buffer between the first and last ASIP (see Fig. 9). Finally, the extrinsic information network is based on routers to make extrinsic information exchanges possible between ASIPs. As the proposed ASIP supports the butterfly scheme, two packets can be sent on this network per emission and per ASIP. Packet headers generated by the ASIP are used by network interfaces (NIs) to perform interleaving. NIs regenerate a new header with the corresponding routing information. Routers integrate buffering mechanisms to support up to two input ports and two output ports. Fig. 9 presents a simple topology supporting four ASIPs. Architectures with more than four processors require more complex topologies. It is worth noting that these networks take advantage of packet switching communication [26]. In [18], we proposed to use multistage interconnection networks on chip based on Butterfly and Benes topologies. These topologies are scalable and present the required bandwidth and latencies with a reasonable hardware complexity. Even if the scalability of these topologies is limited to the number of input ports to the power of two, they master the mean propagation latency which evolves with the logarithm of the network size. This valuable property contributes significantly to fulfilling shuffled decoding requirements. C. Results With the conventional turbo decoding technique, we can observe that achievable throughput of multiprocessor architecture does not increase linearly with the number of processors, especially when this number is high [2], [8]. This degradation is mainly due to the interleaving, IO communication delays, and sub-block parallelism [19]. The use of the shuffled decoding technique limits this degradation. Thus, the throughput of the proposed ASIP-based multiprocessor architecture depends on the number of integrated ASIPs and on the shuffled decoding efficiency (see Section III-B2). The low-latency extrinsic information network (see Section V-B) and the short ASIP pipeline length (see Section IV-B3) guarantee a high shuffled decoding efficiency. For example, Table II summarizes multiprocessor turbo decoding performance for the WiMAX double binary code. Results are compared at the error rate performance level obtained with five iterations without BCJR-SISO decoder parallelism. Web Version

10 10 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS TABLE II PERFORMANCE FOR WIMAX DOUBLE BINARY TURBO DECODING, N = 1504 BITS (752 SYMBOLS), ST 90 nm The table provides the latency of the extrinsic information network using a Butterfly topology with a maximum clock frequency of 600 MHz. We can note that the latency requirement is always respected inducing efficient shuffled decoding. Thus, shuffled decoding requires seven iterations in this application example whatever the number of ASIPs. Observing the throughput results with respect to the degree of BCJR-SISO parallelism shows a more linear increase than in the literature [2], [8], although the block size (an important parameter of throughput linearity [19]) is smaller. This observation is explained by a halved sub-block parallelism degree (thanks to shuffled decoding) that minimizes sub-block parallelism degradation. Table II shows also that exploiting the diverse parallelism levels of turbo decoding induces a reasonable overall area overhead (including memories, networks, ASIPs) while achieving outstanding throughput rates. Note that the overall area is mainly dominated by the logic when the number of ASIPs increases. Table II illustrates how the memory area decreases down to 13% of the overall area. For comparison, with 16 SISOs and 5 iterations, the multi- ASIP architecture in [8] achieves Mb/s at a clock frequency of 133 MHz with a 180-nm technology and the dedicated ASIC in [2] achieves 340 Mb/s at a clock frequency of 256 MHz with a 130-nm technology. Even with technology rescaling, our flexible platform (249 Mb/s at 400 MHz, 90 nm) is closer to the full custom design performance. VI. CONCLUSION In order to meet flexibility and performance constraints of current and future digital communication applications, multiple application-specific instruction-set processors combined with dedicated communication and memory infrastructures are required. This paper provides a clear and detailed bridge between a three-level classification of turbo decoding parallelism techniques and associated VLSI implementations. We show how a multi-asip platform has been derived from this classification to enable flexible high-throughput turbo decoding. The ASIP has an SIMD architecture dedicated to the first level of this classification, a specialized and extensible instruction-set and a 6-stage pipeline control. It can process 228 Mb/s in double binary mode and 114 Mb/s in simple binary mode for an occupied area of 63.1 KGates. A future extension with negligible hardware overhead is also proposed in order to double the throughput in simple binary mode. The memory architecture and communication interfaces allow for the efficient assembling of multiple ASIP cores. Considering the second parallelism level, we have illustrated how ASIPs can be aggregated in a multiprocessor platform to process in parallel different sub-blocks and component decoders. The proposed ASIP-based multiprocessor architecture breaks the interleaving bottleneck thanks to the shuffled decoding technique and allows a high throughput while preserving flexibility and scalability. The presented platform supports turbo codes of all existing and emerging standards. Results obtained for WiMAX turbo decoding with 5 iterations demonstrate around 250 Mb/s throughput using 16-ASIP multiprocessor architecture. We are now planning to extend the proposed multi-asip platform to other digital communication applications. REFERENCES [1] C. Berrou, A. Glavieux, and P. Thitimajshima, Near shannon limit error-correcting coding and decoding: Turbo-codes, presented at the Int. Conf. Commun. (ICC), Geneva, Switzerland, [2] G. Prescher, T. Gemmeke, and T. Noll, <PLEASE PROVIDE PAGE NUMBERS OR LOCATION.> A parametrizable low-power highthroughput turbo-decoder, in Proc. ICASSP, Mar [3] Xilinx, San Jose, CA, 3GPP Turbo Decoder v3.1, May [4] D. Gnaëdig, E. Boutillon, M. Jezequel, V. Gaudet, and G. Gulak, On multiple slice turbo code, in Proc. Int. Symp. Turbo Codes Related Topics, Brest, France, Sep. 2003, pp [5] A. La Rosa, C. Passerone, F. Gregoretti, and L. Lavagno, Implementation of a UMTS turbo-decoder on a dynamically reconfigurable platform, presented at the Des., Autom. Test Eur. (DATE) Conf., Paris, France, Feb [6] R. Kothandaraman and M. J. Lopez, <PLEASE PROVIDE PAGE NUMBERS OR LOCATION.> An efficient implementation of turbo decoder on ADI TigerSHARC TS201 DSP, in Proc. SPCOM, Dec [7] A. Oraioglu and A. Veidenbaum, <PLEASE PROVIDE VOLUME, ISSUE, AND PAGE NUMBERS.> Application specific microprocessors, (Guest Editors Introduction) IEEE Des. Test Mag., Jan./Feb [8] F. Gilbert, M. Thul, and N. Wehn, Communication centric architectures for turbo-decoding on embedded multiprocessors, in Proc. Des., Autom. Test Eur. (DATE) Conf., Munich, Germany, Mar. 2003, pp [9] A. Hoffmann, O. Schliebusch, A. Nohl, G. Braun, and H. Meyr, A methodology for the design of application specific instruction set processors (ASIP) using the machine description language LISA, presented at the ICCAD, San Jose, CA, Nov [10] O. Muller, A. Baghdadi, and M. Jézéquel, ASIP-based multiprocessor SoC design for simple and double binary turbo decoding, in Proc. Des., Autom. Test Eur. (DATE) Conf., Munich, Germany, Mar. 2006, pp [11] T. Vogt and N. Wehn, A reconfigurable application specific instruction set processor for viterbi and log-map decoding, in Proc. IEEE Workshop Signal Process. (SIPS), Banff, Canada, Oct. 2006, pp [12] J. Hagenauer, The turbo principle: Tutorial introduction and state of the art, in Proc. Int. Symp. Turbo Codes Related Topics, Brest, France, Sep. 1997, pp Web Version

11 MULLER et al.: FROM PARALLELISM LEVELS TO A MULTI-ASIP ARCHITECTURE FOR TURBO DECODING 11 [13] L. Bahl, J. Cocke, F. Jelinek, and J. Raviv, Optimal decoding of linear codes for minimizing symbol error rate, IEEE Trans. Inf. Theory, vol. IT-20, no. 2, pp , Mar [14] P. Robertson, P. Hoeher, and E. Villebrun, Optimal and sub-optimal maximum a posteriori algorithms suitable for turbo decoding, Euro. Trans. Telecommun. (ETT), vol. 8, no. 2, pp , [15] G. Masera, G. Piccinini, M. R. Roch, and M. Zamboni, VLSI architectures for turbo codes, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 7, no. 3, pp , Sep [16] Y. Zhang and K. K. Parhi, Parallel turbo decoding, in Proc. Int. Symp. Circuits Syst., May 2004, vol. 2, pp. II-509 II-512. [17] A. Abbasfar and K. Yao, An efficient architecture for high speed turbo decoders, in Proc. ICASSP, Apr. 2003, pp. IV-521 IV-524. [18] H. Moussa, O. Muller, A. Baghdadi, and M. Jézéquel, Butterfly and benes-based on-chip communication networks for multiprocessor turbo decoding, presented at the Des., Autom. Test Euro. (DATE) Conf., Nice, France, Apr [19] O. Muller, A. Baghdadi, and M. Jezequel, <PLEASE PROVIDE PAGE NUMBERS OR LOCATION.> Exploring parallel processing levels for convolutional turbo decoding, in Proc. ICTTA, Apr [20] J. Zhang and M. P. C. Fossorier, Shuffled iterative decoding, IEEE Trans. Commun., vol. 53, no. 2, pp , Feb [21] D. Gnaedig, E. Boutillon, J. Tousch, and M. Jezequel, Towards an optimal parallel decoding of turbo codes, presented at the 4th Int. Symp. Turbo Codes Related Topics, Munich, Germany, Apr [22] O. Muller, A. Baghdadi, and M. Jezequel, <PLEASE PROVIDE PAGE NUMBERS OR LOCATION.> On the parallelism of convolutional turbo decoding and interleaving interference, in IEEE Global Telecommun. Conf. (GLOBECOM), Nov [23] CoWare Inc., <PLEASE PROVIDE COMPANY LOCATION AND YEAR.> CoWare Inc. homepage, [Online]. Available: coware.com/ [24] G. Fettweis and H. Meyr, Parallel viterbi algorithm implementation: Breaking the ACS-bottleneck, IEEE Trans. Commun., vol. 37, no. 8, pp , Aug [25] T. Vogt, C. Neeb, and N. Wehn, A reconfigurable multi-processor platform for convolutional and turbo decoding, presented at the ReCoSoC, Montpellier, France, [26] L. Benini and G. D. Micheli, Networks on Chips: A new SoC paradigm, IEEE Computer, vol. 35, no. 1, pp , Jan Olivier Muller (M 06) received the engineering (M.S.) and Ph.D. degrees in telecommunications and electrical engineering from the Ecole Nationale Supérieure des Télécommunications de Bretagne (TELECOM Bretagne), Brest, France, in 2004 and 2007, respectively. In 2003, he worked on co-design with Motorola, Toulouse, France. Currently, he is a Postdoctoral Researcher with the Electronics Department, TELECOM Bretagne, Brest, France. His research interests include the areas of multiprocessor architectures, application specific processors, on-chip networks, digital communication algorithms, and information theory. Amer Baghdadi received the Electronic Engineering and M.S. degrees and a Ph.D. degree in microelectronics from the Institut National Polytechnique de Grenoble (INPG), Grenoble, France, in 1998, 1998, and 2002, respectively. He is an Associate Professor with the Electronics Department, TELECOM Bretagne, Brest, France, since December In 2002, he was an Assistant Professor with the INPG while continuing his research activities with the System-Level Synthesis Group, TIMA Laboratory. His research interests include system-on-chip architectures and design methodology, especially, design and exploration of application-specific multiprocessor architectures, performance estimation and on-chip communication architectures. Recently, his research activities target multiprocessor and network-on-chip architecture design for digital communication applications. Dr. Baghdadi was nominated for a Best Paper Award at the 2001 DATE Conference for his work on the design automation of application-specific multiprocessor SoC. He serves on the technical program committee for RSP, ICTTA, and DATE Conferences. Michel Jézéquel (M 02) was born in Saint Renan, France, on February 26, He received the Ingénieur degree in electronics from the École Nationale Supérieure de lélectronique et de ses Applications, Paris, France, in In , he was a Design Engineer with CIT ALCATEL, Lannion, France. Then, after an experience in a small company, he followed a one year course about software design. In 1988, he joined the École Nationale Supérieure des Télécommunications de Bretagne, Brest, France, where he is currently Professor, head of the Electronics Department. His main research interest is circuit design for digital communications. He focuses his activities in the fields of turbo codes, adaptation of the turbo principle to iterative correction of intersymbol interference, the design of interleavers, and the interaction between modulation and error correcting codes. Web Version

12 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1 From Parallelism Levels to a Multi-ASIP Architecture for Turbo Decoding Olivier Muller, Member, IEEE, Amer Baghdadi, and Michel Jézéquel, Member, IEEE Abstract Emerging digital communication applications and the underlying architectures encounter drastically increasing performance and flexibility requirements. In this paper, we present a novel flexible multiprocessor platform for high throughput turbo decoding. The proposed platform enables exploiting all parallelism levels of turbo decoding applications to fulfill performance requirements. In order to fulfill flexibility requirements, the platform is structured around configurable application-specific instruction-set processors (ASIP) combined with an efficient memory and communication interconnect scheme. The designed ASIP has an single instruction multiple data (SIMD) architecture with a specialized and extensible instruction-set and 6-stages pipeline control. The attached memories and communication interfaces enable its integration in multiprocessor architectures. These multiprocessor architectures benefit from the recent shuffled decoding technique introduced in the turbo-decoding field to achieve higher throughput. The major characteristics of the proposed platform are its flexibility and scalability which make it reusable for all simple and double binary turbo codes of existing and emerging standards. Results obtained for double binary WiMAX turbo codes demonstrate around 250 Mb/s throughput using 16-ASIP multiprocessor architecture. Index Terms application-specific instruction-set processor (ASIP), <PLEASE DEFINE BCJR.> BCJR, parallel processing, multiprocessor, turbo decoding. I. INTRODUCTION SYSTEMS on chips (SoCs) in the field of digital communication are becoming more and more diversified and complex. In this field, performance requirements, like throughput and error rates, are becoming increasingly severe. To reduce the error rate with a lower signal-to-noise ratio (SNR) (closer to the Shannon limit), turbo (iterative) processing algorithms have recently emerged [1]. These algorithms, which originally concerned channel coding, are currently being reused over the whole digital communication system, like for equalization, demodulation, synchronization, and multiple-input multiple-output (MIMO). Furthermore, the severe time-to-market constraints and the continuously developing new standards and applications in this digital communication, make resorting to new design methodologies and the proposal of a flexible turbo communication Manuscript received April 04, 2007; revised July 11, 2007, September 05, 2007, and January 04, This work has been supported in part by the European Commission through the Network of Excellence in Wireless Communications (NEWCOM). The authors are with the Electronics Department, TELECOM Bretagne, Technopôle Brest Iroise, Brest, France ( olivier. muller@telecom-bretagne.eu; amer.baghdadi@telecom-bretagne.eu; michel.jezequel@telecom-bretagne.eu). Digital Object Identifier /TVLSI platform inevitable. Flexibility could be achieved by the use of programmable/configurable processors rather than application-specific integrated circuits (ASICs). Thus, embedded multiprocessor architectures integrating an adequate communication network-on-chip (NoC) will constitute an ultimate solution to preserve flexibility while achieving the required computation and throughput rates. Algorithm parallelization of turbo decoding has been widely investigated these last few years and several implementations have been proposed. Some of these implementations succeeded in achieving high throughput for specific standards with a fully dedicated architecture. High performance turbo decoders dedicated to 3GPP standards have been implemented in ASIC [2] and in field-programmable gate arrays (FPGA) [3]. In [4], a new class of turbo codes more suitable for high throughput implementation is proposed. However, such implementations do not take into account flexibility and scalability issues. Unlike these implementations, others include software and/or reconfigurable parts to achieve the required flexibility while achieving lower throughput. This is addressed, for example, in [5] with the XiRISC processor, a reconfigurable processor using embedded FPGA, or in [6] with a digital signal processing (DSP) integrating dedicated instructions for turbo decoding. Because of their great flexibility, these solutions do not fulfill performance requirements of all standards (e.g., 150 Mb/s for Homeplug). In fact, the concept of the application-specific instruction set processor (ASIP) [7] constitutes the appropriate solution for fulfilling the flexibility and performance constraints of emerging and future applications. The use of ASIPs in embedded SoCs is becoming inevitable due to the rapid increase in complexity and flexibility of emerging applications and evolving standards. Two approaches are mainly proposed by EDA vendors for ASIP design. The first approach is based on an environment where the designer can select and configure predefined hardware elements to enhance a predefined basic processor core according to the application needs. User-defined hardware blocks, together with the corresponding instructions, can be added to the processor. This approach was used in a parallel multiprocessor implementation [8]. Despite the advanced heterogeneous communication network that optimizes data transfer and enables parallel turbodecoding implementation, the platform lacks performance due to the predefined basic processor core imposed by this approach. In the second approach, the designer has full design freedom thanks to an Architecture Description Language (ADL) which is used to specify the instruction set and the ASIP architecture Print Version [9]. In [10], we proposed the first ASIP dedicated to turbo codes using this approach. Thanks to its performance and the multiprocessor template proposed, the solution was able to cover al /$ IEEE

13 2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS Fig. 1. Turbo decoding: (a) turbo decoder; (b) BCJR SISO; (c) trellis. most all standards while presenting few limitations [support up to 8-state trellis; instruction-level parallelism (ILP) not fully exploited]. Another ASIP based on the same approach, proposed in [11] and [25], resolves these limitations and achieves higher throughput thanks to full exploitation of ILP through a long pipeline. In addition, this ASIP provides support for convolutional codes and integrates interfaces for a multiprocessor platform. However, its long pipeline length inhibits the exploitation of the most efficient parallelism for high throughput (component-decoder parallelism Section III-B2). In this work, we present an original parallelism classification of turbo decoding applications and directly link the different parallelism levels of the classification to their VLSI implementation techniques and issues in a multi-asip platform. An improved ASIP model enabling the support of all parallelism techniques is proposed. It can be configured to decode all simple and double binary turbo codes. Besides the specific arithmetic units that make up this processor model, special care was taken with the memory organization and communication buses. Its architecture facilitates its integration in a multiprocessor scheme enabling an efficient and flexible implementation of the turbo decoding algorithm. The rest of this paper is organized as follows. Section II presents the turbo decoding algorithm for a better understanding of subsequent sections. Section III analyses all parallel processing techniques of turbo decoding and proposes a three-level classification of these techniques. Then, Section IV details the proposed single instruction multiple data (SIMD) ASIP architecture for turbo decoding which exploits fully the first level of parallelism. Exploiting the other parallelism levels requires the resort to multi-asip architectures. This is illustrated in Section V, where we make use of the second level of parallelism that achieves high throughput with reasonable hardware complexity. Finally, Section VI summarizes the results obtained and concludes this paper. II. CONVOLUTIONAL TURBO DECODING In iterative decoding algorithms [12], the underlying turbo principle relies on extrinsic information exchanges and iterative processing between different soft input soft output (SISO) modules. Using input information and a priori extrinsic information, each SISO module computes a posteriori extrinsic information. Fig. 2. BCJR computation schemes: (a) forward-backward and (b) butterfly. Fig. 3. Frame decomposition and sub-block parallelism. This a posteriori extrinsic information becomes the a priori information for the other modules and are exchanged via interleaving and deinterleaving processes. For convolutional turbo codes [1], classically constructed with two convolutional component codes, the SISO modules process the BCJR or forward-backward algorithm [13] which is the optimal algorithm for the maximum a posteriori (MAP) decoding of convolutional codes (see Fig. 1). So, a BCJR SISO will first compute branch metrics (or metric), which represents the probability of a transition occurring between two trellis states ( : starting state; : ending state). Note that a branch metric can be decomposed in an intrinsic part due to systematic information and a priori information and an extrinsic part due to redundancy information. Then a BCJR SISO computes forward and backward recursions. Forward recursion (or recursion) computes a trellis section (i.e., the probability of all states of the trellis regarding the th symbol) using the previous trellis section and branch metrics between these two sections, while backward recursion (or recursion) computes a trellis section using the future trellis section and branch metrics between these two sections. With max-log-map algorithm [14], it can be expressed Print Version

14 MULLER et al.: FROM PARALLELISM LEVELS TO A MULTI-ASIP ARCHITECTURE FOR TURBO DECODING 3 Finally, the extrinsic information of the symbol is computed for all decisions from the forward recursion, the backward recursion and the extrinsic part of the branch metrics III. PARALLEL PROCESSING LEVELS In turbo decoding with the <PLEASE DEFINE BCJR.> BCJR algorithm, parallelism techniques can be classified at three levels: 1) BCJR metric level parallelism; 2) BCJR SISO decoder level parallelism; and 3) turbo-decoder level parallelism. The first (fine grain) parallelism level concerns symbol elementary computations inside a SISO decoder processing the BCJR algorithm. Parallelism between these SISO decoders, inside one turbo decoder, belongs to the second parallelism level. The third (coarse grain) parallelism level duplicates the turbo decoder itself. (1) (2) (3) A. BCJR Metric Level Parallelism The BCJR metric level parallelism concerns the processing of all metrics involved in the decoding of each received symbol inside a BCJR SISO decoder. It exploits the inherent parallelism of the trellis structure, and also the parallelism of BCJR computations [15]. 1) Parallelism of Trellis Transitions: Trellis-transition parallelism can easily be extracted from trellis structure as the same operations are repeated for all transition pairs. In log-domain [14], these operations are either add-compare-select (ACS) operations for the max-log-map algorithm or ACSO operations (ACS with a correction offset [14]) for the log-map algorithm. Each BCJR computation (1) (3) requires a number of ACS-like operation equals to half the number of transitions per trellis section. Thus, this number, which depends on the structure of the convolutional code, constitutes the upper bound of the trellis-transition parallelism degree. Furthermore, this parallelism implies low area overhead as only the ACS units have to be duplicated. In particular, no additional memories are required since all the parallelized operations are executed on the same trellis section, and in consequence on the same data. 2) Parallelism of BCJR Computations: A second metric parallelism can be orthogonally extracted from the BCJR algorithm through a parallel execution of the three BCJR computations. Parallel execution of backward recursion and APP computations was proposed with the original forward-backward scheme, depicted in Fig. 1(a). So, in this scheme, we can notice that BCJR computation parallelism degree is equal to one in the forward part and two in the backward part. To increase this parallelism degree, several schemes are proposed [16]. Fig. 1(b) shows the butterfly scheme which doubles the parallelism degree of the original scheme through the parallelism between the forward and backward recursion computations. This is performed without any memory increase and only BCJR computation resources have to be duplicated. Thus, BCJR computation parallelism is area efficient but still limited in parallelism degree. In conclusion, BCJR metric level parallelism achieves optimal area efficiency as it does not affect memory size, which occupies most of the area in a turbo decoder circuit. Exploiting this level of parallelism is detailed in Section IV. Nevertheless, the parallelism degree is limited by the decoding algorithm and the code structure. Thus, achieving higher parallelism degree implies exploring higher processing levels. B. BCJR-SISO Decoder Level Parallelism The second level of parallelism concerns the SISO decoder level. It consists of the use of multiple SISO decoders, each executing the BCJR algorithm and processing a sub-block of the same frame in one of the two interleaving orders. At this level, parallelism can be applied either on sub-blocks and/or on component decoders. 1) Sub-Block Parallelism: In sub-block parallelism, each frame is divided into M sub-blocks and then each sub-block is processed on a BCJR-SISO decoder using adequate initializations [16], [17]. A formalism is proposed in [16] to compare various existing sub-block decoding schemes towards parallelism degree and memory efficiency. Besides duplication of BCJR-SISO decoders, this parallelism imposes two other constraints. On the one hand, interleaving has to be parallelized in order to scale proportionally the communication bandwidth [8]. Due to the scramble property of interleaving, this parallelism can induce communication conflicts except for interleavers of emerging standards that are conflictfree. These conflicts force the communication structure to implement conflict management mechanisms and imply a long and variable communication time. This issue is generally addressed by minimizing interleaving delay with specific communication networks [8], [18]. On the other hand, BCJR-SISO decoders have to be initialized adequately either by acquisition or by message passing [17], [19]. The acquisition method involves estimating recursion metrics thanks to an overlapping region called acquisition window or prologue. Starting from a trellis section, where all the states are initialized to a uniform constant, the acquisition window will be processed over its length to provide reliable recursion metrics at sub-block ending points. The message passing method initializes a sub-block with recursion metrics computed during the last iteration in the neighboring sub-blocks. In [19], we observed that message passing initialization enables a more efficient decoding reaching better throughput at comparable hardware complexity. Thus, the message passing initialization is mainly considered in the rest of this paper. Regarding the first iteration, message passing method is undefined and the iterative process starts with a uniform initialization of the sub-block ending states. Instead, an initialization by acquisition can slightly improve the convergence of the iterative process, but the resulting gain is usually less than one iteration. This parallelism is necessary to reach high throughput. Nevertheless, its efficiency for high throughput is strongly reduced since resolving the initialization issue implies a computation Print Version

15 4 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS Fig. 4. Turbo decoding: (a) serial and (b) shuffled. overhead following an Amdahl s law (due to acquisition length or additional iterations) [19]. 2) Component-Decoder Parallelism: The component-decoder parallelism is a new kind of BCJR-SISO decoder parallelism that has become operational with the introduction of the shuffled decoding technique [20]. The basic idea of shuffled decoding technique is to execute all component decoders in parallel and to exchange extrinsic information as soon as it is created, so that component decoders use more reliable a priori information. Thus, the shuffled decoding technique performs decoding (computation time) and interleaving (communication time) fully concurrently while serial decoding implies waiting for the update of all extrinsic information before starting the next half iteration (see Fig. 4). Modifying serial decoding to restart processing right after the previous half iteration in order to save the propagation latency was studied in [21]. The resulting decoding requires nevertheless additional control mechanisms to avoid consistency conflict in memories. Since communication time is often considered to be the limiting factor in multiprocessor turbo decoding, saving the propagation latency is a crucial property of shuffled decoding. In addition, by doubling the number of BCJR SISO decoders, component-decoder parallelism halves the iteration period in comparison with originally proposed serial turbo decoding. Nevertheless, to preserve error-rate performance with shuffled decoding, an overhead of iteration between 5% and 50% is required depending on the BCJR computation scheme, on the degree of sub-block parallelism, on propagation time, and on interleaving rules [22]. In fact, this overhead decreases with respect to sub-block parallelism degree [19] while computation overhead of sub-block parallelism increases. Consequently, at high throughput and comparable complexity, the computation overhead becomes greater by doubling the sub-block parallelism degree than by using shuffled decoding. Thus, for high throughput, shuffled decoding is more efficient than sub-block parallelism. Simulations demonstrate minor shuffled decoding overhead variations for low propagation latency. Above a propagation latency of three times the extrinsic information emission time, the overhead reduces the interest of the shuffled decoding technique. Finally, this level of parallelism presents great potential for scalability and high area efficiency. Exploiting this level of parallelism is detailed in Section V. C. Turbo-Decoder Level Parallelism The highest level of parallelism simply duplicates whole turbo decoders to process iterations and/or frames in parallel. Iteration parallelism occurs in a pipelined fashion with a maximum pipeline depth equal to the iteration number, whereas frame parallelism presents no limitation in parallelism degree. Nevertheless, turbo-decoder level parallelism is too area-expensive (all memories and computation resources are duplicated) and presents no gain in frame decoding latency and for these reasons it is not considered in this work. IV. EXPLOITING BCJR METRIC LEVEL PARALLELISM: ASIP FOR BCJR SISO DECODER A. Context of Architectural Choices As seen in Section III-A, the BCJR metric level parallelism that occurs inside a BCJR SISO decoder is the most area efficient level of parallelism. Thus, a hardware implementation achieving high throughput should first exploit this parallelism. The complexity of convolutional turbo codes proposed in all existing and emerging standards is limited to eight-state double binary turbo codes or 16-state simple binary turbo codes. Hence, to fully exploit trellis transition parallelism (Section III-A1) for all standards, a parallelism degree of 32 is required. The implementation of future more complex codes can be supported by splitting trellis sections into sub-sections of 32-parallelism degrees and by processing sub-sections sequentially. Regarding Print Version

16 MULLER et al.: FROM PARALLELISM LEVELS TO A MULTI-ASIP ARCHITECTURE FOR TURBO DECODING 5 BCJR computation parallelism (Section III-A2), we choose a parallelism degree of two instead of four (maximum). Using a parallelism degree of four with the butterfly scheme leads to underutilization of the BCJR computation units (two are only used half of the time). These parallelism requirements imply the use of specific hardware units. To implement these units while preserving flexibility, application-specific instruction set processors constitute the perfect solution [7]. The BCJR SISO decoder should also have adequate communication interfaces in order to handle the inter sub-block communications in the case of BCJR-SISO decoder parallelism (see Section V). In this context, in order to efficiently implement shuffled decoding, the propagation time should be less than three emission periods (see Section III-B2). Let be the time required to cross the network, be the pipeline length between the load stage of the extrinsic information and the store stage of the extrinsic information, #cycle be the number of clock cycles required by the processor to compute an extrinsic information value, and be the frequency of the processor. Then # (6) # # To preserve a low ratio and thus make the use of shuffled decoding technique efficient we choose to use a short pipeline length achieving: #. A long pipeline inhibits the exploitation of the shuffled decoding technique. For example, the ASIP developed in [11], which can emit one extrinsic information value per cycle # with a of 8, has a ratio greater than 8. B. Architecture of the ASIP The presented processor, dedicated to the BCJR algorithm, is an enhanced version of the ASIP proposed in [10]. 1) Global View: The ASIP is mainly composed of operative and control parts besides its communication interfaces and attached memories [see Fig. 5(a)]. The operative part is tailored to process a window of 64 symbols by means of two identical BCJR computation units, corresponding to forward and backward processing in the MAP algorithm. Each unit produces recursion metrics and extrinsic information. The storage of recursion metrics produced by one unit, to be used by the other unit, is performed in 8 cross memories of bit words. So the processor integrates 16 internal cross memories in order to provide the adequate bandwidth. Another 96-bit width internal memory (config) contains up to 256 trellis descriptions, so that the processor can be configured for the corresponding standard. Incoming data that group systematic and redundant information of the channel, in addition to extrinsic information, are stored in external memories attached to the ASIP (input data, info ext). The input data memory has a 32-bit width to contain up to four 8-bit channel information data (systematic or redundant). The info ext memory has a 64-bit width to contain up to four 16-bit extrinsic information data, since four extrinsic information data # (4) (5) are required by double binary codes. Depending on the application s requirements, the depth of incoming data memories can be scaled up to to cover all existing and emerging standards frame-length specifications. The external future and past memory banks are used to initialize state metric values for the beginning and end of each window according to the message passing method. Each bank has two 128-bit width memories, one storing forward recursions and the other backward recursions. These initialization memories are used as follows: 1) at the beginning of the decoding, the state metric registers are either set according to the available information about the state, or reset so that all state metrics have equal probability; 2) after computations (e.g., acquisition if programmed, or recursion), the state metrics obtained for the beginning and end of each window can be stored in a memory bank in order to be used for initialization of the next iteration. The depth of these memories can be scaled to the number of windows required with a maximum of For the th window associated with the processor, initialization metrics are read at address in the forward past memory and at address in the backward past memory. Then the refined state metrics are stored at address in the forward past memory and at address in the backward past memory. The future memory bank is only accessible at address 0 and for the state metrics of the end of the last window associated with the processor. In this case, the backward initialization metrics are read from the backward future memory at address 0 and the forward metrics is stored in the forward future memory at address 0. For all the external memories, memory latencies of one cycle in read/write access have been integrated in the ASIP pipeline. 2) BCJR Computation Unit: Each BCJR computation unit is based on single instruction multiple data (SIMD) architecture in order to exploit trellis transition parallelism. Thus, 32 adder nodes (one per transition) and 8 max nodes are incorporated in each unit [see Fig. 5(b)]. The 32 adder nodes are organized as a4 8 processing matrix. In this organization, for an 8-state double binary code, the raw and column of an adder node correspond respectively to the considered symbol decision and the ending state of the associated transition. For a 16-state simple binary code, transitions with ending states 0 to 7 are mapped on matrix nodes of raw 0, if transition bit decision is 0, or matrix nodes of raw 1, if transition bit decision is 1, whereas states 8 to 15 are mapped on nodes of raws 2 and 3. An adder node [see Fig. 5(c)] contains one adder, multiplexers, one register for configuration (RT), and an output register (RADD). It supports the addition required in a recursion between a state metric (coming from the state metric register bank RMC) and a branch metric (coming from the branch metric register bank RG), and also the addition required in information generation since it can accumulate the previous result with the state metric of the other recursion coming from the register bank RC. The max nodes [see Fig. 5(d)] are shared in the processing matrix so that the max operations can be performed on RADD registers either rawwise or columnwise, depending on the ASIP instructions. A max node contains three max operators connected in a tree. This makes it possible to perform either a four-input maximum (using the three operators) or two two-input maximum. Results Print Version

6 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS Fig. 5. (a) ASIP architecture. (b) BCJR computation unit. (c) Adder node. (d) Max node. (e) Control unit.

17 6 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS Fig. 5. (a) ASIP architecture. (b) BCJR computation unit. (c) Adder node. (d) Max node. (e) Control unit. are stored either in the first raws or columns of RADD matrix or in RMC bank to achieve recursion computation. The BCJR computation unit also contains a GLOBAL arithmetic logic unit (ALU) that computes extrinsic information, hard decisions and other global processing, and a branch metric (BM) generator, that performs branch metrics calculation from extrinsic information register bank (RIE) and from channel information available in the pipeline registers (PR). The BM generator supports cyclic puncturing patterns with a maximum pattern length of 16. The pattern length is configurable in a 4-bit register, while puncturing patterns associated with the four channel information data are configurable through four 16-bit registers, in which each zero corresponds to a punctured bit. The BM generator supports a code rate between and 1 for double binary code and between and 1 for simple binary code. 3) Pipeline Strategy: The ASIP control part is based on a sixstage pipeline [see Fig. 5(e)]. The first two stages (FE, DC) fetch instructions from the program memory and decode them. Then, depending on the instruction type, the operand fetch (OPF) stage loads data from the input data memory to the pipeline registers PR, and/or data from the extrinsic information memory to the RIE registers, and/or data from the past/future memories to the RMC registers, and/or the configuration in the RT registers. In comparison to [10], a BM stage has been added to the pipeline in order to anticipate the calculation of branch metrics performed in the BM generator, to increase the clock frequency of the ASIP, and to improve the number of cycles per symbol. The Execute (EX) stage performs the rest of the BCJR operations. This choice reduces the performance of the ASIP since the architecture does not fully exploit ILP. However it was intentionally chosen to keep the pipeline length as short as possible in order to efficiently support the shuffled decoding technique. Fig. 6. Butterfly ZOL mechanism. Hence extrinsic information can cross the pipeline from the OPF to Store (ST) stage in only 4 cycles (see Section IV-A). 4) Control Structure: The control part also requires several dedicated control registers. Thus, the window size is fixed in the register R SIZE, and the current processed symbol inside the BCJR computation unit A (respectively, BCJR computation unit B) is stored in the pipeline register ADDRESS A (respectively, ADDRESS B). These addresses, as well as the program counter and the corresponding instruction, are then pipelined. To correctly access incoming data memories and past/future memories, the processor has a 10-bit WINDOW ID register that identifies the window computed and a 10-bit R SUB BLOCK register that sets the number of windows processed by the ASIP. Thus, one ASIP can process up to 1024 windows. In addition, the control architecture provides branch mechanisms and a zero overhead loop (ZOL) fully dedicated to the butterfly scheme (see Section III-A2). To alleviate the ASIP instruction set, the ZOL mechanism is tightly coupled with addresses Print Version

18 MULLER et al.: FROM PARALLELISM LEVELS TO A MULTI-ASIP ARCHITECTURE FOR TURBO DECODING 7 Fig. 7. Assembly programs for: (a) WiMAX double binary turbo code and (b) 3GPP simple binary turbo code. generation (see Fig. 6). Thus, the first loop is performed while the address of the symbol processed by unit A is smaller than the address of the symbol processed by unit B. In case of odd window size, the middle symbol is processed by unit A when both addresses are equal. Finally, the second loop is performed while the address of the symbol processed by unit B is positive. C. ASIP Instruction Set The designed instruction set of our ASIP architecture is coded on sixteen bits. The basic version contains 30 instructions that perform the basic operations of the MAP algorithm. To increase performance, the ASIP was extended with compacted instructions that can perform several operations in different pipeline stages within a single instruction. The following section details the mandatory instructions to perform simple decoding. These instructions are divided into three different classes: control, operative, and IO. 1) Control: As mentioned previously, the butterfly ZOL instruction repeats R SIZE times the two loops of the butterfly scheme. It requires three markers to retain relative addresses of first-loop end instruction, second-loop begin instruction, and second-loop end instruction. An unconditional branch instruction has also been designed and uses the direct addressing mode. SET SIZE instruction is used to set the ASIP window size to a maximum size of 64 symbols. SET WINDOW ID and SET SUB BLOCK NB are also used to set the WINDOW ID or R SUB BLOCK registers. Thus, the processor manages up to symbols. 2) Operative MAP: An add instruction is defined and used in two different modes: metrics computation (add m) and extrinsic information computation (add i). According to the add mode and the configuration registers (RT), each processing node selects the desired operands to perform the addition and to store the result in the corresponding RADD register. In the same way, a max1 and max2 instructions are defined with the same modes as an ADD instruction. This max1 instruction only performs one comparison-selection (two outputs per max node) while the max2 instruction cascades comparison-selection operations (one output per max node). These instructions have to be repeated as often as necessary to obtain either extrinsic information or recursion metrics at the considered address in the sub-block. The basic instruction set also contains the DECISION instruction to produce hard decisions on processed symbols. 3) IO: The basic instruction set also provides input and output instructions. With these instructions, parallel multi-accesses are executed in order to: load decoder input data (LD DATA), input recursion metrics (LD REC), configuration (LD CONFIG); store output recursion metrics (ST REC); handle internal cross metrics between the two BCJR computation units (LD CROSS, LD ST); send extrinsic information packets and hard decisions (ST EXT, DEC). We choose to group extrinsic information in packets for efficient IO operations. Each packet can contain a packet header and extrinsic information of the current symbol (up to four 16-bit data in the case of double binary codes). This header typically contains the processed symbol address including the WINDOW ID and local address (cf. Section V-B). D. Application Examples Fig. 7 gives the ASIP assembly programs required to decode a 48-symbol sub-block considering the turbo code used in: (a) WiMAX standard and (b) to decode a 40-bit sub-block considering the turbo code using 3GPP standard. In both cases, the first instructions load the required configuration (LD CONFIG) and initialize the recursion metrics (LD REC). Then the butterfly loops are initialized using the ZOLB instruction. The first loop (2 instructions) only computes the state metrics. Two max operations (max2 instruction) are required for the double binary code, whereas only one max operation (max1 instruction) is required for simple binary code. The second loop (five instructions) computes, in addition to state metrics, the extrinsic information for the eight-state code (using three max operations). Finally, the ASIP exports the sub-block ending metrics (ST REC) and program branches to the first instruction of the butterfly. Thus, regarding the execution time, cycles are needed in the first loop of the butterfly scheme, and cycles in the second loop, where is the sub-block size. So, about cycles are needed to process the symbols of the sub-block. Thus, 3.5 cycles are roughly needed per symbol (3.5 cycles/bit in simple binary mode and 1.75 cycles/bit in double binary mode). Print Version E. Implementation Results In this paper, we use the Processor Designer framework from CoWare [23]. Processor Designer is based on the LISA ADL [9] which allows the automatic generation of ASIP models (VHDL,

19 8 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS Verilog, and SystemC) for hardware synthesis and system integration, in addition to the generation of the underlying software development tools. Using this tool, a VHDL description of the ASIP architecture was generated. It was synthesized with a Synopsys Design Compiler using ST 90-nm ASIC library under worst case conditions (0.9 V, 105 C). The optimized ASIP presents a maximum clock frequency of 400 MHz and occupies about 63.1 KGates (equivalent). Compared with previous ASIP [10], clock frequency is improved by 20% and area is decreased by 35%. The presented ASIP can process 228 Mb/s in double binary mode and 114 Mb/s in simple binary mode. Note that, as a future extension, the performance of the simple binary mode can be significantly improved if the code rate of component codes is greater than or equal to one half (valid in most standards), by compacting the trellis [24]. With this condition, two consecutive stages of the simple binary trellis can be compacted in one stage of a new double binary trellis without error rate degradation. With this new double binary trellis configuration, the simple binary code can be also decoded at 228 Mb/s. Trellis compaction requires a different soft decision management implying extra operations. This feature is not implemented in the current ASIP architecture. However, it can be supported with minor modifications of LD DATA and ST EXT instructions. The LD DATA instruction should handle the soft input bit-to-symbol conversion (new add operators in BM stage) and the ST EXT instruction should handle the soft output symbol-to-bit conversion (new max operators in ST stage). These elementary operators do not change the maximum clock frequency since they introduce their own non-critical path. Thus, negligible hardware overhead is induced without any degradation in throughput. Besides, as the packet generated by the ASIP still contains two binary soft decisions, the interface of the network has to split the processor packet into two network packets to perform interleaving. Table I compares the performance results of log-map turbo decoding implementations for the UMTS turbo code. We can observe that the designed ASIP has excellent throughput performance thanks to a number of cycles per bit per SISO close to one. One is obtained by fully dedicated hardware implementations. Compared to [11], our ASIP presents a slightly lower throughput for almost similar area (63 versus 56 KGates) and with a shorter pipeline depth (6 versus 11 stages) in order to make shuffled decoding possible. With the trellis compaction extension, which reveals the real potential of the proposed architecture to decode simple binary codes, the ASIP can have a slightly better throughput despite the use of a 90-nm target technology. Furthermore, thanks to its dedicated past/future memories, our processor can skip the acquisition phases efficiently and without degradation. The figures of Table I do not integrate this acquisition computation overhead which can rise to around 15% in [25]. V. EXPLOITING BCJR-SISO DECODER LEVEL PARALLELISM: MULTI-ASIP PLATFORM The ASIP presented in the previous section fully exploits the first parallelism level (BCJR metric level parallelism Section III-A) by efficiently performing all computations of a BCJR-SISO decoder. In order to exploit the second TABLE I COMPARISON OF DIFFERENT TURBO DECODING IMPLEMENTATIONS FOR UMTS Fig. 8. Extrinsic information exchanges in BCJR-SISO decoder level parallelism. parallelism level (BCJR-SISO decoder level Section III-B), a multi-asip architecture is required. A. Multi-ASIP Turbo Decoding Overview Sub-block parallelism implies the use of a BCJR-SISO decoder, e.g., our ASIP, for each sub-block. Initializations of state metrics of a sub-block can then be performed using the message passing technique (see Section III-B2) through the state metric interfaces of the ASIP. Component-decoder parallelism implies the use of at least one ASIP for each component decoder, where ASIPs are executed in parallel and exchanging extrinsic information concurrently (shuffled decoding). Fig. 8 illustrates the architecture template required to exploit both kinds of parallelism. Besides the multiple ASIP integration, this figure shows the need for dedicated communication Print Version

MULLER et al.: FROM PARALLELISM LEVELS TO A MULTI-ASIP ARCHITECTURE FOR TURBO DECODING 9 Fig. 9. ASIP-based multiprocessor architecture for turbo decoding.

20 MULLER et al.: FROM PARALLELISM LEVELS TO A MULTI-ASIP ARCHITECTURE FOR TURBO DECODING 9 Fig. 9. ASIP-based multiprocessor architecture for turbo decoding. structures to accommodate the massive information exchanges between ASIPs. Regarding communication interfaces, the developed ASIP incorporates the required state metric interfaces and the extrinsic information ones. On the other hand, interleaving of extrinsic information has to be handled by the communication structure. As seen in Section IV-A, efficient shuffled decoding imposes a low ratio (less than 3). The proposed ASIP architecture leads to a ratio equal to. Thus, to preserve shuffled decoding efficiency, the communication structure has to ensure a short propagation time which can be qualified using (6):. B. Communication Structures In order to illustrate how we implement the required communication structures, Fig. 9 presents a four-asip turbo decoder architecture where each component decoder is implemented using two ASIPs. This figure shows the three kinds of networks which are used: data interface network, state metric network, and extrinsic information network. First, the data interface network is used to dispatch new channel data from the frame memory of the IO interface to the local input data memories of ASIPs and, concurrently, to gather output data from ASIPs. Second, the state metric network enables exchanges between neighboring ASIPs in a component decoder. These exchanges are mandatory to initialize sub-blocks with message passing technique. As seen in Section IV-B1, ASIP accesses the initialization values of the beginning and end of the sub-block at address 0 of its past/future memories. In the case of full sub-block parallelism (i.e., no windowing), memories can be replaced by buffers and the state metric network consists of a set of buffers between neighboring processors, reflecting the trellis termination strategy. Thus, a circular trellis termination strategy, i.e., ending and beginning states of the frame are identical, implies the use of a buffer between the first and last ASIP (see Fig. 9). Finally, the extrinsic information network is based on routers to make extrinsic information exchanges possible between ASIPs. As the proposed ASIP supports the butterfly scheme, two packets can be sent on this network per emission and per ASIP. Packet headers generated by the ASIP are used by network interfaces (NIs) to perform interleaving. NIs regenerate a new header with the corresponding routing information. Routers integrate buffering mechanisms to support up to two input ports and two output ports. Fig. 9 presents a simple topology supporting four ASIPs. Architectures with more than four processors require more complex topologies. It is worth noting that these networks take advantage of packet switching communication [26]. In [18], we proposed to use multistage interconnection networks on chip based on Butterfly and Benes topologies. These topologies are scalable and present the required bandwidth and latencies with a reasonable hardware complexity. Even if the scalability of these topologies is limited to the number of input ports to the power of two, they master the mean propagation latency which evolves with the logarithm of the network size. This valuable property contributes significantly to fulfilling shuffled decoding requirements. C. Results With the conventional turbo decoding technique, we can observe that achievable throughput of multiprocessor architecture does not increase linearly with the number of processors, especially when this number is high [2], [8]. This degradation is mainly due to the interleaving, IO communication delays, and sub-block parallelism [19]. The use of the shuffled decoding technique limits this degradation. Thus, the throughput of the proposed ASIP-based multiprocessor architecture depends on the number of integrated ASIPs and on the shuffled decoding efficiency (see Section III-B2). The low-latency extrinsic information network (see Section V-B) and the short ASIP pipeline length (see Section IV-B3) guarantee a high shuffled decoding efficiency. For example, Table II summarizes multiprocessor turbo decoding performance for the WiMAX double binary code. Results are compared at the error rate performance level obtained with five iterations without BCJR-SISO decoder parallelism. Print Version

Exploring Parallel Processing Levels for Convolutional Turbo Decoding

Exploring Parallel Processing Levels for Convolutional Turbo Decoding Olivier Muller Electronics Department, GET/EST Bretagne Technopôle Brest Iroise, 29238 Brest, France olivier.muller@enst-bretagne.fr