IEEE Proof Web Version

Size: px
Start display at page:

Download "IEEE Proof Web Version"

Transcription

1 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1 From Parallelism Levels to a Multi-ASIP Architecture for Turbo Decoding Olivier Muller, Member, IEEE, Amer Baghdadi, and Michel Jézéquel, Member, IEEE Abstract Emerging digital communication applications and the underlying architectures encounter drastically increasing performance and flexibility requirements. In this paper, we present a novel flexible multiprocessor platform for high throughput turbo decoding. The proposed platform enables exploiting all parallelism levels of turbo decoding applications to fulfill performance requirements. In order to fulfill flexibility requirements, the platform is structured around configurable application-specific instruction-set processors (ASIP) combined with an efficient memory and communication interconnect scheme. The designed ASIP has an single instruction multiple data (SIMD) architecture with a specialized and extensible instruction-set and 6-stages pipeline control. The attached memories and communication interfaces enable its integration in multiprocessor architectures. These multiprocessor architectures benefit from the recent shuffled decoding technique introduced in the turbo-decoding field to achieve higher throughput. The major characteristics of the proposed platform are its flexibility and scalability which make it reusable for all simple and double binary turbo codes of existing and emerging standards. Results obtained for double binary WiMAX turbo codes demonstrate around 250 Mb/s throughput using 16-ASIP multiprocessor architecture. Index Terms application-specific instruction-set processor (ASIP), <PLEASE DEFINE BCJR.> BCJR, parallel processing, multiprocessor, turbo decoding. I. INTRODUCTION SYSTEMS on chips (SoCs) in the field of digital communication are becoming more and more diversified and complex. In this field, performance requirements, like throughput and error rates, are becoming increasingly severe. To reduce the error rate with a lower signal-to-noise ratio (SNR) (closer to the Shannon limit), turbo (iterative) processing algorithms have recently emerged [1]. These algorithms, which originally concerned channel coding, are currently being reused over the whole digital communication system, like for equalization, demodulation, synchronization, and multiple-input multiple-output (MIMO). Furthermore, the severe time-to-market constraints and the continuously developing new standards and applications in this digital communication, make resorting to new design methodologies and the proposal of a flexible turbo communication Manuscript received April 04, 2007; revised July 11, 2007, September 05, 2007, and January 04, This work has been supported in part by the European Commission through the Network of Excellence in Wireless Communications (NEWCOM). The authors are with the Electronics Department, TELECOM Bretagne, Technopôle Brest Iroise, Brest, France ( olivier. muller@telecom-bretagne.eu; amer.baghdadi@telecom-bretagne.eu; michel.jezequel@telecom-bretagne.eu). Digital Object Identifier /TVLSI platform inevitable. Flexibility could be achieved by the use of programmable/configurable processors rather than application-specific integrated circuits (ASICs). Thus, embedded multiprocessor architectures integrating an adequate communication network-on-chip (NoC) will constitute an ultimate solution to preserve flexibility while achieving the required computation and throughput rates. Algorithm parallelization of turbo decoding has been widely investigated these last few years and several implementations have been proposed. Some of these implementations succeeded in achieving high throughput for specific standards with a fully dedicated architecture. High performance turbo decoders dedicated to 3GPP standards have been implemented in ASIC [2] and in field-programmable gate arrays (FPGA) [3]. In [4], a new class of turbo codes more suitable for high throughput implementation is proposed. However, such implementations do not take into account flexibility and scalability issues. Unlike these implementations, others include software and/or reconfigurable parts to achieve the required flexibility while achieving lower throughput. This is addressed, for example, in [5] with the XiRISC processor, a reconfigurable processor using embedded FPGA, or in [6] with a digital signal processing (DSP) integrating dedicated instructions for turbo decoding. Because of their great flexibility, these solutions do not fulfill performance requirements of all standards (e.g., 150 Mb/s for Homeplug). In fact, the concept of the application-specific instruction set processor (ASIP) [7] constitutes the appropriate solution for fulfilling the flexibility and performance constraints of emerging and future applications. The use of ASIPs in embedded SoCs is becoming inevitable due to the rapid increase in complexity and flexibility of emerging applications and evolving standards. Two approaches are mainly proposed by EDA vendors for ASIP design. The first approach is based on an environment where the designer can select and configure predefined hardware elements to enhance a predefined basic processor core according to the application needs. User-defined hardware blocks, together with the corresponding instructions, can be added to the processor. This approach was used in a parallel multiprocessor implementation [8]. Despite the advanced heterogeneous communication network that optimizes data transfer and enables parallel turbodecoding implementation, the platform lacks performance due to the predefined basic processor core imposed by this approach. In the second approach, the designer has full design freedom thanks to an Architecture Description Language (ADL) which is used to specify the instruction set and the ASIP architecture Web Version [9]. In [10], we proposed the first ASIP dedicated to turbo codes using this approach. Thanks to its performance and the multiprocessor template proposed, the solution was able to cover al /$ IEEE

2 2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS Fig. 1. Turbo decoding: (a) turbo decoder; (b) BCJR SISO; (c) trellis. most all standards while presenting few limitations [support up to 8-state trellis; instruction-level parallelism (ILP) not fully exploited]. Another ASIP based on the same approach, proposed in [11] and [25], resolves these limitations and achieves higher throughput thanks to full exploitation of ILP through a long pipeline. In addition, this ASIP provides support for convolutional codes and integrates interfaces for a multiprocessor platform. However, its long pipeline length inhibits the exploitation of the most efficient parallelism for high throughput (component-decoder parallelism Section III-B2). In this work, we present an original parallelism classification of turbo decoding applications and directly link the different parallelism levels of the classification to their VLSI implementation techniques and issues in a multi-asip platform. An improved ASIP model enabling the support of all parallelism techniques is proposed. It can be configured to decode all simple and double binary turbo codes. Besides the specific arithmetic units that make up this processor model, special care was taken with the memory organization and communication buses. Its architecture facilitates its integration in a multiprocessor scheme enabling an efficient and flexible implementation of the turbo decoding algorithm. The rest of this paper is organized as follows. Section II presents the turbo decoding algorithm for a better understanding of subsequent sections. Section III analyses all parallel processing techniques of turbo decoding and proposes a three-level classification of these techniques. Then, Section IV details the proposed single instruction multiple data (SIMD) ASIP architecture for turbo decoding which exploits fully the first level of parallelism. Exploiting the other parallelism levels requires the resort to multi-asip architectures. This is illustrated in Section V, where we make use of the second level of parallelism that achieves high throughput with reasonable hardware complexity. Finally, Section VI summarizes the results obtained and concludes this paper. II. CONVOLUTIONAL TURBO DECODING In iterative decoding algorithms [12], the underlying turbo principle relies on extrinsic information exchanges and iterative processing between different soft input soft output (SISO) modules. Using input information and a priori extrinsic information, each SISO module computes a posteriori extrinsic information. Fig. 2. BCJR computation schemes: (a) forward-backward and (b) butterfly. Fig. 3. Frame decomposition and sub-block parallelism. This a posteriori extrinsic information becomes the a priori information for the other modules and are exchanged via interleaving and deinterleaving processes. For convolutional turbo codes [1], classically constructed with two convolutional component codes, the SISO modules process the BCJR or forward-backward algorithm [13] which is the optimal algorithm for the maximum a posteriori (MAP) decoding of convolutional codes (see Fig. 1). So, a BCJR SISO will first compute branch metrics (or metric), which represents the probability of a transition occurring between two trellis states ( : starting state; : ending state). Note that a branch metric can be decomposed in an intrinsic part due to systematic information and a priori information and an extrinsic part due to redundancy information. Then a BCJR SISO computes forward and backward recursions. Forward recursion (or recursion) computes a trellis section (i.e., the probability of all states of the trellis regarding the th symbol) using the previous trellis section and branch metrics between these two sections, while backward recursion (or recursion) computes a trellis section using the future trellis section and branch metrics between these two sections. With max-log-map algorithm [14], it can be expressed Web Version

3 MULLER et al.: FROM PARALLELISM LEVELS TO A MULTI-ASIP ARCHITECTURE FOR TURBO DECODING 3 Finally, the extrinsic information of the symbol is computed for all decisions from the forward recursion, the backward recursion and the extrinsic part of the branch metrics III. PARALLEL PROCESSING LEVELS In turbo decoding with the <PLEASE DEFINE BCJR.> BCJR algorithm, parallelism techniques can be classified at three levels: 1) BCJR metric level parallelism; 2) BCJR SISO decoder level parallelism; and 3) turbo-decoder level parallelism. The first (fine grain) parallelism level concerns symbol elementary computations inside a SISO decoder processing the BCJR algorithm. Parallelism between these SISO decoders, inside one turbo decoder, belongs to the second parallelism level. The third (coarse grain) parallelism level duplicates the turbo decoder itself. (1) (2) (3) A. BCJR Metric Level Parallelism The BCJR metric level parallelism concerns the processing of all metrics involved in the decoding of each received symbol inside a BCJR SISO decoder. It exploits the inherent parallelism of the trellis structure, and also the parallelism of BCJR computations [15]. 1) Parallelism of Trellis Transitions: Trellis-transition parallelism can easily be extracted from trellis structure as the same operations are repeated for all transition pairs. In log-domain [14], these operations are either add-compare-select (ACS) operations for the max-log-map algorithm or ACSO operations (ACS with a correction offset [14]) for the log-map algorithm. Each BCJR computation (1) (3) requires a number of ACS-like operation equals to half the number of transitions per trellis section. Thus, this number, which depends on the structure of the convolutional code, constitutes the upper bound of the trellis-transition parallelism degree. Furthermore, this parallelism implies low area overhead as only the ACS units have to be duplicated. In particular, no additional memories are required since all the parallelized operations are executed on the same trellis section, and in consequence on the same data. 2) Parallelism of BCJR Computations: A second metric parallelism can be orthogonally extracted from the BCJR algorithm through a parallel execution of the three BCJR computations. Parallel execution of backward recursion and APP computations was proposed with the original forward-backward scheme, depicted in Fig. 1(a). So, in this scheme, we can notice that BCJR computation parallelism degree is equal to one in the forward part and two in the backward part. To increase this parallelism degree, several schemes are proposed [16]. Fig. 1(b) shows the butterfly scheme which doubles the parallelism degree of the original scheme through the parallelism between the forward and backward recursion computations. This is performed without any memory increase and only BCJR computation resources have to be duplicated. Thus, BCJR computation parallelism is area efficient but still limited in parallelism degree. In conclusion, BCJR metric level parallelism achieves optimal area efficiency as it does not affect memory size, which occupies most of the area in a turbo decoder circuit. Exploiting this level of parallelism is detailed in Section IV. Nevertheless, the parallelism degree is limited by the decoding algorithm and the code structure. Thus, achieving higher parallelism degree implies exploring higher processing levels. B. BCJR-SISO Decoder Level Parallelism The second level of parallelism concerns the SISO decoder level. It consists of the use of multiple SISO decoders, each executing the BCJR algorithm and processing a sub-block of the same frame in one of the two interleaving orders. At this level, parallelism can be applied either on sub-blocks and/or on component decoders. 1) Sub-Block Parallelism: In sub-block parallelism, each frame is divided into M sub-blocks and then each sub-block is processed on a BCJR-SISO decoder using adequate initializations [16], [17]. A formalism is proposed in [16] to compare various existing sub-block decoding schemes towards parallelism degree and memory efficiency. Besides duplication of BCJR-SISO decoders, this parallelism imposes two other constraints. On the one hand, interleaving has to be parallelized in order to scale proportionally the communication bandwidth [8]. Due to the scramble property of interleaving, this parallelism can induce communication conflicts except for interleavers of emerging standards that are conflictfree. These conflicts force the communication structure to implement conflict management mechanisms and imply a long and variable communication time. This issue is generally addressed by minimizing interleaving delay with specific communication networks [8], [18]. On the other hand, BCJR-SISO decoders have to be initialized adequately either by acquisition or by message passing [17], [19]. The acquisition method involves estimating recursion metrics thanks to an overlapping region called acquisition window or prologue. Starting from a trellis section, where all the states are initialized to a uniform constant, the acquisition window will be processed over its length to provide reliable recursion metrics at sub-block ending points. The message passing method initializes a sub-block with recursion metrics computed during the last iteration in the neighboring sub-blocks. In [19], we observed that message passing initialization enables a more efficient decoding reaching better throughput at comparable hardware complexity. Thus, the message passing initialization is mainly considered in the rest of this paper. Regarding the first iteration, message passing method is undefined and the iterative process starts with a uniform initialization of the sub-block ending states. Instead, an initialization by acquisition can slightly improve the convergence of the iterative process, but the resulting gain is usually less than one iteration. This parallelism is necessary to reach high throughput. Nevertheless, its efficiency for high throughput is strongly reduced since resolving the initialization issue implies a computation Web Version

4 4 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS Fig. 4. Turbo decoding: (a) serial and (b) shuffled. overhead following an Amdahl s law (due to acquisition length or additional iterations) [19]. 2) Component-Decoder Parallelism: The component-decoder parallelism is a new kind of BCJR-SISO decoder parallelism that has become operational with the introduction of the shuffled decoding technique [20]. The basic idea of shuffled decoding technique is to execute all component decoders in parallel and to exchange extrinsic information as soon as it is created, so that component decoders use more reliable a priori information. Thus, the shuffled decoding technique performs decoding (computation time) and interleaving (communication time) fully concurrently while serial decoding implies waiting for the update of all extrinsic information before starting the next half iteration (see Fig. 4). Modifying serial decoding to restart processing right after the previous half iteration in order to save the propagation latency was studied in [21]. The resulting decoding requires nevertheless additional control mechanisms to avoid consistency conflict in memories. Since communication time is often considered to be the limiting factor in multiprocessor turbo decoding, saving the propagation latency is a crucial property of shuffled decoding. In addition, by doubling the number of BCJR SISO decoders, component-decoder parallelism halves the iteration period in comparison with originally proposed serial turbo decoding. Nevertheless, to preserve error-rate performance with shuffled decoding, an overhead of iteration between 5% and 50% is required depending on the BCJR computation scheme, on the degree of sub-block parallelism, on propagation time, and on interleaving rules [22]. In fact, this overhead decreases with respect to sub-block parallelism degree [19] while computation overhead of sub-block parallelism increases. Consequently, at high throughput and comparable complexity, the computation overhead becomes greater by doubling the sub-block parallelism degree than by using shuffled decoding. Thus, for high throughput, shuffled decoding is more efficient than sub-block parallelism. Simulations demonstrate minor shuffled decoding overhead variations for low propagation latency. Above a propagation latency of three times the extrinsic information emission time, the overhead reduces the interest of the shuffled decoding technique. Finally, this level of parallelism presents great potential for scalability and high area efficiency. Exploiting this level of parallelism is detailed in Section V. C. Turbo-Decoder Level Parallelism The highest level of parallelism simply duplicates whole turbo decoders to process iterations and/or frames in parallel. Iteration parallelism occurs in a pipelined fashion with a maximum pipeline depth equal to the iteration number, whereas frame parallelism presents no limitation in parallelism degree. Nevertheless, turbo-decoder level parallelism is too area-expensive (all memories and computation resources are duplicated) and presents no gain in frame decoding latency and for these reasons it is not considered in this work. IV. EXPLOITING BCJR METRIC LEVEL PARALLELISM: ASIP FOR BCJR SISO DECODER A. Context of Architectural Choices As seen in Section III-A, the BCJR metric level parallelism that occurs inside a BCJR SISO decoder is the most area efficient level of parallelism. Thus, a hardware implementation achieving high throughput should first exploit this parallelism. The complexity of convolutional turbo codes proposed in all existing and emerging standards is limited to eight-state double binary turbo codes or 16-state simple binary turbo codes. Hence, to fully exploit trellis transition parallelism (Section III-A1) for all standards, a parallelism degree of 32 is required. The implementation of future more complex codes can be supported by splitting trellis sections into sub-sections of 32-parallelism degrees and by processing sub-sections sequentially. Regarding Web Version

5 MULLER et al.: FROM PARALLELISM LEVELS TO A MULTI-ASIP ARCHITECTURE FOR TURBO DECODING 5 BCJR computation parallelism (Section III-A2), we choose a parallelism degree of two instead of four (maximum). Using a parallelism degree of four with the butterfly scheme leads to underutilization of the BCJR computation units (two are only used half of the time). These parallelism requirements imply the use of specific hardware units. To implement these units while preserving flexibility, application-specific instruction set processors constitute the perfect solution [7]. The BCJR SISO decoder should also have adequate communication interfaces in order to handle the inter sub-block communications in the case of BCJR-SISO decoder parallelism (see Section V). In this context, in order to efficiently implement shuffled decoding, the propagation time should be less than three emission periods (see Section III-B2). Let be the time required to cross the network, be the pipeline length between the load stage of the extrinsic information and the store stage of the extrinsic information, #cycle be the number of clock cycles required by the processor to compute an extrinsic information value, and be the frequency of the processor. Then # (6) # # To preserve a low ratio and thus make the use of shuffled decoding technique efficient we choose to use a short pipeline length achieving: #. A long pipeline inhibits the exploitation of the shuffled decoding technique. For example, the ASIP developed in [11], which can emit one extrinsic information value per cycle # with a of 8, has a ratio greater than 8. B. Architecture of the ASIP The presented processor, dedicated to the BCJR algorithm, is an enhanced version of the ASIP proposed in [10]. 1) Global View: The ASIP is mainly composed of operative and control parts besides its communication interfaces and attached memories [see Fig. 5(a)]. The operative part is tailored to process a window of 64 symbols by means of two identical BCJR computation units, corresponding to forward and backward processing in the MAP algorithm. Each unit produces recursion metrics and extrinsic information. The storage of recursion metrics produced by one unit, to be used by the other unit, is performed in 8 cross memories of bit words. So the processor integrates 16 internal cross memories in order to provide the adequate bandwidth. Another 96-bit width internal memory (config) contains up to 256 trellis descriptions, so that the processor can be configured for the corresponding standard. Incoming data that group systematic and redundant information of the channel, in addition to extrinsic information, are stored in external memories attached to the ASIP (input data, info ext). The input data memory has a 32-bit width to contain up to four 8-bit channel information data (systematic or redundant). The info ext memory has a 64-bit width to contain up to four 16-bit extrinsic information data, since four extrinsic information data # (4) (5) are required by double binary codes. Depending on the application s requirements, the depth of incoming data memories can be scaled up to to cover all existing and emerging standards frame-length specifications. The external future and past memory banks are used to initialize state metric values for the beginning and end of each window according to the message passing method. Each bank has two 128-bit width memories, one storing forward recursions and the other backward recursions. These initialization memories are used as follows: 1) at the beginning of the decoding, the state metric registers are either set according to the available information about the state, or reset so that all state metrics have equal probability; 2) after computations (e.g., acquisition if programmed, or recursion), the state metrics obtained for the beginning and end of each window can be stored in a memory bank in order to be used for initialization of the next iteration. The depth of these memories can be scaled to the number of windows required with a maximum of For the th window associated with the processor, initialization metrics are read at address in the forward past memory and at address in the backward past memory. Then the refined state metrics are stored at address in the forward past memory and at address in the backward past memory. The future memory bank is only accessible at address 0 and for the state metrics of the end of the last window associated with the processor. In this case, the backward initialization metrics are read from the backward future memory at address 0 and the forward metrics is stored in the forward future memory at address 0. For all the external memories, memory latencies of one cycle in read/write access have been integrated in the ASIP pipeline. 2) BCJR Computation Unit: Each BCJR computation unit is based on single instruction multiple data (SIMD) architecture in order to exploit trellis transition parallelism. Thus, 32 adder nodes (one per transition) and 8 max nodes are incorporated in each unit [see Fig. 5(b)]. The 32 adder nodes are organized as a4 8 processing matrix. In this organization, for an 8-state double binary code, the raw and column of an adder node correspond respectively to the considered symbol decision and the ending state of the associated transition. For a 16-state simple binary code, transitions with ending states 0 to 7 are mapped on matrix nodes of raw 0, if transition bit decision is 0, or matrix nodes of raw 1, if transition bit decision is 1, whereas states 8 to 15 are mapped on nodes of raws 2 and 3. An adder node [see Fig. 5(c)] contains one adder, multiplexers, one register for configuration (RT), and an output register (RADD). It supports the addition required in a recursion between a state metric (coming from the state metric register bank RMC) and a branch metric (coming from the branch metric register bank RG), and also the addition required in information generation since it can accumulate the previous result with the state metric of the other recursion coming from the register bank RC. The max nodes [see Fig. 5(d)] are shared in the processing matrix so that the max operations can be performed on RADD registers either rawwise or columnwise, depending on the ASIP instructions. A max node contains three max operators connected in a tree. This makes it possible to perform either a four-input maximum (using the three operators) or two two-input maximum. Results Web Version

6 6 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS Fig. 5. (a) ASIP architecture. (b) BCJR computation unit. (c) Adder node. (d) Max node. (e) Control unit. are stored either in the first raws or columns of RADD matrix or in RMC bank to achieve recursion computation. The BCJR computation unit also contains a GLOBAL arithmetic logic unit (ALU) that computes extrinsic information, hard decisions and other global processing, and a branch metric (BM) generator, that performs branch metrics calculation from extrinsic information register bank (RIE) and from channel information available in the pipeline registers (PR). The BM generator supports cyclic puncturing patterns with a maximum pattern length of 16. The pattern length is configurable in a 4-bit register, while puncturing patterns associated with the four channel information data are configurable through four 16-bit registers, in which each zero corresponds to a punctured bit. The BM generator supports a code rate between and 1 for double binary code and between and 1 for simple binary code. 3) Pipeline Strategy: The ASIP control part is based on a sixstage pipeline [see Fig. 5(e)]. The first two stages (FE, DC) fetch instructions from the program memory and decode them. Then, depending on the instruction type, the operand fetch (OPF) stage loads data from the input data memory to the pipeline registers PR, and/or data from the extrinsic information memory to the RIE registers, and/or data from the past/future memories to the RMC registers, and/or the configuration in the RT registers. In comparison to [10], a BM stage has been added to the pipeline in order to anticipate the calculation of branch metrics performed in the BM generator, to increase the clock frequency of the ASIP, and to improve the number of cycles per symbol. The Execute (EX) stage performs the rest of the BCJR operations. This choice reduces the performance of the ASIP since the architecture does not fully exploit ILP. However it was intentionally chosen to keep the pipeline length as short as possible in order to efficiently support the shuffled decoding technique. Fig. 6. Butterfly ZOL mechanism. Hence extrinsic information can cross the pipeline from the OPF to Store (ST) stage in only 4 cycles (see Section IV-A). 4) Control Structure: The control part also requires several dedicated control registers. Thus, the window size is fixed in the register R SIZE, and the current processed symbol inside the BCJR computation unit A (respectively, BCJR computation unit B) is stored in the pipeline register ADDRESS A (respectively, ADDRESS B). These addresses, as well as the program counter and the corresponding instruction, are then pipelined. To correctly access incoming data memories and past/future memories, the processor has a 10-bit WINDOW ID register that identifies the window computed and a 10-bit R SUB BLOCK register that sets the number of windows processed by the ASIP. Thus, one ASIP can process up to 1024 windows. In addition, the control architecture provides branch mechanisms and a zero overhead loop (ZOL) fully dedicated to the butterfly scheme (see Section III-A2). To alleviate the ASIP instruction set, the ZOL mechanism is tightly coupled with addresses Web Version

7 MULLER et al.: FROM PARALLELISM LEVELS TO A MULTI-ASIP ARCHITECTURE FOR TURBO DECODING 7 Fig. 7. Assembly programs for: (a) WiMAX double binary turbo code and (b) 3GPP simple binary turbo code. generation (see Fig. 6). Thus, the first loop is performed while the address of the symbol processed by unit A is smaller than the address of the symbol processed by unit B. In case of odd window size, the middle symbol is processed by unit A when both addresses are equal. Finally, the second loop is performed while the address of the symbol processed by unit B is positive. C. ASIP Instruction Set The designed instruction set of our ASIP architecture is coded on sixteen bits. The basic version contains 30 instructions that perform the basic operations of the MAP algorithm. To increase performance, the ASIP was extended with compacted instructions that can perform several operations in different pipeline stages within a single instruction. The following section details the mandatory instructions to perform simple decoding. These instructions are divided into three different classes: control, operative, and IO. 1) Control: As mentioned previously, the butterfly ZOL instruction repeats R SIZE times the two loops of the butterfly scheme. It requires three markers to retain relative addresses of first-loop end instruction, second-loop begin instruction, and second-loop end instruction. An unconditional branch instruction has also been designed and uses the direct addressing mode. SET SIZE instruction is used to set the ASIP window size to a maximum size of 64 symbols. SET WINDOW ID and SET SUB BLOCK NB are also used to set the WINDOW ID or R SUB BLOCK registers. Thus, the processor manages up to symbols. 2) Operative MAP: An add instruction is defined and used in two different modes: metrics computation (add m) and extrinsic information computation (add i). According to the add mode and the configuration registers (RT), each processing node selects the desired operands to perform the addition and to store the result in the corresponding RADD register. In the same way, a max1 and max2 instructions are defined with the same modes as an ADD instruction. This max1 instruction only performs one comparison-selection (two outputs per max node) while the max2 instruction cascades comparison-selection operations (one output per max node). These instructions have to be repeated as often as necessary to obtain either extrinsic information or recursion metrics at the considered address in the sub-block. The basic instruction set also contains the DECISION instruction to produce hard decisions on processed symbols. 3) IO: The basic instruction set also provides input and output instructions. With these instructions, parallel multi-accesses are executed in order to: load decoder input data (LD DATA), input recursion metrics (LD REC), configuration (LD CONFIG); store output recursion metrics (ST REC); handle internal cross metrics between the two BCJR computation units (LD CROSS, LD ST); send extrinsic information packets and hard decisions (ST EXT, DEC). We choose to group extrinsic information in packets for efficient IO operations. Each packet can contain a packet header and extrinsic information of the current symbol (up to four 16-bit data in the case of double binary codes). This header typically contains the processed symbol address including the WINDOW ID and local address (cf. Section V-B). D. Application Examples Fig. 7 gives the ASIP assembly programs required to decode a 48-symbol sub-block considering the turbo code used in: (a) WiMAX standard and (b) to decode a 40-bit sub-block considering the turbo code using 3GPP standard. In both cases, the first instructions load the required configuration (LD CONFIG) and initialize the recursion metrics (LD REC). Then the butterfly loops are initialized using the ZOLB instruction. The first loop (2 instructions) only computes the state metrics. Two max operations (max2 instruction) are required for the double binary code, whereas only one max operation (max1 instruction) is required for simple binary code. The second loop (five instructions) computes, in addition to state metrics, the extrinsic information for the eight-state code (using three max operations). Finally, the ASIP exports the sub-block ending metrics (ST REC) and program branches to the first instruction of the butterfly. Thus, regarding the execution time, cycles are needed in the first loop of the butterfly scheme, and cycles in the second loop, where is the sub-block size. So, about cycles are needed to process the symbols of the sub-block. Thus, 3.5 cycles are roughly needed per symbol (3.5 cycles/bit in simple binary mode and 1.75 cycles/bit in double binary mode). Web Version E. Implementation Results In this paper, we use the Processor Designer framework from CoWare [23]. Processor Designer is based on the LISA ADL [9] which allows the automatic generation of ASIP models (VHDL,

8 8 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS Verilog, and SystemC) for hardware synthesis and system integration, in addition to the generation of the underlying software development tools. Using this tool, a VHDL description of the ASIP architecture was generated. It was synthesized with a Synopsys Design Compiler using ST 90-nm ASIC library under worst case conditions (0.9 V, 105 C). The optimized ASIP presents a maximum clock frequency of 400 MHz and occupies about 63.1 KGates (equivalent). Compared with previous ASIP [10], clock frequency is improved by 20% and area is decreased by 35%. The presented ASIP can process 228 Mb/s in double binary mode and 114 Mb/s in simple binary mode. Note that, as a future extension, the performance of the simple binary mode can be significantly improved if the code rate of component codes is greater than or equal to one half (valid in most standards), by compacting the trellis [24]. With this condition, two consecutive stages of the simple binary trellis can be compacted in one stage of a new double binary trellis without error rate degradation. With this new double binary trellis configuration, the simple binary code can be also decoded at 228 Mb/s. Trellis compaction requires a different soft decision management implying extra operations. This feature is not implemented in the current ASIP architecture. However, it can be supported with minor modifications of LD DATA and ST EXT instructions. The LD DATA instruction should handle the soft input bit-to-symbol conversion (new add operators in BM stage) and the ST EXT instruction should handle the soft output symbol-to-bit conversion (new max operators in ST stage). These elementary operators do not change the maximum clock frequency since they introduce their own non-critical path. Thus, negligible hardware overhead is induced without any degradation in throughput. Besides, as the packet generated by the ASIP still contains two binary soft decisions, the interface of the network has to split the processor packet into two network packets to perform interleaving. Table I compares the performance results of log-map turbo decoding implementations for the UMTS turbo code. We can observe that the designed ASIP has excellent throughput performance thanks to a number of cycles per bit per SISO close to one. One is obtained by fully dedicated hardware implementations. Compared to [11], our ASIP presents a slightly lower throughput for almost similar area (63 versus 56 KGates) and with a shorter pipeline depth (6 versus 11 stages) in order to make shuffled decoding possible. With the trellis compaction extension, which reveals the real potential of the proposed architecture to decode simple binary codes, the ASIP can have a slightly better throughput despite the use of a 90-nm target technology. Furthermore, thanks to its dedicated past/future memories, our processor can skip the acquisition phases efficiently and without degradation. The figures of Table I do not integrate this acquisition computation overhead which can rise to around 15% in [25]. V. EXPLOITING BCJR-SISO DECODER LEVEL PARALLELISM: MULTI-ASIP PLATFORM The ASIP presented in the previous section fully exploits the first parallelism level (BCJR metric level parallelism Section III-A) by efficiently performing all computations of a BCJR-SISO decoder. In order to exploit the second TABLE I COMPARISON OF DIFFERENT TURBO DECODING IMPLEMENTATIONS FOR UMTS Fig. 8. Extrinsic information exchanges in BCJR-SISO decoder level parallelism. parallelism level (BCJR-SISO decoder level Section III-B), a multi-asip architecture is required. A. Multi-ASIP Turbo Decoding Overview Sub-block parallelism implies the use of a BCJR-SISO decoder, e.g., our ASIP, for each sub-block. Initializations of state metrics of a sub-block can then be performed using the message passing technique (see Section III-B2) through the state metric interfaces of the ASIP. Component-decoder parallelism implies the use of at least one ASIP for each component decoder, where ASIPs are executed in parallel and exchanging extrinsic information concurrently (shuffled decoding). Fig. 8 illustrates the architecture template required to exploit both kinds of parallelism. Besides the multiple ASIP integration, this figure shows the need for dedicated communication Web Version

9 MULLER et al.: FROM PARALLELISM LEVELS TO A MULTI-ASIP ARCHITECTURE FOR TURBO DECODING 9 Fig. 9. ASIP-based multiprocessor architecture for turbo decoding. structures to accommodate the massive information exchanges between ASIPs. Regarding communication interfaces, the developed ASIP incorporates the required state metric interfaces and the extrinsic information ones. On the other hand, interleaving of extrinsic information has to be handled by the communication structure. As seen in Section IV-A, efficient shuffled decoding imposes a low ratio (less than 3). The proposed ASIP architecture leads to a ratio equal to. Thus, to preserve shuffled decoding efficiency, the communication structure has to ensure a short propagation time which can be qualified using (6):. B. Communication Structures In order to illustrate how we implement the required communication structures, Fig. 9 presents a four-asip turbo decoder architecture where each component decoder is implemented using two ASIPs. This figure shows the three kinds of networks which are used: data interface network, state metric network, and extrinsic information network. First, the data interface network is used to dispatch new channel data from the frame memory of the IO interface to the local input data memories of ASIPs and, concurrently, to gather output data from ASIPs. Second, the state metric network enables exchanges between neighboring ASIPs in a component decoder. These exchanges are mandatory to initialize sub-blocks with message passing technique. As seen in Section IV-B1, ASIP accesses the initialization values of the beginning and end of the sub-block at address 0 of its past/future memories. In the case of full sub-block parallelism (i.e., no windowing), memories can be replaced by buffers and the state metric network consists of a set of buffers between neighboring processors, reflecting the trellis termination strategy. Thus, a circular trellis termination strategy, i.e., ending and beginning states of the frame are identical, implies the use of a buffer between the first and last ASIP (see Fig. 9). Finally, the extrinsic information network is based on routers to make extrinsic information exchanges possible between ASIPs. As the proposed ASIP supports the butterfly scheme, two packets can be sent on this network per emission and per ASIP. Packet headers generated by the ASIP are used by network interfaces (NIs) to perform interleaving. NIs regenerate a new header with the corresponding routing information. Routers integrate buffering mechanisms to support up to two input ports and two output ports. Fig. 9 presents a simple topology supporting four ASIPs. Architectures with more than four processors require more complex topologies. It is worth noting that these networks take advantage of packet switching communication [26]. In [18], we proposed to use multistage interconnection networks on chip based on Butterfly and Benes topologies. These topologies are scalable and present the required bandwidth and latencies with a reasonable hardware complexity. Even if the scalability of these topologies is limited to the number of input ports to the power of two, they master the mean propagation latency which evolves with the logarithm of the network size. This valuable property contributes significantly to fulfilling shuffled decoding requirements. C. Results With the conventional turbo decoding technique, we can observe that achievable throughput of multiprocessor architecture does not increase linearly with the number of processors, especially when this number is high [2], [8]. This degradation is mainly due to the interleaving, IO communication delays, and sub-block parallelism [19]. The use of the shuffled decoding technique limits this degradation. Thus, the throughput of the proposed ASIP-based multiprocessor architecture depends on the number of integrated ASIPs and on the shuffled decoding efficiency (see Section III-B2). The low-latency extrinsic information network (see Section V-B) and the short ASIP pipeline length (see Section IV-B3) guarantee a high shuffled decoding efficiency. For example, Table II summarizes multiprocessor turbo decoding performance for the WiMAX double binary code. Results are compared at the error rate performance level obtained with five iterations without BCJR-SISO decoder parallelism. Web Version

10 10 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS TABLE II PERFORMANCE FOR WIMAX DOUBLE BINARY TURBO DECODING, N = 1504 BITS (752 SYMBOLS), ST 90 nm The table provides the latency of the extrinsic information network using a Butterfly topology with a maximum clock frequency of 600 MHz. We can note that the latency requirement is always respected inducing efficient shuffled decoding. Thus, shuffled decoding requires seven iterations in this application example whatever the number of ASIPs. Observing the throughput results with respect to the degree of BCJR-SISO parallelism shows a more linear increase than in the literature [2], [8], although the block size (an important parameter of throughput linearity [19]) is smaller. This observation is explained by a halved sub-block parallelism degree (thanks to shuffled decoding) that minimizes sub-block parallelism degradation. Table II shows also that exploiting the diverse parallelism levels of turbo decoding induces a reasonable overall area overhead (including memories, networks, ASIPs) while achieving outstanding throughput rates. Note that the overall area is mainly dominated by the logic when the number of ASIPs increases. Table II illustrates how the memory area decreases down to 13% of the overall area. For comparison, with 16 SISOs and 5 iterations, the multi- ASIP architecture in [8] achieves Mb/s at a clock frequency of 133 MHz with a 180-nm technology and the dedicated ASIC in [2] achieves 340 Mb/s at a clock frequency of 256 MHz with a 130-nm technology. Even with technology rescaling, our flexible platform (249 Mb/s at 400 MHz, 90 nm) is closer to the full custom design performance. VI. CONCLUSION In order to meet flexibility and performance constraints of current and future digital communication applications, multiple application-specific instruction-set processors combined with dedicated communication and memory infrastructures are required. This paper provides a clear and detailed bridge between a three-level classification of turbo decoding parallelism techniques and associated VLSI implementations. We show how a multi-asip platform has been derived from this classification to enable flexible high-throughput turbo decoding. The ASIP has an SIMD architecture dedicated to the first level of this classification, a specialized and extensible instruction-set and a 6-stage pipeline control. It can process 228 Mb/s in double binary mode and 114 Mb/s in simple binary mode for an occupied area of 63.1 KGates. A future extension with negligible hardware overhead is also proposed in order to double the throughput in simple binary mode. The memory architecture and communication interfaces allow for the efficient assembling of multiple ASIP cores. Considering the second parallelism level, we have illustrated how ASIPs can be aggregated in a multiprocessor platform to process in parallel different sub-blocks and component decoders. The proposed ASIP-based multiprocessor architecture breaks the interleaving bottleneck thanks to the shuffled decoding technique and allows a high throughput while preserving flexibility and scalability. The presented platform supports turbo codes of all existing and emerging standards. Results obtained for WiMAX turbo decoding with 5 iterations demonstrate around 250 Mb/s throughput using 16-ASIP multiprocessor architecture. We are now planning to extend the proposed multi-asip platform to other digital communication applications. REFERENCES [1] C. Berrou, A. Glavieux, and P. Thitimajshima, Near shannon limit error-correcting coding and decoding: Turbo-codes, presented at the Int. Conf. Commun. (ICC), Geneva, Switzerland, [2] G. Prescher, T. Gemmeke, and T. Noll, <PLEASE PROVIDE PAGE NUMBERS OR LOCATION.> A parametrizable low-power highthroughput turbo-decoder, in Proc. ICASSP, Mar [3] Xilinx, San Jose, CA, 3GPP Turbo Decoder v3.1, May [4] D. Gnaëdig, E. Boutillon, M. Jezequel, V. Gaudet, and G. Gulak, On multiple slice turbo code, in Proc. Int. Symp. Turbo Codes Related Topics, Brest, France, Sep. 2003, pp [5] A. La Rosa, C. Passerone, F. Gregoretti, and L. Lavagno, Implementation of a UMTS turbo-decoder on a dynamically reconfigurable platform, presented at the Des., Autom. Test Eur. (DATE) Conf., Paris, France, Feb [6] R. Kothandaraman and M. J. Lopez, <PLEASE PROVIDE PAGE NUMBERS OR LOCATION.> An efficient implementation of turbo decoder on ADI TigerSHARC TS201 DSP, in Proc. SPCOM, Dec [7] A. Oraioglu and A. Veidenbaum, <PLEASE PROVIDE VOLUME, ISSUE, AND PAGE NUMBERS.> Application specific microprocessors, (Guest Editors Introduction) IEEE Des. Test Mag., Jan./Feb [8] F. Gilbert, M. Thul, and N. Wehn, Communication centric architectures for turbo-decoding on embedded multiprocessors, in Proc. Des., Autom. Test Eur. (DATE) Conf., Munich, Germany, Mar. 2003, pp [9] A. Hoffmann, O. Schliebusch, A. Nohl, G. Braun, and H. Meyr, A methodology for the design of application specific instruction set processors (ASIP) using the machine description language LISA, presented at the ICCAD, San Jose, CA, Nov [10] O. Muller, A. Baghdadi, and M. Jézéquel, ASIP-based multiprocessor SoC design for simple and double binary turbo decoding, in Proc. Des., Autom. Test Eur. (DATE) Conf., Munich, Germany, Mar. 2006, pp [11] T. Vogt and N. Wehn, A reconfigurable application specific instruction set processor for viterbi and log-map decoding, in Proc. IEEE Workshop Signal Process. (SIPS), Banff, Canada, Oct. 2006, pp [12] J. Hagenauer, The turbo principle: Tutorial introduction and state of the art, in Proc. Int. Symp. Turbo Codes Related Topics, Brest, France, Sep. 1997, pp Web Version

11 MULLER et al.: FROM PARALLELISM LEVELS TO A MULTI-ASIP ARCHITECTURE FOR TURBO DECODING 11 [13] L. Bahl, J. Cocke, F. Jelinek, and J. Raviv, Optimal decoding of linear codes for minimizing symbol error rate, IEEE Trans. Inf. Theory, vol. IT-20, no. 2, pp , Mar [14] P. Robertson, P. Hoeher, and E. Villebrun, Optimal and sub-optimal maximum a posteriori algorithms suitable for turbo decoding, Euro. Trans. Telecommun. (ETT), vol. 8, no. 2, pp , [15] G. Masera, G. Piccinini, M. R. Roch, and M. Zamboni, VLSI architectures for turbo codes, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 7, no. 3, pp , Sep [16] Y. Zhang and K. K. Parhi, Parallel turbo decoding, in Proc. Int. Symp. Circuits Syst., May 2004, vol. 2, pp. II-509 II-512. [17] A. Abbasfar and K. Yao, An efficient architecture for high speed turbo decoders, in Proc. ICASSP, Apr. 2003, pp. IV-521 IV-524. [18] H. Moussa, O. Muller, A. Baghdadi, and M. Jézéquel, Butterfly and benes-based on-chip communication networks for multiprocessor turbo decoding, presented at the Des., Autom. Test Euro. (DATE) Conf., Nice, France, Apr [19] O. Muller, A. Baghdadi, and M. Jezequel, <PLEASE PROVIDE PAGE NUMBERS OR LOCATION.> Exploring parallel processing levels for convolutional turbo decoding, in Proc. ICTTA, Apr [20] J. Zhang and M. P. C. Fossorier, Shuffled iterative decoding, IEEE Trans. Commun., vol. 53, no. 2, pp , Feb [21] D. Gnaedig, E. Boutillon, J. Tousch, and M. Jezequel, Towards an optimal parallel decoding of turbo codes, presented at the 4th Int. Symp. Turbo Codes Related Topics, Munich, Germany, Apr [22] O. Muller, A. Baghdadi, and M. Jezequel, <PLEASE PROVIDE PAGE NUMBERS OR LOCATION.> On the parallelism of convolutional turbo decoding and interleaving interference, in IEEE Global Telecommun. Conf. (GLOBECOM), Nov [23] CoWare Inc., <PLEASE PROVIDE COMPANY LOCATION AND YEAR.> CoWare Inc. homepage, [Online]. Available: coware.com/ [24] G. Fettweis and H. Meyr, Parallel viterbi algorithm implementation: Breaking the ACS-bottleneck, IEEE Trans. Commun., vol. 37, no. 8, pp , Aug [25] T. Vogt, C. Neeb, and N. Wehn, A reconfigurable multi-processor platform for convolutional and turbo decoding, presented at the ReCoSoC, Montpellier, France, [26] L. Benini and G. D. Micheli, Networks on Chips: A new SoC paradigm, IEEE Computer, vol. 35, no. 1, pp , Jan Olivier Muller (M 06) received the engineering (M.S.) and Ph.D. degrees in telecommunications and electrical engineering from the Ecole Nationale Supérieure des Télécommunications de Bretagne (TELECOM Bretagne), Brest, France, in 2004 and 2007, respectively. In 2003, he worked on co-design with Motorola, Toulouse, France. Currently, he is a Postdoctoral Researcher with the Electronics Department, TELECOM Bretagne, Brest, France. His research interests include the areas of multiprocessor architectures, application specific processors, on-chip networks, digital communication algorithms, and information theory. Amer Baghdadi received the Electronic Engineering and M.S. degrees and a Ph.D. degree in microelectronics from the Institut National Polytechnique de Grenoble (INPG), Grenoble, France, in 1998, 1998, and 2002, respectively. He is an Associate Professor with the Electronics Department, TELECOM Bretagne, Brest, France, since December In 2002, he was an Assistant Professor with the INPG while continuing his research activities with the System-Level Synthesis Group, TIMA Laboratory. His research interests include system-on-chip architectures and design methodology, especially, design and exploration of application-specific multiprocessor architectures, performance estimation and on-chip communication architectures. Recently, his research activities target multiprocessor and network-on-chip architecture design for digital communication applications. Dr. Baghdadi was nominated for a Best Paper Award at the 2001 DATE Conference for his work on the design automation of application-specific multiprocessor SoC. He serves on the technical program committee for RSP, ICTTA, and DATE Conferences. Michel Jézéquel (M 02) was born in Saint Renan, France, on February 26, He received the Ingénieur degree in electronics from the École Nationale Supérieure de lélectronique et de ses Applications, Paris, France, in In , he was a Design Engineer with CIT ALCATEL, Lannion, France. Then, after an experience in a small company, he followed a one year course about software design. In 1988, he joined the École Nationale Supérieure des Télécommunications de Bretagne, Brest, France, where he is currently Professor, head of the Electronics Department. His main research interest is circuit design for digital communications. He focuses his activities in the fields of turbo codes, adaptation of the turbo principle to iterative correction of intersymbol interference, the design of interleavers, and the interaction between modulation and error correcting codes. Web Version

12 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1 From Parallelism Levels to a Multi-ASIP Architecture for Turbo Decoding Olivier Muller, Member, IEEE, Amer Baghdadi, and Michel Jézéquel, Member, IEEE Abstract Emerging digital communication applications and the underlying architectures encounter drastically increasing performance and flexibility requirements. In this paper, we present a novel flexible multiprocessor platform for high throughput turbo decoding. The proposed platform enables exploiting all parallelism levels of turbo decoding applications to fulfill performance requirements. In order to fulfill flexibility requirements, the platform is structured around configurable application-specific instruction-set processors (ASIP) combined with an efficient memory and communication interconnect scheme. The designed ASIP has an single instruction multiple data (SIMD) architecture with a specialized and extensible instruction-set and 6-stages pipeline control. The attached memories and communication interfaces enable its integration in multiprocessor architectures. These multiprocessor architectures benefit from the recent shuffled decoding technique introduced in the turbo-decoding field to achieve higher throughput. The major characteristics of the proposed platform are its flexibility and scalability which make it reusable for all simple and double binary turbo codes of existing and emerging standards. Results obtained for double binary WiMAX turbo codes demonstrate around 250 Mb/s throughput using 16-ASIP multiprocessor architecture. Index Terms application-specific instruction-set processor (ASIP), <PLEASE DEFINE BCJR.> BCJR, parallel processing, multiprocessor, turbo decoding. I. INTRODUCTION SYSTEMS on chips (SoCs) in the field of digital communication are becoming more and more diversified and complex. In this field, performance requirements, like throughput and error rates, are becoming increasingly severe. To reduce the error rate with a lower signal-to-noise ratio (SNR) (closer to the Shannon limit), turbo (iterative) processing algorithms have recently emerged [1]. These algorithms, which originally concerned channel coding, are currently being reused over the whole digital communication system, like for equalization, demodulation, synchronization, and multiple-input multiple-output (MIMO). Furthermore, the severe time-to-market constraints and the continuously developing new standards and applications in this digital communication, make resorting to new design methodologies and the proposal of a flexible turbo communication Manuscript received April 04, 2007; revised July 11, 2007, September 05, 2007, and January 04, This work has been supported in part by the European Commission through the Network of Excellence in Wireless Communications (NEWCOM). The authors are with the Electronics Department, TELECOM Bretagne, Technopôle Brest Iroise, Brest, France ( olivier. muller@telecom-bretagne.eu; amer.baghdadi@telecom-bretagne.eu; michel.jezequel@telecom-bretagne.eu). Digital Object Identifier /TVLSI platform inevitable. Flexibility could be achieved by the use of programmable/configurable processors rather than application-specific integrated circuits (ASICs). Thus, embedded multiprocessor architectures integrating an adequate communication network-on-chip (NoC) will constitute an ultimate solution to preserve flexibility while achieving the required computation and throughput rates. Algorithm parallelization of turbo decoding has been widely investigated these last few years and several implementations have been proposed. Some of these implementations succeeded in achieving high throughput for specific standards with a fully dedicated architecture. High performance turbo decoders dedicated to 3GPP standards have been implemented in ASIC [2] and in field-programmable gate arrays (FPGA) [3]. In [4], a new class of turbo codes more suitable for high throughput implementation is proposed. However, such implementations do not take into account flexibility and scalability issues. Unlike these implementations, others include software and/or reconfigurable parts to achieve the required flexibility while achieving lower throughput. This is addressed, for example, in [5] with the XiRISC processor, a reconfigurable processor using embedded FPGA, or in [6] with a digital signal processing (DSP) integrating dedicated instructions for turbo decoding. Because of their great flexibility, these solutions do not fulfill performance requirements of all standards (e.g., 150 Mb/s for Homeplug). In fact, the concept of the application-specific instruction set processor (ASIP) [7] constitutes the appropriate solution for fulfilling the flexibility and performance constraints of emerging and future applications. The use of ASIPs in embedded SoCs is becoming inevitable due to the rapid increase in complexity and flexibility of emerging applications and evolving standards. Two approaches are mainly proposed by EDA vendors for ASIP design. The first approach is based on an environment where the designer can select and configure predefined hardware elements to enhance a predefined basic processor core according to the application needs. User-defined hardware blocks, together with the corresponding instructions, can be added to the processor. This approach was used in a parallel multiprocessor implementation [8]. Despite the advanced heterogeneous communication network that optimizes data transfer and enables parallel turbodecoding implementation, the platform lacks performance due to the predefined basic processor core imposed by this approach. In the second approach, the designer has full design freedom thanks to an Architecture Description Language (ADL) which is used to specify the instruction set and the ASIP architecture Print Version [9]. In [10], we proposed the first ASIP dedicated to turbo codes using this approach. Thanks to its performance and the multiprocessor template proposed, the solution was able to cover al /$ IEEE

13 2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS Fig. 1. Turbo decoding: (a) turbo decoder; (b) BCJR SISO; (c) trellis. most all standards while presenting few limitations [support up to 8-state trellis; instruction-level parallelism (ILP) not fully exploited]. Another ASIP based on the same approach, proposed in [11] and [25], resolves these limitations and achieves higher throughput thanks to full exploitation of ILP through a long pipeline. In addition, this ASIP provides support for convolutional codes and integrates interfaces for a multiprocessor platform. However, its long pipeline length inhibits the exploitation of the most efficient parallelism for high throughput (component-decoder parallelism Section III-B2). In this work, we present an original parallelism classification of turbo decoding applications and directly link the different parallelism levels of the classification to their VLSI implementation techniques and issues in a multi-asip platform. An improved ASIP model enabling the support of all parallelism techniques is proposed. It can be configured to decode all simple and double binary turbo codes. Besides the specific arithmetic units that make up this processor model, special care was taken with the memory organization and communication buses. Its architecture facilitates its integration in a multiprocessor scheme enabling an efficient and flexible implementation of the turbo decoding algorithm. The rest of this paper is organized as follows. Section II presents the turbo decoding algorithm for a better understanding of subsequent sections. Section III analyses all parallel processing techniques of turbo decoding and proposes a three-level classification of these techniques. Then, Section IV details the proposed single instruction multiple data (SIMD) ASIP architecture for turbo decoding which exploits fully the first level of parallelism. Exploiting the other parallelism levels requires the resort to multi-asip architectures. This is illustrated in Section V, where we make use of the second level of parallelism that achieves high throughput with reasonable hardware complexity. Finally, Section VI summarizes the results obtained and concludes this paper. II. CONVOLUTIONAL TURBO DECODING In iterative decoding algorithms [12], the underlying turbo principle relies on extrinsic information exchanges and iterative processing between different soft input soft output (SISO) modules. Using input information and a priori extrinsic information, each SISO module computes a posteriori extrinsic information. Fig. 2. BCJR computation schemes: (a) forward-backward and (b) butterfly. Fig. 3. Frame decomposition and sub-block parallelism. This a posteriori extrinsic information becomes the a priori information for the other modules and are exchanged via interleaving and deinterleaving processes. For convolutional turbo codes [1], classically constructed with two convolutional component codes, the SISO modules process the BCJR or forward-backward algorithm [13] which is the optimal algorithm for the maximum a posteriori (MAP) decoding of convolutional codes (see Fig. 1). So, a BCJR SISO will first compute branch metrics (or metric), which represents the probability of a transition occurring between two trellis states ( : starting state; : ending state). Note that a branch metric can be decomposed in an intrinsic part due to systematic information and a priori information and an extrinsic part due to redundancy information. Then a BCJR SISO computes forward and backward recursions. Forward recursion (or recursion) computes a trellis section (i.e., the probability of all states of the trellis regarding the th symbol) using the previous trellis section and branch metrics between these two sections, while backward recursion (or recursion) computes a trellis section using the future trellis section and branch metrics between these two sections. With max-log-map algorithm [14], it can be expressed Print Version

14 MULLER et al.: FROM PARALLELISM LEVELS TO A MULTI-ASIP ARCHITECTURE FOR TURBO DECODING 3 Finally, the extrinsic information of the symbol is computed for all decisions from the forward recursion, the backward recursion and the extrinsic part of the branch metrics III. PARALLEL PROCESSING LEVELS In turbo decoding with the <PLEASE DEFINE BCJR.> BCJR algorithm, parallelism techniques can be classified at three levels: 1) BCJR metric level parallelism; 2) BCJR SISO decoder level parallelism; and 3) turbo-decoder level parallelism. The first (fine grain) parallelism level concerns symbol elementary computations inside a SISO decoder processing the BCJR algorithm. Parallelism between these SISO decoders, inside one turbo decoder, belongs to the second parallelism level. The third (coarse grain) parallelism level duplicates the turbo decoder itself. (1) (2) (3) A. BCJR Metric Level Parallelism The BCJR metric level parallelism concerns the processing of all metrics involved in the decoding of each received symbol inside a BCJR SISO decoder. It exploits the inherent parallelism of the trellis structure, and also the parallelism of BCJR computations [15]. 1) Parallelism of Trellis Transitions: Trellis-transition parallelism can easily be extracted from trellis structure as the same operations are repeated for all transition pairs. In log-domain [14], these operations are either add-compare-select (ACS) operations for the max-log-map algorithm or ACSO operations (ACS with a correction offset [14]) for the log-map algorithm. Each BCJR computation (1) (3) requires a number of ACS-like operation equals to half the number of transitions per trellis section. Thus, this number, which depends on the structure of the convolutional code, constitutes the upper bound of the trellis-transition parallelism degree. Furthermore, this parallelism implies low area overhead as only the ACS units have to be duplicated. In particular, no additional memories are required since all the parallelized operations are executed on the same trellis section, and in consequence on the same data. 2) Parallelism of BCJR Computations: A second metric parallelism can be orthogonally extracted from the BCJR algorithm through a parallel execution of the three BCJR computations. Parallel execution of backward recursion and APP computations was proposed with the original forward-backward scheme, depicted in Fig. 1(a). So, in this scheme, we can notice that BCJR computation parallelism degree is equal to one in the forward part and two in the backward part. To increase this parallelism degree, several schemes are proposed [16]. Fig. 1(b) shows the butterfly scheme which doubles the parallelism degree of the original scheme through the parallelism between the forward and backward recursion computations. This is performed without any memory increase and only BCJR computation resources have to be duplicated. Thus, BCJR computation parallelism is area efficient but still limited in parallelism degree. In conclusion, BCJR metric level parallelism achieves optimal area efficiency as it does not affect memory size, which occupies most of the area in a turbo decoder circuit. Exploiting this level of parallelism is detailed in Section IV. Nevertheless, the parallelism degree is limited by the decoding algorithm and the code structure. Thus, achieving higher parallelism degree implies exploring higher processing levels. B. BCJR-SISO Decoder Level Parallelism The second level of parallelism concerns the SISO decoder level. It consists of the use of multiple SISO decoders, each executing the BCJR algorithm and processing a sub-block of the same frame in one of the two interleaving orders. At this level, parallelism can be applied either on sub-blocks and/or on component decoders. 1) Sub-Block Parallelism: In sub-block parallelism, each frame is divided into M sub-blocks and then each sub-block is processed on a BCJR-SISO decoder using adequate initializations [16], [17]. A formalism is proposed in [16] to compare various existing sub-block decoding schemes towards parallelism degree and memory efficiency. Besides duplication of BCJR-SISO decoders, this parallelism imposes two other constraints. On the one hand, interleaving has to be parallelized in order to scale proportionally the communication bandwidth [8]. Due to the scramble property of interleaving, this parallelism can induce communication conflicts except for interleavers of emerging standards that are conflictfree. These conflicts force the communication structure to implement conflict management mechanisms and imply a long and variable communication time. This issue is generally addressed by minimizing interleaving delay with specific communication networks [8], [18]. On the other hand, BCJR-SISO decoders have to be initialized adequately either by acquisition or by message passing [17], [19]. The acquisition method involves estimating recursion metrics thanks to an overlapping region called acquisition window or prologue. Starting from a trellis section, where all the states are initialized to a uniform constant, the acquisition window will be processed over its length to provide reliable recursion metrics at sub-block ending points. The message passing method initializes a sub-block with recursion metrics computed during the last iteration in the neighboring sub-blocks. In [19], we observed that message passing initialization enables a more efficient decoding reaching better throughput at comparable hardware complexity. Thus, the message passing initialization is mainly considered in the rest of this paper. Regarding the first iteration, message passing method is undefined and the iterative process starts with a uniform initialization of the sub-block ending states. Instead, an initialization by acquisition can slightly improve the convergence of the iterative process, but the resulting gain is usually less than one iteration. This parallelism is necessary to reach high throughput. Nevertheless, its efficiency for high throughput is strongly reduced since resolving the initialization issue implies a computation Print Version

15 4 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS Fig. 4. Turbo decoding: (a) serial and (b) shuffled. overhead following an Amdahl s law (due to acquisition length or additional iterations) [19]. 2) Component-Decoder Parallelism: The component-decoder parallelism is a new kind of BCJR-SISO decoder parallelism that has become operational with the introduction of the shuffled decoding technique [20]. The basic idea of shuffled decoding technique is to execute all component decoders in parallel and to exchange extrinsic information as soon as it is created, so that component decoders use more reliable a priori information. Thus, the shuffled decoding technique performs decoding (computation time) and interleaving (communication time) fully concurrently while serial decoding implies waiting for the update of all extrinsic information before starting the next half iteration (see Fig. 4). Modifying serial decoding to restart processing right after the previous half iteration in order to save the propagation latency was studied in [21]. The resulting decoding requires nevertheless additional control mechanisms to avoid consistency conflict in memories. Since communication time is often considered to be the limiting factor in multiprocessor turbo decoding, saving the propagation latency is a crucial property of shuffled decoding. In addition, by doubling the number of BCJR SISO decoders, component-decoder parallelism halves the iteration period in comparison with originally proposed serial turbo decoding. Nevertheless, to preserve error-rate performance with shuffled decoding, an overhead of iteration between 5% and 50% is required depending on the BCJR computation scheme, on the degree of sub-block parallelism, on propagation time, and on interleaving rules [22]. In fact, this overhead decreases with respect to sub-block parallelism degree [19] while computation overhead of sub-block parallelism increases. Consequently, at high throughput and comparable complexity, the computation overhead becomes greater by doubling the sub-block parallelism degree than by using shuffled decoding. Thus, for high throughput, shuffled decoding is more efficient than sub-block parallelism. Simulations demonstrate minor shuffled decoding overhead variations for low propagation latency. Above a propagation latency of three times the extrinsic information emission time, the overhead reduces the interest of the shuffled decoding technique. Finally, this level of parallelism presents great potential for scalability and high area efficiency. Exploiting this level of parallelism is detailed in Section V. C. Turbo-Decoder Level Parallelism The highest level of parallelism simply duplicates whole turbo decoders to process iterations and/or frames in parallel. Iteration parallelism occurs in a pipelined fashion with a maximum pipeline depth equal to the iteration number, whereas frame parallelism presents no limitation in parallelism degree. Nevertheless, turbo-decoder level parallelism is too area-expensive (all memories and computation resources are duplicated) and presents no gain in frame decoding latency and for these reasons it is not considered in this work. IV. EXPLOITING BCJR METRIC LEVEL PARALLELISM: ASIP FOR BCJR SISO DECODER A. Context of Architectural Choices As seen in Section III-A, the BCJR metric level parallelism that occurs inside a BCJR SISO decoder is the most area efficient level of parallelism. Thus, a hardware implementation achieving high throughput should first exploit this parallelism. The complexity of convolutional turbo codes proposed in all existing and emerging standards is limited to eight-state double binary turbo codes or 16-state simple binary turbo codes. Hence, to fully exploit trellis transition parallelism (Section III-A1) for all standards, a parallelism degree of 32 is required. The implementation of future more complex codes can be supported by splitting trellis sections into sub-sections of 32-parallelism degrees and by processing sub-sections sequentially. Regarding Print Version

16 MULLER et al.: FROM PARALLELISM LEVELS TO A MULTI-ASIP ARCHITECTURE FOR TURBO DECODING 5 BCJR computation parallelism (Section III-A2), we choose a parallelism degree of two instead of four (maximum). Using a parallelism degree of four with the butterfly scheme leads to underutilization of the BCJR computation units (two are only used half of the time). These parallelism requirements imply the use of specific hardware units. To implement these units while preserving flexibility, application-specific instruction set processors constitute the perfect solution [7]. The BCJR SISO decoder should also have adequate communication interfaces in order to handle the inter sub-block communications in the case of BCJR-SISO decoder parallelism (see Section V). In this context, in order to efficiently implement shuffled decoding, the propagation time should be less than three emission periods (see Section III-B2). Let be the time required to cross the network, be the pipeline length between the load stage of the extrinsic information and the store stage of the extrinsic information, #cycle be the number of clock cycles required by the processor to compute an extrinsic information value, and be the frequency of the processor. Then # (6) # # To preserve a low ratio and thus make the use of shuffled decoding technique efficient we choose to use a short pipeline length achieving: #. A long pipeline inhibits the exploitation of the shuffled decoding technique. For example, the ASIP developed in [11], which can emit one extrinsic information value per cycle # with a of 8, has a ratio greater than 8. B. Architecture of the ASIP The presented processor, dedicated to the BCJR algorithm, is an enhanced version of the ASIP proposed in [10]. 1) Global View: The ASIP is mainly composed of operative and control parts besides its communication interfaces and attached memories [see Fig. 5(a)]. The operative part is tailored to process a window of 64 symbols by means of two identical BCJR computation units, corresponding to forward and backward processing in the MAP algorithm. Each unit produces recursion metrics and extrinsic information. The storage of recursion metrics produced by one unit, to be used by the other unit, is performed in 8 cross memories of bit words. So the processor integrates 16 internal cross memories in order to provide the adequate bandwidth. Another 96-bit width internal memory (config) contains up to 256 trellis descriptions, so that the processor can be configured for the corresponding standard. Incoming data that group systematic and redundant information of the channel, in addition to extrinsic information, are stored in external memories attached to the ASIP (input data, info ext). The input data memory has a 32-bit width to contain up to four 8-bit channel information data (systematic or redundant). The info ext memory has a 64-bit width to contain up to four 16-bit extrinsic information data, since four extrinsic information data # (4) (5) are required by double binary codes. Depending on the application s requirements, the depth of incoming data memories can be scaled up to to cover all existing and emerging standards frame-length specifications. The external future and past memory banks are used to initialize state metric values for the beginning and end of each window according to the message passing method. Each bank has two 128-bit width memories, one storing forward recursions and the other backward recursions. These initialization memories are used as follows: 1) at the beginning of the decoding, the state metric registers are either set according to the available information about the state, or reset so that all state metrics have equal probability; 2) after computations (e.g., acquisition if programmed, or recursion), the state metrics obtained for the beginning and end of each window can be stored in a memory bank in order to be used for initialization of the next iteration. The depth of these memories can be scaled to the number of windows required with a maximum of For the th window associated with the processor, initialization metrics are read at address in the forward past memory and at address in the backward past memory. Then the refined state metrics are stored at address in the forward past memory and at address in the backward past memory. The future memory bank is only accessible at address 0 and for the state metrics of the end of the last window associated with the processor. In this case, the backward initialization metrics are read from the backward future memory at address 0 and the forward metrics is stored in the forward future memory at address 0. For all the external memories, memory latencies of one cycle in read/write access have been integrated in the ASIP pipeline. 2) BCJR Computation Unit: Each BCJR computation unit is based on single instruction multiple data (SIMD) architecture in order to exploit trellis transition parallelism. Thus, 32 adder nodes (one per transition) and 8 max nodes are incorporated in each unit [see Fig. 5(b)]. The 32 adder nodes are organized as a4 8 processing matrix. In this organization, for an 8-state double binary code, the raw and column of an adder node correspond respectively to the considered symbol decision and the ending state of the associated transition. For a 16-state simple binary code, transitions with ending states 0 to 7 are mapped on matrix nodes of raw 0, if transition bit decision is 0, or matrix nodes of raw 1, if transition bit decision is 1, whereas states 8 to 15 are mapped on nodes of raws 2 and 3. An adder node [see Fig. 5(c)] contains one adder, multiplexers, one register for configuration (RT), and an output register (RADD). It supports the addition required in a recursion between a state metric (coming from the state metric register bank RMC) and a branch metric (coming from the branch metric register bank RG), and also the addition required in information generation since it can accumulate the previous result with the state metric of the other recursion coming from the register bank RC. The max nodes [see Fig. 5(d)] are shared in the processing matrix so that the max operations can be performed on RADD registers either rawwise or columnwise, depending on the ASIP instructions. A max node contains three max operators connected in a tree. This makes it possible to perform either a four-input maximum (using the three operators) or two two-input maximum. Results Print Version

17 6 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS Fig. 5. (a) ASIP architecture. (b) BCJR computation unit. (c) Adder node. (d) Max node. (e) Control unit. are stored either in the first raws or columns of RADD matrix or in RMC bank to achieve recursion computation. The BCJR computation unit also contains a GLOBAL arithmetic logic unit (ALU) that computes extrinsic information, hard decisions and other global processing, and a branch metric (BM) generator, that performs branch metrics calculation from extrinsic information register bank (RIE) and from channel information available in the pipeline registers (PR). The BM generator supports cyclic puncturing patterns with a maximum pattern length of 16. The pattern length is configurable in a 4-bit register, while puncturing patterns associated with the four channel information data are configurable through four 16-bit registers, in which each zero corresponds to a punctured bit. The BM generator supports a code rate between and 1 for double binary code and between and 1 for simple binary code. 3) Pipeline Strategy: The ASIP control part is based on a sixstage pipeline [see Fig. 5(e)]. The first two stages (FE, DC) fetch instructions from the program memory and decode them. Then, depending on the instruction type, the operand fetch (OPF) stage loads data from the input data memory to the pipeline registers PR, and/or data from the extrinsic information memory to the RIE registers, and/or data from the past/future memories to the RMC registers, and/or the configuration in the RT registers. In comparison to [10], a BM stage has been added to the pipeline in order to anticipate the calculation of branch metrics performed in the BM generator, to increase the clock frequency of the ASIP, and to improve the number of cycles per symbol. The Execute (EX) stage performs the rest of the BCJR operations. This choice reduces the performance of the ASIP since the architecture does not fully exploit ILP. However it was intentionally chosen to keep the pipeline length as short as possible in order to efficiently support the shuffled decoding technique. Fig. 6. Butterfly ZOL mechanism. Hence extrinsic information can cross the pipeline from the OPF to Store (ST) stage in only 4 cycles (see Section IV-A). 4) Control Structure: The control part also requires several dedicated control registers. Thus, the window size is fixed in the register R SIZE, and the current processed symbol inside the BCJR computation unit A (respectively, BCJR computation unit B) is stored in the pipeline register ADDRESS A (respectively, ADDRESS B). These addresses, as well as the program counter and the corresponding instruction, are then pipelined. To correctly access incoming data memories and past/future memories, the processor has a 10-bit WINDOW ID register that identifies the window computed and a 10-bit R SUB BLOCK register that sets the number of windows processed by the ASIP. Thus, one ASIP can process up to 1024 windows. In addition, the control architecture provides branch mechanisms and a zero overhead loop (ZOL) fully dedicated to the butterfly scheme (see Section III-A2). To alleviate the ASIP instruction set, the ZOL mechanism is tightly coupled with addresses Print Version

18 MULLER et al.: FROM PARALLELISM LEVELS TO A MULTI-ASIP ARCHITECTURE FOR TURBO DECODING 7 Fig. 7. Assembly programs for: (a) WiMAX double binary turbo code and (b) 3GPP simple binary turbo code. generation (see Fig. 6). Thus, the first loop is performed while the address of the symbol processed by unit A is smaller than the address of the symbol processed by unit B. In case of odd window size, the middle symbol is processed by unit A when both addresses are equal. Finally, the second loop is performed while the address of the symbol processed by unit B is positive. C. ASIP Instruction Set The designed instruction set of our ASIP architecture is coded on sixteen bits. The basic version contains 30 instructions that perform the basic operations of the MAP algorithm. To increase performance, the ASIP was extended with compacted instructions that can perform several operations in different pipeline stages within a single instruction. The following section details the mandatory instructions to perform simple decoding. These instructions are divided into three different classes: control, operative, and IO. 1) Control: As mentioned previously, the butterfly ZOL instruction repeats R SIZE times the two loops of the butterfly scheme. It requires three markers to retain relative addresses of first-loop end instruction, second-loop begin instruction, and second-loop end instruction. An unconditional branch instruction has also been designed and uses the direct addressing mode. SET SIZE instruction is used to set the ASIP window size to a maximum size of 64 symbols. SET WINDOW ID and SET SUB BLOCK NB are also used to set the WINDOW ID or R SUB BLOCK registers. Thus, the processor manages up to symbols. 2) Operative MAP: An add instruction is defined and used in two different modes: metrics computation (add m) and extrinsic information computation (add i). According to the add mode and the configuration registers (RT), each processing node selects the desired operands to perform the addition and to store the result in the corresponding RADD register. In the same way, a max1 and max2 instructions are defined with the same modes as an ADD instruction. This max1 instruction only performs one comparison-selection (two outputs per max node) while the max2 instruction cascades comparison-selection operations (one output per max node). These instructions have to be repeated as often as necessary to obtain either extrinsic information or recursion metrics at the considered address in the sub-block. The basic instruction set also contains the DECISION instruction to produce hard decisions on processed symbols. 3) IO: The basic instruction set also provides input and output instructions. With these instructions, parallel multi-accesses are executed in order to: load decoder input data (LD DATA), input recursion metrics (LD REC), configuration (LD CONFIG); store output recursion metrics (ST REC); handle internal cross metrics between the two BCJR computation units (LD CROSS, LD ST); send extrinsic information packets and hard decisions (ST EXT, DEC). We choose to group extrinsic information in packets for efficient IO operations. Each packet can contain a packet header and extrinsic information of the current symbol (up to four 16-bit data in the case of double binary codes). This header typically contains the processed symbol address including the WINDOW ID and local address (cf. Section V-B). D. Application Examples Fig. 7 gives the ASIP assembly programs required to decode a 48-symbol sub-block considering the turbo code used in: (a) WiMAX standard and (b) to decode a 40-bit sub-block considering the turbo code using 3GPP standard. In both cases, the first instructions load the required configuration (LD CONFIG) and initialize the recursion metrics (LD REC). Then the butterfly loops are initialized using the ZOLB instruction. The first loop (2 instructions) only computes the state metrics. Two max operations (max2 instruction) are required for the double binary code, whereas only one max operation (max1 instruction) is required for simple binary code. The second loop (five instructions) computes, in addition to state metrics, the extrinsic information for the eight-state code (using three max operations). Finally, the ASIP exports the sub-block ending metrics (ST REC) and program branches to the first instruction of the butterfly. Thus, regarding the execution time, cycles are needed in the first loop of the butterfly scheme, and cycles in the second loop, where is the sub-block size. So, about cycles are needed to process the symbols of the sub-block. Thus, 3.5 cycles are roughly needed per symbol (3.5 cycles/bit in simple binary mode and 1.75 cycles/bit in double binary mode). Print Version E. Implementation Results In this paper, we use the Processor Designer framework from CoWare [23]. Processor Designer is based on the LISA ADL [9] which allows the automatic generation of ASIP models (VHDL,

19 8 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS Verilog, and SystemC) for hardware synthesis and system integration, in addition to the generation of the underlying software development tools. Using this tool, a VHDL description of the ASIP architecture was generated. It was synthesized with a Synopsys Design Compiler using ST 90-nm ASIC library under worst case conditions (0.9 V, 105 C). The optimized ASIP presents a maximum clock frequency of 400 MHz and occupies about 63.1 KGates (equivalent). Compared with previous ASIP [10], clock frequency is improved by 20% and area is decreased by 35%. The presented ASIP can process 228 Mb/s in double binary mode and 114 Mb/s in simple binary mode. Note that, as a future extension, the performance of the simple binary mode can be significantly improved if the code rate of component codes is greater than or equal to one half (valid in most standards), by compacting the trellis [24]. With this condition, two consecutive stages of the simple binary trellis can be compacted in one stage of a new double binary trellis without error rate degradation. With this new double binary trellis configuration, the simple binary code can be also decoded at 228 Mb/s. Trellis compaction requires a different soft decision management implying extra operations. This feature is not implemented in the current ASIP architecture. However, it can be supported with minor modifications of LD DATA and ST EXT instructions. The LD DATA instruction should handle the soft input bit-to-symbol conversion (new add operators in BM stage) and the ST EXT instruction should handle the soft output symbol-to-bit conversion (new max operators in ST stage). These elementary operators do not change the maximum clock frequency since they introduce their own non-critical path. Thus, negligible hardware overhead is induced without any degradation in throughput. Besides, as the packet generated by the ASIP still contains two binary soft decisions, the interface of the network has to split the processor packet into two network packets to perform interleaving. Table I compares the performance results of log-map turbo decoding implementations for the UMTS turbo code. We can observe that the designed ASIP has excellent throughput performance thanks to a number of cycles per bit per SISO close to one. One is obtained by fully dedicated hardware implementations. Compared to [11], our ASIP presents a slightly lower throughput for almost similar area (63 versus 56 KGates) and with a shorter pipeline depth (6 versus 11 stages) in order to make shuffled decoding possible. With the trellis compaction extension, which reveals the real potential of the proposed architecture to decode simple binary codes, the ASIP can have a slightly better throughput despite the use of a 90-nm target technology. Furthermore, thanks to its dedicated past/future memories, our processor can skip the acquisition phases efficiently and without degradation. The figures of Table I do not integrate this acquisition computation overhead which can rise to around 15% in [25]. V. EXPLOITING BCJR-SISO DECODER LEVEL PARALLELISM: MULTI-ASIP PLATFORM The ASIP presented in the previous section fully exploits the first parallelism level (BCJR metric level parallelism Section III-A) by efficiently performing all computations of a BCJR-SISO decoder. In order to exploit the second TABLE I COMPARISON OF DIFFERENT TURBO DECODING IMPLEMENTATIONS FOR UMTS Fig. 8. Extrinsic information exchanges in BCJR-SISO decoder level parallelism. parallelism level (BCJR-SISO decoder level Section III-B), a multi-asip architecture is required. A. Multi-ASIP Turbo Decoding Overview Sub-block parallelism implies the use of a BCJR-SISO decoder, e.g., our ASIP, for each sub-block. Initializations of state metrics of a sub-block can then be performed using the message passing technique (see Section III-B2) through the state metric interfaces of the ASIP. Component-decoder parallelism implies the use of at least one ASIP for each component decoder, where ASIPs are executed in parallel and exchanging extrinsic information concurrently (shuffled decoding). Fig. 8 illustrates the architecture template required to exploit both kinds of parallelism. Besides the multiple ASIP integration, this figure shows the need for dedicated communication Print Version

20 MULLER et al.: FROM PARALLELISM LEVELS TO A MULTI-ASIP ARCHITECTURE FOR TURBO DECODING 9 Fig. 9. ASIP-based multiprocessor architecture for turbo decoding. structures to accommodate the massive information exchanges between ASIPs. Regarding communication interfaces, the developed ASIP incorporates the required state metric interfaces and the extrinsic information ones. On the other hand, interleaving of extrinsic information has to be handled by the communication structure. As seen in Section IV-A, efficient shuffled decoding imposes a low ratio (less than 3). The proposed ASIP architecture leads to a ratio equal to. Thus, to preserve shuffled decoding efficiency, the communication structure has to ensure a short propagation time which can be qualified using (6):. B. Communication Structures In order to illustrate how we implement the required communication structures, Fig. 9 presents a four-asip turbo decoder architecture where each component decoder is implemented using two ASIPs. This figure shows the three kinds of networks which are used: data interface network, state metric network, and extrinsic information network. First, the data interface network is used to dispatch new channel data from the frame memory of the IO interface to the local input data memories of ASIPs and, concurrently, to gather output data from ASIPs. Second, the state metric network enables exchanges between neighboring ASIPs in a component decoder. These exchanges are mandatory to initialize sub-blocks with message passing technique. As seen in Section IV-B1, ASIP accesses the initialization values of the beginning and end of the sub-block at address 0 of its past/future memories. In the case of full sub-block parallelism (i.e., no windowing), memories can be replaced by buffers and the state metric network consists of a set of buffers between neighboring processors, reflecting the trellis termination strategy. Thus, a circular trellis termination strategy, i.e., ending and beginning states of the frame are identical, implies the use of a buffer between the first and last ASIP (see Fig. 9). Finally, the extrinsic information network is based on routers to make extrinsic information exchanges possible between ASIPs. As the proposed ASIP supports the butterfly scheme, two packets can be sent on this network per emission and per ASIP. Packet headers generated by the ASIP are used by network interfaces (NIs) to perform interleaving. NIs regenerate a new header with the corresponding routing information. Routers integrate buffering mechanisms to support up to two input ports and two output ports. Fig. 9 presents a simple topology supporting four ASIPs. Architectures with more than four processors require more complex topologies. It is worth noting that these networks take advantage of packet switching communication [26]. In [18], we proposed to use multistage interconnection networks on chip based on Butterfly and Benes topologies. These topologies are scalable and present the required bandwidth and latencies with a reasonable hardware complexity. Even if the scalability of these topologies is limited to the number of input ports to the power of two, they master the mean propagation latency which evolves with the logarithm of the network size. This valuable property contributes significantly to fulfilling shuffled decoding requirements. C. Results With the conventional turbo decoding technique, we can observe that achievable throughput of multiprocessor architecture does not increase linearly with the number of processors, especially when this number is high [2], [8]. This degradation is mainly due to the interleaving, IO communication delays, and sub-block parallelism [19]. The use of the shuffled decoding technique limits this degradation. Thus, the throughput of the proposed ASIP-based multiprocessor architecture depends on the number of integrated ASIPs and on the shuffled decoding efficiency (see Section III-B2). The low-latency extrinsic information network (see Section V-B) and the short ASIP pipeline length (see Section IV-B3) guarantee a high shuffled decoding efficiency. For example, Table II summarizes multiprocessor turbo decoding performance for the WiMAX double binary code. Results are compared at the error rate performance level obtained with five iterations without BCJR-SISO decoder parallelism. Print Version

Exploring Parallel Processing Levels for Convolutional Turbo Decoding

Exploring Parallel Processing Levels for Convolutional Turbo Decoding Exploring Parallel Processing Levels for Convolutional Turbo Decoding Olivier Muller Electronics Department, GET/EST Bretagne Technopôle Brest Iroise, 29238 Brest, France olivier.muller@enst-bretagne.fr

More information

ASIP-Based Multiprocessor SoC Design for Simple and Double Binary Turbo Decoding

ASIP-Based Multiprocessor SoC Design for Simple and Double Binary Turbo Decoding ASIP-Based ultiprocessor SoC Design for Simple and Double Binary Turbo Decoding Olivier uller, Amer Baghdadi, ichel Jézéquel Electronics Department, ENST Bretagne, Technopôle Brest Iroise, 29238 Brest,

More information

/$ IEEE

/$ IEEE IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 56, NO. 1, JANUARY 2009 81 Bit-Level Extrinsic Information Exchange Method for Double-Binary Turbo Codes Ji-Hoon Kim, Student Member,

More information

Research Article Parallelism Efficiency in Convolutional Turbo Decoding

Research Article Parallelism Efficiency in Convolutional Turbo Decoding Hindawi Publishing Corporation EURASIP Journal on Advances in Signal Processing Volume 21, Article ID 92792, 11 pages doi:1.1155/21/92792 Research Article Parallelism Efficiency in Convolutional Turbo

More information

High Speed Downlink Packet Access efficient turbo decoder architecture: 3GPP Advanced Turbo Decoder

High Speed Downlink Packet Access efficient turbo decoder architecture: 3GPP Advanced Turbo Decoder I J C T A, 9(24), 2016, pp. 291-298 International Science Press High Speed Downlink Packet Access efficient turbo decoder architecture: 3GPP Advanced Turbo Decoder Parvathy M.*, Ganesan R.*,** and Tefera

More information

ISSCC 2003 / SESSION 8 / COMMUNICATIONS SIGNAL PROCESSING / PAPER 8.7

ISSCC 2003 / SESSION 8 / COMMUNICATIONS SIGNAL PROCESSING / PAPER 8.7 ISSCC 2003 / SESSION 8 / COMMUNICATIONS SIGNAL PROCESSING / PAPER 8.7 8.7 A Programmable Turbo Decoder for Multiple 3G Wireless Standards Myoung-Cheol Shin, In-Cheol Park KAIST, Daejeon, Republic of Korea

More information

Stopping-free dynamic configuration of a multi-asip turbo decoder

Stopping-free dynamic configuration of a multi-asip turbo decoder 2013 16th Euromicro Conference on Digital System Design Stopping-free dynamic configuration of a multi-asip turbo decoder Vianney Lapotre, Purushotham Murugappa, Guy Gogniat, Amer Baghdadi, Michael Hübner

More information

Low Complexity Architecture for Max* Operator of Log-MAP Turbo Decoder

Low Complexity Architecture for Max* Operator of Log-MAP Turbo Decoder International Journal of Current Engineering and Technology E-ISSN 2277 4106, P-ISSN 2347 5161 2015 INPRESSCO, All Rights Reserved Available at http://inpressco.com/category/ijcet Research Article Low

More information

Architecture multi-asip pour turbo décodeur multi-standard. Amer Baghdadi.

Architecture multi-asip pour turbo décodeur multi-standard. Amer Baghdadi. Architecture multi-asip pour turbo décodeur multi-standard Amer Baghdadi Amer.Baghdadi@telecom-bretagne.eu Télécom Bretagne Technopôle Brest-Iroise - CS 8388 938 Brest Cedex 3 - FRANCE GDR-ISIS, Paris.

More information

EFFICIENT PARALLEL MEMORY ORGANIZATION FOR TURBO DECODERS

EFFICIENT PARALLEL MEMORY ORGANIZATION FOR TURBO DECODERS In Proceedings of the European Signal Processing Conference, pages 831-83, Poznan, Poland, September 27. EFFICIENT PARALLEL MEMORY ORGANIZATION FOR TURBO DECODERS Perttu Salmela, Ruirui Gu*, Shuvra S.

More information

THE turbo code is one of the most attractive forward error

THE turbo code is one of the most attractive forward error IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 63, NO. 2, FEBRUARY 2016 211 Memory-Reduced Turbo Decoding Architecture Using NII Metric Compression Youngjoo Lee, Member, IEEE, Meng

More information

Implementation Aspects of Turbo-Decoders for Future Radio Applications

Implementation Aspects of Turbo-Decoders for Future Radio Applications Implementation Aspects of Turbo-Decoders for Future Radio Applications Friedbert Berens STMicroelectronics Advanced System Technology Geneva Applications Laboratory CH-1215 Geneva 15, Switzerland e-mail:

More information

EFFICIENT RECURSIVE IMPLEMENTATION OF A QUADRATIC PERMUTATION POLYNOMIAL INTERLEAVER FOR LONG TERM EVOLUTION SYSTEMS

EFFICIENT RECURSIVE IMPLEMENTATION OF A QUADRATIC PERMUTATION POLYNOMIAL INTERLEAVER FOR LONG TERM EVOLUTION SYSTEMS Rev. Roum. Sci. Techn. Électrotechn. et Énerg. Vol. 61, 1, pp. 53 57, Bucarest, 016 Électronique et transmission de l information EFFICIENT RECURSIVE IMPLEMENTATION OF A QUADRATIC PERMUTATION POLYNOMIAL

More information

TURBO CODES with performance near the Shannon

TURBO CODES with performance near the Shannon IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 13, NO. 4, APRIL 2005 427 Parallel Interleaver Design and VLSI Architecture for Low-Latency MAP Turbo Decoders Rostislav (Reuven)

More information

422 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 45, NO. 2, FEBRUARY 2010

422 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 45, NO. 2, FEBRUARY 2010 422 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 45, NO. 2, FEBRUARY 2010 Turbo Decoder Using Contention-Free Interleaver and Parallel Architecture Cheng-Chi Wong, Ming-Wei Lai, Chien-Ching Lin, Hsie-Chia

More information

TURBO codes, [1], [2], have attracted much interest due

TURBO codes, [1], [2], have attracted much interest due 800 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 47, NO. 2, FEBRUARY 2001 Zigzag Codes and Concatenated Zigzag Codes Li Ping, Member, IEEE, Xiaoling Huang, and Nam Phamdo, Senior Member, IEEE Abstract

More information

A Reconfigurable Multi-Processor Platform for Convolutional and Turbo Decoding

A Reconfigurable Multi-Processor Platform for Convolutional and Turbo Decoding A Reconfigurable Multi-Processor Platform for Convolutional and Turbo Decoding Timo Vogt, Christian Neeb, and Norbert Wehn University of Kaiserslautern, Kaiserslautern, Germany {vogt, neeb, wehn}@eit.uni-kl.de

More information

Comparison of Decoding Algorithms for Concatenated Turbo Codes

Comparison of Decoding Algorithms for Concatenated Turbo Codes Comparison of Decoding Algorithms for Concatenated Turbo Codes Drago Žagar, Nenad Falamić and Snježana Rimac-Drlje University of Osijek Faculty of Electrical Engineering Kneza Trpimira 2b, HR-31000 Osijek,

More information

VHDL Implementation of different Turbo Encoder using Log-MAP Decoder

VHDL Implementation of different Turbo Encoder using Log-MAP Decoder 49 VHDL Implementation of different Turbo Encoder using Log-MAP Decoder Akash Kumar Gupta and Sanjeet Kumar Abstract Turbo code is a great achievement in the field of communication system. It can be created

More information

A Scalable Multi-Core ASIP Virtual Platform For Standard-Compliant Trellis Decoding

A Scalable Multi-Core ASIP Virtual Platform For Standard-Compliant Trellis Decoding A Scalable Multi-Core ASIP Virtual Platform For Standard-Compliant Trellis Decoding Matthias Jung, Christian Brehm, Norbert Wehn Microelectronic Systems Design Research Group University of Kaiserslautern,

More information

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume VII /Issue 2 / OCT 2016

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume VII /Issue 2 / OCT 2016 NEW VLSI ARCHITECTURE FOR EXPLOITING CARRY- SAVE ARITHMETIC USING VERILOG HDL B.Anusha 1 Ch.Ramesh 2 shivajeehul@gmail.com 1 chintala12271@rediffmail.com 2 1 PG Scholar, Dept of ECE, Ganapathy Engineering

More information

ISSN Vol.05,Issue.09, September-2017, Pages:

ISSN Vol.05,Issue.09, September-2017, Pages: WWW.IJITECH.ORG ISSN 2321-8665 Vol.05,Issue.09, September-2017, Pages:1693-1697 AJJAM PUSHPA 1, C. H. RAMA MOHAN 2 1 PG Scholar, Dept of ECE(DECS), Shirdi Sai Institute of Science and Technology, Anantapuramu,

More information

DESIGN OF EFFICIENT ROUTING ALGORITHM FOR CONGESTION CONTROL IN NOC

DESIGN OF EFFICIENT ROUTING ALGORITHM FOR CONGESTION CONTROL IN NOC DESIGN OF EFFICIENT ROUTING ALGORITHM FOR CONGESTION CONTROL IN NOC 1 Pawar Ruchira Pradeep M. E, E&TC Signal Processing, Dr. D Y Patil School of engineering, Ambi, Pune Email: 1 ruchira4391@gmail.com

More information

High speed low complexity radix-16 Max-Log-MAP SISO decoder

High speed low complexity radix-16 Max-Log-MAP SISO decoder High speed low complexity radix-16 Max-Log-MAP SISO decoder Oscar David Sanchez Gonzalez, Christophe Jego, Michel Jezequel, Yannick Saouter To cite this version: Oscar David Sanchez Gonzalez, Christophe

More information

FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC)

FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC) FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC) D.Udhayasheela, pg student [Communication system],dept.ofece,,as-salam engineering and technology, N.MageshwariAssistant Professor

More information

The Serial Commutator FFT

The Serial Commutator FFT The Serial Commutator FFT Mario Garrido Gálvez, Shen-Jui Huang, Sau-Gee Chen and Oscar Gustafsson Journal Article N.B.: When citing this work, cite the original article. 2016 IEEE. Personal use of this

More information

Nearly-optimal associative memories based on distributed constant weight codes

Nearly-optimal associative memories based on distributed constant weight codes Nearly-optimal associative memories based on distributed constant weight codes Vincent Gripon Electronics and Computer Enginering McGill University Montréal, Canada Email: vincent.gripon@ens-cachan.org

More information

/$ IEEE

/$ IEEE IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 10, OCTOBER 2006 1147 Transactions Briefs Highly-Parallel Decoding Architectures for Convolutional Turbo Codes Zhiyong He,

More information

A 256-Radix Crossbar Switch Using Mux-Matrix-Mux Folded-Clos Topology

A 256-Radix Crossbar Switch Using Mux-Matrix-Mux Folded-Clos Topology http://dx.doi.org/10.5573/jsts.014.14.6.760 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.14, NO.6, DECEMBER, 014 A 56-Radix Crossbar Switch Using Mux-Matrix-Mux Folded-Clos Topology Sung-Joon Lee

More information

OPTIMIZED MAP TURBO DECODER

OPTIMIZED MAP TURBO DECODER OPTIMIZED MAP TURBO DECODER Curt Schurgers Francy Catthoor Marc Engels EE Department IMEC/KUL IMEC UCLA Kapeldreef 75 Kapeldreef 75 Los Angeles, CA 90024 3001 Heverlee 3001 Heverlee USA Belgium Belgium

More information

/$ IEEE

/$ IEEE IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 56, NO. 5, MAY 2009 1005 Low-Power Memory-Reduced Traceback MAP Decoding for Double-Binary Convolutional Turbo Decoder Cheng-Hung Lin,

More information

Interlaced Column-Row Message-Passing Schedule for Decoding LDPC Codes

Interlaced Column-Row Message-Passing Schedule for Decoding LDPC Codes Interlaced Column-Row Message-Passing Schedule for Decoding LDPC Codes Saleh Usman, Mohammad M. Mansour, Ali Chehab Department of Electrical and Computer Engineering American University of Beirut Beirut

More information

THERE has been great interest in recent years in coding

THERE has been great interest in recent years in coding 186 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 16, NO. 2, FEBRUARY 1998 Concatenated Decoding with a Reduced-Search BCJR Algorithm Volker Franz and John B. Anderson, Fellow, IEEE Abstract We

More information

Soft-Core Embedded Processor-Based Built-In Self- Test of FPGAs: A Case Study

Soft-Core Embedded Processor-Based Built-In Self- Test of FPGAs: A Case Study Soft-Core Embedded Processor-Based Built-In Self- Test of FPGAs: A Case Study Bradley F. Dutton, Graduate Student Member, IEEE, and Charles E. Stroud, Fellow, IEEE Dept. of Electrical and Computer Engineering

More information

Configuration latency constraint and frame duration /13/$ IEEE. a) Configuration latency constraint

Configuration latency constraint and frame duration /13/$ IEEE. a) Configuration latency constraint An efficient on-chip configuration infrastructure for a flexible multi-asip turbo decoder architecture Vianney Lapotre, Michael Hübner, Guy Gogniat, Purushotham Murugappa, Amer Baghdadi and Jean-Philippe

More information

RECENTLY, researches on gigabit wireless personal area

RECENTLY, researches on gigabit wireless personal area 146 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 55, NO. 2, FEBRUARY 2008 An Indexed-Scaling Pipelined FFT Processor for OFDM-Based WPAN Applications Yuan Chen, Student Member, IEEE,

More information

The Lekha 3GPP LTE Turbo Decoder IP Core meets 3GPP LTE specification 3GPP TS V Release 10[1].

The Lekha 3GPP LTE Turbo Decoder IP Core meets 3GPP LTE specification 3GPP TS V Release 10[1]. Lekha IP Core: LW RI 1002 3GPP LTE Turbo Decoder IP Core V1.0 The Lekha 3GPP LTE Turbo Decoder IP Core meets 3GPP LTE specification 3GPP TS 36.212 V 10.5.0 Release 10[1]. Introduction The Lekha IP 3GPP

More information

Programmable Turbo Decoder Supporting Multiple Third-Generation Wireless Standards

Programmable Turbo Decoder Supporting Multiple Third-Generation Wireless Standards Programmable Turbo Decoder Supporting Multiple Third-eneration Wireless Standards Myoung-Cheol Shin and In-Cheol Park Department of Electrical Engineering and Computer Science, KAIST Yuseong-gu Daejeon,

More information

Research Article A Programmable Max-Log-MAP Turbo Decoder Implementation

Research Article A Programmable Max-Log-MAP Turbo Decoder Implementation VLSI Design Volume 28, Article ID 3995, 7 pages doi:.55/28/3995 Research Article A Programmable Max-Log-MAP Turbo Decoder Implementation Perttu Salmela, Harri Sorokin, and Jarmo Takala Department of Computer

More information

Architecture Implementation Using the Machine Description Language LISA

Architecture Implementation Using the Machine Description Language LISA Architecture Implementation Using the Machine Description Language LISA Oliver Schliebusch, Andreas Hoffmann, Achim Nohl, Gunnar Braun and Heinrich Meyr Integrated Signal Processing Systems, RWTH Aachen,

More information

Session: Configurable Systems. Tailored SoC building using reconfigurable IP blocks

Session: Configurable Systems. Tailored SoC building using reconfigurable IP blocks IP 08 Session: Configurable Systems Tailored SoC building using reconfigurable IP blocks Lodewijk T. Smit, Gerard K. Rauwerda, Jochem H. Rutgers, Maciej Portalski and Reinier Kuipers Recore Systems www.recoresystems.com

More information

Mapping real-life applications on run-time reconfigurable NoC-based MPSoC on FPGA. Singh, A.K.; Kumar, A.; Srikanthan, Th.; Ha, Y.

Mapping real-life applications on run-time reconfigurable NoC-based MPSoC on FPGA. Singh, A.K.; Kumar, A.; Srikanthan, Th.; Ha, Y. Mapping real-life applications on run-time reconfigurable NoC-based MPSoC on FPGA. Singh, A.K.; Kumar, A.; Srikanthan, Th.; Ha, Y. Published in: Proceedings of the 2010 International Conference on Field-programmable

More information

DUE to the high computational complexity and real-time

DUE to the high computational complexity and real-time IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 15, NO. 3, MARCH 2005 445 A Memory-Efficient Realization of Cyclic Convolution and Its Application to Discrete Cosine Transform Hun-Chen

More information

FPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression

FPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression FPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression Divakara.S.S, Research Scholar, J.S.S. Research Foundation, Mysore Cyril Prasanna Raj P Dean(R&D), MSEC, Bangalore Thejas

More information

Memory-Reduced Turbo Decoding Architecture Using NII Metric Compression

Memory-Reduced Turbo Decoding Architecture Using NII Metric Compression Memory-Reduced Turbo Decoding Architecture Using NII Metric Compression Syed kareem saheb, Research scholar, Dept. of ECE, ANU, GUNTUR,A.P, INDIA. E-mail:sd_kareem@yahoo.com A. Srihitha PG student dept.

More information

Design of Low-Power High-Speed Maximum a Priori Decoder Architectures

Design of Low-Power High-Speed Maximum a Priori Decoder Architectures Design of Low-Power High-Speed Maximum a Priori Decoder Architectures Alexander Worm Λ, Holger Lamm, Norbert Wehn Institute of Microelectronic Systems Department of Electrical Engineering and Information

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

High Speed ACSU Architecture for Viterbi Decoder Using T-Algorithm

High Speed ACSU Architecture for Viterbi Decoder Using T-Algorithm High Speed ACSU Architecture for Viterbi Decoder Using T-Algorithm Atish A. Peshattiwar & Tejaswini G. Panse Department of Electronics Engineering, Yeshwantrao Chavan College of Engineering, E-mail : atishp32@gmail.com,

More information

Low-Power Adaptive Viterbi Decoder for TCM Using T-Algorithm

Low-Power Adaptive Viterbi Decoder for TCM Using T-Algorithm International Journal of Scientific and Research Publications, Volume 3, Issue 8, August 2013 1 Low-Power Adaptive Viterbi Decoder for TCM Using T-Algorithm MUCHHUMARRI SANTHI LATHA*, Smt. D.LALITHA KUMARI**

More information

High Throughput Radix-4 SISO Decoding Architecture with Reduced Memory Requirement

High Throughput Radix-4 SISO Decoding Architecture with Reduced Memory Requirement JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.14, NO.4, AUGUST, 2014 http://dx.doi.org/10.5573/jsts.2014.14.4.407 High Throughput Radix-4 SISO Decoding Architecture with Reduced Memory Requirement

More information

AS TURBO codes [1], or parallel concatenated convolutional

AS TURBO codes [1], or parallel concatenated convolutional IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 7, JULY 2007 801 SIMD Processor-Based Turbo Decoder Supporting Multiple Third-Generation Wireless Standards Myoung-Cheol Shin,

More information

Super Codes: A Flexible Multi Rate Coding System

Super Codes: A Flexible Multi Rate Coding System Super Codes: A Flexible Multi Rate Coding System Steven S. Pietrobon Small World Communications, 6 First Avenue, Payneham South SA 57, Australia. E mail: steven@sworld.com.au Abstract: We define super

More information

Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues

Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues E. BOUTILLON, C. DOUILLARD, G. MONTORSI 1 Iterative Decoding of Concatenated Convolutional Codes: Implementation Issues Emmanuel Boutillon, Catherine Douillard, and Guido Montorsi Abstract This tutorial

More information

BER Guaranteed Optimization and Implementation of Parallel Turbo Decoding on GPU

BER Guaranteed Optimization and Implementation of Parallel Turbo Decoding on GPU 2013 8th International Conference on Communications and Networking in China (CHINACOM) BER Guaranteed Optimization and Implementation of Parallel Turbo Decoding on GPU Xiang Chen 1,2, Ji Zhu, Ziyu Wen,

More information

A MULTIBANK MEMORY-BASED VLSI ARCHITECTURE OF DIGITAL VIDEO BROADCASTING SYMBOL DEINTERLEAVER

A MULTIBANK MEMORY-BASED VLSI ARCHITECTURE OF DIGITAL VIDEO BROADCASTING SYMBOL DEINTERLEAVER A MULTIBANK MEMORY-BASED VLSI ARCHITECTURE OF DIGITAL VIDEO BROADCASTING SYMBOL DEINTERLEAVER D.HARI BABU 1, B.NEELIMA DEVI 2 1,2 Noble college of engineering and technology, Nadergul, Hyderabad, Abstract-

More information

THE orthogonal frequency-division multiplex (OFDM)

THE orthogonal frequency-division multiplex (OFDM) 26 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 57, NO. 1, JANUARY 2010 A Generalized Mixed-Radix Algorithm for Memory-Based FFT Processors Chen-Fong Hsiao, Yuan Chen, Member, IEEE,

More information

A generalized precompiling scheme for surviving path memory management in Viterbi decoders

A generalized precompiling scheme for surviving path memory management in Viterbi decoders A generalized precompiling scheme for surviving path memory management in Viterbi decoders Emmanuel BOUTON, Nicolas DEMASSEUX Telecom Paris, E.N.S.T, 46 rue Barrault, 75634 PARS CEDEX 3, FRANCE e-mail

More information

A scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment

A scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment LETTER IEICE Electronics Express, Vol.11, No.2, 1 9 A scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment Ting Chen a), Hengzhu Liu, and Botao Zhang College of

More information

Effective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management

Effective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management International Journal of Computer Theory and Engineering, Vol., No., December 01 Effective Memory Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management Sultan Daud Khan, Member,

More information

A Low Power Asynchronous FPGA with Autonomous Fine Grain Power Gating and LEDR Encoding

A Low Power Asynchronous FPGA with Autonomous Fine Grain Power Gating and LEDR Encoding A Low Power Asynchronous FPGA with Autonomous Fine Grain Power Gating and LEDR Encoding N.Rajagopala krishnan, k.sivasuparamanyan, G.Ramadoss Abstract Field Programmable Gate Arrays (FPGAs) are widely

More information

4. Networks. in parallel computers. Advances in Computer Architecture

4. Networks. in parallel computers. Advances in Computer Architecture 4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors

More information

VLSI Architectures for SISO-APP Decoders

VLSI Architectures for SISO-APP Decoders IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 11, NO. 4, AUGUST 2003 627 VLSI Architectures for SISO-APP Decoders Mohammad M. Mansour, Student Member, IEEE, and Naresh R. Shanbhag,

More information

Application of a design space exploration tool to enhance interleaver generation

Application of a design space exploration tool to enhance interleaver generation Application of a design space exploration tool to enhance interleaver generation Cyrille Chavet, Philippe Coussy, Pascal Urard, Eric Martin To cite this version: Cyrille Chavet, Philippe Coussy, Pascal

More information

A Ripple Carry Adder based Low Power Architecture of LMS Adaptive Filter

A Ripple Carry Adder based Low Power Architecture of LMS Adaptive Filter A Ripple Carry Adder based Low Power Architecture of LMS Adaptive Filter A.S. Sneka Priyaa PG Scholar Government College of Technology Coimbatore ABSTRACT The Least Mean Square Adaptive Filter is frequently

More information

Reduced complexity Log-MAP algorithm with Jensen inequality based non-recursive max operator for turbo TCM decoding

Reduced complexity Log-MAP algorithm with Jensen inequality based non-recursive max operator for turbo TCM decoding Sybis and Tyczka EURASIP Journal on Wireless Communications and Networking 2013, 2013:238 RESEARCH Open Access Reduced complexity Log-MAP algorithm with Jensen inequality based non-recursive max operator

More information

An Efficient Constant Multiplier Architecture Based On Vertical- Horizontal Binary Common Sub-Expression Elimination Algorithm

An Efficient Constant Multiplier Architecture Based On Vertical- Horizontal Binary Common Sub-Expression Elimination Algorithm Volume-6, Issue-6, November-December 2016 International Journal of Engineering and Management Research Page Number: 229-234 An Efficient Constant Multiplier Architecture Based On Vertical- Horizontal Binary

More information

High Data Rate Fully Flexible SDR Modem

High Data Rate Fully Flexible SDR Modem High Data Rate Fully Flexible SDR Modem Advanced configurable architecture & development methodology KASPERSKI F., PIERRELEE O., DOTTO F., SARLOTTE M. THALES Communication 160 bd de Valmy, 92704 Colombes,

More information

ERROR correcting codes are used to increase the bandwidth

ERROR correcting codes are used to increase the bandwidth 404 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 37, NO. 3, MARCH 2002 A 690-mW 1-Gb/s 1024-b, Rate-1/2 Low-Density Parity-Check Code Decoder Andrew J. Blanksby and Chris J. Howland Abstract A 1024-b, rate-1/2,

More information

Design and Implementation of Low Complexity Router for 2D Mesh Topology using FPGA

Design and Implementation of Low Complexity Router for 2D Mesh Topology using FPGA Design and Implementation of Low Complexity Router for 2D Mesh Topology using FPGA Maheswari Murali * and Seetharaman Gopalakrishnan # * Assistant professor, J. J. College of Engineering and Technology,

More information

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks X. Yuan, R. Melhem and R. Gupta Department of Computer Science University of Pittsburgh Pittsburgh, PA 156 fxyuan,

More information

Design of Low-Power and Low-Latency 256-Radix Crossbar Switch Using Hyper-X Network Topology

Design of Low-Power and Low-Latency 256-Radix Crossbar Switch Using Hyper-X Network Topology JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.15, NO.1, FEBRUARY, 2015 http://dx.doi.org/10.5573/jsts.2015.15.1.077 Design of Low-Power and Low-Latency 256-Radix Crossbar Switch Using Hyper-X Network

More information

A Novel Area Efficient Folded Modified Convolutional Interleaving Architecture for MAP Decoder

A Novel Area Efficient Folded Modified Convolutional Interleaving Architecture for MAP Decoder A Novel Area Efficient Folded Modified Convolutional Interleaving Architecture for Decoder S.Shiyamala Department of ECE SSCET Palani, India. Dr.V.Rajamani Principal IGCET Trichy,India ABSTRACT This paper

More information

UNIT I (Two Marks Questions & Answers)

UNIT I (Two Marks Questions & Answers) UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-

More information

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP 133 CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP 6.1 INTRODUCTION As the era of a billion transistors on a one chip approaches, a lot of Processing Elements (PEs) could be located

More information

A Reconfigurable Outer Modem Platform for Future Wireless Communications Systems. Timo Vogt Norbert Wehn {vogt,

A Reconfigurable Outer Modem Platform for Future Wireless Communications Systems. Timo Vogt Norbert Wehn {vogt, Microelectronic System Design TU Kaiserslautern www.eit.uni-kl.de/wehn A econfigurable Outer Modem Platform for Future Wireless Communications Systems Timo Vogt Norbert Wehn {vogt, wehn}@eit.uni-kl.de

More information

Fault-Tolerant Multiple Task Migration in Mesh NoC s over virtual Point-to-Point connections

Fault-Tolerant Multiple Task Migration in Mesh NoC s over virtual Point-to-Point connections Fault-Tolerant Multiple Task Migration in Mesh NoC s over virtual Point-to-Point connections A.SAI KUMAR MLR Group of Institutions Dundigal,INDIA B.S.PRIYANKA KUMARI CMR IT Medchal,INDIA Abstract Multiple

More information

A Normal I/O Order Radix-2 FFT Architecture to Process Twin Data Streams for MIMO

A Normal I/O Order Radix-2 FFT Architecture to Process Twin Data Streams for MIMO 2402 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 6, JUNE 2016 A Normal I/O Order Radix-2 FFT Architecture to Process Twin Data Streams for MIMO Antony Xavier Glittas,

More information

Towards an optimal parallel decoding of turbo codes

Towards an optimal parallel decoding of turbo codes owards an optimal parallel decoding of turbo codes David Gnaedig *, Emmanuel Boutillon +, Jacky ousch *, Michel Jézéquel * urboconcept, 115 rue Claude Chappe, 29280 PLOUZANE, France + LEER Unité CNR FRE

More information

Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm

Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm 1 A.Malashri, 2 C.Paramasivam 1 PG Student, Department of Electronics and Communication K S Rangasamy College Of Technology,

More information

ARITHMETIC operations based on residue number systems

ARITHMETIC operations based on residue number systems IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 53, NO. 2, FEBRUARY 2006 133 Improved Memoryless RNS Forward Converter Based on the Periodicity of Residues A. B. Premkumar, Senior Member,

More information

Implementation of FFT Processor using Urdhva Tiryakbhyam Sutra of Vedic Mathematics

Implementation of FFT Processor using Urdhva Tiryakbhyam Sutra of Vedic Mathematics Implementation of FFT Processor using Urdhva Tiryakbhyam Sutra of Vedic Mathematics Yojana Jadhav 1, A.P. Hatkar 2 PG Student [VLSI & Embedded system], Dept. of ECE, S.V.I.T Engineering College, Chincholi,

More information

Optimal M-BCJR Turbo Decoding: The Z-MAP Algorithm

Optimal M-BCJR Turbo Decoding: The Z-MAP Algorithm Wireless Engineering and Technology, 2011, 2, 230-234 doi:10.4236/wet.2011.24031 Published Online October 2011 (http://www.scirp.org/journal/wet) Optimal M-BCJR Turbo Decoding: The Z-MAP Algorithm Aissa

More information

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Zhining Huang, Sharad Malik Electrical Engineering Department

More information

LLR-based Successive-Cancellation List Decoder for Polar Codes with Multi-bit Decision

LLR-based Successive-Cancellation List Decoder for Polar Codes with Multi-bit Decision > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLIC HERE TO EDIT < LLR-based Successive-Cancellation List Decoder for Polar Codes with Multi-bit Decision Bo Yuan and eshab. Parhi, Fellow,

More information

Implementation of Efficient Modified Booth Recoder for Fused Sum-Product Operator

Implementation of Efficient Modified Booth Recoder for Fused Sum-Product Operator Implementation of Efficient Modified Booth Recoder for Fused Sum-Product Operator A.Sindhu 1, K.PriyaMeenakshi 2 PG Student [VLSI], Dept. of ECE, Muthayammal Engineering College, Rasipuram, Tamil Nadu,

More information

Multi-path Routing for Mesh/Torus-Based NoCs

Multi-path Routing for Mesh/Torus-Based NoCs Multi-path Routing for Mesh/Torus-Based NoCs Yaoting Jiao 1, Yulu Yang 1, Ming He 1, Mei Yang 2, and Yingtao Jiang 2 1 College of Information Technology and Science, Nankai University, China 2 Department

More information

Single Pass Connected Components Analysis

Single Pass Connected Components Analysis D. G. Bailey, C. T. Johnston, Single Pass Connected Components Analysis, Proceedings of Image and Vision Computing New Zealand 007, pp. 8 87, Hamilton, New Zealand, December 007. Single Pass Connected

More information

Integrating MRPSOC with multigrain parallelism for improvement of performance

Integrating MRPSOC with multigrain parallelism for improvement of performance Integrating MRPSOC with multigrain parallelism for improvement of performance 1 Swathi S T, 2 Kavitha V 1 PG Student [VLSI], Dept. of ECE, CMRIT, Bangalore, Karnataka, India 2 Ph.D Scholar, Jain University,

More information

Optimized architectures of CABAC codec for IA-32-, DSP- and FPGAbased

Optimized architectures of CABAC codec for IA-32-, DSP- and FPGAbased Optimized architectures of CABAC codec for IA-32-, DSP- and FPGAbased platforms Damian Karwowski, Marek Domański Poznan University of Technology, Chair of Multimedia Telecommunications and Microelectronics

More information

An Area-Efficient BIRA With 1-D Spare Segments

An Area-Efficient BIRA With 1-D Spare Segments 206 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 26, NO. 1, JANUARY 2018 An Area-Efficient BIRA With 1-D Spare Segments Donghyun Kim, Hayoung Lee, and Sungho Kang Abstract The

More information

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,

More information

Chip Design for Turbo Encoder Module for In-Vehicle System

Chip Design for Turbo Encoder Module for In-Vehicle System Chip Design for Turbo Encoder Module for In-Vehicle System Majeed Nader Email: majeed@wayneedu Yunrui Li Email: yunruili@wayneedu John Liu Email: johnliu@wayneedu Abstract This paper studies design and

More information

High Throughput and Low Power NoC

High Throughput and Low Power NoC IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 5, o 3, September 011 www.ijcsi.org 431 High Throughput and Low Power oc Magdy El-Moursy 1, Member IEEE and Mohamed Abdelgany 1 Mentor

More information

ISSN Vol.04,Issue.01, January-2016, Pages:

ISSN Vol.04,Issue.01, January-2016, Pages: WWW.IJITECH.ORG ISSN 2321-8665 Vol.04,Issue.01, January-2016, Pages:0077-0082 Implementation of Data Encoding and Decoding Techniques for Energy Consumption Reduction in NoC GORANTLA CHAITHANYA 1, VENKATA

More information

Implementation of Convolution Encoder and Viterbi Decoder Using Verilog

Implementation of Convolution Encoder and Viterbi Decoder Using Verilog International Journal of Electronics and Communication Engineering. ISSN 0974-2166 Volume 11, Number 1 (2018), pp. 13-21 International Research Publication House http://www.irphouse.com Implementation

More information

A Review on Analysis on Codes using Different Algorithms

A Review on Analysis on Codes using Different Algorithms A Review on Analysis on Codes using Different Algorithms Devansh Vats Gaurav Kochar Rakesh Joon (ECE/GITAM/MDU) (ECE/GITAM/MDU) (HOD-ECE/GITAM/MDU) Abstract-Turbo codes are a new class of forward error

More information

Designing and Characterization of koggestone, Sparse Kogge stone, Spanning tree and Brentkung Adders

Designing and Characterization of koggestone, Sparse Kogge stone, Spanning tree and Brentkung Adders Vol. 3, Issue. 4, July-august. 2013 pp-2266-2270 ISSN: 2249-6645 Designing and Characterization of koggestone, Sparse Kogge stone, Spanning tree and Brentkung Adders V.Krishna Kumari (1), Y.Sri Chakrapani

More information

FPGA IMPLEMENTATION OF FLOATING POINT ADDER AND MULTIPLIER UNDER ROUND TO NEAREST

FPGA IMPLEMENTATION OF FLOATING POINT ADDER AND MULTIPLIER UNDER ROUND TO NEAREST FPGA IMPLEMENTATION OF FLOATING POINT ADDER AND MULTIPLIER UNDER ROUND TO NEAREST SAKTHIVEL Assistant Professor, Department of ECE, Coimbatore Institute of Engineering and Technology Abstract- FPGA is

More information

LOW-DENSITY PARITY-CHECK (LDPC) codes [1] can

LOW-DENSITY PARITY-CHECK (LDPC) codes [1] can 208 IEEE TRANSACTIONS ON MAGNETICS, VOL 42, NO 2, FEBRUARY 2006 Structured LDPC Codes for High-Density Recording: Large Girth and Low Error Floor J Lu and J M F Moura Department of Electrical and Computer

More information

Non-Binary Turbo Codes Interleavers

Non-Binary Turbo Codes Interleavers Non-Binary Turbo Codes Interleavers Maria KOVACI, Horia BALTA University Polytechnic of Timişoara, Faculty of Electronics and Telecommunications, Postal Address, 30223 Timişoara, ROMANIA, E-Mail: mariakovaci@etcuttro,

More information