EFFICIENT RECURSIVE IMPLEMENTATION OF A QUADRATIC PERMUTATION POLYNOMIAL INTERLEAVER FOR LONG TERM EVOLUTION SYSTEMS

Similar documents
Chip Design for Turbo Encoder Module for In-Vehicle System

The Lekha 3GPP LTE FEC IP Core meets 3GPP LTE specification 3GPP TS V Release 10[1].

The Lekha 3GPP LTE Turbo Decoder IP Core meets 3GPP LTE specification 3GPP TS V Release 10[1].

/$ IEEE

Low Complexity Architecture for Max* Operator of Log-MAP Turbo Decoder

We are IntechOpen, the world s leading publisher of Open Access books Built by scientists, for scientists. International authors and editors

ISSCC 2003 / SESSION 8 / COMMUNICATIONS SIGNAL PROCESSING / PAPER 8.7

RECURSIVE GF(2 N ) ENCODERS USING LEFT-CIRCULATE FUNCTION FOR OPTIMUM TCM SCHEMES

Linköping University Post Print. Analysis of Twiddle Factor Memory Complexity of Radix-2^i Pipelined FFTs

Hard Decision Based Low SNR Early Termination for LTE Turbo Decoding

A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN

High Speed Downlink Packet Access efficient turbo decoder architecture: 3GPP Advanced Turbo Decoder

PERFORMANCE ANALYSIS OF HIGH EFFICIENCY LOW DENSITY PARITY-CHECK CODE DECODER FOR LOW POWER APPLICATIONS

THE turbo code is one of the most attractive forward error

Binary Adders. Ripple-Carry Adder

Mobile Robot Path Planning Software and Hardware Implementations

Non-Binary Turbo Codes Interleavers

Design of Convolution Encoder and Reconfigurable Viterbi Decoder

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

On the parallelization of slice-based Keccak implementations on Xilinx FPGAs

Comparison of Decoding Algorithms for Concatenated Turbo Codes

EFFICIENT PARALLEL MEMORY ORGANIZATION FOR TURBO DECODERS

Programmable Turbo Decoder Supporting Multiple Third-Generation Wireless Standards

DESIGN OF QUATERNARY ADDER FOR HIGH SPEED APPLICATIONS

VLSI Architecture for High Performance 3GPP (De)Interleaver for Turbo Codes

The design and implementation of TPC encoder and decoder

VLSI Implementation of Low Power Area Efficient FIR Digital Filter Structures Shaila Khan 1 Uma Sharma 2

Low Complexity Opportunistic Decoder for Network Coding

Pipelined Quadratic Equation based Novel Multiplication Method for Cryptographic Applications

HDL Implementation of an Efficient Partial Parallel LDPC Decoder Using Soft Bit Flip Algorithm

PROPOSED DETERMINISTIC INTERLEAVERS FOR CCSDS TURBO CODE STANDARD

Parallelized Radix-4 Scalable Montgomery Multipliers

Fault Tolerant Parallel Filters Based On Bch Codes

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch

Reduced complexity Log-MAP algorithm with Jensen inequality based non-recursive max operator for turbo TCM decoding

RECENTLY, low-density parity-check (LDPC) codes have

BER Guaranteed Optimization and Implementation of Parallel Turbo Decoding on GPU

Available online at ScienceDirect. Procedia Technology 24 (2016 )

A Modified Medium Access Control Algorithm for Systems with Iterative Decoding

TURBO codes, [1], [2], have attracted much interest due

FPGA IMPLEMENTATION FOR REAL TIME SOBEL EDGE DETECTOR BLOCK USING 3-LINE BUFFERS

Vendor Agnostic, High Performance, Double Precision Floating Point Division for FPGAs

Fast evaluation of nonlinear functions using FPGAs

PROJECT REPORT IMPLEMENTATION OF LOGARITHM COMPUTATION DEVICE AS PART OF VLSI TOOLS COURSE

An Optimized Montgomery Modular Multiplication Algorithm for Cryptography

Efficient VLSI Huffman encoder implementation and its application in high rate serial data encoding

PCE03D DVB RCS/ WiMAX Turbo Encoder. Small World Communications. PCE03D Features. Introduction. Signal Descriptions

International Journal of Engineering Trends and Technology (IJETT) - Volume4Issue5- May 2013

Performance Optimization of HVD: An Error Detection and Correction Code

Memory-Reduced Turbo Decoding Architecture Using NII Metric Compression

Design and Implementation of Low-Complexity Redundant Multiplier Architecture for Finite Field

High Data Rate Fully Flexible SDR Modem

IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY FPGA

3GPP Turbo Encoder v4.0

Optimized Design and Implementation of a 16-bit Iterative Logarithmic Multiplier

FPGA Implementation of Multiplier for Floating- Point Numbers Based on IEEE Standard

A Novel Carry-look ahead approach to an Unified BCD and Binary Adder/Subtractor

FPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression

Mapping the SISO module of the Turbo decoder to a FPFA

Super Codes: A Flexible Multi Rate Coding System

Implementation of a Turbo Encoder and Turbo Decoder on DSP Processor-TMS320C6713

Quasi-Cyclic Low-Density Parity-Check (QC-LDPC) Codes for Deep Space and High Data Rate Applications

On the Optimizing of LTE System Performance for SISO and MIMO Modes

Designing and Characterization of koggestone, Sparse Kogge stone, Spanning tree and Brentkung Adders

VHDL Implementation of different Turbo Encoder using Log-MAP Decoder

HIGH PERFORMANCE QUATERNARY ARITHMETIC LOGIC UNIT ON PROGRAMMABLE LOGIC DEVICE

Abstract. Literature Survey. Introduction. A.Radix-2/8 FFT algorithm for length qx2 m DFTs

Analysis of Circular Buffer Rate Matching for LTE Turbo Code

WORD LEVEL FINITE FIELD MULTIPLIERS USING NORMAL BASIS

Design of Convolutional Codes for varying Constraint Lengths

Implementation and Impact of LNS MAC Units in Digital Filter Application

Implementation of A Optimized Systolic Array Architecture for FSBMA using FPGA for Real-time Applications

ECE 341 Midterm Exam

Design and Optimized Implementation of Six-Operand Single- Precision Floating-Point Addition

Low-Power Adaptive Viterbi Decoder for TCM Using T-Algorithm

A High-Speed FPGA Implementation of an RSD- Based ECC Processor

Applications. Discontinued IP SYST_A SYST_B RDY BLK_START BLK_END RFFD. Figure 1: Pinout

FPGA Based Low Area Motion Estimation with BISCD Architecture

Hardware Description of Multi-Directional Fast Sobel Edge Detection Processor by VHDL for Implementing on FPGA

FPGA Implementation of the Complex Division in Digital Predistortion Linearizer

Faster Interleaved Modular Multiplier Based on Sign Detection

2 Asst Prof, Kottam College of Engineering, Chinnatekur, Kurnool, AP-INDIA,

Fast Evaluation of the Square Root and Other Nonlinear Functions in FPGA

Reduced Complexity Decision Feedback Channel Equalizer using Series Expansion Division

An FPGA Implementation of the Powering Function with Single Precision Floating-Point Arithmetic

Method We follow- How to Get Entry Pass in SEMICODUCTOR Industries for 3rd year engineering. Winter/Summer Training

A 122Mb/s Turbo Decoder using a Mid-range GPU

Double Precision Floating-Point Multiplier using Coarse-Grain Units

Reliability of Memory Storage System Using Decimal Matrix Code and Meta-Cure

FPGA Implementation of 2-D DCT Architecture for JPEG Image Compression

SINGLE PRECISION FLOATING POINT DIVISION

FPGA IMPLEMENTATION OF AN EFFICIENT PROPORTIONATE AFFINE PROJECTION ALGORITHM FOR ECHO CANCELLATION

A Novel Approach for Parallel CRC generation for high speed application

DIGITAL IMPLEMENTATION OF THE SIGMOID FUNCTION FOR FPGA CIRCUITS

ANALYZING THE PERFORMANCE OF CARRY TREE ADDERS BASED ON FPGA S

ANALYSIS OF AN AREA EFFICIENT VLSI ARCHITECTURE FOR FLOATING POINT MULTIPLIER AND GALOIS FIELD MULTIPLIER*

Figure 1: Verilog used to generate divider

Implementation of Galois Field Arithmetic Unit on FPGA

Design of Flash Controller for Single Level Cell NAND Flash Memory

Evaluation of High Speed Hardware Multipliers - Fixed Point and Floating point

Transcription:

Rev. Roum. Sci. Techn. Électrotechn. et Énerg. Vol. 61, 1, pp. 53 57, Bucarest, 016 Électronique et transmission de l information EFFICIENT RECURSIVE IMPLEMENTATION OF A QUADRATIC PERMUTATION POLYNOMIAL INTERLEAVER FOR LONG TERM EVOLUTION SYSTEMS CRISTIAN STANCIU 1, CRISTIAN ANGHEL, CONSTANTIN PALEOLOGU Key words: Long term evolution (LTE), Turbo codes, Interleaver, Field programmable gate array (FPGA) implementation. This paper describes an efficient hardware implementation for the address generation block used in the interleaving procedure associated with the channel turbo coding/decoding modules in the long term evolution (LTE) standard. The solution exploits key arithmetic properties of the corresponding equation to perform the address computation in a recursive manner. The proposed method replaces divisions and multiplications by comparisons and subtractions. The new implementation model targets a Xilinx Virtex 5 XC5VFX70T field programmable gate array (FPGA) device. 1. INTRODUCTION Turbo codes were introduced in [1 3] as an alternative (with superior performance) to classic methods from the forward error coding (FEC) group. Although the arithmetic complexity was prohibitive, the development of hardware platforms slowly gained the mandatory status in communication standards for turbo coding. The current technology level associated with field programmable gate arrays (FPGAs) and digital signal processors (DSPs) allows implementations for complex processing architectures, such as long term evolution (LTE) coding/decoding structures. One of the most important developments introduced by LTE technology [4, 5] is related to the turbo interleaving block, which is based on a quadratic permutation polynomial (QPP) function. The block is tailored for high transmission rates associated with parallel decoding architectures [6]. The turbo decoding is performed in an iterative manner, using at each stage the extrinsic values produced by the previous iteration [7, 8]. The functionality takes advantage of the QPP interleaver, which allows the parallelization of the decoding process. An iteration is completed when the data is processed by the two soft input soft output (SISO) decoding units. The SISOs are interconnected so that each unit receives at the input the output values produced by the other. The goal of the paper is to present a simplified implementation of the interleaving block used for the LTE turbo coding and decoding architectures. The standard hardware model for the included address generation block uses multiplications and divisions with pipe-line functionalities. The required hardware resources are costly, taking into consideration that the interleaving block is employed for transmission and reception tasks. Other solutions [9 1] devised more efficient methods for recursively computing the output address values, but the complexity of the arithmetic operations can be simplified further. We will demonstrate that the multiplications and divisions can be completely eliminated by exploiting the properties of the modulo operation and the recursive nature of the associated QPP address generation expression. The paper is organized as follows. Section introduces the LTE turbo coding/decoding structures. In Section 3, the functionality of the address generation block is described and a new efficient version is proposed for the interleaving and de-interleaving operations. The new model requires only basic arithmetic operations to perform the calculations, using a significantly lower amount of hardware resources. Section 4 presents the hardware implementation. Results are discussed for tests performed in Modelsim [13] and for the synthesis process targeting a Xilinx Virtex 5 device [14, 15]. In Section 5, we present the final conclusions and the perspectives of this study.. LTE CODING/DECODING CONFIGURATIONS The LTE coding is performed with a parallel concatenated convolutional code (PCCC), comprising of two constituent encoders and one interleaving block. The structure is illustrated in Fig. 1. Each individual 8-state constituent encoder has the following transfer function: ( D) [ 1, g ( D) g ( D) ] G 1 / 0 =, (1) where D denotes the basic delay block and 3 3 ( D) 1+ D + D ; g ( D) = + D D g 0 1 1 + =. () The information generated by the coding structure is formed with the input bits C k (k = 1,..., K, where K is the length of the uncoded data block), denoted as X k at the output, and the parity bits produced by each of the constituent encoders, denoted as Z k, respectively Z k. The second ' constituent encoder performs its operations using an interleaved (reordered) version of the input bits C k, denoted by ' k C = π i = 1... K, (3) C ( i ), where π(i ) is an address computed as π ( = ( f i + f i ). (4) 1 The parameters f 1, f, and the block length K are standardized in 188 possible sets of values and can be found in Table 5.1.3 3 in [5]. 1 Politehnica University of Bucharest, Iuliu Maniu 1-3, Sect 6, Bucharest, room B10, E-mail:{cristian, canghel, pale}@comm.pub.ro

54 Cristian Stanciu, Cristian Anghel, Constantin Paleologu Fig. 1 LTE turbo encoder [6]. The LTE turbo decoding scheme used for the receiver is illustrated in Fig.. The principle of the turbo decoding is an iterative processing of data between the two SISO units. An iteration is completed once the data has passed through both the SISO blocks. The input of one SISO uses the previously computed output of the other decoding unit. While one SISO unit is decoding the input information, the second one waits for the end of the process before starting its own decoding phase. Furthermore, the interleaving and de-interleaving blocks have the same hardware structure and process only frames of data that must be available before the beginning of computations. Thus, the hardware implementation of the LTE turbo decoder requires only one SISO unit and one interleaving/de-interleaving block. The elements located in positions i = 0,..., K 1 are moved according to the predetermined function presented in (3) and (4). The LTE system requires for each device two interleaving structures, associated with the transmitting and receiving procedures. Most of the hardware costs are necessary for the address generation function presented in (4). The apparent arithmetic requirements for the computation of the memory addresses π ( consist of one addition, three multiplications, and one division (which is used for the extraction of the remainder associated with the modulo operation). The associated denominators are the values of K and the remainders (the results of the modulo computations) have smaller values than the corresponding K lengths. Figure 3 illustrates, for each of the possible data block lengths K (i.e., each of the intervals i = 0,, K 1), the maximum values of the dividends and quotients associated with (4). It can be noticed that the minimum hardware resources necessary for finite numeric formats must account for representations of values up to billions for the dividents and millions for the quotients. The values illustrated in Fig. 3 require up to 35 bits, respectively 3, for unsigned integer representations. Furthermore, the hardware implementations must generate the values π ( with a minimum delay, requiring a pipe-line arithmetic. The large numeric ranges and the pipe-line system occupy large chip areas. Fig. 3 Maximum dividents and quotients for the interleaver address generator. Fig. LTE turbo decoding scheme. 3. EFFICIENT ADDRESS GENERATION FOR THE LTE INTERLEAVER As mentioned before, the function of an interleaver is to reorder the elements comprising a data block with K values. Thus, the address computation in (4) can be replaced with a more efficient technique, which works using only additions and simplified modulo operations in a recursive manner. Similar approaches were presented before in literature [9 1], without the complete elimination of multiplications and divisions. With the stated purpose, we make the notation: p ( = f1i + fi. (5) The value of π( is the modulo operator applied to function p(. The recursive computation of p( was introduced to reduce the arithmetic complexity [9, 10, 1]. The function can be expressed at each stage as:

3 Implementation of a quadratic permutation polynomial interleaver 55 p(0) = 0, p(1) = f1 + f = p(0) + f, p() = f1 + 4 f = p(1) + 3 f, p(3) = 3 f1 + 9 f = p() + 5 f,... p( = p( + s1 + s(. (6) The value of p( can be expressed using two step functions denoted s 1 and s (. The value of step s 1 is the constant f 1, as it represents the regularly increasing contribution of value f 1, proportional to the value of i. The second step increases its contribution to p( in a nonlinear manner, based on the constant f and the squared step number i. Thus, s ( can be recursively expressed as: 0, i = 0, s ( = f, i = 1, (7) s( + f, i > 1. Fig. 4 Comparison between the values f and the corresponding frame lengths K. Additionally, we can use (6) to rewrite the relation between π ( and p(: π( = p( = [ p( + s + s ( ] mod. 1 K (8) The multiplications are replaced by additions and the arithmetic complexity is reduced. Nevertheless, the division is still required for the modulo operation. Considering that the modulo operator applied to a sum of elements can be expressed as ck = ck, (9) k k we propose to modify the computation of π ( in (8) to considerably reduce the arithmetic complexity. The number of modulo operations increases, but the complexity of the corresponding divisions is reduced as a consequence of having smaller quotients. Consequently, using (7), (8), and (9), we obtain: π() i = = [ pi ( 1) + s1 + s( ] = π( i 1) f1 ( s( i 1) f) + + + (10) = π( i 1) f1 s( i 1) f + + + The values for π( and s( i 1) are computed at the previous iteration. The computation f1 is not necessary, since all of the f 1 values are smaller than their corresponding frame lengths K. Also, f is a constant parameter for a specific frame length. Figure 4 compares the values f and K, for all of the frame lengths. It can be noticed that f has corresponding quotients of 0 or 1. The specified modulo operation is straight forward and can be computed at the beginning of a frame processing or pre-stored in a memory. Fig. 5 Comparison between the values f 1 +f and the corresponding frame lengths K. Moreover, for the last sum in (10), all of the mentioned values are smaller than K. As a result, the last modulo operation has an integer quotient no larger than 3. Otherwise, the additions can be performed alternatively with modulo operations, starting with the last two terms, in order to generate the value s ( needed for iteration i + 1. The alternation between additions and modulo operations reduces the maximum possible quotients to 1, which allows the use of comparisons with K, and possible subtractions instead of divisions, for the extraction of remainders. The arithmetic complexity required for the address generation is significantly reduced to three additions and three simplified modulo procedures (with the maximum quotient 1) per address value i. The efficiency of the hardware implementation for the address generation block can be further increased. By taking into account in (10) that f 1 and f are constant values for a given frame length K, their contribution to the function p( can be included into a single pre-computed value. Therefore, the functions s 1 and s ( can be combined into a single step function: 0, i = 0, s3() i = s1+ s() i = f1+ f, i = 1, s3( i 1) + f, i > 1. (11)

56 Cristian Stanciu, Cristian Anghel, Constantin Paleologu 4 By using (9) and (11) in (8), the result is: π () i = = pi () = [ pi ( 1) + s3() i] = [ pi ( 1) + s3( i 1) + f) ] = [ π ( i 1) + s ( i 1) + f ] 3 (1) All of the values in the last stage of (1) are lower than the value K, and available recursively, such as π( and s3( i 1), or they can be predetermined and stored, like the case of f Moreover, a comparison is depicted in Fig. 5 for the 188 possible frame lengths K and the corresponding values f 1 + f used for the initialization of the step function in (11). In several cases, the relation f 1 + f > K is noticeable. Thus, the value s 3 (1) = (f 1 + f ) must also be pre-computed and stored for usage at the beginning of every address generation round. For each computed address, two additions and two modulo operations (with the maximum quotient 1) are necessary, using a structure that alternates the two operations. The overall arithmetic complexity of the address generation module is reduced from 3K additions and 3K simplified modulo operations corresponding to (10), to K additions and K simplified modulo procedures associated with (1). procedure, the modulo operator is applied to the sum, with a maximum quotient of 1. Thus, a comparison and the corresponding possible subtraction are performed. The address generation results for stage i are successively available with a delay of 4 clock cycles. The maximum number of operations required for a full set of address computations is reduced to K comparisons and 4K additions/subtractions. The method improves the solutions presented in [10 1], by eliminating any multiplications or divisions. The low numerical range of the operators (with numbers lower than K) allows the usage of minimal resources for the representation of binary values (i.e., at most 14 bits per operator). The address generation is performed for consecutive input values and allows the use of a single address computation module. Our model is a particular configuration of the parallel setup in [9], which has lower data processing delays and higher arithmetic complexity, proportional to the number of address generation blocks (~ Kb additions and ~ Kb simplified modulo procedures). 4. HARDWARE IMPLEMENTATION OF ADDRESS GENERATION For the hardware implementation of the address generation block we targeted a Xilinx Virtex 5 XC5VFX70T FPGA. We performed the hardware imple-mentation using the very high speed hardware description language (VHDL). The standard block requires the knowledge of parameters K, f 1, and f. Considering the corresponding values, available in Table 5.1.3 3 in [5], the total number of bits required for their representation is 13 + 9 + 10 = 3. The LTE standard has 188 sets of values for the specified parameters, which must be available in a memory. Consequently, by multiplying the size of a location with the total number of locations, we obtain a required memory of 6016 bits. Furthermore, for the implementation of the model proposed in (1), we imposed the pre-storing of the values K, f and the starting values for s 3 (, i.e., s3 (1) = ( f 1 + f) The required number of bits for a memory location is 13 + 10 + 11 = 34. Thus, the total amount of memory is slightly larger (639 bits) than the standard approach, but smaller than the general solution presented in [9], which assumes a non-zero starting input address. The memory requirements in [9] can increase up to 188 13+39 188 b bits, where b is the number of parallel address generation blocks. The functionality of the proposed model is illustrated in Fig. 6 for stage number i. The algorithm works in two steps. In the first step, the value s 3 ( is determined by adding its previous value to f Furthermore, the modulo operator is applied, i.e., the result is compared with K, which is subtracted from the sum (if necessary). In step, the computed value is added to the previous output of the address generation block, i.e., π(. In a similar Fig. 6 Proposed address generation scheme. Fig. 7 Modelsim example (first and last generated addresses); K = 180.

5 Implementation of a quadratic permutation polynomial interleaver 57 The VHDL code was tested using Modelsim SE-64 10.1c. A waveform example is illustrated in Fig. 7 for K = 180. Furthermore, the synthesis procedure was performed using Xilinx ISE Design Suite 13.4 for the target device mentioned above. The results show that the design requires an amount of resources lower by an order of magnitude than the classical approach implementation presented in [8] for the same target device. The proposed hardware design requires 15 slice registers and 9 lookup-tables (LUTs), with a maximum clock frequency of 4.459 MHz (equivalent to a minimum clock period of 4.455 ns). 5. CONCLUSIONS The paper presented a low-cost (chip resources) hardware implementation for the address generation block used for the turbo QPP interleaving procedure in the LTE standard. We presented a simplified recursive mathematical model, which requires no multiplications or divisions. The corresponding modulo operations are replaced by comparisons and subtractions. Additionally, the hardware resources used for binary representations are greatly reduced. In order to demonstrate the validity of our solution, we illustrated simulation results and presented the hardware requirements generated by the synthesis process for a Xilinx Virtex 5 FPGA device. ACKNOWLEDGMENTS The work has been funded by the Sectoral Operational Programme Human Resources Development 007 013 of the Ministry of European Funds through the Financial Agreement POSDRU/159/1.5/S/134398. Received on June 30, 015 REFERENCES 1. C. Berrou, A. Glavieux, Near optimum error correcting coding and decoding: Turbo-Codes, IEEE Trans. Communications, 44, 10, 1996, pp. 161 171.. C. Berrou, M. Jézéquel, Non binary convolutional codes for turbo coding, Electronics Letters, 35, 1, pp. 9 40, 1999. 3. C. Berrou, A. Glavieux, P. Thitimajshima, Near Shannon limit errorcorrecting coding and decoding: Turbo Codes, IEEE Proceedings of the Int. Conf. on Communications, Geneva, Switzerland, 1993, pp. 1064 1070. 4. F. Khan, LTE for 4G Mobile Broadband, Cambridge University Press, New York, 009. 5. ***3 rd Generation Partnership Project; Technical Specification Group Radio Access Network; Evolved Universal Terrestrial Radio Access (E-UTRA); Multiplexing and channel coding (Release 8), 3GPP TS 36.1 V8.7.0 (009 05) Technical Specification. 6. S. Chae, A low complexity parallel architecture of turbo decoder based on QPP interleaver for 3GPP-LTE/LTE-A, http://www.design-reuse. com/articles/31907/turbo-decoder-architecture-qpp-interleaver-3gpplte-lte-a.html. 7. M. C. Valenti, J. Sun, The UMTS Turbo Code and an Efficient Decoder Implementation Suitable for Software-Defined Radios, International Journal of Wireless Information Networks, 8, 4, 001. 8. C. Anghel, C. Stanciu, C. Paleologu, Efficient FPGA Implementation of a Channel Turbo Decoder for LTE Systems, Rev. Roum. Sci. Techn. Électrotechn. et Énerg., 60,, pp. 163 173, 015. 9. Yang Sun, Joseph R. Cavallaro, Efficient hardware implementation of a highly-parallel 3GPP LTE/LTE-advance turbo decoder, Integration, VLSI Journal, 44, pp. 305 315, 011. 10. Di Wu, R. Asghar, D. Liu, Implementation of a High-Speed Parallel Turbo Decoder for 3GPP LTE Terminals, IEEE Proceedings of the Int. Conf. on ASIC, Chengdu, China, 009, pp. 481 484. 11. R. Asghar, Di Wu, J. Eilert, D. Liu, Memory Conflict Analysis and a Re-configurable Interleaver Architecture Supporting Unified Parallel Turbo Decoding, Journal of Signal Processing Systems, 60, 1, pp. 15 19, 010. 1. S. Wang, L. Liu, Z. Wen, High Speed QPP Generator with Optimized Parallel Architecture for 4G LTE-A System, Int. Journal of Advancements in Computing Technology, 4, 3, pp. 355 364, 010. 13. *** Modelsim Reference Manual, Software Version 6.5e, 010, http://cseweb.ucsd.edu/classes/fa10/cse140l/lab/docs/modelsim_ref.pdf, Mentor Graphics Corporation. 14. ***Xilinx Virtex 5 family user guide, www.xilinx.com. 15. ***Xilinx ML507 evaluation platform user guide, www.xilinx.com.