EFFICIENT RECURSIVE IMPLEMENTATION OF A QUADRATIC PERMUTATION POLYNOMIAL INTERLEAVER FOR LONG TERM EVOLUTION SYSTEMS

Rev. Roum. Sci. Techn. Électrotechn. et Énerg. Vol. 61, 1, pp. 53 57, Bucarest, 016 Électronique et transmission de l information EFFICIENT RECURSIVE IMPLEMENTATION OF A QUADRATIC PERMUTATION POLYNOMIAL INTERLEAVER FOR LONG TERM EVOLUTION SYSTEMS CRISTIAN STANCIU 1, CRISTIAN ANGHEL, CONSTANTIN PALEOLOGU Key words: Long term evolution (LTE), Turbo codes, Interleaver, Field programmable gate array (FPGA) implementation. This paper describes an efficient hardware implementation for the address generation block used in the interleaving procedure associated with the channel turbo coding/decoding modules in the long term evolution (LTE) standard. The solution exploits key arithmetic properties of the corresponding equation to perform the address computation in a recursive manner. The proposed method replaces divisions and multiplications by comparisons and subtractions. The new implementation model targets a Xilinx Virtex 5 XC5VFX70T field programmable gate array (FPGA) device. 1. INTRODUCTION Turbo codes were introduced in [1 3] as an alternative (with superior performance) to classic methods from the forward error coding (FEC) group. Although the arithmetic complexity was prohibitive, the development of hardware platforms slowly gained the mandatory status in communication standards for turbo coding. The current technology level associated with field programmable gate arrays (FPGAs) and digital signal processors (DSPs) allows implementations for complex processing architectures, such as long term evolution (LTE) coding/decoding structures. One of the most important developments introduced by LTE technology [4, 5] is related to the turbo interleaving block, which is based on a quadratic permutation polynomial (QPP) function. The block is tailored for high transmission rates associated with parallel decoding architectures [6]. The turbo decoding is performed in an iterative manner, using at each stage the extrinsic values produced by the previous iteration [7, 8]. The functionality takes advantage of the QPP interleaver, which allows the parallelization of the decoding process. An iteration is completed when the data is processed by the two soft input soft output (SISO) decoding units. The SISOs are interconnected so that each unit receives at the input the output values produced by the other. The goal of the paper is to present a simplified implementation of the interleaving block used for the LTE turbo coding and decoding architectures. The standard hardware model for the included address generation block uses multiplications and divisions with pipe-line functionalities. The required hardware resources are costly, taking into consideration that the interleaving block is employed for transmission and reception tasks. Other solutions [9 1] devised more efficient methods for recursively computing the output address values, but the complexity of the arithmetic operations can be simplified further. We will demonstrate that the multiplications and divisions can be completely eliminated by exploiting the properties of the modulo operation and the recursive nature of the associated QPP address generation expression. The paper is organized as follows. Section introduces the LTE turbo coding/decoding structures. In Section 3, the functionality of the address generation block is described and a new efficient version is proposed for the interleaving and de-interleaving operations. The new model requires only basic arithmetic operations to perform the calculations, using a significantly lower amount of hardware resources. Section 4 presents the hardware implementation. Results are discussed for tests performed in Modelsim [13] and for the synthesis process targeting a Xilinx Virtex 5 device [14, 15]. In Section 5, we present the final conclusions and the perspectives of this study.. LTE CODING/DECODING CONFIGURATIONS The LTE coding is performed with a parallel concatenated convolutional code (PCCC), comprising of two constituent encoders and one interleaving block. The structure is illustrated in Fig. 1. Each individual 8-state constituent encoder has the following transfer function: ( D) [ 1, g ( D) g ( D) ] G 1 / 0 =, (1) where D denotes the basic delay block and 3 3 ( D) 1+ D + D ; g ( D) = + D D g 0 1 1 + =. () The information generated by the coding structure is formed with the input bits C k (k = 1,..., K, where K is the length of the uncoded data block), denoted as X k at the output, and the parity bits produced by each of the constituent encoders, denoted as Z k, respectively Z k. The second ' constituent encoder performs its operations using an interleaved (reordered) version of the input bits C k, denoted by ' k C = π i = 1... K, (3) C ( i ), where π(i ) is an address computed as π ( = ( f i + f i ). (4) 1 The parameters f 1, f, and the block length K are standardized in 188 possible sets of values and can be found in Table 5.1.3 3 in [5]. 1 Politehnica University of Bucharest, Iuliu Maniu 1-3, Sect 6, Bucharest, room B10, E-mail:{cristian, canghel, pale}@comm.pub.ro

54 Cristian Stanciu, Cristian Anghel, Constantin Paleologu Fig. 1 LTE turbo encoder [6]. The LTE turbo decoding scheme used for the receiver is illustrated in Fig.. The principle of the turbo decoding is an iterative processing of data between the two SISO units. An iteration is completed once the data has passed through both the SISO blocks. The input of one SISO uses the previously computed output of the other decoding unit. While one SISO unit is decoding the input information, the second one waits for the end of the process before starting its own decoding phase. Furthermore, the interleaving and de-interleaving blocks have the same hardware structure and process only frames of data that must be available before the beginning of computations. Thus, the hardware implementation of the LTE turbo decoder requires only one SISO unit and one interleaving/de-interleaving block. The elements located in positions i = 0,..., K 1 are moved according to the predetermined function presented in (3) and (4). The LTE system requires for each device two interleaving structures, associated with the transmitting and receiving procedures. Most of the hardware costs are necessary for the address generation function presented in (4). The apparent arithmetic requirements for the computation of the memory addresses π ( consist of one addition, three multiplications, and one division (which is used for the extraction of the remainder associated with the modulo operation). The associated denominators are the values of K and the remainders (the results of the modulo computations) have smaller values than the corresponding K lengths. Figure 3 illustrates, for each of the possible data block lengths K (i.e., each of the intervals i = 0,, K 1), the maximum values of the dividends and quotients associated with (4). It can be noticed that the minimum hardware resources necessary for finite numeric formats must account for representations of values up to billions for the dividents and millions for the quotients. The values illustrated in Fig. 3 require up to 35 bits, respectively 3, for unsigned integer representations. Furthermore, the hardware implementations must generate the values π ( with a minimum delay, requiring a pipe-line arithmetic. The large numeric ranges and the pipe-line system occupy large chip areas. Fig. 3 Maximum dividents and quotients for the interleaver address generator. Fig. LTE turbo decoding scheme. 3. EFFICIENT ADDRESS GENERATION FOR THE LTE INTERLEAVER As mentioned before, the function of an interleaver is to reorder the elements comprising a data block with K values. Thus, the address computation in (4) can be replaced with a more efficient technique, which works using only additions and simplified modulo operations in a recursive manner. Similar approaches were presented before in literature [9 1], without the complete elimination of multiplications and divisions. With the stated purpose, we make the notation: p ( = f1i + fi. (5) The value of π( is the modulo operator applied to function p(. The recursive computation of p( was introduced to reduce the arithmetic complexity [9, 10, 1]. The function can be expressed at each stage as:

3 Implementation of a quadratic permutation polynomial interleaver 55 p(0) = 0, p(1) = f1 + f = p(0) + f, p() = f1 + 4 f = p(1) + 3 f, p(3) = 3 f1 + 9 f = p() + 5 f,... p( = p( + s1 + s(. (6) The value of p( can be expressed using two step functions denoted s 1 and s (. The value of step s 1 is the constant f 1, as it represents the regularly increasing contribution of value f 1, proportional to the value of i. The second step increases its contribution to p( in a nonlinear manner, based on the constant f and the squared step number i. Thus, s ( can be recursively expressed as: 0, i = 0, s ( = f, i = 1, (7) s( + f, i > 1. Fig. 4 Comparison between the values f and the corresponding frame lengths K. Additionally, we can use (6) to rewrite the relation between π ( and p(: π( = p( = [ p( + s + s ( ] mod. 1 K (8) The multiplications are replaced by additions and the arithmetic complexity is reduced. Nevertheless, the division is still required for the modulo operation. Considering that the modulo operator applied to a sum of elements can be expressed as ck = ck, (9) k k we propose to modify the computation of π ( in (8) to considerably reduce the arithmetic complexity. The number of modulo operations increases, but the complexity of the corresponding divisions is reduced as a consequence of having smaller quotients. Consequently, using (7), (8), and (9), we obtain: π() i = = [ pi ( 1) + s1 + s( ] = π( i 1) f1 ( s( i 1) f) + + + (10) = π( i 1) f1 s( i 1) f + + + The values for π( and s( i 1) are computed at the previous iteration. The computation f1 is not necessary, since all of the f 1 values are smaller than their corresponding frame lengths K. Also, f is a constant parameter for a specific frame length. Figure 4 compares the values f and K, for all of the frame lengths. It can be noticed that f has corresponding quotients of 0 or 1. The specified modulo operation is straight forward and can be computed at the beginning of a frame processing or pre-stored in a memory. Fig. 5 Comparison between the values f 1 +f and the corresponding frame lengths K. Moreover, for the last sum in (10), all of the mentioned values are smaller than K. As a result, the last modulo operation has an integer quotient no larger than 3. Otherwise, the additions can be performed alternatively with modulo operations, starting with the last two terms, in order to generate the value s ( needed for iteration i + 1. The alternation between additions and modulo operations reduces the maximum possible quotients to 1, which allows the use of comparisons with K, and possible subtractions instead of divisions, for the extraction of remainders. The arithmetic complexity required for the address generation is significantly reduced to three additions and three simplified modulo procedures (with the maximum quotient 1) per address value i. The efficiency of the hardware implementation for the address generation block can be further increased. By taking into account in (10) that f 1 and f are constant values for a given frame length K, their contribution to the function p( can be included into a single pre-computed value. Therefore, the functions s 1 and s ( can be combined into a single step function: 0, i = 0, s3() i = s1+ s() i = f1+ f, i = 1, s3( i 1) + f, i > 1. (11)

56 Cristian Stanciu, Cristian Anghel, Constantin Paleologu 4 By using (9) and (11) in (8), the result is: π () i = = pi () = [ pi ( 1) + s3() i] = [ pi ( 1) + s3( i 1) + f) ] = [ π ( i 1) + s ( i 1) + f ] 3 (1) All of the values in the last stage of (1) are lower than the value K, and available recursively, such as π( and s3( i 1), or they can be predetermined and stored, like the case of f Moreover, a comparison is depicted in Fig. 5 for the 188 possible frame lengths K and the corresponding values f 1 + f used for the initialization of the step function in (11). In several cases, the relation f 1 + f > K is noticeable. Thus, the value s 3 (1) = (f 1 + f ) must also be pre-computed and stored for usage at the beginning of every address generation round. For each computed address, two additions and two modulo operations (with the maximum quotient 1) are necessary, using a structure that alternates the two operations. The overall arithmetic complexity of the address generation module is reduced from 3K additions and 3K simplified modulo operations corresponding to (10), to K additions and K simplified modulo procedures associated with (1). procedure, the modulo operator is applied to the sum, with a maximum quotient of 1. Thus, a comparison and the corresponding possible subtraction are performed. The address generation results for stage i are successively available with a delay of 4 clock cycles. The maximum number of operations required for a full set of address computations is reduced to K comparisons and 4K additions/subtractions. The method improves the solutions presented in [10 1], by eliminating any multiplications or divisions. The low numerical range of the operators (with numbers lower than K) allows the usage of minimal resources for the representation of binary values (i.e., at most 14 bits per operator). The address generation is performed for consecutive input values and allows the use of a single address computation module. Our model is a particular configuration of the parallel setup in [9], which has lower data processing delays and higher arithmetic complexity, proportional to the number of address generation blocks (~ Kb additions and ~ Kb simplified modulo procedures). 4. HARDWARE IMPLEMENTATION OF ADDRESS GENERATION For the hardware implementation of the address generation block we targeted a Xilinx Virtex 5 XC5VFX70T FPGA. We performed the hardware imple-mentation using the very high speed hardware description language (VHDL). The standard block requires the knowledge of parameters K, f 1, and f. Considering the corresponding values, available in Table 5.1.3 3 in [5], the total number of bits required for their representation is 13 + 9 + 10 = 3. The LTE standard has 188 sets of values for the specified parameters, which must be available in a memory. Consequently, by multiplying the size of a location with the total number of locations, we obtain a required memory of 6016 bits. Furthermore, for the implementation of the model proposed in (1), we imposed the pre-storing of the values K, f and the starting values for s 3 (, i.e., s3 (1) = ( f 1 + f) The required number of bits for a memory location is 13 + 10 + 11 = 34. Thus, the total amount of memory is slightly larger (639 bits) than the standard approach, but smaller than the general solution presented in [9], which assumes a non-zero starting input address. The memory requirements in [9] can increase up to 188 13+39 188 b bits, where b is the number of parallel address generation blocks. The functionality of the proposed model is illustrated in Fig. 6 for stage number i. The algorithm works in two steps. In the first step, the value s 3 ( is determined by adding its previous value to f Furthermore, the modulo operator is applied, i.e., the result is compared with K, which is subtracted from the sum (if necessary). In step, the computed value is added to the previous output of the address generation block, i.e., π(. In a similar Fig. 6 Proposed address generation scheme. Fig. 7 Modelsim example (first and last generated addresses); K = 180.

5 Implementation of a quadratic permutation polynomial interleaver 57 The VHDL code was tested using Modelsim SE-64 10.1c. A waveform example is illustrated in Fig. 7 for K = 180. Furthermore, the synthesis procedure was performed using Xilinx ISE Design Suite 13.4 for the target device mentioned above. The results show that the design requires an amount of resources lower by an order of magnitude than the classical approach implementation presented in [8] for the same target device. The proposed hardware design requires 15 slice registers and 9 lookup-tables (LUTs), with a maximum clock frequency of 4.459 MHz (equivalent to a minimum clock period of 4.455 ns). 5. CONCLUSIONS The paper presented a low-cost (chip resources) hardware implementation for the address generation block used for the turbo QPP interleaving procedure in the LTE standard. We presented a simplified recursive mathematical model, which requires no multiplications or divisions. The corresponding modulo operations are replaced by comparisons and subtractions. Additionally, the hardware resources used for binary representations are greatly reduced. In order to demonstrate the validity of our solution, we illustrated simulation results and presented the hardware requirements generated by the synthesis process for a Xilinx Virtex 5 FPGA device. ACKNOWLEDGMENTS The work has been funded by the Sectoral Operational Programme Human Resources Development 007 013 of the Ministry of European Funds through the Financial Agreement POSDRU/159/1.5/S/134398. Received on June 30, 015 REFERENCES 1. C. Berrou, A. Glavieux, Near optimum error correcting coding and decoding: Turbo-Codes, IEEE Trans. Communications, 44, 10, 1996, pp. 161 171.. C. Berrou, M. Jézéquel, Non binary convolutional codes for turbo coding, Electronics Letters, 35, 1, pp. 9 40, 1999. 3. C. Berrou, A. Glavieux, P. Thitimajshima, Near Shannon limit errorcorrecting coding and decoding: Turbo Codes, IEEE Proceedings of the Int. Conf. on Communications, Geneva, Switzerland, 1993, pp. 1064 1070. 4. F. Khan, LTE for 4G Mobile Broadband, Cambridge University Press, New York, 009. 5. ***3 rd Generation Partnership Project; Technical Specification Group Radio Access Network; Evolved Universal Terrestrial Radio Access (E-UTRA); Multiplexing and channel coding (Release 8), 3GPP TS 36.1 V8.7.0 (009 05) Technical Specification. 6. S. Chae, A low complexity parallel architecture of turbo decoder based on QPP interleaver for 3GPP-LTE/LTE-A, http://www.design-reuse. com/articles/31907/turbo-decoder-architecture-qpp-interleaver-3gpplte-lte-a.html. 7. M. C. Valenti, J. Sun, The UMTS Turbo Code and an Efficient Decoder Implementation Suitable for Software-Defined Radios, International Journal of Wireless Information Networks, 8, 4, 001. 8. C. Anghel, C. Stanciu, C. Paleologu, Efficient FPGA Implementation of a Channel Turbo Decoder for LTE Systems, Rev. Roum. Sci. Techn. Électrotechn. et Énerg., 60,, pp. 163 173, 015. 9. Yang Sun, Joseph R. Cavallaro, Efficient hardware implementation of a highly-parallel 3GPP LTE/LTE-advance turbo decoder, Integration, VLSI Journal, 44, pp. 305 315, 011. 10. Di Wu, R. Asghar, D. Liu, Implementation of a High-Speed Parallel Turbo Decoder for 3GPP LTE Terminals, IEEE Proceedings of the Int. Conf. on ASIC, Chengdu, China, 009, pp. 481 484. 11. R. Asghar, Di Wu, J. Eilert, D. Liu, Memory Conflict Analysis and a Re-configurable Interleaver Architecture Supporting Unified Parallel Turbo Decoding, Journal of Signal Processing Systems, 60, 1, pp. 15 19, 010. 1. S. Wang, L. Liu, Z. Wen, High Speed QPP Generator with Optimized Parallel Architecture for 4G LTE-A System, Int. Journal of Advancements in Computing Technology, 4, 3, pp. 355 364, 010. 13. *** Modelsim Reference Manual, Software Version 6.5e, 010, http://cseweb.ucsd.edu/classes/fa10/cse140l/lab/docs/modelsim_ref.pdf, Mentor Graphics Corporation. 14. ***Xilinx Virtex 5 family user guide, www.xilinx.com. 15. ***Xilinx ML507 evaluation platform user guide, www.xilinx.com.