Energy-efficient Reconfigurable FEC Processor for Multi-standard Wireless Communication Systems

Size: px

Start display at page:

Download "Energy-efficient Reconfigurable FEC Processor for Multi-standard Wireless Communication Systems"

Spencer Pope
5 years ago
Views:

1 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.17, NO.3, JUNE, 2017 ISSN(Print) ISSN(Online) Energy-efficient Reconfigurable FEC Processor for Multi-standard Wireless Communication Systems Meng Li 1, Liesbet Van der Perre 2, Wim van Thillo 1, and Youngjoo Lee 3,* Abstract In this paper, we describe HW/SW cooptimizations for reconfigurable application specific instruction-set processors (ASIPs). Based on our previous very long instruction word (VLIW) ASIP, the proposed framework realizes various forward error-correction (FEC) algorithms for wireless communication systems. In order to enhance the energy efficiency, we newly introduce several design methodologies including high-radix algorithms, tasklevel out-of-order executions, and intensive resource allocations with loop-level rescheduling. The case study on the radix-4 turbo decoding shows that the proposed techniques improve the energy efficiency by 3.7 times compared to the previous architecture. Index Terms Digital integrated circuits, error correction codes, programmable circuits, wireless communication Manuscript received Mar. 21, 2016; accepted Dec. 14, Interuniversity Microelectronics Center (IMEC) vzw, 3001 Leuven, Belgium 2 Department of Electrical Engineering, KU Leuven, 3001 Leuven, Belgium 3 Department of Electrical Engineering, POSTECH, 37673, Pohang, Korea yjlee.ims@gmail.com I. INTRODUCTION In last decades, numerous communication standards have been continuously developed to improve the connectivity of mobile devices. Basically, recent specifications are requested to satisfy the severe demands on data rate, reliability, and bandwidth efficiency. In order to increase the data integrity, iterative forward error-correction (FEC) codes have been widely accepted because of their powerful error-correcting capabilities [1-3]. Due to the different parameters from FEC standards, it is quite challenging to design highly-optimized decoder ASICs while beating the tough time-to-market (TTM) requirements [4-6]. The previous processor-based solutions may provide flexibilities for reducing the TTM, however they normally use much more hardware resources than the fixed ASICs, resulting the power hungry realizations [7-11]. To provide the flexible decoder architecture achieving an acceptable energy-efficiency, this paper presents novel design frameworks for the FEC application specific instruction-set processors (ASIPs). In contrast to the previous multi-standard approaches developing the unified hardware resources among different FEC specifications [12, 13], the proposed design procedures consider co-optimizations between the hardware architecture and software kernels. More precisely, we propose novel methodologies in algorithm, architecture, and firmware levels based on our previous flexible ASIP [8]. In the proposed method, we first investigate hardware-friendly FEC decoding algorithms. By relaxing severe congestions on register-files (RFs), the proposed high-level instructions allow the task-level out-of-order execution, which reduces the number of operating cycles. Considering the long-latency memory requests inside of a loop, the proposed loop-level rescheduling enhances the decoding throughput further by changing the order of instructions for eliminating the waiting cycles. To show the impacts of the proposed design methods, the optimized radix-4 LTE turbo decoder on the FlexFec is implemented as a prototype in a 40nm low-power (LP) CMOS process. Compared to the previous non-optimized

2 334 MENG LI et al : ENERGY-EFFICIENT RECONFIGURABLE FEC PROCESSOR FOR MULTI-STANDARD WIRELESS Flipr core 1 2 RF 3 VM1 VM2 Xbar ALU1 ALU2 Reconfigurable AGU Background Memory Scalar ALU architecture, the prototype improves the area and energy efficiencies by 3.3 and 3.7 times, respectively. The rest of this paper is organized as follows. Section II depicts the backgrounds of this work. Section III presents our design frameworks. A case study on the radix-4 turbo decoder is described and compared to the previous works in Section IV. The conclusions are made in Section V. II. BACKGROUNDS Scalar RF Scalar Fetch Scalar In this section, we describe our previous FlexFec platform, which changes its resource configurations during the design time [8]. Fig. 1 shows a block diagram of the FlexFec including multiple processing units associated with two RFs, a multi-step crossbar network (Xbar) controlled by the flexible address generation unit (AGU), and high-speed host interfaces. The decoding process is performed by the flipr core, which is an energy-efficient VLIW processor connected to a number of on-chip SRAMs denoted as gray-colored blocks. Note that the operation is programmable by initializing the proper kernels to program memory (PM), data memory (DM), and AGU. The received code is firstly moved into the background memories, which are denoted as BM1 and BM2, realizing the double-buffering scheme. In the flipr core, one scalar and five 96-way vector operations are processed simultaneously. Two on-chip memories, VM1 and VM2, are reserved for storing intermediate data with single-cycle instructions. However, the BM cannot be accessed in a cycle as the multi-step Xbar is inevitable in decoding of iterative FEC codes [4-8]. Based on the generalized instruction-set architecture (ISA), the FlexFec can support arbitrary LDPC, turbo PM BM2 BM1 Host interface Fig. 1. The reconfigurable FlexFec ASIP architecture [8]. DM and Viterbi codes, which are the most popular FEC codes. However, it is limited to increase the throughput of Flexfec by introducing more parallel operations due to the severe writing congestions on the vector RF, as the software kernels are described by using single-cycle vector instructions. If the BM through the Xbar is frequently accessed, moreover, a number of waiting cycles are used for the following instructions having data dependencies. Hence, the decoding throughput of the previous ASIP is limited by the large portion of nooperation (NOP) instructions. For a high-throughput decoder, in general, the ASIP-based solutions use multiple cores, increasing the decoding energy significantly [9-11]. As high-throughput energy-efficient flexible decoders are strongly recommended for the future wireless systems, it is urgent to develop an advanced design framework that enhances the decoding throughput without increasing the energy consumption of each ASIP core. III. PROPOSED OPTIMIZATIONS Before defining the FEC ASIP architecture, it is necessary to select the proper high-throughput algorithms, which can be realized by simple hardware resources. Numerous researches have revealed attractive solutions for the multi-standard FEC decoders [6-8, 12, 13]. Simplified FEC algorithms such as the min-sum LDPC decoding and the max-log-map turbo decoding are actively used as they are conceptually based on the similar max (or min) operations [4, 5]. Parallel structure for layered LDPC decoders and sliding-window-based turbo decoders are combined into a flexible structure by sharing the same SRAM buffers [6]. Similar to the dedicated ASIC-based decoders, in addition, high-radix decoding algorithms are also adopted to the recent ASIPbased flexible FEC decoders [10, 13]. Although highradix algorithms are effective in reducing the size of onchip memories, however, the throughput of each ASIP core cannot be enhanced drastically due to the increased number of cycles for the complex computations. In order to increase the decoding throughput of the flexible ASIP, we present several software-level optimizations that actually shorten the processing time of high-radix decoding algorithms.

3 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.17, NO.3, JUNE, cycles ALU1 T1 T1 T1 T1 T1 T1 T1 T1 T2 T2 T2 T2 T2 T2 T2 T2 T3 T3 T3 T3 T3 T4 T4 T4 T4 (a) Multi-cycle high-level instruction 20 cycles ALU1 T1 T1 T1 T1 T1 T1 T1 T1 T3 T3 T3 T3 T3 T4 T4 T4 T4 Multi-cycle high-level instruction ALU2 T2 T2 T2 T2 T2 T2 T2 T2 (b) Fig. 2. The processing sequences having data dependencies based on (a) the conventional in-order execution, (b) the proposed tasklevel out-of-order execution. 1. Task-level Out-of-order Execution Conventionally, the ISA for flexible ASIPs contains simple instructions for basic vector operations that can be completed in one cycle. As illustrated in Fig. 2(a), let several independent tasks be serially issued to an ALU. In the figure, the shaded circle denoted as Tx represents a single-cycle instruction of the x-th task, which produces a writing request on the vector RF. In a task, the dependencies between two instructions are denoted as dotted arrows. There might be dependencies between two tasks, which are represented as solid arrows in Fig. 2(a). Note that a task consists of single-cycle instructions related to each other, causing the continuous writing accesses. Even though multiple issues of tasks are possible by utilizing additional ALUs, the overall computing time is still limited by the limited bandwidth of RFs for reading the operands and storing the intermediate results. To solve the severe writing congestions, we define a new ISA by using the dedicated high-level instructions. Conceptually, a high-level instruction is a multi- cycle instruction, which is composed of several arithmetic vector operations. The demands on the RFs are naturally alleviated as the highlevel instructions reduce the number of writing requests. Therefore, it is possible to allocate other RF-writing instructions by using the non-rf-writing cycles of the high-level instruction. Due to the serialized dependencies inside of a task, however, it is hard to collect the available instructions for non-rf-writing cycles. In our work, the task- level out-of-order execution is proposed for the parallel issue of following tasks, which are independent of the current task. As depicted in Fig. 2(b), for example, the first task includes a high-level instruction whose internal cycles are represented as squares, where the shaded node only makes requests on the RF. The instructions in the next independent task, i.e., T2, can be performed earlier at the second ALU by accessing the RF without any congestion. As a result, the processing cycles can be shortened by using the tasklevel out-of-order execution. 2. Multi-level ALU Architecture In general, the previous pipelined ASIPs process all the instructions sequentially by using the generalized data-path [7]. Although the generalized data-path provides the maximum level of flexibility, it requires a number of operating cycles due to the in-order processing. To support the proposed task-level out-oforder execution effectively, as shown in Fig. 3, we introduce the n-level ALU, where each level performs the pre- defined vector operation at the corresponding processing unit (PUx). Note that the vector instruction is issued from the first level, i.e., PU1. In every cycle, each PU transfers its instruction to the next PU with the proper intermediate results, until the instruction reaches the last cycle defined by ISA. Controlled by the wide multiplexor, the vector RF is accessed once in a cycle by selecting the corresponding level whose instruction is completed. Note that it is unnecessary to employ the RF-writing paths for every level. According to the new high-level instruction, the workloads of each PU have to be carefully distributed for restricting the number of RF-writing paths,

4 336 MENG LI et al : ENERGY-EFFICIENT RECONFIGURABLE FEC PROCESSOR FOR MULTI-STANDARD WIRELESS Issued instruction PU1 preserving the original critical delay as much as possible. To reduce the complexity, in addition, the first level of the ALU can be combined with the previous ALU that performs a simple operation in one cycle. Instead of using the additional ALU for parallel processing in Fig. 2(b), therefore, the out-of-order execution can be naturally implemented in a single multi-level ALU as shown in Fig. 4. While the multi-cycle instruction is performed by changing the level, the independent second task can be issued to the first PU of the proposed ALU. In summary, the multi-level architecture successfully supports the proposed task-level out-of-order execution, leading to the significant reduction in terms of the processing cycles as well as the complexity. As the size of the PM increases by using the additional ALUs in the VLIW architecture, in addition, the proposed ALU also achieves the memory-reduced ASIP. 3. Loop-level Rescheduling PU2 RF Multiplexor PUn Level-1 Level-2 Level-n Fig. 3. The proposed multi-level vector ALU architecture. Fig. 4. The task-level out-of-order execution using the proposed multi-level vector ALU. In the multi-standard FEC solutions, the flexible multistep crossbar is normally used for supporting various interleaving patterns [7, 8, 11]. Hence, the previous software suffers from the long latency of reading a codeword. For example, the previous FlexFec uses 5 cycles for accessing the BM [8]. In the proposed framework, we focus on that the iterative decoding process normally has numerous loops for performing the identical tasks in each bit position. Fig. 5(a) illustrates the conventional processing scenario of the loop operation using long-latency load instructions. For the sake of simplicity, the single-cycle arithmetic vector instructions are denoted as Ax without constructing tasks. The 5-cycle load operation is denoted as triangular nodes, where the shaded node accesses the vector RF to store the reading-out codeword. Similar to the previous figures, the dotted arrows show the data dependencies inside of the loop. Note that the following operations have to wait the completion of the load instruction due to the dependency, although the dedicated unit activates the memory accesses in parallel. If there are multiple loads in the loop, moreover, the nonworking waiting time is increased significantly, causing the severe throughput degradations. In the proposed method, we reorganize the processing order inside of the loop as shown in Fig. 5(b). Based on the proposed rescheduling, the load operation reads the memory for the next iteration of the loop, and the arithmetic operations of the current iteration no longer suffer from the time-consuming memory accesses. More precisely, the instruction denoted as A4 becomes the first operation of the loop in the proposed method, and the required loading operation for A4 is performed at the end of the loop to prepare the next iteration. As the proposed rescheduling is conceptually similar to the software pipelining technique [14], there are some additional cycles for the prologue and the epilogue of the first and the last iterations, respectively. By eliminating the waiting time, the processing cycles in a loop is reduced from 15 to 10 as exemplified in Fig. 5(b). In other word, the hardware utilization is maximized by the proposed rescheduling, reducing the processing time remarkably to achieve an energy-efficient FEC ASIP solution. IV. IMPLEMENTATION RESULTS To improve the throughput as well as the energy efficiency, we design a prototype ASIP-based flexible FEC decoder based on the proposed design methods. In this section, the radix-4 LTE turbo decoding on our ASIP is detailed as a case study. Based on the same design concepts, the radix-4 LDPC and Viterbi decoders are also implemented on the prototype ASIP. For the algorithm-level improvement of a turbo

5 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.17, NO.3, JUNE, cycle load instruction 15 cycles in a loop 5-cycle load instruction ALU A1 A2 A3 A4 A5 A6 A7 A8 (a) 10 cycles for prologue 10 cycles in a loop 5-cycle load instruction 5-cycle load instruction 5-cycle load instruction 5-cycle load instruction 5 cycles for epilogue ALU A1 A2 A3 A4 A5 A6 A7 A8 A1 A2 A3 A4 A5 A6 A7 A8 (b) Fig. 5. (a) The conventional loop processing with long-latency loads, (b) the processing sequence using the proposed loop-level rescheduling. Table 1. High-level instructions for Radix-4 Turbo Decoding Processing latency 2 cycles Branch metric Metric calculation Forward recursion Reliability generation Backward recursion Reliability generation Output generation Reliability generation 3 cycles ACS tree ACS tree ACS tree 4 cycles 5 cycles Codword collection activation activation ACS: Addition, comparison and selection LLR: Log-likelihood ratio Extrinsic value calculation activation LLR calculation activation decoder, we first select the radix-4 decoding algorithm, which is accepted at the recent decoder ASICs [4, 6]. Without changing the original ASIP architecture, the firmware is re-designed for the radix-4 processing at this level. To shorten the decoding time further, we split the firmware into several tasks and define the high-level instructions in four categories, i.e., branch metric calculations, forward/ backward recursions, and output generations [4]. In this case study, as shown in Table 1, 11 high-level instructions are newly introduced by taking up to 5 processing cycles. Note that all the high-level instructions are basically multi-cycle instructions. Therefore, as shown in the previous section, the multilevel ALU supporting the task-level out-of-execution can relax the intensive writing requests on the vector RF, leading to the energy-efficient decoding operations. Fig. 6 illustrates the processing steps of backward recursion in turbo decoding algorithm based on the proposed highlevel instructions. Compared to the conventional serialized operations, it is noticeable that the proposed task-level out-of-order technique successfully reduces the latency of backward recursion by 20%. Number of processing cycles 6x10 4 5x10 4 4x10 4 3x10 4 2x Without optimizations Number of processing cycles Radix-4 decoding algorithm Task-level out-of-order execution Decoding throughput Loop-level optimization Fig. 6. Cycle reductions and throughput improvements. Due to the native iterations in turbo decoding process, the loop-level rescheduling is additionally performed on forward and backward recursions, which are associated with long-latency load operations as depicted in Fig. 6. A number of instructions can be processed in parallel with the load by breaking the data dependencies inside of the loop, minimizing the overall processing cycles. Fig. 7 depicts how the proposed optimizations reduce the number of cycles for decoding a 6144b LTE turbo code by using the prototype ASIPs. The maximum number of turbo iterations is equally set to six. By reducing the processing cycles in each design step, the proposed schemes shorten the total number of required cycles by 6.35 times compared to the radix-2 turbo decoding on the previous FlexFec ASIP [8]. To investigate the impacts of the proposed work, all the architectures are designed at the speed of 450 MHz in a 40 nm CMOS process. As shown in Fig. 7, the Decoding throughput (Mb/s)

6 338 MENG LI et al : ENERGY-EFFICIENT RECONFIGURABLE FEC PROCESSOR FOR MULTI-STANDARD WIRELESS Energy efficiency (nj/b) Without optimizations Radix-4 decoding algorithm Task-level out-of-order execution Multi-level ALU architecture Loop-level rescheduling Fig. 7. Backward recursion example in radix-4 turbo decoding on the proposed flexible processor. throughput is gradually increased by using the proposed schemes. Note that the fully-optimized ASIP-based turbo decoder achieves a throughput of 315 Mb/s, which can support up to the category-5 of LTE systems [1]. Fig. 8 shows how the proposed design methods improve the area and energy efficiencies of turbo decoders, where the efficiencies are defined as follows: Area efficiency (μm 2 s/b) 2 Area (mm ) = (1) Throughput (Mb/s) Decoding power (mw) Energy efficiency (nj/b) = Throughput (Mb/s) By applying the radix-4 algorithm, as shown in Fig. 8, the decoder becomes cost-effective in terms of area and energy consumption. As the straight-forward radix-4 firmware cannot meet the high-throughput demands as shown in Fig. 7, the task-level out-of-order execution enhances the throughput by utilizing hardware resources in parallel. The multi-level ALU compensates area overheads by reducing the PM size and the loop-level optimization finally improves the energy efficiency by changing the order of operations in a loop. As a result, the proposed work reduces the area and energy efficiencies for LTE turbo decoding by 3.3 and 3.7 times, respectively. The implementation results of the prototype FEC ASIP are summarized and compared to the previous works in Table 2. For the fair comparison, we normalize all the efficiencies to 40 nm CMOS whose reference voltage is 0.9 V. In addition, the maximum number of iterations for turbo and LDPC decoding scenarios are set to six and ten, (2) Area efficiency (μm 2 s/b) Fig. 8. Area and energy efficiencies. Table 2. Implementation results of ASIP-based FEC decoders Turbo LDPC Viterbi This work [7] [8] [9] [10] Process (nm) Voltage (V) 0.9 N. A N. A. Area (mm 2 ) Frequency (MHz) Standards Throughput (Mb/s) Norm. area eff. (μm2 s/b) Norm. energy eff. (nj/b) Throughput (Mb/s) Norm. area eff. (μm2 s/b) Norm. energy eff. (nj/b) Throughput (Mb/s) Norm. area eff. (μm2 s/b) Norm. energy eff. (nj/b) LTE WiFi WiMAX LTE LTE WiFi WiFi WiMAX WiMAX N. A. LTE WiFi WiMAX N. A N. A N. A N. A N. A. Norm. area eff. = (40/Process) 2 Area efficiency Norm. energy eff. = (40/Process) (0.9/Voltage) 2 Energy efficiency respectively. Note the proposed ASIP can support arbitrary LDPC, turbo and Viterbi codes. According to the proposed novel optimizations, the prototype ASIP offers an attractive multi-standard FEC decoder. In case of the turbo decoding for LTE systems, for example, the proposed ASIP-based work achieves the highest decoding throughput among the existing ASIP-based turbo decoders, while providing the similar area and

7 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.17, NO.3, JUNE, energy efficiencies. VI. CONCLUSION In this paper, we have presented several design schemes to enhance the energy efficiency of the multistandard FEC ASIP. By introducing the advanced methods on algorithm, software firmware and hardware structure, the proposed work reduces the number of processing cycles. The case study on the radix-4 turbo decoding shows that proposed framework achieves a sufficient decoding throughput for the recent wireless systems, while lowering the area and energy efficiencies remarkably. ACKNOWLEDGMENTS This work was supported by the National Research Foundation (NRF) of Korea grants funded by the Korea government (MSIP) (2016R1C1B ). REFERENCES [1] Multiplexing and Channel Coding, 3GPP TS , Rev , Jun [2] IEEE Standard for Local and metropolitan area networks, Part 16: Air Interface for Fixed Broadband Wireless Access Systems, IEEE Std e-2005, [3] IEEE Standard for Information Technology Telecommunications and Information Exchange between Local and Metropolitan Area Networks Specific Requirements Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications, IEEE Std n- 2009, [4] W. Byun, H. Kim, and J.-H. Kim, High throughput radix-4 SISO decoding architecture with reduced memory requirement, J. Semicon. Technol. Sci., vol. 14, no. 4, pp , Aug [5] Y.-M. Jung, C.-H. Chung, Y.-H. Jung, and J.-S. Kim, 7.7 Gbps encoder design for IEEE ac QC-LDPC Codes, J. Semicond. Technol. Sci., vol. 14, no. 4, pp , Aug [6] C. Condo, M. Martina, and G. Masera, VLSI implementation of a multi-mode turbo/ldpc decoder architecture, IEEE Trans. Circuits Syst. I, Reg. Paper, vol. 60, no. 6, pp , June [7] Z. Wu and D. Liu, Flexible multistandard FEC processor design with ASIP methodology, in Proc. IEEE Int. Conf. Application-specific Systems, Architectures and Processors (ASAP), 2014, pp [8] F. Naessens et al., A mm mw reconfigurable LDPC and turbo encoder and decoder for n, e and 3GPP-LTE, in Proc. IEEE Symp. VLSI Circuits, 2010, pp [9] B. Noethen et al., A 105GOPS 36mm 2 heterogeneous SDR MPSoC with energy-aware dynamic scheduling and iterative detectiondecoding for 4G in 65nm CMOS, in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), 2014, pp [10] P. Murugapp, R. Al-Khayat, A. Baghdadi, and M. Jezequel, A flexible high throughput multi-asip architecture for LDPC and turbo decoding, in Proc. Design, Automation Test in Europe Conf. Exhib. (DATE), 2012, pp [11] Z. Wu, D. Liu, Z. Yang, Q. Wang, and W. Zhou, FPGA implementation of a multi-algorithm parallel FEC for SDR platforms, in Proc. IEEE Int. Conf. Field Programmable Logic and Applications (FPL), 2014, pp [12] S. Kunze, E. Matus, G. Fettweis, and T. Kobori, Combining LDPC, turbo and Viterbi decoders: Benefits and cost, in Proc. Int. Workshop on Signal Process. Syst. (SiPS), 2011, pp [13] J. Dion, M. Hamon, P. Penard, M. Arzel and M. Jezequel, Multi-standard trellis-based FEC decoder, in Proc. IEEE Conf. Design and Architectures for Signal and Image Processing (DASIP), 2012, pp [14] J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative Approach, San Mateo, CA, USA: Morgan Kaufmann, 2011.

340 MENG LI et al : ENERGY-EFFICIENT RECONFIGURABLE FEC PROCESSOR FOR MULTI-STANDARD WIRELESS Meng Li received a PH.D. degree from Telecom Bretagne in electrical engineering, France in 2011.

Her research interests cover high speed and low power digital circuit design for essential components in wireless baseband, especially with the design of decoder for error control codes.

The research for her thesis was completed at the Ecole Nationale Superieure de Telecommunications in Paris. She graduated summa cu laude with a Ph.D.

Van der Perre joined IMEC in 1997 in the wireless group.

8 340 MENG LI et al : ENERGY-EFFICIENT RECONFIGURABLE FEC PROCESSOR FOR MULTI-STANDARD WIRELESS Meng Li received a PH.D. degree from Telecom Bretagne in electrical engineering, France in She joined the Green Radio Group of IMEC in 2012 and now she is a senior research engineer. Her research interests cover high speed and low power digital circuit design for essential components in wireless baseband, especially with the design of decoder for error control codes. Liesbet Van der Perre received the M.Sc. degree in electrical engineering from the KU Leuven, Belgium in The research for her thesis was completed at the Ecole Nationale Superieure de Telecommunications in Paris. She graduated summa cu laude with a Ph.D. degree in electrical engineering from the same university in After finishing her Ph.D. on propagation modeling at the Telecommunications group of the KU Leuven, Belgium, Dr. Van der Perre joined IMEC in 1997 in the wireless group. She took up responsibilities as system architect, project leader and program manager, scientific and program director with a focus on energy efficiency in broadband communication till Currently, she is a professor of electrical engineering department of the KU Leuven. She s an author and co-author of over 300 scientific publications. She was appointed honorary doctor at the faculty of engineering LTH, Lund University, in Wim Van Thillo received his master degree in electrical engineering and his undergraduate degree in business economics from the Katholieke Universiteit Leuven, Belgium. He obtained a PhD degree from the same university based on his research in IMEC s wireless communications group. In 2008, he was a visiting researcher at UC Berkeley s Connectivity Lab. From 2012 to 2014 he led IMEC s 79 GHz radar research program. Since January 2015 Wim is responsible for IMEC s R&D in cellular and WiFi transceivers, 60 GHz communications, 79 GHz radar and 140 GHz sensors. Youngjoo Lee received the B.S., M.S. and Ph.D. degrees in electrical engineering from Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea, in 2008, 2010 and 2014, respectively. Since February 2017, he has been an Assistant Professor in the department of Electrical Engineering, POSTECH, Pohang, Korea. Prior to joining POSTECH, he was with Interuniversity Microelectronics Center (IMEC), Leuven, Belgium, from May 2014 to February 2015, where he researched reconfigurable SoC platforms for software-defined radio systems. From March 2015 to February 2017, he was with the Faculty of the Department Electronic Engineering, Kwangwoon University, Seoul, Korea. His current research interests include the algorithms and architectures for embedded processors, intelligent transportation systems, advanced error-correction codes, and mixed-signal circuit designs.

THE turbo code is one of the most attractive forward error

THE turbo code is one of the most attractive forward error IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 63, NO. 2, FEBRUARY 2016 211 Memory-Reduced Turbo Decoding Architecture Using NII Metric Compression Youngjoo Lee, Member, IEEE, Meng