Improved Convolutional Coding and Decoding of IEEE802.11n Based on General Purpose Processors

2013 th International Conference on Communications and Networking in China (CHINACOM) Improved Convolutional Coding and Decoding of IEEE02.11n Based on General Purpose Processors Yanuo Xu, Kai Niu, Zhiqiang He, Jiaru Lin Key Lab of Universal Wireless Communications, Ministry of Education, Beijing University of Posts and Telecommunications, Beijing 10076, China xuyanuo@gmail.com Abstract In this paper, the convolutional coding and decoding of 02.11n are improved on general purpose processor (GPP) software defined radio(sdr) platforms. The prototype makes extensive use of features of contemporary processor architectures to accele signal processing and satisfy protocol real-time requirements of IEEE02.11n, including large lowlatency caches to store lookup tables, and single instruction multiple data (SIMD) processor on GPPs. In the prototype, the Viterbi decoder employs the parallel structure and trace back decoding algorithm to improve performance. The simulation results show the prototype can satisfy the performance and realtime requirement of IEEE02.11n. Considering the rapid development of GPP, data processing capacity of our prototype will be further improved. Keywords-Convolutional code; 02.11n; GPP; Viterbi decoder; SIMD I. INTRODUCTION With the rapid development of wireless Local Area Network (WLAN) technology, IEEE02.11n has become mainstream wireless LAN standard. In the wireless LAN, the transmission characteristics of the actual communication channel is not very well, the noise in the channel often cause the receiving end of a certain error, affect the reliability of data transmission. To solve this problem, channel codes are applied to enhance the reliability of information transmission. In IEEE02.11n, a convolutional codes is adopted which is currently the most widely used in practical communication systems. Convolutional cods can obtain a high coding-gain with a simple encoding structure. To decode a convolutional code, Viterbi decoding algorithm was firstly proposed in 1967[1], and it is proven to obtain the maximum likelihood. With a low code constraint length, a good decoding performance can be obtained with a quite low complexity. And the hardware structure of Viterbi decoding algorithm is quite easy to implementation. Thus, it is one of the best decoding algorithms for convolutional codes. The implementation of convolutional coding and Viterbi decoding algorithm is a key technology in IEEE02.11 n. Many existing SDR platforms are based on either programmable hardware such as field programmable gate arrays (FPGAs) or embedded digital signal processors (DSPs). Such hardware platforms can meet the processing and timing requirements of modern high-speed wireless protocols, but programming FPGAs and specialized DSPs are difficult tasks. Developers have to learn how to program on each particular embedded architecture, often without the support of a rich development environment of programming and debugging tools. Meanwhile, GPP technology provides another method to signal processing. According to Moore s law, the capability and the integration of a microprocessor will be doubled every 1 months. Recently, the single-core technique has reached its limit which is caused by the physical size of semiconductorbased microelectronics. Though manufacturing technology improves, the precision reaches 32nm-45nm and can hardly be reduced. In this situation, the trend of the future development is to make full use of various features of widely adopted multicore architectures in existing GPPs. In [2], Sora is a fully programmable software radio platform on commodity PC architectures. Sora achieves equivalent performance of IEEE 02.11 a/b/g. Sora is taken as a object with our prototype. In this paper, convolutional coding and decoding of IEEE02.11n are implemented on general purpose processor (GPP) software defined radio (SDR) platforms. II. ARCHITECTURE COMPARED WITH SORA This section briefly describes the advantages of our architecture and the differences compared with Sora. Sora use Intel Core 2 microarchitecture, and we use the new architecture-intel Sandy Bridge. We mainly analysis three parts (instruction set, cache and multi-core and multi-threading) that used in our implementation to accele signal processing of IEEE02.11n. A. Instruction set General purpose processor platform provides a lot of SIMD instruction sets to optimize data processing. SIMD computations (see Figure 1) are introduced to the architecture with MMX technology. MMX technology allows SIMD computations to be performed on packed byte, word, and double-word integers. The integers are contained in a set of eight 12-bit registers called MMX registers in our scheme. Figure 1 shows a typical SIMD float-point computation. OP in Figure 1 stands for the operation performed on Xi and Yi (i=1, 2, 3, 4). Two sets of the four packed float-point data elements are oped in parallel, with the same operation being performed on each corresponding pair of data elements. 19 97-1-4799-1406-7 2013 IEEE

It is worth noting that Sora use two cores to complete the whole processing but each unit can only be processed in one core. To satisfy the requirement of IEEE 02.11n, we must use several cores to complete Viterbi decoding in parallel. In our implementation, we use the multi-core API provided by Windows. Figure 1. Typical SIMD operation Streaming SIMD Extensions (SSE) is provided in Intel architectures. In computing, SSE is an SIMD instruction set extension for x6 architecture, designed by Intel and introduced in 1999 in their Pentium III series processors as a reply to AMD's 3DNow!. Compared with Sora, Sandy Bridge architecture can support SSE4.2 and AVX instruction sets. The width of the SIMD register file is increased from 12 bits to 256 bits in AVX instruction set. Therefore, the instruction efficiency and optimization are better than Sora. B. Cache The cache structure of Sandy Bridge is similar to Nehalem. However, the cache structure of Nehalem changes a lot compared with Intel Core 2. The cache of Core 2 has two levels, and L2 is shared by two cores to reduce coherency traffic. The cache structure of Nehalem has three levels. L1 and L2 which are relatively small are private, and L3 is very large shared by all cores. LUT is an optimization method which trades space complexity for time complexity. The basis of the method can be considered as the one-to-one correspondence of the input value and the result. Suppose we have a module whose input is a symbol stream while the output is the calculation result of each symbol. No matter how complicated the calculation formula is, we can summarize a one-to-one correspondence between input data and output result. As far as we known, the access speed of a cache is much faster than that of a memory. To lower the access latency, we load the LUTs with a right size to the core cache. The cache structure of Nehalem can support and optimize unaligned instructions. And our architecture has a large cache memory with three cache levels. C. Multi-core and multi-threading Along with the limited processing capability of single core, the multi-core technology has been more and more used. Compared with Core 2 architecture, simultaneous multithreading (SMT) has been used in Nehalem and Sandy Bridge. SMT is 2-way which means each core can simultaneously handle two threads. In the case of multi-threaded tasks, the delay of a single thread can be covered. SMT can more effectively improve performance with lager cache and lager memory in Sandy Bridge. III. IMPLEMENTATION We have implemented convolutional coding and decoding structures of IEEE02.11n on GPP platform. This section mainly describes how to optimize the convolutional encoder and decoder on GPP platform which can satisfy real-time requirements. The convolutional encoder use SIMD instruction and lookup tables to accele data processing. The Viterbi decoder mainly uses SIMD instruction and multi-core to accele decoding processing. A. Convolutional encoder design Convolutional code is a famous forward error correction (FEC) code, standard by ( n, k, L ), where k is the input information bits, n is the output bits, L is constraint length, thus the code R k / n. The k input bits are encoded to n output bits. After encoding, n output bits not only have the relationship with k input information bits, but also related to L 1 information bits. In IEEE std 02.11-2012[3], the convolutional encoder is defined by generator polynomials with g0 133 and g1 171, the code which is R 1/ 2. The encoder is shown in Figure 2. Figure 2. Convolutional encoder in IEEE std 02.11-2012 [3] The generator polynomial corresponding to output A is: 2 3 5 6 g0 1 D D D D The generator polynomial corresponding to output B is: 2 3 6 g1 1 D D D D The convolutional encoder uses the SIMD instruction set and lookup tables to accele the single processing. With input bits and the state of the registers, we can acculy calculate output bits. This part of the calculation can be avoided by using LUT to accele single processing. Figure 3 shows the encoder LUT data structure. One byte can be encoded to 16 bits at one processing in the encoder with 64 states. The input bits range from 0x00 to 0xff (one byte). LUT must save 16 output bits and one bit for the state of the registers after one processing. Therefore, the LUT has totally 2 * 64 *(16 1) 2752 numbers. 190

0x00 0xFF 16+1 X X 0 S0~S63 X X 0 X X 0 X X 0 X X 0 X X 0 X X 63 X X 63 X X 63 X X 63 X X 63 X X 63 Figure 3. Convolutional encoder LUT data structure B. Viterbi decoder design The Viterbi decoder can be divided into three functional units, i.e., a branch metric unit, an add-compare-select (ACS) unit, and a track back unit, as shown in Figure 4. In the Viterbi decoder, all the data are represented by -bit. So, one 12bit SIMD instruction can handle 16 data operations simultaneously. Figure 4 shows the structure of Viterbi decoder. transition. We use the received data to calculate each branch metric. 2) Add-compare-select unit This unit is the most important unit of the Viterbi decoder. First, add the current state metric at previous moment to the branch metric of the path which reach the current state, and gene new state metric. Second, compare the two path state metrics and select the path with the minimum state metric as the survivor path. Finally, save the state metric of the survivor path for the next ACS operation. For each state, there are two branch paths. One ACS operation has two add operation and one compare operation, as shown in Figure 6. In Figure 6, 16 ACS operations can be processed in parallel using SIMD instruction. Figure 6. ACS processing using SIMD instructions Figure 4. The structure of Viterbi decoder 1) Branch metric unit This unit mainly calculates branch metrics of all states in each state transition time. Using SIMD instructions set, we can deal with 16 branch metric calculations at one time. For -bit fix point data, normalizations of the state metrics are required to avoid overflow. 3) Track back unit The key of backtracking algorithm is to find the survivor decoding path. The survivor decoding path is stored as a linked list. Each node in the linked list represents one state of the state transition diagram. If we find a state of a certain moment, we can backtrack this linked list and find previous states of the state transition diagram. Therefore, we can achieve decoding purpose. IV. SIMULATIONS This section mainly analyzes the simulation results of performance and throughput. Test environment of GPP platform is shown in Table 1. The CPU is Intel core i7-2600k which has 4 cores and threads. Intel C++ Compiler, also known as ICC or ICL, is a group of C and C++ compilers from Intel Corporation available for Apple Mac OS X, Linux and Microsoft Windows, which support SSE instruction. TABLE 1 TEST CONDITIONS Figure 5. (2,1,7) Convolutional code trellis diagram Figure 5 shows a quarter of the convolutional code trellis diagram. On the left is the state before transferring; on the right is the state after transferring. The dashed lines illust the state transition when the input bits are 0, and the solid lines are for input bits 1. The binary numbers shows the output bits of convolutional encoder when taking the corresponding state CPU Architecture L2 L3 Version of SSE Instruction Set Intel core i7-2600k @3.4GHz(32nm) Sandy Bridge 4*256KB MB SSE4.2, AVX Operation System Windows 7 Software Microsoft Visual Studio 2010 Compiler Intel ICC v11.1 191

A. Performance In IEEE std 02.11-2012[3], encoder bits are punctured to four s (1/2, 2/3, 3/4, 5/6) on request. We test all the four cases with block length 1040(which is the number of data bits per OFDM symbol) over AWGN channel under BPSK modulation. The bit error (BER) performances are shown in Figure 7. streams; MCS 16-23 have three spatial streams; MCS 24-31 have four spatial streams. We test the processing capability of CPU when using 1, 2, 3 or 4 cores. According to the required throughput of each MCS, we calculate the core utilization of each MCS. 4 BER 10 0 10-1 10-2 10-3 Core Utilization 3 2 1 10-4 R=1/2 float point R=1/2 bit fix point 10-5 R=2/3 bit fix point R=3/4 bit fix point R=5/6 bit fix point 10-6 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Figure 7. BER performances of four s In Figure 7, the dashed black curve is the theoretical performance of R=1/2 float point convolutional code [4]. From the simulation results, we conclude that the fix point performances are very close to the float point performances. The BER performance is more and more well with lower. B. Real-time 1) Throughput The maximum throughput of 20 MHz bandwidth in IEEE 02.11n is 260Mbps and the number of data bits per OFDM symbol is 1040. On this condition, the throughput results of our implementation are shown in Table 2. TABLE 2 THROUGHPUT RESULTS Algorithm Throughput(Mbps) Delay(us) Conv. Encoder 4033 0.25 Viterbi One core 75.2 One core 13. Four cores Eb/N0(dB) 270.7 Four cores 3.4 As shown in Table 2, our implementation can satisfy the maximum throughput requirement of 20MHz bandwidth when using four cores. Because of the threads delay and the multicore data transmission, the throughput enhancement of the application of multi-core is not linear to the number of cores. But we are able to increase the throughput by increasing the number of cores. For example, we can use cores to satisfy the throughput requirement of 40MHz bandwidth. In our implementation, the Viterbi decoder is the most computationally-intensive component. Figure shows the core utilization of our implementation to support the Viterbi decoder real-time requirements of 32 MCSs at 20MHz bandwidth. MCS 0-7 have one spatial stream; MCS -15 have two spatial 0 0 4 9 14 19 24 29 MCSs of 20MHz bandwidth Figure. Core Utilization of 32 MCSs At the receiver, a higher throughput requires higher core utilization due to the increased computational complexity of the Viterbi decoder. We can see that one core of a contemporary multi-core CPU can comfortably support MCS 0-7. Due to the multi-core call delay, the core utilization is not completely linear with the spatial streams. Along with the increasing of the core number, the processing capability of one core decreases. 2) Compared with Sora To facilitate comparison, we used the same test conditions of Sora. It is worth noting that the CPU frequency of Sora is 2.66GHz and the CPU frequency of our implementation is 3.4GHz. Our implementation will have shorter delays than Sora with the same computation. TABLE 3 Algorithm Conv. Encoder Viterbi THE COMPARE BETWEEN SORA IMPLEMENTATIONS AND OUR IMPLEMENTATIONS Configuration 24Mbps,1/2 4Mbps,2/3 54Mbps,3/4 24Mbps,1/2 4Mbps,2/3 54Mbps,3/4 Computation Required (M cycles/sec) Sora Impl.(one core) Our Impl.(one core) 1.15 20.23 37.21 39.33 56.23 45.52 140.93 104.33 2422.04 2210.0 2573.5 2439.76 In Table 3, the numbers of required computation between Sora implementations and our implementations are listed. Our Viterbi decoder need less computing resources than Sora. The convolutional encoder doesn t perform very well in low throughput conditions. But, in high throughput conditions, the encoder performs better than Sora. In IEEE 02.11n, we mainly process and optimize high throughput data stream and 192

encoder need very little delay compared with decoder. Therefore, the processing delay of encoder is acceptable. [1] Viterbi A J.ErTor bounds for convolutional codes and an asymptotically optimum decoding algorithm [J].IEEE Trans.Inform Theory, 1967, ITS(2):260-269. [2] Sora: High Performance Software Radio Using General Purpose Multicore Processors. Kun Tan, Jiansong Zhang, Ji Fang, He Liu, Yusheng Ye, Shen Wang, Yongguang Zhang, Haitao Wu, Wei Wang, Geoffre M. Voelker, Microsoft Research Asia, Beijing, China; Tsinghua University, Beijing, China; Beijing Jiaotong University, Beijing, China; UCSD, La Jolla, USA. [3] IEEE Std 02.11-2012, Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications. [4] Wiley. Error Correction Coding Mathematical Methods and Algorithms. May. 2005. ebook-ddu. [5] G. Feygin, P. Gulak, "Architectural tradeoffs for survivor sequence memory management in Viterbi decoders" IEEE Transactions on Communications, vol.41, Issue 3, pp 425-429, March 1993. [6] A. J. Viterbi, Convolutional codes and their performance in communication systems, IEEE Trans. Commun., vol.com-19, pp 751-772, Oct. 1971. [7] P. K. Singh, and S. Jayasimha, A low-complexity, reduced-power Viterbi algorithm, proc. 12 th International Conf. on VLSI Design, Goa, India, pp 61-66, Jan. 1999. [] D. A. El-Dib, M. I. Elmasry, "Modified register-wxchange Viterbi decoder for low-power wireless communications" IEEE Transactions on Circuits and Systems, vol. 51, Issue2, pp 371-37, Feb. 2004. [9] F. Chan and D. Haccoun, "Adaptive Viterbi decoding of convolutional Codes over memory less channels" IEEE Transaction on Communications, vol. 45, no. 11, pp 139-1400, Nov. 1997 [10] B. Pandita and S. K. Roy. Design and Implementation of a Viterbi Decoder Using FPGAs. In Proceedings of IEEE International Conference on VLSI Design, pp 611-6I4, Jan. 1999. Compared with Sora GPP platform, our platform has higher CPU frequency, new architecture and new SIMD instructions. Along with the evolution of GPP platform, our implementation can handle larger amount of data processing. V. CONCLUSION This paper mainly describes how to implement convolutional coding and decoding of IEEE02.11n on GPP platform. According to the simulation results, the SIMD instruction and LUT can rapidly accele signal processing to satisfy the real-time requirement of IEEE02.11n. GPP technology has many advantages compared to the traditional FPGA or DSP. The rapid development of CPU and the CPU architecture optimization can greatly enhance the code execution efficiency with less program optimization. It has lower hardware cost and shorter code development cycle and test cycle. Therefore, GPP technology has large development space. ACKNOWLEDGMENT This work was supported by the National Basic Research Program of China (973 Program) (No. 2009CB320401), the National Natural Science Foundation of China (No.61171099), the National Science and Technology Major Project of China (No.2012ZX03004005-002 and 2013ZX03003-004). REFERENCES 193