FPGA Implementation of 16-Point FFT Core Using NEDA Abhishek Mankar, Ansuman Diptisankar Das and N Prasad Abstract--NEDA is one of the techniques to implement many digital signal processing systems that require multiply and accumulate units. FFT is one of the most employed blocks in many communication and signal processing systems. This paper proposes FPGA implementation of a 16 point radix-4 complex FFT core using NEDA. The proposed design has improvement in terms of hardware utilization compared to traditional methods. The design has been implemented on a range of FPGAs to compare the performance. The proposed design has a power consumption of 728.89 mw on XC2VP1-6FF174 FPGA at 5 MHz. The maximum frequency achieved is 114.27 MHz on FPGA at a cost of higher power and the maximum throughput observed is 1828.32 Mbit/s and minimum slice delay product observed is 9.18. The design is also implemented using synopsys DC synthesis for both 65 nm and 18 nm technology libraries. Index Terms--Fast Fourier Transform (FFT), FPGA, New Distributed Arithmetic (NEDA), radix-4, Synopsys DC. I. INTRODUCTION oday s electronic systems mostly run on batteries thus Tmaking the designs to be hardware efficient and power efficient. Application areas such as digital signal processing, communications, etc. employ digital systems which carryout complex functionalities. Hardware efficient and power efficient architectures for these systems are most required to achieve maximum performance. Fast Fourier Transform (FFT) is one of the most efficient ways to implement Discrete Fourier Transform (DFT) due to its reduced usage of arithmetic units. DFT is one of those primary tools that are used for the frequency analysis of discrete time signals and to represent a discrete time sequence in frequency domain using its spectrum samples. The analysis (forward) and synthesis (inverse) equations of an N point FFT are given below. Abhishek Mankar, Ansuman DiptiSankar Das and N Prasad are with the Department of Electronics and Communication Engineering, National Institute of Technology, Rourkela-7698, India (e-mail: abhishek.mankar1@gmail.com, ansuman.das.engg@gmail.com, prasadn57@yahoo.com ). 978-4673-563-5//13/$31. 213 IEEE (1) Where,. As evident from equation (1), the basis of both synthesis and analysis equations remains same thus increasing the scope of the architecture to both analyze and synthesize. Due to increased employability of FFT in modern electronic systems, higher radix FFTs such as radix 4, radix 8, radix 2 k, split radix, etc. are designed for improved timing and reduced hardware. The basic difference of the mentioned methods lies in the structure of their butterfly units. Distributed Arithmetic (DA) has become an efficient tool to implement multiply and accumulate (MAC) unit in many digital signal processing (DSP) systems [1]. It eliminates the need of a multiplier that is used as a part of MAC unit. DA implements MAC unit by pre-computing all possible products and by storing them using a read only memory (ROM). Usage of ROM can be eliminated if one set of the inputs has a fixed value. This is done by distributing the coefficients to the inputs of the unit. This approach is called NEw Distributed Arithmetic (NEDA) [2]. Thus, using NEDA, any MAC like unit can be implemented just by using adders and shifters. Architecture designs in [3] [6] use either DA approach or CORDIC unit approach to implement FFT, which require ROM as en essential unit in the design. The proposed approach is based on NEDA, which does not require any ROM thus making the design to have reduced hardware. The distribution of the coefficients is done optimally to further reduce the redundant hardware units. The organization of rest of the paper is as follows. Section II briefly overviews NEDA, section III elucidates the proposed design, section IV discusses the results and performance, and section V concludes the work done in this paper. II. BRIEF OVERVIEW OF NEW DISTRIBUTED ARITHMETIC NEw Distributed Arithmetic (NEDA) technique is being used in many digital signal processing systems that require MAC unit. Transforms such as FFT, DCT, etc. have many multiplications that in turn require a number of multipliers. Implementation of such transforms using NEDA improves performance of the system in terms of speed, power and area. The mathematical derivation of NEDA is discussed as follows. Inner product calculation of two sequences may be represented as (2)
Where are constant coefficients and are varying inputs. Matrix representation of equation (2) may be given as (3) Considering both and in 2 s complement format, they may be expressed in the form 2 2 (4) Where 1,,1,,and is the sign bit and is the least significant bit. Substituting equation (4) in equation (3) results in the following matrix product which is modelled according to the required design. 2 2 2 (5) The matrix containing is a sparse matrix, which means the values are either or 1. The number of rows in matrix defines the precision of fixed coefficients. Equation (5) is rearranged as shown below. Where 2 2 2 (6) 1 2 (7) 12 1 12 The matrix consists of sums of the inputs depending on the coefficient values, in each row. An example that shows the NEDA operations is discussed below. Consider to evaluate the value of equation (8). cos 1 cos (8) Equation (8) can be expressed in the form of equation (5) as shown in equation (9). 1 1 1 1 1 1 1 2 2 2 1 1 1 1 (9) Equation (9) may be rewritten as 2 2 2 Applying precise shifting, we rewrite equation (1) as (1) 2 2 2 2 2 2 2 2 (11) Thus implementing equation (11) further reduces number of adders compared to implement equation (1). Multiplication with 2, can be realized with the help of shifters. In equation (11), the first row of matrix shifts right by 1 bit, second row by 2 bits and so on. More precisely, the shifts carried out are arithmetic right shifts. The output can be realized as a column matrix if we need the partial products. Thus NEDA based architecture designs have less critical path compared to traditional MAC units. III. THEPROPOSEDDESIGN In this paper, we have proposed the implementation of 16- point complex FFT using radix-4 method. multiplications required during the process have been implemented by using NEDA. According to the radix-4 algorithm, to implement 16-point FFT, eight radix-4butterflies are required. Four radix-4 butterflies are used in the first stage and the other four being used in the second/final stage. The input is taken in normal order and the output in bit-reversal order. The output of each radix-4 butterfly is multiplied by the respective twiddle factors [7]. In the shown block diagram, the first stage consists of four radix-4 butterflies. The inputs to the butterflies are x(n), x(n+4), x(n+8), x(n+12) where n is for first butterfly, 1 for second butterfly, 2 for the third butterfly, and 3 for the last butterfly, all of first stage. The twiddle factors are given by,,, where q is for first butterfly, 1 for second butterfly, 2 for third butterfly, and 3 for the last butterfly, all of first stage. The outputs of first stage are multiplied with respective twiddle factors and are given as inputs to the second stage.as proposed in our design, the complex twiddle multiplications required at the stage output have been implemented by using NEDA blocks.overall 9
4 8 12 1 5 9 13 2 6 1 14 NEDA NEDA NEDA NEDA NEDA NEDA 4 8 12 1 5 9 13 2 6 1 14 3 7 11 15 NEDA NEDA NEDA 3 7 11 15 Fig. 1. diagram of the proposed architecture NEDA blocks are required at the output of first stage of the 16 point FFT processor. In the second stage, 4 more radix-4 butterfly blocks are used. The first radix-4 butterfly in the second stage takes the first output of the 4 radix-4 butterfly blocks used in the first stage. The second radix-4 butterfly in the second stage takes the second output of the 4 radix-4 butterfly blocks followed by the NEDA block (if required). This process continues for the rest radix-4 butterfly blocks present in the second stage. There is no need of using any NEDA block after second stage as the twiddle factor that is 1 is multiplied to the outputs of the second stage. The final output comes in a bit-reversal order. The advantage of using radix-4 algorithm is that it retains the simplicity of radix-2 algorithm and gives the output with lesser complexity [8]. The NEDA block shown in the proposed block diagramdoes the complex multiplication of the output of the first stage and the respective twiddle factor. The twiddle factor values used here are as follows. cos 8 sin.9238.3826 8 cos 4 sin.771.771 4 cos 3 8 sin3.3826.9238 8 cos 3 4 sin3.771.771 4 cos sin.9238.3826 (12) The product of a complex number and a twiddle factor is given by cos sin. For a constant, cosine and sine values x[n] x[n+4] x[n+8] x[n+12] -j j 1 j -j X[4r] X[4r+1] X[4r+2] X[4r+3] Fig. 2. butterfly structure
are constant. So, referring to equations (8) to (11), we can find,,, and. IV. RESULTSAND DISCUSSION The proposed architecture has been implemented using Xilinx ISE v1.1. To map the design, the FPGAs selected are XC2VP1-6FF174, XC2V8- and. The results are checked for integer inputs. The design is coded in VHDL, declaring all inputs and outputs in signed two s complement number system. The width of data path in the proposed design is 16-bits which includes the width of inputs and outputs. TABLE I COMPARISONOF OUTPUTS OBTAINED USING MATLAB AND XILINX ISE MATLAB outputs Xilinx ISim Outputs 1783 1783 71.6-j783.9 85-j781-286.7+j149.3-287+j152-96.3-j447 3-j444-943+j228-943+j228 477.8+j687.8 483+j693 298.7-j64.7 299-j62-337.9+j375-336+j381 719 719-337.9-j375-335-j379 298.7+j64.7 299+j62 477.8-j687.8 485-j682-943-j228-943-j228-96.3+j447-286.7-j149.3 71.6+j783.9-91+j443-287-j152 74+j769 The number of real adders and multipliers required for calculation of an N-point FFT using radix-4 algorithm are given by 3Nlog and (3N/2)log respectively. A comparison between the number of arithmetic units required in conventional and proposed design is depicted in table II. TABLE II COMPARISONOF NUMBEROF ARITHMETIC UNITS REQUIREDTO PERFORM 16-POINT RADIX-4 COMPLEX FFT Arithmetic unit Conventional design Proposed design Real adder/subtractor 192 269 Real multiplier 96 Table III figures out the performance metrics of the proposed design on different FPGAs, which include throughput and slice delay product. TABLE III PERFORMANCE METRICSOF PROPOSED DESIGNON DIFFERENT FPGAs. Frequency (MHz) Throughput (Mbit/s) Slice delay product XC2VP16FF174 XC2V8-61.831 54.478 119.27 989.296 871.648 1828.32 38.64 44.49 9.18 Table IV shows device utilization summary of proposed design on different FPGAs. From table III, it is clear that the number of slices occupied in is less compared to other two FPGAs. Logic Utilization Number of Slices Number of Slice Flip Flops Number of 4- input LUTs TABLE IV DEVICE UTILIZATION SUMMARYOF PROPOSED DESIGNON DIFFERENT FPGAs Used Utilization XC2V8- XC2V8- XC2VP16FF174 XC2VP16FF174 2389 2424 196 5% 5% 2% 1913 194 1894 2% 2% 1% 3972 454 3469 4% 4% 1% Frequency (MHZ) XC2VP1-6FF174 TABLE V POWER ANALYSISOF PROPOSED DESIGNON DIFFERENT FPGAs Total Quiescent (W) Total Dynamic Power (W) Total Power (W) XC2V8- XC2VP1-6FF174 XC2V8- XC2VP1-6FF174 XC2V8-12.5.2438.13785 3.32919.3797.47152.7869.51234.6937 3.4789 2.2438.13785 3.55222.37262.57793.57921.57699.71578 4.13143 3.2438.13785 3.6824.4755.69395.83959.61193.8318 4.51983 4.2438.13785 3.81776.45378.79855 1.9989.65815.9364 4.91766 5.2438.13785 3.9667.52452.93893 1.3613.72889 1.7678 5.32617
Table V compares the power analysis of the proposed design on different FPGAs. The software tool used to analyze power is Xilinx Xpower Analyzer. The proposed design has also been implemented in synopsys for both 65 nm and 18 nm technologies. Table VI shows the power report and area report of the proposed design in ASIC using synopsys for TCBN65GPLUSTC and FSAA_C_GENERIC_CORE_FF1P98VM4C libraries. The flow used is synopsys DC synthesis. Total Dynamic Power (mw) Total Cell Area TABLE VI POWERAND AREA REPORTOF PROPOSED DESIGNFOR SYNOPSYS DC TCBN65GPLU STC IMPLEMENTATIONS TECHNOLOGY LIBRARY FSAA_C_GENERIC_CORE_FF1P98V M4C 22.733 32.5547 2139.48469 161733.4625 V. CONCLUSIONS The present paper reported architecture of 16-point radix-4 complex FFT core using NEDA which is a ROM less and multiplier less method. The proposed design is efficient in terms of hardware as compared to other traditional methods. The proposed design has been implemented on different FPGAs to compare the performance metrics. Number of real adders/subtractors required to implement the proposed designis 269. The maximum frequency obtained is 119.27 MHz on XC5VLX33 FPGA. The throughput and slice delay product obtained in XC5VLX33 FPGA are 1828.32 Mbit/s and 9.18. In terms of power consumption, XC2V8 FPGA shows better result than others. The proposed design occupied 2139.48469 units of cell area for 65 nm technology and 161733.4625 units of cell area for 18 nm technology, when synthesized using synopsys DC. Power reports using synopsys show that the consumption is 22.733 mw for 65 nm technology and 32.5547 mw for 18 nm technology. VI. REFERENCES [1] Stanley A. White, Applications of Distributed Arithmetic to Digital Signal Processing: A Tutorial Review, IEEE ASSP Magazine, vol. 6, no. 3, pp. 4 19, Jul. 1989. [2] Wendi Pan, Ahmed Shams, and Magdy A. Bayoumi, NEDA: A NEw Distributed Arithmetic Architecture and its Application to One Dimensional Discrete Cosine Transform, Proc. IEEE Workshop on Signal Processing Syst., pp. 159 168, Oct. 1999. [3] M. Rawski, M. Vojtynski, T. Wojciechowski, and P. Majkowski, Distributed Arithmetic Based Implementation of Fourier Transform Targeted at FPGA Architectures, Proc. Intl. Conf. Mixed Design, pp. 152 156, Jun. 27. [4] S. Chandrasekaran, and A. Amira, Novel Sparse OBC based Distributed Arithmetic Architecture for Matrix Transforms, Proc. IEEE Intl. Sym. Circuits and Syst., pp. 327 321, May 27. [5] Richard M. Jiang, An Area-Efficient FFT Architecture for OFDM Digital Video Broadcasting, IEEE Trans. Consumer Elect., vol. 53, no. 4, pp. 1322 1326, Nov. 27. [6] Ren-Xi Gong, Jiong-Quan Wei, Dan Sun, Ling-Ling Xie, Peng-Fei Shu and Xiao-Bi Meng, FPGA Implementation of a CORDIC-based Radix- 4 FFT Processor for Real-Time Harmonic Analyzer, Intl. Conf. on Natural Computation, pp. 1832 1835, Jul. 211. [7] Alban Ferizi,Bernhard Hoeher, Melanie Jung, Georg Fischer, and Alexander Koelpin, Design and Implementation of a Fixed-PointRadix- 4 FFT Optimized for Local Positioning in Wireless SensorNetworks, Intl. Multi-Conf. Syst. Signals and Devices, pp. 1 4, Mar. 212. [8] Li Wenqi, Wang Xuan, and Sun Xiangran, Design of Fixed-Point High-Performance FFT Processor, Intl. Conf. Edu. Tech. and Comput., vol. 5, pp. 139 143, Jun. 21.