FPGA Implementation of 16-Point Radix-4 Complex FFT Core Using NEDA

Similar documents
A Novel Distributed Arithmetic Multiplierless Approach for Computing Complex Inner Products

DUE to the high computational complexity and real-time

Design of Delay Efficient Distributed Arithmetic Based Split Radix FFT

STUDY OF A CORDIC BASED RADIX-4 FFT PROCESSOR

Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm

LOW-POWER SPLIT-RADIX FFT PROCESSORS

Low Power Complex Multiplier based FFT Processor

DISCRETE COSINE TRANSFORM (DCT) is a widely

AN FFT PROCESSOR BASED ON 16-POINT MODULE

ISSN (Online), Volume 1, Special Issue 2(ICITET 15), March 2015 International Journal of Innovative Trends and Emerging Technologies

IMPLEMENTATION OF FAST FOURIER TRANSFORM USING VERILOG HDL

FPGA Implementation of a High Speed Multistage Pipelined Adder Based CORDIC Structure for Large Operand Word Lengths

Linköping University Post Print. Analysis of Twiddle Factor Memory Complexity of Radix-2^i Pipelined FFTs

FPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression

Analysis of Radix- SDF Pipeline FFT Architecture in VLSI Using Chip Scope

Low-Power Split-Radix FFT Processors Using Radix-2 Butterfly Units

Performance Analysis of CORDIC Architectures Targeted by FPGA Devices

An Efficient High Speed VLSI Architecture Based 16-Point Adaptive Split Radix-2 FFT Architecture

FPGA Based Design and Simulation of 32- Point FFT Through Radix-2 DIT Algorith

IMPLEMENTATION OF AN ADAPTIVE FIR FILTER USING HIGH SPEED DISTRIBUTED ARITHMETIC

High-Performance FIR Filter Architecture for Fixed and Reconfigurable Applications

II. MOTIVATION AND IMPLEMENTATION

Batchu Jeevanarani and Thota Sreenivas Department of ECE, Sri Vasavi Engg College, Tadepalligudem, West Godavari (DT), Andhra Pradesh, India

Design of FPGA Based Radix 4 FFT Processor using CORDIC

A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN

Pipelined Quadratic Equation based Novel Multiplication Method for Cryptographic Applications

Adaptive FIR Filter Using Distributed Airthmetic for Area Efficient Design

Speed Optimised CORDIC Based Fast Algorithm for DCT

HIGH-PERFORMANCE RECONFIGURABLE FIR FILTER USING PIPELINE TECHNIQUE

THE orthogonal frequency-division multiplex (OFDM)

Efficient Radix-4 and Radix-8 Butterfly Elements

Implementation of FFT Processor using Urdhva Tiryakbhyam Sutra of Vedic Mathematics

FPGA Implementation of Discrete Fourier Transform Using CORDIC Algorithm

The Serial Commutator FFT

FPGA Implementation of Low Complexity Video Encoder using Optimized 3D-DCT

FFT/IFFTProcessor IP Core Datasheet

An Area Efficient Mixed Decimation MDF Architecture for Radix. Parallel FFT

VLSI IMPLEMENTATION AND PERFORMANCE ANALYSIS OF EFFICIENT MIXED-RADIX 8-2 FFT ALGORITHM WITH BIT REVERSAL FOR THE OUTPUT SEQUENCES.

Novel design of multiplier-less FFT processors

Research Article International Journal of Emerging Research in Management &Technology ISSN: (Volume-6, Issue-8) Abstract:

DESIGN OF DCT ARCHITECTURE USING ARAI ALGORITHMS

A DCT Architecture based on Complex Residue Number Systems

A 4096-Point Radix-4 Memory-Based FFT Using DSP Slices

Modified Welch Power Spectral Density Computation with Fast Fourier Transform

FAST FOURIER TRANSFORM (FFT) and inverse fast

FIR Filter Architecture for Fixed and Reconfigurable Applications

Design and Implementation of Effective Architecture for DCT with Reduced Multipliers

Fast Block LMS Adaptive Filter Using DA Technique for High Performance in FGPA

Abstract. Literature Survey. Introduction. A.Radix-2/8 FFT algorithm for length qx2 m DFTs

Design and Performance Analysis of 32 and 64 Point FFT using Multiple Radix Algorithms

Twiddle Factor Transformation for Pipelined FFT Processing

Parallel-computing approach for FFT implementation on digital signal processor (DSP)

FPGA Implementation of 4-Point and 8-Point Fast Hadamard Transform

Design of Feature Extraction Circuit for Speech Recognition Applications

Implementation of a Unified DSP Coprocessor

IJSRD - International Journal for Scientific Research & Development Vol. 4, Issue 05, 2016 ISSN (online):

On-Chip Implementation of Pipeline Digit- Slicing Multiplier-Less Butterfly for Fast Fourier Transform Architecture

Fused Floating Point Arithmetic Unit for Radix 2 FFT Implementation

CHAPTER 4. DIGITAL DOWNCONVERTER FOR WiMAX SYSTEM

MCM Based FIR Filter Architecture for High Performance

DESIGN & SIMULATION PARALLEL PIPELINED RADIX -2^2 FFT ARCHITECTURE FOR REAL VALUED SIGNALS

A Normal I/O Order Radix-2 FFT Architecture to Process Twin Data Streams for MIMO

International Journal of Innovative and Emerging Research in Engineering. e-issn: p-issn:

Design and Implementation of 3-D DWT for Video Processing Applications

DESIGN METHODOLOGY. 5.1 General

Image Compression System on an FPGA

OPTIMIZATION OF AREA COMPLEXITY AND DELAY USING PRE-ENCODED NR4SD MULTIPLIER.

A Novel Discrete cosine transforms & Distributed arithmetic

RISC IMPLEMENTATION OF OPTIMAL PROGRAMMABLE DIGITAL IIR FILTER

Paper ID # IC In the last decade many research have been carried

VLSI Design Of a Novel Pre Encoding Multiplier Using DADDA Multiplier. Guntur(Dt),Pin:522017

Design and Implementation of VLSI 8 Bit Systolic Array Multiplier

An Enhanced Mixed-Scaling-Rotation CORDIC algorithm with Weighted Amplifying Factor

Joint Optimization of Low-power DCT Architecture and Efficient Quantization Technique for Embedded Image Compression

A scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment

Implementation of High Speed FIR Filter using Serial and Parallel Distributed Arithmetic Algorithm

Design of a Floating-Point Fused Add-Subtract Unit Using Verilog

Introduction to Field Programmable Gate Arrays

Xilinx Based Simulation of Line detection Using Hough Transform

FPGA Implementation of 2-D DCT Architecture for JPEG Image Compression

Design Efficient VLSI architecture for an Orthogonal Transformation Himanshu R Upadhyay 1 Sohail Ansari 2

High Performance Pipelined Design for FFT Processor based on FPGA

Sine/Cosine using CORDIC Algorithm

VLSI Implementation of Low Power Area Efficient FIR Digital Filter Structures Shaila Khan 1 Uma Sharma 2

Two High Performance Adaptive Filter Implementation Schemes Using Distributed Arithmetic

Design of Fir Filter Architecture Using Manifold Steady Method

This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore.

COPY RIGHT. To Secure Your Paper As Per UGC Guidelines We Are Providing A Electronic Bar Code

INTEGER SEQUENCE WINDOW BASED RECONFIGURABLE FIR FILTERS.

Design of 2-D DWT VLSI Architecture for Image Processing

The Efficient Implementation of Numerical Integration for FPGA Platforms

Implementation of Efficient Modified Booth Recoder for Fused Sum-Product Operator

OPTIMIZING THE POWER USING FUSED ADD MULTIPLIER

@ 2014 SEMAR GROUPS TECHNICAL SOCIETY.

High Throughput Energy Efficient Parallel FFT Architecture on FPGAs

Efficient Implementation of Low Power 2-D DCT Architecture

HIGH PERFORMANCE FUSED ADD MULTIPLY OPERATOR

University, Patiala, Punjab, India 1 2

DESIGN AND IMPLEMENTATION OF VLSI SYSTOLIC ARRAY MULTIPLIER FOR DSP APPLICATIONS

FPGA Implementation of FFT Processor in Xilinx

Transcription:

FPGA Implementation of 16-Point FFT Core Using NEDA Abhishek Mankar, Ansuman Diptisankar Das and N Prasad Abstract--NEDA is one of the techniques to implement many digital signal processing systems that require multiply and accumulate units. FFT is one of the most employed blocks in many communication and signal processing systems. This paper proposes FPGA implementation of a 16 point radix-4 complex FFT core using NEDA. The proposed design has improvement in terms of hardware utilization compared to traditional methods. The design has been implemented on a range of FPGAs to compare the performance. The proposed design has a power consumption of 728.89 mw on XC2VP1-6FF174 FPGA at 5 MHz. The maximum frequency achieved is 114.27 MHz on FPGA at a cost of higher power and the maximum throughput observed is 1828.32 Mbit/s and minimum slice delay product observed is 9.18. The design is also implemented using synopsys DC synthesis for both 65 nm and 18 nm technology libraries. Index Terms--Fast Fourier Transform (FFT), FPGA, New Distributed Arithmetic (NEDA), radix-4, Synopsys DC. I. INTRODUCTION oday s electronic systems mostly run on batteries thus Tmaking the designs to be hardware efficient and power efficient. Application areas such as digital signal processing, communications, etc. employ digital systems which carryout complex functionalities. Hardware efficient and power efficient architectures for these systems are most required to achieve maximum performance. Fast Fourier Transform (FFT) is one of the most efficient ways to implement Discrete Fourier Transform (DFT) due to its reduced usage of arithmetic units. DFT is one of those primary tools that are used for the frequency analysis of discrete time signals and to represent a discrete time sequence in frequency domain using its spectrum samples. The analysis (forward) and synthesis (inverse) equations of an N point FFT are given below. Abhishek Mankar, Ansuman DiptiSankar Das and N Prasad are with the Department of Electronics and Communication Engineering, National Institute of Technology, Rourkela-7698, India (e-mail: abhishek.mankar1@gmail.com, ansuman.das.engg@gmail.com, prasadn57@yahoo.com ). 978-4673-563-5//13/$31. 213 IEEE (1) Where,. As evident from equation (1), the basis of both synthesis and analysis equations remains same thus increasing the scope of the architecture to both analyze and synthesize. Due to increased employability of FFT in modern electronic systems, higher radix FFTs such as radix 4, radix 8, radix 2 k, split radix, etc. are designed for improved timing and reduced hardware. The basic difference of the mentioned methods lies in the structure of their butterfly units. Distributed Arithmetic (DA) has become an efficient tool to implement multiply and accumulate (MAC) unit in many digital signal processing (DSP) systems [1]. It eliminates the need of a multiplier that is used as a part of MAC unit. DA implements MAC unit by pre-computing all possible products and by storing them using a read only memory (ROM). Usage of ROM can be eliminated if one set of the inputs has a fixed value. This is done by distributing the coefficients to the inputs of the unit. This approach is called NEw Distributed Arithmetic (NEDA) [2]. Thus, using NEDA, any MAC like unit can be implemented just by using adders and shifters. Architecture designs in [3] [6] use either DA approach or CORDIC unit approach to implement FFT, which require ROM as en essential unit in the design. The proposed approach is based on NEDA, which does not require any ROM thus making the design to have reduced hardware. The distribution of the coefficients is done optimally to further reduce the redundant hardware units. The organization of rest of the paper is as follows. Section II briefly overviews NEDA, section III elucidates the proposed design, section IV discusses the results and performance, and section V concludes the work done in this paper. II. BRIEF OVERVIEW OF NEW DISTRIBUTED ARITHMETIC NEw Distributed Arithmetic (NEDA) technique is being used in many digital signal processing systems that require MAC unit. Transforms such as FFT, DCT, etc. have many multiplications that in turn require a number of multipliers. Implementation of such transforms using NEDA improves performance of the system in terms of speed, power and area. The mathematical derivation of NEDA is discussed as follows. Inner product calculation of two sequences may be represented as (2)

Where are constant coefficients and are varying inputs. Matrix representation of equation (2) may be given as (3) Considering both and in 2 s complement format, they may be expressed in the form 2 2 (4) Where 1,,1,,and is the sign bit and is the least significant bit. Substituting equation (4) in equation (3) results in the following matrix product which is modelled according to the required design. 2 2 2 (5) The matrix containing is a sparse matrix, which means the values are either or 1. The number of rows in matrix defines the precision of fixed coefficients. Equation (5) is rearranged as shown below. Where 2 2 2 (6) 1 2 (7) 12 1 12 The matrix consists of sums of the inputs depending on the coefficient values, in each row. An example that shows the NEDA operations is discussed below. Consider to evaluate the value of equation (8). cos 1 cos (8) Equation (8) can be expressed in the form of equation (5) as shown in equation (9). 1 1 1 1 1 1 1 2 2 2 1 1 1 1 (9) Equation (9) may be rewritten as 2 2 2 Applying precise shifting, we rewrite equation (1) as (1) 2 2 2 2 2 2 2 2 (11) Thus implementing equation (11) further reduces number of adders compared to implement equation (1). Multiplication with 2, can be realized with the help of shifters. In equation (11), the first row of matrix shifts right by 1 bit, second row by 2 bits and so on. More precisely, the shifts carried out are arithmetic right shifts. The output can be realized as a column matrix if we need the partial products. Thus NEDA based architecture designs have less critical path compared to traditional MAC units. III. THEPROPOSEDDESIGN In this paper, we have proposed the implementation of 16- point complex FFT using radix-4 method. multiplications required during the process have been implemented by using NEDA. According to the radix-4 algorithm, to implement 16-point FFT, eight radix-4butterflies are required. Four radix-4 butterflies are used in the first stage and the other four being used in the second/final stage. The input is taken in normal order and the output in bit-reversal order. The output of each radix-4 butterfly is multiplied by the respective twiddle factors [7]. In the shown block diagram, the first stage consists of four radix-4 butterflies. The inputs to the butterflies are x(n), x(n+4), x(n+8), x(n+12) where n is for first butterfly, 1 for second butterfly, 2 for the third butterfly, and 3 for the last butterfly, all of first stage. The twiddle factors are given by,,, where q is for first butterfly, 1 for second butterfly, 2 for third butterfly, and 3 for the last butterfly, all of first stage. The outputs of first stage are multiplied with respective twiddle factors and are given as inputs to the second stage.as proposed in our design, the complex twiddle multiplications required at the stage output have been implemented by using NEDA blocks.overall 9

4 8 12 1 5 9 13 2 6 1 14 NEDA NEDA NEDA NEDA NEDA NEDA 4 8 12 1 5 9 13 2 6 1 14 3 7 11 15 NEDA NEDA NEDA 3 7 11 15 Fig. 1. diagram of the proposed architecture NEDA blocks are required at the output of first stage of the 16 point FFT processor. In the second stage, 4 more radix-4 butterfly blocks are used. The first radix-4 butterfly in the second stage takes the first output of the 4 radix-4 butterfly blocks used in the first stage. The second radix-4 butterfly in the second stage takes the second output of the 4 radix-4 butterfly blocks followed by the NEDA block (if required). This process continues for the rest radix-4 butterfly blocks present in the second stage. There is no need of using any NEDA block after second stage as the twiddle factor that is 1 is multiplied to the outputs of the second stage. The final output comes in a bit-reversal order. The advantage of using radix-4 algorithm is that it retains the simplicity of radix-2 algorithm and gives the output with lesser complexity [8]. The NEDA block shown in the proposed block diagramdoes the complex multiplication of the output of the first stage and the respective twiddle factor. The twiddle factor values used here are as follows. cos 8 sin.9238.3826 8 cos 4 sin.771.771 4 cos 3 8 sin3.3826.9238 8 cos 3 4 sin3.771.771 4 cos sin.9238.3826 (12) The product of a complex number and a twiddle factor is given by cos sin. For a constant, cosine and sine values x[n] x[n+4] x[n+8] x[n+12] -j j 1 j -j X[4r] X[4r+1] X[4r+2] X[4r+3] Fig. 2. butterfly structure

are constant. So, referring to equations (8) to (11), we can find,,, and. IV. RESULTSAND DISCUSSION The proposed architecture has been implemented using Xilinx ISE v1.1. To map the design, the FPGAs selected are XC2VP1-6FF174, XC2V8- and. The results are checked for integer inputs. The design is coded in VHDL, declaring all inputs and outputs in signed two s complement number system. The width of data path in the proposed design is 16-bits which includes the width of inputs and outputs. TABLE I COMPARISONOF OUTPUTS OBTAINED USING MATLAB AND XILINX ISE MATLAB outputs Xilinx ISim Outputs 1783 1783 71.6-j783.9 85-j781-286.7+j149.3-287+j152-96.3-j447 3-j444-943+j228-943+j228 477.8+j687.8 483+j693 298.7-j64.7 299-j62-337.9+j375-336+j381 719 719-337.9-j375-335-j379 298.7+j64.7 299+j62 477.8-j687.8 485-j682-943-j228-943-j228-96.3+j447-286.7-j149.3 71.6+j783.9-91+j443-287-j152 74+j769 The number of real adders and multipliers required for calculation of an N-point FFT using radix-4 algorithm are given by 3Nlog and (3N/2)log respectively. A comparison between the number of arithmetic units required in conventional and proposed design is depicted in table II. TABLE II COMPARISONOF NUMBEROF ARITHMETIC UNITS REQUIREDTO PERFORM 16-POINT RADIX-4 COMPLEX FFT Arithmetic unit Conventional design Proposed design Real adder/subtractor 192 269 Real multiplier 96 Table III figures out the performance metrics of the proposed design on different FPGAs, which include throughput and slice delay product. TABLE III PERFORMANCE METRICSOF PROPOSED DESIGNON DIFFERENT FPGAs. Frequency (MHz) Throughput (Mbit/s) Slice delay product XC2VP16FF174 XC2V8-61.831 54.478 119.27 989.296 871.648 1828.32 38.64 44.49 9.18 Table IV shows device utilization summary of proposed design on different FPGAs. From table III, it is clear that the number of slices occupied in is less compared to other two FPGAs. Logic Utilization Number of Slices Number of Slice Flip Flops Number of 4- input LUTs TABLE IV DEVICE UTILIZATION SUMMARYOF PROPOSED DESIGNON DIFFERENT FPGAs Used Utilization XC2V8- XC2V8- XC2VP16FF174 XC2VP16FF174 2389 2424 196 5% 5% 2% 1913 194 1894 2% 2% 1% 3972 454 3469 4% 4% 1% Frequency (MHZ) XC2VP1-6FF174 TABLE V POWER ANALYSISOF PROPOSED DESIGNON DIFFERENT FPGAs Total Quiescent (W) Total Dynamic Power (W) Total Power (W) XC2V8- XC2VP1-6FF174 XC2V8- XC2VP1-6FF174 XC2V8-12.5.2438.13785 3.32919.3797.47152.7869.51234.6937 3.4789 2.2438.13785 3.55222.37262.57793.57921.57699.71578 4.13143 3.2438.13785 3.6824.4755.69395.83959.61193.8318 4.51983 4.2438.13785 3.81776.45378.79855 1.9989.65815.9364 4.91766 5.2438.13785 3.9667.52452.93893 1.3613.72889 1.7678 5.32617

Table V compares the power analysis of the proposed design on different FPGAs. The software tool used to analyze power is Xilinx Xpower Analyzer. The proposed design has also been implemented in synopsys for both 65 nm and 18 nm technologies. Table VI shows the power report and area report of the proposed design in ASIC using synopsys for TCBN65GPLUSTC and FSAA_C_GENERIC_CORE_FF1P98VM4C libraries. The flow used is synopsys DC synthesis. Total Dynamic Power (mw) Total Cell Area TABLE VI POWERAND AREA REPORTOF PROPOSED DESIGNFOR SYNOPSYS DC TCBN65GPLU STC IMPLEMENTATIONS TECHNOLOGY LIBRARY FSAA_C_GENERIC_CORE_FF1P98V M4C 22.733 32.5547 2139.48469 161733.4625 V. CONCLUSIONS The present paper reported architecture of 16-point radix-4 complex FFT core using NEDA which is a ROM less and multiplier less method. The proposed design is efficient in terms of hardware as compared to other traditional methods. The proposed design has been implemented on different FPGAs to compare the performance metrics. Number of real adders/subtractors required to implement the proposed designis 269. The maximum frequency obtained is 119.27 MHz on XC5VLX33 FPGA. The throughput and slice delay product obtained in XC5VLX33 FPGA are 1828.32 Mbit/s and 9.18. In terms of power consumption, XC2V8 FPGA shows better result than others. The proposed design occupied 2139.48469 units of cell area for 65 nm technology and 161733.4625 units of cell area for 18 nm technology, when synthesized using synopsys DC. Power reports using synopsys show that the consumption is 22.733 mw for 65 nm technology and 32.5547 mw for 18 nm technology. VI. REFERENCES [1] Stanley A. White, Applications of Distributed Arithmetic to Digital Signal Processing: A Tutorial Review, IEEE ASSP Magazine, vol. 6, no. 3, pp. 4 19, Jul. 1989. [2] Wendi Pan, Ahmed Shams, and Magdy A. Bayoumi, NEDA: A NEw Distributed Arithmetic Architecture and its Application to One Dimensional Discrete Cosine Transform, Proc. IEEE Workshop on Signal Processing Syst., pp. 159 168, Oct. 1999. [3] M. Rawski, M. Vojtynski, T. Wojciechowski, and P. Majkowski, Distributed Arithmetic Based Implementation of Fourier Transform Targeted at FPGA Architectures, Proc. Intl. Conf. Mixed Design, pp. 152 156, Jun. 27. [4] S. Chandrasekaran, and A. Amira, Novel Sparse OBC based Distributed Arithmetic Architecture for Matrix Transforms, Proc. IEEE Intl. Sym. Circuits and Syst., pp. 327 321, May 27. [5] Richard M. Jiang, An Area-Efficient FFT Architecture for OFDM Digital Video Broadcasting, IEEE Trans. Consumer Elect., vol. 53, no. 4, pp. 1322 1326, Nov. 27. [6] Ren-Xi Gong, Jiong-Quan Wei, Dan Sun, Ling-Ling Xie, Peng-Fei Shu and Xiao-Bi Meng, FPGA Implementation of a CORDIC-based Radix- 4 FFT Processor for Real-Time Harmonic Analyzer, Intl. Conf. on Natural Computation, pp. 1832 1835, Jul. 211. [7] Alban Ferizi,Bernhard Hoeher, Melanie Jung, Georg Fischer, and Alexander Koelpin, Design and Implementation of a Fixed-PointRadix- 4 FFT Optimized for Local Positioning in Wireless SensorNetworks, Intl. Multi-Conf. Syst. Signals and Devices, pp. 1 4, Mar. 212. [8] Li Wenqi, Wang Xuan, and Sun Xiangran, Design of Fixed-Point High-Performance FFT Processor, Intl. Conf. Edu. Tech. and Comput., vol. 5, pp. 139 143, Jun. 21.