Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm

Similar documents
International Journal of Innovative and Emerging Research in Engineering. e-issn: p-issn:

Implementation of FFT Processor using Urdhva Tiryakbhyam Sutra of Vedic Mathematics

An Enhanced Mixed-Scaling-Rotation CORDIC algorithm with Weighted Amplifying Factor

Design of FPGA Based Radix 4 FFT Processor using CORDIC

AN FFT PROCESSOR BASED ON 16-POINT MODULE

THE orthogonal frequency-division multiplex (OFDM)

Modified CORDIC Architecture for Fixed Angle Rotation

LOW-POWER SPLIT-RADIX FFT PROCESSORS

Realization of Fixed Angle Rotation for Co-Ordinate Rotation Digital Computer

Keywords: Fast Fourier Transforms (FFT), Multipath Delay Commutator (MDC), Pipelined Architecture, Radix-2 k, VLSI.

Analysis of Radix- SDF Pipeline FFT Architecture in VLSI Using Chip Scope

A Modified CORDIC Processor for Specific Angle Rotation based Applications

A 4096-Point Radix-4 Memory-Based FFT Using DSP Slices

The Serial Commutator FFT

Performance Analysis of CORDIC Architectures Targeted by FPGA Devices

FPGA Implementation of 16-Point Radix-4 Complex FFT Core Using NEDA

Research Article Design of A Novel 8-point Modified R2MDC with Pipelined Technique for High Speed OFDM Applications

Implementation of Optimized CORDIC Designs

A Normal I/O Order Radix-2 FFT Architecture to Process Twin Data Streams for MIMO

Computing the Discrete Fourier Transform on FPGA Based Systolic Arrays

FPGA Based Design and Simulation of 32- Point FFT Through Radix-2 DIT Algorith

Abstract. Literature Survey. Introduction. A.Radix-2/8 FFT algorithm for length qx2 m DFTs

STUDY OF A CORDIC BASED RADIX-4 FFT PROCESSOR

VLSI IMPLEMENTATION AND PERFORMANCE ANALYSIS OF EFFICIENT MIXED-RADIX 8-2 FFT ALGORITHM WITH BIT REVERSAL FOR THE OUTPUT SEQUENCES.

Speed Optimised CORDIC Based Fast Algorithm for DCT

IMPLEMENTATION OF FAST FOURIER TRANSFORM USING VERILOG HDL

Twiddle Factor Transformation for Pipelined FFT Processing

Design of Delay Efficient Distributed Arithmetic Based Split Radix FFT

A scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment

Low Power Complex Multiplier based FFT Processor

Design of a Multiplier Architecture Based on LUT and VHBCSE Algorithm For FIR Filter

DESIGN & SIMULATION PARALLEL PIPELINED RADIX -2^2 FFT ARCHITECTURE FOR REAL VALUED SIGNALS

Low-Power Split-Radix FFT Processors Using Radix-2 Butterfly Units

IMPLEMENTATION OF AN ADAPTIVE FIR FILTER USING HIGH SPEED DISTRIBUTED ARITHMETIC

High Speed Radix 8 CORDIC Processor

DESIGN METHODOLOGY. 5.1 General

Multiplierless Unity-Gain SDF FFTs

MULTIPLIERLESS HIGH PERFORMANCE FFT COMPUTATION

FPGA Design, Implementation and Analysis of Trigonometric Generators using Radix 4 CORDIC Algorithm

An Efficient Design of Sum-Modified Booth Recoder for Fused Add-Multiply Operator

Parallel-computing approach for FFT implementation on digital signal processor (DSP)

Implementation of Area Efficient Multiplexer Based Cordic

FPGA Implementation of FFT Processor in Xilinx

Novel design of multiplier-less FFT processors

FPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression

Efficient Radix-4 and Radix-8 Butterfly Elements

DESIGN OF PARALLEL PIPELINED FEED FORWARD ARCHITECTURE FOR ZERO FREQUENCY & MINIMUM COMPUTATION (ZMC) ALGORITHM OF FFT

RECENTLY, researches on gigabit wireless personal area

An Area Efficient Mixed Decimation MDF Architecture for Radix. Parallel FFT

ISSN Vol.02, Issue.11, December-2014, Pages:

FPGA Implementation of Discrete Fourier Transform Using CORDIC Algorithm

OPTIMIZING THE POWER USING FUSED ADD MULTIPLIER

Efficient Methods for FFT calculations Using Memory Reduction Techniques.

Modified Welch Power Spectral Density Computation with Fast Fourier Transform

Linköping University Post Print. Analysis of Twiddle Factor Memory Complexity of Radix-2^i Pipelined FFTs

CORDIC Based DFT on FPGA for DSP Applications

Research Article International Journal of Emerging Research in Management &Technology ISSN: (Volume-6, Issue-8) Abstract:

VLSI Implementation of Low Power Area Efficient FIR Digital Filter Structures Shaila Khan 1 Uma Sharma 2

Low Latency CORDIC Architecture in FFT

Fused Floating Point Arithmetic Unit for Radix 2 FFT Implementation

Area And Power Optimized One-Dimensional Median Filter

IMPLEMENTATION OF OPTIMIZED 128-POINT PIPELINE FFT PROCESSOR USING MIXED RADIX 4-2 FOR OFDM APPLICATIONS

High Throughput Energy Efficient Parallel FFT Architecture on FPGAs

Pipelined Quadratic Equation based Novel Multiplication Method for Cryptographic Applications

FAST FOURIER TRANSFORM (FFT) and inverse fast

A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN

A Ripple Carry Adder based Low Power Architecture of LMS Adaptive Filter

FPGA Implementation of the CORDIC Algorithm for Fingerprints Recognition Systems

Realization of Hardware Architectures for Householder Transformation based QR Decomposition using Xilinx System Generator Block Sets

MCM Based FIR Filter Architecture for High Performance

Sum to Modified Booth Recoding Techniques For Efficient Design of the Fused Add-Multiply Operator

COPY RIGHT. To Secure Your Paper As Per UGC Guidelines We Are Providing A Electronic Bar Code

DUE to the high computational complexity and real-time

Area And Power Efficient LMS Adaptive Filter With Low Adaptation Delay

FPGA Implementation of a High Speed Multistage Pipelined Adder Based CORDIC Structure for Large Operand Word Lengths

High Performance and Area Efficient DSP Architecture using Dadda Multiplier

FPGA IMPLEMENTATION OF CORDIC ALGORITHM ARCHITECTURE

Implementation of Two Level DWT VLSI Architecture

ISSN (Online), Volume 1, Special Issue 2(ICITET 15), March 2015 International Journal of Innovative Trends and Emerging Technologies

VHDL Implementation of DIT-FFT using CORDIC

The Efficient Implementation of Numerical Integration for FPGA Platforms

Vertical-Horizontal Binary Common Sub- Expression Elimination for Reconfigurable Transposed Form FIR Filter

A HIGH PERFORMANCE FIR FILTER ARCHITECTURE FOR FIXED AND RECONFIGURABLE APPLICATIONS

Three Dimensional CORDIC with reduced iterations

Xilinx Based Simulation of Line detection Using Hough Transform

Implementation of Efficient Modified Booth Recoder for Fused Sum-Product Operator

DESIGN AND IMPLEMENTATION OF DA- BASED RECONFIGURABLE FIR DIGITAL FILTER USING VERILOGHDL

CORDIC Based FFT for Signal Processing System

Improved Design of High Performance Radix-10 Multiplication Using BCD Codes

VLSI Design and Implementation of High Speed and High Throughput DADDA Multiplier

REVIEW OF CORDIC ARCHITECTURES

We are IntechOpen, the world s leading publisher of Open Access books Built by scientists, for scientists. International authors and editors

Design of Efficient Fast Fourier Transform

Keywords: Soft Core Processor, Arithmetic and Logical Unit, Back End Implementation and Front End Implementation.

SFF The Single-Stream FPGA-Optimized Feedforward FFT Hardware Architecture

Review Article CORDIC Architectures: A Survey

Implementation of Reduce the Area- Power Efficient Fixed-Point LMS Adaptive Filter with Low Adaptation-Delay

Reconfigurable FFT Processor A Broader Perspective Survey

An efficient multiplierless approximation of the fast Fourier transform using sum-of-powers-of-two (SOPOT) coefficients

THE coordinate rotational digital computer (CORDIC)

Transcription:

Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm 1 A.Malashri, 2 C.Paramasivam 1 PG Student, Department of Electronics and Communication K S Rangasamy College Of Technology, TN, India 2 Assistant Professor, Department of Electronics and Communication K S Rangasamy College Of Technology, TN, India Abstract - This paper presents a pipelined, reduced memory and low power CORDIC-based architecture for fast Fourier transform implementation. The proposed algorithm utilizes a new addressing scheme and the associated angle generator logic in order to remove any ROM usage for storing twiddle factors. CORDIC is implemented by a simple hardware through repeated shift-add operations. Low power is achieved by the using the Coordinate Rotation Digital Computer algorithm in the place of conventional multiplication and furthermore, dynamic power consumption is reduced with no delay penalties. Keywords FFT, CORDIC, VLSI, Low power. I.INTRODUCTION Fast Fourier transform (FFT) is among the most widely used operations in digital signal processing. Often, a high performance FFT processor is the key component and determines most of the design metrics in many applications such as Orthogonal Frequency-Division Multiplexing (OFDM), Synthetic Aperture Radar (SAR) and software defined radio. For embedded systems, in particular portable devices; efficient hardware realization of FFT with small area, low-power dissipation and real-time computation is a significant challenge. A typical FFT processor is composed of butterfly calculation units, memory banks and control logic (address generator for data and twiddle factor accesses). In most cases, an FFT processor uses only one butterfly unit to realize all calculations iteratively, and the in-place memory access strategy is required for the least amount of memory. With inplace strategy, the outputs of a butterfly operation are stored back to the same memory location of the inputs, saving the memory usage by one half. However, correct memory addressing scheme is required to avoid the data conflict. This study implements an efficient addressing scheme to realize the parallel, pipelined and in-place memory accessing. It produces an output at every clock cycle; furthermore the memory banks and the butterfly unit are utilized with 100% efficiency within the pipeline. In FFT processors, butterfly operation is the most computationally demanding stage. Traditionally, a butterfly unit is composed of complex adders and multipliers, and the multiplier is usually the speed bottleneck in the pipeline of the FFT processor. The Coordinate Rotation Digital Computer (CORDIC) [5] algorithm is an alternative method to realize the butterfly operation without using any dedicated multiplier hardware. CORDIC algorithm is very versatile and hardware efficient since it requires only add and shift operations, making it very suitable for the butterfly operations in FFT [6]. Instead of storing actual twiddle factors in a ROM, the CORDIC-based FFT processor needs to store only the twiddle factor angles in a ROM for the butterfly operation. Additionally, the CORDIC-based butterfly can be twice faster than traditional multiplier-based butterflies in VLSI implementations. In this, we propose a modified CORDIC algorithm for FFT processors which eliminates the need for storing the twiddle factor angles. The algorithm generates the twiddle factor angles successively by an accumulator. With this approach, full memory requirements of an FFT processor can be reduced by more than 20%. Memory reduction improves with the increased radix size. Since the critical path is not modified with the CORDIC angle calculation, system throughput does not change. II. FAST FOURIER TRANSFORM The N-point discrete Fourier transform is defined by X(k) = x(n) W k = 0,1,, N 1, (1) W = e / Fig. 1 shows the signal flow graph of 16-point decimationin-frequency (DIF) radix-2 FFT. FFT algorithm is composed of butterfly calculation units: x (p) = x (p) + x (q) (2) x (q) = [x (p) x (q)]w (3) Equations (2), (3) describe the radix-2 butterfly operation at stage m as shown in Fig.1. Each butterfly operation needs four data accesses (two read and two write); however, hardware realization of four port memory units is difficult ISSN: 2231-2803 http://www.ijcttjournal.org Page 1053

and costly. To overcome this challenge, multi-bank memory units can be used to realize the parallel and "inplace" data accesses. Two two-port memory banks can provide four data access in each clock cycle, but in this case, a special data addressing scheme is required to prevent the data conflict. y = y x. d. 2 (5) The direction of each rotation is defined by d and the sequence of all d 's determines the final vector. d is given as: d = 1, z < 0 (6) 1, z 0 Where z is called angle accumulator and given by z = (z d. arctan 2 ) (7) All operations described through (4)-(7) can be realized by only additions and shifts; therefore, CORDIC algorithm does not require dedicated multipliers. Fig. 1. Signal flow graph of a 16-point radix-2 FFT A new address scheme has been proposed to realize this function and it can be easily adopted for CORDIC based FFT implementation. III. CORDIC ALGORITHM CORDIC algorithm is an iterative algorithm to calculate the rotation of a vector by using only additions and shifts. It calculates trigonometric functions, rotation of a vector and angle of a vector by realizing two dimensional vector rotation in circular coordinate systems. The CORDIC algorithm involves rotation of a vector v on the XY-plane in circular, linear and hyperbolic coordinate systems depending on the function to be evaluated. This is an iterative convergence algorithm that performs a rotation iteratively using a series of specific incremental rotation angles selected so that each iteration is performed by shift and add operation. The norm of a vector in these coordinate systems is defined as x + my, where m ε {1,0, 1} represents a circular, linear or hyperbolic coordinate system respectively. The norm preserving rotation trajectory is a circle defined by x + y = 1 in the circular coordinate system. Similarly, the norm preserving rotation trajectory in the hyperbolic and linear coordinate systems is defined by the function x y = 1 and x = 1, respectively. The CORDIC method can be employed in two different modes, namely, 1) Rotation mode 2) Vectoring mode. The rotation mode is used to perform the general rotation by a given angle θ. The vectoring mode computes unknown angle θ of a vector by performing a finite number of microrotations. It can be shown that rotation can be simplified to: x = x y. d. 2 (4) Fig.2. Basic structure of a pipelined CORDIC unit Although CORDIC may not be the fastest technique to perform these operations, it is attractive due to the simplicity of its hardware implementation, since the same iterative algorithm could be used for all these applications using the basic shift-add operations of the form. Keeping the requirements and constraints of different application environments in view, the development of CORDIC algorithm and architecture has taken place for achieving high throughput rate and reduction of hardwarecomplexity as well as the latency of implementation. Angle recoding schemes, mixed-grain rotation and higher radix CORDIC have been developed for reduced latency ISSN: 2231-2803 http://www.ijcttjournal.org Page 1054

realization. Parallel and pipelined CORDIC have been suggested for high-throughput computation. CORDIC algorithm is often realized by pipeline structures, leading to high processing speed. Figure shows the basic structure of the pipelined CORDIC unit. As shown in (1), the key operation of the FFT processing is x(n). W.This is equivalent to rotate x(n) by angle 2πnk/N operation can be realized easily by the CORDIC algorithm. Without normal complex multiplication, CORDIC based butterfly can be very fast. An FFT processor needs to store the twiddle factors in memory. CORDIC-based FFT doesn t have twiddle factors but needs a memory bank to store the rotation angles. For radix-2, N-point, m-bit FFT, mn/2 bits memory needed to store N/2 angles. In the next section, a new CORDIC based FFT design which does not require any twiddle factor or angle memory units is presented. This design uses a single accumulator for generating all the necessary angles instantly and does not have any precision loss. Conventionally, a CORDIC-based FFT processor needs a dedicated memory bank to store the necessary twiddle factor angles for the rotation. In pipelined CORDIC algorithm [2] for FFT processors was proposed which eliminates the need for storing the twiddle factor angles, but it requires more number of iteration. In this, modified CORDIC algorithm using micro rotation selection technique [1] for FFT processors was proposed which reduces the number of iteration and slice-delay product. IV. PROPOSED CORDIC BASED FFT The proposed architecture is designed using the butterfly structure using CORDIC, angle generator, ram, multiplexer, demultiplexer and registers. This architecture can be classified into input block, core block and output block. The input block will have the ram, demultiplexer, register and multiplexer arrangement, input for the system is going to be binary data input. Input block will have a RAM where the data will be saved by incremental addressing and that data will enter in to the demux unit, output of demux unit is saved in the register. The register chosen for saving the data is based on the select line of the demux, output of the register are applied to the muxing unit. The multiplexer unit will produce input for the core block. The core block consists of the butterfly structure of the FFT which is designed using the CORDIC algorithm to replace the complex multipliers. An angle generator is used to generate the twiddle factor angle for rotation to the pipe-lined CORDIC structure. The core block will be designed for the radix 2 and radix-4 FFT structure. The output from the core block also will have the demux - mux arrangement with registers, the data output will be stored before sending the data out. INPUT INPUT BLOCK BUTTERFLY STRUCTURE ANGLE GENERATOR OUTPUT BLOCK CORE BLOCK OUTPUT Fig. 3. General block diagram of the proposed FFT Although several multi-bank addressing schemes have been used to realize parallel and pipelined FFT processing, these methods are not suitable for the reduced memory CORDIC FFT. In these schemes, the twiddle factor angles are not in regular increasing order, resulting in a more complex design for angle generators. Here twiddle factor angles are sequentially increasing, and every angle is a multiple of the basic angle 2π/N, which is π/8 for 16-point FFT. For different FFT stages, the angles increase always one step per clock cycle. Hence, an angle generator circuit composed of an accumulator, and an output latch can realize this function, as shown in Fig. 4. Fig. 4 Angle generator for the CORDIC based FFT The accumulator consists of a simple adder and a register. It will add value fed back by the register with the input angle value. Control signal for the latch that enables or disables the accumulator output is simple and it is based on the current FFT butterfly stage and RAM address bits b2b1b0. MODIFIED CORDIC PROCESSOR The proposed CORDIC processor provides the flexibility to manipulate the number of iterations depending ISSN: 2231-2803 http://www.ijcttjournal.org Page 1055

on the accuracy, area and latency requirements. This gives an area-time efficient CORDIC algorithm that completely eliminates the scale-factor. A generalized micro-rotation selection technique based on high speed most-significant-1- detection obviates the complex search algorithms for identifying the micro-rotations. In this, a novel scaling-free CORDIC algorithm for area-time efficient implementation of CORDIC with adequate range of convergence is proposed. The proposed recursive architecture has comparable or less area complexity with other existing scaling-free CORDIC algorithms. Moreover, no scale-factor multiplications are required for extending the range of convergence to entire coordinate space. Fig. 5. Block diagram for the proposed CORDIC architecture The block diagram for the proposed CORDIC architecture is shown in Fig. 5. It makes use of the same stage for all the iterations for the coordinate calculations, as well as for the generation of shift values. The structure of each stage consists of three computing blocks namely: the 1) shift-value estimation; 2) coordinate calculation; and 3) micro-rotation sequence generator. Fig. 6 shows the architecture of the proposed notwiddle-factor-memory design for radix-2 FFT. Four registers and eight 2-to-1 multiplexers are used. Registers are needed before and after the butterfly unit to buffer the intermediate data in order to group two sequential butterfly operations together. Therefore, the conflict-free in-place data accessing can be realized. This register-buffer design can be extended to any radix FFTs. For radix-2, the structure can be simplified by using just 4 registers, but for radix-r FFT, 2 r 2 registers are needed. Fig. 6. Proposed desisign for radix-2 CORDIC FFT processor For an N = 2 n -point FFT, the addressing and control logic are composed of several components: An (n 1)-bit butterfly counter B = b b. b b will provide the address sequences and the control logic of the angle generator. In stage S, the memory address is given by b b b b b b. b, which is rotate right S bits of butterfly counter B. Meanwhile, the control logic of the latch of the angle generator is determined by the sequence of the pattern; b b. b 0 0 (S 0 s). For 16 (N = 2 4 )-point FFT, the addressing and control logic are composed of several components: A 3((n 1) = (4-1) =3)-bit butterfly counter B= b b b will provide the address sequences and the control logic of the angle generator. In stage 0, the memory address is given by b b b, which is rotate right 0 bits of butterfly counter B In stage 1, the memory address is given by b b b, which is rotate right 1 bit of butterfly counter B and the control logic of the latch of the angle pattern; b b 0 (S 0 s, where S=1) In stage 2, the memory address is given by b b b, which is rotate right 2 bits of butterfly counter B. And the control logic of the latch of the angle pattern; b 00 (S 0 s, where S=2) In stage 3, the memory address is given by b b b, which is rotate right 3 bits of butterfly counter B. And the control logic of the latch of the angle pattern; 000 (S 0 s, where S=3) Due to finite wordlength, as the accumulator operates, the precision loss will accumulate as well. In order to address this issue, more bits (wider wordlength) can be used for the fundamental angle 2π/N and the accumulator logic. V. RESULTS AND CONCLUSION The proposed designs for radix-2 and radix-4 FFT architectures have been realized by VHDL. The HDL synthesis has been performed for the both architectures using Xilinx 9.1 tool and Synopsys tool. For FFT ISSN: 2231-2803 http://www.ijcttjournal.org Page 1056

processors, butterfly operation is the most computationally demanding stage. Traditionally, a butterfly unit is composed of complex adders and multipliers, and the multiplier is usually the speed bottleneck in the pipeline of the FFT processor. In order to avoid these problems with butterfly unit, modified CORDIC algorithm associated with angle generator logic is proposed. The HDL synthesis has been performed for the both architectures using Xilinx 9.1 tool and the results are given in Table.1. Table1. HDL Synthesis results using Xilinx for radix-2 and radix-4 FFT architecture Similarly, Synopsys tool is also used here to perform the synthesis and the results using design compiler is given in the Table 2. Table 2. Synthesis results using Synopsys for radix-2 and radix-4 FFT architecture Synthesis results shown in Table 4.2 confirm that the proposed design can reduce memory usage (upto 39%) for FFT processors without any tangible increase in the number of logic elements. Furthermore, power consumption is reduced as much as by 70% for radix-2 FFT and 47% for radix-4 FFT with no delay parameters. [3] Wey, C., Lin, S., & Tang, W. (2007). Efficient memorybased FFT processors for OFDM applications. In IEEE International Conf. on Electro-Information Technology, 345 350. May. [4] Mittal, S., Khan, M., & Srinivas, M. B. (2007). On the suitability of Bruun s FFT algorithm for software defined radio. In 2007 IEEE Sarnoff Symposium, (pp. 1 5),Apr. [5] G. Bi and E.V. Jones, A Pipelined FFT Processor for Word-Sequence Data, IEEE Trans. on Acoustics, Speech, and Signal Processing, Vol.37, pp.1982-1985, December. [6] Volder, J. (1959). The CORDIC trigonometric computing technique. IEEE Transactions on Electronic Computers, 8(8), 330 334. [7] Despain, A. M. (1974). Fourier transform computers using CORDIC iterations. IEEE Transactions on Electronic Computers, 23(10), 993 1001. [8] Abdullah, S. S., Nam, H., McDermot, M., & Abraham, J. A. (2009). A high throughput FFT processor with no multipliers. In IEEE International Conf. on Computer Design, pp. 485 490. [9] Lin, C., & Wu, A. (2005). Mixed-scaling-rotation CORDIC (MSRCORDIC) algorithm and architecture for high-performance vector rotational DSP applications. IEEE Transactions on Circuits and Systems I, 52(11), 2385 2396. [10] Jiang, R. M. (2007). An area-efficient FFT architecture for OFDM digital video broadcasting. IEEE Transactions on Consumer Electronics, 53(4), 1322 1326. [11]Garrido, M., & Grajal, J. (2007). Efficient memory-less CORDIC for FFT Computation. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2, 113 116), Apr. [12] Xiao, X., Oruklu, E., & Saniie, J. (2009). Fast memory addressing scheme for radix-4 FFT implementation. In IEEE International Conference on Electro/Information Technology, EIT 2009, 437 440, June. [13]Xiao, X., Oruklu, E., & Saniie, J. (2010) Reduced Memory Architecture for CORDIC-based FFT. In IEEE International Symposium on Circuits and Systems, 2690 2693. [14] Ma, Y. (1999). An effective memory addressing scheme for FFT processors. IEEE Transactions on Signal Processing, 47(3), 907 911. REFERENCES [1]Supriya Aggarwal, Pramod K. Meher, and Kavita khar(2012) Area-Time Efficient Scaling-Free CORDIC Using Generalized Micro-Rotation Selection, IEEE transactions on VLSI systems [2] Xiao, X., Oruklu, E., & Saniie, J. (2012) Low Power And Reduced Memory Architecture for CORDIC-based FFT, J Sign Process Syst ISSN: 2231-2803 http://www.ijcttjournal.org Page 1057