Efficient Radix-4 and Radix-8 Butterfly Elements

Similar documents
AN FFT PROCESSOR BASED ON 16-POINT MODULE

MULTIPLIERLESS HIGH PERFORMANCE FFT COMPUTATION

Linköping University Post Print. Analysis of Twiddle Factor Memory Complexity of Radix-2^i Pipelined FFTs

DESIGN METHODOLOGY. 5.1 General

Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm

Twiddle Factor Transformation for Pipelined FFT Processing

LOW-POWER SPLIT-RADIX FFT PROCESSORS

FPGA Implementation of 16-Point Radix-4 Complex FFT Core Using NEDA

Research Article Design of A Novel 8-point Modified R2MDC with Pipelined Technique for High Speed OFDM Applications

The Serial Commutator FFT

DESIGN OF PARALLEL PIPELINED FEED FORWARD ARCHITECTURE FOR ZERO FREQUENCY & MINIMUM COMPUTATION (ZMC) ALGORITHM OF FFT

Speed Optimised CORDIC Based Fast Algorithm for DCT

Implementation of FFT Processor using Urdhva Tiryakbhyam Sutra of Vedic Mathematics

THE orthogonal frequency-division multiplex (OFDM)

Abstract. Literature Survey. Introduction. A.Radix-2/8 FFT algorithm for length qx2 m DFTs

Low Power Complex Multiplier based FFT Processor

International Journal of Innovative and Emerging Research in Engineering. e-issn: p-issn:

TOPICS PIPELINE IMPLEMENTATIONS OF THE FAST FOURIER TRANSFORM (FFT) DISCRETE FOURIER TRANSFORM (DFT) INVERSE DFT (IDFT) Consulted work:

Novel design of multiplier-less FFT processors

An efficient multiplierless approximation of the fast Fourier transform using sum-of-powers-of-two (SOPOT) coefficients

Fixed Point Streaming Fft Processor For Ofdm

Carry-Free Radix-2 Subtractive Division Algorithm and Implementation of the Divider

Decimation-in-Frequency (DIF) Radix-2 FFT *

Design of Delay Efficient Distributed Arithmetic Based Split Radix FFT

STUDY OF A CORDIC BASED RADIX-4 FFT PROCESSOR

Radix-4 FFT Algorithms *

Fused Floating Point Arithmetic Unit for Radix 2 FFT Implementation

DESIGN & SIMULATION PARALLEL PIPELINED RADIX -2^2 FFT ARCHITECTURE FOR REAL VALUED SIGNALS

FAST FOURIER TRANSFORM (FFT) and inverse fast

Analysis of Radix- SDF Pipeline FFT Architecture in VLSI Using Chip Scope

DESIGN OF AN FFT PROCESSOR

IMPLEMENTATION OF FAST FOURIER TRANSFORM USING VERILOG HDL

A 4096-Point Radix-4 Memory-Based FFT Using DSP Slices

Efficient FFT Algorithm and Programming Tricks

Low-Power Split-Radix FFT Processors Using Radix-2 Butterfly Units

IMPLEMENTATION OF OPTIMIZED 128-POINT PIPELINE FFT PROCESSOR USING MIXED RADIX 4-2 FOR OFDM APPLICATIONS

A Pipelined Fused Processing Unit for DSP Applications

A Novel Distributed Arithmetic Multiplierless Approach for Computing Complex Inner Products

HIGH PERFORMANCE QUATERNARY ARITHMETIC LOGIC UNIT ON PROGRAMMABLE LOGIC DEVICE

RECENTLY, researches on gigabit wireless personal area

A Binary Floating-Point Adder with the Signed-Digit Number Arithmetic

Design and Implementation of CVNS Based Low Power 64-Bit Adder

Research Article International Journal of Emerging Research in Management &Technology ISSN: (Volume-6, Issue-8) Abstract:

Chapter 3: part 3 Binary Subtraction

VLSI IMPLEMENTATION AND PERFORMANCE ANALYSIS OF EFFICIENT MIXED-RADIX 8-2 FFT ALGORITHM WITH BIT REVERSAL FOR THE OUTPUT SEQUENCES.

FAST Fourier transform (FFT) is an important signal processing

Binary Addition. Add the binary numbers and and show the equivalent decimal addition.

SFF The Single-Stream FPGA-Optimized Feedforward FFT Hardware Architecture

IMPLEMENTATION OF AN ADAPTIVE FIR FILTER USING HIGH SPEED DISTRIBUTED ARITHMETIC

An Area Efficient Mixed Decimation MDF Architecture for Radix. Parallel FFT

Reconfigurable Fast Fourier Transform Architecture for Orthogonal Frequency Division Multiplexing Systems

Improved Design of High Performance Radix-10 Multiplication Using BCD Codes

Vertical-Horizontal Binary Common Sub- Expression Elimination for Reconfigurable Transposed Form FIR Filter

ENT 315 Medical Signal Processing CHAPTER 3 FAST FOURIER TRANSFORM. Dr. Lim Chee Chin

An Efficient High Speed VLSI Architecture Based 16-Point Adaptive Split Radix-2 FFT Architecture

Area And Power Efficient LMS Adaptive Filter With Low Adaptation Delay

Keywords: Fast Fourier Transforms (FFT), Multipath Delay Commutator (MDC), Pipelined Architecture, Radix-2 k, VLSI.

A DCT Architecture based on Complex Residue Number Systems

A scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment

Adaptive FIR Filter Using Distributed Airthmetic for Area Efficient Design

VLSI DESIGN OF FLOATING POINT ARITHMETIC & LOGIC UNIT

Decimation-in-time (DIT) Radix-2 FFT *

Modified Welch Power Spectral Density Computation with Fast Fourier Transform

International Journal of Engineering and Techniques - Volume 4 Issue 2, April-2018

HIGH-SPEED CO-PROCESSORS BASED ON REDUNDANT NUMBER SYSTEMS

Digital Logic Design Exercises. Assignment 1

A Genetic Algorithm for the Optimisation of a Reconfigurable Pipelined FFT Processor

2 Assoc Prof, Dept of ECE, RGM College of Engineering & Technology, Nandyal, AP-India,

Fixed Point LMS Adaptive Filter with Low Adaptation Delay

High-Speed and Low-Power Split-Radix FFT

Research Article Radix-2 α /4 β Building Blocks for Efficient VLSI s Higher Radices Butterflies Implementation

VLSI Implementation of Fast Addition Using Quaternary Signed Digit Number System

Efficient Methods for FFT calculations Using Memory Reduction Techniques.

Parallel-computing approach for FFT implementation on digital signal processor (DSP)

A Ripple Carry Adder based Low Power Architecture of LMS Adaptive Filter

Digital Computer Arithmetic

We are IntechOpen, the world s leading publisher of Open Access books Built by scientists, for scientists. International authors and editors

Low Power Floating-Point Multiplier Based On Vedic Mathematics

Latest Innovation For FFT implementation using RCBNS

High Throughput Radix-D Multiplication Using BCD

High Throughput Energy Efficient Parallel FFT Architecture on FPGAs

A Normal I/O Order Radix-2 FFT Architecture to Process Twin Data Streams for MIMO

FPGA Based Design and Simulation of 32- Point FFT Through Radix-2 DIT Algorith

FPGA IMPLEMENTATION OF FLOATING POINT ADDER AND MULTIPLIER UNDER ROUND TO NEAREST

DLD VIDYA SAGAR P. potharajuvidyasagar.wordpress.com. Vignana Bharathi Institute of Technology UNIT 3 DLD P VIDYA SAGAR

A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN

High Speed Systolic Montgomery Modular Multipliers for RSA Cryptosystems

VLSI for Multi-Technology Systems (Spring 2003)

Reconfigurable FFT Processor A Broader Perspective Survey

ENERGY-EFFICIENT VLSI REALIZATION OF BINARY64 DIVISION WITH REDUNDANT NUMBER SYSTEMS 1 AVANIGADDA. NAGA SANDHYA RANI

DUE to the high computational complexity and real-time

Design And Simulation Of Pipelined Radix-2 k Feed-Forward FFT Architectures

ON CONFIGURATION OF RESIDUE SCALING PROCESS IN PIPELINED RADIX-4 MQRNS FFT PROCESSOR

COPY RIGHT. To Secure Your Paper As Per UGC Guidelines We Are Providing A Electronic Bar Code

Design of a Floating-Point Fused Add-Subtract Unit Using Verilog

CORDIC Based DFT on FPGA for DSP Applications

High Speed Multiplication Using BCD Codes For DSP Applications

VHDL IMPLEMENTATION OF A FLEXIBLE AND SYNTHESIZABLE FFT PROCESSOR

A Review of Various Adders for Fast ALU

*Instruction Matters: Purdue Academic Course Transformation. Introduction to Digital System Design. Module 4 Arithmetic and Computer Logic Circuits

Transcription:

Efficient Radix4 and Radix8 Butterfly Elements Weidong Li and Lars Wanhammar Electronics Systems, Department of Electrical Engineering Linköping University, SE581 83 Linköping, Sweden Tel.: +46 13 28 {1721, 1344} Fax: +46 13 139282 Email: {weidongl, larsw}@isy.liu.se Abstract: In this paper, we present a class of highradix butterfly elements that utilize (m, n) counters to replace the adders in conventional realization of butterfly elements. With these butterfly elements, we reduce the hardware complexity, delay time, and the power consumption. 1. INTRODUCTION FFT/IFFT has been one of the most important algorithms in digital signal processing. In the recent years, the FFT/IFFT has frequently been applied in the modern communication systems due to its efficiency for OFDM (Orthogonal Frequency Division Multiplex) implementation. Many applications, like xdsl modems, HDTV, mobile radio terminals, use FFT/IFFT processor as a key component. The butterfly elements are one of the basic building blocks in an FFT/IFFT processor. Since FFT/IFFT processors using a radix4 architecture has fewer multiplications than the processors using radix2, the radix4 architectures are often used for FFT/IFFT processors. Higher radix are tend to reduce the memory access rate, arithmetic workload, and, hence, the power consumption [2] [3]. Efficient design of highradix butterfly elements is therefore important. In the following section, we give a short review on the conventional implementation of butterfly elements. We introduce the carrysave based butterfly in section 3 and the results are presented in section 4. 2. A 4POINT DFT WITH A CONVENTIONAL BUTTERFLY ELEMENT 2.1 4point DFT The 4point DFT is defined as X( k) = xn ( )e 2πnk 4 n = 0 where k = 0123,,,. Since e 2πi 4 = ± 1or ± and the multiplications with ± 1 or ± are trivial, i.e., they can be simply realized with bypass, inversion, and/or swap for two scomplement numbers. Hence, it does not require any multiplier to construct a butterfly element for a 4point DFT (radix4 butterfly). 2.2 Conventional butterfly elements for 4point DFT We can rewrite equation (1) in matrix form 3 (1) X( 0) X( 1) X( 2) X( 3) = 1 1 1 1 1 1 1 1 1 1 1 1 x( 0) x( 1) x( 2) x( 3) (2)

Using the numerical strength reduction technique at the wordlevel [4], we obtain the signalflow graph shown below. X(0) X(2) x(1) x(3) X(1) X(3) Complex Multiplication Figure 1 Signalflow graph for 4point DFT. A conventional butterfly element based on an isomorphic mapping of the signalflow graph above, requires therefore 8 adders/subtractors and a delay of 2 additions/subtractions. 3. CARRYSAVE BASED BUTTERFLY ELEMENTS 3.1 Principle of carrysave based butterfly elements for a 4point DFT For the sake of simplicity, we consider only the real part of one output, i.e., X re ( 0) for the 4 point butterfly operation. From eq. (2), X re ( 0) is X re ( 0) = 1 x re ( 0) + 1 x re ( 1) + 1 x re ( 2) + 1 x re ( 3) (3) Consider also a real multiplication Y = B 15. The multiplication can be expressed as Y = 1 ( 2 0 B) + 1 ( 2 1 B) + 1 ( 2 2 B) + 1 ( 2 3 B) (4) Comparing eq. (3) with eq. (4), we find that these two equations are of the same nature, i.e., both the butterfly operation and the multiplication are addition of multiple addends. To speed up the butterfly operation, we can apply the same technique for carrysave multipliers [1]. That is to use an adder tree for the summation of partial products and a fast adder for the vectormerging addition. In a more general scheme the inner adders/subtractors are replaced by (m, n)counters.this reduces the hardware complexity. Since it does not require the sequential operation of addition/subtraction of inputs, the execution time is reduced as well. A simplified notation for the conventional and the carrysave based butterfly element is shown in Fig. 2. Figure 2 } (i) Conventional } } } (ii) Carrysave based (4,2) counter outputs } Fast Adder input output intermediate results Simplified notation for the conventional and the carrysave butterfly element.

3.2 Implementation of carrysave based butterfly element for 4point DFT The carrysave based butterfly element can be realized with (m, n)counters and a fast adder. From the eq. (2), we find that some outputs require both additions and subtractions. Since we use two scomplement number representation, the subtraction can be realized by addition of the negative number. The negative value can be obtained by adding 1 at the LSB to the bitcomplement. Hence, there are more than 4inputs at the LSB which requires (m, 2)counters (m > 4). Since the other bits except the LSB has only 4inputs, the use of (m, 2)counters (m >4)is not efficient. Due to the fact that there are either zero or two inputs that are needed to be changed to their negative values simultaneously, we retain (4,2)counters by adding the carry inputs to the final merging adder instead of adding two correction terms at the carrysave tree (See Fig. 3). (6,2) counter 1 or 0 (4,2) counter 1 or 0 input output 1 or 0 intermediate result (i) Straightforward realization. (i) New realization. Figure 3 Solution for subtractions. For the parallel implementation of radix4 butterfly element, the subtractions are known in advance and this can be used to simplify the implementation further. The XORgates to obtain negative values in the general adder/subtractor can be replaced with inverters. The resulting butterfly element is shown in Fig. 4. The critical path is reduced from a path consisting of two fast adders to a path consisting of an inverter, a (4,2)counter, and a fast adder. xre(0) xim(0) xre(1) xim(1) xre(2) xim(2) xre(3) xim(3) Xre(0) Xim(0) Xre(1) Xim(1) Xre(2) Xim(2) Xre(3) Xim(3) Inverter Fast Adder (4,2)counter Figure 4 Parallel radix4 butterfly element. This technique can be applied for other butterfly architectures as well. For a splitradix (SR) butterfly element [5], the oddindexed terms can use this technique and result in a simpler struc

ture (See Fig. 5). In a simplified butterfly element [6], it can also give an efficient implementation [7] (See Fig. 5). x(1) x(3) (i) Splitradix BE X(0) X(1) Multiplication x(1) Addition X(2) x(3) X(3) Subtraction Control signals (ii) Simplified BE XOR i=0,1,2,3 Fast Adder X(i) Figure 5 Splitradix and simplified butterfly element. (4,2)counter 3.3 Efficient implementation for radix8 butterfly elements Highradix butterfly elements can be realized according to the signalflow graph with the same technique as described above. However, the routing cost for highradix butterfly element becomes excessive. It is therefore often implemented by cascading lower radix butterfly elements (with twiddle factor multiplier in between). Hence we choose to use the radix8 butterfly element as the largest building block. The signalflow graph for a radix8 DIF (Decimation In Frequency) butterfly is shown in Fig. 6 [8]. By moving the two complex multipliers forward according the dashed arrows, the radix 8 butterfly element can be regarded as two radix4 butterfly elements and four radix2 butterfly elements with twiddle factor multiplication between. Using carrysave radix4 butterfly element combining with constant multiplication with [ 2 ( 1 i) ] 2, we can implement the radix8 butterfly element more efficiently. X(0) x(4) X(4) X(5) x(6) x(3) X(1) X(2) c Complex Multiplication x(7) e π/4 X(6) x(1) X(3) x(5) e π/4 X(7) Figure 6 Signalflow graph for radix8 DIF butterfly element. The radix8 butterfly element can also implemented with a splitradix structure. The signalflow graph for splitradix butterfly element [9] differs slightly from that of Fig. 6. It requires in all three SR radix4 SR butterfly elements, and three radix2 butterfly elements.

4. RESULTS In this section, we present the synthesis results for butterfly elements in AMS 0.8 µm technology to demonstrate the efficiency. Table 1 shows the area cost and delay for the main components of the butterfly element with the synthesis tool AutoLogic II from Mentor Graphics. Component Area cost Delay@3.3 V, 25 C VMA(15 bits) 584.24 5.708 ns VMA(16 bits) 643.80 5.747 ns VMA(17 bits) 689.50 6.015 ns 42 counter 20.32 3.441 ns inverter 1.64 0.405 ns Table 1: Performance of key components in butterfly elements. With the result from Table 1, we can calculate the area and delay for different radix4 butterfly elements. The result shows that the area saving can be up to 21% and 38% for carrysave radix 4 butterfly element and SR radix4 butterfly element with carrysave. The delay can be reduced with 22%. Architecture Area cost Delay@3.3 V, 25 C Conventional 10504.16 12.32 ns Carrysave 8266.48 9.59 ns SR with carrysave 6494.80 9.59 ns Table 2: Comparison of different 15bit radix4 butterfly elements. For the radix8 butterfly elements, if we exclude the cost of multipliers, we can summarize the result in the following table. This result can be improved if we apply carrysave technique when possible. Architecture Area cost a Delay a @3.3 V, 25 C a. Excluding the multipliers. 5. CONCLUSION Conventional 32250.40 18.74 ns Carrysave 27774.88 16.01 ns SR with carrysave 27915.84 16.01 ns Table 3: Comparison of different 15bit radix8 butterfly elements (excluding the multipliers). In this paper, we have presented an efficient method to realize the higher radix butterfly elements with carrysave technique. It shows that this method has advantages in term of chip area, short execution time, and power consumption as well.

ACKNOWLEDGEMENT The authors would like to thank Thomas Johansson and Dr. Kent Palmkvist for fruitful discussions. This proect is financed by SSF, the Foundation for Strategic Research in Sweden, under the program of INTELECT. REFERENCES [1] L. Wanhammar, DSP Integrated Circuits, Academic Press, 1999. [2] J. Melander, Design of SIC FFT Architectures, Linköping Studies in Science and Technology, Thesis No. 618, Linköping University, Sweden, 1997. [3] T. Widhe, Efficient Implementation of FFT Processing Elements, Linköping Studies in Science and Technology Thesis No. 619, Linköping University, Sweden, 1997. [4] K. K. Parhi, VLSI Digital Signal Processing Systems: Design and Implementation, John Wiley & Sons, 1999. [5] P. Duhamel and H. Hollmann, SplitRadix FFT Algorithm, Electronics Letters, Vol. 20, No. 1, pp. 1416, 1984. [6] G. Bi and E. V. Jones, A Pipelined FFT Processor for WordSequential Data, IEEE Trans. on Acoust., Speech, and Signal Process., Vol. ASSP37, No. 12, pp. 19821985, 1989. [7] W. Li and L. Wanhammar, A Pipeline FFT Processor, accepted for publication at IEEE Workshop on Signal Processing Systems (SiPS), 1999. [8] T. Widhe, J. Melander, and L. Wanhammar, Design of efficient radix8 butterfly PEs for VLSI, IEEE Intern. Symp. on Circuits and Systems (ISCAS), Vol. 3, pp. 2084 2087, 1997. [9] H. V. Sorensen, M. T. Heideman, and C. S. Burrus, On Computing the SplitRadix FFT, IEEE Trans. on Acoust., Speech, and Signal Process., Vol. ASSP34, No. 1, pp. 152156, 1986.