Efficient FFT Algorithm and Programming Tricks

Similar documents
Radix-4 FFT Algorithms *

Decimation-in-Frequency (DIF) Radix-2 FFT *

Decimation-in-time (DIT) Radix-2 FFT *

FFT. There are many ways to decompose an FFT [Rabiner and Gold] The simplest ones are radix-2 Computation made up of radix-2 butterflies X = A + BW

Efficient Methods for FFT calculations Using Memory Reduction Techniques.

DESIGN OF PARALLEL PIPELINED FEED FORWARD ARCHITECTURE FOR ZERO FREQUENCY & MINIMUM COMPUTATION (ZMC) ALGORITHM OF FFT

MULTIPLIERLESS HIGH PERFORMANCE FFT COMPUTATION

FPGA Based Design and Simulation of 32- Point FFT Through Radix-2 DIT Algorith

FFT ALGORITHMS FOR MULTIPLY-ADD ARCHITECTURES

LOW-POWER SPLIT-RADIX FFT PROCESSORS

Parallel-computing approach for FFT implementation on digital signal processor (DSP)

Efficient Radix-4 and Radix-8 Butterfly Elements

An efficient multiplierless approximation of the fast Fourier transform using sum-of-powers-of-two (SOPOT) coefficients

Abstract. Literature Survey. Introduction. A.Radix-2/8 FFT algorithm for length qx2 m DFTs

An introduction to Digital Signal Processors (DSP) Using the C55xx family

TOPICS PIPELINE IMPLEMENTATIONS OF THE FAST FOURIER TRANSFORM (FFT) DISCRETE FOURIER TRANSFORM (DFT) INVERSE DFT (IDFT) Consulted work:

IMPLEMENTATION OF THE. Alexander J. Yee University of Illinois Urbana-Champaign

Fast Block LMS Adaptive Filter Using DA Technique for High Performance in FGPA

Low-Power Split-Radix FFT Processors Using Radix-2 Butterfly Units

1:21. Down sampling/under sampling. The spectrum has the same shape, but the periodicity is twice as dense.

International Journal of Innovative and Emerging Research in Engineering. e-issn: p-issn:

Agenda for supervisor meeting the 22th of March 2011

Research Article Design of A Novel 8-point Modified R2MDC with Pipelined Technique for High Speed OFDM Applications

ENT 315 Medical Signal Processing CHAPTER 3 FAST FOURIER TRANSFORM. Dr. Lim Chee Chin

DESIGN METHODOLOGY. 5.1 General

AN FFT PROCESSOR BASED ON 16-POINT MODULE

THE orthogonal frequency-division multiplex (OFDM)

Computing the fast Fourier transform on SIMD microprocessors. By: Anthony Blake

Modified Welch Power Spectral Density Computation with Fast Fourier Transform

The Design and Implementation of FFTW3

How to Write Fast Numerical Code Spring 2012 Lecture 20. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato

Twiddle Factor Transformation for Pipelined FFT Processing

Digital Signal Processing. Soma Biswas

How to Write Fast Numerical Code

Design of Delay Efficient Distributed Arithmetic Based Split Radix FFT

Split-Radix FFT Algorithms Based on Ternary Tree

Results and Discussion *

Implementing FFTs in Practice

Implementation of Low-Memory Reference FFT on Digital Signal Processor

AREA MINIMIZED FFT ARCHITECTURE USING SPLIT RADIX ALGORITHM FOR LENGTH L=RADIX-3 AND RADIX-2BX3C

Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm

Fused Floating Point Arithmetic Unit for Radix 2 FFT Implementation

FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS. Waqas Akram, Cirrus Logic Inc., Austin, Texas

Relative Reduced Hops

Analysis of Radix- SDF Pipeline FFT Architecture in VLSI Using Chip Scope

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

The x86 Microprocessors. Introduction. The 80x86 Microprocessors. 1.1 Assembly Language

Developing and Integrating FPGA Co-processors with the Tic6x Family of DSP Processors

REAL TIME DIGITAL SIGNAL PROCESSING

Parallel FFT Program Optimizations on Heterogeneous Computers

How to Write Fast Numerical Code Spring 2011 Lecture 22. Instructor: Markus Püschel TA: Georg Ofenbeck

A Genetic Algorithm for the Optimisation of a Reconfigurable Pipelined FFT Processor

Fatima Michael College of Engineering & Technology

A framework for automatic generation of audio processing applications on a dual-core system

The Tangent FFT. Daniel J. Bernstein

Algorithms and Computation in Signal Processing

Implementation of FFT Processor using Urdhva Tiryakbhyam Sutra of Vedic Mathematics

An introduction to DSP s. Examples of DSP applications Why a DSP? Characteristics of a DSP Architectures

Digital Signal Processing with Field Programmable Gate Arrays

FAST FOURIER TRANSFORM (FFT) and inverse fast

System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries

3.2 Cache Oblivious Algorithms

Formal Loop Merging for Signal Transforms

FFT and Spectrum Analyzer

Small Discrete Fourier Transforms on GPUs

Novel design of multiplier-less FFT processors

DUE to the high computational complexity and real-time

indicates problems that have been selected for discussion in section, time permitting.

The Serial Commutator FFT

Energy Optimizations for FPGA-based 2-D FFT Architecture

Linköping University Post Print. Analysis of Twiddle Factor Memory Complexity of Radix-2^i Pipelined FFTs

VHDL IMPLEMENTATION OF A FLEXIBLE AND SYNTHESIZABLE FFT PROCESSOR

The Fast Fourier Transform

Design and Implementation of VLSI Architecture for. Mixed Radix FFT

Research Article International Journal of Emerging Research in Management &Technology ISSN: (Volume-6, Issue-8) Abstract:

A Novel Distributed Arithmetic Multiplierless Approach for Computing Complex Inner Products

A novel algorithm for computing the 2D split-vector-radix FFT

EEO 401 Digital Signal Processing Prof. Mark Fowler

SDP Memo 048: Two Dimensional Sparse Fourier Transform Algorithms

GENERAL-PURPOSE MICROPROCESSOR PERFORMANCE FOR DSP APPLICATIONS. University of Utah. Salt Lake City, UT USA

Streamlined real-factor FFTs

Introduction to HPC. Lecture 21

Stochastic Search for Signal Processing Algorithm Optimization

IMPLEMENTATION OF DOUBLE PRECISION FLOATING POINT RADIX-2 FFT USING VHDL

Parallelism in Spiral

Learning to Construct Fast Signal Processing Implementations

Design and Performance Analysis of 32 and 64 Point FFT using Multiple Radix Algorithms

Parallel Fast Fourier Transform Literature Review

Using a Scalable Parallel 2D FFT for Image Enhancement

Compiler Techniques for Energy Saving in Instruction Caches. of Speculative Parallel Microarchitectures.

FPGA Implementation of 16-Point Radix-4 Complex FFT Core Using NEDA

VIII. DSP Processors. Digital Signal Processing 8 December 24, 2009

Fixed Point Streaming Fft Processor For Ofdm

Extra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987

Processor Lang Keying Code Clocks to Key Clocks to Encrypt Option Size PPro/II ASM Comp PPr

Performance Analysis of Face Recognition Algorithms on TMS320C64x

Embedded Target for TI C6000 DSP 2.0 Release Notes

Rapid Prototyping System for Teaching Real-Time Digital Signal Processing

Core Facts. Documentation Design File Formats. Verification Instantiation Templates Reference Designs & Application Notes Additional Items

ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE. Ren Chen, Hoang Le, and Viktor K. Prasanna

Transcription:

Connexions module: m12021 1 Efficient FFT Algorithm and Programming Tricks Douglas L. Jones This work is produced by The Connexions Project and licensed under the Creative Commons Attribution License Abstract Many tricks and techniques have been developed to speed up the computation of FFTs. Signicant reductions in computation time result from table lookup of twiddle factors, compiler-friendly or assembly-language programming, special hardware, and FFT algorithms for real-valued data. Higherradix algorithms, fast bit-reversal, and special butteries yield more modest but worthwhile savings. The use of FFT algorithms 1 such as the radix-2 decimation-in-time 2 or decimation-in-frequency 3 methods result in tremendous savings in computations when computing the discrete Fourier transform 4. While most of the speed-up of FFTs comes from this, careful implementation can provide additional savings ranging from a few percent to several-fold increases in program speed. 1 Precompute twiddle factors The twiddle factor 5, or WN k 2πk = e (i N ), terms that multiply the intermediate data in the FFT algorithms 6 consist of cosines and sines that each take the equivalent of several multiplies to compute. However, at most N unique twiddle factors can appear in any FFT or DFT algorithm. (For example, in the radix-2 decimation-in-time FFT 7, only N 2 twiddle factors k, k = { ( ) 0, 1, 2,..., N 2 1} k : W N are used.) These twiddle factors can be precomputed once and stored in an array in computer memory, and accessed in the FFT algorithm by table lookup. This simple technique yields very substantial savings and is almost always used in practice. 2 Compiler-friendly programming On most computers, only some of the total computation time of an FFT is spent performing the FFT buttery computations; determining indices, loading and storing data, computing loop parameters and other operations consume the majority of cycles. Careful programming that allows the compiler to generate ecient code can make a several-fold improvement in the run-time of an FFT. The best choice of radix in Version 1.6: Feb 24, 2007 12:15 pm -0600 http://creativecommons.org/licenses/by/1.0 1 "Overview of Fast Fourier Transform (FFT) Algorithms" <http://cnx.org/content/m12026/latest/> 2 "Decimation-in-time (DIT) Radix-2 FFT" <http://cnx.org/content/m12016/latest/> 3 "Decimation-in-Frequency (DIF) Radix-2 FFT" <http://cnx.org/content/m12018/latest/> 4 "DFT Denition and Properties" <http://cnx.org/content/m12019/latest/> 5 "Decimation-in-time (DIT) Radix-2 FFT" <http://cnx.org/content/m12016/latest/> 6 "Overview of Fast Fourier Transform (FFT) Algorithms" <http://cnx.org/content/m12026/latest/> 7 "Decimation-in-time (DIT) Radix-2 FFT" <http://cnx.org/content/m12016/latest/>

Connexions module: m12021 2 terms of program speed may depend more on characteristics of the hardware (such as the number of CPU registers) or compiler than on the exact number of computations. Very often the manufacturer's library codes are carefully crafted by experts who know intimately both the hardware and compiler architecture and how to get the most performance out of them, so use of well-written FFT libraries is generally recommended. Certain freely available programs and libraries are also very good. Perhaps the best current general-purpose library is the FFTW 8 package; information can be found at http://www.tw.org 9. A paper by Frigo and Johnson[2] describes many of the key issues in developing compiler-friendly code. 3 Program in assembly language While compilers continue to improve, FFT programs written directly in the assembly language of a specic machine are often several times faster than the best compiled code. This is particularly true for DSP microprocessors, which have special instructions for accelerating FFTs that compilers don't use. (I have myself seen dierences of up to 26 to 1 in favor of assembly!) Very often, FFTs in the manufacturer's or high-performance third-party libraries are hand-coded in assembly. For DSP microprocessors, the codes developed by Meyer, Schuessler, and Schwarz[4] are perhaps the best ever developed; while the particular processors are now obsolete, the techniques remain equally relevant today. Most DSP processors provide special instructions and a hardware design favoring the radix-2 decimation-in-time algorithm, which is thus generally fastest on these machines. 4 Special hardware Some processors have special hardware accelerators or co-processors specically designed to accelerate FFT computations. For example, AMI Semiconductor's 10 Toccata 11 ultra-low-power DSP microprocessor family, which is widely used in digital hearing aids, have on-chip FFT accelerators; it is always faster and more power-ecient to use such accelerators and whatever radix they prefer. In a surprising number of applications, almost all of the computations are FFTs. A number of specialpurpose chips are designed to specically compute FFTs, and are used in specialized high-performance applications such as radar systems. Other systems, such as OFDM 12 -based communications receivers, have special FFT hardware built into the digital receiver circuit. Such hardware can run many times faster, with much less power consumption, than FFT programs on general-purpose processors. 5 Eective memory management Cache misses or excessive data movement between registers and memory can greatly slow down an FFT computation. Ecient programs such as the FFTW package 13 are carefully designed to minimize these ineciences. In-place algorithms 14 reuse the data memory throughout the transform, which can reduce cache misses for longer lengths. 6 Real-valued FFTs FFTs of real-valued signals require only half as many computations as with complex-valued data. There are several methods for reducing the computation, which are described in more detail in Sorensen et al.[3] 1. Use DFT symmetry properties 15 to do two real-valued DFTs at once with one FFT program 8 http://www.tw.org 9 http://www.tw.org 10 http://www.amis.com 11 http://www.amis.com/products/dsp/toccata_plus.html 12 http://en.wikipedia.org/wiki/ofdm 13 http://www.tw.org 14 "Decimation-in-time (DIT) Radix-2 FFT" <http://cnx.org/content/m12016/latest/> 15 "DFT Denition and Properties" <http://cnx.org/content/m12019/latest/>

Connexions module: m12021 3 2. Perform one stage of the radix-2 decimation-in-time 16 decomposition and compute the two length- N 2 DFTs using the above approach. 3. Use a direct real-valued FFT algorithm; see H.V. Sorensen et.al.[3] 7 Special cases Occasionally only certain DFT frequencies are needed, the input signal values are mostly zero, the signal is real-valued (as discussed above), or other special conditions exist for which faster algorithms can be developed. Sorensen and Burrus[5] describe slightly faster algorithms for pruned 17 or zero-padded 18 data. Goertzel's algorithm 19 is useful when only a few DFT outputs are needed. The running FFT 20 can be faster when DFTs of highly overlapped blocks of data are needed, as in a spectrogram 21. 8 Higher-radix algorithms Higher-radix algorithms, such as the radix-4 22, radix-8, or split-radix 23 FFTs, require fewer computations and can produce modest but worthwhile savings. Even the split-radix FFT 24 reduces the multiplications by only 33% and the additions by a much lesser amount relative to the radix-2 FFTs 25 ; signicant improvements in program speed are often due to implicit loop-unrolling 26 or other compiler benets than from the computational reduction itself! 9 Fast bit-reversal Bit-reversing 27 the input or output data can consume several percent of the total run-time of an FFT program. Several fast bit-reversal algorithms have been developed that can reduce this to two percent or less, including the method published by D.M.W. Evans[1]. 10 Trade additions for multiplications When FFTs rst became widely used, hardware multipliers were relatively rare on digital computers, and multiplications generally required many more cycles than additions. Methods to reduce multiplications, even at the expense of a substantial increase in additions, were often benecial. The prime factor algorithms 28 and the Winograd Fourier transform algorithms 29, which required fewer multiplies and considerably more additions than the power-of-two-length algorithms 30, were developed during this period. Current processors generally have high-speed pipelined hardware multipliers, so trading multiplies for additions is often no longer benecial. In particular, most machines now support single-cycle multiply-accumulate (MAC) operations, so balancing the number of multiplies and adds and combining them into single-cycle MACs generally results in the fastest code. Thus, the prime-factor and Winograd FFTs are rarely used today unless the application requires FFTs of a specic length. 16 "Decimation-in-time (DIT) Radix-2 FFT" <http://cnx.org/content/m12016/latest/> 17 http://www.tw.org/pruned.html 18 "Spectrum Analysis Using the Discrete Fourier Transform": Section Zero-Padding <http://cnx.org/content/m12032/latest/#zeropad> 19 "Goertzel's Algorithm" <http://cnx.org/content/m12024/latest/> 20 "Running FFT" <http://cnx.org/content/m12029/latest/> 21 "Short Time Fourier Transform" <http://cnx.org/content/m10570/latest/> 22 "Radix-4 FFT Algorithms" <http://cnx.org/content/m12027/latest/> 23 "Split-radix FFT Algorithms" <http://cnx.org/content/m12031/latest/> 24 "Split-radix FFT Algorithms" <http://cnx.org/content/m12031/latest/> 25 "Decimation-in-time (DIT) Radix-2 FFT" <http://cnx.org/content/m12016/latest/> 26 http://en.wikipedia.org/wiki/loop_unrolling 27 "Decimation-in-time (DIT) Radix-2 FFT" <http://cnx.org/content/m12016/latest/> 28 "The Prime Factor Algorithm" <http://cnx.org/content/m12033/latest/> 29 "FFTs of prime length and Rader's conversion" <http://cnx.org/content/m12023/latest/> 30 "Power-of-two FFTs" <http://cnx.org/content/m12059/latest/>

Connexions module: m12021 4 It is possible to implement a complex multiply with 3 real multiplies and 5 real adds rather than the usual 4 real multiplies and 2 real adds: (C + is) (X + iy ) = CX SY + i (CY + SX) but alernatively Z = C (X Y ) D = C + S E = C S CX SY = EY + Z CY + SX = DX Z In an FFT, D and E come entirely from the twiddle factors, so they can be precomputed and stored in a look-up table. This reduces the cost of the complex twiddle-factor multiply to 3 real multiplies and 3 real adds, or one less and one more, respectively, than the conventional 4/2 computation. 11 Special butteries Certain twiddle factors, namely WN 0 = 1, W N 2 N, W N 4 N, W N 8 N, W 3N 8 N, etc., can be implemented with no additional operations, or with fewer real operations than a general complex multiply. Programs that specially implement such butteries in the most ecient manner throughout the algorithm can reduce the computational cost by up to several N multiplies and additions in a length-n FFT. 12 Practical Perspective When optimizing FFTs for speed, it can be important to maintain perspective on the benets that can be expected from any given optimization. The following list categorizes the various techniques by potential benet; these will be somewhat situation- and machine-dependent, but clearly one should begin with the most signicant and put the most eort where the pay-o is likely to be largest. Methods to speed up computation of DFTs Tremendous Savings - a. FFT ( N log 2 N savings) Substantial Savings - (2) a. Table lookup of cosine/sine b. Compiler tricks/good programming c. Assembly-language programming d. Special-purpose hardware e. Real-data FFT for real data (factor of 2) f. Special cases Minor Savings - a. radix-4 31, split-radix 32 (-10% - +30%) 31 "Radix-4 FFT Algorithms" <http://cnx.org/content/m12027/latest/> 32 "Split-radix FFT Algorithms" <http://cnx.org/content/m12031/latest/>

Connexions module: m12021 5 b. special butteries c. 3-real-multiplication complex multiply d. Fast bit-reversal (up to 6%) note: On general-purpose machines, computation is only part of the total run time. Address generation, indexing, data shuing, and memory access take up much or most of the cycles. note: A well-written radix-2 33 program will run much faster than a poorly written split-radix 34 program! References [1] D.M.W. Evans. An improved digit-reversal permutation algorithm for the fast fourier and hartley transforms. IEEE Transactions on Signal Processing, 35(8):11201125, August 1987. [2] M. Frigo and S.G. Johnson. The design and implementation of tw3. Proceedings of the IEEE, 93(2):216 231, February 2005. [3] M.T. Heideman H.V. Sorensen, D.L Jones and C.S. Burrus. Real-valued fast fourier transform algorithms. IEEE Transactions on Signal Processing, 35(6):849863, June 1987. [4] H.W. Schuessler R. Meyer and K. Schwarz. Fft implmentation on dsp chips - theory and practice. IEEE International Conference on Acoustics, Speech, and Signal Processing, 1990. [5] H.V. Sorensen and C.S. Burrus. Ecient computation of the dft with only a subset of input or output points. IEEE Transactions on Signal Processing, 41(3):11841200, March 1993. 33 "Decimation-in-time (DIT) Radix-2 FFT" <http://cnx.org/content/m12016/latest/> 34 "Split-radix FFT Algorithms" <http://cnx.org/content/m12031/latest/>