TOPICS PIPELINE IMPLEMENTATIONS OF THE FAST FOURIER TRANSFORM (FFT) DISCRETE FOURIER TRANSFORM (DFT) INVERSE DFT (IDFT) Consulted work:

Similar documents
DESIGN METHODOLOGY. 5.1 General

Analysis of Radix- SDF Pipeline FFT Architecture in VLSI Using Chip Scope

Research Article Design of A Novel 8-point Modified R2MDC with Pipelined Technique for High Speed OFDM Applications

Decimation-in-Frequency (DIF) Radix-2 FFT *

ENT 315 Medical Signal Processing CHAPTER 3 FAST FOURIER TRANSFORM. Dr. Lim Chee Chin

Twiddle Factor Transformation for Pipelined FFT Processing

ON CONFIGURATION OF RESIDUE SCALING PROCESS IN PIPELINED RADIX-4 MQRNS FFT PROCESSOR

Digital Signal Processing. Soma Biswas

The Serial Commutator FFT

FPGA Based Design and Simulation of 32- Point FFT Through Radix-2 DIT Algorith

Radix-4 FFT Algorithms *

INTRODUCTION TO THE FAST FOURIER TRANSFORM ALGORITHM

AN FFT PROCESSOR BASED ON 16-POINT MODULE

Abstract. Literature Survey. Introduction. A.Radix-2/8 FFT algorithm for length qx2 m DFTs

FFT. There are many ways to decompose an FFT [Rabiner and Gold] The simplest ones are radix-2 Computation made up of radix-2 butterflies X = A + BW

Outline. Applications of FFT in Communications. Fundamental FFT Algorithms. FFT Circuit Design Architectures. Conclusions

Implementation of FFT Processor using Urdhva Tiryakbhyam Sutra of Vedic Mathematics

FAST Fourier transform (FFT) is an important signal processing

Fiber Fourier optics

6. Fast Fourier Transform

Novel design of multiplier-less FFT processors

REAL TIME DIGITAL SIGNAL PROCESSING

An efficient multiplierless approximation of the fast Fourier transform using sum-of-powers-of-two (SOPOT) coefficients

Using a Scalable Parallel 2D FFT for Image Enhancement

High-Performance 16-Point Complex FFT Features 1 Functional Description 2 Theory of Operation

Modified Welch Power Spectral Density Computation with Fast Fourier Transform

VHDL IMPLEMENTATION OF A FLEXIBLE AND SYNTHESIZABLE FFT PROCESSOR

MULTIPLIERLESS HIGH PERFORMANCE FFT COMPUTATION

Latest Innovation For FFT implementation using RCBNS

Fixed Point Streaming Fft Processor For Ofdm

DESIGN OF PARALLEL PIPELINED FEED FORWARD ARCHITECTURE FOR ZERO FREQUENCY & MINIMUM COMPUTATION (ZMC) ALGORITHM OF FFT

Keywords: Fast Fourier Transforms (FFT), Multipath Delay Commutator (MDC), Pipelined Architecture, Radix-2 k, VLSI.

Reconfigurable FFT Processor A Broader Perspective Survey

We are IntechOpen, the world s leading publisher of Open Access books Built by scientists, for scientists. International authors and editors

A Normal I/O Order Radix-2 FFT Architecture to Process Twin Data Streams for MIMO

High Throughput Energy Efficient Parallel FFT Architecture on FPGAs

Fast Fourier Transform (FFT)

Research Article International Journal of Emerging Research in Management &Technology ISSN: (Volume-6, Issue-8) Abstract:

2 Assoc Prof, Dept of ECE, RGM College of Engineering & Technology, Nandyal, AP-India,

Efficient Radix-4 and Radix-8 Butterfly Elements

EEO 401 Digital Signal Processing Prof. Mark Fowler

ISSN: [Kavitha* et al., (6): 3 March-2017] Impact Factor: 4.116

High-Speed and Low-Power Split-Radix FFT

The Fast Fourier Transform Algorithm and Its Application in Digital Image Processing

LogiCORE IP Fast Fourier Transform v7.1

ISSN Vol.02, Issue.11, December-2014, Pages:

Efficient Methods for FFT calculations Using Memory Reduction Techniques.

STUDY OF A CORDIC BASED RADIX-4 FFT PROCESSOR

Fused Floating Point Arithmetic Unit for Radix 2 FFT Implementation

Low-Power Split-Radix FFT Processors Using Radix-2 Butterfly Units

ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE. Ren Chen, Hoang Le, and Viktor K. Prasanna

LogiCORE IP Fast Fourier Transform v7.1

Decimation-in-time (DIT) Radix-2 FFT *

Parallel-computing approach for FFT implementation on digital signal processor (DSP)

Low Power Complex Multiplier based FFT Processor

Computing the Discrete Fourier Transform on FPGA Based Systolic Arrays

An Efficient High Speed VLSI Architecture Based 16-Point Adaptive Split Radix-2 FFT Architecture

AREA-DELAY EFFICIENT FFT ARCHITECTURE USING PARALLEL PROCESSING AND NEW MEMORY SHARING TECHNIQUE

1024-Point Complex FFT/IFFT V Functional Description. Features. Theory of Operation

FFT/IFFTProcessor IP Core Datasheet

Design and Performance Analysis of 32 and 64 Point FFT using Multiple Radix Algorithms

A BSD ADDER REPRESENTATION USING FFT BUTTERFLY ARCHITECHTURE

University of Thessaly Department of Electrical and Computer Engineering

Power Spectral Density Computation using Modified Welch Method

The Fast Fourier Transform

An Area Efficient Mixed Decimation MDF Architecture for Radix. Parallel FFT

Linköping University Post Print. Analysis of Twiddle Factor Memory Complexity of Radix-2^i Pipelined FFTs

ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE. Ren Chen, Hoang Le, and Viktor K. Prasanna

Image Compression System on an FPGA

A Novel Distributed Arithmetic Multiplierless Approach for Computing Complex Inner Products

Fast Fourier Transform Architectures: A Survey and State of the Art

High Performance Pipelined Design for FFT Processor based on FPGA

Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm

The Hazard-Free Superscalar Pipeline Fast Fourier Transform Architecture and Algorithm

New Integer-FFT Multiplication Architectures and Implementations for Accelerating Fully Homomorphic Encryption

CORDIC Based FFT for Signal Processing System

EE123 Digital Signal Processing

Academic Course Description. VL2003 Digital Processing Structures for VLSI First Semester, (Odd semester)

DESIGN & SIMULATION PARALLEL PIPELINED RADIX -2^2 FFT ARCHITECTURE FOR REAL VALUED SIGNALS

FAST FOURIER TRANSFORM (FFT) and inverse fast

An Efficient Multi Precision Floating Point Complex Multiplier Unit in FFT

Design And Simulation Of Pipelined Radix-2 k Feed-Forward FFT Architectures

MA/CSSE 473 Day 20. Recap: Josephus Problem

Energy Optimizations for FPGA-based 2-D FFT Architecture

VLSI for Multi-Technology Systems (Spring 2003)

Academic Course Description

CORDIC Based DFT on FPGA for DSP Applications

Implementation of Low-Memory Reference FFT on Digital Signal Processor

IMPLEMENTATION OF DOUBLE PRECISION FLOATING POINT RADIX-2 FFT USING VHDL

THE orthogonal frequency-division multiplex (OFDM)

Digital Signal Processing with Field Programmable Gate Arrays

A Pipelined Fused Processing Unit for DSP Applications

How a Digital Binary Adder Operates

LOW-POWER SPLIT-RADIX FFT PROCESSORS

Efficient FFT Algorithm and Programming Tricks

Agenda for supervisor meeting the 22th of March 2011

Chapter 3 Classification of FFT Processor Algorithms

Design and Simulation of 32 and 64 Point FFT Using Multiple Radix Algorithm

Introduction to HPC. Lecture 21

CHAPTER 5. Software Implementation of FFT Using the SC3850 Core

Transcription:

1 PIPELINE IMPLEMENTATIONS OF THE FAST FOURIER TRANSFORM (FFT) Consulted work: Chiueh, T.D. and P.Y. Tsai, OFDM Baseband Receiver Design for Wireless Communications, John Wiley and Sons Asia, (2007). Second edition of 2012 available as e-book via UT Library. 2 TOPICS Discrete Fourier Transform Fast Fourier Transform Decimation in time Decimation in frequency FFT pipelines: Radix-2 multi-path delay commutator Radix-2 single-path delay feedback Delay buffer implementation Radix-4 algorithms 3 DISCRETE FOURIER TRANSFORM (DFT) 4 INVERSE DFT (IDFT) Consider a block of N samples of a (possibly complex-valued) data stream: The inverse DFT is almost the same computation as the DFT: The discrete Fourier transform of this block is given by:

5 TWIDDLE FACTORS 6 MATRIX REPRESENTATION OF DFT Define: The DFT can be expressed in matrix form: Then DFT becomes: is called a twiddle factor (it is a number on the unit circle in the complex plane). Number of complex multiplications involved (including trivial ones with 1, j, etc.): 7 DFT VISUALIZATION The DFT basically matches frequencies. The following picture originates from Wikipedia (entry: DFT matrix ): 8 FAST FOURIER TRANSFORM Reduces the number of calculations in a DFT from to First published by Cooley and Tukey in 1965. Check e.g. Wikipedia page for historical background dating back to Gauss in 1805 (earlier than Fourier!). Two variants: Decimation in time Decimation in frequency.

9 DECIMATION-IN-TIME FFT (1) 10 DECIMATION-IN-TIME FFT (2) Split odd and even terms in DFT definition: Consider first half of outputs: Rewrite, using : 11 DECIMATION-IN-TIME FFT (3) DFT has been expressed in terms of half-size DFTs: 12 DECIMATION-IN-TIME FFT (4) Now consider second half of outputs: DFT of even samples DFT of odd samples

13 DECIMATION-IN-TIME FFT (5) Make use of the following identities: 14 DECIMATION-IN-TIME FFT (6) The expression for the second half becomes the same as for the first except for a minus sign: 15 DECIMATION-IN-TIME FFT (7) N/2-point DFT 16 DECIMATION-IN-TIME BUTTERFLY elementary FFT operation with two inputs and two outputs consisting of one complex multiplication, one complex addition and one complex subtraction: N/2-point DFT

17 DECIMATION-IN-TIME FFT (8) One decimation-in-time step defines an N-point DFT in terms of two N/2-point DFTs. They are combined by means of butterflies. Applying the principle recursively, results in a computation that consists of butterflies only. 18 8-POINT DECIMATION-IN-TIME FFT 19 COMPLEXITY REDUCTION Number of computations has been reduced: From for the DFT To for the FFT 20 IMPLEMENTATION ISSUES Buffering of input is needed because of block-based nature. One could e.g. use a ping-pong memory: While the input is filling one memory, the FFT could consume samples of the other memory. After processing one block of samples, the memories change roles. Note the bit reversal of the addresses: Address order can be found from increasing binary addresses read in reverse order 4 = 100 2 e.g. becomes 1 = 001 2.

21 DECIMATION-IN-FREQUENCY FFT (1) 22 DECIMATION-IN-FREQUENCY FFT (2) Split input block in first and second half and consider the outputs with even index: Shift index in second sum: 23 DECIMATION-IN-FREQUENCY FFT (3) 24 DECIMATION-IN-FREQUENCY FFT (4) Using simplification rules mentioned earlier: This is a half-size DFT applied to an input stream consisting of pairs of the original input stream:

25 DECIMATION-IN-FREQUENCY FFT (5) Now consider the outputs with odd index: 26 DECIMATION-IN-FREQUENCY FFT (6) With the usual type of simplifications: This is a half-size DFT applied on a sequence obtained by taking the difference of pairs of the input and multiplying them with factor. 27 DECIMATION-IN-FREQUENCY FFT (7) N/2-point DFT 28 DECIMATION-IN-FREQ. BUTTERFLY Similar to the decimation-in-time butterfly, but location of multiplication is now at the output side. Decimation in time Decimation in frequency N/2-point DFT

29 8-POINT DECIMATION-IN-FREQ. FFT 30 SUMMARY FFT BASICS Decimation-in-time (DIT) FFT: Divide and conquer approach to DFT, based on grouping even and odd inputs in DFT definition : multiply before add/subtract Decimation-in-frequency (DIF) FFT: Divide and conquer approach based on grouping even and odd outputs : multiply after add/subtract Both are of radix-2 type: problem size is reduced by 2 at each stage 31 FFT IMPLEMENTATIONS FFTs can be realized on all kind of platforms: Programmable processors Dedicated hardware Here attention is paid to one type of implementation, viz. the FFT pipelines. An FFT pipeline transforms a computation with a block processing nature into one of a streaming nature. Issues of concern: Area Speed Power Memory access 32 RADIX-2 MULTI-PATH DELAY COMMUTATOR (R2MDC) Pipeline solution: Stream-like processing of block-based algorithm. Examples based on 8-point FFT: solutions scale for higher powers of 2. Originally proposed in: Rabiner, L.R. and B. Gold, Theory and Application of Digital Signal Processing, Prentice-Hall, Englewood Cliffs, New Jersey, (1975).

33 34 R2MDC FOR 8-POINT FFT R2MDC STEP-BY-STEP Consists of Commutators, switches that send the data either straight ahead or crisscross blocks (either for DIT or DIF) Delay buffers B A A/C: 0 1 2 3 B: 4 5 6 7 D:.. 4 5 6 7 E: 0 1 4 5 F:.. 0 1 4 5 G:.. 2 3 6 7 C D E G F H H:.. 0 1 4 5 I:... 2 3 6 7 J:.. 0 2 4 6 L:... 0 2 4 6 K:... 1 3 5 7 I J K L 35 R2MDC EFFICIENCY If implemented as presented, the pipeline has a 50% hardware utilization: The first butterfly is idle half of the time until it can process new pairs of inputs. All hardware operates at input sample frequency Situation can be easily improved by duplicating input buffer and operating it in ping-pong fashion: Hardware utilization jumps to 100%. All hardware operates at half the sample frequency. As there are no feedback paths, hardware can be further pipelined to cope with possibly long computational delays. 36 R2MDC WITH PING-PONG INPUT BUFFER

37 RADIX-2 SINGLE-PATH DELAY FEEDBACK (R2SDF) Alternative pipeline solution, with optimized memory size Originally proposed in: Groginsky, H.L. and G.A. Works, A Pipeline Fast Fourier Transform, IEEE Transactions on Computers, Vol.C-19(11), pp. 1015-1019, (November 1970). 38 R2SDF ELEMENT Element: Either shifts first half of input in delay buffer and second half of output out of delay buffer (blue); Or shifts second half of input into butterfly together with first half from delay buffer, first half of output to next stage and second half output into delay buffer (red). Delay line 39 SYMBOLILC REPRESENTATION OF R2SDF ELEMENT Delay line 40 R2SDF FOR 8-POINT FFT feedback Delay line single path single path

41 C A R2SDF STEP-BY-STEP B E D G E:...... 0 1 2 3 4 5 6 7 A:.... 0 1 2 3 4 5 F: 6. 7... 0 1 2 3 4 5 6 7 B: 0 1 2 3 4 5 6 7 G:.... 0 1 2 3 4 5 6 7 C: 0 1 2 3 4 5 6 7 H:...... 0 1 2 3 4 5 6 7 D:.... 0 1 2 3 4 5 6 7 H F 42 R2SDF EFFICIENCY It needs to operate at same speed as input sample rate. Hardware utilization is 50%. Number of delay elements for N-point FFT: R2SDF: R2MDC: R2SDF and R2MDC have the same number of (complex) adders and multipliers. 43 DELAY-BUFFER IMPLEMENTATION 44 DUAL-PORT RAM DELAY BUFFER The straightforward implementation of a delay buffer in hardware would be a shift register. Such an implementation is not convenient (think of e.g. a 1024- point FFT): Memory elements implemented by D-flipflops are large! Shifting all data at every clock cycle consumes a lot of energy! Much better idea to keep the data in RAM and shift the address pointer(s) instead. Dual-port RAM: One read port with its own data and address One write port with its own data and address Cyclic buffer: L+1 locations to store L items. +1 Write addr. Read addr. Write data Read data Dual-port RAM

45 DELAY BUFFER WITH TWO SINGLE- PORT RAMS input Idea is to alternatively read from and write to the two RAMs. Two single-port RAMs are cheaper than one dual-port RAM. Use LSB of address to connect to write enable (wen). +1 L-1 Addr. wen 1 L RAM1 output L-1 1 Addr. wen RAM2 46 SPECIAL-PURPOSE RAMS Replace address decoder of RAM by single-bit shift register. Only one D-flipflop has output 1. Not much switching activity in shift-register (no waste of power). 47 RADIX-4 FFT 48 RADIX-4 DIF (1) Has both decimation in time as decimation in frequency variants. Idea is to express DFT as the combination of 4 DFTs whose sizes are one fourth of the original DFT. Takes advantage of the following symmetries: This leads to a reduction of the number of multipliers (multiplication by j can be performed without multiplier).

49 RADIX-4 DIF (2) 50 RADIX-4 DIF BUTTERFLY Scaling with +1, -1, +j, -j 51 MULTIPLIER & ADDER COMPLEXITY 52 RADIX-2 2 BUTTERFLY Radix-2 decomposition: Number of stages: Complex multiplications per stage: in total: Complex additions per stage: in total: Radix-4 decomposition: Number of stages: Complex multiplications per stage: in total: Complex additions per stage: in total:

53 RADIX-2 2 BUTTERFLY EVALUATION Radix-2 2 has the same number of multipliers as radix-4. It has fewer additions: 54 RADIX-4 MULTI-PATH DELAY COMMUTATOR (R4MDC) PIPELINE FOR 16-POINT FFT In a similar way, a radix-8 butterfly can be decomposed in a radix-2 3 butterfly. 1 8D 3D 3D 55 RADIX-4 SINGLE-PATH DELAY FEEDBACK (R4SDF) PIPELINE FOR 16-POINT FFT 56 EVALUATION RADIX-4 PIPELINES The multi-path structure: Has a high memory overhead; Has 100% multiplier utilization; Can operate at one quarter of sample frequency. The single-path structure: Has minimal memory; Has a 75% multiplier utilization (multiplier drawn outside butterfly!) Operates at sample frequency.