ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE. Ren Chen, Hoang Le, and Viktor K. Prasanna

Similar documents
ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE. Ren Chen, Hoang Le, and Viktor K. Prasanna

High Throughput Energy Efficient Parallel FFT Architecture on FPGAs

Energy Optimizations for FPGA-based 2-D FFT Architecture

Analysis of Radix- SDF Pipeline FFT Architecture in VLSI Using Chip Scope

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

OPTIMIZING INTERCONNECTION COMPLEXITY FOR REALIZING FIXED PERMUTATION IN DATA AND SIGNAL PROCESSING ALGORITHMS. Ren Chen and Viktor K.

The Serial Commutator FFT

ON CONFIGURATION OF RESIDUE SCALING PROCESS IN PIPELINED RADIX-4 MQRNS FFT PROCESSOR

Evaluating Energy Efficiency of Floating Point Matrix Multiplication on FPGAs

AN FFT PROCESSOR BASED ON 16-POINT MODULE

A Methodology for Energy Efficient FPGA Designs Using Malleable Algorithms

Linköping University Post Print. Analysis of Twiddle Factor Memory Complexity of Radix-2^i Pipelined FFTs

Abstract. Literature Survey. Introduction. A.Radix-2/8 FFT algorithm for length qx2 m DFTs

Research Article International Journal of Emerging Research in Management &Technology ISSN: (Volume-6, Issue-8) Abstract:

Keywords: Fast Fourier Transforms (FFT), Multipath Delay Commutator (MDC), Pipelined Architecture, Radix-2 k, VLSI.

Analysis of High-performance Floating-point Arithmetic on FPGAs

TOPICS PIPELINE IMPLEMENTATIONS OF THE FAST FOURIER TRANSFORM (FFT) DISCRETE FOURIER TRANSFORM (DFT) INVERSE DFT (IDFT) Consulted work:

Novel design of multiplier-less FFT processors

Multi-Gigahertz Parallel FFTs for FPGA and ASIC Implementation

DESIGN OF PARALLEL PIPELINED FEED FORWARD ARCHITECTURE FOR ZERO FREQUENCY & MINIMUM COMPUTATION (ZMC) ALGORITHM OF FFT

Towards Performance Modeling of 3D Memory Integrated FPGA Architectures

Modified Welch Power Spectral Density Computation with Fast Fourier Transform

Low-Power Split-Radix FFT Processors Using Radix-2 Butterfly Units

MULTIPLIERLESS HIGH PERFORMANCE FFT COMPUTATION

Computing the Discrete Fourier Transform on FPGA Based Systolic Arrays

FPGA Based Design and Simulation of 32- Point FFT Through Radix-2 DIT Algorith

Energy and Memory Efficient Mapping of Bitonic Sorting on FPGA

Research Article Design of A Novel 8-point Modified R2MDC with Pipelined Technique for High Speed OFDM Applications

SFF The Single-Stream FPGA-Optimized Feedforward FFT Hardware Architecture

STUDY OF A CORDIC BASED RADIX-4 FFT PROCESSOR

A Model-based Methodology for Application Specific Energy Efficient Data Path Design using FPGAs

A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN

DESIGN METHODOLOGY. 5.1 General

A 4096-Point Radix-4 Memory-Based FFT Using DSP Slices

Fused Floating Point Arithmetic Unit for Radix 2 FFT Implementation

FPGA Matrix Multiplier

LOW-POWER SPLIT-RADIX FFT PROCESSORS

User Manual for FC100

INTRODUCTION TO FPGA ARCHITECTURE

Twiddle Factor Transformation for Pipelined FFT Processing

Adaptive FIR Filter Using Distributed Airthmetic for Area Efficient Design

Parallelism in Spiral

Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm

Verilog for High Performance

An Area Efficient Mixed Decimation MDF Architecture for Radix. Parallel FFT

Resource-efficient Acceleration of 2-Dimensional Fast Fourier Transform Computations on FPGAs

Computer Generation of IP Cores

Comparison of Adders for optimized Exponent Addition circuit in IEEE754 Floating point multiplier using VHDL

FPGA Implementation of Discrete Fourier Transform Using CORDIC Algorithm

An Efficient High Speed VLSI Architecture Based 16-Point Adaptive Split Radix-2 FFT Architecture

Efficient Self-Reconfigurable Implementations Using On-Chip Memory

Core Facts. Documentation Design File Formats. Verification Instantiation Templates Reference Designs & Application Notes Additional Items

Implementation of FFT Processor using Urdhva Tiryakbhyam Sutra of Vedic Mathematics

EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs)

High-Performance 16-Point Complex FFT Features 1 Functional Description 2 Theory of Operation

A High-Performance and Energy-efficient Architecture for Floating-point based LU Decomposition on FPGAs

PyGen: A MATLAB/Simulink Based Tool for Synthesizing Parameterized and Energy Efficient Designs Using FPGAs

A scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment

Outline. EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) FPGA Overview. Why FPGAs?

DUE to the high computational complexity and real-time

A Normal I/O Order Radix-2 FFT Architecture to Process Twin Data Streams for MIMO

International Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering

High-Speed and Low-Power Split-Radix FFT

FAST FOURIER TRANSFORM (FFT) and inverse fast

High-throughput Online Hash Table on FPGA*

High Performance Pipelined Design for FFT Processor based on FPGA

A Novel Efficient VLSI Architecture for IEEE 754 Floating point multiplier using Modified CSA

Accelerating Equi-Join on a CPU-FPGA Heterogeneous Platform*

Design of a Floating-Point Fused Add-Subtract Unit Using Verilog

Parallelized Radix-4 Scalable Montgomery Multipliers

A HIGH PERFORMANCE FIR FILTER ARCHITECTURE FOR FIXED AND RECONFIGURABLE APPLICATIONS

PERFORMANCE ANALYSIS OF HIGH EFFICIENCY LOW DENSITY PARITY-CHECK CODE DECODER FOR LOW POWER APPLICATIONS

FPGA Implementation of 16-Point Radix-4 Complex FFT Core Using NEDA

High Speed Systolic Montgomery Modular Multipliers for RSA Cryptosystems

DESIGN AND IMPLEMENTATION OF DA- BASED RECONFIGURABLE FIR DIGITAL FILTER USING VERILOGHDL

Parallel FFT Program Optimizations on Heterogeneous Computers

Energy Efficient Adaptive Beamforming on Sensor Networks

Hardware Description of Multi-Directional Fast Sobel Edge Detection Processor by VHDL for Implementing on FPGA

Fixed Point Streaming Fft Processor For Ofdm

FPGAs: THE HIGH-END ALTERNATIVE FOR DSP APPLICATIONS. By Dr. Chris Dick

The Efficient Implementation of Numerical Integration for FPGA Platforms

Dynamically Configurable Online Statistical Flow Feature Extractor on FPGA

A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors

16 BIT IMPLEMENTATION OF ASYNCHRONOUS TWOS COMPLEMENT ARRAY MULTIPLIER USING MODIFIED BAUGH-WOOLEY ALGORITHM AND ARCHITECTURE.

Image Compression System on an FPGA

Accelerating Equi-Join on a CPU-FPGA Heterogeneous Platform

FPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression

FAST Fourier transform (FFT) is an important signal processing

A Ripple Carry Adder based Low Power Architecture of LMS Adaptive Filter

ENERGY, AREA AND SPEED OPTIMIZED SIGNAL PROCESSING ON FPGA

Power Spectral Density Computation using Modified Welch Method

Overview. CSE372 Digital Systems Organization and Design Lab. Hardware CAD. Two Types of Chips

VHDL for Synthesis. Course Description. Course Duration. Goals

Creating Parameterized and Energy-Efficient System Generator Designs

IEEE-754 compliant Algorithms for Fast Multiplication of Double Precision Floating Point Numbers

AREA-DELAY EFFICIENT FFT ARCHITECTURE USING PARALLEL PROCESSING AND NEW MEMORY SHARING TECHNIQUE

DESIGN OF AN FFT PROCESSOR

Design & Analysis of 16 bit RISC Processor Using low Power Pipelining

Multiplierless Unity-Gain SDF FFTs

Transcription:

ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE Ren Chen, Hoang Le, and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California, Los Angeles, USA 989 Email: {renchen, hoangle, prasanna}@usc.edu ABSTRACT Recently, there has been a growing interest within the research community to improve energy efficiency. In this paper, we revisit the classic Fast Fourier Transform (FFT) for energy efficient designs on FPGAs. Parameterized FFT architecture is proposed to identify design trade-offs in achieving energy efficiency. We first perform design space exploration by varying the algorithm mapping parameters, such as the degree of vertical and horizontal parallelism, that characterize the decomposition based FFT algorithms. After empirical selection on the values of algorithm mapping parameters, an energy-performance-area trade-off design for energy efficiency is identified by varying the architecture parameters, including the type of memory elements, the type of interconnection network and the number of pipeline stages. The tradeoffs between energy, area, and time are analyzed using two performance metrics: the Energy Area Time (EAT) composite metric and the energy efficiency (defined as the number of operations per Joule). From the experimental results, a design space is generated to demonstrate the effect of these parameters on the various performance metrics. For N-point FFT (16 N 124), our designs achieve up to 28% and 38% improvement in the energy efficiency and EAT, respectively, compared with a state-of-the-art design. 1. INTRODUCTION FPGA is a promising implementation technology for computationally intensive applications such as signal, image, and network processing tasks [1, 2]. State-of-the-art FPGAs offer high operating frequency, unprecedented logic density and a host of other features. As FPGAs are programmed specifically for the problem to be solved, they can achieve higher performance with lower power consumption than general purpose processors. Fast Fourier Transform (FFT) is one of the most frequently used kernels for Discrete Fourier Transform (DFT) in a wide variety of image and signal processing applications. Various derivative FFT algorithms have been pro- This work has been funded by DARPA under grant number HR11-12-2-23. posed and developed. Radix-x Cooley-Tukey algorithm is one of the most popular algorithms for hardware implementation [3, 4, 5, 6]. Most hardware solutions for Radix-x FFT fall into the following categories: delay feedback or delay commutator architectures [4], such as Radix-2 2 single-path delay feedback FFT [4], Radix-4 single-path delay commutator FFT [5], etc. By focusing on circuit level optimizations, these solutions achieved improvement either in throughput, area, or power. Power is a key metric in computing today. To obtain an energy efficient design for FFT, we analyze the tradeoffs between energy, area, and time for fixed-point FFT on a parameterized architecture, using Cooley-Tukey algorithm. Energy efficiency can be obtained both at the algorithm mapping level and the architecture level [7, 8]. Optimizing at these two levels allows power to be effectively traded off with other performance parameters. For example, a design consuming 2 power but achieving 3 system throughput is actually 5% more energy efficient than the original design. We present the design space for the chosen architecture with respect to energy efficiency at the algorithm mapping level. Energy-performance-area trade-off design is achieved at the architecture level by empirical selection on the proposed architecture parameters. In this paper, we make the following contributions: 1. A parameterized architecture of the Radix-4 Cooley- Tukey algorithm for FFT (Section 3.1). 2. A design space that demonstrates the effect of the parameters on the EAT and the energy efficiency metric (Section 4.3.2). 3. Demonstrate improved energy efficiency of the proposed trade-off design by identifying the energy hotspots and varying the proposed architecture parameters (Section 4.3.2). 4. Optimized designs achieving significant improvement in energy efficiency compared with a state-of-the-art design (Section 4.4). The rest of the paper is organized as follows. Section 2 covers the background and related work. Section 3 describes the proposed parameterized architecture and its implementation on FPGA. Section 4 presents experimental results and analysis. Section 5 concludes the paper. 1

2. BACKGROUND AND RELATED WORK 2.1. Background Given N complex numbers x,..., x N 1, DFT is computed as: X k = N 1 n= x ne i2πk n N, k =,..., N 1. Radixx Cooley-Tukey FFT is a well know decomposition based algorithm for N-point DFT. Radix-4 FFT is employed in this paper. The description of Radix-4 FFT is presented in Algorithm 1. In terms of the number of real operations, the computational complexity for N-point Radix-4 FFT is O(N log 4 N). The algorithm performs N-point FFT in N/m (m < N) cycles using m Input/Output ports (I/Os) and log 4 N radix blocks, which are used for butterfly computations. The algorithm iteratively decomposes the entire problem into four subproblems. This feature enables us to map Algorithm 1 Radix-4 FFT Algorithm 1: q = N/4; d = N/4; 2: for p := to log 4 N do 3: for k := to 4 p 1 do 4: l = 4kq/4 p ; r = l + q/(4 p 1); 5: tw 1 = w[k]; tw 2 = w[2k]; tw 3 = w[3k]; 6: for i := l to r do 7: t = i; t 1 = i+d/4 p ; t 2 = i+2d/4 p ; t 3 = i+3d/4 p ; 8: do parallel 9: f p+1 [t ] = f p[t ] + f p[t 1 ] + f p[t 2 ] + f p[t 3 ]; 1: f p+1 [t 1 ] = f p[t ] jf p[t 1 ] f p[t 2 ] + jf p[t 3 ]; 11: f p+1 [t 2 ] = f p[t ] f p[t 1 ] + f p[t 2 ] + jf p[t 3 ]; 12: f p+1 [t 3 ] = f p[t ] + jf p[t 1 ] f p[t 2 ] jf p[t 3 ]; 13: end parallel 14: do parallel 15: f p+1 [t ] = f p+1 [t ]; 16: f p+1 [t 1 ] = tw 1 f p+1 [t 1 ]; 17: f p+1 [t 2 ] = tw 2 f p+1 [t 2 ]; 18: f p+1 [t 3 ] = tw 3 x p+1 [t 3 ]; 19: end parallel : end for 21: end for 22: end for the algorithm by folding the FFT architecture vertically or horizontally, thus providing much freedom to implement various designs on FPGAs. We will propose our parameterized architecture in Section 3.2 based on this characteristic. 2.2. Related Work To the best of our knowledge, there has been no previous work targeted at exploring the design space for energy efficiency of FFT at both the algorithm mapping level and the architecture level on FPGAs. Existing work has mainly focused on optimizing the performance, power and area of the design at the circuit level. In [9], the authors designed an energy-efficient 124- point FFT processor. Cache-based FFT algorithm was proposed for achieving low power and high performance. Energytime performance metric was evaluated at different processor operation points. In [1], a high-speed and low-power FFT architecture was presented. They presented a delay balanced pipeline architecture based on split-radix algorithm. Algorithm trade-offs for reducing computation complexity were explored and the architecture was evaluated in area, power and timing performance. Based on Radix-x FFT, various pipeline FFT architectures have been proposed, such as Radix-2 single-path delay feedback FFT [3], Radix-4 single-path delay commutator FFT [5], Radix-2 multi-path delay commutator FFT [6], and Radix-2 2 single-path delay feedback FFT [4]. These architectures can achieve high throughput per unit area with single-path or multi-path pipelines, while energy efficiency has not been explored and evaluated in these works. In [11], a mathematical model for generating DFT soft core was developed. This model can automatically produce an optimized design with user inputs on performance and resource constraints. The resource usage was estimated with available parameters. However, the power and performance estimation have not been presented in this work. In [7], it presented a parameterized FFT architecture for energy efficiency. For energy efficiency, the optimized design was achieved by varying the chosen architecture parameters. Some energy efficient design techniques, such as clock gating and memory binding, are also employed in their work. Other than FPGA, there are also some techniques for energy efficient FFT presented based on other different platforms [12, 13]. However, it is not clear how to apply these techniques on FPGAs. In this work, we extend the work of [7] by design space exploration for energy efficiency at different levels. The design space exploration is performed on the current state-of-the-art FPGAs. By exploring the energy efficiency at two levels, we obtained an energy-performancearea trade-off design for FFT. 3. ARCHITECTURE AND IMPLEMENTATIONS 3.1. Architecture building blocks The proposed N-point FFT architecture is based on the Radix- 4 Cooley-Tukey FFT algorithm. Note that the choice of the radix affects energy efficiency of the design. Compared with Radix-2 algorithm, Radix-4 uses less number of multiply operations. The basic architecture consists of five building blocks (see Fig.1): Radix-4 block (R4), buffer, path permutation (PER), Parallel-to-serial/serial-to-parallel (PS/SP) multiplexer, and twiddle factor computation (). A complete design for a given N-point FFT can be obtained from combinations of the basic blocks. A. Radix-4 block In this module, 16 signed adder/subtractors are used to complete butterfly computations. It takes four inputs and generates four outputs in parallel. Each input data contains real and imaginary components. The data outputs of R4 will be used by the twiddle factor computation block except in the last stage (see Fig. 1a). 2

Radix Block X R4 (a) (b) (c) (d) (e) Fig. 1: (a) Radix block, (b) buffer, (c) path permutation (PER), (d) Parallel-to-serial/serial-to-parallel MUX (PS/SP), (e) Twiddle factor computation () Memory entry 1 2 3 X X 7 X 1 X 13 X [i] X [i+1] X [i+2] R4 (a) H p = 1,V p = 1 R4 X 2[i] X 2[i+1] X 2[i+2] X 1 X 4 X 11 X 14 X 2 X 5 X 8 X 15 X 3 X 6 X 9 X 12 output in parallel Fig. 2: permutation in the data buffers for 16-point FFT B. buffer The data buffer consists of a dual-port RAM having N/m (m equals to the number of I/Os) entries. is written into one port and read from the other port simultaneously. The data buffers are shown in Fig. 2 where N = 16. In four cycles, 16 permutated data inputs are fed into the data buffers. And in each cycle, with alternating entries, four data outputs are read in parallel. For different architectural parameters, the read and write addresses are generated with different strides. For example, in Fig. 2, four data inputs (X, X 4, X 8, X 12 ) are written in cycle, cycle 1, cycle 2, and cycle 3 respectively. Then they are output simultaneously in cycle 4. C. permutation block Parallel input data are required to be permutated before being processed by the subsequent modules. Fig. 2 shows the data permutation for 16-point FFT. In the first cycle, four data inputs (X, X 1, X 2, X 3 ) are fed into the first entry of each data buffer without permutation. In the second cycle, another four data inputs are written into the second entry of each data buffer with one location permutated. The parallel output data (X i, X i+4, X i+8, X (i+12)mod16, i =, 1, 2, 3) are stored in different RAMs after four cycles. These permutations are repeated for every four cycles. D. PS/SP module This module is used to multiplex serial/parallel input data to output in parallel/serial respectively. As shown in Fig. 3a, the number of I/Os is limited to one, but the radix-4 block still operates on four data inputs in parallel, thus the PS/SP module is employed to match the data rate both before and after the radix-4 block. X [i+3] (b) H p = 2,V p = 4 Fig. 3: Parameterized Architectures for 16-point FFT E. Twiddle factor computation This module consists of two blocks: the twiddle factor generation block and the complex number multiplier block. The twiddle factor generation block includes several lookup tables for storing twiddle factor coefficients, where the data read addresses will be updated with the control signals. The size of the lookup tables will increase with the problem size. The complex number multiplier block consists of three multipliers and three adder/subtractors. 3.2. Parameterized FFT Architecture 3.2.1. Algorithm Mapping Parameters X 2[i+3] Decomposition based Radix-4 FFT offers much flexibility to map various architectures. By folding the FFT architecture horizontally or vertically, the radix-4 blocks can be reused iteratively, connected in a pipeline, or replicated to process input data in parallel. Hence we use two algorithm mapping parameters that characterize the decomposition-based N-point FFT algorithm in our design: 1. Horizontal Parallelism (H p ): determines the number of radix blocks used in one pipeline (1 H p log 4 N). 2. Vertical Parallelism (V p ): determines the number of inputs being computed in parallel (1 V p N). V p varies with the number of data channels per pipeline (N c ) and the number of parallel pipelines (N p ), and V p = N c N p. These two proposed architectural parameters are chosen to create a design space. Two different architectures are presented in Fig. 3. In Fig. 3a, V p = N c = N p = 1, H p = 1, N = 16, one radix-4 block is employed and iteratively used by two stages, and one input data is processed per cycle. This architecture achieves higher resource efficiency 3

Out Out1 Out2 Out3 In In1 In2 In3 (a) (b) Fig. 4: (a) Crossbar network, (b) Complete binary tree, (c) Dynamic network In In1 In2 In3 (c) Out Out1 Out2 Out3 and consumes less I/Os power consumption, at the expense of the throughput. In Fig. 3b, V p = 4, H p = 2, N = 16, two radix-4 blocks are utilized. There is only one pipeline and N c = 4, N p = 1. All the stages are fully pipelined, and four inputs can be processed in parallel per cycle. Note that there is no feedback path. The architecture achieves high throughput by using more basic blocks and I/Os, while resulting in higher power consumption. We can also increase V p by replicating the basic pipeline. This replication allows several pipelines to work in parallel to significantly increase the throughput at the cost of more complex interconnections. 3.2.2. Architecture Parameters Three architecture parameters that significantly affect energyefficiency are employed in our design and applied to different components: 1. Type of memory element: BRAM or distributed RAM (dist. RAM) can be used as memories. In our design, both data buffers and twiddle factor lookup tables can be implemented using different memory elements. 2. Type of interconnection: three different types of interconnection (see Fig.4) are used for implementation of data permutation blocks, including crossbar network, complete binary tree, as well as dynamic network. 3. Pipeline depth: Both adder/subtractors and DSP slices in FPGA can be deep pipelined by inserting registers, so we parameterized the arithmetic units and multipliers with pipeline depth in our design to balance the performance and resource usage. According to [14], when used for large size memories, BRAM consumes less power than dist. RAM. Hence this characteristic can be utilized to make a trade-off between power and performance for various problem sizes. As there are 2 m (H p + 1) (when V p = 4, m = 1, otherwise m = ) permutation modules, using different interconnection networks can significantly affect the energy efficiency of the designs. The physical layout of the complete binary tree is similar with that of crossbar network, while it can be inserted with more pipeline registers between the layers of tree. The dynamic network can be implemented by using shift registers. Among three of them, dynamic network can lead to high performance while more power consumption; crossbar network consumes least resource and power while will also bring long wire delay; complete binary tree can be used to release routing burden to improve performance at the expense of more area usage. 4. EXPERIMENTAL RESULTS AND ANALYSIS 4.1. Experimental Setup In this section, we present a detailed analysis of several implementation experiments by varying the parameters. All the designs were implemented in Verilog on Virtex-7 FPGA (XC7VX69T, speed grade -2L) using Xilinx ISE 14.4. Inputs are 16-bit fixed point complex numbers. The designs were verified by post place-and-route simulation. The reported results are post place-and-route results. We used the SAIF file (Switching Activity Interchange Format) as input to Xilinx XPower Analyzer to produce accurate power dissipation estimation [14]. 4.2. Performance Metrics Two metrics for performance evaluation are considered in this paper: 1. Energy efficiency is defined as the number of operations per unit energy consumed (Energy efficiency = number of operations / energy consumed by the design). For N-point FFT, Energy efficiency is given by (2N log 2 N + 9 4 N log 2 N) / energy consumed by the design, Energy consumed by the design = time taken by the design average power dissipation of the design. Alternatively energy efficiency of the design is Power efficiency (Power efficiency = number of operations per second / Watt). 2. Energy Area Time (EAT) is measured as the product of three important metrics: energy, area, and time. We define Energy in Joules consumed by the design for one transformation of N points. Area is defined as area usage of the design, which is considered as the maximum number of LUTs or flip-flops occupied by the entire design. The area of design using BRAMs is equal to the area usage of the same design when only using dist. RAMs. Time is the latency of N-point FFT. 4.3. Design space exploration In this section, we first present the design space exploration by varying algorithm mapping parameters. Both the dist. RAM based design and the BRAM based design are used in this experiment. The effect of the algorithm mapping parameters on energy efficiency is demonstrated by using 4

4, 1 Giga operations/joule 1 log log 1/2 1 Fig. 5: Energy efficiency for various H p with varying N for the dist. RAM based design Giga operations / Joule 1 4, 4 1, 4 1, 1 Fig. 7: Energy efficiency for various V p with varying N for the dist. RAM based design 5 Giga oprations/joule 1 Giga operations / Joule 1 4, 1 4, 4 1, 4 1, 1 Problem Size N Problem Size N Fig. 6: Energy efficiency for various H p with varying N for the BRAM based design Fig. 8: Energy efficiency for various V p with varying N for the BRAM based design the proposed performance metrics. Next we explore the energy-performance-area trade-off design (denoted tradeoff design) by varying the architecture parameters, based on the conclusions of design space exploration in this section. 4.3.1. Algorithm mapping level exploration A. Horizontal Parallelism In this experiment, we explore the energy efficiency for various horizontal parallelism, and V p = 4, N c = 4, N p = 1. The range of H p is [1, log 4 N]. The energy efficiency for various H p are shown in Fig. 6 and Fig. 5 respectively. Based on the experimental results, we have the following observations: For all the considered problem sizes, increasing horizontal parallelism could significantly improve energy efficiency for both the dist. RAM and BRAM based design. As the problem size N increases, the energy efficiency of the dist. RAM based design declines, whereas, the energy efficiency of the BRAM based design increases. The improvement in energy efficiency brought by increasing H p for the dist. RAM based design is sensitive to N. For example, when N = 124, halving H p only leads to slight performance decline in energy efficiency. Considering reducing H p to save area would be a feasible alternative for larger size problems. The improvement in energy efficiency brought by increasing H p for BRAM based designs is not sensitive to N. Reducing H p to save area is not a feasible approach, which leads to a significant decline in energy efficiency. B. Vertical Parallelism Vertical parallelism is determined by three different values: radix value (fixed at 4), N c, and N p. H p was set as log 4 N. N c and N p were modified for evaluation. Both dist. RAM and BRAM based designs were evaluated. The energy efficiency for various V p are shown in Fig. 7 and Fig. 8. Based on the results, the conclusions are listed as below: Reducing N c leads to performance decline in energy efficiency. BRAM based design is more scalable than dist. RAM based design with respect to energy efficiency. When N 64, the energy efficiency starts to decline for dist. RAM based designs due to high power consumption per access of dist. RAMs with large memory entries. 5

Table 1: Architecture parameters of designs for comparison Design A Trade-off Design Design C Giga operations / Joule 5 Memory type Dist. RAM Dist. RAM or BRAM BRAM Interconnection Network Pipeline stages Type Components Multiplier Adder Dynamic network Crossbar network Complete binary tree Regitsers 5 2 LUTs 3 2 LUTs+ Registers Design A Trade-off design Design C 2 1 Fig. 9: Energy efficiency of the trade-off design and the baseline designs Increasing N c instead N p is a more feasible approach to improve energy efficiency. Also there is no much extra resource needed for increasing N c, and we have to replicate the pipeline to increase N p. Although increasing H p leads to a high power and resource consumption, it can produce improvement in energy efficiency due to high throughput. 4.3.2. Architecture level exploration In this section, the trade-off design is explored at the architecture level. In this experiment, we choose V p = 4 and H p = log 4 N based on previous experimental conclusions. A. Energy hot spots As shown in Fig.1a, dominant portion of the entire power is consumed by the data buffers for 124-point FFT. This indicates that BRAM can be utilized to improve energy efficiency for large values of N. It also suggests that I/Os consumes a major power for small values of N. Fig.1b shows that the core power consumption except I/O power and static power is dominant in the entire power for BRAM-based designs. And we observe that pipeline registers are the energy hot-spots among the architecture components. B. Trade-off design By varying the architecture parameters, a set of implementations have been evaluated in this experiment. The analysis of effects of the architecture parameters on power, performance, and area is performed as below: Static Power 9% Radix block 4% I/O PER power 3% 8% 4% buffer 72% (a) Dist. RAM based design I/O power 26% Static Power 13% Radix block 13% PER 8% 12% buffer 28% (b) BRAM based design Fig. 1: % power consumed by the components for 124- point FFT architecture Energy: Reducing the number of registers can significantly reduce signal power, which is dominant in the dynamic power. Crossbar network can be evaluated to increase energy efficiency. Performance: Using BRAM can lead to a decline in peak operating frequency. For large values of N, when using BRAMs, extra pipeline stages can be used to solve the performance degradation issue. Area: Area usage of pipeline registers is dominant in the entire design area. Pipeline registers can be balanced to obtain the trade-off design between area and performance. The analysis above has been applied to achieve the tradeoff design in our experiment and serves as a guide for design space exploration. As shown in Table1, we use two baseline designs to compare with our proposed trade-off design. The architecture parameters of the designs for comparison are shown in Table1. The comparison results of the designs on energy efficiency are shown in Fig.9. It shows that the energy efficiency can be improved up to 27% by our proposed trade-off design, compared with the other two baseline designs. 4.4. Performance comparison We finally use SPIRAL FFT IP core to compare with our proposed trade-off design. The SPIRAL DFT/FFT IP Generator can automatically generate customized DFT soft IP cores in synthesizable RTL Verilog with user inputs [11]. The available parameters of the DFT core generator include transform size, data precision, etc. In this comparison, we use the dist. RAM based design for N 64 and the BRAM based design for N > 64. For the design from SPIRAL, the codes of N-point (16-bit fixed point) FFT are automatically generated by the SPIRAL Core generator. The architecture is fully streaming and the data are presented in their natural ordering. As shown in Fig. 11, our proposed design improves energy-efficiency by 8% to 28% and EAT by 23% to 38%, respectively, compared with the SPIRAL FFT IP Cores. 6

Giga operations / Joule 45 35 25 Energy efficency of our design Energy efficiency of SPIRAL FFT IP Core (EAT of SPIRAL IP CORE) / (EAT of our design) 1.4 1.35 1.3 1.25 1.2 1.15 Fig. 11: Comparison between the proposed trade off design and the SPIRAL FFT IP Cores for EAT and energy efficiency 5. CONCLUSION In this work, we presented a parameterized architecture for energy efficiency using Radix-4 Cooley-Tukey FFT algorithm. The effect of the multi-level parameters on energyefficiency was demonstrated by using design space exploration. We studied the power consumption of the components for various problem sizes, and proposed our tradeoff design by empirical selection on architecture parameters. Compared with the state-of-the-art design, our optimized architectures achieve up to 28% and 38% improvement in the energy efficiency and EAT respectively. In the future we plan to work on an accurate high-level performance model for energy-efficiency estimation, which can be used to accelerate design space exploration to obtain an energy efficient design. EAT Ratio [6] L. R. Rabiner and B. Gold, Theory and application of digital signal processing, Englewood Cliffs, NJ, Prentice-Hall, Inc., 1975. 777 p., vol. 1. [7] S. Choi, R. Scrofano, V. K. Prasanna, and J.-W. Jang, Energy-efficient signal processing using FPGAs, in Proceedings of the 3 FPGA, pp. 225 234. [8] D. Aravind and A. Sudarsanam, High level -Application Analysis Techniques Architectures - To Explore Design possibilities for Reduced Reconfiguration Area Overheads in FP- GAs executing Compute Intensive Applications, in Proc. of IPDPS, 5, pp. 158a 158a. [9] B. Baas, A low-power, high-performance, 124-point FFT processor, IEEE Journal of Solid-State Circuits, vol. 34, no. 3, pp. 38 387, 1999. [1] C.-W. J. Wen-Chang Yeh, High-speed and low-power splitradix FFT, IEEE Transactions on Signal Processing, vol. 51, no. 3, pp. 864 874, 3. [11] P. A. Milder, M. Ahmad, J. C. Hoe, and M. Püschel, Fast and accurate resource estimation of automatically generated custom DFT IP cores, in Proceedings of the 6 FPGA, pp. 211 2. [12] T. Sugimura, H. Yamasaki, H. Noda, O. Yamamoto, Y. Okuno, and K. Arimoto, A high-performance and energyefficient FFT implementation on super parallel processor (MX) for mobile multimedia applications, in International Symposium on Intelligent Signal Processing and Communications Systems, 9, pp. 1 4. [13] H. Kimura, H. Nakamura, S. Kimura, and N. Yoshimoto, Numerical analysis of dynamic snr management by controlling dsp calculation precision for energy-efficient ofdmpon, Photonics Technology Letters, IEEE, vol. 24, no. 23, pp. 2132 2135, 12. [14] XST User Guide for Virtex-6, Spartan-6, and 7 Series Devices, http://www.xilinx.com/support/documentation. 6. REFERENCES [1] N. Shirazi, P. M. Athanas, and A. L. Abbott, Implementation of a 2-D Fast Fourier Transform on an FPGA-Based Custom Computing Machine, in Field-Programmable Logic and Applications, 1995, pp. 282 292. [2] D. Chen, G. Yao, C. Koc, and R. Cheung, Low complexity and hardware-friendly spectral modular multiplication, in International Conference on Field-Programmable Technology (FPT), 12, pp. 368 375. [3] E. H. Wold and A. M. Despain, Pipeline and parallelpipeline FFT processors for VLSI implementations, IEEE Transactions on Computers, vol. 1, no. 5, pp. 414 426, 1984. [4] S. He and M. Torkelson, A new approach to pipeline FFT processor, in Proceedings of IPPS 96, pp. 766 77. [5] G. Bi and E. Jones, A pipelined FFT processor for wordsequential data, IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 37, no. 12, pp. 1982 1985, 1989. 7