ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE. Ren Chen, Hoang Le, and Viktor K. Prasanna

Similar documents
ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE. Ren Chen, Hoang Le, and Viktor K. Prasanna

High Throughput Energy Efficient Parallel FFT Architecture on FPGAs

Energy Optimizations for FPGA-based 2-D FFT Architecture

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA

Analysis of Radix- SDF Pipeline FFT Architecture in VLSI Using Chip Scope

OPTIMIZING INTERCONNECTION COMPLEXITY FOR REALIZING FIXED PERMUTATION IN DATA AND SIGNAL PROCESSING ALGORITHMS. Ren Chen and Viktor K.

Evaluating Energy Efficiency of Floating Point Matrix Multiplication on FPGAs

A Methodology for Energy Efficient FPGA Designs Using Malleable Algorithms

The Serial Commutator FFT

ON CONFIGURATION OF RESIDUE SCALING PROCESS IN PIPELINED RADIX-4 MQRNS FFT PROCESSOR

AN FFT PROCESSOR BASED ON 16-POINT MODULE

Linköping University Post Print. Analysis of Twiddle Factor Memory Complexity of Radix-2^i Pipelined FFTs

Novel design of multiplier-less FFT processors

TOPICS PIPELINE IMPLEMENTATIONS OF THE FAST FOURIER TRANSFORM (FFT) DISCRETE FOURIER TRANSFORM (DFT) INVERSE DFT (IDFT) Consulted work:

Towards Performance Modeling of 3D Memory Integrated FPGA Architectures

Abstract. Literature Survey. Introduction. A.Radix-2/8 FFT algorithm for length qx2 m DFTs

MULTIPLIERLESS HIGH PERFORMANCE FFT COMPUTATION

Low-Power Split-Radix FFT Processors Using Radix-2 Butterfly Units

Research Article International Journal of Emerging Research in Management &Technology ISSN: (Volume-6, Issue-8) Abstract:

Keywords: Fast Fourier Transforms (FFT), Multipath Delay Commutator (MDC), Pipelined Architecture, Radix-2 k, VLSI.

DESIGN OF PARALLEL PIPELINED FEED FORWARD ARCHITECTURE FOR ZERO FREQUENCY & MINIMUM COMPUTATION (ZMC) ALGORITHM OF FFT

Modified Welch Power Spectral Density Computation with Fast Fourier Transform

Analysis of High-performance Floating-point Arithmetic on FPGAs

FPGA Based Design and Simulation of 32- Point FFT Through Radix-2 DIT Algorith

Research Article Design of A Novel 8-point Modified R2MDC with Pipelined Technique for High Speed OFDM Applications

Resource-efficient Acceleration of 2-Dimensional Fast Fourier Transform Computations on FPGAs

Computing the Discrete Fourier Transform on FPGA Based Systolic Arrays

FPGA Matrix Multiplier

Multi-Gigahertz Parallel FFTs for FPGA and ASIC Implementation

User Manual for FC100

Energy and Memory Efficient Mapping of Bitonic Sorting on FPGA

Core Facts. Documentation Design File Formats. Verification Instantiation Templates Reference Designs & Application Notes Additional Items

STUDY OF A CORDIC BASED RADIX-4 FFT PROCESSOR

Computer Generation of IP Cores

SFF The Single-Stream FPGA-Optimized Feedforward FFT Hardware Architecture

A Model-based Methodology for Application Specific Energy Efficient Data Path Design using FPGAs

A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN

Implementation of FFT Processor using Urdhva Tiryakbhyam Sutra of Vedic Mathematics

DESIGN METHODOLOGY. 5.1 General

PyGen: A MATLAB/Simulink Based Tool for Synthesizing Parameterized and Energy Efficient Designs Using FPGAs

A 4096-Point Radix-4 Memory-Based FFT Using DSP Slices

A scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment

Image Compression System on an FPGA

Improving Reconfiguration Speed for Dynamic Circuit Specialization using Placement Constraints

High-throughput Online Hash Table on FPGA*

INTRODUCTION TO FPGA ARCHITECTURE

Efficient Self-Reconfigurable Implementations Using On-Chip Memory

An Efficient High Speed VLSI Architecture Based 16-Point Adaptive Split Radix-2 FFT Architecture

Dynamically Configurable Online Statistical Flow Feature Extractor on FPGA

PERFORMANCE ANALYSIS OF HIGH EFFICIENCY LOW DENSITY PARITY-CHECK CODE DECODER FOR LOW POWER APPLICATIONS

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

A High-Performance and Energy-efficient Architecture for Floating-point based LU Decomposition on FPGAs

Twiddle Factor Transformation for Pipelined FFT Processing

Parallelism in Spiral

LOW-POWER SPLIT-RADIX FFT PROCESSORS

Core Facts. Documentation Design File Formats. Verification Instantiation Templates Reference Designs & Application Notes Additional Items

FAST FOURIER TRANSFORM (FFT) and inverse fast

An Area Efficient Mixed Decimation MDF Architecture for Radix. Parallel FFT

Fused Floating Point Arithmetic Unit for Radix 2 FFT Implementation

High-Performance 16-Point Complex FFT Features 1 Functional Description 2 Theory of Operation

High Speed Systolic Montgomery Modular Multipliers for RSA Cryptosystems

Adaptive FIR Filter Using Distributed Airthmetic for Area Efficient Design

Energy Efficient Adaptive Beamforming on Sensor Networks

DUE to the high computational complexity and real-time

Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm

A HIGH PERFORMANCE FIR FILTER ARCHITECTURE FOR FIXED AND RECONFIGURABLE APPLICATIONS

Parallel graph traversal for FPGA

High-Speed and Low-Power Split-Radix FFT

FPGA Implementation of 16-Point Radix-4 Complex FFT Core Using NEDA

Verilog for High Performance

DESIGN AND IMPLEMENTATION OF DA- BASED RECONFIGURABLE FIR DIGITAL FILTER USING VERILOGHDL

IEEE-754 compliant Algorithms for Fast Multiplication of Double Precision Floating Point Numbers

A SCALABLE COMPUTING AND MEMORY ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION ON FIELD-PROGRAMMABLE GATE ARRAYS. Theepan Moorthy and Andy Ye

Comparison of Adders for optimized Exponent Addition circuit in IEEE754 Floating point multiplier using VHDL

High Performance Pipelined Design for FFT Processor based on FPGA

DESIGN OF AN FFT PROCESSOR

Digital Signal Processing for Analog Input

FPGA Implementation of Discrete Fourier Transform Using CORDIC Algorithm

Parallelized Radix-4 Scalable Montgomery Multipliers

A Normal I/O Order Radix-2 FFT Architecture to Process Twin Data Streams for MIMO

Core Facts. Documentation Design File Formats. Verification Instantiation Templates Reference Designs & Application Notes Additional Items

Basic FPGA Architectures. Actel FPGAs. PLD Technologies: Antifuse. 3 Digital Systems Implementation Programmable Logic Devices

Design of Delay Efficient Distributed Arithmetic Based Split Radix FFT

FPGA IMPLEMENTATION FOR REAL TIME SOBEL EDGE DETECTOR BLOCK USING 3-LINE BUFFERS

Creating Parameterized and Energy-Efficient System Generator Designs

55 Streaming Sorting Networks

Accelerating Equi-Join on a CPU-FPGA Heterogeneous Platform*

Hardware Description of Multi-Directional Fast Sobel Edge Detection Processor by VHDL for Implementing on FPGA

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch

FPGAs: THE HIGH-END ALTERNATIVE FOR DSP APPLICATIONS. By Dr. Chris Dick

A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors

EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs)

ISSN Vol.05,Issue.09, September-2017, Pages:

Overview. CSE372 Digital Systems Organization and Design Lab. Hardware CAD. Two Types of Chips

Design of a Floating-Point Fused Add-Subtract Unit Using Verilog

Outline. EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) FPGA Overview. Why FPGAs?

Accelerating Equi-Join on a CPU-FPGA Heterogeneous Platform

A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup

Multiplierless Unity-Gain SDF FFTs

Transcription:

ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE Ren Chen, Hoang Le, and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California, Los Angeles, USA 989 Email: {renchen, hoangle, prasanna}@usc.edu ABSTRACT In this paper, we revisit the classic Fast Fourier Transform (FFT) for energy efficient designs on FPGAs. A parameterized FFT architecture is proposed to identify the design trade-offs in achieving energy efficiency. We first perform design space exploration by varying the algorithm mapping parameters, such as the degree of vertical and horizontal parallelism, that characterize decomposition based FFT algorithms. Then we explore an energy efficient design by empirical selection on the values of the chosen architecture parameters, including the type of memory elements, the type of interconnection and the number of pipeline stages. The trade offs between energy, area, and time are analyzed using two performance metrics: the energy efficiency (defined as the number of operations per Joule) and the Energy Area Time (EAT) composite metric. From the experimental results, a design space is generated to demonstrate the effect of these parameters on the various performance metrics. For N-point FFT (16 N 24), our designs achieve up to 28% and 38% improvement in the energy efficiency and EAT, respectively, compared with a state-of-the-art design. 1. INTRODUCTION FPGA is a promising implementation technology for computationally intensive applications such as signal, image, and processing tasks [1, 2]. State-of-the-art FPGAs offer high operating frequency, unprecedented logic density and a host of other features. As FPGAs are programmed specifically for the problem to be solved, they can achieve higher performance with lower power consumption than general purpose processors. Fast Fourier Transform (FFT) is one of the most frequently used kernels in a wide variety of image and signal processing applications. Various derivative FFT algorithms have been proposed and developed. Radix-x Cooley- Tukey algorithm is one of the most popular algorithms for This work has been funded by DARPA under grant number HR11-12-2-23. hardware implementation [3, 4, 5, 6]. Most hardware solutions for Radix-x FFT fall into the following categories: delay feedback or delay commutator architectures [4], such as Radix-2 2 single-path delay feedback FFT [4] and Radix- 4 single-path delay commutator FFT [5]. By focusing on circuit level optimizations, these solutions achieve improvement either in throughput, area, or power. Energy efficiency is a key design metric. To obtain an energy efficient design for FFT, we analyze the trade-offs between energy, area, and time for fixed-point FFT on a parameterized architecture, using Cooley-Tukey algorithm. Energy efficiency can be achieved both at the algorithm mapping level and the architecture level [7, 8]. Optimizing at these two levels allows power to be effectively traded off with other performance parameters. For example, a design consuming 2 power but achieving 3 system throughput is actually 5% more energy efficient than the original design. We present the architecture design space with respect to energy efficiency at the algorithm mapping level. By empirical selection of the proposed architecture parameter values, we explore an energy efficient design at the architecture level. In this paper, we make the following contributions: 1. A parameterized FFT architecture using the Radix-4 Cooley-Tukey algorithm (Section 3.1). 2. A design space that demonstrates the effect of the parameters on the Energy Area Time (EAT) composite metric and the energy efficiency (Section 4.3.2). 3. Demonstrate improved energy efficiency of the proposed design by identifying energy hot-spots and varying the chosen architecture parameters (Section 4.3.2). 4. Optimized designs achieving significant improvement in energy efficiency compared with a state-of-the-art design (Section 4.4). The rest of the paper is organized as follows. Section 2 covers the background and related work. Section 3 describes the proposed parameterized architecture and its implementation on FPGA. Section 4 presents experimental results and analysis. Section 5 concludes the paper. 1

2. BACKGROUND AND RELATED WORK 2.1. Background Given N complex numbers x,..., x N 1, Discrete Fourier Transform (DFT) is computed as: X k = N 1 n= x ne i2πk n N, k =,..., N 1. Radix-x Cooley-Tukey FFT is a well known decomposition based algorithm for N-point DFT. In this paper, we employ Radix-4 FFT for our design. The description of Radix-4 FFT is presented in Algorithm 1. In terms of the number of real operations, the computational complexity of N-point Radix-4 FFT is O(N log 4 N). The algorithm performs N-point FFT in N/m (m < N) cycles using m Input/Output ports (I/Os) and log 4 N radix blocks, which are used for butterfly computations. The algorithm iteratively decomposes the entire problem into four subproblems. This feature enables us to map the algorithm by fold- Algorithm 1 Radix-4 FFT Algorithm 1: q = N/4; d = N/4; 2: for p := to log 4 N do 3: for k := to 4 p 1 do 4: l = 4kq/4 p ; r = l + q/(4 p 1); 5: tw 1 = w[k]; tw 2 = w[2k]; tw 3 = w[3k]; 6: for i := l to r do 7: t = i; t 1 = i+d/4 p ; t 2 = i+2d/4 p ; t 3 = i+3d/4 p ; 8: do parallel 9: f p+1 [t ] = f p[t ] + f p[t 1 ] + f p[t 2 ] + f p[t 3 ]; : f p+1 [t 1 ] = f p[t ] jf p[t 1 ] f p[t 2 ] + jf p[t 3 ]; 11: f p+1 [t 2 ] = f p[t ] f p[t 1 ] + f p[t 2 ] + jf p[t 3 ]; 12: f p+1 [t 3 ] = f p[t ] + jf p[t 1 ] f p[t 2 ] jf p[t 3 ]; 13: end parallel 14: do parallel 15: f p+1 [t ] = f p+1 [t ]; 16: f p+1 [t 1 ] = tw 1 f p+1 [t 1 ]; 17: f p+1 [t 2 ] = tw 2 f p+1 [t 2 ]; 18: f p+1 [t 3 ] = tw 3 x p+1 [t 3 ]; 19: end parallel : end for 21: end for 22: end for ing the FFT architecture vertically or horizontally, thus providing much freedom to implement various designs on FP- GAs. We propose our parameterized architecture in Section 3.2 based on this feature. 2.2. Related Work To the best of our knowledge, there has been no previous work targeted at exploring the design space for energy efficiency of FFT at both the algorithm mapping level and the architecture level on FPGAs. Existing work has mainly focused on optimizing the performance, power and area of the design at the circuit level. An energy-efficient 24-point FFT processor was developed in [9]. Cache-based FFT algorithm was proposed to achieve low power and high performance. Energy-time performance metric was evaluated at various processor operation points. In [], a high-speed and low-power FFT architecture was presented. They presented a delay balanced pipeline architecture based on split-radix algorithm. Algorithms for reducing computation complexity were explored and the architecture was evaluated in area, power and timing performance. Based on Radix-x FFT, various pipeline FFT architectures have been proposed, such as Radix-2 single-path delay feedback FFT [3], Radix-4 single-path delay commutator FFT [5], Radix-2 multi-path delay commutator FFT [6], and Radix-2 2 single-path delay feedback FFT [4]. These architectures can achieve high throughput per unit area with single-path or multi-path pipelines, but energy efficiency has not been explored in these works. In [11], a parameterized soft core generator for high throughput DFT was developed. This generator can automatically produce an optimized design with user inputs for performance and resource constraints. However, energy efficiency is not considered in this work. In [7], the author presented a parameterized energy efficient FFT architecture. Their design is optimized to achieve high energy efficiency by varying the architecture parameters. Some energy efficient design techniques, such as clock gating and memory binding, are also employed in their work. Other than FPGA, there are also some techniques for energy efficient FFT presented based on other different platforms [12, 13]. However, it is not clear how to apply these techniques on FPGAs. In this work, we extend the work of [7] by design space exploration at multiple levels. The design space exploration is performed on the current state-ofthe-art FPGAs. By exploring the energy-performance-area trade-offs at mutiple levels, we obtain an energy efficient design for FFT. 3. ARCHITECTURE AND IMPLEMENTATIONS 3.1. Architecture building blocks The proposed N-point FFT architecture is based on the Radix- 4 Cooley-Tukey FFT algorithm. Note that the choice of the radix affects energy efficiency of the design. Compared with Radix-2 algorithm, Radix-4 uses fewer multiply operations. The proposed architecture consists of five building blocks (see Fig.1): Radix-4 block (R4), buffer, path permutation (PER), Parallel-to-serial/serial-to-parallel (PS/SP) multiplexer, and twiddle factor computation (). A complete design for N-point FFT can be obtained by a combination of the basic blocks. A. Radix-4 block In this module, 16 signed adder/subtractors are used to complete butterfly computations. It takes four inputs and generates four outputs in parallel. Each input data contains real and imaginary components. The data outputs of R4 will be used by the twiddle factor computation block except in the last stage (see Fig. 1a). B. buffer 2

Radix Block X R4 (a) (b) (c) (d) (e) Fig. 1: (a) Radix block, (b) buffer, (c) path permutation (PER), (d) Parallel-to-serial/serial-to-parallel MUX (PS/SP), (e) Twiddle factor computation () Memory entry 1 2 3 X X 7 X X 13 X [i] X [i+1] X [i+2] R4 (a) H p = 1,V p = 1 R4 X 2[i] X 2[i+1] X 2[i+2] X 1 X 4 X 11 X 14 X 2 X 5 X 8 X 15 X 3 X 6 X 9 X 12 output in parallel Fig. 2: permutation in the data buffers for 16-point FFT Each data buffer consists of a dual-port RAM having N/m (m equals to the number of I/Os) entries. is written into one port and read from the other port simultaneously. Fig. 2 shows the data buffers for 16-point pipelining FFT. In four cycles, 16 permutated data inputs are fed into the data buffers. In each cycle, with alternating locations, four data outputs are read in parallel. For different architectural parameters, the read and write addresses are generated with different strides. For example, in Fig. 2, four data inputs (X, X 4, X 8, X 12 ) are written in cycle, cycle 1, cycle 2, and cycle 3 respectively. Then they are output simultaneously in cycle 4. C. permutation block Parallel input data are required to be permutated before being processed by the subsequent modules. Fig. 2 shows the data permutation for 16-point FFT. In the first cycle, four data inputs (X, X 1, X 2, X 3 ) are fed into the first entry of each data buffer without permutation. In the second cycle, another four data inputs are written into the second entry of each data buffer with one location permutated. The parallel output data (X i, X i+4, X i+8, X (i+12)mod16, i =, 1, 2, 3) are stored in different RAMs after four cycles. These permutations are repeated every four cycles. D. PS/SP module This module is used to multiplex serial/parallel input data to output in parallel/serial respectively. As shown in Fig. 3a, the number of I/Os is limited to one, but the radix-4 block still operates on four data inputs in parallel, thus the PS/SP module is employed to match the data rate both before and after the radix-4 block. X [i+3] (b) H p = 2,V p = 4 Fig. 3: Parameterized Architectures for 16-point FFT E. Twiddle factor computation This module consists of two blocks: the twiddle factor generation block and the complex number multiplier block. The twiddle factor generation block includes several lookup tables for storing twiddle factor coefficients, where the data read addresses will be updated with the control signals. The size of the lookup tables will increase with the problem size. The complex number multiplier block consists of three multipliers and three adder/subtractors. 3.2. Parameterized FFT Architecture 3.2.1. Algorithm Mapping Parameters Decomposition based Radix-4 FFT offers much flexibility to map various architectures. Folding the FFT architecture enables the radix-4 blocks to be reused iteratively to save area, while unfolding the FFT increases spatial parallelism. Hence we use two algorithm mapping parameters that characterize the decomposition-based N-point FFT algorithm in our design: X 2[i+3] 1. Horizontal Parallelism (H p ): determines the number of radix-4 blocks concatenated horizontally (1 H p log 4 N). 2. Vertical Parallelism (V p ): determines the number of parallel I/Os (1 V p N). V p is determined by the number of I/Os per pipeline (N c ) and the number of parallel pipelines (N p ), and V p = N c N p. Each pipeline is a row of horizontally concatenated radix-4 blocks. We adapt these two architectural parameters for an energy efficient design. Two different architectures are presented in Fig. 3. In Fig. 3a, V p = N c = N p = 1, H p = 1, N = 16, one radix-4 block is employed and iteratively 3

Out Out1 Out2 Out3 In In1 In2 In3 (a) (b) Fig. 4: (a) Crossbar, (b) Complete binary tree, (c) Dynamic used by two stages, and one input data is processed per cycle. This architecture achieves higher resource efficiency and consumes less I/O power, at the expense of lower throughput. In Fig. 3b, V p = 4, H p = 2, N = 16, two radix-4 blocks are utilized. There is only one pipeline and N c = 4, N p = 1. Four inputs can be processed in parallel per cycle. Note that there is no feedback path. The architecture achieves high throughput by using more basic blocks and I/Os, while resulting in higher power consumption. We can also increase V p by replicating the basic pipeline. This replication allows several pipelines to work in parallel to significantly increase the throughput at the cost of more complex interconnections. 3.2.2. Architecture Parameters In In1 In2 In3 (c) Out Out1 Out2 Out3 Three architecture parameters that significantly affect energyefficiency are employed in our design and applied to different components: 1. Type of memory element: BRAM or distributed RAM (dist. RAM) can be used as memories. In our design, both data buffers and twiddle factor lookup tables can be implemented using different memory elements. 2. Type of interconnection: three different types of interconnection (see Fig.4) are used for implementation of data permutation blocks, including crossbar, complete binary tree, as well as dynamic. 3. Pipeline depth: Both adder/subtractors and DSP slices in FPGA can be deep pipelined by inserting registers, so we parameterized the arithmetic units and multipliers with pipeline depth in our design to balance the performance and resource usage. According to the FPGA manufacturers user guide [14], BRAM consumes less power than dist. RAM when used for large size memories. Hence this characteristic can be utilized to trade-off between power and performance for various problem sizes. As there are 2 m (H p + 1) (when V p = 4, m = 1, otherwise m = ) permutation modules, using different interconnection s can significantly affect the energy efficiency of the designs. The physical layout of the complete binary tree is similar with that of a crossbar, while it can be inserted with more pipeline registers between the layers of tree. The dynamic can be implemented by using shift registers. Among the three types of interconnections, dynamic has high performance but greater power consumption, crossbar consumes the least resources and power but has a long wire delay, and complete binary tree has simpler routing which improves performance but at the expense of greater area usage. 4. EXPERIMENTAL RESULTS AND ANALYSIS 4.1. Experimental Setup In this section, we present a detailed analysis of several implementation experiments by varying the parameters. All the designs were implemented in Verilog on Virtex-7 FPGA (XC7VX98T, speed grade -2L) using Xilinx ISE 14.4. Inputs are 16-bit fixed point complex numbers. The input test vectors for simulation were randomly generated and had an average toggle rate of 5%. We used the VCD file (value change dump file) as input to Xilinx XPower Analyzer to produce accurate power dissipation estimation [14]. For all the evaluated designs, the operating frequency is set to 333 MHz. 4.2. Performance Metrics Two metrics for performance evaluation are considered in this paper: 1. Energy efficiency is defined as the number of operations per unit energy (Energy efficiency = Number of operations / Energy). For N-point Radix-4 FFT, Energy efficiency is given by ( 17 4 N log 2 N) / Energy. Energy is the product of the average power dissipation of the design and the latency of FFT computation. 2. Energy Area Time (EAT) is measured as the product of three key metrics: energy, area, and time. When given the same problem size, we use EAT ratio for performance comparison between different designs. Area is the area usage of the design, i.e. the number of LUTs or flip-flops (the larger one will be chosen) occupied by the entire design. The BRAM slides will be transferred to a certain amount of LUTs based on the memory size, hence we can obtain the area of BRAMs. Time is the latency for pipelining N-point FFT. 4.3. Design space exploration In this section, we first explore the design space by varying algorithm mapping parameters. Then the parameter values are chosen according to the experimental results. Based on 4

4, 1 Giga operations/joule log log 1/2 1 Problem size N Fig. 5: Energy efficiency for various H p with varying N for the dist. RAM based design Giga oprations/joule Problem Size N Fig. 6: Energy efficiency for various H p with varying N for the BRAM based design Giga operations / Joule 4, 4 1, 4 1, 1 Problem size N Fig. 7: Energy efficiency for various V p with varying N for the dist. RAM based design Giga operations / Joule 5 4, 1 1, 4 1, 1 4, 4 Problem Size N Fig. 8: Energy efficiency for various V p with varying N for the BRAM based design that, we explore the energy-efficient design (denoted empirically optimized design) by varying the architecture parameters empirically. Both the dist. RAM based design and the BRAM based design are used in this experiment. The effects of the design parameters on energy efficiency are demonstrated by using the proposed performance metrics. 4.3.1. Algorithm mapping level exploration A. Horizontal Parallelism In this experiment, we explored energy efficiency while varying horizontal parallelism, and V p = 4, N c = 4, N p = 1. The range of H p is [1, log 4 N]. The energy efficiency for various H p are shown in Fig. 5 and Fig. 6 respectively. Based on the experimental results, we have the following observations: For the considered problem sizes, increasing H p could significantly improve energy efficiency for all designs. Despite the required extra hardware to unfold the FFT horizontally, the reduced latency of FFT computation enables the design to outperform the original design. As N grows, the energy efficiency of the dist. RAM based design declines, whereas, that of the BRAM based design increases. The reason for that is dist. RAM power increases significantly with memory size, however, BRAM power is mainly decided by the number of used BRAM slides [14]. For the dist. RAM based design, the improvement in energy efficiency brought by increasing H p is sensitive to N. For example, when N = 24, doubling H p only leads to a slight increase in energy efficiency. Thus, reducing H p to save area could be a feasible alternative for larger size problems. The improvement in energy efficiency brought by increasing H p for BRAM based designs is not sensitive to N. Reducing H p to save area can lead to a significant decline in energy efficiency for any problem size. B. Vertical Parallelism Vertical parallelism is determined by three different values: radix value (fixed at 4), N c, and N p. H p was set as log 4 N. N c and N p were varied for evaluation. Both dist. RAM and BRAM based designs were evaluated. The energy efficiency for various V p are shown in Fig. 7 and Fig. 8. Note that the maximum V p is limited by available number of I/O pins. In this experiment, we have the following observations: BRAM based design is more scalable than dist. RAM based design with respect to energy efficiency. When 5

Table 1: Architecture parameters of designs for comparison Design A Empirically optimized design Design C Memory type or BRAM BRAM Interconnection Pipeline stages Type Components Multiplier Adder Dynamic Crossbar Complete binary tree Regitsers 5 2 LUTs 3 2 LUTs+ Registers 2 1 Table 2: Architecture parameters of designs for comparison Giga operations / Joule 5 Design A Empirically optimized design Design C Problem size Design A Design B Design C Memory type or BRAM BRAM Interconnection Pipeline depth Type Components Multiplier Adder Dynamic Crossbar Tree-based Regitsers 5 2 LUTs 3 2 LUTs+ Registers 2 1 N 64, energy efficiency starts to decline for dist. RAM based designs due to high power consumption with increasing memory size. Increasing N c instead of N p can improve energy efficiency with less hardware resource since increasing N c only requires extra data buffers while increasing N p requires replicating the full pipeline. When given a loose area constraint, we can improve energy efficiency and throughput by increasing N p. Although increasing N p leads to high power and resource consumption, the boosted throughput can offsets these disadvantages. 4.3.2. Architecture level exploration In this section, we explore an energy efficient design (empirically optimized design) at the architecture level. We choose V p = 4 and H p = log 4 N based on conclusions from the previous experiments. A. Energy hot spots As shown in Fig. 9a, the dominant portion of the entire power is consumed by the data buffers for 24-point Static Power 9% Radix block 4% I/O PER power 3% 8% 4% buffer 72% (a) based design I/O power 26% Static Power 13% Radix block 13% PER 8% 12% buffer 28% (b) BRAM based design Fig. 9: Power profile of 24-point FFT architecture Fig. : Energy efficiency of the empirically optimized design and the baseline designs FFT. This indicates that BRAM can be utilized to improve energy efficiency for large values of N. Fig. 9b shows that the percentage of I/O power and static power in the entire power increases significantly for BRAM-based designs. As I/O power and static power are constants here, this indicates a power decline of the main design components by using BRAMs. It also suggests that I/Os consume a large portion of power for BRAM based design. B. Empirically optimized design We first perform the analysis of effects of the architecture parameters on energy, performance, and area. The analysis below has been applied to choose the architecture parameter values to achieve the empirically optimized design in our experiment: Energy: Reducing the number of registers can significantly reduce signal power, which is dominant in the dynamic power. Crossbar can be evaluated to increase energy efficiency. Performance: Using BRAM can lead to a decline in peak operating frequency. For large values of N, when using BRAMs, extra pipeline stages can be used to solve the performance degradation issue. Area: Area usage of pipeline registers is dominant in the entire design area. Pipeline registers can be balanced to obtain trade offs between area and performance. As shown in Table2, we use two baseline designs to compare with our proposed empirically optimized design. The architecture parameters of the designs for comparison are shown in Table2. The comparison results of the designs on energy efficiency are shown in Fig.. It shows that the energy efficiency can be improved up to 27% by the proposed empirically optimized design, compared with the other two baseline designs. 4.4. Performance comparison We use the SPIRAL FFT IP core to compare with our proposed empirically optimized design. The SPIRAL FFT IP 6

Giga operations / Joule 45 35 25 Energy efficency of our design Energy efficiency of SPIRAL FFT IP Core (EAT of SPIRAL IP CORE) / (EAT of our design) Problem size 1.4 1.35 1.3 1.25 1.2 1.15 Fig. 11: Comparison between the proposed empirically optimized design and the SPIRAL FFT IP Cores for EAT and energy efficiency cores are high performance FFT designs based on streaming architecture. The data permutation block in their designs has been mathematically proved to be control-cost optimal [15]. By using their provided tools, customized FFT soft IP cores can be automatically generated in synthesizable RTL Verilog with user inputs [11]. The available parameters of the DFT core generator include transform size, data precision, and streaming width. In this comparison, we use the dist. RAM based design for N 64 and the BRAM based design for N > 64. For the design from SPIRAL, the codes of N-point (16-bit fixed point) FFT are automatically generated by the SPIRAL Core generator. The architecture is fully streaming and the data are presented in their natural order. As shown in Fig. 11, our proposed design improves energy-efficiency by 8% to 28% and EAT by 23% to 38%, respectively, compared with the SPIRAL FFT IP Cores. 5. CONCLUSION We presented a parameterized architecture for energy efficient implementation of the Radix-4 Cooley-Tukey FFT algorithm. The effect of the two-level parameters on energyefficiency was demonstrated by using design space exploration. We studied the power consumption of the components for various problem sizes, and proposed our empirically optimized design by empirical selection of architecture parameter values. Compared with the state-of-the-art design, our optimized architectures achieve up to 28% and 38% improvement in the energy efficiency and EAT respectively. In the future we plan to work on an accurate highlevel performance model for energy-efficiency estimation, which can be used to accelerate design space exploration to obtain an energy efficient design. EAT Ratio 6. REFERENCES [1] N. Shirazi, P. M. Athanas, and A. L. Abbott, Implementation of a 2-D Fast Fourier Transform on an FPGA-Based Custom Computing Machine, in Proceedings of Field- Programmable Logic and Applications, 1995, pp. 282 292. [2] D. Chen, G. Yao, C. Koc, and R. Cheung, Low complexity and hardware-friendly spectral modular multiplication, in Proceedings of Field-Programmable Technology (FPT), 12, pp. 368 375. [3] E. H. Wold and A. M. Despain, Pipeline and parallelpipeline FFT processors for VLSI implementations, IEEE Transactions on Computers, vol., no. 5, pp. 414 426, 1984. [4] S. He and M. Torkelson, A new approach to pipeline FFT processor, in Proceedings of IPPS 96, pp. 766 77. [5] G. Bi and E. Jones, A pipelined FFT processor for wordsequential data, IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 37, no. 12, pp. 1982 1985, 1989. [6] L. R. Rabiner and B. Gold, Theory and application of digital signal processing, Englewood Cliffs, NJ, Prentice-Hall, Inc., 1975. 777 p., vol. 1. [7] S. Choi, R. Scrofano, V. K. Prasanna, and J.-W. Jang, Energy-efficient signal processing using FPGAs, in Proceedings of FPGA 3, 3, pp. 225 234. [8] D. Aravind and A. Sudarsanam, High level - application analysis techniques architectures - to explore design possibilities for reduced reconfiguration area overheads in FPGAs executing compute intensive applications, in Proceedings of IPDPS, 5, pp. 158a 158a. [9] B. Baas, A low-power, high-performance, 24-point FFT processor, IEEE Journal of Solid-State Circuits, vol. 34, no. 3, pp. 38 387, 1999. [] C.-W. J. Wen-Chang Yeh, High-speed and low-power splitradix FFT, IEEE Transactions on Signal Processing, vol. 51, no. 3, pp. 864 874, 3. [11] G. Nordin, P. A. Milder, J. C. Hoe, and M. Püschel, Automatic generation of customized Discrete Fourier Transform IPs, in Proceedings of Design Automation Conference (DAC), 5, pp. 471 474. [12] T. Sugimura, H. Yamasaki, H. Noda, O. Yamamoto, Y. Okuno, and K. Arimoto, A high-performance and energyefficient FFT implementation on super parallel processor (MX) for mobile multimedia applications, in Proceedings of Intelligent Signal Processing and Communications Systems, 9, pp. 1 4. [13] H. Kimura, H. Nakamura, S. Kimura, and N. Yoshimoto, Numerical analysis of dynamic snr management by controlling dsp calculation precision for energy-efficient ofdmpon, Photonics Technology Letters, IEEE, vol. 24, no. 23, pp. 2132 2135, 12. [14] XST User Guide for Virtex-6, Spartan-6, and 7 Series Devices, http://www.xilinx.com/support/documentation. [15] M. Püschel, P. A. Milder, and J. C. Hoe, Permuting streaming data using rams, Journal of the ACM, vol. 56, no. 2, pp. :1 :34, 9. 7