ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE. Ren Chen, Hoang Le, and Viktor K. Prasanna

Size: px
Start display at page:

Download "ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE. Ren Chen, Hoang Le, and Viktor K. Prasanna"

Transcription

1 ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE Ren Chen, Hoang Le, and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California, Los Angeles, USA {renchen, hoangle, ABSTRACT In this paper, we revisit the classic Fast Fourier Transform (FFT) for energy efficient designs on FPGAs. A parameterized FFT architecture is proposed to identify the design trade-offs in achieving energy efficiency. We first perform design space exploration by varying the algorithm mapping parameters, such as the degree of vertical and horizontal parallelism, that characterize decomposition based FFT algorithms. Then we explore an energy efficient design by empirical selection on the values of the chosen architecture parameters, including the type of memory elements, the type of interconnection and the number of pipeline stages. The trade offs between energy, area, and time are analyzed using two performance metrics: the energy efficiency (defined as the number of operations per Joule) and the Energy Area Time (EAT) composite metric. From the experimental results, a design space is generated to demonstrate the effect of these parameters on the various performance metrics. For N-point FFT (16 N 24), our designs achieve up to 28% and 38% improvement in the energy efficiency and EAT, respectively, compared with a state-of-the-art design. 1. INTRODUCTION FPGA is a promising implementation technology for computationally intensive applications such as signal, image, and processing tasks [1, 2]. State-of-the-art FPGAs offer high operating frequency, unprecedented logic density and a host of other features. As FPGAs are programmed specifically for the problem to be solved, they can achieve higher performance with lower power consumption than general purpose processors. Fast Fourier Transform (FFT) is one of the most frequently used kernels in a wide variety of image and signal processing applications. Various derivative FFT algorithms have been proposed and developed. Radix-x Cooley- Tukey algorithm is one of the most popular algorithms for This work has been funded by DARPA under grant number HR hardware implementation [3, 4, 5, 6]. Most hardware solutions for Radix-x FFT fall into the following categories: delay feedback or delay commutator architectures [4], such as Radix-2 2 single-path delay feedback FFT [4] and Radix- 4 single-path delay commutator FFT [5]. By focusing on circuit level optimizations, these solutions achieve improvement either in throughput, area, or power. Energy efficiency is a key design metric. To obtain an energy efficient design for FFT, we analyze the trade-offs between energy, area, and time for fixed-point FFT on a parameterized architecture, using Cooley-Tukey algorithm. Energy efficiency can be achieved both at the algorithm mapping level and the architecture level [7, 8]. Optimizing at these two levels allows power to be effectively traded off with other performance parameters. For example, a design consuming 2 power but achieving 3 system throughput is actually 5% more energy efficient than the original design. We present the architecture design space with respect to energy efficiency at the algorithm mapping level. By empirical selection of the proposed architecture parameter values, we explore an energy efficient design at the architecture level. In this paper, we make the following contributions: 1. A parameterized FFT architecture using the Radix-4 Cooley-Tukey algorithm (Section 3.1). 2. A design space that demonstrates the effect of the parameters on the Energy Area Time (EAT) composite metric and the energy efficiency (Section 4.3.2). 3. Demonstrate improved energy efficiency of the proposed design by identifying energy hot-spots and varying the chosen architecture parameters (Section 4.3.2). 4. Optimized designs achieving significant improvement in energy efficiency compared with a state-of-the-art design (Section 4.4). The rest of the paper is organized as follows. Section 2 covers the background and related work. Section 3 describes the proposed parameterized architecture and its implementation on FPGA. Section 4 presents experimental results and analysis. Section 5 concludes the paper. 1

2 2. BACKGROUND AND RELATED WORK 2.1. Background Given N complex numbers x,..., x N 1, Discrete Fourier Transform (DFT) is computed as: X k = N 1 n= x ne i2πk n N, k =,..., N 1. Radix-x Cooley-Tukey FFT is a well known decomposition based algorithm for N-point DFT. In this paper, we employ Radix-4 FFT for our design. The description of Radix-4 FFT is presented in Algorithm 1. In terms of the number of real operations, the computational complexity of N-point Radix-4 FFT is O(N log 4 N). The algorithm performs N-point FFT in N/m (m < N) cycles using m Input/Output ports (I/Os) and log 4 N radix blocks, which are used for butterfly computations. The algorithm iteratively decomposes the entire problem into four subproblems. This feature enables us to map the algorithm by fold- Algorithm 1 Radix-4 FFT Algorithm 1: q = N/4; d = N/4; 2: for p := to log 4 N do 3: for k := to 4 p 1 do 4: l = 4kq/4 p ; r = l + q/(4 p 1); 5: tw 1 = w[k]; tw 2 = w[2k]; tw 3 = w[3k]; 6: for i := l to r do 7: t = i; t 1 = i+d/4 p ; t 2 = i+2d/4 p ; t 3 = i+3d/4 p ; 8: do parallel 9: f p+1 [t ] = f p[t ] + f p[t 1 ] + f p[t 2 ] + f p[t 3 ]; : f p+1 [t 1 ] = f p[t ] jf p[t 1 ] f p[t 2 ] + jf p[t 3 ]; 11: f p+1 [t 2 ] = f p[t ] f p[t 1 ] + f p[t 2 ] + jf p[t 3 ]; 12: f p+1 [t 3 ] = f p[t ] + jf p[t 1 ] f p[t 2 ] jf p[t 3 ]; 13: end parallel 14: do parallel 15: f p+1 [t ] = f p+1 [t ]; 16: f p+1 [t 1 ] = tw 1 f p+1 [t 1 ]; 17: f p+1 [t 2 ] = tw 2 f p+1 [t 2 ]; 18: f p+1 [t 3 ] = tw 3 x p+1 [t 3 ]; 19: end parallel : end for 21: end for 22: end for ing the FFT architecture vertically or horizontally, thus providing much freedom to implement various designs on FP- GAs. We propose our parameterized architecture in Section 3.2 based on this feature Related Work To the best of our knowledge, there has been no previous work targeted at exploring the design space for energy efficiency of FFT at both the algorithm mapping level and the architecture level on FPGAs. Existing work has mainly focused on optimizing the performance, power and area of the design at the circuit level. An energy-efficient 24-point FFT processor was developed in [9]. Cache-based FFT algorithm was proposed to achieve low power and high performance. Energy-time performance metric was evaluated at various processor operation points. In [], a high-speed and low-power FFT architecture was presented. They presented a delay balanced pipeline architecture based on split-radix algorithm. Algorithms for reducing computation complexity were explored and the architecture was evaluated in area, power and timing performance. Based on Radix-x FFT, various pipeline FFT architectures have been proposed, such as Radix-2 single-path delay feedback FFT [3], Radix-4 single-path delay commutator FFT [5], Radix-2 multi-path delay commutator FFT [6], and Radix-2 2 single-path delay feedback FFT [4]. These architectures can achieve high throughput per unit area with single-path or multi-path pipelines, but energy efficiency has not been explored in these works. In [11], a parameterized soft core generator for high throughput DFT was developed. This generator can automatically produce an optimized design with user inputs for performance and resource constraints. However, energy efficiency is not considered in this work. In [7], the author presented a parameterized energy efficient FFT architecture. Their design is optimized to achieve high energy efficiency by varying the architecture parameters. Some energy efficient design techniques, such as clock gating and memory binding, are also employed in their work. Other than FPGA, there are also some techniques for energy efficient FFT presented based on other different platforms [12, 13]. However, it is not clear how to apply these techniques on FPGAs. In this work, we extend the work of [7] by design space exploration at multiple levels. The design space exploration is performed on the current state-ofthe-art FPGAs. By exploring the energy-performance-area trade-offs at mutiple levels, we obtain an energy efficient design for FFT. 3. ARCHITECTURE AND IMPLEMENTATIONS 3.1. Architecture building blocks The proposed N-point FFT architecture is based on the Radix- 4 Cooley-Tukey FFT algorithm. Note that the choice of the radix affects energy efficiency of the design. Compared with Radix-2 algorithm, Radix-4 uses fewer multiply operations. The proposed architecture consists of five building blocks (see Fig.1): Radix-4 block (R4), buffer, path permutation (PER), Parallel-to-serial/serial-to-parallel (PS/SP) multiplexer, and twiddle factor computation (). A complete design for N-point FFT can be obtained by a combination of the basic blocks. A. Radix-4 block In this module, 16 signed adder/subtractors are used to complete butterfly computations. It takes four inputs and generates four outputs in parallel. Each input data contains real and imaginary components. The data outputs of R4 will be used by the twiddle factor computation block except in the last stage (see Fig. 1a). B. buffer 2

3 Radix Block X R4 (a) (b) (c) (d) (e) Fig. 1: (a) Radix block, (b) buffer, (c) path permutation (PER), (d) Parallel-to-serial/serial-to-parallel MUX (PS/SP), (e) Twiddle factor computation () Memory entry X X 7 X X 13 X [i] X [i+1] X [i+2] R4 (a) H p = 1,V p = 1 R4 X 2[i] X 2[i+1] X 2[i+2] X 1 X 4 X 11 X 14 X 2 X 5 X 8 X 15 X 3 X 6 X 9 X 12 output in parallel Fig. 2: permutation in the data buffers for 16-point FFT Each data buffer consists of a dual-port RAM having N/m (m equals to the number of I/Os) entries. is written into one port and read from the other port simultaneously. Fig. 2 shows the data buffers for 16-point pipelining FFT. In four cycles, 16 permutated data inputs are fed into the data buffers. In each cycle, with alternating locations, four data outputs are read in parallel. For different architectural parameters, the read and write addresses are generated with different strides. For example, in Fig. 2, four data inputs (X, X 4, X 8, X 12 ) are written in cycle, cycle 1, cycle 2, and cycle 3 respectively. Then they are output simultaneously in cycle 4. C. permutation block Parallel input data are required to be permutated before being processed by the subsequent modules. Fig. 2 shows the data permutation for 16-point FFT. In the first cycle, four data inputs (X, X 1, X 2, X 3 ) are fed into the first entry of each data buffer without permutation. In the second cycle, another four data inputs are written into the second entry of each data buffer with one location permutated. The parallel output data (X i, X i+4, X i+8, X (i+12)mod16, i =, 1, 2, 3) are stored in different RAMs after four cycles. These permutations are repeated every four cycles. D. PS/SP module This module is used to multiplex serial/parallel input data to output in parallel/serial respectively. As shown in Fig. 3a, the number of I/Os is limited to one, but the radix-4 block still operates on four data inputs in parallel, thus the PS/SP module is employed to match the data rate both before and after the radix-4 block. X [i+3] (b) H p = 2,V p = 4 Fig. 3: Parameterized Architectures for 16-point FFT E. Twiddle factor computation This module consists of two blocks: the twiddle factor generation block and the complex number multiplier block. The twiddle factor generation block includes several lookup tables for storing twiddle factor coefficients, where the data read addresses will be updated with the control signals. The size of the lookup tables will increase with the problem size. The complex number multiplier block consists of three multipliers and three adder/subtractors Parameterized FFT Architecture Algorithm Mapping Parameters Decomposition based Radix-4 FFT offers much flexibility to map various architectures. Folding the FFT architecture enables the radix-4 blocks to be reused iteratively to save area, while unfolding the FFT increases spatial parallelism. Hence we use two algorithm mapping parameters that characterize the decomposition-based N-point FFT algorithm in our design: X 2[i+3] 1. Horizontal Parallelism (H p ): determines the number of radix-4 blocks concatenated horizontally (1 H p log 4 N). 2. Vertical Parallelism (V p ): determines the number of parallel I/Os (1 V p N). V p is determined by the number of I/Os per pipeline (N c ) and the number of parallel pipelines (N p ), and V p = N c N p. Each pipeline is a row of horizontally concatenated radix-4 blocks. We adapt these two architectural parameters for an energy efficient design. Two different architectures are presented in Fig. 3. In Fig. 3a, V p = N c = N p = 1, H p = 1, N = 16, one radix-4 block is employed and iteratively 3

4 Out Out1 Out2 Out3 In In1 In2 In3 (a) (b) Fig. 4: (a) Crossbar, (b) Complete binary tree, (c) Dynamic used by two stages, and one input data is processed per cycle. This architecture achieves higher resource efficiency and consumes less I/O power, at the expense of lower throughput. In Fig. 3b, V p = 4, H p = 2, N = 16, two radix-4 blocks are utilized. There is only one pipeline and N c = 4, N p = 1. Four inputs can be processed in parallel per cycle. Note that there is no feedback path. The architecture achieves high throughput by using more basic blocks and I/Os, while resulting in higher power consumption. We can also increase V p by replicating the basic pipeline. This replication allows several pipelines to work in parallel to significantly increase the throughput at the cost of more complex interconnections Architecture Parameters In In1 In2 In3 (c) Out Out1 Out2 Out3 Three architecture parameters that significantly affect energyefficiency are employed in our design and applied to different components: 1. Type of memory element: BRAM or distributed RAM (dist. RAM) can be used as memories. In our design, both data buffers and twiddle factor lookup tables can be implemented using different memory elements. 2. Type of interconnection: three different types of interconnection (see Fig.4) are used for implementation of data permutation blocks, including crossbar, complete binary tree, as well as dynamic. 3. Pipeline depth: Both adder/subtractors and DSP slices in FPGA can be deep pipelined by inserting registers, so we parameterized the arithmetic units and multipliers with pipeline depth in our design to balance the performance and resource usage. According to the FPGA manufacturers user guide [14], BRAM consumes less power than dist. RAM when used for large size memories. Hence this characteristic can be utilized to trade-off between power and performance for various problem sizes. As there are 2 m (H p + 1) (when V p = 4, m = 1, otherwise m = ) permutation modules, using different interconnection s can significantly affect the energy efficiency of the designs. The physical layout of the complete binary tree is similar with that of a crossbar, while it can be inserted with more pipeline registers between the layers of tree. The dynamic can be implemented by using shift registers. Among the three types of interconnections, dynamic has high performance but greater power consumption, crossbar consumes the least resources and power but has a long wire delay, and complete binary tree has simpler routing which improves performance but at the expense of greater area usage. 4. EXPERIMENTAL RESULTS AND ANALYSIS 4.1. Experimental Setup In this section, we present a detailed analysis of several implementation experiments by varying the parameters. All the designs were implemented in Verilog on Virtex-7 FPGA (XC7VX98T, speed grade -2L) using Xilinx ISE Inputs are 16-bit fixed point complex numbers. The input test vectors for simulation were randomly generated and had an average toggle rate of 5%. We used the VCD file (value change dump file) as input to Xilinx XPower Analyzer to produce accurate power dissipation estimation [14]. For all the evaluated designs, the operating frequency is set to 333 MHz Performance Metrics Two metrics for performance evaluation are considered in this paper: 1. Energy efficiency is defined as the number of operations per unit energy (Energy efficiency = Number of operations / Energy). For N-point Radix-4 FFT, Energy efficiency is given by ( 17 4 N log 2 N) / Energy. Energy is the product of the average power dissipation of the design and the latency of FFT computation. 2. Energy Area Time (EAT) is measured as the product of three key metrics: energy, area, and time. When given the same problem size, we use EAT ratio for performance comparison between different designs. Area is the area usage of the design, i.e. the number of LUTs or flip-flops (the larger one will be chosen) occupied by the entire design. The BRAM slides will be transferred to a certain amount of LUTs based on the memory size, hence we can obtain the area of BRAMs. Time is the latency for pipelining N-point FFT Design space exploration In this section, we first explore the design space by varying algorithm mapping parameters. Then the parameter values are chosen according to the experimental results. Based on 4

5 4, 1 Giga operations/joule log log 1/2 1 Problem size N Fig. 5: Energy efficiency for various H p with varying N for the dist. RAM based design Giga oprations/joule Problem Size N Fig. 6: Energy efficiency for various H p with varying N for the BRAM based design Giga operations / Joule 4, 4 1, 4 1, 1 Problem size N Fig. 7: Energy efficiency for various V p with varying N for the dist. RAM based design Giga operations / Joule 5 4, 1 1, 4 1, 1 4, 4 Problem Size N Fig. 8: Energy efficiency for various V p with varying N for the BRAM based design that, we explore the energy-efficient design (denoted empirically optimized design) by varying the architecture parameters empirically. Both the dist. RAM based design and the BRAM based design are used in this experiment. The effects of the design parameters on energy efficiency are demonstrated by using the proposed performance metrics Algorithm mapping level exploration A. Horizontal Parallelism In this experiment, we explored energy efficiency while varying horizontal parallelism, and V p = 4, N c = 4, N p = 1. The range of H p is [1, log 4 N]. The energy efficiency for various H p are shown in Fig. 5 and Fig. 6 respectively. Based on the experimental results, we have the following observations: For the considered problem sizes, increasing H p could significantly improve energy efficiency for all designs. Despite the required extra hardware to unfold the FFT horizontally, the reduced latency of FFT computation enables the design to outperform the original design. As N grows, the energy efficiency of the dist. RAM based design declines, whereas, that of the BRAM based design increases. The reason for that is dist. RAM power increases significantly with memory size, however, BRAM power is mainly decided by the number of used BRAM slides [14]. For the dist. RAM based design, the improvement in energy efficiency brought by increasing H p is sensitive to N. For example, when N = 24, doubling H p only leads to a slight increase in energy efficiency. Thus, reducing H p to save area could be a feasible alternative for larger size problems. The improvement in energy efficiency brought by increasing H p for BRAM based designs is not sensitive to N. Reducing H p to save area can lead to a significant decline in energy efficiency for any problem size. B. Vertical Parallelism Vertical parallelism is determined by three different values: radix value (fixed at 4), N c, and N p. H p was set as log 4 N. N c and N p were varied for evaluation. Both dist. RAM and BRAM based designs were evaluated. The energy efficiency for various V p are shown in Fig. 7 and Fig. 8. Note that the maximum V p is limited by available number of I/O pins. In this experiment, we have the following observations: BRAM based design is more scalable than dist. RAM based design with respect to energy efficiency. When 5

6 Table 1: Architecture parameters of designs for comparison Design A Empirically optimized design Design C Memory type or BRAM BRAM Interconnection Pipeline stages Type Components Multiplier Adder Dynamic Crossbar Complete binary tree Regitsers 5 2 LUTs 3 2 LUTs+ Registers 2 1 Table 2: Architecture parameters of designs for comparison Giga operations / Joule 5 Design A Empirically optimized design Design C Problem size Design A Design B Design C Memory type or BRAM BRAM Interconnection Pipeline depth Type Components Multiplier Adder Dynamic Crossbar Tree-based Regitsers 5 2 LUTs 3 2 LUTs+ Registers 2 1 N 64, energy efficiency starts to decline for dist. RAM based designs due to high power consumption with increasing memory size. Increasing N c instead of N p can improve energy efficiency with less hardware resource since increasing N c only requires extra data buffers while increasing N p requires replicating the full pipeline. When given a loose area constraint, we can improve energy efficiency and throughput by increasing N p. Although increasing N p leads to high power and resource consumption, the boosted throughput can offsets these disadvantages Architecture level exploration In this section, we explore an energy efficient design (empirically optimized design) at the architecture level. We choose V p = 4 and H p = log 4 N based on conclusions from the previous experiments. A. Energy hot spots As shown in Fig. 9a, the dominant portion of the entire power is consumed by the data buffers for 24-point Static Power 9% Radix block 4% I/O PER power 3% 8% 4% buffer 72% (a) based design I/O power 26% Static Power 13% Radix block 13% PER 8% 12% buffer 28% (b) BRAM based design Fig. 9: Power profile of 24-point FFT architecture Fig. : Energy efficiency of the empirically optimized design and the baseline designs FFT. This indicates that BRAM can be utilized to improve energy efficiency for large values of N. Fig. 9b shows that the percentage of I/O power and static power in the entire power increases significantly for BRAM-based designs. As I/O power and static power are constants here, this indicates a power decline of the main design components by using BRAMs. It also suggests that I/Os consume a large portion of power for BRAM based design. B. Empirically optimized design We first perform the analysis of effects of the architecture parameters on energy, performance, and area. The analysis below has been applied to choose the architecture parameter values to achieve the empirically optimized design in our experiment: Energy: Reducing the number of registers can significantly reduce signal power, which is dominant in the dynamic power. Crossbar can be evaluated to increase energy efficiency. Performance: Using BRAM can lead to a decline in peak operating frequency. For large values of N, when using BRAMs, extra pipeline stages can be used to solve the performance degradation issue. Area: Area usage of pipeline registers is dominant in the entire design area. Pipeline registers can be balanced to obtain trade offs between area and performance. As shown in Table2, we use two baseline designs to compare with our proposed empirically optimized design. The architecture parameters of the designs for comparison are shown in Table2. The comparison results of the designs on energy efficiency are shown in Fig.. It shows that the energy efficiency can be improved up to 27% by the proposed empirically optimized design, compared with the other two baseline designs Performance comparison We use the SPIRAL FFT IP core to compare with our proposed empirically optimized design. The SPIRAL FFT IP 6

7 Giga operations / Joule Energy efficency of our design Energy efficiency of SPIRAL FFT IP Core (EAT of SPIRAL IP CORE) / (EAT of our design) Problem size Fig. 11: Comparison between the proposed empirically optimized design and the SPIRAL FFT IP Cores for EAT and energy efficiency cores are high performance FFT designs based on streaming architecture. The data permutation block in their designs has been mathematically proved to be control-cost optimal [15]. By using their provided tools, customized FFT soft IP cores can be automatically generated in synthesizable RTL Verilog with user inputs [11]. The available parameters of the DFT core generator include transform size, data precision, and streaming width. In this comparison, we use the dist. RAM based design for N 64 and the BRAM based design for N > 64. For the design from SPIRAL, the codes of N-point (16-bit fixed point) FFT are automatically generated by the SPIRAL Core generator. The architecture is fully streaming and the data are presented in their natural order. As shown in Fig. 11, our proposed design improves energy-efficiency by 8% to 28% and EAT by 23% to 38%, respectively, compared with the SPIRAL FFT IP Cores. 5. CONCLUSION We presented a parameterized architecture for energy efficient implementation of the Radix-4 Cooley-Tukey FFT algorithm. The effect of the two-level parameters on energyefficiency was demonstrated by using design space exploration. We studied the power consumption of the components for various problem sizes, and proposed our empirically optimized design by empirical selection of architecture parameter values. Compared with the state-of-the-art design, our optimized architectures achieve up to 28% and 38% improvement in the energy efficiency and EAT respectively. In the future we plan to work on an accurate highlevel performance model for energy-efficiency estimation, which can be used to accelerate design space exploration to obtain an energy efficient design. EAT Ratio 6. REFERENCES [1] N. Shirazi, P. M. Athanas, and A. L. Abbott, Implementation of a 2-D Fast Fourier Transform on an FPGA-Based Custom Computing Machine, in Proceedings of Field- Programmable Logic and Applications, 1995, pp [2] D. Chen, G. Yao, C. Koc, and R. Cheung, Low complexity and hardware-friendly spectral modular multiplication, in Proceedings of Field-Programmable Technology (FPT), 12, pp [3] E. H. Wold and A. M. Despain, Pipeline and parallelpipeline FFT processors for VLSI implementations, IEEE Transactions on Computers, vol., no. 5, pp , [4] S. He and M. Torkelson, A new approach to pipeline FFT processor, in Proceedings of IPPS 96, pp [5] G. Bi and E. Jones, A pipelined FFT processor for wordsequential data, IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 37, no. 12, pp , [6] L. R. Rabiner and B. Gold, Theory and application of digital signal processing, Englewood Cliffs, NJ, Prentice-Hall, Inc., p., vol. 1. [7] S. Choi, R. Scrofano, V. K. Prasanna, and J.-W. Jang, Energy-efficient signal processing using FPGAs, in Proceedings of FPGA 3, 3, pp [8] D. Aravind and A. Sudarsanam, High level - application analysis techniques architectures - to explore design possibilities for reduced reconfiguration area overheads in FPGAs executing compute intensive applications, in Proceedings of IPDPS, 5, pp. 158a 158a. [9] B. Baas, A low-power, high-performance, 24-point FFT processor, IEEE Journal of Solid-State Circuits, vol. 34, no. 3, pp , [] C.-W. J. Wen-Chang Yeh, High-speed and low-power splitradix FFT, IEEE Transactions on Signal Processing, vol. 51, no. 3, pp , 3. [11] G. Nordin, P. A. Milder, J. C. Hoe, and M. Püschel, Automatic generation of customized Discrete Fourier Transform IPs, in Proceedings of Design Automation Conference (DAC), 5, pp [12] T. Sugimura, H. Yamasaki, H. Noda, O. Yamamoto, Y. Okuno, and K. Arimoto, A high-performance and energyefficient FFT implementation on super parallel processor (MX) for mobile multimedia applications, in Proceedings of Intelligent Signal Processing and Communications Systems, 9, pp [13] H. Kimura, H. Nakamura, S. Kimura, and N. Yoshimoto, Numerical analysis of dynamic snr management by controlling dsp calculation precision for energy-efficient ofdmpon, Photonics Technology Letters, IEEE, vol. 24, no. 23, pp , 12. [14] XST User Guide for Virtex-6, Spartan-6, and 7 Series Devices, [15] M. Püschel, P. A. Milder, and J. C. Hoe, Permuting streaming data using rams, Journal of the ACM, vol. 56, no. 2, pp. :1 :34, 9. 7

ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE. Ren Chen, Hoang Le, and Viktor K. Prasanna

ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE. Ren Chen, Hoang Le, and Viktor K. Prasanna ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE Ren Chen, Hoang Le, and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California, Los Angeles, USA 989 Email:

More information

High Throughput Energy Efficient Parallel FFT Architecture on FPGAs

High Throughput Energy Efficient Parallel FFT Architecture on FPGAs High Throughput Energy Efficient Parallel FFT Architecture on FPGAs Ren Chen Ming Hsieh Department of Electrical Engineering University of Southern California Los Angeles, USA 989 Email: renchen@usc.edu

More information

Energy Optimizations for FPGA-based 2-D FFT Architecture

Energy Optimizations for FPGA-based 2-D FFT Architecture Energy Optimizations for FPGA-based 2-D FFT Architecture Ren Chen and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California Ganges.usc.edu/wiki/TAPAS Outline

More information

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM

More information

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA Yun R. Qu, Viktor K. Prasanna Ming Hsieh Dept. of Electrical Engineering University of Southern California Los Angeles, CA 90089

More information

Analysis of Radix- SDF Pipeline FFT Architecture in VLSI Using Chip Scope

Analysis of Radix- SDF Pipeline FFT Architecture in VLSI Using Chip Scope Analysis of Radix- SDF Pipeline FFT Architecture in VLSI Using Chip Scope G. Mohana Durga 1, D.V.R. Mohan 2 1 M.Tech Student, 2 Professor, Department of ECE, SRKR Engineering College, Bhimavaram, Andhra

More information

OPTIMIZING INTERCONNECTION COMPLEXITY FOR REALIZING FIXED PERMUTATION IN DATA AND SIGNAL PROCESSING ALGORITHMS. Ren Chen and Viktor K.

OPTIMIZING INTERCONNECTION COMPLEXITY FOR REALIZING FIXED PERMUTATION IN DATA AND SIGNAL PROCESSING ALGORITHMS. Ren Chen and Viktor K. OPTIMIZING INTERCONNECTION COMPLEXITY FOR REALIZING FIXED PERMUTATION IN DATA AND SIGNAL PROCESSING ALGORITHMS Ren Chen and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University

More information

Evaluating Energy Efficiency of Floating Point Matrix Multiplication on FPGAs

Evaluating Energy Efficiency of Floating Point Matrix Multiplication on FPGAs Evaluating Energy Efficiency of Floating Point Matrix Multiplication on FPGAs Kiran Kumar Matam Computer Science Department University of Southern California Email: kmatam@usc.edu Hoang Le and Viktor K.

More information

A Methodology for Energy Efficient FPGA Designs Using Malleable Algorithms

A Methodology for Energy Efficient FPGA Designs Using Malleable Algorithms A Methodology for Energy Efficient FPGA Designs Using Malleable Algorithms Jingzhao Ou and Viktor K. Prasanna Department of Electrical Engineering, University of Southern California Los Angeles, California,

More information

The Serial Commutator FFT

The Serial Commutator FFT The Serial Commutator FFT Mario Garrido Gálvez, Shen-Jui Huang, Sau-Gee Chen and Oscar Gustafsson Journal Article N.B.: When citing this work, cite the original article. 2016 IEEE. Personal use of this

More information

ON CONFIGURATION OF RESIDUE SCALING PROCESS IN PIPELINED RADIX-4 MQRNS FFT PROCESSOR

ON CONFIGURATION OF RESIDUE SCALING PROCESS IN PIPELINED RADIX-4 MQRNS FFT PROCESSOR POZNAN UNIVE RSITY OF TE CHNOLOGY ACADE MIC JOURNALS No 80 Electrical Engineering 2014 Robert SMYK* Maciej CZYŻAK* ON CONFIGURATION OF RESIDUE SCALING PROCESS IN PIPELINED RADIX-4 MQRNS FFT PROCESSOR Residue

More information

AN FFT PROCESSOR BASED ON 16-POINT MODULE

AN FFT PROCESSOR BASED ON 16-POINT MODULE AN FFT PROCESSOR BASED ON 6-POINT MODULE Weidong Li, Mark Vesterbacka and Lars Wanhammar Electronics Systems, Dept. of EE., Linköping University SE-58 8 LINKÖPING, SWEDEN E-mail: {weidongl, markv, larsw}@isy.liu.se,

More information

Linköping University Post Print. Analysis of Twiddle Factor Memory Complexity of Radix-2^i Pipelined FFTs

Linköping University Post Print. Analysis of Twiddle Factor Memory Complexity of Radix-2^i Pipelined FFTs Linköping University Post Print Analysis of Twiddle Factor Complexity of Radix-2^i Pipelined FFTs Fahad Qureshi and Oscar Gustafsson N.B.: When citing this work, cite the original article. 200 IEEE. Personal

More information

Novel design of multiplier-less FFT processors

Novel design of multiplier-less FFT processors Signal Processing 8 (00) 140 140 www.elsevier.com/locate/sigpro Novel design of multiplier-less FFT processors Yuan Zhou, J.M. Noras, S.J. Shepherd School of EDT, University of Bradford, Bradford, West

More information

TOPICS PIPELINE IMPLEMENTATIONS OF THE FAST FOURIER TRANSFORM (FFT) DISCRETE FOURIER TRANSFORM (DFT) INVERSE DFT (IDFT) Consulted work:

TOPICS PIPELINE IMPLEMENTATIONS OF THE FAST FOURIER TRANSFORM (FFT) DISCRETE FOURIER TRANSFORM (DFT) INVERSE DFT (IDFT) Consulted work: 1 PIPELINE IMPLEMENTATIONS OF THE FAST FOURIER TRANSFORM (FFT) Consulted work: Chiueh, T.D. and P.Y. Tsai, OFDM Baseband Receiver Design for Wireless Communications, John Wiley and Sons Asia, (2007). Second

More information

Towards Performance Modeling of 3D Memory Integrated FPGA Architectures

Towards Performance Modeling of 3D Memory Integrated FPGA Architectures Towards Performance Modeling of 3D Memory Integrated FPGA Architectures Shreyas G. Singapura, Anand Panangadan and Viktor K. Prasanna University of Southern California, Los Angeles CA 90089, USA, {singapur,

More information

Abstract. Literature Survey. Introduction. A.Radix-2/8 FFT algorithm for length qx2 m DFTs

Abstract. Literature Survey. Introduction. A.Radix-2/8 FFT algorithm for length qx2 m DFTs Implementation of Split Radix algorithm for length 6 m DFT using VLSI J.Nancy, PG Scholar,PSNA College of Engineering and Technology; S.Bharath,Assistant Professor,PSNA College of Engineering and Technology;J.Wilson,Assistant

More information

MULTIPLIERLESS HIGH PERFORMANCE FFT COMPUTATION

MULTIPLIERLESS HIGH PERFORMANCE FFT COMPUTATION MULTIPLIERLESS HIGH PERFORMANCE FFT COMPUTATION Maheshwari.U 1, Josephine Sugan Priya. 2, 1 PG Student, Dept Of Communication Systems Engg, Idhaya Engg. College For Women, 2 Asst Prof, Dept Of Communication

More information

Low-Power Split-Radix FFT Processors Using Radix-2 Butterfly Units

Low-Power Split-Radix FFT Processors Using Radix-2 Butterfly Units Low-Power Split-Radix FFT Processors Using Radix-2 Butterfly Units Abstract: Split-radix fast Fourier transform (SRFFT) is an ideal candidate for the implementation of a lowpower FFT processor, because

More information

Research Article International Journal of Emerging Research in Management &Technology ISSN: (Volume-6, Issue-8) Abstract:

Research Article International Journal of Emerging Research in Management &Technology ISSN: (Volume-6, Issue-8) Abstract: International Journal of Emerging Research in Management &Technology Research Article August 27 Design and Implementation of Fast Fourier Transform (FFT) using VHDL Code Akarshika Singhal, Anjana Goen,

More information

Keywords: Fast Fourier Transforms (FFT), Multipath Delay Commutator (MDC), Pipelined Architecture, Radix-2 k, VLSI.

Keywords: Fast Fourier Transforms (FFT), Multipath Delay Commutator (MDC), Pipelined Architecture, Radix-2 k, VLSI. ww.semargroup.org www.ijvdcs.org ISSN 2322-0929 Vol.02, Issue.05, August-2014, Pages:0294-0298 Radix-2 k Feed Forward FFT Architectures K.KIRAN KUMAR 1, M.MADHU BABU 2 1 PG Scholar, Dept of VLSI & ES,

More information

DESIGN OF PARALLEL PIPELINED FEED FORWARD ARCHITECTURE FOR ZERO FREQUENCY & MINIMUM COMPUTATION (ZMC) ALGORITHM OF FFT

DESIGN OF PARALLEL PIPELINED FEED FORWARD ARCHITECTURE FOR ZERO FREQUENCY & MINIMUM COMPUTATION (ZMC) ALGORITHM OF FFT IMPACT: International Journal of Research in Engineering & Technology (IMPACT: IJRET) ISSN(E): 2321-8843; ISSN(P): 2347-4599 Vol. 2, Issue 4, Apr 2014, 199-206 Impact Journals DESIGN OF PARALLEL PIPELINED

More information

Modified Welch Power Spectral Density Computation with Fast Fourier Transform

Modified Welch Power Spectral Density Computation with Fast Fourier Transform Modified Welch Power Spectral Density Computation with Fast Fourier Transform Sreelekha S 1, Sabi S 2 1 Department of Electronics and Communication, Sree Budha College of Engineering, Kerala, India 2 Professor,

More information

Analysis of High-performance Floating-point Arithmetic on FPGAs

Analysis of High-performance Floating-point Arithmetic on FPGAs Analysis of High-performance Floating-point Arithmetic on FPGAs Gokul Govindu, Ling Zhuo, Seonil Choi and Viktor Prasanna Dept. of Electrical Engineering University of Southern California Los Angeles,

More information

FPGA Based Design and Simulation of 32- Point FFT Through Radix-2 DIT Algorith

FPGA Based Design and Simulation of 32- Point FFT Through Radix-2 DIT Algorith FPGA Based Design and Simulation of 32- Point FFT Through Radix-2 DIT Algorith Sudhanshu Mohan Khare M.Tech (perusing), Dept. of ECE Laxmi Naraian College of Technology, Bhopal, India M. Zahid Alam Associate

More information

Research Article Design of A Novel 8-point Modified R2MDC with Pipelined Technique for High Speed OFDM Applications

Research Article Design of A Novel 8-point Modified R2MDC with Pipelined Technique for High Speed OFDM Applications Research Journal of Applied Sciences, Engineering and Technology 7(23): 5021-5025, 2014 DOI:10.19026/rjaset.7.895 ISSN: 2040-7459; e-issn: 2040-7467 2014 Maxwell Scientific Publication Corp. Submitted:

More information

Resource-efficient Acceleration of 2-Dimensional Fast Fourier Transform Computations on FPGAs

Resource-efficient Acceleration of 2-Dimensional Fast Fourier Transform Computations on FPGAs In Proceedings of the International Conference on Distributed Smart Cameras, Como, Italy, August 2009. Resource-efficient Acceleration of 2-Dimensional Fast Fourier Transform Computations on FPGAs Hojin

More information

Computing the Discrete Fourier Transform on FPGA Based Systolic Arrays

Computing the Discrete Fourier Transform on FPGA Based Systolic Arrays Computing the Discrete Fourier Transform on FPGA Based Systolic Arrays Chris Dick School of Electronic Engineering La Trobe University Melbourne 3083, Australia Abstract Reconfigurable logic arrays allow

More information

FPGA Matrix Multiplier

FPGA Matrix Multiplier FPGA Matrix Multiplier In Hwan Baek Henri Samueli School of Engineering and Applied Science University of California Los Angeles Los Angeles, California Email: chris.inhwan.baek@gmail.com David Boeck Henri

More information

Multi-Gigahertz Parallel FFTs for FPGA and ASIC Implementation

Multi-Gigahertz Parallel FFTs for FPGA and ASIC Implementation Multi-Gigahertz Parallel FFTs for FPGA and ASIC Implementation Doug Johnson, Applications Consultant Chris Eddington, Technical Marketing Synopsys 2013 1 Synopsys, Inc. 700 E. Middlefield Road Mountain

More information

User Manual for FC100

User Manual for FC100 Sundance Multiprocessor Technology Limited User Manual Form : QCF42 Date : 6 July 2006 Unit / Module Description: IEEE-754 Floating-point FPGA IP Core Unit / Module Number: FC100 Document Issue Number:

More information

Energy and Memory Efficient Mapping of Bitonic Sorting on FPGA

Energy and Memory Efficient Mapping of Bitonic Sorting on FPGA Energy and Memory Efficient Mapping of Bitonic Sorting on FPGA Ren Chen, Sruja Siriyal, Viktor Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California, Los Angeles, USA

More information

Core Facts. Documentation Design File Formats. Verification Instantiation Templates Reference Designs & Application Notes Additional Items

Core Facts. Documentation Design File Formats. Verification Instantiation Templates Reference Designs & Application Notes Additional Items (FFT_PIPE) Product Specification Dillon Engineering, Inc. 4974 Lincoln Drive Edina, MN USA, 55436 Phone: 952.836.2413 Fax: 952.927.6514 E mail: info@dilloneng.com URL: www.dilloneng.com Core Facts Documentation

More information

STUDY OF A CORDIC BASED RADIX-4 FFT PROCESSOR

STUDY OF A CORDIC BASED RADIX-4 FFT PROCESSOR STUDY OF A CORDIC BASED RADIX-4 FFT PROCESSOR 1 AJAY S. PADEKAR, 2 S. S. BELSARE 1 BVDU, College of Engineering, Pune, India 2 Department of E & TC, BVDU, College of Engineering, Pune, India E-mail: ajay.padekar@gmail.com,

More information

Computer Generation of IP Cores

Computer Generation of IP Cores A I n Computer Generation of IP Cores Peter Milder (ECE, Carnegie Mellon) James Hoe (ECE, Carnegie Mellon) Markus Püschel (CS, ETH Zürich) addfxp #(16, 1) add15282(.a(a69),.b(a70),.clk(clk),.q(t45)); addfxp

More information

SFF The Single-Stream FPGA-Optimized Feedforward FFT Hardware Architecture

SFF The Single-Stream FPGA-Optimized Feedforward FFT Hardware Architecture Journal of Signal Processing Systems (2018) 90:1583 1592 https://doi.org/10.1007/s11265-018-1370-y SFF The Single-Stream FPGA-Optimized Feedforward FFT Hardware Architecture Carl Ingemarsson 1 Oscar Gustafsson

More information

A Model-based Methodology for Application Specific Energy Efficient Data Path Design using FPGAs

A Model-based Methodology for Application Specific Energy Efficient Data Path Design using FPGAs A Model-based Methodology for Application Specific Energy Efficient Data Path Design using FPGAs Sumit Mohanty 1, Seonil Choi 1, Ju-wook Jang 2, Viktor K. Prasanna 1 1 Dept. of Electrical Engg. 2 Dept.

More information

A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN

A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN Xiaoying Li 1 Fuming Sun 2 Enhua Wu 1, 3 1 University of Macau, Macao, China 2 University of Science and Technology Beijing, Beijing, China

More information

Implementation of FFT Processor using Urdhva Tiryakbhyam Sutra of Vedic Mathematics

Implementation of FFT Processor using Urdhva Tiryakbhyam Sutra of Vedic Mathematics Implementation of FFT Processor using Urdhva Tiryakbhyam Sutra of Vedic Mathematics Yojana Jadhav 1, A.P. Hatkar 2 PG Student [VLSI & Embedded system], Dept. of ECE, S.V.I.T Engineering College, Chincholi,

More information

DESIGN METHODOLOGY. 5.1 General

DESIGN METHODOLOGY. 5.1 General 87 5 FFT DESIGN METHODOLOGY 5.1 General The fast Fourier transform is used to deliver a fast approach for the processing of data in the wireless transmission. The Fast Fourier Transform is one of the methods

More information

PyGen: A MATLAB/Simulink Based Tool for Synthesizing Parameterized and Energy Efficient Designs Using FPGAs

PyGen: A MATLAB/Simulink Based Tool for Synthesizing Parameterized and Energy Efficient Designs Using FPGAs PyGen: A MATLAB/Simulink Based Tool for Synthesizing Parameterized and Energy Efficient Designs Using FPGAs Jingzhao Ou and Viktor K. Prasanna Department of Electrical Engineering, University of Southern

More information

A 4096-Point Radix-4 Memory-Based FFT Using DSP Slices

A 4096-Point Radix-4 Memory-Based FFT Using DSP Slices A 4096-Point Radix-4 Memory-Based FFT Using DSP Slices Mario Garrido Gálvez, Miguel Angel Sanchez, Maria Luisa Lopez-Vallejo and Jesus Grajal Journal Article N.B.: When citing this work, cite the original

More information

A scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment

A scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment LETTER IEICE Electronics Express, Vol.11, No.2, 1 9 A scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment Ting Chen a), Hengzhu Liu, and Botao Zhang College of

More information

Image Compression System on an FPGA

Image Compression System on an FPGA Image Compression System on an FPGA Group 1 Megan Fuller, Ezzeldin Hamed 6.375 Contents 1 Objective 2 2 Background 2 2.1 The DFT........................................ 3 2.2 The DCT........................................

More information

Improving Reconfiguration Speed for Dynamic Circuit Specialization using Placement Constraints

Improving Reconfiguration Speed for Dynamic Circuit Specialization using Placement Constraints Improving Reconfiguration Speed for Dynamic Circuit Specialization using Placement Constraints Amit Kulkarni, Tom Davidson, Karel Heyse, and Dirk Stroobandt ELIS department, Computer Systems Lab, Ghent

More information

High-throughput Online Hash Table on FPGA*

High-throughput Online Hash Table on FPGA* High-throughput Online Hash Table on FPGA* Da Tong, Shijie Zhou, Viktor K. Prasanna Ming Hsieh Dept. of Electrical Engineering University of Southern California Los Angeles, CA 989 Email: datong@usc.edu,

More information

INTRODUCTION TO FPGA ARCHITECTURE

INTRODUCTION TO FPGA ARCHITECTURE 3/3/25 INTRODUCTION TO FPGA ARCHITECTURE DIGITAL LOGIC DESIGN (BASIC TECHNIQUES) a b a y 2input Black Box y b Functional Schematic a b y a b y a b y 2 Truth Table (AND) Truth Table (OR) Truth Table (XOR)

More information

Efficient Self-Reconfigurable Implementations Using On-Chip Memory

Efficient Self-Reconfigurable Implementations Using On-Chip Memory 10th International Conference on Field Programmable Logic and Applications, August 2000. Efficient Self-Reconfigurable Implementations Using On-Chip Memory Sameer Wadhwa and Andreas Dandalis University

More information

An Efficient High Speed VLSI Architecture Based 16-Point Adaptive Split Radix-2 FFT Architecture

An Efficient High Speed VLSI Architecture Based 16-Point Adaptive Split Radix-2 FFT Architecture IJSTE - International Journal of Science Technology & Engineering Volume 2 Issue 10 April 2016 ISSN (online): 2349-784X An Efficient High Speed VLSI Architecture Based 16-Point Adaptive Split Radix-2 FFT

More information

Dynamically Configurable Online Statistical Flow Feature Extractor on FPGA

Dynamically Configurable Online Statistical Flow Feature Extractor on FPGA Dynamically Configurable Online Statistical Flow Feature Extractor on FPGA Da Tong, Viktor Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California Email: {datong, prasanna}@usc.edu

More information

PERFORMANCE ANALYSIS OF HIGH EFFICIENCY LOW DENSITY PARITY-CHECK CODE DECODER FOR LOW POWER APPLICATIONS

PERFORMANCE ANALYSIS OF HIGH EFFICIENCY LOW DENSITY PARITY-CHECK CODE DECODER FOR LOW POWER APPLICATIONS American Journal of Applied Sciences 11 (4): 558-563, 2014 ISSN: 1546-9239 2014 Science Publication doi:10.3844/ajassp.2014.558.563 Published Online 11 (4) 2014 (http://www.thescipub.com/ajas.toc) PERFORMANCE

More information

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow Abstract: High-level synthesis (HLS) of data-parallel input languages, such as the Compute Unified Device Architecture

More information

A High-Performance and Energy-efficient Architecture for Floating-point based LU Decomposition on FPGAs

A High-Performance and Energy-efficient Architecture for Floating-point based LU Decomposition on FPGAs A High-Performance and Energy-efficient Architecture for Floating-point based LU Decomposition on FPGAs Gokul Govindu, Seonil Choi, Viktor Prasanna Dept. of Electrical Engineering-Systems University of

More information

Twiddle Factor Transformation for Pipelined FFT Processing

Twiddle Factor Transformation for Pipelined FFT Processing Twiddle Factor Transformation for Pipelined FFT Processing In-Cheol Park, WonHee Son, and Ji-Hoon Kim School of EECS, Korea Advanced Institute of Science and Technology, Daejeon, Korea icpark@ee.kaist.ac.kr,

More information

Parallelism in Spiral

Parallelism in Spiral Parallelism in Spiral Franz Franchetti and the Spiral team (only part shown) Electrical and Computer Engineering Carnegie Mellon University Joint work with Yevgen Voronenko Markus Püschel This work was

More information

LOW-POWER SPLIT-RADIX FFT PROCESSORS

LOW-POWER SPLIT-RADIX FFT PROCESSORS LOW-POWER SPLIT-RADIX FFT PROCESSORS Avinash 1, Manjunath Managuli 2, Suresh Babu D 3 ABSTRACT To design a split radix fast Fourier transform is an ideal person for the implementing of a low-power FFT

More information

Core Facts. Documentation Design File Formats. Verification Instantiation Templates Reference Designs & Application Notes Additional Items

Core Facts. Documentation Design File Formats. Verification Instantiation Templates Reference Designs & Application Notes Additional Items (FFT_MIXED) November 26, 2008 Product Specification Dillon Engineering, Inc. 4974 Lincoln Drive Edina, MN USA, 55436 Phone: 952.836.2413 Fax: 952.927.6514 E mail: info@dilloneng.com URL: www.dilloneng.com

More information

FAST FOURIER TRANSFORM (FFT) and inverse fast

FAST FOURIER TRANSFORM (FFT) and inverse fast IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 39, NO. 11, NOVEMBER 2004 2005 A Dynamic Scaling FFT Processor for DVB-T Applications Yu-Wei Lin, Hsuan-Yu Liu, and Chen-Yi Lee Abstract This paper presents an

More information

An Area Efficient Mixed Decimation MDF Architecture for Radix. Parallel FFT

An Area Efficient Mixed Decimation MDF Architecture for Radix. Parallel FFT An Area Efficient Mixed Decimation MDF Architecture for Radix Parallel FFT Reshma K J 1, Prof. Ebin M Manuel 2 1M-Tech, Dept. of ECE Engineering, Government Engineering College, Idukki, Kerala, India 2Professor,

More information

Fused Floating Point Arithmetic Unit for Radix 2 FFT Implementation

Fused Floating Point Arithmetic Unit for Radix 2 FFT Implementation IOSR Journal of VLSI and Signal Processing (IOSR-JVSP) Volume 6, Issue 2, Ver. I (Mar. -Apr. 2016), PP 58-65 e-issn: 2319 4200, p-issn No. : 2319 4197 www.iosrjournals.org Fused Floating Point Arithmetic

More information

High-Performance 16-Point Complex FFT Features 1 Functional Description 2 Theory of Operation

High-Performance 16-Point Complex FFT Features 1 Functional Description 2 Theory of Operation High-Performance 16-Point Complex FFT April 8, 1999 Application Note This document is (c) Xilinx, Inc. 1999. No part of this file may be modified, transmitted to any third party (other than as intended

More information

High Speed Systolic Montgomery Modular Multipliers for RSA Cryptosystems

High Speed Systolic Montgomery Modular Multipliers for RSA Cryptosystems High Speed Systolic Montgomery Modular Multipliers for RSA Cryptosystems RAVI KUMAR SATZODA, CHIP-HONG CHANG and CHING-CHUEN JONG Centre for High Performance Embedded Systems Nanyang Technological University

More information

Adaptive FIR Filter Using Distributed Airthmetic for Area Efficient Design

Adaptive FIR Filter Using Distributed Airthmetic for Area Efficient Design International Journal of Scientific and Research Publications, Volume 5, Issue 1, January 2015 1 Adaptive FIR Filter Using Distributed Airthmetic for Area Efficient Design Manish Kumar *, Dr. R.Ramesh

More information

Energy Efficient Adaptive Beamforming on Sensor Networks

Energy Efficient Adaptive Beamforming on Sensor Networks Energy Efficient Adaptive Beamforming on Sensor Networks Viktor K. Prasanna Bhargava Gundala, Mitali Singh Dept. of EE-Systems University of Southern California email: prasanna@usc.edu http://ceng.usc.edu/~prasanna

More information

DUE to the high computational complexity and real-time

DUE to the high computational complexity and real-time IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 15, NO. 3, MARCH 2005 445 A Memory-Efficient Realization of Cyclic Convolution and Its Application to Discrete Cosine Transform Hun-Chen

More information

Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm

Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm 1 A.Malashri, 2 C.Paramasivam 1 PG Student, Department of Electronics and Communication K S Rangasamy College Of Technology,

More information

A HIGH PERFORMANCE FIR FILTER ARCHITECTURE FOR FIXED AND RECONFIGURABLE APPLICATIONS

A HIGH PERFORMANCE FIR FILTER ARCHITECTURE FOR FIXED AND RECONFIGURABLE APPLICATIONS A HIGH PERFORMANCE FIR FILTER ARCHITECTURE FOR FIXED AND RECONFIGURABLE APPLICATIONS Saba Gouhar 1 G. Aruna 2 gouhar.saba@gmail.com 1 arunastefen@gmail.com 2 1 PG Scholar, Department of ECE, Shadan Women

More information

Parallel graph traversal for FPGA

Parallel graph traversal for FPGA LETTER IEICE Electronics Express, Vol.11, No.7, 1 6 Parallel graph traversal for FPGA Shice Ni a), Yong Dou, Dan Zou, Rongchun Li, and Qiang Wang National Laboratory for Parallel and Distributed Processing,

More information

High-Speed and Low-Power Split-Radix FFT

High-Speed and Low-Power Split-Radix FFT 864 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 51, NO. 3, MARCH 2003 High-Speed and Low-Power Split-Radix FFT Wen-Chang Yeh and Chein-Wei Jen Abstract This paper presents a novel split-radix fast Fourier

More information

FPGA Implementation of 16-Point Radix-4 Complex FFT Core Using NEDA

FPGA Implementation of 16-Point Radix-4 Complex FFT Core Using NEDA FPGA Implementation of 16-Point FFT Core Using NEDA Abhishek Mankar, Ansuman Diptisankar Das and N Prasad Abstract--NEDA is one of the techniques to implement many digital signal processing systems that

More information

Verilog for High Performance

Verilog for High Performance Verilog for High Performance Course Description This course provides all necessary theoretical and practical know-how to write synthesizable HDL code through Verilog standard language. The course goes

More information

DESIGN AND IMPLEMENTATION OF DA- BASED RECONFIGURABLE FIR DIGITAL FILTER USING VERILOGHDL

DESIGN AND IMPLEMENTATION OF DA- BASED RECONFIGURABLE FIR DIGITAL FILTER USING VERILOGHDL DESIGN AND IMPLEMENTATION OF DA- BASED RECONFIGURABLE FIR DIGITAL FILTER USING VERILOGHDL [1] J.SOUJANYA,P.G.SCHOLAR, KSHATRIYA COLLEGE OF ENGINEERING,NIZAMABAD [2] MR. DEVENDHER KANOOR,M.TECH,ASSISTANT

More information

IEEE-754 compliant Algorithms for Fast Multiplication of Double Precision Floating Point Numbers

IEEE-754 compliant Algorithms for Fast Multiplication of Double Precision Floating Point Numbers International Journal of Research in Computer Science ISSN 2249-8257 Volume 1 Issue 1 (2011) pp. 1-7 White Globe Publications www.ijorcs.org IEEE-754 compliant Algorithms for Fast Multiplication of Double

More information

A SCALABLE COMPUTING AND MEMORY ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION ON FIELD-PROGRAMMABLE GATE ARRAYS. Theepan Moorthy and Andy Ye

A SCALABLE COMPUTING AND MEMORY ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION ON FIELD-PROGRAMMABLE GATE ARRAYS. Theepan Moorthy and Andy Ye A SCALABLE COMPUTING AND MEMORY ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION ON FIELD-PROGRAMMABLE GATE ARRAYS Theepan Moorthy and Andy Ye Department of Electrical and Computer Engineering Ryerson

More information

Comparison of Adders for optimized Exponent Addition circuit in IEEE754 Floating point multiplier using VHDL

Comparison of Adders for optimized Exponent Addition circuit in IEEE754 Floating point multiplier using VHDL International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 11, Issue 07 (July 2015), PP.60-65 Comparison of Adders for optimized Exponent Addition

More information

High Performance Pipelined Design for FFT Processor based on FPGA

High Performance Pipelined Design for FFT Processor based on FPGA High Performance Pipelined Design for FFT Processor based on FPGA A.A. Raut 1, S. M. Kate 2 1 Sinhgad Institute of Technology, Lonavala, Pune University, India 2 Sinhgad Institute of Technology, Lonavala,

More information

DESIGN OF AN FFT PROCESSOR

DESIGN OF AN FFT PROCESSOR 1 DESIGN OF AN FFT PROCESSOR Erik Nordhamn, Björn Sikström and Lars Wanhammar Department of Electrical Engineering Linköping University S-581 83 Linköping, Sweden Abstract In this paper we present a structured

More information

Digital Signal Processing for Analog Input

Digital Signal Processing for Analog Input Digital Signal Processing for Analog Input Arnav Agharwal Saurabh Gupta April 25, 2009 Final Report 1 Objective The object of the project was to implement a Fast Fourier Transform. We implemented the Radix

More information

FPGA Implementation of Discrete Fourier Transform Using CORDIC Algorithm

FPGA Implementation of Discrete Fourier Transform Using CORDIC Algorithm AMSE JOURNALS-AMSE IIETA publication-2017-series: Advances B; Vol. 60; N 2; pp 332-337 Submitted Apr. 04, 2017; Revised Sept. 25, 2017; Accepted Sept. 30, 2017 FPGA Implementation of Discrete Fourier Transform

More information

Parallelized Radix-4 Scalable Montgomery Multipliers

Parallelized Radix-4 Scalable Montgomery Multipliers Parallelized Radix-4 Scalable Montgomery Multipliers Nathaniel Pinckney and David Money Harris 1 1 Harvey Mudd College, 301 Platt. Blvd., Claremont, CA, USA e-mail: npinckney@hmc.edu ABSTRACT This paper

More information

A Normal I/O Order Radix-2 FFT Architecture to Process Twin Data Streams for MIMO

A Normal I/O Order Radix-2 FFT Architecture to Process Twin Data Streams for MIMO 2402 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 6, JUNE 2016 A Normal I/O Order Radix-2 FFT Architecture to Process Twin Data Streams for MIMO Antony Xavier Glittas,

More information

Core Facts. Documentation Design File Formats. Verification Instantiation Templates Reference Designs & Application Notes Additional Items

Core Facts. Documentation Design File Formats. Verification Instantiation Templates Reference Designs & Application Notes Additional Items (ULFFT) November 3, 2008 Product Specification Dillon Engineering, Inc. 4974 Lincoln Drive Edina, MN USA, 55436 Phone: 952.836.2413 Fax: 952.927.6514 E-mail: info@dilloneng.com URL: www.dilloneng.com Core

More information

Basic FPGA Architectures. Actel FPGAs. PLD Technologies: Antifuse. 3 Digital Systems Implementation Programmable Logic Devices

Basic FPGA Architectures. Actel FPGAs. PLD Technologies: Antifuse. 3 Digital Systems Implementation Programmable Logic Devices 3 Digital Systems Implementation Programmable Logic Devices Basic FPGA Architectures Why Programmable Logic Devices (PLDs)? Low cost, low risk way of implementing digital circuits as application specific

More information

Design of Delay Efficient Distributed Arithmetic Based Split Radix FFT

Design of Delay Efficient Distributed Arithmetic Based Split Radix FFT Design of Delay Efficient Arithmetic Based Split Radix FFT Nisha Laguri #1, K. Anusudha *2 #1 M.Tech Student, Electronics, Department of Electronics Engineering, Pondicherry University, Puducherry, India

More information

FPGA IMPLEMENTATION FOR REAL TIME SOBEL EDGE DETECTOR BLOCK USING 3-LINE BUFFERS

FPGA IMPLEMENTATION FOR REAL TIME SOBEL EDGE DETECTOR BLOCK USING 3-LINE BUFFERS FPGA IMPLEMENTATION FOR REAL TIME SOBEL EDGE DETECTOR BLOCK USING 3-LINE BUFFERS 1 RONNIE O. SERFA JUAN, 2 CHAN SU PARK, 3 HI SEOK KIM, 4 HYEONG WOO CHA 1,2,3,4 CheongJu University E-maul: 1 engr_serfs@yahoo.com,

More information

Creating Parameterized and Energy-Efficient System Generator Designs

Creating Parameterized and Energy-Efficient System Generator Designs Creating Parameterized and Energy-Efficient System Generator Designs Jingzhao Ou, Seonil Choi, Gokul Govindu, and Viktor K. Prasanna EE - Systems, University of Southern California {ouj,seonilch,govindu,prasanna}@usc.edu

More information

55 Streaming Sorting Networks

55 Streaming Sorting Networks 55 Streaming Sorting Networks MARCELA ZULUAGA, Department of Computer Science, ETH Zurich PETER MILDER, Department of Electrical and Computer Engineering, Stony Brook University MARKUS PÜSCHEL, Department

More information

Accelerating Equi-Join on a CPU-FPGA Heterogeneous Platform*

Accelerating Equi-Join on a CPU-FPGA Heterogeneous Platform* Accelerating Equi-Join on a CPU-FPGA Heterogeneous Platform* Ren Chen and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California, Los Angeles, USA 90089 Email:

More information

Hardware Description of Multi-Directional Fast Sobel Edge Detection Processor by VHDL for Implementing on FPGA

Hardware Description of Multi-Directional Fast Sobel Edge Detection Processor by VHDL for Implementing on FPGA Hardware Description of Multi-Directional Fast Sobel Edge Detection Processor by VHDL for Implementing on FPGA Arash Nosrat Faculty of Engineering Shahid Chamran University Ahvaz, Iran Yousef S. Kavian

More information

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC Zoltan Baruch Computer Science Department, Technical University of Cluj-Napoca, 26-28, Bariţiu St., 3400 Cluj-Napoca,

More information

FPGAs: THE HIGH-END ALTERNATIVE FOR DSP APPLICATIONS. By Dr. Chris Dick

FPGAs: THE HIGH-END ALTERNATIVE FOR DSP APPLICATIONS. By Dr. Chris Dick THE HIGH-END ALTERNATIVE FOR D APPLICATIONS By Dr. Chris Dick Engineers have been using field programmable gate arrays (FPGAs) to build high performance D systems for several years. FPGAs are uniquely

More information

A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors

A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors Brent Bohnenstiehl and Bevan Baas Department of Electrical and Computer Engineering University of California, Davis {bvbohnen,

More information

EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs)

EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) September 12, 2002 John Wawrzynek Fall 2002 EECS150 - Lec06-FPGA Page 1 Outline What are FPGAs? Why use FPGAs (a short history

More information

ISSN Vol.05,Issue.09, September-2017, Pages:

ISSN Vol.05,Issue.09, September-2017, Pages: WWW.IJITECH.ORG ISSN 2321-8665 Vol.05,Issue.09, September-2017, Pages:1693-1697 AJJAM PUSHPA 1, C. H. RAMA MOHAN 2 1 PG Scholar, Dept of ECE(DECS), Shirdi Sai Institute of Science and Technology, Anantapuramu,

More information

Overview. CSE372 Digital Systems Organization and Design Lab. Hardware CAD. Two Types of Chips

Overview. CSE372 Digital Systems Organization and Design Lab. Hardware CAD. Two Types of Chips Overview CSE372 Digital Systems Organization and Design Lab Prof. Milo Martin Unit 5: Hardware Synthesis CAD (Computer Aided Design) Use computers to design computers Virtuous cycle Architectural-level,

More information

Design of a Floating-Point Fused Add-Subtract Unit Using Verilog

Design of a Floating-Point Fused Add-Subtract Unit Using Verilog International Journal of Electronics and Computer Science Engineering 1007 Available Online at www.ijecse.org ISSN- 2277-1956 Design of a Floating-Point Fused Add-Subtract Unit Using Verilog Mayank Sharma,

More information

Outline. EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) FPGA Overview. Why FPGAs?

Outline. EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) FPGA Overview. Why FPGAs? EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) September 12, 2002 John Wawrzynek Outline What are FPGAs? Why use FPGAs (a short history lesson). FPGA variations Internal logic

More information

Accelerating Equi-Join on a CPU-FPGA Heterogeneous Platform

Accelerating Equi-Join on a CPU-FPGA Heterogeneous Platform Accelerating Equi-Join on a CPU-FPGA Heterogeneous Platform Ren Chen, Viktor Prasanna Computer Engineering Technical Report Number CENG-05- Ming Hsieh Department of Electrical Engineering Systems University

More information

A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup

A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup Yan Sun and Min Sik Kim School of Electrical Engineering and Computer Science Washington State University Pullman, Washington

More information

Multiplierless Unity-Gain SDF FFTs

Multiplierless Unity-Gain SDF FFTs Multiplierless Unity-Gain SDF FFTs Mario Garrido Gálvez, Rikard Andersson, Fahad Qureshi and Oscar Gustafsson Journal Article N.B.: When citing this work, cite the original article. 216 IEEE. Personal

More information