ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE. Ren Chen, Hoang Le, and Viktor K. Prasanna

Size: px
Start display at page:

Download "ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE. Ren Chen, Hoang Le, and Viktor K. Prasanna"

Transcription

1 ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE Ren Chen, Hoang Le, and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California, Los Angeles, USA {renchen, hoangle, ABSTRACT Recently, there has been a growing interest within the research community to improve energy efficiency. In this paper, we revisit the classic Fast Fourier Transform (FFT) for energy efficient designs on FPGAs. Parameterized FFT architecture is proposed to identify design trade-offs in achieving energy efficiency. We first perform design space exploration by varying the algorithm mapping parameters, such as the degree of vertical and horizontal parallelism, that characterize the decomposition based FFT algorithms. After empirical selection on the values of algorithm mapping parameters, an energy-performance-area trade-off design for energy efficiency is identified by varying the architecture parameters, including the type of memory elements, the type of interconnection network and the number of pipeline stages. The tradeoffs between energy, area, and time are analyzed using two performance metrics: the Energy Area Time (EAT) composite metric and the energy efficiency (defined as the number of operations per Joule). From the experimental results, a design space is generated to demonstrate the effect of these parameters on the various performance metrics. For N-point FFT (16 N 124), our designs achieve up to 28% and 38% improvement in the energy efficiency and EAT, respectively, compared with a state-of-the-art design. 1. INTRODUCTION FPGA is a promising implementation technology for computationally intensive applications such as signal, image, and network processing tasks [1, 2]. State-of-the-art FPGAs offer high operating frequency, unprecedented logic density and a host of other features. As FPGAs are programmed specifically for the problem to be solved, they can achieve higher performance with lower power consumption than general purpose processors. Fast Fourier Transform (FFT) is one of the most frequently used kernels for Discrete Fourier Transform (DFT) in a wide variety of image and signal processing applications. Various derivative FFT algorithms have been pro- This work has been funded by DARPA under grant number HR posed and developed. Radix-x Cooley-Tukey algorithm is one of the most popular algorithms for hardware implementation [3, 4, 5, 6]. Most hardware solutions for Radix-x FFT fall into the following categories: delay feedback or delay commutator architectures [4], such as Radix-2 2 single-path delay feedback FFT [4], Radix-4 single-path delay commutator FFT [5], etc. By focusing on circuit level optimizations, these solutions achieved improvement either in throughput, area, or power. Power is a key metric in computing today. To obtain an energy efficient design for FFT, we analyze the tradeoffs between energy, area, and time for fixed-point FFT on a parameterized architecture, using Cooley-Tukey algorithm. Energy efficiency can be obtained both at the algorithm mapping level and the architecture level [7, 8]. Optimizing at these two levels allows power to be effectively traded off with other performance parameters. For example, a design consuming 2 power but achieving 3 system throughput is actually 5% more energy efficient than the original design. We present the design space for the chosen architecture with respect to energy efficiency at the algorithm mapping level. Energy-performance-area trade-off design is achieved at the architecture level by empirical selection on the proposed architecture parameters. In this paper, we make the following contributions: 1. A parameterized architecture of the Radix-4 Cooley- Tukey algorithm for FFT (Section 3.1). 2. A design space that demonstrates the effect of the parameters on the EAT and the energy efficiency metric (Section 4.3.2). 3. Demonstrate improved energy efficiency of the proposed trade-off design by identifying the energy hotspots and varying the proposed architecture parameters (Section 4.3.2). 4. Optimized designs achieving significant improvement in energy efficiency compared with a state-of-the-art design (Section 4.4). The rest of the paper is organized as follows. Section 2 covers the background and related work. Section 3 describes the proposed parameterized architecture and its implementation on FPGA. Section 4 presents experimental results and analysis. Section 5 concludes the paper. 1

2 2. BACKGROUND AND RELATED WORK 2.1. Background Given N complex numbers x,..., x N 1, DFT is computed as: X k = N 1 n= x ne i2πk n N, k =,..., N 1. Radixx Cooley-Tukey FFT is a well know decomposition based algorithm for N-point DFT. Radix-4 FFT is employed in this paper. The description of Radix-4 FFT is presented in Algorithm 1. In terms of the number of real operations, the computational complexity for N-point Radix-4 FFT is O(N log 4 N). The algorithm performs N-point FFT in N/m (m < N) cycles using m Input/Output ports (I/Os) and log 4 N radix blocks, which are used for butterfly computations. The algorithm iteratively decomposes the entire problem into four subproblems. This feature enables us to map Algorithm 1 Radix-4 FFT Algorithm 1: q = N/4; d = N/4; 2: for p := to log 4 N do 3: for k := to 4 p 1 do 4: l = 4kq/4 p ; r = l + q/(4 p 1); 5: tw 1 = w[k]; tw 2 = w[2k]; tw 3 = w[3k]; 6: for i := l to r do 7: t = i; t 1 = i+d/4 p ; t 2 = i+2d/4 p ; t 3 = i+3d/4 p ; 8: do parallel 9: f p+1 [t ] = f p[t ] + f p[t 1 ] + f p[t 2 ] + f p[t 3 ]; 1: f p+1 [t 1 ] = f p[t ] jf p[t 1 ] f p[t 2 ] + jf p[t 3 ]; 11: f p+1 [t 2 ] = f p[t ] f p[t 1 ] + f p[t 2 ] + jf p[t 3 ]; 12: f p+1 [t 3 ] = f p[t ] + jf p[t 1 ] f p[t 2 ] jf p[t 3 ]; 13: end parallel 14: do parallel 15: f p+1 [t ] = f p+1 [t ]; 16: f p+1 [t 1 ] = tw 1 f p+1 [t 1 ]; 17: f p+1 [t 2 ] = tw 2 f p+1 [t 2 ]; 18: f p+1 [t 3 ] = tw 3 x p+1 [t 3 ]; 19: end parallel : end for 21: end for 22: end for the algorithm by folding the FFT architecture vertically or horizontally, thus providing much freedom to implement various designs on FPGAs. We will propose our parameterized architecture in Section 3.2 based on this characteristic Related Work To the best of our knowledge, there has been no previous work targeted at exploring the design space for energy efficiency of FFT at both the algorithm mapping level and the architecture level on FPGAs. Existing work has mainly focused on optimizing the performance, power and area of the design at the circuit level. In [9], the authors designed an energy-efficient 124- point FFT processor. Cache-based FFT algorithm was proposed for achieving low power and high performance. Energytime performance metric was evaluated at different processor operation points. In [1], a high-speed and low-power FFT architecture was presented. They presented a delay balanced pipeline architecture based on split-radix algorithm. Algorithm trade-offs for reducing computation complexity were explored and the architecture was evaluated in area, power and timing performance. Based on Radix-x FFT, various pipeline FFT architectures have been proposed, such as Radix-2 single-path delay feedback FFT [3], Radix-4 single-path delay commutator FFT [5], Radix-2 multi-path delay commutator FFT [6], and Radix-2 2 single-path delay feedback FFT [4]. These architectures can achieve high throughput per unit area with single-path or multi-path pipelines, while energy efficiency has not been explored and evaluated in these works. In [11], a mathematical model for generating DFT soft core was developed. This model can automatically produce an optimized design with user inputs on performance and resource constraints. The resource usage was estimated with available parameters. However, the power and performance estimation have not been presented in this work. In [7], it presented a parameterized FFT architecture for energy efficiency. For energy efficiency, the optimized design was achieved by varying the chosen architecture parameters. Some energy efficient design techniques, such as clock gating and memory binding, are also employed in their work. Other than FPGA, there are also some techniques for energy efficient FFT presented based on other different platforms [12, 13]. However, it is not clear how to apply these techniques on FPGAs. In this work, we extend the work of [7] by design space exploration for energy efficiency at different levels. The design space exploration is performed on the current state-of-the-art FPGAs. By exploring the energy efficiency at two levels, we obtained an energy-performancearea trade-off design for FFT. 3. ARCHITECTURE AND IMPLEMENTATIONS 3.1. Architecture building blocks The proposed N-point FFT architecture is based on the Radix- 4 Cooley-Tukey FFT algorithm. Note that the choice of the radix affects energy efficiency of the design. Compared with Radix-2 algorithm, Radix-4 uses less number of multiply operations. The basic architecture consists of five building blocks (see Fig.1): Radix-4 block (R4), buffer, path permutation (PER), Parallel-to-serial/serial-to-parallel (PS/SP) multiplexer, and twiddle factor computation (). A complete design for a given N-point FFT can be obtained from combinations of the basic blocks. A. Radix-4 block In this module, 16 signed adder/subtractors are used to complete butterfly computations. It takes four inputs and generates four outputs in parallel. Each input data contains real and imaginary components. The data outputs of R4 will be used by the twiddle factor computation block except in the last stage (see Fig. 1a). 2

3 Radix Block X R4 (a) (b) (c) (d) (e) Fig. 1: (a) Radix block, (b) buffer, (c) path permutation (PER), (d) Parallel-to-serial/serial-to-parallel MUX (PS/SP), (e) Twiddle factor computation () Memory entry X X 7 X 1 X 13 X [i] X [i+1] X [i+2] R4 (a) H p = 1,V p = 1 R4 X 2[i] X 2[i+1] X 2[i+2] X 1 X 4 X 11 X 14 X 2 X 5 X 8 X 15 X 3 X 6 X 9 X 12 output in parallel Fig. 2: permutation in the data buffers for 16-point FFT B. buffer The data buffer consists of a dual-port RAM having N/m (m equals to the number of I/Os) entries. is written into one port and read from the other port simultaneously. The data buffers are shown in Fig. 2 where N = 16. In four cycles, 16 permutated data inputs are fed into the data buffers. And in each cycle, with alternating entries, four data outputs are read in parallel. For different architectural parameters, the read and write addresses are generated with different strides. For example, in Fig. 2, four data inputs (X, X 4, X 8, X 12 ) are written in cycle, cycle 1, cycle 2, and cycle 3 respectively. Then they are output simultaneously in cycle 4. C. permutation block Parallel input data are required to be permutated before being processed by the subsequent modules. Fig. 2 shows the data permutation for 16-point FFT. In the first cycle, four data inputs (X, X 1, X 2, X 3 ) are fed into the first entry of each data buffer without permutation. In the second cycle, another four data inputs are written into the second entry of each data buffer with one location permutated. The parallel output data (X i, X i+4, X i+8, X (i+12)mod16, i =, 1, 2, 3) are stored in different RAMs after four cycles. These permutations are repeated for every four cycles. D. PS/SP module This module is used to multiplex serial/parallel input data to output in parallel/serial respectively. As shown in Fig. 3a, the number of I/Os is limited to one, but the radix-4 block still operates on four data inputs in parallel, thus the PS/SP module is employed to match the data rate both before and after the radix-4 block. X [i+3] (b) H p = 2,V p = 4 Fig. 3: Parameterized Architectures for 16-point FFT E. Twiddle factor computation This module consists of two blocks: the twiddle factor generation block and the complex number multiplier block. The twiddle factor generation block includes several lookup tables for storing twiddle factor coefficients, where the data read addresses will be updated with the control signals. The size of the lookup tables will increase with the problem size. The complex number multiplier block consists of three multipliers and three adder/subtractors Parameterized FFT Architecture Algorithm Mapping Parameters X 2[i+3] Decomposition based Radix-4 FFT offers much flexibility to map various architectures. By folding the FFT architecture horizontally or vertically, the radix-4 blocks can be reused iteratively, connected in a pipeline, or replicated to process input data in parallel. Hence we use two algorithm mapping parameters that characterize the decomposition-based N-point FFT algorithm in our design: 1. Horizontal Parallelism (H p ): determines the number of radix blocks used in one pipeline (1 H p log 4 N). 2. Vertical Parallelism (V p ): determines the number of inputs being computed in parallel (1 V p N). V p varies with the number of data channels per pipeline (N c ) and the number of parallel pipelines (N p ), and V p = N c N p. These two proposed architectural parameters are chosen to create a design space. Two different architectures are presented in Fig. 3. In Fig. 3a, V p = N c = N p = 1, H p = 1, N = 16, one radix-4 block is employed and iteratively used by two stages, and one input data is processed per cycle. This architecture achieves higher resource efficiency 3

4 Out Out1 Out2 Out3 In In1 In2 In3 (a) (b) Fig. 4: (a) Crossbar network, (b) Complete binary tree, (c) Dynamic network In In1 In2 In3 (c) Out Out1 Out2 Out3 and consumes less I/Os power consumption, at the expense of the throughput. In Fig. 3b, V p = 4, H p = 2, N = 16, two radix-4 blocks are utilized. There is only one pipeline and N c = 4, N p = 1. All the stages are fully pipelined, and four inputs can be processed in parallel per cycle. Note that there is no feedback path. The architecture achieves high throughput by using more basic blocks and I/Os, while resulting in higher power consumption. We can also increase V p by replicating the basic pipeline. This replication allows several pipelines to work in parallel to significantly increase the throughput at the cost of more complex interconnections Architecture Parameters Three architecture parameters that significantly affect energyefficiency are employed in our design and applied to different components: 1. Type of memory element: BRAM or distributed RAM (dist. RAM) can be used as memories. In our design, both data buffers and twiddle factor lookup tables can be implemented using different memory elements. 2. Type of interconnection: three different types of interconnection (see Fig.4) are used for implementation of data permutation blocks, including crossbar network, complete binary tree, as well as dynamic network. 3. Pipeline depth: Both adder/subtractors and DSP slices in FPGA can be deep pipelined by inserting registers, so we parameterized the arithmetic units and multipliers with pipeline depth in our design to balance the performance and resource usage. According to [14], when used for large size memories, BRAM consumes less power than dist. RAM. Hence this characteristic can be utilized to make a trade-off between power and performance for various problem sizes. As there are 2 m (H p + 1) (when V p = 4, m = 1, otherwise m = ) permutation modules, using different interconnection networks can significantly affect the energy efficiency of the designs. The physical layout of the complete binary tree is similar with that of crossbar network, while it can be inserted with more pipeline registers between the layers of tree. The dynamic network can be implemented by using shift registers. Among three of them, dynamic network can lead to high performance while more power consumption; crossbar network consumes least resource and power while will also bring long wire delay; complete binary tree can be used to release routing burden to improve performance at the expense of more area usage. 4. EXPERIMENTAL RESULTS AND ANALYSIS 4.1. Experimental Setup In this section, we present a detailed analysis of several implementation experiments by varying the parameters. All the designs were implemented in Verilog on Virtex-7 FPGA (XC7VX69T, speed grade -2L) using Xilinx ISE Inputs are 16-bit fixed point complex numbers. The designs were verified by post place-and-route simulation. The reported results are post place-and-route results. We used the SAIF file (Switching Activity Interchange Format) as input to Xilinx XPower Analyzer to produce accurate power dissipation estimation [14] Performance Metrics Two metrics for performance evaluation are considered in this paper: 1. Energy efficiency is defined as the number of operations per unit energy consumed (Energy efficiency = number of operations / energy consumed by the design). For N-point FFT, Energy efficiency is given by (2N log 2 N N log 2 N) / energy consumed by the design, Energy consumed by the design = time taken by the design average power dissipation of the design. Alternatively energy efficiency of the design is Power efficiency (Power efficiency = number of operations per second / Watt). 2. Energy Area Time (EAT) is measured as the product of three important metrics: energy, area, and time. We define Energy in Joules consumed by the design for one transformation of N points. Area is defined as area usage of the design, which is considered as the maximum number of LUTs or flip-flops occupied by the entire design. The area of design using BRAMs is equal to the area usage of the same design when only using dist. RAMs. Time is the latency of N-point FFT Design space exploration In this section, we first present the design space exploration by varying algorithm mapping parameters. Both the dist. RAM based design and the BRAM based design are used in this experiment. The effect of the algorithm mapping parameters on energy efficiency is demonstrated by using 4

5 4, 1 Giga operations/joule 1 log log 1/2 1 Fig. 5: Energy efficiency for various H p with varying N for the dist. RAM based design Giga operations / Joule 1 4, 4 1, 4 1, 1 Fig. 7: Energy efficiency for various V p with varying N for the dist. RAM based design 5 Giga oprations/joule 1 Giga operations / Joule 1 4, 1 4, 4 1, 4 1, 1 Problem Size N Problem Size N Fig. 6: Energy efficiency for various H p with varying N for the BRAM based design Fig. 8: Energy efficiency for various V p with varying N for the BRAM based design the proposed performance metrics. Next we explore the energy-performance-area trade-off design (denoted tradeoff design) by varying the architecture parameters, based on the conclusions of design space exploration in this section Algorithm mapping level exploration A. Horizontal Parallelism In this experiment, we explore the energy efficiency for various horizontal parallelism, and V p = 4, N c = 4, N p = 1. The range of H p is [1, log 4 N]. The energy efficiency for various H p are shown in Fig. 6 and Fig. 5 respectively. Based on the experimental results, we have the following observations: For all the considered problem sizes, increasing horizontal parallelism could significantly improve energy efficiency for both the dist. RAM and BRAM based design. As the problem size N increases, the energy efficiency of the dist. RAM based design declines, whereas, the energy efficiency of the BRAM based design increases. The improvement in energy efficiency brought by increasing H p for the dist. RAM based design is sensitive to N. For example, when N = 124, halving H p only leads to slight performance decline in energy efficiency. Considering reducing H p to save area would be a feasible alternative for larger size problems. The improvement in energy efficiency brought by increasing H p for BRAM based designs is not sensitive to N. Reducing H p to save area is not a feasible approach, which leads to a significant decline in energy efficiency. B. Vertical Parallelism Vertical parallelism is determined by three different values: radix value (fixed at 4), N c, and N p. H p was set as log 4 N. N c and N p were modified for evaluation. Both dist. RAM and BRAM based designs were evaluated. The energy efficiency for various V p are shown in Fig. 7 and Fig. 8. Based on the results, the conclusions are listed as below: Reducing N c leads to performance decline in energy efficiency. BRAM based design is more scalable than dist. RAM based design with respect to energy efficiency. When N 64, the energy efficiency starts to decline for dist. RAM based designs due to high power consumption per access of dist. RAMs with large memory entries. 5

6 Table 1: Architecture parameters of designs for comparison Design A Trade-off Design Design C Giga operations / Joule 5 Memory type Dist. RAM Dist. RAM or BRAM BRAM Interconnection Network Pipeline stages Type Components Multiplier Adder Dynamic network Crossbar network Complete binary tree Regitsers 5 2 LUTs 3 2 LUTs+ Registers Design A Trade-off design Design C 2 1 Fig. 9: Energy efficiency of the trade-off design and the baseline designs Increasing N c instead N p is a more feasible approach to improve energy efficiency. Also there is no much extra resource needed for increasing N c, and we have to replicate the pipeline to increase N p. Although increasing H p leads to a high power and resource consumption, it can produce improvement in energy efficiency due to high throughput Architecture level exploration In this section, the trade-off design is explored at the architecture level. In this experiment, we choose V p = 4 and H p = log 4 N based on previous experimental conclusions. A. Energy hot spots As shown in Fig.1a, dominant portion of the entire power is consumed by the data buffers for 124-point FFT. This indicates that BRAM can be utilized to improve energy efficiency for large values of N. It also suggests that I/Os consumes a major power for small values of N. Fig.1b shows that the core power consumption except I/O power and static power is dominant in the entire power for BRAM-based designs. And we observe that pipeline registers are the energy hot-spots among the architecture components. B. Trade-off design By varying the architecture parameters, a set of implementations have been evaluated in this experiment. The analysis of effects of the architecture parameters on power, performance, and area is performed as below: Static Power 9% Radix block 4% I/O PER power 3% 8% 4% buffer 72% (a) Dist. RAM based design I/O power 26% Static Power 13% Radix block 13% PER 8% 12% buffer 28% (b) BRAM based design Fig. 1: % power consumed by the components for 124- point FFT architecture Energy: Reducing the number of registers can significantly reduce signal power, which is dominant in the dynamic power. Crossbar network can be evaluated to increase energy efficiency. Performance: Using BRAM can lead to a decline in peak operating frequency. For large values of N, when using BRAMs, extra pipeline stages can be used to solve the performance degradation issue. Area: Area usage of pipeline registers is dominant in the entire design area. Pipeline registers can be balanced to obtain the trade-off design between area and performance. The analysis above has been applied to achieve the tradeoff design in our experiment and serves as a guide for design space exploration. As shown in Table1, we use two baseline designs to compare with our proposed trade-off design. The architecture parameters of the designs for comparison are shown in Table1. The comparison results of the designs on energy efficiency are shown in Fig.9. It shows that the energy efficiency can be improved up to 27% by our proposed trade-off design, compared with the other two baseline designs Performance comparison We finally use SPIRAL FFT IP core to compare with our proposed trade-off design. The SPIRAL DFT/FFT IP Generator can automatically generate customized DFT soft IP cores in synthesizable RTL Verilog with user inputs [11]. The available parameters of the DFT core generator include transform size, data precision, etc. In this comparison, we use the dist. RAM based design for N 64 and the BRAM based design for N > 64. For the design from SPIRAL, the codes of N-point (16-bit fixed point) FFT are automatically generated by the SPIRAL Core generator. The architecture is fully streaming and the data are presented in their natural ordering. As shown in Fig. 11, our proposed design improves energy-efficiency by 8% to 28% and EAT by 23% to 38%, respectively, compared with the SPIRAL FFT IP Cores. 6

7 Giga operations / Joule Energy efficency of our design Energy efficiency of SPIRAL FFT IP Core (EAT of SPIRAL IP CORE) / (EAT of our design) Fig. 11: Comparison between the proposed trade off design and the SPIRAL FFT IP Cores for EAT and energy efficiency 5. CONCLUSION In this work, we presented a parameterized architecture for energy efficiency using Radix-4 Cooley-Tukey FFT algorithm. The effect of the multi-level parameters on energyefficiency was demonstrated by using design space exploration. We studied the power consumption of the components for various problem sizes, and proposed our tradeoff design by empirical selection on architecture parameters. Compared with the state-of-the-art design, our optimized architectures achieve up to 28% and 38% improvement in the energy efficiency and EAT respectively. In the future we plan to work on an accurate high-level performance model for energy-efficiency estimation, which can be used to accelerate design space exploration to obtain an energy efficient design. EAT Ratio [6] L. R. Rabiner and B. Gold, Theory and application of digital signal processing, Englewood Cliffs, NJ, Prentice-Hall, Inc., p., vol. 1. [7] S. Choi, R. Scrofano, V. K. Prasanna, and J.-W. Jang, Energy-efficient signal processing using FPGAs, in Proceedings of the 3 FPGA, pp [8] D. Aravind and A. Sudarsanam, High level -Application Analysis Techniques Architectures - To Explore Design possibilities for Reduced Reconfiguration Area Overheads in FP- GAs executing Compute Intensive Applications, in Proc. of IPDPS, 5, pp. 158a 158a. [9] B. Baas, A low-power, high-performance, 124-point FFT processor, IEEE Journal of Solid-State Circuits, vol. 34, no. 3, pp , [1] C.-W. J. Wen-Chang Yeh, High-speed and low-power splitradix FFT, IEEE Transactions on Signal Processing, vol. 51, no. 3, pp , 3. [11] P. A. Milder, M. Ahmad, J. C. Hoe, and M. Püschel, Fast and accurate resource estimation of automatically generated custom DFT IP cores, in Proceedings of the 6 FPGA, pp [12] T. Sugimura, H. Yamasaki, H. Noda, O. Yamamoto, Y. Okuno, and K. Arimoto, A high-performance and energyefficient FFT implementation on super parallel processor (MX) for mobile multimedia applications, in International Symposium on Intelligent Signal Processing and Communications Systems, 9, pp [13] H. Kimura, H. Nakamura, S. Kimura, and N. Yoshimoto, Numerical analysis of dynamic snr management by controlling dsp calculation precision for energy-efficient ofdmpon, Photonics Technology Letters, IEEE, vol. 24, no. 23, pp , 12. [14] XST User Guide for Virtex-6, Spartan-6, and 7 Series Devices, 6. REFERENCES [1] N. Shirazi, P. M. Athanas, and A. L. Abbott, Implementation of a 2-D Fast Fourier Transform on an FPGA-Based Custom Computing Machine, in Field-Programmable Logic and Applications, 1995, pp [2] D. Chen, G. Yao, C. Koc, and R. Cheung, Low complexity and hardware-friendly spectral modular multiplication, in International Conference on Field-Programmable Technology (FPT), 12, pp [3] E. H. Wold and A. M. Despain, Pipeline and parallelpipeline FFT processors for VLSI implementations, IEEE Transactions on Computers, vol. 1, no. 5, pp , [4] S. He and M. Torkelson, A new approach to pipeline FFT processor, in Proceedings of IPPS 96, pp [5] G. Bi and E. Jones, A pipelined FFT processor for wordsequential data, IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 37, no. 12, pp ,

ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE. Ren Chen, Hoang Le, and Viktor K. Prasanna

ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE. Ren Chen, Hoang Le, and Viktor K. Prasanna ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE Ren Chen, Hoang Le, and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California, Los Angeles, USA 989 Email:

More information

High Throughput Energy Efficient Parallel FFT Architecture on FPGAs

High Throughput Energy Efficient Parallel FFT Architecture on FPGAs High Throughput Energy Efficient Parallel FFT Architecture on FPGAs Ren Chen Ming Hsieh Department of Electrical Engineering University of Southern California Los Angeles, USA 989 Email: renchen@usc.edu

More information

Energy Optimizations for FPGA-based 2-D FFT Architecture

Energy Optimizations for FPGA-based 2-D FFT Architecture Energy Optimizations for FPGA-based 2-D FFT Architecture Ren Chen and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California Ganges.usc.edu/wiki/TAPAS Outline

More information

Analysis of Radix- SDF Pipeline FFT Architecture in VLSI Using Chip Scope

Analysis of Radix- SDF Pipeline FFT Architecture in VLSI Using Chip Scope Analysis of Radix- SDF Pipeline FFT Architecture in VLSI Using Chip Scope G. Mohana Durga 1, D.V.R. Mohan 2 1 M.Tech Student, 2 Professor, Department of ECE, SRKR Engineering College, Bhimavaram, Andhra

More information

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA Yun R. Qu, Viktor K. Prasanna Ming Hsieh Dept. of Electrical Engineering University of Southern California Los Angeles, CA 90089

More information

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM

More information

OPTIMIZING INTERCONNECTION COMPLEXITY FOR REALIZING FIXED PERMUTATION IN DATA AND SIGNAL PROCESSING ALGORITHMS. Ren Chen and Viktor K.

OPTIMIZING INTERCONNECTION COMPLEXITY FOR REALIZING FIXED PERMUTATION IN DATA AND SIGNAL PROCESSING ALGORITHMS. Ren Chen and Viktor K. OPTIMIZING INTERCONNECTION COMPLEXITY FOR REALIZING FIXED PERMUTATION IN DATA AND SIGNAL PROCESSING ALGORITHMS Ren Chen and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University

More information

The Serial Commutator FFT

The Serial Commutator FFT The Serial Commutator FFT Mario Garrido Gálvez, Shen-Jui Huang, Sau-Gee Chen and Oscar Gustafsson Journal Article N.B.: When citing this work, cite the original article. 2016 IEEE. Personal use of this

More information

ON CONFIGURATION OF RESIDUE SCALING PROCESS IN PIPELINED RADIX-4 MQRNS FFT PROCESSOR

ON CONFIGURATION OF RESIDUE SCALING PROCESS IN PIPELINED RADIX-4 MQRNS FFT PROCESSOR POZNAN UNIVE RSITY OF TE CHNOLOGY ACADE MIC JOURNALS No 80 Electrical Engineering 2014 Robert SMYK* Maciej CZYŻAK* ON CONFIGURATION OF RESIDUE SCALING PROCESS IN PIPELINED RADIX-4 MQRNS FFT PROCESSOR Residue

More information

Evaluating Energy Efficiency of Floating Point Matrix Multiplication on FPGAs

Evaluating Energy Efficiency of Floating Point Matrix Multiplication on FPGAs Evaluating Energy Efficiency of Floating Point Matrix Multiplication on FPGAs Kiran Kumar Matam Computer Science Department University of Southern California Email: kmatam@usc.edu Hoang Le and Viktor K.

More information

AN FFT PROCESSOR BASED ON 16-POINT MODULE

AN FFT PROCESSOR BASED ON 16-POINT MODULE AN FFT PROCESSOR BASED ON 6-POINT MODULE Weidong Li, Mark Vesterbacka and Lars Wanhammar Electronics Systems, Dept. of EE., Linköping University SE-58 8 LINKÖPING, SWEDEN E-mail: {weidongl, markv, larsw}@isy.liu.se,

More information

A Methodology for Energy Efficient FPGA Designs Using Malleable Algorithms

A Methodology for Energy Efficient FPGA Designs Using Malleable Algorithms A Methodology for Energy Efficient FPGA Designs Using Malleable Algorithms Jingzhao Ou and Viktor K. Prasanna Department of Electrical Engineering, University of Southern California Los Angeles, California,

More information

Linköping University Post Print. Analysis of Twiddle Factor Memory Complexity of Radix-2^i Pipelined FFTs

Linköping University Post Print. Analysis of Twiddle Factor Memory Complexity of Radix-2^i Pipelined FFTs Linköping University Post Print Analysis of Twiddle Factor Complexity of Radix-2^i Pipelined FFTs Fahad Qureshi and Oscar Gustafsson N.B.: When citing this work, cite the original article. 200 IEEE. Personal

More information

Abstract. Literature Survey. Introduction. A.Radix-2/8 FFT algorithm for length qx2 m DFTs

Abstract. Literature Survey. Introduction. A.Radix-2/8 FFT algorithm for length qx2 m DFTs Implementation of Split Radix algorithm for length 6 m DFT using VLSI J.Nancy, PG Scholar,PSNA College of Engineering and Technology; S.Bharath,Assistant Professor,PSNA College of Engineering and Technology;J.Wilson,Assistant

More information

Research Article International Journal of Emerging Research in Management &Technology ISSN: (Volume-6, Issue-8) Abstract:

Research Article International Journal of Emerging Research in Management &Technology ISSN: (Volume-6, Issue-8) Abstract: International Journal of Emerging Research in Management &Technology Research Article August 27 Design and Implementation of Fast Fourier Transform (FFT) using VHDL Code Akarshika Singhal, Anjana Goen,

More information

Keywords: Fast Fourier Transforms (FFT), Multipath Delay Commutator (MDC), Pipelined Architecture, Radix-2 k, VLSI.

Keywords: Fast Fourier Transforms (FFT), Multipath Delay Commutator (MDC), Pipelined Architecture, Radix-2 k, VLSI. ww.semargroup.org www.ijvdcs.org ISSN 2322-0929 Vol.02, Issue.05, August-2014, Pages:0294-0298 Radix-2 k Feed Forward FFT Architectures K.KIRAN KUMAR 1, M.MADHU BABU 2 1 PG Scholar, Dept of VLSI & ES,

More information

Analysis of High-performance Floating-point Arithmetic on FPGAs

Analysis of High-performance Floating-point Arithmetic on FPGAs Analysis of High-performance Floating-point Arithmetic on FPGAs Gokul Govindu, Ling Zhuo, Seonil Choi and Viktor Prasanna Dept. of Electrical Engineering University of Southern California Los Angeles,

More information

TOPICS PIPELINE IMPLEMENTATIONS OF THE FAST FOURIER TRANSFORM (FFT) DISCRETE FOURIER TRANSFORM (DFT) INVERSE DFT (IDFT) Consulted work:

TOPICS PIPELINE IMPLEMENTATIONS OF THE FAST FOURIER TRANSFORM (FFT) DISCRETE FOURIER TRANSFORM (DFT) INVERSE DFT (IDFT) Consulted work: 1 PIPELINE IMPLEMENTATIONS OF THE FAST FOURIER TRANSFORM (FFT) Consulted work: Chiueh, T.D. and P.Y. Tsai, OFDM Baseband Receiver Design for Wireless Communications, John Wiley and Sons Asia, (2007). Second

More information

Novel design of multiplier-less FFT processors

Novel design of multiplier-less FFT processors Signal Processing 8 (00) 140 140 www.elsevier.com/locate/sigpro Novel design of multiplier-less FFT processors Yuan Zhou, J.M. Noras, S.J. Shepherd School of EDT, University of Bradford, Bradford, West

More information

Multi-Gigahertz Parallel FFTs for FPGA and ASIC Implementation

Multi-Gigahertz Parallel FFTs for FPGA and ASIC Implementation Multi-Gigahertz Parallel FFTs for FPGA and ASIC Implementation Doug Johnson, Applications Consultant Chris Eddington, Technical Marketing Synopsys 2013 1 Synopsys, Inc. 700 E. Middlefield Road Mountain

More information

DESIGN OF PARALLEL PIPELINED FEED FORWARD ARCHITECTURE FOR ZERO FREQUENCY & MINIMUM COMPUTATION (ZMC) ALGORITHM OF FFT

DESIGN OF PARALLEL PIPELINED FEED FORWARD ARCHITECTURE FOR ZERO FREQUENCY & MINIMUM COMPUTATION (ZMC) ALGORITHM OF FFT IMPACT: International Journal of Research in Engineering & Technology (IMPACT: IJRET) ISSN(E): 2321-8843; ISSN(P): 2347-4599 Vol. 2, Issue 4, Apr 2014, 199-206 Impact Journals DESIGN OF PARALLEL PIPELINED

More information

Towards Performance Modeling of 3D Memory Integrated FPGA Architectures

Towards Performance Modeling of 3D Memory Integrated FPGA Architectures Towards Performance Modeling of 3D Memory Integrated FPGA Architectures Shreyas G. Singapura, Anand Panangadan and Viktor K. Prasanna University of Southern California, Los Angeles CA 90089, USA, {singapur,

More information

Modified Welch Power Spectral Density Computation with Fast Fourier Transform

Modified Welch Power Spectral Density Computation with Fast Fourier Transform Modified Welch Power Spectral Density Computation with Fast Fourier Transform Sreelekha S 1, Sabi S 2 1 Department of Electronics and Communication, Sree Budha College of Engineering, Kerala, India 2 Professor,

More information

Low-Power Split-Radix FFT Processors Using Radix-2 Butterfly Units

Low-Power Split-Radix FFT Processors Using Radix-2 Butterfly Units Low-Power Split-Radix FFT Processors Using Radix-2 Butterfly Units Abstract: Split-radix fast Fourier transform (SRFFT) is an ideal candidate for the implementation of a lowpower FFT processor, because

More information

MULTIPLIERLESS HIGH PERFORMANCE FFT COMPUTATION

MULTIPLIERLESS HIGH PERFORMANCE FFT COMPUTATION MULTIPLIERLESS HIGH PERFORMANCE FFT COMPUTATION Maheshwari.U 1, Josephine Sugan Priya. 2, 1 PG Student, Dept Of Communication Systems Engg, Idhaya Engg. College For Women, 2 Asst Prof, Dept Of Communication

More information

Computing the Discrete Fourier Transform on FPGA Based Systolic Arrays

Computing the Discrete Fourier Transform on FPGA Based Systolic Arrays Computing the Discrete Fourier Transform on FPGA Based Systolic Arrays Chris Dick School of Electronic Engineering La Trobe University Melbourne 3083, Australia Abstract Reconfigurable logic arrays allow

More information

FPGA Based Design and Simulation of 32- Point FFT Through Radix-2 DIT Algorith

FPGA Based Design and Simulation of 32- Point FFT Through Radix-2 DIT Algorith FPGA Based Design and Simulation of 32- Point FFT Through Radix-2 DIT Algorith Sudhanshu Mohan Khare M.Tech (perusing), Dept. of ECE Laxmi Naraian College of Technology, Bhopal, India M. Zahid Alam Associate

More information

Energy and Memory Efficient Mapping of Bitonic Sorting on FPGA

Energy and Memory Efficient Mapping of Bitonic Sorting on FPGA Energy and Memory Efficient Mapping of Bitonic Sorting on FPGA Ren Chen, Sruja Siriyal, Viktor Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California, Los Angeles, USA

More information

Research Article Design of A Novel 8-point Modified R2MDC with Pipelined Technique for High Speed OFDM Applications

Research Article Design of A Novel 8-point Modified R2MDC with Pipelined Technique for High Speed OFDM Applications Research Journal of Applied Sciences, Engineering and Technology 7(23): 5021-5025, 2014 DOI:10.19026/rjaset.7.895 ISSN: 2040-7459; e-issn: 2040-7467 2014 Maxwell Scientific Publication Corp. Submitted:

More information

SFF The Single-Stream FPGA-Optimized Feedforward FFT Hardware Architecture

SFF The Single-Stream FPGA-Optimized Feedforward FFT Hardware Architecture Journal of Signal Processing Systems (2018) 90:1583 1592 https://doi.org/10.1007/s11265-018-1370-y SFF The Single-Stream FPGA-Optimized Feedforward FFT Hardware Architecture Carl Ingemarsson 1 Oscar Gustafsson

More information

STUDY OF A CORDIC BASED RADIX-4 FFT PROCESSOR

STUDY OF A CORDIC BASED RADIX-4 FFT PROCESSOR STUDY OF A CORDIC BASED RADIX-4 FFT PROCESSOR 1 AJAY S. PADEKAR, 2 S. S. BELSARE 1 BVDU, College of Engineering, Pune, India 2 Department of E & TC, BVDU, College of Engineering, Pune, India E-mail: ajay.padekar@gmail.com,

More information

A Model-based Methodology for Application Specific Energy Efficient Data Path Design using FPGAs

A Model-based Methodology for Application Specific Energy Efficient Data Path Design using FPGAs A Model-based Methodology for Application Specific Energy Efficient Data Path Design using FPGAs Sumit Mohanty 1, Seonil Choi 1, Ju-wook Jang 2, Viktor K. Prasanna 1 1 Dept. of Electrical Engg. 2 Dept.

More information

A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN

A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN Xiaoying Li 1 Fuming Sun 2 Enhua Wu 1, 3 1 University of Macau, Macao, China 2 University of Science and Technology Beijing, Beijing, China

More information

DESIGN METHODOLOGY. 5.1 General

DESIGN METHODOLOGY. 5.1 General 87 5 FFT DESIGN METHODOLOGY 5.1 General The fast Fourier transform is used to deliver a fast approach for the processing of data in the wireless transmission. The Fast Fourier Transform is one of the methods

More information

A 4096-Point Radix-4 Memory-Based FFT Using DSP Slices

A 4096-Point Radix-4 Memory-Based FFT Using DSP Slices A 4096-Point Radix-4 Memory-Based FFT Using DSP Slices Mario Garrido Gálvez, Miguel Angel Sanchez, Maria Luisa Lopez-Vallejo and Jesus Grajal Journal Article N.B.: When citing this work, cite the original

More information

Fused Floating Point Arithmetic Unit for Radix 2 FFT Implementation

Fused Floating Point Arithmetic Unit for Radix 2 FFT Implementation IOSR Journal of VLSI and Signal Processing (IOSR-JVSP) Volume 6, Issue 2, Ver. I (Mar. -Apr. 2016), PP 58-65 e-issn: 2319 4200, p-issn No. : 2319 4197 www.iosrjournals.org Fused Floating Point Arithmetic

More information

FPGA Matrix Multiplier

FPGA Matrix Multiplier FPGA Matrix Multiplier In Hwan Baek Henri Samueli School of Engineering and Applied Science University of California Los Angeles Los Angeles, California Email: chris.inhwan.baek@gmail.com David Boeck Henri

More information

LOW-POWER SPLIT-RADIX FFT PROCESSORS

LOW-POWER SPLIT-RADIX FFT PROCESSORS LOW-POWER SPLIT-RADIX FFT PROCESSORS Avinash 1, Manjunath Managuli 2, Suresh Babu D 3 ABSTRACT To design a split radix fast Fourier transform is an ideal person for the implementing of a low-power FFT

More information

User Manual for FC100

User Manual for FC100 Sundance Multiprocessor Technology Limited User Manual Form : QCF42 Date : 6 July 2006 Unit / Module Description: IEEE-754 Floating-point FPGA IP Core Unit / Module Number: FC100 Document Issue Number:

More information

INTRODUCTION TO FPGA ARCHITECTURE

INTRODUCTION TO FPGA ARCHITECTURE 3/3/25 INTRODUCTION TO FPGA ARCHITECTURE DIGITAL LOGIC DESIGN (BASIC TECHNIQUES) a b a y 2input Black Box y b Functional Schematic a b y a b y a b y 2 Truth Table (AND) Truth Table (OR) Truth Table (XOR)

More information

Twiddle Factor Transformation for Pipelined FFT Processing

Twiddle Factor Transformation for Pipelined FFT Processing Twiddle Factor Transformation for Pipelined FFT Processing In-Cheol Park, WonHee Son, and Ji-Hoon Kim School of EECS, Korea Advanced Institute of Science and Technology, Daejeon, Korea icpark@ee.kaist.ac.kr,

More information

Adaptive FIR Filter Using Distributed Airthmetic for Area Efficient Design

Adaptive FIR Filter Using Distributed Airthmetic for Area Efficient Design International Journal of Scientific and Research Publications, Volume 5, Issue 1, January 2015 1 Adaptive FIR Filter Using Distributed Airthmetic for Area Efficient Design Manish Kumar *, Dr. R.Ramesh

More information

Parallelism in Spiral

Parallelism in Spiral Parallelism in Spiral Franz Franchetti and the Spiral team (only part shown) Electrical and Computer Engineering Carnegie Mellon University Joint work with Yevgen Voronenko Markus Püschel This work was

More information

Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm

Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm 1 A.Malashri, 2 C.Paramasivam 1 PG Student, Department of Electronics and Communication K S Rangasamy College Of Technology,

More information

Verilog for High Performance

Verilog for High Performance Verilog for High Performance Course Description This course provides all necessary theoretical and practical know-how to write synthesizable HDL code through Verilog standard language. The course goes

More information

An Area Efficient Mixed Decimation MDF Architecture for Radix. Parallel FFT

An Area Efficient Mixed Decimation MDF Architecture for Radix. Parallel FFT An Area Efficient Mixed Decimation MDF Architecture for Radix Parallel FFT Reshma K J 1, Prof. Ebin M Manuel 2 1M-Tech, Dept. of ECE Engineering, Government Engineering College, Idukki, Kerala, India 2Professor,

More information

Resource-efficient Acceleration of 2-Dimensional Fast Fourier Transform Computations on FPGAs

Resource-efficient Acceleration of 2-Dimensional Fast Fourier Transform Computations on FPGAs In Proceedings of the International Conference on Distributed Smart Cameras, Como, Italy, August 2009. Resource-efficient Acceleration of 2-Dimensional Fast Fourier Transform Computations on FPGAs Hojin

More information

Computer Generation of IP Cores

Computer Generation of IP Cores A I n Computer Generation of IP Cores Peter Milder (ECE, Carnegie Mellon) James Hoe (ECE, Carnegie Mellon) Markus Püschel (CS, ETH Zürich) addfxp #(16, 1) add15282(.a(a69),.b(a70),.clk(clk),.q(t45)); addfxp

More information

Comparison of Adders for optimized Exponent Addition circuit in IEEE754 Floating point multiplier using VHDL

Comparison of Adders for optimized Exponent Addition circuit in IEEE754 Floating point multiplier using VHDL International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 11, Issue 07 (July 2015), PP.60-65 Comparison of Adders for optimized Exponent Addition

More information

FPGA Implementation of Discrete Fourier Transform Using CORDIC Algorithm

FPGA Implementation of Discrete Fourier Transform Using CORDIC Algorithm AMSE JOURNALS-AMSE IIETA publication-2017-series: Advances B; Vol. 60; N 2; pp 332-337 Submitted Apr. 04, 2017; Revised Sept. 25, 2017; Accepted Sept. 30, 2017 FPGA Implementation of Discrete Fourier Transform

More information

An Efficient High Speed VLSI Architecture Based 16-Point Adaptive Split Radix-2 FFT Architecture

An Efficient High Speed VLSI Architecture Based 16-Point Adaptive Split Radix-2 FFT Architecture IJSTE - International Journal of Science Technology & Engineering Volume 2 Issue 10 April 2016 ISSN (online): 2349-784X An Efficient High Speed VLSI Architecture Based 16-Point Adaptive Split Radix-2 FFT

More information

Efficient Self-Reconfigurable Implementations Using On-Chip Memory

Efficient Self-Reconfigurable Implementations Using On-Chip Memory 10th International Conference on Field Programmable Logic and Applications, August 2000. Efficient Self-Reconfigurable Implementations Using On-Chip Memory Sameer Wadhwa and Andreas Dandalis University

More information

Core Facts. Documentation Design File Formats. Verification Instantiation Templates Reference Designs & Application Notes Additional Items

Core Facts. Documentation Design File Formats. Verification Instantiation Templates Reference Designs & Application Notes Additional Items (FFT_PIPE) Product Specification Dillon Engineering, Inc. 4974 Lincoln Drive Edina, MN USA, 55436 Phone: 952.836.2413 Fax: 952.927.6514 E mail: info@dilloneng.com URL: www.dilloneng.com Core Facts Documentation

More information

Implementation of FFT Processor using Urdhva Tiryakbhyam Sutra of Vedic Mathematics

Implementation of FFT Processor using Urdhva Tiryakbhyam Sutra of Vedic Mathematics Implementation of FFT Processor using Urdhva Tiryakbhyam Sutra of Vedic Mathematics Yojana Jadhav 1, A.P. Hatkar 2 PG Student [VLSI & Embedded system], Dept. of ECE, S.V.I.T Engineering College, Chincholi,

More information

EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs)

EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) September 12, 2002 John Wawrzynek Fall 2002 EECS150 - Lec06-FPGA Page 1 Outline What are FPGAs? Why use FPGAs (a short history

More information

High-Performance 16-Point Complex FFT Features 1 Functional Description 2 Theory of Operation

High-Performance 16-Point Complex FFT Features 1 Functional Description 2 Theory of Operation High-Performance 16-Point Complex FFT April 8, 1999 Application Note This document is (c) Xilinx, Inc. 1999. No part of this file may be modified, transmitted to any third party (other than as intended

More information

A High-Performance and Energy-efficient Architecture for Floating-point based LU Decomposition on FPGAs

A High-Performance and Energy-efficient Architecture for Floating-point based LU Decomposition on FPGAs A High-Performance and Energy-efficient Architecture for Floating-point based LU Decomposition on FPGAs Gokul Govindu, Seonil Choi, Viktor Prasanna Dept. of Electrical Engineering-Systems University of

More information

PyGen: A MATLAB/Simulink Based Tool for Synthesizing Parameterized and Energy Efficient Designs Using FPGAs

PyGen: A MATLAB/Simulink Based Tool for Synthesizing Parameterized and Energy Efficient Designs Using FPGAs PyGen: A MATLAB/Simulink Based Tool for Synthesizing Parameterized and Energy Efficient Designs Using FPGAs Jingzhao Ou and Viktor K. Prasanna Department of Electrical Engineering, University of Southern

More information

A scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment

A scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment LETTER IEICE Electronics Express, Vol.11, No.2, 1 9 A scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment Ting Chen a), Hengzhu Liu, and Botao Zhang College of

More information

Outline. EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) FPGA Overview. Why FPGAs?

Outline. EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) FPGA Overview. Why FPGAs? EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) September 12, 2002 John Wawrzynek Outline What are FPGAs? Why use FPGAs (a short history lesson). FPGA variations Internal logic

More information

DUE to the high computational complexity and real-time

DUE to the high computational complexity and real-time IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 15, NO. 3, MARCH 2005 445 A Memory-Efficient Realization of Cyclic Convolution and Its Application to Discrete Cosine Transform Hun-Chen

More information

A Normal I/O Order Radix-2 FFT Architecture to Process Twin Data Streams for MIMO

A Normal I/O Order Radix-2 FFT Architecture to Process Twin Data Streams for MIMO 2402 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 6, JUNE 2016 A Normal I/O Order Radix-2 FFT Architecture to Process Twin Data Streams for MIMO Antony Xavier Glittas,

More information

International Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering

International Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering An Efficient Implementation of Double Precision Floating Point Multiplier Using Booth Algorithm Pallavi Ramteke 1, Dr. N. N. Mhala 2, Prof. P. R. Lakhe M.Tech [IV Sem], Dept. of Comm. Engg., S.D.C.E, [Selukate],

More information

High-Speed and Low-Power Split-Radix FFT

High-Speed and Low-Power Split-Radix FFT 864 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 51, NO. 3, MARCH 2003 High-Speed and Low-Power Split-Radix FFT Wen-Chang Yeh and Chein-Wei Jen Abstract This paper presents a novel split-radix fast Fourier

More information

FAST FOURIER TRANSFORM (FFT) and inverse fast

FAST FOURIER TRANSFORM (FFT) and inverse fast IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 39, NO. 11, NOVEMBER 2004 2005 A Dynamic Scaling FFT Processor for DVB-T Applications Yu-Wei Lin, Hsuan-Yu Liu, and Chen-Yi Lee Abstract This paper presents an

More information

High-throughput Online Hash Table on FPGA*

High-throughput Online Hash Table on FPGA* High-throughput Online Hash Table on FPGA* Da Tong, Shijie Zhou, Viktor K. Prasanna Ming Hsieh Dept. of Electrical Engineering University of Southern California Los Angeles, CA 989 Email: datong@usc.edu,

More information

High Performance Pipelined Design for FFT Processor based on FPGA

High Performance Pipelined Design for FFT Processor based on FPGA High Performance Pipelined Design for FFT Processor based on FPGA A.A. Raut 1, S. M. Kate 2 1 Sinhgad Institute of Technology, Lonavala, Pune University, India 2 Sinhgad Institute of Technology, Lonavala,

More information

A Novel Efficient VLSI Architecture for IEEE 754 Floating point multiplier using Modified CSA

A Novel Efficient VLSI Architecture for IEEE 754 Floating point multiplier using Modified CSA RESEARCH ARTICLE OPEN ACCESS A Novel Efficient VLSI Architecture for IEEE 754 Floating point multiplier using Nishi Pandey, Virendra Singh Sagar Institute of Research & Technology Bhopal Abstract Due to

More information

Accelerating Equi-Join on a CPU-FPGA Heterogeneous Platform*

Accelerating Equi-Join on a CPU-FPGA Heterogeneous Platform* Accelerating Equi-Join on a CPU-FPGA Heterogeneous Platform* Ren Chen and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California, Los Angeles, USA 90089 Email:

More information

Design of a Floating-Point Fused Add-Subtract Unit Using Verilog

Design of a Floating-Point Fused Add-Subtract Unit Using Verilog International Journal of Electronics and Computer Science Engineering 1007 Available Online at www.ijecse.org ISSN- 2277-1956 Design of a Floating-Point Fused Add-Subtract Unit Using Verilog Mayank Sharma,

More information

Parallelized Radix-4 Scalable Montgomery Multipliers

Parallelized Radix-4 Scalable Montgomery Multipliers Parallelized Radix-4 Scalable Montgomery Multipliers Nathaniel Pinckney and David Money Harris 1 1 Harvey Mudd College, 301 Platt. Blvd., Claremont, CA, USA e-mail: npinckney@hmc.edu ABSTRACT This paper

More information

A HIGH PERFORMANCE FIR FILTER ARCHITECTURE FOR FIXED AND RECONFIGURABLE APPLICATIONS

A HIGH PERFORMANCE FIR FILTER ARCHITECTURE FOR FIXED AND RECONFIGURABLE APPLICATIONS A HIGH PERFORMANCE FIR FILTER ARCHITECTURE FOR FIXED AND RECONFIGURABLE APPLICATIONS Saba Gouhar 1 G. Aruna 2 gouhar.saba@gmail.com 1 arunastefen@gmail.com 2 1 PG Scholar, Department of ECE, Shadan Women

More information

PERFORMANCE ANALYSIS OF HIGH EFFICIENCY LOW DENSITY PARITY-CHECK CODE DECODER FOR LOW POWER APPLICATIONS

PERFORMANCE ANALYSIS OF HIGH EFFICIENCY LOW DENSITY PARITY-CHECK CODE DECODER FOR LOW POWER APPLICATIONS American Journal of Applied Sciences 11 (4): 558-563, 2014 ISSN: 1546-9239 2014 Science Publication doi:10.3844/ajassp.2014.558.563 Published Online 11 (4) 2014 (http://www.thescipub.com/ajas.toc) PERFORMANCE

More information

FPGA Implementation of 16-Point Radix-4 Complex FFT Core Using NEDA

FPGA Implementation of 16-Point Radix-4 Complex FFT Core Using NEDA FPGA Implementation of 16-Point FFT Core Using NEDA Abhishek Mankar, Ansuman Diptisankar Das and N Prasad Abstract--NEDA is one of the techniques to implement many digital signal processing systems that

More information

High Speed Systolic Montgomery Modular Multipliers for RSA Cryptosystems

High Speed Systolic Montgomery Modular Multipliers for RSA Cryptosystems High Speed Systolic Montgomery Modular Multipliers for RSA Cryptosystems RAVI KUMAR SATZODA, CHIP-HONG CHANG and CHING-CHUEN JONG Centre for High Performance Embedded Systems Nanyang Technological University

More information

DESIGN AND IMPLEMENTATION OF DA- BASED RECONFIGURABLE FIR DIGITAL FILTER USING VERILOGHDL

DESIGN AND IMPLEMENTATION OF DA- BASED RECONFIGURABLE FIR DIGITAL FILTER USING VERILOGHDL DESIGN AND IMPLEMENTATION OF DA- BASED RECONFIGURABLE FIR DIGITAL FILTER USING VERILOGHDL [1] J.SOUJANYA,P.G.SCHOLAR, KSHATRIYA COLLEGE OF ENGINEERING,NIZAMABAD [2] MR. DEVENDHER KANOOR,M.TECH,ASSISTANT

More information

Parallel FFT Program Optimizations on Heterogeneous Computers

Parallel FFT Program Optimizations on Heterogeneous Computers Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid

More information

Energy Efficient Adaptive Beamforming on Sensor Networks

Energy Efficient Adaptive Beamforming on Sensor Networks Energy Efficient Adaptive Beamforming on Sensor Networks Viktor K. Prasanna Bhargava Gundala, Mitali Singh Dept. of EE-Systems University of Southern California email: prasanna@usc.edu http://ceng.usc.edu/~prasanna

More information

Hardware Description of Multi-Directional Fast Sobel Edge Detection Processor by VHDL for Implementing on FPGA

Hardware Description of Multi-Directional Fast Sobel Edge Detection Processor by VHDL for Implementing on FPGA Hardware Description of Multi-Directional Fast Sobel Edge Detection Processor by VHDL for Implementing on FPGA Arash Nosrat Faculty of Engineering Shahid Chamran University Ahvaz, Iran Yousef S. Kavian

More information

Fixed Point Streaming Fft Processor For Ofdm

Fixed Point Streaming Fft Processor For Ofdm Fixed Point Streaming Fft Processor For Ofdm Sudhir Kumar Sa Rashmi Panda Aradhana Raju Abstract Fast Fourier Transform (FFT) processors are today one of the most important blocks in communication systems.

More information

FPGAs: THE HIGH-END ALTERNATIVE FOR DSP APPLICATIONS. By Dr. Chris Dick

FPGAs: THE HIGH-END ALTERNATIVE FOR DSP APPLICATIONS. By Dr. Chris Dick THE HIGH-END ALTERNATIVE FOR D APPLICATIONS By Dr. Chris Dick Engineers have been using field programmable gate arrays (FPGAs) to build high performance D systems for several years. FPGAs are uniquely

More information

The Efficient Implementation of Numerical Integration for FPGA Platforms

The Efficient Implementation of Numerical Integration for FPGA Platforms Website: www.ijeee.in (ISSN: 2348-4748, Volume 2, Issue 7, July 2015) The Efficient Implementation of Numerical Integration for FPGA Platforms Hemavathi H Department of Electronics and Communication Engineering

More information

Dynamically Configurable Online Statistical Flow Feature Extractor on FPGA

Dynamically Configurable Online Statistical Flow Feature Extractor on FPGA Dynamically Configurable Online Statistical Flow Feature Extractor on FPGA Da Tong, Viktor Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California Email: {datong, prasanna}@usc.edu

More information

A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors

A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors Brent Bohnenstiehl and Bevan Baas Department of Electrical and Computer Engineering University of California, Davis {bvbohnen,

More information

16 BIT IMPLEMENTATION OF ASYNCHRONOUS TWOS COMPLEMENT ARRAY MULTIPLIER USING MODIFIED BAUGH-WOOLEY ALGORITHM AND ARCHITECTURE.

16 BIT IMPLEMENTATION OF ASYNCHRONOUS TWOS COMPLEMENT ARRAY MULTIPLIER USING MODIFIED BAUGH-WOOLEY ALGORITHM AND ARCHITECTURE. 16 BIT IMPLEMENTATION OF ASYNCHRONOUS TWOS COMPLEMENT ARRAY MULTIPLIER USING MODIFIED BAUGH-WOOLEY ALGORITHM AND ARCHITECTURE. AditiPandey* Electronics & Communication,University Institute of Technology,

More information

Image Compression System on an FPGA

Image Compression System on an FPGA Image Compression System on an FPGA Group 1 Megan Fuller, Ezzeldin Hamed 6.375 Contents 1 Objective 2 2 Background 2 2.1 The DFT........................................ 3 2.2 The DCT........................................

More information

Accelerating Equi-Join on a CPU-FPGA Heterogeneous Platform

Accelerating Equi-Join on a CPU-FPGA Heterogeneous Platform Accelerating Equi-Join on a CPU-FPGA Heterogeneous Platform Ren Chen, Viktor Prasanna Computer Engineering Technical Report Number CENG-05- Ming Hsieh Department of Electrical Engineering Systems University

More information

FPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression

FPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression FPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression Divakara.S.S, Research Scholar, J.S.S. Research Foundation, Mysore Cyril Prasanna Raj P Dean(R&D), MSEC, Bangalore Thejas

More information

FAST Fourier transform (FFT) is an important signal processing

FAST Fourier transform (FFT) is an important signal processing IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 54, NO. 4, APRIL 2007 889 Balanced Binary-Tree Decomposition for Area-Efficient Pipelined FFT Processing Hyun-Yong Lee, Student Member,

More information

A Ripple Carry Adder based Low Power Architecture of LMS Adaptive Filter

A Ripple Carry Adder based Low Power Architecture of LMS Adaptive Filter A Ripple Carry Adder based Low Power Architecture of LMS Adaptive Filter A.S. Sneka Priyaa PG Scholar Government College of Technology Coimbatore ABSTRACT The Least Mean Square Adaptive Filter is frequently

More information

ENERGY, AREA AND SPEED OPTIMIZED SIGNAL PROCESSING ON FPGA

ENERGY, AREA AND SPEED OPTIMIZED SIGNAL PROCESSING ON FPGA ENERGY, AREA AND SPEED OPTIMIZED SIGNAL PROCESSING ON FPGA A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Technology In VLSI design and embedded system By DURGA

More information

Power Spectral Density Computation using Modified Welch Method

Power Spectral Density Computation using Modified Welch Method IJSTE - International Journal of Science Technology & Engineering Volume 2 Issue 4 October 2015 ISSN (online): 2349-784X Power Spectral Density Computation using Modified Welch Method Betsy Elina Thomas

More information

Overview. CSE372 Digital Systems Organization and Design Lab. Hardware CAD. Two Types of Chips

Overview. CSE372 Digital Systems Organization and Design Lab. Hardware CAD. Two Types of Chips Overview CSE372 Digital Systems Organization and Design Lab Prof. Milo Martin Unit 5: Hardware Synthesis CAD (Computer Aided Design) Use computers to design computers Virtuous cycle Architectural-level,

More information

VHDL for Synthesis. Course Description. Course Duration. Goals

VHDL for Synthesis. Course Description. Course Duration. Goals VHDL for Synthesis Course Description This course provides all necessary theoretical and practical know how to write an efficient synthesizable HDL code through VHDL standard language. The course goes

More information

Creating Parameterized and Energy-Efficient System Generator Designs

Creating Parameterized and Energy-Efficient System Generator Designs Creating Parameterized and Energy-Efficient System Generator Designs Jingzhao Ou, Seonil Choi, Gokul Govindu, and Viktor K. Prasanna EE - Systems, University of Southern California {ouj,seonilch,govindu,prasanna}@usc.edu

More information

IEEE-754 compliant Algorithms for Fast Multiplication of Double Precision Floating Point Numbers

IEEE-754 compliant Algorithms for Fast Multiplication of Double Precision Floating Point Numbers International Journal of Research in Computer Science ISSN 2249-8257 Volume 1 Issue 1 (2011) pp. 1-7 White Globe Publications www.ijorcs.org IEEE-754 compliant Algorithms for Fast Multiplication of Double

More information

AREA-DELAY EFFICIENT FFT ARCHITECTURE USING PARALLEL PROCESSING AND NEW MEMORY SHARING TECHNIQUE

AREA-DELAY EFFICIENT FFT ARCHITECTURE USING PARALLEL PROCESSING AND NEW MEMORY SHARING TECHNIQUE AREA-DELAY EFFICIENT FFT ARCHITECTURE USING PARALLEL PROCESSING AND NEW MEMORY SHARING TECHNIQUE Yousri Ouerhani, Maher Jridi, Ayman Alfalou To cite this version: Yousri Ouerhani, Maher Jridi, Ayman Alfalou.

More information

DESIGN OF AN FFT PROCESSOR

DESIGN OF AN FFT PROCESSOR 1 DESIGN OF AN FFT PROCESSOR Erik Nordhamn, Björn Sikström and Lars Wanhammar Department of Electrical Engineering Linköping University S-581 83 Linköping, Sweden Abstract In this paper we present a structured

More information

Design & Analysis of 16 bit RISC Processor Using low Power Pipelining

Design & Analysis of 16 bit RISC Processor Using low Power Pipelining International OPEN ACCESS Journal ISSN: 2249-6645 Of Modern Engineering Research (IJMER) Design & Analysis of 16 bit RISC Processor Using low Power Pipelining Yedla Venkanna 148R1D5710 Branch: VLSI ABSTRACT:-

More information

Multiplierless Unity-Gain SDF FFTs

Multiplierless Unity-Gain SDF FFTs Multiplierless Unity-Gain SDF FFTs Mario Garrido Gálvez, Rikard Andersson, Fahad Qureshi and Oscar Gustafsson Journal Article N.B.: When citing this work, cite the original article. 216 IEEE. Personal

More information