FFT ALGORITHMS FOR MULTIPLY-ADD ARCHITECTURES
|
|
- Claribel McDonald
- 6 years ago
- Views:
Transcription
1 FFT ALGORITHMS FOR MULTIPLY-ADD ARCHITECTURES FRANCHETTI Franz, (AUT), KALTENBERGER Florian, (AUT), UEBERHUBER Christoph W. (AUT) Abstract. FFTs are the single most important algorithms in science and engineering. To utilize current hardware architectures, special techniques are needed. This paper introduces newly developed radix-2 n FFT kernels that efficiently take advantage of fused multiply-add (FMA) instructions. If a processor is provided with FMA instructions, the new radix-2 n kernels reduce the number of required twiddle factors from 2 n 1ton compared to conventional radix-2 n FFT kernels. The reduction in the number of twiddle factors is accomplished by a hidden computation of the twiddle factors (instead of accessing a twiddle factor array) by hiding the additional arithmetic operations in FMA instructions. The new FFT kernels are fully compatible with conventional FFT kernels and can therefore easily be incorporated into existing FFT software, such as Fftw or Spiral. 1 Introduction In the computation of an FFT, memory accesses are the most dominating part of the runtime. There are numerous FFT algorithms with identical arithmetic complexity whose memory access patterns differ significantly. Accordingly, their runtime behavior differs tremendously. State of the art in FFT software are codes like Spiral (Moura et al. [13]) or Fftw (Frigo and Johnson [4]) that are able to adapt themselves automatically to given memory and hardware features in the most efficient way (Fabianek et al. [2]). In this paper the memory access bottleneck is attacked by utilizing the parallelism inherent in FMA instructions to reduce the number of memory accesses. This efficiency enhancement is accomplished by a hidden computation of the twiddle factors (instead of accessing a twiddle factor array) by including the additional arithmetic operations in FMA instructions (and thereby increasing the utilization of FMA instructions). The goal of earlier works (Linzer and Feig [10, 11], Goedecker [6], Karner et al. [9]) was to reduce the arithmetic complexity of FFT kernels. No attempt was made to reduce 333
2 the number of memory accesses. The newly developed FFT kernels introduced in this paper require only n twiddle factors instead of the 2 n 1 twiddle factors required by conventional radix-2 n kernels. Contrary to existing FMA optimized FFT kernels the newly developed FFT kernels are fully compatible with conventional FFT kernels since there is no need to scale the twiddle factors and can therefore easily be incorporated into existing FFT software. This feature was tested by installing the new kernels into Fftw, the currently most sophisticated and fastest FFT package (Karner et al. [8]). 2 Fused Multiply-Add Instructions Fused multiply-add (FMA) instructions perform the ternary operation ±a ± b c in the same amount of time as needed for a single floating-point addition or multiplication operation. Let π fma denote the FMA arithmetic complexity, i. e., the number of floating-point operations (additions, multiplications, and multiply-add operations) needed to perform a specific numerical task on a processor equipped with FMA instructions. It is useful to introduce the term FMA utilization to characterize the degree to which an algorithm takes advantage of multiply-add instructions. FMA utilization is given by F := π R π fma π fma 100 [%], (1) where π R denotes the real arithmetic complexity. F is the percentage of floating-point operations performed by multiply-add instructions. Some compilers are able to extract FMA operations from given source code and produce FMA optimized code. However, compilers are not able to deal with all performance issues, because not all of the necessary information is available at compile time. Therefore the underlying algorithm has to be dealt with on a more abstract level. The main result of this section is that certain matrix computations occurring in FFT algorithms can be optimized with respect to FMA utilization. The rules introduced in the following can be incorporated into self-adapting numerical digital signal processing software like Spiral. Algorithms for matrix-vector multiplication Ab,where A has a special structure, can be carried out using FMA instructions only (F = 100 %), i. e., such computations are FMA optimized. The following lemmas characterize some special FMA optimized computations (Franchetti et al. [3]). Lemma 2.1 For the complex matrices ( ) ( ) ±1 ±c 0 1 A =,, 0 1 ±1 ±c ( ) ±c ±1, 0 1 ( ) 0 1, (2) ±c ±1 the calculation of Ab for b C 2 requires only two FMAs (π fma =2,F= 100 %) ifc is a trivial twiddle factor (i. e., c R or c ir). Otherwise the computation of Ab requires four FMAs (π fma =4, F = 100 %). Corollary 2.2 If A, A 1,A 2,...,A r can be FMA optimized, then matrices of the form r i=1 A i, I m A I n, and the matrix conjugate L mn m AL mn n,wherel mn n and L mn m = ) 1 are stride permutation matrices (Van Loan [7]), can also be FMA optimized. (L mn n Multiplications by non-trivial twiddle factors can also be FMA optimized. 334
3 Lemma 2.3 A complex multiplication by ω C with ω =1requires only three FMA instructions (π fma =3, F = 100 %). 3 FMA Optimized FFT Kernels The discrete Fourier transform y C N of an input vector x C N is defined by the matrix-vector product y := F N x,wheref N =[ω j,k N ] j,k=0,1,...,n 1 with ω N = e 2πi/N is the Fourier transform matrix. The basis of the fast Fourier transform (FFT) is the Cooley-Tukey radix-p splitting. For N = pq 2 it holds that where F N =(F p I q )Tq pq (I p F q )L pq p, (3) q 1 p 1 q 1 Tq pq = ωpq ij = diag(1,ω pq,...,ωpq q 1 ) i (4) i=0 j=0 i=0 is the twiddle factor matrix and L pq p is the stride-by-p permutation matrix (Van Loan [7]). In Fftw the computation of an FFT is performed by so-called codelets, i. e., small pieces of highly optimized machine generated code. There are two types of codelets: codelets are used to perform multiplications by twiddle factors and to compute in-place FFTs, whereas no-twiddle codelets are used to compute out-of-place FFTs. (i) Twiddle codelets of size p compute y := (F p I q )T pq q y, (5) and (ii) no-twiddle codelets of size q perform an out-of-place computation of y := F q x. If N = N 1 N 2 N n then the splitting (3) can be applied recursively and the fast Fourier transform algorithm (with O(N ln(n)) complexity) is obtained. Any factorization of N leads to an FFT program with another memory access pattern and hence to another runtime behavior. Fftw tries to find the optimum algorithm for a given machine by applying, for instance, dynamic programming (Fabianek et al. [2]). A no-twiddle codelet of size q can be FMA optimized with existing compiler techniques [5, 9]. A twiddle codelet, on the other hand, cannot be optimized at compile time, since the twiddle factors can change from call to call. However, with some formula manipulation and the tools developed in the previous section it is possible to get a full FMA utilization even in twiddle codelets. A twiddle codelet can be written as (F p I q )T pq q = L pq p (I q F p ) L pq q L pq p }{{} j=0 =I pq T pq p L pq q [ q 1 ] = L pq p F p diag(1,ωpq,...,ω j pq (p 1)j ) L pq q. A factor of the form F p diag(1,ω j,...,ω (p 1)j ) (6) is called an elementary radix-p FFT kernel. In the following the kernel (6) will be FMA optimized, since then by applying Corollary 2.2, the twiddle codelet (5) can be FMA optimized as well. Usually the twiddle factors are pre-computed and stored in an array and are accessed during the FFT computation. In the following a method is presented to reduce the 335
4 number of memory accesses by a hidden on-line computation of the required twiddle factors. First, the method is illustrated by means of a radix-8 kernel computation. FMA Optimized Radix-8 Kernel. The radix-8 kernel computation is y := F 8 diag(1,ω,ω 2,ω 3,ω 4,ω 5,ω 6,ω 7 )x. (7) Using a radix-2 decimation-in-frequency (DIF) factorization, (7) can be written as y := R 8 (I 4 F 2 )D 3 T 2 (I 2 F 2 I 2 )D 2 T 1 (F 2 I 4 )D 1 x, (8) D 1 = diag(1, 1, 1, 1,ω 4,ω 4,ω 4,ω 4 ), D 2 = diag(1, 1,ω 2,ω 2, 1, 1,ω 2,ω 2 ), D 3 = diag(1,ω,1,ω,1,ω,1,ω), T 1 = diag(1, 1, 1, 1, 1,c 1, i, c 2 ), T 2 = diag(1, 1, 1, i, 1, 1, 1, i), c 1 =(1 i)/ 2,c 2 = 1(1 + i)/ 2. R 8 is the bit-reversal permutation matrix (Van Loan [7]). Arithmetic Complexity. The heart of the FMA optimized algorithm is the factorization of the initial twiddle factor scaling matrix diag(1,ω,ω 2,ω 3,ω 4,ω 5,ω 6,ω 7 )intod 1 D 2 D 3. D 1, D 2,andD 3 are distributed over the 3 stages of the radix-8 kernel computation (8). To fully utilize FMA instructions for computing the first stage, T 1 (F 2 I 4 )D 1 x,the following factorization is applied (Meyer and Schwarz [12]). ( )( ) (F 2 I 4 )D 1 =(F 2 I 4 )(diag(1,ω 4 ) I 4 )= ω 4 I 4. (10) Now, by Lemma 2.1 and Corollary 2.2 the matrix-vector product (F 2 I 4 )D 1 x requires 24 FMA operations (π fma = 24, F = 100 %). T 1 contains the two non-trivial twiddle factors c 1 and c 2. Therefore, by Lemma 2.3, scaling by T 1 can be carried out with 6 FMA instructions (π fma =6,F = 100 %). Factorizations similar to (10) for stages two and three lead to an arithmetic complexity of 78 (= ) FMA instructions (π fma = 78, F = 100 %) for a radix-8 FFT kernel. From (9) it can be seen that only three twiddle factors ω 4, ω 2,andω are required. FMA Optimized Radix-2 n FFT Kernels. Using a radix-2 DIF factorization the radix-2 n kernel can be written as [ 1 F 2 n diag(1,ω,...,ω 2n 1 )=R 2 n (I 2 i 1 T 2n i+1 2 )(I n i 2 i 1 F 2 I 2 n i)d i ], (11) i=n T 2n i+1 2 n i = I 2 n i diag(1,ω 2 n i+1,...,ω2n i 1 2 n i+1 ), (12) D i = I 2 i 1 diag(1,ω 2n i ) I 2 n i. (13) R 2 n is the bit-reversal permutation matrix (Van Loan [7]). (13) shows that only n twiddle factors ω 2n 1,..., ω are required in the radix-2 n kernel. Applying factorization (10) and Lemma 2.3 leads to the arithmetic complexity π fma =4.5 n 2 n 9 2 n (14) (9) 336
5 Radix TF Loads π fma FMA TF Loads π fma FMA Utilization Utilization conventional kernels fma optimized kernels % % % % % % % % % % Table 1: Operation counts (number TF of required twiddle factors, number of load operations, arithmetic complexity π fma, and FMA utilization F ) for radix-2, -4, -8, -16 and -32 kernels. 4 Runtime Experiments Numerical experiments were performed on one processor of an SGI PowerChallenge XL R In a first series of experiments FFT algorithms using the newly developed FMA optimized radix-8 FFT kernel were compared with FFT algorithms using a conventional radix-8 kernel in a triple-loop Cooley-Tukey framework. In the experiments cycles and L2 data cache misses have been counted using the on-chip performance monitor counter of the MIPS R10000 processor (Auer et al. [1]). Run times have been derived from the cycle counts. The super scalar architecture of the R10000 processor is able to execute up to four instructions per cycle. Since the R10000 can execute one load or store and one multiplyadd instruction per cycle, the compiler can generate an instruction scheduling that overlaps memory references by floating-point operations. This is the reason why the lower instruction count of the FMA optimized radix-8 kernel does not result in a significant speed-up for lengths N 2 9. The speed-up in execution time compared to a conventional radix-8 kernel can be seen in Fig. 1. For N 2 15 the speed-up is due to the lower number of primary data cache misses. For N > 2 15 when secondary data cache misses make memory accesses very costly the benefits of the new radix-8 kernel, i. e., fewer memory accesses leading to fewer secondary data cache misses, results in a substantial speed-up in execution time. 50 % Speed-up 0.7 Normalized L2 Cache Misses 40 % FMA Optimized Radix-8 Kernel Conventional Radix-8 Kernel 30 % % % Vector Length Vector Length Figure 1: Speed-up in minimum execution time and normalized number of M 2 /N log 2 N L2 cache misses of the FMA optimized radix-8 FFT algorithm compared to a conventional radix-8 FFT algorithm on one processor of an SGI Power Challenge XL. 337
6 An advantage of the FFT kernels presented in this paper is their compatibility to conventional kernels, making it easy to incorporate them into existing FFT software. This feature has been demonstrated by incorporating the multiply-add optimized radix-8 kernel into Fftw (Karner et al. [8]). The installation of the FMA optimized radix-8 kernel in Fftw did not cause any difficulty. For transform lengths N 2 20 a speed-up of more than 30 % was achieved. 5 Conclusion To sum up, it can be said that the advantages of the multiply-add optimized FFT kernels presented in this paper are their low number of required load operations, their high efficiency on performance oriented computer systems, and their striking simplicity. Current work is the integration of the above presented method into Fftw s kernel generator genfft (Frigo, Kral [5]) to generate FMA optimized radix-16, -32, and -64 kernels. 6 References 1. Auer, M., et al. Performance Evaluation of FFT Algorithms Using Performance Counters. Tech. Rep. Aurora TR , Vienna University of Technology, Fabianek, C., et al. Survey of Self-adapting FFT Software. Tech. Rep. Aurora TR , Vienna University of Technology, Franchetti, F., et al. FMA Optimized FFT Codelets. Tech. Rep. Aurora TR , Vienna University of Technology, Frigo, M. and Johnson, S. G. FFTW: An Adaptive Software Architecture for the FFT. In: Proc. ICASSP, pp , Frigo, M. and Kral, S. The Advanced FFT Program Generator GENFFT. Tech. Rep. Aurora TR , Vienna University of Technology, Goedecker, S. Fast Radix 2, 3, 4, and 5 Kernels for Fast Fourier Transformations on Computers with Overlapping Multiply-Add Instructions. SIAM J. Sci. Comput., 18, pp , Golub, G. H. and Van Loan, C. F. Matrix Computations. Johns Hopkins University Press, Baltimore, 2nd edn., Karner, H., et al. Accelerating FFTW by Multiply-Add Optimization. Tech. Rep. Aurora TR , Vienna University of Technology, Karner, H., et al. Multiply-Add Optimized FFT Kernels. Math. Models and Methods in Appl. Sci., 11, pp , Linzer, E. N. and Feig, E. Implementation of Efficient FFT Algorithms on Fused Multiply-Add Architectures. IEEE Trans. Signal Processing, 41, pp , Linzer, E. N. and Feig, E. Modified FFTs for Fused Multiply-Add Architectures. Math. Comp., 60, pp , Meyer, R. and Schwarz, K. FFT Implementation on DSP Chips Theory and Practice. In: Proc. ICASSP, pp , Pueschel, M., et al. Spiral: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms. Journal of High Performance Computing and Applications, submitted. 338
7 Acknowledgement This work is the result of a cooperation with our colleague and friend Herbert Karner who passed away in October The project was supported by the Special Research Program SFB F011 AURORA of the Austrian Science Fund FWF. Current Address Institute for Applied Mathematics and Numerical Analysis, Vienna University of Technology, Wiedner Hauptstrasse 8-10, A-1040 Wien, Austria, Tel.: , {franz,kalti,christof}@aurora.anum.tuwien.ac.at 339
8 340
System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries
System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries Yevgen Voronenko, Franz Franchetti, Frédéric de Mesmay, and Markus Püschel Department of Electrical and Computer
More informationA SIMD Vectorizing Compiler for Digital Signal Processing Algorithms Λ
A SIMD Vectorizing Compiler for Digital Signal Processing Algorithms Λ Franz Franchetti Applied and Numerical Mathematics Technical University of Vienna, Austria franz.franchetti@tuwien.ac.at Markus Püschel
More informationAutomatically Tuned FFTs for BlueGene/L s Double FPU
Automatically Tuned FFTs for BlueGene/L s Double FPU Franz Franchetti, Stefan Kral, Juergen Lorenz, Markus Püschel, and Christoph W. Ueberhuber Institute for Analysis and Scientific Computing, Vienna University
More informationFormal Loop Merging for Signal Transforms
Formal Loop Merging for Signal Transforms Franz Franchetti Yevgen S. Voronenko Markus Püschel Department of Electrical & Computer Engineering Carnegie Mellon University This work was supported by NSF through
More informationAbstract. Literature Survey. Introduction. A.Radix-2/8 FFT algorithm for length qx2 m DFTs
Implementation of Split Radix algorithm for length 6 m DFT using VLSI J.Nancy, PG Scholar,PSNA College of Engineering and Technology; S.Bharath,Assistant Professor,PSNA College of Engineering and Technology;J.Wilson,Assistant
More informationInput parameters System specifics, user options. Input parameters size, dim,... FFT Code Generator. Initialization Select fastest execution plan
Automatic Performance Tuning in the UHFFT Library Dragan Mirković 1 and S. Lennart Johnsson 1 Department of Computer Science University of Houston Houston, TX 7724 mirkovic@cs.uh.edu, johnsson@cs.uh.edu
More informationAutomatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University
Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University Outline Scientific Computation Kernels Matrix Multiplication Fast Fourier Transform (FFT) Automated Performance Tuning
More informationParallelism in Spiral
Parallelism in Spiral Franz Franchetti and the Spiral team (only part shown) Electrical and Computer Engineering Carnegie Mellon University Joint work with Yevgen Voronenko Markus Püschel This work was
More informationAlgorithms and Computation in Signal Processing
Algorithms and Computation in Signal Processing special topic course 18-799B spring 2005 14 th Lecture Feb. 24, 2005 Instructor: Markus Pueschel TA: Srinivas Chellappa Course Evaluation Email sent out
More informationEfficient FFT Algorithm and Programming Tricks
Connexions module: m12021 1 Efficient FFT Algorithm and Programming Tricks Douglas L. Jones This work is produced by The Connexions Project and licensed under the Creative Commons Attribution License Abstract
More informationGenerating Parallel Transforms Using Spiral
Generating Parallel Transforms Using Spiral Franz Franchetti Yevgen Voronenko Markus Püschel Part of the Spiral Team Electrical and Computer Engineering Carnegie Mellon University Sponsors: DARPA DESA
More informationSPIRAL Generated Modular FFTs *
SPIRAL Generated Modular FFTs * Jeremy Johnson Lingchuan Meng Drexel University * The work is supported by DARPA DESA, NSF, and Intel. Material for SPIRAL overview provided by Franz Francheti, Yevgen Voronenko,,
More informationDecimation-in-Frequency (DIF) Radix-2 FFT *
OpenStax-CX module: m1018 1 Decimation-in-Frequency (DIF) Radix- FFT * Douglas L. Jones This work is produced by OpenStax-CX and licensed under the Creative Commons Attribution License 1.0 The radix- decimation-in-frequency
More informationFFT Compiler Techniques
FFT Compiler Techniques Stefan Kral, Franz Franchetti, Juergen Lorenz, Christoph W. Ueberhuber, and Peter Wurzinger Institute for Applied Mathematics and Numerical Analysis, Vienna University of Technology,
More informationMixed Data Layout Kernels for Vectorized Complex Arithmetic
Mixed Data Layout Kernels for Vectorized Complex Arithmetic Doru T. Popovici, Franz Franchetti, Tze Meng Low Department of Electrical and Computer Engineering Carnegie Mellon University Email: {dpopovic,
More informationAutomatically Optimized FFT Codes for the BlueGene/L Supercomputer
Automatically Optimized FFT Codes for the BlueGene/L Supercomputer Franz Franchetti, Stefan Kral, Juergen Lorenz, Markus Püschel, Christoph W. Ueberhuber, and Peter Wurzinger Institute for Analysis and
More informationPerformance Optimisations of the NPB FT Kernel by Special-Purpose Unroller
Performance Optimisations of the NPB FT Kernel by Special-Purpose Unroller Vladimir Getov ½, Yuan Wei ½, Larry Carter ¾, Kang Su Gatlin ¾ ½ School of Computer Science University of Westminster Northwick
More informationStatistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform
Statistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform Michael Andrews and Jeremy Johnson Department of Computer Science, Drexel University, Philadelphia, PA USA Abstract.
More informationAn efficient multiplierless approximation of the fast Fourier transform using sum-of-powers-of-two (SOPOT) coefficients
Title An efficient multiplierless approximation of the fast Fourier transm using sum-of-powers-of-two (SOPOT) coefficients Author(s) Chan, SC; Yiu, PM Citation Ieee Signal Processing Letters, 2002, v.
More informationFormal Loop Merging for Signal Transforms
Formal Loop Merging for Signal Transforms Franz Franchetti Yevgen Voronenko Markus Püschel Department of Electrical and Computer Engineering Carnegie Mellon University {franzf, yvoronen, pueschel}@ece.cmu.edu
More informationTwiddle Factor Transformation for Pipelined FFT Processing
Twiddle Factor Transformation for Pipelined FFT Processing In-Cheol Park, WonHee Son, and Ji-Hoon Kim School of EECS, Korea Advanced Institute of Science and Technology, Daejeon, Korea icpark@ee.kaist.ac.kr,
More informationSPIRAL: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms
SPIRAL: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms Markus Püschel Faculty José Moura (CMU) Jeremy Johnson (Drexel) Robert Johnson (MathStar Inc.) David Padua (UIUC) Viktor
More informationTOPICS PIPELINE IMPLEMENTATIONS OF THE FAST FOURIER TRANSFORM (FFT) DISCRETE FOURIER TRANSFORM (DFT) INVERSE DFT (IDFT) Consulted work:
1 PIPELINE IMPLEMENTATIONS OF THE FAST FOURIER TRANSFORM (FFT) Consulted work: Chiueh, T.D. and P.Y. Tsai, OFDM Baseband Receiver Design for Wireless Communications, John Wiley and Sons Asia, (2007). Second
More informationFPGA Based Design and Simulation of 32- Point FFT Through Radix-2 DIT Algorith
FPGA Based Design and Simulation of 32- Point FFT Through Radix-2 DIT Algorith Sudhanshu Mohan Khare M.Tech (perusing), Dept. of ECE Laxmi Naraian College of Technology, Bhopal, India M. Zahid Alam Associate
More informationFFT Program Generation for the Cell BE
FFT Program Generation for the Cell BE Srinivas Chellappa, Franz Franchetti, and Markus Püschel Electrical and Computer Engineering Carnegie Mellon University Pittsburgh PA 15213, USA {schellap, franzf,
More informationA Fast Fourier Transform Compiler
RETROSPECTIVE: A Fast Fourier Transform Compiler Matteo Frigo Vanu Inc., One Porter Sq., suite 18 Cambridge, MA, 02140, USA athena@fftw.org 1. HOW FFTW WAS BORN FFTW (the fastest Fourier transform in the
More informationAlgorithms and Computation in Signal Processing
Algorithms and Computation in Signal Processing special topic course 8-799B spring 25 24 th and 25 th Lecture Apr. 7 and 2, 25 Instructor: Markus Pueschel TA: Srinivas Chellappa Research Projects Presentations
More informationFused Floating Point Arithmetic Unit for Radix 2 FFT Implementation
IOSR Journal of VLSI and Signal Processing (IOSR-JVSP) Volume 6, Issue 2, Ver. I (Mar. -Apr. 2016), PP 58-65 e-issn: 2319 4200, p-issn No. : 2319 4197 www.iosrjournals.org Fused Floating Point Arithmetic
More informationIntroduction to HPC. Lecture 21
443 Introduction to HPC Lecture Dept of Computer Science 443 Fast Fourier Transform 443 FFT followed by Inverse FFT DIF DIT Use inverse twiddles for the inverse FFT No bitreversal necessary! 443 FFT followed
More information3.2 Cache Oblivious Algorithms
3.2 Cache Oblivious Algorithms Cache-Oblivious Algorithms by Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. In the 40th Annual Symposium on Foundations of Computer Science,
More informationJoint Runtime / Energy Optimization and Hardware / Software Partitioning of Linear Transforms
Joint Runtime / Energy Optimization and Hardware / Software Partitioning of Linear Transforms Paolo D Alberto, Franz Franchetti, Peter A. Milder, Aliaksei Sandryhaila, James C. Hoe, José M. F. Moura, Markus
More informationEmpirical Auto-tuning Code Generator for FFT and Trigonometric Transforms
Empirical Auto-tuning Code Generator for FFT and Trigonometric Transforms Ayaz Ali and Lennart Johnsson Texas Learning and Computation Center University of Houston, Texas {ayaz,johnsson}@cs.uh.edu Dragan
More informationAutomatic Derivation and Implementation of Signal Processing Algorithms
Automatic Derivation and Implementation of Signal Processing Algorithms Sebastian Egner Philips Research Laboratories Prof. Hostlaan 4, WY21 5656 AA Eindhoven, The Netherlands sebastian.egner@philips.com
More informationA Pipelined Fused Processing Unit for DSP Applications
A Pipelined Fused Processing Unit for DSP Applications Vinay Reddy N PG student Dept of ECE, PSG College of Technology, Coimbatore, Abstract Hema Chitra S Assistant professor Dept of ECE, PSG College of
More informationParallel FFT Program Optimizations on Heterogeneous Computers
Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid
More informationDigital Signal Processing. Soma Biswas
Digital Signal Processing Soma Biswas 2017 Partial credit for slides: Dr. Manojit Pramanik Outline What is FFT? Types of FFT covered in this lecture Decimation in Time (DIT) Decimation in Frequency (DIF)
More informationScheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok
Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Texas Learning and Computation Center Department of Computer Science University of Houston Outline Motivation
More informationSPIRAL Overview: Automatic Generation of DSP Algorithms & More *
SPIRAL Overview: Automatic Generation of DSP Algorithms & More * Jeremy Johnson (& SPIRAL Team) Drexel University * The work is supported by DARPA DESA, NSF, and Intel. Material for presentation provided
More informationLearning to Construct Fast Signal Processing Implementations
Journal of Machine Learning Research 3 (2002) 887-919 Submitted 12/01; Published 12/02 Learning to Construct Fast Signal Processing Implementations Bryan Singer Manuela Veloso Department of Computer Science
More informationFFT. There are many ways to decompose an FFT [Rabiner and Gold] The simplest ones are radix-2 Computation made up of radix-2 butterflies X = A + BW
FFT There are many ways to decompose an FFT [Rabiner and Gold] The simplest ones are radix-2 Computation made up of radix-2 butterflies A X = A + BW B Y = A BW B. Baas 442 FFT Dataflow Diagram Dataflow
More informationSplit-Radix FFT Algorithms Based on Ternary Tree
International Journal of Science Vol.3 o.5 016 ISS: 1813-4890 Split-Radix FFT Algorithms Based on Ternary Tree Ming Zhang School of Computer Science and Technology, University of Science and Technology
More informationAdvanced Computing Research Laboratory. Adaptive Scientific Software Libraries
Adaptive Scientific Software Libraries and Texas Learning and Computation Center and Department of Computer Science University of Houston Challenges Diversity of execution environments Growing complexity
More informationThe Design and Implementation of FFTW3
The Design and Implementation of FFTW3 MATTEO FRIGO AND STEVEN G. JOHNSON Invited Paper FFTW is an implementation of the discrete Fourier transform (DFT) that adapts to the hardware in order to maximize
More informationRadix-4 FFT Algorithms *
OpenStax-CNX module: m107 1 Radix-4 FFT Algorithms * Douglas L Jones This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 10 The radix-4 decimation-in-time
More informationThe Serial Commutator FFT
The Serial Commutator FFT Mario Garrido Gálvez, Shen-Jui Huang, Sau-Gee Chen and Oscar Gustafsson Journal Article N.B.: When citing this work, cite the original article. 2016 IEEE. Personal use of this
More informationSPIRAL: Code Generation for DSP Transforms
PROCEEDINGS OF THE IEEE SPECIAL ISSUE ON PROGRAM GENERATION, OPTIMIZATION, AND ADAPTATION 1 SPIRAL: Code Generation for DSP Transforms Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela
More informationSmall Discrete Fourier Transforms on GPUs
Small Discrete Fourier Transforms on GPUs S. Mitra and A. Srinivasan Dept. of Computer Science, Florida State University, Tallahassee, FL 32306, USA {mitra,asriniva}@cs.fsu.edu Abstract Efficient implementations
More informationLow-Power Split-Radix FFT Processors Using Radix-2 Butterfly Units
Low-Power Split-Radix FFT Processors Using Radix-2 Butterfly Units Abstract: Split-radix fast Fourier transform (SRFFT) is an ideal candidate for the implementation of a lowpower FFT processor, because
More informationLarge-Scale First-Principles Molecular Dynamics simulations on the BlueGene/L Platform using the Qbox code
Large-Scale First-Principles Molecular Dynamics simulations on the BlueGene/L Platform using the Qbox code François Gygi Center for Applied Scientific Lawrence Livermore National Laboratory Livermore,
More informationFitting FFT onto the G80 Architecture
Fitting FFT onto the G80 Architecture Vasily Volkov Brian Kazian University of California, Berkeley May 9, 2008. Abstract In this work we present a novel implementation of FFT on GeForce 8800GTX that achieves
More informationAlgorithms and Computation in Signal Processing
Algorithms and Computation in Signal Processing special topic course 18-799B spring 2005 22 nd lecture Mar. 31, 2005 Instructor: Markus Pueschel Guest instructor: Franz Franchetti TA: Srinivas Chellappa
More informationHow to Write Fast Code , spring rd Lecture, Apr. 9 th
How to Write Fast Code 18-645, spring 2008 23 rd Lecture, Apr. 9 th Instructor: Markus Püschel TAs: Srinivas Chellappa (Vas) and Frédéric de Mesmay (Fred) Research Project Current status Today Papers due
More informationMATH 423 Linear Algebra II Lecture 17: Reduced row echelon form (continued). Determinant of a matrix.
MATH 423 Linear Algebra II Lecture 17: Reduced row echelon form (continued). Determinant of a matrix. Row echelon form A matrix is said to be in the row echelon form if the leading entries shift to the
More informationThe Fast Fourier Transform Algorithm and Its Application in Digital Image Processing
The Fast Fourier Transform Algorithm and Its Application in Digital Image Processing S.Arunachalam(Associate Professor) Department of Mathematics, Rizvi College of Arts, Science & Commerce, Bandra (West),
More informationHow to Write Fast Numerical Code
How to Write Fast Numerical Code Lecture: Cost analysis and performance Instructor: Markus Püschel TA: Gagandeep Singh, Daniele Spampinato & Alen Stojanov Technicalities Research project: Let us know (fastcode@lists.inf.ethz.ch)
More informationComputer Generation of Hardware for Linear Digital Signal Processing Transforms
Computer Generation of Hardware for Linear Digital Signal Processing Transforms PETER MILDER, FRANZ FRANCHETTI, and JAMES C. HOE, Carnegie Mellon University MARKUS PÜSCHEL, ETH Zurich Linear signal transforms
More informationAutomatic Generation of the HPC Challenge's Global FFT Benchmark for BlueGene/P
Automatic Generation of the HPC Challenge's Global FFT Benchmark for BlueGene/P Franz Franchetti 1, Yevgen Voronenko 2, Gheorghe Almasi 3 1 University and SpiralGen, Inc. 2 AccuRay, Inc., 3 IBM Research
More informationDecimation-in-time (DIT) Radix-2 FFT *
OpenStax-CNX module: m1016 1 Decimation-in-time (DIT) Radix- FFT * Douglas L. Jones This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 1.0 The radix- decimation-in-time
More informationNovel design of multiplier-less FFT processors
Signal Processing 8 (00) 140 140 www.elsevier.com/locate/sigpro Novel design of multiplier-less FFT processors Yuan Zhou, J.M. Noras, S.J. Shepherd School of EDT, University of Bradford, Bradford, West
More informationAlgorithms and Computation in Signal Processing
Algorithms and Computation in Signal Processing special topic course 18-799B spring 2005 20 th Lecture Mar. 24, 2005 Instructor: Markus Pueschel TA: Srinivas Chellappa Assignment 3 - Feedback Peak Performance
More informationUsing a Scalable Parallel 2D FFT for Image Enhancement
Introduction Using a Scalable Parallel 2D FFT for Image Enhancement Yaniv Sapir Adapteva, Inc. Email: yaniv@adapteva.com Frequency domain operations on spatial or time data are often used as a means for
More informationResearch Article Design of A Novel 8-point Modified R2MDC with Pipelined Technique for High Speed OFDM Applications
Research Journal of Applied Sciences, Engineering and Technology 7(23): 5021-5025, 2014 DOI:10.19026/rjaset.7.895 ISSN: 2040-7459; e-issn: 2040-7467 2014 Maxwell Scientific Publication Corp. Submitted:
More informationAnalysis of Radix- SDF Pipeline FFT Architecture in VLSI Using Chip Scope
Analysis of Radix- SDF Pipeline FFT Architecture in VLSI Using Chip Scope G. Mohana Durga 1, D.V.R. Mohan 2 1 M.Tech Student, 2 Professor, Department of ECE, SRKR Engineering College, Bhimavaram, Andhra
More informationIMPLEMENTATION OF THE. Alexander J. Yee University of Illinois Urbana-Champaign
SINGLE-TRANSPOSE IMPLEMENTATION OF THE OUT-OF-ORDER 3D-FFT Alexander J. Yee University of Illinois Urbana-Champaign The Problem FFTs are extremely memory-intensive. Completely bound by memory access. Memory
More informationc 2013 Alexander Jih-Hing Yee
c 2013 Alexander Jih-Hing Yee A FASTER FFT IN THE MID-WEST BY ALEXANDER JIH-HING YEE THESIS Submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science
More informationSum to Modified Booth Recoding Techniques For Efficient Design of the Fused Add-Multiply Operator
Sum to Modified Booth Recoding Techniques For Efficient Design of the Fused Add-Multiply Operator D.S. Vanaja 1, S. Sandeep 2 1 M. Tech scholar in VLSI System Design, Department of ECE, Sri VenkatesaPerumal
More informationAn Adaptive Framework for Scientific Software Libraries. Ayaz Ali Lennart Johnsson Dept of Computer Science University of Houston
An Adaptive Framework for Scientific Software Libraries Ayaz Ali Lennart Johnsson Dept of Computer Science University of Houston Diversity of execution environments Growing complexity of modern microprocessors.
More informationLinköping University Post Print. Analysis of Twiddle Factor Memory Complexity of Radix-2^i Pipelined FFTs
Linköping University Post Print Analysis of Twiddle Factor Complexity of Radix-2^i Pipelined FFTs Fahad Qureshi and Oscar Gustafsson N.B.: When citing this work, cite the original article. 200 IEEE. Personal
More informationHow to Write Fast Numerical Code
How to Write Fast Numerical Code Lecture: Cost analysis and performance Instructor: Markus Püschel TA: Alen Stojanov, Georg Ofenbeck, Gagandeep Singh Technicalities Research project: Let us know (fastcode@lists.inf.ethz.ch)
More informationThe Fast Fourier Transform
Chapter 7 7.1 INTRODUCTION The Fast Fourier Transform In Chap. 6 we saw that the discrete Fourier transform (DFT) could be used to perform convolutions. In this chapter we look at the computational requirements
More informationEfficient Radix-4 and Radix-8 Butterfly Elements
Efficient Radix4 and Radix8 Butterfly Elements Weidong Li and Lars Wanhammar Electronics Systems, Department of Electrical Engineering Linköping University, SE581 83 Linköping, Sweden Tel.: +46 13 28 {1721,
More informationAN FFT PROCESSOR BASED ON 16-POINT MODULE
AN FFT PROCESSOR BASED ON 6-POINT MODULE Weidong Li, Mark Vesterbacka and Lars Wanhammar Electronics Systems, Dept. of EE., Linköping University SE-58 8 LINKÖPING, SWEDEN E-mail: {weidongl, markv, larsw}@isy.liu.se,
More informationEfficient Methods for FFT calculations Using Memory Reduction Techniques.
Efficient Methods for FFT calculations Using Memory Reduction Techniques. N. Kalaiarasi Assistant professor SRM University Kattankulathur, chennai A.Rathinam Assistant professor SRM University Kattankulathur,chennai
More informationModule 9 : Numerical Relaying II : DSP Perspective
Module 9 : Numerical Relaying II : DSP Perspective Lecture 36 : Fast Fourier Transform Objectives In this lecture, We will introduce Fast Fourier Transform (FFT). We will show equivalence between FFT and
More informationParallel Fast Fourier Transform implementations in Julia 12/15/2011
Parallel Fast Fourier Transform implementations in Julia 1/15/011 Abstract This paper examines the parallel computation models of Julia through several different multiprocessor FFT implementations of 1D
More informationParallel-computing approach for FFT implementation on digital signal processor (DSP)
Parallel-computing approach for FFT implementation on digital signal processor (DSP) Yi-Pin Hsu and Shin-Yu Lin Abstract An efficient parallel form in digital signal processor can improve the algorithm
More informationAutomatic Performance Tuning and Machine Learning
Automatic Performance Tuning and Machine Learning Markus Püschel Computer Science, ETH Zürich with: Frédéric de Mesmay PhD, Electrical and Computer Engineering, Carnegie Mellon PhD and Postdoc openings:
More informationVLSI Based Low Power FFT Implementation using Floating Point Operations
VLSI ased Low Power FFT Implementation using Floating Point Operations Pooja Andhale, Manisha Ingle Abstract This paper presents low power floating point FFT implementation based low power multiplier architectures
More informationHow to Write Fast Numerical Code Spring 2012 Lecture 20. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato
How to Write Fast Numerical Code Spring 2012 Lecture 20 Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato Planning Today Lecture Project meetings Project presentations 10 minutes each
More informationDESIGN OF PARALLEL PIPELINED FEED FORWARD ARCHITECTURE FOR ZERO FREQUENCY & MINIMUM COMPUTATION (ZMC) ALGORITHM OF FFT
IMPACT: International Journal of Research in Engineering & Technology (IMPACT: IJRET) ISSN(E): 2321-8843; ISSN(P): 2347-4599 Vol. 2, Issue 4, Apr 2014, 199-206 Impact Journals DESIGN OF PARALLEL PIPELINED
More informationThe Fastest Fourier Transform in the West
: The Fastest Fourier Transform in the West Steven G. ohnson, MIT Applied Mathematics Matteo Frigo, Cilk Arts Inc. In the beginning (c. 1805): Carl Friedrich Gauss declination angle ( ) 30 25 20 15 10
More informationPRIME FACTOR CYCLOTOMIC FOURIER TRANSFORMS WITH REDUCED COMPLEXITY OVER FINITE FIELDS
PRIME FACTOR CYCLOTOMIC FOURIER TRANSFORMS WITH REDUCED COMPLEXITY OVER FINITE FIELDS Xuebin Wu, Zhiyuan Yan, Ning Chen, and Meghanad Wagh Department of ECE, Lehigh University, Bethlehem, PA 18015 PMC-Sierra
More informationLow Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm
Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm 1 A.Malashri, 2 C.Paramasivam 1 PG Student, Department of Electronics and Communication K S Rangasamy College Of Technology,
More informationAutomating the Modeling and Optimization of the Performance of Signal Transforms
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 50, NO. 8, AUGUST 2002 2003 Automating the Modeling and Optimization of the Performance of Signal Transforms Bryan Singer and Manuela M. Veloso Abstract Fast
More informationModified Welch Power Spectral Density Computation with Fast Fourier Transform
Modified Welch Power Spectral Density Computation with Fast Fourier Transform Sreelekha S 1, Sabi S 2 1 Department of Electronics and Communication, Sree Budha College of Engineering, Kerala, India 2 Professor,
More informationHigh Throughput Energy Efficient Parallel FFT Architecture on FPGAs
High Throughput Energy Efficient Parallel FFT Architecture on FPGAs Ren Chen Ming Hsieh Department of Electrical Engineering University of Southern California Los Angeles, USA 989 Email: renchen@usc.edu
More informationPerformance Analysis of a Family of WHT Algorithms
Performance Analysis of a Family of WHT Algorithms Michael Andrews and Jeremy Johnson Department of Computer Science Drexel University Philadelphia, PA USA January, 7 Abstract This paper explores the correlation
More informationHow to Write Fast Numerical Code
How to Write Fast Numerical Code Lecture: Optimizing FFT, FFTW Instructor: Markus Püschel TA: Georg Ofenbeck & Daniele Spampinato Rest of Semester Today Lecture Project meetings Project presentations 10
More informationSusumu Yamamoto. Tokyo, , JAPAN. Fourier Transform (DFT), and the calculation time is in proportion to N logn,
Effective Implementations of Multi-Dimensional Radix-2 FFT Susumu Yamamoto Department of Applied Physics, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-8656, JAPAN Abstract The Fast Fourier Transform
More informationHow to Write Fast Numerical Code Spring 2011 Lecture 22. Instructor: Markus Püschel TA: Georg Ofenbeck
How to Write Fast Numerical Code Spring 2011 Lecture 22 Instructor: Markus Püschel TA: Georg Ofenbeck Schedule Today Lecture Project presentations 10 minutes each random order random speaker 10 Final code
More informationAdaptive Matrix Transpose Algorithms for Distributed Multicore Processors
Adaptive Matrix Transpose Algorithms for Distributed Multicore ors John C. Bowman and Malcolm Roberts Abstract An adaptive parallel matrix transpose algorithm optimized for distributed multicore architectures
More informationGenerating SIMD Vectorized Permutations
Generating SIMD Vectorized Permutations Franz Franchetti and Markus Püschel Electrical and Computer Engineering, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213 {franzf, pueschel}@ece.cmu.edu
More informationImplementation of FFT Processor using Urdhva Tiryakbhyam Sutra of Vedic Mathematics
Implementation of FFT Processor using Urdhva Tiryakbhyam Sutra of Vedic Mathematics Yojana Jadhav 1, A.P. Hatkar 2 PG Student [VLSI & Embedded system], Dept. of ECE, S.V.I.T Engineering College, Chincholi,
More informationA SEARCH OPTIMIZATION IN FFTW. by Liang Gu
A SEARCH OPTIMIZATION IN FFTW by Liang Gu A thesis submitted to the Faculty of the University of Delaware in partial fulfillment of the requirements for the degree of Master of Science in Electrical and
More informationMULTIPLIERLESS HIGH PERFORMANCE FFT COMPUTATION
MULTIPLIERLESS HIGH PERFORMANCE FFT COMPUTATION Maheshwari.U 1, Josephine Sugan Priya. 2, 1 PG Student, Dept Of Communication Systems Engg, Idhaya Engg. College For Women, 2 Asst Prof, Dept Of Communication
More informationLow Power Floating-Point Multiplier Based On Vedic Mathematics
Low Power Floating-Point Multiplier Based On Vedic Mathematics K.Prashant Gokul, M.E(VLSI Design), Sri Ramanujar Engineering College, Chennai Prof.S.Murugeswari., Supervisor,Prof.&Head,ECE.,SREC.,Chennai-600
More informationVHDL IMPLEMENTATION OF A FLEXIBLE AND SYNTHESIZABLE FFT PROCESSOR
VHDL IMPLEMENTATION OF A FLEXIBLE AND SYNTHESIZABLE FFT PROCESSOR 1 Gatla Srinivas, 2 P.Masthanaiah, 3 P.Veeranath, 4 R.Durga Gopal, 1,2[ M.Tech], 3 Associate Professor, J.B.R E.C, 4 Associate Professor,
More informationENT 315 Medical Signal Processing CHAPTER 3 FAST FOURIER TRANSFORM. Dr. Lim Chee Chin
ENT 315 Medical Signal Processing CHAPTER 3 FAST FOURIER TRANSFORM Dr. Lim Chee Chin Outline Definition and Introduction FFT Properties of FFT Algorithm of FFT Decimate in Time (DIT) FFT Steps for radix
More informationUnlabeled equivalence for matroids representable over finite fields
Unlabeled equivalence for matroids representable over finite fields November 16, 2012 S. R. Kingan Department of Mathematics Brooklyn College, City University of New York 2900 Bedford Avenue Brooklyn,
More informationEnergy Optimizations for FPGA-based 2-D FFT Architecture
Energy Optimizations for FPGA-based 2-D FFT Architecture Ren Chen and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California Ganges.usc.edu/wiki/TAPAS Outline
More information