FFT ALGORITHMS FOR MULTIPLY-ADD ARCHITECTURES

Similar documents
System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries

A SIMD Vectorizing Compiler for Digital Signal Processing Algorithms Λ

Automatically Tuned FFTs for BlueGene/L s Double FPU

Formal Loop Merging for Signal Transforms

Abstract. Literature Survey. Introduction. A.Radix-2/8 FFT algorithm for length qx2 m DFTs

Input parameters System specifics, user options. Input parameters size, dim,... FFT Code Generator. Initialization Select fastest execution plan

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University

Parallelism in Spiral

Algorithms and Computation in Signal Processing

Efficient FFT Algorithm and Programming Tricks

Generating Parallel Transforms Using Spiral

SPIRAL Generated Modular FFTs *

Decimation-in-Frequency (DIF) Radix-2 FFT *

FFT Compiler Techniques

Mixed Data Layout Kernels for Vectorized Complex Arithmetic

Automatically Optimized FFT Codes for the BlueGene/L Supercomputer

Performance Optimisations of the NPB FT Kernel by Special-Purpose Unroller

Statistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform

An efficient multiplierless approximation of the fast Fourier transform using sum-of-powers-of-two (SOPOT) coefficients

Formal Loop Merging for Signal Transforms

Twiddle Factor Transformation for Pipelined FFT Processing

SPIRAL: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms

TOPICS PIPELINE IMPLEMENTATIONS OF THE FAST FOURIER TRANSFORM (FFT) DISCRETE FOURIER TRANSFORM (DFT) INVERSE DFT (IDFT) Consulted work:

FPGA Based Design and Simulation of 32- Point FFT Through Radix-2 DIT Algorith

FFT Program Generation for the Cell BE

A Fast Fourier Transform Compiler

Algorithms and Computation in Signal Processing

Fused Floating Point Arithmetic Unit for Radix 2 FFT Implementation

Introduction to HPC. Lecture 21

3.2 Cache Oblivious Algorithms

Joint Runtime / Energy Optimization and Hardware / Software Partitioning of Linear Transforms

Empirical Auto-tuning Code Generator for FFT and Trigonometric Transforms

Automatic Derivation and Implementation of Signal Processing Algorithms

A Pipelined Fused Processing Unit for DSP Applications

Parallel FFT Program Optimizations on Heterogeneous Computers

Digital Signal Processing. Soma Biswas

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

SPIRAL Overview: Automatic Generation of DSP Algorithms & More *

Learning to Construct Fast Signal Processing Implementations

FFT. There are many ways to decompose an FFT [Rabiner and Gold] The simplest ones are radix-2 Computation made up of radix-2 butterflies X = A + BW

Split-Radix FFT Algorithms Based on Ternary Tree

Advanced Computing Research Laboratory. Adaptive Scientific Software Libraries

The Design and Implementation of FFTW3

Radix-4 FFT Algorithms *

The Serial Commutator FFT

SPIRAL: Code Generation for DSP Transforms

Small Discrete Fourier Transforms on GPUs

Low-Power Split-Radix FFT Processors Using Radix-2 Butterfly Units

Large-Scale First-Principles Molecular Dynamics simulations on the BlueGene/L Platform using the Qbox code

Fitting FFT onto the G80 Architecture

Algorithms and Computation in Signal Processing

How to Write Fast Code , spring rd Lecture, Apr. 9 th

MATH 423 Linear Algebra II Lecture 17: Reduced row echelon form (continued). Determinant of a matrix.

The Fast Fourier Transform Algorithm and Its Application in Digital Image Processing

How to Write Fast Numerical Code

Computer Generation of Hardware for Linear Digital Signal Processing Transforms

Automatic Generation of the HPC Challenge's Global FFT Benchmark for BlueGene/P

Decimation-in-time (DIT) Radix-2 FFT *

Novel design of multiplier-less FFT processors

Algorithms and Computation in Signal Processing

Using a Scalable Parallel 2D FFT for Image Enhancement

Research Article Design of A Novel 8-point Modified R2MDC with Pipelined Technique for High Speed OFDM Applications

Analysis of Radix- SDF Pipeline FFT Architecture in VLSI Using Chip Scope

IMPLEMENTATION OF THE. Alexander J. Yee University of Illinois Urbana-Champaign

c 2013 Alexander Jih-Hing Yee

Sum to Modified Booth Recoding Techniques For Efficient Design of the Fused Add-Multiply Operator

An Adaptive Framework for Scientific Software Libraries. Ayaz Ali Lennart Johnsson Dept of Computer Science University of Houston

Linköping University Post Print. Analysis of Twiddle Factor Memory Complexity of Radix-2^i Pipelined FFTs

How to Write Fast Numerical Code

The Fast Fourier Transform

Efficient Radix-4 and Radix-8 Butterfly Elements

AN FFT PROCESSOR BASED ON 16-POINT MODULE

Efficient Methods for FFT calculations Using Memory Reduction Techniques.

Module 9 : Numerical Relaying II : DSP Perspective

Parallel Fast Fourier Transform implementations in Julia 12/15/2011

Parallel-computing approach for FFT implementation on digital signal processor (DSP)

Automatic Performance Tuning and Machine Learning

VLSI Based Low Power FFT Implementation using Floating Point Operations

How to Write Fast Numerical Code Spring 2012 Lecture 20. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato

DESIGN OF PARALLEL PIPELINED FEED FORWARD ARCHITECTURE FOR ZERO FREQUENCY & MINIMUM COMPUTATION (ZMC) ALGORITHM OF FFT

The Fastest Fourier Transform in the West

PRIME FACTOR CYCLOTOMIC FOURIER TRANSFORMS WITH REDUCED COMPLEXITY OVER FINITE FIELDS

Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm

Automating the Modeling and Optimization of the Performance of Signal Transforms

Modified Welch Power Spectral Density Computation with Fast Fourier Transform

High Throughput Energy Efficient Parallel FFT Architecture on FPGAs

Performance Analysis of a Family of WHT Algorithms

How to Write Fast Numerical Code

Susumu Yamamoto. Tokyo, , JAPAN. Fourier Transform (DFT), and the calculation time is in proportion to N logn,

How to Write Fast Numerical Code Spring 2011 Lecture 22. Instructor: Markus Püschel TA: Georg Ofenbeck

Adaptive Matrix Transpose Algorithms for Distributed Multicore Processors

Generating SIMD Vectorized Permutations

Implementation of FFT Processor using Urdhva Tiryakbhyam Sutra of Vedic Mathematics

A SEARCH OPTIMIZATION IN FFTW. by Liang Gu

MULTIPLIERLESS HIGH PERFORMANCE FFT COMPUTATION

Low Power Floating-Point Multiplier Based On Vedic Mathematics

VHDL IMPLEMENTATION OF A FLEXIBLE AND SYNTHESIZABLE FFT PROCESSOR

ENT 315 Medical Signal Processing CHAPTER 3 FAST FOURIER TRANSFORM. Dr. Lim Chee Chin

Unlabeled equivalence for matroids representable over finite fields

Energy Optimizations for FPGA-based 2-D FFT Architecture

Transcription:

FFT ALGORITHMS FOR MULTIPLY-ADD ARCHITECTURES FRANCHETTI Franz, (AUT), KALTENBERGER Florian, (AUT), UEBERHUBER Christoph W. (AUT) Abstract. FFTs are the single most important algorithms in science and engineering. To utilize current hardware architectures, special techniques are needed. This paper introduces newly developed radix-2 n FFT kernels that efficiently take advantage of fused multiply-add (FMA) instructions. If a processor is provided with FMA instructions, the new radix-2 n kernels reduce the number of required twiddle factors from 2 n 1ton compared to conventional radix-2 n FFT kernels. The reduction in the number of twiddle factors is accomplished by a hidden computation of the twiddle factors (instead of accessing a twiddle factor array) by hiding the additional arithmetic operations in FMA instructions. The new FFT kernels are fully compatible with conventional FFT kernels and can therefore easily be incorporated into existing FFT software, such as Fftw or Spiral. 1 Introduction In the computation of an FFT, memory accesses are the most dominating part of the runtime. There are numerous FFT algorithms with identical arithmetic complexity whose memory access patterns differ significantly. Accordingly, their runtime behavior differs tremendously. State of the art in FFT software are codes like Spiral (Moura et al. [13]) or Fftw (Frigo and Johnson [4]) that are able to adapt themselves automatically to given memory and hardware features in the most efficient way (Fabianek et al. [2]). In this paper the memory access bottleneck is attacked by utilizing the parallelism inherent in FMA instructions to reduce the number of memory accesses. This efficiency enhancement is accomplished by a hidden computation of the twiddle factors (instead of accessing a twiddle factor array) by including the additional arithmetic operations in FMA instructions (and thereby increasing the utilization of FMA instructions). The goal of earlier works (Linzer and Feig [10, 11], Goedecker [6], Karner et al. [9]) was to reduce the arithmetic complexity of FFT kernels. No attempt was made to reduce 333

the number of memory accesses. The newly developed FFT kernels introduced in this paper require only n twiddle factors instead of the 2 n 1 twiddle factors required by conventional radix-2 n kernels. Contrary to existing FMA optimized FFT kernels the newly developed FFT kernels are fully compatible with conventional FFT kernels since there is no need to scale the twiddle factors and can therefore easily be incorporated into existing FFT software. This feature was tested by installing the new kernels into Fftw, the currently most sophisticated and fastest FFT package (Karner et al. [8]). 2 Fused Multiply-Add Instructions Fused multiply-add (FMA) instructions perform the ternary operation ±a ± b c in the same amount of time as needed for a single floating-point addition or multiplication operation. Let π fma denote the FMA arithmetic complexity, i. e., the number of floating-point operations (additions, multiplications, and multiply-add operations) needed to perform a specific numerical task on a processor equipped with FMA instructions. It is useful to introduce the term FMA utilization to characterize the degree to which an algorithm takes advantage of multiply-add instructions. FMA utilization is given by F := π R π fma π fma 100 [%], (1) where π R denotes the real arithmetic complexity. F is the percentage of floating-point operations performed by multiply-add instructions. Some compilers are able to extract FMA operations from given source code and produce FMA optimized code. However, compilers are not able to deal with all performance issues, because not all of the necessary information is available at compile time. Therefore the underlying algorithm has to be dealt with on a more abstract level. The main result of this section is that certain matrix computations occurring in FFT algorithms can be optimized with respect to FMA utilization. The rules introduced in the following can be incorporated into self-adapting numerical digital signal processing software like Spiral. Algorithms for matrix-vector multiplication Ab,where A has a special structure, can be carried out using FMA instructions only (F = 100 %), i. e., such computations are FMA optimized. The following lemmas characterize some special FMA optimized computations (Franchetti et al. [3]). Lemma 2.1 For the complex matrices ( ) ( ) ±1 ±c 0 1 A =,, 0 1 ±1 ±c ( ) ±c ±1, 0 1 ( ) 0 1, (2) ±c ±1 the calculation of Ab for b C 2 requires only two FMAs (π fma =2,F= 100 %) ifc is a trivial twiddle factor (i. e., c R or c ir). Otherwise the computation of Ab requires four FMAs (π fma =4, F = 100 %). Corollary 2.2 If A, A 1,A 2,...,A r can be FMA optimized, then matrices of the form r i=1 A i, I m A I n, and the matrix conjugate L mn m AL mn n,wherel mn n and L mn m = ) 1 are stride permutation matrices (Van Loan [7]), can also be FMA optimized. (L mn n Multiplications by non-trivial twiddle factors can also be FMA optimized. 334

Lemma 2.3 A complex multiplication by ω C with ω =1requires only three FMA instructions (π fma =3, F = 100 %). 3 FMA Optimized FFT Kernels The discrete Fourier transform y C N of an input vector x C N is defined by the matrix-vector product y := F N x,wheref N =[ω j,k N ] j,k=0,1,...,n 1 with ω N = e 2πi/N is the Fourier transform matrix. The basis of the fast Fourier transform (FFT) is the Cooley-Tukey radix-p splitting. For N = pq 2 it holds that where F N =(F p I q )Tq pq (I p F q )L pq p, (3) q 1 p 1 q 1 Tq pq = ωpq ij = diag(1,ω pq,...,ωpq q 1 ) i (4) i=0 j=0 i=0 is the twiddle factor matrix and L pq p is the stride-by-p permutation matrix (Van Loan [7]). In Fftw the computation of an FFT is performed by so-called codelets, i. e., small pieces of highly optimized machine generated code. There are two types of codelets: codelets are used to perform multiplications by twiddle factors and to compute in-place FFTs, whereas no-twiddle codelets are used to compute out-of-place FFTs. (i) Twiddle codelets of size p compute y := (F p I q )T pq q y, (5) and (ii) no-twiddle codelets of size q perform an out-of-place computation of y := F q x. If N = N 1 N 2 N n then the splitting (3) can be applied recursively and the fast Fourier transform algorithm (with O(N ln(n)) complexity) is obtained. Any factorization of N leads to an FFT program with another memory access pattern and hence to another runtime behavior. Fftw tries to find the optimum algorithm for a given machine by applying, for instance, dynamic programming (Fabianek et al. [2]). A no-twiddle codelet of size q can be FMA optimized with existing compiler techniques [5, 9]. A twiddle codelet, on the other hand, cannot be optimized at compile time, since the twiddle factors can change from call to call. However, with some formula manipulation and the tools developed in the previous section it is possible to get a full FMA utilization even in twiddle codelets. A twiddle codelet can be written as (F p I q )T pq q = L pq p (I q F p ) L pq q L pq p }{{} j=0 =I pq T pq p L pq q [ q 1 ] = L pq p F p diag(1,ωpq,...,ω j pq (p 1)j ) L pq q. A factor of the form F p diag(1,ω j,...,ω (p 1)j ) (6) is called an elementary radix-p FFT kernel. In the following the kernel (6) will be FMA optimized, since then by applying Corollary 2.2, the twiddle codelet (5) can be FMA optimized as well. Usually the twiddle factors are pre-computed and stored in an array and are accessed during the FFT computation. In the following a method is presented to reduce the 335

number of memory accesses by a hidden on-line computation of the required twiddle factors. First, the method is illustrated by means of a radix-8 kernel computation. FMA Optimized Radix-8 Kernel. The radix-8 kernel computation is y := F 8 diag(1,ω,ω 2,ω 3,ω 4,ω 5,ω 6,ω 7 )x. (7) Using a radix-2 decimation-in-frequency (DIF) factorization, (7) can be written as y := R 8 (I 4 F 2 )D 3 T 2 (I 2 F 2 I 2 )D 2 T 1 (F 2 I 4 )D 1 x, (8) D 1 = diag(1, 1, 1, 1,ω 4,ω 4,ω 4,ω 4 ), D 2 = diag(1, 1,ω 2,ω 2, 1, 1,ω 2,ω 2 ), D 3 = diag(1,ω,1,ω,1,ω,1,ω), T 1 = diag(1, 1, 1, 1, 1,c 1, i, c 2 ), T 2 = diag(1, 1, 1, i, 1, 1, 1, i), c 1 =(1 i)/ 2,c 2 = 1(1 + i)/ 2. R 8 is the bit-reversal permutation matrix (Van Loan [7]). Arithmetic Complexity. The heart of the FMA optimized algorithm is the factorization of the initial twiddle factor scaling matrix diag(1,ω,ω 2,ω 3,ω 4,ω 5,ω 6,ω 7 )intod 1 D 2 D 3. D 1, D 2,andD 3 are distributed over the 3 stages of the radix-8 kernel computation (8). To fully utilize FMA instructions for computing the first stage, T 1 (F 2 I 4 )D 1 x,the following factorization is applied (Meyer and Schwarz [12]). ( )( ) 2 1 1 0 (F 2 I 4 )D 1 =(F 2 I 4 )(diag(1,ω 4 ) I 4 )= 0 1 1 ω 4 I 4. (10) Now, by Lemma 2.1 and Corollary 2.2 the matrix-vector product (F 2 I 4 )D 1 x requires 24 FMA operations (π fma = 24, F = 100 %). T 1 contains the two non-trivial twiddle factors c 1 and c 2. Therefore, by Lemma 2.3, scaling by T 1 can be carried out with 6 FMA instructions (π fma =6,F = 100 %). Factorizations similar to (10) for stages two and three lead to an arithmetic complexity of 78 (= 3 24 + 6) FMA instructions (π fma = 78, F = 100 %) for a radix-8 FFT kernel. From (9) it can be seen that only three twiddle factors ω 4, ω 2,andω are required. FMA Optimized Radix-2 n FFT Kernels. Using a radix-2 DIF factorization the radix-2 n kernel can be written as [ 1 F 2 n diag(1,ω,...,ω 2n 1 )=R 2 n (I 2 i 1 T 2n i+1 2 )(I n i 2 i 1 F 2 I 2 n i)d i ], (11) i=n T 2n i+1 2 n i = I 2 n i diag(1,ω 2 n i+1,...,ω2n i 1 2 n i+1 ), (12) D i = I 2 i 1 diag(1,ω 2n i ) I 2 n i. (13) R 2 n is the bit-reversal permutation matrix (Van Loan [7]). (13) shows that only n twiddle factors ω 2n 1,..., ω are required in the radix-2 n kernel. Applying factorization (10) and Lemma 2.3 leads to the arithmetic complexity π fma =4.5 n 2 n 9 2 n 1 +6. (14) (9) 336

Radix TF Loads π fma FMA TF Loads π fma FMA Utilization Utilization conventional kernels fma optimized kernels 2 1 6 8 25% 1 6 6 100 % 4 3 14 28 21% 2 12 24 100 % 8 7 30 84 17% 3 22 78 100 % 16 15 62 220 17 % 4 40 222 100 % 32 31 126 548 17 % 5 74 582 100 % Table 1: Operation counts (number TF of required twiddle factors, number of load operations, arithmetic complexity π fma, and FMA utilization F ) for radix-2, -4, -8, -16 and -32 kernels. 4 Runtime Experiments Numerical experiments were performed on one processor of an SGI PowerChallenge XL R10000. In a first series of experiments FFT algorithms using the newly developed FMA optimized radix-8 FFT kernel were compared with FFT algorithms using a conventional radix-8 kernel in a triple-loop Cooley-Tukey framework. In the experiments cycles and L2 data cache misses have been counted using the on-chip performance monitor counter of the MIPS R10000 processor (Auer et al. [1]). Run times have been derived from the cycle counts. The super scalar architecture of the R10000 processor is able to execute up to four instructions per cycle. Since the R10000 can execute one load or store and one multiplyadd instruction per cycle, the compiler can generate an instruction scheduling that overlaps memory references by floating-point operations. This is the reason why the lower instruction count of the FMA optimized radix-8 kernel does not result in a significant speed-up for lengths N 2 9. The speed-up in execution time compared to a conventional radix-8 kernel can be seen in Fig. 1. For N 2 15 the speed-up is due to the lower number of primary data cache misses. For N > 2 15 when secondary data cache misses make memory accesses very costly the benefits of the new radix-8 kernel, i. e., fewer memory accesses leading to fewer secondary data cache misses, results in a substantial speed-up in execution time. 50 % Speed-up 0.7 Normalized L2 Cache Misses 40 % 0.6 0.5 FMA Optimized Radix-8 Kernel Conventional Radix-8 Kernel 30 % 0.4 20 % 0.3 0.2 10 % 0.1 0 2 6 2 9 2 12 2 15 2 18 2 21 2 24 0 2 6 2 9 2 12 2 15 2 18 2 21 2 24 Vector Length Vector Length Figure 1: Speed-up in minimum execution time and normalized number of M 2 /N log 2 N L2 cache misses of the FMA optimized radix-8 FFT algorithm compared to a conventional radix-8 FFT algorithm on one processor of an SGI Power Challenge XL. 337

An advantage of the FFT kernels presented in this paper is their compatibility to conventional kernels, making it easy to incorporate them into existing FFT software. This feature has been demonstrated by incorporating the multiply-add optimized radix-8 kernel into Fftw (Karner et al. [8]). The installation of the FMA optimized radix-8 kernel in Fftw did not cause any difficulty. For transform lengths N 2 20 a speed-up of more than 30 % was achieved. 5 Conclusion To sum up, it can be said that the advantages of the multiply-add optimized FFT kernels presented in this paper are their low number of required load operations, their high efficiency on performance oriented computer systems, and their striking simplicity. Current work is the integration of the above presented method into Fftw s kernel generator genfft (Frigo, Kral [5]) to generate FMA optimized radix-16, -32, and -64 kernels. 6 References 1. Auer, M., et al. Performance Evaluation of FFT Algorithms Using Performance Counters. Tech. Rep. Aurora TR1998-20, Vienna University of Technology, 1998. 2. Fabianek, C., et al. Survey of Self-adapting FFT Software. Tech. Rep. Aurora TR2002-01, Vienna University of Technology, 2002. 3. Franchetti, F., et al. FMA Optimized FFT Codelets. Tech. Rep. Aurora TR1998-28, Vienna University of Technology, 2002. 4. Frigo, M. and Johnson, S. G. FFTW: An Adaptive Software Architecture for the FFT. In: Proc. ICASSP, pp. 1381 1384, 1998. 5. Frigo, M. and Kral, S. The Advanced FFT Program Generator GENFFT. Tech. Rep. Aurora TR2001-03, Vienna University of Technology, 2001. 6. Goedecker, S. Fast Radix 2, 3, 4, and 5 Kernels for Fast Fourier Transformations on Computers with Overlapping Multiply-Add Instructions. SIAM J. Sci. Comput., 18, pp. 1605 1611, 1997. 7. Golub, G. H. and Van Loan, C. F. Matrix Computations. Johns Hopkins University Press, Baltimore, 2nd edn., 1989. 8. Karner, H., et al. Accelerating FFTW by Multiply-Add Optimization. Tech. Rep. Aurora TR1999-13, Vienna University of Technology, 1999. 9. Karner, H., et al. Multiply-Add Optimized FFT Kernels. Math. Models and Methods in Appl. Sci., 11, pp. 105 117, 2001. 10. Linzer, E. N. and Feig, E. Implementation of Efficient FFT Algorithms on Fused Multiply-Add Architectures. IEEE Trans. Signal Processing, 41, pp. 93 107, 1993. 11. Linzer, E. N. and Feig, E. Modified FFTs for Fused Multiply-Add Architectures. Math. Comp., 60, pp. 347 361, 1993. 12. Meyer, R. and Schwarz, K. FFT Implementation on DSP Chips Theory and Practice. In: Proc. ICASSP, pp. 1503 1506, 1990. 13. Pueschel, M., et al. Spiral: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms. Journal of High Performance Computing and Applications, submitted. 338

Acknowledgement This work is the result of a cooperation with our colleague and friend Herbert Karner who passed away in October 2001. The project was supported by the Special Research Program SFB F011 AURORA of the Austrian Science Fund FWF. Current Address Institute for Applied Mathematics and Numerical Analysis, Vienna University of Technology, Wiedner Hauptstrasse 8-10, A-1040 Wien, Austria, Tel.: +43-1-58801-11512, Email: {franz,kalti,christof}@aurora.anum.tuwien.ac.at 339

340