FFT ALGORITHMS FOR MULTIPLY-ADD ARCHITECTURES FRANCHETTI Franz, (AUT), KALTENBERGER Florian, (AUT), UEBERHUBER Christoph W. (AUT) Abstract. FFTs are the single most important algorithms in science and engineering. To utilize current hardware architectures, special techniques are needed. This paper introduces newly developed radix-2 n FFT kernels that efficiently take advantage of fused multiply-add (FMA) instructions. If a processor is provided with FMA instructions, the new radix-2 n kernels reduce the number of required twiddle factors from 2 n 1ton compared to conventional radix-2 n FFT kernels. The reduction in the number of twiddle factors is accomplished by a hidden computation of the twiddle factors (instead of accessing a twiddle factor array) by hiding the additional arithmetic operations in FMA instructions. The new FFT kernels are fully compatible with conventional FFT kernels and can therefore easily be incorporated into existing FFT software, such as Fftw or Spiral. 1 Introduction In the computation of an FFT, memory accesses are the most dominating part of the runtime. There are numerous FFT algorithms with identical arithmetic complexity whose memory access patterns differ significantly. Accordingly, their runtime behavior differs tremendously. State of the art in FFT software are codes like Spiral (Moura et al. [13]) or Fftw (Frigo and Johnson [4]) that are able to adapt themselves automatically to given memory and hardware features in the most efficient way (Fabianek et al. [2]). In this paper the memory access bottleneck is attacked by utilizing the parallelism inherent in FMA instructions to reduce the number of memory accesses. This efficiency enhancement is accomplished by a hidden computation of the twiddle factors (instead of accessing a twiddle factor array) by including the additional arithmetic operations in FMA instructions (and thereby increasing the utilization of FMA instructions). The goal of earlier works (Linzer and Feig [10, 11], Goedecker [6], Karner et al. [9]) was to reduce the arithmetic complexity of FFT kernels. No attempt was made to reduce 333
the number of memory accesses. The newly developed FFT kernels introduced in this paper require only n twiddle factors instead of the 2 n 1 twiddle factors required by conventional radix-2 n kernels. Contrary to existing FMA optimized FFT kernels the newly developed FFT kernels are fully compatible with conventional FFT kernels since there is no need to scale the twiddle factors and can therefore easily be incorporated into existing FFT software. This feature was tested by installing the new kernels into Fftw, the currently most sophisticated and fastest FFT package (Karner et al. [8]). 2 Fused Multiply-Add Instructions Fused multiply-add (FMA) instructions perform the ternary operation ±a ± b c in the same amount of time as needed for a single floating-point addition or multiplication operation. Let π fma denote the FMA arithmetic complexity, i. e., the number of floating-point operations (additions, multiplications, and multiply-add operations) needed to perform a specific numerical task on a processor equipped with FMA instructions. It is useful to introduce the term FMA utilization to characterize the degree to which an algorithm takes advantage of multiply-add instructions. FMA utilization is given by F := π R π fma π fma 100 [%], (1) where π R denotes the real arithmetic complexity. F is the percentage of floating-point operations performed by multiply-add instructions. Some compilers are able to extract FMA operations from given source code and produce FMA optimized code. However, compilers are not able to deal with all performance issues, because not all of the necessary information is available at compile time. Therefore the underlying algorithm has to be dealt with on a more abstract level. The main result of this section is that certain matrix computations occurring in FFT algorithms can be optimized with respect to FMA utilization. The rules introduced in the following can be incorporated into self-adapting numerical digital signal processing software like Spiral. Algorithms for matrix-vector multiplication Ab,where A has a special structure, can be carried out using FMA instructions only (F = 100 %), i. e., such computations are FMA optimized. The following lemmas characterize some special FMA optimized computations (Franchetti et al. [3]). Lemma 2.1 For the complex matrices ( ) ( ) ±1 ±c 0 1 A =,, 0 1 ±1 ±c ( ) ±c ±1, 0 1 ( ) 0 1, (2) ±c ±1 the calculation of Ab for b C 2 requires only two FMAs (π fma =2,F= 100 %) ifc is a trivial twiddle factor (i. e., c R or c ir). Otherwise the computation of Ab requires four FMAs (π fma =4, F = 100 %). Corollary 2.2 If A, A 1,A 2,...,A r can be FMA optimized, then matrices of the form r i=1 A i, I m A I n, and the matrix conjugate L mn m AL mn n,wherel mn n and L mn m = ) 1 are stride permutation matrices (Van Loan [7]), can also be FMA optimized. (L mn n Multiplications by non-trivial twiddle factors can also be FMA optimized. 334
Lemma 2.3 A complex multiplication by ω C with ω =1requires only three FMA instructions (π fma =3, F = 100 %). 3 FMA Optimized FFT Kernels The discrete Fourier transform y C N of an input vector x C N is defined by the matrix-vector product y := F N x,wheref N =[ω j,k N ] j,k=0,1,...,n 1 with ω N = e 2πi/N is the Fourier transform matrix. The basis of the fast Fourier transform (FFT) is the Cooley-Tukey radix-p splitting. For N = pq 2 it holds that where F N =(F p I q )Tq pq (I p F q )L pq p, (3) q 1 p 1 q 1 Tq pq = ωpq ij = diag(1,ω pq,...,ωpq q 1 ) i (4) i=0 j=0 i=0 is the twiddle factor matrix and L pq p is the stride-by-p permutation matrix (Van Loan [7]). In Fftw the computation of an FFT is performed by so-called codelets, i. e., small pieces of highly optimized machine generated code. There are two types of codelets: codelets are used to perform multiplications by twiddle factors and to compute in-place FFTs, whereas no-twiddle codelets are used to compute out-of-place FFTs. (i) Twiddle codelets of size p compute y := (F p I q )T pq q y, (5) and (ii) no-twiddle codelets of size q perform an out-of-place computation of y := F q x. If N = N 1 N 2 N n then the splitting (3) can be applied recursively and the fast Fourier transform algorithm (with O(N ln(n)) complexity) is obtained. Any factorization of N leads to an FFT program with another memory access pattern and hence to another runtime behavior. Fftw tries to find the optimum algorithm for a given machine by applying, for instance, dynamic programming (Fabianek et al. [2]). A no-twiddle codelet of size q can be FMA optimized with existing compiler techniques [5, 9]. A twiddle codelet, on the other hand, cannot be optimized at compile time, since the twiddle factors can change from call to call. However, with some formula manipulation and the tools developed in the previous section it is possible to get a full FMA utilization even in twiddle codelets. A twiddle codelet can be written as (F p I q )T pq q = L pq p (I q F p ) L pq q L pq p }{{} j=0 =I pq T pq p L pq q [ q 1 ] = L pq p F p diag(1,ωpq,...,ω j pq (p 1)j ) L pq q. A factor of the form F p diag(1,ω j,...,ω (p 1)j ) (6) is called an elementary radix-p FFT kernel. In the following the kernel (6) will be FMA optimized, since then by applying Corollary 2.2, the twiddle codelet (5) can be FMA optimized as well. Usually the twiddle factors are pre-computed and stored in an array and are accessed during the FFT computation. In the following a method is presented to reduce the 335
number of memory accesses by a hidden on-line computation of the required twiddle factors. First, the method is illustrated by means of a radix-8 kernel computation. FMA Optimized Radix-8 Kernel. The radix-8 kernel computation is y := F 8 diag(1,ω,ω 2,ω 3,ω 4,ω 5,ω 6,ω 7 )x. (7) Using a radix-2 decimation-in-frequency (DIF) factorization, (7) can be written as y := R 8 (I 4 F 2 )D 3 T 2 (I 2 F 2 I 2 )D 2 T 1 (F 2 I 4 )D 1 x, (8) D 1 = diag(1, 1, 1, 1,ω 4,ω 4,ω 4,ω 4 ), D 2 = diag(1, 1,ω 2,ω 2, 1, 1,ω 2,ω 2 ), D 3 = diag(1,ω,1,ω,1,ω,1,ω), T 1 = diag(1, 1, 1, 1, 1,c 1, i, c 2 ), T 2 = diag(1, 1, 1, i, 1, 1, 1, i), c 1 =(1 i)/ 2,c 2 = 1(1 + i)/ 2. R 8 is the bit-reversal permutation matrix (Van Loan [7]). Arithmetic Complexity. The heart of the FMA optimized algorithm is the factorization of the initial twiddle factor scaling matrix diag(1,ω,ω 2,ω 3,ω 4,ω 5,ω 6,ω 7 )intod 1 D 2 D 3. D 1, D 2,andD 3 are distributed over the 3 stages of the radix-8 kernel computation (8). To fully utilize FMA instructions for computing the first stage, T 1 (F 2 I 4 )D 1 x,the following factorization is applied (Meyer and Schwarz [12]). ( )( ) 2 1 1 0 (F 2 I 4 )D 1 =(F 2 I 4 )(diag(1,ω 4 ) I 4 )= 0 1 1 ω 4 I 4. (10) Now, by Lemma 2.1 and Corollary 2.2 the matrix-vector product (F 2 I 4 )D 1 x requires 24 FMA operations (π fma = 24, F = 100 %). T 1 contains the two non-trivial twiddle factors c 1 and c 2. Therefore, by Lemma 2.3, scaling by T 1 can be carried out with 6 FMA instructions (π fma =6,F = 100 %). Factorizations similar to (10) for stages two and three lead to an arithmetic complexity of 78 (= 3 24 + 6) FMA instructions (π fma = 78, F = 100 %) for a radix-8 FFT kernel. From (9) it can be seen that only three twiddle factors ω 4, ω 2,andω are required. FMA Optimized Radix-2 n FFT Kernels. Using a radix-2 DIF factorization the radix-2 n kernel can be written as [ 1 F 2 n diag(1,ω,...,ω 2n 1 )=R 2 n (I 2 i 1 T 2n i+1 2 )(I n i 2 i 1 F 2 I 2 n i)d i ], (11) i=n T 2n i+1 2 n i = I 2 n i diag(1,ω 2 n i+1,...,ω2n i 1 2 n i+1 ), (12) D i = I 2 i 1 diag(1,ω 2n i ) I 2 n i. (13) R 2 n is the bit-reversal permutation matrix (Van Loan [7]). (13) shows that only n twiddle factors ω 2n 1,..., ω are required in the radix-2 n kernel. Applying factorization (10) and Lemma 2.3 leads to the arithmetic complexity π fma =4.5 n 2 n 9 2 n 1 +6. (14) (9) 336
Radix TF Loads π fma FMA TF Loads π fma FMA Utilization Utilization conventional kernels fma optimized kernels 2 1 6 8 25% 1 6 6 100 % 4 3 14 28 21% 2 12 24 100 % 8 7 30 84 17% 3 22 78 100 % 16 15 62 220 17 % 4 40 222 100 % 32 31 126 548 17 % 5 74 582 100 % Table 1: Operation counts (number TF of required twiddle factors, number of load operations, arithmetic complexity π fma, and FMA utilization F ) for radix-2, -4, -8, -16 and -32 kernels. 4 Runtime Experiments Numerical experiments were performed on one processor of an SGI PowerChallenge XL R10000. In a first series of experiments FFT algorithms using the newly developed FMA optimized radix-8 FFT kernel were compared with FFT algorithms using a conventional radix-8 kernel in a triple-loop Cooley-Tukey framework. In the experiments cycles and L2 data cache misses have been counted using the on-chip performance monitor counter of the MIPS R10000 processor (Auer et al. [1]). Run times have been derived from the cycle counts. The super scalar architecture of the R10000 processor is able to execute up to four instructions per cycle. Since the R10000 can execute one load or store and one multiplyadd instruction per cycle, the compiler can generate an instruction scheduling that overlaps memory references by floating-point operations. This is the reason why the lower instruction count of the FMA optimized radix-8 kernel does not result in a significant speed-up for lengths N 2 9. The speed-up in execution time compared to a conventional radix-8 kernel can be seen in Fig. 1. For N 2 15 the speed-up is due to the lower number of primary data cache misses. For N > 2 15 when secondary data cache misses make memory accesses very costly the benefits of the new radix-8 kernel, i. e., fewer memory accesses leading to fewer secondary data cache misses, results in a substantial speed-up in execution time. 50 % Speed-up 0.7 Normalized L2 Cache Misses 40 % 0.6 0.5 FMA Optimized Radix-8 Kernel Conventional Radix-8 Kernel 30 % 0.4 20 % 0.3 0.2 10 % 0.1 0 2 6 2 9 2 12 2 15 2 18 2 21 2 24 0 2 6 2 9 2 12 2 15 2 18 2 21 2 24 Vector Length Vector Length Figure 1: Speed-up in minimum execution time and normalized number of M 2 /N log 2 N L2 cache misses of the FMA optimized radix-8 FFT algorithm compared to a conventional radix-8 FFT algorithm on one processor of an SGI Power Challenge XL. 337
An advantage of the FFT kernels presented in this paper is their compatibility to conventional kernels, making it easy to incorporate them into existing FFT software. This feature has been demonstrated by incorporating the multiply-add optimized radix-8 kernel into Fftw (Karner et al. [8]). The installation of the FMA optimized radix-8 kernel in Fftw did not cause any difficulty. For transform lengths N 2 20 a speed-up of more than 30 % was achieved. 5 Conclusion To sum up, it can be said that the advantages of the multiply-add optimized FFT kernels presented in this paper are their low number of required load operations, their high efficiency on performance oriented computer systems, and their striking simplicity. Current work is the integration of the above presented method into Fftw s kernel generator genfft (Frigo, Kral [5]) to generate FMA optimized radix-16, -32, and -64 kernels. 6 References 1. Auer, M., et al. Performance Evaluation of FFT Algorithms Using Performance Counters. Tech. Rep. Aurora TR1998-20, Vienna University of Technology, 1998. 2. Fabianek, C., et al. Survey of Self-adapting FFT Software. Tech. Rep. Aurora TR2002-01, Vienna University of Technology, 2002. 3. Franchetti, F., et al. FMA Optimized FFT Codelets. Tech. Rep. Aurora TR1998-28, Vienna University of Technology, 2002. 4. Frigo, M. and Johnson, S. G. FFTW: An Adaptive Software Architecture for the FFT. In: Proc. ICASSP, pp. 1381 1384, 1998. 5. Frigo, M. and Kral, S. The Advanced FFT Program Generator GENFFT. Tech. Rep. Aurora TR2001-03, Vienna University of Technology, 2001. 6. Goedecker, S. Fast Radix 2, 3, 4, and 5 Kernels for Fast Fourier Transformations on Computers with Overlapping Multiply-Add Instructions. SIAM J. Sci. Comput., 18, pp. 1605 1611, 1997. 7. Golub, G. H. and Van Loan, C. F. Matrix Computations. Johns Hopkins University Press, Baltimore, 2nd edn., 1989. 8. Karner, H., et al. Accelerating FFTW by Multiply-Add Optimization. Tech. Rep. Aurora TR1999-13, Vienna University of Technology, 1999. 9. Karner, H., et al. Multiply-Add Optimized FFT Kernels. Math. Models and Methods in Appl. Sci., 11, pp. 105 117, 2001. 10. Linzer, E. N. and Feig, E. Implementation of Efficient FFT Algorithms on Fused Multiply-Add Architectures. IEEE Trans. Signal Processing, 41, pp. 93 107, 1993. 11. Linzer, E. N. and Feig, E. Modified FFTs for Fused Multiply-Add Architectures. Math. Comp., 60, pp. 347 361, 1993. 12. Meyer, R. and Schwarz, K. FFT Implementation on DSP Chips Theory and Practice. In: Proc. ICASSP, pp. 1503 1506, 1990. 13. Pueschel, M., et al. Spiral: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms. Journal of High Performance Computing and Applications, submitted. 338
Acknowledgement This work is the result of a cooperation with our colleague and friend Herbert Karner who passed away in October 2001. The project was supported by the Special Research Program SFB F011 AURORA of the Austrian Science Fund FWF. Current Address Institute for Applied Mathematics and Numerical Analysis, Vienna University of Technology, Wiedner Hauptstrasse 8-10, A-1040 Wien, Austria, Tel.: +43-1-58801-11512, Email: {franz,kalti,christof}@aurora.anum.tuwien.ac.at 339
340