A Fast Fourier Transform Compiler

RETROSPECTIVE: A Fast Fourier Transform Compiler Matteo Frigo Vanu Inc., One Porter Sq., suite 18 Cambridge, MA, 02140, USA athena@fftw.org 1. HOW FFTW WAS BORN FFTW (the fastest Fourier transform in the West ) is a library that computes Fourier transforms in an unusual way: Instead of executing a well defined algorithm, FFTW attempts to determine automatically which algorithm runs best on your hardware. On generalpurpose processors, FFTW is usually faster than hand-optimized libraries that compute the same thing. In 1996, it became clear to me that microprocessors had became so complicated that one could no longer program for performance directly, and that some sort of automatic performance tuning was necessary. (At the time, I was unaware that the PHiPAC project had already come to the same conclusion [1, 2] while developing an automatically tuned implementation of matrix multiplication kernels.) The environment of the Supercomputing Technologies group at the MIT Laboratory for Computer Science, where I was a graduate student, was instrumental for the development of the FFTW philosophy. In the summer of 1996, Charles Leiserson, Keith Randall, and I had just completed the implementation of Cilk-4 [4], a multithreaded parallel language. While previous implementations of Cilk were aimed at distributed-memory supercomputers, Cilk-4 was designed to run on multiprocessor machines with shared memory. Thanks to a substantial hardware donation from Sun Microsystems, which generously donated nine eight-processor machines to MIT, we had such machines available for Cilk development. At around that time, Don Dailey joined our group. Don is a computer chess expert, and he had historically been the first real user of Cilk. Don found Cilk convenient for developing parallel chess programs, and we received valuable feedback from him about Cilk bugs and missing features. One day, Don came to me complaining that the new Cilk was slow. His chess program, written in C, ran in about 50 seconds, but it took about 80 seconds to run when compiled through the Cilk compiler. This assertion was completely unexpected, because Cilk was designed to run as fast as C when executing on a single processor. In an attempt to understand the problem, Don instrumented his code to measure the execution time of a certain subroutine, and he inserted a printf statement to print the result. After the insertion of the printf, his program suddenly ran in the expected 50 seconds. The reason for this behavior of the machine is complicated. The printf statement changed the alignment of a certain tight loop, which, because of certain architectural tradeoffs in the UltraSPARC processor, caused the wrong branch prediction bits to be used. (Later, Nate Kushman was able to explain this and related phenomena [6].) 20 Years of the ACM/SIGPLAN Conference on Programming Language Design and Implementation (1979-1999): A Selection, 2003. Copyright 2003 ACM 1-58113-623-4...$5.00. After this event, I started looking at computers and at programming in a different way. I actively started looking for other performance surprises. I devised a simple loop whose performance changed by a factor of 2 depending on the code alignment. Volker Strumpen later discovered a situation where performance could vary by a factor of 4. I no longer trusted any performance numbers, because it seemed like you could obtain any runtime you wanted simply by inserting no-ops. This was my mood when, in January 1997, Steven Johnson approached me asking for a fast FFT program, which he needed to do certain simulations in condensed-matter physics. As it happens, I had written a FFT routine optimized for the CM-5, a parallel computer based on SuperSPARC processors. My routine was fairly efficient on the CM-5, and therefore I gave Steven the code, asserting with confidence that this is the fastest FFT in the West. Alas, it was not on the Steven s PowerPC 603 processor. Steven carefully measured the performance of all the FFT programs he could find, and showed that, for certain input sizes, other programs were faster than my routine. To address Steven s concerns, I modified my program. The modifications made my program faster on the UltraSPARC, but slower on Steven s PowerPC. After this episode, I was looking for a way to to make the FFT routine robust against both the code/data alignment problems and against differences in machine architecture. Thus, I conceived a FFT program capable of choosing its algorithm from a menu of possibilities, and in this way FFTW was born. Automatic code generation has been a crucial aspect of the FFTW design since the beginning. To choose from a menu of subroutines, we needed the subroutines to begin with, but Steven and I did not want to write them. Our initial generator of FFT subroutines (later called codelets as suggested by Charles Leiserson) consisted of 79 lines of Scheme code, and it was very simple-minded. It soon became clear that we really wanted the generator to incorporate compiler algorithms such as common-subexpression elimination, and that strong typing and ML-style pattern matching would have been desirable. (I had just learned the wonders of ML-style pattern matching from Arvind.) Consequently, we switched to Caml Light, which was the only variant of ML or Haskell capable of running on my i486 laptop with 8MB of RAM. 2. WHERE FFTW IS GOING Much development occurred on genfft after my 1999 paper. Many of these developments are still unpublished, a fault that we hope to address soon. One major research line is how to use SIMD (single instruction, multiple data) extensions in processors such as the Intel SSE and SSE2, AMD s 3DNow!, and Motorola s Altivec. Franz Franchetti of the Technical University of Vienna has mod- ACM SIGPLAN 642 Best of PLDI 1979-1999

ified genfft to produce reasonably portable C code that exploits SIMD instructions [3]. We have incorporated this contribution into the forthcoming FFTW3, which should be released by the time this retrospective is printed. Whereas Franchetti coded the FFT parallelism explicitly in the genfft compiler, Stefan Kral, also of Vienna, investigated a different approach. He modified genfft to extract two-way SIMD parallelism automatically from the sequential FFT dag produced by genfft. A special version of FFTW with his changes is available from our web page http://www.fftw.org. Steven Johnson and I have focused on a complete rewrite of FFTW so as to explore a larger space of FFT algorithms. We have also modified genfft to produce sine and cosine transforms and convolutions. It turns out that genfft is powerful enough to derive sine and cosine transforms automatically by exploiting symmetries in FFT dags. (See [7].) 3. RELATED RESEARCH EFFORTS The SPIRAL project aims at the automatic generation and optimization of signal-processing transforms, including Fourier transforms, sine and cosine transforms, and Walsh-Hadamard transforms. The SPIRAL system implements its own compiler, called SPL [9], which produces C or FORTRAN code from a special-purpose language, also called SPL. The SPL language can be viewed as a matrix language (think of Matlab) with LISP-like syntax. SPL expresses loops by means of a special tensor product operator. The SPL language obeys many formal algebraic identities, which allow for example for two loops to be interchanged, or for vectors to be accessed with different strides. SPIRAL exploits these identities for generating a large space of implementations. The SPIRAL and FFTW approaches are somewhat orthogonal. SPIRAL sees the FFT optimization process as a problem of finding the best formula. My 1999 paper sees it as a more traditional compiler problem: After the initial stage, genfft operates on a standard expression dag, even though it uses FFT-specific optimizations. genfft optimizations are powerful, and genfft can derive algorithms that were not known before, as discussed in my paper. On the other hand, the SPL compiler has a real input language, which makes it much more flexible than genfft. The ATLAS system [8] optimizes the basic linear-algebra subroutines (BLAS) automatically. The BLAS form the performancecritical kernel of LAPACK, a widely used linear algebra package. The problem of tuning linear-algebra subroutines automatically becomes really interesting when the matrices involved are sparse, that is, they contain many zeros entries. Usually the position of the zeros is known in advance. For example, a circuit may be represented in a simulator by a square matrix, where each circuit node corresponds to a row and a column, and each matrix element contains the inverse of the impedance between the two nodes corresponding to the row and column of the entry. Clearly, most matrix entries are zero in this case. Typically, the circuit is fixed, and one wants to simulate the circuit under several different conditions. If the simulation runs for hours or days, it is worthwhile to spend some time optimizing the linear algebra operations for the specific circuit at hand. The Sparsity system [5] addresses this problem. Demmel. Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology. In Proceedings of International Conference on Supercomputing, Vienna, Austria, July 1997. [3] Franz Franchetti, Herbert Karner, Stefan Kral, and C. W. Ueberhuber. Architecture independent short vector FFTs. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 2, pages 1109 1112, 2001. [4] Matteo Frigo, Keith H. Randall, and Charles E. Leiserson. The implementation of the Cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN 98 Conference on Programming Language Design and Implementation (PLDI), Montreal, Canada, June 1998. [5] Eun-Jin Im. Optimizing the Performance of Sparse Matrix-Vector Multiplication. PhD thesis, University of California at Berkeley, 2000. [6] Nathaniel A. Kushman. Performance nonmonotonicities: A case study of the UltraSPARC processor. Master s thesis, MIT Department of Electrical Engineering and Computer Science, June 1998. [7] R. Vuduc and J. W. Demmel. Code generators for automatic tuning of numerical kernels: Experiences with FFTW. In Proc. the Int. Workshop on Semantics, Applications, and Implementation of Program Generation, Montreal, Canada, September 2000. [8] R. Whaley and J. Dongarra. Automatically tuned linear algebra software. Technical Report CS-97-366, Computer Science Department, University of Tennessee, Knoxville, TN, 1997. [9] Jianxin Xiong, David Padua, and Jeremy Johnson. SPL: A language and compiler for DSP algorithms. In Proceedings of the ACM SIGPLAN 01 Conference on Programming Language Design and Implementation (PLDI), pages 298 308, 2001. REFERENCES [1] J. Bilmes, K. Asanović, J. Demmel, D. Lam, and C.W. Chin. PHiPAC: A portable, high-performance, ANSI C coding methodology and its application to matrix multiply. LAPACK working note 111, University of Tennessee, 1996. [2] Jeff Bilmes, Krste Asanović, Chee whye Chin, and Jim ACM SIGPLAN 643 Best of PLDI 1979-1999

ACM SIGPLAN 644 Best of PLDI 1979-1999

ACM SIGPLAN 645 Best of PLDI 1979-1999

ACM SIGPLAN 646 Best of PLDI 1979-1999

ACM SIGPLAN 647 Best of PLDI 1979-1999

ACM SIGPLAN 648 Best of PLDI 1979-1999

ACM SIGPLAN 649 Best of PLDI 1979-1999

ACM SIGPLAN 650 Best of PLDI 1979-1999

ACM SIGPLAN 651 Best of PLDI 1979-1999

ACM SIGPLAN 652 Best of PLDI 1979-1999

ACM SIGPLAN 653 Best of PLDI 1979-1999

ACM SIGPLAN 654 Best of PLDI 1979-1999

ACM SIGPLAN 655 Best of PLDI 1979-1999