A Fast Fourier Transform Compiler

Similar documents
System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries

A SIMD Vectorizing Compiler for Digital Signal Processing Algorithms Λ

Stochastic Search for Signal Processing Algorithm Optimization

Stochastic Search for Signal Processing Algorithm Optimization

Automatically Optimized FFT Codes for the BlueGene/L Supercomputer

Accurate Cache and TLB Characterization Using Hardware Counters

Joint Runtime / Energy Optimization and Hardware / Software Partitioning of Linear Transforms

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University

Automatically Tuned FFTs for BlueGene/L s Double FPU

Statistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform

Generating Parallel Transforms Using Spiral

Formal Loop Merging for Signal Transforms

Program Composition and Optimization: An Introduction

SPIRAL: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms

Optimizing Matrix Multiply using PHiPAC: a Portable, High-Performance, ANSI C Coding Methodology

Special Issue on Program Generation, Optimization, and Platform Adaptation /$ IEEE

FFT ALGORITHMS FOR MULTIPLY-ADD ARCHITECTURES

Performance Analysis of a Family of WHT Algorithms

Aggressive Scheduling for Numerical Programs

1 Motivation for Improving Matrix Multiplication

Statistical Models for Automatic Performance Tuning

Automatic Generation of the HPC Challenge's Global FFT Benchmark for BlueGene/P

How to Write Fast Numerical Code Spring 2012 Lecture 9. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato

Program Generation, Optimization, and Adaptation: SPIRAL and other efforts

Too large design space?

Abstract HPF was originally created to simplify high-level programming of parallel computers. The inventors of HPF strove for an easy-to-use language

Adaptive Scientific Software Libraries

Advanced Computing Research Laboratory. Adaptive Scientific Software Libraries

Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology

The Implementation of Cilk-5 Multithreaded Language

SPIRAL Generated Modular FFTs *

Parallelism in Spiral

Empirical Auto-tuning Code Generator for FFT and Trigonometric Transforms

Automatic Tuning Matrix Multiplication Performance on Graphics Hardware

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

SPIRAL Overview: Automatic Generation of DSP Algorithms & More *

FFT Program Generation for the Cell BE

Cost Model: Work, Span and Parallelism

How to Write Fast Numerical Code

Algorithms and Computation in Signal Processing

Input parameters System specifics, user options. Input parameters size, dim,... FFT Code Generator. Initialization Select fastest execution plan

Plan. Introduction to Multicore Programming. Plan. University of Western Ontario, London, Ontario (Canada) Multi-core processor CPU Coherence

Scientific Computing. Some slides from James Lambers, Stanford

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

Provably Efficient Non-Preemptive Task Scheduling with Cilk

Self-adapting Numerical Software for Next Generation Applications Lapack Working Note 157, ICL-UT-02-07

FFT Program Generation for Shared Memory: SMP and Multicore

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core

Advanced School in High Performance and GRID Computing November Mathematical Libraries. Part I

Automatic Performance Tuning and Machine Learning

Code optimization techniques

SPIRAL: Code Generation for DSP Transforms

Algorithms and Computation in Signal Processing

Learning to Construct Fast Signal Processing Implementations

Intel Math Kernel Library 10.3

Library Generation For Linear Transforms

How to Write Fast Numerical Code

Statistical Models for Automatic Performance Tuning

Efficient FFT Algorithm and Programming Tricks

Multithreaded Parallelism and Performance Measures

How to Write Fast Code , spring rd Lecture, Apr. 9 th

How to Write Fast Numerical Code Spring 2012 Lecture 13. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato

Algorithms and Computation in Signal Processing

Introduction to Algorithms

Spiral: Program Generation for Linear Transforms and Beyond

Brief notes on setting up semi-high performance computing environments. July 25, 2014

Programming Languages and Compilers. Jeff Nucciarone AERSP 597B Sept. 20, 2004

How to Write Fast Numerical Code

Trends in HPC (hardware complexity and software challenges)

FFT Compiler Techniques

Parallel Linear Algebra on Clusters

A study on SIMD architecture

When Cache Blocking of Sparse Matrix Vector Multiply Works and Why

Multicore programming in CilkPlus

Computer Caches. Lab 1. Caching

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Practical Introduction to CUDA and GPU

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011

3.2 Cache Oblivious Algorithms

Automatic Performance Programming?

FFT Program Generation for Shared Memory: SMP and Multicore

Automatic Tuning of the Fast Multipole Method Based on Integrated Performance Prediction

A Minicourse on Dynamic Multithreaded Algorithms

Adaptable benchmarks for register blocked sparse matrix-vector multiplication

CISC 662 Graduate Computer Architecture Lecture 18 - Cache Performance. Why More on Memory Hierarchy?

Unit 9 : Fundamentals of Parallel Processing

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

Cache-Oblivious Algorithms

Testing Linux multiprocessors for seismic imaging

Automatic Tuning of Whole Applications Using Direct Search and a Performance-based Transformation System

Automated Empirical Optimizations of Software and the ATLAS project* Software Engineering Seminar Pascal Spörri

Case Studies on Cache Performance and Optimization of Programs with Unit Strides

Spiral. Computer Generation of Performance Libraries. José M. F. Moura Markus Püschel Franz Franchetti & the Spiral Team. Performance.

An Overview of Parallel Computing

A Fortran90 implementation of symmetric nonstationary phaseshift extrapolator

Results and Discussion *

Efficient Computation of LALR(1) Look-Ahead Sets

The Fastest Fourier Transform in the West

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

How to Write Fast Numerical Code Spring 2011 Lecture 7. Instructor: Markus Püschel TA: Georg Ofenbeck

Transcription:

RETROSPECTIVE: A Fast Fourier Transform Compiler Matteo Frigo Vanu Inc., One Porter Sq., suite 18 Cambridge, MA, 02140, USA athena@fftw.org 1. HOW FFTW WAS BORN FFTW (the fastest Fourier transform in the West ) is a library that computes Fourier transforms in an unusual way: Instead of executing a well defined algorithm, FFTW attempts to determine automatically which algorithm runs best on your hardware. On generalpurpose processors, FFTW is usually faster than hand-optimized libraries that compute the same thing. In 1996, it became clear to me that microprocessors had became so complicated that one could no longer program for performance directly, and that some sort of automatic performance tuning was necessary. (At the time, I was unaware that the PHiPAC project had already come to the same conclusion [1, 2] while developing an automatically tuned implementation of matrix multiplication kernels.) The environment of the Supercomputing Technologies group at the MIT Laboratory for Computer Science, where I was a graduate student, was instrumental for the development of the FFTW philosophy. In the summer of 1996, Charles Leiserson, Keith Randall, and I had just completed the implementation of Cilk-4 [4], a multithreaded parallel language. While previous implementations of Cilk were aimed at distributed-memory supercomputers, Cilk-4 was designed to run on multiprocessor machines with shared memory. Thanks to a substantial hardware donation from Sun Microsystems, which generously donated nine eight-processor machines to MIT, we had such machines available for Cilk development. At around that time, Don Dailey joined our group. Don is a computer chess expert, and he had historically been the first real user of Cilk. Don found Cilk convenient for developing parallel chess programs, and we received valuable feedback from him about Cilk bugs and missing features. One day, Don came to me complaining that the new Cilk was slow. His chess program, written in C, ran in about 50 seconds, but it took about 80 seconds to run when compiled through the Cilk compiler. This assertion was completely unexpected, because Cilk was designed to run as fast as C when executing on a single processor. In an attempt to understand the problem, Don instrumented his code to measure the execution time of a certain subroutine, and he inserted a printf statement to print the result. After the insertion of the printf, his program suddenly ran in the expected 50 seconds. The reason for this behavior of the machine is complicated. The printf statement changed the alignment of a certain tight loop, which, because of certain architectural tradeoffs in the UltraSPARC processor, caused the wrong branch prediction bits to be used. (Later, Nate Kushman was able to explain this and related phenomena [6].) 20 Years of the ACM/SIGPLAN Conference on Programming Language Design and Implementation (1979-1999): A Selection, 2003. Copyright 2003 ACM 1-58113-623-4...$5.00. After this event, I started looking at computers and at programming in a different way. I actively started looking for other performance surprises. I devised a simple loop whose performance changed by a factor of 2 depending on the code alignment. Volker Strumpen later discovered a situation where performance could vary by a factor of 4. I no longer trusted any performance numbers, because it seemed like you could obtain any runtime you wanted simply by inserting no-ops. This was my mood when, in January 1997, Steven Johnson approached me asking for a fast FFT program, which he needed to do certain simulations in condensed-matter physics. As it happens, I had written a FFT routine optimized for the CM-5, a parallel computer based on SuperSPARC processors. My routine was fairly efficient on the CM-5, and therefore I gave Steven the code, asserting with confidence that this is the fastest FFT in the West. Alas, it was not on the Steven s PowerPC 603 processor. Steven carefully measured the performance of all the FFT programs he could find, and showed that, for certain input sizes, other programs were faster than my routine. To address Steven s concerns, I modified my program. The modifications made my program faster on the UltraSPARC, but slower on Steven s PowerPC. After this episode, I was looking for a way to to make the FFT routine robust against both the code/data alignment problems and against differences in machine architecture. Thus, I conceived a FFT program capable of choosing its algorithm from a menu of possibilities, and in this way FFTW was born. Automatic code generation has been a crucial aspect of the FFTW design since the beginning. To choose from a menu of subroutines, we needed the subroutines to begin with, but Steven and I did not want to write them. Our initial generator of FFT subroutines (later called codelets as suggested by Charles Leiserson) consisted of 79 lines of Scheme code, and it was very simple-minded. It soon became clear that we really wanted the generator to incorporate compiler algorithms such as common-subexpression elimination, and that strong typing and ML-style pattern matching would have been desirable. (I had just learned the wonders of ML-style pattern matching from Arvind.) Consequently, we switched to Caml Light, which was the only variant of ML or Haskell capable of running on my i486 laptop with 8MB of RAM. 2. WHERE FFTW IS GOING Much development occurred on genfft after my 1999 paper. Many of these developments are still unpublished, a fault that we hope to address soon. One major research line is how to use SIMD (single instruction, multiple data) extensions in processors such as the Intel SSE and SSE2, AMD s 3DNow!, and Motorola s Altivec. Franz Franchetti of the Technical University of Vienna has mod- ACM SIGPLAN 642 Best of PLDI 1979-1999

ified genfft to produce reasonably portable C code that exploits SIMD instructions [3]. We have incorporated this contribution into the forthcoming FFTW3, which should be released by the time this retrospective is printed. Whereas Franchetti coded the FFT parallelism explicitly in the genfft compiler, Stefan Kral, also of Vienna, investigated a different approach. He modified genfft to extract two-way SIMD parallelism automatically from the sequential FFT dag produced by genfft. A special version of FFTW with his changes is available from our web page http://www.fftw.org. Steven Johnson and I have focused on a complete rewrite of FFTW so as to explore a larger space of FFT algorithms. We have also modified genfft to produce sine and cosine transforms and convolutions. It turns out that genfft is powerful enough to derive sine and cosine transforms automatically by exploiting symmetries in FFT dags. (See [7].) 3. RELATED RESEARCH EFFORTS The SPIRAL project aims at the automatic generation and optimization of signal-processing transforms, including Fourier transforms, sine and cosine transforms, and Walsh-Hadamard transforms. The SPIRAL system implements its own compiler, called SPL [9], which produces C or FORTRAN code from a special-purpose language, also called SPL. The SPL language can be viewed as a matrix language (think of Matlab) with LISP-like syntax. SPL expresses loops by means of a special tensor product operator. The SPL language obeys many formal algebraic identities, which allow for example for two loops to be interchanged, or for vectors to be accessed with different strides. SPIRAL exploits these identities for generating a large space of implementations. The SPIRAL and FFTW approaches are somewhat orthogonal. SPIRAL sees the FFT optimization process as a problem of finding the best formula. My 1999 paper sees it as a more traditional compiler problem: After the initial stage, genfft operates on a standard expression dag, even though it uses FFT-specific optimizations. genfft optimizations are powerful, and genfft can derive algorithms that were not known before, as discussed in my paper. On the other hand, the SPL compiler has a real input language, which makes it much more flexible than genfft. The ATLAS system [8] optimizes the basic linear-algebra subroutines (BLAS) automatically. The BLAS form the performancecritical kernel of LAPACK, a widely used linear algebra package. The problem of tuning linear-algebra subroutines automatically becomes really interesting when the matrices involved are sparse, that is, they contain many zeros entries. Usually the position of the zeros is known in advance. For example, a circuit may be represented in a simulator by a square matrix, where each circuit node corresponds to a row and a column, and each matrix element contains the inverse of the impedance between the two nodes corresponding to the row and column of the entry. Clearly, most matrix entries are zero in this case. Typically, the circuit is fixed, and one wants to simulate the circuit under several different conditions. If the simulation runs for hours or days, it is worthwhile to spend some time optimizing the linear algebra operations for the specific circuit at hand. The Sparsity system [5] addresses this problem. Demmel. Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology. In Proceedings of International Conference on Supercomputing, Vienna, Austria, July 1997. [3] Franz Franchetti, Herbert Karner, Stefan Kral, and C. W. Ueberhuber. Architecture independent short vector FFTs. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 2, pages 1109 1112, 2001. [4] Matteo Frigo, Keith H. Randall, and Charles E. Leiserson. The implementation of the Cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN 98 Conference on Programming Language Design and Implementation (PLDI), Montreal, Canada, June 1998. [5] Eun-Jin Im. Optimizing the Performance of Sparse Matrix-Vector Multiplication. PhD thesis, University of California at Berkeley, 2000. [6] Nathaniel A. Kushman. Performance nonmonotonicities: A case study of the UltraSPARC processor. Master s thesis, MIT Department of Electrical Engineering and Computer Science, June 1998. [7] R. Vuduc and J. W. Demmel. Code generators for automatic tuning of numerical kernels: Experiences with FFTW. In Proc. the Int. Workshop on Semantics, Applications, and Implementation of Program Generation, Montreal, Canada, September 2000. [8] R. Whaley and J. Dongarra. Automatically tuned linear algebra software. Technical Report CS-97-366, Computer Science Department, University of Tennessee, Knoxville, TN, 1997. [9] Jianxin Xiong, David Padua, and Jeremy Johnson. SPL: A language and compiler for DSP algorithms. In Proceedings of the ACM SIGPLAN 01 Conference on Programming Language Design and Implementation (PLDI), pages 298 308, 2001. REFERENCES [1] J. Bilmes, K. Asanović, J. Demmel, D. Lam, and C.W. Chin. PHiPAC: A portable, high-performance, ANSI C coding methodology and its application to matrix multiply. LAPACK working note 111, University of Tennessee, 1996. [2] Jeff Bilmes, Krste Asanović, Chee whye Chin, and Jim ACM SIGPLAN 643 Best of PLDI 1979-1999

ACM SIGPLAN 644 Best of PLDI 1979-1999

ACM SIGPLAN 645 Best of PLDI 1979-1999

ACM SIGPLAN 646 Best of PLDI 1979-1999

ACM SIGPLAN 647 Best of PLDI 1979-1999

ACM SIGPLAN 648 Best of PLDI 1979-1999

ACM SIGPLAN 649 Best of PLDI 1979-1999

ACM SIGPLAN 650 Best of PLDI 1979-1999

ACM SIGPLAN 651 Best of PLDI 1979-1999

ACM SIGPLAN 652 Best of PLDI 1979-1999

ACM SIGPLAN 653 Best of PLDI 1979-1999

ACM SIGPLAN 654 Best of PLDI 1979-1999

ACM SIGPLAN 655 Best of PLDI 1979-1999