A Fast Fourier Transform Compiler

Size: px
Start display at page:

Download "A Fast Fourier Transform Compiler"

Transcription

1 RETROSPECTIVE: A Fast Fourier Transform Compiler Matteo Frigo Vanu Inc., One Porter Sq., suite 18 Cambridge, MA, 02140, USA athena@fftw.org 1. HOW FFTW WAS BORN FFTW (the fastest Fourier transform in the West ) is a library that computes Fourier transforms in an unusual way: Instead of executing a well defined algorithm, FFTW attempts to determine automatically which algorithm runs best on your hardware. On generalpurpose processors, FFTW is usually faster than hand-optimized libraries that compute the same thing. In 1996, it became clear to me that microprocessors had became so complicated that one could no longer program for performance directly, and that some sort of automatic performance tuning was necessary. (At the time, I was unaware that the PHiPAC project had already come to the same conclusion [1, 2] while developing an automatically tuned implementation of matrix multiplication kernels.) The environment of the Supercomputing Technologies group at the MIT Laboratory for Computer Science, where I was a graduate student, was instrumental for the development of the FFTW philosophy. In the summer of 1996, Charles Leiserson, Keith Randall, and I had just completed the implementation of Cilk-4 [4], a multithreaded parallel language. While previous implementations of Cilk were aimed at distributed-memory supercomputers, Cilk-4 was designed to run on multiprocessor machines with shared memory. Thanks to a substantial hardware donation from Sun Microsystems, which generously donated nine eight-processor machines to MIT, we had such machines available for Cilk development. At around that time, Don Dailey joined our group. Don is a computer chess expert, and he had historically been the first real user of Cilk. Don found Cilk convenient for developing parallel chess programs, and we received valuable feedback from him about Cilk bugs and missing features. One day, Don came to me complaining that the new Cilk was slow. His chess program, written in C, ran in about 50 seconds, but it took about 80 seconds to run when compiled through the Cilk compiler. This assertion was completely unexpected, because Cilk was designed to run as fast as C when executing on a single processor. In an attempt to understand the problem, Don instrumented his code to measure the execution time of a certain subroutine, and he inserted a printf statement to print the result. After the insertion of the printf, his program suddenly ran in the expected 50 seconds. The reason for this behavior of the machine is complicated. The printf statement changed the alignment of a certain tight loop, which, because of certain architectural tradeoffs in the UltraSPARC processor, caused the wrong branch prediction bits to be used. (Later, Nate Kushman was able to explain this and related phenomena [6].) 20 Years of the ACM/SIGPLAN Conference on Programming Language Design and Implementation ( ): A Selection, Copyright 2003 ACM $5.00. After this event, I started looking at computers and at programming in a different way. I actively started looking for other performance surprises. I devised a simple loop whose performance changed by a factor of 2 depending on the code alignment. Volker Strumpen later discovered a situation where performance could vary by a factor of 4. I no longer trusted any performance numbers, because it seemed like you could obtain any runtime you wanted simply by inserting no-ops. This was my mood when, in January 1997, Steven Johnson approached me asking for a fast FFT program, which he needed to do certain simulations in condensed-matter physics. As it happens, I had written a FFT routine optimized for the CM-5, a parallel computer based on SuperSPARC processors. My routine was fairly efficient on the CM-5, and therefore I gave Steven the code, asserting with confidence that this is the fastest FFT in the West. Alas, it was not on the Steven s PowerPC 603 processor. Steven carefully measured the performance of all the FFT programs he could find, and showed that, for certain input sizes, other programs were faster than my routine. To address Steven s concerns, I modified my program. The modifications made my program faster on the UltraSPARC, but slower on Steven s PowerPC. After this episode, I was looking for a way to to make the FFT routine robust against both the code/data alignment problems and against differences in machine architecture. Thus, I conceived a FFT program capable of choosing its algorithm from a menu of possibilities, and in this way FFTW was born. Automatic code generation has been a crucial aspect of the FFTW design since the beginning. To choose from a menu of subroutines, we needed the subroutines to begin with, but Steven and I did not want to write them. Our initial generator of FFT subroutines (later called codelets as suggested by Charles Leiserson) consisted of 79 lines of Scheme code, and it was very simple-minded. It soon became clear that we really wanted the generator to incorporate compiler algorithms such as common-subexpression elimination, and that strong typing and ML-style pattern matching would have been desirable. (I had just learned the wonders of ML-style pattern matching from Arvind.) Consequently, we switched to Caml Light, which was the only variant of ML or Haskell capable of running on my i486 laptop with 8MB of RAM. 2. WHERE FFTW IS GOING Much development occurred on genfft after my 1999 paper. Many of these developments are still unpublished, a fault that we hope to address soon. One major research line is how to use SIMD (single instruction, multiple data) extensions in processors such as the Intel SSE and SSE2, AMD s 3DNow!, and Motorola s Altivec. Franz Franchetti of the Technical University of Vienna has mod- ACM SIGPLAN 642 Best of PLDI

2 ified genfft to produce reasonably portable C code that exploits SIMD instructions [3]. We have incorporated this contribution into the forthcoming FFTW3, which should be released by the time this retrospective is printed. Whereas Franchetti coded the FFT parallelism explicitly in the genfft compiler, Stefan Kral, also of Vienna, investigated a different approach. He modified genfft to extract two-way SIMD parallelism automatically from the sequential FFT dag produced by genfft. A special version of FFTW with his changes is available from our web page Steven Johnson and I have focused on a complete rewrite of FFTW so as to explore a larger space of FFT algorithms. We have also modified genfft to produce sine and cosine transforms and convolutions. It turns out that genfft is powerful enough to derive sine and cosine transforms automatically by exploiting symmetries in FFT dags. (See [7].) 3. RELATED RESEARCH EFFORTS The SPIRAL project aims at the automatic generation and optimization of signal-processing transforms, including Fourier transforms, sine and cosine transforms, and Walsh-Hadamard transforms. The SPIRAL system implements its own compiler, called SPL [9], which produces C or FORTRAN code from a special-purpose language, also called SPL. The SPL language can be viewed as a matrix language (think of Matlab) with LISP-like syntax. SPL expresses loops by means of a special tensor product operator. The SPL language obeys many formal algebraic identities, which allow for example for two loops to be interchanged, or for vectors to be accessed with different strides. SPIRAL exploits these identities for generating a large space of implementations. The SPIRAL and FFTW approaches are somewhat orthogonal. SPIRAL sees the FFT optimization process as a problem of finding the best formula. My 1999 paper sees it as a more traditional compiler problem: After the initial stage, genfft operates on a standard expression dag, even though it uses FFT-specific optimizations. genfft optimizations are powerful, and genfft can derive algorithms that were not known before, as discussed in my paper. On the other hand, the SPL compiler has a real input language, which makes it much more flexible than genfft. The ATLAS system [8] optimizes the basic linear-algebra subroutines (BLAS) automatically. The BLAS form the performancecritical kernel of LAPACK, a widely used linear algebra package. The problem of tuning linear-algebra subroutines automatically becomes really interesting when the matrices involved are sparse, that is, they contain many zeros entries. Usually the position of the zeros is known in advance. For example, a circuit may be represented in a simulator by a square matrix, where each circuit node corresponds to a row and a column, and each matrix element contains the inverse of the impedance between the two nodes corresponding to the row and column of the entry. Clearly, most matrix entries are zero in this case. Typically, the circuit is fixed, and one wants to simulate the circuit under several different conditions. If the simulation runs for hours or days, it is worthwhile to spend some time optimizing the linear algebra operations for the specific circuit at hand. The Sparsity system [5] addresses this problem. Demmel. Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology. In Proceedings of International Conference on Supercomputing, Vienna, Austria, July [3] Franz Franchetti, Herbert Karner, Stefan Kral, and C. W. Ueberhuber. Architecture independent short vector FFTs. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 2, pages , [4] Matteo Frigo, Keith H. Randall, and Charles E. Leiserson. The implementation of the Cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN 98 Conference on Programming Language Design and Implementation (PLDI), Montreal, Canada, June [5] Eun-Jin Im. Optimizing the Performance of Sparse Matrix-Vector Multiplication. PhD thesis, University of California at Berkeley, [6] Nathaniel A. Kushman. Performance nonmonotonicities: A case study of the UltraSPARC processor. Master s thesis, MIT Department of Electrical Engineering and Computer Science, June [7] R. Vuduc and J. W. Demmel. Code generators for automatic tuning of numerical kernels: Experiences with FFTW. In Proc. the Int. Workshop on Semantics, Applications, and Implementation of Program Generation, Montreal, Canada, September [8] R. Whaley and J. Dongarra. Automatically tuned linear algebra software. Technical Report CS , Computer Science Department, University of Tennessee, Knoxville, TN, [9] Jianxin Xiong, David Padua, and Jeremy Johnson. SPL: A language and compiler for DSP algorithms. In Proceedings of the ACM SIGPLAN 01 Conference on Programming Language Design and Implementation (PLDI), pages , REFERENCES [1] J. Bilmes, K. Asanović, J. Demmel, D. Lam, and C.W. Chin. PHiPAC: A portable, high-performance, ANSI C coding methodology and its application to matrix multiply. LAPACK working note 111, University of Tennessee, [2] Jeff Bilmes, Krste Asanović, Chee whye Chin, and Jim ACM SIGPLAN 643 Best of PLDI

3 ACM SIGPLAN 644 Best of PLDI

4 ACM SIGPLAN 645 Best of PLDI

5 ACM SIGPLAN 646 Best of PLDI

6 ACM SIGPLAN 647 Best of PLDI

7 ACM SIGPLAN 648 Best of PLDI

8 ACM SIGPLAN 649 Best of PLDI

9 ACM SIGPLAN 650 Best of PLDI

10 ACM SIGPLAN 651 Best of PLDI

11 ACM SIGPLAN 652 Best of PLDI

12 ACM SIGPLAN 653 Best of PLDI

13 ACM SIGPLAN 654 Best of PLDI

14 ACM SIGPLAN 655 Best of PLDI

System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries

System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries Yevgen Voronenko, Franz Franchetti, Frédéric de Mesmay, and Markus Püschel Department of Electrical and Computer

More information

A SIMD Vectorizing Compiler for Digital Signal Processing Algorithms Λ

A SIMD Vectorizing Compiler for Digital Signal Processing Algorithms Λ A SIMD Vectorizing Compiler for Digital Signal Processing Algorithms Λ Franz Franchetti Applied and Numerical Mathematics Technical University of Vienna, Austria franz.franchetti@tuwien.ac.at Markus Püschel

More information

Stochastic Search for Signal Processing Algorithm Optimization

Stochastic Search for Signal Processing Algorithm Optimization Stochastic Search for Signal Processing Algorithm Optimization Bryan Singer Manuela Veloso May, 01 CMU-CS-01-137 School of Computer Science Carnegie Mellon University Pittsburgh, PA 1213 Abstract Many

More information

Stochastic Search for Signal Processing Algorithm Optimization

Stochastic Search for Signal Processing Algorithm Optimization Stochastic Search for Signal Processing Algorithm Optimization Bryan Singer and Manuela Veloso Computer Science Department Carnegie Mellon University Pittsburgh, PA 1213 Email: {bsinger+, mmv+}@cs.cmu.edu

More information

Automatically Optimized FFT Codes for the BlueGene/L Supercomputer

Automatically Optimized FFT Codes for the BlueGene/L Supercomputer Automatically Optimized FFT Codes for the BlueGene/L Supercomputer Franz Franchetti, Stefan Kral, Juergen Lorenz, Markus Püschel, Christoph W. Ueberhuber, and Peter Wurzinger Institute for Analysis and

More information

Accurate Cache and TLB Characterization Using Hardware Counters

Accurate Cache and TLB Characterization Using Hardware Counters Accurate Cache and TLB Characterization Using Hardware Counters Jack Dongarra, Shirley Moore, Philip Mucci, Keith Seymour, and Haihang You Innovative Computing Laboratory, University of Tennessee Knoxville,

More information

Joint Runtime / Energy Optimization and Hardware / Software Partitioning of Linear Transforms

Joint Runtime / Energy Optimization and Hardware / Software Partitioning of Linear Transforms Joint Runtime / Energy Optimization and Hardware / Software Partitioning of Linear Transforms Paolo D Alberto, Franz Franchetti, Peter A. Milder, Aliaksei Sandryhaila, James C. Hoe, José M. F. Moura, Markus

More information

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University Outline Scientific Computation Kernels Matrix Multiplication Fast Fourier Transform (FFT) Automated Performance Tuning

More information

Automatically Tuned FFTs for BlueGene/L s Double FPU

Automatically Tuned FFTs for BlueGene/L s Double FPU Automatically Tuned FFTs for BlueGene/L s Double FPU Franz Franchetti, Stefan Kral, Juergen Lorenz, Markus Püschel, and Christoph W. Ueberhuber Institute for Analysis and Scientific Computing, Vienna University

More information

Statistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform

Statistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform Statistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform Michael Andrews and Jeremy Johnson Department of Computer Science, Drexel University, Philadelphia, PA USA Abstract.

More information

Generating Parallel Transforms Using Spiral

Generating Parallel Transforms Using Spiral Generating Parallel Transforms Using Spiral Franz Franchetti Yevgen Voronenko Markus Püschel Part of the Spiral Team Electrical and Computer Engineering Carnegie Mellon University Sponsors: DARPA DESA

More information

Formal Loop Merging for Signal Transforms

Formal Loop Merging for Signal Transforms Formal Loop Merging for Signal Transforms Franz Franchetti Yevgen S. Voronenko Markus Püschel Department of Electrical & Computer Engineering Carnegie Mellon University This work was supported by NSF through

More information

Program Composition and Optimization: An Introduction

Program Composition and Optimization: An Introduction Program Composition and Optimization: An Introduction Christoph W. Kessler 1, Welf Löwe 2, David Padua 3, and Markus Püschel 4 1 Linköping University, Linköping, Sweden, chrke@ida.liu.se 2 Linnaeus University,

More information

SPIRAL: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms

SPIRAL: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms SPIRAL: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms Markus Püschel Faculty José Moura (CMU) Jeremy Johnson (Drexel) Robert Johnson (MathStar Inc.) David Padua (UIUC) Viktor

More information

Optimizing Matrix Multiply using PHiPAC: a Portable, High-Performance, ANSI C Coding Methodology

Optimizing Matrix Multiply using PHiPAC: a Portable, High-Performance, ANSI C Coding Methodology Optimizing Matrix Multiply using PHiPAC: a Portable, High-Performance, ANSI C Coding Methodology Jeff Bilmes, Krste Asanović, Chee-Whye Chin and Jim Demmel CS Division, University of California at Berkeley,

More information

Special Issue on Program Generation, Optimization, and Platform Adaptation /$ IEEE

Special Issue on Program Generation, Optimization, and Platform Adaptation /$ IEEE Scanning the Issue Special Issue on Program Generation, Optimization, and Platform Adaptation This special issue of the PROCEEDINGS OF THE IEEE offers an overview of ongoing efforts to facilitate the development

More information

FFT ALGORITHMS FOR MULTIPLY-ADD ARCHITECTURES

FFT ALGORITHMS FOR MULTIPLY-ADD ARCHITECTURES FFT ALGORITHMS FOR MULTIPLY-ADD ARCHITECTURES FRANCHETTI Franz, (AUT), KALTENBERGER Florian, (AUT), UEBERHUBER Christoph W. (AUT) Abstract. FFTs are the single most important algorithms in science and

More information

Performance Analysis of a Family of WHT Algorithms

Performance Analysis of a Family of WHT Algorithms Performance Analysis of a Family of WHT Algorithms Michael Andrews and Jeremy Johnson Department of Computer Science Drexel University Philadelphia, PA USA January, 7 Abstract This paper explores the correlation

More information

Aggressive Scheduling for Numerical Programs

Aggressive Scheduling for Numerical Programs Aggressive Scheduling for Numerical Programs Richard Veras, Flavio Cruz, Berkin Akin May 2, 2012 Abstract Certain classes of algorithms are still hard to optimize by modern compilers due to the differences

More information

1 Motivation for Improving Matrix Multiplication

1 Motivation for Improving Matrix Multiplication CS170 Spring 2007 Lecture 7 Feb 6 1 Motivation for Improving Matrix Multiplication Now we will just consider the best way to implement the usual algorithm for matrix multiplication, the one that take 2n

More information

Statistical Models for Automatic Performance Tuning

Statistical Models for Automatic Performance Tuning Statistical Models for Automatic Performance Tuning Richard Vuduc, James Demmel (U.C. Berkeley, EECS) {richie,demmel}@cs.berkeley.edu Jeff Bilmes (Univ. of Washington, EE) bilmes@ee.washington.edu May

More information

Automatic Generation of the HPC Challenge's Global FFT Benchmark for BlueGene/P

Automatic Generation of the HPC Challenge's Global FFT Benchmark for BlueGene/P Automatic Generation of the HPC Challenge's Global FFT Benchmark for BlueGene/P Franz Franchetti 1, Yevgen Voronenko 2, Gheorghe Almasi 3 1 University and SpiralGen, Inc. 2 AccuRay, Inc., 3 IBM Research

More information

How to Write Fast Numerical Code Spring 2012 Lecture 9. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato

How to Write Fast Numerical Code Spring 2012 Lecture 9. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato How to Write Fast Numerical Code Spring 2012 Lecture 9 Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato Today Linear algebra software: history, LAPACK and BLAS Blocking (BLAS 3): key

More information

Program Generation, Optimization, and Adaptation: SPIRAL and other efforts

Program Generation, Optimization, and Adaptation: SPIRAL and other efforts Program Generation, Optimization, and Adaptation: SPIRAL and other efforts Markus Püschel Electrical and Computer Engineering University SPIRAL Team: José M. F. Moura (ECE, CMU) James C. Hoe (ECE, CMU)

More information

Too large design space?

Too large design space? 1 Multi-core HW/SW interplay and energy efficiency examples and ideas Lasse Natvig CARD group, Dept. of comp.sci. (IDI) - NTNU & HPC-section NTNU 2 Too large design space? Very young and highly dynamic

More information

Abstract HPF was originally created to simplify high-level programming of parallel computers. The inventors of HPF strove for an easy-to-use language

Abstract HPF was originally created to simplify high-level programming of parallel computers. The inventors of HPF strove for an easy-to-use language Ecient HPF Programs Harald J. Ehold 1 Wilfried N. Gansterer 2 Dieter F. Kvasnicka 3 Christoph W. Ueberhuber 2 1 VCPC, European Centre for Parallel Computing at Vienna E-Mail: ehold@vcpc.univie.ac.at 2

More information

Adaptive Scientific Software Libraries

Adaptive Scientific Software Libraries Adaptive Scientific Software Libraries Lennart Johnsson Advanced Computing Research Laboratory Department of Computer Science University of Houston Challenges Diversity of execution environments Growing

More information

Advanced Computing Research Laboratory. Adaptive Scientific Software Libraries

Advanced Computing Research Laboratory. Adaptive Scientific Software Libraries Adaptive Scientific Software Libraries and Texas Learning and Computation Center and Department of Computer Science University of Houston Challenges Diversity of execution environments Growing complexity

More information

Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology

Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology EE382C: Embedded Software Systems Final Report David Brunke Young Cho Applied Research Laboratories:

More information

The Implementation of Cilk-5 Multithreaded Language

The Implementation of Cilk-5 Multithreaded Language The Implementation of Cilk-5 Multithreaded Language By Matteo Frigo, Charles E. Leiserson, and Keith H Randall Presented by Martin Skou 1/14 The authors Matteo Frigo Chief Scientist and founder of Cilk

More information

SPIRAL Generated Modular FFTs *

SPIRAL Generated Modular FFTs * SPIRAL Generated Modular FFTs * Jeremy Johnson Lingchuan Meng Drexel University * The work is supported by DARPA DESA, NSF, and Intel. Material for SPIRAL overview provided by Franz Francheti, Yevgen Voronenko,,

More information

Parallelism in Spiral

Parallelism in Spiral Parallelism in Spiral Franz Franchetti and the Spiral team (only part shown) Electrical and Computer Engineering Carnegie Mellon University Joint work with Yevgen Voronenko Markus Püschel This work was

More information

Empirical Auto-tuning Code Generator for FFT and Trigonometric Transforms

Empirical Auto-tuning Code Generator for FFT and Trigonometric Transforms Empirical Auto-tuning Code Generator for FFT and Trigonometric Transforms Ayaz Ali and Lennart Johnsson Texas Learning and Computation Center University of Houston, Texas {ayaz,johnsson}@cs.uh.edu Dragan

More information

Automatic Tuning Matrix Multiplication Performance on Graphics Hardware

Automatic Tuning Matrix Multiplication Performance on Graphics Hardware Automatic Tuning Matrix Multiplication Performance on Graphics Hardware Changhao Jiang (cjiang@cs.uiuc.edu) Marc Snir (snir@cs.uiuc.edu) University of Illinois Urbana Champaign GPU becomes more powerful

More information

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop http://cscads.rice.edu/ Discussion and Feedback CScADS Autotuning 07 Top Priority Questions for Discussion

More information

SPIRAL Overview: Automatic Generation of DSP Algorithms & More *

SPIRAL Overview: Automatic Generation of DSP Algorithms & More * SPIRAL Overview: Automatic Generation of DSP Algorithms & More * Jeremy Johnson (& SPIRAL Team) Drexel University * The work is supported by DARPA DESA, NSF, and Intel. Material for presentation provided

More information

FFT Program Generation for the Cell BE

FFT Program Generation for the Cell BE FFT Program Generation for the Cell BE Srinivas Chellappa, Franz Franchetti, and Markus Püschel Electrical and Computer Engineering Carnegie Mellon University Pittsburgh PA 15213, USA {schellap, franzf,

More information

Cost Model: Work, Span and Parallelism

Cost Model: Work, Span and Parallelism CSE 539 01/15/2015 Cost Model: Work, Span and Parallelism Lecture 2 Scribe: Angelina Lee Outline of this lecture: 1. Overview of Cilk 2. The dag computation model 3. Performance measures 4. A simple greedy

More information

How to Write Fast Numerical Code

How to Write Fast Numerical Code How to Write Fast Numerical Code Lecture: Autotuning and Machine Learning Instructor: Markus Püschel TA: Gagandeep Singh, Daniele Spampinato, Alen Stojanov Overview Rough classification of autotuning efforts

More information

Algorithms and Computation in Signal Processing

Algorithms and Computation in Signal Processing Algorithms and Computation in Signal Processing special topic course 18-799B spring 2005 22 nd lecture Mar. 31, 2005 Instructor: Markus Pueschel Guest instructor: Franz Franchetti TA: Srinivas Chellappa

More information

Input parameters System specifics, user options. Input parameters size, dim,... FFT Code Generator. Initialization Select fastest execution plan

Input parameters System specifics, user options. Input parameters size, dim,... FFT Code Generator. Initialization Select fastest execution plan Automatic Performance Tuning in the UHFFT Library Dragan Mirković 1 and S. Lennart Johnsson 1 Department of Computer Science University of Houston Houston, TX 7724 mirkovic@cs.uh.edu, johnsson@cs.uh.edu

More information

Plan. Introduction to Multicore Programming. Plan. University of Western Ontario, London, Ontario (Canada) Multi-core processor CPU Coherence

Plan. Introduction to Multicore Programming. Plan. University of Western Ontario, London, Ontario (Canada) Multi-core processor CPU Coherence Plan Introduction to Multicore Programming Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 3101 1 Multi-core Architecture 2 Race Conditions and Cilkscreen (Moreno Maza) Introduction

More information

Scientific Computing. Some slides from James Lambers, Stanford

Scientific Computing. Some slides from James Lambers, Stanford Scientific Computing Some slides from James Lambers, Stanford Dense Linear Algebra Scaling and sums Transpose Rank-one updates Rotations Matrix vector products Matrix Matrix products BLAS Designing Numerical

More information

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development M. Serdar Celebi

More information

Provably Efficient Non-Preemptive Task Scheduling with Cilk

Provably Efficient Non-Preemptive Task Scheduling with Cilk Provably Efficient Non-Preemptive Task Scheduling with Cilk V. -Y. Vee and W.-J. Hsu School of Applied Science, Nanyang Technological University Nanyang Avenue, Singapore 639798. Abstract We consider the

More information

Self-adapting Numerical Software for Next Generation Applications Lapack Working Note 157, ICL-UT-02-07

Self-adapting Numerical Software for Next Generation Applications Lapack Working Note 157, ICL-UT-02-07 Self-adapting Numerical Software for Next Generation Applications Lapack Working Note 157, ICL-UT-02-07 Jack Dongarra, Victor Eijkhout December 2002 Abstract The challenge for the development of next generation

More information

FFT Program Generation for Shared Memory: SMP and Multicore

FFT Program Generation for Shared Memory: SMP and Multicore FFT Program Generation for Shared Memory: SMP and Multicore Franz Franchetti, Yevgen Voronenko, and Markus Püschel Electrical and Computer Engineering Carnegie Mellon University Abstract The chip maker

More information

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core Tuning on a single core 1 From models to practice In lecture 2, we discussed features such as instruction-level parallelism and cache hierarchies that we need to understand in order to have a reasonable

More information

Advanced School in High Performance and GRID Computing November Mathematical Libraries. Part I

Advanced School in High Performance and GRID Computing November Mathematical Libraries. Part I 1967-10 Advanced School in High Performance and GRID Computing 3-14 November 2008 Mathematical Libraries. Part I KOHLMEYER Axel University of Pennsylvania Department of Chemistry 231 South 34th Street

More information

Automatic Performance Tuning and Machine Learning

Automatic Performance Tuning and Machine Learning Automatic Performance Tuning and Machine Learning Markus Püschel Computer Science, ETH Zürich with: Frédéric de Mesmay PhD, Electrical and Computer Engineering, Carnegie Mellon PhD and Postdoc openings:

More information

Code optimization techniques

Code optimization techniques & Alberto Bertoldo Advanced Computing Group Dept. of Information Engineering, University of Padova, Italy cyberto@dei.unipd.it May 19, 2009 The Four Commandments 1. The Pareto principle 80% of the effects

More information

SPIRAL: Code Generation for DSP Transforms

SPIRAL: Code Generation for DSP Transforms PROCEEDINGS OF THE IEEE SPECIAL ISSUE ON PROGRAM GENERATION, OPTIMIZATION, AND ADAPTATION 1 SPIRAL: Code Generation for DSP Transforms Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela

More information

Algorithms and Computation in Signal Processing

Algorithms and Computation in Signal Processing Algorithms and Computation in Signal Processing special topic course 18-799B spring 2005 14 th Lecture Feb. 24, 2005 Instructor: Markus Pueschel TA: Srinivas Chellappa Course Evaluation Email sent out

More information

Learning to Construct Fast Signal Processing Implementations

Learning to Construct Fast Signal Processing Implementations Journal of Machine Learning Research 3 (2002) 887-919 Submitted 12/01; Published 12/02 Learning to Construct Fast Signal Processing Implementations Bryan Singer Manuela Veloso Department of Computer Science

More information

Intel Math Kernel Library 10.3

Intel Math Kernel Library 10.3 Intel Math Kernel Library 10.3 Product Brief Intel Math Kernel Library 10.3 The Flagship High Performance Computing Math Library for Windows*, Linux*, and Mac OS* X Intel Math Kernel Library (Intel MKL)

More information

Library Generation For Linear Transforms

Library Generation For Linear Transforms Library Generation For Linear Transforms Yevgen Voronenko May RS RS 3 RS RS RS RS 7 RS 5 Dissertation Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in Electrical

More information

How to Write Fast Numerical Code

How to Write Fast Numerical Code How to Write Fast Numerical Code Lecture: Dense linear algebra, LAPACK, MMM optimizations in ATLAS Instructor: Markus Püschel TA: Daniele Spampinato & Alen Stojanov Today Linear algebra software: history,

More information

Statistical Models for Automatic Performance Tuning

Statistical Models for Automatic Performance Tuning Statistical Models for Automatic Performance Tuning Richard Vuduc James W. Demmel Jeff Bilmes Abstract Achieving peak performance from library subroutines usually requires extensive, machine-dependent

More information

Efficient FFT Algorithm and Programming Tricks

Efficient FFT Algorithm and Programming Tricks Connexions module: m12021 1 Efficient FFT Algorithm and Programming Tricks Douglas L. Jones This work is produced by The Connexions Project and licensed under the Creative Commons Attribution License Abstract

More information

Multithreaded Parallelism and Performance Measures

Multithreaded Parallelism and Performance Measures Multithreaded Parallelism and Performance Measures Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 3101 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101

More information

How to Write Fast Code , spring rd Lecture, Apr. 9 th

How to Write Fast Code , spring rd Lecture, Apr. 9 th How to Write Fast Code 18-645, spring 2008 23 rd Lecture, Apr. 9 th Instructor: Markus Püschel TAs: Srinivas Chellappa (Vas) and Frédéric de Mesmay (Fred) Research Project Current status Today Papers due

More information

How to Write Fast Numerical Code Spring 2012 Lecture 13. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato

How to Write Fast Numerical Code Spring 2012 Lecture 13. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato How to Write Fast Numerical Code Spring 2012 Lecture 13 Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato ATLAS Mflop/s Compile Execute Measure Detect Hardware Parameters L1Size NR MulAdd

More information

Algorithms and Computation in Signal Processing

Algorithms and Computation in Signal Processing Algorithms and Computation in Signal Processing special topic course 8-799B spring 25 24 th and 25 th Lecture Apr. 7 and 2, 25 Instructor: Markus Pueschel TA: Srinivas Chellappa Research Projects Presentations

More information

Introduction to Algorithms

Introduction to Algorithms Lecture 1 Introduction to Algorithms 1.1 Overview The purpose of this lecture is to give a brief overview of the topic of Algorithms and the kind of thinking it involves: why we focus on the subjects that

More information

Spiral: Program Generation for Linear Transforms and Beyond

Spiral: Program Generation for Linear Transforms and Beyond Spiral: Program Generation for Linear Transforms and Beyond Franz Franchetti and the Spiral team (only part shown) ECE, Carnegie Mellon University www.spiral.net Co-Founder, SpiralGen www.spiralgen.com

More information

Brief notes on setting up semi-high performance computing environments. July 25, 2014

Brief notes on setting up semi-high performance computing environments. July 25, 2014 Brief notes on setting up semi-high performance computing environments July 25, 2014 1 We have two different computing environments for fitting demanding models to large space and/or time data sets. 1

More information

Programming Languages and Compilers. Jeff Nucciarone AERSP 597B Sept. 20, 2004

Programming Languages and Compilers. Jeff Nucciarone AERSP 597B Sept. 20, 2004 Programming Languages and Compilers Jeff Nucciarone Sept. 20, 2004 Programming Languages Fortran C C++ Java many others Why use Standard Programming Languages? Programming tedious requiring detailed knowledge

More information

How to Write Fast Numerical Code

How to Write Fast Numerical Code How to Write Fast Numerical Code Lecture: Memory bound computation, sparse linear algebra, OSKI Instructor: Markus Püschel TA: Alen Stojanov, Georg Ofenbeck, Gagandeep Singh ATLAS Mflop/s Compile Execute

More information

Trends in HPC (hardware complexity and software challenges)

Trends in HPC (hardware complexity and software challenges) Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18

More information

FFT Compiler Techniques

FFT Compiler Techniques FFT Compiler Techniques Stefan Kral, Franz Franchetti, Juergen Lorenz, Christoph W. Ueberhuber, and Peter Wurzinger Institute for Applied Mathematics and Numerical Analysis, Vienna University of Technology,

More information

Parallel Linear Algebra on Clusters

Parallel Linear Algebra on Clusters Parallel Linear Algebra on Clusters Fernando G. Tinetti Investigador Asistente Comisión de Investigaciones Científicas Prov. Bs. As. 1 III-LIDI, Facultad de Informática, UNLP 50 y 115, 1er. Piso, 1900

More information

A study on SIMD architecture

A study on SIMD architecture A study on SIMD architecture Gürkan Solmaz, Rouhollah Rahmatizadeh and Mohammad Ahmadian Department of Electrical Engineering and Computer Science University of Central Florida Email: {gsolmaz,rrahmati,mohammad}@knights.ucf.edu

More information

When Cache Blocking of Sparse Matrix Vector Multiply Works and Why

When Cache Blocking of Sparse Matrix Vector Multiply Works and Why When Cache Blocking of Sparse Matrix Vector Multiply Works and Why Rajesh Nishtala 1, Richard W. Vuduc 1, James W. Demmel 1, and Katherine A. Yelick 1 University of California at Berkeley, Computer Science

More information

Multicore programming in CilkPlus

Multicore programming in CilkPlus Multicore programming in CilkPlus Marc Moreno Maza University of Western Ontario, Canada CS3350 March 16, 2015 CilkPlus From Cilk to Cilk++ and Cilk Plus Cilk has been developed since 1994 at the MIT Laboratory

More information

Computer Caches. Lab 1. Caching

Computer Caches. Lab 1. Caching Lab 1 Computer Caches Lab Objective: Caches play an important role in computational performance. Computers store memory in various caches, each with its advantages and drawbacks. We discuss the three main

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

Practical Introduction to CUDA and GPU

Practical Introduction to CUDA and GPU Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing

More information

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011 CS4961 Parallel Programming Lecture 3: Introduction to Parallel Architectures Administrative UPDATE Nikhil office hours: - Monday, 2-3 PM, MEB 3115 Desk #12 - Lab hours on Tuesday afternoons during programming

More information

3.2 Cache Oblivious Algorithms

3.2 Cache Oblivious Algorithms 3.2 Cache Oblivious Algorithms Cache-Oblivious Algorithms by Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. In the 40th Annual Symposium on Foundations of Computer Science,

More information

Automatic Performance Programming?

Automatic Performance Programming? A I n Automatic Performance Programming? Markus Püschel Computer Science m128i t3 = _mm_unpacklo_epi16(x[0], X[1]); m128i t4 = _mm_unpackhi_epi16(x[0], X[1]); m128i t7 = _mm_unpacklo_epi16(x[2], X[3]);

More information

FFT Program Generation for Shared Memory: SMP and Multicore

FFT Program Generation for Shared Memory: SMP and Multicore FFT Program Generation for Shared Memory: SMP and Multicore Franz Franchetti, Yevgen Voronenko, and Markus Püschel Electrical and Computer Engineering Carnegie Mellon University {franzf, yvoronen, pueschel@ece.cmu.edu

More information

Automatic Tuning of the Fast Multipole Method Based on Integrated Performance Prediction

Automatic Tuning of the Fast Multipole Method Based on Integrated Performance Prediction Original published: H. Dachsel, M. Hofmann, J. Lang, and G. Rünger. Automatic tuning of the Fast Multipole Method based on integrated performance prediction. In Proceedings of the 14th IEEE International

More information

A Minicourse on Dynamic Multithreaded Algorithms

A Minicourse on Dynamic Multithreaded Algorithms Introduction to Algorithms December 5, 005 Massachusetts Institute of Technology 6.046J/18.410J Professors Erik D. Demaine and Charles E. Leiserson Handout 9 A Minicourse on Dynamic Multithreaded Algorithms

More information

Adaptable benchmarks for register blocked sparse matrix-vector multiplication

Adaptable benchmarks for register blocked sparse matrix-vector multiplication Adaptable benchmarks for register blocked sparse matrix-vector multiplication Berkeley Benchmarking and Optimization group (BeBOP) Hormozd Gahvari and Mark Hoemmen Based on research of: Eun-Jin Im Rich

More information

CISC 662 Graduate Computer Architecture Lecture 18 - Cache Performance. Why More on Memory Hierarchy?

CISC 662 Graduate Computer Architecture Lecture 18 - Cache Performance. Why More on Memory Hierarchy? CISC 662 Graduate Computer Architecture Lecture 18 - Cache Performance Michela Taufer Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer Architecture, 4th edition ---- Additional

More information

Unit 9 : Fundamentals of Parallel Processing

Unit 9 : Fundamentals of Parallel Processing Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing

More information

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Texas Learning and Computation Center Department of Computer Science University of Houston Outline Motivation

More information

Cache-Oblivious Algorithms

Cache-Oblivious Algorithms Cache-Oblivious Algorithms Paper Reading Group Matteo Frigo Charles E. Leiserson Harald Prokop Sridhar Ramachandran Presents: Maksym Planeta 03.09.2015 Table of Contents Introduction Cache-oblivious algorithms

More information

Testing Linux multiprocessors for seismic imaging

Testing Linux multiprocessors for seismic imaging Stanford Exploration Project, Report 102, October 25, 1999, pages 187 248 Testing Linux multiprocessors for seismic imaging Biondo Biondi, Robert G. Clapp, and James Rickett 1 keywords: Linux, parallel

More information

Automatic Tuning of Whole Applications Using Direct Search and a Performance-based Transformation System

Automatic Tuning of Whole Applications Using Direct Search and a Performance-based Transformation System Automatic Tuning of Whole Applications Using Direct Search and a Performance-based Transformation System Apan Qasem Ken Kennedy John Mellor-Crummey Department of Computer Science Rice University Houston,

More information

Automated Empirical Optimizations of Software and the ATLAS project* Software Engineering Seminar Pascal Spörri

Automated Empirical Optimizations of Software and the ATLAS project* Software Engineering Seminar Pascal Spörri Automated Empirical Optimizations of Software and the ATLAS project* Software Engineering Seminar Pascal Spörri *R. Clint Whaley, Antoine Petitet and Jack Dongarra. Parallel Computing, 27(1-2):3-35, 2001.

More information

Case Studies on Cache Performance and Optimization of Programs with Unit Strides

Case Studies on Cache Performance and Optimization of Programs with Unit Strides SOFTWARE PRACTICE AND EXPERIENCE, VOL. 27(2), 167 172 (FEBRUARY 1997) Case Studies on Cache Performance and Optimization of Programs with Unit Strides pei-chi wu and kuo-chan huang Department of Computer

More information

Spiral. Computer Generation of Performance Libraries. José M. F. Moura Markus Püschel Franz Franchetti & the Spiral Team. Performance.

Spiral. Computer Generation of Performance Libraries. José M. F. Moura Markus Püschel Franz Franchetti & the Spiral Team. Performance. Spiral Computer Generation of Performance Libraries José M. F. Moura Markus Püschel Franz Franchetti & the Spiral Team Platforms Performance Applications What is Spiral? Traditionally Spiral Approach Spiral

More information

An Overview of Parallel Computing

An Overview of Parallel Computing An Overview of Parallel Computing Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS2101 Plan 1 Hardware 2 Types of Parallelism 3 Concurrency Platforms: Three Examples Cilk CUDA

More information

A Fortran90 implementation of symmetric nonstationary phaseshift extrapolator

A Fortran90 implementation of symmetric nonstationary phaseshift extrapolator A Fortran90 implementation of symmetric nonstationary phaseshift extrapolator Yanpeng Mi and Gary F Margrave ABSTRACT Symmetric nonstationary phase-shift (SNPS) for 2D (SNPS2D) is a 2D prestack shot-gather

More information

Results and Discussion *

Results and Discussion * OpenStax-CNX module: m Results and Discussion * Anthony Blake This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License. In order to test the hypotheses set out

More information

Efficient Computation of LALR(1) Look-Ahead Sets

Efficient Computation of LALR(1) Look-Ahead Sets RETROSPECTIVE: Efficient Computation of LALR(1) Look-Ahead Sets Thomas J. Pennello ARC International Santa Cruz, CA 95060 tom.pennello@arc.com Frank DeRemer 8 South Circle Santa Cruz, CA 95060 fderemer@alum.mit.edu

More information

The Fastest Fourier Transform in the West

The Fastest Fourier Transform in the West : The Fastest Fourier Transform in the West Steven G. ohnson, MIT Applied Mathematics Matteo Frigo, Cilk Arts Inc. In the beginning (c. 1805): Carl Friedrich Gauss declination angle ( ) 30 25 20 15 10

More information

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM Issues In Implementing The Primal-Dual Method for SDP Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt.edu Outline 1. Cache and shared memory parallel computing concepts.

More information

How to Write Fast Numerical Code Spring 2011 Lecture 7. Instructor: Markus Püschel TA: Georg Ofenbeck

How to Write Fast Numerical Code Spring 2011 Lecture 7. Instructor: Markus Püschel TA: Georg Ofenbeck How to Write Fast Numerical Code Spring 2011 Lecture 7 Instructor: Markus Püschel TA: Georg Ofenbeck Last Time: Locality Temporal and Spatial memory memory Last Time: Reuse FFT: O(log(n)) reuse MMM: O(n)

More information