SPIRAL: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms

Size: px
Start display at page:

Download "SPIRAL: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms"

Transcription

1 SPIRAL: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms Markus Püschel Faculty José Moura (CMU) Jeremy Johnson (Drexel) Robert Johnson (MathStar Inc.) David Padua (UIUC) Viktor Prasanna (USC) Markus Püschel (CMU) Manuela Veloso (CMU) Students Franz Franchetti (TU Vienna) Gavin Haentjens (CMU) Pinit Kumhom (Drexel) Neungsoo Park (USC) David Sepiashvili (CMU) Bryan Singer (CMU) Yevgen Voronenko (Drexel) Jianxin Xiong (UIUC)

2 Sponsor Work supported by DARPA (DSO), Applied & Computational Mathematics Program, OPAL, through grant managed by research grant DABT administered by the Army Directorate of Contracting.

3 Moore s Law and High(est est) Performance Scientific Computing (single processor, off-the-shelf) Moore s Law: processor-memory bottleneck short life cycles of computers very complex architectures vendor specific special instructions (MMX, SSE, FMA, ) undocumented features Effects on software/algorithms: arithmetic cost model not accurate for predicting runtime (one cache miss = floating point ops) better performance models hard to get best code is machine dependent (registers/caches size, structure) hand-tuned code becomes obsolete as fast as it is written compiler limitations full performance requires (in part) assembly coding Portable performance requires automation

4 SPIRAL Automates Implementation cuts development costs code less error-prone Optimization systematic exploration of alternatives both at algorithmic and code level Platform-Adaptation of DSP algorithms takes advantage of architecture specific features porting without loss of performance are performance critical A library generator for highly optimized signal processing algorithms

5 SPIRAL Approach given DSP Transform (DFT, DCT, Wavelets etc.) SPIRAL Search Space Possible Algorithms Possible Implementations Performance Evaluation Intelligent Search adapted implementation given Computing Platform (Pentium III, Pentium 4, Athlon, SUN, PowerPC, Alpha, )

6 DSP Transform Algorithm

7 DSP DSP Algorithms: Example 4 Algorithms: Example 4-point DFT point DFT Cooley/Tukey FFT (size 4): = i i i i i ) ( ) ( L DFT I T I DFT DFT = Diagonal matrix (twiddles) Fourier transform Kronecker product Permutation Identity product of structured sparse matrices mathematical notation

8 DSP Algorithms: Terminology Transform DFTn parameterized matrix Rule DFT nm ( DFT I ) D ( I DFT ) P a breakdown strategy product of sparse matrices n m n m Ruletree DFT 8 DFT 2 DFT 4 DFT 2 DFT 2 recursive application of rules uniquely defines an algorithm efficient representation easy manipulation Formula DFT ( F I ) D ( I ( I F )) P = L few constructs and primitives uniquely defines an algorithm can be translated into code 2 2

9 DSP Transforms discrete Fourier transform Walsh-Hadamard transform [ exp ( 2kli π n) ] DFT n = / WHT k = DFT 2 2 L DFT 2 ( II ) DCT n = [ cos( k( l + / 2) π / n) ] discrete cosine and sine ( IV ) DCT n Transforms (6 types) = [ cos( ( k + / 2) ( l + / 2) π / n) ] ( I ) DST n = [ sin ( klπ / n) ] MDCT = [ cos( ( k + ( n + ) / 2)( l + / 2) n) ] modified discrete cosine transform two-dimensional transform n 2 n π / T T Others: filters, discrete wavelet transforms, Haar, Hartley,

10 Rules = Breakdown Strategies DCT II (, / ) F2 ( ) 2 diag 2 ( ( II ) ( IV ) DCT DCT ) ( I F ) Q ( II ) DCT n P n / 2 n / 2 n / 2 2 DCT DCT DFT DFT F F n ( IV ) n n ( IV n nm S DCT ( II ) n D ) M L M r ( I ) ( I ) B ( DCT DST ) C base case recursive translation iterative n / 2 n / 2 recursive ( DFT I ) D ( I DFT ) P n k h) ( I n / d I d + k ) ( I n d Fd ( h)) h) Circ( h ) E m n ( / n m recursive recursive ( recursive DWT n( W ) ( DWTn / 2( W ) In / 2) P ( In / 2 n W ) E n n+ K+ ni ni ni+ + K+ nt WHT ( I WHT I ) MDCT i= n 2 n S DCT ( IV ) n P k recursive iterative/ recursive translation

11 Formula for a DCT, size 6

12 DSP Transform Algorithm (Formula) Implementation

13 Formulas in SPL ( compose ( diagonal ( 2*cos(/6*pi) 2*cos(3/6*pi) 2*cos(5/6*pi) 2*cos(7/6*pi) ) ) ( permutation ( ) ) ( tensor ( I 2 ) ( F 2 ) ) ( permutation ( ) ) ( direct_sum ( compose ( F 2 ) ( diagonal ( sqrt(/2) ) ) ) ( compose ( matrix ( ) ( (-) ) ) ( diagonal ( cos(3/8*pi)-sin(3/8*pi) sin(3/8*pi) cos(3/8*pi)+sin(3/8*pi) ) ) ( matrix ( ) ( ) ( ) ) ( permutation ( 2 ) )

14 SPL Syntax (Subset) matrix operations: (compose formula formula...) (tensor formula formula...) (direct_sum formula formula...) direct matrix description: (matrix (a a2...) (a2 a22...)...) (diagonal (d d2...)) (permutation (p p2...)) parameterized matrices: (I n) (F n) scalars:.5, 2/7, cos(..), w(3), pi,.2e-4 definition of new symbols: allows extension of SPL (define name formula) (template formula (i-code-list) directives for code generation #codetype real/complex #unroll on/off controls loop unrolling

15 SPL Compiler, 4-point 4 FFT (compose (tensor (F 2) (I 2)) (T 4 2) (tensor (I 2) (F 2)) (L 4 2)) complex #codetype real fast algorithm as formula as SPL program f = x() + x(3) f = x() - x(3) f2 = x(2) + x(4) f3 = x(2) - x(4) f4 = (.d,-.d)*f(3) y() = f + f2 y(2) = f - f2 y(3) = f + f4 y(4) = f - f4 r = x() + x(5) r = x() - x(5) r2 = x(2) + x(6) r3 = x(2) - x(6) r4 = x(3) + x(7) r5 = x(3) - x(7) r6 = x(4) + x(8) r7 = x(4) - x(8) y() = r + r4 y(2) = r + r5 y(3) = r - r4 y(4) = r - r5 y(5) = r2 + r7 y(6) = r3 - r6 y(7) = r2 - r7 y(8) = r3 + r6

16 SPL Compiler: Summary SPL Formula Abstract Syntax Tree SPL Program Symbol Definition Parsing Symbol Table Intermediate Code Generation I-Code Intermediate Code Restructuring I-Code Optimization I-Code Target Code Generation C, FORTRAN function Template Definition Template Table Built-in optimizations: single static assignment code no reuse of temporary vars only scalar temporary vars constants precomputed limited CSE Extensible through templates

17 SIMD Short Vector Extensions SIMD vector length = 4 (4-way) + x Extension to instruction set architecture Available on most current architectures (SSE on Pentium, AltiVec on Motorola G4) Originally for multimedia (like MMX for integers) Requires fine grain parallelism Large potential speed-up Problems: SIMD instructions are architecture specific No common API (usually assembly hand coding) Performance very sensitive to memory access Automatic vectorization very limited

18 Vector code generation from SPL formulas Naturally vectorizable construct A I 4 A k i= i i vector length (Current) generic construct completely vectorizable: P D ( A I ) E Vectorization in two steps: i υ i Q i x P i, Q i D i, E i. Formula manipulation using manipulation rules 2. Code generation (vector code + C code) A i ν y permutations diagonals arbitrary formulas SIMD vector length

19 DSP Transform Algorithm (Formula) Search Implementation

20 Why Search? DCT, type IV, size 6 ~3 formulas maaaany different formulas large spread in runtimes, even for modest size not due to arithmetic cost

21 Search Methods available in SPIRAL Exhaustive Search Dynamic Programming (DP) Random Search Hill Climbing STEER (similar to a genetic algorithm) Possible Sizes Formulas Timed Results Exhaust Very small All Best DP All s-s (very) good Random All User decided fair/good Hill Climbing All s-s Good STEER All s-s (very) good Search over algorithm space and implementation options (degree of unrolling)

22 STEER Population n: Mutation Cross-Breeding expand differently Population n+: swap expansions Survival of Fittest

23 Learning to Generate Fast Algorithms Learns from given dataset (formulas+runtimes) how to design a fast algorithm (breakdown strategy) Learns from a transform of one size, generates the best algorithm for many sizes Tested for DFT and WHT

24 SPIRAL

25 SPIRAL system user DSP transform specifies goes for a coffee S P I R A L Formula Generator fast algorithm as SPL formula SPL Compiler C/Fortran/SIMD code controls algorithm generation controls implementation options runtime (performance) Search Engine (or an espresso for small transforms) platform-adapted implementation comes back

26 SPIRAL System: Summary Available for download: Easy installation (Unix: configure/make; Windows: install shield) Unix/Linux and Windows 98/ME/NT/2/XP Current transforms: DFT, DHT, WHT, RHT, DCT/DST type I IV, MDCT, Filters, Wavelets, Toeplitz, Circulants Extensible

27 Experimental Results search methods (applicable to all transforms) high performance code (compared with FFTW) different transforms

28 Vectorized code WHT DFT normalized runtime normalized runtime transform size 2-dim DCT normalized runtime transform size speed-ups up to a factor of 2.5 beats hand-tuned Intel MKL (< 24) transform size (Pentium III, SSE)

29 Conclusions Closing the gap between math domain (algorithms) and implementation domain (programs) Mathematical computer representation of algorithms Automatic translation of algorithms into code Implement constructs not transforms Optimization as intelligent search/learning in the space of alternatives High level: Mathematical manipulation of algorithms Low level: Coding degrees of freedom

Algorithms and Computation in Signal Processing

Algorithms and Computation in Signal Processing Algorithms and Computation in Signal Processing special topic course 8-799B spring 25 24 th and 25 th Lecture Apr. 7 and 2, 25 Instructor: Markus Pueschel TA: Srinivas Chellappa Research Projects Presentations

More information

Generating Parallel Transforms Using Spiral

Generating Parallel Transforms Using Spiral Generating Parallel Transforms Using Spiral Franz Franchetti Yevgen Voronenko Markus Püschel Part of the Spiral Team Electrical and Computer Engineering Carnegie Mellon University Sponsors: DARPA DESA

More information

Joint Runtime / Energy Optimization and Hardware / Software Partitioning of Linear Transforms

Joint Runtime / Energy Optimization and Hardware / Software Partitioning of Linear Transforms Joint Runtime / Energy Optimization and Hardware / Software Partitioning of Linear Transforms Paolo D Alberto, Franz Franchetti, Peter A. Milder, Aliaksei Sandryhaila, James C. Hoe, José M. F. Moura, Markus

More information

System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries

System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries Yevgen Voronenko, Franz Franchetti, Frédéric de Mesmay, and Markus Püschel Department of Electrical and Computer

More information

A SIMD Vectorizing Compiler for Digital Signal Processing Algorithms Λ

A SIMD Vectorizing Compiler for Digital Signal Processing Algorithms Λ A SIMD Vectorizing Compiler for Digital Signal Processing Algorithms Λ Franz Franchetti Applied and Numerical Mathematics Technical University of Vienna, Austria franz.franchetti@tuwien.ac.at Markus Püschel

More information

Stochastic Search for Signal Processing Algorithm Optimization

Stochastic Search for Signal Processing Algorithm Optimization Stochastic Search for Signal Processing Algorithm Optimization Bryan Singer Manuela Veloso May, 01 CMU-CS-01-137 School of Computer Science Carnegie Mellon University Pittsburgh, PA 1213 Abstract Many

More information

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University Outline Scientific Computation Kernels Matrix Multiplication Fast Fourier Transform (FFT) Automated Performance Tuning

More information

How to Write Fast Code , spring rd Lecture, Apr. 9 th

How to Write Fast Code , spring rd Lecture, Apr. 9 th How to Write Fast Code 18-645, spring 2008 23 rd Lecture, Apr. 9 th Instructor: Markus Püschel TAs: Srinivas Chellappa (Vas) and Frédéric de Mesmay (Fred) Research Project Current status Today Papers due

More information

Program Generation, Optimization, and Adaptation: SPIRAL and other efforts

Program Generation, Optimization, and Adaptation: SPIRAL and other efforts Program Generation, Optimization, and Adaptation: SPIRAL and other efforts Markus Püschel Electrical and Computer Engineering University SPIRAL Team: José M. F. Moura (ECE, CMU) James C. Hoe (ECE, CMU)

More information

Stochastic Search for Signal Processing Algorithm Optimization

Stochastic Search for Signal Processing Algorithm Optimization Stochastic Search for Signal Processing Algorithm Optimization Bryan Singer and Manuela Veloso Computer Science Department Carnegie Mellon University Pittsburgh, PA 1213 Email: {bsinger+, mmv+}@cs.cmu.edu

More information

Parallelism in Spiral

Parallelism in Spiral Parallelism in Spiral Franz Franchetti and the Spiral team (only part shown) Electrical and Computer Engineering Carnegie Mellon University Joint work with Yevgen Voronenko Markus Püschel This work was

More information

SPIRAL Generated Modular FFTs *

SPIRAL Generated Modular FFTs * SPIRAL Generated Modular FFTs * Jeremy Johnson Lingchuan Meng Drexel University * The work is supported by DARPA DESA, NSF, and Intel. Material for SPIRAL overview provided by Franz Francheti, Yevgen Voronenko,,

More information

SPIRAL Overview: Automatic Generation of DSP Algorithms & More *

SPIRAL Overview: Automatic Generation of DSP Algorithms & More * SPIRAL Overview: Automatic Generation of DSP Algorithms & More * Jeremy Johnson (& SPIRAL Team) Drexel University * The work is supported by DARPA DESA, NSF, and Intel. Material for presentation provided

More information

SPIRAL: Code Generation for DSP Transforms

SPIRAL: Code Generation for DSP Transforms PROCEEDINGS OF THE IEEE SPECIAL ISSUE ON PROGRAM GENERATION, OPTIMIZATION, AND ADAPTATION 1 SPIRAL: Code Generation for DSP Transforms Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela

More information

Automatic Derivation and Implementation of Signal Processing Algorithms

Automatic Derivation and Implementation of Signal Processing Algorithms Automatic Derivation and Implementation of Signal Processing Algorithms Sebastian Egner Philips Research Laboratories Prof. Hostlaan 4, WY21 5656 AA Eindhoven, The Netherlands sebastian.egner@philips.com

More information

Tackling Parallelism With Symbolic Computation

Tackling Parallelism With Symbolic Computation Tackling Parallelism With Symbolic Computation Markus Püschel and the Spiral team (only part shown) With: Franz Franchetti Yevgen Voronenko Electrical and Computer Engineering Carnegie Mellon University

More information

Performance Analysis of a Family of WHT Algorithms

Performance Analysis of a Family of WHT Algorithms Performance Analysis of a Family of WHT Algorithms Michael Andrews and Jeremy Johnson Department of Computer Science Drexel University Philadelphia, PA USA January, 7 Abstract This paper explores the correlation

More information

Automating the Modeling and Optimization of the Performance of Signal Transforms

Automating the Modeling and Optimization of the Performance of Signal Transforms IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 50, NO. 8, AUGUST 2002 2003 Automating the Modeling and Optimization of the Performance of Signal Transforms Bryan Singer and Manuela M. Veloso Abstract Fast

More information

Automatic Performance Programming?

Automatic Performance Programming? A I n Automatic Performance Programming? Markus Püschel Computer Science m128i t3 = _mm_unpacklo_epi16(x[0], X[1]); m128i t4 = _mm_unpackhi_epi16(x[0], X[1]); m128i t7 = _mm_unpacklo_epi16(x[2], X[3]);

More information

Automatic Generation of the HPC Challenge's Global FFT Benchmark for BlueGene/P

Automatic Generation of the HPC Challenge's Global FFT Benchmark for BlueGene/P Automatic Generation of the HPC Challenge's Global FFT Benchmark for BlueGene/P Franz Franchetti 1, Yevgen Voronenko 2, Gheorghe Almasi 3 1 University and SpiralGen, Inc. 2 AccuRay, Inc., 3 IBM Research

More information

Statistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform

Statistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform Statistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform Michael Andrews and Jeremy Johnson Department of Computer Science, Drexel University, Philadelphia, PA USA Abstract.

More information

Formal Loop Merging for Signal Transforms

Formal Loop Merging for Signal Transforms Formal Loop Merging for Signal Transforms Franz Franchetti Yevgen S. Voronenko Markus Püschel Department of Electrical & Computer Engineering Carnegie Mellon University This work was supported by NSF through

More information

Library Generation For Linear Transforms

Library Generation For Linear Transforms Library Generation For Linear Transforms Yevgen Voronenko May RS RS 3 RS RS RS RS 7 RS 5 Dissertation Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in Electrical

More information

Mixed Data Layout Kernels for Vectorized Complex Arithmetic

Mixed Data Layout Kernels for Vectorized Complex Arithmetic Mixed Data Layout Kernels for Vectorized Complex Arithmetic Doru T. Popovici, Franz Franchetti, Tze Meng Low Department of Electrical and Computer Engineering Carnegie Mellon University Email: {dpopovic,

More information

Learning to Construct Fast Signal Processing Implementations

Learning to Construct Fast Signal Processing Implementations Journal of Machine Learning Research 3 (2002) 887-919 Submitted 12/01; Published 12/02 Learning to Construct Fast Signal Processing Implementations Bryan Singer Manuela Veloso Department of Computer Science

More information

Advanced Computing Research Laboratory. Adaptive Scientific Software Libraries

Advanced Computing Research Laboratory. Adaptive Scientific Software Libraries Adaptive Scientific Software Libraries and Texas Learning and Computation Center and Department of Computer Science University of Houston Challenges Diversity of execution environments Growing complexity

More information

FFT Program Generation for the Cell BE

FFT Program Generation for the Cell BE FFT Program Generation for the Cell BE Srinivas Chellappa, Franz Franchetti, and Markus Püschel Electrical and Computer Engineering Carnegie Mellon University Pittsburgh PA 15213, USA {schellap, franzf,

More information

Generating SIMD Vectorized Permutations

Generating SIMD Vectorized Permutations Generating SIMD Vectorized Permutations Franz Franchetti and Markus Püschel Electrical and Computer Engineering, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213 {franzf, pueschel}@ece.cmu.edu

More information

Automatically Optimized FFT Codes for the BlueGene/L Supercomputer

Automatically Optimized FFT Codes for the BlueGene/L Supercomputer Automatically Optimized FFT Codes for the BlueGene/L Supercomputer Franz Franchetti, Stefan Kral, Juergen Lorenz, Markus Püschel, Christoph W. Ueberhuber, and Peter Wurzinger Institute for Analysis and

More information

Computer Generation of Efficient Software Viterbi Decoders

Computer Generation of Efficient Software Viterbi Decoders Computer Generation of Efficient Software Viterbi Decoders Frédéric de Mesmay, Srinivas Chellappa, Franz Franchetti, and Markus Püschel Electrical and Computer Engineering Carnegie Mellon University Pittsburgh

More information

PARALLELIZATION OF WAVELET FILTERS USING SIMD EXTENSIONS

PARALLELIZATION OF WAVELET FILTERS USING SIMD EXTENSIONS c World Scientific Publishing Company PARALLELIZATION OF WAVELET FILTERS USING SIMD EXTENSIONS RADE KUTIL and PETER EDER Department of Computer Sciences, University of Salzburg Jakob Haringerstr. 2, 5020

More information

Spiral: Program Generation for Linear Transforms and Beyond

Spiral: Program Generation for Linear Transforms and Beyond Spiral: Program Generation for Linear Transforms and Beyond Franz Franchetti and the Spiral team (only part shown) ECE, Carnegie Mellon University www.spiral.net Co-Founder, SpiralGen www.spiralgen.com

More information

Aggressive Scheduling for Numerical Programs

Aggressive Scheduling for Numerical Programs Aggressive Scheduling for Numerical Programs Richard Veras, Flavio Cruz, Berkin Akin May 2, 2012 Abstract Certain classes of algorithms are still hard to optimize by modern compilers due to the differences

More information

Automatically Tuned FFTs for BlueGene/L s Double FPU

Automatically Tuned FFTs for BlueGene/L s Double FPU Automatically Tuned FFTs for BlueGene/L s Double FPU Franz Franchetti, Stefan Kral, Juergen Lorenz, Markus Püschel, and Christoph W. Ueberhuber Institute for Analysis and Scientific Computing, Vienna University

More information

Automatic Performance Tuning and Machine Learning

Automatic Performance Tuning and Machine Learning Automatic Performance Tuning and Machine Learning Markus Püschel Computer Science, ETH Zürich with: Frédéric de Mesmay PhD, Electrical and Computer Engineering, Carnegie Mellon PhD and Postdoc openings:

More information

Spiral. Computer Generation of Performance Libraries. José M. F. Moura Markus Püschel Franz Franchetti & the Spiral Team. Performance.

Spiral. Computer Generation of Performance Libraries. José M. F. Moura Markus Püschel Franz Franchetti & the Spiral Team. Performance. Spiral Computer Generation of Performance Libraries José M. F. Moura Markus Püschel Franz Franchetti & the Spiral Team Platforms Performance Applications What is Spiral? Traditionally Spiral Approach Spiral

More information

Formal Loop Merging for Signal Transforms

Formal Loop Merging for Signal Transforms Formal Loop Merging for Signal Transforms Franz Franchetti Yevgen Voronenko Markus Püschel Department of Electrical and Computer Engineering Carnegie Mellon University {franzf, yvoronen, pueschel}@ece.cmu.edu

More information

A Fast Fourier Transform Compiler

A Fast Fourier Transform Compiler RETROSPECTIVE: A Fast Fourier Transform Compiler Matteo Frigo Vanu Inc., One Porter Sq., suite 18 Cambridge, MA, 02140, USA athena@fftw.org 1. HOW FFTW WAS BORN FFTW (the fastest Fourier transform in the

More information

FFT ALGORITHMS FOR MULTIPLY-ADD ARCHITECTURES

FFT ALGORITHMS FOR MULTIPLY-ADD ARCHITECTURES FFT ALGORITHMS FOR MULTIPLY-ADD ARCHITECTURES FRANCHETTI Franz, (AUT), KALTENBERGER Florian, (AUT), UEBERHUBER Christoph W. (AUT) Abstract. FFTs are the single most important algorithms in science and

More information

Cache miss analysis of WHT algorithms

Cache miss analysis of WHT algorithms 2005 International Conference on Analysis of Algorithms DMTCS proc. AD, 2005, 115 124 Cache miss analysis of WHT algorithms Mihai Furis 1 and Paweł Hitczenko 2 and Jeremy Johnson 1 1 Dept. of Computer

More information

Algorithms and Computation in Signal Processing

Algorithms and Computation in Signal Processing Algorithms and Computation in Signal Processing special topic course 18-799B spring 2005 22 nd lecture Mar. 31, 2005 Instructor: Markus Pueschel Guest instructor: Franz Franchetti TA: Srinivas Chellappa

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Introduction to Parallel Computing W. P. Petersen Seminar for Applied Mathematics Department of Mathematics, ETHZ, Zurich wpp@math. ethz.ch P. Arbenz Institute for Scientific Computing Department Informatik,

More information

How to Write Fast Numerical Code

How to Write Fast Numerical Code How to Write Fast Numerical Code Lecture: Autotuning and Machine Learning Instructor: Markus Püschel TA: Gagandeep Singh, Daniele Spampinato, Alen Stojanov Overview Rough classification of autotuning efforts

More information

Multimedia Macros for Portable Optimized Programs

Multimedia Macros for Portable Optimized Programs Problem Multimedia Macros for Portable Optimized Programs Juan Carlos Rojas and Miriam Leeser Department of Electrical and Computer Engineering Northeastern University, Boston Abstract Multimedia architectures

More information

Special Issue on Program Generation, Optimization, and Platform Adaptation /$ IEEE

Special Issue on Program Generation, Optimization, and Platform Adaptation /$ IEEE Scanning the Issue Special Issue on Program Generation, Optimization, and Platform Adaptation This special issue of the PROCEEDINGS OF THE IEEE offers an overview of ongoing efforts to facilitate the development

More information

Computer Generation of Hardware for Linear Digital Signal Processing Transforms

Computer Generation of Hardware for Linear Digital Signal Processing Transforms Computer Generation of Hardware for Linear Digital Signal Processing Transforms PETER MILDER, FRANZ FRANCHETTI, and JAMES C. HOE, Carnegie Mellon University MARKUS PÜSCHEL, ETH Zurich Linear signal transforms

More information

Program Composition and Optimization: An Introduction

Program Composition and Optimization: An Introduction Program Composition and Optimization: An Introduction Christoph W. Kessler 1, Welf Löwe 2, David Padua 3, and Markus Püschel 4 1 Linköping University, Linköping, Sweden, chrke@ida.liu.se 2 Linnaeus University,

More information

FFT Program Generation for Shared Memory: SMP and Multicore

FFT Program Generation for Shared Memory: SMP and Multicore FFT Program Generation for Shared Memory: SMP and Multicore Franz Franchetti, Yevgen Voronenko, and Markus Püschel Electrical and Computer Engineering Carnegie Mellon University Abstract The chip maker

More information

RECENTLY, the characteristics of signal transform algorithms

RECENTLY, the characteristics of signal transform algorithms 2120 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 52, NO. 7, JULY 2004 Dynamic Data Layouts for Cache-Conscious Implementation of a Class of Signal Transforms Neungsoo Park, Member, IEEE, and Viktor K.

More information

How to Write Fast Numerical Code Spring 2011 Lecture 22. Instructor: Markus Püschel TA: Georg Ofenbeck

How to Write Fast Numerical Code Spring 2011 Lecture 22. Instructor: Markus Püschel TA: Georg Ofenbeck How to Write Fast Numerical Code Spring 2011 Lecture 22 Instructor: Markus Püschel TA: Georg Ofenbeck Schedule Today Lecture Project presentations 10 minutes each random order random speaker 10 Final code

More information

Final Review. Image Processing CSE 166 Lecture 18

Final Review. Image Processing CSE 166 Lecture 18 Final Review Image Processing CSE 166 Lecture 18 Topics covered Basis vectors Matrix based transforms Wavelet transform Image compression Image watermarking Morphological image processing Segmentation

More information

Automatic Tuning Matrix Multiplication Performance on Graphics Hardware

Automatic Tuning Matrix Multiplication Performance on Graphics Hardware Automatic Tuning Matrix Multiplication Performance on Graphics Hardware Changhao Jiang (cjiang@cs.uiuc.edu) Marc Snir (snir@cs.uiuc.edu) University of Illinois Urbana Champaign GPU becomes more powerful

More information

A Package for Generating, Manipulating, and Testing Convolution Algorithms. Anthony F. Breitzman and Jeremy R. Johnson

A Package for Generating, Manipulating, and Testing Convolution Algorithms. Anthony F. Breitzman and Jeremy R. Johnson A Package for Generating, Manipulating, and Testing Convolution Algorithms Anthony F. Breitzman and Jeremy R. Johnson Technical Report DU-CS-03-05 Department of Computer Science Drexel University Philadelphia,

More information

Bandit-Based Optimization on Graphs with Application to Library Performance Tuning

Bandit-Based Optimization on Graphs with Application to Library Performance Tuning Bandit-Based Optimization on Graphs with Application to Library Performance Tuning Frédéric de Mesmay fdemesma@ece.cmu.edu Arpad Rimmel rimmel@lri.fr Yevgen Voronenko yvoronen@ece.cmu.edu Markus Püschel

More information

How to Write Fast Numerical Code Spring 2012 Lecture 20. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato

How to Write Fast Numerical Code Spring 2012 Lecture 20. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato How to Write Fast Numerical Code Spring 2012 Lecture 20 Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato Planning Today Lecture Project meetings Project presentations 10 minutes each

More information

Algorithms and Computation in Signal Processing

Algorithms and Computation in Signal Processing Algorithms and Computation in Signal Processing special topic course 18-799B spring 2005 14 th Lecture Feb. 24, 2005 Instructor: Markus Pueschel TA: Srinivas Chellappa Course Evaluation Email sent out

More information

The Design and Implementation of FFTW3

The Design and Implementation of FFTW3 The Design and Implementation of FFTW3 MATTEO FRIGO AND STEVEN G. JOHNSON Invited Paper FFTW is an implementation of the discrete Fourier transform (DFT) that adapts to the hardware in order to maximize

More information

FFT Program Generation for Shared Memory: SMP and Multicore

FFT Program Generation for Shared Memory: SMP and Multicore FFT Program Generation for Shared Memory: SMP and Multicore Franz Franchetti, Yevgen Voronenko, and Markus Püschel Electrical and Computer Engineering Carnegie Mellon University {franzf, yvoronen, pueschel@ece.cmu.edu

More information

How to Write Fast Numerical Code

How to Write Fast Numerical Code How to Write Fast Numerical Code Lecture: Optimizing FFT, FFTW Instructor: Markus Püschel TA: Georg Ofenbeck & Daniele Spampinato Rest of Semester Today Lecture Project meetings Project presentations 10

More information

Empirical Auto-tuning Code Generator for FFT and Trigonometric Transforms

Empirical Auto-tuning Code Generator for FFT and Trigonometric Transforms Empirical Auto-tuning Code Generator for FFT and Trigonometric Transforms Ayaz Ali and Lennart Johnsson Texas Learning and Computation Center University of Houston, Texas {ayaz,johnsson}@cs.uh.edu Dragan

More information

4. Image Retrieval using Transformed Image Content

4. Image Retrieval using Transformed Image Content 4. Image Retrieval using Transformed Image Content The desire of better and faster retrieval techniques has always fuelled to the research in content based image retrieval (CBIR). A class of unitary matrices

More information

Utilization of SIMD Extensions for Numerical Straight Line Code

Utilization of SIMD Extensions for Numerical Straight Line Code DIPLOMARBEIT Utilization of SIMD Extensions for Numerical Straight Line Code ausgeführt am Institut für Angewandte und Numerische Mathematik der Technischen Universität Wien unter der Anleitung von Prof.

More information

Intel Parallel Studio XE 2016 Composer Edition for OS X* Installation Guide and Release Notes

Intel Parallel Studio XE 2016 Composer Edition for OS X* Installation Guide and Release Notes Intel Parallel Studio XE 2016 Composer Edition for OS X* Installation Guide and Release Notes 30 July 2015 Table of Contents 1 Introduction... 2 1.1 Change History... 2 1.1.1 Changes since Intel Parallel

More information

Intel Parallel Studio XE 2016 Composer Edition Update 1 for OS X* Installation Guide and Release Notes

Intel Parallel Studio XE 2016 Composer Edition Update 1 for OS X* Installation Guide and Release Notes Intel Parallel Studio XE 2016 Composer Edition Update 1 for OS X* Installation Guide and Release Notes 30 October 2015 Table of Contents 1 Introduction... 2 1.1 What s New... 2 1.2 Product Contents...

More information

Automatic Tuning of Discrete Fourier Transforms Driven by Analytical Modeling

Automatic Tuning of Discrete Fourier Transforms Driven by Analytical Modeling Automatic Tuning of Discrete Fourier Transforms Driven by Analytical Modeling Basilio B. Fraguela Depto. de Electrónica e Sistemas Universidade da Coruña A Coruña, Spain basilio@udc.es Yevgen Voronenko,

More information

Adaptive Scientific Software Libraries

Adaptive Scientific Software Libraries Adaptive Scientific Software Libraries Lennart Johnsson Advanced Computing Research Laboratory Department of Computer Science University of Houston Challenges Diversity of execution environments Growing

More information

Fast Orthogonal Neural Networks

Fast Orthogonal Neural Networks Fast Orthogonal Neural Networks Bart lomiej Stasiak and Mykhaylo Yatsymirskyy Institute of Computer Science, Technical University of Lódź ul. Wólczańska 5, 93-005 Lódź, Poland basta@ics.p.lodz.pl, jacym@ics.p.lodz.pl

More information

Computer Generation of IP Cores

Computer Generation of IP Cores A I n Computer Generation of IP Cores Peter Milder (ECE, Carnegie Mellon) James Hoe (ECE, Carnegie Mellon) Markus Püschel (CS, ETH Zürich) addfxp #(16, 1) add15282(.a(a69),.b(a70),.clk(clk),.q(t45)); addfxp

More information

Performance Models and Search Methods for Optimal FFT Implementations. David Sepiashvili Advisor: Prof. ~t~ Electrical ~ Computer ENGINEERING

Performance Models and Search Methods for Optimal FFT Implementations. David Sepiashvili Advisor: Prof. ~t~ Electrical ~ Computer ENGINEERING Performance Models and Search Methods for Optimal FFT Implementations David Sepiashvili 2000 Advisor: Prof. Moura ~t~ Electrical ~ Computer ENGINEERING Performance Models and Search Methods for Optimal

More information

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Texas Learning and Computation Center Department of Computer Science University of Houston Outline Motivation

More information

Dan Stafford, Justine Bonnot

Dan Stafford, Justine Bonnot Dan Stafford, Justine Bonnot Background Applications Timeline MMX 3DNow! Streaming SIMD Extension SSE SSE2 SSE3 and SSSE3 SSE4 Advanced Vector Extension AVX AVX2 AVX-512 Compiling with x86 Vector Processing

More information

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

Using Intel Streaming SIMD Extensions for 3D Geometry Processing Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,

More information

An Adaptive Framework for Scientific Software Libraries. Ayaz Ali Lennart Johnsson Dept of Computer Science University of Houston

An Adaptive Framework for Scientific Software Libraries. Ayaz Ali Lennart Johnsson Dept of Computer Science University of Houston An Adaptive Framework for Scientific Software Libraries Ayaz Ali Lennart Johnsson Dept of Computer Science University of Houston Diversity of execution environments Growing complexity of modern microprocessors.

More information

Intel Math Kernel Library

Intel Math Kernel Library Intel Math Kernel Library Getting Started Tutorial: Using the Intel Math Kernel Library for Matrix Multiplication Document Number: 327356-005US Legal Information Intel Math Kernel Library Getting Started

More information

Intel Parallel Studio XE 2017 Update 4 for macos* Installation Guide and Release Notes

Intel Parallel Studio XE 2017 Update 4 for macos* Installation Guide and Release Notes Intel Parallel Studio XE 2017 Update 4 for macos* Installation Guide and Release Notes 12 April 2017 Contents 1.1 What Every User Should Know About This Release... 2 1.2 What s New... 2 1.3 Product Contents...

More information

Optimizing Sorting with Genetic Algorithms

Optimizing Sorting with Genetic Algorithms Optimizing Sorting with Genetic Algorithms ABSTRACT Generating highly efficient code is becoming very difficult. When it is hand-tuned, the process is error-prone and very time consuming. When it is generated

More information

FFT Compiler Techniques

FFT Compiler Techniques FFT Compiler Techniques Stefan Kral, Franz Franchetti, Juergen Lorenz, Christoph W. Ueberhuber, and Peter Wurzinger Institute for Applied Mathematics and Numerical Analysis, Vienna University of Technology,

More information

CUDA Toolkit 5.0 Performance Report. January 2013

CUDA Toolkit 5.0 Performance Report. January 2013 CUDA Toolkit 5.0 Performance Report January 2013 CUDA Math Libraries High performance math routines for your applications: cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse

More information

Think Globally, Search Locally

Think Globally, Search Locally Think Globally, Search Locally Kamen Yotov, Keshav Pingali, Paul Stodghill, {kyotov,pingali,stodghil}@cs.cornell.edu Department of Computer Science, Cornell University, Ithaca, NY 853. ABSTRACT A key step

More information

Intel Visual Fortran Compiler Professional Edition 11.0 for Windows* In-Depth

Intel Visual Fortran Compiler Professional Edition 11.0 for Windows* In-Depth Intel Visual Fortran Compiler Professional Edition 11.0 for Windows* In-Depth Contents Intel Visual Fortran Compiler Professional Edition for Windows*........................ 3 Features...3 New in This

More information

Input parameters System specifics, user options. Input parameters size, dim,... FFT Code Generator. Initialization Select fastest execution plan

Input parameters System specifics, user options. Input parameters size, dim,... FFT Code Generator. Initialization Select fastest execution plan Automatic Performance Tuning in the UHFFT Library Dragan Mirković 1 and S. Lennart Johnsson 1 Department of Computer Science University of Houston Houston, TX 7724 mirkovic@cs.uh.edu, johnsson@cs.uh.edu

More information

OPTIMIZING ALL-PAIRS SHORTEST-PATH ALGORITHM USING VECTOR INSTRUCTIONS. Sungchul Han and Sukchan Kang

OPTIMIZING ALL-PAIRS SHORTEST-PATH ALGORITHM USING VECTOR INSTRUCTIONS. Sungchul Han and Sukchan Kang OPTIMIZIG ALL-PAIRS SHORTEST-PATH ALGORITHM USIG VECTOR ISTRUCTIOS Sungchul Han and Sukchan Kang Department of Electrical and Computer Engineering Carnegie Mellon University Pittsburgh, PA 15213 ABSTRACT

More information

How to Write Fast Code , spring st Lecture, Jan. 14 th

How to Write Fast Code , spring st Lecture, Jan. 14 th How to Write Fast Code 18-645, spring 2008 1 st Lecture, Jan. 14 th Instructor: Markus Püschel TAs: Srinivas Chellappa (Vas) and Frédéric de Mesmay (Fred) Today Motivation and idea behind this course Technicalities

More information

Efficient FFT Algorithm and Programming Tricks

Efficient FFT Algorithm and Programming Tricks Connexions module: m12021 1 Efficient FFT Algorithm and Programming Tricks Douglas L. Jones This work is produced by The Connexions Project and licensed under the Creative Commons Attribution License Abstract

More information

HiLO: High Level Optimization of FFTs

HiLO: High Level Optimization of FFTs HiLO: High Level Optimization of FFTs Nick Rizzolo and David Padua University of Illinois at Urbana-Champaign 201 N. Goodwin Ave., Urbana, IL 61801-2302 {rizzolo,padua}@cs.uiuc.edu, WWW home page: http://polaris.cs.uiuc.edu

More information

Chapter 4 Face Recognition Using Orthogonal Transforms

Chapter 4 Face Recognition Using Orthogonal Transforms Chapter 4 Face Recognition Using Orthogonal Transforms Face recognition as a means of identification and authentication is becoming more reasonable with frequent research contributions in the area. In

More information

DFT Compiler for Custom and Adaptable Systems

DFT Compiler for Custom and Adaptable Systems DFT Compiler for Custom and Adaptable Systems Paolo D Alberto Electrical and Computer Engineering Carnegie Mellon University Personal Research Background Embedded and High Performance Computing Compiler:

More information

SPIRAL DSP Transform Compiler:

SPIRAL DSP Transform Compiler: SPIRAL DSP Trasform Compiler: Applicatio Specific Hardware Sythesis Peter A. Milder (peter.milder@stoybroo.edu) Fraz Frachetti, James C. Hoe, ad Marus Pueschel Departmet of ECE Caregie Mello Uiversity

More information

Scientific Computing. Some slides from James Lambers, Stanford

Scientific Computing. Some slides from James Lambers, Stanford Scientific Computing Some slides from James Lambers, Stanford Dense Linear Algebra Scaling and sums Transpose Rank-one updates Rotations Matrix vector products Matrix Matrix products BLAS Designing Numerical

More information

Parallel FFT Program Optimizations on Heterogeneous Computers

Parallel FFT Program Optimizations on Heterogeneous Computers Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid

More information

Generating Optimized Fourier Interpolation Routines for Density Functional Theory using SPIRAL

Generating Optimized Fourier Interpolation Routines for Density Functional Theory using SPIRAL Generating Optimized Fourier Interpolation Routines for Density Functional Theory using SPIRAL Doru Thom Popovici, Francis P. Russell, Karl Wilkinson, Chris-Kriton Skylaris Paul H. J. Kelly and Franz Franchetti

More information

Administration. Prerequisites. Meeting times. CS 380C: Advanced Topics in Compilers

Administration. Prerequisites. Meeting times. CS 380C: Advanced Topics in Compilers Administration CS 380C: Advanced Topics in Compilers Instructor: eshav Pingali Professor (CS, ICES) Office: POB 4.126A Email: pingali@cs.utexas.edu TA: TBD Graduate student (CS) Office: Email: Meeting

More information

Shading Languages for Graphics Hardware

Shading Languages for Graphics Hardware Shading Languages for Graphics Hardware Bill Mark and Kekoa Proudfoot Stanford University Collaborators: Pat Hanrahan, Svetoslav Tzvetkov, Pradeep Sen, Ren Ng Sponsors: ATI, NVIDIA, SGI, SONY, Sun, 3dfx,

More information

Intel Math Kernel Library

Intel Math Kernel Library Intel Math Kernel Library Release 7.0 March 2005 Intel MKL Purpose Performance, performance, performance! Intel s scientific and engineering floating point math library Initially only basic linear algebra

More information

Computers in Engineering COMP 208. Computer Structure. Computer Architecture. Computer Structure Michael A. Hawker

Computers in Engineering COMP 208. Computer Structure. Computer Architecture. Computer Structure Michael A. Hawker Computers in Engineering COMP 208 Computer Structure Michael A. Hawker Computer Structure We will briefly look at the structure of a modern computer That will help us understand some of the concepts that

More information

CUDA 6.0 Performance Report. April 2014

CUDA 6.0 Performance Report. April 2014 CUDA 6. Performance Report April 214 1 CUDA 6 Performance Report CUDART CUDA Runtime Library cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse Matrix Library curand Random

More information

Intel Math Kernel Library

Intel Math Kernel Library Intel Math Kernel Library Getting Started Tutorial: Using the Intel Math Kernel Library for Matrix Multiplication Document Number: 327356-005US Legal Information Legal Information By using this document,

More information

CUDA Toolkit 4.0 Performance Report. June, 2011

CUDA Toolkit 4.0 Performance Report. June, 2011 CUDA Toolkit 4. Performance Report June, 211 CUDA Math Libraries High performance math routines for your applications: cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse

More information

Relationship between Fourier Space and Image Space. Academic Resource Center

Relationship between Fourier Space and Image Space. Academic Resource Center Relationship between Fourier Space and Image Space Academic Resource Center Presentation Outline What is an image? Noise Why do we transform images? What is the Fourier Transform? Examples of images in

More information

How To Write Fast Numerical Code: A Small Introduction

How To Write Fast Numerical Code: A Small Introduction How To Write Fast Numerical Code: A Small Introduction Srinivas Chellappa, Franz Franchetti, and Markus Püschel Electrical and Computer Engineering Carnegie Mellon University {schellap, franzf, pueschel}@ece.cmu.edu

More information