Program Generation, Optimization, and Adaptation: SPIRAL and other efforts

Size: px
Start display at page:

Download "Program Generation, Optimization, and Adaptation: SPIRAL and other efforts"

Transcription

1 Program Generation, Optimization, and Adaptation: SPIRAL and other efforts Markus Püschel Electrical and Computer Engineering University SPIRAL Team: José M. F. Moura (ECE, CMU) James C. Hoe (ECE, CMU) Jeremy Johnson (CS, Drexel) David Padua (CS, UIUC) Manuela Veloso (CS, CMU) Bryan W. Singer (CS, CMU) Jianxin Xiong (CS, UIUC) Franz Franchetti (ECE, CMU) Aca Gacic (ECE, CMU) Yevgen Voronenko (ECE, CMU) Kang Chen (CS, Drexel) Robert W. Johnson (Quarry Comp. Inc.) Nick Rizzolo (CS, UIUC) and many others Supported by: NSF ACR NSF ITR/NGS Thanks also to: Cylab, CMU Austrian Science Fund Intel ITRI, Taiwan ENSCO, Inc.

2 Organization Introduction and Overview What s the problem? Overview Program generation, optimization, and adaptation Common philosophy LAPACK, ATLAS, BeBOP FFTW SPIRAL

3 The Problem: Example Discrete Fourier Transform (DFT) x vendor library (hand-tuned assembly) but also FFTW (adaptable library) SPIRAL (generated code) DFT size reasonable implementation (Numerical recipes. GNU scientific library) yeah, but DFT is kind of difficult, so let s take something simpler

4 The Problem: Example Matrix-Matrix Mult. MFLOPS vendor library (hand-tuned assembly) x BLAS Unleashed CGw S ATLAS generated code 2 Model Size Compiler matrix size How to achieve optimal, portable performance? standard triple loop + compiler optimizations graph: Pingali, Yotov, Cornell U.

5 The Problem (Specifics) Computing platforms are complex memory hierarchy undocumented features in microarchitecture special instructions (vector, FMA, prefetching) Consequences: Performance only roughly determined by operations count Performance behaviour is hard to understand Best code is platform-dependent Computing platforms change fast code becomes obsolete (almost) as fast as it is written Compiler optimizations are very limited for numerical code people often revert to assembly coding (like 5 years ago) How to we port performance across platforms and across time?

6 Solution #: Brute Force Thousands of programmers hand-write and hand-tune (assembly) code for the same numerical problems and for every platform and whenever a new platform comes out? Hmm

7 Solution #2: New Approaches to Code Optimization and Code Creation Linear Algebra: LAPACK, ATLAS BeBOP Tensor Computations (Quantum Chemistry): Sadayappan, Baumgartner et al. (Ohio State) Finite Element Methods: Fenics (U. Chicago) Signal Processing: FFTW SPIRAL VSIPL (Kepner, Lebak et al., MIT Lincoln Labs) New Compiler Techniques (Domain aware/specific): Model-based ATLAS Broadway (Lin, Guyer, U. Texas Austin) SIMD optimizations (Ueberhuber, Univ. Techn. Vienna) Telescoping Languages (Kennedy et al., Rice) See also upcoming Proceedings of the IEEE special issue on Program Generation, Optimization, and Adaptation,

8 Philosophy Present Future Algorithm Level Implementation Level starting point: one algorithm/program high level information destroyed implementation restricted automation algorithm/implementation codesign domain knowledge used for optimization a new breed of domain-aware approaches/tools push automation beyond what is currently possible applies for software and hardware design alike a u t o m a t i o n

9 Organization Introduction and Overview What s the problem? Overview Program generation, optimization, and adaptation Common philosophy LAPACK, ATLAS, BeBOP FFTW SPIRAL

10 LAPACK/ATLAS (Demmel, Dongarra, Petitet, Whaley, ) LAPACK: library for linear algebra problems (LU, Cholesky, QR, SVD, ) static built on top of BLAS (basic linear algebra subroutines), which is meant to be reoptimized ATLAS: provides BLAS library (matrix-matrix, matrix-vector mult., ) can be generated for every machine people can contribute hand-tuned code

11 ATLAS Architecture MFLOPS Compile, Execute, Measure Detect Hardware Parameters LSize NR MulAdd L * ATLAS Search Engine (MMSearch) NB MU,NU,KU xfetch MulAdd Latency ATLAS MM Code Generator (MMCase) MiniMMM Source Search parameters: span search space determine code found by orthogonal line search Hardware parameters: LSize: size of L data cache NR: number of registers MulAdd: fused multiply-add available? L * : latency of FP multiplication source: Pingali, Yotov, Cornell U.

12 ATLAS: Details (I) K Blocking for L D cache only square blocks NB x NB x NB search NB: 6 < NB < 8, NB 2 <= LSize divides into minimmms NB M NU Register level blocking search parameters MU, NU, K clean-up code also generated divides into micrommms NB K B MU K NB A C source: Pingali, Yotov, Cornell U.

13 ATLAS: Details (II) Code: minimmm + micrommm for (int j = ; j < NB; j += NU) for (int i = ; i < NB; i += MU) load C[i..i+MU-, j..j+nu-] into registers copy to scalar var. unrolled for (int k = ; k < NB; k++) load A[i..i+MU-,k] into registers load B[k,j..j+NU-] into registers multiply A s and B s and add to C s store C[i..i+MU-, j..j+nu-] M M 2 M 3 M 4 Memory Operations A A 2 A 3 A 4 Latency=2 Computation Memory Operations Computation A Computation A 2 Memory Operations IFetch Loads L Memory Operations L 2 L 3 NFetch Loads L MU+NU NFetch Loads Scheduling interleave loads and operations interleave adds and mults using Latency using xfetch M MU*NU A MU*NU Computation A MU*NU Memory Operations Computation NFetch Loads source: Pingali, Yotov, Cornell U.

14 Model-Based ATLAS (Pingali, Stodghill, Yotov, Padua, ) Detect Hardware Parameters LSize LI$Size NR MulAdd L * Model NB MU,NU,KU xfetch MulAdd Latency ATLAS MM Code Generator (MMCase) MiniMMM Source uses more hardware parameters uses simple analytical models to compute MMM parameters graph: Pingali, Yotov, Cornell U.

15 ATLAS: Experiments Alpha 2264 Hand-written code often substantially faster (e.g., vector instructions) Model-based comparable to searchbased (except Itanium) Power 3 Power 4 R2K UltraSparc IIIi Itanium 2 Opteron 24 Model Refined Model Unleashed see papers by Demmel et al. (ATLAS), Yotov et al. (Model-based ATLAS) in special issue Athlon MP Pentium III Pentium 4 % 5% ATLAS CGw S % 5% 2% graph: Pingali, Yotov, Cornell U.

16 BeBOP (formerly SPARSITY) (Vuduc, Yelick, Im) library for sparse linear algebra problems in particular, sparse matrix-vector multiply difference to dense case: matrix only known at runtime requires runtime adaptation main parameter: degree of register blocking, determines sparse format approach: offline (machine dependent, matrix independent): performance P(r,c) for dense matrix-vector multiply in sparse format for register blocks r x c, <= r,c <= 2 for given matrix M: compute average fill ratio f(r,c,m) for all these block sizes choose r,c that maximizes P(r,c)/f(r,c,M)

17 BeBOP: : Experiments speed-up through blocking (Itan. 2) row block size r n = 2226 source: NASA structural analysis very regular 8 x 8 block structure col. block size c best block size is not 8 x 8, but 2 x 4 source Vuduc, U. Berkeley

18 Organization Introduction and Overview What s the problem? Overview Program generation, optimization, and adaptation Common philosophy LAPACK, ATLAS, BeBOP FFTW SPIRAL

19 FFTW (Frigo, Johnson) Self-adaptable library for the discrete Fourier transform (DFT) DFTs: real/complex, arbitrary dimension, scalar and vector code Design in two layers: small problem sizes (codelets): straightline code automatically generated offline (same for every platform) and highly optimized (CSE, scheduling, SSA, ) large problem sizes: an empirical dynamic programming search finds the best recursion strategy called plan (adaptivity) Usage in two stages: user specifies DFT (size, dim, etc.) planner computes by search the plan plan is applied to input data

20 Organization Introduction and Overview What s the problem? Overview Program generation, optimization, and adaptation Common philosophy LAPACK, ATLAS, BeBOP FFTW SPIRAL

21 Spiral Code generation from scratch for linear digital signal processing (DSP) transforms (DFT, DCT, DWT, filters,.) Automatic optimization and platform-tuning at the algorithmic level and the code level Different code types supported (scalar, vector, FMA, fixedpoint, multiplierless, ) Goal: A flexible, extensible code generation framework that can survive time (to whatever extent possible) for an entire domain of algorithms

22 Code Generation and Tuning as Optimization Problem a DSP transform to be implemented the target platform set of possible implementations of T on P cost of implementation I of T on P The implementation of T that is tuned to P is given by: Problems: How to characterize and generate the set of implementations? How to efficiently minimize C? Spiral exploits the domain-specific structure to implement a solver for this optimization problem

23 Spiral s architecture Domain knowledge: Generating algorithms & manipulating algorithms Architecture knowledge: by evaluating runtime

24 From Transform to Algorithm (Formula) Input: Transform specification Output: Fast algorithm as formula Domain Knowledge I: Generating the algorithm space

25 DSP Algorithms: Example 4 DSP Algorithms: Example 4-point DFT point DFT Cooley/Tukey FFT (size 4): algorithms reduce arithmetic cost O(n^2) O(nlog(n)) mathematical notation exhibits structure: SPL (signal processing language) contains all information to generate code = i i i i i ) ( ) ( L DFT I T I DFT DFT = Fourier transform Identity Permutation Diagonal matrix (twiddles) Kronecker product

26 DSP Algorithms: Spiral Terminology Transform DFTn parameterized matrix Rule DFT nm ( DFT I ) D ( I DFT ) P a breakdown strategy product of sparse matrices n m n m Ruletree DFT 8 Formula DFT 2 DFT 4 DFT DFT 2 DFT 2 recursive application of rules uniquely defines an algorithm efficient representation easy manipulation ( F I ) D ( I ( I F )) P = L few constructs and primitives uniquely defines an algorithm can be translated into code 2 2

27 Some Transforms Spiral currently contains 36 transforms

28 Some Breakdown Rules Base case rules Spiral contains + rules

29 Algorithm (Formula) Generation Given a transform: Apply breakdown rules recursively until all occurring transforms are expanded Choice of rules at each step yields (usually) exponentially large algorithms space: about equal in operations count differ in data flow k # DFTs, size 2^k # DCT IV, size 2^k ~. ^27 ~2.3 ^6 ~2.86 ^ ~.7 ^38 ~2.3 ^76 ~.6 ^53

30 From Algorithm (Formula) to Optimized Algorithm Input: Fast algorithm as formula Output: Optimized formula Domain Knowledge II: Optimizing an algorithm

31 Motivation: Fusing Loops void I4xF2_L84(double *y, double *x) { double t[8]; for (int i=; i<8; i++) t[i==7? 7 : (i*4)%7] = x[i]; for (int i=; i<4; i++){ y[2*i] = t[2*i] + t[2*i+]; y[2*i+] = t[2*i] - t[2*i+]; } Algebraic transformations } Formula manipulation Loop splitting Loop fusion void I4xF2_L84(double for (int j=; j<4; y[2*j] = x[j] + y[2*j+] = x[j] } } *y, double *x) { j++){ x[j+4]; - x[j+4];

32 Formula Level Optimization Main goals: Fusing iterative steps (fusing loops), e.g., permutations with loops Improving structure (data flow) for SIMD instructions Overcomes compiler limitations Formula manipulation through mathematical rules Implemented using multiple levels of rewriting systems Puts math knowledge into the system

33 Structure of Loop Optimization SPL formula To Σ-SPL Σ -SPL formula Join permutations Σ -SPL formula Rules: Join diagonals and monomials Σ -SPL formula

34 Vector code generation from SPL formulas Naturally vectorizable construct A I 4 A vector length x y (Current) generic construct completely vectorizable: k i= P D ( A I ) E i i i υ i Q i P i, Q i D i, E i A i ν permutations diagonals arbitrary formulas SIMD vector length Vectorization in two steps:. Formula manipulation using manipulation rules 2. Code generation (vector code + C code) Formula manipulation overcomes compiler limitations

35 Example DFT Standard FFT Formula manipulation Vector FFT for ν-way vector instructions

36 From Optimized Algorithm (Formula) to Code Input: Optimized formula Output: Intermediate Code Straightforward

37 From Code to Optimized Code Input: Intermediate Code Output: Optimized C code

38 Code Level Optimizations Precomputation of constants Loop unrolling (controlled by search module) Constant inlining SSA code, array scalarization, algebraic simplifications, CSE Code reordering for locality (optional) Conversion to FMA code (optional) Conversion to fixed point code (optional) Conversion to multiplierless code (optional) Finally: Unparsing to C (or Fortran)

39 Evaluating Code Input: Optimized C code Output: Performance Number Straightforward Examples: runtime accuracy operations count

40 Search (Learning) for the Best Input: Performance Number Output: Controls Formula Generation Controls Implementation Options

41 Search Methods Search over: Algorithmic degrees of freedom (choice of breakdown rules) Implementation degrees of freedom (degree of unrolling) Operates with the ruletree representation of an algorithm transform independent efficient Search Methods Exhaustive Search Dynamic Programming (DP) Random Search Hill Climbing STEER (an evolutionary algorithm)

42 STEER Population n: Mutation Cross-Breeding expand differently Population n+: swap expansions Survival of Fittest

43 Learning Procedure: Generate a set of ( say) algorithms and their runtimes (one transform, one size); represent algorithms by features From this data (pairs of features and runtimes), learn a set of algorithm design rules From this set, generate best algorithms (theory of Markov decision processes) Evaluation: Tested for WHT and DFT From data generated for one size (2^5) could construct best algorithms across sizes (2^2-2^8) Bryan Singer and Manuela Veloso Learning to Construct Fast Signal Processing Implementations Journal of Machine Learning Research, 22, Vol. 3, pp

44 SPIRAL Experiments

45 Performance Spread: DCT, size 32 Histograms,, algorithms P4, 3.2 GHz, gcc runtime: x2 #ops: x.8 #assembly instr: x.5 #ops vs. runtime: no correlation #fma ops: x.2 accuracy: x, most x2

46 Performance Spread: DFT 2^6 Histograms, 2, Algorithms P4, 3.2 GHz, icc 8. Generated scalar code Generated SSE code (4-way vector single precision) Generality of vectorization (all algorithms improve) Efficiency of vectorization (x 2.5 gain)

47 Benchmark: DFT, 2-powers2 P4, 3.2 GHz, icc 8. Vendor code: hand-tuned assembly? Higher is better Single precision code limitations of compiler vectorization Spiral code competitive with the best

48 DFT for one BG/L node One BG/L node PowerPC 44 FP2 5 MHz Up to 4% speedup using 2-way vector instructions Results in collaboration with Ueberhuber/Franchetti Univ. of Technology Vienna

49 Benchmark: Fixed-Point DFT, IPAQ IPAQ Xscale arch. 4 MHz Fixed-point only Higher is better Intel spend less effort? Spiral code generate without modifications to the system

50 Benchmark: DCT -dim DCT 2-dim DCT P4, 3.2 GHz, icc 8. Scalar code Scalar vs. SSE code factor 2 over vendor code another factor of 3 with special instructions

51 Platform-Dependence 5% loss when porting PIII tuned code to P4

52 Compiler Flags P4, 3.2 GHz, gcc ACOVEA: evolutionary search for best compiler flags Runtime histogram random compiler flags incl. -O march=pentium4 % improvement when applied to best Spiral generated code

53 Multiplierless DFT, IPAQ IPAQ Xscale arch. 4 MHz Fixed-point only fixed-point constant multiplications replaced by adds and shifts trade-off runtime and precision

54 Current/Future Work Complete reimplementation of system (incl. vector code, FMA code, code for SMP platforms, fixed-point code) Generation of packages (generating something like FFTW?) Spiral as educational tool Other domains: SP applications (backprojection) Cryptography (seems to be not suitable) Communications Maybe linear algebra problems Generating special purpose hardware (FPGA)

55 Summary Code generation and tuning as optimization problem over the algorithm and implementation space Exploit the structure of the domain to solve it Declarative framework for computer representation of the domainknowledge Enables algorithm generation and optimization (teaches the system the math; does not become obsolete?) Compiler to translate math description into code Search and learning to explore implementation space Closes the feedback loop gives the system intelligence Extensible, versatile Every step in the code generation is exposed

56 Questions? Check out: upcoming Proceedings of the IEEE special issue on Program Generation, Optimization, and Adaptation,

Algorithms and Computation in Signal Processing

Algorithms and Computation in Signal Processing Algorithms and Computation in Signal Processing special topic course 8-799B spring 25 24 th and 25 th Lecture Apr. 7 and 2, 25 Instructor: Markus Pueschel TA: Srinivas Chellappa Research Projects Presentations

More information

SPIRAL: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms

SPIRAL: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms SPIRAL: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms Markus Püschel Faculty José Moura (CMU) Jeremy Johnson (Drexel) Robert Johnson (MathStar Inc.) David Padua (UIUC) Viktor

More information

Joint Runtime / Energy Optimization and Hardware / Software Partitioning of Linear Transforms

Joint Runtime / Energy Optimization and Hardware / Software Partitioning of Linear Transforms Joint Runtime / Energy Optimization and Hardware / Software Partitioning of Linear Transforms Paolo D Alberto, Franz Franchetti, Peter A. Milder, Aliaksei Sandryhaila, James C. Hoe, José M. F. Moura, Markus

More information

Generating Parallel Transforms Using Spiral

Generating Parallel Transforms Using Spiral Generating Parallel Transforms Using Spiral Franz Franchetti Yevgen Voronenko Markus Püschel Part of the Spiral Team Electrical and Computer Engineering Carnegie Mellon University Sponsors: DARPA DESA

More information

Parallelism in Spiral

Parallelism in Spiral Parallelism in Spiral Franz Franchetti and the Spiral team (only part shown) Electrical and Computer Engineering Carnegie Mellon University Joint work with Yevgen Voronenko Markus Püschel This work was

More information

How to Write Fast Numerical Code

How to Write Fast Numerical Code How to Write Fast Numerical Code Lecture: Dense linear algebra, LAPACK, MMM optimizations in ATLAS Instructor: Markus Püschel TA: Daniele Spampinato & Alen Stojanov Today Linear algebra software: history,

More information

Optimizing MMM & ATLAS Library Generator

Optimizing MMM & ATLAS Library Generator Optimizing MMM & ATLAS Library Generator Recall: MMM miss ratios L1 Cache Miss Ratio for Intel Pentium III MMM with N = 1 1300 16 32/lock 4-way 8-byte elements IJ version (large cache) DO I = 1, N//row-major

More information

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University Outline Scientific Computation Kernels Matrix Multiplication Fast Fourier Transform (FFT) Automated Performance Tuning

More information

SPIRAL Overview: Automatic Generation of DSP Algorithms & More *

SPIRAL Overview: Automatic Generation of DSP Algorithms & More * SPIRAL Overview: Automatic Generation of DSP Algorithms & More * Jeremy Johnson (& SPIRAL Team) Drexel University * The work is supported by DARPA DESA, NSF, and Intel. Material for presentation provided

More information

How to Write Fast Numerical Code Spring 2012 Lecture 9. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato

How to Write Fast Numerical Code Spring 2012 Lecture 9. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato How to Write Fast Numerical Code Spring 2012 Lecture 9 Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato Today Linear algebra software: history, LAPACK and BLAS Blocking (BLAS 3): key

More information

Formal Loop Merging for Signal Transforms

Formal Loop Merging for Signal Transforms Formal Loop Merging for Signal Transforms Franz Franchetti Yevgen S. Voronenko Markus Püschel Department of Electrical & Computer Engineering Carnegie Mellon University This work was supported by NSF through

More information

SPIRAL Generated Modular FFTs *

SPIRAL Generated Modular FFTs * SPIRAL Generated Modular FFTs * Jeremy Johnson Lingchuan Meng Drexel University * The work is supported by DARPA DESA, NSF, and Intel. Material for SPIRAL overview provided by Franz Francheti, Yevgen Voronenko,,

More information

System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries

System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries Yevgen Voronenko, Franz Franchetti, Frédéric de Mesmay, and Markus Püschel Department of Electrical and Computer

More information

How to Write Fast Numerical Code

How to Write Fast Numerical Code How to Write Fast Numerical Code Lecture: Autotuning and Machine Learning Instructor: Markus Püschel TA: Gagandeep Singh, Daniele Spampinato, Alen Stojanov Overview Rough classification of autotuning efforts

More information

How to Write Fast Code , spring rd Lecture, Apr. 9 th

How to Write Fast Code , spring rd Lecture, Apr. 9 th How to Write Fast Code 18-645, spring 2008 23 rd Lecture, Apr. 9 th Instructor: Markus Püschel TAs: Srinivas Chellappa (Vas) and Frédéric de Mesmay (Fred) Research Project Current status Today Papers due

More information

Automatic Performance Tuning and Machine Learning

Automatic Performance Tuning and Machine Learning Automatic Performance Tuning and Machine Learning Markus Püschel Computer Science, ETH Zürich with: Frédéric de Mesmay PhD, Electrical and Computer Engineering, Carnegie Mellon PhD and Postdoc openings:

More information

Performance Analysis of a Family of WHT Algorithms

Performance Analysis of a Family of WHT Algorithms Performance Analysis of a Family of WHT Algorithms Michael Andrews and Jeremy Johnson Department of Computer Science Drexel University Philadelphia, PA USA January, 7 Abstract This paper explores the correlation

More information

How to Write Fast Numerical Code

How to Write Fast Numerical Code How to Write Fast Numerical Code Lecture: Memory bound computation, sparse linear algebra, OSKI Instructor: Markus Püschel TA: Alen Stojanov, Georg Ofenbeck, Gagandeep Singh ATLAS Mflop/s Compile Execute

More information

How to Write Fast Numerical Code Spring 2012 Lecture 13. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato

How to Write Fast Numerical Code Spring 2012 Lecture 13. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato How to Write Fast Numerical Code Spring 2012 Lecture 13 Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato ATLAS Mflop/s Compile Execute Measure Detect Hardware Parameters L1Size NR MulAdd

More information

Automatic Generation of the HPC Challenge's Global FFT Benchmark for BlueGene/P

Automatic Generation of the HPC Challenge's Global FFT Benchmark for BlueGene/P Automatic Generation of the HPC Challenge's Global FFT Benchmark for BlueGene/P Franz Franchetti 1, Yevgen Voronenko 2, Gheorghe Almasi 3 1 University and SpiralGen, Inc. 2 AccuRay, Inc., 3 IBM Research

More information

Tackling Parallelism With Symbolic Computation

Tackling Parallelism With Symbolic Computation Tackling Parallelism With Symbolic Computation Markus Püschel and the Spiral team (only part shown) With: Franz Franchetti Yevgen Voronenko Electrical and Computer Engineering Carnegie Mellon University

More information

Automatic Performance Programming?

Automatic Performance Programming? A I n Automatic Performance Programming? Markus Püschel Computer Science m128i t3 = _mm_unpacklo_epi16(x[0], X[1]); m128i t4 = _mm_unpackhi_epi16(x[0], X[1]); m128i t7 = _mm_unpacklo_epi16(x[2], X[3]);

More information

A SIMD Vectorizing Compiler for Digital Signal Processing Algorithms Λ

A SIMD Vectorizing Compiler for Digital Signal Processing Algorithms Λ A SIMD Vectorizing Compiler for Digital Signal Processing Algorithms Λ Franz Franchetti Applied and Numerical Mathematics Technical University of Vienna, Austria franz.franchetti@tuwien.ac.at Markus Püschel

More information

Aggressive Scheduling for Numerical Programs

Aggressive Scheduling for Numerical Programs Aggressive Scheduling for Numerical Programs Richard Veras, Flavio Cruz, Berkin Akin May 2, 2012 Abstract Certain classes of algorithms are still hard to optimize by modern compilers due to the differences

More information

SPIRAL: Code Generation for DSP Transforms

SPIRAL: Code Generation for DSP Transforms PROCEEDINGS OF THE IEEE SPECIAL ISSUE ON PROGRAM GENERATION, OPTIMIZATION, AND ADAPTATION 1 SPIRAL: Code Generation for DSP Transforms Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela

More information

Stochastic Search for Signal Processing Algorithm Optimization

Stochastic Search for Signal Processing Algorithm Optimization Stochastic Search for Signal Processing Algorithm Optimization Bryan Singer Manuela Veloso May, 01 CMU-CS-01-137 School of Computer Science Carnegie Mellon University Pittsburgh, PA 1213 Abstract Many

More information

Special Issue on Program Generation, Optimization, and Platform Adaptation /$ IEEE

Special Issue on Program Generation, Optimization, and Platform Adaptation /$ IEEE Scanning the Issue Special Issue on Program Generation, Optimization, and Platform Adaptation This special issue of the PROCEEDINGS OF THE IEEE offers an overview of ongoing efforts to facilitate the development

More information

Stochastic Search for Signal Processing Algorithm Optimization

Stochastic Search for Signal Processing Algorithm Optimization Stochastic Search for Signal Processing Algorithm Optimization Bryan Singer and Manuela Veloso Computer Science Department Carnegie Mellon University Pittsburgh, PA 1213 Email: {bsinger+, mmv+}@cs.cmu.edu

More information

Program Composition and Optimization: An Introduction

Program Composition and Optimization: An Introduction Program Composition and Optimization: An Introduction Christoph W. Kessler 1, Welf Löwe 2, David Padua 3, and Markus Püschel 4 1 Linköping University, Linköping, Sweden, chrke@ida.liu.se 2 Linnaeus University,

More information

Statistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform

Statistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform Statistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform Michael Andrews and Jeremy Johnson Department of Computer Science, Drexel University, Philadelphia, PA USA Abstract.

More information

Automatic Tuning of Sparse Matrix Kernels

Automatic Tuning of Sparse Matrix Kernels Automatic Tuning of Sparse Matrix Kernels Kathy Yelick U.C. Berkeley and Lawrence Berkeley National Laboratory Richard Vuduc, Lawrence Livermore National Laboratory James Demmel, U.C. Berkeley Berkeley

More information

Advanced Computing Research Laboratory. Adaptive Scientific Software Libraries

Advanced Computing Research Laboratory. Adaptive Scientific Software Libraries Adaptive Scientific Software Libraries and Texas Learning and Computation Center and Department of Computer Science University of Houston Challenges Diversity of execution environments Growing complexity

More information

Adaptive Scientific Software Libraries

Adaptive Scientific Software Libraries Adaptive Scientific Software Libraries Lennart Johnsson Advanced Computing Research Laboratory Department of Computer Science University of Houston Challenges Diversity of execution environments Growing

More information

Spiral: Program Generation for Linear Transforms and Beyond

Spiral: Program Generation for Linear Transforms and Beyond Spiral: Program Generation for Linear Transforms and Beyond Franz Franchetti and the Spiral team (only part shown) ECE, Carnegie Mellon University www.spiral.net Co-Founder, SpiralGen www.spiralgen.com

More information

A Fast Fourier Transform Compiler

A Fast Fourier Transform Compiler RETROSPECTIVE: A Fast Fourier Transform Compiler Matteo Frigo Vanu Inc., One Porter Sq., suite 18 Cambridge, MA, 02140, USA athena@fftw.org 1. HOW FFTW WAS BORN FFTW (the fastest Fourier transform in the

More information

Library Generation For Linear Transforms

Library Generation For Linear Transforms Library Generation For Linear Transforms Yevgen Voronenko May RS RS 3 RS RS RS RS 7 RS 5 Dissertation Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in Electrical

More information

A Comparison of Empirical and Model-driven Optimization

A Comparison of Empirical and Model-driven Optimization 1 A Comparison of Empirical and Model-driven Optimization Kamen Yotov, Xiaoming Li, Gang Ren, Maria Garzaran, David Padua, Keshav Pingali, Paul Stodghill Abstract A key step in program optimization is

More information

Cache-oblivious Programming

Cache-oblivious Programming Cache-oblivious Programming Story so far We have studied cache optimizations for array programs Main transformations: loop interchange, loop tiling Loop tiling converts matrix computations into block matrix

More information

Empirical Auto-tuning Code Generator for FFT and Trigonometric Transforms

Empirical Auto-tuning Code Generator for FFT and Trigonometric Transforms Empirical Auto-tuning Code Generator for FFT and Trigonometric Transforms Ayaz Ali and Lennart Johnsson Texas Learning and Computation Center University of Houston, Texas {ayaz,johnsson}@cs.uh.edu Dragan

More information

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply University of California, Berkeley Berkeley Benchmarking and Optimization Group (BeBOP) http://bebop.cs.berkeley.edu

More information

Optimal Performance Numerical Code Challenges and Solutions

Optimal Performance Numerical Code Challenges and Solutions Optimal Performance Numerical Code Challenges and Solutions Markus Püschel Picture: www.tapety-na-pulpit.org Scientific Computing Physics/biology simulations Consumer Computing Audio/image/video processing

More information

Think Globally, Search Locally

Think Globally, Search Locally Think Globally, Search Locally Kamen Yotov, Keshav Pingali, Paul Stodghill, {kyotov,pingali,stodghil}@cs.cornell.edu Department of Computer Science, Cornell University, Ithaca, NY 853. ABSTRACT A key step

More information

Automatic Tuning Matrix Multiplication Performance on Graphics Hardware

Automatic Tuning Matrix Multiplication Performance on Graphics Hardware Automatic Tuning Matrix Multiplication Performance on Graphics Hardware Changhao Jiang (cjiang@cs.uiuc.edu) Marc Snir (snir@cs.uiuc.edu) University of Illinois Urbana Champaign GPU becomes more powerful

More information

Autotuning (1/2): Cache-oblivious algorithms

Autotuning (1/2): Cache-oblivious algorithms Autotuning (1/2): Cache-oblivious algorithms Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.17] Tuesday, March 4, 2008 1 Today s sources CS 267 (Demmel

More information

Algorithms and Computation in Signal Processing

Algorithms and Computation in Signal Processing Algorithms and Computation in Signal Processing special topic course 18-799B spring 2005 14 th Lecture Feb. 24, 2005 Instructor: Markus Pueschel TA: Srinivas Chellappa Course Evaluation Email sent out

More information

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Texas Learning and Computation Center Department of Computer Science University of Houston Outline Motivation

More information

Automated Empirical Optimizations of Software and the ATLAS project* Software Engineering Seminar Pascal Spörri

Automated Empirical Optimizations of Software and the ATLAS project* Software Engineering Seminar Pascal Spörri Automated Empirical Optimizations of Software and the ATLAS project* Software Engineering Seminar Pascal Spörri *R. Clint Whaley, Antoine Petitet and Jack Dongarra. Parallel Computing, 27(1-2):3-35, 2001.

More information

Mixed Data Layout Kernels for Vectorized Complex Arithmetic

Mixed Data Layout Kernels for Vectorized Complex Arithmetic Mixed Data Layout Kernels for Vectorized Complex Arithmetic Doru T. Popovici, Franz Franchetti, Tze Meng Low Department of Electrical and Computer Engineering Carnegie Mellon University Email: {dpopovic,

More information

How to Write Fast Code , spring st Lecture, Jan. 14 th

How to Write Fast Code , spring st Lecture, Jan. 14 th How to Write Fast Code 18-645, spring 2008 1 st Lecture, Jan. 14 th Instructor: Markus Püschel TAs: Srinivas Chellappa (Vas) and Frédéric de Mesmay (Fred) Today Motivation and idea behind this course Technicalities

More information

FFT Program Generation for the Cell BE

FFT Program Generation for the Cell BE FFT Program Generation for the Cell BE Srinivas Chellappa, Franz Franchetti, and Markus Püschel Electrical and Computer Engineering Carnegie Mellon University Pittsburgh PA 15213, USA {schellap, franzf,

More information

Autotuning (2.5/2): TCE & Empirical compilers

Autotuning (2.5/2): TCE & Empirical compilers Autotuning (2.5/2): TCE & Empirical compilers Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.19] Tuesday, March 11, 2008 1 Today s sources CS 267

More information

Scientific Computing. Some slides from James Lambers, Stanford

Scientific Computing. Some slides from James Lambers, Stanford Scientific Computing Some slides from James Lambers, Stanford Dense Linear Algebra Scaling and sums Transpose Rank-one updates Rotations Matrix vector products Matrix Matrix products BLAS Designing Numerical

More information

How to Write Fast Numerical Code Spring 2011 Lecture 22. Instructor: Markus Püschel TA: Georg Ofenbeck

How to Write Fast Numerical Code Spring 2011 Lecture 22. Instructor: Markus Püschel TA: Georg Ofenbeck How to Write Fast Numerical Code Spring 2011 Lecture 22 Instructor: Markus Püschel TA: Georg Ofenbeck Schedule Today Lecture Project presentations 10 minutes each random order random speaker 10 Final code

More information

Automatically Optimized FFT Codes for the BlueGene/L Supercomputer

Automatically Optimized FFT Codes for the BlueGene/L Supercomputer Automatically Optimized FFT Codes for the BlueGene/L Supercomputer Franz Franchetti, Stefan Kral, Juergen Lorenz, Markus Püschel, Christoph W. Ueberhuber, and Peter Wurzinger Institute for Analysis and

More information

Adaptable benchmarks for register blocked sparse matrix-vector multiplication

Adaptable benchmarks for register blocked sparse matrix-vector multiplication Adaptable benchmarks for register blocked sparse matrix-vector multiplication Berkeley Benchmarking and Optimization group (BeBOP) Hormozd Gahvari and Mark Hoemmen Based on research of: Eun-Jin Im Rich

More information

Is Search Really Necessary to Generate High-Performance BLAS?

Is Search Really Necessary to Generate High-Performance BLAS? 1 Is Search Really Necessary to Generate High-Performance BLAS? Kamen Yotov, Xiaoming Li, Gang Ren, Maria Garzaran, David Padua, Keshav Pingali, Paul Stodghill Abstract A key step in program optimization

More information

Automatically Tuned FFTs for BlueGene/L s Double FPU

Automatically Tuned FFTs for BlueGene/L s Double FPU Automatically Tuned FFTs for BlueGene/L s Double FPU Franz Franchetti, Stefan Kral, Juergen Lorenz, Markus Püschel, and Christoph W. Ueberhuber Institute for Analysis and Scientific Computing, Vienna University

More information

Algorithms and Computation in Signal Processing

Algorithms and Computation in Signal Processing Algorithms and Computation in Signal Processing special topic course 18-799B spring 2005 22 nd lecture Mar. 31, 2005 Instructor: Markus Pueschel Guest instructor: Franz Franchetti TA: Srinivas Chellappa

More information

Spiral. Computer Generation of Performance Libraries. José M. F. Moura Markus Püschel Franz Franchetti & the Spiral Team. Performance.

Spiral. Computer Generation of Performance Libraries. José M. F. Moura Markus Püschel Franz Franchetti & the Spiral Team. Performance. Spiral Computer Generation of Performance Libraries José M. F. Moura Markus Püschel Franz Franchetti & the Spiral Team Platforms Performance Applications What is Spiral? Traditionally Spiral Approach Spiral

More information

How to Write Fast Numerical Code Spring 2012 Lecture 20. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato

How to Write Fast Numerical Code Spring 2012 Lecture 20. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato How to Write Fast Numerical Code Spring 2012 Lecture 20 Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato Planning Today Lecture Project meetings Project presentations 10 minutes each

More information

Automating the Modeling and Optimization of the Performance of Signal Transforms

Automating the Modeling and Optimization of the Performance of Signal Transforms IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 50, NO. 8, AUGUST 2002 2003 Automating the Modeling and Optimization of the Performance of Signal Transforms Bryan Singer and Manuela M. Veloso Abstract Fast

More information

How to Write Fast Numerical Code

How to Write Fast Numerical Code How to Write Fast Numerical Code Lecture: Optimizing FFT, FFTW Instructor: Markus Püschel TA: Georg Ofenbeck & Daniele Spampinato Rest of Semester Today Lecture Project meetings Project presentations 10

More information

Learning to Construct Fast Signal Processing Implementations

Learning to Construct Fast Signal Processing Implementations Journal of Machine Learning Research 3 (2002) 887-919 Submitted 12/01; Published 12/02 Learning to Construct Fast Signal Processing Implementations Bryan Singer Manuela Veloso Department of Computer Science

More information

FFT ALGORITHMS FOR MULTIPLY-ADD ARCHITECTURES

FFT ALGORITHMS FOR MULTIPLY-ADD ARCHITECTURES FFT ALGORITHMS FOR MULTIPLY-ADD ARCHITECTURES FRANCHETTI Franz, (AUT), KALTENBERGER Florian, (AUT), UEBERHUBER Christoph W. (AUT) Abstract. FFTs are the single most important algorithms in science and

More information

Intel Parallel Studio XE 2016 Composer Edition for OS X* Installation Guide and Release Notes

Intel Parallel Studio XE 2016 Composer Edition for OS X* Installation Guide and Release Notes Intel Parallel Studio XE 2016 Composer Edition for OS X* Installation Guide and Release Notes 30 July 2015 Table of Contents 1 Introduction... 2 1.1 Change History... 2 1.1.1 Changes since Intel Parallel

More information

Statistical Models for Automatic Performance Tuning

Statistical Models for Automatic Performance Tuning Statistical Models for Automatic Performance Tuning Richard Vuduc, James Demmel (U.C. Berkeley, EECS) {richie,demmel}@cs.berkeley.edu Jeff Bilmes (Univ. of Washington, EE) bilmes@ee.washington.edu May

More information

Generating SIMD Vectorized Permutations

Generating SIMD Vectorized Permutations Generating SIMD Vectorized Permutations Franz Franchetti and Markus Püschel Electrical and Computer Engineering, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213 {franzf, pueschel}@ece.cmu.edu

More information

Automatic Derivation and Implementation of Signal Processing Algorithms

Automatic Derivation and Implementation of Signal Processing Algorithms Automatic Derivation and Implementation of Signal Processing Algorithms Sebastian Egner Philips Research Laboratories Prof. Hostlaan 4, WY21 5656 AA Eindhoven, The Netherlands sebastian.egner@philips.com

More information

FFT Program Generation for Shared Memory: SMP and Multicore

FFT Program Generation for Shared Memory: SMP and Multicore FFT Program Generation for Shared Memory: SMP and Multicore Franz Franchetti, Yevgen Voronenko, and Markus Püschel Electrical and Computer Engineering Carnegie Mellon University Abstract The chip maker

More information

Exploring the Optimization Space of Dense Linear Algebra Kernels

Exploring the Optimization Space of Dense Linear Algebra Kernels Exploring the Optimization Space of Dense Linear Algebra Kernels Qing Yi Apan Qasem University of Texas at San Anstonio Texas State University Abstract. Dense linear algebra kernels such as matrix multiplication

More information

Parameterizing Loop Fusion for Automated Empirical Tuning

Parameterizing Loop Fusion for Automated Empirical Tuning Parameterizing Loop Fusion for Automated Empirical Tuning Yuan Zhao Qing Yi Ken Kennedy Dan Quinlan Richard Vuduc Rice University, Houston, TX, USA University of Texas at San Antonio, TX, USA Lawrence

More information

Intel Parallel Studio XE 2016 Composer Edition Update 1 for OS X* Installation Guide and Release Notes

Intel Parallel Studio XE 2016 Composer Edition Update 1 for OS X* Installation Guide and Release Notes Intel Parallel Studio XE 2016 Composer Edition Update 1 for OS X* Installation Guide and Release Notes 30 October 2015 Table of Contents 1 Introduction... 2 1.1 What s New... 2 1.2 Product Contents...

More information

OSKI: A Library of Automatically Tuned Sparse Matrix Kernels

OSKI: A Library of Automatically Tuned Sparse Matrix Kernels OSKI: A Library of Automatically Tuned Sparse Matrix Kernels Richard Vuduc (LLNL), James Demmel, Katherine Yelick Berkeley Benchmarking and OPtimization (BeBOP) Project bebop.cs.berkeley.edu EECS Department,

More information

Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley Innovative Computing Laboratory University of Tennessee.

Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley Innovative Computing Laboratory University of Tennessee. Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley Innovative Computing Laboratory University of Tennessee Outline Pre-intro: BLAS Motivation What is ATLAS Present release How ATLAS works

More information

Computer Generation of IP Cores

Computer Generation of IP Cores A I n Computer Generation of IP Cores Peter Milder (ECE, Carnegie Mellon) James Hoe (ECE, Carnegie Mellon) Markus Püschel (CS, ETH Zürich) addfxp #(16, 1) add15282(.a(a69),.b(a70),.clk(clk),.q(t45)); addfxp

More information

An Adaptive Framework for Scientific Software Libraries. Ayaz Ali Lennart Johnsson Dept of Computer Science University of Houston

An Adaptive Framework for Scientific Software Libraries. Ayaz Ali Lennart Johnsson Dept of Computer Science University of Houston An Adaptive Framework for Scientific Software Libraries Ayaz Ali Lennart Johnsson Dept of Computer Science University of Houston Diversity of execution environments Growing complexity of modern microprocessors.

More information

DFT Compiler for Custom and Adaptable Systems

DFT Compiler for Custom and Adaptable Systems DFT Compiler for Custom and Adaptable Systems Paolo D Alberto Electrical and Computer Engineering Carnegie Mellon University Personal Research Background Embedded and High Performance Computing Compiler:

More information

Java Performance Analysis for Scientific Computing

Java Performance Analysis for Scientific Computing Java Performance Analysis for Scientific Computing Roldan Pozo Leader, Mathematical Software Group National Institute of Standards and Technology USA UKHEC: Java for High End Computing Nov. 20th, 2000

More information

FFT Program Generation for Shared Memory: SMP and Multicore

FFT Program Generation for Shared Memory: SMP and Multicore FFT Program Generation for Shared Memory: SMP and Multicore Franz Franchetti, Yevgen Voronenko, and Markus Püschel Electrical and Computer Engineering Carnegie Mellon University {franzf, yvoronen, pueschel@ece.cmu.edu

More information

An Experimental Comparison of Cache-oblivious and Cache-aware Programs DRAFT: DO NOT DISTRIBUTE

An Experimental Comparison of Cache-oblivious and Cache-aware Programs DRAFT: DO NOT DISTRIBUTE An Experimental Comparison of Cache-oblivious and Cache-aware Programs DRAFT: DO NOT DISTRIBUTE Kamen Yotov IBM T. J. Watson Research Center kyotov@us.ibm.com Tom Roeder, Keshav Pingali Cornell University

More information

Formal Loop Merging for Signal Transforms

Formal Loop Merging for Signal Transforms Formal Loop Merging for Signal Transforms Franz Franchetti Yevgen Voronenko Markus Püschel Department of Electrical and Computer Engineering Carnegie Mellon University {franzf, yvoronen, pueschel}@ece.cmu.edu

More information

PARALLELIZATION OF WAVELET FILTERS USING SIMD EXTENSIONS

PARALLELIZATION OF WAVELET FILTERS USING SIMD EXTENSIONS c World Scientific Publishing Company PARALLELIZATION OF WAVELET FILTERS USING SIMD EXTENSIONS RADE KUTIL and PETER EDER Department of Computer Sciences, University of Salzburg Jakob Haringerstr. 2, 5020

More information

Automatic Tuning of Discrete Fourier Transforms Driven by Analytical Modeling

Automatic Tuning of Discrete Fourier Transforms Driven by Analytical Modeling Automatic Tuning of Discrete Fourier Transforms Driven by Analytical Modeling Basilio B. Fraguela Depto. de Electrónica e Sistemas Universidade da Coruña A Coruña, Spain basilio@udc.es Yevgen Voronenko,

More information

In 1986, I had degrees in math and engineering and found I wanted to compute things. What I ve mostly found is that:

In 1986, I had degrees in math and engineering and found I wanted to compute things. What I ve mostly found is that: Parallel Computing and Data Locality Gary Howell In 1986, I had degrees in math and engineering and found I wanted to compute things. What I ve mostly found is that: Real estate and efficient computation

More information

Advanced School in High Performance and GRID Computing November Mathematical Libraries. Part I

Advanced School in High Performance and GRID Computing November Mathematical Libraries. Part I 1967-10 Advanced School in High Performance and GRID Computing 3-14 November 2008 Mathematical Libraries. Part I KOHLMEYER Axel University of Pennsylvania Department of Chemistry 231 South 34th Street

More information

Parallel FFT Program Optimizations on Heterogeneous Computers

Parallel FFT Program Optimizations on Heterogeneous Computers Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid

More information

Automatic Performance Tuning and Analysis of Sparse Triangular Solve p.1/31

Automatic Performance Tuning and Analysis of Sparse Triangular Solve p.1/31 Automatic Performance Tuning and Analysis of Sparse Triangular Solve Richard Vuduc, Shoaib Kamil, Jen Hsu, Rajesh Nishtala James W. Demmel, Katherine A. Yelick June 22, 2002 Berkeley Benchmarking and OPtimization

More information

The Design and Implementation of FFTW3

The Design and Implementation of FFTW3 The Design and Implementation of FFTW3 MATTEO FRIGO AND STEVEN G. JOHNSON Invited Paper FFTW is an implementation of the discrete Fourier transform (DFT) that adapts to the hardware in order to maximize

More information

Automatically Tuned Linear Algebra Software

Automatically Tuned Linear Algebra Software Automatically Tuned Linear Algebra Software R. Clint Whaley whaley@cs.utsa.edu www.cs.utsa.edu/ whaley University of Texas at San Antonio Department of Computer Science November 5, 2007 R. Clint Whaley

More information

Administration. Prerequisites. Meeting times. CS 380C: Advanced Topics in Compilers

Administration. Prerequisites. Meeting times. CS 380C: Advanced Topics in Compilers Administration CS 380C: Advanced Topics in Compilers Instructor: eshav Pingali Professor (CS, ICES) Office: POB 4.126A Email: pingali@cs.utexas.edu TA: TBD Graduate student (CS) Office: Email: Meeting

More information

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop http://cscads.rice.edu/ Discussion and Feedback CScADS Autotuning 07 Top Priority Questions for Discussion

More information

HiLO: High Level Optimization of FFTs

HiLO: High Level Optimization of FFTs HiLO: High Level Optimization of FFTs Nick Rizzolo and David Padua University of Illinois at Urbana-Champaign 201 N. Goodwin Ave., Urbana, IL 61801-2302 {rizzolo,padua}@cs.uiuc.edu, WWW home page: http://polaris.cs.uiuc.edu

More information

Intel Math Kernel Library

Intel Math Kernel Library Intel Math Kernel Library Getting Started Tutorial: Using the Intel Math Kernel Library for Matrix Multiplication Document Number: 327356-005US Legal Information Intel Math Kernel Library Getting Started

More information

Double-precision General Matrix Multiply (DGEMM)

Double-precision General Matrix Multiply (DGEMM) Double-precision General Matrix Multiply (DGEMM) Parallel Computation (CSE 0), Assignment Andrew Conegliano (A0) Matthias Springer (A00) GID G-- January, 0 0. Assumptions The following assumptions apply

More information

An Experimental Study of Self-Optimizing Dense Linear Algebra Software

An Experimental Study of Self-Optimizing Dense Linear Algebra Software INVITED PAPER An Experimental Study of Self-Optimizing Dense Linear Algebra Software Analytical models of the memory hierarchy are used to explain the performance of self-optimizing software. By Milind

More information

Cache miss analysis of WHT algorithms

Cache miss analysis of WHT algorithms 2005 International Conference on Analysis of Algorithms DMTCS proc. AD, 2005, 115 124 Cache miss analysis of WHT algorithms Mihai Furis 1 and Paweł Hitczenko 2 and Jeremy Johnson 1 1 Dept. of Computer

More information

Intel Performance Libraries

Intel Performance Libraries Intel Performance Libraries Powerful Mathematical Library Intel Math Kernel Library (Intel MKL) Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation

More information

CS 612: Software Design for High-performance Architectures. Keshav Pingali Cornell University

CS 612: Software Design for High-performance Architectures. Keshav Pingali Cornell University CS 612: Software Design for High-performance Architectures Keshav Pingali Cornell University Administration Instructor: Keshav Pingali 457 Rhodes Hall pingali@cs.cornell.edu TA: Kamen Yotov 492 Rhodes

More information

Intel Parallel Studio XE 2017 Update 4 for macos* Installation Guide and Release Notes

Intel Parallel Studio XE 2017 Update 4 for macos* Installation Guide and Release Notes Intel Parallel Studio XE 2017 Update 4 for macos* Installation Guide and Release Notes 12 April 2017 Contents 1.1 What Every User Should Know About This Release... 2 1.2 What s New... 2 1.3 Product Contents...

More information

Scheduling FFT Computation on SMP and Multicore Systems

Scheduling FFT Computation on SMP and Multicore Systems Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali Dept. of Computer Science University of Houston Houston, TX 77, USA ayaz@cs.uh.edu Lennart Johnsson Dept. of Computer Science University

More information