Program Generation, Optimization, and Adaptation: SPIRAL and other efforts Markus Püschel Electrical and Computer Engineering University SPIRAL Team: José M. F. Moura (ECE, CMU) James C. Hoe (ECE, CMU) Jeremy Johnson (CS, Drexel) David Padua (CS, UIUC) Manuela Veloso (CS, CMU) Bryan W. Singer (CS, CMU) Jianxin Xiong (CS, UIUC) Franz Franchetti (ECE, CMU) Aca Gacic (ECE, CMU) Yevgen Voronenko (ECE, CMU) Kang Chen (CS, Drexel) Robert W. Johnson (Quarry Comp. Inc.) Nick Rizzolo (CS, UIUC) and many others Supported by: NSF ACR-234293 NSF ITR/NGS-325687 Thanks also to: Cylab, CMU Austrian Science Fund Intel ITRI, Taiwan ENSCO, Inc. http://www.spiral.net
Organization Introduction and Overview What s the problem? Overview Program generation, optimization, and adaptation Common philosophy LAPACK, ATLAS, BeBOP FFTW SPIRAL
The Problem: Example Discrete Fourier Transform (DFT) x vendor library (hand-tuned assembly) but also FFTW (adaptable library) SPIRAL (generated code) DFT size reasonable implementation (Numerical recipes. GNU scientific library) yeah, but DFT is kind of difficult, so let s take something simpler
The Problem: Example Matrix-Matrix Mult. MFLOPS vendor library (hand-tuned assembly) 5 4 3 6x BLAS Unleashed CGw S ATLAS generated code 2 Model Size 2 3 4 5 Compiler matrix size How to achieve optimal, portable performance? standard triple loop + compiler optimizations graph: Pingali, Yotov, Cornell U.
The Problem (Specifics) Computing platforms are complex memory hierarchy undocumented features in microarchitecture special instructions (vector, FMA, prefetching) Consequences: Performance only roughly determined by operations count Performance behaviour is hard to understand Best code is platform-dependent Computing platforms change fast code becomes obsolete (almost) as fast as it is written Compiler optimizations are very limited for numerical code people often revert to assembly coding (like 5 years ago) How to we port performance across platforms and across time?
Solution #: Brute Force Thousands of programmers hand-write and hand-tune (assembly) code for the same numerical problems and for every platform and whenever a new platform comes out? Hmm
Solution #2: New Approaches to Code Optimization and Code Creation Linear Algebra: LAPACK, ATLAS BeBOP Tensor Computations (Quantum Chemistry): Sadayappan, Baumgartner et al. (Ohio State) Finite Element Methods: Fenics (U. Chicago) Signal Processing: FFTW SPIRAL VSIPL (Kepner, Lebak et al., MIT Lincoln Labs) New Compiler Techniques (Domain aware/specific): Model-based ATLAS Broadway (Lin, Guyer, U. Texas Austin) SIMD optimizations (Ueberhuber, Univ. Techn. Vienna) Telescoping Languages (Kennedy et al., Rice) See also upcoming Proceedings of the IEEE special issue on Program Generation, Optimization, and Adaptation, http://www.ece.cmu.edu/~spiral/special-issue.html
Philosophy Present Future Algorithm Level Implementation Level starting point: one algorithm/program high level information destroyed implementation restricted automation algorithm/implementation codesign domain knowledge used for optimization a new breed of domain-aware approaches/tools push automation beyond what is currently possible applies for software and hardware design alike a u t o m a t i o n
Organization Introduction and Overview What s the problem? Overview Program generation, optimization, and adaptation Common philosophy LAPACK, ATLAS, BeBOP FFTW SPIRAL
LAPACK/ATLAS (Demmel, Dongarra, Petitet, Whaley, ) LAPACK: library for linear algebra problems (LU, Cholesky, QR, SVD, ) static built on top of BLAS (basic linear algebra subroutines), which is meant to be reoptimized ATLAS: provides BLAS library (matrix-matrix, matrix-vector mult., ) can be generated for every machine people can contribute hand-tuned code
ATLAS Architecture MFLOPS Compile, Execute, Measure Detect Hardware Parameters LSize NR MulAdd L * ATLAS Search Engine (MMSearch) NB MU,NU,KU xfetch MulAdd Latency ATLAS MM Code Generator (MMCase) MiniMMM Source Search parameters: span search space determine code found by orthogonal line search Hardware parameters: LSize: size of L data cache NR: number of registers MulAdd: fused multiply-add available? L * : latency of FP multiplication source: Pingali, Yotov, Cornell U.
ATLAS: Details (I) K Blocking for L D cache only square blocks NB x NB x NB search NB: 6 < NB < 8, NB 2 <= LSize divides into minimmms NB M NU Register level blocking search parameters MU, NU, K clean-up code also generated divides into micrommms NB K B MU K NB A C source: Pingali, Yotov, Cornell U.
ATLAS: Details (II) Code: minimmm + micrommm for (int j = ; j < NB; j += NU) for (int i = ; i < NB; i += MU) load C[i..i+MU-, j..j+nu-] into registers copy to scalar var. unrolled for (int k = ; k < NB; k++) load A[i..i+MU-,k] into registers load B[k,j..j+NU-] into registers multiply A s and B s and add to C s store C[i..i+MU-, j..j+nu-] M M 2 M 3 M 4 Memory Operations A A 2 A 3 A 4 Latency=2 Computation Memory Operations Computation A Computation A 2 Memory Operations IFetch Loads L Memory Operations L 2 L 3 NFetch Loads L MU+NU NFetch Loads Scheduling interleave loads and operations interleave adds and mults using Latency using xfetch M MU*NU A MU*NU Computation A MU*NU Memory Operations Computation NFetch Loads source: Pingali, Yotov, Cornell U.
Model-Based ATLAS (Pingali, Stodghill, Yotov, Padua, ) Detect Hardware Parameters LSize LI$Size NR MulAdd L * Model NB MU,NU,KU xfetch MulAdd Latency ATLAS MM Code Generator (MMCase) MiniMMM Source uses more hardware parameters uses simple analytical models to compute MMM parameters graph: Pingali, Yotov, Cornell U.
ATLAS: Experiments Alpha 2264 Hand-written code often substantially faster (e.g., vector instructions) Model-based comparable to searchbased (except Itanium) Power 3 Power 4 R2K UltraSparc IIIi Itanium 2 Opteron 24 Model Refined Model Unleashed see papers by Demmel et al. (ATLAS), Yotov et al. (Model-based ATLAS) in special issue Athlon MP Pentium III Pentium 4 % 5% ATLAS CGw S % 5% 2% graph: Pingali, Yotov, Cornell U.
BeBOP (formerly SPARSITY) (Vuduc, Yelick, Im) library for sparse linear algebra problems in particular, sparse matrix-vector multiply difference to dense case: matrix only known at runtime requires runtime adaptation main parameter: degree of register blocking, determines sparse format approach: offline (machine dependent, matrix independent): performance P(r,c) for dense matrix-vector multiply in sparse format for register blocks r x c, <= r,c <= 2 for given matrix M: compute average fill ratio f(r,c,m) for all these block sizes choose r,c that maximizes P(r,c)/f(r,c,M)
BeBOP: : Experiments speed-up through blocking (Itan. 2) row block size r n = 2226 source: NASA structural analysis very regular 8 x 8 block structure col. block size c best block size is not 8 x 8, but 2 x 4 source Vuduc, U. Berkeley
Organization Introduction and Overview What s the problem? Overview Program generation, optimization, and adaptation Common philosophy LAPACK, ATLAS, BeBOP FFTW SPIRAL
FFTW (Frigo, Johnson) Self-adaptable library for the discrete Fourier transform (DFT) DFTs: real/complex, arbitrary dimension, scalar and vector code Design in two layers: small problem sizes (codelets): straightline code automatically generated offline (same for every platform) and highly optimized (CSE, scheduling, SSA, ) large problem sizes: an empirical dynamic programming search finds the best recursion strategy called plan (adaptivity) Usage in two stages: user specifies DFT (size, dim, etc.) planner computes by search the plan plan is applied to input data
Organization Introduction and Overview What s the problem? Overview Program generation, optimization, and adaptation Common philosophy LAPACK, ATLAS, BeBOP FFTW SPIRAL
Spiral Code generation from scratch for linear digital signal processing (DSP) transforms (DFT, DCT, DWT, filters,.) Automatic optimization and platform-tuning at the algorithmic level and the code level Different code types supported (scalar, vector, FMA, fixedpoint, multiplierless, ) Goal: A flexible, extensible code generation framework that can survive time (to whatever extent possible) for an entire domain of algorithms
Code Generation and Tuning as Optimization Problem a DSP transform to be implemented the target platform set of possible implementations of T on P cost of implementation I of T on P The implementation of T that is tuned to P is given by: Problems: How to characterize and generate the set of implementations? How to efficiently minimize C? Spiral exploits the domain-specific structure to implement a solver for this optimization problem
Spiral s architecture Domain knowledge: Generating algorithms & manipulating algorithms Architecture knowledge: by evaluating runtime
From Transform to Algorithm (Formula) Input: Transform specification Output: Fast algorithm as formula Domain Knowledge I: Generating the algorithm space
DSP Algorithms: Example 4 DSP Algorithms: Example 4-point DFT point DFT Cooley/Tukey FFT (size 4): algorithms reduce arithmetic cost O(n^2) O(nlog(n)) mathematical notation exhibits structure: SPL (signal processing language) contains all information to generate code = i i i i i 4 2 2 2 4 2 2 2 4 ) ( ) ( L DFT I T I DFT DFT = Fourier transform Identity Permutation Diagonal matrix (twiddles) Kronecker product
DSP Algorithms: Spiral Terminology Transform DFTn parameterized matrix Rule DFT nm ( DFT I ) D ( I DFT ) P a breakdown strategy product of sparse matrices n m n m Ruletree DFT 8 Formula DFT 2 DFT 4 DFT DFT 2 DFT 2 recursive application of rules uniquely defines an algorithm efficient representation easy manipulation ( F I ) D ( I ( I F )) P = L 8 2 4 2 few constructs and primitives uniquely defines an algorithm can be translated into code 2 2
Some Transforms Spiral currently contains 36 transforms
Some Breakdown Rules Base case rules Spiral contains + rules
Algorithm (Formula) Generation Given a transform: Apply breakdown rules recursively until all occurring transforms are expanded Choice of rules at each step yields (usually) exponentially large algorithms space: about equal in operations count differ in data flow k # DFTs, size 2^k # DCT IV, size 2^k 2 3 4 5 6 7 8 9 6 4 296 27744 62573628 ~. ^27 ~2.3 ^6 ~2.86 ^33 26 3242 924443362 734385263354242 ~.7 ^38 ~2.3 ^76 ~.6 ^53
From Algorithm (Formula) to Optimized Algorithm Input: Fast algorithm as formula Output: Optimized formula Domain Knowledge II: Optimizing an algorithm
Motivation: Fusing Loops void I4xF2_L84(double *y, double *x) { double t[8]; for (int i=; i<8; i++) t[i==7? 7 : (i*4)%7] = x[i]; for (int i=; i<4; i++){ y[2*i] = t[2*i] + t[2*i+]; y[2*i+] = t[2*i] - t[2*i+]; } Algebraic transformations } Formula manipulation Loop splitting Loop fusion void I4xF2_L84(double for (int j=; j<4; y[2*j] = x[j] + y[2*j+] = x[j] } } *y, double *x) { j++){ x[j+4]; - x[j+4];
Formula Level Optimization Main goals: Fusing iterative steps (fusing loops), e.g., permutations with loops Improving structure (data flow) for SIMD instructions Overcomes compiler limitations Formula manipulation through mathematical rules Implemented using multiple levels of rewriting systems Puts math knowledge into the system
Structure of Loop Optimization SPL formula To Σ-SPL Σ -SPL formula Join permutations Σ -SPL formula Rules: Join diagonals and monomials Σ -SPL formula
Vector code generation from SPL formulas Naturally vectorizable construct A I 4 A vector length x y (Current) generic construct completely vectorizable: k i= P D ( A I ) E i i i υ i Q i P i, Q i D i, E i A i ν permutations diagonals arbitrary formulas SIMD vector length Vectorization in two steps:. Formula manipulation using manipulation rules 2. Code generation (vector code + C code) Formula manipulation overcomes compiler limitations
Example DFT Standard FFT Formula manipulation Vector FFT for ν-way vector instructions
From Optimized Algorithm (Formula) to Code Input: Optimized formula Output: Intermediate Code Straightforward
From Code to Optimized Code Input: Intermediate Code Output: Optimized C code
Code Level Optimizations Precomputation of constants Loop unrolling (controlled by search module) Constant inlining SSA code, array scalarization, algebraic simplifications, CSE Code reordering for locality (optional) Conversion to FMA code (optional) Conversion to fixed point code (optional) Conversion to multiplierless code (optional) Finally: Unparsing to C (or Fortran)
Evaluating Code Input: Optimized C code Output: Performance Number Straightforward Examples: runtime accuracy operations count
Search (Learning) for the Best Input: Performance Number Output: Controls Formula Generation Controls Implementation Options
Search Methods Search over: Algorithmic degrees of freedom (choice of breakdown rules) Implementation degrees of freedom (degree of unrolling) Operates with the ruletree representation of an algorithm transform independent efficient Search Methods Exhaustive Search Dynamic Programming (DP) Random Search Hill Climbing STEER (an evolutionary algorithm)
STEER Population n: Mutation Cross-Breeding expand differently Population n+: swap expansions Survival of Fittest
Learning Procedure: Generate a set of ( say) algorithms and their runtimes (one transform, one size); represent algorithms by features From this data (pairs of features and runtimes), learn a set of algorithm design rules From this set, generate best algorithms (theory of Markov decision processes) Evaluation: Tested for WHT and DFT From data generated for one size (2^5) could construct best algorithms across sizes (2^2-2^8) Bryan Singer and Manuela Veloso Learning to Construct Fast Signal Processing Implementations Journal of Machine Learning Research, 22, Vol. 3, pp. 887-99
SPIRAL Experiments
Performance Spread: DCT, size 32 Histograms,, algorithms P4, 3.2 GHz, gcc runtime: x2 #ops: x.8 #assembly instr: x.5 #ops vs. runtime: no correlation #fma ops: x.2 accuracy: x, most x2
Performance Spread: DFT 2^6 Histograms, 2, Algorithms P4, 3.2 GHz, icc 8. Generated scalar code Generated SSE code (4-way vector single precision) Generality of vectorization (all algorithms improve) Efficiency of vectorization (x 2.5 gain)
Benchmark: DFT, 2-powers2 P4, 3.2 GHz, icc 8. Vendor code: hand-tuned assembly? Higher is better Single precision code limitations of compiler vectorization Spiral code competitive with the best
DFT for one BG/L node One BG/L node PowerPC 44 FP2 5 MHz Up to 4% speedup using 2-way vector instructions Results in collaboration with Ueberhuber/Franchetti Univ. of Technology Vienna
Benchmark: Fixed-Point DFT, IPAQ IPAQ Xscale arch. 4 MHz Fixed-point only Higher is better Intel spend less effort? Spiral code generate without modifications to the system
Benchmark: DCT -dim DCT 2-dim DCT P4, 3.2 GHz, icc 8. Scalar code Scalar vs. SSE code factor 2 over vendor code another factor of 3 with special instructions
Platform-Dependence 5% loss when porting PIII tuned code to P4
Compiler Flags P4, 3.2 GHz, gcc ACOVEA: evolutionary search for best compiler flags Runtime histogram random compiler flags incl. -O march=pentium4 % improvement when applied to best Spiral generated code
Multiplierless DFT, IPAQ IPAQ Xscale arch. 4 MHz Fixed-point only fixed-point constant multiplications replaced by adds and shifts trade-off runtime and precision
Current/Future Work Complete reimplementation of system (incl. vector code, FMA code, code for SMP platforms, fixed-point code) Generation of packages (generating something like FFTW?) Spiral as educational tool Other domains: SP applications (backprojection) Cryptography (seems to be not suitable) Communications Maybe linear algebra problems Generating special purpose hardware (FPGA)
Summary Code generation and tuning as optimization problem over the algorithm and implementation space Exploit the structure of the domain to solve it Declarative framework for computer representation of the domainknowledge Enables algorithm generation and optimization (teaches the system the math; does not become obsolete?) Compiler to translate math description into code Search and learning to explore implementation space Closes the feedback loop gives the system intelligence Extensible, versatile Every step in the code generation is exposed www.spiral.net
Questions? Check out: upcoming Proceedings of the IEEE special issue on Program Generation, Optimization, and Adaptation, http://www.ece.cmu.edu/~spiral/special-issue.html