Program Generation, Optimization, and Adaptation: SPIRAL and other efforts

Similar documents
Algorithms and Computation in Signal Processing

SPIRAL: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms

Joint Runtime / Energy Optimization and Hardware / Software Partitioning of Linear Transforms

Generating Parallel Transforms Using Spiral

Parallelism in Spiral

How to Write Fast Numerical Code

Optimizing MMM & ATLAS Library Generator

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University

SPIRAL Overview: Automatic Generation of DSP Algorithms & More *

How to Write Fast Numerical Code Spring 2012 Lecture 9. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato

Formal Loop Merging for Signal Transforms

SPIRAL Generated Modular FFTs *

System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries

How to Write Fast Numerical Code

How to Write Fast Code , spring rd Lecture, Apr. 9 th

Automatic Performance Tuning and Machine Learning

Performance Analysis of a Family of WHT Algorithms

How to Write Fast Numerical Code

How to Write Fast Numerical Code Spring 2012 Lecture 13. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato

Automatic Generation of the HPC Challenge's Global FFT Benchmark for BlueGene/P

Tackling Parallelism With Symbolic Computation

Automatic Performance Programming?

A SIMD Vectorizing Compiler for Digital Signal Processing Algorithms Λ

Aggressive Scheduling for Numerical Programs

SPIRAL: Code Generation for DSP Transforms

Stochastic Search for Signal Processing Algorithm Optimization

Special Issue on Program Generation, Optimization, and Platform Adaptation /$ IEEE

Stochastic Search for Signal Processing Algorithm Optimization

Program Composition and Optimization: An Introduction

Statistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform

Automatic Tuning of Sparse Matrix Kernels

Advanced Computing Research Laboratory. Adaptive Scientific Software Libraries

Adaptive Scientific Software Libraries

Spiral: Program Generation for Linear Transforms and Beyond

A Fast Fourier Transform Compiler

Library Generation For Linear Transforms

A Comparison of Empirical and Model-driven Optimization

Cache-oblivious Programming

Empirical Auto-tuning Code Generator for FFT and Trigonometric Transforms

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply

Optimal Performance Numerical Code Challenges and Solutions

Think Globally, Search Locally

Automatic Tuning Matrix Multiplication Performance on Graphics Hardware

Autotuning (1/2): Cache-oblivious algorithms

Algorithms and Computation in Signal Processing

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

Automated Empirical Optimizations of Software and the ATLAS project* Software Engineering Seminar Pascal Spörri

Mixed Data Layout Kernels for Vectorized Complex Arithmetic

How to Write Fast Code , spring st Lecture, Jan. 14 th

FFT Program Generation for the Cell BE

Autotuning (2.5/2): TCE & Empirical compilers

Scientific Computing. Some slides from James Lambers, Stanford

How to Write Fast Numerical Code Spring 2011 Lecture 22. Instructor: Markus Püschel TA: Georg Ofenbeck

Automatically Optimized FFT Codes for the BlueGene/L Supercomputer

Adaptable benchmarks for register blocked sparse matrix-vector multiplication

Is Search Really Necessary to Generate High-Performance BLAS?

Automatically Tuned FFTs for BlueGene/L s Double FPU

Algorithms and Computation in Signal Processing

Spiral. Computer Generation of Performance Libraries. José M. F. Moura Markus Püschel Franz Franchetti & the Spiral Team. Performance.

How to Write Fast Numerical Code Spring 2012 Lecture 20. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato

Automating the Modeling and Optimization of the Performance of Signal Transforms

How to Write Fast Numerical Code

Learning to Construct Fast Signal Processing Implementations

FFT ALGORITHMS FOR MULTIPLY-ADD ARCHITECTURES

Intel Parallel Studio XE 2016 Composer Edition for OS X* Installation Guide and Release Notes

Statistical Models for Automatic Performance Tuning

Generating SIMD Vectorized Permutations

Automatic Derivation and Implementation of Signal Processing Algorithms

FFT Program Generation for Shared Memory: SMP and Multicore

Exploring the Optimization Space of Dense Linear Algebra Kernels

Parameterizing Loop Fusion for Automated Empirical Tuning

Intel Parallel Studio XE 2016 Composer Edition Update 1 for OS X* Installation Guide and Release Notes

OSKI: A Library of Automatically Tuned Sparse Matrix Kernels

Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley Innovative Computing Laboratory University of Tennessee.

Computer Generation of IP Cores

An Adaptive Framework for Scientific Software Libraries. Ayaz Ali Lennart Johnsson Dept of Computer Science University of Houston

DFT Compiler for Custom and Adaptable Systems

Java Performance Analysis for Scientific Computing

FFT Program Generation for Shared Memory: SMP and Multicore

An Experimental Comparison of Cache-oblivious and Cache-aware Programs DRAFT: DO NOT DISTRIBUTE

Formal Loop Merging for Signal Transforms

PARALLELIZATION OF WAVELET FILTERS USING SIMD EXTENSIONS

Automatic Tuning of Discrete Fourier Transforms Driven by Analytical Modeling

In 1986, I had degrees in math and engineering and found I wanted to compute things. What I ve mostly found is that:

Advanced School in High Performance and GRID Computing November Mathematical Libraries. Part I

Parallel FFT Program Optimizations on Heterogeneous Computers

Automatic Performance Tuning and Analysis of Sparse Triangular Solve p.1/31

The Design and Implementation of FFTW3

Automatically Tuned Linear Algebra Software

Administration. Prerequisites. Meeting times. CS 380C: Advanced Topics in Compilers

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

HiLO: High Level Optimization of FFTs

Intel Math Kernel Library

Double-precision General Matrix Multiply (DGEMM)

An Experimental Study of Self-Optimizing Dense Linear Algebra Software

Cache miss analysis of WHT algorithms

Intel Performance Libraries

CS 612: Software Design for High-performance Architectures. Keshav Pingali Cornell University

Intel Parallel Studio XE 2017 Update 4 for macos* Installation Guide and Release Notes

Scheduling FFT Computation on SMP and Multicore Systems

Transcription:

Program Generation, Optimization, and Adaptation: SPIRAL and other efforts Markus Püschel Electrical and Computer Engineering University SPIRAL Team: José M. F. Moura (ECE, CMU) James C. Hoe (ECE, CMU) Jeremy Johnson (CS, Drexel) David Padua (CS, UIUC) Manuela Veloso (CS, CMU) Bryan W. Singer (CS, CMU) Jianxin Xiong (CS, UIUC) Franz Franchetti (ECE, CMU) Aca Gacic (ECE, CMU) Yevgen Voronenko (ECE, CMU) Kang Chen (CS, Drexel) Robert W. Johnson (Quarry Comp. Inc.) Nick Rizzolo (CS, UIUC) and many others Supported by: NSF ACR-234293 NSF ITR/NGS-325687 Thanks also to: Cylab, CMU Austrian Science Fund Intel ITRI, Taiwan ENSCO, Inc. http://www.spiral.net

Organization Introduction and Overview What s the problem? Overview Program generation, optimization, and adaptation Common philosophy LAPACK, ATLAS, BeBOP FFTW SPIRAL

The Problem: Example Discrete Fourier Transform (DFT) x vendor library (hand-tuned assembly) but also FFTW (adaptable library) SPIRAL (generated code) DFT size reasonable implementation (Numerical recipes. GNU scientific library) yeah, but DFT is kind of difficult, so let s take something simpler

The Problem: Example Matrix-Matrix Mult. MFLOPS vendor library (hand-tuned assembly) 5 4 3 6x BLAS Unleashed CGw S ATLAS generated code 2 Model Size 2 3 4 5 Compiler matrix size How to achieve optimal, portable performance? standard triple loop + compiler optimizations graph: Pingali, Yotov, Cornell U.

The Problem (Specifics) Computing platforms are complex memory hierarchy undocumented features in microarchitecture special instructions (vector, FMA, prefetching) Consequences: Performance only roughly determined by operations count Performance behaviour is hard to understand Best code is platform-dependent Computing platforms change fast code becomes obsolete (almost) as fast as it is written Compiler optimizations are very limited for numerical code people often revert to assembly coding (like 5 years ago) How to we port performance across platforms and across time?

Solution #: Brute Force Thousands of programmers hand-write and hand-tune (assembly) code for the same numerical problems and for every platform and whenever a new platform comes out? Hmm

Solution #2: New Approaches to Code Optimization and Code Creation Linear Algebra: LAPACK, ATLAS BeBOP Tensor Computations (Quantum Chemistry): Sadayappan, Baumgartner et al. (Ohio State) Finite Element Methods: Fenics (U. Chicago) Signal Processing: FFTW SPIRAL VSIPL (Kepner, Lebak et al., MIT Lincoln Labs) New Compiler Techniques (Domain aware/specific): Model-based ATLAS Broadway (Lin, Guyer, U. Texas Austin) SIMD optimizations (Ueberhuber, Univ. Techn. Vienna) Telescoping Languages (Kennedy et al., Rice) See also upcoming Proceedings of the IEEE special issue on Program Generation, Optimization, and Adaptation, http://www.ece.cmu.edu/~spiral/special-issue.html

Philosophy Present Future Algorithm Level Implementation Level starting point: one algorithm/program high level information destroyed implementation restricted automation algorithm/implementation codesign domain knowledge used for optimization a new breed of domain-aware approaches/tools push automation beyond what is currently possible applies for software and hardware design alike a u t o m a t i o n

Organization Introduction and Overview What s the problem? Overview Program generation, optimization, and adaptation Common philosophy LAPACK, ATLAS, BeBOP FFTW SPIRAL

LAPACK/ATLAS (Demmel, Dongarra, Petitet, Whaley, ) LAPACK: library for linear algebra problems (LU, Cholesky, QR, SVD, ) static built on top of BLAS (basic linear algebra subroutines), which is meant to be reoptimized ATLAS: provides BLAS library (matrix-matrix, matrix-vector mult., ) can be generated for every machine people can contribute hand-tuned code

ATLAS Architecture MFLOPS Compile, Execute, Measure Detect Hardware Parameters LSize NR MulAdd L * ATLAS Search Engine (MMSearch) NB MU,NU,KU xfetch MulAdd Latency ATLAS MM Code Generator (MMCase) MiniMMM Source Search parameters: span search space determine code found by orthogonal line search Hardware parameters: LSize: size of L data cache NR: number of registers MulAdd: fused multiply-add available? L * : latency of FP multiplication source: Pingali, Yotov, Cornell U.

ATLAS: Details (I) K Blocking for L D cache only square blocks NB x NB x NB search NB: 6 < NB < 8, NB 2 <= LSize divides into minimmms NB M NU Register level blocking search parameters MU, NU, K clean-up code also generated divides into micrommms NB K B MU K NB A C source: Pingali, Yotov, Cornell U.

ATLAS: Details (II) Code: minimmm + micrommm for (int j = ; j < NB; j += NU) for (int i = ; i < NB; i += MU) load C[i..i+MU-, j..j+nu-] into registers copy to scalar var. unrolled for (int k = ; k < NB; k++) load A[i..i+MU-,k] into registers load B[k,j..j+NU-] into registers multiply A s and B s and add to C s store C[i..i+MU-, j..j+nu-] M M 2 M 3 M 4 Memory Operations A A 2 A 3 A 4 Latency=2 Computation Memory Operations Computation A Computation A 2 Memory Operations IFetch Loads L Memory Operations L 2 L 3 NFetch Loads L MU+NU NFetch Loads Scheduling interleave loads and operations interleave adds and mults using Latency using xfetch M MU*NU A MU*NU Computation A MU*NU Memory Operations Computation NFetch Loads source: Pingali, Yotov, Cornell U.

Model-Based ATLAS (Pingali, Stodghill, Yotov, Padua, ) Detect Hardware Parameters LSize LI$Size NR MulAdd L * Model NB MU,NU,KU xfetch MulAdd Latency ATLAS MM Code Generator (MMCase) MiniMMM Source uses more hardware parameters uses simple analytical models to compute MMM parameters graph: Pingali, Yotov, Cornell U.

ATLAS: Experiments Alpha 2264 Hand-written code often substantially faster (e.g., vector instructions) Model-based comparable to searchbased (except Itanium) Power 3 Power 4 R2K UltraSparc IIIi Itanium 2 Opteron 24 Model Refined Model Unleashed see papers by Demmel et al. (ATLAS), Yotov et al. (Model-based ATLAS) in special issue Athlon MP Pentium III Pentium 4 % 5% ATLAS CGw S % 5% 2% graph: Pingali, Yotov, Cornell U.

BeBOP (formerly SPARSITY) (Vuduc, Yelick, Im) library for sparse linear algebra problems in particular, sparse matrix-vector multiply difference to dense case: matrix only known at runtime requires runtime adaptation main parameter: degree of register blocking, determines sparse format approach: offline (machine dependent, matrix independent): performance P(r,c) for dense matrix-vector multiply in sparse format for register blocks r x c, <= r,c <= 2 for given matrix M: compute average fill ratio f(r,c,m) for all these block sizes choose r,c that maximizes P(r,c)/f(r,c,M)

BeBOP: : Experiments speed-up through blocking (Itan. 2) row block size r n = 2226 source: NASA structural analysis very regular 8 x 8 block structure col. block size c best block size is not 8 x 8, but 2 x 4 source Vuduc, U. Berkeley

Organization Introduction and Overview What s the problem? Overview Program generation, optimization, and adaptation Common philosophy LAPACK, ATLAS, BeBOP FFTW SPIRAL

FFTW (Frigo, Johnson) Self-adaptable library for the discrete Fourier transform (DFT) DFTs: real/complex, arbitrary dimension, scalar and vector code Design in two layers: small problem sizes (codelets): straightline code automatically generated offline (same for every platform) and highly optimized (CSE, scheduling, SSA, ) large problem sizes: an empirical dynamic programming search finds the best recursion strategy called plan (adaptivity) Usage in two stages: user specifies DFT (size, dim, etc.) planner computes by search the plan plan is applied to input data

Organization Introduction and Overview What s the problem? Overview Program generation, optimization, and adaptation Common philosophy LAPACK, ATLAS, BeBOP FFTW SPIRAL

Spiral Code generation from scratch for linear digital signal processing (DSP) transforms (DFT, DCT, DWT, filters,.) Automatic optimization and platform-tuning at the algorithmic level and the code level Different code types supported (scalar, vector, FMA, fixedpoint, multiplierless, ) Goal: A flexible, extensible code generation framework that can survive time (to whatever extent possible) for an entire domain of algorithms

Code Generation and Tuning as Optimization Problem a DSP transform to be implemented the target platform set of possible implementations of T on P cost of implementation I of T on P The implementation of T that is tuned to P is given by: Problems: How to characterize and generate the set of implementations? How to efficiently minimize C? Spiral exploits the domain-specific structure to implement a solver for this optimization problem

Spiral s architecture Domain knowledge: Generating algorithms & manipulating algorithms Architecture knowledge: by evaluating runtime

From Transform to Algorithm (Formula) Input: Transform specification Output: Fast algorithm as formula Domain Knowledge I: Generating the algorithm space

DSP Algorithms: Example 4 DSP Algorithms: Example 4-point DFT point DFT Cooley/Tukey FFT (size 4): algorithms reduce arithmetic cost O(n^2) O(nlog(n)) mathematical notation exhibits structure: SPL (signal processing language) contains all information to generate code = i i i i i 4 2 2 2 4 2 2 2 4 ) ( ) ( L DFT I T I DFT DFT = Fourier transform Identity Permutation Diagonal matrix (twiddles) Kronecker product

DSP Algorithms: Spiral Terminology Transform DFTn parameterized matrix Rule DFT nm ( DFT I ) D ( I DFT ) P a breakdown strategy product of sparse matrices n m n m Ruletree DFT 8 Formula DFT 2 DFT 4 DFT DFT 2 DFT 2 recursive application of rules uniquely defines an algorithm efficient representation easy manipulation ( F I ) D ( I ( I F )) P = L 8 2 4 2 few constructs and primitives uniquely defines an algorithm can be translated into code 2 2

Some Transforms Spiral currently contains 36 transforms

Some Breakdown Rules Base case rules Spiral contains + rules

Algorithm (Formula) Generation Given a transform: Apply breakdown rules recursively until all occurring transforms are expanded Choice of rules at each step yields (usually) exponentially large algorithms space: about equal in operations count differ in data flow k # DFTs, size 2^k # DCT IV, size 2^k 2 3 4 5 6 7 8 9 6 4 296 27744 62573628 ~. ^27 ~2.3 ^6 ~2.86 ^33 26 3242 924443362 734385263354242 ~.7 ^38 ~2.3 ^76 ~.6 ^53

From Algorithm (Formula) to Optimized Algorithm Input: Fast algorithm as formula Output: Optimized formula Domain Knowledge II: Optimizing an algorithm

Motivation: Fusing Loops void I4xF2_L84(double *y, double *x) { double t[8]; for (int i=; i<8; i++) t[i==7? 7 : (i*4)%7] = x[i]; for (int i=; i<4; i++){ y[2*i] = t[2*i] + t[2*i+]; y[2*i+] = t[2*i] - t[2*i+]; } Algebraic transformations } Formula manipulation Loop splitting Loop fusion void I4xF2_L84(double for (int j=; j<4; y[2*j] = x[j] + y[2*j+] = x[j] } } *y, double *x) { j++){ x[j+4]; - x[j+4];

Formula Level Optimization Main goals: Fusing iterative steps (fusing loops), e.g., permutations with loops Improving structure (data flow) for SIMD instructions Overcomes compiler limitations Formula manipulation through mathematical rules Implemented using multiple levels of rewriting systems Puts math knowledge into the system

Structure of Loop Optimization SPL formula To Σ-SPL Σ -SPL formula Join permutations Σ -SPL formula Rules: Join diagonals and monomials Σ -SPL formula

Vector code generation from SPL formulas Naturally vectorizable construct A I 4 A vector length x y (Current) generic construct completely vectorizable: k i= P D ( A I ) E i i i υ i Q i P i, Q i D i, E i A i ν permutations diagonals arbitrary formulas SIMD vector length Vectorization in two steps:. Formula manipulation using manipulation rules 2. Code generation (vector code + C code) Formula manipulation overcomes compiler limitations

Example DFT Standard FFT Formula manipulation Vector FFT for ν-way vector instructions

From Optimized Algorithm (Formula) to Code Input: Optimized formula Output: Intermediate Code Straightforward

From Code to Optimized Code Input: Intermediate Code Output: Optimized C code

Code Level Optimizations Precomputation of constants Loop unrolling (controlled by search module) Constant inlining SSA code, array scalarization, algebraic simplifications, CSE Code reordering for locality (optional) Conversion to FMA code (optional) Conversion to fixed point code (optional) Conversion to multiplierless code (optional) Finally: Unparsing to C (or Fortran)

Evaluating Code Input: Optimized C code Output: Performance Number Straightforward Examples: runtime accuracy operations count

Search (Learning) for the Best Input: Performance Number Output: Controls Formula Generation Controls Implementation Options

Search Methods Search over: Algorithmic degrees of freedom (choice of breakdown rules) Implementation degrees of freedom (degree of unrolling) Operates with the ruletree representation of an algorithm transform independent efficient Search Methods Exhaustive Search Dynamic Programming (DP) Random Search Hill Climbing STEER (an evolutionary algorithm)

STEER Population n: Mutation Cross-Breeding expand differently Population n+: swap expansions Survival of Fittest

Learning Procedure: Generate a set of ( say) algorithms and their runtimes (one transform, one size); represent algorithms by features From this data (pairs of features and runtimes), learn a set of algorithm design rules From this set, generate best algorithms (theory of Markov decision processes) Evaluation: Tested for WHT and DFT From data generated for one size (2^5) could construct best algorithms across sizes (2^2-2^8) Bryan Singer and Manuela Veloso Learning to Construct Fast Signal Processing Implementations Journal of Machine Learning Research, 22, Vol. 3, pp. 887-99

SPIRAL Experiments

Performance Spread: DCT, size 32 Histograms,, algorithms P4, 3.2 GHz, gcc runtime: x2 #ops: x.8 #assembly instr: x.5 #ops vs. runtime: no correlation #fma ops: x.2 accuracy: x, most x2

Performance Spread: DFT 2^6 Histograms, 2, Algorithms P4, 3.2 GHz, icc 8. Generated scalar code Generated SSE code (4-way vector single precision) Generality of vectorization (all algorithms improve) Efficiency of vectorization (x 2.5 gain)

Benchmark: DFT, 2-powers2 P4, 3.2 GHz, icc 8. Vendor code: hand-tuned assembly? Higher is better Single precision code limitations of compiler vectorization Spiral code competitive with the best

DFT for one BG/L node One BG/L node PowerPC 44 FP2 5 MHz Up to 4% speedup using 2-way vector instructions Results in collaboration with Ueberhuber/Franchetti Univ. of Technology Vienna

Benchmark: Fixed-Point DFT, IPAQ IPAQ Xscale arch. 4 MHz Fixed-point only Higher is better Intel spend less effort? Spiral code generate without modifications to the system

Benchmark: DCT -dim DCT 2-dim DCT P4, 3.2 GHz, icc 8. Scalar code Scalar vs. SSE code factor 2 over vendor code another factor of 3 with special instructions

Platform-Dependence 5% loss when porting PIII tuned code to P4

Compiler Flags P4, 3.2 GHz, gcc ACOVEA: evolutionary search for best compiler flags Runtime histogram random compiler flags incl. -O march=pentium4 % improvement when applied to best Spiral generated code

Multiplierless DFT, IPAQ IPAQ Xscale arch. 4 MHz Fixed-point only fixed-point constant multiplications replaced by adds and shifts trade-off runtime and precision

Current/Future Work Complete reimplementation of system (incl. vector code, FMA code, code for SMP platforms, fixed-point code) Generation of packages (generating something like FFTW?) Spiral as educational tool Other domains: SP applications (backprojection) Cryptography (seems to be not suitable) Communications Maybe linear algebra problems Generating special purpose hardware (FPGA)

Summary Code generation and tuning as optimization problem over the algorithm and implementation space Exploit the structure of the domain to solve it Declarative framework for computer representation of the domainknowledge Enables algorithm generation and optimization (teaches the system the math; does not become obsolete?) Compiler to translate math description into code Search and learning to explore implementation space Closes the feedback loop gives the system intelligence Extensible, versatile Every step in the code generation is exposed www.spiral.net

Questions? Check out: upcoming Proceedings of the IEEE special issue on Program Generation, Optimization, and Adaptation, http://www.ece.cmu.edu/~spiral/special-issue.html