Automatic Performance Tuning and Machine Learning

Similar documents
How to Write Fast Numerical Code

Automatic Performance Programming?

How to Write Fast Code , spring rd Lecture, Apr. 9 th

System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries

How to Write Fast Numerical Code Spring 2011 Lecture 22. Instructor: Markus Püschel TA: Georg Ofenbeck

How to Write Fast Numerical Code

How to Write Fast Numerical Code Spring 2012 Lecture 13. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato

Parallelism in Spiral

SPIRAL Overview: Automatic Generation of DSP Algorithms & More *

How to Write Fast Numerical Code Spring 2012 Lecture 20. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato

How to Write Fast Numerical Code

Tackling Parallelism With Symbolic Computation

How to Write Fast Numerical Code Spring 2012 Lecture 9. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato

Optimal Performance Numerical Code Challenges and Solutions

Formal Loop Merging for Signal Transforms

How to Write Fast Numerical Code

Program Generation, Optimization, and Adaptation: SPIRAL and other efforts

SPIRAL Generated Modular FFTs *

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University

Generating Parallel Transforms Using Spiral

Bandit-Based Optimization on Graphs with Application to Library Performance Tuning

Program Composition and Optimization: An Introduction

Library Generation For Linear Transforms

Systems Programming and Computer Architecture ( ) Timothy Roscoe

Computer Generation of IP Cores

Autotuning (1/2): Cache-oblivious algorithms

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Joint Runtime / Energy Optimization and Hardware / Software Partitioning of Linear Transforms

Automatic Tuning Matrix Multiplication Performance on Graphics Hardware

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

How to Write Fast Code , spring st Lecture, Jan. 14 th

How to Write Fast Numerical Code

How to Write Fast Numerical Code

Cache-oblivious Programming

CS241 Computer Organization Spring Principle of Locality

Optimizing MMM & ATLAS Library Generator

SPIRAL: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms

Algorithms and Computation in Signal Processing

Advanced Computing Research Laboratory. Adaptive Scientific Software Libraries

Adaptive Scientific Software Libraries

Performance Analysis of a Family of WHT Algorithms

Statistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform

On the Computer Generation of Adaptive Numerical Libraries

Spiral: Program Generation for Linear Transforms and Beyond

FFT Program Generation for the Cell BE

DFT Compiler for Custom and Adaptable Systems

How to Write Fast Numerical Code Spring 2011 Lecture 7. Instructor: Markus Püschel TA: Georg Ofenbeck

Automatic Generation of the HPC Challenge s Global FFT Benchmark for BlueGene/P

Automatic Generation of the HPC Challenge's Global FFT Benchmark for BlueGene/P

Algorithms and Computation in Signal Processing

FFT Program Generation for Shared Memory: SMP and Multicore

Carnegie Mellon. Cache Lab. Recitation 7: Oct 11 th, 2016

FFT Program Generation for Shared Memory: SMP and Multicore

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

Statistical Models for Automatic Performance Tuning

SFFT Documentation. Release 0.1. Jörn Schumacher

Cache memories are small, fast SRAM based memories managed automatically in hardware.

Auto-tuning Multigrid with PetaBricks

Stochastic Search for Signal Processing Algorithm Optimization

Agenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories

Empirical Auto-tuning Code Generator for FFT and Trigonometric Transforms

Stochastic Search for Signal Processing Algorithm Optimization

Today Cache memory organization and operation Performance impact of caches

Cache Memories. Lecture, Oct. 30, Bryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition

Cache Lab Implementation and Blocking

Automatic Tuning of Discrete Fourier Transforms Driven by Analytical Modeling

Results and Discussion *

Too large design space?

Exploring the Optimization Space of Dense Linear Algebra Kernels

Automated Empirical Optimizations of Software and the ATLAS project* Software Engineering Seminar Pascal Spörri

Cache Memories /18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, Today s Instructor: Phil Gibbons

FFT ALGORITHMS FOR MULTIPLY-ADD ARCHITECTURES

Types of Cache Misses. The Hardware/So<ware Interface CSE351 Winter Cache Read. General Cache OrganizaJon (S, E, B) Memory and Caches II

poski: Parallel Optimized Sparse Kernel Interface Library User s Guide for Version 1.0.0

The Hardware/So<ware Interface CSE351 Winter 2013

Optimization by Run-time Specialization for Sparse-Matrix Vector Multiplication

A Fast Fourier Transform Compiler

Carnegie Mellon. Cache Memories

Autotuning (2.5/2): TCE & Empirical compilers

Today. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster,

Intelligent Compilation

ELASTIC COMPUTING: A FRAMEWORK FOR EFFECTIVE MULTI-CORE HETEROGENEOUS COMPUTING

An Adaptive Framework for Scientific Software Libraries. Ayaz Ali Lennart Johnsson Dept of Computer Science University of Houston

Evaluating the Role of Optimization-Specific Search Heuristics in Effective Autotuning

Problem: Processor Memory BoJleneck

Special Issue on Program Generation, Optimization, and Platform Adaptation /$ IEEE

Spiral. Computer Generation of Performance Libraries. José M. F. Moura Markus Püschel Franz Franchetti & the Spiral Team. Performance.

Aggressive Scheduling for Numerical Programs

Scheduling FFT Computation on SMP and Multicore Systems

A SIMD Vectorizing Compiler for Digital Signal Processing Algorithms Λ

An Input-Centric Paradigm for Program Dynamic Optimizations

Automatically Optimized FFT Codes for the BlueGene/L Supercomputer

Input parameters System specifics, user options. Input parameters size, dim,... FFT Code Generator. Initialization Select fastest execution plan

When Cache Blocking of Sparse Matrix Vector Multiply Works and Why

Automatic Tuning of the Fast Multipole Method Based on Integrated Performance Prediction

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply

Automatically Tuned FFTs for BlueGene/L s Double FPU

Generating Optimized Fourier Interpolation Routines for Density Functional Theory using SPIRAL

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

Example. How are these parameters decided?

Transcription:

Automatic Performance Tuning and Machine Learning Markus Püschel Computer Science, ETH Zürich with: Frédéric de Mesmay PhD, Electrical and Computer Engineering, Carnegie Mellon

PhD and Postdoc openings: High performance computing Compilers Theory Programming languages/generative programming

Why Autotuning? Matrix-Matrix Multiplication (MMM) on quadcore Intel platform Performance [Gflop/s] 50 45 40 35 Best implementation 30 25 20 160x 15 10 5 0 Triple loop 0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000 matrix size Same (mathematical) operation count (2n 3 ) Compiler underperforms by 160x

Same for All Critical Compute Functions WiFi Receiver (Physical layer) on one Intel Core Throughput [Mbit/s] vs. Data rate [Mbit/s] 35x 30x

Solution: Autotuning Definition: Search over alternative implementations or parameters to find the fastest. Definition: Automating performance optimization with tools that complement/aid the compiler or programmer. However: Search is an important tool. But expensive. Solution: Machine learning

Organization Autotuning examples An example use of machine learning

time of implementation time of installation platform known time of use problem parameters known

PhiPac/ATLAS: MMM Generator Whaley, Bilmes, Demmel, Dongarra, c = a * b Blocking improves locality c = (double *) calloc(sizeof(double), n*n); /* Multiply n x n matrices a and b */ void mmm(double *a, double *b, double *c, int n) { int i, j, k; for (i = 0; i < n; i+=b) for (j = 0; j < n; j+=b) for (k = 0; k < n; k+=b) /* B x B mini matrix multiplications */ for (i1 = i; i1 < i+b; i++) for (j1 = j; j1 < j+b; j++) for (k1 = k; k1 < k+b; k++) c[i1*n+j1] += a[i1*n + k1]*b[k1*n + j1]; }

PhiPac/ATLAS: MMM Generator Mflop/s Compile Execute Measure Detect Hardware Parameters L1Size NR MulAdd L * ATLAS Search Engine NB MU,NU,KU xfetch MulAdd Latency ATLAS MMM Code Generator MiniMMM Source source: Pingali, Yotov, Cornell U.

ATLAS MMM generator time of implementation time of installation platform known time of use problem parameters known

FFTW: Discrete Fourier Transform (DFT) Frigo, Johnson Installation configure/make n = 1024 radix 16 Usage d = dft(n) d(x,y) Twiddles Search for fastest computation strategy 16 64 base case radix 8 8 8 base case base case

FFTW: Codelet Generator Frigo n DFT codelet generator dft_n(*x, *y, ) fixed size DFT function straightline code

FFTW codelet generator ATLAS MMM generator FFTW adaptive library time of implementation time of installation platform known time of use problem parameters known

OSKI: Sparse Matrix-Vector Multiplication Vuduc, Im, Yelick, Demmel = * Blocking for registers: Improves locality (reuse of input vector) But creates overhead (zeros in block) * =

OSKI: Sparse Matrix-Vector Multiplication Gain by blocking (dense MVM) Overhead by blocking * = 16/9 = 1.77 1.4 1.4/1.77 = 0.79 (no gain)

OSKI sparse MVM OSKI sparse MVM FFTW codelet generator ATLAS MMM generator FFTW adaptive library time of implementation time of installation platform known time of use problem parameters known

Spiral: Linear Transforms & More Algorithm knowledge Platform description Spiral Optimized implementation regenerated for every new platform

Program Generation in Spiral (Sketched) Transform user specified Fast algorithm in SPL many choices Algorithm rules Optimization at all abstraction levels parallelization vectorization Σ-SPL loop optimizations Optimized implementation constant folding scheduling + search

Machine learning Machine learning Spiral: transforms general input size Spiral: transforms general input size Spiral: transforms fixed input size OSKI sparse MVM OSKI sparse MVM FFTW codelet generator ATLAS MMM generator FFTW adaptive library time of implementation time of installation platform known time of use problem parameters known

Organization Autotuning examples An example use of machine learning

Online tuning (time of use) Installation configure/make Offline tuning (time of installation) Installation configure/make for a few n: search learn decision trees Use Twiddles Use d = dft(n) d(x,y) Search for fastest computation strategy d = dft(n) d(x,y) Twiddles Goal

Integration with Spiral-Generated Libraries Voronenko 2008 + some platform information Spiral Online tunable library

Organization Autotuning examples An example use of machine learning Anatomy of an adaptive discrete Fourier transform library Decision tree generation using C4.5 Results

Discrete/Fast Fourier Transform Discrete Fourier transform (DFT): Cooley/Tukey fast Fourier transform (FFT): Dataflow (right to left): 16 = 4 x 4

Adaptive Scalar Implementation (FFTW 2.x) void dft(int n, cpx *y, cpx *x) { if (use_dft_base_case(n)) dft_bc(n, y, x); else { int k = choose_dft_radix(n); for (int i=0; i < k; ++i) dft_strided(m, k, t + m*i, x + m*i); for (int i=0; i < m; ++i) dft_scaled(k, m, precomp_d[i], y + i, t + i); } } void dft_strided(int n, int istr, cpx *y, cpx *x) {... } void dft_scaled(int n, int str, cpx *d, cpx *y, cpx *x) {... } Choices used for adaptation

Decision Graph of Library void dft(int n, cpx *y, cpx *x) { if (use_dft_base_case(n)) dft_bc(n, y, x); else { int k = choose_dft_radix(n); for (int i=0; i < k; ++i) dft_strided(m, k, t + m*i, x + m*i); for (int i=0; i < m; ++i) dft_scaled(k, m, precomp_d[i], y + i, t + i); } } void dft_strided(int n, int istr, cpx *y, cpx *x) {... } void dft_scaled(int n, int str, cpx *d, cpx *y, cpx *x) {... } Choices used for adaptation

Spiral-Generated Libraries Spiral Vectorized Threading Buffering Standard Scalar 20 mutually recursive functions 10 different choices (occurring recursively) Choices are heterogeneous (radix, threading, buffering, )

Our Work Upon installation, generate decision trees for each choice Example:

Statistical Classification: C4.5 Features (events) C4.5 P(play windy=false) = 6/8 P(don t play windy=false) = 2/8 P(play windy=true) = 1/2 P(don t play windy=false) = 1/2 H(windy=false) = 0.81 H(windy=true) = 1.0 Entropy of Features H(windy) = 0.89 H(outlook) = 0.69 H(humidity) =

Application to Libraries Features = arguments of functions (except variable pointers) dft(int n, cpx *y, cpx *x) dft_strided(int n, int istr, cpx *y, cpx *x) dft_scaled(int n, int str, cpx *d, cpx *y, cpx *x) At installation time: Run search for a few input sizes n Yields training set: features and associated decisions (several for each size) Generate decision trees using C4.5 and insert into library

Issues Correctness of generated decision trees Issue: learning sizes n in {12, 18, 24, 48}, may find radix 6 Solution: correction pass through decision tree Prime factor structure n = 2 i 3 j = 2, 3, 4, 6, 9, 12, 16, 18, 24, 27, 32, i Compute i, j and add to features j

Experimental Setup 3GHz Intel Xeon 5160 (2 Core 2 Duos = 4 cores) Linux 64-bit, icc 10.1 Libraries: IPP 5.3 FFTW 3.2 alpha 2 Spiral-generated library

Learning works as expected

All Sizes All sizes n 2 18, with prime factors 19

All Sizes All sizes n 2 18, with prime factors 19 Higher order fit of all sizes

Related Work Machine learning in Spiral Learning DFT recursions (Singer/Veloso 2001) Machine learning in compilation Scheduling (Moss et al. 1997, Cavazos/Moss 2004) Branch prediction (Calder et al. 1997) Heuristics generation (Monsifrot/Bodin/Quiniou 2002) Feature generation (Leather/Bonilla/O Boyle 2009) Multicores (Wang/O Boyle 2009) code learning learning feature space description features heuristics model Optimization Opt. sequence

This talk Frédéric de Mesmay, Yevgen Voronenko and Markus Püschel Offline Library Adaptation Using Automatically Generated Heuristics Proc. International Parallel and Distributed Processing Symposium (IPDPS), pp. 1-10, 2010 Frédéric de Mesmay, Arpad Rimmel, Yevgen Voronenko and Markus Püschel Bandit-Based Optimization on Graphs with Application to Library Performance Tuning Proc. International Conference on Machine Learning (ICML), pp. 729-736, 2009

Message of Talk Machine learning should be used in autotuning Overcomes the problem of expensive searches Relatively easy to do Applicable to any search-based approach Machine learning Machine learning time of implementation time of installation platform known time of use problem parameters known