My background B.S. in Physics, M.S. in Computer Engineering i from Moscow Institute t of Physics & Technology, Russia (1999) Ph.D. from the University

Size: px

Start display at page:

Download "My background B.S. in Physics, M.S. in Computer Engineering i from Moscow Institute t of Physics & Technology, Russia (1999) Ph.D. from the University"

Tyler Ellis
5 years ago
Views:

1 Collective optimization, run-time adaptation and machine learning Grigori Fursin UNIDAPT / ALCHEMY Group INRIA Saclay, France HiPEAC member

2 My background B.S. in Physics, M.S. in Computer Engineering i from Moscow Institute t of Physics & Technology, Russia (1999) Ph.D. from the University of Edinburgh, UK ( ) iterative compilation and performance prediction (advisor: Prof. Michael O Boyle) Postdoctoral researcher at INRIA Futurs, France ( ) 2007) machine learning for optimization knowledge reuse, run-time adaptation for programs with multiple datasets and heterogeneous systems, architectural design space exploration (working with groups of Prof. Olivier Temam and Prof. Michael O Boyle) Tenured research scientist at INRIA Saclay, France (2007-now) self-tuning computing systems based on statistical collective optimization, machine learning and run-time adaptation (building a research group p developing HiPEAC common research compiler based on GCC developing Collective Tuning Center

3 Motivation Prerequisites Outline Iterative feedback-directed directed compilation (empirical optimization) MILEPOST project: machine-learning based self-tuning compiler adaptable to heterogeneous reconfigurable architectures (based on static and dynamic program features) Optimization sensitivity to datasets Run-time adaptation for statically compiled programs (adaptive binaries) Recent projects Adaptive libraries (iterative compilation, dataset features and run-time adaptation) Predictive code scheduling for heterogeneous (CPU/GPU-like) architectures Statistical collective optimization Collective Tuning Initiative Conclusions and future work

Motivation Continuing innovation in science and

resources while placing strict requirements on

response, reliability, portability and design

4 Motivation Continuing innovation in science and technology require increasing computing resources while placing strict requirements on system performance, power consumption, size, response, reliability, portability and design time. High-performance computing systems tend to evolve toward complex heterogeneous multi-core systems dramatically increased design & optimization time

5 Motivation Continuing innovation in science and technology require increasing computing resources while placing strict requirements on system performance, power consumption, size, response, reliability, portability and design time. High-performance computing systems tend to evolve toward complex heterogeneous multi-core systems dramatically increased design & optimization time Optimizing compilers play a key role in producing executable codes quickly and automatically while satisfying all the above requirements for a broad range of programs and architectures.

6 Motivation Developing and tuning current compilers for rapidly evolving architectures is a tedious and time consuming process. Current state-of-the-art compilers and optimizers often fail to deliver best performance. hardwired ad-hoc optimization heuristics (cost models) for rapidly evolving hardware large irregular optimization spaces interaction between optimizations, order of optimizations difficult to add new transformations to already tuned optimization heuristics inability to reuse optimization i knowledge among different programs and architectures lack of run-time information and inability to adapt to varying program and system behavior (or dataset) at run-time with low overhead

7 Motivation Developing and tuning current compilers for rapidly evolving architectures is a tedious and time consuming process. Current state-of-the-art compilers and optimizers often fail to deliver best performance. hardwired ad-hoc optimization heuristics (cost models) for rapidly evolving hardware large irregular optimization spaces interaction between optimizations, order of optimizations difficult to add new transformations to already tuned optimization heuristics inability to reuse optimization i knowledge among different programs and architectures lack of run-time information and inability to adapt to varying program and system behavior (or dataset) at run-time with low overhead Need universal self-tuning compilers (architectures and run-time systems) that can continuously and automatically adapt to any heterogeneous architecture and learn how to optimize programs

8 Motivation Prerequisites Outline Iterative feedback-directed directed compilation (empirical optimization) MILEPOST project: machine-learning based self-tuning compiler adaptable to heterogeneous reconfigurable architectures (based on static and dynamic program features) Optimization sensitivity to datasets Run-time adaptation for statically compiled programs (adaptive binaries) Recent projects Adaptive libraries (iterative compilation, dataset features and run-time adaptation) Predictive code scheduling for heterogeneous (CPU/GPU-like) architectures Statistical collective optimization Collective Tuning Initiative Conclusions and future work

9 Iterative compilation (first steps) Optimization spaces (set of all possible program transformations) are large, non-linear with many local minima Finding a good solution may be long and non-trivial matmul, 2 transformations, search space = 2000 swim, 3 transformations, search space = Potential ti solution - iterative ti feedback-directed di d compilation: learn program behavior across executions (combination of global flags, orders of optimization phases, cost-model tuning for individual transformation - meta- optimization) High potential

10 Iterative compilation (first steps) Optimization spaces (set of all possible program transformations) are large, non-linear with many local minima Finding a good solution may be long and non-trivial matmul, 2 transformations, search space = 2000 swim, 3 transformations, search space = Potential ti solution - iterative ti feedback-directed di d compilation: learn program behavior across executions (combination of global flags, orders of optimization phases, cost-model tuning for individual transformation - meta- optimization) High potential but: -slow - often the same dataset is used - often no run-time adaptation - no optimization knowledge reuse

11 Iterative compilation (first steps) Optimization spaces (set of all possible program transformations) are large, non-linear with many local minima Finding a good solution may be long and non-trivial matmul, 2 transformations, search space = 2000 swim, 3 transformations, search space = Potential ti solution - iterative ti feedback-directed di d compilation: learn program behavior across executions (combination of global flags, orders of optimization phases, cost-model tuning for individual transformation - meta- optimization) High potential but: -slow Solving these problems - often the same dataset is used - often no run-time adaptation is non-trivial i - no optimization knowledge reuse

12 p speedu Iterative compilation (uniform random) Systematic optimization exploration bitcount susan_c susan_e susan_s jpeg_c jpeg_d dijkstra patricia blo owfish_d blo owfish_e rij ijndael_d rij ijndael_e sha adpcm_c adpcm_d CRC32 gsm qsort1 AMD - a cluster with 16 AMD Athlon processors running at 2.4GHz IA32 - a cluster with 4 Intel Xeon processors running at 2.8GHz IA64 - a server with an Itanium2 processor running at 1.3GHz string gsearch1 Traditional uniform random iterative search (GCC/Open64 global compiler flags): 500 random combination of flags and associated passes (~100 optimizations) i i Can obtain high speedups, but very slow

13 Iterative compilation (uniform random) Optimization space is not trivial

Iterative compilation (uniform random) susan_corners AMD Athlon64 -O3 GCC 4.2.

14 Iterative compilation (uniform random) susan_corners AMD Athlon64 -O3 GCC ~100 flags -O3 Multi-objective optimizations (also power, reliability, etc) Combinations of optimizations matter Important to balance performance/code size particularly for embedded applications

Iterative compilation (focused probabilistic search) Focused probabilistic search (similar to GA search) (SUIF source-to-source + Intel compiler, 80 different transformations) Faster then traditional

15 Iterative compilation (focused probabilistic search) Focused probabilistic search (similar to GA search) (SUIF source-to-source + Intel compiler, 80 different transformations) Faster then traditional search (~50 iterations). Can stuck in local minima B. Franke, M. O'Boyle, J. Thomson and G. Fursin. Probabilistic Source-Level Optimisation of Embedded Systems Software. Proceedings of the Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES'05), pages 78-86, Chicago, IL, USA, June 2005

16 Motivation Prerequisites Outline Iterative feedback-directed directed compilation (empirical optimization) MILEPOST project: machine-learning based self-tuning compiler adaptable to heterogeneous reconfigurable architectures (based on static and dynamic program features) Optimization sensitivity to datasets Run-time adaptation for statically compiled programs (adaptive binaries) Recent projects Adaptive libraries (iterative compilation, dataset features and run-time adaptation) Predictive code scheduling for heterogeneous (CPU/GPU-like) architectures Statistical collective optimization Collective Tuning Initiative Conclusions and future work

MILEPOST project Machine Learning for Embedded Programs Optimization 2006-2009 Objective is to develop compiler technology that can automatically learn how to best optimize programs for

17 MILEPOST project Machine Learning for Embedded Programs Optimization Objective is to develop compiler technology that can automatically learn how to best optimize programs for re-configurable heterogeneous embedded processors (with run-time adaptation) and dramatically reduce the time to market. Partners: INRIA, University of Edinburgh, IBM, ARC, CAPS Enterprise Developed techniques and software are publicly available and hopefully will influence the future compiler developments

18 Interactive Compilation Interface GCC Detect optimization flags GCC Controller (Pass Manager) Pass 1... Pass N GCC Data Layer AST, CFG, CF, etc

19 Interactive Compilation Interface GCC with ICI Detect optimization flags ICI IC Event GCC Controller (Pass Manager) IC Event Interactive Compilation Interface Pass 1... Pass N IC Event GCC Data Layer AST, CFG, CF, etc IC Data

20 Interactive Compilation Interface GCC with ICI High-level scripting (java, python, etc) Detect optimization flags IC Event GCC Controller (Pass Manager) IC Event ICI Interactive Compilation Interface IC Plugins <Dynamically linked shared libraries> Selecting pass sequences... Extracting static program features Pass 1... Pass N IC Event GCC Data Layer AST, CFG, CF, etc IC Data

21 Interactive Compilation Interface GCC with ICI High-level scripting (java, python, etc) Detect optimization flags IC Event GCC Controller (Pass Manager) Pass 1... GCC Data Layer AST, CFG, CF, etc IC Event Pass N IC Event IC Data ICI Interactive Compilation Interface IC Plugins <Dynamically linked shared libraries> Selecting pass sequences... Extracting static program features CCC Continuous Collective Compilation Framework ML drivers to optimize programs and tune compiler optimization heuristic

22 MILEPOST framework Trainin ng Program 1 Program N MILEPOST GCC (with ICI and ML routines) IC Plugins Recording pass sequences Extracting static program features

23 MILEPOST framework Trainin ng Program 1 Program N MILEPOST GCC (with ICI and ML routines) IC Plugins Recording pass sequences Extracting static program features CCC Continuous Collective Compilation Framework Drivers for iterative compilation and model training

24 MILEPOST framework g Program 1 MILEPOST GCC (with ICI and ML routines) IC Plugins CCC Continuous Collective Compilation Framework Trainin Program N Recording pass sequences Extracting static program features Drivers for iterative compilation and model training MILEPOST GCC Deploy yment New program Extracting static program features Selecting good passes Predicting good passes to improve exec.time, code size and comp. time Grigori Fursin, Cupertino Miranda, Olivier Temam, Mircea Namolaru, Elad Yom-Tov, Ayal Zaks, Bilha Mendelson, Phil Barnard, Elton Ashton, Eric Courtois, Francois Bodin, Edwin Bonilla, John Thomson, Hugh Leather, Chris Williams, Michael O'Boyle. MILEPOST GCC: machine learning based research compiler. Proceedings of the GCC Developers' Summit, Ottawa, Canada, June 2008 (based on CGO 06 and HiPEAC 05 papers)

25 Machine learning (static features) Program characterization based on static features MILEPOST GCC feature extractor (IBM Haifa & INRIA) ft1 - Number of basic blocks in the method ft19 - Number of direct calls in the method ft20 - Number of conditional branches in the method ft21 - Number of assignment instructions in the method ft22 - Number of binary integer operations in the method ft23 - Number of binary floating point operations in the method ft24 - Number of instructions in the method ft25 - Average of number of instructions in basic blocks ft54 - Number of local variables that are pointers in the method ft55 - Number of static/extern variables that are pointers in the method How constructed: human intuition On-going work on feature generation and analysis Critical to be able to make good predictions!

26 Machine learning (static features) 14 transformations, sequences of length 5 search space = Predict best transformations from this space based on program features and previous optimization experience Off-line traning: Focused search Predicting best transformation for a new program: Static features Nearest neighbors classifier F. Agakov, E. Bonilla, J. Cavazos, B. Franke, G. Fursin, M.F.P. O'Boyle, J. Thomson, M. Toussaint and C.K.I. Williams. Using Machine Learning to Focus Iterative Optimization. Proceedings of the 4th Annual International Symposium on Code Generation and Optimization (CGO), New York, NY, USA, March 2006

27 Machine learning (dynamic features) Performance counter values for 181.mcf compiled with -O0 relative to the average values for the entire set of benchmark suite (SPECFP,SPECINT, MiBench, Polyhedron) John Cavazos, Grigori Fursin, Felix Agakov, Edwin Bonilla, Michael F.P.O'Boyle and Olivier Temam. Rapidly Selecting Good Compiler Optimizations using Performance Counters. Proceedings of the 5th Annual International Symposium on Code Generation and Optimization (CGO), San Jose, USA, March 2007

28 Machine learning (dynamic features) Performance counter values for 181.mcf compiled with -O0 relative to the average values for the entire set of benchmark suite (SPECFP,SPECINT, MiBench, Polyhedron) Problem: greater number of memory accesses per instruction than average

29 Machine learning (dynamic features) Performance counter values for 181.mcf compiled with -O0 relative to the average values for the entire set of benchmark suite (SPECFP,SPECINT, MiBench, Polyhedron) Solving all performance issues one by one is slow and can be inefficient due to their non-linear dependencies

30 Machine learning (dynamic features) Performance counter values for 181.mcf compiled with -O0 relative to the average values for the entire set of benchmark suite (SPECFP,SPECINT, MiBench, Polyhedron) Solving all performance issues one by one is slow and can be inefficient due to their non-linear dependencies CONSIDER ALL PERFORMANCE ISSUES AT THE SAME TIME!

31 Detecting influential counters Principle Component Analysis: Most informative Performance Counters 1) L1_TCA 2) L1_DCH 3) TLB_DM 4) BR_INS 5) RES_STL S 6) TOT_CYC C 7) L2_ICH 8) VEC_INS 9) L2_DCH 10) L2_TCA 11) L1_DCA 12) HW_INT 13) L2_TCH 14) L1_TCH 15) BR_MS Analysis of the importance of the performance counters. The data contains one good optimization sequence per benchmark. Calculating mutual information between a subset of the performance counters and good optimization sequences

32 Motivation Prerequisites Outline Iterative feedback-directed directed compilation (empirical optimization) MILEPOST project: machine-learning based self-tuning compiler adaptable to heterogeneous reconfigurable architectures (based on static and dynamic program features) Optimization sensitivity to datasets Run-time adaptation for statically compiled programs (adaptive binaries) Recent projects Adaptive libraries (iterative compilation, dataset features and run-time adaptation) Predictive code scheduling for heterogeneous (CPU/GPU-like) architectures Statistical collective optimization Collective Tuning Initiative Conclusions and future work

33 Optimization sensitivity to datasets MiBench, 20 datasets t per benchmark, 200/1000 random combination of Open64 (GCC) compiler flags, 5 months of experiments dijkstra (not sensitive) jpeg_d (dataset sensitive) Grigori Fursin, John Cavazos, Michael O'Boyle and Olivier Temam. MiDataSets: Creating The Conditions For A More Realistic Evaluation of Iterative Optimization. Proceedings of HiPEAC 2007, Ghent, Belgium, January 2007

34 Optimization sensitivity to datasets MiBench, 20 datasets t per benchmark, 200/1000 random combination of Open64 (GCC) compiler flags, 5 months of experiments dijkstra (not sensitive) jpeg_d (clustering) Grigori Fursin, John Cavazos, Michael O'Boyle and Olivier Temam. MiDataSets: Creating The Conditions For A More Realistic Evaluation of Iterative Optimization. Proceedings of HiPEAC 2007, Ghent, Belgium, January 2007

35 Motivation Prerequisites Outline Iterative feedback-directed directed compilation (empirical optimization) MILEPOST project: machine-learning based self-tuning compiler adaptable to heterogeneous reconfigurable architectures (based on static and dynamic program features) Optimization sensitivity to datasets Run-time adaptation for statically compiled programs (adaptive binaries) Recent projects Adaptive libraries (iterative compilation, dataset features and run-time adaptation) Predictive code scheduling for heterogeneous (CPU/GPU-like) architectures Statistical collective optimization Collective Tuning Initiative Conclusions and future work

36 Run-time adaptation Why adapt at run-time? Different program context Different run-time behavior Different system load Different available resources Different power consumption Different architectures & ISA For each case we want to find and use best optimization settings!

37 Current methods Some existing solutions: Application Dataset 1 Compiler Binary Output 1

38 Current methods Some existing solutions: Application Dataset 1 Compiler Binary Dynamic optimizations Output 1

39 Current methods Some existing solutions: Application Dataset 1 Compiler Binary Dynamic optimizations Output 1 Pros: run-time information, potentially more than one dataset

40 Current methods Some existing solutions: Application Dataset 1 Compiler Binary Dynamic optimizations Output 1 Pros: run-time information, potentially more than one dataset Cons: restrictions on optimization time, simple optimizations

41 Current methods Some existing solutions: Application Dataset 1 Iterative optimizations Compiler Binary Output 1 Dynamic optimizations Pros: run-time information, potentially more than one dataset Cons: restrictions on optimization time, simple optimizations

42 Current methods Some existing solutions: Application Dataset 1 Iterative optimizations Compiler Binary Output 1 Dynamic optimizations Pros: powerful transformation space exploration Pros: run-time information, potentially more than one dataset Cons: restrictions on optimization time, simple optimizations

43 Current methods Some existing solutions: Application Dataset 1 Iterative optimizations Compiler Binary Output 1 Dynamic optimizations Pros: powerful transformation space exploration Cons: slow, one dataset Pros: run-time information, potentially more than one dataset Cons: restrictions on optimization time, simple optimizations

44 Current methods Can we combine both? Application Dataset 1 Iterative optimizations Compiler Binary Output 1 Dynamic optimizations Static code multiversioning and run-time adaptation: powerful transformation space exploration, run-time information self-tuning adaptive binaries

45 Our approach: static multiversioning Application Select most time consuming code sections

46 Our approach: static multiversioning Application Create multi-versions of time consuming code sections

47 Our approach: static multiversioning Application adapt_start adapt_start adapt_stop adapt_stop Add adaptation routines (depends on the task)

48 Our approach: static multiversioning Transformations Application adapt_start adapt_start adapt_stop adapt_stop Apply various transformations over multi-versions of code sections

49 Our approach: static multiversioning Select global or fine-grain internal compiler optimizations Transformations Application adapt_start adapt_start adapt_stop adapt_stop Apply various transformations over multi-versions of code sections

50 Our approach: static multiversioning Transformations Application adapt_start adapt_start adapt_stop adapt_stop Apply various transformations over multi-versions of code sections

51 Our approach: static multiversioning Transformations Differernt ISA; manual transformations, etc Application adapt_start adapt_start adapt_stop adapt_stop Apply various transformations over multi-versions of code sections

52 Our approach: static multiversioning Final instrumented program Application adapt_start adapt_start adapt_stop adapt_stop

53 Run-time program adaptation Execution times for subroutine resid of benchmark mgrid across calls time (sec) function calls startup (phase detection) or end of the optimization process (best option found) evaluation of 1 option 1) Consider new optimization option evaluated after 2 consecutive executions of the code section with the same performance 2) Ignore one next execution to avoid transitional effects 3) Check baseline performance (to verify stability prediction)

54 Run-time program adaptation time (sec) function calls Continuous adaptation: saving prediction info after execution

55 Run-time program adaptation ime (sec) t function calls Continuous adaptation: preloading prediction info before execution Grigori Fursin, Albert Cohen, Michael O'Boyle and Oliver Temam. A Practical Method For Quickly Evaluating Program Optimizations. Proceedings of the 1 st International Conference on High Performance Embedded Architectures & Compilers (HiPEAC 2005), number 3793 in LNCS, pages 29-46, Barcelona, Spain, November 2005

56 Possible usage Create static binaries and libraries adaptable to different program and architecture behavior (also split-compilation) Generate mixed multiple-isa code with run-time adaptation for heterogeneous systems (CPU/GPU or CELL architectures) Determine the effect of optimizations at run-time for programs with varying datasets without a reference (randomly selecting versions at run-time) Statistical collective optimization

57 Motivation Prerequisites Outline Iterative feedback-directed directed compilation (empirical optimization) MILEPOST project: machine-learning based self-tuning compiler adaptable to heterogeneous reconfigurable architectures (based on static and dynamic program features) Optimization sensitivity to datasets Run-time adaptation for statically compiled programs (adaptive binaries) Recent projects Adaptive libraries (iterative compilation, dataset features and run-time adaptation) Predictive code scheduling for heterogeneous (CPU/GPU-like) architectures Statistical collective optimization Collective Tuning Initiative Conclusions and future work

58 Static multiversioning framework for dynamic optimizations compiled adaptive binaries and libraries Step 1 Statically compiled compiled adaptive binaries and libraries Original hot function Function Version 1 Function Version 2 Function Version N Iterative /collective compilation with multiple datasets

59 Static multiversioning framework for dynamic optimizations compiled adaptive binaries and libraries Step 2 Statically compiled compiled adaptive binaries and libraries Original Function Function Function hot Version 1 Version 2 Version N function Representative ti set of versions for the following fll optimization i cases to minimize i i execution time, power consumption and code size across all available datasets: optimizations for different datasets optimizations/compilation for different architectures (heterogeneous or reconfigurable processors with different ISA such as GPGPU, CELL, etc or the same ISA with extensions such as 3dnow, SSE, etc or virtual environments) optimizations for different program phases or different run time environment behavior Iterative /collective compilation with multiple datasets

60 Static multiversioning framework for dynamic optimizations compiled adaptive binaries and libraries Step 3 Statically compiled compiled adaptive binaries and libraries Extract dataset features Selection mechanism optimized for low run time overhead Original Function Function Function hot Version 1 Version 2 Version N function Representative ti set of versions for the following fll optimization i cases to minimize i i execution time, power consumption and code size across all available datasets: optimizations for different datasets optimizations/compilation for different architectures (heterogeneous or reconfigurable processors with different ISA such as GPGPU, CELL, etc or the same ISA with extensions such as 3dnow, SSE, etc or virtual environments) optimizations for different program phases or different run time environment behavior Machine learning techniques to find mapping between different run time contexts and representative versions Iterative /collective compilation with multiple datasets

61 Static multiversioning framework for dynamic optimizations Statically compiled compiled adaptive binaries and libraries Step X Extract Monitor run time behavior or architectural dataset changes (in virtual, reconfigurable or features heterogeneous environments) using timers or performance counters Selection mechanism optimized for low run time overhead Original Function Function Function hot Version 1 Version 2 Version N function Representative ti set of versions for the following fll optimization i cases to minimize i i execution time, power consumption and code size across all available datasets: optimizations for different datasets optimizations/compilation for different architectures (heterogeneous or reconfigurable processors with different ISA such as GPGPU, CELL, etc or the same ISA with extensions such as 3dnow, SSE, etc or virtual environments) optimizations for different program phases or different run time environment behavior Machine learning techniques to find mapping between different run time contexts and representative versions Iterative /collective compilation with multiple datasets unidapt

62 Step 1: Iterative compilation with multiple datasets Open64/PathScale compiler with Interactive Compilation Interface, AMD64, random optimization selection: loop tiling (2..512); register tiling (2..8); loop unrolling (2..16); loop vectorization loop interchange; loop fusion; array prefetching (8..128)

63 Step 2: Algorithm to select representative optimization set Main idea: Simplify life of developers by automating the process of representative optimization/dataset selection Developer controls overall performance gain/loss, code size/number of versions and quality of mapping mechanism

64 Step 2: Algorithm to select representative optimization set Main idea: Simplify life of developers by automating the process of representative optimization/dataset selection Developer controls overall performance gain/loss, code size/number of versions and quality of mapping mechanism First version is greedy (expensive) will be revisited in the future D 1 D 2 D 3 D 4 D 5 D X Datasets s 11 s 12 s 21 s 22 s 31 s 32 s 41 s 42 s 51 s 52 s X1 s X2 s 1o s 2o s 3o s 4o s 5o s Xo Code (optimization variants)

65 Step 2: Algorithm to select representative optimization set Main idea: Simplify life of developers by automating the process of representative optimization/dataset selection Developer controls overall performance gain/loss, code size/number of versions and quality of mapping mechanism First version is greedy (expensive) will be revisited in the future D 1 D 2 D 3 D 4 D 5 D X s 11 s 12 s 21 s 22 s 31 s 32 s 41 s 42 s 51 s 52 s X1 s X2 s 1o s 2o s 3o s 4o s 5o s Xo s max Calculate S max = geometric mean of the best achievable speedups

66 Step 2: Algorithm to select representative optimization set Main idea: Simplify life of developers by automating the process of representative optimization/dataset selection Developer controls overall performance gain/loss, code size/number of versions and quality of mapping mechanism First version is greedy (expensive) will be revisited in the future D 1 D 2 D 3 D 4 D 5 D X s 11 s 12 s 21 s 22 s 31 s 32 s 41 s 42 s 51 s 52 s X1 s X2 s 1o s 2o s 3o s 4o s 5o s Xo s max Find code version with a max geom. mean of speedups across all datasets

67 Step 2: Algorithm to select representative optimization set Main idea: Simplify life of developers by automating the process of representative optimization/dataset selection Developer controls overall performance gain/loss, code size/number of versions and quality of mapping mechanism First version is greedy (expensive) will be revisited in the future D 1 D 2 D 3 D 4 D 5 D X s 11 s 12 s 21 s 22 s 31 s 32 s 41 s 42 s 51 s 52 s X1 s X2 s 1o s 2o s 3o s 4o s 5o s Xo s max Find code version with a max geom. mean of speedups across all datasets

68 Step 2: Algorithm to select representative optimization set Main idea: Simplify life of developers by automating the process of representative optimization/dataset selection Developer controls overall performance gain/loss, code size/number of versions and quality of mapping mechanism First version is greedy (expensive) will be revisited in the future D 1 D 2 D 3 D 4 D 5 D X s 11 s 12 s 21 s 22 s 31 s 32 s 41 s 42 s 51 s 52 s X1 s X2 s 1o s 2o s 3o s 4o s 5o s Xo s max c o Add this version to the representative set

69 Step 2: Algorithm to select representative optimization set Main idea: Simplify life of developers by automating the process of representative optimization/dataset selection Developer controls overall performance gain/loss, code size/number of versions and quality of mapping mechanism First version is greedy (expensive) will be revisited in the future D 1 D 2 D 3 D 4 D 5 D X s 11 s 12 s 21 s 22 s 31 s 32 s 41 s 42 s 51 s 52 s X1 s X2 s 1o s 2o s 3o s 4o s 5o s Xo s max c o s Rmax Calculate s Rmax = geom.mean of the best speedups for each dataset using rep. set

70 Step 2: Algorithm to select representative optimization set Main idea: Simplify life of developers by automating the process of representative optimization/dataset selection Developer controls overall performance gain/loss, code size/number of versions and quality of mapping mechanism First version is greedy (expensive) will be revisited in the future D 1 D 2 D 3 D 4 D 5 D X s 11 s 12 s 21 s 22 s 31 s 32 s 41 s 42 s 51 s 52 s X1 s X2 s 1o s 2o s 3o s 4o s 5o s Xo s max c o s Rmax Check performance improvement/loss and code size explosion for the rep. set

71 Step 2: Algorithm to select representative optimization set Main idea: Simplify life of developers by automating the process of representative optimization/dataset selection Developer controls overall performance gain/loss, code size/number of versions and quality of mapping mechanism First version is greedy (expensive) will be revisited in the future D 1 D 2 D 3 D 4 D 5 D X s 11 s 12 s 21 s 22 s 31 s 32 s 41 s 42 s 51 s 52 s X1 s X2 s 1o s 2o s 3o s 4o s 5o s Xo s max c o s Rmax If continue, remove c i and all datasets where c i achieves the best speedup

72 Step 2: Algorithm to select representative optimization set Main idea: Simplify life of developers by automating the process of representative optimization/dataset selection Developer controls overall performance gain/loss, code size/number of versions and quality of mapping mechanism First version is greedy (expensive) will be revisited in the future D 2 D 3 D 4 D 5 s 21 s 22 s 31 s 32 s 41 s 42 s 51 s 52 s max c o s Rmax Find code version with a max geom. mean of speedups across all remaining datasets

73 Step 2: Algorithm to select representative optimization set Main idea: Simplify life of developers by automating the process of representative optimization/dataset selection Developer controls overall performance gain/loss, code size/number of versions and quality of mapping mechanism First version is greedy (expensive) will be revisited in the future D 2 D 3 D 4 D 5 s 21 s 22 s 31 s 32 s 41 s 42 s 51 s 52 s max c o s Rmax Find code version with a max geom. mean of speedups across all remaining datasets

74 Step 2: Algorithm to select representative optimization set Main idea: Simplify life of developers by automating the process of representative optimization/dataset selection Developer controls overall performance gain/loss, code size/number of versions and quality of mapping mechanism First version is greedy (expensive) will be revisited in the future D 2 D 3 D 4 D 5 s 21 s 22 s 31 s 32 s 41 s 42 s 51 s 52 s max c o c 1 s Rmax Add this version to the representative set

75 Step 2: Algorithm to select representative optimization set Main idea: Simplify life of developers by automating the process of representative optimization/dataset selection Developer controls overall performance gain/loss, code size/number of versions and quality of mapping mechanism First version is greedy (expensive) will be revisited in the future D 2 D 3 D 4 D 5 s 21 s 22 s 31 s 32 s 41 s 42 s 51 s 52 s max c o c 1 s Rmax Calculate s Rmax = geom.mean of the best speedups for each dataset using rep. set

76 Step 2: Algorithm to select representative optimization set Main idea: Simplify life of developers by automating the process of representative optimization/dataset selection Developer controls overall performance gain/loss, code size/number of versions and quality of mapping mechanism First version is greedy (expensive) will be revisited in the future D 2 D 3 D 4 D 5 s 21 s 22 s 31 s 32 s 41 s 42 s 51 s 52 s max c o c 1 s Rmax Check performance improvement/loss and code size explosion for the rep. set

77 Step 2: Algorithm to select representative optimization set Main idea: Simplify life of developers by automating the process of representative optimization/dataset selection Developer controls overall performance gain/loss, code size/number of versions and quality of mapping mechanism First version is greedy (expensive) will be revisited in the future D 2 D 3 D 4 D 5 s 21 s 22 s 31 s 32 s 41 s 42 s 51 s 52 s max c o c 1 If continue, remove c i and all datasets where c i achieves the best speedup s Rmax

78 Step 2: Algorithm to select representative optimization set Main idea: Simplify life of developers by automating the process of representative optimization/dataset selection Developer controls overall performance gain/loss, code size/number of versions and quality of mapping mechanism First version is greedy (expensive) will be revisited in the future D 2 D 4 s 21 s 22 s 41 s 42 s max c o c 1 s Rmax and so on

79 Step 3: Run-time version mapping mechanism WEKA - machine learning suite that supporting multiple standard techniques such as clustering, classification and regression Direct Classification (returns the most similar case from prior experience): SMO - Support Vector Machine based J48 - decision tree based REPTree - decision tree based JRip - rule based PART - rule based Ridor - rule based Performance Prediction Model (probabilistic approach): LeastMedSq - linear regression based LinearRegression - linear regression based PaceRegression - linear regression based SMOreg - Support Vector Machine based REPTree - decision i tree based M5Rules - rule based

80 Step 3: Run-time version mapping mechanism WEKA - machine learning suite that supporting multiple standard techniques such as clustering, classification and regression Direct Classification (returns the most similar case from prior experience): SMO - Support Vector Machine based algorithms vary in J48 - decision tree based applicability and REPTree - decision tree based complexity depending JRip - rule based on the problem PART - rule based encountered Ridor - rule based Performance Prediction Model (probabilistic approach): LeastMedSq - linear regression based LinearRegression - linear regression based PaceRegression - linear regression based SMOreg - Support Vector Machine based REPTree - decision i tree based M5Rules - rule based

81 Error rates of different classification algorithms Basic characterization: array dimension, 990 distinct datasets for DGEMM, 82 for FFT Direct Classification!

82 Error rates of different classification algorithms Basic characterization: array dimension, 990 distinct datasets for DGEMM, 82 for FFT Direct Classification Performance Prediction Model!!

DGEMM, 82 for FFT Direct Classification Performance Prediction

83 Error rates of different classification algorithms Basic characterization: array dimension, 990 distinct datasets for DGEMM, 82 for FFT Direct Classification Performance Prediction Model! Best performing algorithms are either decision tree or rule based!!

84 Motivation Prerequisites Outline Iterative feedback-directed directed compilation (empirical optimization) MILEPOST project: machine-learning based self-tuning compiler adaptable to heterogeneous reconfigurable architectures (based on static and dynamic program features) Optimization sensitivity to datasets Run-time adaptation for statically compiled programs (adaptive binaries) Recent projects Adaptive libraries (iterative compilation, dataset features and run-time adaptation) Predictive code scheduling for heterogeneous (CPU/GPU-like) architectures (BRIEF) Statistical collective optimization Collective Tuning Initiative Conclusions and future work

85 Predictive code scheduling

86 Predictive code scheduling Victor Jimenez, Isaac Gelado, Lluis Vilanova, Marisa Gil, Grigori Fursin and Nacho Navarro. Predictive runtime code scheduling for heterogeneous architectures. Proceedings of the International Conference on High Performance Embedded Architectures & Compilers (HiPEAC 2009), Paphos, Cyprus, January 2009

87 Predictive code scheduling

88 Predictive code scheduling

89 Motivation Prerequisites Outline Iterative feedback-directed directed compilation (empirical optimization) MILEPOST project: machine-learning based self-tuning compiler adaptable to heterogeneous reconfigurable architectures (based on static and dynamic program features) Optimization sensitivity to datasets Run-time adaptation for statically compiled programs (adaptive binaries) Recent projects Adaptive libraries (iterative compilation, dataset features and run-time adaptation) Predictive code scheduling for heterogeneous (CPU/GPU-like) architectures Statistical collective optimization Collective Tuning Initiative Conclusions and future work

90 Collective optimization framework This framework relies only on execution time and statistical ti ti analysis. It does not require specialized ML compiler or OS.

91 Probabilistic focused search Probability distribution to select an optimization combination based on continuous competition between combinations

92 Probabilistic focused search Probability distribution to select an optimization combination based on continuous competition between combinations

93 Probabilistic focused search Probability distribution to select an optimization combination based on continuous competition between combinations

94 Maturation stages of a program We would like to find probability distribution of optimizations for a given program to maximizes performance across (all) datasets Stage 1: Program unknown. We leverage and apply optimizations suggested by the general experience collected over all programs. (Probability distribution d 1 )

95 Maturation stages of a program We would like to find probability distribution of optimizations for a given program to maximizes performance across (all) datasets Stage 1: Program unknown. We leverage and apply optimizations suggested by the general experience collected over all programs. (Probability distribution d 1 ) Stage 2: Program unknown, a few runs only. Characterize the program behavior using gp program reaction to optimizations [more details later]. (Probability distribution d 2 )

96 Maturation stages of a program We would like to find probability distribution of optimizations for a given program to maximizes performance across (all) datasets Stage 1: Program unknown. We leverage and apply optimizations suggested by the general experience collected over all programs. (Probability distribution d 1 ) Stage 2: Program unknown, a few runs only. Characterize the program behavior using gp program reaction to optimizations [more details later]. (Probability distribution d 2 ) Stage 3: Program known, heavily used. We do not need the experience of other programs to select the most appropriate optimizations combinations. Learning across data sets is important use continuous competition between combinations. (Probability distribution d 3 )

97 Stage 2: Program known, a few runs only Intuition: i it is unlikely l that all programs exhibit widely different behavior with respect to compiler optimizations (clustering) Need to find similar programs to bias distribution of optimization using some characterization static program features dynamic program features Need special tools may not be portable

98 Stage 2: Program known, a few runs only Intuition: i it is unlikely l that all programs exhibit widely different behavior with respect to compiler optimizations (clustering) Need to find similar programs to bias distribution of optimization using some characterization static program features Need special tools dynamic program features may not be portable Suggest to use program reaction to transformations based on speedup only (portable) Optimization combinations P P 1 P 2 P 3 C 1 > C 2 C 3 > C 4 C 5 > C C 7 > C

99 Stage 2: Program known, a few runs only Intuition: i it is unlikely l that all programs exhibit widely different behavior with respect to compiler optimizations (clustering) Need to find similar programs to bias distribution of optimization using some characterization static program features Need special tools dynamic program features may not be portable Suggest to use program reaction to transformations based on speedup only (portable) Optimization combinations P P 1 P 2 P 3 C 1 > C 2 C 3 > C 4 C 5 > C C 7 > C

100 Selecting stages (meta-distribution) Permanent competition between the different stages distributions (d 1,d 2,d 3 ) 1) All have equal probability 2) Select d from (d 1,d 2,d 3 ) 3) Select C 1 and C 2 based on probability bilit distribution ib ti d 4) Run program and compare C 1 and C 2 5) If C 1 > C 2 and is same according to d, the prediction was correct and we reward the associated distribution

101 Performance of collective optimization d 1 improving default compiler heuristic using collective knowledge d 2 should improve with more collective knowledge d 3 often better than collective but may change if more optimizations are available

102 Collective Tuning Initiative Community-driven project to develop auto-tuning tuning computing systems Developing common open-source R&D tools with unified APIs (universal compilers, computer architecture simulators, adaptive run-time systems) to optimize/parallelize programs and architectures collectively using empirical iterative compilation, statistical and machine learning techniques ( Sharing useful optimization cases from the community for programs/libraries/os (compiler optimizations/architecture configurations to improve execution time, code size, architecture size, power consumption, etc) in the Collective Optimization Database ( Partially funded by HiPEAC, MILEPOST, Google summer of code, IBM, INRIA

103 Collective Tuning Initiative Community-driven project to develop auto-tuning tuning computing systems Developing common open-source R&D tools with unified APIs (universal compilers, computer architecture simulators, adaptive run-time systems) to optimize/parallelize programs and architectures collectively using empirical iterative compilation, statistical and machine learning techniques ( Sharing useful optimization cases from the community for programs/libraries/os (compiler optimizations/architecture configurations to improve execution time, code size, architecture size, power consumption, etc) in the Collective Optimization Database ( Partially funded by HiPEAC, MILEPOST, Google summer of code, IBM, INRIA Requires lots of work

104 Collective Tuning Initiative Community-driven project to develop auto-tuning tuning computing systems Developing common open-source R&D tools with unified APIs (universal compilers, computer architecture simulators, adaptive run-time systems) to optimize/parallelize programs and architectures collectively using empirical iterative compilation, statistical and machine learning techniques ( Sharing useful optimization cases from the community for programs/libraries/os (compiler optimizations/architecture configurations to improve execution time, code size, architecture size, power consumption, etc) in the Collective Optimization Database ( Partially funded by HiPEAC, MILEPOST, Google summer of code, IBM, INRIA Requires lots of work Join ctuning community

105 Current and future work New tools open many research opportunities and allow systematic empirical evaluation of computing systems analysis of program static and dynamic features to improve predictions analysis of dataset features for better run-time adaptation fine-grain program optimizations, polyhedral transformations and ML (supported by Google summer of code 2009) detection of important cases for collective optimization split compilation adaptive fault-tolerancetolerance statistical parallelization and adaptation for heterogeneous systems

org Partially funded by Contact email: grigori.

106 Thank you for your attention Collaborative R&D: Partially funded by Contact More information about research projects and software: t/ h

arxiv: v1 [cs.pl] 14 Jul 2014

arxiv: v1 [cs.pl] 14 Jul 2014 Finding representative sets of optimizations for adaptive multiversioning applications Lianjie Luo 1,2, Yang Chen 1,2, Chengyong Wu 1, Shun Long 3, Grigori Fursin 4 arxiv:1407.4075v1 [cs.pl] 14 Jul 2014