My background B.S. in Physics, M.S. in Computer Engineering i from Moscow Institute t of Physics & Technology, Russia (1999) Ph.D. from the University
|
|
- Tyler Ellis
- 5 years ago
- Views:
Transcription
1 Collective optimization, run-time adaptation and machine learning Grigori Fursin UNIDAPT / ALCHEMY Group INRIA Saclay, France HiPEAC member
2 My background B.S. in Physics, M.S. in Computer Engineering i from Moscow Institute t of Physics & Technology, Russia (1999) Ph.D. from the University of Edinburgh, UK ( ) iterative compilation and performance prediction (advisor: Prof. Michael O Boyle) Postdoctoral researcher at INRIA Futurs, France ( ) 2007) machine learning for optimization knowledge reuse, run-time adaptation for programs with multiple datasets and heterogeneous systems, architectural design space exploration (working with groups of Prof. Olivier Temam and Prof. Michael O Boyle) Tenured research scientist at INRIA Saclay, France (2007-now) self-tuning computing systems based on statistical collective optimization, machine learning and run-time adaptation (building a research group p developing HiPEAC common research compiler based on GCC developing Collective Tuning Center
3 Motivation Prerequisites Outline Iterative feedback-directed directed compilation (empirical optimization) MILEPOST project: machine-learning based self-tuning compiler adaptable to heterogeneous reconfigurable architectures (based on static and dynamic program features) Optimization sensitivity to datasets Run-time adaptation for statically compiled programs (adaptive binaries) Recent projects Adaptive libraries (iterative compilation, dataset features and run-time adaptation) Predictive code scheduling for heterogeneous (CPU/GPU-like) architectures Statistical collective optimization Collective Tuning Initiative Conclusions and future work
4 Motivation Continuing innovation in science and technology require increasing computing resources while placing strict requirements on system performance, power consumption, size, response, reliability, portability and design time. High-performance computing systems tend to evolve toward complex heterogeneous multi-core systems dramatically increased design & optimization time
5 Motivation Continuing innovation in science and technology require increasing computing resources while placing strict requirements on system performance, power consumption, size, response, reliability, portability and design time. High-performance computing systems tend to evolve toward complex heterogeneous multi-core systems dramatically increased design & optimization time Optimizing compilers play a key role in producing executable codes quickly and automatically while satisfying all the above requirements for a broad range of programs and architectures.
6 Motivation Developing and tuning current compilers for rapidly evolving architectures is a tedious and time consuming process. Current state-of-the-art compilers and optimizers often fail to deliver best performance. hardwired ad-hoc optimization heuristics (cost models) for rapidly evolving hardware large irregular optimization spaces interaction between optimizations, order of optimizations difficult to add new transformations to already tuned optimization heuristics inability to reuse optimization i knowledge among different programs and architectures lack of run-time information and inability to adapt to varying program and system behavior (or dataset) at run-time with low overhead
7 Motivation Developing and tuning current compilers for rapidly evolving architectures is a tedious and time consuming process. Current state-of-the-art compilers and optimizers often fail to deliver best performance. hardwired ad-hoc optimization heuristics (cost models) for rapidly evolving hardware large irregular optimization spaces interaction between optimizations, order of optimizations difficult to add new transformations to already tuned optimization heuristics inability to reuse optimization i knowledge among different programs and architectures lack of run-time information and inability to adapt to varying program and system behavior (or dataset) at run-time with low overhead Need universal self-tuning compilers (architectures and run-time systems) that can continuously and automatically adapt to any heterogeneous architecture and learn how to optimize programs
8 Motivation Prerequisites Outline Iterative feedback-directed directed compilation (empirical optimization) MILEPOST project: machine-learning based self-tuning compiler adaptable to heterogeneous reconfigurable architectures (based on static and dynamic program features) Optimization sensitivity to datasets Run-time adaptation for statically compiled programs (adaptive binaries) Recent projects Adaptive libraries (iterative compilation, dataset features and run-time adaptation) Predictive code scheduling for heterogeneous (CPU/GPU-like) architectures Statistical collective optimization Collective Tuning Initiative Conclusions and future work
9 Iterative compilation (first steps) Optimization spaces (set of all possible program transformations) are large, non-linear with many local minima Finding a good solution may be long and non-trivial matmul, 2 transformations, search space = 2000 swim, 3 transformations, search space = Potential ti solution - iterative ti feedback-directed di d compilation: learn program behavior across executions (combination of global flags, orders of optimization phases, cost-model tuning for individual transformation - meta- optimization) High potential
10 Iterative compilation (first steps) Optimization spaces (set of all possible program transformations) are large, non-linear with many local minima Finding a good solution may be long and non-trivial matmul, 2 transformations, search space = 2000 swim, 3 transformations, search space = Potential ti solution - iterative ti feedback-directed di d compilation: learn program behavior across executions (combination of global flags, orders of optimization phases, cost-model tuning for individual transformation - meta- optimization) High potential but: -slow - often the same dataset is used - often no run-time adaptation - no optimization knowledge reuse
11 Iterative compilation (first steps) Optimization spaces (set of all possible program transformations) are large, non-linear with many local minima Finding a good solution may be long and non-trivial matmul, 2 transformations, search space = 2000 swim, 3 transformations, search space = Potential ti solution - iterative ti feedback-directed di d compilation: learn program behavior across executions (combination of global flags, orders of optimization phases, cost-model tuning for individual transformation - meta- optimization) High potential but: -slow Solving these problems - often the same dataset is used - often no run-time adaptation is non-trivial i - no optimization knowledge reuse
12 p speedu Iterative compilation (uniform random) Systematic optimization exploration bitcount susan_c susan_e susan_s jpeg_c jpeg_d dijkstra patricia blo owfish_d blo owfish_e rij ijndael_d rij ijndael_e sha adpcm_c adpcm_d CRC32 gsm qsort1 AMD - a cluster with 16 AMD Athlon processors running at 2.4GHz IA32 - a cluster with 4 Intel Xeon processors running at 2.8GHz IA64 - a server with an Itanium2 processor running at 1.3GHz string gsearch1 Traditional uniform random iterative search (GCC/Open64 global compiler flags): 500 random combination of flags and associated passes (~100 optimizations) i i Can obtain high speedups, but very slow
13 Iterative compilation (uniform random) Optimization space is not trivial
14 Iterative compilation (uniform random) susan_corners AMD Athlon64 -O3 GCC ~100 flags -O3 Multi-objective optimizations (also power, reliability, etc) Combinations of optimizations matter Important to balance performance/code size particularly for embedded applications
15 Iterative compilation (focused probabilistic search) Focused probabilistic search (similar to GA search) (SUIF source-to-source + Intel compiler, 80 different transformations) Faster then traditional search (~50 iterations). Can stuck in local minima B. Franke, M. O'Boyle, J. Thomson and G. Fursin. Probabilistic Source-Level Optimisation of Embedded Systems Software. Proceedings of the Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES'05), pages 78-86, Chicago, IL, USA, June 2005
16 Motivation Prerequisites Outline Iterative feedback-directed directed compilation (empirical optimization) MILEPOST project: machine-learning based self-tuning compiler adaptable to heterogeneous reconfigurable architectures (based on static and dynamic program features) Optimization sensitivity to datasets Run-time adaptation for statically compiled programs (adaptive binaries) Recent projects Adaptive libraries (iterative compilation, dataset features and run-time adaptation) Predictive code scheduling for heterogeneous (CPU/GPU-like) architectures Statistical collective optimization Collective Tuning Initiative Conclusions and future work
17 MILEPOST project Machine Learning for Embedded Programs Optimization Objective is to develop compiler technology that can automatically learn how to best optimize programs for re-configurable heterogeneous embedded processors (with run-time adaptation) and dramatically reduce the time to market. Partners: INRIA, University of Edinburgh, IBM, ARC, CAPS Enterprise Developed techniques and software are publicly available and hopefully will influence the future compiler developments
18 Interactive Compilation Interface GCC Detect optimization flags GCC Controller (Pass Manager) Pass 1... Pass N GCC Data Layer AST, CFG, CF, etc
19 Interactive Compilation Interface GCC with ICI Detect optimization flags ICI IC Event GCC Controller (Pass Manager) IC Event Interactive Compilation Interface Pass 1... Pass N IC Event GCC Data Layer AST, CFG, CF, etc IC Data
20 Interactive Compilation Interface GCC with ICI High-level scripting (java, python, etc) Detect optimization flags IC Event GCC Controller (Pass Manager) IC Event ICI Interactive Compilation Interface IC Plugins <Dynamically linked shared libraries> Selecting pass sequences... Extracting static program features Pass 1... Pass N IC Event GCC Data Layer AST, CFG, CF, etc IC Data
21 Interactive Compilation Interface GCC with ICI High-level scripting (java, python, etc) Detect optimization flags IC Event GCC Controller (Pass Manager) Pass 1... GCC Data Layer AST, CFG, CF, etc IC Event Pass N IC Event IC Data ICI Interactive Compilation Interface IC Plugins <Dynamically linked shared libraries> Selecting pass sequences... Extracting static program features CCC Continuous Collective Compilation Framework ML drivers to optimize programs and tune compiler optimization heuristic
22 MILEPOST framework Trainin ng Program 1 Program N MILEPOST GCC (with ICI and ML routines) IC Plugins Recording pass sequences Extracting static program features
23 MILEPOST framework Trainin ng Program 1 Program N MILEPOST GCC (with ICI and ML routines) IC Plugins Recording pass sequences Extracting static program features CCC Continuous Collective Compilation Framework Drivers for iterative compilation and model training
24 MILEPOST framework g Program 1 MILEPOST GCC (with ICI and ML routines) IC Plugins CCC Continuous Collective Compilation Framework Trainin Program N Recording pass sequences Extracting static program features Drivers for iterative compilation and model training MILEPOST GCC Deploy yment New program Extracting static program features Selecting good passes Predicting good passes to improve exec.time, code size and comp. time Grigori Fursin, Cupertino Miranda, Olivier Temam, Mircea Namolaru, Elad Yom-Tov, Ayal Zaks, Bilha Mendelson, Phil Barnard, Elton Ashton, Eric Courtois, Francois Bodin, Edwin Bonilla, John Thomson, Hugh Leather, Chris Williams, Michael O'Boyle. MILEPOST GCC: machine learning based research compiler. Proceedings of the GCC Developers' Summit, Ottawa, Canada, June 2008 (based on CGO 06 and HiPEAC 05 papers)
25 Machine learning (static features) Program characterization based on static features MILEPOST GCC feature extractor (IBM Haifa & INRIA) ft1 - Number of basic blocks in the method ft19 - Number of direct calls in the method ft20 - Number of conditional branches in the method ft21 - Number of assignment instructions in the method ft22 - Number of binary integer operations in the method ft23 - Number of binary floating point operations in the method ft24 - Number of instructions in the method ft25 - Average of number of instructions in basic blocks ft54 - Number of local variables that are pointers in the method ft55 - Number of static/extern variables that are pointers in the method How constructed: human intuition On-going work on feature generation and analysis Critical to be able to make good predictions!
26 Machine learning (static features) 14 transformations, sequences of length 5 search space = Predict best transformations from this space based on program features and previous optimization experience Off-line traning: Focused search Predicting best transformation for a new program: Static features Nearest neighbors classifier F. Agakov, E. Bonilla, J. Cavazos, B. Franke, G. Fursin, M.F.P. O'Boyle, J. Thomson, M. Toussaint and C.K.I. Williams. Using Machine Learning to Focus Iterative Optimization. Proceedings of the 4th Annual International Symposium on Code Generation and Optimization (CGO), New York, NY, USA, March 2006
27 Machine learning (dynamic features) Performance counter values for 181.mcf compiled with -O0 relative to the average values for the entire set of benchmark suite (SPECFP,SPECINT, MiBench, Polyhedron) John Cavazos, Grigori Fursin, Felix Agakov, Edwin Bonilla, Michael F.P.O'Boyle and Olivier Temam. Rapidly Selecting Good Compiler Optimizations using Performance Counters. Proceedings of the 5th Annual International Symposium on Code Generation and Optimization (CGO), San Jose, USA, March 2007
28 Machine learning (dynamic features) Performance counter values for 181.mcf compiled with -O0 relative to the average values for the entire set of benchmark suite (SPECFP,SPECINT, MiBench, Polyhedron) Problem: greater number of memory accesses per instruction than average
29 Machine learning (dynamic features) Performance counter values for 181.mcf compiled with -O0 relative to the average values for the entire set of benchmark suite (SPECFP,SPECINT, MiBench, Polyhedron) Solving all performance issues one by one is slow and can be inefficient due to their non-linear dependencies
30 Machine learning (dynamic features) Performance counter values for 181.mcf compiled with -O0 relative to the average values for the entire set of benchmark suite (SPECFP,SPECINT, MiBench, Polyhedron) Solving all performance issues one by one is slow and can be inefficient due to their non-linear dependencies CONSIDER ALL PERFORMANCE ISSUES AT THE SAME TIME!
31 Detecting influential counters Principle Component Analysis: Most informative Performance Counters 1) L1_TCA 2) L1_DCH 3) TLB_DM 4) BR_INS 5) RES_STL S 6) TOT_CYC C 7) L2_ICH 8) VEC_INS 9) L2_DCH 10) L2_TCA 11) L1_DCA 12) HW_INT 13) L2_TCH 14) L1_TCH 15) BR_MS Analysis of the importance of the performance counters. The data contains one good optimization sequence per benchmark. Calculating mutual information between a subset of the performance counters and good optimization sequences
32 Motivation Prerequisites Outline Iterative feedback-directed directed compilation (empirical optimization) MILEPOST project: machine-learning based self-tuning compiler adaptable to heterogeneous reconfigurable architectures (based on static and dynamic program features) Optimization sensitivity to datasets Run-time adaptation for statically compiled programs (adaptive binaries) Recent projects Adaptive libraries (iterative compilation, dataset features and run-time adaptation) Predictive code scheduling for heterogeneous (CPU/GPU-like) architectures Statistical collective optimization Collective Tuning Initiative Conclusions and future work
33 Optimization sensitivity to datasets MiBench, 20 datasets t per benchmark, 200/1000 random combination of Open64 (GCC) compiler flags, 5 months of experiments dijkstra (not sensitive) jpeg_d (dataset sensitive) Grigori Fursin, John Cavazos, Michael O'Boyle and Olivier Temam. MiDataSets: Creating The Conditions For A More Realistic Evaluation of Iterative Optimization. Proceedings of HiPEAC 2007, Ghent, Belgium, January 2007
34 Optimization sensitivity to datasets MiBench, 20 datasets t per benchmark, 200/1000 random combination of Open64 (GCC) compiler flags, 5 months of experiments dijkstra (not sensitive) jpeg_d (clustering) Grigori Fursin, John Cavazos, Michael O'Boyle and Olivier Temam. MiDataSets: Creating The Conditions For A More Realistic Evaluation of Iterative Optimization. Proceedings of HiPEAC 2007, Ghent, Belgium, January 2007
35 Motivation Prerequisites Outline Iterative feedback-directed directed compilation (empirical optimization) MILEPOST project: machine-learning based self-tuning compiler adaptable to heterogeneous reconfigurable architectures (based on static and dynamic program features) Optimization sensitivity to datasets Run-time adaptation for statically compiled programs (adaptive binaries) Recent projects Adaptive libraries (iterative compilation, dataset features and run-time adaptation) Predictive code scheduling for heterogeneous (CPU/GPU-like) architectures Statistical collective optimization Collective Tuning Initiative Conclusions and future work
36 Run-time adaptation Why adapt at run-time? Different program context Different run-time behavior Different system load Different available resources Different power consumption Different architectures & ISA For each case we want to find and use best optimization settings!
37 Current methods Some existing solutions: Application Dataset 1 Compiler Binary Output 1
38 Current methods Some existing solutions: Application Dataset 1 Compiler Binary Dynamic optimizations Output 1
39 Current methods Some existing solutions: Application Dataset 1 Compiler Binary Dynamic optimizations Output 1 Pros: run-time information, potentially more than one dataset
40 Current methods Some existing solutions: Application Dataset 1 Compiler Binary Dynamic optimizations Output 1 Pros: run-time information, potentially more than one dataset Cons: restrictions on optimization time, simple optimizations
41 Current methods Some existing solutions: Application Dataset 1 Iterative optimizations Compiler Binary Output 1 Dynamic optimizations Pros: run-time information, potentially more than one dataset Cons: restrictions on optimization time, simple optimizations
42 Current methods Some existing solutions: Application Dataset 1 Iterative optimizations Compiler Binary Output 1 Dynamic optimizations Pros: powerful transformation space exploration Pros: run-time information, potentially more than one dataset Cons: restrictions on optimization time, simple optimizations
43 Current methods Some existing solutions: Application Dataset 1 Iterative optimizations Compiler Binary Output 1 Dynamic optimizations Pros: powerful transformation space exploration Cons: slow, one dataset Pros: run-time information, potentially more than one dataset Cons: restrictions on optimization time, simple optimizations
44 Current methods Can we combine both? Application Dataset 1 Iterative optimizations Compiler Binary Output 1 Dynamic optimizations Static code multiversioning and run-time adaptation: powerful transformation space exploration, run-time information self-tuning adaptive binaries
45 Our approach: static multiversioning Application Select most time consuming code sections
46 Our approach: static multiversioning Application Create multi-versions of time consuming code sections
47 Our approach: static multiversioning Application adapt_start adapt_start adapt_stop adapt_stop Add adaptation routines (depends on the task)
48 Our approach: static multiversioning Transformations Application adapt_start adapt_start adapt_stop adapt_stop Apply various transformations over multi-versions of code sections
49 Our approach: static multiversioning Select global or fine-grain internal compiler optimizations Transformations Application adapt_start adapt_start adapt_stop adapt_stop Apply various transformations over multi-versions of code sections
50 Our approach: static multiversioning Transformations Application adapt_start adapt_start adapt_stop adapt_stop Apply various transformations over multi-versions of code sections
51 Our approach: static multiversioning Transformations Differernt ISA; manual transformations, etc Application adapt_start adapt_start adapt_stop adapt_stop Apply various transformations over multi-versions of code sections
52 Our approach: static multiversioning Final instrumented program Application adapt_start adapt_start adapt_stop adapt_stop
53 Run-time program adaptation Execution times for subroutine resid of benchmark mgrid across calls time (sec) function calls startup (phase detection) or end of the optimization process (best option found) evaluation of 1 option 1) Consider new optimization option evaluated after 2 consecutive executions of the code section with the same performance 2) Ignore one next execution to avoid transitional effects 3) Check baseline performance (to verify stability prediction)
54 Run-time program adaptation time (sec) function calls Continuous adaptation: saving prediction info after execution
55 Run-time program adaptation ime (sec) t function calls Continuous adaptation: preloading prediction info before execution Grigori Fursin, Albert Cohen, Michael O'Boyle and Oliver Temam. A Practical Method For Quickly Evaluating Program Optimizations. Proceedings of the 1 st International Conference on High Performance Embedded Architectures & Compilers (HiPEAC 2005), number 3793 in LNCS, pages 29-46, Barcelona, Spain, November 2005
56 Possible usage Create static binaries and libraries adaptable to different program and architecture behavior (also split-compilation) Generate mixed multiple-isa code with run-time adaptation for heterogeneous systems (CPU/GPU or CELL architectures) Determine the effect of optimizations at run-time for programs with varying datasets without a reference (randomly selecting versions at run-time) Statistical collective optimization
57 Motivation Prerequisites Outline Iterative feedback-directed directed compilation (empirical optimization) MILEPOST project: machine-learning based self-tuning compiler adaptable to heterogeneous reconfigurable architectures (based on static and dynamic program features) Optimization sensitivity to datasets Run-time adaptation for statically compiled programs (adaptive binaries) Recent projects Adaptive libraries (iterative compilation, dataset features and run-time adaptation) Predictive code scheduling for heterogeneous (CPU/GPU-like) architectures Statistical collective optimization Collective Tuning Initiative Conclusions and future work
58 Static multiversioning framework for dynamic optimizations compiled adaptive binaries and libraries Step 1 Statically compiled compiled adaptive binaries and libraries Original hot function Function Version 1 Function Version 2 Function Version N Iterative /collective compilation with multiple datasets
59 Static multiversioning framework for dynamic optimizations compiled adaptive binaries and libraries Step 2 Statically compiled compiled adaptive binaries and libraries Original Function Function Function hot Version 1 Version 2 Version N function Representative ti set of versions for the following fll optimization i cases to minimize i i execution time, power consumption and code size across all available datasets: optimizations for different datasets optimizations/compilation for different architectures (heterogeneous or reconfigurable processors with different ISA such as GPGPU, CELL, etc or the same ISA with extensions such as 3dnow, SSE, etc or virtual environments) optimizations for different program phases or different run time environment behavior Iterative /collective compilation with multiple datasets
60 Static multiversioning framework for dynamic optimizations compiled adaptive binaries and libraries Step 3 Statically compiled compiled adaptive binaries and libraries Extract dataset features Selection mechanism optimized for low run time overhead Original Function Function Function hot Version 1 Version 2 Version N function Representative ti set of versions for the following fll optimization i cases to minimize i i execution time, power consumption and code size across all available datasets: optimizations for different datasets optimizations/compilation for different architectures (heterogeneous or reconfigurable processors with different ISA such as GPGPU, CELL, etc or the same ISA with extensions such as 3dnow, SSE, etc or virtual environments) optimizations for different program phases or different run time environment behavior Machine learning techniques to find mapping between different run time contexts and representative versions Iterative /collective compilation with multiple datasets
61 Static multiversioning framework for dynamic optimizations Statically compiled compiled adaptive binaries and libraries Step X Extract Monitor run time behavior or architectural dataset changes (in virtual, reconfigurable or features heterogeneous environments) using timers or performance counters Selection mechanism optimized for low run time overhead Original Function Function Function hot Version 1 Version 2 Version N function Representative ti set of versions for the following fll optimization i cases to minimize i i execution time, power consumption and code size across all available datasets: optimizations for different datasets optimizations/compilation for different architectures (heterogeneous or reconfigurable processors with different ISA such as GPGPU, CELL, etc or the same ISA with extensions such as 3dnow, SSE, etc or virtual environments) optimizations for different program phases or different run time environment behavior Machine learning techniques to find mapping between different run time contexts and representative versions Iterative /collective compilation with multiple datasets unidapt
62 Step 1: Iterative compilation with multiple datasets Open64/PathScale compiler with Interactive Compilation Interface, AMD64, random optimization selection: loop tiling (2..512); register tiling (2..8); loop unrolling (2..16); loop vectorization loop interchange; loop fusion; array prefetching (8..128)
63 Step 2: Algorithm to select representative optimization set Main idea: Simplify life of developers by automating the process of representative optimization/dataset selection Developer controls overall performance gain/loss, code size/number of versions and quality of mapping mechanism
64 Step 2: Algorithm to select representative optimization set Main idea: Simplify life of developers by automating the process of representative optimization/dataset selection Developer controls overall performance gain/loss, code size/number of versions and quality of mapping mechanism First version is greedy (expensive) will be revisited in the future D 1 D 2 D 3 D 4 D 5 D X Datasets s 11 s 12 s 21 s 22 s 31 s 32 s 41 s 42 s 51 s 52 s X1 s X2 s 1o s 2o s 3o s 4o s 5o s Xo Code (optimization variants)
65 Step 2: Algorithm to select representative optimization set Main idea: Simplify life of developers by automating the process of representative optimization/dataset selection Developer controls overall performance gain/loss, code size/number of versions and quality of mapping mechanism First version is greedy (expensive) will be revisited in the future D 1 D 2 D 3 D 4 D 5 D X s 11 s 12 s 21 s 22 s 31 s 32 s 41 s 42 s 51 s 52 s X1 s X2 s 1o s 2o s 3o s 4o s 5o s Xo s max Calculate S max = geometric mean of the best achievable speedups
66 Step 2: Algorithm to select representative optimization set Main idea: Simplify life of developers by automating the process of representative optimization/dataset selection Developer controls overall performance gain/loss, code size/number of versions and quality of mapping mechanism First version is greedy (expensive) will be revisited in the future D 1 D 2 D 3 D 4 D 5 D X s 11 s 12 s 21 s 22 s 31 s 32 s 41 s 42 s 51 s 52 s X1 s X2 s 1o s 2o s 3o s 4o s 5o s Xo s max Find code version with a max geom. mean of speedups across all datasets
67 Step 2: Algorithm to select representative optimization set Main idea: Simplify life of developers by automating the process of representative optimization/dataset selection Developer controls overall performance gain/loss, code size/number of versions and quality of mapping mechanism First version is greedy (expensive) will be revisited in the future D 1 D 2 D 3 D 4 D 5 D X s 11 s 12 s 21 s 22 s 31 s 32 s 41 s 42 s 51 s 52 s X1 s X2 s 1o s 2o s 3o s 4o s 5o s Xo s max Find code version with a max geom. mean of speedups across all datasets
68 Step 2: Algorithm to select representative optimization set Main idea: Simplify life of developers by automating the process of representative optimization/dataset selection Developer controls overall performance gain/loss, code size/number of versions and quality of mapping mechanism First version is greedy (expensive) will be revisited in the future D 1 D 2 D 3 D 4 D 5 D X s 11 s 12 s 21 s 22 s 31 s 32 s 41 s 42 s 51 s 52 s X1 s X2 s 1o s 2o s 3o s 4o s 5o s Xo s max c o Add this version to the representative set
69 Step 2: Algorithm to select representative optimization set Main idea: Simplify life of developers by automating the process of representative optimization/dataset selection Developer controls overall performance gain/loss, code size/number of versions and quality of mapping mechanism First version is greedy (expensive) will be revisited in the future D 1 D 2 D 3 D 4 D 5 D X s 11 s 12 s 21 s 22 s 31 s 32 s 41 s 42 s 51 s 52 s X1 s X2 s 1o s 2o s 3o s 4o s 5o s Xo s max c o s Rmax Calculate s Rmax = geom.mean of the best speedups for each dataset using rep. set
70 Step 2: Algorithm to select representative optimization set Main idea: Simplify life of developers by automating the process of representative optimization/dataset selection Developer controls overall performance gain/loss, code size/number of versions and quality of mapping mechanism First version is greedy (expensive) will be revisited in the future D 1 D 2 D 3 D 4 D 5 D X s 11 s 12 s 21 s 22 s 31 s 32 s 41 s 42 s 51 s 52 s X1 s X2 s 1o s 2o s 3o s 4o s 5o s Xo s max c o s Rmax Check performance improvement/loss and code size explosion for the rep. set
71 Step 2: Algorithm to select representative optimization set Main idea: Simplify life of developers by automating the process of representative optimization/dataset selection Developer controls overall performance gain/loss, code size/number of versions and quality of mapping mechanism First version is greedy (expensive) will be revisited in the future D 1 D 2 D 3 D 4 D 5 D X s 11 s 12 s 21 s 22 s 31 s 32 s 41 s 42 s 51 s 52 s X1 s X2 s 1o s 2o s 3o s 4o s 5o s Xo s max c o s Rmax If continue, remove c i and all datasets where c i achieves the best speedup
72 Step 2: Algorithm to select representative optimization set Main idea: Simplify life of developers by automating the process of representative optimization/dataset selection Developer controls overall performance gain/loss, code size/number of versions and quality of mapping mechanism First version is greedy (expensive) will be revisited in the future D 2 D 3 D 4 D 5 s 21 s 22 s 31 s 32 s 41 s 42 s 51 s 52 s max c o s Rmax Find code version with a max geom. mean of speedups across all remaining datasets
73 Step 2: Algorithm to select representative optimization set Main idea: Simplify life of developers by automating the process of representative optimization/dataset selection Developer controls overall performance gain/loss, code size/number of versions and quality of mapping mechanism First version is greedy (expensive) will be revisited in the future D 2 D 3 D 4 D 5 s 21 s 22 s 31 s 32 s 41 s 42 s 51 s 52 s max c o s Rmax Find code version with a max geom. mean of speedups across all remaining datasets
74 Step 2: Algorithm to select representative optimization set Main idea: Simplify life of developers by automating the process of representative optimization/dataset selection Developer controls overall performance gain/loss, code size/number of versions and quality of mapping mechanism First version is greedy (expensive) will be revisited in the future D 2 D 3 D 4 D 5 s 21 s 22 s 31 s 32 s 41 s 42 s 51 s 52 s max c o c 1 s Rmax Add this version to the representative set
75 Step 2: Algorithm to select representative optimization set Main idea: Simplify life of developers by automating the process of representative optimization/dataset selection Developer controls overall performance gain/loss, code size/number of versions and quality of mapping mechanism First version is greedy (expensive) will be revisited in the future D 2 D 3 D 4 D 5 s 21 s 22 s 31 s 32 s 41 s 42 s 51 s 52 s max c o c 1 s Rmax Calculate s Rmax = geom.mean of the best speedups for each dataset using rep. set
76 Step 2: Algorithm to select representative optimization set Main idea: Simplify life of developers by automating the process of representative optimization/dataset selection Developer controls overall performance gain/loss, code size/number of versions and quality of mapping mechanism First version is greedy (expensive) will be revisited in the future D 2 D 3 D 4 D 5 s 21 s 22 s 31 s 32 s 41 s 42 s 51 s 52 s max c o c 1 s Rmax Check performance improvement/loss and code size explosion for the rep. set
77 Step 2: Algorithm to select representative optimization set Main idea: Simplify life of developers by automating the process of representative optimization/dataset selection Developer controls overall performance gain/loss, code size/number of versions and quality of mapping mechanism First version is greedy (expensive) will be revisited in the future D 2 D 3 D 4 D 5 s 21 s 22 s 31 s 32 s 41 s 42 s 51 s 52 s max c o c 1 If continue, remove c i and all datasets where c i achieves the best speedup s Rmax
78 Step 2: Algorithm to select representative optimization set Main idea: Simplify life of developers by automating the process of representative optimization/dataset selection Developer controls overall performance gain/loss, code size/number of versions and quality of mapping mechanism First version is greedy (expensive) will be revisited in the future D 2 D 4 s 21 s 22 s 41 s 42 s max c o c 1 s Rmax and so on
79 Step 3: Run-time version mapping mechanism WEKA - machine learning suite that supporting multiple standard techniques such as clustering, classification and regression Direct Classification (returns the most similar case from prior experience): SMO - Support Vector Machine based J48 - decision tree based REPTree - decision tree based JRip - rule based PART - rule based Ridor - rule based Performance Prediction Model (probabilistic approach): LeastMedSq - linear regression based LinearRegression - linear regression based PaceRegression - linear regression based SMOreg - Support Vector Machine based REPTree - decision i tree based M5Rules - rule based
80 Step 3: Run-time version mapping mechanism WEKA - machine learning suite that supporting multiple standard techniques such as clustering, classification and regression Direct Classification (returns the most similar case from prior experience): SMO - Support Vector Machine based algorithms vary in J48 - decision tree based applicability and REPTree - decision tree based complexity depending JRip - rule based on the problem PART - rule based encountered Ridor - rule based Performance Prediction Model (probabilistic approach): LeastMedSq - linear regression based LinearRegression - linear regression based PaceRegression - linear regression based SMOreg - Support Vector Machine based REPTree - decision i tree based M5Rules - rule based
81 Error rates of different classification algorithms Basic characterization: array dimension, 990 distinct datasets for DGEMM, 82 for FFT Direct Classification!
82 Error rates of different classification algorithms Basic characterization: array dimension, 990 distinct datasets for DGEMM, 82 for FFT Direct Classification Performance Prediction Model!!
83 Error rates of different classification algorithms Basic characterization: array dimension, 990 distinct datasets for DGEMM, 82 for FFT Direct Classification Performance Prediction Model! Best performing algorithms are either decision tree or rule based!!
84 Motivation Prerequisites Outline Iterative feedback-directed directed compilation (empirical optimization) MILEPOST project: machine-learning based self-tuning compiler adaptable to heterogeneous reconfigurable architectures (based on static and dynamic program features) Optimization sensitivity to datasets Run-time adaptation for statically compiled programs (adaptive binaries) Recent projects Adaptive libraries (iterative compilation, dataset features and run-time adaptation) Predictive code scheduling for heterogeneous (CPU/GPU-like) architectures (BRIEF) Statistical collective optimization Collective Tuning Initiative Conclusions and future work
85 Predictive code scheduling
86 Predictive code scheduling Victor Jimenez, Isaac Gelado, Lluis Vilanova, Marisa Gil, Grigori Fursin and Nacho Navarro. Predictive runtime code scheduling for heterogeneous architectures. Proceedings of the International Conference on High Performance Embedded Architectures & Compilers (HiPEAC 2009), Paphos, Cyprus, January 2009
87 Predictive code scheduling
88 Predictive code scheduling
89 Motivation Prerequisites Outline Iterative feedback-directed directed compilation (empirical optimization) MILEPOST project: machine-learning based self-tuning compiler adaptable to heterogeneous reconfigurable architectures (based on static and dynamic program features) Optimization sensitivity to datasets Run-time adaptation for statically compiled programs (adaptive binaries) Recent projects Adaptive libraries (iterative compilation, dataset features and run-time adaptation) Predictive code scheduling for heterogeneous (CPU/GPU-like) architectures Statistical collective optimization Collective Tuning Initiative Conclusions and future work
90 Collective optimization framework This framework relies only on execution time and statistical ti ti analysis. It does not require specialized ML compiler or OS.
91 Probabilistic focused search Probability distribution to select an optimization combination based on continuous competition between combinations
92 Probabilistic focused search Probability distribution to select an optimization combination based on continuous competition between combinations
93 Probabilistic focused search Probability distribution to select an optimization combination based on continuous competition between combinations
94 Maturation stages of a program We would like to find probability distribution of optimizations for a given program to maximizes performance across (all) datasets Stage 1: Program unknown. We leverage and apply optimizations suggested by the general experience collected over all programs. (Probability distribution d 1 )
95 Maturation stages of a program We would like to find probability distribution of optimizations for a given program to maximizes performance across (all) datasets Stage 1: Program unknown. We leverage and apply optimizations suggested by the general experience collected over all programs. (Probability distribution d 1 ) Stage 2: Program unknown, a few runs only. Characterize the program behavior using gp program reaction to optimizations [more details later]. (Probability distribution d 2 )
96 Maturation stages of a program We would like to find probability distribution of optimizations for a given program to maximizes performance across (all) datasets Stage 1: Program unknown. We leverage and apply optimizations suggested by the general experience collected over all programs. (Probability distribution d 1 ) Stage 2: Program unknown, a few runs only. Characterize the program behavior using gp program reaction to optimizations [more details later]. (Probability distribution d 2 ) Stage 3: Program known, heavily used. We do not need the experience of other programs to select the most appropriate optimizations combinations. Learning across data sets is important use continuous competition between combinations. (Probability distribution d 3 )
97 Stage 2: Program known, a few runs only Intuition: i it is unlikely l that all programs exhibit widely different behavior with respect to compiler optimizations (clustering) Need to find similar programs to bias distribution of optimization using some characterization static program features dynamic program features Need special tools may not be portable
98 Stage 2: Program known, a few runs only Intuition: i it is unlikely l that all programs exhibit widely different behavior with respect to compiler optimizations (clustering) Need to find similar programs to bias distribution of optimization using some characterization static program features Need special tools dynamic program features may not be portable Suggest to use program reaction to transformations based on speedup only (portable) Optimization combinations P P 1 P 2 P 3 C 1 > C 2 C 3 > C 4 C 5 > C C 7 > C
99 Stage 2: Program known, a few runs only Intuition: i it is unlikely l that all programs exhibit widely different behavior with respect to compiler optimizations (clustering) Need to find similar programs to bias distribution of optimization using some characterization static program features Need special tools dynamic program features may not be portable Suggest to use program reaction to transformations based on speedup only (portable) Optimization combinations P P 1 P 2 P 3 C 1 > C 2 C 3 > C 4 C 5 > C C 7 > C
100 Selecting stages (meta-distribution) Permanent competition between the different stages distributions (d 1,d 2,d 3 ) 1) All have equal probability 2) Select d from (d 1,d 2,d 3 ) 3) Select C 1 and C 2 based on probability bilit distribution ib ti d 4) Run program and compare C 1 and C 2 5) If C 1 > C 2 and is same according to d, the prediction was correct and we reward the associated distribution
101 Performance of collective optimization d 1 improving default compiler heuristic using collective knowledge d 2 should improve with more collective knowledge d 3 often better than collective but may change if more optimizations are available
102 Collective Tuning Initiative Community-driven project to develop auto-tuning tuning computing systems Developing common open-source R&D tools with unified APIs (universal compilers, computer architecture simulators, adaptive run-time systems) to optimize/parallelize programs and architectures collectively using empirical iterative compilation, statistical and machine learning techniques ( Sharing useful optimization cases from the community for programs/libraries/os (compiler optimizations/architecture configurations to improve execution time, code size, architecture size, power consumption, etc) in the Collective Optimization Database ( Partially funded by HiPEAC, MILEPOST, Google summer of code, IBM, INRIA
103 Collective Tuning Initiative Community-driven project to develop auto-tuning tuning computing systems Developing common open-source R&D tools with unified APIs (universal compilers, computer architecture simulators, adaptive run-time systems) to optimize/parallelize programs and architectures collectively using empirical iterative compilation, statistical and machine learning techniques ( Sharing useful optimization cases from the community for programs/libraries/os (compiler optimizations/architecture configurations to improve execution time, code size, architecture size, power consumption, etc) in the Collective Optimization Database ( Partially funded by HiPEAC, MILEPOST, Google summer of code, IBM, INRIA Requires lots of work
104 Collective Tuning Initiative Community-driven project to develop auto-tuning tuning computing systems Developing common open-source R&D tools with unified APIs (universal compilers, computer architecture simulators, adaptive run-time systems) to optimize/parallelize programs and architectures collectively using empirical iterative compilation, statistical and machine learning techniques ( Sharing useful optimization cases from the community for programs/libraries/os (compiler optimizations/architecture configurations to improve execution time, code size, architecture size, power consumption, etc) in the Collective Optimization Database ( Partially funded by HiPEAC, MILEPOST, Google summer of code, IBM, INRIA Requires lots of work Join ctuning community
105 Current and future work New tools open many research opportunities and allow systematic empirical evaluation of computing systems analysis of program static and dynamic features to improve predictions analysis of dataset features for better run-time adaptation fine-grain program optimizations, polyhedral transformations and ML (supported by Google summer of code 2009) detection of important cases for collective optimization split compilation adaptive fault-tolerancetolerance statistical parallelization and adaptation for heterogeneous systems
106 Thank you for your attention Collaborative R&D: Partially funded by Contact More information about research projects and software: t/ h
arxiv: v1 [cs.pl] 14 Jul 2014
Finding representative sets of optimizations for adaptive multiversioning applications Lianjie Luo 1,2, Yang Chen 1,2, Chengyong Wu 1, Shun Long 3, Grigori Fursin 4 arxiv:1407.4075v1 [cs.pl] 14 Jul 2014
More informationMILEPOST GCC: machine learning based research compiler
MILEPOST GCC: machine learning based research compiler Grigori Fursin, Cupertino Miranda, Olivier Temam, Mircea Namolaru, Elad Yom-Tov, Ayal Zaks, Bilha Mendelson, Edwin Bonilla, John Thomson, Hugh Leather,
More informationExploiting Phase Inter-Dependencies for Faster Iterative Compiler Optimization Phase Order Searches
1/26 Exploiting Phase Inter-Dependencies for Faster Iterative Compiler Optimization Phase Order Searches Michael R. Jantz Prasad A. Kulkarni Electrical Engineering and Computer Science, University of Kansas
More informationCollective characterization, optimization and design of computer systems. Grigori Fursin INRIA, France
Collective characterization, optimization and design of computer systems Grigori Fursin INRIA, France HiPEAC computing week April 2012 Session introduction HiPEAC 3 includes new instrument: Thematic sessions.
More informationMilepost GCC: machine learning enabled self-tuning compiler
The final publication is available at: http://www.springerlink.com/content/d753r27550257252 Milepost GCC: machine learning enabled self-tuning compiler Grigori Fursin 12 Yuriy Kashnikov 2 Abdul Wahid Memon
More informationPractical Aggregation of Semantical Program Properties for Machine Learning Based Optimization
Practical Aggregation of Semantical Program Properties for Machine Learning Based Optimization Mircea Namolaru IBM Haifa Research Lab namolaru@il.ibm.com Albert Cohen INRIA Saclay and LRI, Paris-Sud 11
More informationMilepost GCC: Machine Learning Enabled Self-tuning Compiler
Int J Parallel Prog (2011) 39:296 327 DOI 10.1007/s10766-010-0161-2 Milepost GCC: Machine Learning Enabled Self-tuning Compiler Grigori Fursin Yuriy Kashnikov Abdul Wahid Memon Zbigniew Chamski Olivier
More informationPredictive Runtime Code Scheduling for Heterogeneous Architectures
Predictive Runtime Code Scheduling for Heterogeneous Architectures Víctor Jiménez, Lluís Vilanova, Isaac Gelado Marisa Gil, Grigori Fursin, Nacho Navarro HiPEAC 2009 January, 26th, 2009 1 Outline Motivation
More informationAmir H. Ashouri University of Toronto Canada
Compiler Autotuning using Machine Learning: A State-of-the-art Review Amir H. Ashouri University of Toronto Canada 4 th July, 2018 Politecnico di Milano, Italy Background 2 Education B.Sc (2005-2009):
More informationIMPROVING PERFORMANCE OF PROGRAM BY FINDING GOOD OPTIMAL SEQUENCES USING SEQUENCE SELECTION APPROACH
IMPROVING PERFORMANCE OF PROGRAM BY FINDING GOOD OPTIMAL SEQUENCES USING SEQUENCE SELECTION APPROACH Praveen Kumar Reddy M. 1 and M. Rajasekhara Babu 2 1 School of Information Technology and Engineering,
More informationHugh Leather, Edwin Bonilla, Michael O'Boyle
Automatic Generation for Machine Learning Based Optimizing Compilation Hugh Leather, Edwin Bonilla, Michael O'Boyle Institute for Computing Systems Architecture University of Edinburgh, UK Overview Introduction
More informationTransforming GCC into a research-friendly environment: plugins for optimization tuning and reordering, function cloning and program instrumentation
Transforming GCC into a research-friendly environment: plugins for optimization tuning and reordering, function cloning and program instrumentation Yuanjie Huang 1,2, Liang Peng 1,2, Chengyong Wu 1 Yuriy
More informationCollective Tuning Initiative: automating and accelerating development and optimization of computing systems
Collective Tuning Initiative: automating and accelerating development and optimization of computing systems Grigori Fursin To cite this version: Grigori Fursin. Collective Tuning Initiative: automating
More informationEfficient Program Compilation through Machine Learning Techniques
Efficient Program Compilation through Machine Learning Techniques Gennady Pekhimenko 1 and Angela Demke Brown 2 1 IBM, Toronto ON L6G 1C7, Canada, gennadyp@ca.ibm.com 2 University of Toronto, M5S 2E4,
More informationBuilding a Practical Iterative Interactive Compiler
Building a Practical Iterative Interactive Compiler Grigori Fursin and Albert Cohen ALCHEMY Group, INRIA Futurs and LRI, Paris-Sud University, France {grigori.fursin,albert.cohen}@inria.fr Abstract. Current
More informationEfficient framework architecture for improved tuning time and normalized tuning time
Efficient framework architecture for improved tuning time and normalized tuning time J.ANDREWS, T.SASIKALA Research Scholar, Department of CSE Sathyabama University Rajiv Gandhi salai, Chennai-600119 INDIA
More informationEfficient Program Compilation through Machine Learning Techniques
Efficient Program Compilation through Machine Learning Techniques Gennady Pekhimenko 1 and Angela Demke Brown 2 1 IBM, Toronto ON L6G 1C7, Canada, gennadyp@ca.ibm.com 2 University of Toronto, M5S 2E4,
More informationCollective Tuning Initiative: automating and accelerating development and optimization of computing systems
Collective Tuning Initiative: automating and accelerating development and optimization of computing systems Grigori Fursin INRIA Saclay, France HiPEAC member grigori.fursin@inria.fr Abstract Computing
More informationProceedings of the GCC Developers Summit. June 17th 19th, 2008 Ottawa, Ontario Canada
Proceedings of the GCC Developers Summit June 17th 19th, 2008 Ottawa, Ontario Canada Contents MILEPOST GCC: machine learning based research compiler 7 Grigori Fursin et al. Using GCC Instead of Grep and
More informationIntelligent Compilation
Intelligent Compilation John Cavazos Department of Computer and Information Sciences University of Delaware Autotuning and Compilers Proposition: Autotuning is a component of an Intelligent Compiler. Code
More informationAn Evaluation of Autotuning Techniques for the Compiler Optimization Problems
An Evaluation of Autotuning Techniques for the Compiler Optimization Problems Amir Hossein Ashouri, Gianluca Palermo and Cristina Silvano Politecnico di Milano, Milan, Italy {amirhossein.ashouri,ginaluca.palermo,cristina.silvano}@polimi.it
More informationA Bayesian Network Approach for Compiler Auto-tuning for Embedded Processors
A Bayesian Network Approach for Compiler Auto-tuning for Embedded Processors Amir Hossein Ashouri, Giovanni Mariani, Gianluca Palermo and Cristina Silvano Dipartimento di Elettronica, Informazione e Bioingegneria,
More informationReducing Power Consumption for High-Associativity Data Caches in Embedded Processors
Reducing Power Consumption for High-Associativity Data Caches in Embedded Processors Dan Nicolaescu Alex Veidenbaum Alex Nicolau Dept. of Information and Computer Science University of California at Irvine
More informationAutotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT
Autotuning John Cavazos University of Delaware What is Autotuning? Searching for the best code parameters, code transformations, system configuration settings, etc. Search can be Quasi-intelligent: genetic
More informationPredictive Modeling in a Polyhedral Optimization Space
Predictive Modeling in a Polyhedral Optimization Space Eunjung Park, Louis-Noël Pouchet, John Cavazos, Albert Cohen and P. Sadayappan University of Delaware {ejpark,cavazos}@cis.udel.edu The Ohio State
More informationTowards a Holistic Approach to Auto-Parallelization
Towards a Holistic Approach to Auto-Parallelization Integrating Profile-Driven Parallelism Detection and Machine-Learning Based Mapping Georgios Tournavitis, Zheng Wang, Björn Franke and Michael F.P. O
More informationMachine Learning based Compilation
Machine Learning based Compilation Michael O Boyle March, 2014 1 Overview Machine learning - what is it and why is it useful? Predictive modelling OSE Scheduling and low level optimisation Loop unrolling
More informationCollective Mind a collaborative curation tool for program optimization
Collective Mind a collaborative curation tool for program optimization Towards collaborative, systematic and reproducible design and optimization of computer systems Grigori Fursin INRIA, France SEA 2014,
More informationCollective Mind: cleaning up the research and experimentation mess in computer engineering using crowdsourcing, big data and machine learning
Collective Mind: cleaning up the research and experimentation mess in computer engineering using crowdsourcing, big data and machine learning Grigori Fursin INRIA, France Grigori.Fursin@cTuning.org Abstract
More informationAutomatic Feature Generation for Setting Compilers Heuristics
Automatic Feature Generation for Setting Compilers Heuristics Hugh Leather 1, Elad Yom-Tov 2, Mircea Namolaru 2 and Ari Freund 2 1 Institute for Computing Systems Architecture School of Informatics University
More informationCross-Layer Memory Management to Reduce DRAM Power Consumption
Cross-Layer Memory Management to Reduce DRAM Power Consumption Michael Jantz Assistant Professor University of Tennessee, Knoxville 1 Introduction Assistant Professor at UT since August 2014 Before UT
More informationCompiler Optimizations and Auto-tuning. Amir H. Ashouri Politecnico Di Milano -2014
Compiler Optimizations and Auto-tuning Amir H. Ashouri Politecnico Di Milano -2014 Compilation Compilation = Translation One piece of code has : Around 10 ^ 80 different translations Different platforms
More informationThe Heterogeneous Programming Jungle. Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest
The Heterogeneous Programming Jungle Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest June 19, 2012 Outline 1. Introduction 2. Heterogeneous System Zoo 3. Similarities 4. Programming
More informationInstruction Cache Energy Saving Through Compiler Way-Placement
Instruction Cache Energy Saving Through Compiler Way-Placement Timothy M. Jones, Sandro Bartolini, Bruno De Bus, John Cavazosζ and Michael F.P. O Boyle Member of HiPEAC, School of Informatics University
More informationPredic've Modeling in a Polyhedral Op'miza'on Space
Predic've Modeling in a Polyhedral Op'miza'on Space Eunjung EJ Park 1, Louis- Noël Pouchet 2, John Cavazos 1, Albert Cohen 3, and P. Sadayappan 2 1 University of Delaware 2 The Ohio State University 3
More informationUsing Cache Models and Empirical Search in Automatic Tuning of Applications. Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX
Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Outline Overview of Framework Fine grain control of transformations
More informationIterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time
Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time Louis-Noël Pouchet, Cédric Bastoul, Albert Cohen and Nicolas Vasilache ALCHEMY, INRIA Futurs / University of Paris-Sud XI March
More informationA NOVEL APPROACH FOR SELECTION OF BEST SET OF OPTIMIZATION FUNCTIONS FOR A BENCHMARK APPLICATION USING AN EFFECTIVE STRATEGY
A NOVEL APPROACH FOR SELECTION OF BEST SET OF OPTIMIZATION FUNCTIONS FOR A BENCHMARK APPLICATION USING AN EFFECTIVE STRATEGY J.Andrews Research scholar, Department of Computer Science Engineering Sathyabama
More informationAries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX
Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX Keerthi Bhushan Rajesh K Chaurasia Hewlett-Packard India Software Operations 29, Cunningham Road Bangalore 560 052 India +91-80-2251554
More informationStudy of Variations of Native Program Execution Times on Multi-Core Architectures
Study of Variations of Native Program Execution Times on Multi-Core Architectures Abdelhafid MAZOUZ University of Versailles Saint-Quentin, France. Sid-Ahmed-Ali TOUATI INRIA-Saclay, France. Denis BARTHOU
More informationAffine Loop Optimization using Modulo Unrolling in CHAPEL
Affine Loop Optimization using Modulo Unrolling in CHAPEL Aroon Sharma, Joshua Koehler, Rajeev Barua LTS POC: Michael Ferguson 2 Overall Goal Improve the runtime of certain types of parallel computers
More informationEnergy-Efficiency Prediction of Multithreaded Workloads on Heterogeneous Composite Cores Architectures using Machine Learning Techniques
Energy-Efficiency Prediction of Multithreaded Workloads on Heterogeneous Composite Cores Architectures using Machine Learning Techniques Hossein Sayadi Department of Electrical and Computer Engineering
More informationDynamic Autotuning. of Algorithmic Skeletons:
Dynamic Autotuning of Algorithmic Skeletons Informatics Research Proposal Chris Cummins Abstract. The rapid transition towards multicore hardware has left application programmers requiring higher-level
More informationAutomatic Algorithm Configuration based on Local Search
Automatic Algorithm Configuration based on Local Search Frank Hutter 1 Holger Hoos 1 Thomas Stützle 2 1 Department of Computer Science University of British Columbia Canada 2 IRIDIA Université Libre de
More informationDeconstructing Iterative Optimization
Deconstructing Iterative Optimization YANG CHEN, SHUANGDE FANG, and YUANJIE HUANG, State Key Laboratory of Computer Architecture, ICT, CAS, China LIEVEN EECKHOUT, Ghent University, Belgium GRIGORI FURSIN,
More informationProfiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency
Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu
More informationCollective mind: Towards practical and collaborative auto-tuning
Collective mind: Towards practical and collaborative auto-tuning Grigori Fursin, Renato Miceli, Anton Lokhmotov, Michael Gerndt, Marc Baboulin, Allen D. Malony, Zbigniew Chamski, Diego Novillo, Davide
More informationExecution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures
Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Xin Huo Advisor: Gagan Agrawal Motivation - Architecture Challenges on GPU architecture
More informationMACHINE LEARNING BASED COMPILER OPTIMIZATION
MACHINE LEARNING BASED COMPILER OPTIMIZATION Arash Ashari Slides have been copied and adapted from 2011 SIParCS by William Petzke; Self-Tuning Compilers Selecting a good set of compiler optimization flags
More informationChapter 14 Performance and Processor Design
Chapter 14 Performance and Processor Design Outline 14.1 Introduction 14.2 Important Trends Affecting Performance Issues 14.3 Why Performance Monitoring and Evaluation are Needed 14.4 Performance Measures
More informationImproving Error Checking and Unsafe Optimizations using Software Speculation. Kirk Kelsey and Chen Ding University of Rochester
Improving Error Checking and Unsafe Optimizations using Software Speculation Kirk Kelsey and Chen Ding University of Rochester Outline Motivation Brief problem statement How speculation can help Our software
More informationEvaluating Heuristic Optimization Phase Order Search Algorithms
Evaluating Heuristic Optimization Phase Order Search Algorithms by Prasad A. Kulkarni David B. Whalley Gary S. Tyson Jack W. Davidson Computer Science Department, Florida State University, Tallahassee,
More informationLoop Nest Optimizer of GCC. Sebastian Pop. Avgust, 2006
Loop Nest Optimizer of GCC CRI / Ecole des mines de Paris Avgust, 26 Architecture of GCC and Loop Nest Optimizer C C++ Java F95 Ada GENERIC GIMPLE Analyses aliasing data dependences number of iterations
More informationPredicting GPU Performance from CPU Runs Using Machine Learning
Predicting GPU Performance from CPU Runs Using Machine Learning Ioana Baldini Stephen Fink Erik Altman IBM T. J. Watson Research Center Yorktown Heights, NY USA 1 To exploit GPGPU acceleration need to
More informationUsing Machine Learning to Improve Automatic Vectorization
Using Machine Learning to Improve Automatic Vectorization Kevin Stock Louis-Noël Pouchet P. Sadayappan The Ohio State University January 24, 2012 HiPEAC Conference Paris, France Introduction: Vectorization
More informationEmpirical Study on Impact of Developer Collaboration on Source Code
Empirical Study on Impact of Developer Collaboration on Source Code Akshay Chopra University of Waterloo Waterloo, Ontario a22chopr@uwaterloo.ca Parul Verma University of Waterloo Waterloo, Ontario p7verma@uwaterloo.ca
More informationDirk Tetzlaff Technical University of Berlin
Software Engineering g for Embedded Systems Intelligent Task Mapping for MPSoCs using Machine Learning Dirk Tetzlaff Technical University of Berlin 3rd Workshop on Mapping of Applications to MPSoCs June
More informationContention-Aware Scheduling of Parallel Code for Heterogeneous Systems
Contention-Aware Scheduling of Parallel Code for Heterogeneous Systems Chris Gregg Jeff S. Brantley Kim Hazelwood Department of Computer Science, University of Virginia Abstract A typical consumer desktop
More informationIMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM
IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM I5 AND I7 PROCESSORS Juan M. Cebrián 1 Lasse Natvig 1 Jan Christian Meyer 2 1 Depart. of Computer and Information
More informationThe Evolution of Big Data Platforms and Data Science
IBM Analytics The Evolution of Big Data Platforms and Data Science ECC Conference 2016 Brandon MacKenzie June 13, 2016 2016 IBM Corporation Hello, I m Brandon MacKenzie. I work at IBM. Data Science - Offering
More informationPerformance Cloning: A Technique for Disseminating Proprietary Applications as Benchmarks
Performance Cloning: Technique for isseminating Proprietary pplications as enchmarks jay Joshi (University of Texas) Lieven Eeckhout (Ghent University, elgium) Robert H. ell Jr. (IM Corp.) Lizy John (University
More informationApplying Multi-Core Model Checking to Hardware-Software Partitioning in Embedded Systems
V Brazilian Symposium on Computing Systems Engineering Applying Multi-Core Model Checking to Hardware-Software Partitioning in Embedded Systems Alessandro Trindade, Hussama Ismail, and Lucas Cordeiro Foz
More informationGuiding the optimization of parallel codes on multicores using an analytical cache model
Guiding the optimization of parallel codes on multicores using an analytical cache model Diego Andrade, Basilio B. Fraguela, and Ramón Doallo Universidade da Coruña, Spain {diego.andrade,basilio.fraguela,ramon.doalllo}@udc.es
More informationExploring and Predicting the Architecture/Optimising Compiler Co-Design Space
Exploring and Predicting the Architecture/Optimising Compiler Co-Design Space Christophe Dubach, Timothy M. Jones and Michael F.P. O Boyle Members of HiPEAC School of Informatics University of Edinburgh
More informationApache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context
1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes
More informationCAPS Technology. ProHMPT, 2009 March12 th
CAPS Technology ProHMPT, 2009 March12 th Overview of the Talk 1. HMPP in a nutshell Directives for Hardware Accelerators (HWA) 2. HMPP Code Generation Capabilities Efficient code generation for CUDA 3.
More informationScheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok
Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Texas Learning and Computation Center Department of Computer Science University of Houston Outline Motivation
More informationEvolving HPCToolkit John Mellor-Crummey Department of Computer Science Rice University Scalable Tools Workshop 7 August 2017
Evolving HPCToolkit John Mellor-Crummey Department of Computer Science Rice University http://hpctoolkit.org Scalable Tools Workshop 7 August 2017 HPCToolkit 1 HPCToolkit Workflow source code compile &
More informationAnalyzing the Performance of IWAVE on a Cluster using HPCToolkit
Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,
More informationADAPTIVE TASK SCHEDULING USING LOW-LEVEL RUNTIME APIs AND MACHINE LEARNING
ADAPTIVE TASK SCHEDULING USING LOW-LEVEL RUNTIME APIs AND MACHINE LEARNING Keynote, ADVCOMP 2017 November, 2017, Barcelona, Spain Prepared by: Ahmad Qawasmeh Assistant Professor The Hashemite University,
More informationCode Auto-Tuning with the Periscope Tuning Framework
Code Auto-Tuning with the Periscope Tuning Framework Renato Miceli, SENAI CIMATEC renato.miceli@fieb.org.br Isaías A. Comprés, TUM compresu@in.tum.de Project Participants Michael Gerndt, TUM Coordinator
More informationAccelerating Financial Applications on the GPU
Accelerating Financial Applications on the GPU Scott Grauer-Gray Robert Searles William Killian John Cavazos Department of Computer and Information Science University of Delaware Sixth Workshop on General
More informationRaced Profiles: Efficient Selection of Competing Compiler Optimizations
Raced Profiles: Efficient Selection of Competing Compiler Optimizations Abstract Many problems in embedded compilation require one set of optimizations to be selected over another based on run time performance.
More informationDirected Optimization On Stencil-based Computational Fluid Dynamics Application(s)
Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Islam Harb 08/21/2015 Agenda Motivation Research Challenges Contributions & Approach Results Conclusion Future Work 2
More informationAn Investigation into Value Profiling and its Applications
Manchester Metropolitan University BSc. (Hons) Computer Science An Investigation into Value Profiling and its Applications Author: Graham Markall Supervisor: Dr. Andy Nisbet No part of this project has
More informationTowards Machine Learning-Based Auto-tuning of MapReduce
3 IEEE st International Symposium on Modelling, Analysis & Simulation of Computer and Telecommunication Systems Towards Machine Learning-Based Auto-tuning of MapReduce Nezih Yigitbasi, Theodore L. Willke,
More informationA Test Suite for High-Performance Parallel Java
page 1 A Test Suite for High-Performance Parallel Java Jochem Häuser, Thorsten Ludewig, Roy D. Williams, Ralf Winkelmann, Torsten Gollnick, Sharon Brunett, Jean Muylaert presented at 5th National Symposium
More informationCollective mind: Towards practical and collaborative auto-tuning
Scientific Programming 22 (2014) 309 329 309 DOI 10.3233/SPR-140396 IOS Press Collective mind: Towards practical and collaborative auto-tuning Grigori Fursin a,, Renato Miceli b, Anton Lokhmotov c, Michael
More informationKismet: Parallel Speedup Estimates for Serial Programs
Kismet: Parallel Speedup Estimates for Serial Programs Donghwan Jeon, Saturnino Garcia, Chris Louie, and Michael Bedford Taylor Computer Science and Engineering University of California, San Diego 1 Questions
More informationArchitecture Tuning Study: the SimpleScalar Experience
Architecture Tuning Study: the SimpleScalar Experience Jianfeng Yang Yiqun Cao December 5, 2005 Abstract SimpleScalar is software toolset designed for modeling and simulation of processor performance.
More informationUnifying Big Data Workloads in Apache Spark
Unifying Big Data Workloads in Apache Spark Hossein Falaki @mhfalaki Outline What s Apache Spark Why Unification Evolution of Unification Apache Spark + Databricks Q & A What s Apache Spark What is Apache
More informationWorkloads Programmierung Paralleler und Verteilter Systeme (PPV)
Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Workloads 2 Hardware / software execution environment
More informationExecuting Legacy Applications on a Java Operating System
Executing Legacy Applications on a Java Operating System Andreas Gal, Michael Yang, Christian Probst, and Michael Franz University of California, Irvine {gal,mlyang,probst,franz}@uci.edu May 30, 2004 Abstract
More informationATOS introduction ST/Linaro Collaboration Context
ATOS introduction ST/Linaro Collaboration Context Presenter: Christian Bertin Development team: Rémi Duraffort, Christophe Guillon, François de Ferrière, Hervé Knochel, Antoine Moynault Consumer Product
More informationTrends in HPC (hardware complexity and software challenges)
Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18
More informationAutomatic Selection of GCC Optimization Options Using A Gene Weighted Genetic Algorithm
Automatic Selection of GCC Optimization Options Using A Gene Weighted Genetic Algorithm San-Chih Lin, Chi-Kuang Chang, Nai-Wei Lin National Chung Cheng University Chiayi, Taiwan 621, R.O.C. {lsch94,changck,naiwei}@cs.ccu.edu.tw
More informationAdministration. Prerequisites. Meeting times. CS 380C: Advanced Topics in Compilers
Administration CS 380C: Advanced Topics in Compilers Instructor: eshav Pingali Professor (CS, ICES) Office: POB 4.126A Email: pingali@cs.utexas.edu TA: TBD Graduate student (CS) Office: Email: Meeting
More informationRechnerarchitektur (RA)
12 Rechnerarchitektur (RA) Sommersemester 2017 Architecture-Aware Optimizations -Software Optimizations- Jian-Jia Chen Informatik 12 Jian-jia.chen@tu-.. http://ls12-www.cs.tu.de/daes/ Tel.: 0231 755 6078
More informationLocality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives
Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives José M. Andión, Manuel Arenaz, François Bodin, Gabriel Rodríguez and Juan Touriño 7th International Symposium on High-Level Parallel
More informationPiecewise Holistic Autotuning of Compiler and Runtime Parameters
Piecewise Holistic Autotuning of Compiler and Runtime Parameters Mihail Popov, Chadi Akel, William Jalby, Pablo de Oliveira Castro University of Versailles Exascale Computing Research August 2016 C E R
More informationPerformance measurement. SMD149 - Operating Systems - Performance and processor design. Introduction. Important trends affecting performance issues
Performance measurement SMD149 - Operating Systems - Performance and processor design Roland Parviainen November 28, 2005 Performance measurement Motivation Techniques Common metrics Processor architectural
More informationDealing with Heterogeneous Multicores
Dealing with Heterogeneous Multicores François Bodin INRIA-UIUC, June 12 th, 2009 Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism
More informationParallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework and numa control Examples
More informationEvaluating the Effects of Compiler Optimisations on AVF
Evaluating the Effects of Compiler Optimisations on AVF Timothy M. Jones, Michael F.P. O Boyle Member of HiPEAC, School of Informatics University of Edinburgh, UK {tjones1,mob}@inf.ed.ac.uk Oğuz Ergin
More informationStatistical Performance Comparisons of Computers
Tianshi Chen 1, Yunji Chen 1, Qi Guo 1, Olivier Temam 2, Yue Wu 1, Weiwu Hu 1 1 State Key Laboratory of Computer Architecture, Institute of Computing Technology (ICT), Chinese Academy of Sciences, Beijing,
More informationExecution-based Prediction Using Speculative Slices
Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers
More informationAutomatic Algorithm Configuration based on Local Search
Automatic Algorithm Configuration based on Local Search Frank Hutter 1 Holger Hoos 1 Thomas Stützle 2 1 Department of Computer Science University of British Columbia Canada 2 IRIDIA Université Libre de
More informationHSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!
Advanced Topics on Heterogeneous System Architectures HSA Foundation! Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2
More informationENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design
ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationSoftware Ecosystem for Arm-based HPC
Software Ecosystem for Arm-based HPC CUG 2018 - Stockholm Florent.Lebeau@arm.com Ecosystem for HPC List of components needed: Linux OS availability Compilers Libraries Job schedulers Debuggers Profilers
More information