My background B.S. in Physics, M.S. in Computer Engineering i from Moscow Institute t of Physics & Technology, Russia (1999) Ph.D. from the University

Size: px
Start display at page:

Download "My background B.S. in Physics, M.S. in Computer Engineering i from Moscow Institute t of Physics & Technology, Russia (1999) Ph.D. from the University"

Transcription

1 Collective optimization, run-time adaptation and machine learning Grigori Fursin UNIDAPT / ALCHEMY Group INRIA Saclay, France HiPEAC member

2 My background B.S. in Physics, M.S. in Computer Engineering i from Moscow Institute t of Physics & Technology, Russia (1999) Ph.D. from the University of Edinburgh, UK ( ) iterative compilation and performance prediction (advisor: Prof. Michael O Boyle) Postdoctoral researcher at INRIA Futurs, France ( ) 2007) machine learning for optimization knowledge reuse, run-time adaptation for programs with multiple datasets and heterogeneous systems, architectural design space exploration (working with groups of Prof. Olivier Temam and Prof. Michael O Boyle) Tenured research scientist at INRIA Saclay, France (2007-now) self-tuning computing systems based on statistical collective optimization, machine learning and run-time adaptation (building a research group p developing HiPEAC common research compiler based on GCC developing Collective Tuning Center

3 Motivation Prerequisites Outline Iterative feedback-directed directed compilation (empirical optimization) MILEPOST project: machine-learning based self-tuning compiler adaptable to heterogeneous reconfigurable architectures (based on static and dynamic program features) Optimization sensitivity to datasets Run-time adaptation for statically compiled programs (adaptive binaries) Recent projects Adaptive libraries (iterative compilation, dataset features and run-time adaptation) Predictive code scheduling for heterogeneous (CPU/GPU-like) architectures Statistical collective optimization Collective Tuning Initiative Conclusions and future work

4 Motivation Continuing innovation in science and technology require increasing computing resources while placing strict requirements on system performance, power consumption, size, response, reliability, portability and design time. High-performance computing systems tend to evolve toward complex heterogeneous multi-core systems dramatically increased design & optimization time

5 Motivation Continuing innovation in science and technology require increasing computing resources while placing strict requirements on system performance, power consumption, size, response, reliability, portability and design time. High-performance computing systems tend to evolve toward complex heterogeneous multi-core systems dramatically increased design & optimization time Optimizing compilers play a key role in producing executable codes quickly and automatically while satisfying all the above requirements for a broad range of programs and architectures.

6 Motivation Developing and tuning current compilers for rapidly evolving architectures is a tedious and time consuming process. Current state-of-the-art compilers and optimizers often fail to deliver best performance. hardwired ad-hoc optimization heuristics (cost models) for rapidly evolving hardware large irregular optimization spaces interaction between optimizations, order of optimizations difficult to add new transformations to already tuned optimization heuristics inability to reuse optimization i knowledge among different programs and architectures lack of run-time information and inability to adapt to varying program and system behavior (or dataset) at run-time with low overhead

7 Motivation Developing and tuning current compilers for rapidly evolving architectures is a tedious and time consuming process. Current state-of-the-art compilers and optimizers often fail to deliver best performance. hardwired ad-hoc optimization heuristics (cost models) for rapidly evolving hardware large irregular optimization spaces interaction between optimizations, order of optimizations difficult to add new transformations to already tuned optimization heuristics inability to reuse optimization i knowledge among different programs and architectures lack of run-time information and inability to adapt to varying program and system behavior (or dataset) at run-time with low overhead Need universal self-tuning compilers (architectures and run-time systems) that can continuously and automatically adapt to any heterogeneous architecture and learn how to optimize programs

8 Motivation Prerequisites Outline Iterative feedback-directed directed compilation (empirical optimization) MILEPOST project: machine-learning based self-tuning compiler adaptable to heterogeneous reconfigurable architectures (based on static and dynamic program features) Optimization sensitivity to datasets Run-time adaptation for statically compiled programs (adaptive binaries) Recent projects Adaptive libraries (iterative compilation, dataset features and run-time adaptation) Predictive code scheduling for heterogeneous (CPU/GPU-like) architectures Statistical collective optimization Collective Tuning Initiative Conclusions and future work

9 Iterative compilation (first steps) Optimization spaces (set of all possible program transformations) are large, non-linear with many local minima Finding a good solution may be long and non-trivial matmul, 2 transformations, search space = 2000 swim, 3 transformations, search space = Potential ti solution - iterative ti feedback-directed di d compilation: learn program behavior across executions (combination of global flags, orders of optimization phases, cost-model tuning for individual transformation - meta- optimization) High potential

10 Iterative compilation (first steps) Optimization spaces (set of all possible program transformations) are large, non-linear with many local minima Finding a good solution may be long and non-trivial matmul, 2 transformations, search space = 2000 swim, 3 transformations, search space = Potential ti solution - iterative ti feedback-directed di d compilation: learn program behavior across executions (combination of global flags, orders of optimization phases, cost-model tuning for individual transformation - meta- optimization) High potential but: -slow - often the same dataset is used - often no run-time adaptation - no optimization knowledge reuse

11 Iterative compilation (first steps) Optimization spaces (set of all possible program transformations) are large, non-linear with many local minima Finding a good solution may be long and non-trivial matmul, 2 transformations, search space = 2000 swim, 3 transformations, search space = Potential ti solution - iterative ti feedback-directed di d compilation: learn program behavior across executions (combination of global flags, orders of optimization phases, cost-model tuning for individual transformation - meta- optimization) High potential but: -slow Solving these problems - often the same dataset is used - often no run-time adaptation is non-trivial i - no optimization knowledge reuse

12 p speedu Iterative compilation (uniform random) Systematic optimization exploration bitcount susan_c susan_e susan_s jpeg_c jpeg_d dijkstra patricia blo owfish_d blo owfish_e rij ijndael_d rij ijndael_e sha adpcm_c adpcm_d CRC32 gsm qsort1 AMD - a cluster with 16 AMD Athlon processors running at 2.4GHz IA32 - a cluster with 4 Intel Xeon processors running at 2.8GHz IA64 - a server with an Itanium2 processor running at 1.3GHz string gsearch1 Traditional uniform random iterative search (GCC/Open64 global compiler flags): 500 random combination of flags and associated passes (~100 optimizations) i i Can obtain high speedups, but very slow

13 Iterative compilation (uniform random) Optimization space is not trivial

14 Iterative compilation (uniform random) susan_corners AMD Athlon64 -O3 GCC ~100 flags -O3 Multi-objective optimizations (also power, reliability, etc) Combinations of optimizations matter Important to balance performance/code size particularly for embedded applications

15 Iterative compilation (focused probabilistic search) Focused probabilistic search (similar to GA search) (SUIF source-to-source + Intel compiler, 80 different transformations) Faster then traditional search (~50 iterations). Can stuck in local minima B. Franke, M. O'Boyle, J. Thomson and G. Fursin. Probabilistic Source-Level Optimisation of Embedded Systems Software. Proceedings of the Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES'05), pages 78-86, Chicago, IL, USA, June 2005

16 Motivation Prerequisites Outline Iterative feedback-directed directed compilation (empirical optimization) MILEPOST project: machine-learning based self-tuning compiler adaptable to heterogeneous reconfigurable architectures (based on static and dynamic program features) Optimization sensitivity to datasets Run-time adaptation for statically compiled programs (adaptive binaries) Recent projects Adaptive libraries (iterative compilation, dataset features and run-time adaptation) Predictive code scheduling for heterogeneous (CPU/GPU-like) architectures Statistical collective optimization Collective Tuning Initiative Conclusions and future work

17 MILEPOST project Machine Learning for Embedded Programs Optimization Objective is to develop compiler technology that can automatically learn how to best optimize programs for re-configurable heterogeneous embedded processors (with run-time adaptation) and dramatically reduce the time to market. Partners: INRIA, University of Edinburgh, IBM, ARC, CAPS Enterprise Developed techniques and software are publicly available and hopefully will influence the future compiler developments

18 Interactive Compilation Interface GCC Detect optimization flags GCC Controller (Pass Manager) Pass 1... Pass N GCC Data Layer AST, CFG, CF, etc

19 Interactive Compilation Interface GCC with ICI Detect optimization flags ICI IC Event GCC Controller (Pass Manager) IC Event Interactive Compilation Interface Pass 1... Pass N IC Event GCC Data Layer AST, CFG, CF, etc IC Data

20 Interactive Compilation Interface GCC with ICI High-level scripting (java, python, etc) Detect optimization flags IC Event GCC Controller (Pass Manager) IC Event ICI Interactive Compilation Interface IC Plugins <Dynamically linked shared libraries> Selecting pass sequences... Extracting static program features Pass 1... Pass N IC Event GCC Data Layer AST, CFG, CF, etc IC Data

21 Interactive Compilation Interface GCC with ICI High-level scripting (java, python, etc) Detect optimization flags IC Event GCC Controller (Pass Manager) Pass 1... GCC Data Layer AST, CFG, CF, etc IC Event Pass N IC Event IC Data ICI Interactive Compilation Interface IC Plugins <Dynamically linked shared libraries> Selecting pass sequences... Extracting static program features CCC Continuous Collective Compilation Framework ML drivers to optimize programs and tune compiler optimization heuristic

22 MILEPOST framework Trainin ng Program 1 Program N MILEPOST GCC (with ICI and ML routines) IC Plugins Recording pass sequences Extracting static program features

23 MILEPOST framework Trainin ng Program 1 Program N MILEPOST GCC (with ICI and ML routines) IC Plugins Recording pass sequences Extracting static program features CCC Continuous Collective Compilation Framework Drivers for iterative compilation and model training

24 MILEPOST framework g Program 1 MILEPOST GCC (with ICI and ML routines) IC Plugins CCC Continuous Collective Compilation Framework Trainin Program N Recording pass sequences Extracting static program features Drivers for iterative compilation and model training MILEPOST GCC Deploy yment New program Extracting static program features Selecting good passes Predicting good passes to improve exec.time, code size and comp. time Grigori Fursin, Cupertino Miranda, Olivier Temam, Mircea Namolaru, Elad Yom-Tov, Ayal Zaks, Bilha Mendelson, Phil Barnard, Elton Ashton, Eric Courtois, Francois Bodin, Edwin Bonilla, John Thomson, Hugh Leather, Chris Williams, Michael O'Boyle. MILEPOST GCC: machine learning based research compiler. Proceedings of the GCC Developers' Summit, Ottawa, Canada, June 2008 (based on CGO 06 and HiPEAC 05 papers)

25 Machine learning (static features) Program characterization based on static features MILEPOST GCC feature extractor (IBM Haifa & INRIA) ft1 - Number of basic blocks in the method ft19 - Number of direct calls in the method ft20 - Number of conditional branches in the method ft21 - Number of assignment instructions in the method ft22 - Number of binary integer operations in the method ft23 - Number of binary floating point operations in the method ft24 - Number of instructions in the method ft25 - Average of number of instructions in basic blocks ft54 - Number of local variables that are pointers in the method ft55 - Number of static/extern variables that are pointers in the method How constructed: human intuition On-going work on feature generation and analysis Critical to be able to make good predictions!

26 Machine learning (static features) 14 transformations, sequences of length 5 search space = Predict best transformations from this space based on program features and previous optimization experience Off-line traning: Focused search Predicting best transformation for a new program: Static features Nearest neighbors classifier F. Agakov, E. Bonilla, J. Cavazos, B. Franke, G. Fursin, M.F.P. O'Boyle, J. Thomson, M. Toussaint and C.K.I. Williams. Using Machine Learning to Focus Iterative Optimization. Proceedings of the 4th Annual International Symposium on Code Generation and Optimization (CGO), New York, NY, USA, March 2006

27 Machine learning (dynamic features) Performance counter values for 181.mcf compiled with -O0 relative to the average values for the entire set of benchmark suite (SPECFP,SPECINT, MiBench, Polyhedron) John Cavazos, Grigori Fursin, Felix Agakov, Edwin Bonilla, Michael F.P.O'Boyle and Olivier Temam. Rapidly Selecting Good Compiler Optimizations using Performance Counters. Proceedings of the 5th Annual International Symposium on Code Generation and Optimization (CGO), San Jose, USA, March 2007

28 Machine learning (dynamic features) Performance counter values for 181.mcf compiled with -O0 relative to the average values for the entire set of benchmark suite (SPECFP,SPECINT, MiBench, Polyhedron) Problem: greater number of memory accesses per instruction than average

29 Machine learning (dynamic features) Performance counter values for 181.mcf compiled with -O0 relative to the average values for the entire set of benchmark suite (SPECFP,SPECINT, MiBench, Polyhedron) Solving all performance issues one by one is slow and can be inefficient due to their non-linear dependencies

30 Machine learning (dynamic features) Performance counter values for 181.mcf compiled with -O0 relative to the average values for the entire set of benchmark suite (SPECFP,SPECINT, MiBench, Polyhedron) Solving all performance issues one by one is slow and can be inefficient due to their non-linear dependencies CONSIDER ALL PERFORMANCE ISSUES AT THE SAME TIME!

31 Detecting influential counters Principle Component Analysis: Most informative Performance Counters 1) L1_TCA 2) L1_DCH 3) TLB_DM 4) BR_INS 5) RES_STL S 6) TOT_CYC C 7) L2_ICH 8) VEC_INS 9) L2_DCH 10) L2_TCA 11) L1_DCA 12) HW_INT 13) L2_TCH 14) L1_TCH 15) BR_MS Analysis of the importance of the performance counters. The data contains one good optimization sequence per benchmark. Calculating mutual information between a subset of the performance counters and good optimization sequences

32 Motivation Prerequisites Outline Iterative feedback-directed directed compilation (empirical optimization) MILEPOST project: machine-learning based self-tuning compiler adaptable to heterogeneous reconfigurable architectures (based on static and dynamic program features) Optimization sensitivity to datasets Run-time adaptation for statically compiled programs (adaptive binaries) Recent projects Adaptive libraries (iterative compilation, dataset features and run-time adaptation) Predictive code scheduling for heterogeneous (CPU/GPU-like) architectures Statistical collective optimization Collective Tuning Initiative Conclusions and future work

33 Optimization sensitivity to datasets MiBench, 20 datasets t per benchmark, 200/1000 random combination of Open64 (GCC) compiler flags, 5 months of experiments dijkstra (not sensitive) jpeg_d (dataset sensitive) Grigori Fursin, John Cavazos, Michael O'Boyle and Olivier Temam. MiDataSets: Creating The Conditions For A More Realistic Evaluation of Iterative Optimization. Proceedings of HiPEAC 2007, Ghent, Belgium, January 2007

34 Optimization sensitivity to datasets MiBench, 20 datasets t per benchmark, 200/1000 random combination of Open64 (GCC) compiler flags, 5 months of experiments dijkstra (not sensitive) jpeg_d (clustering) Grigori Fursin, John Cavazos, Michael O'Boyle and Olivier Temam. MiDataSets: Creating The Conditions For A More Realistic Evaluation of Iterative Optimization. Proceedings of HiPEAC 2007, Ghent, Belgium, January 2007

35 Motivation Prerequisites Outline Iterative feedback-directed directed compilation (empirical optimization) MILEPOST project: machine-learning based self-tuning compiler adaptable to heterogeneous reconfigurable architectures (based on static and dynamic program features) Optimization sensitivity to datasets Run-time adaptation for statically compiled programs (adaptive binaries) Recent projects Adaptive libraries (iterative compilation, dataset features and run-time adaptation) Predictive code scheduling for heterogeneous (CPU/GPU-like) architectures Statistical collective optimization Collective Tuning Initiative Conclusions and future work

36 Run-time adaptation Why adapt at run-time? Different program context Different run-time behavior Different system load Different available resources Different power consumption Different architectures & ISA For each case we want to find and use best optimization settings!

37 Current methods Some existing solutions: Application Dataset 1 Compiler Binary Output 1

38 Current methods Some existing solutions: Application Dataset 1 Compiler Binary Dynamic optimizations Output 1

39 Current methods Some existing solutions: Application Dataset 1 Compiler Binary Dynamic optimizations Output 1 Pros: run-time information, potentially more than one dataset

40 Current methods Some existing solutions: Application Dataset 1 Compiler Binary Dynamic optimizations Output 1 Pros: run-time information, potentially more than one dataset Cons: restrictions on optimization time, simple optimizations

41 Current methods Some existing solutions: Application Dataset 1 Iterative optimizations Compiler Binary Output 1 Dynamic optimizations Pros: run-time information, potentially more than one dataset Cons: restrictions on optimization time, simple optimizations

42 Current methods Some existing solutions: Application Dataset 1 Iterative optimizations Compiler Binary Output 1 Dynamic optimizations Pros: powerful transformation space exploration Pros: run-time information, potentially more than one dataset Cons: restrictions on optimization time, simple optimizations

43 Current methods Some existing solutions: Application Dataset 1 Iterative optimizations Compiler Binary Output 1 Dynamic optimizations Pros: powerful transformation space exploration Cons: slow, one dataset Pros: run-time information, potentially more than one dataset Cons: restrictions on optimization time, simple optimizations

44 Current methods Can we combine both? Application Dataset 1 Iterative optimizations Compiler Binary Output 1 Dynamic optimizations Static code multiversioning and run-time adaptation: powerful transformation space exploration, run-time information self-tuning adaptive binaries

45 Our approach: static multiversioning Application Select most time consuming code sections

46 Our approach: static multiversioning Application Create multi-versions of time consuming code sections

47 Our approach: static multiversioning Application adapt_start adapt_start adapt_stop adapt_stop Add adaptation routines (depends on the task)

48 Our approach: static multiversioning Transformations Application adapt_start adapt_start adapt_stop adapt_stop Apply various transformations over multi-versions of code sections

49 Our approach: static multiversioning Select global or fine-grain internal compiler optimizations Transformations Application adapt_start adapt_start adapt_stop adapt_stop Apply various transformations over multi-versions of code sections

50 Our approach: static multiversioning Transformations Application adapt_start adapt_start adapt_stop adapt_stop Apply various transformations over multi-versions of code sections

51 Our approach: static multiversioning Transformations Differernt ISA; manual transformations, etc Application adapt_start adapt_start adapt_stop adapt_stop Apply various transformations over multi-versions of code sections

52 Our approach: static multiversioning Final instrumented program Application adapt_start adapt_start adapt_stop adapt_stop

53 Run-time program adaptation Execution times for subroutine resid of benchmark mgrid across calls time (sec) function calls startup (phase detection) or end of the optimization process (best option found) evaluation of 1 option 1) Consider new optimization option evaluated after 2 consecutive executions of the code section with the same performance 2) Ignore one next execution to avoid transitional effects 3) Check baseline performance (to verify stability prediction)

54 Run-time program adaptation time (sec) function calls Continuous adaptation: saving prediction info after execution

55 Run-time program adaptation ime (sec) t function calls Continuous adaptation: preloading prediction info before execution Grigori Fursin, Albert Cohen, Michael O'Boyle and Oliver Temam. A Practical Method For Quickly Evaluating Program Optimizations. Proceedings of the 1 st International Conference on High Performance Embedded Architectures & Compilers (HiPEAC 2005), number 3793 in LNCS, pages 29-46, Barcelona, Spain, November 2005

56 Possible usage Create static binaries and libraries adaptable to different program and architecture behavior (also split-compilation) Generate mixed multiple-isa code with run-time adaptation for heterogeneous systems (CPU/GPU or CELL architectures) Determine the effect of optimizations at run-time for programs with varying datasets without a reference (randomly selecting versions at run-time) Statistical collective optimization

57 Motivation Prerequisites Outline Iterative feedback-directed directed compilation (empirical optimization) MILEPOST project: machine-learning based self-tuning compiler adaptable to heterogeneous reconfigurable architectures (based on static and dynamic program features) Optimization sensitivity to datasets Run-time adaptation for statically compiled programs (adaptive binaries) Recent projects Adaptive libraries (iterative compilation, dataset features and run-time adaptation) Predictive code scheduling for heterogeneous (CPU/GPU-like) architectures Statistical collective optimization Collective Tuning Initiative Conclusions and future work

58 Static multiversioning framework for dynamic optimizations compiled adaptive binaries and libraries Step 1 Statically compiled compiled adaptive binaries and libraries Original hot function Function Version 1 Function Version 2 Function Version N Iterative /collective compilation with multiple datasets

59 Static multiversioning framework for dynamic optimizations compiled adaptive binaries and libraries Step 2 Statically compiled compiled adaptive binaries and libraries Original Function Function Function hot Version 1 Version 2 Version N function Representative ti set of versions for the following fll optimization i cases to minimize i i execution time, power consumption and code size across all available datasets: optimizations for different datasets optimizations/compilation for different architectures (heterogeneous or reconfigurable processors with different ISA such as GPGPU, CELL, etc or the same ISA with extensions such as 3dnow, SSE, etc or virtual environments) optimizations for different program phases or different run time environment behavior Iterative /collective compilation with multiple datasets

60 Static multiversioning framework for dynamic optimizations compiled adaptive binaries and libraries Step 3 Statically compiled compiled adaptive binaries and libraries Extract dataset features Selection mechanism optimized for low run time overhead Original Function Function Function hot Version 1 Version 2 Version N function Representative ti set of versions for the following fll optimization i cases to minimize i i execution time, power consumption and code size across all available datasets: optimizations for different datasets optimizations/compilation for different architectures (heterogeneous or reconfigurable processors with different ISA such as GPGPU, CELL, etc or the same ISA with extensions such as 3dnow, SSE, etc or virtual environments) optimizations for different program phases or different run time environment behavior Machine learning techniques to find mapping between different run time contexts and representative versions Iterative /collective compilation with multiple datasets

61 Static multiversioning framework for dynamic optimizations Statically compiled compiled adaptive binaries and libraries Step X Extract Monitor run time behavior or architectural dataset changes (in virtual, reconfigurable or features heterogeneous environments) using timers or performance counters Selection mechanism optimized for low run time overhead Original Function Function Function hot Version 1 Version 2 Version N function Representative ti set of versions for the following fll optimization i cases to minimize i i execution time, power consumption and code size across all available datasets: optimizations for different datasets optimizations/compilation for different architectures (heterogeneous or reconfigurable processors with different ISA such as GPGPU, CELL, etc or the same ISA with extensions such as 3dnow, SSE, etc or virtual environments) optimizations for different program phases or different run time environment behavior Machine learning techniques to find mapping between different run time contexts and representative versions Iterative /collective compilation with multiple datasets unidapt

62 Step 1: Iterative compilation with multiple datasets Open64/PathScale compiler with Interactive Compilation Interface, AMD64, random optimization selection: loop tiling (2..512); register tiling (2..8); loop unrolling (2..16); loop vectorization loop interchange; loop fusion; array prefetching (8..128)

63 Step 2: Algorithm to select representative optimization set Main idea: Simplify life of developers by automating the process of representative optimization/dataset selection Developer controls overall performance gain/loss, code size/number of versions and quality of mapping mechanism

64 Step 2: Algorithm to select representative optimization set Main idea: Simplify life of developers by automating the process of representative optimization/dataset selection Developer controls overall performance gain/loss, code size/number of versions and quality of mapping mechanism First version is greedy (expensive) will be revisited in the future D 1 D 2 D 3 D 4 D 5 D X Datasets s 11 s 12 s 21 s 22 s 31 s 32 s 41 s 42 s 51 s 52 s X1 s X2 s 1o s 2o s 3o s 4o s 5o s Xo Code (optimization variants)

65 Step 2: Algorithm to select representative optimization set Main idea: Simplify life of developers by automating the process of representative optimization/dataset selection Developer controls overall performance gain/loss, code size/number of versions and quality of mapping mechanism First version is greedy (expensive) will be revisited in the future D 1 D 2 D 3 D 4 D 5 D X s 11 s 12 s 21 s 22 s 31 s 32 s 41 s 42 s 51 s 52 s X1 s X2 s 1o s 2o s 3o s 4o s 5o s Xo s max Calculate S max = geometric mean of the best achievable speedups

66 Step 2: Algorithm to select representative optimization set Main idea: Simplify life of developers by automating the process of representative optimization/dataset selection Developer controls overall performance gain/loss, code size/number of versions and quality of mapping mechanism First version is greedy (expensive) will be revisited in the future D 1 D 2 D 3 D 4 D 5 D X s 11 s 12 s 21 s 22 s 31 s 32 s 41 s 42 s 51 s 52 s X1 s X2 s 1o s 2o s 3o s 4o s 5o s Xo s max Find code version with a max geom. mean of speedups across all datasets

67 Step 2: Algorithm to select representative optimization set Main idea: Simplify life of developers by automating the process of representative optimization/dataset selection Developer controls overall performance gain/loss, code size/number of versions and quality of mapping mechanism First version is greedy (expensive) will be revisited in the future D 1 D 2 D 3 D 4 D 5 D X s 11 s 12 s 21 s 22 s 31 s 32 s 41 s 42 s 51 s 52 s X1 s X2 s 1o s 2o s 3o s 4o s 5o s Xo s max Find code version with a max geom. mean of speedups across all datasets

68 Step 2: Algorithm to select representative optimization set Main idea: Simplify life of developers by automating the process of representative optimization/dataset selection Developer controls overall performance gain/loss, code size/number of versions and quality of mapping mechanism First version is greedy (expensive) will be revisited in the future D 1 D 2 D 3 D 4 D 5 D X s 11 s 12 s 21 s 22 s 31 s 32 s 41 s 42 s 51 s 52 s X1 s X2 s 1o s 2o s 3o s 4o s 5o s Xo s max c o Add this version to the representative set

69 Step 2: Algorithm to select representative optimization set Main idea: Simplify life of developers by automating the process of representative optimization/dataset selection Developer controls overall performance gain/loss, code size/number of versions and quality of mapping mechanism First version is greedy (expensive) will be revisited in the future D 1 D 2 D 3 D 4 D 5 D X s 11 s 12 s 21 s 22 s 31 s 32 s 41 s 42 s 51 s 52 s X1 s X2 s 1o s 2o s 3o s 4o s 5o s Xo s max c o s Rmax Calculate s Rmax = geom.mean of the best speedups for each dataset using rep. set

70 Step 2: Algorithm to select representative optimization set Main idea: Simplify life of developers by automating the process of representative optimization/dataset selection Developer controls overall performance gain/loss, code size/number of versions and quality of mapping mechanism First version is greedy (expensive) will be revisited in the future D 1 D 2 D 3 D 4 D 5 D X s 11 s 12 s 21 s 22 s 31 s 32 s 41 s 42 s 51 s 52 s X1 s X2 s 1o s 2o s 3o s 4o s 5o s Xo s max c o s Rmax Check performance improvement/loss and code size explosion for the rep. set

71 Step 2: Algorithm to select representative optimization set Main idea: Simplify life of developers by automating the process of representative optimization/dataset selection Developer controls overall performance gain/loss, code size/number of versions and quality of mapping mechanism First version is greedy (expensive) will be revisited in the future D 1 D 2 D 3 D 4 D 5 D X s 11 s 12 s 21 s 22 s 31 s 32 s 41 s 42 s 51 s 52 s X1 s X2 s 1o s 2o s 3o s 4o s 5o s Xo s max c o s Rmax If continue, remove c i and all datasets where c i achieves the best speedup

72 Step 2: Algorithm to select representative optimization set Main idea: Simplify life of developers by automating the process of representative optimization/dataset selection Developer controls overall performance gain/loss, code size/number of versions and quality of mapping mechanism First version is greedy (expensive) will be revisited in the future D 2 D 3 D 4 D 5 s 21 s 22 s 31 s 32 s 41 s 42 s 51 s 52 s max c o s Rmax Find code version with a max geom. mean of speedups across all remaining datasets

73 Step 2: Algorithm to select representative optimization set Main idea: Simplify life of developers by automating the process of representative optimization/dataset selection Developer controls overall performance gain/loss, code size/number of versions and quality of mapping mechanism First version is greedy (expensive) will be revisited in the future D 2 D 3 D 4 D 5 s 21 s 22 s 31 s 32 s 41 s 42 s 51 s 52 s max c o s Rmax Find code version with a max geom. mean of speedups across all remaining datasets

74 Step 2: Algorithm to select representative optimization set Main idea: Simplify life of developers by automating the process of representative optimization/dataset selection Developer controls overall performance gain/loss, code size/number of versions and quality of mapping mechanism First version is greedy (expensive) will be revisited in the future D 2 D 3 D 4 D 5 s 21 s 22 s 31 s 32 s 41 s 42 s 51 s 52 s max c o c 1 s Rmax Add this version to the representative set

75 Step 2: Algorithm to select representative optimization set Main idea: Simplify life of developers by automating the process of representative optimization/dataset selection Developer controls overall performance gain/loss, code size/number of versions and quality of mapping mechanism First version is greedy (expensive) will be revisited in the future D 2 D 3 D 4 D 5 s 21 s 22 s 31 s 32 s 41 s 42 s 51 s 52 s max c o c 1 s Rmax Calculate s Rmax = geom.mean of the best speedups for each dataset using rep. set

76 Step 2: Algorithm to select representative optimization set Main idea: Simplify life of developers by automating the process of representative optimization/dataset selection Developer controls overall performance gain/loss, code size/number of versions and quality of mapping mechanism First version is greedy (expensive) will be revisited in the future D 2 D 3 D 4 D 5 s 21 s 22 s 31 s 32 s 41 s 42 s 51 s 52 s max c o c 1 s Rmax Check performance improvement/loss and code size explosion for the rep. set

77 Step 2: Algorithm to select representative optimization set Main idea: Simplify life of developers by automating the process of representative optimization/dataset selection Developer controls overall performance gain/loss, code size/number of versions and quality of mapping mechanism First version is greedy (expensive) will be revisited in the future D 2 D 3 D 4 D 5 s 21 s 22 s 31 s 32 s 41 s 42 s 51 s 52 s max c o c 1 If continue, remove c i and all datasets where c i achieves the best speedup s Rmax

78 Step 2: Algorithm to select representative optimization set Main idea: Simplify life of developers by automating the process of representative optimization/dataset selection Developer controls overall performance gain/loss, code size/number of versions and quality of mapping mechanism First version is greedy (expensive) will be revisited in the future D 2 D 4 s 21 s 22 s 41 s 42 s max c o c 1 s Rmax and so on

79 Step 3: Run-time version mapping mechanism WEKA - machine learning suite that supporting multiple standard techniques such as clustering, classification and regression Direct Classification (returns the most similar case from prior experience): SMO - Support Vector Machine based J48 - decision tree based REPTree - decision tree based JRip - rule based PART - rule based Ridor - rule based Performance Prediction Model (probabilistic approach): LeastMedSq - linear regression based LinearRegression - linear regression based PaceRegression - linear regression based SMOreg - Support Vector Machine based REPTree - decision i tree based M5Rules - rule based

80 Step 3: Run-time version mapping mechanism WEKA - machine learning suite that supporting multiple standard techniques such as clustering, classification and regression Direct Classification (returns the most similar case from prior experience): SMO - Support Vector Machine based algorithms vary in J48 - decision tree based applicability and REPTree - decision tree based complexity depending JRip - rule based on the problem PART - rule based encountered Ridor - rule based Performance Prediction Model (probabilistic approach): LeastMedSq - linear regression based LinearRegression - linear regression based PaceRegression - linear regression based SMOreg - Support Vector Machine based REPTree - decision i tree based M5Rules - rule based

81 Error rates of different classification algorithms Basic characterization: array dimension, 990 distinct datasets for DGEMM, 82 for FFT Direct Classification!

82 Error rates of different classification algorithms Basic characterization: array dimension, 990 distinct datasets for DGEMM, 82 for FFT Direct Classification Performance Prediction Model!!

83 Error rates of different classification algorithms Basic characterization: array dimension, 990 distinct datasets for DGEMM, 82 for FFT Direct Classification Performance Prediction Model! Best performing algorithms are either decision tree or rule based!!

84 Motivation Prerequisites Outline Iterative feedback-directed directed compilation (empirical optimization) MILEPOST project: machine-learning based self-tuning compiler adaptable to heterogeneous reconfigurable architectures (based on static and dynamic program features) Optimization sensitivity to datasets Run-time adaptation for statically compiled programs (adaptive binaries) Recent projects Adaptive libraries (iterative compilation, dataset features and run-time adaptation) Predictive code scheduling for heterogeneous (CPU/GPU-like) architectures (BRIEF) Statistical collective optimization Collective Tuning Initiative Conclusions and future work

85 Predictive code scheduling

86 Predictive code scheduling Victor Jimenez, Isaac Gelado, Lluis Vilanova, Marisa Gil, Grigori Fursin and Nacho Navarro. Predictive runtime code scheduling for heterogeneous architectures. Proceedings of the International Conference on High Performance Embedded Architectures & Compilers (HiPEAC 2009), Paphos, Cyprus, January 2009

87 Predictive code scheduling

88 Predictive code scheduling

89 Motivation Prerequisites Outline Iterative feedback-directed directed compilation (empirical optimization) MILEPOST project: machine-learning based self-tuning compiler adaptable to heterogeneous reconfigurable architectures (based on static and dynamic program features) Optimization sensitivity to datasets Run-time adaptation for statically compiled programs (adaptive binaries) Recent projects Adaptive libraries (iterative compilation, dataset features and run-time adaptation) Predictive code scheduling for heterogeneous (CPU/GPU-like) architectures Statistical collective optimization Collective Tuning Initiative Conclusions and future work

90 Collective optimization framework This framework relies only on execution time and statistical ti ti analysis. It does not require specialized ML compiler or OS.

91 Probabilistic focused search Probability distribution to select an optimization combination based on continuous competition between combinations

92 Probabilistic focused search Probability distribution to select an optimization combination based on continuous competition between combinations

93 Probabilistic focused search Probability distribution to select an optimization combination based on continuous competition between combinations

94 Maturation stages of a program We would like to find probability distribution of optimizations for a given program to maximizes performance across (all) datasets Stage 1: Program unknown. We leverage and apply optimizations suggested by the general experience collected over all programs. (Probability distribution d 1 )

95 Maturation stages of a program We would like to find probability distribution of optimizations for a given program to maximizes performance across (all) datasets Stage 1: Program unknown. We leverage and apply optimizations suggested by the general experience collected over all programs. (Probability distribution d 1 ) Stage 2: Program unknown, a few runs only. Characterize the program behavior using gp program reaction to optimizations [more details later]. (Probability distribution d 2 )

96 Maturation stages of a program We would like to find probability distribution of optimizations for a given program to maximizes performance across (all) datasets Stage 1: Program unknown. We leverage and apply optimizations suggested by the general experience collected over all programs. (Probability distribution d 1 ) Stage 2: Program unknown, a few runs only. Characterize the program behavior using gp program reaction to optimizations [more details later]. (Probability distribution d 2 ) Stage 3: Program known, heavily used. We do not need the experience of other programs to select the most appropriate optimizations combinations. Learning across data sets is important use continuous competition between combinations. (Probability distribution d 3 )

97 Stage 2: Program known, a few runs only Intuition: i it is unlikely l that all programs exhibit widely different behavior with respect to compiler optimizations (clustering) Need to find similar programs to bias distribution of optimization using some characterization static program features dynamic program features Need special tools may not be portable

98 Stage 2: Program known, a few runs only Intuition: i it is unlikely l that all programs exhibit widely different behavior with respect to compiler optimizations (clustering) Need to find similar programs to bias distribution of optimization using some characterization static program features Need special tools dynamic program features may not be portable Suggest to use program reaction to transformations based on speedup only (portable) Optimization combinations P P 1 P 2 P 3 C 1 > C 2 C 3 > C 4 C 5 > C C 7 > C

99 Stage 2: Program known, a few runs only Intuition: i it is unlikely l that all programs exhibit widely different behavior with respect to compiler optimizations (clustering) Need to find similar programs to bias distribution of optimization using some characterization static program features Need special tools dynamic program features may not be portable Suggest to use program reaction to transformations based on speedup only (portable) Optimization combinations P P 1 P 2 P 3 C 1 > C 2 C 3 > C 4 C 5 > C C 7 > C

100 Selecting stages (meta-distribution) Permanent competition between the different stages distributions (d 1,d 2,d 3 ) 1) All have equal probability 2) Select d from (d 1,d 2,d 3 ) 3) Select C 1 and C 2 based on probability bilit distribution ib ti d 4) Run program and compare C 1 and C 2 5) If C 1 > C 2 and is same according to d, the prediction was correct and we reward the associated distribution

101 Performance of collective optimization d 1 improving default compiler heuristic using collective knowledge d 2 should improve with more collective knowledge d 3 often better than collective but may change if more optimizations are available

102 Collective Tuning Initiative Community-driven project to develop auto-tuning tuning computing systems Developing common open-source R&D tools with unified APIs (universal compilers, computer architecture simulators, adaptive run-time systems) to optimize/parallelize programs and architectures collectively using empirical iterative compilation, statistical and machine learning techniques ( Sharing useful optimization cases from the community for programs/libraries/os (compiler optimizations/architecture configurations to improve execution time, code size, architecture size, power consumption, etc) in the Collective Optimization Database ( Partially funded by HiPEAC, MILEPOST, Google summer of code, IBM, INRIA

103 Collective Tuning Initiative Community-driven project to develop auto-tuning tuning computing systems Developing common open-source R&D tools with unified APIs (universal compilers, computer architecture simulators, adaptive run-time systems) to optimize/parallelize programs and architectures collectively using empirical iterative compilation, statistical and machine learning techniques ( Sharing useful optimization cases from the community for programs/libraries/os (compiler optimizations/architecture configurations to improve execution time, code size, architecture size, power consumption, etc) in the Collective Optimization Database ( Partially funded by HiPEAC, MILEPOST, Google summer of code, IBM, INRIA Requires lots of work

104 Collective Tuning Initiative Community-driven project to develop auto-tuning tuning computing systems Developing common open-source R&D tools with unified APIs (universal compilers, computer architecture simulators, adaptive run-time systems) to optimize/parallelize programs and architectures collectively using empirical iterative compilation, statistical and machine learning techniques ( Sharing useful optimization cases from the community for programs/libraries/os (compiler optimizations/architecture configurations to improve execution time, code size, architecture size, power consumption, etc) in the Collective Optimization Database ( Partially funded by HiPEAC, MILEPOST, Google summer of code, IBM, INRIA Requires lots of work Join ctuning community

105 Current and future work New tools open many research opportunities and allow systematic empirical evaluation of computing systems analysis of program static and dynamic features to improve predictions analysis of dataset features for better run-time adaptation fine-grain program optimizations, polyhedral transformations and ML (supported by Google summer of code 2009) detection of important cases for collective optimization split compilation adaptive fault-tolerancetolerance statistical parallelization and adaptation for heterogeneous systems

106 Thank you for your attention Collaborative R&D: Partially funded by Contact More information about research projects and software: t/ h

arxiv: v1 [cs.pl] 14 Jul 2014

arxiv: v1 [cs.pl] 14 Jul 2014 Finding representative sets of optimizations for adaptive multiversioning applications Lianjie Luo 1,2, Yang Chen 1,2, Chengyong Wu 1, Shun Long 3, Grigori Fursin 4 arxiv:1407.4075v1 [cs.pl] 14 Jul 2014

More information

MILEPOST GCC: machine learning based research compiler

MILEPOST GCC: machine learning based research compiler MILEPOST GCC: machine learning based research compiler Grigori Fursin, Cupertino Miranda, Olivier Temam, Mircea Namolaru, Elad Yom-Tov, Ayal Zaks, Bilha Mendelson, Edwin Bonilla, John Thomson, Hugh Leather,

More information

Exploiting Phase Inter-Dependencies for Faster Iterative Compiler Optimization Phase Order Searches

Exploiting Phase Inter-Dependencies for Faster Iterative Compiler Optimization Phase Order Searches 1/26 Exploiting Phase Inter-Dependencies for Faster Iterative Compiler Optimization Phase Order Searches Michael R. Jantz Prasad A. Kulkarni Electrical Engineering and Computer Science, University of Kansas

More information

Collective characterization, optimization and design of computer systems. Grigori Fursin INRIA, France

Collective characterization, optimization and design of computer systems. Grigori Fursin INRIA, France Collective characterization, optimization and design of computer systems Grigori Fursin INRIA, France HiPEAC computing week April 2012 Session introduction HiPEAC 3 includes new instrument: Thematic sessions.

More information

Milepost GCC: machine learning enabled self-tuning compiler

Milepost GCC: machine learning enabled self-tuning compiler The final publication is available at: http://www.springerlink.com/content/d753r27550257252 Milepost GCC: machine learning enabled self-tuning compiler Grigori Fursin 12 Yuriy Kashnikov 2 Abdul Wahid Memon

More information

Practical Aggregation of Semantical Program Properties for Machine Learning Based Optimization

Practical Aggregation of Semantical Program Properties for Machine Learning Based Optimization Practical Aggregation of Semantical Program Properties for Machine Learning Based Optimization Mircea Namolaru IBM Haifa Research Lab namolaru@il.ibm.com Albert Cohen INRIA Saclay and LRI, Paris-Sud 11

More information

Milepost GCC: Machine Learning Enabled Self-tuning Compiler

Milepost GCC: Machine Learning Enabled Self-tuning Compiler Int J Parallel Prog (2011) 39:296 327 DOI 10.1007/s10766-010-0161-2 Milepost GCC: Machine Learning Enabled Self-tuning Compiler Grigori Fursin Yuriy Kashnikov Abdul Wahid Memon Zbigniew Chamski Olivier

More information

Predictive Runtime Code Scheduling for Heterogeneous Architectures

Predictive Runtime Code Scheduling for Heterogeneous Architectures Predictive Runtime Code Scheduling for Heterogeneous Architectures Víctor Jiménez, Lluís Vilanova, Isaac Gelado Marisa Gil, Grigori Fursin, Nacho Navarro HiPEAC 2009 January, 26th, 2009 1 Outline Motivation

More information

Amir H. Ashouri University of Toronto Canada

Amir H. Ashouri University of Toronto Canada Compiler Autotuning using Machine Learning: A State-of-the-art Review Amir H. Ashouri University of Toronto Canada 4 th July, 2018 Politecnico di Milano, Italy Background 2 Education B.Sc (2005-2009):

More information

IMPROVING PERFORMANCE OF PROGRAM BY FINDING GOOD OPTIMAL SEQUENCES USING SEQUENCE SELECTION APPROACH

IMPROVING PERFORMANCE OF PROGRAM BY FINDING GOOD OPTIMAL SEQUENCES USING SEQUENCE SELECTION APPROACH IMPROVING PERFORMANCE OF PROGRAM BY FINDING GOOD OPTIMAL SEQUENCES USING SEQUENCE SELECTION APPROACH Praveen Kumar Reddy M. 1 and M. Rajasekhara Babu 2 1 School of Information Technology and Engineering,

More information

Hugh Leather, Edwin Bonilla, Michael O'Boyle

Hugh Leather, Edwin Bonilla, Michael O'Boyle Automatic Generation for Machine Learning Based Optimizing Compilation Hugh Leather, Edwin Bonilla, Michael O'Boyle Institute for Computing Systems Architecture University of Edinburgh, UK Overview Introduction

More information

Transforming GCC into a research-friendly environment: plugins for optimization tuning and reordering, function cloning and program instrumentation

Transforming GCC into a research-friendly environment: plugins for optimization tuning and reordering, function cloning and program instrumentation Transforming GCC into a research-friendly environment: plugins for optimization tuning and reordering, function cloning and program instrumentation Yuanjie Huang 1,2, Liang Peng 1,2, Chengyong Wu 1 Yuriy

More information

Collective Tuning Initiative: automating and accelerating development and optimization of computing systems

Collective Tuning Initiative: automating and accelerating development and optimization of computing systems Collective Tuning Initiative: automating and accelerating development and optimization of computing systems Grigori Fursin To cite this version: Grigori Fursin. Collective Tuning Initiative: automating

More information

Efficient Program Compilation through Machine Learning Techniques

Efficient Program Compilation through Machine Learning Techniques Efficient Program Compilation through Machine Learning Techniques Gennady Pekhimenko 1 and Angela Demke Brown 2 1 IBM, Toronto ON L6G 1C7, Canada, gennadyp@ca.ibm.com 2 University of Toronto, M5S 2E4,

More information

Building a Practical Iterative Interactive Compiler

Building a Practical Iterative Interactive Compiler Building a Practical Iterative Interactive Compiler Grigori Fursin and Albert Cohen ALCHEMY Group, INRIA Futurs and LRI, Paris-Sud University, France {grigori.fursin,albert.cohen}@inria.fr Abstract. Current

More information

Efficient framework architecture for improved tuning time and normalized tuning time

Efficient framework architecture for improved tuning time and normalized tuning time Efficient framework architecture for improved tuning time and normalized tuning time J.ANDREWS, T.SASIKALA Research Scholar, Department of CSE Sathyabama University Rajiv Gandhi salai, Chennai-600119 INDIA

More information

Efficient Program Compilation through Machine Learning Techniques

Efficient Program Compilation through Machine Learning Techniques Efficient Program Compilation through Machine Learning Techniques Gennady Pekhimenko 1 and Angela Demke Brown 2 1 IBM, Toronto ON L6G 1C7, Canada, gennadyp@ca.ibm.com 2 University of Toronto, M5S 2E4,

More information

Collective Tuning Initiative: automating and accelerating development and optimization of computing systems

Collective Tuning Initiative: automating and accelerating development and optimization of computing systems Collective Tuning Initiative: automating and accelerating development and optimization of computing systems Grigori Fursin INRIA Saclay, France HiPEAC member grigori.fursin@inria.fr Abstract Computing

More information

Proceedings of the GCC Developers Summit. June 17th 19th, 2008 Ottawa, Ontario Canada

Proceedings of the GCC Developers Summit. June 17th 19th, 2008 Ottawa, Ontario Canada Proceedings of the GCC Developers Summit June 17th 19th, 2008 Ottawa, Ontario Canada Contents MILEPOST GCC: machine learning based research compiler 7 Grigori Fursin et al. Using GCC Instead of Grep and

More information

Intelligent Compilation

Intelligent Compilation Intelligent Compilation John Cavazos Department of Computer and Information Sciences University of Delaware Autotuning and Compilers Proposition: Autotuning is a component of an Intelligent Compiler. Code

More information

An Evaluation of Autotuning Techniques for the Compiler Optimization Problems

An Evaluation of Autotuning Techniques for the Compiler Optimization Problems An Evaluation of Autotuning Techniques for the Compiler Optimization Problems Amir Hossein Ashouri, Gianluca Palermo and Cristina Silvano Politecnico di Milano, Milan, Italy {amirhossein.ashouri,ginaluca.palermo,cristina.silvano}@polimi.it

More information

A Bayesian Network Approach for Compiler Auto-tuning for Embedded Processors

A Bayesian Network Approach for Compiler Auto-tuning for Embedded Processors A Bayesian Network Approach for Compiler Auto-tuning for Embedded Processors Amir Hossein Ashouri, Giovanni Mariani, Gianluca Palermo and Cristina Silvano Dipartimento di Elettronica, Informazione e Bioingegneria,

More information

Reducing Power Consumption for High-Associativity Data Caches in Embedded Processors

Reducing Power Consumption for High-Associativity Data Caches in Embedded Processors Reducing Power Consumption for High-Associativity Data Caches in Embedded Processors Dan Nicolaescu Alex Veidenbaum Alex Nicolau Dept. of Information and Computer Science University of California at Irvine

More information

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT Autotuning John Cavazos University of Delaware What is Autotuning? Searching for the best code parameters, code transformations, system configuration settings, etc. Search can be Quasi-intelligent: genetic

More information

Predictive Modeling in a Polyhedral Optimization Space

Predictive Modeling in a Polyhedral Optimization Space Predictive Modeling in a Polyhedral Optimization Space Eunjung Park, Louis-Noël Pouchet, John Cavazos, Albert Cohen and P. Sadayappan University of Delaware {ejpark,cavazos}@cis.udel.edu The Ohio State

More information

Towards a Holistic Approach to Auto-Parallelization

Towards a Holistic Approach to Auto-Parallelization Towards a Holistic Approach to Auto-Parallelization Integrating Profile-Driven Parallelism Detection and Machine-Learning Based Mapping Georgios Tournavitis, Zheng Wang, Björn Franke and Michael F.P. O

More information

Machine Learning based Compilation

Machine Learning based Compilation Machine Learning based Compilation Michael O Boyle March, 2014 1 Overview Machine learning - what is it and why is it useful? Predictive modelling OSE Scheduling and low level optimisation Loop unrolling

More information

Collective Mind a collaborative curation tool for program optimization

Collective Mind a collaborative curation tool for program optimization Collective Mind a collaborative curation tool for program optimization Towards collaborative, systematic and reproducible design and optimization of computer systems Grigori Fursin INRIA, France SEA 2014,

More information

Collective Mind: cleaning up the research and experimentation mess in computer engineering using crowdsourcing, big data and machine learning

Collective Mind: cleaning up the research and experimentation mess in computer engineering using crowdsourcing, big data and machine learning Collective Mind: cleaning up the research and experimentation mess in computer engineering using crowdsourcing, big data and machine learning Grigori Fursin INRIA, France Grigori.Fursin@cTuning.org Abstract

More information

Automatic Feature Generation for Setting Compilers Heuristics

Automatic Feature Generation for Setting Compilers Heuristics Automatic Feature Generation for Setting Compilers Heuristics Hugh Leather 1, Elad Yom-Tov 2, Mircea Namolaru 2 and Ari Freund 2 1 Institute for Computing Systems Architecture School of Informatics University

More information

Cross-Layer Memory Management to Reduce DRAM Power Consumption

Cross-Layer Memory Management to Reduce DRAM Power Consumption Cross-Layer Memory Management to Reduce DRAM Power Consumption Michael Jantz Assistant Professor University of Tennessee, Knoxville 1 Introduction Assistant Professor at UT since August 2014 Before UT

More information

Compiler Optimizations and Auto-tuning. Amir H. Ashouri Politecnico Di Milano -2014

Compiler Optimizations and Auto-tuning. Amir H. Ashouri Politecnico Di Milano -2014 Compiler Optimizations and Auto-tuning Amir H. Ashouri Politecnico Di Milano -2014 Compilation Compilation = Translation One piece of code has : Around 10 ^ 80 different translations Different platforms

More information

The Heterogeneous Programming Jungle. Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest

The Heterogeneous Programming Jungle. Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest The Heterogeneous Programming Jungle Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest June 19, 2012 Outline 1. Introduction 2. Heterogeneous System Zoo 3. Similarities 4. Programming

More information

Instruction Cache Energy Saving Through Compiler Way-Placement

Instruction Cache Energy Saving Through Compiler Way-Placement Instruction Cache Energy Saving Through Compiler Way-Placement Timothy M. Jones, Sandro Bartolini, Bruno De Bus, John Cavazosζ and Michael F.P. O Boyle Member of HiPEAC, School of Informatics University

More information

Predic've Modeling in a Polyhedral Op'miza'on Space

Predic've Modeling in a Polyhedral Op'miza'on Space Predic've Modeling in a Polyhedral Op'miza'on Space Eunjung EJ Park 1, Louis- Noël Pouchet 2, John Cavazos 1, Albert Cohen 3, and P. Sadayappan 2 1 University of Delaware 2 The Ohio State University 3

More information

Using Cache Models and Empirical Search in Automatic Tuning of Applications. Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX

Using Cache Models and Empirical Search in Automatic Tuning of Applications. Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Outline Overview of Framework Fine grain control of transformations

More information

Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time

Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time Louis-Noël Pouchet, Cédric Bastoul, Albert Cohen and Nicolas Vasilache ALCHEMY, INRIA Futurs / University of Paris-Sud XI March

More information

A NOVEL APPROACH FOR SELECTION OF BEST SET OF OPTIMIZATION FUNCTIONS FOR A BENCHMARK APPLICATION USING AN EFFECTIVE STRATEGY

A NOVEL APPROACH FOR SELECTION OF BEST SET OF OPTIMIZATION FUNCTIONS FOR A BENCHMARK APPLICATION USING AN EFFECTIVE STRATEGY A NOVEL APPROACH FOR SELECTION OF BEST SET OF OPTIMIZATION FUNCTIONS FOR A BENCHMARK APPLICATION USING AN EFFECTIVE STRATEGY J.Andrews Research scholar, Department of Computer Science Engineering Sathyabama

More information

Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX

Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX Keerthi Bhushan Rajesh K Chaurasia Hewlett-Packard India Software Operations 29, Cunningham Road Bangalore 560 052 India +91-80-2251554

More information

Study of Variations of Native Program Execution Times on Multi-Core Architectures

Study of Variations of Native Program Execution Times on Multi-Core Architectures Study of Variations of Native Program Execution Times on Multi-Core Architectures Abdelhafid MAZOUZ University of Versailles Saint-Quentin, France. Sid-Ahmed-Ali TOUATI INRIA-Saclay, France. Denis BARTHOU

More information

Affine Loop Optimization using Modulo Unrolling in CHAPEL

Affine Loop Optimization using Modulo Unrolling in CHAPEL Affine Loop Optimization using Modulo Unrolling in CHAPEL Aroon Sharma, Joshua Koehler, Rajeev Barua LTS POC: Michael Ferguson 2 Overall Goal Improve the runtime of certain types of parallel computers

More information

Energy-Efficiency Prediction of Multithreaded Workloads on Heterogeneous Composite Cores Architectures using Machine Learning Techniques

Energy-Efficiency Prediction of Multithreaded Workloads on Heterogeneous Composite Cores Architectures using Machine Learning Techniques Energy-Efficiency Prediction of Multithreaded Workloads on Heterogeneous Composite Cores Architectures using Machine Learning Techniques Hossein Sayadi Department of Electrical and Computer Engineering

More information

Dynamic Autotuning. of Algorithmic Skeletons:

Dynamic Autotuning. of Algorithmic Skeletons: Dynamic Autotuning of Algorithmic Skeletons Informatics Research Proposal Chris Cummins Abstract. The rapid transition towards multicore hardware has left application programmers requiring higher-level

More information

Automatic Algorithm Configuration based on Local Search

Automatic Algorithm Configuration based on Local Search Automatic Algorithm Configuration based on Local Search Frank Hutter 1 Holger Hoos 1 Thomas Stützle 2 1 Department of Computer Science University of British Columbia Canada 2 IRIDIA Université Libre de

More information

Deconstructing Iterative Optimization

Deconstructing Iterative Optimization Deconstructing Iterative Optimization YANG CHEN, SHUANGDE FANG, and YUANJIE HUANG, State Key Laboratory of Computer Architecture, ICT, CAS, China LIEVEN EECKHOUT, Ghent University, Belgium GRIGORI FURSIN,

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

Collective mind: Towards practical and collaborative auto-tuning

Collective mind: Towards practical and collaborative auto-tuning Collective mind: Towards practical and collaborative auto-tuning Grigori Fursin, Renato Miceli, Anton Lokhmotov, Michael Gerndt, Marc Baboulin, Allen D. Malony, Zbigniew Chamski, Diego Novillo, Davide

More information

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Xin Huo Advisor: Gagan Agrawal Motivation - Architecture Challenges on GPU architecture

More information

MACHINE LEARNING BASED COMPILER OPTIMIZATION

MACHINE LEARNING BASED COMPILER OPTIMIZATION MACHINE LEARNING BASED COMPILER OPTIMIZATION Arash Ashari Slides have been copied and adapted from 2011 SIParCS by William Petzke; Self-Tuning Compilers Selecting a good set of compiler optimization flags

More information

Chapter 14 Performance and Processor Design

Chapter 14 Performance and Processor Design Chapter 14 Performance and Processor Design Outline 14.1 Introduction 14.2 Important Trends Affecting Performance Issues 14.3 Why Performance Monitoring and Evaluation are Needed 14.4 Performance Measures

More information

Improving Error Checking and Unsafe Optimizations using Software Speculation. Kirk Kelsey and Chen Ding University of Rochester

Improving Error Checking and Unsafe Optimizations using Software Speculation. Kirk Kelsey and Chen Ding University of Rochester Improving Error Checking and Unsafe Optimizations using Software Speculation Kirk Kelsey and Chen Ding University of Rochester Outline Motivation Brief problem statement How speculation can help Our software

More information

Evaluating Heuristic Optimization Phase Order Search Algorithms

Evaluating Heuristic Optimization Phase Order Search Algorithms Evaluating Heuristic Optimization Phase Order Search Algorithms by Prasad A. Kulkarni David B. Whalley Gary S. Tyson Jack W. Davidson Computer Science Department, Florida State University, Tallahassee,

More information

Loop Nest Optimizer of GCC. Sebastian Pop. Avgust, 2006

Loop Nest Optimizer of GCC. Sebastian Pop. Avgust, 2006 Loop Nest Optimizer of GCC CRI / Ecole des mines de Paris Avgust, 26 Architecture of GCC and Loop Nest Optimizer C C++ Java F95 Ada GENERIC GIMPLE Analyses aliasing data dependences number of iterations

More information

Predicting GPU Performance from CPU Runs Using Machine Learning

Predicting GPU Performance from CPU Runs Using Machine Learning Predicting GPU Performance from CPU Runs Using Machine Learning Ioana Baldini Stephen Fink Erik Altman IBM T. J. Watson Research Center Yorktown Heights, NY USA 1 To exploit GPGPU acceleration need to

More information

Using Machine Learning to Improve Automatic Vectorization

Using Machine Learning to Improve Automatic Vectorization Using Machine Learning to Improve Automatic Vectorization Kevin Stock Louis-Noël Pouchet P. Sadayappan The Ohio State University January 24, 2012 HiPEAC Conference Paris, France Introduction: Vectorization

More information

Empirical Study on Impact of Developer Collaboration on Source Code

Empirical Study on Impact of Developer Collaboration on Source Code Empirical Study on Impact of Developer Collaboration on Source Code Akshay Chopra University of Waterloo Waterloo, Ontario a22chopr@uwaterloo.ca Parul Verma University of Waterloo Waterloo, Ontario p7verma@uwaterloo.ca

More information

Dirk Tetzlaff Technical University of Berlin

Dirk Tetzlaff Technical University of Berlin Software Engineering g for Embedded Systems Intelligent Task Mapping for MPSoCs using Machine Learning Dirk Tetzlaff Technical University of Berlin 3rd Workshop on Mapping of Applications to MPSoCs June

More information

Contention-Aware Scheduling of Parallel Code for Heterogeneous Systems

Contention-Aware Scheduling of Parallel Code for Heterogeneous Systems Contention-Aware Scheduling of Parallel Code for Heterogeneous Systems Chris Gregg Jeff S. Brantley Kim Hazelwood Department of Computer Science, University of Virginia Abstract A typical consumer desktop

More information

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM I5 AND I7 PROCESSORS Juan M. Cebrián 1 Lasse Natvig 1 Jan Christian Meyer 2 1 Depart. of Computer and Information

More information

The Evolution of Big Data Platforms and Data Science

The Evolution of Big Data Platforms and Data Science IBM Analytics The Evolution of Big Data Platforms and Data Science ECC Conference 2016 Brandon MacKenzie June 13, 2016 2016 IBM Corporation Hello, I m Brandon MacKenzie. I work at IBM. Data Science - Offering

More information

Performance Cloning: A Technique for Disseminating Proprietary Applications as Benchmarks

Performance Cloning: A Technique for Disseminating Proprietary Applications as Benchmarks Performance Cloning: Technique for isseminating Proprietary pplications as enchmarks jay Joshi (University of Texas) Lieven Eeckhout (Ghent University, elgium) Robert H. ell Jr. (IM Corp.) Lizy John (University

More information

Applying Multi-Core Model Checking to Hardware-Software Partitioning in Embedded Systems

Applying Multi-Core Model Checking to Hardware-Software Partitioning in Embedded Systems V Brazilian Symposium on Computing Systems Engineering Applying Multi-Core Model Checking to Hardware-Software Partitioning in Embedded Systems Alessandro Trindade, Hussama Ismail, and Lucas Cordeiro Foz

More information

Guiding the optimization of parallel codes on multicores using an analytical cache model

Guiding the optimization of parallel codes on multicores using an analytical cache model Guiding the optimization of parallel codes on multicores using an analytical cache model Diego Andrade, Basilio B. Fraguela, and Ramón Doallo Universidade da Coruña, Spain {diego.andrade,basilio.fraguela,ramon.doalllo}@udc.es

More information

Exploring and Predicting the Architecture/Optimising Compiler Co-Design Space

Exploring and Predicting the Architecture/Optimising Compiler Co-Design Space Exploring and Predicting the Architecture/Optimising Compiler Co-Design Space Christophe Dubach, Timothy M. Jones and Michael F.P. O Boyle Members of HiPEAC School of Informatics University of Edinburgh

More information

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

CAPS Technology. ProHMPT, 2009 March12 th

CAPS Technology. ProHMPT, 2009 March12 th CAPS Technology ProHMPT, 2009 March12 th Overview of the Talk 1. HMPP in a nutshell Directives for Hardware Accelerators (HWA) 2. HMPP Code Generation Capabilities Efficient code generation for CUDA 3.

More information

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Texas Learning and Computation Center Department of Computer Science University of Houston Outline Motivation

More information

Evolving HPCToolkit John Mellor-Crummey Department of Computer Science Rice University Scalable Tools Workshop 7 August 2017

Evolving HPCToolkit John Mellor-Crummey Department of Computer Science Rice University   Scalable Tools Workshop 7 August 2017 Evolving HPCToolkit John Mellor-Crummey Department of Computer Science Rice University http://hpctoolkit.org Scalable Tools Workshop 7 August 2017 HPCToolkit 1 HPCToolkit Workflow source code compile &

More information

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,

More information

ADAPTIVE TASK SCHEDULING USING LOW-LEVEL RUNTIME APIs AND MACHINE LEARNING

ADAPTIVE TASK SCHEDULING USING LOW-LEVEL RUNTIME APIs AND MACHINE LEARNING ADAPTIVE TASK SCHEDULING USING LOW-LEVEL RUNTIME APIs AND MACHINE LEARNING Keynote, ADVCOMP 2017 November, 2017, Barcelona, Spain Prepared by: Ahmad Qawasmeh Assistant Professor The Hashemite University,

More information

Code Auto-Tuning with the Periscope Tuning Framework

Code Auto-Tuning with the Periscope Tuning Framework Code Auto-Tuning with the Periscope Tuning Framework Renato Miceli, SENAI CIMATEC renato.miceli@fieb.org.br Isaías A. Comprés, TUM compresu@in.tum.de Project Participants Michael Gerndt, TUM Coordinator

More information

Accelerating Financial Applications on the GPU

Accelerating Financial Applications on the GPU Accelerating Financial Applications on the GPU Scott Grauer-Gray Robert Searles William Killian John Cavazos Department of Computer and Information Science University of Delaware Sixth Workshop on General

More information

Raced Profiles: Efficient Selection of Competing Compiler Optimizations

Raced Profiles: Efficient Selection of Competing Compiler Optimizations Raced Profiles: Efficient Selection of Competing Compiler Optimizations Abstract Many problems in embedded compilation require one set of optimizations to be selected over another based on run time performance.

More information

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Islam Harb 08/21/2015 Agenda Motivation Research Challenges Contributions & Approach Results Conclusion Future Work 2

More information

An Investigation into Value Profiling and its Applications

An Investigation into Value Profiling and its Applications Manchester Metropolitan University BSc. (Hons) Computer Science An Investigation into Value Profiling and its Applications Author: Graham Markall Supervisor: Dr. Andy Nisbet No part of this project has

More information

Towards Machine Learning-Based Auto-tuning of MapReduce

Towards Machine Learning-Based Auto-tuning of MapReduce 3 IEEE st International Symposium on Modelling, Analysis & Simulation of Computer and Telecommunication Systems Towards Machine Learning-Based Auto-tuning of MapReduce Nezih Yigitbasi, Theodore L. Willke,

More information

A Test Suite for High-Performance Parallel Java

A Test Suite for High-Performance Parallel Java page 1 A Test Suite for High-Performance Parallel Java Jochem Häuser, Thorsten Ludewig, Roy D. Williams, Ralf Winkelmann, Torsten Gollnick, Sharon Brunett, Jean Muylaert presented at 5th National Symposium

More information

Collective mind: Towards practical and collaborative auto-tuning

Collective mind: Towards practical and collaborative auto-tuning Scientific Programming 22 (2014) 309 329 309 DOI 10.3233/SPR-140396 IOS Press Collective mind: Towards practical and collaborative auto-tuning Grigori Fursin a,, Renato Miceli b, Anton Lokhmotov c, Michael

More information

Kismet: Parallel Speedup Estimates for Serial Programs

Kismet: Parallel Speedup Estimates for Serial Programs Kismet: Parallel Speedup Estimates for Serial Programs Donghwan Jeon, Saturnino Garcia, Chris Louie, and Michael Bedford Taylor Computer Science and Engineering University of California, San Diego 1 Questions

More information

Architecture Tuning Study: the SimpleScalar Experience

Architecture Tuning Study: the SimpleScalar Experience Architecture Tuning Study: the SimpleScalar Experience Jianfeng Yang Yiqun Cao December 5, 2005 Abstract SimpleScalar is software toolset designed for modeling and simulation of processor performance.

More information

Unifying Big Data Workloads in Apache Spark

Unifying Big Data Workloads in Apache Spark Unifying Big Data Workloads in Apache Spark Hossein Falaki @mhfalaki Outline What s Apache Spark Why Unification Evolution of Unification Apache Spark + Databricks Q & A What s Apache Spark What is Apache

More information

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Workloads 2 Hardware / software execution environment

More information

Executing Legacy Applications on a Java Operating System

Executing Legacy Applications on a Java Operating System Executing Legacy Applications on a Java Operating System Andreas Gal, Michael Yang, Christian Probst, and Michael Franz University of California, Irvine {gal,mlyang,probst,franz}@uci.edu May 30, 2004 Abstract

More information

ATOS introduction ST/Linaro Collaboration Context

ATOS introduction ST/Linaro Collaboration Context ATOS introduction ST/Linaro Collaboration Context Presenter: Christian Bertin Development team: Rémi Duraffort, Christophe Guillon, François de Ferrière, Hervé Knochel, Antoine Moynault Consumer Product

More information

Trends in HPC (hardware complexity and software challenges)

Trends in HPC (hardware complexity and software challenges) Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18

More information

Automatic Selection of GCC Optimization Options Using A Gene Weighted Genetic Algorithm

Automatic Selection of GCC Optimization Options Using A Gene Weighted Genetic Algorithm Automatic Selection of GCC Optimization Options Using A Gene Weighted Genetic Algorithm San-Chih Lin, Chi-Kuang Chang, Nai-Wei Lin National Chung Cheng University Chiayi, Taiwan 621, R.O.C. {lsch94,changck,naiwei}@cs.ccu.edu.tw

More information

Administration. Prerequisites. Meeting times. CS 380C: Advanced Topics in Compilers

Administration. Prerequisites. Meeting times. CS 380C: Advanced Topics in Compilers Administration CS 380C: Advanced Topics in Compilers Instructor: eshav Pingali Professor (CS, ICES) Office: POB 4.126A Email: pingali@cs.utexas.edu TA: TBD Graduate student (CS) Office: Email: Meeting

More information

Rechnerarchitektur (RA)

Rechnerarchitektur (RA) 12 Rechnerarchitektur (RA) Sommersemester 2017 Architecture-Aware Optimizations -Software Optimizations- Jian-Jia Chen Informatik 12 Jian-jia.chen@tu-.. http://ls12-www.cs.tu.de/daes/ Tel.: 0231 755 6078

More information

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives José M. Andión, Manuel Arenaz, François Bodin, Gabriel Rodríguez and Juan Touriño 7th International Symposium on High-Level Parallel

More information

Piecewise Holistic Autotuning of Compiler and Runtime Parameters

Piecewise Holistic Autotuning of Compiler and Runtime Parameters Piecewise Holistic Autotuning of Compiler and Runtime Parameters Mihail Popov, Chadi Akel, William Jalby, Pablo de Oliveira Castro University of Versailles Exascale Computing Research August 2016 C E R

More information

Performance measurement. SMD149 - Operating Systems - Performance and processor design. Introduction. Important trends affecting performance issues

Performance measurement. SMD149 - Operating Systems - Performance and processor design. Introduction. Important trends affecting performance issues Performance measurement SMD149 - Operating Systems - Performance and processor design Roland Parviainen November 28, 2005 Performance measurement Motivation Techniques Common metrics Processor architectural

More information

Dealing with Heterogeneous Multicores

Dealing with Heterogeneous Multicores Dealing with Heterogeneous Multicores François Bodin INRIA-UIUC, June 12 th, 2009 Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism

More information

Parallel Algorithm Engineering

Parallel Algorithm Engineering Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework and numa control Examples

More information

Evaluating the Effects of Compiler Optimisations on AVF

Evaluating the Effects of Compiler Optimisations on AVF Evaluating the Effects of Compiler Optimisations on AVF Timothy M. Jones, Michael F.P. O Boyle Member of HiPEAC, School of Informatics University of Edinburgh, UK {tjones1,mob}@inf.ed.ac.uk Oğuz Ergin

More information

Statistical Performance Comparisons of Computers

Statistical Performance Comparisons of Computers Tianshi Chen 1, Yunji Chen 1, Qi Guo 1, Olivier Temam 2, Yue Wu 1, Weiwu Hu 1 1 State Key Laboratory of Computer Architecture, Institute of Computing Technology (ICT), Chinese Academy of Sciences, Beijing,

More information

Execution-based Prediction Using Speculative Slices

Execution-based Prediction Using Speculative Slices Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers

More information

Automatic Algorithm Configuration based on Local Search

Automatic Algorithm Configuration based on Local Search Automatic Algorithm Configuration based on Local Search Frank Hutter 1 Holger Hoos 1 Thomas Stützle 2 1 Department of Computer Science University of British Columbia Canada 2 IRIDIA Université Libre de

More information

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017! Advanced Topics on Heterogeneous System Architectures HSA Foundation! Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

Software Ecosystem for Arm-based HPC

Software Ecosystem for Arm-based HPC Software Ecosystem for Arm-based HPC CUG 2018 - Stockholm Florent.Lebeau@arm.com Ecosystem for HPC List of components needed: Linux OS availability Compilers Libraries Job schedulers Debuggers Profilers

More information