Piecewise Holistic Autotuning of Compiler and Runtime Parameters

Size: px

Start display at page:

Download "Piecewise Holistic Autotuning of Compiler and Runtime Parameters"

Violet Townsend
5 years ago
Views:

1 Piecewise Holistic Autotuning of Compiler and Runtime Parameters Mihail Popov, Chadi Akel, William Jalby, Pablo de Oliveira Castro University of Versailles Exascale Computing Research August 2016 C E R E

2 Context Architecture, system, and application complexities increase System provides default good enough parameter configurations Compiler optimizations: -O2, -O3 Thread affinity: scatter Outperforming default parameters leads to substantial benefits but is a costly process Execution driven studies test different configurations Applications have redundancies Executing an application is time consuming The search space is huge Studies reduce the exploration cost by smartly navigating through the search space 1 / 23

3 Piecewise Exploration Codelet Extractor and REplayer (CERE) decomposes applications into small pieces called Codelets Each codelet maps a loop or a parallel region and is a standalone executable Extract codelets once Replay codelets instead of applications with different configurations to avoid redundancies 2 / 23

4 IS Motivating Example int main() { create_seq() for(i=0;i<11;i++) rank() } IS benchmark IS create seq covers 40% of the execution time IS rank sorting algorithm performs 11 invocations with the same execution time Piecewise exploration benefits Avoid create seq execution Evaluate a single invocation of rank IS rank and create seq are not sensitive to the same optimizations 3 / 23

5 Outline Codelet Extractor and Replayer (CERE) Prediction Model Thread and Compiler Tuning 4 / 23

Codelet Replay Fast performance prediction Warmup + Replay Generate codelets wrapper Retarget for: different

6 CERE Workflow Applications LLVM IR Region outlining Working set and cache capture Invocation & Codelet subsetting Region Capture Change: number of threads, affinity, runtime parameters Working sets memory dump Codelet Replay Fast performance prediction Warmup + Replay Generate codelets wrapper Retarget for: different architectures different optimizations CERE can extract codelets from: Hot Loops OpenMP non-nested parallel regions 5 / 23

7 Codelet Capture and Replay Codelets are extracted at the LLVM Intermediate Representation level The user can recompile each codelet and replay it while changing compile options, runtime parameters, or the target system Performance accurate replay requires to capture the cache state Semantically accurate replay requires to capture the memory 6 / 23

8 Memory Page Capture region to capture memory allocation a = malloc(256); memory access a[i]++; protect static and currently allocated process memory (/proc/self/maps) intercept memory allocation functions with LD_PRELOAD 1 allocate memory 2 protect memory and return to user program segmentation fault handler 1 dump accessed memory to disk 2 unlock accessed page and return to user program Capture access at page granularity: coarse but fast Small dump footprint: only touched pages are saved 7 / 23

9 Cache State Capture Cold Do not capture cache effects Working Set Warms all the working set during replay (Optimistic) Page Trace Before replay warms the last N pages accessed to restore a cache state close to the original 8 / 23

10 CERE Cache Warmup for (i=0; i < size; i++) a[i] += b[i]; array a[] pages { array b[] pages { memory FIFO (most recently unprotected) 22 pages addresses Reprotect 20 warmup page trace / 23

11 OpenMP Regions Support void main() { #pragma omp parallel { int p = omp_get_thread_num(); printf("%d",p); } } Clang OpenMP front end define { entry:... } define internal { entry: %p = alloca i32, align 4 %call = call store i32 %call, i32* %p, align 4 %1 = load i32* %p, align 4 } C code LLVM simplified IR Thread execution model 10 / 23

12 Selecting Representative Invocations A region can have thousand of invocations Performance differs due to different working sets Cluster to select representative invocations Cycles 8e+07 6e+07 4e+07 2e+07 0e invocation replay Figure: SPEC tonto make ft@shell2.f90:1133 execution trace. 90% of NAS codelets can be reduced to four or less representatives. 11 / 23

13 Performance Classes Across Parameters original trace replay megacycles O0 + 4 threads O3 + 2 threads invocation MG resid invocations execution time Use three invocations to predict the application execution time Parameters do not change the performance classes 12 / 23

14 NUMA Aware Warmup First touch policy: threads allocate the pages that they are the first to touch on their NUMA domain Detect the first thread that touches the memory pages During warmup the recorded NUMA-domains are restored Original Single Thread Warmup NUMA Warmup 1 NUMA domain (compact) 2 NUMA domains (scatter) 4e+10 Cycles 3e+10 2e+10 1e+10 0e thread number Figure: BT xsolve replay 13 / 23

15 Test Architectures and Applications NAS SER and NPB OpenMP 3.0 C version CLASS A Blackscholes from the PARSEC benchmarks Reverse Time Migration (RTM) proto-application Compiler LLVM 3.4 Sandy Bridge Ivy Bridge CPU E5 i Frequency (GHz) Sockets 2 1 Cores per socket 8 4 Threads per core 2 2 L1 cache (KB) L2 cache (KB) L3 cache (MB) 20 8 Ram (GB) Figure: Test architectures 14 / 23

16 Blackscholes Thread Affinities Exploration Different thread affinities to evaluate sn: n scatter threads cn: n compact threads without hyper threading hn: n compact threads with hyper threading Original Replay 3e+07 Cycles 2e+07 1e+07 0e+00 1 s2 c2 h2 s4 c4 h4 s8 c8 h8 s16 h16 h32 thread configuration Figure: PARSEC Blackscholes thread configurations search 15 / 23

17 Outperforming Default Thread Configuration original replay Speed up over standard (s16) hyperthread.h32 compact.c8 BT CG EP FT IS LU MG SP BT CG EP FT IS LU MG SP Figure: NAS thread configurations tuning 16 / 23

18 Autotuning LLVM Middle End Optimizations LLVM middle end offers more than 50 optimization passes Codelet replay enable per-region fast optimization tuning Cycles (Ivybridge 3.4GHz) 1.2e e e e+07 original replay O Id of LLVM middle end optimization passes combination Figure: SP ysolve codelet schedules of random passes combinations explored based on O3 passes. CERE 149 cheaper than running the full benchmark ( 27 cheaper when tuning codelets covering 75% of SP) 17 / 23

19 Hybridization Application Hotspots extraction into codelets O3 sse O2... avx Assign to each codelet the flag that gave the best performance avx sse sse Replay codelets with selected flags to test avx Flags selection Hybridization Application Compile files with their respective best flag Hybrid optimized binary Extract hotspots into new files 18 / 23

20 Hybrid Compilation over the NAS Four parallel regions of SP cover 93% of the execution time No single sequence is the best for all the regions Codelets explore parameters for each region separately Produce an hybrid where each region is compiled using its best sequence gigacycles compact hybrid O2 rhs zsolve xsolve+ysolve total 60 O3 rhs best z best hybrid O2 O3 rhs best z best hybrid O2 compiler optimizations O3 rhs best z best hybrid O2 O3 rhs best z best Figure: Hybrid compilation speeds up SP OpenMP / 23

21 Piecewise Exploration Benefits Speedup over O BT IS SP Benchmarks hybrid (original exploration) hybrid (replay exploration) monolithic best standard cost of piecewise exploration overhead of monolithic exploration zsolve ysolve xsolve BT IS SP zsolve rank ysolve xsolve fullverify createseq Compiler optimization sequences Figure: Piecewise exploration of the NAS SER 20 / 23

22 Codelets Tuning Results Compiler passes Thread affinity #Regions Accuracy Acceleration #Regions Accuracy Acceleration BT CG FT IS SP LU EP MG AVG NAS SER and OpenMP benchmarks average speedup of 1.08 Tuning a single codelet is 13 faster than full applications Codelet average accuracy is 94.6% RTM tuning through a codelet is 200 faster and achieves a speedup of / 23

23 Related Works Kulkarni et al. Improving Both the Performance Benefits and Speed of Optimization Phase Sequence Searches (ACM Sigplan Notices 2010) Fursin et al. Quick and practical run-time evaluation of multiple program optimizations (HiPEAC 2007) Fursin et al. Milepost gcc: Machine learning enabled self-tuning compiler (Int. J. Parallel Prog. 2011) Purini et al. Finding good optimization sequences covering program space (TACO 2013) 22 / 23

24 Conclusion Piecewise tuning with codelets Accelerate the exploration process Improve the benefits Discussion Some regions are not independent: LU jacu and jacld Piecewise tuning sensitivity to the data set Future Work Combine codelets tuning with GA Use a clustering approach over codelets Improve the parallel warmup strategy 23 / 23

Piecewise Holistic Autotuning of Compiler and Runtime Parameters

Piecewise Holistic Autotuning of Compiler and Runtime Parameters Mihail Popov 1, Chadi Akel 2, William Jalby 1, and Pablo de Oliveira Castro 1 1 Université de Versailles Saint-Quentin-en-Yvelines, Université