PCERE: Fine-grained Parallel Benchmark Decomposition for Scalability Prediction

Size: px

Start display at page:

Download "PCERE: Fine-grained Parallel Benchmark Decomposition for Scalability Prediction"

Shona Chandler
6 years ago
Views:

1 PCERE: Fine-grained Parallel Benchmark Decomposition for Scalability Prediction Mihail Popov, Chadi kel, Florent Conti, William Jalby, Pablo de Oliveira Castro UVSQ - PRiSM - ECR Mai 28, 2015

2 Introduction Evaluate strong scalability Evaluate strong scalability of OpenMP applications is costly and time-consuming Execute multiple times the whole application with different thread configurations Waste of ressources ccording to mdahl s law sequential parts do not scale Parallel regions may share similar performance across invocations M.Popov C.kel F.Conti W.Jalby P.Oliveira PCERE Mai 28, / 18

3 Introduction PCERE: Parallel Codelet Extractor and REplayer ccelerate strong scalability evaluation with PCERE PCERE is part of CERE (Codelet Extractor and REplayer) framework Decompose applications into small pieces called Codelets Each codelet maps a parallel region and is a standalone executable Extract codelets once Replay codelets instead of applications with different number of threads M.Popov C.kel F.Conti W.Jalby P.Oliveira PCERE Mai 28, / 18

4 Introduction Prediction model int main() { for(i=0;i<3;i++){ //sequentiel code #pragma omp parallel //sequentiel code #pragma omp parallel B //sequentiel code B B B Executing the whole application with different threads configurations M.Popov C.kel F.Conti W.Jalby P.Oliveira PCERE Mai 28, / 18

5 Introduction Prediction model int main() { for(i=0;i<3;i++){ //sequentiel code #pragma omp parallel //sequentiel code #pragma omp parallel B //sequentiel code B B B Extracting parallel regions and B and measuring sequentiel execution time Directly replaying the parallel regions M.Popov C.kel F.Conti W.Jalby P.Oliveira PCERE Mai 28, / 18

6 Introduction Prediction model B B int main() { for(i=0;i<3;i++){ //sequentiel code #pragma omp parallel //sequentiel code #pragma omp parallel B //sequentiel code B S S dd sequential time and parallel region multiple invocations M.Popov C.kel F.Conti W.Jalby P.Oliveira PCERE Mai 28, / 18

7 Outline 1 Overview 2 Extract and replay codelets 3 Prediction model evaluation M.Popov C.kel F.Conti W.Jalby P.Oliveira PCERE Mai 28, / 18

8 Overview Codelet capture and replay OpenMP pplications Parallel region outlining Capture of representative working sets Region Capture Change number of threads or affinity Working sets memory dump Codelet Replay Fast performance prediction Warmup + Replay Generate codelets wrapper Retarget for different architecture M.Popov C.kel F.Conti W.Jalby P.Oliveira PCERE Mai 28, / 18

9 Overview LLVM OpenMP Intermediate Representation extraction Extract codelets at Intermediate Representation for language portability and cross architecture evaluation C C++ OpenMP pplications Openmp Clang front end LLVM IR Codelets extraction passes LLVM opt optimization LLVM IR Linking LLVM llc static compiler Executable binary Objects files LLVM IR M.Popov C.kel F.Conti W.Jalby P.Oliveira PCERE Mai 28, / 18

10 Overview Clang front end transforms source code into IR void main() { #pragma omp parallel { int p = omp_get_thread_num(); printf("%d",p); Clang OpenMP front end define { entry:... define internal { entry: %p = alloca i32, align 4 %call = call store i32 %call, i32* %p, align 4 %1 = load i32* %p, align 4 C code LLVM simplified IR Thread execution model M.Popov C.kel F.Conti W.Jalby P.Oliveira PCERE Mai 28, / 18

11 Extract and replay codelets Deterministic codelet replay Dump call Direct jump to parallel region Restore call Exit Parallel region capture Parallel region replay M.Popov C.kel F.Conti W.Jalby P.Oliveira PCERE Mai 28, / 18

12 Extract and replay codelets Memory dump System memory snapshot at the beginning of each parallel region define { entry:... define internal { entry: %p = alloca i32, align 4 %call = call store i32 %call, i32* %p, align 4 %1 = load i32* %p, align 4 LLVM simplified IR extract + dump passes define { entry:... extracted.omp_microtask.(...)... define internal extracted.omp_microtask.(...){ newfuncroot: call define internal { entry:... LLVM simplified IR M.Popov C.kel F.Conti W.Jalby P.Oliveira PCERE Mai 28, / 18

13 Extract and replay codelets Codelet replay Reload codelet working set Reproduce cache state with optimistic cache warm-up Multiple working sets for a single codelet M.Popov C.kel F.Conti W.Jalby P.Oliveira PCERE Mai 28, / 18

14 Extract and replay codelets Codelets with different working sets 3e+08 Cycles 2e+08 1e+08 0e invocation replay Figure : MG resid execution time over the different invocations replayed with 4 threads M.Popov C.kel F.Conti W.Jalby P.Oliveira PCERE Mai 28, / 18

15 Extract and replay codelets Lock Support Lock support on Linux uses Futex Each futex allocates a kernel space wait queue Memory capture saves only the user space memory Lock capture step that detects all the locks accessed by a codelet Replay wrapper initialize required locks in kernel space M.Popov C.kel F.Conti W.Jalby P.Oliveira PCERE Mai 28, / 18

16 Prediction model evaluation Test benchmarks and architectures Using NS Parallel Benchmark OpenMP 3.0 C version based on the Omni Compiler Project Core2 Nehalem Sandy Bridge Ivy Bridge CPU E7500 Xeon E5620 E5 i Frequency (GHz) Sockets Cores per socket Threads per core L1 cache (KB) L2 cache (KB) 3MB L3 cache (MB) Ram (GB) Figure : Test architectures M.Popov C.kel F.Conti W.Jalby P.Oliveira PCERE Mai 28, / 18

17 Prediction model evaluation Reproducing parallel regions scaling with codelets 6 1e8 5 SP compute rhs Real Predicted 4 Runtime cycles Threads Figure : Real vs. PCERE execution time predictions on Sandy Bridge for the SP compute rhs codelet M.Popov C.kel F.Conti W.Jalby P.Oliveira PCERE Mai 28, / 18

18 Prediction model evaluation Prediction accuracy BT EP LU FT SP CG IS MG Core Nehalem Sandy Bridge Ivy Bridge Figure : NS 3.0 C version average percentage error prediction accuracy On Ivy Bridge, PCERE predicts FT execution time scalability with an error of 3.4% M.Popov C.kel F.Conti W.Jalby P.Oliveira PCERE Mai 28, / 18

19 Prediction model evaluation Benchmarking acceleration BT EP LU FT SP CG IS MG Core Nehalem Sandy Bridge Ivy Bridge Figure : NS 3.0 C version average benchmarking acceleration On Core2, PCERE CG scalability evaluation is 24.2 times faster than with normal executions M.Popov C.kel F.Conti W.Jalby P.Oliveira PCERE Mai 28, / 18

20 Prediction model evaluation PCERE prediction accuracy and benchmarking acceleration Core2 Nehalem Sandy Bridge Ivy Bridge ccuracy 1.8% 2.9% 7.4% 2.8% cceleration Figure : NS 3.0 C version average prediction accuracy and benchmarking acceleration per architecture M.Popov C.kel F.Conti W.Jalby P.Oliveira PCERE Mai 28, / 18

21 Prediction model evaluation Cross micro-architecture codelet replay Capture-Replay is micro-architecture agnostic Capture on Nehalem Replay on Sandy Bridge Threads ccuracy Figure : NS 3.0 C version average percentage error cross replay accuracy pplication BT EP LU FT SP CG IS MG ccuracy Figure : NS 3.0 C version average percentage error cross replay accuracy M.Popov C.kel F.Conti W.Jalby P.Oliveira PCERE Mai 28, / 18

22 Prediction model evaluation Limitation and future work Limitations No acceleration on applications with a single parallel region and no relevant sequential parts (EP) Prediction error due to variant sequential time across thread configurations (IS) Future work Improve warm-up strategy: use CERE page traces warm-up pply a clustering approach over codelets OpenMP parameters space exploration with codelets M.Popov C.kel F.Conti W.Jalby P.Oliveira PCERE Mai 28, / 18

23 Conclusion Conclusion To be released with CERE at Extract codelets once, replay them many times Cross micro-architecture and thread configuration extraction and replay ccelerate strong scalability evaluation 25 times Strong scalability prediction average error of 3.7% M.Popov C.kel F.Conti W.Jalby P.Oliveira PCERE Mai 28, / 18

24 Backup Codelet replay Optimistic cache warm-up: assuming that the codelet working set is hot in the original run extract + replay passes void main() { int i; int iteration = 1; for(i=0;i<iteration;i++) run extracted.omp_microtask.(); define extracted.omp_microtask.() { entry: call %rrange arguments extracted.omp_microtask.(...) define internal extracted.omp_microtask.(...){ newfuncroot: define internal { entry:... Updated main C code LLVM simplified IR M.Popov C.kel F.Conti W.Jalby P.Oliveira PCERE Mai 28, / 18

25 Backup Related work Cross-platform performance prediction of parallel applications using partial execution. Yang, Leo T and Ma, Xiaosong and Mueller, Frank SC 2005 Detecting Phases in Parallel pplications on Shared Memory rchitectures. Perelman, Erez and Polito, Marzia and Bouguet, J-Y and Sampson, Jack and Calder, Brad and Dulong, Carole IPDPS 2006 BarrierPoint: Sampled Simulation of Multi-Threaded pplications. Carlson, Trevor E and Heirman, Wim and Van Craeynest, Kenzo and Eeckhout, Lieven ISPSS 2014 Effective source-to-source outlining to support whole program empirical optimization. Liao, Chunhua and Quinlan, Daniel J and Vuduc, Richard and Panas, Thomas Languages and Compilers for Parallel Computing 2010 M.Popov C.kel F.Conti W.Jalby P.Oliveira PCERE Mai 28, / 18

26 Backup Flags exploration void main() { (...) Loop Loop B (...) CERE Loops IR extraction with no optimization With -O2 Loops profiling and extraction Loop inv 48 Loop inv 2 Representative invocations working sets Loop inv 48 Loop inv 2 Loop time Prediction model Intermediate representation Loop Compile with an optimization point Replay representative invocations Fast optimization point evaluation Optimizatiopn space to explore Codelet optimization and replay M.Popov C.kel F.Conti W.Jalby P.Oliveira PCERE Mai 28, / 18

27 Backup Flags exploration For each optimization sequence, only replay the relevant parts Codelets matching over 200 optimization sequences pplication Median error verage error CG EP FT IS LU MG SP RTM Figure : Matching error percentage per application Speed-up evaluation versus matching error -O2 RTM evaluation is 237 times cheaper with codelets M.Popov C.kel F.Conti W.Jalby P.Oliveira PCERE Mai 28, / 18

Piecewise Holistic Autotuning of Compiler and Runtime Parameters

Piecewise Holistic Autotuning of Compiler and Runtime Parameters Mihail Popov, Chadi Akel, William Jalby, Pablo de Oliveira Castro University of Versailles Exascale Computing Research August 2016 C E R