Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines

Size: px

Start display at page:

Download "Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines"

Earl McDonald
5 years ago
Views:

Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines Jie Tao 1 Karl Fuerlinger 2 Holger Marten 1 jie.tao@kit.edu karl.fuerlinger@nm.ifi.lmu.

1 Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines Jie Tao 1 Karl Fuerlinger 2 Holger Marten 1 jie.tao@kit.edu karl.fuerlinger@nm.ifi.lmu.de holger.marten@kit.edu 1 : Steinbuch Center for Computing, Karlsruhe Institute of Technology (KIT), Germany 2 : MNM-Team, Department of Computer Science, LMU München, Germany

2 Outline Introduction Virtualization and the impact on performance Experimental Setup NAS parallel benchmarks, SPEC OpenMP, microbenchmarks Study of SP (NAS Parallel Benchmarks) Initial performance Analysis using ompp Optimization results and microbenchmark study Conclusions Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines 2

3 Virtualization Running multiple OSs on the same hardware VM 1 VM 2 VM 3 VM 4 Application Operating System Hardware Guest OS Guest Guest OS OS Hypervisor Host machine Guest OS Concepts Hypervisor (xen, KVM, VMware) Full virtualization vs para-virtualization Adopted for Server consolidation Cloud Computing: on-demand resource provision Performance impact Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines 3

4 Performance Impact of Virtualization Has been studied before, E.g., Keith Jackson, et al. Performance of HPC Applications on the Amazon Web Services Cloud Here: The performance impact of virtualization on OpenMP applications Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines 4

5 Experimental Setup Benchmarks NAS OpenMP (size A) SPEC OpenMP (reference dataset) EPCC OpenMP Microbenchmarks Host machine AMD Opteron 2376 ( Shanghai ), 2.3 GHz, 2 socket quadcore Scientific Linux Virtualized with xen Virtual machines Hypervisor: xen OS: Debian Compiler: gcc #cores: 1-8 Memory: 4GB Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines 5

6 NAS Parallel Benchmarks Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines 6

7 NAS Parallel Benchmarks (2) Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines 7

8 SPEC OpenMP Benchmarks Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines 8

9 SPEC OpenMP Benchmarks (2) Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines 9

10 Execution time of NAS SP What is going on here? Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines 10

source code instrumentation Independent of

com Source Code Automatic instrumentation

11 OpenMP Performance Analysis with ompp ompp: OpenMP profiling tool Based on source code instrumentation Independent of the compiler and runtime used Supports HW counters through PAPI Uses source code instrumenter Opari from the KOJAK/Scalasca toolset Available for download (GPL): Source Code Automatic instrumentation of OpenMP constructs, manual region instrumentation ompp library Executable Settings (env. Vars) HW Counters, output format, Execution on parallel machine Profiling Report Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines 11

Source to Source Instrumentation with Opari Preprocessor Instrumentation Example:

$a parallel region POMP_Parallel_fork [master] #pragma omp parallel {$ POMP_Parallel_begin [team] /* user code in parallel region */ /* user code in

POMP_Barrier_exit [team] POMP_Parallel_end [team] } POMP_Parallel_join [master]

12 Source to Source Instrumentation with Opari Preprocessor Instrumentation Example: Instrumenting OpenMP constructs with Opari Preprocessor operation Orignial source code Preprocessor Modified (instrumented) source code Example: Instrumentation of a parallel region POMP_Parallel_fork [master] #pragma omp parallel { POMP_Parallel_begin [team] /* user code in parallel region */ /* user code in parallel region */ } POMP_Barrier_enter [team] #pragma omp barrier POMP_Barrier_exit [team] POMP_Parallel_end [team] } POMP_Parallel_join [master] Instrumentation added by Opari Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines 12

13 ompp s Profiling Data Example code section and performance profile: Code: #pragma omp parallel { #pragma omp critical { sleep(1.0); } } Profile: R00002 main.c (34-37) (default) CRITICAL TID exect execc bodyt entert exitt SUM Components: Source code location and type of region Timing data and execution counts, depending on the particular construct One line per thread, last line sums over all threads Hardware counter data (if PAPI is available and HW counters are selected) Data is exact (measured, not based on sampling) Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines 13

14 ompp Overhead Analysis (1) Certain timing categories reported by ompp can be classified as overheads: Example: entert in a critical section: Threads wait to enter the critical section (synchronization overhead). Four overhead categories are defined in ompp: Imbalance: waiting time incurred due to an imbalanced amount of work in a worksharing or parallel region Synchronization: overhead that arises due to threads having to synchronize their activity, e.g. barrier call Limited Parallelism: idle threads due not enough parallelism being exposed by the program Thread management: overhead for the creation and destruction of threads, and for signaling critical sections, locks as available Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines 14

15 ompp Overhead Analysis (2) S: Synchronization overhead M: Thread management overhead I: Imbalance overhead L: Limited Parallelism overhead Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines 15

16 Overhead Analysis for the NAS Benchmarks BT-host BT-full BT-para FT-host FT-full FT-para CG-host CG-full CG-para EP-host EP-full EP-para SP-host SP-full SP-para Total Overhead (%) Synch Imbal Limpar Mgmt (06.48) (11.47) (11.65) (35.44) (34.53) (36.34) 1.55 (08.95) 4.87 (23.59) 6.37 (26.49) 1.08 (01.17) 1.24 (01.37) (22.13) (33.03) (86.89) (77.68) Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines 16

17 OpenMP Constructs in the NAS Parallel Benchmarks Parallel Loop Single Barrier Critical Master BT FT CG EP SP Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines 17

18 ompp Profile for SP ompp Profiling Report for sp.c (lines ) (para-virtualized) TID exect execc bodyt exitbart exitbart (native host) SUM Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines 18

19 exitbart in a Parallel Loops Loop_enter Barrier_enter Opari transforms the implicit barrier into an explict barrier Worst case load imbalance scenario: i Barrier_exit Loop_exit t exitbart = i Thread i can induce at most t seconds exitbart time in each other thread Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines 19

20 TID 0 1 exect execc bodyt exitbart exitbart should be max. ~80 seconds SUM Barrier that takes a really long time Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines 20

21 Optimization Move parallelization to outermost loop for (j = 1; j <= grid_points[1]-2; j++) { for (k = 1; k <= grid_points[2]-2; k++) { #pragma omp for for (i = 0; i <= grid_points[0]-1; i++) { ru1 = c3c4*rho_i[i][j][k]; cv[i] = us[i][j][k]; rhon[i] = max(dx2+con43*ru1, max(dx5+c1c5*ru1, max(dxmax+ru1, dx1))); } #pragma omp for for (i = 1; i <= grid_points[0]-2; i++) { lhs[0][i][j][k] = 0.0; lhs[1][i][j][k] = - dttx2 * cv[i-1] - dttx1 * rhon[i-1]; lhs[2][i][j][k] = c2dttx1 * rhon[i]; lhs[3][i][j][k] = dttx2 * cv[i+1] - dttx1 * rhon[i+1]; lhs[4][i][j][k] = 0.0; } } } #pragma omp for for (j = 1; j <= grid_points[1]-2; j++) { for (k = 1; k <= grid_points[2]-2; k++) { for (i = 0; i <= grid_points[0]-1; i++) { ru1 = c3c4*rho_i[i][j][k]; cv[i] = us[i][j][k]; rhon[i] = max(dx2+con43*ru1, max(dx5+c1c5*ru1, max(dxmax+ru1, dx1))); } for (i = 1; i <= grid_points[0]-2; i++) { lhs[0][i][j][k] = 0.0; lhs[1][i][j][k] = - dttx2 * cv[i-1] - dttx1 * rhon[i-1]; lhs[2][i][j][k] = c2dttx1 * rhon[i]; lhs[3][i][j][k] = dttx2 * cv[i+1] - dttx1 * rhon[i+1]; lhs[4][i][j][k] = 0.0; } } Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines 21

22 Optimization Results Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines 22

23 EPCC Microbenchmarks There is significant overhead in fine-grained constructs related to thread scheduling and reduction operations Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines 23

24 Conclusion and Future Work Virtualization introduces application-dependent overheads Following good practice advice (outermost, coarse-grained parallelization) even more important Hypercalls are very expensive Future work Investigate this behavior with XEN tracing tools Other OpenMP runtimes Busy wait vs. yielding Virtualization aware runtime Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines 24

Performance Profiling for OpenMP Tasks

Performance Profiling for OpenMP Tasks Karl Fürlinger 1 and David Skinner 2 1 Computer Science Division, EECS Department University of California at Berkeley Soda Hall 593, Berkeley CA 94720, U.S.A. fuerling@eecs.berkeley.edu