Adaptive Power Profiling for Many-Core HPC Architectures

Size: px

Start display at page:

Download "Adaptive Power Profiling for Many-Core HPC Architectures"

Quentin Montgomery
5 years ago
Views:

1 Adaptive Power Profiling for Many-Core HPC Architectures Jaimie Kelley, Christopher Stewart The Ohio State University Devesh Tiwari, Saurabh Gupta Oak Ridge National Laboratory

2 State-of-the-Art Schedulers use profiles to assign resources. Machine 1 Workload Binary.exe Profile Scheduler Machine N 2

3 State-of-the-Art Schedulers use profiles to assign resources. Profiles contain: CPU usage Memory usage Disk usage Scheduler Machine 1 Machine N Workload Binary.exe 3

4 State-of-the-Art Schedulers use profiles to assign resources. Machine 1 Profiles inform: Scheduler Assigning only what is needed Which apps strain datacenter resources Workload Binary.exe Machine N 4

5 State-of-the-Art HPC workloads may use dozens of machines. Traditional HPC workloads use every available core. Scheduler Machine 1 Workload Binary.exe Machine N Workload Binary.exe 5

6 State-of-the-Art HPC workloads may use dozens of machines. Traditional HPC workloads use every available core. Scheduler Machine 1 Workload Binary.exe Machine N Workload Binary.exe Using every core can be inefficient in terms of power. 6

7 Core Scaling Machine i Core 1 Core 2 Core N On how many cores should a workload execute? 7

8 Core Scaling Machine i Core 1 Core 2 Core N Few workloads use every core effectively. PARSEC achieves 9% of the speedup of 442 cores with only 35 cores. [H. Esmaeilzadeh et al., ISCA, 211] 8

9 Profiling Peak Power With Idle Cores Peak power is the largest aggregate power draw that lasts at least a tenth of a second. In an offline setting, workloads are run to completion Power readings are taken Multiple runs on each number of active cores Very slow 9

10 Profiling Peak Power Online Common approaches to online profiling: Use power readings from similar workloads [C. Delimitrou et al., ASPLOS 214] K% or fixed time (1 minutes) sampling [H. Nguyen et al., ICAC 213] Accurate profiles are critical to schedulers and capacity planning. [O. Sarood et al., Supercomputing 214] 1

11 Our Contributions Empirical study of peak power on HPC 5 observations about the interaction between architecture, workload, and power usage Novel approach to speedup peak power profiling with idle cores 11

12 Experimental Setup: Architectures Values Cores L2 (KB) L3 (MB) Frequency (GHz) TDP (W) Size (nm) Intel i Intel Sandy Bridge Intel Xeon Phi N/A When cores idled, each used low-power modes: Intel i7: On-Demand CPU governor Intel Sandy Bridge: P-states Intel Xeon Phi: P-states and C3 package 12

13 Experimental Setup: Workloads 9 High Performance Computing kernels Each benchmark runs on 2 x cores. For NAS, size B is appropriate for 1 node. Values CoMD Snap MiniFE NAS benchmarks Features N-body molecular dynamics, Frequent inter-thread communication PARTISN workload used at LANL, Iterative inter-thread communication Finite-element workload composed from 4 parallelizable kernels Widely used benchmark suite (CG, EP, FT, IS, LU, MG) 13

14 Platform & Tools Platform MPI OpenMP (OMP) Features Message passing interface, limited support for shared memory Open Multi-Processing provides extensive shared memory support We used thread pinning to ensure which cores were active. MPI & OMP exhibited similar peak power under micro-benchmarks (<2%) Tools librapl likwid Micsmc Features Reports on-core, uncore and total power readings per socket Reports hardware counters related to data movement and total energy Reports power readings on the Intel Xeon Phi Coprocessor We use data movement to explain the root causes for the effects of core scaling 14

15 Observation 1: Architectural Differences Peak Power (relative to 1 core) i7 (ceiling) i7 (floor) Sandy Bridge (ceiling) Sandy Bridge (floor) Xeon Phi (ceiling) Xeon Phi (floor) Active Cores Different architectures observe significantly different relative increase in peak power as the number of active cores increases 15

16 Observation 2a: One Reference Workload Average Error (Peak Power) 35% 3% 25% 2% 15% 1% 5% % i7 Sandy Bridge Xeon Phi Target Workload Using a randomly selected reference workload to predict peak power of other workloads can result in highly variable inaccuracy. 16

17 Observation 2b: Best Reference Workload 15 Peak Power (W) Active Cores NAS-IS on i7 CoMD on i7 NAS-IS on Xeon Phi CoMD on Xeon Phi While NAS IS and CoMD have similar power on Intel i7, their peak power on Xeon Phi differ especially at high core counts. 17

18 Observation 2b: Best Reference Workload 15 Peak Power (W) CoMD+MPI on Xeon Phi CoMD+OpenMP on Xeon Phi CoMD+MPI on Sandy Bridge Active Cores CoMD+OpenMP on Sandy Bridge CoMD MPI and OpenMP have similar peak power on Intel Xeon Phi, but their peak power on Intel Sandy Bridge differ especially at high numbers of active cores. 18

19 Observation 3: Power Phases (Snap) Instantaneous Power (W) Xeon Phi.5 1 Normalized Runtime Instantaneous Power (W) Sandy Bridge.5 1 Normalized Runtime Instantaneous Power (W) i7.5 1 Normalized Runtime Interestingly, for a given application the power profiles are similar across different numbers of active cores on a fixed architecture given normalized execution time. 19

20 Our Contributions Empirical study of peak power on HPC 5 observations about the interaction between architecture, workload, and power usage Novel approach to speedup peak power profiling with idle cores 2

21 K% Sampling (Sandy Bridge Examples) Profiles for only K% of a workload Peak Power Prediction Error MPI MG Normalized Runtime (K%) Peak Power Prediction Error MPI FT Normalized Runtime (K%) Workloads exhibit similar phases for different numbers of active cores. 21

22 Adaptive Power Profiling We can acquire power profiles faster than K% by taking advantage of these similar phases. 1. Profile the highest # of cores first. 2. Only profile at other cores until the most power hungry phase has occurred (or until K%, whichever is first). 22

23 Inputs Sampling Duration (k%) Inaccuracy (a%) i.e., expected error Active cores for initial profiling run By default this is all available cores. Output The duration of normalized time kˆ % to run the workload at other active core settings. 23

24 First Trace 1. Collect a power trace PP(i) for the given number of active cores, for k% of the workload. 2. For each point in PP(i), calculate PP max (i): PP ( i) = max max( PP(1),..., PP( i)) 3. We are guaranteed that the most accurate peak power over k% will be found at PP max (k). 4. Calculate the expected error in estimating peak power PPEC(i): PPEC( i) = PP max ( k) PP max PP ( k) max ( i) 24

25 Algorithm Data: PPEC[], k, a (expected inaccuracy) Result: find sampling duration kˆ % to run at other active core settings 1. For i to k do 2. if PPEC[i] < a then 3. return i; 4. end 5. end 25

26 Varying Time Requested (i7) Duration Profiled (%) th Percentile 25th Percentile Time Requested (%) Median Inaccuracy (%) Our Method K% Method Time Requested (%) On the Intel i7, we saved up to 73% of total profiling duration, 11% on average. On average, inaccuracy stays within 1.7% of k% sampling. 26

27 Varying Time Requested (Xeon Phi) Duration Profiled (%) th Percentile 25th Percentile Median Inaccuracy (%) Our Method K% Method Time Requested (%) Time Requested (%) On the Intel Xeon Phi, we saved up to 93% of total profiling duration, 25% on average. On average, inaccuracy stays within.3% of k% sampling. 27

28 Requesting Inaccuracy (i7) Duration Profiled (%) th Percentile 25th Percentile Target Inaccuracy (%) Inaccuracy (%) th Percentile 25th Percentile Target Inaccuracy (%) On the Intel i7, we saved more than 4% of total profiling duration in the median case. For target inaccuracy above 7%, the 75th percentile of workloads was above 5%. 28

29 Conclusion Workloads exhibit similar stages under normalized runtime with different active cores. Our algorithm uses this observation to save time profiling for peak power online. We can save up to 93% of profiling time while keeping inaccuracy within 3% on average. 29

30 Questions? 3

31 Varying Time Requested (Sandy Bridge) Duration Profiled (%) th Percentile 25th Percentile 5 1 Time Requested (%) Median Inaccuracy (%) Our Method K% Method Time Requested (%) On the Intel Sandy Bridge, we saved up to 6% of total profiling duration, 12% on average. On average, inaccuracy stays within 3% of k% sampling. 34

32 Requesting Inaccuracy (Sandy Bridge) th Percentile 25th Percentile th Percentile 25th Percentile Duration Profiled (%) Inaccuracy (%) Target Inaccuracy (%) Target Inaccuracy (%) With k=1% and inaccuracy > 2% our method used a profiling duration less than 6%. 35

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Matthew Koop 1 Miao Luo D. K. Panda matthew.koop@nasa.gov {luom, panda}@cse.ohio-state.edu 1 NASA Center for Computational