System Software Solutions for Exploiting Power Limited HPC Systems

Size: px

Start display at page:

Download "System Software Solutions for Exploiting Power Limited HPC Systems"

Peregrine Griffin
5 years ago
Views:

SPEEDUP Workshop on High-Performance Computing September 2016, Basel, Switzerland LLNL-PRES-703122 This

1 System Software Solutions for Exploiting Power Limited HPC Systems 45th Martin Schulz, LLNL/CASC SPEEDUP Workshop on High-Performance Computing September 2016, Basel, Switzerland LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA Lawrence Livermore National Security, LLC

2 Power Limits Will be a Reality Facility issues Procuring large amounts of power is difficult Large investment costs Operating costs Agencies are setting limits (e.g., DOE: 20MW for exascale) Expensive energy prices in some regions Heating and cooling issues Need to get power out of the system again Problem for very dense systems Consequences Facility, system, or rack level power bounds Limits the hardware that can be purchased Need to use power more efficiently!

3 Power Consumption on Vulcan, a BG/Q System Provisioned Power

4 Goal: 20MW Provisioned Power Wasted Power

5 Goal: 20MW Provisioned Power

6 Goal: 20MW Provisioned Power Need to Enforce Power Caps Overprovisioned systems More hardware that can be powered at once Need mechanisms to control and cap power Amount of capping can depend on multiple issues Physical limits of a facility Budget/End of year constraints Renewable energy sources Power as a constraint, not an optimization goal

7 Power-Capping is Not New Chip vendors are starting to build this into their CPUs Most prominent: Intel s RAPL for CPUs and memory Vendors looking into other system components as well State of the art in GPUs Limited by power draw over PCI bus GPUs can clock themselves down Typically not exposed Open questions: How to apply it on large scale parallel applications? What is the impact on performance and how can we mitigate? Problem definition Given a job-level power constraint, how do we optimize application performance?

8 Naïve Scheme: Uniform, Static Allocation Equally distribute Enforce power constraint over all nodes of a job Can use Running Average Power Limit (RAPL) Statically select a configuration under the power constraint Configuration: number of cores core posi/oning Commonly used: Packed configuration Maximum cores possible under the power constraint Ø Leads to several severe problems

9 1 st Issue: Node/Core/Cap Configurations Matter Experiment on 32 nodes on LLNL s TLCC system with a global power bound Naïve configuration in red Running all cores This limits number of nodes Best configuration in blue Moderate per node power bound Reduced number of cores Difference: > 2x Consequences: Nodes (8 32) Cores (4 16) sp mz, 4500W power bound Max power per processor Best configuration Determining the right configuration is critical Intuition is insufficient -> need ways to models Seconds Processor Power Bound (Watts) (51, 65, 80, 95, 115) Avg. Watts

10 Application and Bound Dependent

11 2nd Issue: Inhomogeneus Power Usage Dual socket 8-core mg.c.8 64 processor 34 power bounds Power lead to unintended noise Power variations turn load imbalance Question: How can we compensate for this?

Large Scale Power Capping: 95W CPU Clock Frequency Normalized Slowdown Census across 2386 Sandybridge processors mg (multigrid) Runs at 105W ep (embarrassingly parallel)

12 Large Scale Power Capping: 95W CPU Clock Frequency Normalized Slowdown Census across 2386 Sandybridge processors mg (multigrid) Runs at 105W ep (embarrassingly parallel) Runs at 90W Chart showing one point for each processor in the system Performance normalized to fastest unbounded run X-Axis: Slowdown Y-Axis: CPU clock Slowest processors circled

13 Large Scale Power Capping: 80/65W CPU Clock Frequency Normalized Slowdown

14 Large Scale Power Capping: < 51W CPU Clock Frequency Normalized Slowdown

Needs to be taken into account during load balancing Slowdown under power caps application

15 Large Scale Power Capping: Summary CPU Clock Frequency Normalized Slowdown Power capping makes systems heterogeneous Need more flexible task scheduling and ability to absorb slack Needs to be taken into account during load balancing Slowdown under power caps application specific Can t use a single knob processor/silicon efficiency Need to compensate per application/per CPU

Variation Aware Power-Budgeting Strategy Compensate power budget on each processor 1. Pre-characterizing power variation of each processor 2.

16 Variation Aware Power-Budgeting Strategy Compensate power budget on each processor 1. Pre-characterizing power variation of each processor 2. Profiling power-performance trend of target application by single node 3. Creating app. dependent power-performance model taking variation into account 4. Calculating power-budget for each node which maximizes performance PMMD: Power Measurement & Management Direc/ves HPC Applica/on Source Code Analysis to Insert PMMDs HPC Applica/on with PMMDs Test Runs on a Single Module App. Input Data Module- level Power Alloca/ons Varia/on- Aware Power Budge/ng Algorithm App. Dependent Power Model Table (all modules) App. Power Profile Final App. Runs Module Alloca/on (Scheduler) App. level Power Constraint App. Independent System Power Varia/on Table Inputs Output

17 Variation-Aware Power Budgeting: Results

18 3 rd Issue: Application Imbalance Can lead to Inefficient power usage ParaDiS: Average power usage on nodes 70 Significant difference in power usage across nodes Wastes unused power on nodes Difference in power usage up to 36%!

The Need for a Power-Aware Runtime Requirements to select performance-optimizing configuration Cannot violate the power constraint Constrain overhead of reallocating power Do not slow down the

19 The Need for a Power-Aware Runtime Requirements to select performance-optimizing configuration Cannot violate the power constraint Constrain overhead of reallocating power Do not slow down the critical path Low-overhead, on-line profiling of the configuration space Need for runtime configuration and optimization Application and input dependent Hard to predict and model a priori System variability depends on actual hardware resources Conductor: A run-time system for efficient power allocation

20 Conductor Step I: Configuration Selection Configuration = thread levels / power settings Profile the configuration space on-line Run each computation operation on individual nodes at distinct configurations Record the power/performance profile characteristics of each configuration Ø Intended for iterative applications with predictable computation behavior

21 Conductor Step II: Power Reallocation How can we allocate power to the critical operations in an application and improve performance? Barrier Before power reallocation P 1 P 0 C 0 C 1 Barrier time After power reallocation P 1 P 0 C 1 C 0 High power Medium power Low power time Slack

Start Explore configura/ons Step 1: Explore configura/on MPI processes Configura/ons 1 2 3... k n 1, k 2,.

22 Start Explore configura/ons Step 1: Explore configura/on MPI processes Configura/ons k n 1, k 2,..., k n k 1 k 2 k 3 k n Allreduce {Power, Execu/on Time} Perform this step for each OpenMP region in a time step

23 Start Explore configura/ons Step 2: Select configura/on Construct Pareto fron/er Select Configura/on k OPT Example: 40W node-level power constraint

24 Start Explore configura/ons Step 3: Save power in non- cri/cal tasks Construct Pareto fron/er Select Configura/on k OPT P constraint Power Slack Slack Yes Slow down computa/on Is computa3on non- cri3cal? P constraint Power C C C Slack Slack Power headroom P average C C C Time Time

25 Start Explore configura/ons Step 4: Monitor power usage Construct Pareto fron/er Yes Slow down computa/on Select Configura/on k OPT Is computa3on non- cri3cal? No End monitor Start monitor

Start Explore configura/ons Construct Pareto fron/er Calculate new power headroom Distribute new power alloca/on Step 5: Reallocate power ParaDiS: Before

26 Start Explore configura/ons Construct Pareto fron/er Calculate new power headroom Distribute new power alloca/on Step 5: Reallocate power ParaDiS: Before power realloca/on Yes Slow down computa/on Select Configura/on k OPT Is computa3on non- cri3cal? No End monitor Start monitor ParaDiS: A_er power realloca/on

27 Experimental Setup Test system: cab cluster at LLNL: Intel Sandy Bridge cluster with 1296 nodes, yields close to ½ PFLOP/s 8-core dual Intel Xeon E GHz Interconnect: InfiniBand QDR Socket-level power management interface (e.g., RAPL) Benchmarks (MPI+OpenMP) Benchmark Input Iterations BT-MZ and SP-MZ Class D 500 LULESH 32 x 32 x ParaDiS Copper 500 CoMD 40 x 40 x Up to 512 cores: 64 MPI processes, 8 OpenMP threads/process

28 Evaluation: Configuration Selection Up to 30% speedup over Static scheme Benefits configuration-sensitive applications

29 Evaluation: Power Reallocation Up to 13% speedup over Static scheme Benefits from process-level imbalance of power usage

System-Wide Power Optimization Power is a global resource

division results in power fragmentation Dynamic management

Power as a controlled resource that is allocated User input

of unused power Part of a global operating system

30 System-Wide Power Optimization Power is a global resource The system power cap must be divided among jobs Static division results in power fragmentation Dynamic management can utilize open resources Power-aware Resource Management Power as a controlled resource that is allocated User input Combined with Runtime Adaptation Detection and reallocation of unused power Part of a global operating system Transparent to application Need to maintain fairness System Enclave Boards Nodes libmsr

31 Application Agnostic Tuning in a Global OS HPC System Model (Picture courtesy of Daniel Ellsworth)

32 POWsched: The Basic Idea Monitor power usage of all applications in the system OS level power monitors Scalable aggregation In a hierarchy represen/ng applica/ons In a hierarchy represen/ng the system Detect wasted power Applications that don t use their allocation Detect needed power Applications that run against their power limit Shift wasted power to applications that need it

33 POWsched Results: 50W Static and Dynamic

34 POWsched Results: Static vs Dynamic

35 Conclusions We need dynamic solutions to deal with the power limits It s a dynamic property On-line adaption and tuning must be part of the solution Discussed projects Sources and effects of variability Critical path power scaling Global power re-distribution Common thread: steer power to where it is needed To slower processor, critical path, limited process Monitor, determine headroom, redistribute headroom, and repeat Where to go next? Finer grained hardware control (beyond sockets) More application configuration options Different code paths, algorithms, depending on capping levels Increases search space Integration of/cooperation between job and system levels

36 PIPER project The Scalability Team Postdoc Staff Abhinav Bhatele Kathryn Mohror q q q q Todd Gamblin Ignacio Laguna David Beckingsale David Boehme Murali Emani Tanzima Islam Barry Rountree Martin Schulz Aniruddha Marathe Tapasya Patki Kento Sato Jae-Seung Yeom Performance analysis tools and optimization Correctness and debugging (incl. STAT, AutomaDeD, MUST) Power-aware and power-limited computing (incl. Adagio, Conductor) Resilience and Checkpoint/Restart (incl. SCR)

PIPER project The Scalability Team Postdoc Staff http://scalability.llnl.

37 PIPER project The Scalability Team Postdoc Staff Abhinav Bhatele Kathryn Mohror q q q q Todd Gamblin Ignacio Laguna David Beckingsale David Boehme Murali Emani Tanzima Islam Barry Rountree Martin Schulz Aniruddha Marathe Tapasya Patki Kento Sato Jae-Seung Yeom Performance analysis tools and optimization Correctness and debugging (incl. STAT, AutomaDeD, MUST) Power-aware and power-limited computing (incl. Adagio, Conductor) Resilience and Checkpoint/Restart (incl. SCR)

38 Conclusions We need dynamic solutions to deal with the power limits It s a dynamic property On-line adaption and tuning must be part of the solution Discussed projects Sources and effects of variability Critical path power scaling Global power re-distribution Common thread: steer power to where it is needed To slower processor, critical path, limited process Monitor, determine headroom, redistribute headroom, and repeat Where to go next? Finer grained hardware control (beyond sockets) More application configuration options Different code paths, algorithms, depending on capping levels Increases search space Integration of/cooperation between job and system levels

Power Constrained HPC

http://scalability.llnl.gov/ Power Constrained HPC Martin Schulz Center or Applied Scientific Computing (CASC) Lawrence Livermore National Laboratory With many collaborators and Co-PIs, incl.: LLNL: Barry