Performance and Power Co-Design of Exascale Systems and Applications

Size: px

Start display at page:

Download "Performance and Power Co-Design of Exascale Systems and Applications"

Scott Curtis
5 years ago
Views:

Abhinav Vishnu Performance and Architecture Lab (PAL) Pacific

1 Performance and Power Co-Design of Exascale Systems and Applications Adolfy Hoisie Work with Kevin Barker, Darren Kerbyson, Abhinav Vishnu Performance and Architecture Lab (PAL) Pacific Northwest National Laboratory 5 th Parallel Tools Workshop Dresden, September 27, 2011

2 Outline Static performance modeling Dynamic modeling Modeling for Exascale Tentative conclusions

3 The fallacy of simple metrics: efficiency Example 1: Efficiency of applications Solver Flops Flops Mflop/s % Peak Time (s) Original 64 % 29.8 x % Optimized 25 % 8.2 x % Example 2: Efficiency of systems Code A on Machine X» (500 MFLOPS Peak per CPU, 2 FLOPS per CP):» Time = 522 sec.; MFLOPS = 26.1 (5.2% of peak) Code A on Machine Y» (3600 MFLOPS Peak per CPU, 4 FLOPS per CP):» Time = 91.1 sec; MFLOPS = (3.1% of peak)

4 Rough taxonomy of modeling Simulation» Greatest architectural flexibility but impractical for real applications Trace-driven experiments» Results often lack generality Quasi-analytical modeling» Can tackle full apps on full machines» Uses a set of input knobs» Tool-neutral Benchmarking» Limited to current implementation of the code» Limited to currently-available architectures» Difficult to distinguish between real performance and machine idiosyncrasies

5 Attributes of a Performance Model Encapsulates application behavior Abstracts application into communication and computation components Focuses on first-order effects, ignoring distracting details Separates performance concerns Inherent properties of application structure (e.g., data dependencies) System performance characteristics (e.g., MPI latency) Execution Code + System problem configuration Code Model + System Model Performance Prediction

A Performance Modeling Process Flow Code Run bmarks on system Identification of application characteristics Data structures Decomposition Parallel activities Frequency of use Memory usage

6 A Performance Modeling Process Flow Code Run bmarks on system Identification of application characteristics Data structures Decomposition Parallel activities Frequency of use Memory usage Microbenchmarks System(s) Construct (or refine) application model Combine Acquire performance characteristics Specifications Future (promised) performance Validate (compare model to measured) Model can be trusted Run code on system Use model Test new configurations (HW and/or SW) Verify current performance Compare systems Propose future systems Determine SW parameters

7 Partial list of modeled systems & codes Machines ASCI Q ASCI BlueMountain ASCI White ASCI Red CRAY T3E Earth Simulator Itanium-2 cluster BlueGene/L BlueGene/P CRAY X-1 ASC Red Storm ASC Purple IBM PERCS IBM Blue Waters Clearspeed accelerators SiCortex SC5832 Roadrunner Jaguar.. Codes SWEEP3D SAGE TYCHO Partisn LBMHD HYCOM MCNP POP KRAK RF-CTH CICE S3D VPIC GTC..

8 Modeling in action as a co-design process IBM PERCS Application(s) IBM PERCS simulator Simulated run-time (1PE, 1chip) System Design cores per chip Network topology Latency Bandwidth Contention Performance Model PNNL Large-scale Performance Predictions Modeling used to explore and guide design of PERCS using application suite (HPCS phase 1 & 2) Design feedback loop got used with increasing speed Explored numerous configurations and options

9 Topology comparison through co-design Example: 2,048 PE job (256-node system, 64-way) FC Fully-connected 1-hop OCS 1-hop or 2-hop 2D, 3D meshes FT Fat-tree OCS-D OCS-Dynamic Best hardware latency of 50ns, 4GB/s links Graph shows relative performance of each network relative to the best performing network Runtime Ratio vs. Best Network FC 1 OCS-FC 1 OCS-FC 2 OCS-D 2D 3D HYCOM LBMHD RF-CTH2 KRAK SAGE Sweep3D POP FT

10 Modeling as a co-design tool Where is the time being spent? ~63% Compute on Cell ~20% Latency (Cell <-> AMD) 100% 90% ~5% Bandwidth (Cell <-> AMD) 80% ~8% Latency (Infiniband) 70% ~3% Bandwidth (Infiniband) 60% 50% Pipeline unavoidable Latency dominates communication (Cell <-> AMD is major component) 40% 30% 20% 10% 0% Inter-node (Bandwidth) Inter-node (Latency) AMD <-> Cell (Bandwidth) AMD <-> Cell (Latency) Compute_Pipe (Cell) Compute_Block (Cell) CU 2CU 4CU 8CU 12CU 16CU 18CU Uses probable HW parameters Node Count

11 An example of modeling in action Cycle-time (ms) AD=1PE AD=2PEs AD=4PEs AD=8PEs AD=16PEs AD=32PEs AD=64PEs AD=128PEs Compute processor / AD count Assumptions (hypothetical system): Weak-scaling Assumed subgrids Processing time per cell Inter-PE (on Accelerator)» Bandwidth =1GB/s,» Latency = 50ns Inter-node (MPI)» Bandwidth = 1.6GB/s,» Latency = 4µs At largest scale, 16,384 compute processors & 16,384 accelerators Performance improvement is ~3.5x when using Accelerators with 128x more Pes

12 Challenges ahead: Performance from concurrency with faults and power 1 st Petascale (Roadrunner) 1 st Exascale (?) (?) System Peak 1.4 peta 1 Exa Power 2.5MW 20MW System Memory 0.3 PB PB Node Performance 425GF 1 TF 10TF Node Concurrency 40 O(1,000) O(10,000) System Size (nodes) 3,240 1K 100K System Concurrency 128,160 1bill MTTI Days < day " System Architecture: connectivity " Technology innovations: chip architecture, chip stacking, optical networks " Multi-dimensional: Performance + Power + Resilience 12

Economics show the shift in importance from performance to including power& FT Current predictions of exa-flop system power requirements: MW Nvidia 05/11 uhpc 2010 Intel 03/11 0 DOE 50 100 goal IBM

13 Economics show the shift in importance from performance to including power& FT Current predictions of exa-flop system power requirements: MW Nvidia 05/11 uhpc 2010 Intel 03/11 0 DOE goal IBM 12/10 (BlueGene) 400+ > Expected energy cost / year at best 20 M$ (@ 1M$ per Mwyr) " If system costs 100M$ then >half total-cost will be Energy (5yr system life) What can you do with a nj? ~30 flops = 1 DM on chip ~60 flops = 1 DM off-chip " It s all about the data movement " Locality, Locality, Locality,E"F>:" B-./"C77=DD",/"?900@=6<>6=0A=" :"9;<=37>;;=7<",-./"-0123" Data based from B. Dally, IPDPS keynote, May 2011!" #!" $!" %!" &!" '!" (!" )!" *!" +!" #!!" ##!" #$!"!"#$%&'()'*+,'+%&'-.' 13

14 Towards Exascale: Exploration of deep memory hierarchies Architectural factors Swim lanes: multi-core vs. heterogeneous Fused CPU/GPU will impact on memory performance Deeper memory Hierarchies for Power as well as performance Application factors Greater concurrency, greater locality, less synchronization Greater focus on data/memory factors Memory access phase behavior indicates potential power saving windows of opportunity. Less frequently used pages can be migrated to low-power memory GTC Count % Sage 14 Page Access Frequency (Hz)

" Looking at Performance, Power and Reliability will lead to

15 Changes of direction in modeling " Performance at what cost? " Reliability at what cost? " Looking at Performance, Power and Reliability will lead to multi-dimensional optimizations: " Trade-offs " Performance at what power " Reliability at what power " Data-movement costs " Power steering Power Reliability

Co-Design of power constrained systems Measurement/ Simulation

consumption as well as performance Measurement of current components

predictions Feedback cycle can represent both off-line and on-line

16 Co-Design of power constrained systems Measurement/ Simulation Modeling Optimization Modeling can be used to quantify power consumption as well as performance Measurement of current components and simulation of future technologies Optimization directed by modeled predictions Feedback cycle can represent both off-line and on-line activities: 1. Static design-space exploration 2. Dynamic application/resource steering 16

17 Measuring Power Today Without specialized hardware, direct power measurement not possible So, indirect methods have been proposed Determining power from temperature» Processor temperature is easy to measure» But it is difficult to correlate temperature with activity Determining power from performance counters» Complex relationship between processor activity and power For higher accuracy, dedicated measurement hardware is needed

18 Measuring Power Today Power measurement hardware comes in two flavors External to the compute node (e.g., Watts Up) Measurement device sits between power socket and compute node Often relatively inexpensive (O($100)), scalable to clusters Typically low temporal and spatial fidelity (e.g., 1Hz, cannot separate consumed power on a component basis) Internal Home-grown solution requiring surgery inside the node Single-node solutions; not scalable to clusters Hardware vendors utilize custom boards not available to research community

19 Where Do We Want To Be? Tools at the single-node level Where s my Power-PAPI? Extend the concept of performance counters to power counters» Valid power counters may vary by architecture» Determining power requires sampling voltage and current, which my inhibit temporal resolution leading to stale data Software control» The PNNL-Power library has this capability, but measurements are coarse-grained» Goal is to associate measurements with software activities Requires close collaboration with hardware community Tools at the cluster level Aggregating data across nodes within the cluster (including network) Again, analogous to performance tools today Limits to scalability?

20 Expanding Modeling Methodology to include Power Power modeling at scale similar to performance modeling Application behaviors in common Resource metrics different (time, power etc) Obtaining characteristics will be different, e.g. Cycle-accurate simulation + micro-benchmarks + for performance Cycle-accurate power simulator + micro-benchmarks + for power Mirror performance approach e.g. Early design: estimate core, memory, communication power Later design: cycle accurate power simulation & refined network / communication power Implementation (small-scale): measurement possible Implementation (large-scale): validation of system power

21 Issues Level of abstraction for modeling? Depends on definition of system power Depends on validation of existing system System space to be explored? Dimensions in design space -> parameterization Range of space of interest, & what would a baseline look-like? Tool design and development Workload (of common interest?) Use of many applications Analysis: Design space Power budget allocation -> performance/energy optimization Analysis: Dynamic possibilities Power steering Analysis: Comparison to other possible future systems Use iterative design-flow

22 A few general remarks Modeling applied in practice: system and application design, analysis, prediction, and testing Modeling is the quantitative tool of co-design Power, performance, and reliability modeling will be the triad to model on the path to Exascale Significant gaps exist in methodology development and practice Investment needs to accompany system and application development for Exascale Power/energy is not the single domain of any level of the stack but we need dynamic, quantitative tools

Performance Analysis of an Optical Circuit Switched Network for Peta-Scale Systems

Performance Analysis of an Optical Circuit Switched Network for Peta-Scale Systems Kevin J. Barker and Darren J. Kerbyson Performance and Architecture Laboratory (PAL) Los Alamos National Laboratory, NM