Overview of HPC and Energy Saving on KNL for Some Computations

Size: px

Start display at page:

Download "Overview of HPC and Energy Saving on KNL for Some Computations"

Brian Thornton
6 years ago
Views:

1 Overview of HPC and Energy Saving on KNL for Some Computations Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 1/2/217 1

2 Outline Overview of High Performance Computing With Extreme Computing the rules for computing have changed

3 Since 1993 H. Meuer, H. Simon, E. Strohmaier, & JD - Listing of the 5 most powerful Computers in the World - Yardstick: Rmax from LINPACK MPP Ax=b, dense problem TPP performance - Updated twice a year Size SC xy in the States in November Meeting in Germany in June - All data available from Rate

4 PERFORMANCE DEVELOPMENT OF HPC OVER THE LAST 24.5 YEARS FROM THE TOP5 flop/s 1E+9 Pflop/s 749 PF 93 PF Pflop/s flop/s Tflop/s 1 SUM 432 TF Tflop/s 1 flop/s 1 Gflop/s 1 Gflop/s 1 flop/s 1 Mflop/s, TFlop/s 59.7 GFlop/s 4 MFlop/s N=1 N=5 6-8 years My Laptop: 166 Gflop

PERFORMANCE DEVELOPMENT 1E+9 1 Eflop/s 1 1 Pflop/s 1 Pflop/s 1 1 Pflop/s

N=1 N=1 SUM N=1 1994 1996 1998 2 22 24 26 28 21 212 214 216 218 22 Today:

Today: 1 cabinet of TaihuLight ~ 3 Pflop/s Pflops (1 15 ) Achieved

5 PERFORMANCE DEVELOPMENT 1E+9 1 Eflop/s 1 1 Pflop/s 1 Pflop/s 1 1 Pflop/s 1 1 Tflop/s 1 1 Tflop/s 1 1 Tflop/s 1 1 Gflop/s 1 1 Gflop/s 1 1 Gflop/s 1 N=1 N=1 SUM N= Today: SW 261 & Intel KNL ~ 3 Tflop/s Tflops (1 12 ) Achieved ASCI Red Sandia NL Today: 1 cabinet of TaihuLight ~ 3 Pflop/s Pflops (1 15 ) Achieved RoadRunner Los Alamos NL Eflops (1 18 ) Achieved? China says 22 U.S. says 221

6 State of Supercomputing in 217 Pflops (> 1 15 Flop/s)computing fully established with 138 computer systems. Three technology architecture or swim lanes are thriving. Commodity (e.g. Intel) Commodity + accelerator (e.g. GPUs) (91 systems) Lightweight cores (e.g. IBM BG, Intel s Knights Landing, ShenWei 261, ARM) Interest in supercomputing is now worldwide, and growing in many new markets (~5% of Top5 computers are in industry). Exascale (1 18 Flop/s) projects exist in many countries and regions. Intel processors largest share, 92%, followed by AMD, 1%. 6

7 Countries Share 7 China has 1/3 of the systems, while the number of systems in the US has fallen to the lowest point since the TOP5 list was created. rectangle represents one of the Top5 computers, area of rectangle reflects its performance.

8 17 - Top5 Systems in France Name Computer Site Manufa cturer Total Cores Acc Rmax Total Exploration Pangea SGI ICE X, Xeon Xeon E5-267/ E5-268v3 12C 2.5GHz, Infiniband FDR Production HPE occigen2 bullx DLC 72, Xeon E5-269v4 14C 2.6GHz, Infiniband FDR (GENCI-CINES) Bull Prolix2 bullx DLC 72, Xeon E5-2698v4 2C 2.2GHz, Infiniband FDR Meteo France Bull Beaufix2 bullx DLC 72, Xeon E5-2698v4 2C 2.2GHz, Infiniband FDR Meteo France Bull Tera-1-1 bullx DLC 72, Xeon E5-2698v3 16C 2.3GHz, Infiniband FDR Commissariat a l'energie Atomique (CEA) Bull Sid bullx DLC 72, Xeon E5-2695v4 18C 2.1GHz, Infiniband FDR Atos Bull Curie thin nodes Bullx B51, Xeon E C 2.7GHz, Infiniband QDR CEA/TGCC-GENCI Bull Commissariat a l'energie Atomique Cobalt bullx DLC 72, Xeon E5-268v4 14C 2.4GHz, Infiniband EDR (CEA)/CCRT Bull Diego bullx DLC 72, Xeon E5-2698v4 2C 2.2GHz, Infiniband FDR Atos Bull Turing BlueGene/Q, Power BQC 16C 1.6GHz, Custom CNRS/IDRIS-GENCI IBM Commissariat a l'energie Atomique Tera-1 Bull bullx super-node S61/S63 (CEA) Bull Eole Intel Cluster, Xeon E5-268v4 14C 2.4GHz, Intel Omni-Path EDF Bull Zumbrota BlueGene/Q, Power BQC 16C 1.6GHz, Custom EDF R&D IBM Occigen2 bullx DLC 72, Xeon E5-269v4 14C 2.6GHz, Infiniband FDR Centre Informatique National (CINES) Bull Sator NEC HPC1812-Rg-2, Xeon E5-268v4 14C 2.4GHz, Intel Omni-Path ONERA NEC HP POD - Cluster Platform BL46c, Intel Xeon E5-2697v2 12C 2.7GHz, HPC4 Infiniband FDR Airbus HPE PORTHOS IBM NeXtScale nx36m5, Xeon E5-2697v3 14C 2.6GHz, Infiniband FDR EDF R&D IBM

June 217: The TOP 1 Systems Rank Site Computer Country Cores Rmax [Pflops] % of Peak Power [MW] GFlops/ Watt 1 National Super Computer Center in Wuxi National Super Computer Center in Guangzhou

9 June 217: The TOP 1 Systems Rank Site Computer Country Cores Rmax [Pflops] % of Peak Power [MW] GFlops/ Watt 1 National Super Computer Center in Wuxi National Super Computer Center in Guangzhou Sunway TaihuLight, SW261 (26C) + Custom China 1,649, Tianhe-2 NUDT, TaihuLight 2 is Xeon 6% (12C) + IntelXeon Sum Phi (57C) of All China EU 3,12, Systems Custom 3 Swiss CSCS 4 5 DOE / OS Oak Ridge Nat Lab DOE / NNSA L Livermore Nat Lab Piz Daint, Cray XC5, Xeon (12C) + Nvidia P1(56C) + Custom Titan, Cray XK7, AMD (16C) + Nvidia Kepler GPU (14C) + Custom Sequoia, BlueGene/Q (16C) + custom Swiss 361, TaihuLight is 35% X Sum of All Systems in US DOE / OS L Berkeley Nat Lab Joint Center for Advanced HPC RIKEN Advanced Inst for Comp Sci Cori, Cray XC4, Xeon Phi (68C) + Custom Oakforest-PACS, Fujitsu Primergy CX164, Xeon Phi (68C) + Omni-Path K computer Fujitsu SPARC64 VIIIfx (8C) + Custom USA 56, USA 1,572, USA 622, Japan 558, Japan 75, DOE / OS Argonne Nat Lab Mira, BlueGene/Q (16C) + Custom USA 786, DOE / NNSA / Trinity, Cray XC4,Xeon (16C) + Los Alamos & Sandia Custom USA 31, Internet company Sugon Intel (8C) + Nnvidia China 11,

Next Big Systems in the US DOE Oak Ridge Lab and Lawrence Livermore Lab to receive IBM and Nvidia based systems IBM Power 9 + Nvidia Volta V1 Installation started but not completed until late

ORNL maybe June 218 fully ready (Top5 number) In 221 Argonne Lab to receive Intel based system Exascale systems, based on Future Intel parts, not Knights Hill Aurora 21 Balanced architecture to

10 Next Big Systems in the US DOE Oak Ridge Lab and Lawrence Livermore Lab to receive IBM and Nvidia based systems IBM Power 9 + Nvidia Volta V1 Installation started but not completed until late times Titan on apps 4,6 nodes, each containing six 7.5-teraflop NVIDIA V1 GPUs, and two IBM Power9 CPUs, its aggregate peak performance should be > 2 petaflops. ORNL maybe June 218 fully ready (Top5 number) In 221 Argonne Lab to receive Intel based system Exascale systems, based on Future Intel parts, not Knights Hill Aurora 21 Balanced architecture to support three pillars Large-scale Simulation (PDEs, traditional HPC) Data Intensive Applications (science pipelines) Deep Learning and Emerging Science AI Integrated computing, acceleration, storage Towards a common software stack

Toward Exascale hina plans for Exascale 22 Three separate developments in HPC; Anything but from the US

O(1) Pflops will be Chinese ARM processor + accelerator Sugon - CAS ICT X86 based; collaboration with AMD

and delivered in 221 Enable capable exascale systems, based on ECP R&D, delivered in 222 and deployed in

11 Toward Exascale hina plans for Exascale 22 Three separate developments in HPC; Anything but from the US Wuxi Upgrade the ShenWei O(1) Pflops all Chinese National University for Defense Technology Tianhe-2A O(1) Pflops will be Chinese ARM processor + accelerator Sugon - CAS ICT X86 based; collaboration with AMD S DOE - Exascale Computing Program 7 Year rogram Initial exascale system based on advanced architecture and delivered in 221 Enable capable exascale systems, based on ECP R&D, delivered in 222 and deployed in Exascale 5X the performance today s 2 PF system 2-3 MW power 1 perceived fault / SW to support broad of apps

S Department of Energy Exascale Computing Program s formulated

achieve capable exascale lication Development Software

mission applications Scalable and productive software stack

Correctness Visualization Data Analysis Applications Co-Design

runtimes System Software, resource management threading,

libraries and Frameworks Memory and Burst buffer Hardware

ECP s work encompasses applications, system software, hardware

12 S Department of Energy Exascale Computing Program s formulated a holistic approach that uses co-design d integration to achieve capable exascale lication Development Software Technology Hardware Technology Exascale Systems cience and mission applications Scalable and productive software stack Hardware technology elements Integrated exascale supercomputers Correctness Visualization Data Analysis Applications Co-Design Resilience Programming models, development environment, and runtimes System Software, resource management threading, scheduling, monitoring, and control Node OS, runtimes Math libraries and Frameworks Memory and Burst buffer Hardware interface Tools Data management I/O and file system Workflows ECP s work encompasses applications, system software, hardware technologies and architectures, and workforce development cale Computing Project,

13 7

14 14 he Problem with the LINPACK Benchmark HPL performance of computer systems are no longer so strongly correlated to real application performance, especially for the broad set of HPC applications governed by partial differential equations. Designing a system for good HPL performance can actually lead to design choices that are wrong for the real application mix, or add unnecessary components or complexity to the system.

Many Problems in Computational Science Involve Solving PDEs;

over some domain ( where P denotes the differential operator )

, Galerkin equations) Find u h = Φ i x i n i=1 n n i=1 (P u h, Φ

j Sparse Linear System A x = b Basis functions Φ j are often

15 Many Problems in Computational Science Involve Solving PDEs; Large Sparse Linear Systems Given a PDE, e.g.: u+ c u+γu Pu = f over some domain ( where P denotes the differential operator ) Discretization (e.g., Galerkin equations) Find u h = Φ i x i n i=1 n n i=1 (P u h, Φ j ) = (f, Φ i ) for Φ j (PΦ i, Φ j ) x i = (f, Φ i ) i=1 a ji b j Sparse Linear System A x = b Basis functions Φ j are often with local support, e.g., leading to local interactions & hence sparse matrices, e.g., + boundary conditions row 1 in this case will have only 6 non-zeroes: a 1,1, a 1,332, a 1,1, a 1,115, a 1,21, a 1,35 Modeling Diffusion Fluid Flow

16 16 hpcg-benchmark.org HPCG High Performance Conjugate Gradients (HPCG). Solves Ax=b, A large, sparse, b known, x computed. An optimized implementation of PCG contains essential computational and communication patterns that are prevalent in a variety of methods for discretization and numerical solution of PDEs Synthetic discretized 3D PDE (FEM, FVM, FDM). Sparse matrix: 27 nonzeros/row interior on boundary. Symmetric positive definite. Patterns: Dense and sparse computations. Dense and sparse collectives. Multi-scale execution of kernels via MG (truncated) V cycle. Data-driven parallelism (unstructured sparse triangular solves). Strong verification (via spectral properties of PCG).

17 -benchmark.org 17 PL vs. HPCG: Bookends Some see HPL and HPCG as bookends of a spectrum. Applications teams know where their codes lie on the spectrum. Can gauge performance on a system using both HPL and HPCG numbers.

18 Results, Jun 217, top-1 Rank Site Computer Cores RIKEN Advanced Institute for Computational Science, Japan NSCC / Guangzhou China National Supercomputing Center in Wuxi China Swiss National Supercomputing Centre (CSCS) Switzerland Joint Center for Advanced HPC Japan DOE/SC/LBNL/NERSC USA DOE/NNSA/LLNL USA DOE/SC/Oak Ridge National Lab 9 DOE/NNSA/LANL/SNL 1 NASA / Mountain View K computer, SPARC64 VIIIfx2.GHz, Tofu interconnect Tianhe-2NUDT, Xeon 12C 2.2GHz + Intel Xeon Phi 57C + Custom Sunway TaihuLight Sunway MPP, SW261 26C 1.45GHz, Sunway, NRCPC Piz Daint CrayXC5, Xeon E5-269v3 12C 2.6GHz, Aries interconnect, NVIDIA P1, Cray Oakforest-PACS PRIMERGY CX6 M1, Intel Xeon Phi C 1.4GHz, Intel OmniPath, Fujitsu Cori XC4, Intel Xeon Phi C 1.4GHz, Cray Aries, Cray Sequoia IBM BlueGene/Q, PowerPC A2 16C 1.6GHz, 5D Torus, IBM Titan-Cray XK7, Opteron C 2.2GHz, Cray Gemini interconnect, NVIDIA K2x Trinity-Cray XC4, Intel E5-2698v3, Aries custom, Cray Pleiades-SGI ICE X, Intel E5-268, E5-268v2, E5-268v3, E5-268v4, Infiniband FDR, HPE Rmax Pflops HPCG Pflops HPCG /HPL % of Peak 75, % 5.3% 3,12, % 1.1% 1,649, %.4% 361, % 1.9% 557, % 1.5% 632, % 1.3% 1,572, % 1.6% 56, % 1.2% 31, % 1.6% 243, % 2.5%

19 ookends: Peak and HPL Only Pflop/s Rpeak Rmax

20 ookends: Peak, HPL, and HPCG Pflop/s Rpeak Rmax HPCG

21 Machine Learning in Computational Science any fields are beginning to adopt machine learning to augment modeling and simulatio ethods Climate Biology Drug Design Epidemology Materials Cosmology High-Energy Physics

22 IEEE 754 Half Precision (16-bit) Floating Pt Standard A lot of interest driven by machine learning 22 / 57

Today many precisions to deal with Note the

23 Mixed Precision Extended to Sparse Direct Methods Iterative Methods where inner and outer iteration Today many precisions to deal with Note the number range with half precision 1/2/217 (16 bit fl.pt.)

small matrices, perfectly parallel, can get by with 16-bit floating point x 1 w 11 x 1 y 1 w

24 ep Learning need Batched Operations trix Multiply is the time consuming part. nvolution Layers and Fully Connected Layers require matrix multiply ere are many GEMM s of small matrices, perfectly parallel, can get by with 16-bit floating point x 1 w 11 x 1 y 1 w w x 2 w 13 w 22 y 2 x 3 w 23 Convolution Step In this case 3x3 GEMM 24 / 47 Fully Connected Classification

25 Standard for Batched Computations efine standard API for batched BLAS and LAPACK in collaboration with ntel/nvidia/ecp/other users ixed size most of BLAS and LAPACK released ariable size most of BLAS released ariable size LAPACK in the branch ative GPU algorithms (Cholesky, LU, QR) in the branch iled algorithm using batched routines on tile or LAPACK data layout in the branch ramework for Deep Neural Network kernels PU, KNL and GPU routines P16 routines in progress

Batched Computations Non-batched computation op over the matrices one by one and compute using multithread (note that, sinc atrices are of small sizes there is not enough work for all the cores).

26 Batched Computations Non-batched computation op over the matrices one by one and compute using multithread (note that, sinc atrices are of small sizes there is not enough work for all the cores). So we expect low rformance as well as threads contention might also affect the performance r (i=; i<batchcount; i++) dgemm( ) Low There percentage is notof enough the resources to fulfill all is used the cores.

Batched Computations Batched computation ibute all the matrices over the available resources by assigning a matrix to eac p of core/tb to operate on it independently For very small matrices, assign a

27 Batched Computations Batched computation ibute all the matrices over the available resources by assigning a matrix to eac p of core/tb to operate on it independently For very small matrices, assign a matrix/core (CPU) or per TB for GPU For medium size a matrix go to a team of cores (CPU) or many TB s (GPU) For large size switch to multithreads classical 1 matrix per round. Batched_dgemm( ) Tasks manager dispatcher Based on the kernel design that decide the number of TB or threads (GPU/CPU) and through the Nvidia/OpenMP scheduler High percentage of the resources is used

28 Batched Computations Batched computation ibute all the matrices over the available resources by assigning a matrix to eac p of core/tb to operate on it independently For very small matrices, assign a matrix/core (CPU) or per TB for GPU For medium size a matrix go to a team of cores (CPU) or many TB s (GPU) For large size switch to multithreads classical 1 matrix per round. Batched_dgemm( ) Tasks manager dispatcher Based on the kernel design that decide the number of TB or threads (GPU/CPU) and through the Nvidia/OpenMP scheduler High percentage of the resources is used

cores (CPU) or many TB s (GPU) For large size switch to multithreads classical 1 matrix per round.

29 Batched Computations Batched computation ibute all the matrices over the available resources by assigning a matrix to eac p of core/tb to operate on it independently For very small matrices, assign a matrix/core (CPU) or per TB for GPU For medium size a matrix go to a team of cores (CPU) or many TB s (GPU) For large size switch to multithreads classical 1 matrix per round. Batched_dgemm( ) Tasks manager dispatcher Based on the kernel design that decide the number of TB or threads (GPU/CPU) and through the Nvidia/OpenMP scheduler High percentage of the resources is used

30 Batched Computations: How do we design and optimize Small sizes medium sizes large sizes 2~3x Switch to non-batched 1X

31 Batched Computations: How do we design and optimize Small sizes 3X 1.2x medium sizes large sizes Switch to non-batched

32 8 7 Nvidia V1 GPU small sizes medium sizes Large sizes 6 1.4X Gflop/s X Switch to non-batch 2 1 Batch dgemm BLAS 3 Batch dgemm BLAS 3 Standard dgemm BLAS 3 Standard dgemm BLAS ~1 matrices of size 97% of that peak

33 tel Knights Landing: FLAT mode 68 cores, 1.3 GHz, Peak DP = 2662 Gflop/s KNL is in FLAT mode (where entire MCDRAM is used as addressable memory) memory allocations are treated similarly to DDR4 memory allocations KNL Thermal Design Power (TDP) is 215W (for main SKUs): TDP = 215W

34 Level 1, 2 and 3 BLAS 68 cores Intel Xeon Phi KNL, 1.3 GHz, Peak DP = 2662 Gflop/s 2 Gflop/s 35x 85.3 Gflop/s 35.1 Gflop/s 68 cores Intel Xeon Phi KNL, 1.3 GHz The theoretical peak double precision is 2662 Gflop/s Compiled with icc and using Intel MKL 217b

35 API for power-aware computing We use PAPI s latest powercap componentfor measurement, control, and performance analysis PAPI power components in the past supported only reading power information New component exposes RAPL functionality to allow users to read and write power limit Study numerical building blocks of varying computational intensity Use PAPI powercap component to detect power optimization opportunities Objective: Cap the power on the architecture to reduce power usage while keeping the execution time constant Energy Savings!!!

36 API for power-aware computing We use PAPI s latest powercap componentfor measurement, control, and performance analysis PAPI power components in the past supported only reading power information New component exposes RAPL functionality to allow users to read and write power limit Study numerical building blocks of varying computational intensity Use PAPI powercap component to detect power optimization opportunities Objective: Cap the power on the architecture to reduce power usage while keeping the execution time constant Energy Savings!!!

el 3 BLAS DGEMM on KNL using MCDRAM 68 cores KNL, Peak DP = 2662 Gflop/s Bandwidth MCDRAM ~425 GB/s DDR4 ~9 GB/s Average power (Watts) 32 3 28 26 24 22 2 18 16 14 12 1 8 6 4 2 Accelerator Power Usage

37 el 3 BLAS DGEMM on KNL using MCDRAM 68 cores KNL, Peak DP = 2662 Gflop/s Bandwidth MCDRAM ~425 GB/s DDR4 ~9 GB/s Average power (Watts) Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set idle DGEMM 18K at default power Performance in Gflop/s Gflops/Watts Joules Time (sec) set pwr limit default DGEMM size 18K Thermal Design Power (TDP) DGEMM is run repeatedly for a fixed matrix size of 18K per step and at each step a new power limit is set in decreasing fashion starting from default, 22 Watts, 2 Watts down till 12 Watts by steps of 1/2. 37

38 el 3 BLAS DGEMM on KNL using MCDRAM 68 cores KNL, Peak DP = 2662 Gflop/s Bandwidth MCDRAM ~425 GB/s DDR4 ~9 GB/s Average power (Watts) idl 8 6 e Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set DGEMM 18K at default power Performance in Gflop/s Gflops/Watts Joules Time (sec) set pwr limit default DGEMM size 18K PAPI_read(EventSet, long long values[2]) // set short term power limit to Watts Values[] = // set long term power limit to 258 Watts Values[1] = 258 PAPI_write(EventSet, values) DGEMM is run repeatedly for a fixed matrix size of 18K per step and at each step a new power limit is set in decreasing fashion starting from default, 22 Watts, 2 Watts down till 12 Watts by steps of 1/2. 38

39 el 3 BLAS DGEMM on KNL using MCDRAM 68 cores KNL, Peak DP = 2662 Gflop/s Bandwidth MCDRAM ~425 GB/s DDR4 ~9 GB/s Average power (Watts) Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set Performance in Gflop/s Gflops/Watts Joules Time (sec) set pwr limit default DGEMM size 18K set pwr limit 22W DGEMM size 18K DGEMM is run repeatedly for a fixed matrix size of 18K per step and at each step a new power limit is set in decreasing fashion starting from default, 22 Watts, 2 Watts down till 12 Watts by steps of 1/2. 39

40 el 3 BLAS DGEMM on KNL using MCDRAM 68 cores KNL, Peak DP = 2662 Gflop/s Bandwidth MCDRAM ~425 GB/s DDR4 ~9 GB/s Average power (Watts) Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set Time (sec) Performance in Gflop/s Gflops/Watts Joules DGEMM is run repeatedly for a fixed matrix size of 18K per step and at each step a new power limit is set in decreasing fashion starting from default, 22 Watts, 2 Watts down till 12 Watts by steps of 1/2. set pwr limit default DGEMM size 18K set pwr limit 22W DGEMM size 18K set pwr limit 2W DGEMM size 18K set pwr limit 18W DGEMM size 18K set pwr limit 16W DGEMM size 18K set pwr limit 15W... 4

41 el 3 BLAS DGEMM on KNL using MCDRAM 68 cores KNL, Peak DP = 2662 Gflop/s Bandwidth MCDRAM ~425 GB/s DDR4 ~9 GB/s Average power (Watts) Optimum Flops Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set Time (sec) Frequency Performance in Gflop/s Gflops/Watts Joules MHz When power cap kickon the frequency is decreased which confirm that capping affect through the DVFS DGEMM is run repeatedly Optimum for a fixed matrix size of 18K per step and at each step a new power limit is set in Flops/Watt decreasing fashion starting from default, 22 Watts, 2 Watts down till 12 Watts by steps of 1/2. 41

42 el 3 BLAS DGEMM on KNL MCDRAM/DDR4 Average power (Watts) MCDRAM Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set Time (sec) Frequency Performance in Gflop/s Gflops/Watts Joules MHz Average power (Watts) DDR4 Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set Time (sec) Frequency Performance in Gflop/s Gflops/Watts Joules MHz Lesson for DGEMM type of operations (compute intensive): Capping can reduce performance, but sometimes can hold the energy efficiency (Glops/Watts). Optimum Flops Optimum Flops/Watt Optimum Flops Optimum Flops/Watt DGEMM is run repeatedly for a fixed matrix size of 18K per step and at each step a new power limit is set in decreasing fashion starting from default, 22 Watts, 2 Watts down till 12 Watts by steps of 1/2. 42

43 el 2 BLAS DGEMV on KNL MCDRAM/DDR4 Average power (Watts) MCDRAM Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set Time (sec) Frequency Performance in Gflop/s Achieved Bandwidth GB/s Gflops/Watts Joules MHz Average power (Watts) DDR4 Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set Lesson for DGEMM DGEMV type of operations (memory (compute bound): intensive): Time (sec) 224 Frequency Performance in Gflop/s Achieved Bandwidth GB/s Gflops/Watts Capping For MCDRAM, can reduce capping performance, at 19 Watts but results sometimes in power can reduction hold the without energy any efficiency loss of performance (Glops/Watts). (energy saving ~2%). For DDR4, capping at 12 Watts improves energy efficiency by ~17%. Optimum Overall capping 4 Watts Optimum below the power observed at default setting provide Optimum about 17-2% energy saving Optimum without any significant Flops reduction Flops/Watt in time to solution. Flops Flops/Watt Joules MHz DGEMV is run repeatedly for a fixed matrix size of 18K per step and at each step a new power limit is set in decreasing fashion starting from default, 22 Watts, 2 Watts down till 12 Watts by steps of 1/2. 43

44 el 1 BLAS DAXPY on KNL MCDRAM/DDR4 Average power (Watts) MCDRAM Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set Text Time (sec) Frequency Performance in Gflop/s Achieved Bandwidth GB/s Gflops/Watts Joules MHz Average power (Watts) DDR4 Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set Time (sec) Frequency Performance in Gflop/s Achieved Bandwidth GB/s Gflops/Watts Joules MHz Lesson for DAXPY type of operations (memory bound): For MCDRAM, capping at 18 Watts results in power reduction without any loss if performance (energy saving ~16%). For DDR4, capping at 13 Watts improves energy efficiency by ~25%. Overall capping 4 Watts below the power observed at default setting provide about 16-25% energy saving without any significant reduction in time to solution. DAXPY is run repeatedly for a fixed matrix size of 1E8 per step and at each step a new power limit is set in decreasing fashion starting from default, 22 Watts, 2 Watts down till 12 Watts by steps of 1/2. 44

45 el Knights Landing evel 1, 2 and 3 BLAS cores Intel Xeon Phi KNL, 1.3 GHz, Peak DP = 2662 Gflop/s BLAS MCDRAM Rate Gflop/s Power Efficiency Gflop/s per W evel evel evel BLAS DRAM Rate Gflop/s Level Level 2 21 Level 1 7 Power Efficienc Gflop/s pe

46 wer-awareness in applications HPCG benchmark on grid of size : MCDRAM Average power (Watts) joules MCDRAM_215watts Time (sec) Sparse Matrix Vector operations Solving Ax=b with Conjugate Gradient Algorithm At TDP basic power limit 215

47 wer-awareness in applications HPCG benchmark on grid of size : MCDRAM Average power (Watts) joules 154 joules MCDRAM_215watts MCDRAM_2watts Time (sec) Sparse Matrix Vector operations Solving Ax=b with Conjugate Gradient Algorithm Decreasing the power limit to 2 W do not results in any loss in performance while we observe a reducing of power rate consumption

48 wer-awareness in applications HPCG benchmark on grid of size : MCDRAM Average power (Watts) MCDRAM_215watts MCDRAM_2watts MCDRAM_18watts MCDRAM_16watts MCDRAM_14watts MCDRAM_12watts 1522 joules 154 joules 1468 joules 1486 joules 1728 joules 239 joules Time (sec) Sparse Matrix Vector operations Solving Ax=b with Conjugate Gradient Algorithm Decreasing the power limit to 2 W do not results in any loss in performance while we observe a reducing of power rate consumption Decreasing the power limit down by 4 Watts (setting pwr limit at 16 W) will provide energy saving, power rate reduction and without any meaningful loss in performance, less than 1% reduction in time to solution.

49 wer-awareness in applications HPCG benchmark on grid of size : DDR4 Average power (Watts) DDR_215watts DDR_2watts DDR_18watts DDR_16watts DDR_14watts DDR_12watts 3915 joules 3957 joules 394 joules 371 joules 33 joules 3158 joules Time (sec) Sparse Matrix Vector operations Solving Ax=b with Conjugate Gradient Algorithm Decreasing the power limit to 16 W do not results in any loss in performance while we observe a reducing of power rate and energy saving. Decreasing the power limit down by 4 Watts (Pwr limit at 14 W) will providing large energy saving (15%), power rate reduction, without any loss in performance. At 12 Watts, less than 1% reduction in time to solution while a large energy saving (2%) and power rate reduction are observed

50 wer-awareness in applications Average power (Watts) HPCG benchmark on grid of size : MCDRAM & DDR4 MCDRAM_215watts MCDRAM_2watts MCDRAM_18watts MCDRAM_16watts MCDRAM_14watts MCDRAM_12watts 1522 joules 154 joules 1468 joules 1486 joules 1728 joules 239 joules Time (sec) Average power (Watts) DDR_215watts DDR_2watts DDR_18watts DDR_16watts DDR_14watts DDR_12watts Sparse Matrix Vector opera Solving Ax=b with CG 3915 joules 3957 joules 394 joules 371 joules 33 joules 3158 joules Time (sec) Lesson for HPCG: For MCDRAM, decreasing the power limit by 4 Watts from the power observed at default (setting Pwr limit at 16 W) will provide energy saving, power rate reduction without any meaningful loss in performance, less than 1% reduction in time to solution. For DDR4, similar behavior observed by decreasing the power limit by 4 Watts from the power observed at default (setting Pwr limit at 12 W) provide a large energy saving (15%-2%), power rate reduction without any significant reduction in time to solution.

51 wer-awareness in applications Solving Helmholtz equation with finite difference, Jacobi iteration 128x128 grid Average power (Watts) MCDRAM_215Watts MCDRAM_2Watts MCDRAM_18Watts MCDRAM_16Watts MCDRAM_14Watts MCDRAM_12Watts 826 joules 842 joules 724 joules 83 joules 166 joules 1865 joules Time (sec) Average power (Watts) DDR_215Watts DDR_2Watts DDR_18Watts DDR_16Watts DDR_14Watts DDR_12Watts 277 joules 2878 joules 2689 joules 2438 joules 2137 joules 2731 joules Time (sec) Lesson for Jacobi iteration: For MCDRAM, capping at 18 Watts improves power efficiency by ~13% without any loss in time to solution, while also decreasing power rate requirements by about 2% For DDR4, capping at 14 Watts improves power efficiency by ~2% without any loss in time to solution. Overall, capping 3-4 Watts below the power observed at default limit provides large energy gain while keeping up with the same time to solution, which is similar to the behavior observed for the DGEMV and DAXPY kernels. 51

52 wer-awareness in applications Lattice-Boltzmann simulation of CFD (from SPEC 26 benchmark) Average power (Watts) joules 2981 joules 2872 joules MCDRAM_27watts MCDRAM_215watts MCDRAM_2watts MCDRAM_18watts MCDRAM_16watts MCDRAM_14watts MCDRAM_12watts 343 joules 3383 joules 433 joules 5734 joules Time (sec) Average power (Watts) DDR_27watts DDR_215watts DDR_2watts DDR_18watts DDR_16watts DDR_14watts DDR_12watts 964 joules 9196 joules 9194 joules 9178 joules 95 joules 884 joules 8467 joules Time (sec) Lesson for Lattice Boltzmann: For MCDRAM, capping at 2 results in 5% energy saving and power rate reduction without increase in time to solution. For DDR4, capping at 14 Watts improves energy efficiency by ~11% and reduces power rate without any increase in time to solution.

53 wer-awareness in applications Monte Carlo neutron transport application (from XSbench) Average power (Watts) MCDRAM_215Watts MCDRAM_2Watts MCDRAM_18Watts MCDRAM_16Watts MCDRAM_14Watts 9175 joules 8955 joules 872 joules 8724 joules 178 joules Time (sec) Average power (Watts) DDR_215Watts DDR_2Watts DDR_18Watts DDR_16Watts DDR_14Watts DDR_12Watts 1991 joules joules joules joules 163 joules joules Time (sec) Lesson for Monte Carlo: For MCDRAM, capping at 18 W results in energy saving (5%) and power rate reduction without any meaningful increase in time to solution. For DDR4, capping at 12 Watts improves power efficiency by ~3% without any loss in performance.

54 wer Aware Tools Designing a toolset that will automatically monitor memory traffic. Track the performance and data movement. Adjust the power to keep the performance roughly the same while decreasing energy consumption. Better power performance without significant loss of performance. All in a automatic fashion without user involvement. /2/217

55 Critical Issues and Challenges at Peta & Exascale for Algorithm and Software Design Synchronization-reducing algorithms Break Fork-Join model Communication-reducing algorithms Use methods which have lower bound on communication Mixed precision methods 2x speed of ops and 2x speed for data movement Autotuning Today s machines are too complicated, build smarts into software to adapt to the hardware Fault resilient algorithms Implement algorithms that can recover from failures/bit flips Reproducibility of results Today we can t guarantee this, without a penalty. We understand the issues, but some of our colleagues have a hard time with this.

56 Collaborators and Support MAGMA team PLASMA team Collaborating partners University of Tennessee, Knoxville University of California, Berkeley University of Colorado, Denver University of Manchester

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Waiting for Moore s Law to save your serial code started getting bleak in 2004 Source: published SPECInt