Overview of HPC and Energy Saving on KNL for Some Computations Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 1/2/217 1
Outline Overview of High Performance Computing With Extreme Computing the rules for computing have changed
Since 1993 H. Meuer, H. Simon, E. Strohmaier, & JD - Listing of the 5 most powerful Computers in the World - Yardstick: Rmax from LINPACK MPP Ax=b, dense problem TPP performance - Updated twice a year Size SC xy in the States in November Meeting in Germany in June - All data available from www.top5.org Rate
PERFORMANCE DEVELOPMENT OF HPC OVER THE LAST 24.5 YEARS FROM THE TOP5 flop/s 1E+9 Pflop/s 749 PF 93 PF Pflop/s flop/s Tflop/s 1 SUM 432 TF Tflop/s 1 flop/s 1 Gflop/s 1 Gflop/s 1 flop/s 1 Mflop/s,1 1.17 TFlop/s 59.7 GFlop/s 4 MFlop/s N=1 N=5 6-8 years My Laptop: 166 Gflop 1994 1996 1998 2 22 24 26 28 21 212 214 216
PERFORMANCE DEVELOPMENT 1E+9 1 Eflop/s 1 1 Pflop/s 1 Pflop/s 1 1 Pflop/s 1 1 Tflop/s 1 1 Tflop/s 1 1 Tflop/s 1 1 Gflop/s 1 1 Gflop/s 1 1 Gflop/s 1 N=1 N=1 SUM N=1 1994 1996 1998 2 22 24 26 28 21 212 214 216 218 22 Today: SW 261 & Intel KNL ~ 3 Tflop/s Tflops (1 12 ) Achieved ASCI Red Sandia NL Today: 1 cabinet of TaihuLight ~ 3 Pflop/s Pflops (1 15 ) Achieved RoadRunner Los Alamos NL Eflops (1 18 ) Achieved? China says 22 U.S. says 221
State of Supercomputing in 217 Pflops (> 1 15 Flop/s)computing fully established with 138 computer systems. Three technology architecture or swim lanes are thriving. Commodity (e.g. Intel) Commodity + accelerator (e.g. GPUs) (91 systems) Lightweight cores (e.g. IBM BG, Intel s Knights Landing, ShenWei 261, ARM) Interest in supercomputing is now worldwide, and growing in many new markets (~5% of Top5 computers are in industry). Exascale (1 18 Flop/s) projects exist in many countries and regions. Intel processors largest share, 92%, followed by AMD, 1%. 6
Countries Share 7 China has 1/3 of the systems, while the number of systems in the US has fallen to the lowest point since the TOP5 list was created. rectangle represents one of the Top5 computers, area of rectangle reflects its performance.
17 - Top5 Systems in France Name Computer Site Manufa cturer Total Cores Acc Rmax Total Exploration Pangea SGI ICE X, Xeon Xeon E5-267/ E5-268v3 12C 2.5GHz, Infiniband FDR Production HPE 228 528311 occigen2 bullx DLC 72, Xeon E5-269v4 14C 2.6GHz, Infiniband FDR (GENCI-CINES) Bull 85824 249465 Prolix2 bullx DLC 72, Xeon E5-2698v4 2C 2.2GHz, Infiniband FDR Meteo France Bull 72 216799 Beaufix2 bullx DLC 72, Xeon E5-2698v4 2C 2.2GHz, Infiniband FDR Meteo France Bull 7344 215741 Tera-1-1 bullx DLC 72, Xeon E5-2698v3 16C 2.3GHz, Infiniband FDR Commissariat a l'energie Atomique (CEA) Bull 7272 1871 Sid bullx DLC 72, Xeon E5-2695v4 18C 2.1GHz, Infiniband FDR Atos Bull 49896 136348 Curie thin nodes Bullx B51, Xeon E5-268 8C 2.7GHz, Infiniband QDR CEA/TGCC-GENCI Bull 77184 1359 Commissariat a l'energie Atomique Cobalt bullx DLC 72, Xeon E5-268v4 14C 2.4GHz, Infiniband EDR (CEA)/CCRT Bull 38528 129947 Diego bullx DLC 72, Xeon E5-2698v4 2C 2.2GHz, Infiniband FDR Atos Bull 468 122528 Turing BlueGene/Q, Power BQC 16C 1.6GHz, Custom CNRS/IDRIS-GENCI IBM 9834 173327 Commissariat a l'energie Atomique Tera-1 Bull bullx super-node S61/S63 (CEA) Bull 138368 15 Eole Intel Cluster, Xeon E5-268v4 14C 2.4GHz, Intel Omni-Path EDF Bull 29568 942829 Zumbrota BlueGene/Q, Power BQC 16C 1.6GHz, Custom EDF R&D IBM 65536 715551 Occigen2 bullx DLC 72, Xeon E5-269v4 14C 2.6GHz, Infiniband FDR Centre Informatique National (CINES) Bull 19488 685985 Sator NEC HPC1812-Rg-2, Xeon E5-268v4 14C 2.4GHz, Intel Omni-Path ONERA NEC 1736 579232 HP POD - Cluster Platform BL46c, Intel Xeon E5-2697v2 12C 2.7GHz, HPC4 Infiniband FDR Airbus HPE 3456 516897 7 PORTHOS IBM NeXtScale nx36m5, Xeon E5-2697v3 14C 2.6GHz, Infiniband FDR EDF R&D IBM 161 56357
June 217: The TOP 1 Systems Rank Site Computer Country Cores Rmax [Pflops] % of Peak Power [MW] GFlops/ Watt 1 National Super Computer Center in Wuxi National Super Computer Center in Guangzhou Sunway TaihuLight, SW261 (26C) + Custom China 1,649, 93. 74 15.4 6.4 Tianhe-2 NUDT, TaihuLight 2 is Xeon 6% (12C) + IntelXeon Sum Phi (57C) of All China EU 3,12, Systems 33.9 62 17.8 1.91 + Custom 3 Swiss CSCS 4 5 DOE / OS Oak Ridge Nat Lab DOE / NNSA L Livermore Nat Lab Piz Daint, Cray XC5, Xeon (12C) + Nvidia P1(56C) + Custom Titan, Cray XK7, AMD (16C) + Nvidia Kepler GPU (14C) + Custom Sequoia, BlueGene/Q (16C) + custom Swiss 361,76 19.6 77 2.27 8.6 TaihuLight is 35% X Sum of All Systems in US 6 7 8 DOE / OS L Berkeley Nat Lab Joint Center for Advanced HPC RIKEN Advanced Inst for Comp Sci Cori, Cray XC4, Xeon Phi (68C) + Custom Oakforest-PACS, Fujitsu Primergy CX164, Xeon Phi (68C) + Omni-Path K computer Fujitsu SPARC64 VIIIfx (8C) + Custom USA 56,64 17.6 65 8.21 2.14 USA 1,572,864 17.2 85 7.89 2.18 USA 622,336 14. 5 3.94 3.55 Japan 558,144 13.6 54 2.72 4.98 Japan 75,24 1.5 93 12.7.827 9 DOE / OS Argonne Nat Lab Mira, BlueGene/Q (16C) + Custom USA 786,432 8.59 85 3.95 2.7 1 DOE / NNSA / Trinity, Cray XC4,Xeon (16C) + Los Alamos & Sandia Custom USA 31,56 8.1 8 4.23 1.92 5 Internet company Sugon Intel (8C) + Nnvidia China 11,.432 71 1.2.36
Next Big Systems in the US DOE Oak Ridge Lab and Lawrence Livermore Lab to receive IBM and Nvidia based systems IBM Power 9 + Nvidia Volta V1 Installation started but not completed until late 218-219 5-1 times Titan on apps 4,6 nodes, each containing six 7.5-teraflop NVIDIA V1 GPUs, and two IBM Power9 CPUs, its aggregate peak performance should be > 2 petaflops. ORNL maybe June 218 fully ready (Top5 number) In 221 Argonne Lab to receive Intel based system Exascale systems, based on Future Intel parts, not Knights Hill Aurora 21 Balanced architecture to support three pillars Large-scale Simulation (PDEs, traditional HPC) Data Intensive Applications (science pipelines) Deep Learning and Emerging Science AI Integrated computing, acceleration, storage Towards a common software stack
Toward Exascale hina plans for Exascale 22 Three separate developments in HPC; Anything but from the US Wuxi Upgrade the ShenWei O(1) Pflops all Chinese National University for Defense Technology Tianhe-2A O(1) Pflops will be Chinese ARM processor + accelerator Sugon - CAS ICT X86 based; collaboration with AMD S DOE - Exascale Computing Program 7 Year rogram Initial exascale system based on advanced architecture and delivered in 221 Enable capable exascale systems, based on ECP R&D, delivered in 222 and deployed in 223 7 Exascale 5X the performance today s 2 PF system 2-3 MW power 1 perceived fault / SW to support broad of apps
S Department of Energy Exascale Computing Program s formulated a holistic approach that uses co-design d integration to achieve capable exascale lication Development Software Technology Hardware Technology Exascale Systems cience and mission applications Scalable and productive software stack Hardware technology elements Integrated exascale supercomputers Correctness Visualization Data Analysis Applications Co-Design Resilience Programming models, development environment, and runtimes System Software, resource management threading, scheduling, monitoring, and control Node OS, runtimes Math libraries and Frameworks Memory and Burst buffer Hardware interface Tools Data management I/O and file system Workflows ECP s work encompasses applications, system software, hardware technologies and architectures, and workforce development cale Computing Project, www.exascaleproject.org
7
14 he Problem with the LINPACK Benchmark HPL performance of computer systems are no longer so strongly correlated to real application performance, especially for the broad set of HPC applications governed by partial differential equations. Designing a system for good HPL performance can actually lead to design choices that are wrong for the real application mix, or add unnecessary components or complexity to the system.
Many Problems in Computational Science Involve Solving PDEs; Large Sparse Linear Systems Given a PDE, e.g.: u+ c u+γu Pu = f over some domain ( where P denotes the differential operator ) Discretization (e.g., Galerkin equations) Find u h = Φ i x i n i=1 n n i=1 (P u h, Φ j ) = (f, Φ i ) for Φ j (PΦ i, Φ j ) x i = (f, Φ i ) i=1 a ji b j Sparse Linear System A x = b Basis functions Φ j are often with local support, e.g., leading to local interactions & hence sparse matrices, e.g., + boundary conditions 332 1 115 1 21 35 row 1 in this case will have only 6 non-zeroes: a 1,1, a 1,332, a 1,1, a 1,115, a 1,21, a 1,35 Modeling Diffusion Fluid Flow
16 hpcg-benchmark.org HPCG High Performance Conjugate Gradients (HPCG). Solves Ax=b, A large, sparse, b known, x computed. An optimized implementation of PCG contains essential computational and communication patterns that are prevalent in a variety of methods for discretization and numerical solution of PDEs Synthetic discretized 3D PDE (FEM, FVM, FDM). Sparse matrix: 27 nonzeros/row interior. 8 18 on boundary. Symmetric positive definite. Patterns: Dense and sparse computations. Dense and sparse collectives. Multi-scale execution of kernels via MG (truncated) V cycle. Data-driven parallelism (unstructured sparse triangular solves). Strong verification (via spectral properties of PCG).
-benchmark.org 17 PL vs. HPCG: Bookends Some see HPL and HPCG as bookends of a spectrum. Applications teams know where their codes lie on the spectrum. Can gauge performance on a system using both HPL and HPCG numbers.
Results, Jun 217, top-1 Rank Site Computer Cores 1 2 3 3 5 6 7 8 RIKEN Advanced Institute for Computational Science, Japan NSCC / Guangzhou China National Supercomputing Center in Wuxi China Swiss National Supercomputing Centre (CSCS) Switzerland Joint Center for Advanced HPC Japan DOE/SC/LBNL/NERSC USA DOE/NNSA/LLNL USA DOE/SC/Oak Ridge National Lab 9 DOE/NNSA/LANL/SNL 1 NASA / Mountain View K computer, SPARC64 VIIIfx2.GHz, Tofu interconnect Tianhe-2NUDT, Xeon 12C 2.2GHz + Intel Xeon Phi 57C + Custom Sunway TaihuLight Sunway MPP, SW261 26C 1.45GHz, Sunway, NRCPC Piz Daint CrayXC5, Xeon E5-269v3 12C 2.6GHz, Aries interconnect, NVIDIA P1, Cray Oakforest-PACS PRIMERGY CX6 M1, Intel Xeon Phi 725 68C 1.4GHz, Intel OmniPath, Fujitsu Cori XC4, Intel Xeon Phi 725 68C 1.4GHz, Cray Aries, Cray Sequoia IBM BlueGene/Q, PowerPC A2 16C 1.6GHz, 5D Torus, IBM Titan-Cray XK7, Opteron 627416C 2.2GHz, Cray Gemini interconnect, NVIDIA K2x Trinity-Cray XC4, Intel E5-2698v3, Aries custom, Cray Pleiades-SGI ICE X, Intel E5-268, E5-268v2, E5-268v3, E5-268v4, Infiniband FDR, HPE Rmax Pflops HPCG Pflops HPCG /HPL % of Peak 75,24 1.5.6 5.7% 5.3% 3,12, 33.8.58 1.7% 1.1% 1,649,6 93..48.5%.4% 361,76 19.6.48 2.4% 1.9% 557,56 24.9.39 2.8% 1.5% 632,4 13.8.36 2.6% 1.3% 1,572,864 17.2.33 1.9% 1.6% 56,64 17.6.32 1.8% 1.2% 31,56 8.1.18 2.3% 1.6% 243,8 6..17 2.9% 2.5%
ookends: Peak and HPL Only.1.1.1 1 1 1 1 1 2 3 4 5 6 7 8 9 1 14 15 17 18 19 2 21 22 23 25 26 31 34 38 39 4 44 5 55 57 58 59 6 63 64 71 72 73 8 85 87 9 94 95 112 113 123 131 164 23 29 214 22 226 247 28 282 292 37 38 43 442 472 Pflop/s Rpeak Rmax
ookends: Peak, HPL, and HPCG.1.1.1 1 1 1 1 1 2 3 4 5 6 7 8 9 1 14 15 17 18 19 2 21 22 23 25 26 31 34 38 39 4 44 5 55 57 58 59 6 63 64 71 72 73 8 85 87 9 94 95 112 113 123 131 164 23 29 214 22 226 247 28 282 292 37 38 43 442 472 Pflop/s Rpeak Rmax HPCG
Machine Learning in Computational Science any fields are beginning to adopt machine learning to augment modeling and simulatio ethods Climate Biology Drug Design Epidemology Materials Cosmology High-Energy Physics
IEEE 754 Half Precision (16-bit) Floating Pt Standard A lot of interest driven by machine learning 22 / 57
Mixed Precision Extended to Sparse Direct Methods Iterative Methods where inner and outer iteration Today many precisions to deal with Note the number range with half precision 1/2/217 (16 bit fl.pt.)
ep Learning need Batched Operations trix Multiply is the time consuming part. nvolution Layers and Fully Connected Layers require matrix multiply ere are many GEMM s of small matrices, perfectly parallel, can get by with 16-bit floating point x 1 w 11 x 1 y 1 w w 21 12 x 2 w 13 w 22 y 2 x 3 w 23 Convolution Step In this case 3x3 GEMM 24 / 47 Fully Connected Classification
Standard for Batched Computations efine standard API for batched BLAS and LAPACK in collaboration with ntel/nvidia/ecp/other users ixed size most of BLAS and LAPACK released ariable size most of BLAS released ariable size LAPACK in the branch ative GPU algorithms (Cholesky, LU, QR) in the branch iled algorithm using batched routines on tile or LAPACK data layout in the branch ramework for Deep Neural Network kernels PU, KNL and GPU routines P16 routines in progress
Batched Computations Non-batched computation op over the matrices one by one and compute using multithread (note that, sinc atrices are of small sizes there is not enough work for all the cores). So we expect low rformance as well as threads contention might also affect the performance r (i=; i<batchcount; i++) dgemm( ) Low There percentage is notof enough the resources to fulfill all is used the cores.
Batched Computations Batched computation ibute all the matrices over the available resources by assigning a matrix to eac p of core/tb to operate on it independently For very small matrices, assign a matrix/core (CPU) or per TB for GPU For medium size a matrix go to a team of cores (CPU) or many TB s (GPU) For large size switch to multithreads classical 1 matrix per round. Batched_dgemm( ) Tasks manager dispatcher Based on the kernel design that decide the number of TB or threads (GPU/CPU) and through the Nvidia/OpenMP scheduler High percentage of the resources is used
Batched Computations Batched computation ibute all the matrices over the available resources by assigning a matrix to eac p of core/tb to operate on it independently For very small matrices, assign a matrix/core (CPU) or per TB for GPU For medium size a matrix go to a team of cores (CPU) or many TB s (GPU) For large size switch to multithreads classical 1 matrix per round. Batched_dgemm( ) Tasks manager dispatcher Based on the kernel design that decide the number of TB or threads (GPU/CPU) and through the Nvidia/OpenMP scheduler High percentage of the resources is used
Batched Computations Batched computation ibute all the matrices over the available resources by assigning a matrix to eac p of core/tb to operate on it independently For very small matrices, assign a matrix/core (CPU) or per TB for GPU For medium size a matrix go to a team of cores (CPU) or many TB s (GPU) For large size switch to multithreads classical 1 matrix per round. Batched_dgemm( ) Tasks manager dispatcher Based on the kernel design that decide the number of TB or threads (GPU/CPU) and through the Nvidia/OpenMP scheduler High percentage of the resources is used
Batched Computations: How do we design and optimize Small sizes medium sizes large sizes 2~3x Switch to non-batched 1X
Batched Computations: How do we design and optimize Small sizes 3X 1.2x medium sizes large sizes Switch to non-batched
8 7 Nvidia V1 GPU small sizes medium sizes Large sizes 6 1.4X Gflop/s 5 4 3 19X Switch to non-batch 2 1 Batch dgemm BLAS 3 Batch dgemm BLAS 3 Standard dgemm BLAS 3 Standard dgemm BLAS 3 5 1 15 2 25 3 35 4 5~1 matrices of size 97% of that peak
tel Knights Landing: FLAT mode 68 cores, 1.3 GHz, Peak DP = 2662 Gflop/s KNL is in FLAT mode (where entire MCDRAM is used as addressable memory) memory allocations are treated similarly to DDR4 memory allocations KNL Thermal Design Power (TDP) is 215W (for main SKUs): TDP = 215W
Level 1, 2 and 3 BLAS 68 cores Intel Xeon Phi KNL, 1.3 GHz, Peak DP = 2662 Gflop/s 2 Gflop/s 35x 85.3 Gflop/s 35.1 Gflop/s 68 cores Intel Xeon Phi KNL, 1.3 GHz The theoretical peak double precision is 2662 Gflop/s Compiled with icc and using Intel MKL 217b1 21656
API for power-aware computing We use PAPI s latest powercap componentfor measurement, control, and performance analysis PAPI power components in the past supported only reading power information New component exposes RAPL functionality to allow users to read and write power limit Study numerical building blocks of varying computational intensity Use PAPI powercap component to detect power optimization opportunities Objective: Cap the power on the architecture to reduce power usage while keeping the execution time constant Energy Savings!!!
API for power-aware computing We use PAPI s latest powercap componentfor measurement, control, and performance analysis PAPI power components in the past supported only reading power information New component exposes RAPL functionality to allow users to read and write power limit Study numerical building blocks of varying computational intensity Use PAPI powercap component to detect power optimization opportunities Objective: Cap the power on the architecture to reduce power usage while keeping the execution time constant Energy Savings!!!
el 3 BLAS DGEMM on KNL using MCDRAM 68 cores KNL, Peak DP = 2662 Gflop/s Bandwidth MCDRAM ~425 GB/s DDR4 ~9 GB/s Average power (Watts) 32 3 28 26 24 22 2 18 16 14 12 1 8 6 4 2 Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set idle DGEMM 18K at default power 1997 8.81 133 Performance in Gflop/s Gflops/Watts Joules 1 2 3 4 5 6 7 8 9 1 11 12 Time (sec) set pwr limit default DGEMM size 18K Thermal Design Power (TDP) DGEMM is run repeatedly for a fixed matrix size of 18K per step and at each step a new power limit is set in decreasing fashion starting from default, 22 Watts, 2 Watts down till 12 Watts by steps of 1/2. 37
el 3 BLAS DGEMM on KNL using MCDRAM 68 cores KNL, Peak DP = 2662 Gflop/s Bandwidth MCDRAM ~425 GB/s DDR4 ~9 GB/s Average power (Watts) 32 3 28 26 24 22 2 18 16 14 12 1 idl 8 6 e 4 2 1997 8.81 133 Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set DGEMM 18K at default power Performance in Gflop/s Gflops/Watts Joules 1 2 3 4 5 6 7 8 9 1 11 Time (sec) set pwr limit default DGEMM size 18K PAPI_read(EventSet, long long values[2]) // set short term power limit to 268.75 Watts Values[] = 26875 // set long term power limit to 258 Watts Values[1] = 258 PAPI_write(EventSet, values) DGEMM is run repeatedly for a fixed matrix size of 18K per step and at each step a new power limit is set in decreasing fashion starting from default, 22 Watts, 2 Watts down till 12 Watts by steps of 1/2. 38
el 3 BLAS DGEMM on KNL using MCDRAM 68 cores KNL, Peak DP = 2662 Gflop/s Bandwidth MCDRAM ~425 GB/s DDR4 ~9 GB/s Average power (Watts) 32 3 28 26 24 22 2 18 16 14 12 1 8 6 4 2 1997 8.81 133 Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set 194 8.92 1279 Performance in Gflop/s Gflops/Watts Joules 1 2 3 4 5 6 7 8 9 1 11 Time (sec) set pwr limit default DGEMM size 18K set pwr limit 22W DGEMM size 18K DGEMM is run repeatedly for a fixed matrix size of 18K per step and at each step a new power limit is set in decreasing fashion starting from default, 22 Watts, 2 Watts down till 12 Watts by steps of 1/2. 39
el 3 BLAS DGEMM on KNL using MCDRAM 68 cores KNL, Peak DP = 2662 Gflop/s Bandwidth MCDRAM ~425 GB/s DDR4 ~9 GB/s Average power (Watts) 32 3 28 26 24 22 2 18 16 14 12 1 1997 194 1741 1589 1328 1137 133 1279 1267 1253 1345 8.82 8.92 8.95 9.4 8.47 8 6 4 2 Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set 7.73 148 956 6.95 1661 1 2 3 4 5 6 7 8 9 1 11 Time (sec) 773 6.3 194 56 4.72 2443 Performance in Gflop/s Gflops/Watts Joules DGEMM is run repeatedly for a fixed matrix size of 18K per step and at each step a new power limit is set in decreasing fashion starting from default, 22 Watts, 2 Watts down till 12 Watts by steps of 1/2. set pwr limit default DGEMM size 18K set pwr limit 22W DGEMM size 18K set pwr limit 2W DGEMM size 18K set pwr limit 18W DGEMM size 18K set pwr limit 16W DGEMM size 18K set pwr limit 15W... 4
el 3 BLAS DGEMM on KNL using MCDRAM 68 cores KNL, Peak DP = 2662 Gflop/s Bandwidth MCDRAM ~425 GB/s DDR4 ~9 GB/s Average power (Watts) 32 3 28 26 24 22 2 18 16 14 12 1 1997 194 1741 1589 1328 1137 133 1279 1267 1253 1345 8.82 8.92 8.95 9.4 8.47 8 6 4 2 Optimum Flops Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set 7.73 148 956 6.95 1661 1 2 3 4 5 6 7 8 9 1 11 Time (sec) 773 6.3 194 56 4.72 2443 Frequency Performance in Gflop/s Gflops/Watts Joules 2 18 16 14 12 1 8 6 4 2 MHz When power cap kickon the frequency is decreased which confirm that capping affect through the DVFS DGEMM is run repeatedly Optimum for a fixed matrix size of 18K per step and at each step a new power limit is set in Flops/Watt decreasing fashion starting from default, 22 Watts, 2 Watts down till 12 Watts by steps of 1/2. 41
el 3 BLAS DGEMM on KNL MCDRAM/DDR4 Average power (Watts) 32 3 28 26 24 22 2 18 16 14 12 1 8 6 4 2 MCDRAM Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set 1997 194 1741 1589 1328 1137 133 1279 1267 1253 1345 8.82 8.92 8.95 9.4 8.47 7.73 148 956 6.95 1661 1 2 3 4 5 6 7 8 9 1 11 Time (sec) 773 6.3 194 56 4.72 2443 Frequency Performance in Gflop/s Gflops/Watts Joules 2 18 16 14 12 1 8 6 4 2 MHz Average power (Watts) 32 3 28 26 24 22 2 18 16 14 12 1 8 6 4 2 1991 8.22 1383 191 8.36 1341 1785 8.58 1314 1615 8.57 1325 1419 1154 7.36 1562 971 6.66 1715 134 8.2 DDR4 Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set 1 2 3 4 5 6 7 8 9 1 11 Time (sec) 785 5.81 1982 575 4.64 2489 Frequency Performance in Gflop/s Gflops/Watts Joules 2 18 16 14 12 1 8 6 4 2 MHz Lesson for DGEMM type of operations (compute intensive): Capping can reduce performance, but sometimes can hold the energy efficiency (Glops/Watts). Optimum Flops Optimum Flops/Watt Optimum Flops Optimum Flops/Watt DGEMM is run repeatedly for a fixed matrix size of 18K per step and at each step a new power limit is set in decreasing fashion starting from default, 22 Watts, 2 Watts down till 12 Watts by steps of 1/2. 42
el 2 BLAS DGEMV on KNL MCDRAM/DDR4 Average power (Watts) 32 3 28 26 24 22 2 18 16 14 12 1 8 6 4 2 84 335 MCDRAM Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set.39 815 83 331.39 85 82 328.42 717 79 317.45 686 65 261.42 745 188.34 38 15.29 56 225.38 822 47 99 4 8 12 16 2 24 28 32 36 4 44 48 52 56 6 64 Time (sec) 177 28 111.24 1361 Frequency Performance in Gflop/s Achieved Bandwidth GB/s Gflops/Watts Joules 2 18 16 14 12 1 8 6 4 2 MHz Average power (Watts) 32 3 28 26 24 22 2 18 16 14 12 1 8 6 4 2 21 85.12 2588 21 85.12 85.12 2623 21 2639 21 85.12 2664 21 DDR4 Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set Lesson for DGEMM DGEMV type of operations (memory (compute bound): intensive): 85.12 2661 21 82.13 2432 2 82.14 84.13 2558 21 78.14 235 19 11 22 33 44 55 66 77 88 99 11 121 132 143 154 Time (sec) 224 Frequency Performance in Gflop/s Achieved Bandwidth GB/s Gflops/Watts Capping For MCDRAM, can reduce capping performance, at 19 Watts but results sometimes in power can reduction hold the without energy any efficiency loss of performance (Glops/Watts). (energy saving ~2%). For DDR4, capping at 12 Watts improves energy efficiency by ~17%. Optimum Overall capping 4 Watts Optimum below the power observed at default setting provide Optimum about 17-2% energy saving Optimum without any significant Flops reduction Flops/Watt in time to solution. Flops Flops/Watt Joules 2 18 16 14 12 1 8 6 4 2 MHz DGEMV is run repeatedly for a fixed matrix size of 18K per step and at each step a new power limit is set in decreasing fashion starting from default, 22 Watts, 2 Watts down till 12 Watts by steps of 1/2. 43
el 1 BLAS DAXPY on KNL MCDRAM/DDR4 Average power (Watts) 32 3 28 26 24 22 2 18 16 14 12 1 8 6 4 2 MCDRAM Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set 35 42.15 35 416 512.16 34 41 465.17 29 344 425.16 22 443 263.14 19 545 226.13 Text 185.11 686 614 15 12 144.9 3 6 9 12 15 18 21 24 27 3 33 36 39 42 45 48 Time (sec) 844 9 15.7 165 Frequency Performance in Gflop/s Achieved Bandwidth GB/s Gflops/Watts Joules 2 18 16 14 12 1 8 6 4 2 MHz Average power (Watts) 32 3 28 26 24 22 2 18 16 14 12 1 7 87.4 28 7 87.4 8 6 4 2 276 7 87.4 87.4 298 7 294 7 DDR4 Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set 84.4 85.4 1993 7 1871 7 84.4 182 7 8 16 24 32 4 48 56 64 72 8 88 96 14 112 12 Time (sec) 82.5 66.4 1951 17 6 Frequency Performance in Gflop/s Achieved Bandwidth GB/s Gflops/Watts Joules 2 18 16 14 12 1 8 6 4 2 MHz Lesson for DAXPY type of operations (memory bound): For MCDRAM, capping at 18 Watts results in power reduction without any loss if performance (energy saving ~16%). For DDR4, capping at 13 Watts improves energy efficiency by ~25%. Overall capping 4 Watts below the power observed at default setting provide about 16-25% energy saving without any significant reduction in time to solution. DAXPY is run repeatedly for a fixed matrix size of 1E8 per step and at each step a new power limit is set in decreasing fashion starting from default, 22 Watts, 2 Watts down till 12 Watts by steps of 1/2. 44
el Knights Landing evel 1, 2 and 3 BLAS cores Intel Xeon Phi KNL, 1.3 GHz, Peak DP = 2662 Gflop/s BLAS MCDRAM Rate Gflop/s Power Efficiency Gflop/s per W evel 3 1997 9.4 evel 2 84.45 evel 1 35.17 BLAS DRAM Rate Gflop/s Level 3 1991 Level 2 21 Level 1 7 Power Efficienc Gflop/s pe
wer-awareness in applications HPCG benchmark on grid of size 192 3 : MCDRAM Average power (Watts) 3 28 26 24 22 2 18 16 14 12 1 8 6 4 2 1522 joules MCDRAM_215watts 2 4 6 8 1 12 14 16 18 2 22 24 26 28 3 Time (sec) Sparse Matrix Vector operations Solving Ax=b with Conjugate Gradient Algorithm At TDP basic power limit 215
wer-awareness in applications HPCG benchmark on grid of size 192 3 : MCDRAM Average power (Watts) 3 28 26 24 22 2 18 16 14 12 1 8 6 4 2 1522 joules 154 joules MCDRAM_215watts MCDRAM_2watts 2 4 6 8 1 12 14 16 18 2 22 24 26 28 3 Time (sec) Sparse Matrix Vector operations Solving Ax=b with Conjugate Gradient Algorithm Decreasing the power limit to 2 W do not results in any loss in performance while we observe a reducing of power rate consumption
wer-awareness in applications HPCG benchmark on grid of size 192 3 : MCDRAM Average power (Watts) 3 28 26 24 22 2 18 16 14 12 1 8 6 4 2 MCDRAM_215watts MCDRAM_2watts MCDRAM_18watts MCDRAM_16watts MCDRAM_14watts MCDRAM_12watts 1522 joules 154 joules 1468 joules 1486 joules 1728 joules 239 joules 2 4 6 8 1 12 14 16 18 2 22 24 26 28 3 Time (sec) Sparse Matrix Vector operations Solving Ax=b with Conjugate Gradient Algorithm Decreasing the power limit to 2 W do not results in any loss in performance while we observe a reducing of power rate consumption Decreasing the power limit down by 4 Watts (setting pwr limit at 16 W) will provide energy saving, power rate reduction and without any meaningful loss in performance, less than 1% reduction in time to solution.
wer-awareness in applications HPCG benchmark on grid of size 192 3 : DDR4 Average power (Watts) 3 28 26 24 22 2 18 16 14 12 1 8 6 4 2 DDR_215watts DDR_2watts DDR_18watts DDR_16watts DDR_14watts DDR_12watts 3915 joules 3957 joules 394 joules 371 joules 33 joules 3158 joules 2 4 6 8 1 12 14 16 18 2 22 24 26 28 3 32 34 Time (sec) Sparse Matrix Vector operations Solving Ax=b with Conjugate Gradient Algorithm Decreasing the power limit to 16 W do not results in any loss in performance while we observe a reducing of power rate and energy saving. Decreasing the power limit down by 4 Watts (Pwr limit at 14 W) will providing large energy saving (15%), power rate reduction, without any loss in performance. At 12 Watts, less than 1% reduction in time to solution while a large energy saving (2%) and power rate reduction are observed
wer-awareness in applications Average power (Watts) 3 28 26 24 22 2 18 16 14 12 1 8 6 4 2 HPCG benchmark on grid of size 192 3 : MCDRAM & DDR4 MCDRAM_215watts MCDRAM_2watts MCDRAM_18watts MCDRAM_16watts MCDRAM_14watts MCDRAM_12watts 1522 joules 154 joules 1468 joules 1486 joules 1728 joules 239 joules 2 4 6 8 1 12 14 16 18 2 22 24 26 28 3 Time (sec) Average power (Watts) 3 28 26 24 22 2 18 16 14 12 1 8 6 4 2 DDR_215watts DDR_2watts DDR_18watts DDR_16watts DDR_14watts DDR_12watts Sparse Matrix Vector opera Solving Ax=b with CG 3915 joules 3957 joules 394 joules 371 joules 33 joules 3158 joules 2 4 6 8 1 12 14 16 18 2 22 24 26 28 3 32 34 Time (sec) Lesson for HPCG: For MCDRAM, decreasing the power limit by 4 Watts from the power observed at default (setting Pwr limit at 16 W) will provide energy saving, power rate reduction without any meaningful loss in performance, less than 1% reduction in time to solution. For DDR4, similar behavior observed by decreasing the power limit by 4 Watts from the power observed at default (setting Pwr limit at 12 W) provide a large energy saving (15%-2%), power rate reduction without any significant reduction in time to solution.
wer-awareness in applications Solving Helmholtz equation with finite difference, Jacobi iteration 128x128 grid Average power (Watts) 3 28 26 24 22 2 18 16 14 12 1 8 6 4 2 MCDRAM_215Watts MCDRAM_2Watts MCDRAM_18Watts MCDRAM_16Watts MCDRAM_14Watts MCDRAM_12Watts 826 joules 842 joules 724 joules 83 joules 166 joules 1865 joules 1 2 3 4 5 6 7 8 9 1 11 12 13 14 15 16 17 18 19 2 21 22 Time (sec) Average power (Watts) 3 28 26 24 22 2 18 16 14 12 1 8 6 4 2 DDR_215Watts DDR_2Watts DDR_18Watts DDR_16Watts DDR_14Watts DDR_12Watts 277 joules 2878 joules 2689 joules 2438 joules 2137 joules 2731 joules 2 4 6 8 1 12 14 16 18 2 22 24 26 28 3 32 Time (sec) Lesson for Jacobi iteration: For MCDRAM, capping at 18 Watts improves power efficiency by ~13% without any loss in time to solution, while also decreasing power rate requirements by about 2% For DDR4, capping at 14 Watts improves power efficiency by ~2% without any loss in time to solution. Overall, capping 3-4 Watts below the power observed at default limit provides large energy gain while keeping up with the same time to solution, which is similar to the behavior observed for the DGEMV and DAXPY kernels. 51
wer-awareness in applications Lattice-Boltzmann simulation of CFD (from SPEC 26 benchmark) Average power (Watts) 3 28 26 24 22 2 18 16 14 12 1 8 6 4 2 35 joules 2981 joules 2872 joules MCDRAM_27watts MCDRAM_215watts MCDRAM_2watts MCDRAM_18watts MCDRAM_16watts MCDRAM_14watts MCDRAM_12watts 343 joules 3383 joules 433 joules 5734 joules 3 6 9 12 15 18 21 24 27 3 33 36 39 42 45 48 51 54 Time (sec) Average power (Watts) 3 28 26 24 22 2 18 16 14 12 1 8 6 4 2 DDR_27watts DDR_215watts DDR_2watts DDR_18watts DDR_16watts DDR_14watts DDR_12watts 964 joules 9196 joules 9194 joules 9178 joules 95 joules 884 joules 8467 joules 4 8 12 16 2 24 28 32 36 4 44 48 52 56 6 64 68 Time (sec) Lesson for Lattice Boltzmann: For MCDRAM, capping at 2 results in 5% energy saving and power rate reduction without increase in time to solution. For DDR4, capping at 14 Watts improves energy efficiency by ~11% and reduces power rate without any increase in time to solution.
wer-awareness in applications Monte Carlo neutron transport application (from XSbench) Average power (Watts) 3 28 26 24 22 2 18 16 14 12 1 8 6 4 2 MCDRAM_215Watts MCDRAM_2Watts MCDRAM_18Watts MCDRAM_16Watts MCDRAM_14Watts 9175 joules 8955 joules 872 joules 8724 joules 178 joules 6 12 18 24 3 36 42 48 54 6 66 72 78 84 9 96 Time (sec) Average power (Watts) 3 28 26 24 22 2 18 16 14 12 1 8 6 4 2 DDR_215Watts DDR_2Watts DDR_18Watts DDR_16Watts DDR_14Watts DDR_12Watts 1991 joules 19796 joules 19676 joules 17763 joules 163 joules 13432 joules 1 2 3 4 5 6 7 8 9 1 11 12 13 14 Time (sec) Lesson for Monte Carlo: For MCDRAM, capping at 18 W results in energy saving (5%) and power rate reduction without any meaningful increase in time to solution. For DDR4, capping at 12 Watts improves power efficiency by ~3% without any loss in performance.
wer Aware Tools Designing a toolset that will automatically monitor memory traffic. Track the performance and data movement. Adjust the power to keep the performance roughly the same while decreasing energy consumption. Better power performance without significant loss of performance. All in a automatic fashion without user involvement. /2/217
Critical Issues and Challenges at Peta & Exascale for Algorithm and Software Design Synchronization-reducing algorithms Break Fork-Join model Communication-reducing algorithms Use methods which have lower bound on communication Mixed precision methods 2x speed of ops and 2x speed for data movement Autotuning Today s machines are too complicated, build smarts into software to adapt to the hardware Fault resilient algorithms Implement algorithms that can recover from failures/bit flips Reproducibility of results Today we can t guarantee this, without a penalty. We understand the issues, but some of our colleagues have a hard time with this.
Collaborators and Support MAGMA team http://icl.cs.utk.edu/magma PLASMA team http://icl.cs.utk.edu/plasma Collaborating partners University of Tennessee, Knoxville University of California, Berkeley University of Colorado, Denver University of Manchester