Overview of HPC and Energy Saving on KNL for Some Computations

Similar documents
It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

HPCG UPDATE: ISC 15 Jack Dongarra Michael Heroux Piotr Luszczek

Presentations: Jack Dongarra, University of Tennessee & ORNL. The HPL Benchmark: Past, Present & Future. Mike Heroux, Sandia National Laboratories

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

Managing HPC Active Archive Storage with HPSS RAIT at Oak Ridge National Laboratory

CSE5351: Parallel Procesisng. Part 1B. UTA Copyright (c) Slide No 1

HPCG UPDATE: SC 15 Jack Dongarra Michael Heroux Piotr Luszczek

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester

Jack Dongarra University of Tennessee Oak Ridge National Laboratory

Report on the Sunway TaihuLight System. Jack Dongarra. University of Tennessee. Oak Ridge National Laboratory

Emerging Heterogeneous Technologies for High Performance Computing

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Cray XC Scalability and the Aries Network Tony Ford

High-Performance Computing - and why Learn about it?

Parallel Computing & Accelerators. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Confessions of an Accidental Benchmarker

Top500

High Performance Linear Algebra

Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems

TOP500 List s Twice-Yearly Snapshots of World s Fastest Supercomputers Develop Into Big Picture of Changing Technology

An Overview of High Performance Computing and Challenges for the Future

ECE 574 Cluster Computing Lecture 2

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

HPC as a Driver for Computing Technology and Education

Supercomputers in ITC/U.Tokyo 2 big systems, 6 yr. cycle

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

Preparing GPU-Accelerated Applications for the Summit Supercomputer

High Performance Computing in Europe and USA: A Comparison

InfiniBand Strengthens Leadership as the Interconnect Of Choice By Providing Best Return on Investment. TOP500 Supercomputers, June 2014

Chapter 5 Supercomputers

Parallel Computing & Accelerators. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Trends in HPC (hardware complexity and software challenges)

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Investigating power capping toward energy-efficient scientific applications

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester

A Standard for Batching BLAS Operations

MAGMA. Matrix Algebra on GPU and Multicore Architectures

CS 5803 Introduction to High Performance Computer Architecture: Performance Metrics

What have we learned from the TOP500 lists?

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Chapter 1. Introduction

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues

HPC Technology Update Challenges or Chances?

Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy

GOING ARM A CODE PERSPECTIVE

The Future of High- Performance Computing

Introduction to Parallel Programming for Multicore/Manycore Clusters

2018/9/25 (1), (I) 1 (1), (I)

Towards Exascale Computing with the Atmospheric Model NUMA

Overview of Reedbush-U How to Login

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends

IHK/McKernel: A Lightweight Multi-kernel Operating System for Extreme-Scale Supercomputing

MAGMA: a New Generation

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Fujitsu s Approach to Application Centric Petascale Computing

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester

Presentation of the 16th List

Overview. Idea: Reduce CPU clock frequency This idea is well suited specifically for visualization

HPC Saudi Jeffrey A. Nichols Associate Laboratory Director Computing and Computational Sciences. Presented to: March 14, 2017

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group

HPC Algorithms and Applications

Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies

Performance and Energy Usage of Workloads on KNL and Haswell Architectures

Overview. High Performance Computing - History of the Supercomputer. Modern Definitions (II)

Hybrid Architectures Why Should I Bother?

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

High Performance Computing in Europe and USA: A Comparison

Performance of deal.ii on a node

20 Jahre TOP500 mit einem Ausblick auf neuere Entwicklungen

Why we need Exascale and why we won t get there by 2020 Horst Simon Lawrence Berkeley National Laboratory

The Mont-Blanc approach towards Exascale

Motivation Goal Idea Proposition for users Study

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Making a Case for a Green500 List

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

Parallel and Distributed Systems. Hardware Trends. Why Parallel or Distributed Computing? What is a parallel computer?

System Packaging Solution for Future High Performance Computing May 31, 2018 Shunichi Kikuchi Fujitsu Limited

Intro To Parallel Computing. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

HPC future trends from a science perspective

An Overview of High Performance Computing

Mixed Precision Methods

Seagate ExaScale HPC Storage

High Performance Computing An introduction talk. Fengguang Song

Exploring Emerging Technologies in the Extreme Scale HPC Co- Design Space with Aspen

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Introduction to Parallel Programming for Multicore/Manycore Clusters

Oak Ridge National Laboratory Computing and Computational Sciences

Fujitsu s Technologies Leading to Practical Petascale Computing: K computer, PRIMEHPC FX10 and the Future

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

Trends of Network Topology on Supercomputers. Michihiro Koibuchi National Institute of Informatics, Japan 2018/11/27

Technologies for Information and Health

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

TOP500 Listen und industrielle/kommerzielle Anwendungen

HPCS HPCchallenge Benchmark Suite

The TOP500 Project of the Universities Mannheim and Tennessee

CHAO YANG. Early Experience on Optimizations of Application Codes on the Sunway TaihuLight Supercomputer

Transcription:

Overview of HPC and Energy Saving on KNL for Some Computations Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 1/2/217 1

Outline Overview of High Performance Computing With Extreme Computing the rules for computing have changed

Since 1993 H. Meuer, H. Simon, E. Strohmaier, & JD - Listing of the 5 most powerful Computers in the World - Yardstick: Rmax from LINPACK MPP Ax=b, dense problem TPP performance - Updated twice a year Size SC xy in the States in November Meeting in Germany in June - All data available from www.top5.org Rate

PERFORMANCE DEVELOPMENT OF HPC OVER THE LAST 24.5 YEARS FROM THE TOP5 flop/s 1E+9 Pflop/s 749 PF 93 PF Pflop/s flop/s Tflop/s 1 SUM 432 TF Tflop/s 1 flop/s 1 Gflop/s 1 Gflop/s 1 flop/s 1 Mflop/s,1 1.17 TFlop/s 59.7 GFlop/s 4 MFlop/s N=1 N=5 6-8 years My Laptop: 166 Gflop 1994 1996 1998 2 22 24 26 28 21 212 214 216

PERFORMANCE DEVELOPMENT 1E+9 1 Eflop/s 1 1 Pflop/s 1 Pflop/s 1 1 Pflop/s 1 1 Tflop/s 1 1 Tflop/s 1 1 Tflop/s 1 1 Gflop/s 1 1 Gflop/s 1 1 Gflop/s 1 N=1 N=1 SUM N=1 1994 1996 1998 2 22 24 26 28 21 212 214 216 218 22 Today: SW 261 & Intel KNL ~ 3 Tflop/s Tflops (1 12 ) Achieved ASCI Red Sandia NL Today: 1 cabinet of TaihuLight ~ 3 Pflop/s Pflops (1 15 ) Achieved RoadRunner Los Alamos NL Eflops (1 18 ) Achieved? China says 22 U.S. says 221

State of Supercomputing in 217 Pflops (> 1 15 Flop/s)computing fully established with 138 computer systems. Three technology architecture or swim lanes are thriving. Commodity (e.g. Intel) Commodity + accelerator (e.g. GPUs) (91 systems) Lightweight cores (e.g. IBM BG, Intel s Knights Landing, ShenWei 261, ARM) Interest in supercomputing is now worldwide, and growing in many new markets (~5% of Top5 computers are in industry). Exascale (1 18 Flop/s) projects exist in many countries and regions. Intel processors largest share, 92%, followed by AMD, 1%. 6

Countries Share 7 China has 1/3 of the systems, while the number of systems in the US has fallen to the lowest point since the TOP5 list was created. rectangle represents one of the Top5 computers, area of rectangle reflects its performance.

17 - Top5 Systems in France Name Computer Site Manufa cturer Total Cores Acc Rmax Total Exploration Pangea SGI ICE X, Xeon Xeon E5-267/ E5-268v3 12C 2.5GHz, Infiniband FDR Production HPE 228 528311 occigen2 bullx DLC 72, Xeon E5-269v4 14C 2.6GHz, Infiniband FDR (GENCI-CINES) Bull 85824 249465 Prolix2 bullx DLC 72, Xeon E5-2698v4 2C 2.2GHz, Infiniband FDR Meteo France Bull 72 216799 Beaufix2 bullx DLC 72, Xeon E5-2698v4 2C 2.2GHz, Infiniband FDR Meteo France Bull 7344 215741 Tera-1-1 bullx DLC 72, Xeon E5-2698v3 16C 2.3GHz, Infiniband FDR Commissariat a l'energie Atomique (CEA) Bull 7272 1871 Sid bullx DLC 72, Xeon E5-2695v4 18C 2.1GHz, Infiniband FDR Atos Bull 49896 136348 Curie thin nodes Bullx B51, Xeon E5-268 8C 2.7GHz, Infiniband QDR CEA/TGCC-GENCI Bull 77184 1359 Commissariat a l'energie Atomique Cobalt bullx DLC 72, Xeon E5-268v4 14C 2.4GHz, Infiniband EDR (CEA)/CCRT Bull 38528 129947 Diego bullx DLC 72, Xeon E5-2698v4 2C 2.2GHz, Infiniband FDR Atos Bull 468 122528 Turing BlueGene/Q, Power BQC 16C 1.6GHz, Custom CNRS/IDRIS-GENCI IBM 9834 173327 Commissariat a l'energie Atomique Tera-1 Bull bullx super-node S61/S63 (CEA) Bull 138368 15 Eole Intel Cluster, Xeon E5-268v4 14C 2.4GHz, Intel Omni-Path EDF Bull 29568 942829 Zumbrota BlueGene/Q, Power BQC 16C 1.6GHz, Custom EDF R&D IBM 65536 715551 Occigen2 bullx DLC 72, Xeon E5-269v4 14C 2.6GHz, Infiniband FDR Centre Informatique National (CINES) Bull 19488 685985 Sator NEC HPC1812-Rg-2, Xeon E5-268v4 14C 2.4GHz, Intel Omni-Path ONERA NEC 1736 579232 HP POD - Cluster Platform BL46c, Intel Xeon E5-2697v2 12C 2.7GHz, HPC4 Infiniband FDR Airbus HPE 3456 516897 7 PORTHOS IBM NeXtScale nx36m5, Xeon E5-2697v3 14C 2.6GHz, Infiniband FDR EDF R&D IBM 161 56357

June 217: The TOP 1 Systems Rank Site Computer Country Cores Rmax [Pflops] % of Peak Power [MW] GFlops/ Watt 1 National Super Computer Center in Wuxi National Super Computer Center in Guangzhou Sunway TaihuLight, SW261 (26C) + Custom China 1,649, 93. 74 15.4 6.4 Tianhe-2 NUDT, TaihuLight 2 is Xeon 6% (12C) + IntelXeon Sum Phi (57C) of All China EU 3,12, Systems 33.9 62 17.8 1.91 + Custom 3 Swiss CSCS 4 5 DOE / OS Oak Ridge Nat Lab DOE / NNSA L Livermore Nat Lab Piz Daint, Cray XC5, Xeon (12C) + Nvidia P1(56C) + Custom Titan, Cray XK7, AMD (16C) + Nvidia Kepler GPU (14C) + Custom Sequoia, BlueGene/Q (16C) + custom Swiss 361,76 19.6 77 2.27 8.6 TaihuLight is 35% X Sum of All Systems in US 6 7 8 DOE / OS L Berkeley Nat Lab Joint Center for Advanced HPC RIKEN Advanced Inst for Comp Sci Cori, Cray XC4, Xeon Phi (68C) + Custom Oakforest-PACS, Fujitsu Primergy CX164, Xeon Phi (68C) + Omni-Path K computer Fujitsu SPARC64 VIIIfx (8C) + Custom USA 56,64 17.6 65 8.21 2.14 USA 1,572,864 17.2 85 7.89 2.18 USA 622,336 14. 5 3.94 3.55 Japan 558,144 13.6 54 2.72 4.98 Japan 75,24 1.5 93 12.7.827 9 DOE / OS Argonne Nat Lab Mira, BlueGene/Q (16C) + Custom USA 786,432 8.59 85 3.95 2.7 1 DOE / NNSA / Trinity, Cray XC4,Xeon (16C) + Los Alamos & Sandia Custom USA 31,56 8.1 8 4.23 1.92 5 Internet company Sugon Intel (8C) + Nnvidia China 11,.432 71 1.2.36

Next Big Systems in the US DOE Oak Ridge Lab and Lawrence Livermore Lab to receive IBM and Nvidia based systems IBM Power 9 + Nvidia Volta V1 Installation started but not completed until late 218-219 5-1 times Titan on apps 4,6 nodes, each containing six 7.5-teraflop NVIDIA V1 GPUs, and two IBM Power9 CPUs, its aggregate peak performance should be > 2 petaflops. ORNL maybe June 218 fully ready (Top5 number) In 221 Argonne Lab to receive Intel based system Exascale systems, based on Future Intel parts, not Knights Hill Aurora 21 Balanced architecture to support three pillars Large-scale Simulation (PDEs, traditional HPC) Data Intensive Applications (science pipelines) Deep Learning and Emerging Science AI Integrated computing, acceleration, storage Towards a common software stack

Toward Exascale hina plans for Exascale 22 Three separate developments in HPC; Anything but from the US Wuxi Upgrade the ShenWei O(1) Pflops all Chinese National University for Defense Technology Tianhe-2A O(1) Pflops will be Chinese ARM processor + accelerator Sugon - CAS ICT X86 based; collaboration with AMD S DOE - Exascale Computing Program 7 Year rogram Initial exascale system based on advanced architecture and delivered in 221 Enable capable exascale systems, based on ECP R&D, delivered in 222 and deployed in 223 7 Exascale 5X the performance today s 2 PF system 2-3 MW power 1 perceived fault / SW to support broad of apps

S Department of Energy Exascale Computing Program s formulated a holistic approach that uses co-design d integration to achieve capable exascale lication Development Software Technology Hardware Technology Exascale Systems cience and mission applications Scalable and productive software stack Hardware technology elements Integrated exascale supercomputers Correctness Visualization Data Analysis Applications Co-Design Resilience Programming models, development environment, and runtimes System Software, resource management threading, scheduling, monitoring, and control Node OS, runtimes Math libraries and Frameworks Memory and Burst buffer Hardware interface Tools Data management I/O and file system Workflows ECP s work encompasses applications, system software, hardware technologies and architectures, and workforce development cale Computing Project, www.exascaleproject.org

7

14 he Problem with the LINPACK Benchmark HPL performance of computer systems are no longer so strongly correlated to real application performance, especially for the broad set of HPC applications governed by partial differential equations. Designing a system for good HPL performance can actually lead to design choices that are wrong for the real application mix, or add unnecessary components or complexity to the system.

Many Problems in Computational Science Involve Solving PDEs; Large Sparse Linear Systems Given a PDE, e.g.: u+ c u+γu Pu = f over some domain ( where P denotes the differential operator ) Discretization (e.g., Galerkin equations) Find u h = Φ i x i n i=1 n n i=1 (P u h, Φ j ) = (f, Φ i ) for Φ j (PΦ i, Φ j ) x i = (f, Φ i ) i=1 a ji b j Sparse Linear System A x = b Basis functions Φ j are often with local support, e.g., leading to local interactions & hence sparse matrices, e.g., + boundary conditions 332 1 115 1 21 35 row 1 in this case will have only 6 non-zeroes: a 1,1, a 1,332, a 1,1, a 1,115, a 1,21, a 1,35 Modeling Diffusion Fluid Flow

16 hpcg-benchmark.org HPCG High Performance Conjugate Gradients (HPCG). Solves Ax=b, A large, sparse, b known, x computed. An optimized implementation of PCG contains essential computational and communication patterns that are prevalent in a variety of methods for discretization and numerical solution of PDEs Synthetic discretized 3D PDE (FEM, FVM, FDM). Sparse matrix: 27 nonzeros/row interior. 8 18 on boundary. Symmetric positive definite. Patterns: Dense and sparse computations. Dense and sparse collectives. Multi-scale execution of kernels via MG (truncated) V cycle. Data-driven parallelism (unstructured sparse triangular solves). Strong verification (via spectral properties of PCG).

-benchmark.org 17 PL vs. HPCG: Bookends Some see HPL and HPCG as bookends of a spectrum. Applications teams know where their codes lie on the spectrum. Can gauge performance on a system using both HPL and HPCG numbers.

Results, Jun 217, top-1 Rank Site Computer Cores 1 2 3 3 5 6 7 8 RIKEN Advanced Institute for Computational Science, Japan NSCC / Guangzhou China National Supercomputing Center in Wuxi China Swiss National Supercomputing Centre (CSCS) Switzerland Joint Center for Advanced HPC Japan DOE/SC/LBNL/NERSC USA DOE/NNSA/LLNL USA DOE/SC/Oak Ridge National Lab 9 DOE/NNSA/LANL/SNL 1 NASA / Mountain View K computer, SPARC64 VIIIfx2.GHz, Tofu interconnect Tianhe-2NUDT, Xeon 12C 2.2GHz + Intel Xeon Phi 57C + Custom Sunway TaihuLight Sunway MPP, SW261 26C 1.45GHz, Sunway, NRCPC Piz Daint CrayXC5, Xeon E5-269v3 12C 2.6GHz, Aries interconnect, NVIDIA P1, Cray Oakforest-PACS PRIMERGY CX6 M1, Intel Xeon Phi 725 68C 1.4GHz, Intel OmniPath, Fujitsu Cori XC4, Intel Xeon Phi 725 68C 1.4GHz, Cray Aries, Cray Sequoia IBM BlueGene/Q, PowerPC A2 16C 1.6GHz, 5D Torus, IBM Titan-Cray XK7, Opteron 627416C 2.2GHz, Cray Gemini interconnect, NVIDIA K2x Trinity-Cray XC4, Intel E5-2698v3, Aries custom, Cray Pleiades-SGI ICE X, Intel E5-268, E5-268v2, E5-268v3, E5-268v4, Infiniband FDR, HPE Rmax Pflops HPCG Pflops HPCG /HPL % of Peak 75,24 1.5.6 5.7% 5.3% 3,12, 33.8.58 1.7% 1.1% 1,649,6 93..48.5%.4% 361,76 19.6.48 2.4% 1.9% 557,56 24.9.39 2.8% 1.5% 632,4 13.8.36 2.6% 1.3% 1,572,864 17.2.33 1.9% 1.6% 56,64 17.6.32 1.8% 1.2% 31,56 8.1.18 2.3% 1.6% 243,8 6..17 2.9% 2.5%

ookends: Peak and HPL Only.1.1.1 1 1 1 1 1 2 3 4 5 6 7 8 9 1 14 15 17 18 19 2 21 22 23 25 26 31 34 38 39 4 44 5 55 57 58 59 6 63 64 71 72 73 8 85 87 9 94 95 112 113 123 131 164 23 29 214 22 226 247 28 282 292 37 38 43 442 472 Pflop/s Rpeak Rmax

ookends: Peak, HPL, and HPCG.1.1.1 1 1 1 1 1 2 3 4 5 6 7 8 9 1 14 15 17 18 19 2 21 22 23 25 26 31 34 38 39 4 44 5 55 57 58 59 6 63 64 71 72 73 8 85 87 9 94 95 112 113 123 131 164 23 29 214 22 226 247 28 282 292 37 38 43 442 472 Pflop/s Rpeak Rmax HPCG

Machine Learning in Computational Science any fields are beginning to adopt machine learning to augment modeling and simulatio ethods Climate Biology Drug Design Epidemology Materials Cosmology High-Energy Physics

IEEE 754 Half Precision (16-bit) Floating Pt Standard A lot of interest driven by machine learning 22 / 57

Mixed Precision Extended to Sparse Direct Methods Iterative Methods where inner and outer iteration Today many precisions to deal with Note the number range with half precision 1/2/217 (16 bit fl.pt.)

ep Learning need Batched Operations trix Multiply is the time consuming part. nvolution Layers and Fully Connected Layers require matrix multiply ere are many GEMM s of small matrices, perfectly parallel, can get by with 16-bit floating point x 1 w 11 x 1 y 1 w w 21 12 x 2 w 13 w 22 y 2 x 3 w 23 Convolution Step In this case 3x3 GEMM 24 / 47 Fully Connected Classification

Standard for Batched Computations efine standard API for batched BLAS and LAPACK in collaboration with ntel/nvidia/ecp/other users ixed size most of BLAS and LAPACK released ariable size most of BLAS released ariable size LAPACK in the branch ative GPU algorithms (Cholesky, LU, QR) in the branch iled algorithm using batched routines on tile or LAPACK data layout in the branch ramework for Deep Neural Network kernels PU, KNL and GPU routines P16 routines in progress

Batched Computations Non-batched computation op over the matrices one by one and compute using multithread (note that, sinc atrices are of small sizes there is not enough work for all the cores). So we expect low rformance as well as threads contention might also affect the performance r (i=; i<batchcount; i++) dgemm( ) Low There percentage is notof enough the resources to fulfill all is used the cores.

Batched Computations Batched computation ibute all the matrices over the available resources by assigning a matrix to eac p of core/tb to operate on it independently For very small matrices, assign a matrix/core (CPU) or per TB for GPU For medium size a matrix go to a team of cores (CPU) or many TB s (GPU) For large size switch to multithreads classical 1 matrix per round. Batched_dgemm( ) Tasks manager dispatcher Based on the kernel design that decide the number of TB or threads (GPU/CPU) and through the Nvidia/OpenMP scheduler High percentage of the resources is used

Batched Computations Batched computation ibute all the matrices over the available resources by assigning a matrix to eac p of core/tb to operate on it independently For very small matrices, assign a matrix/core (CPU) or per TB for GPU For medium size a matrix go to a team of cores (CPU) or many TB s (GPU) For large size switch to multithreads classical 1 matrix per round. Batched_dgemm( ) Tasks manager dispatcher Based on the kernel design that decide the number of TB or threads (GPU/CPU) and through the Nvidia/OpenMP scheduler High percentage of the resources is used

Batched Computations Batched computation ibute all the matrices over the available resources by assigning a matrix to eac p of core/tb to operate on it independently For very small matrices, assign a matrix/core (CPU) or per TB for GPU For medium size a matrix go to a team of cores (CPU) or many TB s (GPU) For large size switch to multithreads classical 1 matrix per round. Batched_dgemm( ) Tasks manager dispatcher Based on the kernel design that decide the number of TB or threads (GPU/CPU) and through the Nvidia/OpenMP scheduler High percentage of the resources is used

Batched Computations: How do we design and optimize Small sizes medium sizes large sizes 2~3x Switch to non-batched 1X

Batched Computations: How do we design and optimize Small sizes 3X 1.2x medium sizes large sizes Switch to non-batched

8 7 Nvidia V1 GPU small sizes medium sizes Large sizes 6 1.4X Gflop/s 5 4 3 19X Switch to non-batch 2 1 Batch dgemm BLAS 3 Batch dgemm BLAS 3 Standard dgemm BLAS 3 Standard dgemm BLAS 3 5 1 15 2 25 3 35 4 5~1 matrices of size 97% of that peak

tel Knights Landing: FLAT mode 68 cores, 1.3 GHz, Peak DP = 2662 Gflop/s KNL is in FLAT mode (where entire MCDRAM is used as addressable memory) memory allocations are treated similarly to DDR4 memory allocations KNL Thermal Design Power (TDP) is 215W (for main SKUs): TDP = 215W

Level 1, 2 and 3 BLAS 68 cores Intel Xeon Phi KNL, 1.3 GHz, Peak DP = 2662 Gflop/s 2 Gflop/s 35x 85.3 Gflop/s 35.1 Gflop/s 68 cores Intel Xeon Phi KNL, 1.3 GHz The theoretical peak double precision is 2662 Gflop/s Compiled with icc and using Intel MKL 217b1 21656

API for power-aware computing We use PAPI s latest powercap componentfor measurement, control, and performance analysis PAPI power components in the past supported only reading power information New component exposes RAPL functionality to allow users to read and write power limit Study numerical building blocks of varying computational intensity Use PAPI powercap component to detect power optimization opportunities Objective: Cap the power on the architecture to reduce power usage while keeping the execution time constant Energy Savings!!!

API for power-aware computing We use PAPI s latest powercap componentfor measurement, control, and performance analysis PAPI power components in the past supported only reading power information New component exposes RAPL functionality to allow users to read and write power limit Study numerical building blocks of varying computational intensity Use PAPI powercap component to detect power optimization opportunities Objective: Cap the power on the architecture to reduce power usage while keeping the execution time constant Energy Savings!!!

el 3 BLAS DGEMM on KNL using MCDRAM 68 cores KNL, Peak DP = 2662 Gflop/s Bandwidth MCDRAM ~425 GB/s DDR4 ~9 GB/s Average power (Watts) 32 3 28 26 24 22 2 18 16 14 12 1 8 6 4 2 Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set idle DGEMM 18K at default power 1997 8.81 133 Performance in Gflop/s Gflops/Watts Joules 1 2 3 4 5 6 7 8 9 1 11 12 Time (sec) set pwr limit default DGEMM size 18K Thermal Design Power (TDP) DGEMM is run repeatedly for a fixed matrix size of 18K per step and at each step a new power limit is set in decreasing fashion starting from default, 22 Watts, 2 Watts down till 12 Watts by steps of 1/2. 37

el 3 BLAS DGEMM on KNL using MCDRAM 68 cores KNL, Peak DP = 2662 Gflop/s Bandwidth MCDRAM ~425 GB/s DDR4 ~9 GB/s Average power (Watts) 32 3 28 26 24 22 2 18 16 14 12 1 idl 8 6 e 4 2 1997 8.81 133 Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set DGEMM 18K at default power Performance in Gflop/s Gflops/Watts Joules 1 2 3 4 5 6 7 8 9 1 11 Time (sec) set pwr limit default DGEMM size 18K PAPI_read(EventSet, long long values[2]) // set short term power limit to 268.75 Watts Values[] = 26875 // set long term power limit to 258 Watts Values[1] = 258 PAPI_write(EventSet, values) DGEMM is run repeatedly for a fixed matrix size of 18K per step and at each step a new power limit is set in decreasing fashion starting from default, 22 Watts, 2 Watts down till 12 Watts by steps of 1/2. 38

el 3 BLAS DGEMM on KNL using MCDRAM 68 cores KNL, Peak DP = 2662 Gflop/s Bandwidth MCDRAM ~425 GB/s DDR4 ~9 GB/s Average power (Watts) 32 3 28 26 24 22 2 18 16 14 12 1 8 6 4 2 1997 8.81 133 Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set 194 8.92 1279 Performance in Gflop/s Gflops/Watts Joules 1 2 3 4 5 6 7 8 9 1 11 Time (sec) set pwr limit default DGEMM size 18K set pwr limit 22W DGEMM size 18K DGEMM is run repeatedly for a fixed matrix size of 18K per step and at each step a new power limit is set in decreasing fashion starting from default, 22 Watts, 2 Watts down till 12 Watts by steps of 1/2. 39

el 3 BLAS DGEMM on KNL using MCDRAM 68 cores KNL, Peak DP = 2662 Gflop/s Bandwidth MCDRAM ~425 GB/s DDR4 ~9 GB/s Average power (Watts) 32 3 28 26 24 22 2 18 16 14 12 1 1997 194 1741 1589 1328 1137 133 1279 1267 1253 1345 8.82 8.92 8.95 9.4 8.47 8 6 4 2 Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set 7.73 148 956 6.95 1661 1 2 3 4 5 6 7 8 9 1 11 Time (sec) 773 6.3 194 56 4.72 2443 Performance in Gflop/s Gflops/Watts Joules DGEMM is run repeatedly for a fixed matrix size of 18K per step and at each step a new power limit is set in decreasing fashion starting from default, 22 Watts, 2 Watts down till 12 Watts by steps of 1/2. set pwr limit default DGEMM size 18K set pwr limit 22W DGEMM size 18K set pwr limit 2W DGEMM size 18K set pwr limit 18W DGEMM size 18K set pwr limit 16W DGEMM size 18K set pwr limit 15W... 4

el 3 BLAS DGEMM on KNL using MCDRAM 68 cores KNL, Peak DP = 2662 Gflop/s Bandwidth MCDRAM ~425 GB/s DDR4 ~9 GB/s Average power (Watts) 32 3 28 26 24 22 2 18 16 14 12 1 1997 194 1741 1589 1328 1137 133 1279 1267 1253 1345 8.82 8.92 8.95 9.4 8.47 8 6 4 2 Optimum Flops Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set 7.73 148 956 6.95 1661 1 2 3 4 5 6 7 8 9 1 11 Time (sec) 773 6.3 194 56 4.72 2443 Frequency Performance in Gflop/s Gflops/Watts Joules 2 18 16 14 12 1 8 6 4 2 MHz When power cap kickon the frequency is decreased which confirm that capping affect through the DVFS DGEMM is run repeatedly Optimum for a fixed matrix size of 18K per step and at each step a new power limit is set in Flops/Watt decreasing fashion starting from default, 22 Watts, 2 Watts down till 12 Watts by steps of 1/2. 41

el 3 BLAS DGEMM on KNL MCDRAM/DDR4 Average power (Watts) 32 3 28 26 24 22 2 18 16 14 12 1 8 6 4 2 MCDRAM Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set 1997 194 1741 1589 1328 1137 133 1279 1267 1253 1345 8.82 8.92 8.95 9.4 8.47 7.73 148 956 6.95 1661 1 2 3 4 5 6 7 8 9 1 11 Time (sec) 773 6.3 194 56 4.72 2443 Frequency Performance in Gflop/s Gflops/Watts Joules 2 18 16 14 12 1 8 6 4 2 MHz Average power (Watts) 32 3 28 26 24 22 2 18 16 14 12 1 8 6 4 2 1991 8.22 1383 191 8.36 1341 1785 8.58 1314 1615 8.57 1325 1419 1154 7.36 1562 971 6.66 1715 134 8.2 DDR4 Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set 1 2 3 4 5 6 7 8 9 1 11 Time (sec) 785 5.81 1982 575 4.64 2489 Frequency Performance in Gflop/s Gflops/Watts Joules 2 18 16 14 12 1 8 6 4 2 MHz Lesson for DGEMM type of operations (compute intensive): Capping can reduce performance, but sometimes can hold the energy efficiency (Glops/Watts). Optimum Flops Optimum Flops/Watt Optimum Flops Optimum Flops/Watt DGEMM is run repeatedly for a fixed matrix size of 18K per step and at each step a new power limit is set in decreasing fashion starting from default, 22 Watts, 2 Watts down till 12 Watts by steps of 1/2. 42

el 2 BLAS DGEMV on KNL MCDRAM/DDR4 Average power (Watts) 32 3 28 26 24 22 2 18 16 14 12 1 8 6 4 2 84 335 MCDRAM Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set.39 815 83 331.39 85 82 328.42 717 79 317.45 686 65 261.42 745 188.34 38 15.29 56 225.38 822 47 99 4 8 12 16 2 24 28 32 36 4 44 48 52 56 6 64 Time (sec) 177 28 111.24 1361 Frequency Performance in Gflop/s Achieved Bandwidth GB/s Gflops/Watts Joules 2 18 16 14 12 1 8 6 4 2 MHz Average power (Watts) 32 3 28 26 24 22 2 18 16 14 12 1 8 6 4 2 21 85.12 2588 21 85.12 85.12 2623 21 2639 21 85.12 2664 21 DDR4 Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set Lesson for DGEMM DGEMV type of operations (memory (compute bound): intensive): 85.12 2661 21 82.13 2432 2 82.14 84.13 2558 21 78.14 235 19 11 22 33 44 55 66 77 88 99 11 121 132 143 154 Time (sec) 224 Frequency Performance in Gflop/s Achieved Bandwidth GB/s Gflops/Watts Capping For MCDRAM, can reduce capping performance, at 19 Watts but results sometimes in power can reduction hold the without energy any efficiency loss of performance (Glops/Watts). (energy saving ~2%). For DDR4, capping at 12 Watts improves energy efficiency by ~17%. Optimum Overall capping 4 Watts Optimum below the power observed at default setting provide Optimum about 17-2% energy saving Optimum without any significant Flops reduction Flops/Watt in time to solution. Flops Flops/Watt Joules 2 18 16 14 12 1 8 6 4 2 MHz DGEMV is run repeatedly for a fixed matrix size of 18K per step and at each step a new power limit is set in decreasing fashion starting from default, 22 Watts, 2 Watts down till 12 Watts by steps of 1/2. 43

el 1 BLAS DAXPY on KNL MCDRAM/DDR4 Average power (Watts) 32 3 28 26 24 22 2 18 16 14 12 1 8 6 4 2 MCDRAM Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set 35 42.15 35 416 512.16 34 41 465.17 29 344 425.16 22 443 263.14 19 545 226.13 Text 185.11 686 614 15 12 144.9 3 6 9 12 15 18 21 24 27 3 33 36 39 42 45 48 Time (sec) 844 9 15.7 165 Frequency Performance in Gflop/s Achieved Bandwidth GB/s Gflops/Watts Joules 2 18 16 14 12 1 8 6 4 2 MHz Average power (Watts) 32 3 28 26 24 22 2 18 16 14 12 1 7 87.4 28 7 87.4 8 6 4 2 276 7 87.4 87.4 298 7 294 7 DDR4 Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set 84.4 85.4 1993 7 1871 7 84.4 182 7 8 16 24 32 4 48 56 64 72 8 88 96 14 112 12 Time (sec) 82.5 66.4 1951 17 6 Frequency Performance in Gflop/s Achieved Bandwidth GB/s Gflops/Watts Joules 2 18 16 14 12 1 8 6 4 2 MHz Lesson for DAXPY type of operations (memory bound): For MCDRAM, capping at 18 Watts results in power reduction without any loss if performance (energy saving ~16%). For DDR4, capping at 13 Watts improves energy efficiency by ~25%. Overall capping 4 Watts below the power observed at default setting provide about 16-25% energy saving without any significant reduction in time to solution. DAXPY is run repeatedly for a fixed matrix size of 1E8 per step and at each step a new power limit is set in decreasing fashion starting from default, 22 Watts, 2 Watts down till 12 Watts by steps of 1/2. 44

el Knights Landing evel 1, 2 and 3 BLAS cores Intel Xeon Phi KNL, 1.3 GHz, Peak DP = 2662 Gflop/s BLAS MCDRAM Rate Gflop/s Power Efficiency Gflop/s per W evel 3 1997 9.4 evel 2 84.45 evel 1 35.17 BLAS DRAM Rate Gflop/s Level 3 1991 Level 2 21 Level 1 7 Power Efficienc Gflop/s pe

wer-awareness in applications HPCG benchmark on grid of size 192 3 : MCDRAM Average power (Watts) 3 28 26 24 22 2 18 16 14 12 1 8 6 4 2 1522 joules MCDRAM_215watts 2 4 6 8 1 12 14 16 18 2 22 24 26 28 3 Time (sec) Sparse Matrix Vector operations Solving Ax=b with Conjugate Gradient Algorithm At TDP basic power limit 215

wer-awareness in applications HPCG benchmark on grid of size 192 3 : MCDRAM Average power (Watts) 3 28 26 24 22 2 18 16 14 12 1 8 6 4 2 1522 joules 154 joules MCDRAM_215watts MCDRAM_2watts 2 4 6 8 1 12 14 16 18 2 22 24 26 28 3 Time (sec) Sparse Matrix Vector operations Solving Ax=b with Conjugate Gradient Algorithm Decreasing the power limit to 2 W do not results in any loss in performance while we observe a reducing of power rate consumption

wer-awareness in applications HPCG benchmark on grid of size 192 3 : MCDRAM Average power (Watts) 3 28 26 24 22 2 18 16 14 12 1 8 6 4 2 MCDRAM_215watts MCDRAM_2watts MCDRAM_18watts MCDRAM_16watts MCDRAM_14watts MCDRAM_12watts 1522 joules 154 joules 1468 joules 1486 joules 1728 joules 239 joules 2 4 6 8 1 12 14 16 18 2 22 24 26 28 3 Time (sec) Sparse Matrix Vector operations Solving Ax=b with Conjugate Gradient Algorithm Decreasing the power limit to 2 W do not results in any loss in performance while we observe a reducing of power rate consumption Decreasing the power limit down by 4 Watts (setting pwr limit at 16 W) will provide energy saving, power rate reduction and without any meaningful loss in performance, less than 1% reduction in time to solution.

wer-awareness in applications HPCG benchmark on grid of size 192 3 : DDR4 Average power (Watts) 3 28 26 24 22 2 18 16 14 12 1 8 6 4 2 DDR_215watts DDR_2watts DDR_18watts DDR_16watts DDR_14watts DDR_12watts 3915 joules 3957 joules 394 joules 371 joules 33 joules 3158 joules 2 4 6 8 1 12 14 16 18 2 22 24 26 28 3 32 34 Time (sec) Sparse Matrix Vector operations Solving Ax=b with Conjugate Gradient Algorithm Decreasing the power limit to 16 W do not results in any loss in performance while we observe a reducing of power rate and energy saving. Decreasing the power limit down by 4 Watts (Pwr limit at 14 W) will providing large energy saving (15%), power rate reduction, without any loss in performance. At 12 Watts, less than 1% reduction in time to solution while a large energy saving (2%) and power rate reduction are observed

wer-awareness in applications Average power (Watts) 3 28 26 24 22 2 18 16 14 12 1 8 6 4 2 HPCG benchmark on grid of size 192 3 : MCDRAM & DDR4 MCDRAM_215watts MCDRAM_2watts MCDRAM_18watts MCDRAM_16watts MCDRAM_14watts MCDRAM_12watts 1522 joules 154 joules 1468 joules 1486 joules 1728 joules 239 joules 2 4 6 8 1 12 14 16 18 2 22 24 26 28 3 Time (sec) Average power (Watts) 3 28 26 24 22 2 18 16 14 12 1 8 6 4 2 DDR_215watts DDR_2watts DDR_18watts DDR_16watts DDR_14watts DDR_12watts Sparse Matrix Vector opera Solving Ax=b with CG 3915 joules 3957 joules 394 joules 371 joules 33 joules 3158 joules 2 4 6 8 1 12 14 16 18 2 22 24 26 28 3 32 34 Time (sec) Lesson for HPCG: For MCDRAM, decreasing the power limit by 4 Watts from the power observed at default (setting Pwr limit at 16 W) will provide energy saving, power rate reduction without any meaningful loss in performance, less than 1% reduction in time to solution. For DDR4, similar behavior observed by decreasing the power limit by 4 Watts from the power observed at default (setting Pwr limit at 12 W) provide a large energy saving (15%-2%), power rate reduction without any significant reduction in time to solution.

wer-awareness in applications Solving Helmholtz equation with finite difference, Jacobi iteration 128x128 grid Average power (Watts) 3 28 26 24 22 2 18 16 14 12 1 8 6 4 2 MCDRAM_215Watts MCDRAM_2Watts MCDRAM_18Watts MCDRAM_16Watts MCDRAM_14Watts MCDRAM_12Watts 826 joules 842 joules 724 joules 83 joules 166 joules 1865 joules 1 2 3 4 5 6 7 8 9 1 11 12 13 14 15 16 17 18 19 2 21 22 Time (sec) Average power (Watts) 3 28 26 24 22 2 18 16 14 12 1 8 6 4 2 DDR_215Watts DDR_2Watts DDR_18Watts DDR_16Watts DDR_14Watts DDR_12Watts 277 joules 2878 joules 2689 joules 2438 joules 2137 joules 2731 joules 2 4 6 8 1 12 14 16 18 2 22 24 26 28 3 32 Time (sec) Lesson for Jacobi iteration: For MCDRAM, capping at 18 Watts improves power efficiency by ~13% without any loss in time to solution, while also decreasing power rate requirements by about 2% For DDR4, capping at 14 Watts improves power efficiency by ~2% without any loss in time to solution. Overall, capping 3-4 Watts below the power observed at default limit provides large energy gain while keeping up with the same time to solution, which is similar to the behavior observed for the DGEMV and DAXPY kernels. 51

wer-awareness in applications Lattice-Boltzmann simulation of CFD (from SPEC 26 benchmark) Average power (Watts) 3 28 26 24 22 2 18 16 14 12 1 8 6 4 2 35 joules 2981 joules 2872 joules MCDRAM_27watts MCDRAM_215watts MCDRAM_2watts MCDRAM_18watts MCDRAM_16watts MCDRAM_14watts MCDRAM_12watts 343 joules 3383 joules 433 joules 5734 joules 3 6 9 12 15 18 21 24 27 3 33 36 39 42 45 48 51 54 Time (sec) Average power (Watts) 3 28 26 24 22 2 18 16 14 12 1 8 6 4 2 DDR_27watts DDR_215watts DDR_2watts DDR_18watts DDR_16watts DDR_14watts DDR_12watts 964 joules 9196 joules 9194 joules 9178 joules 95 joules 884 joules 8467 joules 4 8 12 16 2 24 28 32 36 4 44 48 52 56 6 64 68 Time (sec) Lesson for Lattice Boltzmann: For MCDRAM, capping at 2 results in 5% energy saving and power rate reduction without increase in time to solution. For DDR4, capping at 14 Watts improves energy efficiency by ~11% and reduces power rate without any increase in time to solution.

wer-awareness in applications Monte Carlo neutron transport application (from XSbench) Average power (Watts) 3 28 26 24 22 2 18 16 14 12 1 8 6 4 2 MCDRAM_215Watts MCDRAM_2Watts MCDRAM_18Watts MCDRAM_16Watts MCDRAM_14Watts 9175 joules 8955 joules 872 joules 8724 joules 178 joules 6 12 18 24 3 36 42 48 54 6 66 72 78 84 9 96 Time (sec) Average power (Watts) 3 28 26 24 22 2 18 16 14 12 1 8 6 4 2 DDR_215Watts DDR_2Watts DDR_18Watts DDR_16Watts DDR_14Watts DDR_12Watts 1991 joules 19796 joules 19676 joules 17763 joules 163 joules 13432 joules 1 2 3 4 5 6 7 8 9 1 11 12 13 14 Time (sec) Lesson for Monte Carlo: For MCDRAM, capping at 18 W results in energy saving (5%) and power rate reduction without any meaningful increase in time to solution. For DDR4, capping at 12 Watts improves power efficiency by ~3% without any loss in performance.

wer Aware Tools Designing a toolset that will automatically monitor memory traffic. Track the performance and data movement. Adjust the power to keep the performance roughly the same while decreasing energy consumption. Better power performance without significant loss of performance. All in a automatic fashion without user involvement. /2/217

Critical Issues and Challenges at Peta & Exascale for Algorithm and Software Design Synchronization-reducing algorithms Break Fork-Join model Communication-reducing algorithms Use methods which have lower bound on communication Mixed precision methods 2x speed of ops and 2x speed for data movement Autotuning Today s machines are too complicated, build smarts into software to adapt to the hardware Fault resilient algorithms Implement algorithms that can recover from failures/bit flips Reproducibility of results Today we can t guarantee this, without a penalty. We understand the issues, but some of our colleagues have a hard time with this.

Collaborators and Support MAGMA team http://icl.cs.utk.edu/magma PLASMA team http://icl.cs.utk.edu/plasma Collaborating partners University of Tennessee, Knoxville University of California, Berkeley University of Colorado, Denver University of Manchester