Overview of HPC and Energy Saving on KNL for Some Computations

Size: px
Start display at page:

Download "Overview of HPC and Energy Saving on KNL for Some Computations"

Transcription

1 Overview of HPC and Energy Saving on KNL for Some Computations Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 1/2/217 1

2 Outline Overview of High Performance Computing With Extreme Computing the rules for computing have changed

3 Since 1993 H. Meuer, H. Simon, E. Strohmaier, & JD - Listing of the 5 most powerful Computers in the World - Yardstick: Rmax from LINPACK MPP Ax=b, dense problem TPP performance - Updated twice a year Size SC xy in the States in November Meeting in Germany in June - All data available from Rate

4 PERFORMANCE DEVELOPMENT OF HPC OVER THE LAST 24.5 YEARS FROM THE TOP5 flop/s 1E+9 Pflop/s 749 PF 93 PF Pflop/s flop/s Tflop/s 1 SUM 432 TF Tflop/s 1 flop/s 1 Gflop/s 1 Gflop/s 1 flop/s 1 Mflop/s, TFlop/s 59.7 GFlop/s 4 MFlop/s N=1 N=5 6-8 years My Laptop: 166 Gflop

5 PERFORMANCE DEVELOPMENT 1E+9 1 Eflop/s 1 1 Pflop/s 1 Pflop/s 1 1 Pflop/s 1 1 Tflop/s 1 1 Tflop/s 1 1 Tflop/s 1 1 Gflop/s 1 1 Gflop/s 1 1 Gflop/s 1 N=1 N=1 SUM N= Today: SW 261 & Intel KNL ~ 3 Tflop/s Tflops (1 12 ) Achieved ASCI Red Sandia NL Today: 1 cabinet of TaihuLight ~ 3 Pflop/s Pflops (1 15 ) Achieved RoadRunner Los Alamos NL Eflops (1 18 ) Achieved? China says 22 U.S. says 221

6 State of Supercomputing in 217 Pflops (> 1 15 Flop/s)computing fully established with 138 computer systems. Three technology architecture or swim lanes are thriving. Commodity (e.g. Intel) Commodity + accelerator (e.g. GPUs) (91 systems) Lightweight cores (e.g. IBM BG, Intel s Knights Landing, ShenWei 261, ARM) Interest in supercomputing is now worldwide, and growing in many new markets (~5% of Top5 computers are in industry). Exascale (1 18 Flop/s) projects exist in many countries and regions. Intel processors largest share, 92%, followed by AMD, 1%. 6

7 Countries Share 7 China has 1/3 of the systems, while the number of systems in the US has fallen to the lowest point since the TOP5 list was created. rectangle represents one of the Top5 computers, area of rectangle reflects its performance.

8 17 - Top5 Systems in France Name Computer Site Manufa cturer Total Cores Acc Rmax Total Exploration Pangea SGI ICE X, Xeon Xeon E5-267/ E5-268v3 12C 2.5GHz, Infiniband FDR Production HPE occigen2 bullx DLC 72, Xeon E5-269v4 14C 2.6GHz, Infiniband FDR (GENCI-CINES) Bull Prolix2 bullx DLC 72, Xeon E5-2698v4 2C 2.2GHz, Infiniband FDR Meteo France Bull Beaufix2 bullx DLC 72, Xeon E5-2698v4 2C 2.2GHz, Infiniband FDR Meteo France Bull Tera-1-1 bullx DLC 72, Xeon E5-2698v3 16C 2.3GHz, Infiniband FDR Commissariat a l'energie Atomique (CEA) Bull Sid bullx DLC 72, Xeon E5-2695v4 18C 2.1GHz, Infiniband FDR Atos Bull Curie thin nodes Bullx B51, Xeon E C 2.7GHz, Infiniband QDR CEA/TGCC-GENCI Bull Commissariat a l'energie Atomique Cobalt bullx DLC 72, Xeon E5-268v4 14C 2.4GHz, Infiniband EDR (CEA)/CCRT Bull Diego bullx DLC 72, Xeon E5-2698v4 2C 2.2GHz, Infiniband FDR Atos Bull Turing BlueGene/Q, Power BQC 16C 1.6GHz, Custom CNRS/IDRIS-GENCI IBM Commissariat a l'energie Atomique Tera-1 Bull bullx super-node S61/S63 (CEA) Bull Eole Intel Cluster, Xeon E5-268v4 14C 2.4GHz, Intel Omni-Path EDF Bull Zumbrota BlueGene/Q, Power BQC 16C 1.6GHz, Custom EDF R&D IBM Occigen2 bullx DLC 72, Xeon E5-269v4 14C 2.6GHz, Infiniband FDR Centre Informatique National (CINES) Bull Sator NEC HPC1812-Rg-2, Xeon E5-268v4 14C 2.4GHz, Intel Omni-Path ONERA NEC HP POD - Cluster Platform BL46c, Intel Xeon E5-2697v2 12C 2.7GHz, HPC4 Infiniband FDR Airbus HPE PORTHOS IBM NeXtScale nx36m5, Xeon E5-2697v3 14C 2.6GHz, Infiniband FDR EDF R&D IBM

9 June 217: The TOP 1 Systems Rank Site Computer Country Cores Rmax [Pflops] % of Peak Power [MW] GFlops/ Watt 1 National Super Computer Center in Wuxi National Super Computer Center in Guangzhou Sunway TaihuLight, SW261 (26C) + Custom China 1,649, Tianhe-2 NUDT, TaihuLight 2 is Xeon 6% (12C) + IntelXeon Sum Phi (57C) of All China EU 3,12, Systems Custom 3 Swiss CSCS 4 5 DOE / OS Oak Ridge Nat Lab DOE / NNSA L Livermore Nat Lab Piz Daint, Cray XC5, Xeon (12C) + Nvidia P1(56C) + Custom Titan, Cray XK7, AMD (16C) + Nvidia Kepler GPU (14C) + Custom Sequoia, BlueGene/Q (16C) + custom Swiss 361, TaihuLight is 35% X Sum of All Systems in US DOE / OS L Berkeley Nat Lab Joint Center for Advanced HPC RIKEN Advanced Inst for Comp Sci Cori, Cray XC4, Xeon Phi (68C) + Custom Oakforest-PACS, Fujitsu Primergy CX164, Xeon Phi (68C) + Omni-Path K computer Fujitsu SPARC64 VIIIfx (8C) + Custom USA 56, USA 1,572, USA 622, Japan 558, Japan 75, DOE / OS Argonne Nat Lab Mira, BlueGene/Q (16C) + Custom USA 786, DOE / NNSA / Trinity, Cray XC4,Xeon (16C) + Los Alamos & Sandia Custom USA 31, Internet company Sugon Intel (8C) + Nnvidia China 11,

10 Next Big Systems in the US DOE Oak Ridge Lab and Lawrence Livermore Lab to receive IBM and Nvidia based systems IBM Power 9 + Nvidia Volta V1 Installation started but not completed until late times Titan on apps 4,6 nodes, each containing six 7.5-teraflop NVIDIA V1 GPUs, and two IBM Power9 CPUs, its aggregate peak performance should be > 2 petaflops. ORNL maybe June 218 fully ready (Top5 number) In 221 Argonne Lab to receive Intel based system Exascale systems, based on Future Intel parts, not Knights Hill Aurora 21 Balanced architecture to support three pillars Large-scale Simulation (PDEs, traditional HPC) Data Intensive Applications (science pipelines) Deep Learning and Emerging Science AI Integrated computing, acceleration, storage Towards a common software stack

11 Toward Exascale hina plans for Exascale 22 Three separate developments in HPC; Anything but from the US Wuxi Upgrade the ShenWei O(1) Pflops all Chinese National University for Defense Technology Tianhe-2A O(1) Pflops will be Chinese ARM processor + accelerator Sugon - CAS ICT X86 based; collaboration with AMD S DOE - Exascale Computing Program 7 Year rogram Initial exascale system based on advanced architecture and delivered in 221 Enable capable exascale systems, based on ECP R&D, delivered in 222 and deployed in Exascale 5X the performance today s 2 PF system 2-3 MW power 1 perceived fault / SW to support broad of apps

12 S Department of Energy Exascale Computing Program s formulated a holistic approach that uses co-design d integration to achieve capable exascale lication Development Software Technology Hardware Technology Exascale Systems cience and mission applications Scalable and productive software stack Hardware technology elements Integrated exascale supercomputers Correctness Visualization Data Analysis Applications Co-Design Resilience Programming models, development environment, and runtimes System Software, resource management threading, scheduling, monitoring, and control Node OS, runtimes Math libraries and Frameworks Memory and Burst buffer Hardware interface Tools Data management I/O and file system Workflows ECP s work encompasses applications, system software, hardware technologies and architectures, and workforce development cale Computing Project,

13 7

14 14 he Problem with the LINPACK Benchmark HPL performance of computer systems are no longer so strongly correlated to real application performance, especially for the broad set of HPC applications governed by partial differential equations. Designing a system for good HPL performance can actually lead to design choices that are wrong for the real application mix, or add unnecessary components or complexity to the system.

15 Many Problems in Computational Science Involve Solving PDEs; Large Sparse Linear Systems Given a PDE, e.g.: u+ c u+γu Pu = f over some domain ( where P denotes the differential operator ) Discretization (e.g., Galerkin equations) Find u h = Φ i x i n i=1 n n i=1 (P u h, Φ j ) = (f, Φ i ) for Φ j (PΦ i, Φ j ) x i = (f, Φ i ) i=1 a ji b j Sparse Linear System A x = b Basis functions Φ j are often with local support, e.g., leading to local interactions & hence sparse matrices, e.g., + boundary conditions row 1 in this case will have only 6 non-zeroes: a 1,1, a 1,332, a 1,1, a 1,115, a 1,21, a 1,35 Modeling Diffusion Fluid Flow

16 16 hpcg-benchmark.org HPCG High Performance Conjugate Gradients (HPCG). Solves Ax=b, A large, sparse, b known, x computed. An optimized implementation of PCG contains essential computational and communication patterns that are prevalent in a variety of methods for discretization and numerical solution of PDEs Synthetic discretized 3D PDE (FEM, FVM, FDM). Sparse matrix: 27 nonzeros/row interior on boundary. Symmetric positive definite. Patterns: Dense and sparse computations. Dense and sparse collectives. Multi-scale execution of kernels via MG (truncated) V cycle. Data-driven parallelism (unstructured sparse triangular solves). Strong verification (via spectral properties of PCG).

17 -benchmark.org 17 PL vs. HPCG: Bookends Some see HPL and HPCG as bookends of a spectrum. Applications teams know where their codes lie on the spectrum. Can gauge performance on a system using both HPL and HPCG numbers.

18 Results, Jun 217, top-1 Rank Site Computer Cores RIKEN Advanced Institute for Computational Science, Japan NSCC / Guangzhou China National Supercomputing Center in Wuxi China Swiss National Supercomputing Centre (CSCS) Switzerland Joint Center for Advanced HPC Japan DOE/SC/LBNL/NERSC USA DOE/NNSA/LLNL USA DOE/SC/Oak Ridge National Lab 9 DOE/NNSA/LANL/SNL 1 NASA / Mountain View K computer, SPARC64 VIIIfx2.GHz, Tofu interconnect Tianhe-2NUDT, Xeon 12C 2.2GHz + Intel Xeon Phi 57C + Custom Sunway TaihuLight Sunway MPP, SW261 26C 1.45GHz, Sunway, NRCPC Piz Daint CrayXC5, Xeon E5-269v3 12C 2.6GHz, Aries interconnect, NVIDIA P1, Cray Oakforest-PACS PRIMERGY CX6 M1, Intel Xeon Phi C 1.4GHz, Intel OmniPath, Fujitsu Cori XC4, Intel Xeon Phi C 1.4GHz, Cray Aries, Cray Sequoia IBM BlueGene/Q, PowerPC A2 16C 1.6GHz, 5D Torus, IBM Titan-Cray XK7, Opteron C 2.2GHz, Cray Gemini interconnect, NVIDIA K2x Trinity-Cray XC4, Intel E5-2698v3, Aries custom, Cray Pleiades-SGI ICE X, Intel E5-268, E5-268v2, E5-268v3, E5-268v4, Infiniband FDR, HPE Rmax Pflops HPCG Pflops HPCG /HPL % of Peak 75, % 5.3% 3,12, % 1.1% 1,649, %.4% 361, % 1.9% 557, % 1.5% 632, % 1.3% 1,572, % 1.6% 56, % 1.2% 31, % 1.6% 243, % 2.5%

19 ookends: Peak and HPL Only Pflop/s Rpeak Rmax

20 ookends: Peak, HPL, and HPCG Pflop/s Rpeak Rmax HPCG

21 Machine Learning in Computational Science any fields are beginning to adopt machine learning to augment modeling and simulatio ethods Climate Biology Drug Design Epidemology Materials Cosmology High-Energy Physics

22 IEEE 754 Half Precision (16-bit) Floating Pt Standard A lot of interest driven by machine learning 22 / 57

23 Mixed Precision Extended to Sparse Direct Methods Iterative Methods where inner and outer iteration Today many precisions to deal with Note the number range with half precision 1/2/217 (16 bit fl.pt.)

24 ep Learning need Batched Operations trix Multiply is the time consuming part. nvolution Layers and Fully Connected Layers require matrix multiply ere are many GEMM s of small matrices, perfectly parallel, can get by with 16-bit floating point x 1 w 11 x 1 y 1 w w x 2 w 13 w 22 y 2 x 3 w 23 Convolution Step In this case 3x3 GEMM 24 / 47 Fully Connected Classification

25 Standard for Batched Computations efine standard API for batched BLAS and LAPACK in collaboration with ntel/nvidia/ecp/other users ixed size most of BLAS and LAPACK released ariable size most of BLAS released ariable size LAPACK in the branch ative GPU algorithms (Cholesky, LU, QR) in the branch iled algorithm using batched routines on tile or LAPACK data layout in the branch ramework for Deep Neural Network kernels PU, KNL and GPU routines P16 routines in progress

26 Batched Computations Non-batched computation op over the matrices one by one and compute using multithread (note that, sinc atrices are of small sizes there is not enough work for all the cores). So we expect low rformance as well as threads contention might also affect the performance r (i=; i<batchcount; i++) dgemm( ) Low There percentage is notof enough the resources to fulfill all is used the cores.

27 Batched Computations Batched computation ibute all the matrices over the available resources by assigning a matrix to eac p of core/tb to operate on it independently For very small matrices, assign a matrix/core (CPU) or per TB for GPU For medium size a matrix go to a team of cores (CPU) or many TB s (GPU) For large size switch to multithreads classical 1 matrix per round. Batched_dgemm( ) Tasks manager dispatcher Based on the kernel design that decide the number of TB or threads (GPU/CPU) and through the Nvidia/OpenMP scheduler High percentage of the resources is used

28 Batched Computations Batched computation ibute all the matrices over the available resources by assigning a matrix to eac p of core/tb to operate on it independently For very small matrices, assign a matrix/core (CPU) or per TB for GPU For medium size a matrix go to a team of cores (CPU) or many TB s (GPU) For large size switch to multithreads classical 1 matrix per round. Batched_dgemm( ) Tasks manager dispatcher Based on the kernel design that decide the number of TB or threads (GPU/CPU) and through the Nvidia/OpenMP scheduler High percentage of the resources is used

29 Batched Computations Batched computation ibute all the matrices over the available resources by assigning a matrix to eac p of core/tb to operate on it independently For very small matrices, assign a matrix/core (CPU) or per TB for GPU For medium size a matrix go to a team of cores (CPU) or many TB s (GPU) For large size switch to multithreads classical 1 matrix per round. Batched_dgemm( ) Tasks manager dispatcher Based on the kernel design that decide the number of TB or threads (GPU/CPU) and through the Nvidia/OpenMP scheduler High percentage of the resources is used

30 Batched Computations: How do we design and optimize Small sizes medium sizes large sizes 2~3x Switch to non-batched 1X

31 Batched Computations: How do we design and optimize Small sizes 3X 1.2x medium sizes large sizes Switch to non-batched

32 8 7 Nvidia V1 GPU small sizes medium sizes Large sizes 6 1.4X Gflop/s X Switch to non-batch 2 1 Batch dgemm BLAS 3 Batch dgemm BLAS 3 Standard dgemm BLAS 3 Standard dgemm BLAS ~1 matrices of size 97% of that peak

33 tel Knights Landing: FLAT mode 68 cores, 1.3 GHz, Peak DP = 2662 Gflop/s KNL is in FLAT mode (where entire MCDRAM is used as addressable memory) memory allocations are treated similarly to DDR4 memory allocations KNL Thermal Design Power (TDP) is 215W (for main SKUs): TDP = 215W

34 Level 1, 2 and 3 BLAS 68 cores Intel Xeon Phi KNL, 1.3 GHz, Peak DP = 2662 Gflop/s 2 Gflop/s 35x 85.3 Gflop/s 35.1 Gflop/s 68 cores Intel Xeon Phi KNL, 1.3 GHz The theoretical peak double precision is 2662 Gflop/s Compiled with icc and using Intel MKL 217b

35 API for power-aware computing We use PAPI s latest powercap componentfor measurement, control, and performance analysis PAPI power components in the past supported only reading power information New component exposes RAPL functionality to allow users to read and write power limit Study numerical building blocks of varying computational intensity Use PAPI powercap component to detect power optimization opportunities Objective: Cap the power on the architecture to reduce power usage while keeping the execution time constant Energy Savings!!!

36 API for power-aware computing We use PAPI s latest powercap componentfor measurement, control, and performance analysis PAPI power components in the past supported only reading power information New component exposes RAPL functionality to allow users to read and write power limit Study numerical building blocks of varying computational intensity Use PAPI powercap component to detect power optimization opportunities Objective: Cap the power on the architecture to reduce power usage while keeping the execution time constant Energy Savings!!!

37 el 3 BLAS DGEMM on KNL using MCDRAM 68 cores KNL, Peak DP = 2662 Gflop/s Bandwidth MCDRAM ~425 GB/s DDR4 ~9 GB/s Average power (Watts) Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set idle DGEMM 18K at default power Performance in Gflop/s Gflops/Watts Joules Time (sec) set pwr limit default DGEMM size 18K Thermal Design Power (TDP) DGEMM is run repeatedly for a fixed matrix size of 18K per step and at each step a new power limit is set in decreasing fashion starting from default, 22 Watts, 2 Watts down till 12 Watts by steps of 1/2. 37

38 el 3 BLAS DGEMM on KNL using MCDRAM 68 cores KNL, Peak DP = 2662 Gflop/s Bandwidth MCDRAM ~425 GB/s DDR4 ~9 GB/s Average power (Watts) idl 8 6 e Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set DGEMM 18K at default power Performance in Gflop/s Gflops/Watts Joules Time (sec) set pwr limit default DGEMM size 18K PAPI_read(EventSet, long long values[2]) // set short term power limit to Watts Values[] = // set long term power limit to 258 Watts Values[1] = 258 PAPI_write(EventSet, values) DGEMM is run repeatedly for a fixed matrix size of 18K per step and at each step a new power limit is set in decreasing fashion starting from default, 22 Watts, 2 Watts down till 12 Watts by steps of 1/2. 38

39 el 3 BLAS DGEMM on KNL using MCDRAM 68 cores KNL, Peak DP = 2662 Gflop/s Bandwidth MCDRAM ~425 GB/s DDR4 ~9 GB/s Average power (Watts) Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set Performance in Gflop/s Gflops/Watts Joules Time (sec) set pwr limit default DGEMM size 18K set pwr limit 22W DGEMM size 18K DGEMM is run repeatedly for a fixed matrix size of 18K per step and at each step a new power limit is set in decreasing fashion starting from default, 22 Watts, 2 Watts down till 12 Watts by steps of 1/2. 39

40 el 3 BLAS DGEMM on KNL using MCDRAM 68 cores KNL, Peak DP = 2662 Gflop/s Bandwidth MCDRAM ~425 GB/s DDR4 ~9 GB/s Average power (Watts) Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set Time (sec) Performance in Gflop/s Gflops/Watts Joules DGEMM is run repeatedly for a fixed matrix size of 18K per step and at each step a new power limit is set in decreasing fashion starting from default, 22 Watts, 2 Watts down till 12 Watts by steps of 1/2. set pwr limit default DGEMM size 18K set pwr limit 22W DGEMM size 18K set pwr limit 2W DGEMM size 18K set pwr limit 18W DGEMM size 18K set pwr limit 16W DGEMM size 18K set pwr limit 15W... 4

41 el 3 BLAS DGEMM on KNL using MCDRAM 68 cores KNL, Peak DP = 2662 Gflop/s Bandwidth MCDRAM ~425 GB/s DDR4 ~9 GB/s Average power (Watts) Optimum Flops Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set Time (sec) Frequency Performance in Gflop/s Gflops/Watts Joules MHz When power cap kickon the frequency is decreased which confirm that capping affect through the DVFS DGEMM is run repeatedly Optimum for a fixed matrix size of 18K per step and at each step a new power limit is set in Flops/Watt decreasing fashion starting from default, 22 Watts, 2 Watts down till 12 Watts by steps of 1/2. 41

42 el 3 BLAS DGEMM on KNL MCDRAM/DDR4 Average power (Watts) MCDRAM Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set Time (sec) Frequency Performance in Gflop/s Gflops/Watts Joules MHz Average power (Watts) DDR4 Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set Time (sec) Frequency Performance in Gflop/s Gflops/Watts Joules MHz Lesson for DGEMM type of operations (compute intensive): Capping can reduce performance, but sometimes can hold the energy efficiency (Glops/Watts). Optimum Flops Optimum Flops/Watt Optimum Flops Optimum Flops/Watt DGEMM is run repeatedly for a fixed matrix size of 18K per step and at each step a new power limit is set in decreasing fashion starting from default, 22 Watts, 2 Watts down till 12 Watts by steps of 1/2. 42

43 el 2 BLAS DGEMV on KNL MCDRAM/DDR4 Average power (Watts) MCDRAM Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set Time (sec) Frequency Performance in Gflop/s Achieved Bandwidth GB/s Gflops/Watts Joules MHz Average power (Watts) DDR4 Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set Lesson for DGEMM DGEMV type of operations (memory (compute bound): intensive): Time (sec) 224 Frequency Performance in Gflop/s Achieved Bandwidth GB/s Gflops/Watts Capping For MCDRAM, can reduce capping performance, at 19 Watts but results sometimes in power can reduction hold the without energy any efficiency loss of performance (Glops/Watts). (energy saving ~2%). For DDR4, capping at 12 Watts improves energy efficiency by ~17%. Optimum Overall capping 4 Watts Optimum below the power observed at default setting provide Optimum about 17-2% energy saving Optimum without any significant Flops reduction Flops/Watt in time to solution. Flops Flops/Watt Joules MHz DGEMV is run repeatedly for a fixed matrix size of 18K per step and at each step a new power limit is set in decreasing fashion starting from default, 22 Watts, 2 Watts down till 12 Watts by steps of 1/2. 43

44 el 1 BLAS DAXPY on KNL MCDRAM/DDR4 Average power (Watts) MCDRAM Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set Text Time (sec) Frequency Performance in Gflop/s Achieved Bandwidth GB/s Gflops/Watts Joules MHz Average power (Watts) DDR4 Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set Time (sec) Frequency Performance in Gflop/s Achieved Bandwidth GB/s Gflops/Watts Joules MHz Lesson for DAXPY type of operations (memory bound): For MCDRAM, capping at 18 Watts results in power reduction without any loss if performance (energy saving ~16%). For DDR4, capping at 13 Watts improves energy efficiency by ~25%. Overall capping 4 Watts below the power observed at default setting provide about 16-25% energy saving without any significant reduction in time to solution. DAXPY is run repeatedly for a fixed matrix size of 1E8 per step and at each step a new power limit is set in decreasing fashion starting from default, 22 Watts, 2 Watts down till 12 Watts by steps of 1/2. 44

45 el Knights Landing evel 1, 2 and 3 BLAS cores Intel Xeon Phi KNL, 1.3 GHz, Peak DP = 2662 Gflop/s BLAS MCDRAM Rate Gflop/s Power Efficiency Gflop/s per W evel evel evel BLAS DRAM Rate Gflop/s Level Level 2 21 Level 1 7 Power Efficienc Gflop/s pe

46 wer-awareness in applications HPCG benchmark on grid of size : MCDRAM Average power (Watts) joules MCDRAM_215watts Time (sec) Sparse Matrix Vector operations Solving Ax=b with Conjugate Gradient Algorithm At TDP basic power limit 215

47 wer-awareness in applications HPCG benchmark on grid of size : MCDRAM Average power (Watts) joules 154 joules MCDRAM_215watts MCDRAM_2watts Time (sec) Sparse Matrix Vector operations Solving Ax=b with Conjugate Gradient Algorithm Decreasing the power limit to 2 W do not results in any loss in performance while we observe a reducing of power rate consumption

48 wer-awareness in applications HPCG benchmark on grid of size : MCDRAM Average power (Watts) MCDRAM_215watts MCDRAM_2watts MCDRAM_18watts MCDRAM_16watts MCDRAM_14watts MCDRAM_12watts 1522 joules 154 joules 1468 joules 1486 joules 1728 joules 239 joules Time (sec) Sparse Matrix Vector operations Solving Ax=b with Conjugate Gradient Algorithm Decreasing the power limit to 2 W do not results in any loss in performance while we observe a reducing of power rate consumption Decreasing the power limit down by 4 Watts (setting pwr limit at 16 W) will provide energy saving, power rate reduction and without any meaningful loss in performance, less than 1% reduction in time to solution.

49 wer-awareness in applications HPCG benchmark on grid of size : DDR4 Average power (Watts) DDR_215watts DDR_2watts DDR_18watts DDR_16watts DDR_14watts DDR_12watts 3915 joules 3957 joules 394 joules 371 joules 33 joules 3158 joules Time (sec) Sparse Matrix Vector operations Solving Ax=b with Conjugate Gradient Algorithm Decreasing the power limit to 16 W do not results in any loss in performance while we observe a reducing of power rate and energy saving. Decreasing the power limit down by 4 Watts (Pwr limit at 14 W) will providing large energy saving (15%), power rate reduction, without any loss in performance. At 12 Watts, less than 1% reduction in time to solution while a large energy saving (2%) and power rate reduction are observed

50 wer-awareness in applications Average power (Watts) HPCG benchmark on grid of size : MCDRAM & DDR4 MCDRAM_215watts MCDRAM_2watts MCDRAM_18watts MCDRAM_16watts MCDRAM_14watts MCDRAM_12watts 1522 joules 154 joules 1468 joules 1486 joules 1728 joules 239 joules Time (sec) Average power (Watts) DDR_215watts DDR_2watts DDR_18watts DDR_16watts DDR_14watts DDR_12watts Sparse Matrix Vector opera Solving Ax=b with CG 3915 joules 3957 joules 394 joules 371 joules 33 joules 3158 joules Time (sec) Lesson for HPCG: For MCDRAM, decreasing the power limit by 4 Watts from the power observed at default (setting Pwr limit at 16 W) will provide energy saving, power rate reduction without any meaningful loss in performance, less than 1% reduction in time to solution. For DDR4, similar behavior observed by decreasing the power limit by 4 Watts from the power observed at default (setting Pwr limit at 12 W) provide a large energy saving (15%-2%), power rate reduction without any significant reduction in time to solution.

51 wer-awareness in applications Solving Helmholtz equation with finite difference, Jacobi iteration 128x128 grid Average power (Watts) MCDRAM_215Watts MCDRAM_2Watts MCDRAM_18Watts MCDRAM_16Watts MCDRAM_14Watts MCDRAM_12Watts 826 joules 842 joules 724 joules 83 joules 166 joules 1865 joules Time (sec) Average power (Watts) DDR_215Watts DDR_2Watts DDR_18Watts DDR_16Watts DDR_14Watts DDR_12Watts 277 joules 2878 joules 2689 joules 2438 joules 2137 joules 2731 joules Time (sec) Lesson for Jacobi iteration: For MCDRAM, capping at 18 Watts improves power efficiency by ~13% without any loss in time to solution, while also decreasing power rate requirements by about 2% For DDR4, capping at 14 Watts improves power efficiency by ~2% without any loss in time to solution. Overall, capping 3-4 Watts below the power observed at default limit provides large energy gain while keeping up with the same time to solution, which is similar to the behavior observed for the DGEMV and DAXPY kernels. 51

52 wer-awareness in applications Lattice-Boltzmann simulation of CFD (from SPEC 26 benchmark) Average power (Watts) joules 2981 joules 2872 joules MCDRAM_27watts MCDRAM_215watts MCDRAM_2watts MCDRAM_18watts MCDRAM_16watts MCDRAM_14watts MCDRAM_12watts 343 joules 3383 joules 433 joules 5734 joules Time (sec) Average power (Watts) DDR_27watts DDR_215watts DDR_2watts DDR_18watts DDR_16watts DDR_14watts DDR_12watts 964 joules 9196 joules 9194 joules 9178 joules 95 joules 884 joules 8467 joules Time (sec) Lesson for Lattice Boltzmann: For MCDRAM, capping at 2 results in 5% energy saving and power rate reduction without increase in time to solution. For DDR4, capping at 14 Watts improves energy efficiency by ~11% and reduces power rate without any increase in time to solution.

53 wer-awareness in applications Monte Carlo neutron transport application (from XSbench) Average power (Watts) MCDRAM_215Watts MCDRAM_2Watts MCDRAM_18Watts MCDRAM_16Watts MCDRAM_14Watts 9175 joules 8955 joules 872 joules 8724 joules 178 joules Time (sec) Average power (Watts) DDR_215Watts DDR_2Watts DDR_18Watts DDR_16Watts DDR_14Watts DDR_12Watts 1991 joules joules joules joules 163 joules joules Time (sec) Lesson for Monte Carlo: For MCDRAM, capping at 18 W results in energy saving (5%) and power rate reduction without any meaningful increase in time to solution. For DDR4, capping at 12 Watts improves power efficiency by ~3% without any loss in performance.

54 wer Aware Tools Designing a toolset that will automatically monitor memory traffic. Track the performance and data movement. Adjust the power to keep the performance roughly the same while decreasing energy consumption. Better power performance without significant loss of performance. All in a automatic fashion without user involvement. /2/217

55 Critical Issues and Challenges at Peta & Exascale for Algorithm and Software Design Synchronization-reducing algorithms Break Fork-Join model Communication-reducing algorithms Use methods which have lower bound on communication Mixed precision methods 2x speed of ops and 2x speed for data movement Autotuning Today s machines are too complicated, build smarts into software to adapt to the hardware Fault resilient algorithms Implement algorithms that can recover from failures/bit flips Reproducibility of results Today we can t guarantee this, without a penalty. We understand the issues, but some of our colleagues have a hard time with this.

56 Collaborators and Support MAGMA team PLASMA team Collaborating partners University of Tennessee, Knoxville University of California, Berkeley University of Colorado, Denver University of Manchester

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Waiting for Moore s Law to save your serial code started getting bleak in 2004 Source: published SPECInt

More information

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Waiting for Moore s Law to save your serial code started getting bleak in 2004 Source: published SPECInt

More information

HPCG UPDATE: ISC 15 Jack Dongarra Michael Heroux Piotr Luszczek

HPCG UPDATE: ISC 15 Jack Dongarra Michael Heroux Piotr Luszczek www.hpcg-benchmark.org 1 HPCG UPDATE: ISC 15 Jack Dongarra Michael Heroux Piotr Luszczek www.hpcg-benchmark.org 2 HPCG Snapshot High Performance Conjugate Gradient (HPCG). Solves Ax=b, A large, sparse,

More information

Presentations: Jack Dongarra, University of Tennessee & ORNL. The HPL Benchmark: Past, Present & Future. Mike Heroux, Sandia National Laboratories

Presentations: Jack Dongarra, University of Tennessee & ORNL. The HPL Benchmark: Past, Present & Future. Mike Heroux, Sandia National Laboratories HPC Benchmarking Presentations: Jack Dongarra, University of Tennessee & ORNL The HPL Benchmark: Past, Present & Future Mike Heroux, Sandia National Laboratories The HPCG Benchmark: Challenges It Presents

More information

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Waiting for Moore s Law to save your serial code start getting bleak in 2004 Source: published SPECInt data Moore s Law is not at all

More information

Managing HPC Active Archive Storage with HPSS RAIT at Oak Ridge National Laboratory

Managing HPC Active Archive Storage with HPSS RAIT at Oak Ridge National Laboratory Managing HPC Active Archive Storage with HPSS RAIT at Oak Ridge National Laboratory Quinn Mitchell HPC UNIX/LINUX Storage Systems ORNL is managed by UT-Battelle for the US Department of Energy U.S. Department

More information

CSE5351: Parallel Procesisng. Part 1B. UTA Copyright (c) Slide No 1

CSE5351: Parallel Procesisng. Part 1B. UTA Copyright (c) Slide No 1 Slide No 1 CSE5351: Parallel Procesisng Part 1B Slide No 2 State of the Art In Supercomputing Several of the next slides (or modified) are the courtesy of Dr. Jack Dongarra, a distinguished professor of

More information

HPCG UPDATE: SC 15 Jack Dongarra Michael Heroux Piotr Luszczek

HPCG UPDATE: SC 15 Jack Dongarra Michael Heroux Piotr Luszczek 1 HPCG UPDATE: SC 15 Jack Dongarra Michael Heroux Piotr Luszczek HPCG Snapshot High Performance Conjugate Gradient (HPCG). Solves Ax=b, A large, sparse, b known, x computed. An optimized implementation

More information

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Moore's Law abandoned serial programming around 2004 Courtesy Liberty Computer Architecture Research Group

More information

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 11/20/13 1 Rank Site Computer Country Cores Rmax [Pflops] % of Peak Power [MW] MFlops /Watt 1 2 3 4 National

More information

Jack Dongarra University of Tennessee Oak Ridge National Laboratory

Jack Dongarra University of Tennessee Oak Ridge National Laboratory Jack Dongarra University of Tennessee Oak Ridge National Laboratory 3/9/11 1 TPP performance Rate Size 2 100 Pflop/s 100000000 10 Pflop/s 10000000 1 Pflop/s 1000000 100 Tflop/s 100000 10 Tflop/s 10000

More information

Report on the Sunway TaihuLight System. Jack Dongarra. University of Tennessee. Oak Ridge National Laboratory

Report on the Sunway TaihuLight System. Jack Dongarra. University of Tennessee. Oak Ridge National Laboratory Report on the Sunway TaihuLight System Jack Dongarra University of Tennessee Oak Ridge National Laboratory June 24, 2016 University of Tennessee Department of Electrical Engineering and Computer Science

More information

Emerging Heterogeneous Technologies for High Performance Computing

Emerging Heterogeneous Technologies for High Performance Computing MURPA (Monash Undergraduate Research Projects Abroad) Emerging Heterogeneous Technologies for High Performance Computing Jack Dongarra University of Tennessee Oak Ridge National Lab University of Manchester

More information

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Moore's Law abandoned serial programming around 2004 Courtesy Liberty Computer Architecture Research Group

More information

Cray XC Scalability and the Aries Network Tony Ford

Cray XC Scalability and the Aries Network Tony Ford Cray XC Scalability and the Aries Network Tony Ford June 29, 2017 Exascale Scalability Which scalability metrics are important for Exascale? Performance (obviously!) What are the contributing factors?

More information

High-Performance Computing - and why Learn about it?

High-Performance Computing - and why Learn about it? High-Performance Computing - and why Learn about it? Tarek El-Ghazawi The George Washington University Washington D.C., USA Outline What is High-Performance Computing? Why is High-Performance Computing

More information

Parallel Computing & Accelerators. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Parallel Computing & Accelerators. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Parallel Computing Accelerators John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Purpose of this talk This is the 50,000 ft. view of the parallel computing landscape. We want

More information

Confessions of an Accidental Benchmarker

Confessions of an Accidental Benchmarker Confessions of an Accidental Benchmarker http://bit.ly/hpcg-benchmark 1 Appendix B of the Linpack Users Guide Designed to help users extrapolate execution Linpack software package First benchmark report

More information

Top500

Top500 Top500 www.top500.org Salvatore Orlando (from a presentation by J. Dongarra, and top500 website) 1 2 MPPs Performance on massively parallel machines Larger problem sizes, i.e. sizes that make sense Performance

More information

High Performance Linear Algebra

High Performance Linear Algebra High Performance Linear Algebra Hatem Ltaief Senior Research Scientist Extreme Computing Research Center King Abdullah University of Science and Technology 4th International Workshop on Real-Time Control

More information

Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems

Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems International Conference on Energy-Aware High Performance Computing Hamburg, Germany Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Sept Profiling, DLA Algorithms ENAHPC / 6 Power Profiling of Cholesky and

More information

TOP500 List s Twice-Yearly Snapshots of World s Fastest Supercomputers Develop Into Big Picture of Changing Technology

TOP500 List s Twice-Yearly Snapshots of World s Fastest Supercomputers Develop Into Big Picture of Changing Technology TOP500 List s Twice-Yearly Snapshots of World s Fastest Supercomputers Develop Into Big Picture of Changing Technology BY ERICH STROHMAIER COMPUTER SCIENTIST, FUTURE TECHNOLOGIES GROUP, LAWRENCE BERKELEY

More information

An Overview of High Performance Computing and Challenges for the Future

An Overview of High Performance Computing and Challenges for the Future An Overview of High Performance Computing and Challenges for the Future Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 6/15/2009 1 H. Meuer, H. Simon, E. Strohmaier,

More information

ECE 574 Cluster Computing Lecture 2

ECE 574 Cluster Computing Lecture 2 ECE 574 Cluster Computing Lecture 2 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 24 January 2019 Announcements Put your name on HW#1 before turning in! 1 Top500 List November

More information

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

Overview. CS 472 Concurrent & Parallel Programming University of Evansville Overview CS 472 Concurrent & Parallel Programming University of Evansville Selection of slides from CIS 410/510 Introduction to Parallel Computing Department of Computer and Information Science, University

More information

HPC as a Driver for Computing Technology and Education

HPC as a Driver for Computing Technology and Education HPC as a Driver for Computing Technology and Education Tarek El-Ghazawi The George Washington University Washington D.C., USA NOW- July 2015: The TOP 10 Systems Rank Site Computer Cores Rmax [Pflops] %

More information

Supercomputers in ITC/U.Tokyo 2 big systems, 6 yr. cycle

Supercomputers in ITC/U.Tokyo 2 big systems, 6 yr. cycle Supercomputers in ITC/U.Tokyo 2 big systems, 6 yr. cycle FY 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 Hitachi SR11K/J2 IBM Power 5+ 18.8TFLOPS, 16.4TB Hitachi HA8000 (T2K) AMD Opteron 140TFLOPS, 31.3TB

More information

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010

More information

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Preparing GPU-Accelerated Applications for the Summit Supercomputer Preparing GPU-Accelerated Applications for the Summit Supercomputer Fernanda Foertter HPC User Assistance Group Training Lead foertterfs@ornl.gov This research used resources of the Oak Ridge Leadership

More information

High Performance Computing in Europe and USA: A Comparison

High Performance Computing in Europe and USA: A Comparison High Performance Computing in Europe and USA: A Comparison Hans Werner Meuer University of Mannheim and Prometeus GmbH 2nd European Stochastic Experts Forum Baden-Baden, June 28-29, 2001 Outlook Introduction

More information

InfiniBand Strengthens Leadership as the Interconnect Of Choice By Providing Best Return on Investment. TOP500 Supercomputers, June 2014

InfiniBand Strengthens Leadership as the Interconnect Of Choice By Providing Best Return on Investment. TOP500 Supercomputers, June 2014 InfiniBand Strengthens Leadership as the Interconnect Of Choice By Providing Best Return on Investment TOP500 Supercomputers, June 2014 TOP500 Performance Trends 38% CAGR 78% CAGR Explosive high-performance

More information

Chapter 5 Supercomputers

Chapter 5 Supercomputers Chapter 5 Supercomputers Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers

More information

Parallel Computing & Accelerators. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Parallel Computing & Accelerators. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Parallel Computing Accelerators John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Purpose of this talk This is the 50,000 ft. view of the parallel computing landscape. We want

More information

Trends in HPC (hardware complexity and software challenges)

Trends in HPC (hardware complexity and software challenges) Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18

More information

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca Distributed Dense Linear Algebra on Heterogeneous Architectures George Bosilca bosilca@eecs.utk.edu Centraro, Italy June 2010 Factors that Necessitate to Redesign of Our Software» Steepness of the ascent

More information

Investigating power capping toward energy-efficient scientific applications

Investigating power capping toward energy-efficient scientific applications Received: 2 October 217 Revised: 15 December 217 Accepted: 19 February 218 DOI: 1.12/cpe.4485 SPECIAL ISSUE PAPER Investigating power capping toward energy-efficient scientific applications Azzam Haidar

More information

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 12/24/09 1 Take a look at high performance computing What s driving HPC Future Trends 2 Traditional scientific

More information

A Standard for Batching BLAS Operations

A Standard for Batching BLAS Operations A Standard for Batching BLAS Operations Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 5/8/16 1 API for Batching BLAS Operations We are proposing, as a community

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures

MAGMA. Matrix Algebra on GPU and Multicore Architectures MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/

More information

CS 5803 Introduction to High Performance Computer Architecture: Performance Metrics

CS 5803 Introduction to High Performance Computer Architecture: Performance Metrics CS 5803 Introduction to High Performance Computer Architecture: Performance Metrics A.R. Hurson 323 Computer Science Building, Missouri S&T hurson@mst.edu 1 Instructor: Ali R. Hurson 323 CS Building hurson@mst.edu

More information

What have we learned from the TOP500 lists?

What have we learned from the TOP500 lists? What have we learned from the TOP500 lists? Hans Werner Meuer University of Mannheim and Prometeus GmbH Sun HPC Consortium Meeting Heidelberg, Germany June 19-20, 2001 Outlook TOP500 Approach Snapshots

More information

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Intel Xeon Phi архитектура, модели программирования, оптимизация. Нижний Новгород, 2017 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Дмитрий Рябцев, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture

More information

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory Office of Science Titan - Early Experience with the Titan System at Oak Ridge National Laboratory Buddy Bland Project Director Oak Ridge Leadership Computing Facility November 13, 2012 ORNL s Titan Hybrid

More information

Chapter 1. Introduction

Chapter 1. Introduction Chapter 1 Introduction Why High Performance Computing? Quote: It is hard to understand an ocean because it is too big. It is hard to understand a molecule because it is too small. It is hard to understand

More information

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber NERSC Site Update National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory Richard Gerber NERSC Senior Science Advisor High Performance Computing Department Head Cori

More information

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues Top500 Supercomputer list represent parallel computers, so distributed systems such as SETI@Home are not considered Does not consider storage or I/O issues Both custom designed machines and commodity machines

More information

HPC Technology Update Challenges or Chances?

HPC Technology Update Challenges or Chances? HPC Technology Update Challenges or Chances? Swiss Distributed Computing Day Thomas Schoenemeyer, Technology Integration, CSCS 1 Move in Feb-April 2012 1500m2 16 MW Lake-water cooling PUE 1.2 New Datacenter

More information

Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy

Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy François Tessier, Venkatram Vishwanath Argonne National Laboratory, USA July 19,

More information

GOING ARM A CODE PERSPECTIVE

GOING ARM A CODE PERSPECTIVE GOING ARM A CODE PERSPECTIVE ISC18 Guillaume Colin de Verdière JUNE 2018 GCdV PAGE 1 CEA, DAM, DIF, F-91297 Arpajon, France June 2018 A history of disruptions All dates are installation dates of the machines

More information

The Future of High- Performance Computing

The Future of High- Performance Computing Lecture 26: The Future of High- Performance Computing Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2017 Comparing Two Large-Scale Systems Oakridge Titan Google Data Center Monolithic

More information

Introduction to Parallel Programming for Multicore/Manycore Clusters

Introduction to Parallel Programming for Multicore/Manycore Clusters duction to Parallel Programming for Multi/Many Clusters General duction Invitation to Supercomputing Kengo Nakajima Information Technology Center The University of Tokyo Takahiro Katagiri Information Technology

More information

2018/9/25 (1), (I) 1 (1), (I)

2018/9/25 (1), (I) 1 (1), (I) 2018/9/25 (1), (I) 1 (1), (I) 2018/9/25 (1), (I) 2 1. 2. 3. 4. 5. 2018/9/25 (1), (I) 3 1. 2. MPI 3. D) http://www.compsci-alliance.jp http://www.compsci-alliance.jp// 2018/9/25 (1), (I) 4 2018/9/25 (1),

More information

Towards Exascale Computing with the Atmospheric Model NUMA

Towards Exascale Computing with the Atmospheric Model NUMA Towards Exascale Computing with the Atmospheric Model NUMA Andreas Müller, Daniel S. Abdi, Michal Kopera, Lucas Wilcox, Francis X. Giraldo Department of Applied Mathematics Naval Postgraduate School, Monterey

More information

Overview of Reedbush-U How to Login

Overview of Reedbush-U How to Login Overview of Reedbush-U How to Login Information Technology Center The University of Tokyo http://www.cc.u-tokyo.ac.jp/ Supercomputers in ITC/U.Tokyo 2 big systems, 6 yr. cycle FY 11 12 13 14 15 16 17 18

More information

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de ComplexHPC Spring School 2013 Heterogeneous computing - Impact

More information

IHK/McKernel: A Lightweight Multi-kernel Operating System for Extreme-Scale Supercomputing

IHK/McKernel: A Lightweight Multi-kernel Operating System for Extreme-Scale Supercomputing : A Lightweight Multi-kernel Operating System for Extreme-Scale Supercomputing Balazs Gerofi Exascale System Software Team, RIKEN Center for Computational Science 218/Nov/15 SC 18 Intel Extreme Computing

More information

MAGMA: a New Generation

MAGMA: a New Generation 1.3 MAGMA: a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Jack Dongarra T. Dong, M. Gates, A. Haidar, S. Tomov, and I. Yamazaki University of Tennessee, Knoxville Release

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

Fujitsu s Approach to Application Centric Petascale Computing

Fujitsu s Approach to Application Centric Petascale Computing Fujitsu s Approach to Application Centric Petascale Computing 2 nd Nov. 2010 Motoi Okuda Fujitsu Ltd. Agenda Japanese Next-Generation Supercomputer, K Computer Project Overview Design Targets System Overview

More information

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 12/3/09 1 ! Take a look at high performance computing! What s driving HPC! Issues with power consumption! Future

More information

Presentation of the 16th List

Presentation of the 16th List Presentation of the 16th List Hans- Werner Meuer, University of Mannheim Erich Strohmaier, University of Tennessee Jack J. Dongarra, University of Tennesse Horst D. Simon, NERSC/LBNL SC2000, Dallas, TX,

More information

Overview. Idea: Reduce CPU clock frequency This idea is well suited specifically for visualization

Overview. Idea: Reduce CPU clock frequency This idea is well suited specifically for visualization Exploring Tradeoffs Between Power and Performance for a Scientific Visualization Algorithm Stephanie Labasan & Matt Larsen (University of Oregon), Hank Childs (Lawrence Berkeley National Laboratory) 26

More information

HPC Saudi Jeffrey A. Nichols Associate Laboratory Director Computing and Computational Sciences. Presented to: March 14, 2017

HPC Saudi Jeffrey A. Nichols Associate Laboratory Director Computing and Computational Sciences. Presented to: March 14, 2017 Creating an Exascale Ecosystem for Science Presented to: HPC Saudi 2017 Jeffrey A. Nichols Associate Laboratory Director Computing and Computational Sciences March 14, 2017 ORNL is managed by UT-Battelle

More information

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group Aim High Intel Technical Update Teratec 07 Symposium June 20, 2007 Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group Risk Factors Today s s presentations contain forward-looking statements.

More information

HPC Algorithms and Applications

HPC Algorithms and Applications HPC Algorithms and Applications Intro Michael Bader Winter 2015/2016 Intro, Winter 2015/2016 1 Part I Scientific Computing and Numerical Simulation Intro, Winter 2015/2016 2 The Simulation Pipeline phenomenon,

More information

Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies

Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies François Tessier, Venkatram Vishwanath, Paul Gressier Argonne National Laboratory, USA Wednesday

More information

Performance and Energy Usage of Workloads on KNL and Haswell Architectures

Performance and Energy Usage of Workloads on KNL and Haswell Architectures Performance and Energy Usage of Workloads on KNL and Haswell Architectures Tyler Allen 1 Christopher Daley 2 Doug Doerfler 2 Brian Austin 2 Nicholas Wright 2 1 Clemson University 2 National Energy Research

More information

Overview. High Performance Computing - History of the Supercomputer. Modern Definitions (II)

Overview. High Performance Computing - History of the Supercomputer. Modern Definitions (II) Overview High Performance Computing - History of the Supercomputer Dr M. Probert Autumn Term 2017 Early systems with proprietary components, operating systems and tools Development of vector computing

More information

Hybrid Architectures Why Should I Bother?

Hybrid Architectures Why Should I Bother? Hybrid Architectures Why Should I Bother? CSCS-FoMICS-USI Summer School on Computer Simulations in Science and Engineering Michael Bader July 8 19, 2013 Computer Simulations in Science and Engineering,

More information

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation CUDA Accelerated Linpack on Clusters E. Phillips, NVIDIA Corporation Outline Linpack benchmark CUDA Acceleration Strategy Fermi DGEMM Optimization / Performance Linpack Results Conclusions LINPACK Benchmark

More information

High Performance Computing in Europe and USA: A Comparison

High Performance Computing in Europe and USA: A Comparison High Performance Computing in Europe and USA: A Comparison Erich Strohmaier 1 and Hans W. Meuer 2 1 NERSC, Lawrence Berkeley National Laboratory, USA 2 University of Mannheim, Germany 1 Introduction In

More information

Performance of deal.ii on a node

Performance of deal.ii on a node Performance of deal.ii on a node Bruno Turcksin Texas A&M University, Dept. of Mathematics Bruno Turcksin Deal.II on a node 1/37 Outline 1 Introduction 2 Architecture 3 Paralution 4 Other Libraries 5 Conclusions

More information

20 Jahre TOP500 mit einem Ausblick auf neuere Entwicklungen

20 Jahre TOP500 mit einem Ausblick auf neuere Entwicklungen 20 Jahre TOP500 mit einem Ausblick auf neuere Entwicklungen Hans Meuer Prometeus GmbH & Universität Mannheim hans@meuer.de ZKI Herbsttagung in Leipzig 11. - 12. September 2012 page 1 Outline Mannheim Supercomputer

More information

Why we need Exascale and why we won t get there by 2020 Horst Simon Lawrence Berkeley National Laboratory

Why we need Exascale and why we won t get there by 2020 Horst Simon Lawrence Berkeley National Laboratory Why we need Exascale and why we won t get there by 2020 Horst Simon Lawrence Berkeley National Laboratory 2013 International Workshop on Computational Science and Engineering National University of Taiwan

More information

The Mont-Blanc approach towards Exascale

The Mont-Blanc approach towards Exascale http://www.montblanc-project.eu The Mont-Blanc approach towards Exascale Alex Ramirez Barcelona Supercomputing Center Disclaimer: Not only I speak for myself... All references to unavailable products are

More information

Motivation Goal Idea Proposition for users Study

Motivation Goal Idea Proposition for users Study Exploring Tradeoffs Between Power and Performance for a Scientific Visualization Algorithm Stephanie Labasan Computer and Information Science University of Oregon 23 November 2015 Overview Motivation:

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

Making a Case for a Green500 List

Making a Case for a Green500 List Making a Case for a Green500 List S. Sharma, C. Hsu, and W. Feng Los Alamos National Laboratory Virginia Tech Outline Introduction What Is Performance? Motivation: The Need for a Green500 List Challenges

More information

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar CRAY XK6 REDEFINING SUPERCOMPUTING - Sanjana Rakhecha - Nishad Nerurkar CONTENTS Introduction History Specifications Cray XK6 Architecture Performance Industry acceptance and applications Summary INTRODUCTION

More information

Parallel and Distributed Systems. Hardware Trends. Why Parallel or Distributed Computing? What is a parallel computer?

Parallel and Distributed Systems. Hardware Trends. Why Parallel or Distributed Computing? What is a parallel computer? Parallel and Distributed Systems Instructor: Sandhya Dwarkadas Department of Computer Science University of Rochester What is a parallel computer? A collection of processing elements that communicate and

More information

System Packaging Solution for Future High Performance Computing May 31, 2018 Shunichi Kikuchi Fujitsu Limited

System Packaging Solution for Future High Performance Computing May 31, 2018 Shunichi Kikuchi Fujitsu Limited System Packaging Solution for Future High Performance Computing May 31, 2018 Shunichi Kikuchi Fujitsu Limited 2018 IEEE 68th Electronic Components and Technology Conference San Diego, California May 29

More information

Intro To Parallel Computing. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Intro To Parallel Computing. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Intro To Parallel Computing John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Purpose of this talk This is the 50,000 ft. view of the parallel computing landscape. We want to orient

More information

HPC future trends from a science perspective

HPC future trends from a science perspective HPC future trends from a science perspective Simon McIntosh-Smith University of Bristol HPC Research Group simonm@cs.bris.ac.uk 1 Business as usual? We've all got used to new machines being relatively

More information

An Overview of High Performance Computing

An Overview of High Performance Computing IFIP Working Group 10.3 on Concurrent Systems An Overview of High Performance Computing Jack Dongarra University of Tennessee and Oak Ridge National Laboratory 1/3/2006 1 Overview Look at fastest computers

More information

Mixed Precision Methods

Mixed Precision Methods Mixed Precision Methods Mixed precision, use the lowest precision required to achieve a given accuracy outcome " Improves runtime, reduce power consumption, lower data movement " Reformulate to find correction

More information

Seagate ExaScale HPC Storage

Seagate ExaScale HPC Storage Seagate ExaScale HPC Storage Miro Lehocky System Engineer, Seagate Systems Group, HPC1 100+ PB Lustre File System 130+ GB/s Lustre File System 140+ GB/s Lustre File System 55 PB Lustre File System 1.6

More information

High Performance Computing An introduction talk. Fengguang Song

High Performance Computing An introduction talk. Fengguang Song High Performance Computing An introduction talk Fengguang Song fgsong@cs.iupui.edu 1 2 Content What is HPC History of supercomputing Current supercomputers (Top 500) Common programming models, tools, and

More information

Exploring Emerging Technologies in the Extreme Scale HPC Co- Design Space with Aspen

Exploring Emerging Technologies in the Extreme Scale HPC Co- Design Space with Aspen Exploring Emerging Technologies in the Extreme Scale HPC Co- Design Space with Aspen Jeffrey S. Vetter SPPEXA Symposium Munich 26 Jan 2016 ORNL is managed by UT-Battelle for the US Department of Energy

More information

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Intel Xeon Phi архитектура, модели программирования, оптимизация. Нижний Новгород, 2016 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture How Programming

More information

Introduction to Parallel Programming for Multicore/Manycore Clusters

Introduction to Parallel Programming for Multicore/Manycore Clusters Introduction to Parallel Programming for Multicore/Manycore Clusters General Introduction Invitation to Supercomputing Kengo Nakajima Information Technology Center The University of Tokyo Takahiro Katagiri

More information

Oak Ridge National Laboratory Computing and Computational Sciences

Oak Ridge National Laboratory Computing and Computational Sciences Oak Ridge National Laboratory Computing and Computational Sciences OFA Update by ORNL Presented by: Pavel Shamis (Pasha) OFA Workshop Mar 17, 2015 Acknowledgments Bernholdt David E. Hill Jason J. Leverman

More information

Fujitsu s Technologies Leading to Practical Petascale Computing: K computer, PRIMEHPC FX10 and the Future

Fujitsu s Technologies Leading to Practical Petascale Computing: K computer, PRIMEHPC FX10 and the Future Fujitsu s Technologies Leading to Practical Petascale Computing: K computer, PRIMEHPC FX10 and the Future November 16 th, 2011 Motoi Okuda Technical Computing Solution Unit Fujitsu Limited Agenda Achievements

More information

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA SUDHEER CHUNDURI, SCOTT PARKER, KEVIN HARMS, VITALI MOROZOV, CHRIS KNIGHT, KALYAN KUMARAN Performance Engineering Group Argonne Leadership Computing Facility

More information

Trends of Network Topology on Supercomputers. Michihiro Koibuchi National Institute of Informatics, Japan 2018/11/27

Trends of Network Topology on Supercomputers. Michihiro Koibuchi National Institute of Informatics, Japan 2018/11/27 Trends of Network Topology on Supercomputers Michihiro Koibuchi National Institute of Informatics, Japan 2018/11/27 From Graph Golf to Real Interconnection Networks Case 1: On-chip Networks Case 2: Supercomputer

More information

Technologies for Information and Health

Technologies for Information and Health Energy Defence and Global Security Technologies for Information and Health Atomic Energy Commission HPC in France from a global perspective Pierre LECA Head of «Simulation and Information Sciences Dpt.»

More information

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA STATE OF THE ART 2012 18,688 Tesla K20X GPUs 27 PetaFLOPS FLAGSHIP SCIENTIFIC APPLICATIONS

More information

TOP500 Listen und industrielle/kommerzielle Anwendungen

TOP500 Listen und industrielle/kommerzielle Anwendungen TOP500 Listen und industrielle/kommerzielle Anwendungen Hans Werner Meuer Universität Mannheim Gesprächsrunde Nichtnumerische Anwendungen im Bereich des Höchstleistungsrechnens des BMBF Berlin, 16./ 17.

More information

HPCS HPCchallenge Benchmark Suite

HPCS HPCchallenge Benchmark Suite HPCS HPCchallenge Benchmark Suite David Koester, Ph.D. () Jack Dongarra (UTK) Piotr Luszczek () 28 September 2004 Slide-1 Outline Brief DARPA HPCS Overview Architecture/Application Characterization Preliminary

More information

The TOP500 Project of the Universities Mannheim and Tennessee

The TOP500 Project of the Universities Mannheim and Tennessee The TOP500 Project of the Universities Mannheim and Tennessee Hans Werner Meuer University of Mannheim EURO-PAR 2000 29. August - 01. September 2000 Munich/Germany Outline TOP500 Approach HPC-Market as

More information

CHAO YANG. Early Experience on Optimizations of Application Codes on the Sunway TaihuLight Supercomputer

CHAO YANG. Early Experience on Optimizations of Application Codes on the Sunway TaihuLight Supercomputer CHAO YANG Dr. Chao Yang is a full professor at the Laboratory of Parallel Software and Computational Sciences, Institute of Software, Chinese Academy Sciences. His research interests include numerical

More information