Overview of HPC and Energy Saving on KNL for Some Computations
|
|
- Brian Thornton
- 6 years ago
- Views:
Transcription
1 Overview of HPC and Energy Saving on KNL for Some Computations Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 1/2/217 1
2 Outline Overview of High Performance Computing With Extreme Computing the rules for computing have changed
3 Since 1993 H. Meuer, H. Simon, E. Strohmaier, & JD - Listing of the 5 most powerful Computers in the World - Yardstick: Rmax from LINPACK MPP Ax=b, dense problem TPP performance - Updated twice a year Size SC xy in the States in November Meeting in Germany in June - All data available from Rate
4 PERFORMANCE DEVELOPMENT OF HPC OVER THE LAST 24.5 YEARS FROM THE TOP5 flop/s 1E+9 Pflop/s 749 PF 93 PF Pflop/s flop/s Tflop/s 1 SUM 432 TF Tflop/s 1 flop/s 1 Gflop/s 1 Gflop/s 1 flop/s 1 Mflop/s, TFlop/s 59.7 GFlop/s 4 MFlop/s N=1 N=5 6-8 years My Laptop: 166 Gflop
5 PERFORMANCE DEVELOPMENT 1E+9 1 Eflop/s 1 1 Pflop/s 1 Pflop/s 1 1 Pflop/s 1 1 Tflop/s 1 1 Tflop/s 1 1 Tflop/s 1 1 Gflop/s 1 1 Gflop/s 1 1 Gflop/s 1 N=1 N=1 SUM N= Today: SW 261 & Intel KNL ~ 3 Tflop/s Tflops (1 12 ) Achieved ASCI Red Sandia NL Today: 1 cabinet of TaihuLight ~ 3 Pflop/s Pflops (1 15 ) Achieved RoadRunner Los Alamos NL Eflops (1 18 ) Achieved? China says 22 U.S. says 221
6 State of Supercomputing in 217 Pflops (> 1 15 Flop/s)computing fully established with 138 computer systems. Three technology architecture or swim lanes are thriving. Commodity (e.g. Intel) Commodity + accelerator (e.g. GPUs) (91 systems) Lightweight cores (e.g. IBM BG, Intel s Knights Landing, ShenWei 261, ARM) Interest in supercomputing is now worldwide, and growing in many new markets (~5% of Top5 computers are in industry). Exascale (1 18 Flop/s) projects exist in many countries and regions. Intel processors largest share, 92%, followed by AMD, 1%. 6
7 Countries Share 7 China has 1/3 of the systems, while the number of systems in the US has fallen to the lowest point since the TOP5 list was created. rectangle represents one of the Top5 computers, area of rectangle reflects its performance.
8 17 - Top5 Systems in France Name Computer Site Manufa cturer Total Cores Acc Rmax Total Exploration Pangea SGI ICE X, Xeon Xeon E5-267/ E5-268v3 12C 2.5GHz, Infiniband FDR Production HPE occigen2 bullx DLC 72, Xeon E5-269v4 14C 2.6GHz, Infiniband FDR (GENCI-CINES) Bull Prolix2 bullx DLC 72, Xeon E5-2698v4 2C 2.2GHz, Infiniband FDR Meteo France Bull Beaufix2 bullx DLC 72, Xeon E5-2698v4 2C 2.2GHz, Infiniband FDR Meteo France Bull Tera-1-1 bullx DLC 72, Xeon E5-2698v3 16C 2.3GHz, Infiniband FDR Commissariat a l'energie Atomique (CEA) Bull Sid bullx DLC 72, Xeon E5-2695v4 18C 2.1GHz, Infiniband FDR Atos Bull Curie thin nodes Bullx B51, Xeon E C 2.7GHz, Infiniband QDR CEA/TGCC-GENCI Bull Commissariat a l'energie Atomique Cobalt bullx DLC 72, Xeon E5-268v4 14C 2.4GHz, Infiniband EDR (CEA)/CCRT Bull Diego bullx DLC 72, Xeon E5-2698v4 2C 2.2GHz, Infiniband FDR Atos Bull Turing BlueGene/Q, Power BQC 16C 1.6GHz, Custom CNRS/IDRIS-GENCI IBM Commissariat a l'energie Atomique Tera-1 Bull bullx super-node S61/S63 (CEA) Bull Eole Intel Cluster, Xeon E5-268v4 14C 2.4GHz, Intel Omni-Path EDF Bull Zumbrota BlueGene/Q, Power BQC 16C 1.6GHz, Custom EDF R&D IBM Occigen2 bullx DLC 72, Xeon E5-269v4 14C 2.6GHz, Infiniband FDR Centre Informatique National (CINES) Bull Sator NEC HPC1812-Rg-2, Xeon E5-268v4 14C 2.4GHz, Intel Omni-Path ONERA NEC HP POD - Cluster Platform BL46c, Intel Xeon E5-2697v2 12C 2.7GHz, HPC4 Infiniband FDR Airbus HPE PORTHOS IBM NeXtScale nx36m5, Xeon E5-2697v3 14C 2.6GHz, Infiniband FDR EDF R&D IBM
9 June 217: The TOP 1 Systems Rank Site Computer Country Cores Rmax [Pflops] % of Peak Power [MW] GFlops/ Watt 1 National Super Computer Center in Wuxi National Super Computer Center in Guangzhou Sunway TaihuLight, SW261 (26C) + Custom China 1,649, Tianhe-2 NUDT, TaihuLight 2 is Xeon 6% (12C) + IntelXeon Sum Phi (57C) of All China EU 3,12, Systems Custom 3 Swiss CSCS 4 5 DOE / OS Oak Ridge Nat Lab DOE / NNSA L Livermore Nat Lab Piz Daint, Cray XC5, Xeon (12C) + Nvidia P1(56C) + Custom Titan, Cray XK7, AMD (16C) + Nvidia Kepler GPU (14C) + Custom Sequoia, BlueGene/Q (16C) + custom Swiss 361, TaihuLight is 35% X Sum of All Systems in US DOE / OS L Berkeley Nat Lab Joint Center for Advanced HPC RIKEN Advanced Inst for Comp Sci Cori, Cray XC4, Xeon Phi (68C) + Custom Oakforest-PACS, Fujitsu Primergy CX164, Xeon Phi (68C) + Omni-Path K computer Fujitsu SPARC64 VIIIfx (8C) + Custom USA 56, USA 1,572, USA 622, Japan 558, Japan 75, DOE / OS Argonne Nat Lab Mira, BlueGene/Q (16C) + Custom USA 786, DOE / NNSA / Trinity, Cray XC4,Xeon (16C) + Los Alamos & Sandia Custom USA 31, Internet company Sugon Intel (8C) + Nnvidia China 11,
10 Next Big Systems in the US DOE Oak Ridge Lab and Lawrence Livermore Lab to receive IBM and Nvidia based systems IBM Power 9 + Nvidia Volta V1 Installation started but not completed until late times Titan on apps 4,6 nodes, each containing six 7.5-teraflop NVIDIA V1 GPUs, and two IBM Power9 CPUs, its aggregate peak performance should be > 2 petaflops. ORNL maybe June 218 fully ready (Top5 number) In 221 Argonne Lab to receive Intel based system Exascale systems, based on Future Intel parts, not Knights Hill Aurora 21 Balanced architecture to support three pillars Large-scale Simulation (PDEs, traditional HPC) Data Intensive Applications (science pipelines) Deep Learning and Emerging Science AI Integrated computing, acceleration, storage Towards a common software stack
11 Toward Exascale hina plans for Exascale 22 Three separate developments in HPC; Anything but from the US Wuxi Upgrade the ShenWei O(1) Pflops all Chinese National University for Defense Technology Tianhe-2A O(1) Pflops will be Chinese ARM processor + accelerator Sugon - CAS ICT X86 based; collaboration with AMD S DOE - Exascale Computing Program 7 Year rogram Initial exascale system based on advanced architecture and delivered in 221 Enable capable exascale systems, based on ECP R&D, delivered in 222 and deployed in Exascale 5X the performance today s 2 PF system 2-3 MW power 1 perceived fault / SW to support broad of apps
12 S Department of Energy Exascale Computing Program s formulated a holistic approach that uses co-design d integration to achieve capable exascale lication Development Software Technology Hardware Technology Exascale Systems cience and mission applications Scalable and productive software stack Hardware technology elements Integrated exascale supercomputers Correctness Visualization Data Analysis Applications Co-Design Resilience Programming models, development environment, and runtimes System Software, resource management threading, scheduling, monitoring, and control Node OS, runtimes Math libraries and Frameworks Memory and Burst buffer Hardware interface Tools Data management I/O and file system Workflows ECP s work encompasses applications, system software, hardware technologies and architectures, and workforce development cale Computing Project,
13 7
14 14 he Problem with the LINPACK Benchmark HPL performance of computer systems are no longer so strongly correlated to real application performance, especially for the broad set of HPC applications governed by partial differential equations. Designing a system for good HPL performance can actually lead to design choices that are wrong for the real application mix, or add unnecessary components or complexity to the system.
15 Many Problems in Computational Science Involve Solving PDEs; Large Sparse Linear Systems Given a PDE, e.g.: u+ c u+γu Pu = f over some domain ( where P denotes the differential operator ) Discretization (e.g., Galerkin equations) Find u h = Φ i x i n i=1 n n i=1 (P u h, Φ j ) = (f, Φ i ) for Φ j (PΦ i, Φ j ) x i = (f, Φ i ) i=1 a ji b j Sparse Linear System A x = b Basis functions Φ j are often with local support, e.g., leading to local interactions & hence sparse matrices, e.g., + boundary conditions row 1 in this case will have only 6 non-zeroes: a 1,1, a 1,332, a 1,1, a 1,115, a 1,21, a 1,35 Modeling Diffusion Fluid Flow
16 16 hpcg-benchmark.org HPCG High Performance Conjugate Gradients (HPCG). Solves Ax=b, A large, sparse, b known, x computed. An optimized implementation of PCG contains essential computational and communication patterns that are prevalent in a variety of methods for discretization and numerical solution of PDEs Synthetic discretized 3D PDE (FEM, FVM, FDM). Sparse matrix: 27 nonzeros/row interior on boundary. Symmetric positive definite. Patterns: Dense and sparse computations. Dense and sparse collectives. Multi-scale execution of kernels via MG (truncated) V cycle. Data-driven parallelism (unstructured sparse triangular solves). Strong verification (via spectral properties of PCG).
17 -benchmark.org 17 PL vs. HPCG: Bookends Some see HPL and HPCG as bookends of a spectrum. Applications teams know where their codes lie on the spectrum. Can gauge performance on a system using both HPL and HPCG numbers.
18 Results, Jun 217, top-1 Rank Site Computer Cores RIKEN Advanced Institute for Computational Science, Japan NSCC / Guangzhou China National Supercomputing Center in Wuxi China Swiss National Supercomputing Centre (CSCS) Switzerland Joint Center for Advanced HPC Japan DOE/SC/LBNL/NERSC USA DOE/NNSA/LLNL USA DOE/SC/Oak Ridge National Lab 9 DOE/NNSA/LANL/SNL 1 NASA / Mountain View K computer, SPARC64 VIIIfx2.GHz, Tofu interconnect Tianhe-2NUDT, Xeon 12C 2.2GHz + Intel Xeon Phi 57C + Custom Sunway TaihuLight Sunway MPP, SW261 26C 1.45GHz, Sunway, NRCPC Piz Daint CrayXC5, Xeon E5-269v3 12C 2.6GHz, Aries interconnect, NVIDIA P1, Cray Oakforest-PACS PRIMERGY CX6 M1, Intel Xeon Phi C 1.4GHz, Intel OmniPath, Fujitsu Cori XC4, Intel Xeon Phi C 1.4GHz, Cray Aries, Cray Sequoia IBM BlueGene/Q, PowerPC A2 16C 1.6GHz, 5D Torus, IBM Titan-Cray XK7, Opteron C 2.2GHz, Cray Gemini interconnect, NVIDIA K2x Trinity-Cray XC4, Intel E5-2698v3, Aries custom, Cray Pleiades-SGI ICE X, Intel E5-268, E5-268v2, E5-268v3, E5-268v4, Infiniband FDR, HPE Rmax Pflops HPCG Pflops HPCG /HPL % of Peak 75, % 5.3% 3,12, % 1.1% 1,649, %.4% 361, % 1.9% 557, % 1.5% 632, % 1.3% 1,572, % 1.6% 56, % 1.2% 31, % 1.6% 243, % 2.5%
19 ookends: Peak and HPL Only Pflop/s Rpeak Rmax
20 ookends: Peak, HPL, and HPCG Pflop/s Rpeak Rmax HPCG
21 Machine Learning in Computational Science any fields are beginning to adopt machine learning to augment modeling and simulatio ethods Climate Biology Drug Design Epidemology Materials Cosmology High-Energy Physics
22 IEEE 754 Half Precision (16-bit) Floating Pt Standard A lot of interest driven by machine learning 22 / 57
23 Mixed Precision Extended to Sparse Direct Methods Iterative Methods where inner and outer iteration Today many precisions to deal with Note the number range with half precision 1/2/217 (16 bit fl.pt.)
24 ep Learning need Batched Operations trix Multiply is the time consuming part. nvolution Layers and Fully Connected Layers require matrix multiply ere are many GEMM s of small matrices, perfectly parallel, can get by with 16-bit floating point x 1 w 11 x 1 y 1 w w x 2 w 13 w 22 y 2 x 3 w 23 Convolution Step In this case 3x3 GEMM 24 / 47 Fully Connected Classification
25 Standard for Batched Computations efine standard API for batched BLAS and LAPACK in collaboration with ntel/nvidia/ecp/other users ixed size most of BLAS and LAPACK released ariable size most of BLAS released ariable size LAPACK in the branch ative GPU algorithms (Cholesky, LU, QR) in the branch iled algorithm using batched routines on tile or LAPACK data layout in the branch ramework for Deep Neural Network kernels PU, KNL and GPU routines P16 routines in progress
26 Batched Computations Non-batched computation op over the matrices one by one and compute using multithread (note that, sinc atrices are of small sizes there is not enough work for all the cores). So we expect low rformance as well as threads contention might also affect the performance r (i=; i<batchcount; i++) dgemm( ) Low There percentage is notof enough the resources to fulfill all is used the cores.
27 Batched Computations Batched computation ibute all the matrices over the available resources by assigning a matrix to eac p of core/tb to operate on it independently For very small matrices, assign a matrix/core (CPU) or per TB for GPU For medium size a matrix go to a team of cores (CPU) or many TB s (GPU) For large size switch to multithreads classical 1 matrix per round. Batched_dgemm( ) Tasks manager dispatcher Based on the kernel design that decide the number of TB or threads (GPU/CPU) and through the Nvidia/OpenMP scheduler High percentage of the resources is used
28 Batched Computations Batched computation ibute all the matrices over the available resources by assigning a matrix to eac p of core/tb to operate on it independently For very small matrices, assign a matrix/core (CPU) or per TB for GPU For medium size a matrix go to a team of cores (CPU) or many TB s (GPU) For large size switch to multithreads classical 1 matrix per round. Batched_dgemm( ) Tasks manager dispatcher Based on the kernel design that decide the number of TB or threads (GPU/CPU) and through the Nvidia/OpenMP scheduler High percentage of the resources is used
29 Batched Computations Batched computation ibute all the matrices over the available resources by assigning a matrix to eac p of core/tb to operate on it independently For very small matrices, assign a matrix/core (CPU) or per TB for GPU For medium size a matrix go to a team of cores (CPU) or many TB s (GPU) For large size switch to multithreads classical 1 matrix per round. Batched_dgemm( ) Tasks manager dispatcher Based on the kernel design that decide the number of TB or threads (GPU/CPU) and through the Nvidia/OpenMP scheduler High percentage of the resources is used
30 Batched Computations: How do we design and optimize Small sizes medium sizes large sizes 2~3x Switch to non-batched 1X
31 Batched Computations: How do we design and optimize Small sizes 3X 1.2x medium sizes large sizes Switch to non-batched
32 8 7 Nvidia V1 GPU small sizes medium sizes Large sizes 6 1.4X Gflop/s X Switch to non-batch 2 1 Batch dgemm BLAS 3 Batch dgemm BLAS 3 Standard dgemm BLAS 3 Standard dgemm BLAS ~1 matrices of size 97% of that peak
33 tel Knights Landing: FLAT mode 68 cores, 1.3 GHz, Peak DP = 2662 Gflop/s KNL is in FLAT mode (where entire MCDRAM is used as addressable memory) memory allocations are treated similarly to DDR4 memory allocations KNL Thermal Design Power (TDP) is 215W (for main SKUs): TDP = 215W
34 Level 1, 2 and 3 BLAS 68 cores Intel Xeon Phi KNL, 1.3 GHz, Peak DP = 2662 Gflop/s 2 Gflop/s 35x 85.3 Gflop/s 35.1 Gflop/s 68 cores Intel Xeon Phi KNL, 1.3 GHz The theoretical peak double precision is 2662 Gflop/s Compiled with icc and using Intel MKL 217b
35 API for power-aware computing We use PAPI s latest powercap componentfor measurement, control, and performance analysis PAPI power components in the past supported only reading power information New component exposes RAPL functionality to allow users to read and write power limit Study numerical building blocks of varying computational intensity Use PAPI powercap component to detect power optimization opportunities Objective: Cap the power on the architecture to reduce power usage while keeping the execution time constant Energy Savings!!!
36 API for power-aware computing We use PAPI s latest powercap componentfor measurement, control, and performance analysis PAPI power components in the past supported only reading power information New component exposes RAPL functionality to allow users to read and write power limit Study numerical building blocks of varying computational intensity Use PAPI powercap component to detect power optimization opportunities Objective: Cap the power on the architecture to reduce power usage while keeping the execution time constant Energy Savings!!!
37 el 3 BLAS DGEMM on KNL using MCDRAM 68 cores KNL, Peak DP = 2662 Gflop/s Bandwidth MCDRAM ~425 GB/s DDR4 ~9 GB/s Average power (Watts) Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set idle DGEMM 18K at default power Performance in Gflop/s Gflops/Watts Joules Time (sec) set pwr limit default DGEMM size 18K Thermal Design Power (TDP) DGEMM is run repeatedly for a fixed matrix size of 18K per step and at each step a new power limit is set in decreasing fashion starting from default, 22 Watts, 2 Watts down till 12 Watts by steps of 1/2. 37
38 el 3 BLAS DGEMM on KNL using MCDRAM 68 cores KNL, Peak DP = 2662 Gflop/s Bandwidth MCDRAM ~425 GB/s DDR4 ~9 GB/s Average power (Watts) idl 8 6 e Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set DGEMM 18K at default power Performance in Gflop/s Gflops/Watts Joules Time (sec) set pwr limit default DGEMM size 18K PAPI_read(EventSet, long long values[2]) // set short term power limit to Watts Values[] = // set long term power limit to 258 Watts Values[1] = 258 PAPI_write(EventSet, values) DGEMM is run repeatedly for a fixed matrix size of 18K per step and at each step a new power limit is set in decreasing fashion starting from default, 22 Watts, 2 Watts down till 12 Watts by steps of 1/2. 38
39 el 3 BLAS DGEMM on KNL using MCDRAM 68 cores KNL, Peak DP = 2662 Gflop/s Bandwidth MCDRAM ~425 GB/s DDR4 ~9 GB/s Average power (Watts) Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set Performance in Gflop/s Gflops/Watts Joules Time (sec) set pwr limit default DGEMM size 18K set pwr limit 22W DGEMM size 18K DGEMM is run repeatedly for a fixed matrix size of 18K per step and at each step a new power limit is set in decreasing fashion starting from default, 22 Watts, 2 Watts down till 12 Watts by steps of 1/2. 39
40 el 3 BLAS DGEMM on KNL using MCDRAM 68 cores KNL, Peak DP = 2662 Gflop/s Bandwidth MCDRAM ~425 GB/s DDR4 ~9 GB/s Average power (Watts) Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set Time (sec) Performance in Gflop/s Gflops/Watts Joules DGEMM is run repeatedly for a fixed matrix size of 18K per step and at each step a new power limit is set in decreasing fashion starting from default, 22 Watts, 2 Watts down till 12 Watts by steps of 1/2. set pwr limit default DGEMM size 18K set pwr limit 22W DGEMM size 18K set pwr limit 2W DGEMM size 18K set pwr limit 18W DGEMM size 18K set pwr limit 16W DGEMM size 18K set pwr limit 15W... 4
41 el 3 BLAS DGEMM on KNL using MCDRAM 68 cores KNL, Peak DP = 2662 Gflop/s Bandwidth MCDRAM ~425 GB/s DDR4 ~9 GB/s Average power (Watts) Optimum Flops Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set Time (sec) Frequency Performance in Gflop/s Gflops/Watts Joules MHz When power cap kickon the frequency is decreased which confirm that capping affect through the DVFS DGEMM is run repeatedly Optimum for a fixed matrix size of 18K per step and at each step a new power limit is set in Flops/Watt decreasing fashion starting from default, 22 Watts, 2 Watts down till 12 Watts by steps of 1/2. 41
42 el 3 BLAS DGEMM on KNL MCDRAM/DDR4 Average power (Watts) MCDRAM Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set Time (sec) Frequency Performance in Gflop/s Gflops/Watts Joules MHz Average power (Watts) DDR4 Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set Time (sec) Frequency Performance in Gflop/s Gflops/Watts Joules MHz Lesson for DGEMM type of operations (compute intensive): Capping can reduce performance, but sometimes can hold the energy efficiency (Glops/Watts). Optimum Flops Optimum Flops/Watt Optimum Flops Optimum Flops/Watt DGEMM is run repeatedly for a fixed matrix size of 18K per step and at each step a new power limit is set in decreasing fashion starting from default, 22 Watts, 2 Watts down till 12 Watts by steps of 1/2. 42
43 el 2 BLAS DGEMV on KNL MCDRAM/DDR4 Average power (Watts) MCDRAM Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set Time (sec) Frequency Performance in Gflop/s Achieved Bandwidth GB/s Gflops/Watts Joules MHz Average power (Watts) DDR4 Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set Lesson for DGEMM DGEMV type of operations (memory (compute bound): intensive): Time (sec) 224 Frequency Performance in Gflop/s Achieved Bandwidth GB/s Gflops/Watts Capping For MCDRAM, can reduce capping performance, at 19 Watts but results sometimes in power can reduction hold the without energy any efficiency loss of performance (Glops/Watts). (energy saving ~2%). For DDR4, capping at 12 Watts improves energy efficiency by ~17%. Optimum Overall capping 4 Watts Optimum below the power observed at default setting provide Optimum about 17-2% energy saving Optimum without any significant Flops reduction Flops/Watt in time to solution. Flops Flops/Watt Joules MHz DGEMV is run repeatedly for a fixed matrix size of 18K per step and at each step a new power limit is set in decreasing fashion starting from default, 22 Watts, 2 Watts down till 12 Watts by steps of 1/2. 43
44 el 1 BLAS DAXPY on KNL MCDRAM/DDR4 Average power (Watts) MCDRAM Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set Text Time (sec) Frequency Performance in Gflop/s Achieved Bandwidth GB/s Gflops/Watts Joules MHz Average power (Watts) DDR4 Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set Time (sec) Frequency Performance in Gflop/s Achieved Bandwidth GB/s Gflops/Watts Joules MHz Lesson for DAXPY type of operations (memory bound): For MCDRAM, capping at 18 Watts results in power reduction without any loss if performance (energy saving ~16%). For DDR4, capping at 13 Watts improves energy efficiency by ~25%. Overall capping 4 Watts below the power observed at default setting provide about 16-25% energy saving without any significant reduction in time to solution. DAXPY is run repeatedly for a fixed matrix size of 1E8 per step and at each step a new power limit is set in decreasing fashion starting from default, 22 Watts, 2 Watts down till 12 Watts by steps of 1/2. 44
45 el Knights Landing evel 1, 2 and 3 BLAS cores Intel Xeon Phi KNL, 1.3 GHz, Peak DP = 2662 Gflop/s BLAS MCDRAM Rate Gflop/s Power Efficiency Gflop/s per W evel evel evel BLAS DRAM Rate Gflop/s Level Level 2 21 Level 1 7 Power Efficienc Gflop/s pe
46 wer-awareness in applications HPCG benchmark on grid of size : MCDRAM Average power (Watts) joules MCDRAM_215watts Time (sec) Sparse Matrix Vector operations Solving Ax=b with Conjugate Gradient Algorithm At TDP basic power limit 215
47 wer-awareness in applications HPCG benchmark on grid of size : MCDRAM Average power (Watts) joules 154 joules MCDRAM_215watts MCDRAM_2watts Time (sec) Sparse Matrix Vector operations Solving Ax=b with Conjugate Gradient Algorithm Decreasing the power limit to 2 W do not results in any loss in performance while we observe a reducing of power rate consumption
48 wer-awareness in applications HPCG benchmark on grid of size : MCDRAM Average power (Watts) MCDRAM_215watts MCDRAM_2watts MCDRAM_18watts MCDRAM_16watts MCDRAM_14watts MCDRAM_12watts 1522 joules 154 joules 1468 joules 1486 joules 1728 joules 239 joules Time (sec) Sparse Matrix Vector operations Solving Ax=b with Conjugate Gradient Algorithm Decreasing the power limit to 2 W do not results in any loss in performance while we observe a reducing of power rate consumption Decreasing the power limit down by 4 Watts (setting pwr limit at 16 W) will provide energy saving, power rate reduction and without any meaningful loss in performance, less than 1% reduction in time to solution.
49 wer-awareness in applications HPCG benchmark on grid of size : DDR4 Average power (Watts) DDR_215watts DDR_2watts DDR_18watts DDR_16watts DDR_14watts DDR_12watts 3915 joules 3957 joules 394 joules 371 joules 33 joules 3158 joules Time (sec) Sparse Matrix Vector operations Solving Ax=b with Conjugate Gradient Algorithm Decreasing the power limit to 16 W do not results in any loss in performance while we observe a reducing of power rate and energy saving. Decreasing the power limit down by 4 Watts (Pwr limit at 14 W) will providing large energy saving (15%), power rate reduction, without any loss in performance. At 12 Watts, less than 1% reduction in time to solution while a large energy saving (2%) and power rate reduction are observed
50 wer-awareness in applications Average power (Watts) HPCG benchmark on grid of size : MCDRAM & DDR4 MCDRAM_215watts MCDRAM_2watts MCDRAM_18watts MCDRAM_16watts MCDRAM_14watts MCDRAM_12watts 1522 joules 154 joules 1468 joules 1486 joules 1728 joules 239 joules Time (sec) Average power (Watts) DDR_215watts DDR_2watts DDR_18watts DDR_16watts DDR_14watts DDR_12watts Sparse Matrix Vector opera Solving Ax=b with CG 3915 joules 3957 joules 394 joules 371 joules 33 joules 3158 joules Time (sec) Lesson for HPCG: For MCDRAM, decreasing the power limit by 4 Watts from the power observed at default (setting Pwr limit at 16 W) will provide energy saving, power rate reduction without any meaningful loss in performance, less than 1% reduction in time to solution. For DDR4, similar behavior observed by decreasing the power limit by 4 Watts from the power observed at default (setting Pwr limit at 12 W) provide a large energy saving (15%-2%), power rate reduction without any significant reduction in time to solution.
51 wer-awareness in applications Solving Helmholtz equation with finite difference, Jacobi iteration 128x128 grid Average power (Watts) MCDRAM_215Watts MCDRAM_2Watts MCDRAM_18Watts MCDRAM_16Watts MCDRAM_14Watts MCDRAM_12Watts 826 joules 842 joules 724 joules 83 joules 166 joules 1865 joules Time (sec) Average power (Watts) DDR_215Watts DDR_2Watts DDR_18Watts DDR_16Watts DDR_14Watts DDR_12Watts 277 joules 2878 joules 2689 joules 2438 joules 2137 joules 2731 joules Time (sec) Lesson for Jacobi iteration: For MCDRAM, capping at 18 Watts improves power efficiency by ~13% without any loss in time to solution, while also decreasing power rate requirements by about 2% For DDR4, capping at 14 Watts improves power efficiency by ~2% without any loss in time to solution. Overall, capping 3-4 Watts below the power observed at default limit provides large energy gain while keeping up with the same time to solution, which is similar to the behavior observed for the DGEMV and DAXPY kernels. 51
52 wer-awareness in applications Lattice-Boltzmann simulation of CFD (from SPEC 26 benchmark) Average power (Watts) joules 2981 joules 2872 joules MCDRAM_27watts MCDRAM_215watts MCDRAM_2watts MCDRAM_18watts MCDRAM_16watts MCDRAM_14watts MCDRAM_12watts 343 joules 3383 joules 433 joules 5734 joules Time (sec) Average power (Watts) DDR_27watts DDR_215watts DDR_2watts DDR_18watts DDR_16watts DDR_14watts DDR_12watts 964 joules 9196 joules 9194 joules 9178 joules 95 joules 884 joules 8467 joules Time (sec) Lesson for Lattice Boltzmann: For MCDRAM, capping at 2 results in 5% energy saving and power rate reduction without increase in time to solution. For DDR4, capping at 14 Watts improves energy efficiency by ~11% and reduces power rate without any increase in time to solution.
53 wer-awareness in applications Monte Carlo neutron transport application (from XSbench) Average power (Watts) MCDRAM_215Watts MCDRAM_2Watts MCDRAM_18Watts MCDRAM_16Watts MCDRAM_14Watts 9175 joules 8955 joules 872 joules 8724 joules 178 joules Time (sec) Average power (Watts) DDR_215Watts DDR_2Watts DDR_18Watts DDR_16Watts DDR_14Watts DDR_12Watts 1991 joules joules joules joules 163 joules joules Time (sec) Lesson for Monte Carlo: For MCDRAM, capping at 18 W results in energy saving (5%) and power rate reduction without any meaningful increase in time to solution. For DDR4, capping at 12 Watts improves power efficiency by ~3% without any loss in performance.
54 wer Aware Tools Designing a toolset that will automatically monitor memory traffic. Track the performance and data movement. Adjust the power to keep the performance roughly the same while decreasing energy consumption. Better power performance without significant loss of performance. All in a automatic fashion without user involvement. /2/217
55 Critical Issues and Challenges at Peta & Exascale for Algorithm and Software Design Synchronization-reducing algorithms Break Fork-Join model Communication-reducing algorithms Use methods which have lower bound on communication Mixed precision methods 2x speed of ops and 2x speed for data movement Autotuning Today s machines are too complicated, build smarts into software to adapt to the hardware Fault resilient algorithms Implement algorithms that can recover from failures/bit flips Reproducibility of results Today we can t guarantee this, without a penalty. We understand the issues, but some of our colleagues have a hard time with this.
56 Collaborators and Support MAGMA team PLASMA team Collaborating partners University of Tennessee, Knoxville University of California, Berkeley University of Colorado, Denver University of Manchester
It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist
It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Waiting for Moore s Law to save your serial code started getting bleak in 2004 Source: published SPECInt
More informationIt s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist
It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Waiting for Moore s Law to save your serial code started getting bleak in 2004 Source: published SPECInt
More informationHPCG UPDATE: ISC 15 Jack Dongarra Michael Heroux Piotr Luszczek
www.hpcg-benchmark.org 1 HPCG UPDATE: ISC 15 Jack Dongarra Michael Heroux Piotr Luszczek www.hpcg-benchmark.org 2 HPCG Snapshot High Performance Conjugate Gradient (HPCG). Solves Ax=b, A large, sparse,
More informationPresentations: Jack Dongarra, University of Tennessee & ORNL. The HPL Benchmark: Past, Present & Future. Mike Heroux, Sandia National Laboratories
HPC Benchmarking Presentations: Jack Dongarra, University of Tennessee & ORNL The HPL Benchmark: Past, Present & Future Mike Heroux, Sandia National Laboratories The HPCG Benchmark: Challenges It Presents
More informationIt s a Multicore World. John Urbanic Pittsburgh Supercomputing Center
It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Waiting for Moore s Law to save your serial code start getting bleak in 2004 Source: published SPECInt data Moore s Law is not at all
More informationManaging HPC Active Archive Storage with HPSS RAIT at Oak Ridge National Laboratory
Managing HPC Active Archive Storage with HPSS RAIT at Oak Ridge National Laboratory Quinn Mitchell HPC UNIX/LINUX Storage Systems ORNL is managed by UT-Battelle for the US Department of Energy U.S. Department
More informationCSE5351: Parallel Procesisng. Part 1B. UTA Copyright (c) Slide No 1
Slide No 1 CSE5351: Parallel Procesisng Part 1B Slide No 2 State of the Art In Supercomputing Several of the next slides (or modified) are the courtesy of Dr. Jack Dongarra, a distinguished professor of
More informationHPCG UPDATE: SC 15 Jack Dongarra Michael Heroux Piotr Luszczek
1 HPCG UPDATE: SC 15 Jack Dongarra Michael Heroux Piotr Luszczek HPCG Snapshot High Performance Conjugate Gradient (HPCG). Solves Ax=b, A large, sparse, b known, x computed. An optimized implementation
More informationIt s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist
It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Moore's Law abandoned serial programming around 2004 Courtesy Liberty Computer Architecture Research Group
More informationJack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester
Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 11/20/13 1 Rank Site Computer Country Cores Rmax [Pflops] % of Peak Power [MW] MFlops /Watt 1 2 3 4 National
More informationJack Dongarra University of Tennessee Oak Ridge National Laboratory
Jack Dongarra University of Tennessee Oak Ridge National Laboratory 3/9/11 1 TPP performance Rate Size 2 100 Pflop/s 100000000 10 Pflop/s 10000000 1 Pflop/s 1000000 100 Tflop/s 100000 10 Tflop/s 10000
More informationReport on the Sunway TaihuLight System. Jack Dongarra. University of Tennessee. Oak Ridge National Laboratory
Report on the Sunway TaihuLight System Jack Dongarra University of Tennessee Oak Ridge National Laboratory June 24, 2016 University of Tennessee Department of Electrical Engineering and Computer Science
More informationEmerging Heterogeneous Technologies for High Performance Computing
MURPA (Monash Undergraduate Research Projects Abroad) Emerging Heterogeneous Technologies for High Performance Computing Jack Dongarra University of Tennessee Oak Ridge National Lab University of Manchester
More informationIt s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist
It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Moore's Law abandoned serial programming around 2004 Courtesy Liberty Computer Architecture Research Group
More informationCray XC Scalability and the Aries Network Tony Ford
Cray XC Scalability and the Aries Network Tony Ford June 29, 2017 Exascale Scalability Which scalability metrics are important for Exascale? Performance (obviously!) What are the contributing factors?
More informationHigh-Performance Computing - and why Learn about it?
High-Performance Computing - and why Learn about it? Tarek El-Ghazawi The George Washington University Washington D.C., USA Outline What is High-Performance Computing? Why is High-Performance Computing
More informationParallel Computing & Accelerators. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist
Parallel Computing Accelerators John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Purpose of this talk This is the 50,000 ft. view of the parallel computing landscape. We want
More informationConfessions of an Accidental Benchmarker
Confessions of an Accidental Benchmarker http://bit.ly/hpcg-benchmark 1 Appendix B of the Linpack Users Guide Designed to help users extrapolate execution Linpack software package First benchmark report
More informationTop500
Top500 www.top500.org Salvatore Orlando (from a presentation by J. Dongarra, and top500 website) 1 2 MPPs Performance on massively parallel machines Larger problem sizes, i.e. sizes that make sense Performance
More informationHigh Performance Linear Algebra
High Performance Linear Algebra Hatem Ltaief Senior Research Scientist Extreme Computing Research Center King Abdullah University of Science and Technology 4th International Workshop on Real-Time Control
More informationPower Profiling of Cholesky and QR Factorizations on Distributed Memory Systems
International Conference on Energy-Aware High Performance Computing Hamburg, Germany Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Sept Profiling, DLA Algorithms ENAHPC / 6 Power Profiling of Cholesky and
More informationTOP500 List s Twice-Yearly Snapshots of World s Fastest Supercomputers Develop Into Big Picture of Changing Technology
TOP500 List s Twice-Yearly Snapshots of World s Fastest Supercomputers Develop Into Big Picture of Changing Technology BY ERICH STROHMAIER COMPUTER SCIENTIST, FUTURE TECHNOLOGIES GROUP, LAWRENCE BERKELEY
More informationAn Overview of High Performance Computing and Challenges for the Future
An Overview of High Performance Computing and Challenges for the Future Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 6/15/2009 1 H. Meuer, H. Simon, E. Strohmaier,
More informationECE 574 Cluster Computing Lecture 2
ECE 574 Cluster Computing Lecture 2 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 24 January 2019 Announcements Put your name on HW#1 before turning in! 1 Top500 List November
More informationOverview. CS 472 Concurrent & Parallel Programming University of Evansville
Overview CS 472 Concurrent & Parallel Programming University of Evansville Selection of slides from CIS 410/510 Introduction to Parallel Computing Department of Computer and Information Science, University
More informationHPC as a Driver for Computing Technology and Education
HPC as a Driver for Computing Technology and Education Tarek El-Ghazawi The George Washington University Washington D.C., USA NOW- July 2015: The TOP 10 Systems Rank Site Computer Cores Rmax [Pflops] %
More informationSupercomputers in ITC/U.Tokyo 2 big systems, 6 yr. cycle
Supercomputers in ITC/U.Tokyo 2 big systems, 6 yr. cycle FY 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 Hitachi SR11K/J2 IBM Power 5+ 18.8TFLOPS, 16.4TB Hitachi HA8000 (T2K) AMD Opteron 140TFLOPS, 31.3TB
More informationMAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures
MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010
More informationPreparing GPU-Accelerated Applications for the Summit Supercomputer
Preparing GPU-Accelerated Applications for the Summit Supercomputer Fernanda Foertter HPC User Assistance Group Training Lead foertterfs@ornl.gov This research used resources of the Oak Ridge Leadership
More informationHigh Performance Computing in Europe and USA: A Comparison
High Performance Computing in Europe and USA: A Comparison Hans Werner Meuer University of Mannheim and Prometeus GmbH 2nd European Stochastic Experts Forum Baden-Baden, June 28-29, 2001 Outlook Introduction
More informationInfiniBand Strengthens Leadership as the Interconnect Of Choice By Providing Best Return on Investment. TOP500 Supercomputers, June 2014
InfiniBand Strengthens Leadership as the Interconnect Of Choice By Providing Best Return on Investment TOP500 Supercomputers, June 2014 TOP500 Performance Trends 38% CAGR 78% CAGR Explosive high-performance
More informationChapter 5 Supercomputers
Chapter 5 Supercomputers Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers
More informationParallel Computing & Accelerators. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist
Parallel Computing Accelerators John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Purpose of this talk This is the 50,000 ft. view of the parallel computing landscape. We want
More informationTrends in HPC (hardware complexity and software challenges)
Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18
More informationDistributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca
Distributed Dense Linear Algebra on Heterogeneous Architectures George Bosilca bosilca@eecs.utk.edu Centraro, Italy June 2010 Factors that Necessitate to Redesign of Our Software» Steepness of the ascent
More informationInvestigating power capping toward energy-efficient scientific applications
Received: 2 October 217 Revised: 15 December 217 Accepted: 19 February 218 DOI: 1.12/cpe.4485 SPECIAL ISSUE PAPER Investigating power capping toward energy-efficient scientific applications Azzam Haidar
More informationJack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester
Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 12/24/09 1 Take a look at high performance computing What s driving HPC Future Trends 2 Traditional scientific
More informationA Standard for Batching BLAS Operations
A Standard for Batching BLAS Operations Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 5/8/16 1 API for Batching BLAS Operations We are proposing, as a community
More informationMAGMA. Matrix Algebra on GPU and Multicore Architectures
MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/
More informationCS 5803 Introduction to High Performance Computer Architecture: Performance Metrics
CS 5803 Introduction to High Performance Computer Architecture: Performance Metrics A.R. Hurson 323 Computer Science Building, Missouri S&T hurson@mst.edu 1 Instructor: Ali R. Hurson 323 CS Building hurson@mst.edu
More informationWhat have we learned from the TOP500 lists?
What have we learned from the TOP500 lists? Hans Werner Meuer University of Mannheim and Prometeus GmbH Sun HPC Consortium Meeting Heidelberg, Germany June 19-20, 2001 Outlook TOP500 Approach Snapshots
More informationIntel Xeon Phi архитектура, модели программирования, оптимизация.
Нижний Новгород, 2017 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Дмитрий Рябцев, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture
More informationTitan - Early Experience with the Titan System at Oak Ridge National Laboratory
Office of Science Titan - Early Experience with the Titan System at Oak Ridge National Laboratory Buddy Bland Project Director Oak Ridge Leadership Computing Facility November 13, 2012 ORNL s Titan Hybrid
More informationChapter 1. Introduction
Chapter 1 Introduction Why High Performance Computing? Quote: It is hard to understand an ocean because it is too big. It is hard to understand a molecule because it is too small. It is hard to understand
More informationNERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber
NERSC Site Update National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory Richard Gerber NERSC Senior Science Advisor High Performance Computing Department Head Cori
More informationrepresent parallel computers, so distributed systems such as Does not consider storage or I/O issues
Top500 Supercomputer list represent parallel computers, so distributed systems such as SETI@Home are not considered Does not consider storage or I/O issues Both custom designed machines and commodity machines
More informationHPC Technology Update Challenges or Chances?
HPC Technology Update Challenges or Chances? Swiss Distributed Computing Day Thomas Schoenemeyer, Technology Integration, CSCS 1 Move in Feb-April 2012 1500m2 16 MW Lake-water cooling PUE 1.2 New Datacenter
More informationShort Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy
Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy François Tessier, Venkatram Vishwanath Argonne National Laboratory, USA July 19,
More informationGOING ARM A CODE PERSPECTIVE
GOING ARM A CODE PERSPECTIVE ISC18 Guillaume Colin de Verdière JUNE 2018 GCdV PAGE 1 CEA, DAM, DIF, F-91297 Arpajon, France June 2018 A history of disruptions All dates are installation dates of the machines
More informationThe Future of High- Performance Computing
Lecture 26: The Future of High- Performance Computing Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2017 Comparing Two Large-Scale Systems Oakridge Titan Google Data Center Monolithic
More informationIntroduction to Parallel Programming for Multicore/Manycore Clusters
duction to Parallel Programming for Multi/Many Clusters General duction Invitation to Supercomputing Kengo Nakajima Information Technology Center The University of Tokyo Takahiro Katagiri Information Technology
More information2018/9/25 (1), (I) 1 (1), (I)
2018/9/25 (1), (I) 1 (1), (I) 2018/9/25 (1), (I) 2 1. 2. 3. 4. 5. 2018/9/25 (1), (I) 3 1. 2. MPI 3. D) http://www.compsci-alliance.jp http://www.compsci-alliance.jp// 2018/9/25 (1), (I) 4 2018/9/25 (1),
More informationTowards Exascale Computing with the Atmospheric Model NUMA
Towards Exascale Computing with the Atmospheric Model NUMA Andreas Müller, Daniel S. Abdi, Michal Kopera, Lucas Wilcox, Francis X. Giraldo Department of Applied Mathematics Naval Postgraduate School, Monterey
More informationOverview of Reedbush-U How to Login
Overview of Reedbush-U How to Login Information Technology Center The University of Tokyo http://www.cc.u-tokyo.ac.jp/ Supercomputers in ITC/U.Tokyo 2 big systems, 6 yr. cycle FY 11 12 13 14 15 16 17 18
More informationDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends
Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de ComplexHPC Spring School 2013 Heterogeneous computing - Impact
More informationIHK/McKernel: A Lightweight Multi-kernel Operating System for Extreme-Scale Supercomputing
: A Lightweight Multi-kernel Operating System for Extreme-Scale Supercomputing Balazs Gerofi Exascale System Software Team, RIKEN Center for Computational Science 218/Nov/15 SC 18 Intel Extreme Computing
More informationMAGMA: a New Generation
1.3 MAGMA: a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Jack Dongarra T. Dong, M. Gates, A. Haidar, S. Tomov, and I. Yamazaki University of Tennessee, Knoxville Release
More informationHow to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić
How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about
More informationFujitsu s Approach to Application Centric Petascale Computing
Fujitsu s Approach to Application Centric Petascale Computing 2 nd Nov. 2010 Motoi Okuda Fujitsu Ltd. Agenda Japanese Next-Generation Supercomputer, K Computer Project Overview Design Targets System Overview
More informationJack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester
Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 12/3/09 1 ! Take a look at high performance computing! What s driving HPC! Issues with power consumption! Future
More informationPresentation of the 16th List
Presentation of the 16th List Hans- Werner Meuer, University of Mannheim Erich Strohmaier, University of Tennessee Jack J. Dongarra, University of Tennesse Horst D. Simon, NERSC/LBNL SC2000, Dallas, TX,
More informationOverview. Idea: Reduce CPU clock frequency This idea is well suited specifically for visualization
Exploring Tradeoffs Between Power and Performance for a Scientific Visualization Algorithm Stephanie Labasan & Matt Larsen (University of Oregon), Hank Childs (Lawrence Berkeley National Laboratory) 26
More informationHPC Saudi Jeffrey A. Nichols Associate Laboratory Director Computing and Computational Sciences. Presented to: March 14, 2017
Creating an Exascale Ecosystem for Science Presented to: HPC Saudi 2017 Jeffrey A. Nichols Associate Laboratory Director Computing and Computational Sciences March 14, 2017 ORNL is managed by UT-Battelle
More informationAim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group
Aim High Intel Technical Update Teratec 07 Symposium June 20, 2007 Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group Risk Factors Today s s presentations contain forward-looking statements.
More informationHPC Algorithms and Applications
HPC Algorithms and Applications Intro Michael Bader Winter 2015/2016 Intro, Winter 2015/2016 1 Part I Scientific Computing and Numerical Simulation Intro, Winter 2015/2016 2 The Simulation Pipeline phenomenon,
More informationToward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies
Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies François Tessier, Venkatram Vishwanath, Paul Gressier Argonne National Laboratory, USA Wednesday
More informationPerformance and Energy Usage of Workloads on KNL and Haswell Architectures
Performance and Energy Usage of Workloads on KNL and Haswell Architectures Tyler Allen 1 Christopher Daley 2 Doug Doerfler 2 Brian Austin 2 Nicholas Wright 2 1 Clemson University 2 National Energy Research
More informationOverview. High Performance Computing - History of the Supercomputer. Modern Definitions (II)
Overview High Performance Computing - History of the Supercomputer Dr M. Probert Autumn Term 2017 Early systems with proprietary components, operating systems and tools Development of vector computing
More informationHybrid Architectures Why Should I Bother?
Hybrid Architectures Why Should I Bother? CSCS-FoMICS-USI Summer School on Computer Simulations in Science and Engineering Michael Bader July 8 19, 2013 Computer Simulations in Science and Engineering,
More informationCUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation
CUDA Accelerated Linpack on Clusters E. Phillips, NVIDIA Corporation Outline Linpack benchmark CUDA Acceleration Strategy Fermi DGEMM Optimization / Performance Linpack Results Conclusions LINPACK Benchmark
More informationHigh Performance Computing in Europe and USA: A Comparison
High Performance Computing in Europe and USA: A Comparison Erich Strohmaier 1 and Hans W. Meuer 2 1 NERSC, Lawrence Berkeley National Laboratory, USA 2 University of Mannheim, Germany 1 Introduction In
More informationPerformance of deal.ii on a node
Performance of deal.ii on a node Bruno Turcksin Texas A&M University, Dept. of Mathematics Bruno Turcksin Deal.II on a node 1/37 Outline 1 Introduction 2 Architecture 3 Paralution 4 Other Libraries 5 Conclusions
More information20 Jahre TOP500 mit einem Ausblick auf neuere Entwicklungen
20 Jahre TOP500 mit einem Ausblick auf neuere Entwicklungen Hans Meuer Prometeus GmbH & Universität Mannheim hans@meuer.de ZKI Herbsttagung in Leipzig 11. - 12. September 2012 page 1 Outline Mannheim Supercomputer
More informationWhy we need Exascale and why we won t get there by 2020 Horst Simon Lawrence Berkeley National Laboratory
Why we need Exascale and why we won t get there by 2020 Horst Simon Lawrence Berkeley National Laboratory 2013 International Workshop on Computational Science and Engineering National University of Taiwan
More informationThe Mont-Blanc approach towards Exascale
http://www.montblanc-project.eu The Mont-Blanc approach towards Exascale Alex Ramirez Barcelona Supercomputing Center Disclaimer: Not only I speak for myself... All references to unavailable products are
More informationMotivation Goal Idea Proposition for users Study
Exploring Tradeoffs Between Power and Performance for a Scientific Visualization Algorithm Stephanie Labasan Computer and Information Science University of Oregon 23 November 2015 Overview Motivation:
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationMaking a Case for a Green500 List
Making a Case for a Green500 List S. Sharma, C. Hsu, and W. Feng Los Alamos National Laboratory Virginia Tech Outline Introduction What Is Performance? Motivation: The Need for a Green500 List Challenges
More informationCRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar
CRAY XK6 REDEFINING SUPERCOMPUTING - Sanjana Rakhecha - Nishad Nerurkar CONTENTS Introduction History Specifications Cray XK6 Architecture Performance Industry acceptance and applications Summary INTRODUCTION
More informationParallel and Distributed Systems. Hardware Trends. Why Parallel or Distributed Computing? What is a parallel computer?
Parallel and Distributed Systems Instructor: Sandhya Dwarkadas Department of Computer Science University of Rochester What is a parallel computer? A collection of processing elements that communicate and
More informationSystem Packaging Solution for Future High Performance Computing May 31, 2018 Shunichi Kikuchi Fujitsu Limited
System Packaging Solution for Future High Performance Computing May 31, 2018 Shunichi Kikuchi Fujitsu Limited 2018 IEEE 68th Electronic Components and Technology Conference San Diego, California May 29
More informationIntro To Parallel Computing. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist
Intro To Parallel Computing John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Purpose of this talk This is the 50,000 ft. view of the parallel computing landscape. We want to orient
More informationHPC future trends from a science perspective
HPC future trends from a science perspective Simon McIntosh-Smith University of Bristol HPC Research Group simonm@cs.bris.ac.uk 1 Business as usual? We've all got used to new machines being relatively
More informationAn Overview of High Performance Computing
IFIP Working Group 10.3 on Concurrent Systems An Overview of High Performance Computing Jack Dongarra University of Tennessee and Oak Ridge National Laboratory 1/3/2006 1 Overview Look at fastest computers
More informationMixed Precision Methods
Mixed Precision Methods Mixed precision, use the lowest precision required to achieve a given accuracy outcome " Improves runtime, reduce power consumption, lower data movement " Reformulate to find correction
More informationSeagate ExaScale HPC Storage
Seagate ExaScale HPC Storage Miro Lehocky System Engineer, Seagate Systems Group, HPC1 100+ PB Lustre File System 130+ GB/s Lustre File System 140+ GB/s Lustre File System 55 PB Lustre File System 1.6
More informationHigh Performance Computing An introduction talk. Fengguang Song
High Performance Computing An introduction talk Fengguang Song fgsong@cs.iupui.edu 1 2 Content What is HPC History of supercomputing Current supercomputers (Top 500) Common programming models, tools, and
More informationExploring Emerging Technologies in the Extreme Scale HPC Co- Design Space with Aspen
Exploring Emerging Technologies in the Extreme Scale HPC Co- Design Space with Aspen Jeffrey S. Vetter SPPEXA Symposium Munich 26 Jan 2016 ORNL is managed by UT-Battelle for the US Department of Energy
More informationIntel Xeon Phi архитектура, модели программирования, оптимизация.
Нижний Новгород, 2016 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture How Programming
More informationIntroduction to Parallel Programming for Multicore/Manycore Clusters
Introduction to Parallel Programming for Multicore/Manycore Clusters General Introduction Invitation to Supercomputing Kengo Nakajima Information Technology Center The University of Tokyo Takahiro Katagiri
More informationOak Ridge National Laboratory Computing and Computational Sciences
Oak Ridge National Laboratory Computing and Computational Sciences OFA Update by ORNL Presented by: Pavel Shamis (Pasha) OFA Workshop Mar 17, 2015 Acknowledgments Bernholdt David E. Hill Jason J. Leverman
More informationFujitsu s Technologies Leading to Practical Petascale Computing: K computer, PRIMEHPC FX10 and the Future
Fujitsu s Technologies Leading to Practical Petascale Computing: K computer, PRIMEHPC FX10 and the Future November 16 th, 2011 Motoi Okuda Technical Computing Solution Unit Fujitsu Limited Agenda Achievements
More informationEARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA
EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA SUDHEER CHUNDURI, SCOTT PARKER, KEVIN HARMS, VITALI MOROZOV, CHRIS KNIGHT, KALYAN KUMARAN Performance Engineering Group Argonne Leadership Computing Facility
More informationTrends of Network Topology on Supercomputers. Michihiro Koibuchi National Institute of Informatics, Japan 2018/11/27
Trends of Network Topology on Supercomputers Michihiro Koibuchi National Institute of Informatics, Japan 2018/11/27 From Graph Golf to Real Interconnection Networks Case 1: On-chip Networks Case 2: Supercomputer
More informationTechnologies for Information and Health
Energy Defence and Global Security Technologies for Information and Health Atomic Energy Commission HPC in France from a global perspective Pierre LECA Head of «Simulation and Information Sciences Dpt.»
More informationHETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA
HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA STATE OF THE ART 2012 18,688 Tesla K20X GPUs 27 PetaFLOPS FLAGSHIP SCIENTIFIC APPLICATIONS
More informationTOP500 Listen und industrielle/kommerzielle Anwendungen
TOP500 Listen und industrielle/kommerzielle Anwendungen Hans Werner Meuer Universität Mannheim Gesprächsrunde Nichtnumerische Anwendungen im Bereich des Höchstleistungsrechnens des BMBF Berlin, 16./ 17.
More informationHPCS HPCchallenge Benchmark Suite
HPCS HPCchallenge Benchmark Suite David Koester, Ph.D. () Jack Dongarra (UTK) Piotr Luszczek () 28 September 2004 Slide-1 Outline Brief DARPA HPCS Overview Architecture/Application Characterization Preliminary
More informationThe TOP500 Project of the Universities Mannheim and Tennessee
The TOP500 Project of the Universities Mannheim and Tennessee Hans Werner Meuer University of Mannheim EURO-PAR 2000 29. August - 01. September 2000 Munich/Germany Outline TOP500 Approach HPC-Market as
More informationCHAO YANG. Early Experience on Optimizations of Application Codes on the Sunway TaihuLight Supercomputer
CHAO YANG Dr. Chao Yang is a full professor at the Laboratory of Parallel Software and Computational Sciences, Institute of Software, Chinese Academy Sciences. His research interests include numerical
More information