HPC future trends from a science perspective

Similar documents
It's the end of the world as we know it

Evaluating OpenMP s Effectiveness in the Many-Core Era

Optimising the Mantevo benchmark suite for multi- and many-core architectures

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Fault Tolerance Techniques for Sparse Matrix Methods

Comparison and analysis of parallel tasking performance for an irregular application

Trends in HPC (hardware complexity and software challenges)

LLVM for the future of Supercomputing

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

HPC Architectures. Types of resource currently in use

Preparing GPU-Accelerated Applications for the Summit Supercomputer

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Mapping MPI+X Applications to Multi-GPU Architectures

Big Data Systems on Future Hardware. Bingsheng He NUS Computing

EXASCALE COMPUTING ROADMAP IMPACT ON LEGACY CODES MARCH 17 TH, MIC Workshop PAGE 1. MIC workshop Guillaume Colin de Verdière

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

Early Experiences Writing Performance Portable OpenMP 4 Codes

Exploring Emerging Technologies in the Extreme Scale HPC Co- Design Space with Aspen

Lecture 1: Gentle Introduction to GPUs

University of Bristol - Explore Bristol Research. Link to published version (if available): / _34

Performance of deal.ii on a node

CPU-GPU Heterogeneous Computing

Towards Exascale Computing with the Atmospheric Model NUMA

Execution Models for the Exascale Era

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

PyFR: Heterogeneous Computing on Mixed Unstructured Grids with Python. F.D. Witherden, M. Klemm, P.E. Vincent

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

GPU-centric communication for improved efficiency

The Era of Heterogeneous Computing

Parallel and Distributed Programming Introduction. Kenjiro Taura

Comparative Benchmarking of the First Generation of HPC-Optimised Arm Processors on Isambard

Efficient Parallel Programming on Xeon Phi for Exascale

Umeå University

Umeå University

Intel Xeon Phi архитектура, модели программирования, оптимизация.

OpenCL: History & Future. November 20, 2017

Interconnect Your Future

arxiv: v1 [hep-lat] 1 Dec 2017

Vectorisation and Portable Programming using OpenCL

HPC with Multicore and GPUs

"On the Capability and Achievable Performance of FPGAs for HPC Applications"

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

Cori (2016) and Beyond Ensuring NERSC Users Stay Productive

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

ECE 574 Cluster Computing Lecture 23

Managing HPC Active Archive Storage with HPSS RAIT at Oak Ridge National Laboratory

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design

HPC code modernization with Intel development tools

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

GPUs and Emerging Architectures

ECE 574 Cluster Computing Lecture 18

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

The Art of Parallel Processing

General overview and first results.

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

GPU-Powered WRF in the Cloud for Research and Operational Applications

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

Building supercomputers from embedded technologies

Trends of Network Topology on Supercomputers. Michihiro Koibuchi National Institute of Informatics, Japan 2018/11/27

CUDA. Matthew Joyner, Jeremy Williams

Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain)

HPC Saudi Jeffrey A. Nichols Associate Laboratory Director Computing and Computational Sciences. Presented to: March 14, 2017

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

ENDURING DIFFERENTIATION. Timothy Lanfear

ENDURING DIFFERENTIATION Timothy Lanfear

Steve Scott, Tesla CTO SC 11 November 15, 2011

GPUS FOR NGVLA. M Clark, April 2015

arxiv: v2 [hep-lat] 3 Nov 2016

Real Parallel Computers

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

The Heterogeneous Programming Jungle. Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest

University of Bristol - Explore Bristol Research. Peer reviewed version. Link to published version (if available): /CLUSTER.2017.

Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures

Parallel Systems. Project topics

An Introduction to OpenACC

Technology challenges and trends over the next decade (A look through a 2030 crystal ball) Al Gara Intel Fellow & Chief HPC System Architect

Trends and Challenges in Multicore Programming

Master Informatics Eng.

Fra superdatamaskiner til grafikkprosessorer og

Building NVLink for Developers

AMD ACCELERATING TECHNOLOGIES FOR EXASCALE COMPUTING FELLOW 3 OCTOBER 2016

Efficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

High Performance Computing (HPC) Introduction

Exascale: challenges and opportunities in a power constrained world

HPC Cineca Infrastructure: State of the art and towards the exascale

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

Programming NVM Systems

The Mont-Blanc approach towards Exascale

Performance and Energy Usage of Workloads on KNL and Haswell Architectures

Present and Future Leadership Computers at OLCF

in Action Fujitsu High Performance Computing Ecosystem Human Centric Innovation Innovation Flexibility Simplicity

To hear the audio, please be sure to dial in: ID#

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS

Real Parallel Computers

Parallelism. Parallel Hardware. Introduction to Computer Systems

When Will SystemC Replace gcc/g++? Bodo Parady, CTO Pentum Group Inc. Sunnyvale, CA

Transcription:

HPC future trends from a science perspective Simon McIntosh-Smith University of Bristol HPC Research Group simonm@cs.bris.ac.uk 1

Business as usual? We've all got used to new machines being relatively simple evolutions of our previous machines This will no longer be true from 2016 onwards With huge implications for the scientific software community 2

What's changing? Mainstream multi-core CPUs will continue to evolve, but more slowly IVB à Haswell à Broadwell CPUs 12 core à 18 core à 22+ cores To retain the levels of performance increase we have historically enjoyed, we will have no choice but to adopt radically different architectures 3

What are the options? Many-core CPUs: Intel Xeon Phi Knights Landing (KNL) launching 2016 Large KNL machines going into US national labs O(70) cores, 2x 512-bit vectors per cycle per core, on-package high bandwidth memory Other many-core CPUs emerging Mostly based on ARM64 architecture Multiple vendors: AMD, Broadcom, Cavium, AMCC, 4

What are the options? GPUs: Nvidia Pascal / Volta Tightly couples with IBM Power CPUs AMD Often have more memory bandwidth and FLOPs than Nvidia Interesting focus on tight CPU/GPU integration ("fat APUs") 5

What other big changes are coming? Deeper memory hierarchies Could have up to 5 levels of memory in-node 1. Registers 2. Caches (L1 to L3 or even L4), O(10MB) 3. On-package high bandwidth memory, O(10GB) 4. Regular DRAM, O(100GB) 5. NVRam, O(1-10TB) How does software use this efficiently? 6

What other big changes are coming? Integrated interconnects, e.g. Intel Omni-Path fabric (on-chip in KNL) Nvidia Nvlink (direct CPU-GPU connection) Cray and Mellanox must be working on this too Once the interconnect is integrated on-chip, the next step would be to integrate inside the virtual memory system Dramatically reduces overhead (no more drivers) Should enable efficient PGAS implementations Likely this will be a feature of Exascale machines 7

What other big changes are coming? FPGAs Now supporting OpenCL Intel recently acquired Altera Hybrid CPU/FPGA systems are possible Microsoft using FPGAs for Bing searches Could cast some key algorithms into optimised hardware in FPGA form I'm still not sure how big a role FPGAs will play in HPC's future 8

Long-term fundamental trends Relative improvement We need to design codes for here! We design codes for here Microprocessor performance ~55% per annum Memory capacity ~49% per annum (and slowing down?) Memory bandwidth ~30% per annum (and slowing down?) Memory latency <<30% per annum Time

"Will this affect my code?" 10

What are we going to have to do? Expose maximum parallelism: Task, data, vector, Explicitly manage the memory hierarchy Can't leave it all to the caches anymore Be ready for major changes to interconnects in the next few years PGAS revival? 11

HPC nirvana (Re)write your code once, run efficiently everywhere Realistic? We don't even really have this today on CPUbased systems E.g. between x86 clusters and Blue Gene Expect to see a lot more (i) Domain Specific Languages (DSLs), (ii) code generation and (iii) autotuning to help make this a reality 12

Performance Portability - How close can we get? My group has been looking at this problem for 6 years Only one truly cross platform parallel programming API we could use in the past OpenCL Finally seeing OpenMP 4.x being adopted for GPU programming, a big step forwards Also Kokkos, Raja, OmpSs, 13

See: https://github.com/uob-hpc/gpu-stream/wiki 14

TeaLeaf Heat Conduction Mini-app from Mantevo suite of benchmarks Implicit, sparse, matrix-free solvers on structured grid Conjugate Gradient (CG) Chebyshev PrecondiAoned Polynomial CG (PPCG) Memory bandwidth bound Good strong and weak scaling on Titan & Piz Daint Download TeaLeaf from: http://uk-mac.github.io/tealeaf/ 15

TeaLeaf CG performance Higher is better, normalised to lowest Most GPUs sustaining ~50% of peak memory bandwidth. Achieved excellent performance portability across diverse architectures An Evaluation of Emerging Many-Core Parallel Programming Models, Martineau, 16 M., McIntosh-Smith, S., Boulton, M., Gaudin, W., PMAM, Barcelona, March 2016.

Evidence? 5 of top 10 leadership class machines already rely on Xeon Phi or GPUs Holdouts are mostly based on IBM's BlueGene/Q, but this is end-of-lifeing From 2016 onwards, expect almost all of the Top 10 machines world-wide to rely on these radically different architectures Performance chasm will then open up between "haves" and "have nots" 17

Summary A "business as usual" approach to scientific software development will result in being left in the slow lane Developers are faced with the challenging issue of developing performance portable code on increasingly complex and diverse architectures This will require the most significant software reengineering effort in a generation BEWARE vendor specific languages! OpenACC, CUDA,... Invest in open standards instead: OpenMP, OpenCL,... 18

References Evaluating OpenMP 4.0's Effectiveness as a Heterogeneous Parallel Programming Model Martineau, M., McIntosh-Smith, S. & Gaudin, W. To appear at HIPS Workshop (IPDPS), Chicago, May 2016 On the performance portability of structured grid codes on manycore computer architectures S.N. McIntosh-Smith, M. Boulton, D. Curran and J.R. Price. ISC, Leipzig, June 2014. Assessing the Performance Portability of Modern Parallel Programming Models using TeaLeaf Martineau, M., McIntosh-Smith, S. & Gaudin, W. Concurrency and Computation: Practice and Experience (April 2016), to appear