COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

Similar documents
EXASCALE COMPUTING ROADMAP IMPACT ON LEGACY CODES MARCH 17 TH, MIC Workshop PAGE 1. MIC workshop Guillaume Colin de Verdière

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

HPC Architectures. Types of resource currently in use

The Era of Heterogeneous Computing

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

The Heterogeneous Programming Jungle. Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest

Pedraforca: a First ARM + GPU Cluster for HPC

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Parallel Programming on Ranger and Stampede

GPUs and Emerging Architectures

Performance of deal.ii on a node

Preparing for Highly Parallel, Heterogeneous Coprocessing

Master Informatics Eng.

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

HPC-CINECA infrastructure: The New Marconi System. HPC methods for Computational Fluid Dynamics and Astrophysics Giorgio Amati,

CME 213 S PRING Eric Darve

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

GPU for HPC. October 2010

45-year CPU Evolution: 1 Law -2 Equations

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

Trends in HPC (hardware complexity and software challenges)

Cray XC Scalability and the Aries Network Tony Ford

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

HPC future trends from a science perspective

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

Performance and Energy Usage of Workloads on KNL and Haswell Architectures

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

GPUs have enormous power that is enormously difficult to use

GPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

There s STILL plenty of room at the bottom! Andreas Olofsson

Introduction. CSCI 4850/5850 High-Performance Computing Spring 2018

The Mont-Blanc approach towards Exascale

Intel Xeon Phi Coprocessors

GOING ARM A CODE PERSPECTIVE

Building supercomputers from commodity embedded chips

FUSION PROCESSORS AND HPC

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

How to Write Code that Will Survive the Many-Core Revolution

Manycore and GPU Channelisers. Seth Hall High Performance Computing Lab, AUT

General introduction: GPUs and the realm of parallel architectures

Hybrid programming with MPI and OpenMP On the way to exascale

Parallel and Distributed Programming Introduction. Kenjiro Taura

Parallel Computing. Parallel Computing. Hwansoo Han

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Vectorisation and Portable Programming using OpenCL

ECE 8823: GPU Architectures. Objectives

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Parallel Programming

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

HPC Technology Trends

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Introduction to GPU hardware and to CUDA

Parallel Algorithm Engineering

Kalray MPPA Manycore Challenges for the Next Generation of Professional Applications Benoît Dupont de Dinechin MPSoC 2013

Trends and Challenges in Multicore Programming

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

CPU-GPU Heterogeneous Computing

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

The Stampede is Coming: A New Petascale Resource for the Open Science Community

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Intel Xeon Phi архитектура, модели программирования, оптимизация.

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

A General Discussion on! Parallelism!

High Performance Computing with Accelerators

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

How many cores are too many cores? Dr. Avi Mendelson, Intel - Mobile Processors Architecture group

Building NVLink for Developers

High performance computing and numerical modeling

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

Overview of Tianhe-2

Modern Processor Architectures. L25: Modern Compiler Design

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Preparing GPU-Accelerated Applications for the Summit Supercomputer

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

OP2 FOR MANY-CORE ARCHITECTURES

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

Parallel Computing. Hwansoo Han (SKKU)

Illinois Proposal Considerations Greg Bauer

Timothy Lanfear, NVIDIA HPC

Parallel Architectures

n N c CIni.o ewsrg.au

The Mont-Blanc Project

Introduction to Runtime Systems

Transcription:

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES P(ND) 2-2 2014 Guillaume Colin de Verdière OCTOBER 14TH, 2014 P(ND)^2-2 PAGE 1 CEA, DAM, DIF, F-91297 Arpajon, France October 14th, 2014

Abstract: This talk will cover the evolution of processing elements we will have in the forthcoming supercomputers. From this technological survey we will explain the impacts that technology will force on the simulation codes and why we won t avoid them. CEA, DAM, DIF, F-91297 Arpajon, France October 14th, 2014 PAGE 2 P(ND)^2-2

Node architecture is more and more complex Computer architectures are getting more and more complex Hybrid Architecture Vectorization Multithreading High core count Multiples NUMA effects Little memory per core Compute node User Code process CPU Local memory process CPU Local memory network Collectives operations are expensive More asynchronous ops process CPU Local memory Increasing bandwidth But same latency Reduce the number of MPI tasks Increasing number of nodes (>= 5000) P(ND)^2-2 CEA, DAM, DIF, F-91297 Arpajon, France October 14th, 2014 PAGE 3

Processor Constraint : dissipated power : P = P d + P s P d = C e x F x V 2 dynamic P s = V x I f static o I f : leakage o C e : capacity of the material o V : voltage o F : frequency (is a function of V too) To limit P Limit/Reduce voltage ex: Pentium 4 = 1.7V; Nehalem = 1.247V Limit/reduce frequencies Nehalem = 2.8GHz, GPU = 1.1GHz To increase compute power at constant thermal power Increase compute core count Consequence Your program must be parallel! 1 core per processor Frequency F 1 core F/2 1 core F/2 1 core F/2 P(ND)^2-2 CEA, DAM, DIF, F-91297 Arpajon, France October 14th, 2014 PAGE 4

More performance = more cores Options are coming from the market constraints and from the history of the providers Hybrid computers are on the rise Top500 : 8 hybrids in the top 20 #1 et #2 are hybrid machines An early try: Road Runner Use the processor of a game console The «GPU» way NVIDIA, AMD Comes from a specific world : game and graphics An opportunity to broaden its market shares A big number of simple cores (one core = 1 pixel) The «Manycore» way INTEL Comes from a general purpose world Capitalize on the existing software ecosystem A managed amount of cores of mid size power P(ND)^2-2 CEA, DAM, DIF, F-91297 Arpajon, France October 14th, 2014 PAGE 5

Road Runner (IBM) @ LANL #1 of June 2008 TOP 500 1,026 petaflop/s Should have been #36 top500 June 2014! 6562 AMD dual-core Opteron 12240 Cell Broadband Engine First Hybrid Architecture Large programming effort to reach performance P(ND)^2-2 CEA, DAM, DIF, F-91297 Arpajon, France October 14th, 2014 PAGE 6

The «GPU» track Track of NVIDIA & AMD Working on pixels pushes massive parallelism Relies on programs using a very large number of threads The hardware introduces constraints on the programming style E.g.: branches, synchronizations Can be seen as a big vector machine (data parallel) Potentially large performance Yet requires a special programming of suitable algorithms Still need a CPU and memory (usually twice the size of the GPU one) and an external link (PCIExpress, NVlink) Incurs additional costs (big ones) Emphasizes the importance of data locality Increases the cost of data movements Some success stories at CEA The ROMEO machine @ URCA #184 Top500 06/14(384TFlop/s) #6 Green500 NVIDIA K20X P(ND)^2-2 CEA, DAM, DIF, F-91297 Arpajon, France October 14th, 2014 PAGE 7

The «Manycore» track Track followed by Intel Keeps the compatibility with the leading X86 architecture Claim a simplicity of programming Intel follows OpenMP standard only Considered by Intel as a serious milestone on the road towards the Exascale Evolution of the entire software ecosystem to follow Tianhe-2, NUDT, China #1 Top500 06/14 33,86 petaflops 32000 Ivy-Bridge 48000 Xeon Phi (KNC) P(ND)^2-2 CEA, DAM, DIF, F-91297 Arpajon, France October 14th, 2014 PAGE 8

Large core counts are getting mainstream MPPA from Kalray Tilera 100 cores HPC Adapteva Epiphani 64 cores P(ND)^2-2 CEA, DAM, DIF, F-91297 Arpajon, France October 14th, 2014 PAGE 9

Xeon Phi Manycore KNL (2015) Stacked memory Innovative architecture 72 cores + 4 threads / core Unavoidable multithreading Introduction of the stacked memory [by HMC] (B/W, latency ) Will have a big Impact on codes (control array placement in memory) 2 vector units for each core Will deliver most of the compute power of the KNL Vector programming is mandatory Internal NUMA effects will appear P(ND)^2-2 CEA, DAM, DIF, F-91297 Arpajon, France October 14th, 2014 PAGE 10

Fusion architecture by AMD Tegra architecture by NVIDIA Goal: put the GPU closer to the CPUs No more communications via the PCI-Ex Easy resource management for the code developper Constraints Required modification to the O/S Need a software ecosystem tailored to these architectures Which standards? AMD favors OpenCL, OpenMP 4.0 NVIDIA favors CUDA, OpenACC AMD Kaveri NVIDIA Tegra k1 Solutions not yet applicable to HPC Primary market = laptops, tablets, smartphones Fast evolution to come P(ND)^2-2 CEA, DAM, DIF, F-91297 Arpajon, France October 14th, 2014 PAGE 11

Remark : for Intel, classical processors have this kind of architecture (already) The Integrated Graphic Processor (IGP) is very efficient for OpenCL (1.2) programs Haswell desktop Intel favors OpenMP 4.0 The software ecosystem is large P(ND)^2-2 CEA, DAM, DIF, F-91297 Arpajon, France October 14th, 2014 PAGE 12

Concept of an Exaflop machine(nvidia) Dense assembly of specialized components LC = Latency core : classical CPU for sequential parts SM = Symmetric Multicore : SIMT engine (GPU type) for massively parallel operations NoC = Network on Chip : coherent data exchange between functional units The question of the software (standards) still remains P(ND)^2-2 CEA, DAM, DIF, F-91297 Arpajon, France October 14th, 2014 PAGE 13

Conclusions CEA, DAM, DIF, F-91297 Arpajon, France October 14th, 2014 PAGE 14 P(ND)^2-2

Hardware The big trends Reduce the electrical consumption of the compute units Ever more cores with a reasonable frequency Systematic usage of specialized units SIMD/SIMT Reduce the cost of data movements Unavoidable placement of CPUs and powerful SIMD units inside the same chip Integration of high performance network interfaces to the chip (SoC) Stacked memory to increase bandwidth More threads to hide latencies Software Necessary evolution of standards to cope with hardware evolution Don t rush on new fancy options, yet learn by experimenting on prototypes Upgrade of the full software ecosystem Compilers, debuggers, profilers Challenges Programming those processors at more than 10% of their peak Optimize existing codes to fully utilize the platforms Use all units Multithread your code Vectorize your algorithms Work on data location(s) Multithread your code MPI as usual (PGAS maybe one day?) P(ND)^2-2 CEA, DAM, DIF, F-91297 Arpajon, France October 14th, 2014 PAGE 15