EXASCALE COMPUTING ROADMAP IMPACT ON LEGACY CODES MARCH 17 TH, MIC Workshop PAGE 1. MIC workshop Guillaume Colin de Verdière

Size: px

Start display at page:

Download "EXASCALE COMPUTING ROADMAP IMPACT ON LEGACY CODES MARCH 17 TH, MIC Workshop PAGE 1. MIC workshop Guillaume Colin de Verdière"

Allen Greene
6 years ago
Views:

1 EXASCALE COMPUTING ROADMAP IMPACT ON LEGACY CODES MIC workshop Guillaume Colin de Verdière MARCH 17 TH, 2015 MIC Workshop PAGE 1 CEA, DAM, DIF, F Arpajon, France March 17th, 2015

2 Overview Context Power Wall Memory Wall Software evolution Conclusion CEA, DAM, DIF, F Arpajon, France March 17th, 2015 PAGE 2 MIC Workshop

Local memory network Collectives operations are expensive More asynchronous ops process CPU Local memory Increasing bandwidth But same

3 Node architecture is more and more complex Computer architectures are getting more and more complex Hybrid Architecture Vectorization Multithreading High core count Multiples NUMA effects Little memory per core Compute node User Code process CPU Local memory process CPU Local memory network Collectives operations are expensive More asynchronous ops process CPU Local memory Increasing bandwidth But same latency Reduce the number of MPI tasks Increasing number of nodes (>= 5000) MIC Workshop CEA, DAM, DIF, F Arpajon, France March 17th, 2015 PAGE 3

F-91297 Arpajon, France March 17th, 2015 PAGE 4 MIC

4 Tricastin AFP/Fred Dufour The first challenge: the power wall Target = 1EFlop in 20 MW CEA, DAM, DIF, F Arpajon, France March 17th, 2015 PAGE 4 MIC Workshop Credit K.J. Roche, Jul 2011, U. of Michigan

247V Limit/reduce frequencies Nehalem = 2.8GHz, GPU = 1.

5 Processor Constraint : dissipated power : P = P d + P s P d = C e x F x V 2 dynamic P s = V x I f static o I f : leakage o C e : capacity of the material o V : voltage o F : frequency (is a function of V too) To limit P Limit/Reduce voltage ex: Pentium 4 = 1.7V; Nehalem = 1.247V Limit/reduce frequencies Nehalem = 2.8GHz, GPU = 1.1GHz To increase compute power at constant thermal power Increase compute core count Consequence Your program must be parallel! 1 core per processor Frequency F 1 core F/2 1 core F/2 1 core F/2 MIC Workshop CEA, DAM, DIF, F Arpajon, France March 17th, 2015 PAGE 5

More performance = more cores Options are coming from the market constraints and from the history of the providers Hybrid computers are on the rise Top500 : 8 hybrids

and graphics An opportunity to broaden its market shares A big number of simple cores (one core = 1 pixel) The «Manycore» way INTEL Comes from a general purpose world

6 More performance = more cores Options are coming from the market constraints and from the history of the providers Hybrid computers are on the rise Top500 : 8 hybrids in the top 20 #1 et #2 are hybrid machines An early try: Road Runner Use the processor of a game console The «GPU» way NVIDIA, AMD Comes from a specific world : game and graphics An opportunity to broaden its market shares A big number of simple cores (one core = 1 pixel) The «Manycore» way INTEL Comes from a general purpose world Capitalize on the existing software ecosystem A managed amount of cores of mid size power MIC Workshop CEA, DAM, DIF, F Arpajon, France March 17th, 2015 PAGE 6

The GPU track Track of NVIDIA & AMD Working on pixels pushes massive parallelism Relies on

programming style E.g.: branches, synchronizations Can be seen as a big vector machine (data

Still need a CPU and memory (usually twice the size of the GPU one) and an external link

locality Increases the cost of data movements Some success stories at CEA The ROMEO machine @

7 The GPU track Track of NVIDIA & AMD Working on pixels pushes massive parallelism Relies on programs using a very large number of threads The hardware introduces constraints on the programming style E.g.: branches, synchronizations Can be seen as a big vector machine (data parallel) Potentially large performance Yet requires a special programming of suitable algorithms Still need a CPU and memory (usually twice the size of the GPU one) and an external link (PCIExpress, NVlink) Incurs additional costs (big ones) Emphasizes the importance of data locality Increases the cost of data movements Some success stories at CEA The ROMEO URCA #184 Top500 06/14(384TFlop/s) #6 Green500 NVIDIA K20X MIC Workshop CEA, DAM, DIF, F Arpajon, France March 17th, 2015 PAGE 7

The Manycore track Track followed by Intel Keeps the compatibility with the

OpenMP standard only Considered by Intel as a serious milestone on the road

Tianhe-2, NUDT, China #1 Top500 06/14 33,86 petaflops 32000 Ivy-Bridge 48000

8 The Manycore track Track followed by Intel Keeps the compatibility with the leading X86 architecture Claim a simplicity of programming Intel follows OpenMP standard only Considered by Intel as a serious milestone on the road towards the Exascale Evolution of the entire software ecosystem to follow Tianhe-2, NUDT, China #1 Top500 06/14 33,86 petaflops Ivy-Bridge Xeon Phi (KNC) MIC Workshop CEA, DAM, DIF, F Arpajon, France March 17th, 2015 PAGE 8

9 Large core counts are getting mainstream MPPA from Kalray 256 cores (up to 1024) Tilera 100 cores HPC Adapteva Epiphani 64 cores Cavium Thunder X 48 cores Targets high end servers MIC Workshop CEA, DAM, DIF, F Arpajon, France March 17th, 2015 PAGE 9

vector units Goal : 90% of the Intel s Haswell MIC

10 Broadcom Vulcan A new player to follow ARMv8 CPU 64 bits 20 cores 4 threads per core NEON SIMD 128 bits wide vector units Goal : 90% of the Intel s Haswell MIC Workshop CEA, DAM, DIF, F Arpajon, France March 17th, 2015 PAGE 10

(control array placement in memory) 2 vector units for each core Will deliver most of the compute power of the KNL

11 Xeon Phi Manycore KNL (2015) Stacked memory HMC by Micron Innovative architecture 72 cores + 4 threads / core Unavoidable multithreading Introduction of the stacked memory [HMC] (B/W, latency ) Will have a big Impact on codes (control array placement in memory) 2 vector units for each core Will deliver most of the compute power of the KNL Vector programming is mandatory Internal NUMA effects will appear MIC Workshop CEA, DAM, DIF, F Arpajon, France March 17th, 2015 PAGE 11

12 Fujitsu Sparc 64TM Xlfx The groundwork for a japanese Exascale Machine Announced in August 2014 HotChips 32 cores 1 thread / core 256 bits wide SIMD FMA Strided access Gather/scatter HMC See KNL MIC Workshop CEA, DAM, DIF, F Arpajon, France March 17th, 2015 PAGE 12

13 The next challenge: the data movement (the Memory Wall) C= m/s 1ns 30cm! CEA, DAM, DIF, F Arpajon, France March 17th, 2015 PAGE 13 MIC Workshop

cy) - MCDRAM (KNL) very high latency (~140 cy), focus on bandwidth - DDR4 very high latency (~140 cy), focus on storage size NIC Credit K.

14 Moving data around is more and more expensive Computing is cheaper than moving data Memory hierarchies have associated costs L1$ small size (kb), small latency (~1cy), high B/W L2$ medium size (kb), medium latency (~14cy), high B/W L3$ larger size (MB), high latency (~40 cy) - MCDRAM (KNL) very high latency (~140 cy), focus on bandwidth - DDR4 very high latency (~140 cy), focus on storage size NIC Credit K.J. Roche, Jul 2011, U. of Michigan Credit Bill Dally, NVIDIA, SC10 MIC Workshop CEA, DAM, DIF, F Arpajon, France March 17th, 2015 PAGE 14

15 Moving data around is more and more expensive Some ways around Finer process Will find its limits soon enough Not covered here: Network interfaces, New memories Optics Memory coherency Put the functional units closer to each other MIC Workshop CEA, DAM, DIF, F Arpajon, France March 17th, 2015 PAGE 15

architectures Which standards? AMD favors OpenCL, OpenMP 4.

16 Fusion architecture by AMD Tegra architecture by NVIDIA Goal: put the GPU closer to the CPUs No more communications via the PCI-Ex Easy resource management for the code developper Constraints Required modification to the O/S Need a software ecosystem tailored to these architectures Which standards? AMD favors OpenCL, OpenMP 4.0 NVIDIA favors CUDA, OpenACC AMD Kaveri NVIDIA Tegra k1 Solutions not yet applicable to HPC Primary market = laptops, tablets, smartphones Fast evolution to come MIC Workshop CEA, DAM, DIF, F Arpajon, France March 17th, 2015 PAGE 16

Remark : for Intel, classical processors have this kind of architecture (already) The Integrated Graphic Processor (IGP) is very efficient for OpenCL (1.

17 Remark : for Intel, classical processors have this kind of architecture (already) The Integrated Graphic Processor (IGP) is very efficient for OpenCL (1.2) programs Haswell desktop Intel favors OpenMP 4.0 The software ecosystem is large MIC Workshop CEA, DAM, DIF, F Arpajon, France March 17th, 2015 PAGE 17

Faster internal links: The NVLINK Coral proposal OpenPower + NVLINK + Volta Volta http://info.nvidianews.

pdf NVLINK 2 NVLINK PCI-Express Gen3 12 GB/s (16 GB/s) Nvlink 1 64 GB/s (80 GB/s) No cache coherency First availability with the

18 Faster internal links: The NVLINK Coral proposal OpenPower + NVLINK + Volta Volta NVLINK 2 NVLINK PCI-Express Gen3 12 GB/s (16 GB/s) Nvlink 1 64 GB/s (80 GB/s) No cache coherency First availability with the Pascal GPU Nvlink 2 will have unified memory, cache coherency and more speed (>= X2) OpenPower then ARM64 Will be opened to more architectures ARM64 MIC Workshop CEA, DAM, DIF, F Arpajon, France March 17th, 2015 PAGE 18

Concept of an Exaflop machine (NVIDIA) Dense assembly of specialized components LC = Latency core : classical CPU for sequential parts SM = Symmetric Multicore : SIMT engine (GPU type) for massively

19 Concept of an Exaflop machine (NVIDIA) Dense assembly of specialized components LC = Latency core : classical CPU for sequential parts SM = Symmetric Multicore : SIMT engine (GPU type) for massively parallel operations NoC = Network on Chip : coherent data exchange between functional units The question of the software (standards) still remains MIC Workshop CEA, DAM, DIF, F Arpajon, France March 17th, 2015 PAGE 19

20 Software evolutions CEA, DAM, DIF, F Arpajon, France March 17th, 2015 PAGE 20 MIC Workshop

21 Programming languages 1/2 Fortran / C++ Decline of FORTRAN? Still an active Open Source effort (flang in LLVM) Interesting evolutions of C++ to come Better integration of threads PGAS Interesting concept yet far from production Partitioned Global Address Space MIC Workshop CEA, DAM, DIF, F Arpajon, France March 17th, 2015 PAGE 21

22 Programming languages 2/2 The MPI +X model is (still) the future The question : what is X? Non standard options CUDA Interesting to learn massive parallelism OpenACC A good lab for OpenMP 4.x Standard options OpenMP 4 The best bet for the future OpenCL Low level, widely available SYCL initiative to follow closely MIC Workshop CEA, DAM, DIF, F Arpajon, France March 17th, 2015 PAGE 22

23 Conclusion Impact on codes CEA, DAM, DIF, F Arpajon, France March 17th, 2015 PAGE 23 MIC Workshop

24 Hardware The big trends towards the Exascale Reduce the electrical consumption of the compute units Ever more cores with a reasonable frequency Systematic usage of specialized units SIMD/SIMT Reduce the cost of data movements Unavoidable placement of CPUs and powerful SIMD units inside the same chip Integration of high performance network interfaces to the chip (SoC) Stacked memory to increase bandwidth More threads per cores to hide latencies Software Necessary evolution of standards to cope with hardware evolution Don t rush on new fancy options, yet learn by experimenting on prototypes Upgrade of the full software ecosystem Compilers, debuggers, profilers Challenges Use all units Programming those processors at more than 10% of their peak Optimize existing codes to fully utilize the platforms Multithread your code Vectorize your algorithms Work on data location(s) Multithread your code MPI as usual (PGAS maybe one day?) MIC Workshop CEA, DAM, DIF, F Arpajon, France March 17th, 2015 PAGE 24

PAGE 25 CEA, DAM, DIF, F-91297 Arpajon, France March 17th, 2015 Commissariat à l énergie atomique et aux énergies alternatives Centre DAM Ile-de-France

25 PAGE 25 CEA, DAM, DIF, F Arpajon, France March 17th, 2015 Commissariat à l énergie atomique et aux énergies alternatives Centre DAM Ile-de-France Arpajon Cedex, France T. +33 (0) DAM DSSI ED MIC Workshop Etablissement public à caractère industriel et commercial RCS Paris B

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES P(ND) 2-2 2014 Guillaume Colin de Verdière OCTOBER 14TH, 2014 P(ND)^2-2 PAGE 1 CEA, DAM, DIF, F-91297 Arpajon, France October 14th, 2014 Abstract: