Introduction to Parallel Programming for Multicore/Manycore Clusters

Size: px

Start display at page:

Download "Introduction to Parallel Programming for Multicore/Manycore Clusters"

Annabella Reynolds
5 years ago
Views:

1 duction to Parallel Programming for Multi/Many Clusters General duction Invitation to Supercomputing Kengo Nakajima Information Technology Center The University of Tokyo Takahiro Katagiri Information Technology Center Nagoya University

2 Motivation for Parallel Computing 2 (and this class) Large-scale parallel computer enables fast computing in large-scale scientific simulations with detailed models. Computational science develops new frontiers of science and engineering. Why parallel computing? faster & larger larger is more important from the view point of new frontiers of science & engineering, but faster is also important. + more complicated Ideal: Scalable Solving N x scale problem using N x computational resources during same computation time (weak scaling) Solving a fix-sized problem using N x computational resources in 1/N computation time (strong scaling)

3 3 Scientific Computing = SMASH Science Modeling Algorithm Software Hardware You have to learn many things. Collaboration (or Co-Design) will be important for future career of each of you, as a scientist and/or an engineer. You have to communicate with people with different backgrounds. It is more difficult than communicating with foreign scientists from same area. (Q): Computer Science, Computational Science, or Numerical Algorithms?

4 4 Supercomputers and Computational Science Overview of the Class

Clock Rate Frequency: Number of operations by per second

5 5 Computer & Central Processing Unit ( 中央処理装置 ): s used in PC and Supercomputers are based on same architecture GHz: Clock Rate Frequency: Number of operations by per second GHz -> 10 9 operations/sec Simultaneous 4-8 instructions per clock

6 6 Multi コア () Single 1 s/ Dual 2 s/ Quad 4 s/ GPU: Many O(10 1 )-O(10 2 ) s More and more s Parallel computing Oakleaf-FX at University of Tokyo: 16 s SPARC64 TM IXfx = Central part of Multi s with 4-8 s are popular Low Power

Programming: CUDA, OpenACC Intel Xeon/Phi: Many >60 s High Memory

7 7 GPU/Manys GPU:Graphic Processing Unit GPGPU: General Purpose GPU O(10 2 ) s High Memory Bandwidth Cheap NO stand-alone operations Host needed Programming: CUDA, OpenACC Intel Xeon/Phi: Many >60 s High Memory Bandwidth Unix, Fortran, C compiler Currently, host needed Stand-alone: Knights Landing

8 8 Parallel Supercomputers Multi s are connected through network

9 9 Supercomputers with Heterogeneous/Hybrid Nodes GPU Many GPU Many GPU Many GPU Many GPU Many C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C

10 Performance of Supercomputers Performance of : Clock Rate FLOPS (Floating Point Operations per Second) Real Number Recent Multi 4-8 FLOPS per Clock (e.g.) Peak performance of a with 3GHz (or 8)=12(or 24) 10 9 FLOPS=12(or 24)GFLOPS 10 6 FLOPS= 1 Mega FLOPS = 1 MFLOPS 10 9 FLOPS= 1 Giga FLOPS = 1 GFLOPS FLOPS= 1 Tera FLOPS = 1 TFLOPS FLOPS= 1 Peta FLOPS = 1 PFLOPS FLOPS= 1 Exa FLOPS = 1 EFLOPS 10

11 Peak Performance of Reedbush-U Intel Xeon E v4 (Broadwell-EP) 2.1 GHz 16 DP (Double Precision) FLOP operations per Clock Peak Performance (1 ) = 33.6 GFLOPS Peak Performance 1-Socket, 18 s: GFLOPS 2-Sockets, 36 s: 1,209.6 GFLOPS 1-Node 11

12 TOP 500 List Ranking list of supercomputers in the world Performance (FLOPS rate) is measured by Linpack which solves large-scale linear equations. Since 1993 Updated twice a year (International Conferences in June and November) Linpack iphone version is available 12

13 PFLOPS: Peta (=10 15 ) Floating OPerations per Sec. Exa-FLOPS (=10 18 ) will be attained after

14 14 50 th TOP500 List (November, 2017) Site Computer/Year Vendor s R max (TFLOPS) R peak (TFLOPS) Power (kw) National Supercomputing Center in Wuxi, China National Supercomputing Center in Tianjin, China Swiss Natl. Supercomputer Center, Switzerland 4 JAMSTEC, JAPAN 5 6 Oak Ridge National Laboratory, USA Lawrence Livermore National Laboratory, USA 7 DOE/NNSA/LANL/SNL 8 DOE/SC/LBNL/NERSC USA 9 Joint Center for Advanced High Performance Computing, Japan 10 RIKEN AICS, Japan Sunway TaihuLight, Sunway MPP, Sunway SW C 1.45GHz, 2016 NRCPC Tianhe-2, Intel Xeon E5-2692, TH Express-2, Xeon Phi, 2013 NUDT Piz Daint Cray XC30/NVIDIA P100, 2013 Cray Gyoukou, ZettaScaler-2.2 HPC, Xeon D-1571, 2017, ExaScaler Titan Cray XK7/NVIDIA K20x, 2012 Cray Sequoia BlueGene/Q, 2011 IBM Trinity, Cray XC40 Intel Xeon Phi C 1.4GHz, Cray Aries, 2017, Cray Cori, Cray XC40, Intel Xeon Phi C 1.4GHz, Cray Aries, 2016 Cray Oakforest-PACS, PRIMERGY CX600 M1, Intel Xeon Phi Processor C 1.4GHz, Intel Omni-Path, 2016 Fujitsu K computer, SPARC64 VIIIfx, 2011 Fujitsu 10,649,600 3,120,000 93,015 (= 93.0 PF) 33,863 (= 33.9 PF) 125,436 15,371 54,902 17, ,760 19,590 33,863 2,272 19,860 19,136 28,192 1, ,640 17,590 27,113 8,209 1,572,864 17,173 20,133 7, ,968 14,137 43, ,400 14,015 27,881 3, ,056 13,555 24,914 2, ,024 10,510 11,280 12,660 R max : Performance of Linpack (TFLOPS) R peak : Peak Performance (TFLOPS), Power: kw

15 Computational Science The 3 rd Pillar of Science Theoretical & Experimental Science Computational Science The 3 rd Pillar of Science Simulations using Supercomputers Theoretical Computational Experimental 15

16 Methods for Scientific Computing Numerical solutions of PDE (Partial Diff. Equations) Grids, Meshes, Particles Large-Scale Linear Equations Finer meshes provide more accurate solutions 16

17 3D Simulations for Earthquake Generation Cycle San Andreas Faults, CA, USA Stress Accumulation at Transcurrent Plate Boundaries 17

18 Adaptive FEM: High-resolution needed at meshes with large deformation (large accumulation) 18

19 19 Typhoon Simulations by FDM Effect of Resolution h=100km h=50km h=5km [JAMSTEC]

20 20 Simulation of Typhoon MANGKHUT in 2003 using the Earth Simulator [JAMSTEC]

1 Atmosphere-Ocean Coupling on OFP by NICAM/COCO/ppOpen-MATH/MP 21 Highiresolution global

An O(km)imesh NICAMiCOCO coupled simulation is planned on the OakforestiPACS system (3.5kmi0.10deg.

21 1 Atmosphere-Ocean Coupling on OFP by NICAM/COCO/ppOpen-MATH/MP 21 Highiresolution global atmosphereiocean coupled simulation by NICAM and COCO (Ocean Simulation) through ppopenimath/mp on the K computer is achieved. ppopenimath/mp is a coupling software for the models employing various discretization method. An O(km)imesh NICAMiCOCO coupled simulation is planned on the OakforestiPACS system (3.5kmi0.10deg., 5+B Meshes). A big challenge for optimization of the codes on new Intel Xeon Phi processor New insights for understanding of global climate dynamics [C/O M. Satoh (AORI/UTokyo)@SC16]

22 22 El Niño Simulations (1/2) U.Tokyo, RIKEN Mechanism of the Abrupt Terminate of Super El Niño in 1997/1998 has been investigated by Atmosphere- Ocean Coupling Simulations for the Entire Earth using ppopen-hpc on the K computer

23 El Niño Simulations (2/2) U.Tokyo, RIKEN 23 A Madden-Julian Oscillation (MJO) event remotely accelerates ocean upwelling to abruptly terminate the 1997/1998 super El Niño NICAM (Atmosphere)-COCO (Ocean) Coupling by ppopen- MATH/MP ppopen- MATH/MP Data Assimilation + Simulations Integration of Simulations + Data Analyses NICAM: 14km COCO deg. on K Next Target: 3.5km-0.10 deg. (5B Meshes) on OFP

24 24 2 Simulation of Geologic CO 2 Storage [Dr. Hajime Yamamoto, Taisei]

25 Simulation of Geologic CO 2 Storage International/Interdisciplinary Collaborations Taisei (Science, Modeling) Lawrence Berkeley National Laboratory, USA (Modeling) Information

25 25 Simulation of Geologic CO 2 Storage International/Interdisciplinary Collaborations Taisei (Science, Modeling) Lawrence Berkeley National Laboratory, USA (Modeling) Information Technology Center, the University of Tokyo (Algorithm, Software) JAMSTEC (Earth Simulator Center) (Software, Hardware) NEC (Software, Hardware) 2010 Japan Geotechnical Society (JGS) Award

26 Simulation of Geologic CO 2 Storage 26 Science Behavior of CO 2 in supercritical state at deep reservoir PDE s 3D Multiphase Flow (Liquid/Gas) + 3D Mass Transfer Method for Computation TOUGH2 code based on FVM, and developed by Lawrence Berkeley National Laboratory, USA More than 90% of computation time is spent for solving large-scale linear equations with more than 10 7 unknowns Numerical Algorithm Fast algorithm for large-scale linear equations developed by Information Technology Center, the University of Tokyo Supercomputer Earth Simulator II (NEX SX9, JAMSTEC, 130 TFLOPS) Oakleaf-FX (Fujitsu PRIMEHP FX10, U.Tokyo, 1.13 PFLOPS

Diffusion-Dissolution-Convection Process Injection Well Caprock(Low permeable seal) Supercritical CO 2 Convective Mixing Buoyant scco 2 overrides onto groundwater Dissolution of CO 2 increases water

27 Diffusion-Dissolution-Convection Process Injection Well Caprock(Low permeable seal) Supercritical CO 2 Convective Mixing Buoyant scco 2 overrides onto groundwater Dissolution of CO 2 increases water density Denser fluid laid on lighter fluid Rayleigh-Taylor instability invokes convective mixing of groundwater The mixing significantly enhances the CO 2 dissolution into groundwater, resulting in more stable storage Preliminary 2D simulation (Yamamoto et al., GHGT11) [Dr. Hajime Yamamoto, Taisei] COPYRIGHT TAISEI CORPORATION ALL RIGHTS RESERVED 27

Hajime Yamamoto, Taisei] The meteriscale fingers gradually developed to larger ones in the fieldiscale model Huge number

28 Density convections for 1,000 years: Flow Model Only the far side of the vertical cross section passing through the injection well is depicted. [Dr. Hajime Yamamoto, Taisei] The meteriscale fingers gradually developed to larger ones in the fieldiscale model Huge number of time steps (> 10 5 ) were required to complete the 1,000iyrs simulation Onset time (10i20 yrs) is comparable to theoretical (linear stability analysis, 15.5yrs) COPYRIGHT TAISEI CORPORATION ALL RIGHTS RESERVED 28

Calculation Time (sec) 10000 Simulation of Geologic CO 2 Storage 30 million DoF (10 million grids 3 DoF/grid node) 1000 100 10 1 0.

29 Calculation Time (sec) Simulation of Geologic CO 2 Storage 30 million DoF (10 million grids 3 DoF/grid node) Average time for solving matrix for one time step Number of Processors TOUGH2-MP on FX10 2i3 times speedup [Dr. Hajime Yamamoto, Taisei] Fujitsu FX10(Oakleaf-FX),30M DOF: 2x-3x improvement 3D Multiphase Flow (Liquid/Gas) + 3D Mass Transfer COPYRIGHT TAISEI CORPORATION ALL RIGHTS RESERVED 29

equations with 17B unknowns can be solved in 7 sec. 20.00 15.

30 3 3D Ground Water Flow Simulation with up to 4,096 nodes on Fujitsu 30 FX10 (GMG-CG) up to 17,179,869,184 meshes (64 3 meshes/) Linear equations with 17B unknowns can be solved in 7 sec HB 8x2:C0 HB 8x2:C1 HB 8x2:C2 HB 8x2:C3 sec CORE#

31 Motivation for Parallel Computing 31 again Large-scale parallel computer enables fast computing in large-scale scientific simulations with detailed models. Computational science develops new frontiers of science and engineering. Why parallel computing? faster & larger larger is more important from the view point of new frontiers of science & engineering, but faster is also important. + more complicated Ideal: Scalable Solving N x scale problem using N x computational resources during same computation time (weak scaling) Solving a fix-sized problem using N x computational resources in 1/N computation time (strong scaling)

32 32 Supercomputers and Computational Science Overview of the Class

33 33 Goal of SC4SC 2018 If you want to do something on supercomputers, you have to learn parallel programming!! duction to MPI & OpenMP MPI: Message Passing Interface grab the idea of SPMD (Single-Program Multiple-Data) OpenMP: Multithreading Parallel Application Finite-Volume Method (FVM): We start at this part Data Structure for Parallel Computing Parallel FVM by OpenMP Parallel FVM by OpenMP/MPI Hybrid

34 34 Structure of this Short Course Feb. 22 (Th) Feb. 23 (F) Part-I (Katagiri): duction to OpenMP & MPI Feb. 24 (Sa) Feb. 25 (Su) Part-II (Nakajima): Parallel FVM

35 PE: Processing Element Processor, Domain, Process SPMD mpirun -np M <Program> You understand 90% MPI, if you understand this figure. 35 PE #0 PE #1 PE #2 PE #M-1 Program Program Program Program Data #0 Data #1 Data #2 Data #M-1 Each process does same operation for different data Large-scale data is decomposed, and each part is computed by each process It is ideal that parallel program is not different from serial one except communication.

36 36 Our Current Target: Multi Cluster Multi s are connected through network Memory Memory Memory Memory Memory OpenMP Multithreading Intra Node (Intra ) Shared Memory MPI Message Passing Inter Node (Inter ) Distributed Memory

37 37 Our Current Target: Multi Cluster Multi s are connected through network Memory Memory Memory Memory Memory OpenMP Multithreading Intra Node (Intra ) Shared Memory MPI Message Passing Inter Node (Inter ) Distributed Memory

38 38 Flat MPI vs. Hybrid Flat-MPI:Each -> Independent MPI only Intra/Inter Node memory memory memory Hybrid:Hierarchal Structure OpenMP MPI memory memory memory

39 39 Example of OpnMP/MPI Hybrid Sending Messages to Neighboring Processes MPI: Message Passing, OpenMP: Threading with Directives!C!C SEND do neib= 1, NEIBPETOT II= (LEVEL-1)*NEIBPETOT istart= STACK_EXPORT(II+neib-1) inum = STACK_EXPORT(II+neib ) - istart!$omp parallel do do k= istart+1, istart+inum WS(k-NE0)= X(NOD_EXPORT(k)) enddo call MPI_Isend (WS(istart+1-NE0), inum, MPI_DOUBLE_PRECISION, & & NEIBPE(neib), 0, MPI_COMM_WORLD, & & req1(neib), ierr) enddo

40 40 Experiences in Unix/Linux Prerequisites Experiences of programming (Fortran or C/C++) Fundamental numerical algorithms (Gaussian Elimination, LU Factorization, Jacobi/Gauss-Seidel/SOR Iterative Solvers) Experiences in SSH Public Key Authentication Method

41 41 Windows Preparation Cygwin: Please install gcc (C) or gfortran (Fortran) and SSH!! ParaView MacOS, UNIX/Linux ParaView Cygwin: ParaView:

Introduction to Parallel Programming for Multicore/Manycore Clusters

Introduction to Parallel Programming for Multicore/Manycore Clusters General Introduction Invitation to Supercomputing Kengo Nakajima Information Technology Center The University of Tokyo Takahiro Katagiri