Introduction to Parallel Programming for Multicore/Manycore Clusters

Size: px

Start display at page:

Download "Introduction to Parallel Programming for Multicore/Manycore Clusters"

Cornelia Webb
5 years ago
Views:

1 duction to Parallel Programming for Multicore/Manycore Clusters duction Kengo Nakajima Information Technology Center The University of Tokyo

2 2 Descriptions of the Class Technical & Scientific Computing I ( ) 科学技術計算 Ⅰ Department of Mathematical Informatics Special Lecture on Computational Science I( ) 計算科学アライアンス特別講義 Ⅰ Department of Computer Science Multithreaded Parallel Computing ( ) (from 2016) スレッド並列コンピューティング Department of Electrical Engineering & Information Systems This class is certificated as the category F" lecture of "the Computational Alliance, the University of Tokyo"

3 duction to FEM Programming FEM: Finite-Element Method: 有限要素法 Summer (I) : FEM Programming for Solid Mechanics Winter (II): Parallel FEM using MPI The 1 st part (summer) is essential for the 2 nd part (winter) Problems Many new (international) students in Winter, who did not take the 1 st part in Summer They are generally more diligent than Japanese students 2015 Summer (I) : Multicore programming by OpenMP Winter (II): FEM + Parallel FEM by MPI/OpenMP for Heat Transfer Part I & II are independent (maybe...) 2017 Lectures are given in English Reedbush-U Supercomputer System

4 Motivation for Parallel Computing 4 (and this class) Large-scale parallel computer enables fast computing in large-scale scientific simulations with detailed models. Computational science develops new frontiers of science and engineering. Why parallel computing? faster & larger larger is more important from the view point of new frontiers of science & engineering, but faster is also important. + more complicated Ideal: Scalable Solving N x scale problem using N x computational resources during same computation time (weak scaling) Solving a fix-sized problem using N x computational resources in 1/N computation time (strong scaling)

5 5 Scientific Computing = SMASH Science Modeling Algorithm Software Hardware You have to learn many things. Collaboration (or Co-Design) will be important for future career of each of you, as a scientist and/or an engineer. You have to communicate with people with different backgrounds. It is more difficult than communicating with foreign scientists from same area. (Q): Computer Science, Computational Science, or Numerical Algorithms?

6 6 This Class... Science Modeling Algorithm Software Hardware Target: Parallel FVM (Finite- Volume Method) using OpenMP Science: 3D Poisson Equations Modeling: FVM Algorithm: Iterative Solvers etc. You have to know many components to learn FVM, although you have already learned each of these in undergraduate and high-school classes.

7 7 Road to Programming for Parallel Scientific Computing Programming for Parallel Scientific Computing (e.g. Parallel FEM/FDM) Programming for Real World Scientific Computing (e.g. FEM, FDM) Programming for Fundamental Numerical Analysis (e.g. Gauss-Seidel, RK etc.) Big gap here!! Unix, Fortan, C etc.

8 8 The third step is important! How to parallelize applications? How to extract parallelism? If you understand methods, algorithms, and implementations of the original code, it s easy. Data-structure is important 4. Programming for Parallel Scientific Computing (e.g. Parallel FEM/FDM) 3. Programming for Real World Scientific Computing (e.g. FEM, FDM) 2. Programming for Fundamental Numerical Analysis (e.g. Gauss-Seidel, RK etc.) 1. Unix, Fortan, C etc. How to understand the code? Reading the application code!! It seems primitive, but very effective. In this class, reading the source code is encouraged. 3: FVM, 4: Parallel FVM

9 Kengo Nakajima 中島研吾 (1/2) Current Position Professor, Supercomputing Research Division, Information Technology Center, The University of Tokyo( 情報基盤センター ) Department of Mathematical Informatics, Graduate School of Information Science & Engineering, The University of Tokyo( 情報理工数理情報学 ) Department of Electrical Engineering and Information Systems, Graduate School of Engineering, The University of Tokyo( 工電気系工学 ) Deputy Director, RIKEN CCS (Center for Computational Science) (Kobe) (20%) Research Interest High-Performance Computing Parallel Numerical Linear Algebra (Preconditioning) Parallel Programming Model Computational Mechanics, Computational Fluid Dynamics Adaptive Mesh Refinement, Parallel Visualization 9

10 Kengo Nakajima (2/2) Education B.Eng (Aeronautics, The University of Tokyo, 1985) M.S. (Aerospace Engineering, University of Texas, 1993) Ph.D. (Quantum Engineering & System Sciences, The University of Tokyo, 2003) Professional Mitsubishi Research Institute, Inc. ( ) Research Organization for Information Science & Technology ( ) The University of Tokyo Department Earth & Planetary Science ( ) Information Technology Center (2008-) JAMSTEC ( ), part-time RIKEN ( ), part-time 10

11 11 Supercomputers and Computational Science Overview of the Class Future Issues

Clock Rate Frequency: Number of operations by per second GHz

12 12 Computer & Central Processing Unit ( 中央処理装置 ): s used in PC and Supercomputers are based on same architecture GHz: Clock Rate Frequency: Number of operations by per second GHz -> 10 9 operations/sec Simultaneous 4-8 instructions per clock

13 コア () Single 1 cores/ Multicore Dual 2 cores/ Quad 4 cores/ = Central part of Multicore s with 4-8 cores are popular Low Power 13 GPU: Manycore O(10 1 )-O(10 2 ) cores More and more cores Parallel computing Reedbush-U: 18 cores x 2 Intel Xeon Broadwell-EP

OpenACC Intel Xeon/Phi: Manycore 60+ cores High Memory Bandwidth Unix, Fortran, C

14 14 GPU/Manycores GPU:Graphic Processing Unit GPGPU: General Purpose GPU O(10 2 ) cores High Memory Bandwidth Cheap NO stand-alone operations Host needed Programming: CUDA, OpenACC Intel Xeon/Phi: Manycore 60+ cores High Memory Bandwidth Unix, Fortran, C compiler Host needed in the 1 st generation Stand-alone is possible now (Knights Landing, KNL)

15 15 Parallel Supercomputers Multicore s are connected through network

16 16 Supercomputers with Heterogeneous/Hybrid Nodes GPU Manycore GPU Manycore GPU Manycore GPU Manycore GPU Manycore

17 Performance of Supercomputers Performance of : Clock Rate FLOPS (Floating Point Operations per Second) Real Number Recent Multicore 4-8 FLOPS per Clock (e.g.) Peak performance of a core with 3GHz (or 8)=12(or 24) 10 9 FLOPS=12(or 24)GFLOPS 10 6 FLOPS= 1 Mega FLOPS = 1 MFLOPS 10 9 FLOPS= 1 Giga FLOPS = 1 GFLOPS FLOPS= 1 Tera FLOPS = 1 TFLOPS FLOPS= 1 Peta FLOPS = 1 PFLOPS FLOPS= 1 Exa FLOPS = 1 EFLOPS 17

18 Peak Performance of Reedbush-U Intel Xeon E v4 (Broadwell-EP) 2.1 GHz 16 DP (Double Precision) FLOP operations per Clock Peak Performance (1 core) = 33.6 GFLOPS Peak Performance 1-Socket, 18 cores: GFLOPS 2-Sockets, 36 cores: 1,209.6 GFLOPS 1-Node 18

19 TOP 500 List Ranking list of supercomputers in the world Performance (FLOPS rate) is measured by Linpack which solves large-scale linear equations. Since 1993 Updated twice a year (International Conferences in June and November) Linpack iphone version is available 19

20 PFLOPS: Peta (=10 15 ) Floating OPerations per Sec. Exa-FLOPS (=10 18 ) will be attained after

21 Benchmarks TOP 500 (Linpack,HPL(High Performance Linpack)) Direct Linear Solvers, FLOPS rate Regular Dense Matrices, Continuous Memory Access Computing Performance HPCG Preconditioned Iterative Solvers, FLOPS rate Irregular Sparse Matrices derived from FEM Applications with Many 0 Components Irregular/Random Memory Access, Closer to Real Applications than HPL Performance of Memory, Communications Green 500 FLOPS/W rate for HPL (TOP500) 21

22 50 th TOP500 List (November, 2017) Site Computer/Year Vendor s R max (TFLOPS) R peak (TFLOPS) Power (kw) National Supercomputing Center in Wuxi, China National Supercomputing Center in Tianjin, China Swiss Natl. Supercomputer Center, Switzerland 4 JAMSTEC, JAPAN 5 6 Oak Ridge National Laboratory, USA Lawrence Livermore National Laboratory, USA 7 DOE/NNSA/LANL/SNL 8 DOE/SC/LBNL/NERSC USA 9 Joint Center for Advanced High Performance Computing, Japan 10 RIKEN AICS, Japan Sunway TaihuLight, Sunway MPP, Sunway SW C 1.45GHz, 2016 NRCPC Tianhe-2, Intel Xeon E5-2692, TH Express-2, Xeon Phi, 2013 NUDT Piz Daint Cray XC30/NVIDIA P100, 2013 Cray Gyoukou, ZettaScaler-2.2 HPC, Xeon D-1571, 2017, ExaScaler Titan Cray XK7/NVIDIA K20x, 2012 Cray Sequoia BlueGene/Q, 2011 IBM Trinity, Cray XC40 Intel Xeon Phi C 1.4GHz, Cray Aries, 2017, Cray Cori, Cray XC40, Intel Xeon Phi C 1.4GHz, Cray Aries, 2016 Cray Oakforest-PACS, PRIMERGY CX600 M1, Intel Xeon Phi Processor C 1.4GHz, Intel Omni-Path, 2016 Fujitsu K computer, SPARC64 VIIIfx, 2011 Fujitsu 10,649,600 3,120,000 93,015 (= 93.0 PF) 33,863 (= 33.9 PF) 125,436 15,371 54,902 17, ,760 19,590 33,863 2,272 19,860 19,136 28,192 1, ,640 17,590 27,113 8,209 1,572,864 17,173 20,133 7, ,968 14,137 43, ,400 14,015 27,881 3, ,056 13,555 24,914 2, ,024 10,510 11,280 12,660 R max : Performance of Linpack (TFLOPS) R peak : Peak Performance (TFLOPS), Power: kw 22

23 HPCG Ranking (November, 2017) Computer s HPL Rmax TOP500 HPCG (Pflop/s) Rank (Pflop/s) Peak 1 K computer 705, % 2 Tianhe-2 (MilkyWay-2) 3,120, % 3 Trinity 979, % 4 Piz Daint 361, % 5 Sunway TaihuLight 10,649, % 6 Oakforest-PACS 557, % 7 Cori 632, % 8 Sequoia 1,572, % 9 Titan 560, % 10 Tsbubame 3 136, % 23

24 1 NVIDIA Corporation 2 Green 500 Ranking (November, 2016) Site Computer Swiss National Supercomputing Centre (CSCS) DGX SATURNV Piz Daint NVIDIA DGX-1, Xeon E5-2698v4 20C 2.2GHz, Infiniband EDR, NVIDIA Tesla P100 Cray XC50, Xeon E5-2690v3 12C 2.6GHz, Aries interconnect, NVIDIA Tesla P100 HPL Rmax (Pflop/s) TOP500 Rank Power (MW) GFLOPS/W JCAHPC 7 DOE/SC/Argonne National Lab. 8 3 RIKEN ACCS Shoubu ZettaScaler-1.6 etc National SC Sunway Sunway MPP, Sunway SW Center in Wuxi TaihuLight 260C 1.45GHz, Sunway SFB/TR55 at Fujitsu Tech. Solutions GmbH QPACE3 PRIMERGY CX1640 M1, Intel Xeon Phi C 1.3GHz, Intel Omni- Path Oakforest- PACS Theta Stanford Research Computing Center XStream ACCMS, Kyoto 9 University Jefferson Natl. 10 Accel. Facility Camphor 2 SciPhi XVI PRIMERGY CX1640 M1, Intel Xeon Phi C 1.4GHz, Intel Omni- Path Cray XC40, Intel Xeon Phi C 1.3GHz, Aries interconnect Cray CS-Storm, Intel Xeon E5-2680v2 10C 2.8GHz, Infiniband FDR, Nvidia K80 Cray XC40, Intel Xeon Phi C 1.4GHz, Aries interconnect KOI Cluster, Intel Xeon Phi C 1.3GHz, Intel Omni-Path

25 Green 500 Ranking (June, 2017) Site Computer 1 Tokyo Tech. TSUBAME3.0 2 Yahoo Japan kukai 3 AIST, Japan CAIP, RIKEN, JAPAN AIST AI Cloud RAIDEN GPU subsystem - SGI ICE XA, IP139-SXM2, Xeon E5-2680v4, NVIDIA Tesla P100 SXM2, HPE ZettaScaler-1.6, Xeon E5-2650Lv4,, NVIDIA Tesla P100, Exascalar NEC 4U-8GPU Server, Xeon E5-2630Lv4, NVIDIA Tesla P100 SXM2, NEC NVIDIA DGX-1, Xeon E5-2698v4, NVIDIA Tesla P100, Fujitsu Univ. Cambridge, UK Wilkes-2 - Dell C4130, Xeon E5-2650v4, NVIDIA Tesla P100, Dell Swiss Natl. SC. Center (CSCS) JAMSTEC, Japan Inst. for Env. Studies, Japan Piz Daint Gyoukou, GOSAT-2 (RCF2) 9 Facebook, USA Penguin Relion 10 NVIDIA, USA 11 ITC, U.Tokyo, Japan DGX Saturn V Reedbush-H Cray XC50, Xeon E5-2690v3, NVIDIA Tesla P100, Cray Inc. ZettaScaler-2.0 HPC system, Xeon D- 1571, PEZY-SC2, ExaScalar SGI Rackable C1104-GP1, Xeon E5-2650v4, NVIDIA Tesla P100, NSSOL/HPE Xeon E5-2698v4/E5-2650v4, NVIDIA Tesla P100, Acer Group Xeon E5-2698v4, NVIDIA Tesla P100, Nvidia SGI Rackable C1102-GP8, Xeon E5-2695v4, NVIDIA Tesla P100 SXM2, HPE HPL Rmax (Pflop/s) TOP500 Rank Power (MW) GFLOPS/ W 1, , , , , , ,

26 1 RIKEN, Japan Green 500 Ranking (Nov., 2017) Site Computer Shoubu system B 2 KEK, Japan Suiren2 3 PEZY, Japan Sakura 4 NVIDIA, USA 5 JAMSTEC, Japan DGX Saturn V Volta Gyoukou 6 Tokyo Tech. TSUBAME3.0 7 AIST, Japan AIST AI Cloud ZettaScaler-2.2 HPC system, Xeon D- 1571, PEZY-SC2, ExaScalar ZettaScaler-2.2 HPC system, Xeon D- 1571, PEZY-SC2, ExaScalar ZettaScaler-2.2 HPC system, Xeon E5-2618Lv3, PEZY-SC2, ExaScalar Xeon E5-2698v4, NVIDIA Tesla V100, Nvidia ZettaScaler-2.2 HPC system, Xeon D- 1571, PEZY-SC2, ExaScalar SGI ICE XA, IP139-SXM2, Xeon E5-2680v4, NVIDIA Tesla P100 SXM2, HPE NEC 4U-8GPU Server, Xeon E5-2630Lv4, NVIDIA Tesla P100 SXM2, NEC HPL Rmax (Pflop/s) TOP500 Rank Power (kw) GFLOPS/ W , , , , CAIP, RIKEN, JAPAN RAIDEN GPU subsystem - NVIDIA DGX-1, Xeon E5-2698v4, NVIDIA Tesla P100, Fujitsu Univ. Cambridge, UK Wilkes-2 - Dell C4130, Xeon E5-2650v4, NVIDIA Tesla P100, Dell Swiss Natl. SC. Center (CSCS) ITC, U.Tokyo, Japan Piz Daint Reedbush-L Cray XC50, Xeon E5-2690v3, NVIDIA Tesla P100, Cray Inc. SGI Rackable C1102-GP8, Xeon E5-2695v4, NVIDIA Tesla P100 SXM2, HPE 1, , ,

27 Computational Science The 3 rd Pillar of Science Theoretical & Experimental Science Computational Science The 3 rd Pillar of Science Simulations using Supercomputers Theoretical Computational Experimental 27

28 Methods for Scientific Computing Numerical solutions of PDE (Partial Diff. Equations) Grids, Meshes, Particles Large-Scale Linear Equations Finer meshes provide more accurate solutions 28

29 3D Simulations for Earthquake Generation Cycle San Andreas Faults, CA, USA Stress Accumulation at Transcurrent Plate Boundaries 29

30 Adaptive FEM: High-resolution needed at meshes with large deformation (large accumulation) 30

31 31 Typhoon Simulations by FDM Effect of Resolution h=100km h=50km h=5km [JAMSTEC]

32 32 Simulation of Typhoon MANGKHUT in 2003 using the Earth Simulator [JAMSTEC]

33 33 Simulation of Geologic CO 2 Storage [Dr. Hajime Yamamoto, Taisei]

34 Simulation of Geologic CO 2 Storage International/Interdisciplinary Collaborations Taisei (Science, Modeling) Lawrence Berkeley National Laboratory, USA (Modeling) Information

34 34 Simulation of Geologic CO 2 Storage International/Interdisciplinary Collaborations Taisei (Science, Modeling) Lawrence Berkeley National Laboratory, USA (Modeling) Information Technology Center, the University of Tokyo (Algorithm, Software) JAMSTEC (Earth Simulator Center) (Software, Hardware) NEC (Software, Hardware) 2010 Japan Geotechnical Society (JGS) Award

35 35 Simulation of Geologic CO 2 Storage Science Behavior of CO 2 in supercritical state at deep reservoir PDE s 3D Multiphase Flow (Liquid/Gas) + 3D Mass Transfer Method for Computation TOUGH2 code based on FVM, and developed by Lawrence Berkeley National Laboratory, USA More than 90% of computation time is spent for solving large-scale linear equations with more than 10 7 unknowns Numerical Algorithm Fast algorithm for large-scale linear equations developed by Information Technology Center, the University of Tokyo Supercomputer Earth Simulator II (NEX SX9, JAMSTEC, 130 TFLOPS) Oakleaf-FX (Fujitsu PRIMEHP FX10, U.Tokyo, 1.13 PFLOPS

Diffusion-Dissolution-Convection Process Injection Well Caprock(Low permeable seal) Supercritical CO 2 Convective Mixing Buoyant scco 2 overrides onto groundwater Dissolutionof CO 2 increases water

36 Diffusion-Dissolution-Convection Process Injection Well Caprock(Low permeable seal) Supercritical CO 2 Convective Mixing Buoyant scco 2 overrides onto groundwater Dissolutionof CO 2 increases water density Denser fluid laid on lighter fluid Rayleigh-Taylor instability invokes convective mixing of groundwater The mixing significantly enhances the CO 2 dissolution into groundwater, resulting in more stable storage Preliminary 2D simulation (Yamamoto et al., GHGT11) [Dr. Hajime Yamamoto, Taisei] 36 COPYRIGHT TAISEI CORPORATION ALL RIGHTS RESERVED

Hajime Yamamoto, Taisei] The meter-scale fingers gradually developed to larger ones in the field-scale model Huge number

37 Density convections for 1,000 years: Flow Model Only the far side of the vertical cross section passing through the injection well is depicted. [Dr. Hajime Yamamoto, Taisei] The meter-scale fingers gradually developed to larger ones in the field-scale model Huge number of time steps (> 10 5 ) were required to complete the 1,000-yrs simulation Onset time (10-20 yrs) is comparable to theoretical (linear stability analysis, 15.5yrs) COPYRIGHT TAISEI CORPORATION ALL RIGHTS RESERVED 37

Calculation Time (sec) 10000 1000 Simulation of Geologic CO 2 Storage 30 million DoF(10 million grids 3 DoF/grid node) 100 10 1 0.

38 Calculation Time (sec) Simulation of Geologic CO 2 Storage 30 million DoF(10 million grids 3 DoF/grid node) Average time for solving matrix for one time step Number of Processors TOUGH2-MP on FX times speedup [Dr. Hajime Yamamoto, Taisei] Fujitsu FX10(Oakleaf-FX),30M DOF: 2x-3x improvement 3D Multiphase Flow (Liquid/Gas) + 3D Mass Transfer COPYRIGHT TAISEI CORPORATION ALL RIGHTS RESERVED 38

39 Motivation for Parallel Computing 39 again Large-scale parallel computer enables fast computing in large-scale scientific simulations with detailed models. Computational science develops new frontiers of science and engineering. Why parallel computing? faster & larger larger is more important from the view point of new frontiers of science & engineering, but faster is also important. + more complicated Ideal: Scalable Solving N x scale problem using N x computational resources during same computation time (weak scaling) Solving a fix-sized problem using N x computational resources in 1/N computation time (strong scaling)

40 40 Supercomputers and Computational Science Overview of the Class Future Issues

41 41 Our Current Target: Multicore Cluster Multicore s are connected through network Memory Memory Memory Memory Memory OpenMP Multithreading Intra Node (Intra ) Shared Memory MPI Message Passing Inter Node (Inter ) Distributed Memory

42 42 Our Current Target: Multicore Cluster Multicore s are connected through network Memory Memory Memory Memory Memory OpenMP Multithreading Intra Node (Intra ) Shared Memory MPI Message Passing Inter Node (Inter ) Distributed Memory

43 43 Our Current Target: Multicore Cluster Multicore s are connected through network Memory Memory Memory Memory Memory OpenMP Multithreading Intra Node (Intra ) Shared Memory MPI (after October) Message Passing Inter Node (Inter ) Distributed Memory

44 44 Flat MPI vs. Hybrid Flat-MPI:Each -> Independent MPI only Intra/Inter Node memory core core core memory core core core memory core core core core core core Hybrid:Hierarchal Structure OpenMP MPI memory core core core core memory core core core core memory core core core core

45 45 Example of OpnMP/MPI Hybrid Sending Messages to Neighboring Processes MPI: Message Passing, OpenMP: Threading with Directives!C!C SEND do neib= 1, NEIBPETOT II= (LEVEL-1)*NEIBPETOT istart= STACK_EXPORT(II+neib-1) inum = STACK_EXPORT(II+neib ) - istart!$omp parallel do do k= istart+1, istart+inum WS(k-NE0)= X(NOD_EXPORT(k)) enddo call MPI_Isend (WS(istart+1-NE0), inum, MPI_DOUBLE_PRECISION, & & NEIBPE(neib), 0, MPI_COMM_WORLD, & & req1(neib), ierr) enddo

46 In order to make full use of modern supercomputer systems with multicore/manycore architectures, hybrid parallel programming with message-passing and multithreading is essential. MPI for message passing and OpenMP for multithreading are the most popular ways for parallel programming on multicore/manycore clusters. In this class, we parallelize a finite-volume method code with Krylov iterative solvers for Poisson s equation on Reedbush-U Supercomputer with Intel Broadwell-EP at the University of Tokyo. Because of limitation of time, we are (mainly) focusing on multithreading by OpenMP. ICCG solver (Conjugate Gradient iterative solvers with Incomplete Cholesky preconditioning) is a widely-used method for solving linear equations. Because it includes data dependency where writing/reading data to/from memory could occur simultaneously, parallelization using OpenMP is not straight forward. 46

47 We need certain kind of reordering in order to extract parallelism. 47 Lectures and exercise on the following issues related to OpenMP will be conducted: Finite-Volume Method (FVM) Krylov Iterative Method Preconditioning Implementation of the Program duction to OpenMP Reordering/Coloring Method Parallel FVM Code using OpenMP

48 Date ID Title Apr-09 (M) CS-01 duction Apr-16 (M) CS-02 FVM (1/3) Apr-23 (M) CS-03 FVM (2/3) Apr-30 (M) (no class) (National Holiday) May-07 (M) CS-04 FVM (3/3) May-14 (M) CS-05 Login to Reedbush-U, OpenMP (1/3) May-21 (M) CS-06 OpenMP (2/3) May-28 (M) CS-07 OpenMP (3/3) Jun-04 (M) CS-08 Reordering (1/2) Jun-11 (M) CS-09 Reordering (2/2) Jun-18 (M) CS-10 Tuning Jun-25 (M) (canceled) Jul-02 (M) CS-11 Parallel Code by OpenMP (1/2) Jul-09 (M) CS-12 Parallel Code by OpenMP (2/2) Jul-23 (M) CS-13 Exercises

49 49 Prerequisites Fundamental physics and mathematics Linear algebra, analytics Experiences in fundamental numerical algorithms LU factorization/decomposition, Gauss-Seidel Experiences in programming by C or Fortran Experiences and knowledge in UNIX User account of ECCS2016 must be obtained:

50 50 Strategy If you can develop programs by yourself, it is ideal... but difficult. focused on reading, not developing by yourself Programs are in C and Fortran Lectures are done by... Lecture Materials available at NOON Friday through WEB. NO hardcopy is provided Starting at 08:30 You can enter the building after 08:00 Taking seats from the front row. Terminals must be shut-down after class.

51 Grades 1 or 2 Reports on programming

52 52 If you have any questions, please feel free to contact me! Office: 3F Annex/Information Technology Center #36 情報基盤センター別館 3F 36 号室 ext.: nakajima(at)cc.u-tokyo.ac.jp NO specific office hours, appointment by 日本語資料 ( 一部 )

53 53 Keywords for OpenMP OpenMP Directive based, (seems to be) easy Many books Data Dependency Conflict of reading from/writing to memory Appropriate reordering of data is needed for consistent parallel computing NO detailed information in OpenMP books: very complicated

54 54 Some Technical Terms Processor, Processing Unit (H/W), Processor= for single-core proc s Process Unit for MPI computation, nearly equal to core Each core (or processor) can host multiple processes (but not efficient) PE (Processing Element) PE originally mean processor, but it is sometimes used as process in this class. Moreover it means domain (next) In multicore proc s: PE generally means core Domain domain=process (=PE), each of MD in SPMD, each data set Process ID of MPI (ID of PE, ID of domain) starts from 0 if you have 8 processes (PE s, domains), ID is 0~7

55 55 Supercomputers and Computational Science Overview of the Class Future Issues

56 56 Key-Issues towards Appl./Algorithms on Exa-Scale Systems Jack Dongarra (ORNL/U. Tennessee) at ISC 2013 Hybrid/Heterogeneous Architecture Multicore + GPU/Manycores (Intel MIC/Xeon Phi) Data Movement, Hierarchy of Memory Communication/Synchronization Reducing Algorithms Mixed Precision Computation Auto-Tuning/Self-Adapting Fault Resilient Algorithms Reproducibility of Results

57 57 Supercomputers with Heterogeneous/Hybrid Nodes GPU Manycore GPU Manycore GPU Manycore GPU Manycore GPU Manycore

58 58 Hybrid Parallel Programming Model is essential for Post-Peta/Exascale Computing Message Passing (e.g. MPI) + Multi Threading (e.g. OpenMP, CUDA, OpenCL, OpenACC etc.) In K computer and FX10, hybrid parallel programming is recommended MPI + Automatic Parallelization by Fujitsu s Compiler Expectations for Hybrid Number of MPI processes (and sub-domains) to be reduced O( )-way MPI might not scale in Exascale Systems Easily extended to Heterogeneous Architectures +GPU, +Manycores (e.g. Intel MIC/Xeon Phi) MPI+X: OpenMP, OpenACC, CUDA, OpenCL

59 59 This class is also useful for this type of parallel system GPU Manycore GPU Manycore GPU Manycore GPU Manycore GPU Manycore

60 60 Parallel Programming Models Multicore Clusters (e.g. K, FX10) MPI + OpenMP and (Fortan/C/C++) Multicore + GPU (e.g. Tsubame) GPU needs host MPI and [(Fortan/C/C++) + CUDA, OpenCL] complicated, MPI and [(Fortran/C/C++) with OpenACC] close to MPI + OpenMP and (Fortran/C/C++) Multicore + Intel MIC/Xeon-Phi Intel MIC/Xeon-Phi Only MPI + OpenMP and (Fortan/C/C++) is possible + Vectorization

61 61 Future of Supercomputers (1/2) Technical Issues Power Consumption Reliability, Fault Tolerance, Fault Resilience Scalability (Parallel Performancce) Petascale System 2MW including A/C, 2M$/year, O(10 5 ~10 6 ) cores Exascale System (10 3 x Petascale) After 2021 (?) 2GW (2 B$/year!), O(10 8 ~10 9 ) cores Various types of innovations are on-going to keep power consumption at 20MW (100x efficiency), Memory, Network... Reliability

62 62 Future of Supercomputers (2/2) Not only hardware, but also numerical models and algorithms must be improved: 省電力 (Power-Aware/Reducing Algorithms) 耐故障 (Fault Resilient Algorithms) 通信削減 (Communication Avoiding/Reducing Algorithms) Co-Design by experts from various area (SMASH) is important Exascale system will be a special-purpose system, not a generalpurpose one.

Introduction to Parallel Programming for Multicore/Manycore Clusters Introduction

Introduction to Parallel Programming for Multicore/Manycore Clusters Introduction duction to Parallel Programming for Multicore/Manycore Clusters Introduction Kengo Nakajima Information Technology Center The University of Tokyo 2 Motivation for Parallel Computing (and this class) Large-scale