Extreme Simulations on ppopen-hpc

Size: px

Start display at page:

Download "Extreme Simulations on ppopen-hpc"

Wendy Morris
6 years ago
Views:

1 Extreme Simulations on ppopen-hpc Kengo Nakajima Information Technology Center The University of Tokyo Challenges in Extreme Engineering Algorithms ISC 16, June 20, 2016 Frankfurt am Main, Germany

2 Summary ppopen-hpc is an open source infrastructure for development and execution of optimized and reliable simulation code on post-peta-scale (pp) parallel computers based on many-core architectures with automatic tuning (AT), and it consists of various types of libraries, which cover general procedures for scientific computation. Some Recent Activities Recent achievements in the development of ppopen- MATH/MG, which is a geometric multigrid solver Coupled earthquake simulations by ppopen-math/mp 2

3 ppopen-hpc ppopen-math ppopen-math/mg: Multigrid Solver Groundwater Flow ppopen-math/mp: Coupler Coupled Earthquake Simulations Future/On-going Works 3

4 ppopen-hpc: Overview Application framework with automatic tuning (AT) pp : post-peta-scale Five-year project (FY ) (since April 2011) P.I.: Kengo Nakajima (ITC, The University of Tokyo) Part of Development of System Software Technologies for Post-Peta Scale High Performance Computing funded by JST/CREST (Supervisor: Prof. Mitsuhisa Sato, RIKEN AICS) Target: Post T2K System (Original Schedule: FY.2015) could be extended to various types of platforms Team with 7 institutes, >50 people (5 PDs) from various fields: Co-Design ITC/U.Tokyo, AORI/U.Tokyo, ERI/U.Tokyo, FS/U.Tokyo 4 Hokkaido U., Kyoto U., JAMSTEC

Tokyo) Hideyuki Jitsumoto (Tokyo Tech) Satoshi Ohshima (ITC/U.Tokyo) Hiroyasu Hasumi (AORI/U.

5 Group Leaders Masaki Satoh (AORI/U.Tokyo) Takashi Furumura (ERI/U.Tokyo) Hiroshi Okuda (GSFS/U.Tokyo) Takeshi Iwashita (Kyoto U., ITC/Hokkaido U.) Hide Sakaguchi (IFREE/JAMSTEC) Main Members Takahiro Katagiri (ITC/U.Tokyo) Masaharu Matsumoto (ITC/U.Tokyo) Hideyuki Jitsumoto (Tokyo Tech) Satoshi Ohshima (ITC/U.Tokyo) Hiroyasu Hasumi (AORI/U.Tokyo) Takashi Arakawa (RIST) Futoshi Mori (ERI/U.Tokyo) Takeshi Kitayama (GSFS/U.Tokyo) Akihiro Ida (ACCMS/Kyoto U. -> University of Tokyo) Miki Yamamoto (IFREE/JAMSTEC) Daisuke Nishiura (IFREE/JAMSTEC) 5

6 6 Framework Appl. Dev. Math Libraries Automatic Tuning (AT) System Software

7 7 ppopen-hpc covers

8 Supercomputers in ITC/U.Tokyo 2 big systems, 6 yr. cycle FY Hitachi SR11K/J2 IBM Power TFLOPS, 16.4TB Hitachi HA8000 (T2K) AMD Opteron 140TFLOPS, 31.3TB Yayoi: Hitachi SR16000/M1 IBM Power TFLOPS, 11.2 TB Oakleaf FX: Fujitsu PRIMEHPC FX10, SPARC64 IXfx 1.13 PFLOPS, 150 TB Oakforest PACS Fujitsu, Intel KNL 25PFLOPS, 919.3TB JCAHPC: U.Tsukuba & U.Tokyo Post FX PFLOPS (?) 8 Oakbridge FX TFLOPS, 18.4 TB CSE & Big Data U.Tokyo s 1 st System with GPU s Reedbush, SGI Broadwell + Pascal PFLOPS Peta K K computer Post K

9 Post T2K System: Oakforest-PACS http://jcahpc.jp/pr/pr-en-20160510.

9 9 Post T2K System: Oakforest-PACS 25 PFLOPS, Fall 2016: by Fujitsu 8,208 Intel Xeon/Phi (KNL) Full operation starts on December 1, 2016 Joint Center for Advanced High Performance Computing (JCAHPC, University of Tsukuba University of Tokyo New system will installed in Kashiwa-no-Ha (Leaf of Oak) Campus/U.Tokyo, which is between Tokyo and Tsukuba Please visit #510 for more detailed information.

10 10 Schedule of Public Release (with English Documents, MIT License) Released at SC-XY (or can be downloaded) Multicore/manycore cluster version (Flat MPI, OpenMP/MPI Hybrid) with documents in English We are now focusing on MIC/Xeon Phi Collaborations are welcome History SC12, Nov 2012 (Ver.0.1.0) SC13, Nov 2013 (Ver.0.2.0) SC14, Nov 2014 (Ver.0.3.0) SC15, Nov 2015 (Ver.1.0.0)

11 New Features in Ver.1.0.0 http://ppopenhpc.cc.u-tokyo.ac.

11 11 New Features in Ver HACApK library for H-matrix comp. in ppopen- APPL/BEM (OpenMP/MPI Hybrid Version) First Open Source Library by OpenMP/MPI Hybrid ppopen-math/mp (Coupler for Multiphysics Simulations, Loose Coupling of FEM & FDM) Matrix Assembly and Linear Solvers for ppopen- APPL/FVM

12 12 Collaborations, Outreaching International Collaborations Lawrence Berkeley National Lab. National Taiwan University National Central University (Taiwan) ESSEX/SPPEXA/DFG, Germany IPCC(Intel Parallel Comp. Ctr.) Outreaching, Applications Large-Scale Simulations Geologic CO 2 Storage Astrophysics Earthquake Simulations etc. ppopen-at, ppopen-math/vis, ppopen-math/mp, Linear Solvers Intl. Workshops (2012,13,15) Tutorials, Classes

13 Example: Geologic CO 2 Storage 3D Multiphase Flow (Liquid/Gas) + 3D Mass Transfer 13 [Dr. Hajime Yamamoto, Taisei]

Density convections for 1,000 years: Flow Model Only the far side of the

Hajime Yamamoto, Taisei] The meter scale fingers gradually developed to larger

to complete the 1,000 yrs simulation Onset time (10 20 yrs) is comparable to

14 Density convections for 1,000 years: Flow Model Only the far side of the vertical cross section passing through the injection well is depicted. [Dr. Hajime Yamamoto, Taisei] The meter scale fingers gradually developed to larger ones in the field scale model Huge number of time steps (> 10 5 ) were required to complete the 1,000 yrs simulation Onset time (10 20 yrs) is comparable to theoretical (linear stability analysis, 15.5yrs) 14 COPYRIGHT TAISEI CORPORATION ALL RIGHTS RESERVED

Calculation Time (sec) 10000 1000 Simulation of Geologic CO 2 Storage 30 million DoF (10 million grids 3 DoF/grid node) 100 10 1 0.

15 Calculation Time (sec) Simulation of Geologic CO 2 Storage 30 million DoF (10 million grids 3 DoF/grid node) Average time for solving matrix for one time step Number of Processors TOUGH2 MP on FX times speedup [Dr. Hajime Yamamoto, Taisei] Fujitsu FX10(Oakleaf-FX),30M DOF: 2x-3x improvement 3D Multiphase Flow (Liquid/Gas) + 3D Mass Transfer COPYRIGHT TAISEI CORPORATION ALL RIGHTS RESERVED 15

16 16 Support FY : Post Peta CREST JST/CREST FY : ESSEX-II JST/CREST & DFG/SPPEXA (Germany) Collaboration ESSEX: Equipping Sparse Solvers for Exascale (FY ) Leading PI: Prof. Gerhard Wellein (U. Erlangen) ESSEX II (FY ) ESSEX + U.Tsukuba (Sakurai) + U. Tokyo (Nakajima)

17 ppopen-hpc ppopen-math ppopen-math/mg: Multigrid Solver ppopen-math/mp: Coupler ppopen-at Future/On-going Works 17

18 18

19 19 ppopen-math A set of common numerical libraries Multigrid solvers (ppopen-math/mg) Parallel graph libraries (ppopen-math/graph) Multithreaded RCM for reordering (under development) Parallel visualization (ppopen-math/vis) Library for coupled multi-physics simulations (loosecoupling) (ppopen-math/mp) Originally developed as a coupler for NICAM (atmosphere, unstructured), and COCO (ocean, structured) in global climate simulations using K computer Both codes are major codes on the K computer.» Prof. Masaki Satoh (AORI/U.Tokyo): NICAM» Prof. Hiroyasu Hasumi (AORI/U.Tokyo): COCO Developed coupler is extended to more general use. Coupled seismic simulations

20 Sparse Linear Solvers in ppopen-hpc 20 (OpenMP+MPI) Hybrid Multicoloring/RCM/CM-RCM for OpenMP Coloring procedures are NOT parallelized yet ppopen-appl/fem, FVM, FDM ILU/BILU(p,d,t)+CG/GPBiCG/GMRES, Depth of Overlapping Hierarchical Interface Decomposition (HID) [Henon & Saad 2007], Extended HID [KN 2010] ppopen-math/mg Geometric Multigrid Solvers/Preconditioners Comm./synch. avoiding/reducing based on hcga [KN 2014, Best Paper Award in IEEE/ICPADS 2014] ppopen-appl/bem H-Matrix Solver: HACApK Only Open-Source H-Matrix Solver Library by OpenMP/MPI

00 MGCG Solver Storage format of coefficient matrices (Serial Comm.

21 MEPA 2014 ppopen MATH/MG: pgw3d FVM 21 3D Groundwater Flow via Heterogeneous Porous Media Poisson s equation x, y, z q Randomly distributed water conductivity Finite Volume Method on Cubic Voxel Mesh =10 5 ~10 +5, Average: 1.00 MGCG Solver Storage format of coefficient matrices (Serial Comm.) CRS (Compressed Row Storage) ELL (Ellpack Itpack) Comm. /Sych. Reducing MG (Parallel Comm.) Coarse Grid Aggregation (CGA) Hierarchical CGA: Communication Reducing CGA

22 Preconditioned CG Method (Geometric) Multigrid Preconditioning (MGCG) IC(0) for Smoothing Operator (Smoother): good for illconditioned problems Linear Solvers Parallel Geometric Multigrid Method 8 fine meshes (children) form 1 coarse mesh (parent) in isotropic manner (octree) V-cycle Domain-Decomposition-based: Localized Block-Jacobi, Overlapped Additive Schwartz Domain Decomposition (ASDD) Operations using a single core at the coarsest level (redundant) 22

23 23 Optimization of Serial Comm. ELL (Ellpack-Itpack), Sliced-ELL for Matrix Storage (a) CRS (b) ELL Boundary Cells AUnew6(6,N) 0 0 separate arrays are introduced Pure Internal Cells AUnew3(3,N) A lot of X-ELL-Y-Z s! focusing on SpMV SELL-C- M. Kreutzer et al Recently, X-ELL-Y-Z s are applied to forward/backward substitutions with data dependency HPCG Gauss-Seidel: Easy ILU/IC Much more difficult than GS

24 MEPA Fine Original Approach (restriction) Coarse grid solver at a single core [KN 2010] Level=1 Level=2 Level=m-3 Level=m-2 Level=m-1 Level=m Mesh # for each MPI= 1 Coarse Coarse grid solver on a single core (further multigrid)

25 MEPA Fine Coarse Grid Aggregation (CGA) Coarse Grid Solver is multithreaded [KN 2012] Level=1 Level=2 Level=m-3 Level=m-2 Coarse Communication overhead could be reduced Coarse grid solver is more expensive than original approach. If process number is larger, this effect might be significant Coarse grid solver on a single MPI process (multi-threaded, further multigrid)

26 26 Computations on Fujitsu FX10 Fujitsu PRIMEHPC FX10 at U.Tokyo (Oakleaf-FX) Commercial version of K 16 cores/node, flat/uniform access to memory 4,800 nodes PF (65 th, TOP 500, 2015 Jun.) Up to 4,096 nodes (65,536 cores) (Large-Scale HPC Challenge) L1 L1 Max 17,179,869,184 unknowns C C Flat MPI, HB 4x4, HB 8x2, HB 16x1 Weak Scaling Strong Scaling = 16,777,216 unknowns, from 8 to 4,096 nodes Network Topology is not specified 1D L1 C L1 C L1 C L1 C L1 C Memory L2 L1 L1 C C L1 C L1 C L1 C L1 C L1 C L1 C L1 C

27 Weak Scaling: ~4,096 nodes Flat MPI up to 17,179,869,184 meshes (64 3 meshes/core) DOWN is GOOD sec Original ELL-CGA SELL-CGA Coarse grid solver is more expensive than original approach. If process number is larger, this effect might be significant CORE# 27

28 Hierarchical CGA: Comm. Reducing MG Fine Reduced number of MPI processes[kn 2014] Level=1 Level=2 Level=m 3 Level=m 3 Level=m 2 Coarse Coarse grid solver on a single MPI process (multithreaded, further multigrid) 28

29 hcga: Related Work Not a new idea, but very few implementations. Not effective for peta-scale systems (Dr. U.M.Yang (LLNL), developer of Hypre) Existing Works: Repartitioning at Coarse Levels Lin, P.T., Improving multigrid performance for unstructured mesh drift-diffusion simulations on 147,000 cores, International Journal for Numerical Methods in Engineering 91 (2012) (Sandia) Sundar, H. et al, Parallel Geometric-Algebraic Multigrid on Unstructured Forests of Octrees, ACM/IEEE Proceedings of the 2012 International Conference for High Performance Computing, Networking, Storage and Analysis (SC12) (2012) (UT Austin) Flat MPI, Repartitioning if DOF < O(10 3 ) on each process 29

30 Weak Scaling: ~4,096 nodes Flat MPI up to 17,179,869,184 meshes (64 3 meshes/core) sec Original ELL-CGA DOWN is GOOD SELL-CGA SELL-hCGA x CORE# 30

31 Strong Scaling:~4,096 nodes Flat MPI 100%: Efficiency for 128 cores 268,435,456 meshes, 16 3 meshes/core at 4,096 nodes Parallel Performance (%) UP is GOOD SELL+CGA SELL+hCGA x CORE# 31

32 Weak Scaling: ~4,096 nodes up to 17,179,869,184 meshes (64 3 meshes/core) DOWN is GOOD Matrix Coarse Grid C0 CRS Single Core C1 ELL (org) Single Core C2 ELL (org) CGA C3 ELL (sliced) CGA C4 ELL (sliced) hcga 15.0 x sec C3, 4,096 nodes C4, 4,096 nodes sec Flat MPI:C3 Flat MPI:C4 HB 4x4:C4 HB 8x2:C3 HB 16x1:C Flat MPI HB 4x4 HB 8x2 HB 16x CORE# 32

33 Summary hcga is effective (only for Flat MPI) Flat MPI: x1.61 for weak scaling, x6.27 for strong scaling at 4,096 nodes of Fujitsu FX10 hcga will be effective for HB 16x1 with more than 2.50x10 5 nodes (= 4.00x10 6 cores) of FX10 (=60 PFLOPS) Comp. time of coarse grid solver is significant for Flat MPI with >10 3 nodes Future Works Improvement of hcga CA-Multigrid (for coarser levels), CA-SPAI, Pipelined Method SELL-C- Strategy for Automatic Selection switching level, number of processes for hcga, optimum color # Post T2K (Oakforest-PACS) 33

34 ppopen-hpc ppopen-math ppopen-math/mg: Multigrid Solver ppopen-math/mp: Coupler ppopen-at Future/On-going Works 34

35 35 ppopen-math A set of common numerical libraries Multigrid solvers (ppopen-math/mg) Parallel graph libraries (ppopen-math/graph) Multithreaded RCM for reordering (under development) Parallel visualization (ppopen-math/vis) Library for coupled multi-physics simulations (loosecoupling) (ppopen-math/mp) Originally developed as a coupler for NICAM (atmosphere, unstructured), and COCO (ocean, structured) in global climate simulations using K computer Both codes are major codes on the K computer.» Prof. Masaki Satoh (AORI/U.Tokyo): NICAM» Prof. Hiroyasu Hasumi (AORI/U.Tokyo): COCO Developed coupler is extended to more general use. Coupled seismic simulations

NICAM: Semi-Unstructured Grid MIROC-A: FDM/Structured

Model-1 Atmospheric Model-2 MIROC-A J-cup

Transformation Multi-Ensemble IO Pre- and

System -System S/W -Architecture COCO Regional COCO

36 NICAM: Semi-Unstructured Grid MIROC-A: FDM/Structured Grid 36 NICAM- Agrid NICAM- ZMgrid Atmospheric Model-1 Atmospheric Model-2 MIROC-A J-cup ppopen-math/mp Coupler Ocean Model Grid Transformation Multi-Ensemble IO Pre- and post-process Fault tolerance M N Post-Peta-Scale System -System S/W -Architecture COCO Regional COCO Matsumuramodel COCO: Tri-Polar FDM Regional Ocean Model Non Hydrostatic Model c/o T.Arakawa, M.Satoh

37 c/o T.Arakawa, M.Satoh 37 Sea surface temperature (OSST) left:coco (Ocean: Structured), right:nicam (Atmospheric: Semi-Unst.)

38 c/o T.Arakawa, M.Satoh 38 Thickness of Sea Ice (OHI) left:coco (Ocean: Structured), right:nicam (Atmospheric: Semi-Unst.)

Weak-Coupled Simulation by the ppopen-hpc Libraries Two kinds

on FEM) are connected by the ppopen-math/mp coupler.

ppopen-math/mp Displacement Principal Functions Make a mapping

39 Weak-Coupled Simulation by the ppopen-hpc Libraries Two kinds of applications (Seism3D+ based on FDM, and FrontISTR++ based on FEM) are connected by the ppopen-math/mp coupler. Seism3D+ FrontISTR++ ppopen-appl/fdm ppopen-appl/fem Velocity ppopen-math/mp Displacement Principal Functions Make a mapping table Convert physical variables Choose a timing of data transmission 39

40 Dataflow of ppopen-math/mp* 40

41 Coupling Simulation of Seismic Waves and Building Vibrations The coupling simulation refers to one-way data communication from FDM (seismic wave propagation) to FEM (dynamic structure). 41

42 Numerical Model Description 42

43 Implementation of the Coupling Simulation 43

44 Practical Simulation on Oakleaf-FX The simulation target is the earthquake that occurred at Awaji Island on 13 April, 2013.

44 44 Practical Simulation on Oakleaf-FX The simulation target is the earthquake that occurred at Awaji Island on 13 April, The computational domain of Seism3D+ is 60 km 2 from Awaji Island and that of FrontISTR++ is the actual building of RIKEN Advanced Institute for Computational Science (AICS), Port Island, Kobe, modeled by an unstructured mesh. Seism3D+ Grid Points Parallelization (x, y, z)=(1536, 1536, 1600) 2560 processes/16 threads FrontISTR++ Grid Points Parallelization 600 million(aics building) 1000 processes/16 threads (@Port Island) 1000 processes/16 threads (@Kobe Stadium) Total 4560 nodes on Oakleaf-FX (Seism3D+: 2560 nodes FrontISTR++: 2000 nodes) Computational domain of Seism3D+

45 2,560 nodes for FDM, 2,000 nodes for FEM = 4,560 nodes of FX10 from east to west to underground Geological layer Seismic wave propagation by Seism3D+ (Red:P-wave, Green:S-wave) Building vibration by FrontISTR++ 45

46 46 2,560 nodes for FDM, 2,000 nodes for FEM = 4,560 nodes of FX10 Coupling simulation was executed on large-scale computational resources of Oakleaf-FX supercomputer system. Seismic wave propagations (Seism3D+) for the simulation time of 90 sec., and building vibrations (FrontISTR++) for the simulation time of 20 sec. were calculated. Comparison between sim. time and exe. time Sim. Time Exe. Time Seism3D+ 90 sec. 6 hours FrontISTR++ 20 sec. 16 hours It was revealed that the manner in which memory allocation occurs in the coupler has some problem when such a large-scale simulation is performed.

47 47 Summary The implementation of a multi-scale coupling simulation of seismic waves and building vibrations was introduced and a performance evaluation was conducted. Finally, a practical simulation on the Oakleaf-FX was performed. Although more investigations need to be conducted, such a multi-scale coupling simulation will be important for numerical simulations on next generation of large-scale computational environments.

48 ppopen-hpc ppopen-math ppopen-math/mg: Multigrid Solver ppopen-math/mp: Coupler ppopen-at Future/On-going Works 48

49 ESSEX-II: 2 nd Phase of ppopen-hpc Japan-German Collaboration (FY ) ESSEX + Tsukuba (Prof. Sakurai) + U.Tokyo (ppopen- HPC) Leading PI: Prof. Gerhard Wellein (U. Erlangen) Preconditioned Iterative Solver for Quantum Science DLR (German Aerospace Research Center) Automatic Tuning with Performance Model U. Erlangen 49

50 Assumptions & Expectations 50 towards Post-K/Post-Moore Era Post-K (-2020) Memory Wall Hierarchical Memory (e.g. KNL: MCDRAM-DDR) Post-Moore (-2025? -2029?) High bandwidth in memory and network, Large capacity of memory and cache Large and heterogeneous latency due to deep hierarchy in memory and network Utilization of FPGA Common Issues Hierarchy, Latency (Memory, Network etc.) Large Number of Nodes, High concurrency with O(10 3 ) threads on each node under certain constraints (e.g. power, space )

51 51 CAE Applications in Post-K Era Block-structured AMR, Voxel-type FEM Semi-structured, Semi-unstructured Length of inner loops is fixed, or its variation is rather smaller (ELL, Sliced ELL, SELL-C- ) can be applied Better than fully unstructured FEM Robust/Scalable GMG/AMG Preconditioning Limited or fixed number of non-zero off-diagonals of coefficient matrices CRS ELL Sliced ELL SELL-C- (SELL-2-8) C

52 52 App s & Alg s in Post-Moore Era Compute Intensity -> Data Movement Intensity Implicit scheme strikes back!: but not straightforward Hierarchical Methods for Hiding Latency Hierarchical Coarse Grid Aggregation (hcga) in Multigrid Parallel in Space/Time (PiST) Comm./Synch. Avoiding/Reducing Algorithms Network latency is already a big bottleneck for parallel sparse linear solvers (SpMV, Dot Products) Pipelined/Asynchronous CG: MPI_Iallreduce in MPI-3 Power-aware Methods Approximate Computing, Power Management, FPGA

Pre/Post Processor for Parallel-in-Space/Time(PiST) (a) Nonlinear

53 Framework for Appl. Development 53 in Post-Moore Era ppopen-hpc (FY JST/CREST)+α Linear Solvers, AT framework, Space/Time Tiling Pre/Post Processor for Parallel-in-Space/Time(PiST) (a) Nonlinear Algorithm (b) AMR, (c) Visualization, (d) Coupler for Multiphysics Data-Movement, Challenging!!

54 54 Summary: Visit #510 ppopen-hpc Porting to Oakforest-PACS (original target system) ppopen-math/mg ppopen-math/mp Coupled Earthquake Simulations ESSEX-II Post-K/Post Moore Issues Related Talk Akihiro Ida (U.Tokyo), Application of Hierarchical Matrices with Adaptive Cross Approximation to Large-Scale Simulations Wednesday, June 22, 2016, 04:10-04:30 pm, Panorama 2, Forum Algorithms for Extreme Scale in Practice (03:30-04:30 pm) Chair: David Keyes,

55 16 th SIAM Conference on 55 Parallel Processing for Scientific Computing (PP18) March 7-10, 2018, Tokyo, Japan Venue Nishiwaseda Campus, Waseda University (near Shinjuku) Organizing Committee Co-Chairs s Satoshi Matsuoka (Tokyo Institute of Technology, Japan) Kengo Nakajima (The University of Tokyo, Japan) Olaf Schenk (Universita della Svizzera italiana, Switzerland) Contact Kengo Nakajima, nakajima(at)cc.u-tokyo.ac.jp

Kengo Nakajima Information Technology Center, The University of Tokyo. SC15, November 16-20, 2015 Austin, Texas, USA

ppopen-hpc Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications on Post-Peta Scale Supercomputers with Automatic Tuning (AT) Kengo Nakajima Information Technology