Multigrid Method using OpenMP/MPI Hybrid Parallel Programming Model on Fujitsu FX10

Size: px

Start display at page:

Download "Multigrid Method using OpenMP/MPI Hybrid Parallel Programming Model on Fujitsu FX10"

Jane Williamson
6 years ago
Views:

Information Technology enter, The University of

1 Multigrid Method using OpenMP/MPI Hybrid Parallel Programming Model on Fujitsu FX0 Kengo Nakajima Information Technology enter, The University of Tokyo, Japan November 4 th, 0 Fujitsu Booth S Salt Lake ity, Utah, USA

2 Motivation of This Study Parallel Multigrid Solvers for FVM-type appl. on Fujitsu PRIMEHP FX0 at University of Tokyo (Oakleaf-FX) Flat MPI vs. Hybrid (OpenMP+MPI) Expectations for Hybrid Parallel Programming Model Number of MPI processes (and sub-domains) to be reduced O( )-way MPI might not scale in Exascale Systems Easily extended to Heterogeneous Architectures PU+GPU, PU+Manys (e.g. Intel MI/Xeon Phi) MPI+X: OpenMP, OpenA, UDA, OpenL

3 3 Multigrid Scalable Multi-Level Method using Multilevel Grid for Solving Linear Eqn s omputation Time ~ O(N) (N: # unknowns) Good for large-scale problems Preconditioner for Krylov Iterative Linear Solvers MGG

4 4 Flat MPI vs. Hybrid Flat-MPI:Each PE -> Independent memory memory memory Hybrid:Hierarchal Structure memory memory memory

5 5 urrent Supercomputer Systems University of Tokyo Total number of users ~,000 Earth Science, Material Science, Engineering etc. Hitachi HA8000 luster System (TK/Tokyo) (008.6-) luster based on AMD Quad-ore Opteron (Barcelona) Peak: 40. TFLOPS Hitachi SR6000/M (Yayoi) (0.0-) Power 7 based SMP with 00 GB/node Peak: 54.9 TFLOPS Fujitsu PRIMEHP FX0 (Oakleaf-FX) (0.04-) SPAR64 IXfx ommercial version of K computer Peak:.3 PFLOPS (.043 PF, st, 40 th TOP 500 in 0 Nov.)

Fujitsu@S Oakleaf-FX 6 Aggregate memory bandwidth: 398 TB/sec.

6 Oakleaf-FX 6 Aggregate memory bandwidth: 398 TB/sec. Local file system for staging with. PB of capacity and 3 GB/sec of aggregate I/O performance (for staging) Shared file system for storing data with. PB and 36 GB/sec. External file system: 3.6 PB

Fujitsu@S 7 Target Application 3D Groundwater Flow via.

7 7 Target Application 3D Groundwater Flow via. Heterogeneous Porous Media Poisson s equation Randomly distributed water conductivity Distribution of water conductivity is defined through methods in geostatistics Deutsch & Journel, 998 Finite-Volume Method on ubic Voxel Mesh Distribution of Water onductivity , ondition Number ~ 0 +0 Average:.0 yclic Distribution: 8 3 Movie

8 8 Linear Solvers Preconditioned G Method Multigrid Preconditioning (MGG) I(0) for Smoothing Operator (Smoother): good for illconditioned problems Parallel Geometric Multigrid Method 8 fine meshes (children) form coarse mesh (parent) in isotropic manner (octree) V-cycle Domain-Decomposition-based: Localized Block-Jacobi, Overlapped Additive Schwartz Domain Decomposition (ASDD) Operations using a single at the coarsest level (redundant)

9 9 Overlapped Additive Schwartz Domain Decomposition Method ASDD: Localized Block-Jacobi Precond. is stabilized Global Operation Local Operation Global Nesting orrection r Mz, r M z r M z ) ( n n n n z M z M r M z z ) ( n n n n z M z M r M z z i :Internal (i N) i :External(i>N)

10 omputations on Fujitsu FX0 Fujitsu PRIMEHP FX0 at U.Tokyo (Oakleaf-FX) 6 s/node, flat/uniform access to memory Up to 4,096 nodes (65,536 s) (Large-Scale HP hallenge) Max 7,79,869,84 unknowns Flat MPI, HB 4x4, HB 8x, HB 6x HB MxN: M-threads x N-MPI-processes on each node Memory Weak Scaling 64 3 L L L L L L L L L L L L L cells/ Strong Scaling 8 3 8= 6,777,6 unknowns, from 8 to 4,096 nodes Network Topology is not specified D L L L L

11 Memory L L L L L L L L L L L L L L L L L HB M x N Number of OpenMP threads per a single MPI process Number of MPI process per a single node

12 oarse Grid Solver on a Single ore PE#0 Original Approach PE# PE# PE#3 Size of the oarsest Grid= Number of MPI Processes Redundant Process lev= lev= lev=3 lev=4 In Flat-MPI, this size is larger

13 3 Weak Scaling: up to 4,096 nodes up to 7,79,869,84 meshes (64 3 meshes/) Although ASDD is applied, convergence is getting worse for larger number of nodes/domains, DOWN is GOOD sec Flat MPI HB 4x4: org. HB 8x: org. HB 6x: org. Iterations sec Iterations 50 Flat MPI HB 4x4: org. HB 8x: org. HB 6x: org ORE# Down is good ORE#

14 4 Strategy: oarse Grid Aggregation Decreasing number of MPI processes at coarser level. Switching to redundant processes for coarse grid solvers earlier (i.e. at finer level). Node-to-node communications at coarser levels are reduced. oarse grid solver on a single MPI proc., not a single HB 4x4: 4 s HB 8x: 8 s HB 6x: 6 s, Single Node Info. gathered to a single MPI process OpenMP is needed In post-peta/exa-scale systems, each node will consists O(0 ) of s, therefore utilization of these many s on each node should be considered. Parallel Serial/Redundant Fine oarse

15 oarse Grid Aggregation: at lev= 5 PE#0 PE# PE# PE#3 lev= lev= lev=3 lev=4

16 oarse Grid Aggregation: at lev= 6 PE#0 PE# PE# PE#3 Apply multigrid procedure on a single MPI process lev= lev= Trade-off: oarse grid solver is more expensive than original approach.

17 Results at 4,096 nodes lev: switching level to coarse grid solver Opt. Level= 7 DOWN is GOOD HB 8x HB 6x Parallel Serial/Redundant Fine oarse Rest oarse Grid Solver MPI_Allgather MPI_Isend/Irecv/Allreduce Rest oarse Grid Solver MPI_Allgather MPI_Isend/Irecv/Allreduce sec. 0.0 sec level=5 level=6 level=7 level=8: org. Switching Level for oarse Grid Solver Down is good level=5 level=6 level=7 level=8 level=9: org. Switching Level for oarse Grid Solver

Fujitsu@S 8 Weak Scaling: up to 4,096 nodes up to 7,79,869,84 meshes (64 3 meshes/) onvergence has been much improved by coarse grid aggregation, DOWN is GOOD sec. Iterations sec. 0.0 5.0 0.

18 8 Weak Scaling: up to 4,096 nodes up to 7,79,869,84 meshes (64 3 meshes/) onvergence has been much improved by coarse grid aggregation, DOWN is GOOD sec. Iterations sec Flat MPI HB 4x4: org. HB 8x: org. HB 6x: org. HB 4x4: new HB 8x: new HB 6x: new ORE# Iterations Down is good Flat MPI HB 8x: org. HB 4x4: new HB 6x: new ORE# HB 4x4: org. HB 6x: org. HB 8x: new

19 9 Strong Scaling: up to 4,096 nodes 68,435,456 meshes, only 6 3 meshes/ at 4,096 nodes UP is GOOD HB 8x 000 UP is good HB 6x 000 Parallel Performance % 00 0 Flat MPI HB 8x: original HB 8x: optimized Parallel Performance % 00 0 Flat MPI HB 6x: original HB 6x: optimized ORE# ORE#

20 0 Strong Scaling at 4,096 nodes 68,435,456 meshes, only 6 3 meshes/ at 4,096 nodes Rest oarse Grid Solver MPI_Allgather MPI_Isend/Irecv/Allreduce sec HB 4x4 Org. HB 4x4 Opt. HB 8x Org. HB 8x Opt. HB 6x Org. HB 6x Opt. Iterations Parallel performance (%)

21 Summary oarse Grid Aggregation is effective for stabilization of convergence at O(0 4 ) s for MGG Not so effective on communications HB 8x is the best at 4,096 nodes HB programming model with smaller number of MPI processes (e.g. HB 8x, HB 6x) are better, if the number of nodes are larger. Smaller problem size for coarse grid solver If the number of nodes are larger, performance is better Further Optimization/Tuning Single node/ performance for FX0 current code is optimized for TK/Tokyo (cc-numa) Overlapping of computation & communication more difficult than SpMV Automatic selection of the optimum switching level lev Gradual reduction of MPI process number (e.g )

22 Reference: Kengo Nakajima OpenMP/MPI Hybrid Parallel Multigrid Method on Fujitsu FX0 Supercomputer System IEEE Proceedings of 0 IEEE International onference on luster omputing Workshops (0 International Workshop on Parallel Algorithm and Parallel Software (IWPAPS)), p.99-06, Beijing, hina (IEEE Digital Library, Print ISBN: , Digital Object Identifier : 0.09/lusterW.0.35) 0.

23 3 Please visit the booth of Oakleaf/Kashiwa Alliance The University of Tokyo #943

Iterative Linear Solvers for Sparse Matrices

Iterative Linear Solvers for Sparse Matrices Kengo Nakajima Information Technology Center, The University of Tokyo, Japan 2013 International Summer School on HPC Challenges in Computational Sciences New