Workshop on Efficient Solvers in Biomedical Applications, Graz, July 2-5, 2012

Size: px

Start display at page:

Download "Workshop on Efficient Solvers in Biomedical Applications, Graz, July 2-5, 2012"

Henry Baldwin
5 years ago
Views:

1 Workshop on Efficient Solvers in Biomedical Applications, Graz, July 2-5, 2012 This work was performed under the auspices of the U.S. Department of Energy by under contract DE-AC52-07NA Lawrence Livermore National Security, LLC

2 Allison Baker Rob Falgout Tzanio Kolev Jacob Schroder Martin Schulz Panayot Vassilevski University of Illinois at Urbana-Champaign: Hormozd Gahvari William Gropp Luke Olson IBM: Kirk Jordan 2

The solution of linear systems is at the

3 The solution of linear systems is at the core of many scientific simulation codes Magnetohydrodynamics Elasticity / Plasticity Electromagnetics Facial surgery simulations High fidelity requires huge linear systems and largescale (e.g., petascale) computing We are developing parallel linear solvers and software (hypre), driven by applications 3

Time to Solution 6000 10 1 Diag-CG Multigrid-CG scalable Number of Processors (Problem Size) 10 5 Multigrid solvers are essential components of

machines Enormous core counts and fine-grain parallelism will degrade convergence New mathematics R&D (convergence) Effective smoothers that are

4 Time to Solution Diag-CG Multigrid-CG scalable Number of Processors (Problem Size) 10 5 Multigrid solvers are essential components of LLNL simulation State-of-the-art is hypre, with proven scalability to BG/L class machines Current solvers will break down on tomorrow s exascale machines Enormous core counts and fine-grain parallelism will degrade convergence New mathematics R&D (convergence) Effective smoothers that are also highly parallel (e.g., polynomial) Techniques for multicore (performance) Different programming model? Communication-reducing algorithms 4

5 Setup Phase Select coarse grids Define interpolation, P (m), m=1,2, Define restriction, R (m) = (P (m) ) T Define coarse-grid operators, A (m+1) = R (m) A (m) P (m) Solve Phase (level m) Smooth A (m) u m = f m Smooth A (m) u m = f m Compute r m = f m - A (m) u m Correct u m u m + e m Restrict r m+1 = R (m) r m Interpolate e m = P (m) e m+1 Solve A (m+1) e m+1 = r m+1 5

Smoothers significantly affect AMG convergence and runtime solving: Ax=b, A SPD smoothing: x n+1 = x n + M -1 (b-a x n ) Gauss-Seidel highly sequential parallel smoothers: Jacobi hybrid Gauss-Seidel

6 Smoothers significantly affect AMG convergence and runtime solving: Ax=b, A SPD smoothing: x n+1 = x n + M -1 (b-a x n ) Gauss-Seidel highly sequential parallel smoothers: Jacobi hybrid Gauss-Seidel (default smoother in hypre) Expect degraded convergence for exascale/multicore machines number of blocks increases with number of processors increased fine-grain parallelism (less memory, threading) Objective M H = smooth Core 0 Core 1 Core p investigate/develop smoothers that are not affected by the parallelism 6

7 A = A 1 A 2 matrices are distributed across P processors in contiguous blocks of rows A p the smoother is threaded such that each thread works on a contiguous subset of the rows A p = t 1 t 2 t 3 t 4 Hybrid: GS within each thread, Jacobi on thread boundaries 7

subdomains by splitting nodes (not elements) the application s numbering

8 partitions grid element-wise into nice subdomains for each processor 16 procs (colors indicate element numbering) AMG determines thread subdomains by splitting nodes (not elements) the application s numbering of the elements/nodes becomes relevant! (i.e., threaded domains dependent on colors) 8

9 Multiplicative weighted hybrid smoother: M ω = ω M H Convergent if M ω SPD and ω = λ max (M -1/2 A M -1/2 ) M H = Additive (l 1 Smoother) hybrid l 1 -GS: M = M l 1 H + D with D l 1 l1 = j i a ijo always convergent, when GS convergent Hybrid l 1 Jacobi: M = D + D l 1 l1 A = 9

10 Polynomial smoothers: I M -1 A = p(a), p(0)=1 Advantages: independent of parallelism MatVec kernel has been tuned low-degree polynomial sufficient Disadvantage: need extreme eigenvalue estimates (~10 CG iterations) note: only need to damp high-freq. (30% of spectrum) (see Adams et. al, 2003, for Smoothed Aggregation-AMG) 10

11 o Test problem: -(a(x,y)u x ) x (a(x,y)u y ) y = f a(x,y) = 1 on inner domains a(x,y) =.001 on outer domain triangular FEM, 4 subdomains o Smoothers considered here: Hybrid-SGS: M = (D+L D ) T (D A O ) -1 (D+L D ) l 1 -SGS: M = (D l 1 +LD ) T (2D l D A O ) -1 (D 1 l 1 +LD ), Cheby(2): a polynomial smoother, where p(a) = I M -1 A, p(0) = 1, here p is the 2 nd order Chebyshev polynomial for D -1/2 AD -1/2 11

12 Polynomial smoothers and our new l1-smoothers are unaffected by poor partitioning (left) and decreasing block sizes (right) New smoothers now available in hypre Baker, Falgout, Kolev, Yang, Multigrid Smoothers for Ultra-Parallel Computing, SIAM Journal on Scientific Computing 12

13 36 racks with 1024 compute nodes each Quad-core 850 MHz PowerPC 450 Processor 147,456 cores 4GB main memory per node shared by all cores 3D torus network - isolated dedicated partitions 13

14 pfmg-1 pfmg-2 amg amg-bench

15 Multi-core/ multi-socket cluster 864 diskless nodes interconnected by DDR Infiniband AMD opteron quad core (2.3 GHz) Fat tree network 15

16 Multicore cluster details (Hera): Individual 512 KB L2 cache for each core 2 MB L3 cache shared by 4 cores 4 sockets per node,16 cores sharing the 32 GB main memory NUMA memory access 16

17 25 25 PFMG-1 20 PFMG-2 AMG

18 18

19 Communication plots for 128 core problem: Level 0 Level 1 Level 2 Level 3 Level 4 Level 5 Level 6 Level

20 Proc id Time computation idle time MPI calls 20

21 Baseline model: α-β (latency-inverse bandwidth) with parameters smooth, form residual restrict to level i+1 prolong to level i-1 smooth T i coarsen = T i prolong = T i smooth = 6(C i /P)s i t i + 3(p i α + n i β) { 2(C i+1 /P)s i t i + p i α + n i β if i < L 0 if i = L { 0 if i = 0 2(Ci-1 /P)s i-1 t i + p i-1 α + n i-1 β if i > 0 P number of processes C i number of grid points in level i L maximum grid index s i, s i average number of nonzeros per row (A i, P i ) p i, p i maximum number of sends per active process (A i, P i ) n i, n i maximum number of elements sent per active process (A i, P i ) t i time per floating-point operation on level i 21

22 Take architectural features into account with penalties Distance of communication: add time per hop γ Lower effective bandwidth: let m = # msgs, l = # links, B HW = hardware bandwidth, B MPI = MPI bandwidth BG/P: multiply β by B HW /B MPI Other machines: multiply β by (B HW /B MPI + m/l) Multicore penalties: let c = # cores (MPI tasks) per node, P i = # active processes on level i, P = # processes Multicore latency penalty: multiply α by cp i /P Multicore distance penalty: multiply γ by cp i /P 22

Model parameters: α latency, β inverse bandwidth, γ delay per extra hop We added penalties to the basic models based on machine constraints: distance effects, reduced per core bandwidth, number of

23 Model parameters: α latency, β inverse bandwidth, γ delay per extra hop We added penalties to the basic models based on machine constraints: distance effects, reduced per core bandwidth, number of cores per node BG/P AMG On BG/P, the simple α-β-γ model with bandwidth penalty leads to best fit. Gahvari, Appeared Baker, in Schulz, ICS11 Yang, Proceedings Jordan, Gropp, Modeling the Performance of an Newest Algebraic version Multigrid includes Cycle on OpenMP HPC Platforms, threads, ICS11 submitted Proceedings to ICPP12 AMG On Hera (fat tree network), distance penalty largest contributor to bad performance 23

24 24

25 Time (s) AMG Solve Cycle on Hera, 3456 Cores 1 16 MPI x 1 OMP 8 MPI x 2 OMP MPI x 4 OMP 2 MPI x 8 OMP 1 MPI x 16 OMP MPI/OMP Mix Cycle Time 16 MPI x 1 OMP 187 ms MPI x 2 OMP 116 ms 4 MPI x 4 OMP 67.7 ms MPI x 8 OMP 91.7 ms 1 MPI x 16 OMP 199 ms Level 3D 7-point Laplace model problem, 50 x 50 x 25 points/core 25

26 Seconds MPI 16x1 H 8x2 H 4x4 H 2x8 OMP 1x No. of cores 26

27 Seconds PFMG-1,MPI PFMG-1, 4x4 PFMG-1, 1x16 PFMG-2, MPI PFMG-2, 4x4 PFMG-2, 1x No. of cores 27

28 PFMG-1, MPI PFMG-1, 2x2 PFMG-1, 1x4 PFMG-2, MPI PFMG-2, 2x2 PFMG-2, 1x4 5 AMG, MPI AMG, 2x2 AMG, 1x

29 Non-Galerkin AMG (Schroder, Falgout) Redundant coarse grid solve (Gahvari, Gropp, Jordan, Schulz, Yang) Additive AMG methods (Gahvari, Olson, Vassilevski, Yang, ) 29

30 Choose non-galerkin coarse-grid for parallel efficiency Sparsify to yield a new coarse-grid operator Raises critical issues related to AMG theory Need good spectral equivalence Algorithm heuristic: Provably implies AMG convergence Choose sparsity pattern for A Initially T PAP P AP T I Add more if necessary Choose matrix entries Require accuracy for near nullspace A Remove edges in using stencil collapsing g I Collapse Edge 30

31 Results: 3D Diffusion Classical AMG: scenarios 1, 2 Proposed approach: scenarios 3, 4 Results: 3D anisotropy on unstructured grid Convergence for Scenario 4 comparable to classical AMG with substantially reduced stencil size 31

32 prolong to level i-1 smooth, form residual restrict to level i+1 smoot h all-gather at level i serial AMG coarse solve 32

33 7pt 3D Laplace problem 27pt stencil No. of cores Hera Atlas Coastal Hera Atlas Coastal Hera: AMD Opteron cluster Atlas: AMD Opteron cluster Coastal: Intel Xeon cluster More sophisticated version under development 33

34 all-gather at level i Illustration of chunk data distribution using 4 chunks Cores in each chunk perform the same operations Now use parallel AMG on k cores for each redundant coarse solve Communication across chunks Allows better use of cache Allows smart choice of cores 34

35 35

36 smooth smooth Perform in parallel smooth smooth smooth smooth smooth smooth smooth solve solve 36

37 Even with slower convergence, noticeable speedups possible on model problem: We are investigating now an additive version of multiplicative multigrid, which preserves convergence 37

38 l 1 and Polynomial smoothers are promising for smoothing on millions of cores Getting efficient use out of multi-core architectures is challenging! Reducing communication is crucial! New methods show promise for better performance. Use model to evaluate new algorithmic changes and to predict performance on much larger scale Continue work on reducing communication Approaches for more efficient implementation Optimized kernels New data structures? 38

39 This work was performed under the auspices of the U.S. Department of Energy by under Contract DE-AC52-07NA

Challenges of Scaling Algebraic Multigrid Across Modern Multicore Architectures. Allison H. Baker, Todd Gamblin, Martin Schulz, and Ulrike Meier Yang

Challenges of Scaling Algebraic Multigrid Across Modern Multicore Architectures. Allison H. Baker, Todd Gamblin, Martin Schulz, and Ulrike Meier Yang Multigrid Solvers Method of solving linear equation