Workshop on Efficient Solvers in Biomedical Applications, Graz, July 2-5, PDF Free Download

Workshop on Efficient Solvers in Biomedical Applications, Graz, July 2-5, 2012 This work was performed under the auspices of the U.S. Department of Energy by under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

Allison Baker Rob Falgout Tzanio Kolev Jacob Schroder Martin Schulz Panayot Vassilevski University of Illinois at Urbana-Champaign: Hormozd Gahvari William Gropp Luke Olson IBM: Kirk Jordan 2

The solution of linear systems is at the core of many scientific simulation codes Magnetohydrodynamics Elasticity / Plasticity Electromagnetics Facial surgery simulations High fidelity requires huge linear systems and largescale (e.g., petascale) computing We are developing parallel linear solvers and software (hypre), driven by applications 3

Time to Solution 6000 10 1 Diag-CG Multigrid-CG scalable Number of Processors (Problem Size) 10 5 Multigrid solvers are essential components of LLNL simulation State-of-the-art is hypre, with proven scalability to BG/L class machines Current solvers will break down on tomorrow s exascale machines Enormous core counts and fine-grain parallelism will degrade convergence New mathematics R&D (convergence) Effective smoothers that are also highly parallel (e.g., polynomial) Techniques for multicore (performance) Different programming model? Communication-reducing algorithms 4

Setup Phase Select coarse grids Define interpolation, P (m), m=1,2, Define restriction, R (m) = (P (m) ) T Define coarse-grid operators, A (m+1) = R (m) A (m) P (m) Solve Phase (level m) Smooth A (m) u m = f m Smooth A (m) u m = f m Compute r m = f m - A (m) u m Correct u m u m + e m Restrict r m+1 = R (m) r m Interpolate e m = P (m) e m+1 Solve A (m+1) e m+1 = r m+1 5

Smoothers significantly affect AMG convergence and runtime solving: Ax=b, A SPD smoothing: x n+1 = x n + M -1 (b-a x n ) Gauss-Seidel highly sequential parallel smoothers: Jacobi hybrid Gauss-Seidel (default smoother in hypre) Expect degraded convergence for exascale/multicore machines number of blocks increases with number of processors increased fine-grain parallelism (less memory, threading) Objective M H = smooth Core 0 Core 1 Core p investigate/develop smoothers that are not affected by the parallelism 6

A = A 1 A 2 matrices are distributed across P processors in contiguous blocks of rows A p the smoother is threaded such that each thread works on a contiguous subset of the rows A p = t 1 t 2 t 3 t 4 Hybrid: GS within each thread, Jacobi on thread boundaries 7

partitions grid element-wise into nice subdomains for each processor 16 procs (colors indicate element numbering) AMG determines thread subdomains by splitting nodes (not elements) the application s numbering of the elements/nodes becomes relevant! (i.e., threaded domains dependent on colors) 8

Multiplicative weighted hybrid smoother: M ω = ω M H Convergent if M ω SPD and ω = λ max (M -1/2 A M -1/2 ) M H = Additive (l 1 Smoother) hybrid l 1 -GS: M = M l 1 H + D with D l 1 l1 = j i a ijo always convergent, when GS convergent Hybrid l 1 Jacobi: M = D + D l 1 l1 A = 9

Polynomial smoothers: I M -1 A = p(a), p(0)=1 Advantages: independent of parallelism MatVec kernel has been tuned low-degree polynomial sufficient Disadvantage: need extreme eigenvalue estimates (~10 CG iterations) note: only need to damp high-freq. (30% of spectrum) (see Adams et. al, 2003, for Smoothed Aggregation-AMG) 10

o Test problem: -(a(x,y)u x ) x (a(x,y)u y ) y = f a(x,y) = 1 on inner domains a(x,y) =.001 on outer domain triangular FEM, 4 subdomains o Smoothers considered here: Hybrid-SGS: M = (D+L D ) T (D A O ) -1 (D+L D ) l 1 -SGS: M = (D l 1 +LD ) T (2D l D A O ) -1 (D 1 l 1 +LD ), Cheby(2): a polynomial smoother, where p(a) = I M -1 A, p(0) = 1, here p is the 2 nd order Chebyshev polynomial for D -1/2 AD -1/2 11

Polynomial smoothers and our new l1-smoothers are unaffected by poor partitioning (left) and decreasing block sizes (right) New smoothers now available in hypre Baker, Falgout, Kolev, Yang, Multigrid Smoothers for Ultra-Parallel Computing, SIAM Journal on Scientific Computing 12

36 racks with 1024 compute nodes each Quad-core 850 MHz PowerPC 450 Processor 147,456 cores 4GB main memory per node shared by all cores 3D torus network - isolated dedicated partitions 13

14 12 10 8 6 4 pfmg-1 pfmg-2 amg amg-bench 2 0 0 20000 40000 60000 80000 100000 120000 14

Multi-core/ multi-socket cluster 864 diskless nodes interconnected by DDR Infiniband AMD opteron quad core (2.3 GHz) Fat tree network 15

Multicore cluster details (Hera): Individual 512 KB L2 cache for each core 2 MB L3 cache shared by 4 cores 4 sockets per node,16 cores sharing the 32 GB main memory NUMA memory access 16

25 25 PFMG-1 20 PFMG-2 AMG 20 15 15 10 10 5 5 0 0 2000 4000 0 0 2000 4000 17 1

Communication plots for 128 core problem: Level 0 Level 1 Level 2 Level 3 Level 4 Level 5 Level 6 Level 7 19 1

Proc id Time computation idle time MPI calls 20

Baseline model: α-β (latency-inverse bandwidth) with parameters smooth, form residual restrict to level i+1 prolong to level i-1 smooth T i coarsen = T i prolong = T i smooth = 6(C i /P)s i t i + 3(p i α + n i β) { 2(C i+1 /P)s i t i + p i α + n i β if i < L 0 if i = L { 0 if i = 0 2(Ci-1 /P)s i-1 t i + p i-1 α + n i-1 β if i > 0 P number of processes C i number of grid points in level i L maximum grid index s i, s i average number of nonzeros per row (A i, P i ) p i, p i maximum number of sends per active process (A i, P i ) n i, n i maximum number of elements sent per active process (A i, P i ) t i time per floating-point operation on level i 21

Take architectural features into account with penalties Distance of communication: add time per hop γ Lower effective bandwidth: let m = # msgs, l = # links, B HW = hardware bandwidth, B MPI = MPI bandwidth BG/P: multiply β by B HW /B MPI Other machines: multiply β by (B HW /B MPI + m/l) Multicore penalties: let c = # cores (MPI tasks) per node, P i = # active processes on level i, P = # processes Multicore latency penalty: multiply α by cp i /P Multicore distance penalty: multiply γ by cp i /P 22

Model parameters: α latency, β inverse bandwidth, γ delay per extra hop We added penalties to the basic models based on machine constraints: distance effects, reduced per core bandwidth, number of cores per node BG/P AMG On BG/P, the simple α-β-γ model with bandwidth penalty leads to best fit. Gahvari, Appeared Baker, in Schulz, ICS11 Yang, Proceedings Jordan, Gropp, Modeling the Performance of an Newest Algebraic version Multigrid includes Cycle on OpenMP HPC Platforms, threads, ICS11 submitted Proceedings to ICPP12 AMG On Hera (fat tree network), distance penalty largest contributor to bad performance 23

Time (s) AMG Solve Cycle on Hera, 3456 Cores 1 16 MPI x 1 OMP 8 MPI x 2 OMP 0.1 0.01 4 MPI x 4 OMP 2 MPI x 8 OMP 1 MPI x 16 OMP MPI/OMP Mix Cycle Time 16 MPI x 1 OMP 187 ms 0.001 8 MPI x 2 OMP 116 ms 4 MPI x 4 OMP 67.7 ms 0.0001 2 MPI x 8 OMP 91.7 ms 1 MPI x 16 OMP 199 ms 0.00001 0 2 4 6 8 10 Level 3D 7-point Laplace model problem, 50 x 50 x 25 points/core 25

Seconds 50 45 40 35 30 25 20 15 10 MPI 16x1 H 8x2 H 4x4 H 2x8 OMP 1x16 14 12 10 8 6 4 2 0 0 200 400 5 0 0 2000 4000 6000 8000 10000 12000 No. of cores 26

Seconds 7 6 5 4 3 2 1 PFMG-1,MPI PFMG-1, 4x4 PFMG-1, 1x16 PFMG-2, MPI PFMG-2, 4x4 PFMG-2, 1x16 0 0 1000 2000 3000 4000 No. of cores 27

3 20 2.5 2 15 1.5 PFMG-1, MPI 10 1 0.5 PFMG-1, 2x2 PFMG-1, 1x4 PFMG-2, MPI PFMG-2, 2x2 PFMG-2, 1x4 5 AMG, MPI AMG, 2x2 AMG, 1x4 0 0 50000 100000 0 0 50000 100000 28

Non-Galerkin AMG (Schroder, Falgout) Redundant coarse grid solve (Gahvari, Gropp, Jordan, Schulz, Yang) Additive AMG methods (Gahvari, Olson, Vassilevski, Yang, ) 29

Choose non-galerkin coarse-grid for parallel efficiency Sparsify to yield a new coarse-grid operator Raises critical issues related to AMG theory Need good spectral equivalence Algorithm heuristic: Provably implies AMG convergence Choose sparsity pattern for A Initially T PAP P AP T I Add more if necessary Choose matrix entries Require accuracy for near nullspace A Remove edges in using stencil collapsing g I Collapse Edge 30

Results: 3D Diffusion Classical AMG: scenarios 1, 2 Proposed approach: scenarios 3, 4 Results: 3D anisotropy on unstructured grid Convergence for Scenario 4 comparable to classical AMG with substantially reduced stencil size 31

prolong to level i-1 smooth, form residual restrict to level i+1 smoot h all-gather at level i serial AMG coarse solve 32

7pt 3D Laplace problem 27pt stencil No. of cores Hera Atlas Coastal Hera Atlas Coastal 128 1.77 1.11 1.62 1.40 1.03 1.31 432 2.00 1.11 2.07 1.39 1.20 1.64 1024 1.88 1.25 2.09 1.57 1.25 1.93 2000 1.79 1.18 2.04 1.62 1.25 1.66 Hera: AMD Opteron cluster Atlas: AMD Opteron cluster Coastal: Intel Xeon cluster More sophisticated version under development 33

all-gather at level i Illustration of chunk data distribution using 4 chunks Cores in each chunk perform the same operations Now use parallel AMG on k cores for each redundant coarse solve Communication across chunks Allows better use of cache Allows smart choice of cores 34

smooth smooth Perform in parallel smooth smooth smooth smooth smooth smooth smooth solve solve 36

Even with slower convergence, noticeable speedups possible on model problem: We are investigating now an additive version of multiplicative multigrid, which preserves convergence 37

l 1 and Polynomial smoothers are promising for smoothing on millions of cores Getting efficient use out of multi-core architectures is challenging! Reducing communication is crucial! New methods show promise for better performance. Use model to evaluate new algorithmic changes and to predict performance on much larger scale Continue work on reducing communication Approaches for more efficient implementation Optimized kernels New data structures? 38

This work was performed under the auspices of the U.S. Department of Energy by under Contract DE-AC52-07NA27344. 39