Workshop on Efficient Solvers in Biomedical Applications, Graz, July 2-5, 2012

Similar documents
Challenges of Scaling Algebraic Multigrid Across Modern Multicore Architectures. Allison H. Baker, Todd Gamblin, Martin Schulz, and Ulrike Meier Yang

Reducing Communication Costs Associated with Parallel Algebraic Multigrid

Automatic Generation of Algorithms and Data Structures for Geometric Multigrid. Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014

Challenges of Scaling Algebraic Multigrid across Modern Multicore Architectures

Distributed NVAMG. Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015

smooth coefficients H. Köstler, U. Rüde

Lecture 15: More Iterative Ideas

Ilya Lashuk, Merico Argentati, Evgenii Ovtchinnikov, Andrew Knyazev (speaker)

Large scale Imaging on Current Many- Core Platforms

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI

Ulrike M. Yang 1 and Bronis R. de Supinski 1

Integrating GPUs as fast co-processors into the existing parallel FE package FEAST

AMS526: Numerical Analysis I (Numerical Linear Algebra)

ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016

Highly Parallel Multigrid Solvers for Multicore and Manycore Processors

Algorithms, System and Data Centre Optimisation for Energy Efficient HPC

SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND

Finite Element Multigrid Solvers for PDE Problems on GPUs and GPU Clusters

Block-asynchronous Multigrid Smoothers for GPU-accelerated Systems

Parallel Performance Studies for a Clustering Algorithm

Multigrid Methods for Markov Chains

Andrew V. Knyazev and Merico E. Argentati (speaker)

Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures

IMPLEMENTATION OF THE. Alexander J. Yee University of Illinois Urbana-Champaign

A Comparison of Algebraic Multigrid Preconditioners using Graphics Processing Units and Multi-Core Central Processing Units

Bandwidth Avoiding Stencil Computations

An Algebraic Multigrid Tutorial

GPU-Accelerated Algebraic Multigrid for Commercial Applications. Joe Eaton, Ph.D. Manager, NVAMG CUDA Library NVIDIA

Performance of Multicore LUP Decomposition

GPU-Accelerated Asynchronous Error Correction for Mixed Precision Iterative Refinement

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs

Tools and Primitives for High Performance Graph Computation

MVAPICH2 vs. OpenMPI for a Clustering Algorithm

Placement de processus (MPI) sur architecture multi-cœur NUMA

3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs

Preconditioning for large scale µfe analysis of 3D poroelasticity

Case study: OpenMP-parallel sparse matrix-vector multiplication

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

A massively parallel solver for discrete Poisson-like problems

A Parallel Solver for Laplacian Matrices. Tristan Konolige (me) and Jed Brown

Using GPUs to compute the multilevel summation of electrostatic forces

GPU Cluster Computing for FEM

c 2006 Society for Industrial and Applied Mathematics

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report

D036 Accelerating Reservoir Simulation with GPUs

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers

Advances of parallel computing. Kirill Bogachev May 2016

Introduction to Multigrid and its Parallelization

EXPOSING FINE-GRAINED PARALLELISM IN ALGEBRAIC MULTIGRID METHODS

Communication-Avoiding Optimization of Geometric Multigrid on GPUs

Contents. I The Basic Framework for Stationary Problems 1

Weighted Block-Asynchronous Iteration on GPU-Accelerated Systems

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

Scalability of Elliptic Solvers in NWP. Weather and Climate- Prediction

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

REDUCING COMPLEXITY IN PARALLEL ALGEBRAIC MULTIGRID PRECONDITIONERS

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.

Parallel Combinatorial BLAS and Applications in Graph Computations

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

Performance Engineering - Case study: Jacobi stencil

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

High Performance Computing for PDE Towards Petascale Computing

Performances and Tuning for Designing a Fast Parallel Hemodynamic Simulator. Bilel Hadri

Reconstruction of Trees from Laser Scan Data and further Simulation Topics

Generation of Multigrid-based Numerical Solvers for FPGA Accelerators

Performance Comparison between Blocking and Non-Blocking Communications for a Three-Dimensional Poisson Problem

Introduction to Parallel Programming for Multicore/Manycore Clusters Part II-3: Parallel FVM using MPI

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

High Performance Computing. Leopold Grinberg T. J. Watson IBM Research Center, USA

Implicit and Explicit Optimizations for Stencil Computations

GPU Acceleration of Unmodified CSM and CFD Solvers

A Case for High Performance Computing with Virtual Machines

Shape Optimizing Load Balancing for Parallel Adaptive Numerical Simulations Using MPI

An algebraic multi-grid implementation in FEniCS for solving 3d fracture problems using XFEM


Accelerating image registration on GPUs

Parallel solution for finite element linear systems of. equations on workstation cluster *

First Experiences with Intel Cluster OpenMP

A Hybrid Geometric+Algebraic Multigrid Method with Semi-Iterative Smoothers

Numerical Algorithms on Multi-GPU Architectures

Adaptive Smoothed Aggregation (αsa) Multigrid

FOR P3: A monolithic multigrid FEM solver for fluid structure interaction

Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow.

Performance Study of the MPI and MPI-CH Communication Libraries on the IBM SP

fspai-1.0 Factorized Sparse Approximate Inverse Preconditioner

Noise Injection Techniques to Expose Subtle and Unintended Message Races

Applications of Berkeley s Dwarfs on Nvidia GPUs

COSC 6385 Computer Architecture - Multi Processor Systems

Mapping MPI+X Applications to Multi-GPU Architectures

Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces. Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S.

Recent developments for the multigrid scheme of the DLR TAU-Code

New Features in LS-DYNA HYBRID Version

10th August Part One: Introduction to Parallel Computing

f xx + f yy = F (x, y)

HPC Architectures. Types of resource currently in use

The determination of the correct

Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling

Transcription:

Workshop on Efficient Solvers in Biomedical Applications, Graz, July 2-5, 2012 This work was performed under the auspices of the U.S. Department of Energy by under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

Allison Baker Rob Falgout Tzanio Kolev Jacob Schroder Martin Schulz Panayot Vassilevski University of Illinois at Urbana-Champaign: Hormozd Gahvari William Gropp Luke Olson IBM: Kirk Jordan 2

The solution of linear systems is at the core of many scientific simulation codes Magnetohydrodynamics Elasticity / Plasticity Electromagnetics Facial surgery simulations High fidelity requires huge linear systems and largescale (e.g., petascale) computing We are developing parallel linear solvers and software (hypre), driven by applications 3

Time to Solution 6000 10 1 Diag-CG Multigrid-CG scalable Number of Processors (Problem Size) 10 5 Multigrid solvers are essential components of LLNL simulation State-of-the-art is hypre, with proven scalability to BG/L class machines Current solvers will break down on tomorrow s exascale machines Enormous core counts and fine-grain parallelism will degrade convergence New mathematics R&D (convergence) Effective smoothers that are also highly parallel (e.g., polynomial) Techniques for multicore (performance) Different programming model? Communication-reducing algorithms 4

Setup Phase Select coarse grids Define interpolation, P (m), m=1,2, Define restriction, R (m) = (P (m) ) T Define coarse-grid operators, A (m+1) = R (m) A (m) P (m) Solve Phase (level m) Smooth A (m) u m = f m Smooth A (m) u m = f m Compute r m = f m - A (m) u m Correct u m u m + e m Restrict r m+1 = R (m) r m Interpolate e m = P (m) e m+1 Solve A (m+1) e m+1 = r m+1 5

Smoothers significantly affect AMG convergence and runtime solving: Ax=b, A SPD smoothing: x n+1 = x n + M -1 (b-a x n ) Gauss-Seidel highly sequential parallel smoothers: Jacobi hybrid Gauss-Seidel (default smoother in hypre) Expect degraded convergence for exascale/multicore machines number of blocks increases with number of processors increased fine-grain parallelism (less memory, threading) Objective M H = smooth Core 0 Core 1 Core p investigate/develop smoothers that are not affected by the parallelism 6

A = A 1 A 2 matrices are distributed across P processors in contiguous blocks of rows A p the smoother is threaded such that each thread works on a contiguous subset of the rows A p = t 1 t 2 t 3 t 4 Hybrid: GS within each thread, Jacobi on thread boundaries 7

partitions grid element-wise into nice subdomains for each processor 16 procs (colors indicate element numbering) AMG determines thread subdomains by splitting nodes (not elements) the application s numbering of the elements/nodes becomes relevant! (i.e., threaded domains dependent on colors) 8

Multiplicative weighted hybrid smoother: M ω = ω M H Convergent if M ω SPD and ω = λ max (M -1/2 A M -1/2 ) M H = Additive (l 1 Smoother) hybrid l 1 -GS: M = M l 1 H + D with D l 1 l1 = j i a ijo always convergent, when GS convergent Hybrid l 1 Jacobi: M = D + D l 1 l1 A = 9

Polynomial smoothers: I M -1 A = p(a), p(0)=1 Advantages: independent of parallelism MatVec kernel has been tuned low-degree polynomial sufficient Disadvantage: need extreme eigenvalue estimates (~10 CG iterations) note: only need to damp high-freq. (30% of spectrum) (see Adams et. al, 2003, for Smoothed Aggregation-AMG) 10

o Test problem: -(a(x,y)u x ) x (a(x,y)u y ) y = f a(x,y) = 1 on inner domains a(x,y) =.001 on outer domain triangular FEM, 4 subdomains o Smoothers considered here: Hybrid-SGS: M = (D+L D ) T (D A O ) -1 (D+L D ) l 1 -SGS: M = (D l 1 +LD ) T (2D l D A O ) -1 (D 1 l 1 +LD ), Cheby(2): a polynomial smoother, where p(a) = I M -1 A, p(0) = 1, here p is the 2 nd order Chebyshev polynomial for D -1/2 AD -1/2 11

Polynomial smoothers and our new l1-smoothers are unaffected by poor partitioning (left) and decreasing block sizes (right) New smoothers now available in hypre Baker, Falgout, Kolev, Yang, Multigrid Smoothers for Ultra-Parallel Computing, SIAM Journal on Scientific Computing 12

36 racks with 1024 compute nodes each Quad-core 850 MHz PowerPC 450 Processor 147,456 cores 4GB main memory per node shared by all cores 3D torus network - isolated dedicated partitions 13

14 12 10 8 6 4 pfmg-1 pfmg-2 amg amg-bench 2 0 0 20000 40000 60000 80000 100000 120000 14

Multi-core/ multi-socket cluster 864 diskless nodes interconnected by DDR Infiniband AMD opteron quad core (2.3 GHz) Fat tree network 15

Multicore cluster details (Hera): Individual 512 KB L2 cache for each core 2 MB L3 cache shared by 4 cores 4 sockets per node,16 cores sharing the 32 GB main memory NUMA memory access 16

25 25 PFMG-1 20 PFMG-2 AMG 20 15 15 10 10 5 5 0 0 2000 4000 0 0 2000 4000 17 1

18

Communication plots for 128 core problem: Level 0 Level 1 Level 2 Level 3 Level 4 Level 5 Level 6 Level 7 19 1

Proc id Time computation idle time MPI calls 20

Baseline model: α-β (latency-inverse bandwidth) with parameters smooth, form residual restrict to level i+1 prolong to level i-1 smooth T i coarsen = T i prolong = T i smooth = 6(C i /P)s i t i + 3(p i α + n i β) { 2(C i+1 /P)s i t i + p i α + n i β if i < L 0 if i = L { 0 if i = 0 2(Ci-1 /P)s i-1 t i + p i-1 α + n i-1 β if i > 0 P number of processes C i number of grid points in level i L maximum grid index s i, s i average number of nonzeros per row (A i, P i ) p i, p i maximum number of sends per active process (A i, P i ) n i, n i maximum number of elements sent per active process (A i, P i ) t i time per floating-point operation on level i 21

Take architectural features into account with penalties Distance of communication: add time per hop γ Lower effective bandwidth: let m = # msgs, l = # links, B HW = hardware bandwidth, B MPI = MPI bandwidth BG/P: multiply β by B HW /B MPI Other machines: multiply β by (B HW /B MPI + m/l) Multicore penalties: let c = # cores (MPI tasks) per node, P i = # active processes on level i, P = # processes Multicore latency penalty: multiply α by cp i /P Multicore distance penalty: multiply γ by cp i /P 22

Model parameters: α latency, β inverse bandwidth, γ delay per extra hop We added penalties to the basic models based on machine constraints: distance effects, reduced per core bandwidth, number of cores per node BG/P AMG On BG/P, the simple α-β-γ model with bandwidth penalty leads to best fit. Gahvari, Appeared Baker, in Schulz, ICS11 Yang, Proceedings Jordan, Gropp, Modeling the Performance of an Newest Algebraic version Multigrid includes Cycle on OpenMP HPC Platforms, threads, ICS11 submitted Proceedings to ICPP12 AMG On Hera (fat tree network), distance penalty largest contributor to bad performance 23

24

Time (s) AMG Solve Cycle on Hera, 3456 Cores 1 16 MPI x 1 OMP 8 MPI x 2 OMP 0.1 0.01 4 MPI x 4 OMP 2 MPI x 8 OMP 1 MPI x 16 OMP MPI/OMP Mix Cycle Time 16 MPI x 1 OMP 187 ms 0.001 8 MPI x 2 OMP 116 ms 4 MPI x 4 OMP 67.7 ms 0.0001 2 MPI x 8 OMP 91.7 ms 1 MPI x 16 OMP 199 ms 0.00001 0 2 4 6 8 10 Level 3D 7-point Laplace model problem, 50 x 50 x 25 points/core 25

Seconds 50 45 40 35 30 25 20 15 10 MPI 16x1 H 8x2 H 4x4 H 2x8 OMP 1x16 14 12 10 8 6 4 2 0 0 200 400 5 0 0 2000 4000 6000 8000 10000 12000 No. of cores 26

Seconds 7 6 5 4 3 2 1 PFMG-1,MPI PFMG-1, 4x4 PFMG-1, 1x16 PFMG-2, MPI PFMG-2, 4x4 PFMG-2, 1x16 0 0 1000 2000 3000 4000 No. of cores 27

3 20 2.5 2 15 1.5 PFMG-1, MPI 10 1 0.5 PFMG-1, 2x2 PFMG-1, 1x4 PFMG-2, MPI PFMG-2, 2x2 PFMG-2, 1x4 5 AMG, MPI AMG, 2x2 AMG, 1x4 0 0 50000 100000 0 0 50000 100000 28

Non-Galerkin AMG (Schroder, Falgout) Redundant coarse grid solve (Gahvari, Gropp, Jordan, Schulz, Yang) Additive AMG methods (Gahvari, Olson, Vassilevski, Yang, ) 29

Choose non-galerkin coarse-grid for parallel efficiency Sparsify to yield a new coarse-grid operator Raises critical issues related to AMG theory Need good spectral equivalence Algorithm heuristic: Provably implies AMG convergence Choose sparsity pattern for A Initially T PAP P AP T I Add more if necessary Choose matrix entries Require accuracy for near nullspace A Remove edges in using stencil collapsing g I Collapse Edge 30

Results: 3D Diffusion Classical AMG: scenarios 1, 2 Proposed approach: scenarios 3, 4 Results: 3D anisotropy on unstructured grid Convergence for Scenario 4 comparable to classical AMG with substantially reduced stencil size 31

prolong to level i-1 smooth, form residual restrict to level i+1 smoot h all-gather at level i serial AMG coarse solve 32

7pt 3D Laplace problem 27pt stencil No. of cores Hera Atlas Coastal Hera Atlas Coastal 128 1.77 1.11 1.62 1.40 1.03 1.31 432 2.00 1.11 2.07 1.39 1.20 1.64 1024 1.88 1.25 2.09 1.57 1.25 1.93 2000 1.79 1.18 2.04 1.62 1.25 1.66 Hera: AMD Opteron cluster Atlas: AMD Opteron cluster Coastal: Intel Xeon cluster More sophisticated version under development 33

all-gather at level i Illustration of chunk data distribution using 4 chunks Cores in each chunk perform the same operations Now use parallel AMG on k cores for each redundant coarse solve Communication across chunks Allows better use of cache Allows smart choice of cores 34

35

smooth smooth Perform in parallel smooth smooth smooth smooth smooth smooth smooth solve solve 36

Even with slower convergence, noticeable speedups possible on model problem: We are investigating now an additive version of multiplicative multigrid, which preserves convergence 37

l 1 and Polynomial smoothers are promising for smoothing on millions of cores Getting efficient use out of multi-core architectures is challenging! Reducing communication is crucial! New methods show promise for better performance. Use model to evaluate new algorithmic changes and to predict performance on much larger scale Continue work on reducing communication Approaches for more efficient implementation Optimized kernels New data structures? 38

This work was performed under the auspices of the U.S. Department of Energy by under Contract DE-AC52-07NA27344. 39