Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI

Size: px

Start display at page:

Download "Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI"

Lee Osborne
6 years ago
Views:

1 Efficient AMG on Hybrid GPU Clusters ScicomP 2012 Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann Fraunhofer SCAI

2 Illustration: Darin McInnis

4 Motivation Sparse iterative solvers benefit from high memory bandwidth of GPU s GPUs might require different data structures / algorithms Separate GPU memory and data management issues can lead to poor inter node parallel performance (e.g. MPI) LAMA provides solutions for common problems with sparse iterative solvers on distributed memory systems, e.g. Data managment Asynchronous operations and concurrency Eample: AMG

5 What is LAMA? (L)ibrary for (A)ccelerated (M)ath (A)pplications Open source C++ software (MIT license, LAMA Design Aims: Inter-node parallelization: MPI (+ PGAS current work) Intra-node parallelization: OpenMP, CUDA OpenCL Supports different (sparse) matri storage formats (CSR, ELL, JDS) Etensible towards novel accel. (e.g. MIC), matri formats Tet book synta at user level: z = A * * y Efficient

6 Tetbook Synta: C++ Interface void solvecg( Vector&, const Matri& A, const Vector& b ) { DenseVector r = b A * ; DenseVector p = r; DenseVector q( A.getDistributionPtr() ); double rho = r * r;... for ( int i = 0; i < niterations; i++ ) { q = A * p; double alpha = rho / ( p * q ); = + alpha * p; r = r alpha * q;... } }

7 Distributed Sparse Matri-Vector Multiplication with Halo A * P(0) P(1) P(2) P(0) P(1) P(2)

8 Synchronous / Asynchronous Eecution Matri Vector Multiplication on GPU Sychronous eecution Download X (CUDA->Host) MPI Halo Echange Upload X Halo Matri-Vector Multiplication (MVM) A * X (local + halo) Download X (CUDA->Host) MPI Halo Echange Upload X Halo Host / GPU time MVM (local) MVM (halo) GPU Asychronous eecution

9 Benchmark: Hardware CPU Name CPU CPU Xeon E5620 Core f (GHz) 2.40 GHz L3 Cache 12 MB Cores / CPU 4 Hyperthreading Off Sockets 1 Cores 4 Memory 12 GB Socket-BW 26 GB/s QDR Infiniband GPUs Name M20 Device M2090 Comp. Cap. 2.0 MP 16 Cores 512 Core f (GHz) 1.30 Memory (GB) 6 BW (GB/s) 177 HW Cache Yes ECC on CUDA 4.1

10 Benchmark: Testcases Testcase: PDE discretization 3D7P 7 point Laplace discretization 3-dimensional structured Grid n1 1 million points / matri rows ( ) Leicographic ordering All benchmarks in double precision Access pattern common for many PDE s entries correspond to direct neighbors

11 CG Benchmark (10 Iterations) CG (10 Iterations) for 3D7P, 1 million unknowns / Node 140 MemBw / device [GB/s] Speed-Up of CG (asychronous eecution), 32 GPUs : 26,6 # nodes synchron asynchron measured

12 Algebraic Multigrid (AMG) very efficient linear solver technology for elliptic problems smoothing / coarse grid correction on hierarchy of matrices A 1 AMG Configuration: Ruge-Stüben coarsening / standard interpolation (classical AMG) 2 I 1 A 2 1 I 2 AMG V-Cycle is used as preconditioner for CG Jacobi relaation 3 I 2 4 I 3 A 3 2 I 3 3 I 4 A 4

13 (Weak) Scalability Solution Phase: CPU vs GPU MemBw / device [GB/s] D7P CPU M2090 measured 1 million unknowns / device # nodes

14 Optimizations for Solver Phase Asynchronous communication for MV multiplication (same as in CG) Optimization of asynchronous communication: gathering of send data on GPU Using CPU computations on coarser levels sync: version with synchronous communication async: version with optimized asynchronous communication host: version with opt. asynchronous communication + host CPU on coarser levels

15 Asynchronous Eecution Optimizations Matri Vector Multiplication on GPU Download X (CUDA->Host) MVM (local) G X DL X MPI Halo Echange MVM (local) MPI Halo Echange Upload X Halo Upload X Halo MVM (halo) MVM (halo) Host / GPU GPU time Host / GPU GPU Asychronous eecution on coarser levels Asychronous eecution optimized

16 Host: Using CPU on the Coarser Levels Level 1 Level 2 Level 3 Level 1 Level 2 Level 3 L 4 L 5 Level 4 Level 5 Coarser AMG levels have larger halos: hiding communication no more possible Optimization switches to CPU version on coarser levels Level 1 Level 2 Level 3 Level 4 Level 5 Level 1 Level 2 Level 3 L 4 L 5 Always GPU time GPU, host CPU for coarser levels

17 Solution Phase: M2090 MemBw / device [GB/s] million unknowns / device # nodes 3D7P M2090 (sync) M2090 (async) M2090 (host) measured

18 AMG Setup Phase Parallel CPU implementation is available No distributed GPU implementation available GPU overhead: conversion CSR -> ELLPACK, host-to-device transfer Using CPU for solver phase construct level 1 construct level 2 construct level 3 time Tr 0 construct level 1 Tr 1 construct level 2 Tr 2 construct level 3 T 3 Using GPU for solver phase Transfer to GPU + storage format conversion on GPU

19 Setup: CPU vs GPU runtime [s] # nodes 1 million unknowns / CPU 3D7P CPU M2090

20 AMG Setup Phase: Optimization GPU Preparation: Asynchronous host-to-device transfer Asynchronous conversion CSR -> ELLPACK on GPU Additional: asynchronous transpose of interpolation matri -> restriction safe MPI) (needs thread CPU construct level 1 construct level 2 construct level 3 time GPU transfer + level 0 transfer level 1 tr l 2 tr l 3 Transfer to GPU + transpose + conversion on GPU

21 Setup: CPU only vs. CPU + GPU Prep. 3D7P 12 runtime [s] CPU M2090 M2090 (async) million unknowns / Node # nodes Attention: thread-safe MPI implementation is inefficient

22 Summary Scalability for GPUs: hiding communication becomes fundamental Switching between computations on CPU and GPU CPU / GPU overlap Software design of LAMA library is able to deal with all identified problems Asynchronous computations for hiding communication Asynchronous data transfer between devices Support of different sparse matri formats

23 Further Development Optimization / parallelization of AMG setup for GPUs Release version of LAMA

Thank You! Benchmark data available at: http://www.libama.

24 Thank You! Benchmark data available at: For more information on LAMA contact us at Work funded by Fraunhofer ITEA Project H4H BMBF Project GASPI

Accelerated ANSYS Fluent: Algebraic Multigrid on a GPU. Robert Strzodka NVAMG Project Lead

Accelerated ANSYS Fluent: Algebraic Multigrid on a GPU Robert Strzodka NVAMG Project Lead A Parallel Success Story in Five Steps 2 Step 1: Understand Application ANSYS Fluent Computational Fluid Dynamics