GPU Cluster Computing for FEM

Similar documents
GPU Acceleration of Unmodified CSM and CFD Solvers

GPU Cluster Computing for Finite Element Applications

Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers and Applications in CFD and CSM

Computing on GPU Clusters

Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers and Applications in CFD and CSM

Performance. Computing (UCHPC)) for Finite Element Simulations

FOR P3: A monolithic multigrid FEM solver for fluid structure interaction

Hardware-Oriented Finite Element Multigrid Solvers for PDEs

Case study: GPU acceleration of parallel multigrid solvers

Fast and Accurate Finite Element Multigrid Solvers for PDE Simulations on GPU Clusters

Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers

High Performance Computing for PDE Towards Petascale Computing

Integrating GPUs as fast co-processors into the existing parallel FE package FEAST

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs

Performance and accuracy of hardware-oriented. native-, solvers in FEM simulations

Integrating GPUs as fast co-processors into the existing parallel FE package FEAST

Performance and accuracy of hardware-oriented native-, solvers in FEM simulations

Resilient geometric finite-element multigrid algorithms using minimised checkpointing

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers

High Performance Computing for PDE Some numerical aspects of Petascale Computing

Finite Element Multigrid Solvers for PDE Problems on GPUs and GPU Clusters Part 2: Applications on GPU Clusters

Accelerating Double Precision FEM Simulations with GPUs

Dominik Göddeke* and Hilmar Wobker

Finite Element Multigrid Solvers for PDE Problems on GPUs and GPU Clusters

Performance and Accuracy of Lattice-Boltzmann Kernels on Multi- and Manycore Architectures

Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations (Part 2: Double Precision GPUs)

Hardware-Oriented Numerics - High Performance FEM Simulation of PDEs

Accelerating Double Precision FEM Simulations with GPUs

Total efficiency of core components in Finite Element frameworks

Accelerated ANSYS Fluent: Algebraic Multigrid on a GPU. Robert Strzodka NVAMG Project Lead

Exploring weak scalability for FEM calculations on a GPU-enhanced cluster

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015

Large scale Imaging on Current Many- Core Platforms

Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

The GPU as a co-processor in FEM-based simulations. Preliminary results. Dipl.-Inform. Dominik Göddeke.

Cyclic Reduction Tridiagonal Solvers on GPUs Applied to Mixed Precision Multigrid

Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs

Contents. I The Basic Framework for Stationary Problems 1

GPU-Accelerated Algebraic Multigrid for Commercial Applications. Joe Eaton, Ph.D. Manager, NVAMG CUDA Library NVIDIA

How to Optimize Geometric Multigrid Methods on GPUs

Parallel Computations

ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016

Distributed NVAMG. Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs

Two-Phase flows on massively parallel multi-gpu clusters

Exploring unstructured Poisson solvers for FDS

Two main topics: `A posteriori (error) control of FEM/FV discretizations with adaptive meshing strategies' `(Iterative) Solution strategies for huge s

Block Distributed Schur Complement Preconditioners for CFD Computations on Many-Core Systems

An Example of Porting PETSc Applications to Heterogeneous Platforms with OpenACC

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

Optimization of HOM Couplers using Time Domain Schemes

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report

Algorithms, System and Data Centre Optimisation for Energy Efficient HPC

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI

Very fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards

Krishnan Suresh Associate Professor Mechanical Engineering

Performances and Tuning for Designing a Fast Parallel Hemodynamic Simulator. Bilel Hadri

Multigrid Pattern. I. Problem. II. Driving Forces. III. Solution

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.

Lecture 15: More Iterative Ideas

Introduction to Multigrid and its Parallelization

BLAS and LAPACK + Data Formats for Sparse Matrices. Part of the lecture Wissenschaftliches Rechnen. Hilmar Wobker

Introduction to Parallel Programming for Multicore/Manycore Clusters Part II-3: Parallel FVM using MPI

SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

PARALUTION - a Library for Iterative Sparse Methods on CPU and GPU

Multigrid Solvers in CFD. David Emerson. Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK

Automatic Generation of Algorithms and Data Structures for Geometric Multigrid. Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

GPU PROGRESS AND DIRECTIONS IN APPLIED CFD

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

ME964 High Performance Computing for Engineering Applications

M. Geveler, D. Ribbrock, D. Göddeke, P. Zajac, S. Turek Corresponding author:

Highly Parallel Multigrid Solvers for Multicore and Manycore Processors

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

(recursive) `Divide and Conquer' strategies hierarchical data and solver structures, but also hierarchical (!) `matrix structures' ScaRC as generaliza

Multigrid Methods for Markov Chains

smooth coefficients H. Köstler, U. Rüde

Scalability of Elliptic Solvers in NWP. Weather and Climate- Prediction

τ-extrapolation on 3D semi-structured finite element meshes

Efficient Imaging Algorithms on Many-Core Platforms

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

A Comparison of Algebraic Multigrid Preconditioners using Graphics Processing Units and Multi-Core Central Processing Units

S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS

Reconstruction of Trees from Laser Scan Data and further Simulation Topics

Approaches to Parallel Implementation of the BDDC Method

Parallel High-Order Geometric Multigrid Methods on Adaptive Meshes for Highly Heterogeneous Nonlinear Stokes Flow Simulations of Earth s Mantle

Architecture Aware Multigrid

Why Use the GPU? How to Exploit? New Hardware Features. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. Semiconductor trends

Towards real-time prediction of Tsunami impact effects on nearshore infrastructure

cuibm A GPU Accelerated Immersed Boundary Method

Numerical Implementation of Overlapping Balancing Domain Decomposition Methods on Unstructured Meshes

Elmer 8/16/2012. Parallel computing concepts. Parallel Computing. Parallel programming models. Parallel computers. Execution model

The effect of irregular interfaces on the BDDC method for the Navier-Stokes equations

Performance engineering for hardware-, numericaland energy-efficiency in FEM frameworks: modeling and control

GPU-Accelerated Asynchronous Error Correction for Mixed Precision Iterative Refinement

Fast and Accurate Finite-Element Multigrid Solvers for PDE Simulations on GPU Clusters

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

Transcription:

GPU Cluster Computing for FEM Dominik Göddeke Sven H.M. Buijssen, Hilmar Wobker and Stefan Turek Angewandte Mathematik und Numerik TU Dortmund, Germany dominik.goeddeke@math.tu-dortmund.de GPU Computing in Computational Science and Engineering International Workshop on Computational Engineering Special Topic Fluid-Structure Interaction Herrsching, October 14, 2009

FEAST Hardware-oriented Numerics

Structured vs. unstructured grids Fully adaptive grids Maximum flexibility Stochastic numbering Unstructured sparse matrices Indirect addressing, very slow. Locally structured grids Logical tensor product Fixed banded matrix structure Direct addressing ( fast) r-adaptivity Unstructured macro mesh of tensor product subdomains I+M+1 I+M I+1 I+M-1 I I-M+1 I-1 I-M I-M-1 Ω i hierarchically refined subdomain (= macro ), rowwise numbered unstructured mesh window for matrix-vector multiplication, per macro LL LU LD DL UU UD UL DU DD Exploit local structure for tuned linear algebra and tailored multigrid smoothers

Solver approach ScaRC Scalable Recursive Clustering Hybrid multilevel domain decomposition method Minimal overlap by extended Dirichlet BCs Inspired by parallel MG ( best of both worlds ) Multiplicative between levels, global coarse grid problem (MG-like) Additive horizontally: block-jacobi / Schwarz smoother (DD-like) Hide local irregularities by MGs within the Schwarz smoother Embed in Krylov to alleviate Block-Jacobi character global BiCGStab preconditioned by global multilevel (V 1+1) additively smoothed by for all Ω i : local multigrid coarse grid solver: UMFPACK

Multivariate problems Block-structured systems Guiding idea: Tune scalar case once per architecture instead of over and over again per application Equation-wise ordering of the unknowns Block-wise treatment enables multivariate ScaRC solvers Examples Linearised elasticity with compressible material (2x2 blocks) Saddle point problems: Stokes (3x3 with zero blocks), elasticity with (nearly) incompressible material, Navier-Stokes with stabilisation (3x3 blocks) ( )( ) A11 A 12 u1 = f, A 21 A 22 u 2 A 11 0 B 1 0 A 22 B 2 v 1 v 2 = f, B T 1 B T 2 0 p A 11 A 12 B 1 A 21 A 22 B 2 v 1 v 2 = f B T 1 B T 2 C C p A 11 and A 22 correspond to scalar (elliptic) operators Tuned linear algebra and tuned solvers

Co-processor integration into FEAST

Bandwidth in a CPU/GPU node

Mixed Precision Multigrid Core2Duo (double) GTX 280 (mixed) Level time(s) MFLOP/s time(s) MFLOP/s speedup 7 0.021 1405 0.009 2788 2.3x 8 0.094 1114 0.012 8086 7.8x 9 0.453 886 0.026 15179 17.4x 10 1.962 805 0.073 21406 26.9x Poisson on unitsquare, Dirichlet BCs, TP grid, not a matrix stencil Converges to wrong solution in single precision 1M DOF, multigrid, FE-accurate in less than 0.1 seconds! 27x faster than CPU, exactly same results as pure double 1.7x faster than pure double on GPU defect calculation alone: 46.5 GFLOP/s, 50x speedup (single vs. single)

Minimally invasive integration global BiCGStab preconditioned by global multilevel (V 1+1) additively smoothed by for all Ω i : local multigrid coarse grid solver: UMFPACK All outer work: CPU, double Local MGs: GPU, single Same accuracy and functionality mandatory Oblivious of the application

Some results

Linearised elasticity µ A11 A21 A12 A22 Ã (2µ + λ ) xx + µ yy (µ + λ ) yx global multivariate BiCGStab block-preconditioned by Global multivariate multilevel (V 1+1) additively smoothed (block GS) by µ u1 =f u2 (µ + λ ) xy µ xx + (2µ + λ ) yy! for all Ωi : solve A11 c1 = d1 by local scalar multigrid update RHS: d2 = d2 A21 c1 for all Ωi : solve A22 c2 = d2 by local scalar multigrid coarse grid solver: UMFPACK

Accuracy 65536 32768 Cantilever beam, aniso 1:1, 1:4, 1:16 Hard, very ill-conditioned CSM test CG solver: > 2x iterations per refinement GPU-ScaRC solver: same results as CPU <---- smaller is better <---- number of iterations 16384 8192 4096 2048 1024 512 aniso01 256 aniso04 aniso16 128 2.1Ki 8.4Ki 33.2Ki 132Ki 528Ki 2.1Mi 8.4Mi number of DOF aniso04 Iterations Volume y-displacement refinement L CPU GPU CPU GPU CPU GPU 8 4 4 1.6087641E-3 1.6087641E-3-2.8083499E-3-2.8083499E-3 9 4 4 1.6087641E-3 1.6087641E-3-2.8083628E-3-2.8083628E-3 10 4.5 4.5 1.6087641E-3 1.6087641E-3-2.8083667E-3-2.8083667E-3 aniso16 8 6 6 6.7176398E-3 6.7176398E-3-6.6216232E-2-6.6216232E-2 9 6 5.5 6.7176427E-3 6.7176427E-3-6.6216551E-2-6.6216552E-2 10 5.5 5.5 6.7176516E-3 6.7176516E-3-6.6217501E-2-6.6217502E-2

Weak scalability 160 50 140 45 <---- smaller is better <---- sec 120 100 80 60 40 <---- smaller is better <---- normalized time per iteration in sec 40 35 30 20 2c8m_Cn_L10_const 1g8m_Cn_L10_const 1c8m_Cn_L10_const 25 CPU GPU 0 20 8 16 32 64 96 128 160 4 8 16 32 64 128 Number of nodes Number of nodes Outdated cluster, dual Xeon EM64T singlecore one NVIDIA Quadro FX 1400 per node (one generation behind the Xeons, 20 GB/s BW) Poisson problem (left): up to 1.3 B DOF, 160 nodes Elasticity (right): up to 1 B DOF, 128 nodes

Absolute speedup 250 200 CPU-single CPU-dual GPU-single <---- smaller is better <---- time (sec) 150 100 50 0 BLOCK CRACK PIPE STEELFRAME 16 nodes, Opteron 2214 dualcore NVIDIA Quadro FX 5600 (76 GB/s BW), OpenGL Problem size 128 M DOF Dualcore 1.6x faster than singlecore GPU 2.6x faster than singlecore, 1.6x than dual

Acceleration analysis Speedup analysis Addition of GPUs increases resources Correct model: strong scalability inside each node Accelerable fraction of the elasticity solver: 2/3 Remaining time spent in MPI and the outer solver 4 3.5 R acc = 1/2 R acc = 2/3 R acc = 3/4 Accelerable fraction R acc : 66% Local speedup S local : 9x Total speedup S total : 2.6x Theoretical limit S max : 3x ----> larger is better ----> S total 3 2.5 2 1.5 SSE X GPU X 1 1 2 3 4 5 6 7 8 9 10 S local

Stationary Navier-Stokes A 11 A 12 B 1 A 21 A 22 B 2 u 1 u 2 = f 1 f 2 B T 1 B T 2 C p g 4-node cluster Opteron 2214 dualcore GeForce 8800 GTX (90 GB/s BW), CUDA Driven cavity and channel flow around a cylinder fixed point iteration assemble linearised subproblems and solve with global BiCGStab (reduce initial residual by 1 digit) Block-Schurcomplement preconditioner 1) approx. solve for velocities with global MG (V1+0), additively smoothed by for all Ω i : solve for u 1 with local MG for all Ω i : solve for u 2 with local MG 2) update RHS: d 3 = d 3 + B T (c 1,c 2 ) T 3) scale c 3 = (M L p) 1 d 3

Navier-Stokes results Speedup analysis R acc S local S total L9 L10 L9 L10 L9 L10 DC Re100 41% 46% 6x 12x 1.4x 1.8x DC Re250 56% 58% 5.5x 11.5x 1.9x 2.1x Channel flow 60% 6x 1.9x Important consequence: Ratio between assembly and linear solve changes significantly DC Re100 DC Re250 Channel flow plain accel. plain accel. plain accel. 29:71 50:48 11:89 25:75 13:87 26:74

Conclusions

Conclusions Hardware-oriented numerics prevents existing codes being worthless in a few years Mixed precision schemes exploit the available bandwidth without sacrificing accuracy GPUs as local preconditioners in a large-scale parallel FEM package Not limited to GPUs, applicable to all kinds of hardware accelerators Minimally invasive approach, no changes to application code Excellent local acceleration, global acceleration limited by sequential part Future work: Design solver schemes with higher acceleration potential without sacrificing numerical efficiency

Acknowledgements Collaborative work with FEAST group (TU Dortmund) Robert Strzodka (Max Planck Institut Informatik) Jamaludin Mohd-Yusof, Patrick McCormick (Los Alamos National Laboratory) http://www.mathematik.tu-dortmund.de/~goeddeke Supported by DFG, projects TU 102/22-1, TU 102/22-2, TU 102/27-1, TU102/11-3; and BMBF, HPC Software für skalierbare Parallelrechner: SKALB project 01IH08003D