Using GPUs for unstructured grid CFD

Similar documents
OP2: an active library for unstructured grid applications on GPUs

OP2 FOR MANY-CORE ARCHITECTURES

OP2 C++ User s Manual

Design and Performance of the OP2 Library for Unstructured Mesh Applications

GPUs and Emerging Architectures

David Ham Abstracting the hardware

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

OP2 Airfoil Example. Mike Giles, Gihan Mudalige, István Reguly. December 2013

Lecture 1: an introduction to CUDA

High-level Abstraction for Block Structured Applications: A lattice Boltzmann Exploration

Performance Analysis and Optimisation of the OP2 Framework on Many-core Architectures

Introduction to GPU hardware and to CUDA

Automated Finite Element Computations in the FEniCS Framework using GPUs

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Generating high-performance multiplatform finite element solvers using the Manycore Form Compiler and OP2

Memory. Lecture 2: different memory and variable types. Memory Hierarchy. CPU Memory Hierarchy. Main memory

Lecture 2: different memory and variable types

OPS C++ User s Manual

Large-scale Gas Turbine Simulations on GPU clusters

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

OP2: An Active Library Framework for Solving Unstructured Mesh-based Applications on Multi-Core and Many-Core Architectures

Trends in HPC (hardware complexity and software challenges)

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

Compiler Optimizations for Industrial Unstructured Mesh CFD Applications on GPUs

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Tesla Architecture, CUDA and Optimization Strategies

Mesh Independent Loop Fusion for Unstructured Mesh Applications

Fundamental CUDA Optimization. NVIDIA Corporation

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

General Purpose GPU Computing in Partial Wave Analysis

Optimising the Mantevo benchmark suite for multi- and many-core architectures

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Computational Finance using CUDA on NVIDIA GPUs

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

Fundamental CUDA Optimization. NVIDIA Corporation

GPU Fundamentals Jeff Larkin November 14, 2016

Large scale Imaging on Current Many- Core Platforms

Paralization on GPU using CUDA An Introduction

Technology for a better society. hetcomp.com

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Accelerating CFD with Graphics Hardware

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Accelerating image registration on GPUs

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Compiler Optimizations for Industrial Unstructured Mesh CFD Applications on GPUs

CUDA Programming Model

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Performance Benefits of NVIDIA GPUs for LS-DYNA

Multi-Processors and GPU

Computer Architecture

Modern Processor Architectures. L25: Modern Compiler Design

Parallel Computing. Hwansoo Han (SKKU)

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

A GPU based brute force de-dispersion algorithm for LOFAR

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

Warps and Reduction Algorithms

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

Chapter 3 Parallel Software

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Turbostream: A CFD solver for manycore

Tesla GPU Computing A Revolution in High Performance Computing

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Advanced OpenMP Features

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Performance of deal.ii on a node

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

OpenACC programming for GPGPUs: Rotor wake simulation

Parallel Computing: Parallel Architectures Jin, Hai

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Parallel Accelerators

S4289: Efficient solution of multiple scalar and block-tridiagonal equations

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs

Real-Time Rendering Architectures

Parallel Computing. November 20, W.Homberg

Introduction to GPU programming with CUDA

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

CUDA. GPU Computing. K. Cooper 1. 1 Department of Mathematics. Washington State University

Auto-Generation and Auto-Tuning of 3D Stencil Codes on GPU Clusters

GPU Implementation of Elliptic Solvers in NWP. Numerical Weather- and Climate- Prediction

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

ARCHER Champions 2 workshop

Software abstractions for manycore software engineering

CUDA Experiences: Over-Optimization and Future HPC

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

B. Tech. Project Second Stage Report on

Transcription:

Using GPUs for unstructured grid CFD Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Schlumberger Abingdon Technology Centre, February 17th, 2011 OP2 p. 1

Outline CPUs, GPUs and programming challenges OP2: a high-level library for unstructured grid applications OP2: implementation Performance on a test problem Aim to answer two questions: are GPUs good for unstructured grid CFD? what am I doing to simplify their use? OP2 p. 2

CPUs Intel s Sandy Bridge CPUs: 2-8 cores, each hyperthreaded complex cores with out-of-order execution and branch prediction to avoid delays when waiting for data each core has an AVX vector unit (8 floats or 4 doubles) 30 DP GFlops/core (15 GFlops without AVX) some models also have integrated graphics units mainly fixed function, not useful for HPC? 64kB L1 and 256kB L2 cache/core up to 8MB shared LLC (Last Level Cache) bandwidth to main DDR3 memory is around 30GB/s OP2 p. 3

GPUs NVIDIA s Fermi GPUs: 14 units called Streaming Multiprocessors (SMs) which have: 32 simple in-order SIMD cores which act as a vector unit = 37 DP GFlops/SM 16-48 threads/core to hide delays 32k 32-bit registers 16kB L1 cache 48kB shared memory GPU also has 768kB unified L2 cache 150 GB/s bandwidth to main GDDR5 memory 5 GB/s bandwidth to CPU across PCIe bus OP2 p. 4

Differences very different if AVX vectors are not used; not so different if they are factor 5-10 difference in peak GFlops factor 5 difference in memory bandwidth slow CPU-GPU link a potential bottleneck CPU has cache coherency at L1 level; GPU avoids the need through language construct which requires no interference between different thread blocks GPU uses much more multithreading; has a lot of registers so each thread has its own set NVIDIA has a much clearer software vision than Intel OP2 p. 5

Software Challenges HPC application developers want the benefits of the latest hardware but are very worried about the software development costs, and the level of expertise required status quo is not an option running 24 MPI processes on a single CPU would give very poor performance, plus we need to exploit the vector units For GPUs, I m happy programming in NVIDIA s CUDA (extended C/C++) but like MPI it s a little low-level For CPUs, MPI + OpenMP may be a good starting point, and PGI/CRAY are proposing OpenMP extensions which would support GPUs and vector units However, hardware is likely to change rapidly in next few years, and developers can not afford to keep changing their software implementation OP2 p. 6

Software Abstraction To address these challenges, need to move to a suitable level of abstraction: separate the user s specification of the application from the details of the parallel implementation aim to achieve application level longevity with the top-level specification not changing for perhaps 10 years aim to achieve near-optimal performance through re-targetting the back-end implementation to different hardware and low-level software platforms OP2 p. 7

OP2 History OPlus (Oxford Parallel Library for Unstructured Solvers) OP2: developed for Rolls-Royce 15 years ago lead developer: Paul Crumpton MPI-based library for HYDRA CFD code on clusters with up to 200 nodes open source project keeps OPlus abstraction, but slightly modifies API an active library approach with code transformation to generate CUDA or OpenMP/AVX code OP2 p. 8

OP2 Abstraction sets (e.g. nodes, edges, faces) datasets (e.g. flow variables) mappings (e.g. from edges to nodes) parallel loops operate over all members of one set datasets have at most one level of indirection user specifies how data is used (e.g. read-only, write-only, increment) OP2 p. 9

OP2 Restrictions set elements can be processed in any order, doesn t affect result to machine precision explicit time-marching, or multigrid with an explicit smoother is OK Gauss-Seidel or ILU preconditioning in not static sets and mappings (no dynamic grid adaptation) OP2 p. 10

OP2 API op init(int argc, char **argv) op decl set(int size, op set *set, char *name) op decl map(op set from, op set to, int dim, int *imap, op map *map, char *name) op decl const(int dim, char *type, T *dat, char *name) op decl dat(op set set, int dim, char *type, T *dat, op dat *data, char *name) op exit() OP2 p. 11

OP2 API Example of parallel loop syntax for a sparse matrix-vector product: op par loop 3(res,"res", edges, A, -1,OP ID, 1,"float",OP READ, u, 0,pedge2,1,"float",OP READ, du, 0,pedge1,1,"float",OP INC); This is equivalent to the C code: for (e=0; e<nedges; e++) du[pedge1[e]] += A[e] * u[pedge2[e]]; where each edge corresponds to a non-zero element in the matrix A, and pedge1 and pedge2 give the corresponding row and column indices. OP2 p. 12

User build processes Using the same source code, the user can build different executables for different target platforms: sequential single-thread CPU execution purely for program development and debugging very poor performance CUDA for single GPU OpenMP/AVX for multicore CPU systems MPI plus any of the above for clusters OP2 p. 13

GPU Parallelisation Could have up to 10 6 threads in 3 levels of parallelism: MPI distributed-memory parallelism (1-100) one MPI process for each GPU all sets partitioned across MPI processes, so each MPI process only holds its data (and halo) block parallelism (50-1000) on each GPU, data is broken into mini-partitions, worked on separately and in parallel by different functional units in the GPU thread parallelism (32-128) each mini-partition is worked on by a block of threads in parallel OP2 p. 14

Data dependencies Key technical issue is data dependency when incrementing indirectly-referenced arrays. e.g. potential problem when two edges update same node OP2 p. 15

Data dependencies Method 1: owner of nodal data does edge computation drawback is redundant computation when the two nodes have different owners OP2 p. 16

Data dependencies Method 2: color edges so no two edges of the same color update the same node parallel execution for each color, then synchronize possible loss of data reuse and some parallelism OP2 p. 17

Data dependencies Which is best for each level? MPI level: method 1 each MPI process does calculation needed to update its data partitions are large, so relatively little redundant computation GPU level: method 2 plenty of blocks of each color so still good parallelism data reuse within each block, not between blocks block level: method 2 indirect data in local shared memory, so get reuse individual threads are colored to avoid conflict when incrementign shared memory OP2 p. 18

Current status Initial prototype, with code parser/generator written in MATLAB, can generate: CUDA code for a single GPU OpenMP code for multiple CPUs The parallel loop API requires redundant information: simplifies MATLAB program generation just need to parse loop arguments, not entire code numeric values for dataset dimensions enable compiler optimisation of CUDA code programming is easy; it s debugging which is difficult not time-consuming to specify redundant information provided consistency is checked automatically OP2 p. 19

Airfoil test code 2D Euler equations, cell-centred finite volume method with scalar dissipation (miminal compute per memory reference should consider switching to more compute-intensive characteristic smoothing more representative of real applications) roughly 1.5M edges, 0.75M cells 5 parallel loops: save soln (direct over cells) adt calc (indirect over cells) res calc (indirect over edges) bres calc (indirect over boundary edges) update (direct over cells with RMS reduction) OP2 p. 20

Airfoil test code Current performance relative to a single CPU thread: 35 speedup on a single GPU 7 speedup for 2 quad-core CPUs OpenMP performance seems bandwidth-limited loops use in excess of 20GB/s bandwidth from main memory. CUDA performance also seems bandwidth-limited: count time GB/s GB/s kernel name 1000 0.2137 107.8126 save_soln 2000 1.3248 61.0920 63.1218 adt_calc 2000 5.6105 32.5672 53.4745 res_calc 2000 0.1029 4.8996 18.4947 bres_calc 2000 0.8849 110.6465 update OP2 p. 21

Conclusions have created a high-level framework for parallel execution of algorithms on unstructured grids looks encouraging for providing ease-of-use, high performance, and longevity through new back-ends one GPU provides 5 speedup compared to two CPUs next step is addition of MPI layer for cluster computing key challenge then is to build user community OP2 p. 22

Acknowledgements Gihan Mudalige (Oxford) Paul Kelly, Graham Markall (Imperial College) Nick Hills (Surrey) and Paul Crumpton Leigh Lapworth, Yoon Ho, David Radford (Rolls-Royce) Jamil Appa, Pierre Moinier (BAE Systems) Tom Bradley, Jon Cohen and others (NVIDIA) EPSRC, TSB, Rolls-Royce and NVIDIA for financial support Oxford Supercomputing Centre OP2 p. 23