Accelerating CFD with Graphics Hardware

Similar documents
Acceleration of scientific computing using graphics hardware

CFD Solvers on Many-core Processors

Turbostream: A CFD solver for manycore

Large-scale Gas Turbine Simulations on GPU clusters

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CS427 Multicore Architecture and Parallel Computing

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Threading Hardware in G80

GPUs and GPGPUs. Greg Blanton John T. Lubia

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

Portland State University ECE 588/688. Graphics Processors

Introduction to GPU hardware and to CUDA

CUDA Programming Model

CUDA (Compute Unified Device Architecture)

Acceleration of a 2D Euler Flow Solver Using Commodity Graphics Hardware

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay


OpenACC programming for GPGPUs: Rotor wake simulation

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

GPU 101. Mike Bailey. Oregon State University. Oregon State University. Computer Graphics gpu101.pptx. mjb April 23, 2017

GPU 101. Mike Bailey. Oregon State University

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

Numerical Algorithms on Multi-GPU Architectures

Real-Time Reyes: Programmable Pipelines and Research Challenges. Anjul Patney University of California, Davis

What Next? Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University. * slides thanks to Kavita Bala & many others

GPU for HPC. October 2010

Data-Parallel Algorithms on GPUs. Mark Harris NVIDIA Developer Technology

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

high performance medical reconstruction using stream programming paradigms

Interactive Aerodynamic Design in a Virtual Environment

Current Trends in Computer Graphics Hardware

General Purpose GPU Computing in Partial Wave Analysis

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC

B. Tech. Project Second Stage Report on

PowerVR Hardware. Architecture Overview for Developers

GPU programming. Dr. Bernhard Kainz

Performance potential for simulating spin models on GPU

Parallel Systems I The GPU architecture. Jan Lemeire

GPU Architecture. Michael Doggett Department of Computer Science Lund university

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

Accelerating image registration on GPUs

A Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

What s New with GPGPU?

Lecture 1: Introduction and Computational Thinking

CUDA Architecture & Programming Model

Parallelising Pipelined Wavefront Computations on the GPU

Introduction to CUDA

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D

GPU Programming Using NVIDIA CUDA

Introduction to CUDA (1 of n*)

Technology for a better society. hetcomp.com

Lecture 2. Shaders, GLSL and GPGPU

Tesla Architecture, CUDA and Optimization Strategies

GPGPU introduction and network applications. PacketShaders, SSLShader

CS GPU and GPGPU Programming Lecture 2: Introduction; GPU Architecture 1. Markus Hadwiger, KAUST

GPGPU Applications. for Hydrological and Atmospheric Simulations. and Visualizations on the Web. Ibrahim Demir

Jeremy W. Sheaffer 1 David P. Luebke 2 Kevin Skadron 1. University of Virginia Computer Science 2. NVIDIA Research

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

Massively Parallel Architectures

GPU Memory Model Overview

Multi-Processors and GPU

Parallel Computing Ideas

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

Implementation of the finite-difference method for solving Maxwell`s equations in MATLAB language on a GPU

How to Optimize Geometric Multigrid Methods on GPUs

From Brook to CUDA. GPU Technology Conference

A GPU Implementation for Two-Dimensional Shallow Water Modeling arxiv: v1 [cs.dc] 5 Sep 2013

Distributed Virtual Reality Computation

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

General introduction: GPUs and the realm of parallel architectures

CUDA. GPU Computing. K. Cooper 1. 1 Department of Mathematics. Washington State University

The Art of Parallel Processing

Parallel Computing: Parallel Architectures Jin, Hai

ECE 574 Cluster Computing Lecture 16

Real-time Graphics 9. GPGPU

Accelerating Molecular Modeling Applications with Graphics Processors

Compute-mode GPU Programming Interfaces

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs

Tesla GPU Computing A Revolution in High Performance Computing

! Readings! ! Room-level, on-chip! vs.!

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

GPU-accelerated data expansion for the Marching Cubes algorithm

The GPGPU Programming Model

CME 213 S PRING Eric Darve

Transcription:

Accelerating CFD with Graphics Hardware Graham Pullan (Whittle Laboratory, Cambridge University) 16 March 2009

Today Motivation CPUs and GPUs Programming NVIDIA GPUs with CUDA Application to turbomachinery CFD: Turbostream Conclusions

Part 1: Motivation

Turbomachinery Thousands of blades Arranged in rows Blade row Each blade row has a bespoke blade profile designed with CFD

Approximate compute requirements Steady models (no wake/blade interaction, etc) 1 blade 0.5 Mcells 1 CPU hour 1 stage (2 blades) 1.0 Mcells 3 CPU hours 1 component (5 stages) 5.0 Mcells 20 CPU hours

Approximate compute requirements Steady models (no wake/blade interaction, etc) 1 blade 0.5 Mcells 1 CPU hour 1 stage (2 blades) 1.0 Mcells 3 CPU hours 1 component (5 stages) 5.0 Mcells 20 CPU hours Unsteady models (with wakes, etc) 1 component (1000 blades) 500 Mcells 0.1 M CPU hours Engine (4000 blades) 2 Gcells 1 M CPU hours

Aim To produce an order of magnitude reduction in run-times for the same hardware cost

Part 2: CPUs and GPUs

Moore s Law The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. Certainly over the short term this rate can be expected to continue. Gordon Moore (Intel), 1965

Moore s Law The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. Certainly over the short term this rate can be expected to continue. Gordon Moore (Intel), 1965 OK, maybe a factor of two every two years. Gordon Moore (Intel), 1975 [paraphrased]

Was Moore right? Source: Intel

Feature size Source: Intel

Clock speed Source: Tom s Hardware

What to do with all these transistors?

Parallel computing Multi-core chips are either: Instruction parallel (Multiple Instruction, Multiple Data) MIMD or Data parallel (Single Instruction, Multiple Data) SIMD

Today s commodity MIMD chips: CPUs Intel Core 2 Quad 4 cores 2.4 GHz 65nm features 582 million transistors 8MB on chip memory

Today s commodity SIMD chips: GPUs NVIDIA 8800 GTX 128 cores 1.35 GHz 90nm features 681 million transistors 768MB on board memory

CPUs vs GPUs

CPUs vs GPUs Transistor usage: Source: NVIDIA

Graphics pipeline

(Traditional) graphics pipeline

GPUs and scientific computing GPUs are designed to apply the same shading function to many pixels simultaneously

GPUs and scientific computing GPUs are designed to apply the same function to many data simultaneously This is what most scientific computing needs!

Part 3: Programming GPUs with CUDA

3 Generations of GPGPU (Owens, 2008)

3 Generations of GPGPU (Owens, 2008) Making it work at all: Primitive functionality and tools (graphics) Comparisons with CPU not rigorous

3 Generations of GPGPU (Owens, 2008) Making it work at all: Primitive functionality and tools (graphics) Comparisons with CPU not rigorous Making it work better: Easier to use (higher level) Understanding of how best to do it

3 Generations of GPGPU (Owens, 2008) Making it work at all: Primitive functionality and tools (graphics) Comparisons with CPU not rigorous Making it work better: Easier to use (higher level) Understanding of how best to do it Doing it right: Stable, portable, modular building blocks

GPU Programming for graphics Application specifies geometry GPU rasterizes Each fragment is shaded (SIMD) Shading can use values from memory (textures) Image can be stored for re-use Courtesy, John Owens, UC Davis

GPGPU programming ( old-school ) Draw a quad Run a SIMD program over each fragment Gather is permitted from texture memory Resulting buffer can be stored for re-use Courtesy, John Owens, UC Davis

NVIDIA G80 hardware implementation Vertex/fragment processors replaced by Unified Shaders Now view GPU as massively parallel co-processor Set of (16) SIMD MultiProcessors (8 cores)

NVIDIA G80 hardware implementation Divide 128 cores into 16 Multiprocessors (MPs) Each MP has: Registers Shared memory Read only constant cache Read only texture cache

NVIDIA s CUDA programming model G80 chip supports MANY active threads: 12,288 Threads are lightweight: Little creation overhead instant switching Efficiency achieved through 1000 s of threads Threads are organised into blocks (1D, 2D, 3D) Blocks are further organised into a grid

Kernels, grids, blocks and threads

Kernels, grids, blocks and threads Organisation of threads and blocks is key abstraction

Kernels, grids, blocks and threads Organisation of threads and blocks is key abstraction Software: Threads from one block may cooperate: Using data in shared memory Through synchronising

Kernels, grids, blocks and threads Organisation of threads and blocks is key abstraction Software: Threads from one block may cooperate: Using data in shared memory Through synchronising Hardware: A block runs on one MP Hardware free to schedule any block on any MP More than one block can reside on one MP

CUDA implementation CUDA implemented as extensions to C CUDA programs: explicitly manage host and device memory: allocation transfers set thread blocks and grid launch kernels are compiled with the CUDA nvcc compiler

Part 4: Application to CFD

Introduction to CFD Blade Flow Divide the volume into cells

Governing equations for each cell

Governing equations for each cell Conserve: Mass Momentum Energy

Example: mass conservation Evaluate mass fluxes on each face F mass = A 4 ρv n

Example: mass conservation Sum fluxes on faces to find density change in cell Δρ cell = F mass

Example: mass conservation Update density Δρ node = 1 8 Δρ cell (only 4 of 8 surrounding cells shown))

Similarity of steps Each step uses data from surrounding nodes stencil operation

Similarity of equations For each equation (5 in all): Set relevant flux (mass, momentum, energy) Sum fluxes Update nodes (plus smoothing also stencil boundary conditions not stencil)

Overall strategy Divide up domain each sub-domain to a thread block update nodes in sub-domain with most efficient stencil operation we can come up with! update sub-domain boundaries (MPI if needed)

Efficient stencil operations Launch one thread per element in an i-k plane Load enough planes into shared memory as needed by stencil Update elements in plane (store in global device memory) Load new (i-k) plane repeat, iterate in j direction

CUDA example global void smooth_kernel(float sf, float *a_data, float *b_data){ /* shared memory array */ shared float a[16][3][5]; /* fetch first planes */ a[i][0][k] = a_data[i0m10]; a[i][1][k] = a_data[i000]; a[i][2][k] = a_data[i0p10]; syncthreads(); /* compute */ b_data[i000] = sf1*a[i][1][k] + sfd6*(a[im1][1][k] + a[ip1][1][k] + a[i][0][k] + a[i][2][k] + a[i][1][km1] + a[i][1][kp1]) /* load next "j" plane and repeat...*/

SBLOCK stencil framework SBLOCK framework for stencil operations on structured grids: Source-to-source compiler Takes in high level kernel definitions Produces optimised kernels in C or CUDA Allows new stencils to be implemented quickly Allows new stencil optimisation strategies to be deployed on all stencils (without typos!)

Example SBLOCK definition kind = "stencil" bpin = ["a"] bpout = ["b ] lookup = ((1,0, 0), (0, 0, 0), (1,0, 0), (0, 1,0), (0, 1, 0), (0, 0, 1), (0, 0, 1)) calc = {"lvalue": "b", "rvalue": """sf1*a[0][0][0] + sfd6*(a[1][0][0] + a[1][0][0] + a[0][1][0] + a[0][1][0] + a[0][0][1] + a[0][0][1])"""}

C implementation void smooth(float sf, float *a, float *b) { for (k=0; k < nk; k++) { for (j=0; j < nj; j++) { for (i=0; i < ni; i++) { /* compute indices i000, im100, etc */ b[i000] = sf1*a[i000] + sfd6*(a[im100] + a[ip100] + a[i0m10] + a[i0p10] + a[i00m1] + a[i00p1]); } } } }

GPU implementation global void smooth_kernel(float sf, float *a_data, float *b_data){ /* shared memory array */ shared float a[16][3][5]; /* fetch first planes */ a[i][0][k] = a_data[i0m10]; a[i][1][k] = a_data[i000]; a[i][2][k] = a_data[i0p10]; syncthreads(); /* compute */ b_data[i000] = sf1*a[i][1][k] + sfd6*(a[im1][1][k] + a[ip1][1][k] + a[i][0][k] + a[i][2][k] + a[i][1][km1] + a[i][1][kp1]) /* load next "j" plane and repeat...*/

Turbostream CUDA port of existing FORTRAN code (TBLOCK) 15,000 lines FORTRAN 5,000 lines kernel definitions -> 30,000 lines of CUDA Runs on CPU or multiple GPUs 20x speedup on Tesla C1060 as compared to all cores of a modern Intel core2 quad.

Turbostream 9 minutes on a Tesla S870 (4 GPUs) 12 hours on one 2.5GHz CPU core

Application to 3 stage turbine

FORTRAN & CUDA comparison Fortran CUDA

Comparison to experimental data

Impact of GPU accelerated CFD Tesla Personal Supercomputer enables Full turbine in 10 minutes (not 12 hours) One blade (for design) in 2 minutes Tesla cluster enables Interactive design of blades for first time Use of higher accuracy methods at early stage in design process

Summary Many science applications fit the SIMD model used in GPUs CUDA enables science developers to access to NVIDIA GPUs without cumbersome graphics APIs Existing codes have to be analysed and re-coded to best fit the manycore architecture The speedups are such that this can be worth doing For our application, the step-change in capability is revolutionary

More information www.many-core.group.cam.ac.uk

Lattice Boltzmann demo