Acceleration of scientific computing using graphics hardware

Similar documents
Accelerating CFD with Graphics Hardware

CFD Solvers on Many-core Processors

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Turbostream: A CFD solver for manycore

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Large-scale Gas Turbine Simulations on GPU clusters

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CS427 Multicore Architecture and Parallel Computing

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Numerical Algorithms on Multi-GPU Architectures

Threading Hardware in G80

The Many-Core Revolution Understanding Change. Alejandro Cabrera January 29, 2009

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

LATTICE-BOLTZMANN AND COMPUTATIONAL FLUID DYNAMICS

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Introduction to GPU hardware and to CUDA

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

CUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata

Introduction to CUDA


CUDA (Compute Unified Device Architecture)

CUDA Programming Model

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

Trends in HPC (hardware complexity and software challenges)

Acceleration of a 2D Euler Flow Solver Using Commodity Graphics Hardware

GPUs and GPGPUs. Greg Blanton John T. Lubia

What s New with GPGPU?

Real-time Graphics 9. GPGPU

Accelerating image registration on GPUs

Real-time Graphics 9. GPGPU

Computational Fluid Dynamics (CFD) using Graphics Processing Units

CS GPU and GPGPU Programming Lecture 2: Introduction; GPU Architecture 1. Markus Hadwiger, KAUST

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

Lecture 1: Introduction and Computational Thinking

Current Trends in Computer Graphics Hardware

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

A Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis

Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

OpenACC programming for GPGPUs: Rotor wake simulation

GPU for HPC. October 2010

What does Heterogeneity bring?

Massively Parallel Architectures

From Brook to CUDA. GPU Technology Conference

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Tesla Architecture, CUDA and Optimization Strategies

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Portland State University ECE 588/688. Graphics Processors

What Next? Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University. * slides thanks to Kavita Bala & many others

Scientific discovery, analysis and prediction made possible through high performance computing.

Introduction to CELL B.E. and GPU Programming. Agenda

Two-Phase flows on massively parallel multi-gpu clusters

GPU programming. Dr. Bernhard Kainz

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Center for Computational Science

ECE 574 Cluster Computing Lecture 15

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

Administrivia. Administrivia. Administrivia. CIS 565: GPU Programming and Architecture. Meeting

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

B. Tech. Project Second Stage Report on

Multi-Processors and GPU

GPGPU introduction and network applications. PacketShaders, SSLShader

Introduction to CUDA (1 of n*)

Parallel Systems I The GPU architecture. Jan Lemeire

Using GPUs for unstructured grid CFD

Introduction to GPU computing

High-level Abstraction for Block Structured Applications: A lattice Boltzmann Exploration

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

GPGPU, 4th Meeting Mordechai Butrashvily, CEO GASS Company for Advanced Supercomputing Solutions

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

high performance medical reconstruction using stream programming paradigms

GPU Programming Using NVIDIA CUDA

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

CME 213 S PRING Eric Darve

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

GPU Memory Model Overview

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC

Unrolling parallel loops

Parallel Numerical Algorithms

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.

GPGPU on ARM. Tom Gall, Gil Pitney, 30 th Oct 2013

GPGPU/CUDA/C Workshop 2012

Introduction to CUDA

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

Interactive Aerodynamic Design in a Virtual Environment

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

The NVIDIA GeForce 8800 GPU

Pat Hanrahan. Modern Graphics Pipeline. How Powerful are GPUs? Application. Command. Geometry. Rasterization. Fragment. Display.

Automated Finite Element Computations in the FEniCS Framework using GPUs

Software and Performance Engineering for numerical codes on GPU clusters

GPGPU. Peter Laurens 1st-year PhD Student, NSC

Transcription:

Acceleration of scientific computing using graphics hardware Graham Pullan Whittle Lab Engineering Department, University of Cambridge 28 May 2008 I ve added some notes that weren t on the original slides to help readers of the online pdf version.

Coming up... Background CPUs and GPUs GPU programming models An example CFD Alternative devices Conclusions

Part 1: Background

Whittle Lab You are here

Whittle Lab I work here You are here

Turbomachinery

Engine calculation Courtesy Vicente Jerez Fidalgo, Whittle Lab

CFD basics Body-fitted mesh

CFD basics Body-fitted mesh For each cell, conserve: mass momentum energy and update flow properties

Approximate compute requirements Steady models (no wake/blade interaction, etc) 1 blade 0.5 Mcells 1 CPU hour 1 stage (2 blades) 1.0 Mcells 3 CPU hours 1 component (5 stages) 5.0 Mcells 20 CPU hours

Approximate compute requirements Steady models (no wake/blade interaction, etc) 1 blade 0.5 Mcells 1 CPU hour 1 stage (2 blades) 1.0 Mcells 3 CPU hours 1 component (5 stages) 5.0 Mcells 20 CPU hours Unsteady models (with wakes, etc) 1 component (1000 blades) 500 Mcells 0.1 M CPU hours Engine (4000 blades) 2 Gcells 1 M CPU hours

Graham s coding experience: FORTRAN C MPI

Graham s coding experience: C MPI

Part 2: CPUs and GPUs

Moore s Law The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. Certainly over the short term this rate can be expected to continue. Gordon Moore (Intel), 1965

Moore s Law The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. Certainly over the short term this rate can be expected to continue. Gordon Moore (Intel), 1965 OK, maybe a factor of two every two years. Gordon Moore (Intel), 1975 [paraphrased]

Was Moore right? Source: ftp://download.intel.com/resea rch/silicon/gordon_moore_issc C_021003.pdf Source: Intel

Was Moore right? Source: Intel

Was Moore right? Source: Intel

Feature size Source: ftp://download.intel.com/resea rch/silicon/gordon_moore_issc C_021003.pdf Source: Intel

Clock speed Source: http://www.tomshardware.com/2005/11/21/the_mother_of_al l_cpu_charts_2005/index.html

Power the Clock speed limiter? 1 GHz CPU requires 25 W 3 GHz CPU requires 100 W

Power the Clock speed limiter? 1 GHz CPU requires 25 W 3 GHz CPU requires 100 W The total of electricity consumed by major search engines in 2006 approaches 5 GW. Wired / AMD Source: http://www.hotchips.org/hc19/docs/keynote2.pdf

What to do with all these transistors?

Parallel computing Multi-core chips are either: Instruction parallel (Multipile Instruction, Multiple Data) MIMD or Data parallel (Single Instruction, Multiple Data) SIMD

Today s commodity MIMD chips: CPUs Intel Core 2 Quad 4 cores 2.4 GHz 65nm features 582 million transistors 8MB on chip memory

Today s commodity SIMD chips: GPUs NVIDIA 8800 GTX 128 cores 1.35 GHz 90nm features 681 million transistors 768MB on board memory

CPUs vs GPUs Source: http://www.eng.cam.ac.uk/~gp10006/research/brandvi k_pullan_2008a_draft.pdf

CPUs vs GPUs Transistor usage: Source: NVIDIA CUDA SDK documentation

Graphics pipeline Source: ftp://download.nvidia.com/developer/presentations/200 4/Perfect_Kitchen_Art/English_Evolution_of_GPUs.pdf

Graphics pipeline

GPUs and scientific computing GPUs are designed to apply the same shading function to many pixels simultaneously

GPUs and scientific computing GPUs are designed to apply the same function to many data simultaneously This is what most scientific computing needs!

Part 3: Programming methods

3 Generations of GPGPU (Owens, 2008)

3 Generations of GPGPU (Owens, 2008) Making it work at all: Primitive functionality and tools (graphics APIs) Comparisons with CPU not rigorous

3 Generations of GPGPU (Owens, 2008) Making it work at all: Primitive functionality and tools (graphics APIs) Comparisons with CPU not rigorous Making it work better: Easier to use (higher level APIs) Understanding of how best to do it

3 Generations of GPGPU (Owens, 2008) Making it work at all: Primitive functionality and tools (graphics APIs) Comparisons with CPU not rigorous Making it work better: Easier to use (higher level APIs) Understanding of how best to do it Doing it right: Stable, portable, modular building blocks Source: http://www.ece.ucdavis.edu/~jowens/talks/intelsantaclara-070420.pdf

GPU Programming for graphics Courtesy, John Owens, UC Davis Application specifies geometry GPU rasterizes Each fragment is shaded (SIMD) Shading can use values from memory (textures) Source: Image can be stored for re-use http://www.ece.ucdavis.edu/~jowens/talks/intelsantaclara-070420.pdf

GPGPU programming ( old-school ) Draw a quad Run a SIMD program over each fragment Gather is permitted from texture memory Resulting buffer can be stored for re-use Courtesy, John Owens, UC Davis

NVIDIA G80 hardware implementation Vertex/fragment processors replaced by Unified Shaders Now view GPU as massively parallel co-processor Set of (16) SIMD MultiProcessors (8 cores) Source: http://www.ece.wisc.edu/~kati/fpga2008/fpga2008%20wo rkshop%20-%2006%20nvidia%20-%20kirk.pdf

NVIDIA G80 hardware implementation Divide 128 cores into 16 Multiprocessors (MPs) Each MP has: Registers Shared memory Read only constant cache Read only texture cache

NVIDIA s CUDA programming model G80 chip supports MANY active threads: 12,288 Threads are lightweight: Little creation overhead instant switching Efficiency achieved through 1000 s of threads Threads are organised into blocks (1D, 2D, 3D) Blocks are further organised into a grid

Kernels, grids, blocks and threads

Kernels, grids, blocks and threads Organisation of threads and blocks is key abstraction

Kernels, grids, blocks and threads Organisation of threads and blocks is key abstraction Software: Threads from one block may cooperate: Using data in shared memory Through synchronising

Kernels, grids, blocks and threads Organisation of threads and blocks is key abstraction Software: Threads from one block may cooperate: Using data in shared memory Through synchronising Hardware: A block runs on one MP Hardware free to schedule any block on any MP More than one block can reside on one MP

Kernels, grids, blocks and threads

CUDA implementation CUDA implemented as extensions to C CUDA programs: explicitly manage host and device memory: allocation transfers set thread blocks and grid launch kernels are compiled with the CUDA nvcc compiler

Part 4: An example CFD

Distribution function f = f ( c, x, t) c is microscopic velocity ρ = f dc ρu = cf dc u is macroscopic velocity

Boltzmann equation The evolution of f : f t + u f = f t collisions Major simplification: f t + u f = ( τ 1 eq f f )

Lattice Boltzmann Method Uniform mesh (lattice)

Lattice Boltzmann Method Uniform mesh (lattice) Restrict microscopic velocities to a finite set: ρ = f α α ρu = α f α c α

Macroscopic flow For 2D, 9 velocities recover Isothermal, incompressible Navier-Stokes eqns With viscosity: ν τ 1 Δx 2 Δt = 2

Solution procedure 1. Evaluate macroscopic properties: ρ = f α α ρu = α f α c α eq 2. Evaluate f ( ρ, u) α 3. Find f * α = f α 1 τ ( eq f f ) α α

Solution procedure

Solution procedure Simple prescriptions at boundary nodes

CPU code: main.c /* Memory allocation */ f0 = (float *)malloc(ni*nj*sizeof(float));... /* Main loop */ Stream (...args...); Apply_BCs (...args...); Collide (...args...);

GPU code: main.cu /* allocate memory on host */ f0 = (float *)malloc(ni*nj*sizeof(float)); /* allocate memory on device */ cudamallocpitch((void **)&f0_data, &pitch, sizeof(float)*ni, nj); cudamallocarray(&f0_array, &desc, ni, nj); /* Main loop */ Stream (...args...); Apply_BCs (...args...); Collide (...args...);

CPU code collide.c for (j=0; j<nj; j++) { for (i=0; i<ni; i++) { i2d = I2D(ni,i,j); /* Flow properties */ density =...function of f s... vel_x =... vel_y =... /* Equilibrium f s */ f0eq =... function of density, vel_x, vel_y... f1eq =... /* Collisions */ f0[i2d] = rtau1 * f0[i2d] + rtau * f0eq; f1[i2d] = rtau1 * f1[i2d] + rtau * f1eq;... } }

GPU code collide.cu kernel wrapper void collide(... args...) { /* Set thread blocks and grid */ dim3 grid = dim3(ni/tile_i, nj/tile_j); dim3 block = dim3(tile_i, TILE_J); /* Launch kernel */ collide_kernel<<<grid, block>>>(... args...); }

GPU code collide.cu - kernel /* Evaluate indices */ i = blockidx.x*tile_i + threadidx.x; j = blockidx.y*tile_j + threadidx.y; i2d = i + j*pitch/sizeof(float); /* Read from device global memory */ f0now = f0_data[i2d]; f1now = f1_data[i2d]; /* Calc flow, feq, collide, as CPU code */ /* Write to device global memory */ f0_data[i2d] = rtau1 * f0now + rtau * f0eq; f1_data[i2d] = rtau1 * f1now + rtau * f1eq;

GPU code stream.cu kernel wrapper void stream(... args...) { /* Copy linear memory to CUDA array */ cudamemcpy2dtoarray(f1_array, 0, 0, (void *)f1_data, pitch,sizeof(float)*ni, nj, cudamemcpydevicetodevice); /* Make CUDA array a texture */ f1_tex.filtermode = cudafiltermodepoint; cudabindtexturetoarray(f1_tex, f1_array)); /* Set threads and launch kernel */ dim3 grid = dim3(ni/tile_i, nj/tile_j); dim3 block = dim3(tile_i, TILE_J); stream_kernel<<<grid, block>>>(... args...); }

GPU code stream.cu kernel /* indices */ i = blockidx.x*tile_i + threadidx.x; j = blockidx.y*tile_j + threadidx.y; i2d = i + j*pitch/sizeof(float); /* stream using texture fetches */ f1_data[i2d] = tex2d(f1_tex, (i-1), j); f2_data[i2d] = tex2d(f2_tex, i, (j-1));...

CPU / GPU demo

Results 2D Lattice Boltzmann code: 15x speedup GPU vs CPU Real CFD is more complex: more kernels 3D To improve performance, make use of shared memory

3D stencil operations Most CFD operations use nearest neighbour lookups (stencil operations) e.g. 7 point stencil: centre point + 6 nearest neighbours Load data into shared memory Perform stencil ops Export results to device global memory Read in more data into shared memory

Stencil operations Source: 3D sub-domain Threads in one plane http://www.eng.cam.ac.uk/~gp10006/research/brandvik_pullan_2 008a_DRAFT.pdf

CUDA stencil kernel global void smooth_kernel(float sf, float *a_data, float *b_data){ /* shared memory array */ shared float a[16][3][5]; /* fetch first planes */ a[i][0][k] = a_data[i0m10]; a[i][1][k] = a_data[i000]; a[i][2][k] = a_data[i0p10]; syncthreads(); /* compute */ b_data[i000] = sf1*a[i][1][k] + sfd6*(a[im1][1][k] + a[ip1][1][k] + a[i][0][k] + a[i][2][k] + a[i][1][km1] + a[i][1][kp1]) /* load next "j" plane and repeat...*/

Typical grid CUDA partitioning

Typical grid CUDA partitioning Each colour to a different multiprocessor

3D results 30x speedup GPU vs CPU

Part 5: NVIDIA the only show in town?

NVIDIA 4 Tesla HPC GPUs 500 GFLOPs peak per GPU 1.5GB per GPU

AMD Firestream HPC GPU 500 GFLOPs 2GB available?

ClearSpeed 80 GFLOPs 35 W!

IBM Cell BE 25 x 8 GFLOPs

Chip comparison (Giles 2008) Source: http://www.cardiff.ac.uk/arcca/services/events/novelarchitecture/ Mike-Giles.pdf

Too much choice! Each device has different hardware characteristics different software (C extensions) different developer tools How can we write code for all SIMD devices for all applications?

Big picture all devices, all problems?

Forget the big picture

Tackle the dwarves!

The View from Berkeley (7 dwarves ) 1. Dense Linear Algebra 2. Sparse Linear Algebra 3. Spectral Methods 4. N-Body Methods 5. Structured Grids 6. Unstructured Grids 7. MapReduce Source: http://view.eecs.berkeley.edu/wiki/main_page

The View from Berkeley (13 dwarves?) 1. Dense Linear Algebra 2. Sparse Linear Algebra 3. Spectral Methods 4. N-Body Methods 5. Structured Grids 6. Unstructured Grids 7. MapReduce 8. Combinational Logic 9. Graph Traversal 10. Dynamic Programming 11.Backtrack and Branch-and-Bound 12.Graphical Models 13.Finite State Machines

The View from Berkeley (13 dwarves?) 1. Dense Linear Algebra 2. Sparse Linear Algebra 3. Spectral Methods 4. N-Body Methods 5. Structured Grids 6. Unstructured Grids 7. MapReduce 8. Combinational Logic 9. Graph Traversal 10. Dynamic Programming 11.Backtrack and Branch-and-Bound 12.Graphical Models 13.Finite State Machines

SBLOCK (Brandvik) Tackle structured grid, stencil operations dwarf Define kernel using high level Python abstraction Generate kernel for a range of devices from same definition: CPU, GPU, Cell Use MPI to handle multiple devices

SBLOCK kernel definition kind = "stencil" bpin = ["a"] bpout = ["b"] lookup = ((1,0, 0), (0, 0, 0), (1,0, 0), (0, 1,0), (0, 1, 0), (0, 0, 1), (0, 0, 1)) calc = {"lvalue": "b", "rvalue": """sf1*a[0][0][0] + sfd6*(a[1][0][0] + a[1][0][0] + a[0][1][0] + a[0][1][0] + a[0][0][1] + a[0][0][1])"""}

SBLOCK CPU implementation (C) void smooth(float sf, float *a, float *b) { for (k=0; k < nk; k++) { for (j=0; j < nj; j++) { for (i=0; i < ni; i++) { /* compute indices i000, im100, etc */ b[i000] = sf1*a[i000] + sfd6*(a[im100] + a[ip100] + a[i0m10] + a[i0p10] + a[i00m1] + a[i00p1]); } } } }

SBLOCK GPU implementation (CUDA) global void smooth_kernel(float sf, float *a_data, float *b_data){ /* shared memory array */ shared float a[16][3][5]; /* fetch first planes */ a[i][0][k] = a_data[i0m10]; a[i][1][k] = a_data[i000]; a[i][2][k] = a_data[i0p10]; syncthreads(); /* compute */ b_data[i000] = sf1*a[i][1][k] + sfd6*(a[im1][1][k] + a[ip1][1][k] + a[i][0][k] + a[i][2][k] + a[i][1][km1] + a[i][1][kp1]) /* load next "j" plane and repeat...*/

Benefits of SBLOCK So long as the task fits the dwarf: Programmer need not learn every device library Optimal device code is produced Code is future proofed (so long as back-ends are available)

Part 6: Conclusions

Conclusions Many science applications fit the SIMD model GPUs are commodity SIMD chips Good speedups (10x 100x) can be achieved

Conclusions Many science applications fit the SIMD model GPUs are commodity SIMD chips Good speedups (10x 100x) can be achieved GPGPU is evolving (Owens, UC Davis): 1. Making it work at all (graphics APIs) 2. Doing it better (high level APIs) 3. Doing it right (portable, modular building blocks)

Conclusions Many science applications fit the SIMD model GPUs are commodity SIMD chips Good speedups (10x 100x) can be achieved GPGPU is evolving (Owens, UC Davis): 1. Making it work at all (graphics APIs) 2. Doing it better (high level APIs) 3. Doing it right (portable, modular building blocks)

Acknowledgements and info Research student: Tobias Brandvik (CUED) Donation of GPU hardware: NVIDIA http://dx.doi.org/10.1109/jproc.2008.917757 http://www.gpgpu.org http://www.oerc.ox.ac.uk/research/many-core-andreconfigurable-supercomputing