CUDA. GPU Computing. K. Cooper 1. 1 Department of Mathematics. Washington State University

Similar documents
Approaches to Parallel Computing

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

GPU Programming Using CUDA

GPU Fundamentals Jeff Larkin November 14, 2016

GPU programming. Dr. Bernhard Kainz

Tesla Architecture, CUDA and Optimization Strategies

Parallel Computing Ideas

Portland State University ECE 588/688. Graphics Processors

Mathematical computations with GPUs

Parallel Numerical Algorithms

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming


Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Multi-Processors and GPU

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

University of Bielefeld

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

High Performance Computing and GPU Programming

CS 179 Lecture 4. GPU Compute Architecture

Introduction to GPU hardware and to CUDA

High Performance Computing and GPU Programming

GPU 101. Mike Bailey. Oregon State University. Oregon State University. Computer Graphics gpu101.pptx. mjb April 23, 2017

GPU 101. Mike Bailey. Oregon State University

Chapter 17 - Parallel Processing

Parallel Computing: Parallel Architectures Jin, Hai

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Practical Introduction to CUDA and GPU

Data Parallel Execution Model

Fundamental CUDA Optimization. NVIDIA Corporation

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

CUDA Programming Model

Fundamental CUDA Optimization. NVIDIA Corporation

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Introduction to GPU programming. Introduction to GPU programming p. 1/17

COSC 6339 Accelerators in Big Data

Lecture 1: an introduction to CUDA

ECE 574 Cluster Computing Lecture 15

Introduc)on to GPU Programming

Introduction to CUDA

CSE 160 Lecture 24. Graphical Processing Units

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

CME 213 S PRING Eric Darve

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

Introduction to CUDA Programming

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

GPU Computing: Introduction to CUDA. Dr Paul Richmond

arxiv: v1 [physics.comp-ph] 4 Nov 2013

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

ECE 574 Cluster Computing Lecture 17

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Parallel Accelerators

Memory. Lecture 2: different memory and variable types. Memory Hierarchy. CPU Memory Hierarchy. Main memory

CS 179: GPU Computing

Warps and Reduction Algorithms

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Lecture 2: different memory and variable types

Parallel Accelerators

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

Scientific discovery, analysis and prediction made possible through high performance computing.

Computer Architecture

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

Lecture 2: CUDA Programming

GPU CUDA Programming

Parallel Processing SIMD, Vector and GPU s cont.

Scientific Computations Using Graphics Processors

GPU COMPUTING. Ana Lucia Varbanescu (UvA)

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Introduction to GPGPUs and to CUDA programming model

Tesla GPU Computing A Revolution in High Performance Computing

Programmable Graphics Hardware (GPU) A Primer

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

CUDA Parallelism Model

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Programming GPUs with CUDA. Prerequisites for this tutorial. Commercial models available for Kepler: GeForce vs. Tesla. I.

CUDA (Compute Unified Device Architecture)

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

Real-time Graphics 9. GPGPU

Introduction to GPU programming with CUDA

Lecture 2: Introduction to CUDA C

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

Numerical Simulation on the GPU

CUDA Architecture & Programming Model

CS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Threading Hardware in G80

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Transcription:

GPU Computing K. Cooper 1 1 Department of Mathematics Washington State University 2014

Review of Parallel Paradigms MIMD Computing Multiple Instruction Multiple Data Several separate program streams, each executing possibly different sets of instructions Each instruction stream operates on different data each instruction stream may only have access to a fragment of data

Review of Parallel Paradigms SIMD Computing Single Instruction Multiple Data Only one program stream, though that may launch multiple threads The instruction stream may be applied simultaneously to many different data elements.

Review of Parallel Paradigms Advantages of MIMD Advantages Instructions can be wildly different for individual streams Instructions can be separated even to different nodes Memory is distributed becomes limited only by number of nodes Nodes can be unsophisticated cheap Disadvantages Communication

Review of Parallel Paradigms Advantages of SIMD Disadvantages All computations must happen on single machine limited memory, processors Hardware must be very complex, therefore expensive Advantages All computations happen on a single machine fast

Review of Parallel Paradigms SIMT Historically, SIMD computing involved vastly complex CPUs with many ALUs, with complicated switch architectures. This is, in some sense, a description of a modern video card. Ever since SGI, video cards have had small specialized processors designed for arithmetic involved in 3-d projections. Single Instruction - Multiple Thread We start one program that program can launch many threads to perform small tasks in parallel on a Graphics Processing Unit.

CUDA Computing NVidia The company that really drives this is NVidia Makes video cards for 3-d games Interface (API) for programmers to send instructions to card. CUDA - Compute Unified Device Architecture AMD/ATI is playing too, but uses different API

CUDA Computing Model 1 Start one program 2 Write function(s) to handle core of computation in parallel kernel 3 Allocate memory in RAM and also on video card 4 Copy data from CPU to video card 5 Run the kernel on the card 6 Copy data back from card to CPU

CUDA Computing Example - Assignment 1 Here is some code to parallelize. h = 1.0/(double)n; for(i=0;i<n;i++){ for(i=0;i<n;i++){ x = i*h; u[i] = (sin(x+h)-sin(x))/h; err[i] = u[i]-cos(x); } }

CUDA Computing Example - Kernel Write the kernel. global void fwd_diff(double *u, double *err, int int i = blockidx.x*blockdim.x + threadidx.x; double h = 1/(double)n; double x = i*h; } u[i] = (sin(x+h)-sin(x))/h; err[i] = u[i]-cos(x);

CUDA Computing Example - Allocate Allocate memory. size_t size = n*sizeof(double); double *u = (double *)malloc(size); double *err = (double *)malloc(size); double *d_u; prob = cudamalloc((void **)&d_u,size); double *d_err; prob = cudamalloc((void **)&d_err,size);

CUDA Computing Example - Copy to Card Copy data to the video card. prob = cudamemcpy(d_u,u,size,cudamemcpyhosttodevice) prob = cudamemcpy(d_err,u,size,cudamemcpyhosttodevic

CUDA Computing Example Run the kernel Run the kernel. Note the peculiar syntax. fwd_diff <<<blockspergrid,threadsperblock>>> (d_u,d_err,n); Note that we pass the pointers to the device memory. Blocks, threads have to do with the device architecture.

CUDA Computing Example Copy back to CPU Copy the results back to the CPU. prob = cudamemcpy(u,d_u,size,cudamemcpydevicetohost) prob = cudamemcpy(err,d_err,size,cudamemcpydevicetoh

CUDA Computing Comparison

Grids, Blocks, and Threads SPs and SMs Each device is organized as a collection of streaming multiprocessors SMs. Each SM is composed of a collection of streaming processors (SPs) and an L1 cache. Each SP has access to some number of registers.

Grids, Blocks, and Threads Blocks You must write your code so that it recognizes a block structure. Each block is loaded onto a single SM. The number of blocks does not have to match the number of SMs. Still, some recognition of the number of SMs can help efficiency. Blocks comprise a collection of threads. Each thread is one little program fragment. Threads from a given block are executed in warps (32 threads). Thus, the block size should probably be a multiple of 32.

Grids, Blocks, and Threads Threads Spawning threads... fwd_diff <<<blockspergrid,threadsperblock>>> (d_u,d_err,n); Code for one thread - the kernel. global void fwd_diff(double *u, double *err, int int i = blockidx.x*blockdim.x + threadidx.x; double h = 1/(double)n; double x = i*h; } u[i] = (sin(x+h)-sin(x))/h; err[i] = u[i]-cos(x);

Grids, Blocks, and Threads Memory The main memory for a device is called global memory. It is comparable in speed to L2 cache on a CPU. Each block has access to its L1 cache, called shared memory, which is much faster than global memory. Each thread has access to some number of registers, which are much faster than shared memory.

Grids, Blocks, and Threads Cards Here are a few numbers from our cards. GT 610 GTX Titan CUDA Cores 48 2688 SMs 1 14 Total Memory 2 GB 6 GB Memory Bus 64-bit 384-bit Shared Mem/Block 49152 bytes 49152 bytes

Performance Block Size

Performance Comparison to MPI Excerpt from email message regarding system with 12000 particles integrated over a period of a second or so... scalar program: 6 days scalar program with parallel fudge on ethernet cluster: 1 day parallel program on ethernet cluster: 44 minutes parallel program on infiniband cluster: 20 minutes CUDA: 19 minutes

Performance Conclusions GPU computing has limitations... Memory Scalability Reduction still an issue Difficulty in programming Still, when our little $2K machine can compete with a $1M cluster on a real problem... this is important.