CSE 160 Lecture 24. Graphical Processing Units

Similar documents
Lecture 3. Programming with GPUs

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

Lecture 6 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden. Computing with Graphical Processing Units CUDA Programming Matrix multiplication

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 3. SIMD and Vectorization GPU Architecture

Introduction to GPU programming. Introduction to GPU programming p. 1/17

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Lecture 5. Performance Programming with CUDA

Tesla Architecture, CUDA and Optimization Strategies

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Module 2: Introduction to CUDA C

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy

Parallel Computing. Lecture 19: CUDA - I

Lecture 2: Introduction to CUDA C

CUDA Programming Model

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

GPU CUDA Programming

Lecture 7. Overlap Using Shared Memory Performance programming the memory hierarchy

Lecture 3: Introduction to CUDA

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

Threading Hardware in G80

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Introduction to CUDA Programming

Module 2: Introduction to CUDA C. Objective

CUDA Architecture & Programming Model

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Introduction to CUDA (1 of n*)

Lecture 10!! Introduction to CUDA!

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Lecture 6. Programming with Message Passing Message Passing Interface (MPI)

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

ECE 574 Cluster Computing Lecture 15

Mattan Erez. The University of Texas at Austin

CUDA C Programming Mark Harris NVIDIA Corporation

Introduction to CUDA

GPU Fundamentals Jeff Larkin November 14, 2016

Programming in CUDA. Malik M Khan

Introduction to Parallel Computing with CUDA. Oswald Haan

CUDA (Compute Unified Device Architecture)

GPU Computing: Introduction to CUDA. Dr Paul Richmond

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

High-Performance Computing Using GPUs

High Performance Linear Algebra on Data Parallel Co-Processors I

Parallel Numerical Algorithms

B. Tech. Project Second Stage Report on

ME964 High Performance Computing for Engineering Applications

High Performance Computing and GPU Programming

ECE 574 Cluster Computing Lecture 17


GPU programming. Dr. Bernhard Kainz

Introduction to CUDA (1 of n*)

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

University of Bielefeld

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Portland State University ECE 588/688. Graphics Processors

Lecture 8: GPU Programming. CSE599G1: Spring 2017

GPU Programming Using CUDA

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

GPU Architecture and Programming. Andrei Doncescu inspired by NVIDIA

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

CUDA Programming. Aiichiro Nakano

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

L17: CUDA, cont. 11/3/10. Final Project Purpose: October 28, Next Wednesday, November 3. Example Projects

Cartoon parallel architectures; CPUs and GPUs

Master Informatics Eng.

Basic unified GPU architecture

Computer Architecture

GPU COMPUTING. Ana Lucia Varbanescu (UvA)

Lecture 1: Introduction and Computational Thinking

Introduc)on to GPU Programming

EEM528 GPU COMPUTING

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Overview: Graphics Processing Units

Practical Introduction to CUDA and GPU

Josef Pelikán, Jan Horáček CGG MFF UK Praha

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

CUDA Parallel Programming Model Michael Garland

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

Parallel Accelerators

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Transcription:

CSE 160 Lecture 24 Graphical Processing Units

Announcements Next week we meet in 1202 on Monday 3/11 only On Weds 3/13 we have a 2 hour session Usual class time at the Rady school final exam review SDSC tour @ 2.15pm (replaces class on 3/15) Quiz return 2

Quiz 1. T(n) = α + β -1 n, α term dominates for short messages, β term for long messages 2. Had to show that the only outcome was s,t = ONE, TWO 3. Two possible answers 1. Use collective MPI_Reduce 2. Use wildcard sender to process local sums in the order they arrive rather than a fixed order 1. Many IRecvs not scalable, requires synchronization 3

Today s lecture GPU Architecture Programming with CUDA 4

Processor design trends No longer possible to use a growing population of transistors to boost single processor performance Can no longer increase the clock speep Instead, we replicate the cores Simplified design, reduces power, pack more onto the chip Simplified core Remove architectural enhancements like branch caches Constrain memory access and control flow Partially expose the memory hierarchy 5

Heterogeneous processing with Graphical Processing Units Specialized many-core processor Explicit data motion between host and device inside the device Host MEM C 0 C 1 C 2 P 0 P 1 P 2 Device 6

Graphical Processing Units Processes long vectors Thousand of highly specialized cores NVIDIA, AMD 7

NVIDIA GeForce GTX 280 Hierarchically organized clusters of streaming multiprocessors 240 cores @ 1.296 GHz Peak performance 933.12 Gflops/s 1 GB device memory (frame buffer) 512 bit memory interface @ 132 GB/s GTX 280: 1.4B transistors Intel Penryn: 410M (110mm 2 ) (dual core) Nehalem: 731M (263mm 2 ) 8

Streaming processor cluster GTX-280 GPU 10 clusters @ 3 streaming multiprocessors or vector cores Each vector core 8 scalar cores: fused multiply adder + multiplier (32 bits), truncate intermediate rslt Shared memory (16KB) and registers (16K 32 bits = 64KB) 1 64- bit fused multiply-adder + 2 super function units (2 fused multiply-adders) 1 FMA + 1 multiply per cycle = 3 flops / cycle / core * 240 cores = 720 flops/cycl @1.296 Ghz: 933 GFLOPS Instruction L1 Data L1 Instruction Fetch/Dispatch Shared Memory SFU SFU DavidKirk/NVIDIA and Wen-mei Hwu/UIUC 9

(Device) Grid Block (0, 0) Registers Shared Memory (0, 0) Registers (1, 0) Memory Hierarchy Block (1, 0) Registers Shared Memory (0, 0) Registers (1, 0) Name Latency (cycles) Global DRAM 100s No Local DRAM 100s No Constant 1s 10s 100s Yes Texture 1s 10s 100s Yes Shared 1 -- Register 1 -- Cached Local Memory Local Memory Local Memory Local Memory Instruction L1 Data L1 Instruction Fetch/Dispatch Host Global Memory Shared Memory Constant Memory SFU SFU Texture Memory Courtesy DavidKirk/NVIDIA and Wen-mei Hwu/UIUC 10

CUDA Programming environment with extensions to C Under control of the host, invoke sequences of multithreaded kernels on the device (GPU) Many lightweight virtualized threads CUDA: programming environment + C extensions KernelA<<4,8>> KernelB<<4,8>> KernelC<<4,8>> 11

execution model Kernel call spawns virtualized, hierarchically organized threads Hardware handles dispatching, 0 overhead Compiler re-arranges loads to hide latencies Global synchronization: kernel invocation Global Memory..... 12

s organization s all execute same instruction (SIMT) s are organized into blocks, blocks into grids Each thread uniquely specified by block & thread ID Programmer determines the mapping of virtual thread IDs to global memory location : Z n Z 2 Z 3 Θ(Π ι ), Π ι Π SM MT IU t0 t1 t2 tm Global Memory..... Shared Memory 13

Hierarchical Organization organization Grid Block Specify number and geometry of threads in a block and similarly for blocks Blocks Unit of workload assignment Subdivide a global index domain Cooperate, synchronize, with access fast on-chip shared memory s in different blocks communicate only through slow global memory Host Kernel 1 Kernel 2 Device Block (1, 1) (0, 0) (0, 1) (0, 2) (1, 0) (1, 1) (1, 2) Grid 1 Block (0, 0) Block (0, 1) Grid 2 (2, 0) (2, 1) (2, 2) Block (1, 0) Block (1, 1) (3, 0) (3, 1) (3, 2) (4, 0) (4, 1) (4, 2) Block (2, 0) Block (2, 1) KernelA<<<2,3>,<3,5>>>() Grid Block DavidKirk/NVIDIA & Wen-mei Hwu/UIUC 14

execution Blocks Unit of workload assignment Each thread has its own set of registers All have access to a fast on-chip shared memory Synchronization only among all threads in a block s in different blocks communicate via slow global memory SM MT IU Shared Memory t0 t1 t2 tm Device Grid 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (2, 0) Block (2, 1) SIMT parallelism: all threads in a warp execute the same instruction All branches followed Instructions disabled Divergence, serialization KernelA<<<2,3>,<3,5>>>() Grid Block Block (1, 1) Grid 2 (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) DavidKirk/NVIDIA & Wen-mei Hwu/UIUC 15

Parallel Speedup How much did our GPU implementation improve over the traditional processor? Speedup, S Running time of the fastest program on conventional processors Running time of the accelerated program Baseline: a multithreaded program 17

How to maximize performance? Avoid algorithms that present intrinsic barriers to utilizing the hardware Hide latency of host device memory transfers Reduce global memory accesses Global memory accesses fast on-chip accesses Coalesced memory transfers Avoid costly branches, or render harmless Minimize serial sections 18

Consequences If we don t use the parallelism, we lose it Amdahl s law - serial sections Von Neumann bottleneck data transfer costs Workload Imbalances Rethink the problem solving technique Hide latency of host device memory transfers Global memory accesses fast on-chip accesses Coalesced memory transfers Avoid costly branches, or render them harmless Simplified processor design, but more user control over the hardware resources 19

Today s lecture GPU Architecture Programming with CUDA 20

CUDA language extensions Type qualifiers to declare device kernel functions global void matrixmul( ) Kernel launch syntax matrixmul<<< grid, threads >>>( ) Keywords blockidx, threadidx, blockdim, griddim Runtime, e.g. storage allocation cudamalloc, cudafree, cudamemcpy 21

Coding example Increment Array Serial Code void incrementarrayonhost(float *a, int N){ int i; for (i=0; i < N; i++) a[i] = a[i]+1.f; } #include <cuda.h> global void incrementondevice(float *a, int N) { int idx = blockidx.x*blockdim.x + threadidx.x; if (idx<n) a[idx] = a[idx]+1.f; } incrementondevice <<< nblocks, blocksize >>> (a_d, N); Rob Farber, Dr Dobb s Journal 22

Managing memory float *a_h, *b_h; float *a_d; // pointers to host memory // pointer to device memory cudamalloc((void **) &a_d, size); for (i=0; i<n; i++) a_h[i] = (float)i; // init host data cudamemcpy(a_d, a_h, sizeof(float)*n, cudamemcpyhosttodevice); 23

Computing and returning result int bsize = 4; int nblocks = N/bSize + (N%bSize == 0?0:1); incrementondevice <<< nblocks, bsize >>> (a_d, N); // Retrieve result from device and store in b_h cudamemcpy(b_h, a_d, sizeof(float)*n, cudamemcpydevicetohost); // check results for (i=0; i<n; i++) assert(a_h[i] == b_h[i]); // cleanup free(a_h); free(b_h); cudafree(a_d); 24

Experiments - increment benchmark Total time: timing taken from the host, includes copying data to the device Device only: time taken on device only N = 8388480, block size = 128, times in milliseconds, Nehalem Reps = 10 100 1000 10000 8.5 83 830 8300 Device time 29 100 850 8300 Kernel launch + data xfer 77 770 7700 Host 16 103 a[i] = 1 + sin(a[i]) : Device) 7000 23.6 sec Sine function (Host) 25

Summary - Programming issues Branches serialize execution within a warp Tradeoff: more blocks with fewer threads or more threads with fewer blocks Locality: want small blocks of data (and hence more plentiful warps) that fit into fast memory Register consumption Scheduling: hide latency Shared memory and registers do not persist across kernel invocations Next time: using shared memory 26

Summary - Programming issues Branches serialize execution within a warp Tradeoff: more blocks with fewer threads or more threads with fewer blocks Locality: want small blocks of data (and hence more plentiful warps) that fit into fast memory Register consumption Scheduling: hide latency Shared memory and registers do not persist across kernel invocations Next time: using shared memory 27