Overview: Graphics Processing Units

Size: px
Start display at page:

Download "Overview: Graphics Processing Units"

Transcription

1 advent of GPUs GPU architecture Overview: Graphics Processing Units the NVIDIA Fermi processor the CUDA programming model simple example, threads organization, memory model case study: matrix multiply memories, thread synchronization, scheduling case study: reductions performance considerations: bandwidth, scheduling, resource conflicts, instruction mix host-device data transfer: multiple GPUs, NVLink, Unified Memory, APUs the OpenCL programming model directive-based programming models refs: CUDA Toolkit Documentation, An Even Easier Introduction to CUDA (tutorial); NCI NF GPU page, Programming Massively Parallel Processors, Kirk & Hwu, Morgan-Kaufman, 2010; Cuda By Example, by Sanders and Kandrot; OpenCL web page, OpenCL in Action, by Matthew Scarpino COMP4300/8300 L21,22: Graphics Processing Units

2 Advent of General-purpose Graphics Processing Units many applications have massive amounts of mostly independent calculations e.g. ray tracing, image rendering, matrix computations, molecular simulations, HDTV can be largely expressed in terms of SIMD operations implementable with minimal control logic & caches, simple instruction sets design point: maximize number of ALUs & FPUs and memory bandwidth to take advantage of Moore s Law (shown here) put this on a co-processor (GPU); have a normal CPU to co-ordinate, run the operating system, launch applications, etc architecture/infrastructure development requires a massive economic base for its development (the gaming industry!) pre 2006: only specialized graphics operations (integer & float data) 2006: General Purpose (GPGPU): general computations but only through a graphics library (e.g. OpenGL) 2009: programmable for general (numeric) calculations (e.g. CUDA, OpenCL) Some applications have large speedups ( ) over a single CPU core. COMP4300/8300 L21,22: Graphics Processing Units

3 Graphics Processor Unit Systems GPU systems are a co-processor device on a CPU-based system ([O H.&Bryant, fig 1.4]) separate memory space (DRAMs) for CPU (host) and GPU (device) must allocate space on GPU and copy data from CPU memory to GPU memory (and visa versa) via the PCIe bus also need a way to copy the GPU executable code and start it (kernel launch) issues? Why not use the same memory space? COMP4300/8300 L21,22: Graphics Processing Units

4 Graphics Processor Unit Architecture GPU chip: an array of streaming multiprocessors (SMs) sharing an L2 cache comparison with UltraSPARC T2 (courtesy Real World Tech) each SM has (8 32) streaming processors (SPs) only SPs (= cores) within an SM can (easily) synchronize, share data identical threads are organized into fixed-size blocks, each allocated to an SM blocks in turn are divided into warps at any timestep, all SPs execute an instruction from a warp ( SIMT mode) latencies hidden by scheduling from many warps TeslaS2050 co-processor TeslaS2050 architecture (courtesy NVIDIA) COMP4300/8300 L21,22: Graphics Processing Units

5 The Fermi Graphics Processor Chip GF110 model: 1.15 GHz; 900W; 3D grid & thread blocks; warp size: 32; max resident: blocks 8, warps 32, threads 1536 (from NCI NF page) COMP4300/8300 L21,22: Graphics Processing Units

6 GPU vs CPU Floating Point Speed and Memory Bandwidth COMP4300/8300 L21,22: Graphics Processing Units

7 The Common Unified Device Architecture Programming Model device refers to a co-processor with own DRAM that can run many threads in parallel host performs serial execution, transfers data to/from device (via DMA), and sends (highly ) kernels to device the kernel s threads are organized into a grid of blocks each block is sent to an SM a CUDA program is a C/C++ program with device calls & kernels (each with many threads) embedded into it GPU threads are very lightweight (some overheads in invoking a kernel, and dispatching each block) threads are identical but have thread (& block) ids (courtesy NCSU) CUDA compiler (e.g. nvcc) produces a normal executable with device code embedded into it has CUDA runtime (cudart) and core (cuda) libraries linked into it COMP4300/8300 L21,22: Graphics Processing Units

8 CUDA Program: Simple Example reverse an array (reversearray.cu) g l o b a l void reversearray ( int a d, int N ) { int idx = threadidx. x ; int v = a [N idx 1]; a [N idx 1] = a [ idx ]; a [ idx ] = v ; } # define N (1<<16) int main () { // may not dereference a d! int a [ N ], a d, a s i z e = N sizeof ( int );... cudamalloc (( void ) & a d, a s i z e ); cudamemcpy ( a d, a, a size, cudamemcpyhosttodevice ); reversearray <<<1, N/2>>> ( a d, N ); cudathreadsynchronize (); // wait till threads finish cudamemcpy (a, a d, a size, cudamemcpydevicetohost ); cudafree ( a d );... } cf. OpenMP on a normal multicore: style; practicality? # pragma omp parallel n u m t h r e a d s ( N /2) default ( shared ) { int idx = o m p g e t t h r e a d s n u m (); int v = a [N idx 1]; a [N idx 1] = a [ idx ]; a [ idx ] = v ; } COMP4300/8300 L21,22: Graphics Processing Units

9 CUDA Thread Organization and Memory Model a 2 1 grid with 2 1 blocks memory model (left) reflects that of the GPU 2 2 grid with blocks (courtesy Real World Tech.) (courtesy NCSC) COMP4300/8300 L21,22: Graphics Processing Units

10 Case Study: Matrix Multiply perform C+=AB, C is N N, A is N K, B is K N column-major storage: C i j is at C[i + j N] 1st attempt: each thread computes one element of C, C i, j invocation with W W thread blocks (assume W N) why better than using a N N thread block? (2 reasons, both important!) for thread (t x,t y ) of block (b x,b y ), i = b y W +t y and j = b x W +t x (courtesy xfig) COMP4300/8300 L21,22: Graphics Processing Units

11 kernel: } CUDA Matrix Multiply: Implementation g l o b a l void matmult ( int N, int K, double A d, double B d, double C d ) { int i = blockidx. y blockdim. y + threadidx. y ; int j = blockidx. x blockdim. x + threadidx. x ; double cij = C d [ i + j N ]; for ( int k =0; k < K ; k ++) cij += A d [ i + k N ] B d [ k + j K ]; C d [ i + j N ] = cij ; main program: needs to allocate device versions of A, B & C (A d, B d, and C d) and cudamemcpy() host versions into them invocation with W W thread blocks (assume W N) dim3 dimg ( N /W, N / W ); dim3 dimb (W, W ); // in kernel blockdim. x == W matmult <<<dimg, dimb >>> (N, K, A d, B d, C d ); what if N % W > 0? Add to kernel if (i < N && j < N) and declare dim3 dimg((n+w 1)/W, (N+W 1)/W); note: SIMD nature of SPs cycles for both branches of if are consumed COMP4300/8300 L21,22: Graphics Processing Units

12 CUDA Memories and Thread Synchronization GPUs can potentially suffer more still from the memory wall DRAM access still may be 100 s of cycles bandwidth is limited for load/store intensive kernels the shared memory is on-chip (hence very fast) the shared type modifier may be used to denote a (fixed) array allocated to shared memory threads within a block can synchronized via the (efficient why?) syncthreads() intrinsic (SM-level) atomic instructions can enforce data consistency within a block note: no way to synchronize between blocks, or safely ensure data consistency across blocks can only be done across separate kernel invocations COMP4300/8300 L21,22: Graphics Processing Units

13 Matrix Multiply Using Shared Memory threads (t x,0)... (t x,w 1) all access B k,bx W +t x ; ((0,t y )... (W 1,t y ) access A by W +t y,k) high ratio of load to FP instructions harder to hide L1 cache latencies; strains memory bandwidth can improve kernel by utilizing SM shared memory: } s h a r e d double A s [ W ][ W ], B s [ W ][ W ]; g l o b a l void m a t M u l t s ( int N, int K, double A d, double B d, double C d ) { int ty = threadidx.y, tx = threadidx. x ; int i = blockidx. y W + ty, j = blockidx. x W + tx ; double cij = C d [ i + j N ]; for ( int k =0; k < K ; k += W ) { A s [ ty ][ tx ] = A d [ i + ( k + tx ) N ]; B s [ ty ][ tx ] = B d [( k + ty ) + j K ]; s y n c t h r e a d s (); for ( int w =0; w < W ; w ++) cij += A s [ ty ][ w ] B s [ w ][ tx ]; s y n c t h r e a d s (); // can this be avoided? } C d [ i + j N ] = cij ; COMP4300/8300 L21,22: Graphics Processing Units

14 GPU Scheduling - Warps the GigaThread scheduler assigns the (independently executable) thread blocks to each SM each block is divided into groups (of 32) called warps grouping occurs in linear order by t x + b x t y + b x b y t z (e.g. warp size 4) the warp scheduler determines which blocks are ready to run with 32-thread warps, suitable block sizes range from 4 8 to SIMT: each SP executes next instr n SIMD-style (note: requires only a single instruction fetch!) thus, a kernel with enough blocks can scale across a GPU with any number of cores (courtesy NVIDIA - both) COMP4300/8300 L21,22: Graphics Processing Units

15 threads within a single 1D block summing A[0..N 1]: } Reductions and Thread Divergence g l o b a l void sumv ( int N, double A, double s ) { int bx = blockdim. x ; s h a r e d double psum [ bx ]; int tx = threadid.x, x ; psum [ tx ] =... for ( x = bx /2; x>0; x /=2) { s y n c t h r e a d s (); if ( tx < x ) psum [ tx ] += psum [ tx + x ]; } if ( tx ==0) s = psum [ tx ]; predicated execution: threads in a warp where the condition is false execute a no-op if-else statements thus cause thread divergence (worse when nested) (courtesy NVIDIA) divergence is minimized: occurs only when x < 32 (on one warp) cf. alternative algorithm: for ( x =1; x < bx ; x =2) { s y n c t h r e a d s (); if ( tx % x == 0) psum [ tx ] += psum [ tx + x ]; } COMP4300/8300 L21,22: Graphics Processing Units

16 Global Memory Bandwidth Issues in reduction example, all threads in warp contiguously access (shared) array psum very important when you have global memory accesses: memory subsystem can coalesce these into a single access allows DRAM banks to deliver peak bandwidth (burst mode) reason: 2D organization of DRAM chips (same row address) (Lect 3, p14) matmult example: threads within warp access A contiguously, but not B effect of accesses to B in this case is mitigated by use of shared memory in multiply note that this effect is opposite to normal cores, where contiguous access within a thread is most desirable (maximizes spatial locality) worst case scenario: memory strides in (large) powers of 2 causes memory bank conflicts COMP4300/8300 L21,22: Graphics Processing Units

17 SM Registers and Warp Scheduling the SM maintains block ids of scheduled blocks, and thread ids (and block sizes) of scheduled threads the SM s (32K word) register file is shared between all of these the block and thread ids are used to index the file for the registers allocated to a particular thread warps whose next instruction has its operands ready for consumption may be selected round-robin used if there are several ready thus, registers need to be scoreboarded can make use of this to (software) prefetch data and better hide latencies (sh. mem. matmult) example: if there are 4 instrn s between a load & its use, on the G80, with 4 clock cycles needed to process an instrn., we need 14 active warps to tolerate a 200-cycle memory latency (courtesy NVIDIA) COMP4300/8300 L21,22: Graphics Processing Units

18 Performance Considerations: Shared SM Resources on Fermi GPUs, may have resident on an SM: 8 blocks, 32 warps and 1536 threads; 128 KB register file, 64 KB shared memory / L1 cache to fully utilize block & thread slots, need at least 192 threads per block assuming 4-byte operands, can have at most 16 registers per thread optimizations on a kernel resulting in more registers may result in fewer blocks being resident... (courtesy NVIDIA) resource contention can cause a dramatic loss of performance the CUDA occupancy calculator can help evaluate this COMP4300/8300 L21,22: Graphics Processing Units

19 Performance Considerations: Instruction Mix goal: keep the SP s FPUs fully occupied doing useful operations every other kind of instruction (loads, address calculations, branches) hinders this! matrix multiply revisited: strategy 1: unroll k loops: for ( int k =0; k < K ; k +=2) cij += A d [ i + k N ] B d [ k + j K ] + A d [ i +( k +1) N ] B d [ k +1+ j K ]; halves loop index increments & branches strategy 2: each thread computes a 2 2 tile of C instead of a single element reduces load instructions; reduces branches by 4 but may require 4 the registers! also increases thread granularity: may help if K is not large COMP4300/8300 L21,22: Graphics Processing Units

20 Host-Device Issues: Multiple GPUs, NVLink, and Unified Memory transfer of data to/from host to device is error-prone, potentially a performance bottleneck (what if the array for an advection solver could not fit in GPU memory?) the problem is exacerbated when multiple GPUs are connected to one host we can select the required device by cudasetdevice(): cudasetdevice (0); cudamalloc ( a d, n ); cudamemcpy ( a d, a, n,...); reversearray <<<1,n/2>>>(a d, n ); cudathreadsynchronize (); cudamemcpypeer ( a b, 0, b d, 1, n ); cudasetdevice (1); reversearray <<<1,n/2>>>(b d, n ); fast interconnects such as NVLink will reduce the transfer costs (e.g. Sierra system) CUDA s Unified Memory will improve programability issues (and in some cases, performance) cudamallocmanaged(a, n); allocates the array on host so that it can migrate, page-by-page, to/from GPU(s) transparently and on demand alternatively, have the device and CPU use the same memory, as on AMD s APU for Exascale Computing COMP4300/8300 L21,22: Graphics Processing Units

21 The Open Compute Language for Devices and Regular Cores open standard not proprietary like CUDA; based on C (no C++) design philosophy: treat GPUs and CPUs as peers, data- and task- parallel compute model similar execution model to CUDA: NDRange (CUDA grid): operates on global data, units within cannot synch. WorkGroup (CUDA block): units within can use shared ), to synch. local data (CUDA WorkItem (CUDA thread): indpt. unit of execution, also has private data example kernel: } k e r n e l void reversearray ( g l o b a l int a d, int N ) { int idx = getglobalid (0); int v = a [N idx 1]; a [N idx 1] = a [ idx ]; a [ idx ] = v ; recall that in CUDA, we could launch as reversearray<<<1,n/2>>>(a d, N), but in OpenCL... COMP4300/8300 L21,22: Graphics Processing Units

22 OpenCL Kernel Launch must explicitly create device handle, compute context and work-queue, load and compile the kernel, and finally enqueue it for execution clgetdeviceids (..., C L D E V I C E T Y P E G P U, 1, & device,...); context = clcreatecontext (0, 1, & device,...); queue = clcreatecommandqueue ( context, device,...); program = clcreateprogramwithsource ( context, " r e v e r s e A r r a y. cl ",. clbuildprogram ( program, 1, & device,...); r e v e r s e A r r k = clcreatekernel ( program, " r e v e r s e A r r a y ",...); clsetkernelarg ( reversearray k, 0, sizeof ( c l m e m ) & a d ); clsetkernelarg ( reversearray k, 0, sizeof ( int ) & N ); cndimension = 1; cnblocksize = N /2; clenqueuendrangekernel ( queue, reversearray k, 1, 0, & cndimension, & cnblocksize, 0, 0, 0); note: CUDA host code is compiled into.cubin intermediate files which follow a similar sequence for usage on normal core (CL DEVICE TYPE CPU), a WorkItem corresponds to an item in a work queue that a number of (kernel-level) threads get work from compiler may aggregate these to reduce overheads COMP4300/8300 L21,22: Graphics Processing Units

23 Directive-Based Programming Models OpenACC enables us to specify which code is to run on a device, and how to transfer data to/from it # pragma acc parallel loop copyin (a, b ) copy ( c ) for ( i =0; i < N ; i ++) for ( int j =0; j < N ; j ++) { double cij = C [ i + j N ]; for ( int k =0; k < K ; k ++) cij += A [ i + k N ] B [ k + j K ]; C [ i + j N ] = cij ; } the data directive may be used to specify data placement across kernels the code can be also compiled to run across multiple CPUs OpenMP 4.0 operates similarly. For the above example: # pragma omp target map ( to : A [0: N K ], B [0: N K ]) map ( tofrom : C [0: N N ]) # pragma omp parallel for default ( shared ) studies on complex applications where all data must be kept on device indicate a productivity grain and performance loss of 2 over CUDA (e.g. Zhe14) COMP4300/8300 L21,22: Graphics Processing Units

24 Graphics Processing Units: Summary designed to exploit computations expressible in large numbers of identical, independent threads grouped into blocks: allocated to an SM and hence can have synchronization within each GPU cores are designed for throughput, not single-thread speed low clock speed, instructions taking several clock cycles SIMT execution to hide long latencies; large amounts of hardware to maintain many thread contexts destructive sharing: appears as resource contention may lose performance due to poor utilization, but not from load imbalance L2 cache and memory bandwidth an important consideration, but main consideration in access patterns is within a warp COMP4300/8300 L21,22: Graphics Processing Units

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34 1 / 34 GPU Programming Lecture 2: CUDA C Basics Miaoqing Huang University of Arkansas 2 / 34 Outline Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

Computer Architecture

Computer Architecture Jens Teubner Computer Architecture Summer 2017 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2017 Jens Teubner Computer Architecture Summer 2017 34 Part II Graphics

More information

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Josef Pelikán, Jan Horáček CGG MFF UK Praha GPGPU and CUDA 2012-2018 Josef Pelikán, Jan Horáček CGG MFF UK Praha pepca@cgg.mff.cuni.cz http://cgg.mff.cuni.cz/~pepca/ 1 / 41 Content advances in hardware multi-core vs. many-core general computing

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions. Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication

More information

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel

More information

Introduction to CUDA (1 of n*)

Introduction to CUDA (1 of n*) Administrivia Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Paper presentation due Wednesday, 02/23 Topics first come, first serve Assignment 4 handed today

More information

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop General Purpose GPU programming (GP-GPU) with Nvidia CUDA Libby Shoop 3 What is (Historical) GPGPU? General Purpose computation using GPU and graphics API in applications other than 3D graphics GPU accelerates

More information

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Introduction to GPU programming. Introduction to GPU programming p. 1/17 Introduction to GPU programming Introduction to GPU programming p. 1/17 Introduction to GPU programming p. 2/17 Overview GPUs & computing Principles of CUDA programming One good reference: David B. Kirk

More information

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers

More information

From Application to Technology OpenCL Application Processors Chung-Ho Chen

From Application to Technology OpenCL Application Processors Chung-Ho Chen From Application to Technology OpenCL Application Processors Chung-Ho Chen Computer Architecture and System Laboratory (CASLab) Department of Electrical Engineering and Institute of Computer and Communication

More information

Introduction to CUDA Programming

Introduction to CUDA Programming Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview

More information

CSE 160 Lecture 24. Graphical Processing Units

CSE 160 Lecture 24. Graphical Processing Units CSE 160 Lecture 24 Graphical Processing Units Announcements Next week we meet in 1202 on Monday 3/11 only On Weds 3/13 we have a 2 hour session Usual class time at the Rady school final exam review SDSC

More information

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks. Sharing the resources of an SM Warp 0 Warp 1 Warp 47 Register file A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks Shared A single SRAM (ex. 16KB)

More information

CUDA Architecture & Programming Model

CUDA Architecture & Programming Model CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh Learn CUDA in an Afternoon Alan Gray EPCC The University of Edinburgh Overview Introduction to CUDA Practical Exercise 1: Getting started with CUDA GPU Optimisation Practical Exercise 2: Optimising a CUDA

More information

GPU programming. Dr. Bernhard Kainz

GPU programming. Dr. Bernhard Kainz GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling

More information

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5) CUDA programming model N. Cardoso & P. Bicudo Física Computacional (FC5) N. Cardoso & P. Bicudo CUDA programming model 1/23 Outline 1 CUDA qualifiers 2 CUDA Kernel Thread hierarchy Kernel, configuration

More information

Lecture 11: GPU programming

Lecture 11: GPU programming Lecture 11: GPU programming David Bindel 4 Oct 2011 Logistics Matrix multiply results are ready Summary on assignments page My version (and writeup) on CMS HW 2 due Thursday Still working on project 2!

More information

CUDA Programming Model

CUDA Programming Model CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

Introduction to Parallel Computing with CUDA. Oswald Haan

Introduction to Parallel Computing with CUDA. Oswald Haan Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries

More information

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA PROGRAMMING MODEL Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA: COMMON UNIFIED DEVICE ARCHITECTURE Parallel computing architecture and programming model GPU Computing Application Includes

More information

CS 179 Lecture 4. GPU Compute Architecture

CS 179 Lecture 4. GPU Compute Architecture CS 179 Lecture 4 GPU Compute Architecture 1 This is my first lecture ever Tell me if I m not speaking loud enough, going too fast/slow, etc. Also feel free to give me lecture feedback over email or at

More information

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1 Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei

More information

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se rafia.inam@mdh.se CONTENTS

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

University of Bielefeld

University of Bielefeld Geistes-, Natur-, Sozial- und Technikwissenschaften gemeinsam unter einem Dach Introduction to GPU Programming using CUDA Olaf Kaczmarek University of Bielefeld STRONGnet Summerschool 2011 ZIF Bielefeld

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware

More information

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization

More information

Cartoon parallel architectures; CPUs and GPUs

Cartoon parallel architectures; CPUs and GPUs Cartoon parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 Th Sep 11! Thanks to Jee Choi (a senior PhD student) for a big assist 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ~ socket 14 ~ core 14 ~ HWMT+SIMD

More information

High Performance Linear Algebra on Data Parallel Co-Processors I

High Performance Linear Algebra on Data Parallel Co-Processors I 926535897932384626433832795028841971693993754918980183 592653589793238462643383279502884197169399375491898018 415926535897932384626433832795028841971693993754918980 592653589793238462643383279502884197169399375491898018

More information

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU

More information

OpenMP and GPU Programming

OpenMP and GPU Programming OpenMP and GPU Programming GPU Intro Emanuele Ruffaldi https://github.com/eruffaldi/course_openmpgpu PERCeptual RObotics Laboratory, TeCIP Scuola Superiore Sant Anna Pisa,Italy e.ruffaldi@sssup.it April

More information

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog CUDA GPGPU Workshop 2012 CUDA/GPGPU Arch&Prog Yip Wichita State University 7/11/2012 GPU-Hardware perspective GPU as PCI device Original PCI PCIe Inside GPU architecture GPU as PCI device Traditional PC

More information

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction Francesco Rossi University of Bologna and INFN * Using this terminology since you ve already heard of SIMD and SPMD at this school

More information

ECE 574 Cluster Computing Lecture 15

ECE 574 Cluster Computing Lecture 15 ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements

More information

Introduction to GPGPUs and to CUDA programming model

Introduction to GPGPUs and to CUDA programming model Introduction to GPGPUs and to CUDA programming model www.cineca.it Marzia Rivi m.rivi@cineca.it GPGPU architecture CUDA programming model CUDA efficient programming Debugging & profiling tools CUDA libraries

More information

CUDA Performance Optimization. Patrick Legresley

CUDA Performance Optimization. Patrick Legresley CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations

More information

Warps and Reduction Algorithms

Warps and Reduction Algorithms Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum

More information

Exotic Methods in Parallel Computing [GPU Computing]

Exotic Methods in Parallel Computing [GPU Computing] Exotic Methods in Parallel Computing [GPU Computing] Frank Feinbube Exotic Methods in Parallel Computing Dr. Peter Tröger Exotic Methods in Parallel Computing FF 2012 Architectural Shift 2 Exotic Methods

More information

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 CUDA Lecture 2 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de December 15, 2015 CUDA Programming Fundamentals CUDA

More information

Introduction to CUDA (1 of n*)

Introduction to CUDA (1 of n*) Agenda Introduction to CUDA (1 of n*) GPU architecture review CUDA First of two or three dedicated classes Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3 Acknowledgements

More information

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690 Stanford University NVIDIA Tesla M2090 NVIDIA GeForce GTX 690 Moore s Law 2 Clock Speed 10000 Pentium 4 Prescott Core 2 Nehalem Sandy Bridge 1000 Pentium 4 Williamette Clock Speed (MHz) 100 80486 Pentium

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

GRAPHICS PROCESSING UNITS

GRAPHICS PROCESSING UNITS GRAPHICS PROCESSING UNITS Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 4, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011

More information

Introduction to GPGPU and GPU-architectures

Introduction to GPGPU and GPU-architectures Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks

More information

Spring Prof. Hyesoon Kim

Spring Prof. Hyesoon Kim Spring 2011 Prof. Hyesoon Kim 2 Warp is the basic unit of execution A group of threads (e.g. 32 threads for the Tesla GPU architecture) Warp Execution Inst 1 Inst 2 Inst 3 Sources ready T T T T One warp

More information

Matrix Multiplication in CUDA. A case study

Matrix Multiplication in CUDA. A case study Matrix Multiplication in CUDA A case study 1 Matrix Multiplication: A Case Study Matrix multiplication illustrates many of the basic features of memory and thread management in CUDA Usage of thread/block

More information

Lecture 5. Performance Programming with CUDA

Lecture 5. Performance Programming with CUDA Lecture 5 Performance Programming with CUDA Announcements 2011 Scott B. Baden / CSE 262 / Spring 2011 2 Today s lecture Matrix multiplication 2011 Scott B. Baden / CSE 262 / Spring 2011 3 Memory Hierarchy

More information

CS377P Programming for Performance GPU Programming - II

CS377P Programming for Performance GPU Programming - II CS377P Programming for Performance GPU Programming - II Sreepathi Pai UTCS November 11, 2015 Outline 1 GPU Occupancy 2 Divergence 3 Costs 4 Cooperation to reduce costs 5 Scheduling Regular Work Outline

More information

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics GPU Programming Rüdiger Westermann Chair for Computer Graphics & Visualization Faculty of Informatics Overview Programming interfaces and support libraries The CUDA programming abstraction An in-depth

More information

Lecture 2: CUDA Programming

Lecture 2: CUDA Programming CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:

More information

Introduction to CUDA

Introduction to CUDA Introduction to CUDA Oliver Meister November 7 th 2012 Tutorial Parallel Programming and High Performance Computing, November 7 th 2012 1 References D. Kirk, W. Hwu: Programming Massively Parallel Processors,

More information

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60 1 / 60 GPU Programming Performance Considerations Miaoqing Huang University of Arkansas Fall 2013 2 / 60 Outline Control Flow Divergence Memory Coalescing Shared Memory Bank Conflicts Occupancy Loop Unrolling

More information

Dense Linear Algebra. HPC - Algorithms and Applications

Dense Linear Algebra. HPC - Algorithms and Applications Dense Linear Algebra HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 6 th 2017 Last Tutorial CUDA Architecture thread hierarchy:

More information

Programming in CUDA. Malik M Khan

Programming in CUDA. Malik M Khan Programming in CUDA October 21, 2010 Malik M Khan Outline Reminder of CUDA Architecture Execution Model - Brief mention of control flow Heterogeneous Memory Hierarchy - Locality through data placement

More information

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA HPC COMPUTING WITH CUDA AND TESLA HARDWARE Timothy Lanfear, NVIDIA WHAT IS GPU COMPUTING? What is GPU Computing? x86 PCIe bus GPU Computing with CPU + GPU Heterogeneous Computing Low Latency or High Throughput?

More information

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca March 13, 2014 Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software

More information

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy Lecture 7 Using Shared Memory Performance programming and the memory hierarchy Announcements Scott B. Baden /CSE 260/ Winter 2014 2 Assignment #1 Blocking for cache will boost performance but a lot more

More information

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca (Slides http://support.scinet.utoronto.ca/ northrup/westgrid CUDA.pdf) March 12, 2014

More information

Parallel Programming Concepts. GPU Computing with OpenCL

Parallel Programming Concepts. GPU Computing with OpenCL Parallel Programming Concepts GPU Computing with OpenCL Frank Feinbube Operating Systems and Middleware Prof. Dr. Andreas Polze Agenda / Quicklinks 2 Recapitulation Motivation History of GPU Computing

More information

Real-time Graphics 9. GPGPU

Real-time Graphics 9. GPGPU Real-time Graphics 9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing

More information

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro INTRODUCTION TO GPU COMPUTING WITH CUDA Topi Siro 19.10.2015 OUTLINE PART I - Tue 20.10 10-12 What is GPU computing? What is CUDA? Running GPU jobs on Triton PART II - Thu 22.10 10-12 Using libraries Different

More information

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers COMP 322: Fundamentals of Parallel Programming Lecture 37: General-Purpose GPU (GPGPU) Computing Max Grossman, Vivek Sarkar Department of Computer Science, Rice University max.grossman@rice.edu, vsarkar@rice.edu

More information

Fundamental Optimizations

Fundamental Optimizations Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access

More information

CUDA C Programming Mark Harris NVIDIA Corporation

CUDA C Programming Mark Harris NVIDIA Corporation CUDA C Programming Mark Harris NVIDIA Corporation Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory Models Programming Environment

More information

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access

More information

Parallel Systems Course: Chapter IV. GPU Programming. Jan Lemeire Dept. ETRO November 6th 2008

Parallel Systems Course: Chapter IV. GPU Programming. Jan Lemeire Dept. ETRO November 6th 2008 Parallel Systems Course: Chapter IV GPU Programming Jan Lemeire Dept. ETRO November 6th 2008 GPU Message-passing Programming with Parallel CUDAMessagepassing Parallel Processing Processing Overview 1.

More information

Analyzing CUDA Workloads Using a Detailed GPU Simulator

Analyzing CUDA Workloads Using a Detailed GPU Simulator CS 3580 - Advanced Topics in Parallel Computing Analyzing CUDA Workloads Using a Detailed GPU Simulator Mohammad Hasanzadeh Mofrad University of Pittsburgh November 14, 2017 1 Article information Title:

More information

CUDA Performance Considerations (2 of 2)

CUDA Performance Considerations (2 of 2) Administrivia CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Friday 03/04, 11:59pm Assignment 4 due Presentation date change due via email Not bonus

More information

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State University (SDSU) Posted:

More information

Threading Hardware in G80

Threading Hardware in G80 ing Hardware in G80 1 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA 2 3D 3D API: API: OpenGL OpenGL or or Direct3D Direct3D GPU Command &

More information

GPGPU. Alan Gray/James Perry EPCC The University of Edinburgh.

GPGPU. Alan Gray/James Perry EPCC The University of Edinburgh. GPGPU Alan Gray/James Perry EPCC The University of Edinburgh a.gray@ed.ac.uk Contents Introduction GPU Technology Programming GPUs GPU Performance Optimisation 2 Introduction 3 Introduction Central Processing

More information

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include 3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI

More information

ECE 574 Cluster Computing Lecture 17

ECE 574 Cluster Computing Lecture 17 ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 March 2019 HW#8 (CUDA) posted. Project topics due. Announcements 1 CUDA installing On Linux

More information

Advanced CUDA Programming. Dr. Timo Stich

Advanced CUDA Programming. Dr. Timo Stich Advanced CUDA Programming Dr. Timo Stich (tstich@nvidia.com) Outline SIMT Architecture, Warps Kernel optimizations Global memory throughput Launch configuration Shared memory access Instruction throughput

More information

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan CUDA Workshop High Performance GPU computing EXEBIT- 2014 Karthikeyan CPU vs GPU CPU Very fast, serial, Low Latency GPU Slow, massively parallel, High Throughput Play Demonstration Compute Unified Device

More information

Hands-on CUDA Optimization. CUDA Workshop

Hands-on CUDA Optimization. CUDA Workshop Hands-on CUDA Optimization CUDA Workshop Exercise Today we have a progressive exercise The exercise is broken into 5 steps If you get lost you can always catch up by grabbing the corresponding directory

More information

CS516 Programming Languages and Compilers II

CS516 Programming Languages and Compilers II CS516 Programming Languages and Compilers II Zheng Zhang Spring 2015 Jan 22 Overview and GPU Programming I Rutgers University CS516 Course Information Staff Instructor: zheng zhang (eddy.zhengzhang@cs.rutgers.edu)

More information

OpenCL. Matt Sellitto Dana Schaa Northeastern University NUCAR

OpenCL. Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL Matt Sellitto Dana Schaa Northeastern University NUCAR OpenCL Architecture Parallel computing for heterogenous devices CPUs, GPUs, other processors (Cell, DSPs, etc) Portable accelerated code Defined

More information

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list CUDA Programming Week 1. Basic Programming Concepts Materials are copied from the reference list G80/G92 Device SP: Streaming Processor (Thread Processors) SM: Streaming Multiprocessor 128 SP grouped into

More information

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh GPU Programming EPCC The University of Edinburgh Contents NVIDIA CUDA C Proprietary interface to NVIDIA architecture CUDA Fortran Provided by PGI OpenCL Cross platform API 2 NVIDIA CUDA CUDA allows NVIDIA

More information

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic

More information

ECE 408 / CS 483 Final Exam, Fall 2014

ECE 408 / CS 483 Final Exam, Fall 2014 ECE 408 / CS 483 Final Exam, Fall 2014 Thursday 18 December 2014 8:00 to 11:00 Central Standard Time You may use any notes, books, papers, or other reference materials. In the interest of fair access across

More information

Real-time Graphics 9. GPGPU

Real-time Graphics 9. GPGPU 9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing GPGPU general-purpose

More information

Device Memories and Matrix Multiplication

Device Memories and Matrix Multiplication Device Memories and Matrix Multiplication 1 Device Memories global, constant, and shared memories CUDA variable type qualifiers 2 Matrix Multiplication an application of tiling runningmatrixmul in the

More information

Paralization on GPU using CUDA An Introduction

Paralization on GPU using CUDA An Introduction Paralization on GPU using CUDA An Introduction Ehsan Nedaaee Oskoee 1 1 Department of Physics IASBS IPM Grid and HPC workshop IV, 2011 Outline 1 Introduction to GPU 2 Introduction to CUDA Graphics Processing

More information

Scientific discovery, analysis and prediction made possible through high performance computing.

Scientific discovery, analysis and prediction made possible through high performance computing. Scientific discovery, analysis and prediction made possible through high performance computing. An Introduction to GPGPU Programming Bob Torgerson Arctic Region Supercomputing Center November 21 st, 2013

More information

Introduction to CUDA

Introduction to CUDA Introduction to CUDA Overview HW computational power Graphics API vs. CUDA CUDA glossary Memory model, HW implementation, execution Performance guidelines CUDA compiler C/C++ Language extensions Limitations

More information

GPU CUDA Programming

GPU CUDA Programming GPU CUDA Programming 이정근 (Jeong-Gun Lee) 한림대학교컴퓨터공학과, 임베디드 SoC 연구실 www.onchip.net Email: Jeonggun.Lee@hallym.ac.kr ALTERA JOINT LAB Introduction 차례 Multicore/Manycore and GPU GPU on Medical Applications

More information