Introduction to CUDA (1 of n*)

Similar documents
Introduction to CUDA (1 of n*)

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Module 2: Introduction to CUDA C

Module 3: CUDA Execution Model -I. Objective

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

Module 2: Introduction to CUDA C. Objective

CUDA C Programming Mark Harris NVIDIA Corporation

Introduction to CUDA (2 of 2)

Programming in CUDA. Malik M Khan

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

Lecture 2: Introduction to CUDA C

CUDA Performance Considerations (2 of 2)

Lecture 3: Introduction to CUDA

1/25/12. Administrative

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Parallel Computing. Lecture 19: CUDA - I

Josef Pelikán, Jan Horáček CGG MFF UK Praha

CUDA Architecture & Programming Model

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CIS 665: GPU Programming. Lecture 2: The CUDA Programming Model

Today s Content. Lecture 7. Trends. Factors contributed to the growth of Beowulf class computers. Introduction. CUDA Programming CUDA (I)

GPGPU: Parallel Reduction and Scan

GPU Programming Using CUDA

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Matrix Multiplication in CUDA. A case study

Introduction to GPGPUs and to CUDA programming model

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Master Informatics Eng.

Computation to Core Mapping Lessons learned from a simple application

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

Lessons learned from a simple application

High Performance Linear Algebra on Data Parallel Co-Processors I

L17: CUDA, cont. 11/3/10. Final Project Purpose: October 28, Next Wednesday, November 3. Example Projects

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

NVIDIA Fermi Architecture

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Paralization on GPU using CUDA An Introduction

CUDA Programming Model

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

GPU Computation CSCI 4239/5239 Advanced Computer Graphics Spring 2014

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

Introduction to CUDA Programming

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

Convolution, Constant Memory and Constant Caching

High Performance Computing

Overview: Graphics Processing Units

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

ECE 574 Cluster Computing Lecture 15

Practical Introduction to CUDA and GPU

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Data Parallel Execution Model

ECE 8823: GPU Architectures. Objectives

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

Lecture 10!! Introduction to CUDA!

CSE 160 Lecture 24. Graphical Processing Units

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

COSC 6374 Parallel Computations Introduction to CUDA

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 22 CUDA

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

L3: Data Partitioning and Memory Organization, Synchronization

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

B. Tech. Project Second Stage Report on

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

L14: CUDA, cont. Execution Model and Memory Hierarchy"

Administrivia. Administrivia. Administrivia. CIS 565: GPU Programming and Architecture. Meeting

GPU programming. Dr. Bernhard Kainz

COMP 322: Fundamentals of Parallel Programming

CS 677: Parallel Programming for Many-core Processors Lecture 1

Introduction to GPU programming. Introduction to GPU programming p. 1/17

GPU Programming and Architecture: Course Overview

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)

Introduction to CELL B.E. and GPU Programming. Agenda

CSE 591: GPU Programming. Memories. Klaus Mueller. Computer Science Department Stony Brook University

Using The CUDA Programming Model

GPGPU. Lecture 2: CUDA

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

Parallel Systems Course: Chapter IV. GPU Programming. Jan Lemeire Dept. ETRO November 6th 2008

INTRODUCTION TO GPU COMPUTING IN AALTO. Topi Siro

Lecture 3. Programming with GPUs

GPU CUDA Programming

University of Bielefeld

An Introduction to OpenAcc

GPU Fundamentals Jeff Larkin November 14, 2016

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

2/17/10. Administrative. L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies. Administrative, cont.

Many-core Processors Lecture 1. Instructor: Philippos Mordohai Webpage:

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

Introduction to CUDA

Efficient Data Transfers

Transcription:

Administrivia Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Paper presentation due Wednesday, 02/23 Topics first come, first serve Assignment 4 handed today Due Friday, 03/04 at 11:59pm * Where n is 2 or 3 Agenda GPU architecture review CUDA First of two or three dedicated classes Acknowledgements Many slides are from Kayvon Fatahalian's From Shader Code to a Teraflop: How GPU Shader Cores Work: http://bps10.idav.ucdavis.edu/talks/03- fatahalian_gpuarchteraflop_bps_siggraph201 0.pdf David Kirk and Wen-mei Hwu s UIUC course: http://courses.engr.illinois.edu/ece498/al/ 1

GPU Architecture Review GPUs are: Parallel Multithreaded Many-core GPUs have: Tremendous computational horsepower High memory bandwidth GPU Architecture Review GPUs are specialized for Compute-intensive, highly parallel computation Graphics! Transistors are devoted to: Processing Not: Data caching Flow control GPU Architecture Review Transistor Usage Image from: http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/cuda_c_programming_guide.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuarchteraflop_bps_siggraph2010.pdf 2

Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuarchteraflop_bps_siggraph2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuarchteraflop_bps_siggraph2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuarchteraflop_bps_siggraph2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuarchteraflop_bps_siggraph2010.pdf 3

Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuarchteraflop_bps_siggraph2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuarchteraflop_bps_siggraph2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuarchteraflop_bps_siggraph2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuarchteraflop_bps_siggraph2010.pdf 4

Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuarchteraflop_bps_siggraph2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuarchteraflop_bps_siggraph2010.pdf Let s program this thing! GPU Computing History 2001/2002 researchers see GPU as dataparallel coprocessor The GPGPU field is born 2007 NVIDIA releases CUDA CUDA Compute Uniform Device Architecture GPGPU shifts to GPU Computing 2008 Khronos releases OpenCL specification 5

CUDA Abstractions A hierarchy of thread groups Shared memories Barrier synchronization CUDA Terminology Host typically the CPU Code written in ANSI C Device typically the GPU (data-parallel) Code written in extended ANSI C Host and device have separate memories CUDA Program Contains both host and device code CUDA Terminology Kernel data-parallel function Invoking a kernel creates lightweight threads on the device Threads are generated and scheduled with hardware CUDA Kernels Executed N times in parallel by N different CUDA threads Thread ID Declaration Specifier Does a kernel remind you of a shader in OpenGL? Execution Configuration 6

CUDA Program Execution Grid one or more thread blocks 1D or 2D Block array of threads 1D, 2D, or 3D Each block in a grid has the same number of threads Each thread in a block can Synchronize Access shared memory Image from: http://courses.engr.illinois.edu/ece498/al/textbook/chapter2-cudaprogrammingmodel.pdf Block 1D, 2D, or 3D Example: Index into vector, matrix, volume Image from: http://courses.engr.illinois.edu/ece498/al/textbook/chapter2-cudaprogrammingmodel.pdf 7

Thread ID: Scalar thread identifier Thread Index: threadidx 2D Index 1D: Thread ID == Thread Index 2D with size (D x, D y ) Thread ID of index (x, y) == x + y D y 3D with size (D x, D y, D z ) Thread ID of index (x, y, z) == x + y D y + z D x D y 1 Thread Block 2D Block Thread Block Group of threads G80 and GT200: Up to 512 threads Fermi: Up to 1024 threads Reside on same processor core Share memory of that core Thread Block Group of threads G80 and GT200: Up to 512 threads Fermi: Up to 1024 threads Reside on same processor core Share memory of that core Image from: http://courses.engr.illinois.edu/ece498/al/textbook/chapter2-cudaprogrammingmodel.pdf 8

Block Index: blockidx Dimension: blockdim 1D or 2D 16x16 Threads per block 2D Thread Block Example: N = 32 16x16 threads per block (independent of N) threadidx ([0, 15], [0, 15]) 2x2 thread blocks in grid blockidx ([0, 1], [0, 1]) blockdim = 16 Thread blocks execute independently In any order: parallel or series Scheduled in any order by any number of cores Allows code to scale with core count i = [0, 1] * 16 + [0, 15] 9

Threads in a block Share (limited) low-latency memory Synchronize execution To coordinate memory accesses syncthreads() Barrier threads in block wait until all threads reach this Lightweight Image from: http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/cuda_c_programming_guide.pdf Image from: http://courses.engr.illinois.edu/ece498/al/textbook/chapter2-cudaprogrammingmodel.pdf Host can transfer to/from device Global memory Constant memory Image from: http://courses.engr.illinois.edu/ece498/al/textbook/chapter2-cudaprogrammingmodel.pdf Image from: http://courses.engr.illinois.edu/ece498/al/textbook/chapter2-cudaprogrammingmodel.pdf 10

cudamalloc() Allocate global memory on device cudafree() Frees memory Image from: http://courses.engr.illinois.edu/ece498/al/textbook/chapter2-cudaprogrammingmodel.pdf Pointer to device memory Size in bytes 11

cudamemcpy() Memory transfer Host to host Host to device Device to host Device to device Host Device Global Memory cudamemcpy() Memory transfer Host to host Host to device Device to host Device to device Host Device Global Memory Does this remind you of VBOs in OpenGL? cudamemcpy() Memory transfer Host to host Host to device Device to host Device to device Host Device Global Memory cudamemcpy() Memory transfer Host to host Host to device Device to host Device to device Host Device Global Memory 12

cudamemcpy() Host to device Memory transfer Host to host Host to device Device to host Device to device Host Device Global Memory Host Device Global Memory All transfers are asynchronous Destination (device) Source (host) Host Device Global Memory Host Device Global Memory 13

Matrix Multiply P = M * N Assume M and N are square for simplicity Is this data-parallel? Matrix Multiply 1,000 x 1,000 matrix 1,000,000 dot products Each 1,000 multiples and 1,000 adds Image from: http://courses.engr.illinois.edu/ece498/al/textbook/chapter2-cudaprogrammingmodel.pdf Matrix Multiply: CPU Implementation Matrix Multiply: CUDA Skeleton void MatrixMulOnHost(float* M, float* N, float* P, int width) { for (int i = 0; i < width; ++i) for (int j = 0; j < width; ++j) { float sum = 0; for (int k = 0; k < width; ++k) { float a = M[i * width + k]; float b = N[k * width + j]; sum += a * b; } P[i * width + j] = sum; } } Code from: http://courses.engr.illinois.edu/ece498/al/lectures/lecture3%20cuda%20threads%20spring%202010.ppt 14

Matrix Multiply: CUDA Skeleton Matrix Multiply: CUDA Skeleton Matrix Multiply Step 1 Add CUDA memory transfers to the skeleton Matrix Multiply: Data Transfer Allocate input 15

Matrix Multiply: Data Transfer Matrix Multiply: Data Transfer Allocate output Matrix Multiply: Data Transfer Matrix Multiply: Data Transfer Read back from device Does this remind you of GPGPU with GLSL? 16

Matrix Multiply Step 2 Implement the kernel in CUDA C Matrix Multiply: CUDA Kernel Accessing a matrix, so using a 2D block Matrix Multiply: CUDA Kernel Matrix Multiply: CUDA Kernel Each kernel computes one output Where did the two outer for loops in the CPU implementation go? 17

Matrix Multiply: CUDA Kernel Matrix Multiply Step 3 Invoke the kernel in CUDA C No locks or synchronization, why? Matrix Multiply: Invoke Kernel One block with width by width threads Matrix Multiply One Block of threads compute matrix Pd Each thread computes one element of Pd Each thread Loads a row of matrix Md Loads a column of matrix Nd Perform one multiply and addition for each pair of Md and Nd elements Compute to off-chip memory access ratio close to 1:1 (not very high) Size of matrix limited by the number of threads allowed in a thread block David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign 72 Slide from: http://courses.engr.illinois.edu/ece498/al/lectures/lecture2%20cuda%20spring%2009.ppt Grid 1 Block 1 3 2 5 4 WIDTH Thread (2, 2) Md Nd Pd 2 4 2 6 48 18

Matrix Multiply What is the major performance problem with our implementation? What is the major limitation? 19