Module 3: CUDA Execution Model -I. Objective

Similar documents
Data Parallel Execution Model

CUDA Parallelism Model

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Lessons learned from a simple application

Module 2: Introduction to CUDA C. Objective

Computation to Core Mapping Lessons learned from a simple application

Lecture 3: Introduction to CUDA

Introduction to CUDA (1 of n*)

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Module 2: Introduction to CUDA C

Lecture 2: Introduction to CUDA C

Module Memory and Data Locality

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Parallel Computing. Lecture 19: CUDA - I

Matrix Multiplication in CUDA. A case study

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Introduction to GPGPUs and to CUDA programming model

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

GPU programming basics. Prof. Marco Bertini

1/31/11. How to tell if results are correct. Assignment 2: Analyzing the Results. Targets of Memory Hierarchy Optimizations. Overview of Lecture

ECE 408 / CS 483 Final Exam, Fall 2014

COSC 6374 Parallel Computations Introduction to CUDA

Introduction to Parallel Computing with CUDA. Oswald Haan

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 10. Reduction Trees

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

ECE408/CS483 Fall Applied Parallel Programming. Objective. To learn about tiled convolution algorithms

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

Lecture 10!! Introduction to CUDA!

Example 1: Color-to-Grayscale Image Processing

Computational Fluid Dynamics (CFD) using Graphics Processing Units

Programming in CUDA. Malik M Khan

2/17/10. Administrative. L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies. Administrative, cont.

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Introduction to GPU Computing. Design and Analysis of Parallel Algorithms

CSE 591: GPU Programming. Memories. Klaus Mueller. Computer Science Department Stony Brook University

CUDA Programming. Aiichiro Nakano

CUDA Basics. July 6, 2016

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

GPU Computing: A Quick Start

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Tesla Architecture, CUDA and Optimization Strategies

Introduction to Scientific Programming using GPGPU and CUDA

Convolution, Constant Memory and Constant Caching

GPU Memory Memory issue for CUDA programming

CUDA C Programming Mark Harris NVIDIA Corporation

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

Introduction to CUDA (2 of 2)

GPU Computing Workshop CSU Getting Started. Garland Durham Quantos Analytics

Today s Content. Lecture 7. Trends. Factors contributed to the growth of Beowulf class computers. Introduction. CUDA Programming CUDA (I)

Vector Addition on the Device: main()

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

GPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways:

Practical Introduction to CUDA and GPU

Real-time Graphics 9. GPGPU

GPU Memory. GPU Memory. Memory issue for CUDA programming. Copyright 2013 Yong Cao, Referencing UIUC ECE498AL Course Notes

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)

CS 677: Parallel Programming for Many-core Processors Lecture 6

Hardware/Software Co-Design

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

Lab 1 Part 1: Introduction to CUDA

CUDA programming interface - CUDA C

Real-time Graphics 9. GPGPU

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Programming with CUDA, WS09

MIC-GPU: High-Performance Computing for Medical Imaging on Programmable Graphics Hardware (GPUs)

1/25/12. Administrative

EEM528 GPU COMPUTING

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

COMP 322: Fundamentals of Parallel Programming Module 3: Locality and Distribution

Scientific discovery, analysis and prediction made possible through high performance computing.

Lecture 1: Introduction

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Card Sizes. Tesla K40: 2880 processors; 12 GB memory

Lecture 5. Performance Programming with CUDA

Writing and compiling a CUDA code

GPGPU. Lecture 2: CUDA

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

CS 677: Parallel Programming for Many-core Processors Lecture 1

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

GPU programming. Dr. Bernhard Kainz

Lecture 2: CUDA Programming

Using The CUDA Programming Model

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

High Performance Linear Algebra on Data Parallel Co-Processors I

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

COSC 462. CUDA Basics: Blocks, Grids, and Threads. Piotr Luszczek. November 1, /10

CME 213 S PRING Eric Darve

Lecture 1: an introduction to CUDA

GPU Programming. Rupesh Nasre.

CS 314 Principles of Programming Languages

Reduction of a Symmetrical Matrix. to Tridiagonal Form on GPUs

Introduction to CUDA Programming

Transcription:

ECE 8823A GPU Architectures odule 3: CUDA Execution odel -I 1 Objective A more detailed look at kernel execution Data to thread assignment To understand the organization and scheduling of threads Resource assignment at the block level Scheduling at the warp level Basics of SIT execution 1

Reading Assignment Kirk and Hwu, Programming assively Parallel Processors: A Hands on Approach,, Chapter 4 Kirk and Hwu, Programming assively Parallel Processors: A Hands on Approach,, Chapter 6.3 Reference: CUDA Programming Guide http://docs.nvidia.com/cuda/cuda-c-programmingguide/#abstract 3 A ulti-dimensional Grid Example host device Kernel 1 Grid 1 Block (0, 0) Block (1, 0) Block (0, 1) Block (1, 1) Kernel 2 Grid 2 Block (1,1) (1,0,0) (1,0,1) (1,0,2) (1,0,3) (0,0,0) (0,1,0) (0,0,1) (0,1,1) (0,0,2) (0,1,2) (0,0,3) Threa d (0,1,3) (0,0,0) 2

Built-In Variables 1D-3D Grid of thread blocks Built-in: griddim griddim.x, griddim.y, griddim.z Built-in: blockdim blockdim.x, blockdim.y, blockdim.z Example dim3 dimgrid (32,2,2) - 3D grid of thread blocks dim3 dimgrid (2,2,1) - 2D grid of thread blocks dim3 dimgrid (32,1,1) - 1D grid of thread blocks dim3 dimgrid ((n/256.0),1,1) - 1D grid of thread blocks my_kernel<<<dimgrid, dimblock>>>(..) Built-In Variables (2) 1D-3D grid of threads in a thread block Built-in: blockidx blockidx.x, blockidx.y, blockidx.z Built-in: threadidx threadidx.x, threadidx.y, threadidx.z All blocks have the same thread configuration Example dim3 dimblock (4,2,2) - 3D grid of thread blocks dim3 dimblock (2,2,1) - 2D grid of thread blocks dim3 dimblock (32,1,1) - 1D grid of thread blocks my_kernel<<<dimgrid, dimblock>>>(..) 3

Built-In Variables (3) 1D-3D grid of threads in a thread block Built-in: blockidx blockidx.x, blockidx.y, blockidx.z Built-in: threadidx threadidx.x, threadidx.y, threadidx.z Initialized by the runtime through a kernel call Range fixed by the compute capability and target devices You can query the device (later) 2D Examples 4

Processing a Picture with a 2D Grid 16 16 blocks 72x62 pixels Row-ajor Layout in C/C++ 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 2,0 2,1 2,2 2,3 3,0 3,1 3,2 3,3 0,0 0,1 0,2 0,3 1,1 1,0 1,2 1,3 2,0 2,1 2,2 2,3 Row*Width+Col = 2*4+1 = 9 0 1 2 3 5 3,0 3,1 3,2 3,3 4 6 7 8 9 10 11 12 13 14 15 5

Source Code of the Picture Kernel global void PictureKernel(float* d_pin, float* d_pout, int n,int m) { // Calculate the row # of the d_pin and d_pout element to process int Row = blockidx.y*blockdim.y + threadidx.y; // Calculate the column # of the d_pin and d_pout element to process int Col = blockidx.x*blockdim.x + threadidx.x; // each thread computes one element of d_pout if in range if ((Row < m) && (Col < n)) { d_pout[row*n+col] = 2*d_Pin[Row*n+Col]; 11 Approach Summary Storage layout of data Assign unique ID ap IDs to Data (access) Figure 4.5 Covering a 76 62 picture with 16 blocks. 6

A Simple Running Example atrix ultiplication A simple illustration of the basic features of memory and thread management in CUDA programs index usage emory layout Register usage Assume square matrix for simplicity Leave shared memory usage until later Square atrix-atrix ultiplication P = * N of size x Each thread calculates one element of P Each row of is loaded times from global memory Each column of N is loaded times from global memory N P 7

Row-ajor Layout in C/C++ 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 2,0 2,1 2,2 2,3 3,0 3,1 3,2 3,3 0,0 0,1 0,2 0,3 1,1 1,0 1,2 1,3 2,0 2,1 2,2 2,3 3,0 3,1 3,2 3,3 Row*Width+Col = 2*4+1 = 9 0 1 2 3 5 4 6 7 8 9 10 11 12 13 14 15 atrix ultiplication A Simple Host Version in C // atrix multiplication on the (CPU) host in double precision void atrixulonhost(float*, float* N, float* P, int Width) { N for (int i = 0; i < Width; ++i) for (int j = 0; j < Width; ++j) {map double sum = 0; for (int k = 0; k < Width; ++k) double a = [i * Width + k]; double b = N[k * Width + j]; sum += a * b; P[i * Width + j] = sum; k i P j k 8

Kernel Version: Functional Description for (int k = 0; k < Width; ++k) Pvalue += d_[row*width+k] * d_n[k*width+col]; Which thread is at coordinate (row,col)? s self allocate row i N P j col k blockdim.y k blockdim.x Kernel Function - A Small Example Have each 2D thread block to compute a (TILE_) 2 sub-matrix (tile) of the result matrix Each has (TILE_) 2 threads Generate a 2D Grid of (/TILE_) 2 blocks Block(0,0) Block(0,1) P 0,0 P 1,0 P 0,1 P 1,1 P 0,2 P 0,3 P 1,2 P 1,3 = 4; TILE_ = 2 Each block has 2*2 = 4 threads P 2,0 P 2,1 P 2,2 P 2,3 /TILE_ = 2 Use 2* 2 = 4 blocks P 3,0 P 3,1 P 2,3 P 3,3 Block(1,0) Block(1,1) David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13 9

A Slightly Bigger Example P 0,0 P 0,1 P 0,2 P 0,3 P 0,4 P 0,5 P 0,6 P 0,7 P 1,0 P 1,1 P 1,2 P 1,3 P 2,0 P 2,1 P 2,2 P 2,3 P 3,0 P 3,1 P 3,2 P 3,3 P 1,4 P 1,5 P 1,6 P 1,7 P 2,4 P 2,5 P 2,6 P 2,7 P 3,4 P 3,5 P 3,6 P 3,7 = 8; TILE_ = 2 Each block has 2*2 = 4 threads P 4,0 P 5,0 P 4,1 P 5,1 P 4,2 P 4,3 P 5,2 P 5,3 P 4,4 P 5,4 P 4,5 P 5,5 P 4,6 P 4,7 P 5,6 P 5,7 /TILE_ = 4 Use 4* 4 = 16 blocks P 6,0 P 6,1 P 6,2 P 6,3 P 7,0 P 7,1 P 7,2 P 7,3 P 6,4 P 6,5 P 6,6 P 6,7 P 7,4 P 7,5 P 7,6 P 7,7 David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13 A Slightly Bigger Example (cont.) P 0,0 P 0,1 P 0,2 P 0,3 P 0,4 P 0,5 P 0,6 P 0,7 P 1,0 P 1,1 P 1,2 P 1,3 P 2,0 P 2,1 P 2,2 P 2,3 P 3,0 P 3,1 P 3,2 P 3,3 P 4,0 P 4,1 P 4,2 P 4,3 P 1,4 P 1,5 P 1,6 P 1,7 P 2,4 P 2,5 P 2,6 P 2,7 P 3,4 P 3,5 P 3,6 P 3,7 P 4,4 P 4,5 P 4,6 P 4,7 = 8; TILE_ = 4 Each block has 4*4 =16 threads /TILE_ = 2 Use 2* 2 = 4 blocks P 5,0 P 5,1 P 5,2 P 5,3 P 5,4 P 5,5 P 5,6 P 5,7 P 6,0 P 6,1 P 6,2 P 6,3 P 7,0 P 7,1 P 7,2 P 7,3 P 6,4 P 6,5 P 6,6 P 6,7 P 7,4 P 7,5 P 7,6 P 7,7 David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13 10

Kernel Invocation (Host-side Code) // Setup the execution configuration // TILE_ is a #define constant dim3 dimgrid(width/tile_, Width/TILE_, 1); dim3 dimblock(tile_, TILE_, 1); // Launch the device computation threads! atrixulkernel<<<dimgrid, dimblock>>>(d, Nd, Pd, Width); Kernel Function // atrix multiplication kernel per thread code global void atrixulkernel(float* d_, float* d_n, float* d_p, int Width) { // Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0; 11

Work for Block (0,0) in a TILE_ = 2 Configuration blockdim.x Col = 0 * 2 + threadidx.x Row = 0 * 2 + threadidx.y blockdim.y Col = 1 Col = 0 N 0,0 N 0,1 N 0,2 N 0,3 blockidx.x blockidx.y N 1,0 N 1,1 N 1,2 N 1,3 N 2,0 N 2,1 N 2,2 N 2,3 N 3,0 N 3,1 N 3,2 N 3,3 Row = 0 0,0 0,1 0,2 0,3 P 0,0 P 0,1 P 0,2 P 0,3 Row = 1 1,0 1,1 1,2 1,3 P 1,0 P 1,1 P 1,2 P 1,3 2,0 2,1 2,2 2,3 P 2,0 P 2,1 P 2,2 P 2,3 David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13 3,0 3,1 3,2 3,3 P 3,0 P 3,1 P 3,2 P 3,3 blockdim.x Col = 1 * 2 + threadidx.x Row = 0 * 2 + threadidx.y Work for Block (0,1) blockdim.y N 0,0 N 0,1 Col = 2 Col = 3 N 0,2 N 0,3 N 1,0 N 1,1 N 1,2 N 1,3 blockidx.x blockidx.y N 2,0 N 2,1 N 2,2 N 2,3 N 3,0 N 3,1 N 2,3 N 3,3 Row = 0 Row = 1 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 2,0 2,1 2,2 2,3 P 0,0 P 0,1 P 0,1 P 1,1 P 0,2 P 0,3 P 1,2 P 1,3 P 2,0 P 2,1 P 2,2 P 2,3 3,0 3,1 3,2 3,3 P 3,0 P 3,1 P 3,2 P 3,3 12

A Simple atrix ultiplication Kernel global void atrixulkernel(float* d_, float* d_n, float* d_p, int Width) { // Calculate the row index of the d_p element and d_ int Row = blockidx.y*blockdim.y+threadidx.y; // Calculate the column idenx of d_p and d_n int Col = blockidx.x*blockdim.x+threadidx.x; if ((Row < Width) && (Col < Width)) { float Pvalue = 0; // each thread computes one element of the block sub-matrix for (int k = 0; k < Width; ++k) Pvalue += d_[row*width+k] * d_n[k*width+col]; d_p[row*width+col] = Pvalue; QUESTIONS? 26 David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign 13