Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Similar documents
Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

EEM528 GPU COMPUTING

Computation to Core Mapping Lessons learned from a simple application

Lessons learned from a simple application

Parallel Computing. Lecture 19: CUDA - I

Lecture 3: Introduction to CUDA

Basic unified GPU architecture

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

CUDA Basics. July 6, 2016

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

CUDA C Programming Mark Harris NVIDIA Corporation

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

Introduction to CUDA (1 of n*)

Matrix Multiplication in CUDA. A case study

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

Massively Parallel Algorithms

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

Programming with CUDA, WS09

GPU CUDA Programming

Module 3: CUDA Execution Model -I. Objective

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Introduction to GPGPUs and to CUDA programming model

HPCSE II. GPU programming and CUDA

Mathematical computations with GPUs

Vector Addition on the Device: main()

GPU computing Simulating spin models on GPU Lecture 1: GPU architecture and CUDA programming. GPU computing. GPU computing.

CS179 GPU Programming Recitation 4: CUDA Particles

CS 179: GPU Computing. Lecture 2: The Basics

GPU Computing: Introduction to CUDA. Dr Paul Richmond

COSC 6374 Parallel Computations Introduction to CUDA

Module 2: Introduction to CUDA C

CUDA C/C++ BASICS. NVIDIA Corporation

Parallel Numerical Algorithms

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

Scientific discovery, analysis and prediction made possible through high performance computing.

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

Module 2: Introduction to CUDA C. Objective

High Performance Linear Algebra on Data Parallel Co-Processors I

CUDA C/C++ BASICS. NVIDIA Corporation

CS : Many-core Computing with CUDA

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Using The CUDA Programming Model

GPU Architecture and Programming. Andrei Doncescu inspired by NVIDIA

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Pinned-Memory. Table of Contents. Streams Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Stream. Pinned-memory.

Lecture 2: Introduction to CUDA C

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

GPGPU. Lecture 2: CUDA

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

Introduc)on to GPU Programming

Tesla Architecture, CUDA and Optimization Strategies

NVIDIA CUDA Compute Unified Device Architecture

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

Lecture 10!! Introduction to CUDA!

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

Module Memory and Data Locality

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

CUDA Exercises. CUDA Programming Model Lukas Cavigelli ETZ E 9 / ETZ D Integrated Systems Laboratory

GPU COMPUTING. Ana Lucia Varbanescu (UvA)

04. CUDA Data Transfer

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

An Introduction to GPU Computing and CUDA Architecture

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API

CIS 665: GPU Programming. Lecture 2: The CUDA Programming Model

Introduction to CUDA

GPU Programming Using CUDA

Basic CUDA workshop. Outlines. Setting Up Your Machine Architecture Getting Started Programming CUDA. Fine-Tuning. Penporn Koanantakool

CUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation

Lecture 5. Performance Programming with CUDA

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

Lecture 11: GPU programming

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

Real-time Graphics 9. GPGPU

Lecture 6 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden. Computing with Graphical Processing Units CUDA Programming Matrix multiplication

ECE 574 Cluster Computing Lecture 15

COMP 322: Fundamentals of Parallel Programming

Lecture 6b Introduction of CUDA programming

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

CUDA Programming. Aiichiro Nakano

INTRODUCTION TO CUDA PROGRAMMING BHUPENDER THAKUR

High-Performance Computing Using GPUs

CUDA Parallel Programming Model Michael Garland

Writing and compiling a CUDA code

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

CS377P Programming for Performance GPU Programming - I

Data Parallel Execution Model

GPU 1. CSCI 4850/5850 High-Performance Computing Spring 2018

Introduction to CUDA C

COSC 6339 Accelerators in Big Data

Transcription:

Outline Memory Management CIS 565 Fall 2011 Qing Sun sunqing@seas.upenn.edu Kernels Matrix multiplication Managing Memory CPU and GPU have separate memory spaces Host (CPU) code manages device (GPU) memory Allocate/free Copy data to and from device Applies to global device memory (DRAM) GPU Memory Allocation / Release cudamalloc(void** pointer, size_t nbytes) cudamemset(void* pointer, int value, size_t count) cudafree(void* pointer) int n = 1024; int nbytes = 1024 * sizeof(int); int *d_a = 0; cudamalloc((void**) &d_a, nbytes); cudamemset(d_a, 0, nbytes); cudafree(d_a); 1

Data Copies cudamemcpy(void* dst, void* src, size_t nbytes, enum cudamemcpykind direction); direction specifies locations (host or device) of src and dst Blocks CPU thread: returns after the copy is complete Doesn t start copying until previous CUDA calls complete enum cudamemcpykind cudamemcpyhosttodevice cudamemcpydevicetohost cudamemcpydevicetodevice Executing Code on the GPU Kernels are C functions with some restrictions Cannot access host memory Must have void return type No variable number of arguments Not recursive No static variables Function arguments automatically copied from host to device Function Qualifiers Kernels designated by function qualifier: global Function called from host and executed on device Must return void Other CUDA function qualifiers device Function called from device and run on device Cannot be called from host code host Function called from host and executed on host (default) host and device qualifiers can be combined to generate both CPU and GPU code Launching Kernels Modified C function call syntax: kernel<<<dim3 dg, dim3 db>>> ( ) Execution Configuration ( <<<>>> ) dg dimension and size of grid in blocks: Two dimensional: x and y Blocks launched in the grid: dg.x * dg.y db dimension and size of blocks in threads: Three dimensional: x, y and z Threads per block: db.x * db.y * db.z Unspecified dim3 fields initialize to 1 2

Execution Configuration Examples dim3 grid, block; grid.x = 2; grid.y = 4; block.x = 8; block.y = 16; kernel<<<grid, block>>> ( ); dim3 grid(2,4), block(8,16); kernel<<<grid, block>>> ( ); kernel<<<32, 512>>> ( ); CUDA Built in Device Variables All global and device functions have access to these automatically defined variables dim3 griddim; Dimensions of the grid in blocks (at most 2D) dim3 blockdim; Dimensions of the block in threads dim3 blockidx; Block index within the grid dim3 threadidx; Thread index within the block Unique Thread IDs Built in variables are used to determine unique thread IDs Map from local threadid (threadidx) to a global ID which can be used as array indices Increment Array Example CPU program void inc_cpu(int *a, int N) int idx; for (idx = 0; idx < N; idx++) a[idx] = a[idx] + 1 void main() inc_cpu(a, N); CUDA program global void inc_gpu(int *d_a, int N) int idx = blockidx.x * blockdim.x + threadidx.x; if (idx < N) d_a[idx] = d_a[idx] + 1; void main() dim3 dimblock(blocksize); dim3 dimgrid(ceil(n / (float)blocksize)); inc_gpu<<<dimgrid, dimblock>>>(d_a, N); 3

Host Synchronization All kernel launches are asynchronous Control returns to CPU immediately Kernel executes after all previous CUDA calls have completed cudamemcpy() is synchronous Copy starts after all previous CUDA calls have completed Control returns to CPU after copy completes cudathreadsynchronize() Blocks until all previous CUDA calls complete Device Synchronization void syncthreads(); Synchronizes all threads in a block Generates barrier synchronization instruction No thread can pass this barrier until all threads in the block reach it Used to avoid RAW/WAR.WAW hazards when accessing shared memory Allowed in conditional code only if the conditional is uniform across the entire thread block idx = blockdim.x * blockidx.x + threadidx.x; if (blockidx.x == blocktoreverse) shareddata[blockdim.x (threadidx.x + 1)] = a[idx]; syncthreads(); a[idx] = shareddata[threadidx.x]; Matrix Multiplication A simple matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programs Local, register usage Thread ID usage Memory data transfer API between host and device Leave shared memory usage until later Matrix Multiplication P = M * N Each matrix is WIDTH * WIDTH Data parallel 4

CPU Implementation void MatrixMulOnHost(float* M, float* N, float* P, int width) for (int i = 0; i < width; i++) for (int j = 0; j < width; j++) float sum = 0; for (int k = 0; k < width; k++) float a = M[i * width + k]; float b = N[k * width + j]; sum += a * b; P[i * width + j] = sum; CUDA Skeleton int main(void) 1. //Allocate and initialize the matrices M, N, P //I/O to read the input matrices M and N 2. //M * N on the device MatrixMulOnDevice(M, N, P, WIDTH); 3. //I/O to write the output matrix P //Free matrices M, N, P return 0; Step1: Data Transfer void MatrixMulOnDevice(float* M, float* N, float* P, int width) int size = width * width * sizeof (float); 1. //Load M and N to device memory cudamalloc (d_m, size); cudamemcpy (d_m, M, size, cudamemcpyhosttodevice); cudamalloc (d_n, size); cudamemcpy (d_n, N, size, cudamemcpyhosttodevice); //Allocate P on the device cudamalloc (d_p, size); 2. //Kernel invocation code 3. //Read P from the device cudamemcpy (P, d_p, size, cudamemcpydevicetohost); //Free device matrices cudafree (d_m); cudafree (d_n); cudafree (d_p); Step2: Implement Kernel global void MatrixMulKernel(float* d_m, float* d_n, float* d_p, int width) //2D Thread ID int tx = threadidx.x; int ty = threadidx.y; //Pvalue stores the d_p element that is computed by the thread float Pvalue = 0; for (int k = 0; k < width; k++) float a = d_m[ty * width + k]; float b = d_n[k * width + tx]; Pvalue += a * b; //Write the matrix to device memory each thread writes one element Pd[ty * width + tx] = Pvalue; 5

Step3: Invoke Kernel void MatrixMulOnDevice(float* M, float* N, float* P, int width) int size = width * width * sizeof (float); cudamalloc (d_m, size); cudamemcpy (d_m, M, size, cudamemcpyhosttodevice); cudamalloc (d_n, size); cudamemcpy (d_n, N, size, cudamemcpyhosttodevice); cudamalloc (d_p, size); //Setup the execution configuration dim3 dimgrid(1, 1); dim3 dimblock(width, width); //Launch the device computation threads MatrixMulKernel<<<dimGrid, dimblock>>>(d_m, d_n, d_p); cudamemcpy (P, d_p, size, cudamemcpydevicetohost); cudafree (d_m); cudafree (d_n); cudafree (d_p); Simple Matrix Multiplication One block of threads compute matrix d_p Each thread computes one element of d_p Each thread Loads a row of matrix d_m Loads a column of matrix d_n Performs one multiplication and addition for each pair of d_m and d_n elements Compute to off chip memory access ratio close to 1:1 (not very high) Size of matrix limited by the number of threads allowed in a thread block Grid 1 Block 1 3 2 5 4 WIDTH Thread (2, 2) Md Nd Pd 2 4 2 6 48 6