Parallel Computing. Lecture 19: CUDA - I

Similar documents
Lecture 3: Introduction to CUDA

Module 2: Introduction to CUDA C

Lecture 2: Introduction to CUDA C

Module 2: Introduction to CUDA C. Objective

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

GPGPU. Lecture 2: CUDA

CIS 665: GPU Programming. Lecture 2: The CUDA Programming Model

CUDA C Programming Mark Harris NVIDIA Corporation

Using The CUDA Programming Model

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

GPU Computation CSCI 4239/5239 Advanced Computer Graphics Spring 2014

Computation to Core Mapping Lessons learned from a simple application

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Matrix Multiplication in CUDA. A case study

CUDA Parallelism Model

Lessons learned from a simple application

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

CUDA Basics. July 6, 2016

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Intro to GPU s for Parallel Computing

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Tesla Architecture, CUDA and Optimization Strategies

Lecture 1: Introduction and Computational Thinking

Data parallel computing

Lecture 10!! Introduction to CUDA!

High Performance Computing and GPU Programming

Introduction to Parallel Computing with CUDA. Oswald Haan

Introduction to GPGPUs and to CUDA programming model

GPU Computing with CUDA

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Lecture 3. Programming with GPUs

Lecture 1: Introduction

Module 3: CUDA Execution Model -I. Objective

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

CS 677: Parallel Programming for Many-core Processors Lecture 1

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

CUDA Parallel Programming Model Michael Garland

CUDA Programming. Aiichiro Nakano

CSE 599 I Accelerated Computing Programming GPUS. Intro to CUDA C

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to CUDA (1 of n*)

Many-core Processors Lecture 1. Instructor: Philippos Mordohai Webpage:

CSE 160 Lecture 24. Graphical Processing Units

Using CUDA. Oswald Haan

Lecture 11: GPU programming

GPU Programming Using CUDA

Practical Introduction to CUDA and GPU

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

Lecture 6 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden. Computing with Graphical Processing Units CUDA Programming Matrix multiplication

Introduction to CUDA (2 of 2)

Nvidia G80 Architecture and CUDA Programming

Zero-copy. Table of Contents. Multi-GPU Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Zero-copy. Multigpu.

Programming with CUDA, WS09

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

High Performance Linear Algebra on Data Parallel Co-Processors I

CUDA (Compute Unified Device Architecture)

Data Parallel Execution Model

ECE 408 / CS 483 Final Exam, Fall 2014

High-Performance Computing Using GPUs

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

COSC 6374 Parallel Computations Introduction to CUDA

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Introduction to CELL B.E. and GPU Programming. Agenda

Massively Parallel Architectures

Introduc)on to GPU Programming

GPU Computing: Introduction to CUDA. Dr Paul Richmond

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

CUDA C/C++ BASICS. NVIDIA Corporation

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 22: Introduction to OpenCL

Threading Hardware in G80

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API

GPU programming. Dr. Bernhard Kainz

Scientific discovery, analysis and prediction made possible through high performance computing.

Lecture 8: GPU Programming. CSE599G1: Spring 2017

CUDA C/C++ BASICS. NVIDIA Corporation

CUDA Kenjiro Taura 1 / 36

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

Basic CUDA workshop. Outlines. Setting Up Your Machine Architecture Getting Started Programming CUDA. Fine-Tuning. Penporn Koanantakool

GPU Architecture and Programming. Ozcan Ozturk Bilkent University Ankara, Turkey

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

CS377P Programming for Performance GPU Programming - I

Lecture 2: CUDA Programming

CUDA programming interface - CUDA C

Transcription:

CSCI-UA.0480-003 Parallel Computing Lecture 19: CUDA - I Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com

GPU w/ local DRAM (device) Behind CUDA CPU (host) Source: http://hothardware.com/reviews/intel-core-i5-and-i7-processors-and-p55-chipset/?page=4

Parallel Computing on a GPU GPUs deliver 25 to 200+ GFLOPS on compiled parallel C applications Available in laptops, desktops, and clusters GeForce 8800 GPU parallelism is doubling every year Programming model scales transparently Data parallelism Programmable in C with CUDA tools Multithreaded SPMD model uses application data parallelism and thread parallelism. [SPMD = Single Program Multiple Data] Tesla D870 Tesla S870 3

CUDA Compute Unified Device Architecture Integrated host+device app C program Serial or modestly parallel parts in host C code Highly parallel parts in device SPMD kernel C code ( host ) Serial Code ( device ) Parallel Kernel KernelA<<< nblk, ntid >>>(args);... ( host ) Serial Code ( device ) Parallel Kernel KernelB<<< nblk, ntid >>>(args);... 4

Parallel Threads A CUDA kernel is executed by an array of threads ( SPMD All threads run the same code (the SP in Each thread has an ID that it uses to compute memory addresses and make control decisions 0 1 2 254 255 i = blockidx.x * blockdim.x + threadidx.x; C_d[i] = A_d[i] + B_d[i];

Thread Blocks Divide monolithic thread array into multiple blocks Threads within a block cooperate via shared memory, atomic operations and barrier synchronization, Threads in different blocks cannot cooperate 0 1 2 254 255 0 1 2 254 255 0 1 2 254 255 i = blockidx.x * blockdim.x + threadidx.x; C_d[i] = A_d[i] + B_d[i]; i = blockidx.x * blockdim.x + threadidx.x; C_d[i] = A_d[i] + B_d[i]; i = blockidx.x * blockdim.x + threadidx.x; C_d[i] = A_d[i] + B_d[i]; 6

Kernel Launched by the host Very similar to a C function To be executed on device All threads will execute that same code in the kernel. Grid Block 1D or 2D organization of a Grid 3D available in recent GPUs griddim.x, griddim.y are the size of the grid in number of blocks 1D, 2D, or 3D organization of a block Block is assigned to an SM blockdim.x, blockdim.y, blockdim.z are block dimensions counted as number of threads blockidx.x, blockidx.y, are indices of the block within a GRID. Thread threadidx.x, threadidx.y, are the index within a block

IDs Each thread uses IDs to decide what data to work on Block ID: 1D or 2D or 3D Thread ID: 1D, 2D, or 3D Host Kernel 1 Device Grid 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Simplifies memory addressing when processing multidimensional data Image processing Solving PDEs on volumes Kernel 2 Grid 2 Block (1, 1) (0,0,1) (1,0,1) (2,0,1) (3,0,1) Thread (0,0,0) Thread (0,1,0) Thread (1,0,0) Thread (1,1,0) Thread (2,0,0) Thread (2,1,0) Thread (3,0,0) Thread (3,1,0) Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Org

A Simple Example: Vector Addition vector A A[0] A[1] A[2] A[3] A[4] A[N-1] vector B B[0] B[1] B[2] B[3] B[4] B[N-1] + + + + + + vector C C[0] C[1] C[2] C[3] C[4] C[N-1]

A Simple Example: Vector Addition // Compute vector sum C = A+B void vecadd(float* A, float* B, float* C, int n) { for (i = 0, i < n, i++) C[i] = A[i] + B[i]; } GPU friendly! int main() { // Memory allocation for A_h, B_h, and C_h // I/O to read A_h and B_h, N elements vecadd(a_h, B_h, C_h, N); }

A Simple Example: Vector Addition #include <cuda.h> ( n void vecadd(float* A, float* B, float* C, int { int size = n* sizeof(float); float* A_d, B_d, C_d; 1. // Allocate device memory for A, B, and C // copy A and B to device memory Host Memory CPU Part 1 Device Memory GPU Part 2 2. // Kernel launch code to have the device // to perform the actual vector addition Part 3 3. // copy C from the device memory // Free device vectors }

CUDA Memory Model Global memory Main means of communicating R/W Data between host and device Contents visible to all threads Long latency access Device code can: R/W per-thread registers R/W per-grid global memory We will cover more later Host Grid ( 0 (0, Block Shared Memory Registers ( 0 (0, Thread Global Memory Registers ( 0 (1, Thread ( 0 (1, Block Shared Memory Registers ( 0 (0, Thread Registers ( 0 (1, Thread

CPU & GPU Memory In CUDA, host and devices have separate memory spaces. If GPU and CPU are on the same chip, then they share memory space fusion

CUDA Device Memory Allocation cudamalloc() Allocates object in the device Global Memory Requires two parameters Address of a pointer to the allocated object Size of of allocated object Host cudafree() Frees object from device Global Memory Pointer to freed object Grid ( 0 (0, Block Registers Global Memory Shared Memory ( 0 (0, Thread Registers ( 0 (1, Thread ( 0 (1, Block Shared Memory Registers ( 0 (0, Thread Registers ( 0 (1, Thread 14

CUDA Device Memory Allocation Example: Grid WIDTH = 64; float* Md int size = WIDTH * WIDTH * sizeof(float); ( 0 (0, Block Shared Memory Registers Registers ( 0 (1, Block Shared Memory Registers Registers cudamalloc((void**)&md, size); cudafree(md); ( 0 (1, Thread ( 0 (0, Thread ( 0 (1, Thread ( 0 (0, Thread Host Global Memory 15

CUDA Device Memory Allocation () cudamemcpy memory data transfer Requires four parameters Pointer to destination Pointer to source Number of bytes copied Type of transfer Host to Host Host to Device Device to Host Device to Device Asynchronous transfer Host Grid Block (0, ( 0 Registers Global Memory Shared Memory Thread (0, ( 0 Registers Thread (1, ( 0 Block (1, ( 0 Registers Shared Memory Thread (0, ( 0 Registers Thread (1, ( 0 Important! cudamemcpy() cannot be used to copy between different GPUs in multi-gpus system

CUDA Device Memory Allocation Example: Grid Destination pointer Source pointer Size in bytes Direction ( 0 (0, Block Shared Memory ( 0 (1, Block Shared Memory Registers Registers Registers Registers ( 0 (1, Thread ( 0 (0, Thread ( 0 (1, Thread ( 0 (0, Thread Host Global Memory cudamemcpy(md, M, size, cudamemcpyhosttodevice); cudamemcpy(m, Md, size, cudamemcpydevicetohost); 17

A Simple Example: Vector Addition void vecadd(float* A, float* B, float* C, int n) { int size = n * sizeof(float); float* A_d, * B_d, * C_d; 1. // Transfer A and B to device memory cudamalloc((void **) &A_d, size); cudamemcpy(a_d, A, size, cudamemcpyhosttodevice); cudamalloc((void **) &B_d, size); cudamemcpy(b_d, B, size, cudamemcpyhosttodevice); // Allocate device memory for cudamalloc((void **) &C_d, size); 2. // Kernel invocation code to be shown later How to launch a kernel? 3. // Transfer C from device to host cudamemcpy(c, C_d, size, cudamemcpydevicetohost); // Free device memory for A, B, C cudafree(a_d); cudafree(b_d); cudafree (C_d); }

int vecadd(float* A, float* B, float* C, int n) { // A_d, B_d, C_d allocations and copies omitted // Run ceil(n/256) blocks of 256 threads each vecaddkernnel<<<ceil(n/256),256>>>(a_d, B_d, C_d, n); } // Each thread performs one pair-wise addition global void vecaddkernel(float* A_d, float* B_d, float* C_d, int n) { } #blocks int i = threadidx.x + blockdim.x * blockidx.x; if(i<n) C_d[i] = A_d[i] + B_d[i]; #threads/blks

The Hello World of Parallel Programming: Matrix Multiplication device float DeviceFunc() global void KernelFunc() host float HostFunc() Executed on the: device device host Only callable from the: device host host global defines a kernel function. Must return void device and host can be used together For functions executed on the device: No recursion No static variable declarations inside the function No indirect function calls through pointers

WIDTH WIDTH The Hello World of Parallel Programming: Matrix Multiplication Data Parallelism: We can safely perform many arithmetic operations on the data structures in a simultaneous manner. N M P WIDTH WIDTH 21

The Hello World of Parallel Programming: Matrix Multiplication M 0,0 M 1,0 M 2,0 M 3,0 M 0,1 M 1,1 M 2,1 M 3,1 M 0,2 M 1,2 M 2,2 M 3,2 M 0,3 M 1,3 M 2,3 M 3,3 M M 0,0 M 1,0 M 2,0 M 3,0 M 1,1 M 0,1 M 2,1 M 3,1 M 0,2 M 1,2 M 2,2 M 3,2 M 0,3 M 1,3 M 2,3 M 3,3 C adopts raw-major placement approach when storing 2D matrix in linear memory address. 22

The Hello World of Parallel Programming: Matrix Multiplication A Simple main function: executed at the host

WIDTH WIDTH The Hello World of Parallel Programming: Matrix Multiplication // Matrix multiplication on the (CPU) host ( Width void MatrixMulOnHost(float* M, float* N, float* P, int { ( i ++ for (int i = 0; i < Width; for (int j = 0; j < Width; ++j) { double sum = 0; for (int k = 0; k < Width; ++k) { double a = M[i * Width + k]; double b = N[k * Width + j]; sum += a * b; M } P[i * Width + j] = sum; i } } k N P j k WIDTH WIDTH 24

The Hello World of Parallel Programming: Matrix Multiplication MatrixMulKernel();

WIDTH WIDTH The Hello World of Parallel Programming: Matrix Multiplication Nd k tx Md Pd ty ty The Kernel Function k WIDTH tx WIDTH 26

More On Specifying Dimensions // Setup the execution configuration dim3 dimgrid(x, y); dim3 dimblock(x, y, z); // Launch the device computation threads! MatrixMulKernel<<<dimGrid, dimblock>>>(md, Nd, Pd, Width); Important: dimgrid and dimblock are user defined griddim and blockdim are built-in predefined variable accessible in kernel functions

Be Sure To Know: Maximum dimensions of a block Maximum number of threads per block Maximum dimensions of a grid Maximum number of blocks per thread

Tools Integrated C programs with CUDA extensions NVCC Compiler Host Code Device Code (PTX) Host C Compiler/ Linker Device Just-in-Time Compiler Heterogeneous Computing Platform with CPUs, GPUs

Conclusions Data parallelism is the main source of scalability for parallel programs Each CUDA source file can have a mixture of both host and device code. What we learned today about CUDA: KernelA<<< nblk, ntid >>>(args) cudamalloc() cudafree() () cudamemcpy griddim and blockdim threadidx.x and threadidx.y dim3