Introduction to GPU programming. Introduction to GPU programming p. 1/17

Similar documents
Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)

GPU Memory Memory issue for CUDA programming

GPU Memory. GPU Memory. Memory issue for CUDA programming. Copyright 2013 Yong Cao, Referencing UIUC ECE498AL Course Notes

CSE 591: GPU Programming. Memories. Klaus Mueller. Computer Science Department Stony Brook University

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Matrix Multiplication in CUDA. A case study

Lecture 10!! Introduction to CUDA!

Introduction to GPGPUs and to CUDA programming model

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

Computation to Core Mapping Lessons learned from a simple application

Lessons learned from a simple application

Introduction to CUDA (2 of 2)

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

Module 3: CUDA Execution Model -I. Objective

Lecture 3: Introduction to CUDA

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

Parallel Computing. Lecture 19: CUDA - I

CSE 160 Lecture 24. Graphical Processing Units

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Tesla Architecture, CUDA and Optimization Strategies

Josef Pelikán, Jan Horáček CGG MFF UK Praha

QUASAR UNLOCKING THE POWER OF HETEROGENEOUS HARDWARE. Website: gepura.io Bart Goossens

Lecture 6 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden. Computing with Graphical Processing Units CUDA Programming Matrix multiplication

Introduction to Parallel Computing with CUDA. Oswald Haan

Using The CUDA Programming Model

Lecture 3. Programming with GPUs

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

COSC 6374 Parallel Computations Introduction to CUDA

Data Parallel Execution Model

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Module 2: Introduction to CUDA C. Objective

Module Memory and Data Locality

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Lecture 5. Performance Programming with CUDA

Programming in CUDA. Malik M Khan

Introduction to CUDA

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

GPGPU A Supercomputer on Your Desktop

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

GPGPU. Lecture 2: CUDA

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

Introduction to CUDA Programming

Module 2: Introduction to CUDA C

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Massively Parallel Architectures

GPU Computing with CUDA

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Dense Linear Algebra. HPC - Algorithms and Applications

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Device Memories and Matrix Multiplication

1/25/12. Administrative

ECE 408 / CS 483 Final Exam, Fall 2014

Basic unified GPU architecture

GPU CUDA Programming

CIS 665: GPU Programming. Lecture 2: The CUDA Programming Model

Warps and Reduction Algorithms

High Performance Computing and GPU Programming

Nvidia G80 Architecture and CUDA Programming

Lecture 2: Introduction to CUDA C

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

Lecture 1: an introduction to CUDA

Parallel Numerical Algorithms

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

Parallel Accelerators

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

CUDA Programming. Aiichiro Nakano

University of Bielefeld

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

CS : Many-core Computing with CUDA

Tiled Matrix Multiplication

Sparse Linear Algebra in CUDA

Performance Diagnosis for Hybrid CPU/GPU Environments

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

CS 179 Lecture 4. GPU Compute Architecture

Hardware/Software Co-Design

Basic CUDA workshop. Outlines. Setting Up Your Machine Architecture Getting Started Programming CUDA. Fine-Tuning. Penporn Koanantakool

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

Plan. CS : High-Performance Computing with CUDA. Matrix Transpose Characteristics (1/2) Plan

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

Parallel Accelerators

Lecture 2: CUDA Programming

High Performance Linear Algebra on Data Parallel Co-Processors I

CUDA C Programming Mark Harris NVIDIA Corporation

CS : High-Performance Computing with CUDA

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

Transcription:

Introduction to GPU programming Introduction to GPU programming p. 1/17

Introduction to GPU programming p. 2/17 Overview GPUs & computing Principles of CUDA programming One good reference: David B. Kirk and Wen-mei W. Hwu, Programming Massively Parallel Processors, Morgan Kaufmann Publishers, 2010.

Introduction to GPU programming p. 3/17 New trends of microprocessors Since 2003, there has been two main trajectories for microprocessor design Multicore a relatively small number of cores per chip, each core is a full-flesh processor in the traditional sense Many-core a large number of much smaller and simpler cores NVIDIA Tesla K40 GPU has 2880 cores (streaming processors), each is heavily multi-threaded, in-order, single-instruction issue processor. Peak performance of many-core GPUs can be 10 fold the peak performance of multicore CPUs GPUs have larger memory bandwidth (simpler memory models and fewer legacy requirements)

Introduction to GPU programming p. 4/17 NVIDIA s Fermi architecture Designed for GPU computing 16 streaming multiprocessors (SMs) 32 CUDA cores (streaming processors) in each SM (512 cores in total) Each streaming processor is massively threaded 6 DRAM 64-bit memory interfaces Peak double-precision floating-point rate: 768 GFLOPs

Introduction to GPU programming p. 5/17 NVIDIA s Kepler architecture Each streaming multiprocessor is more powerful than Fermi Dynamic parallelism (nested kernels) Hyper-Q allows multiple CPU threads/processes to use a single GPU Peak double-precision floating-point rate of K40: 1430 GFLOPs

Introduction to GPU programming p. 6/17 Massive (but simple) parallelism One streaming processor is the most fundamental execution resource Simpler than a CPU core But capable of executing a large number of threads simultaneously, where the threads carry out same instruction to different data elements - single-instruction-multiple-data (SIMD) A number of streaming processors constitute one streaming multiprocessor (SM)

Introduction to GPU programming p. 7/17 CUDA CUDA Compute Unified Device Architecture C-based programming model for GPUs Introduced together with GeForce 8800 Joint CPU/GPU execution (host/device) A CUDA program consists of one of more phases that are executed on either host or device User needs to manage data transfer between CPU and GPU A CUDA program is a unified source code encompassing both host and device code

Introduction to GPU programming p. 8/17 More about CUDA To a CUDA programmer, the computing system consists of a host (CPU) and one or more devices (GPUs) Data must be explicitly copied from host to device (and back) On device, there is so-called global memory Device global memory tends to have long access latencies and finite bandwidth On-chip memories: registers and shared memory (per SM block) have limited capacity, but are much faster Registers are private for individual threads All threads with a thread block can access variables in the shared memory Shared memory is an efficient means for threads to cooperate, by sharing input data and intermediate results

Introduction to GPU programming p. 9/17 Programming GPUs for computing (1) Kernel functions the computational tasks on GPU An application or library function may consist of one or more kernels Kernels can be written in C, extended with additional keywords to express parallelism Once compiled, each kernel makes use of many threads that execute the same program in parallel Multiple threads are grouped into thread blocks All threads in a thread block run on a single SM Within a thread block, threads cooperate and share memory A thread block is divided into warps of 32 threads Warp is the fundamental unit of dispatch within an SM Threads blocks may execute in any order

Introduction to GPU programming p. 10/17 Programming GPUs for computing (2) When a kernel is invoked on host, a grid of parallel threads are generated on device Threads in a grid are organized in a two-level hierarchy each grid consists of one or more thread blocks all blocks in a grid have the same number of threads each block has a unique 3D coordinateblockidx.x, blockidx.y andblockidx.z each thread block is organized as a 3D array of threads threadidx.x, threadidx.y, threadidx.z The grid and thread block dimensions are set when a kernel is invoked A centralized scheduler

Introduction to GPU programming p. 11/17 Programming GPUs for computing (3) Syntax for invoking a kernel dim3 dimgrid(64,32,1) dim3 dimblock(4,2,2); KernelFunction<<<dimGrid, dimblock>>>(...); griddim andblockdim contain the dimension info All threads in a block share the sameblockidx Each thread has its uniquethreadidx within a block Values ofblockidx andthreadidx can be used to determine which data element(s) that a thread is to work on Often, one thread is used to compute one data element (fine-grain parallelism)

Introduction to GPU programming p. 12/17 Thread execution Launching a CUDA kernel will generate a 1D or 2D or 3D array of thread blocks, each having a 1D or 2D or 3D array of threads The thread blocks can execute in any order relative to each other CUDA runtime system bundles several threads for simultaneous execution, by partitioning each thread block into warps (32 threads) Scheduling of warps is taken care by CUDA runtime system The hardware executes same instruction for all threads in the same warp If-tests can cause thread divergence, which will require multiple passes of divergent paths (involving all threads of a warp)

Introduction to GPU programming p. 13/17 Simple example of CUDA program Use GPU to calculate the square each element of an array // Kernel that executes on the CUDA device global void square_array(float *a, int N) { // 1D thread blocks and 1D thread array inside each block // N is the length of the data array a, stored in device memory } int idx = blockidx.x * blockdim.x + threadidx.x; if (idx<n) a[idx] = a[idx] * a[idx];

Introduction to GPU programming p. 14/17 Simple example of CUDA program (cont d) // main routine that executes on the CPU host int main(void) { float *a_h, *a_d; // Pointer to host & device arrays const int N = 10; // Number of elements in arrays size_t size = N * sizeof(float); a_h = (float *)malloc(size); // Allocate array on host cudamalloc((void **) &a_d, size); // Allocate array on device // Initialize host array and copy it to CUDA device for (int i=0; i<n; i++) a_h[i] = (float)i; cudamemcpy(a_d, a_h, size, cudamemcpyhosttodevice); // Do calculation on device: int block_size = 4; int n_blocks = N/block_size + (N%block_size == 0? 0:1); square_array <<< n_blocks, block_size >>> (a_d, N); // Retrieve result from device and store it in host array cudamemcpy(a_h, a_d, sizeof(float)*n, cudamemcpydevicetohost); // Print results for (int i=0; i<n; i++) printf("%d %f\n", i, a_h[i]); // Cleanup free(a_h); cudafree(a_d); }

Introduction to GPU programming p. 15/17 Matrix multiplication, example 1 We want to compute Q = M N, assuming Q, M, N are all square matrices of same sizewidth Width Each matrix has a 1D contiguous data storage Naive kernel implementation global void MatrixMulKernel(float* Md, float* Nd, float* Qd, int Width) { int Row = blockidx.y*tile_width + threadidx.y; int Col = blockidx.x*tile_width + threadidx.x; float Qvalue = 0; for (int k=0; k<width; ++k) Qvalue += Md[Row*Width+k] * Nd[k*Width+Col]; Qd[Row*Width+Col] = Qvalue; }

Introduction to GPU programming p. 16/17 Matrix multiplication, example 2 The previous implementation is not memory efficient Each thread reads 2 Width values from global memory A better approach is to let a patch oftile WIDTH TILE WIDTH threads share the 2 TILE WIDTH Width data reads That is, each thread reads 2 Width/TILE WIDTH values from global memory Shared memory is important for performance! Also beware that size of shared memory is limited

Introduction to GPU programming p. 17/17 Matrix multiplication, example 2 (cont d) global void MatrixMulKernel(float* Md, float* Nd, float* Qd, int Width) { shared float Mds[TILE_WIDTH][TILE_WIDTH]; shared float Nds[TILE_WIDTH][TILE_WIDTH]; int bx = blockidx.x; int by = blockidx.y; int tx = threadidx.x; int ty = threadidx.y; int Row = by * TILE_WIDTH + ty; int Col = bx * TILE_WIDTH + tx; float Qvalue = 0; for (int m = 0; m<width/tile_width; ++m) { Mds[ty][tx] = Md[Row*Width + (m*tile_width + tx)]; Nds[ty][tx] = Nd[(m*TILE_WIDTH + ty)*width + Col]; syncthreads(); } for (int k=0; k<tile_width; ++k) Qvalue += Mds[ty][k] * Nds[k][tx]; syncthreads(); } Qd[Row*Width+Col] = Qvalue;