Introduc)on to GPU Programming

Similar documents
Introduction to GPU Programming

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

CUDA Basics. July 6, 2016

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

CUDA C Programming Mark Harris NVIDIA Corporation

Parallel Numerical Algorithms

GPU COMPUTING. Ana Lucia Varbanescu (UvA)

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA


High Performance Computing and GPU Programming

COSC 6339 Accelerators in Big Data

GPU CUDA Programming

Tesla Architecture, CUDA and Optimization Strategies

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Parallel Accelerators

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

EEM528 GPU COMPUTING

High-Performance Computing Using GPUs

Advanced CUDA Optimization 1. Introduction

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

Parallel Accelerators

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Lecture 3: Introduction to CUDA

Parallel Computing. Lecture 19: CUDA - I

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

CS 179: GPU Computing. Lecture 2: The Basics

GPU Computing: Introduction to CUDA. Dr Paul Richmond

University of Bielefeld

Module 2: Introduction to CUDA C

Lecture 8: GPU Programming. CSE599G1: Spring 2017

HPCSE II. GPU programming and CUDA

Lecture 2: Introduction to CUDA C

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

CUDA Performance Optimization Mark Harris NVIDIA Corporation

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to CUDA Programming

Module 2: Introduction to CUDA C. Objective

COSC 6374 Parallel Computations Introduction to CUDA

Massively Parallel Algorithms

ECE 574 Cluster Computing Lecture 15

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

INTRODUCTION TO CUDA PROGRAMMING BHUPENDER THAKUR

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Scientific discovery, analysis and prediction made possible through high performance computing.

Basic CUDA workshop. Outlines. Setting Up Your Machine Architecture Getting Started Programming CUDA. Fine-Tuning. Penporn Koanantakool

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation

ECE 574 Cluster Computing Lecture 17

COSC 6385 Computer Architecture. - Data Level Parallelism (II)

GPU 1. CSCI 4850/5850 High-Performance Computing Spring 2018

Introduction to Parallel Computing with CUDA. Oswald Haan

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

HIGH-PERFORMANCE COMPUTING WITH CUDA AND TESLA GPUS

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Introduction to GPGPUs and to CUDA programming model

GPU computing Simulating spin models on GPU Lecture 1: GPU architecture and CUDA programming. GPU computing. GPU computing.

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

Using a GPU in InSAR processing to improve performance

GPU Architecture and Programming. Andrei Doncescu inspired by NVIDIA

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

CUDA Parallel Programming Model Michael Garland

GPU Programming Using CUDA

First: Shameless Adver2sing

Basic unified GPU architecture

CS377P Programming for Performance GPU Programming - I

Mathematical computations with GPUs

Basics of CADA Programming - CUDA 4.0 and newer

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA

GPGPU. Alan Gray/James Perry EPCC The University of Edinburgh.

CSE 160 Lecture 24. Graphical Processing Units

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

Parallel Systems Course: Chapter IV. GPU Programming. Jan Lemeire Dept. ETRO November 6th 2008

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

High Performance Computing with Accelerators

Portland State University ECE 588/688. Graphics Processors

COSC 6385 Computer Architecture. - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

CUDA Memories. Introduction 5/4/11

GPGPU in Film Production. Laurence Emms Pixar Animation Studios

CUDA C/C++ BASICS. NVIDIA Corporation

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 22 CUDA

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

CUDA Performance Optimization. Patrick Legresley

Tesla GPU Computing A Revolution in High Performance Computing

Transcription:

Introduc)on to GPU Programming Mubashir Adnan Qureshi h3p://www.ncsa.illinois.edu/people/kindr/projects/hpca/files/singapore_p1.pdf h3p://developer.download.nvidia.com/cuda/training/nvidia_gpu_compu)ng_webinars_cuda_memory_op)miza)on.pdf

Tutorial Goals NVIDIA GPU architecture NVIDIA GPU application development flow Write and run simple NVIDIA GPU kernels in CUDA Be aware of performance limiting factors and understand performance tuning strategies 2

Introduction Why use Graphics Processing Units (GPUs) for general-purpose computing Modern GPU architecture NVIDIA GPU programming overview CUDA C OpenCL 3

GPU vs. CPU Silicon Use 4 Graph is courtesy of NVIDIA

5 Figure is courtesy of NVIDIA NVIDIA GPU Architecture N mul)processors called SMs Each has M cores called SPs SIMD Same instruc)on executed on SPs Device shared across all SMs

NVIDIA GeForce9400M G GPU TPC Geometry controller SM I cache MT issue C cache SFU SFU Shared SMC SM I cache MT issue C cache SFU SFU Shared Texture units Texture L1 128-bit interconnect L2 ROP ROP L2 16 streaming processors arranged as 2 streaming multiprocessors At 0.8 GHz this provides 54 GFLOPS in singleprecision (SP) 128-bit interface to offchip GDDR3 21 GB/s bandwidth DRAM DRAM 6

NVIDIA Tesla C1060 GPU SM I cache MT issue C cache SFU SFU Shared L2 ROP DRAM TPC 1 Geometry controller SMC SM I cache MT issue C cache SFU SFU Shared Texture units Texture L1 DRAM SM I cache MT issue C cache SFU SFU Shared SM I cache MT issue C cache SFU SFU Shared 512-bit interconnect DRAM DRAM DRAM DRAM TPC 10 Geometry controller SMC SM I cache MT issue C cache SFU SFU Shared Texture units Texture L1 DRAM SM I cache MT issue C cache SFU SFU Shared ROP L2 DRAM 240 streaming processors arranged as 30 streaming mul)processors At 1.3 GHz this provides 1 TFLOPS SP 86.4 GFLOPS DP 512-bit interface to off-chip GDDR3 102 GB/s bandwidth 7

12 Graph is courtesy of NVIDIA NVIDIA Tesla S1070 Computing Server 4 T10 GPUs 4 GB GDDR3 SDRAM 4 GB GDDR3 SDRAM Power supply Tesla GPU Tesla GPU Thermal management PCI x16 NVIDIA SWITCH NVIDIA SWITCH PCI x16 System monitoring Tesla GPU 4 GB GDDR3 SDRAM Tesla GPU 4 GB GDDR3 SDRAM

GPU Use/Programming GPU libraries NVIDIA s CUDA BLAS and FFT libraries Many 3 rd party libraries Low abstraction lightweight GPU programming toolkits CUDA C OpenCL 9

nvcc Any source file containing CUDA C language extensions must be compiled with nvcc nvcc is a compiler driver that invokes many other tools to accomplish the job Basic nvcc usage nvcc <filename>.cu [-o <executable>] Builds release mode nvcc -deviceemu <filename>.cu Builds device emula)on mode (all code runs on CPU) nvprof <executable> Profiles the code 10

Anatomy of a GPU Applica)on Host side Device side 30

Reference CPU Version void vecadd(int N, float* A, float* B, float* C) } { for (int i = 0; i < N; i++) C[i] = A[i] + B[i]; Computational kernel int main(int argc, char **argv) { int N = 16384; // default vector size float *A = (float*)malloc(n * sizeof(float)); float *B = (float*)malloc(n * sizeof(float)); float *C = (float*)malloc(n * sizeof(float)); vecadd(n, A, B, C); // call compute kernel Memory allocation Kernel invocation free(a); free(b); free(c); Memory de-allocation } 12

Adding GPU support Host CPU GPU card GPU Host Memory A B Device Memory ga gb C gc 13

Memory Spaces CPU and GPU have separate spaces Data is moved across PCIe bus Use func[ons to allocate/set/copy on GPU Host (CPU) manages device (GPU) cudamalloc(void** pointer, size_t nbytes) cudafree(void* pointer) cudamemcpy(void* dst, void* src, size_t nbytes, enum cudamemcpykind direc[on); returns after the copy is complete blocks CPU thread un[l all bytes have been copied does not start copying un[l previous CUDA calls complete enum cudamemcpykind cudamemcpyhosttodevice cudamemcpydevicetohost cudamemcpydevicetodevice 14

Adding GPU support int main(int argc, char **argv) { int N = 16384; // default vector size float *A = (float*)malloc(n * sizeof(float)); float *B = (float*)malloc(n * sizeof(float)); float *C = (float*)malloc(n * sizeof(float)); float *devptra, *devptrb, *devptrc; cudamalloc((void**)&devptra, N * sizeof(float)); cudamalloc((void**)&devptrb, N * sizeof(float)); cudamalloc((void**)&devptrc, N * sizeof(float)); Memory allocation on the GPU card Copy data from the CPU (host) to the GPU (device) cudamemcpy(devptra, A, N * sizeof(float), cudamemcpyhosttodevice); cudamemcpy(devptrb, B, N * sizeof(float), cudamemcpyhosttodevice); 15

Adding GPU support vecadd<<<n/512, 512>>>(devPtrA, devptrb, devptrc); Kernel invocation cudamemcpy(c, devptrc, N * sizeof(float), cudamemcpydevicetohost); cudafree(devptra); cudafree(devptrb); cudafree(devptrc); Copy results from device to the host } free(a); free(b); free(c); Device de-allocation 16

GPU Kernel CPU version void vecadd(int N, float* A, float* B, float* C) { for (int i = 0; i < N; i++) C[i] = A[i] + B[i]; } GPU version global void vecadd(float* A, float* B, float* C) { int i = blockidx.x * blockdim.x + threadidx.x; C[i] = A[i] + B[i]; } 17

CUDA Programming Model A CUDA kernel is executed by threadid an array of threads All threads run the same code (SIMD) Each thread has an ID that it uses to compute addresses and make control decisions Threads are arranged as a grid of thread blocks Threads within a block have access to a segment of shared Grid Thread Block 0 Shared Thread Block 1 Shared float x = input[threadid]; float y = func(x); output[threadid] = y; Thread Block N-1 Shared 18

Kernel Invoca)on Syntax grid & thread block dimensionality vecadd<<<32, 512>>>(devPtrA, devptrb, devptrc); Grid Thread Block 0 Thread Block 1 Thread Block N-1 Shared Shared Shared int i = blockidx.x * blockdim.x + threadidx.x; block ID within a grid number of threads per block thread ID within a thread block 19

Mapping Threads to the Hardware Blocks of threads are transparently assigned to SMs A block of threads executes on one SM & does not migrate Several blocks can reside concurrently on one SM Blocks must be independent Any possible interleaving of blocks should be valid Blocks may coordinate but not synchronize Thread blocks can run in any order Device Kernel grid Block 0 Block 1 Device Block 2 Block 3 Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 time Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Block 4 Block 5 Block 6 Block 7 Each block can execute in any order relative to other blocks. 20 Slide is courtesy of NVIDIA

GPU Memory Hierarchy Global (device) Accessible by all threads as well as host (CPU) Data life)me is from alloca)on to dealloca)on Device 0 Host cudamemcpy() Device 1 21

GPU Memory Hierarchy Global (device) Kernel 0 Thread Block 0 Thread Block 1 Thread Block N-1 Kernel 1 Per-device Global Memory Thread Block 0 Thread Block 1 Thread Block N-1 22

GPU Memory Hierarchy Local storage Each thread has own local storage Mostly registers (managed by the compiler) Data life)me = thread life)me Shared Each thread block has own shared Accessible only by threads within that block Data life)me = block life)me Per-thread local Thread Block Per-block shared 23

GPU Memory Hierarchy Host CPU chipset DRAM DRAM local global constant texture Device GPU Mul)processor Mul)processor Mul)processor registers shared constant and texture caches Memory Loca[on Cached Access Scope Life[me Register On-chip N/A R/W One thread Thread Local Off-chip No R/W One thread Thread Shared On-chip N/A R/W All threads in a block Block Global Off-chip No R/W All threads + host Applica[on Constant Off-chip Yes R All threads + host Applica[on Texture Off-chip Yes R All threads + host Applica[on 24

GPU Kernel CPU version void vecadd(int N, float* A, float* B, float* C) { for (int i = 0; i < N; i++) C[i] = A[i] + B[i]; } GPU version global void vecadd(float* A, float* B, float* C) { int i = blockidx.x * blockdim.x + threadidx.x; C[i] = A[i] + B[i]; } 25

Op)mizing Algorithms for GPUs Maximize independent parallelism Maximize arithme)c intensity (math/bandwidth) Some)mes it s be3er to recompute than to cache GPU GPU spends its transistors on ALUs, not Do more computa)on on the GPU to avoid costly data transfers Even low parallelism computa)ons can some)mes be faster than transferring back and forth to host

Op)mize Memory Access Coalesced vs. Non-coalesced = order of magnitude Global/Local device Con)guous threads accessing con)guous Shared Memory Hundreds of )mes faster than global Threads can cooperate via shared Use it to avoid non-coalesced access