Lecture 8: GPU Programming. CSE599G1: Spring 2017

Similar documents
Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Parallel Accelerators

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Parallel Accelerators

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

Programmable Graphics Hardware (GPU) A Primer

Lecture 5. Performance Programming with CUDA

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

High Performance Computing and GPU Programming

GPU Computing: Introduction to CUDA. Dr Paul Richmond

University of Bielefeld

HPCSE II. GPU programming and CUDA

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Josef Pelikán, Jan Horáček CGG MFF UK Praha

COSC 6374 Parallel Computations Introduction to CUDA

COSC 6339 Accelerators in Big Data

GPU CUDA Programming

CUDA C Programming Mark Harris NVIDIA Corporation

Lecture 11: GPU programming

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Introduction to CUDA Programming

Introduction to GPU programming. Introduction to GPU programming p. 1/17

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

GPU programming: CUDA. Sylvain Collange Inria Rennes Bretagne Atlantique

Lecture 6 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden. Computing with Graphical Processing Units CUDA Programming Matrix multiplication

CUDA C/C++ BASICS. NVIDIA Corporation

$ cd ex $ cd 1.Enumerate_GPUs $ make $./enum_gpu.x. Enter in the application folder Compile the source code Launch the application

CUDA C/C++ BASICS. NVIDIA Corporation

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

COSC 6385 Computer Architecture. - Data Level Parallelism (II)

Tesla Architecture, CUDA and Optimization Strategies

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

An Introduction to GPU Computing and CUDA Architecture

GPU COMPUTING. Ana Lucia Varbanescu (UvA)

CUDA. More on threads, shared memory, synchronization. cuprintf

Basics of CADA Programming - CUDA 4.0 and newer

COSC 6385 Computer Architecture. - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors

CS377P Programming for Performance GPU Programming - I

Don t reinvent the wheel. BLAS LAPACK Intel Math Kernel Library

Introduc)on to GPU Programming

CUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation

Lecture 3: Introduction to CUDA

Parallel Computing. Lecture 19: CUDA - I

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Parallel Numerical Algorithms

GPU programming. Dr. Bernhard Kainz

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17


Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Scientific discovery, analysis and prediction made possible through high performance computing.

Introduction to Parallel Computing with CUDA. Oswald Haan

GPU Programming with CUDA. Pedro Velho

Cartoon parallel architectures; CPUs and GPUs

GPU Architecture and Programming. Andrei Doncescu inspired by NVIDIA

CS 179 Lecture 4. GPU Compute Architecture

High-Performance Computing Using GPUs

GPU 1. CSCI 4850/5850 High-Performance Computing Spring 2018

CSE 599 I Accelerated Computing Programming GPUS. Intro to CUDA C

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Double-Precision Matrix Multiply on CUDA

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

Lecture 2: CUDA Programming

Advanced CUDA Optimizations

Vector Addition on the Device: main()

Lecture 1: Introduction and Computational Thinking

CSE 160 Lecture 24. Graphical Processing Units

Lecture 6b Introduction of CUDA programming

CUDA Exercises. CUDA Programming Model Lukas Cavigelli ETZ E 9 / ETZ D Integrated Systems Laboratory

Warps and Reduction Algorithms

CUDA Architecture & Programming Model

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

Dense Linear Algebra. HPC - Algorithms and Applications

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

ECE 408 / CS 483 Final Exam, Fall 2014

Basic CUDA workshop. Outlines. Setting Up Your Machine Architecture Getting Started Programming CUDA. Fine-Tuning. Penporn Koanantakool

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

EEM528 GPU COMPUTING

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

A Tutorial on CUDA Performance Optimizations

GPU Performance Nuggets

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Fundamental CUDA Optimization. NVIDIA Corporation

GPU Programming Using CUDA

Real-time Graphics 9. GPGPU

GPU Computing with CUDA

CUDA Experiences: Over-Optimization and Future HPC

CUDA. Sathish Vadhiyar High Performance Computing

CME 213 S PRING Eric Darve

Multi-Processors and GPU

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Transcription:

Lecture 8: GPU Programming CSE599G1: Spring 2017

Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library Infer output shapes and memory planning

Overview GPU architecture CUDA programming model Case study of efficient GPU kernels

CPU vs GPU output CPU input

CPU vs GPU output output CPU Write back Decode input Fetch input

CPU vs GPU output output CPU Write back Decode input Fetch input Too much overhead in compute resources and energy efficiency

CPU vs GPU output output CPU Write back Decode input Fetch input Vector operations (SSE / AVX)

CPU vs GPU GPU: specialized accelerator output Decode output CPU Write back Decode input Fetch input Vector operations (SSE / AVX)

Streaming Multiprocessor (SM) Decode and schedule the next instructions Registers SP float core DP float core Load/store memory Special function unit Multiple caches

GPU Architecture

Theoretical peak FLOPS comparison * https://github.com/oxford-cs-deepnlp-2017/lectures/blob/master/lecture%206%20-%20nvidia%20rnns%20and%20gpus.pdf

Memory Hierarchy CPU memory hierarchy Core Reg L1 cache L2 cache

Memory Hierarchy CPU memory hierarchy Core Core Reg Reg L1 cache L1 cache L2 cache L2 cache L3 cache DRAM

Memory Hierarchy CPU memory hierarchy Core Core Reg Reg L1 cache L1 cache GPU memory hierarchy SM Reg L1 cache L2 cache L2 cache SM Shared memory Read-only cache L3 cache L2 cache DRAM GPU DRAM

Memory Hierarchy CPU memory hierarchy Core Reg L1 cache Intel Xeon E7-8870v4 Cores: 20 Reg / core:?? GPU memory hierarchy Titan X Pascal SMs: 28 Cores / SM: 128 Reg / SM: 256 KB SM Reg L1 / core: 32KB L2 cache L2 / core: 256KB L1 / SM: 48 KB Sharedmem / SM: 64 KB L3 cache L3 cache: 50MB L2 cache: 3 MB DRAM DRAM: 100s GB GPU DRAM: 12 GB Price: $12,000 Price: $1,200 L1 cache Shared memory L2 cache GPU DRAM Read-only cache

Memory Hierarchy CPU memory hierarchy Core Reg L1 cache Intel Xeon E7-8870v4 Cores: 20 Reg / core:?? GPU memory hierarchy Titan X Pascal SMs: 28 Cores / SM: 128 Reg / SM: 256 KB SM Reg L1 / core: 32KB L1 / SM: 48 KB Sharedmem / SM: 64 KB L2 cache L2 / core: 256KB L3 cache L3 cache: 50MB L2 cache: 3 MB DRAM DRAM: 100s GB GPU DRAM: 12 GB Price: $12,000 Price: $1,200 L1 cache Shared memory More registers than L1 cache L2 cache GPU DRAM Read-only cache

Memory Hierarchy CPU memory hierarchy Core Reg L1 cache L2 cache GPU memory hierarchy Intel Xeon E7-8870v4 Cores: 20 Reg / core:?? Titan X Pascal SMs: 28 Cores / SM: 128 Reg / SM: 256 KB SM Reg L1 / core: 32KB L2 / core: 256KB L1 / SM: 48 KB Sharedmem / SM: 64 KB L1 cache Shared memory L1 cache controlled by programmer L3 cache L3 cache: 50MB L2 cache: 3 MB DRAM DRAM: 100s GB GPU DRAM: 12 GB Price: $12,000 Price: $1,200 L2 cache GPU DRAM Read-only cache

GPU Memory Latency Registers Registers: R 0 cycle / R-after-W ~20 cycles L1/texture cache: 92 cycles Shared memory: 28 cycles Constant L1 cache: 28 cycles L2 cache: 200 cycles DRAM: 350 cycles (for Nvidia Maxwell architecture) * http://lpgpu.org/wp/wp-content/uploads/2013/05/poster_andresch_acaces2014.pdf

Memory bandwidth comparison * https://github.com/oxford-cs-deepnlp-2017/lectures/blob/master/lecture%206%20-%20nvidia%20rnns%20and%20gpus.pdf

Nvidia GPU Comparison GPU Tesla K40 (2014) Titan X (2015) Titan X (2016) Architecture Kepler GK110 Maxwell GM200 Pascal GP102 Number of SMs 15 24 28 CUDA cores 2880 (192 * 15SM) 3072 (128 * 24SM) 3584 (128 * 28SM) Max clock rate 875 MHz 1177 MHz 1531 MHz FP32 GFLOPS 5040 7230 10970 32-bit Registers / SM 64K (256KB) 64K (256KB) 64K (256KB) Shared Memory / SM 16 KB / 48 KB 96 KB 64 KB L2 Cache / SM 1.5 MB 3 MB 3 MB Global DRAM 12 GB 12 GB 12 GB Power 235 W 250 W 250 W

CUDA Programming Model

Thread Hierarchy thread Programmer writes code for a single thread in simple C program. Threads are grouped into a block. thread block block 0 block 1 block 2 block 3 grid Threads within the same block can synchronize execution. Blocks are grouped into a grid. All threads executes the same code, but can take different paths. Blocks are independently scheduled on the GPU, can be executed in any order. A kernel is executed as a grid of blocks of threads.

Kernel Execution Each block is executed by one SM and does not migrate. Several concurrent blocks can reside on one SM depending on block s memory requirement and the SM s memory resources. Kernel (Grid) GPU with 2 SMs SM 0 SM 1 Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 GPU with 4 SMs SM 0 SM 1 SM 2 SM 3 Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7

Kernel Execution A warp consists of 32 threads A warp is the basic schedule unit in kernel execution. A thread block consists of 32-thread warps. Each cycle, a warp scheduler selects one ready warps and dispatches the warps to CUDA cores to execute. SM0 Thread Block Pool thread block 1 thread block 2 warp warp thread block n

Thread Hierarchy & Memory Hierarchy GPU memory hierarchy thread registers & local memory SM Reg shared memory L1 cache Shared memory thread block L2 cache block 0 block 1 global memory block 2 block 3 grid GPU DRAM Read-only cache

Example: Vector Add // compute vector sum Void vecadd_cpu(const for (int i = 0; i C[i] = A[i] + } C = A + B float* A, const float* B, float* C, int n) { < n; ++i) B[i];

Example: Vector Add // compute vector sum Void vecadd_cpu(const for (int i = 0; i C[i] = A[i] + } C = A + B float* A, const float* B, float* C, int n) { < n; ++i) B[i]; global void vecaddkernel(const float* A, const float* B, float* C, int n) { int i = blockdim.x * blockidx.x + threadidx.x; if (i < n) { C[i] = A[i] + B[i]; } }

Example: Vector Add global index 0 1 2 3 4 5 6 7 8 9 10 11 threadidx.x 0 1 2 3 0 1 2 3 0 1 2 3 blockidx.x 0 1 Suppose each block only includes 4 threads: blockdim.x = 4 2 global void vecaddkernel(const float* A, const float* B, float* C, int n) { int i = blockdim.x * blockidx.x + threadidx.x; Compute the global index if (i < n) { C[i] = A[i] + B[i]; } }

Example: Vector Add global index 0 1 2 3 4 5 6 7 8 9 10 11 threadidx.x 0 1 2 3 0 1 2 3 0 1 2 3 blockidx.x 0 1 Suppose each block only includes 4 threads: blockdim.x = 4 2 global void vecaddkernel(const float* A, const float* B, float* C, int n) { int i = blockdim.x * blockidx.x + threadidx.x; if (i < n) { Each thread only performs C[i] = A[i] + B[i]; one pair-wise addition } }

Example: Vector Add (Host) #define THREADS_PER_BLOCK 512 void vecadd(const float* A, const float* B, float* C, int n) { float *d_a, *d_b, *d_c; int size = n * sizeof(float); cudamalloc((void **) &d_a, size); cudamemcpy(d_a, A, size, cudamemcpyhosttodevice); cudamalloc((void **) &d_b, size); cudamemcpy(d_b, B, size, cudamemcpyhosttodevice); cudamalloc((void **) &d_c, size); int nblocks = (n + THREADS_PER_BLOCK - 1) / THREADS_PER_BLOCK; vecaddkernel<<<nblocks, THREADS_PER_BLOCK>>>(d_A, d_b, d_c, n); cudamemcpy(c, d_c, size, cudamemcpydevicetohost); cudafree(d_a); cudafree(d_b); cudafree(d_c); }

Example: Vector Add (Host) #define THREADS_PER_BLOCK 512 void vecadd(const float* A, const float* B, float* C, int n) { float *d_a, *d_b, *d_c; int size = n * sizeof(float); cudamalloc((void **) &d_a, size); cudamemcpy(d_a, A, size, cudamemcpyhosttodevice); cudamalloc((void **) &d_b, size); cudamemcpy(d_b, B, size, cudamemcpyhosttodevice); cudamalloc((void **) &d_c, size); int nblocks = (n + THREADS_PER_BLOCK - 1) / THREADS_PER_BLOCK; vecaddkernel<<<nblocks, THREADS_PER_BLOCK>>>(d_A, d_b, d_c, n); cudamemcpy(c, d_c, size, cudamemcpydevicetohost); Launch the GPU kernel cudafree(d_a); cudafree(d_b); cudafree(d_c); asynchronously }

Example: Sliding Window Sum Consider computing the sum of a sliding window over a vector Each output element is the sum of input elements within a radius Example: image blur kernel If radius is 3, each output element is sum of 7 input elements input output

A naive implementation #define RADIUS 3 global void windowsumnaivekernel(const float* A, float* B, int n) { int out_index = blockdim.x * blockidx.x + threadidx.x; int in_index = out_index + RADIUS; if (out_index < n) { float sum = 0.; for (int i = -RADIUS; i <= RADIUS; ++i) { sum += A[in_index + i]; } B[out_index] = sum; } }

A naive implementation void windowsum(const float* A, float* B, int n) { float *d_a, *d_b; int size = n * sizeof(float); cudamalloc((void **) &d_a, (n + 2 * RADIUS) * sizeof(float)); cudamemset(d_a, 0, (n + 2 * RADIUS) * sizeof(float)); cudamemcpy(d_a + RADIUS, A, size, cudamemcpyhosttodevice); cudamalloc((void **) &d_b, size); dim3 threads(threads_per_block, 1, 1); dim3 blocks((n + THREADS_PER_BLOCK - 1) / THREADS_PER_BLOCK, 1, 1); windowsumnaivekernel<<<blocks, threads>>>(d_a, d_b, n); cudamemcpy(b, d_b, size, cudamemcpydevicetohost); cudafree(d_a); cudafree(d_b); }

How to improve it? For each element in the input, how many times it is loaded?

How to improve it? For each element in the input, how many times it is read? Each input element is read 7 times! Neighboring threads read most of the same elements How can we avoid redundant reading of data? input output

Sharing data between threads within a block A thread block first cooperatively loads the needed input data into the shared memory. input output Computed by block 1

Kernel with shared memory global void windowsumkernel(const float* A, float* B, int n) { shared float temp[threads_per_block + 2 * RADIUS]; int out_index = blockdim.x * blockidx.x + threadidx.x; int in_index = out_index + RADIUS; int local_index = threadidx.x + RADIUS; if (out_index < n) { temp[local_index] = A[in_index]; if (threadidx.x < RADIUS) { temp[local_index - RADIUS] = A[in_index - RADIUS]; temp[local_index + THREADS_PER_BLOCK] = A[in_index+THREADS_PER_BLOCK]; } syncthreads();

Kernel with shared memory float sum = 0.; for (int i = -RADIUS; i <= RADIUS; ++i) { sum += temp[local_index + i]; } B[out_index] = sum; } }

Performance comparison Demo! Code: https://github.com/dlsys-course/examples/blob/master/lecture8/window_sum.cu

Case study of efficient GPU kernels

C=AxB A: MxK matrix B: KxN matrix C: MxN matrix Case study: GEMM b s B K K A C s b b b M N Workload of a thread block

C=AxB A: MxK matrix B: KxN matrix C: MxN matrix Case study: GEMM b s B K K A C b s b b M N Workload of a thread block

C=AxB A: MxK matrix B: KxN matrix C: MxN matrix Case study: GEMM b B s K K A C b s b b M N Workload of a thread block

C=AxB A: MxK matrix B: KxN matrix C: MxN matrix Case study: GEMM b s B Cooperatively loaded by both thread 1 and 2 K Shared memory K A s C b s b Suppose each thread block has t * t threads, bt=b / t B strip b bt M b A strip Global memory C tile Registers Thread 2 b Thread 1 N s Each thread block computes a b x b area

Case study: GEMM pseudocode block_dim: <M / b, N / b> thread_dim: <t, t> // thread function global void SGEMM(float *A, float *B, float *C, int b, int s) { shared float sa[2][b][s], sb[2][s][b]; // shared by a thread block float rc[bt][bt] = {0}; // thread local buffer, in the registers Cooperative fetch first strip from A, B to sa[0], sb[0] sync_threads(); for (k = 0; k < K / s; k += 1) { Cooperative fetch next strip from A, B to sa[(k+1)%2], sb[(k+1)%2] sync_threads(); Run in parallel for (kk = 0; kk < s; kk += 1) { for (j = 0; j < bt; j += 1) { // unroll loop for (i = 0; i < bt; i += 1) { // unroll loop rc[j][i] += sa[k%2][threadidx.x*bt+j][kk]*sb[k%2][kk][threadidx.y*bt+i]; } }}} Write rc back to C }

Case study: GEMM More details see: http://homes.cs.washington.edu/~tws10/cse599i/cse%20599%20i%20acc elerated%20computing%20-%20programming%20gpus%20lecture%204. pdf Lai, Junjie, and André Seznec. "Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs." Code Generation and Optimization (CGO), 2013 IEEE/ACM International Symposium on. IEEE, 2013.

Case study: Reduction Sum http://developer.download.nvidia.com/compute/cuda/1.1-beta/x86_website/p rojects/reduction/doc/reduction.pdf

Tips for high performance Use existing libraries, which are highly optimized, e.g. cublas, cudnn. Use nvprof or nvvp (visual profiler) to debug the performance. Use high level language to write GPU kernels.

References CUDA Programming Guide: http://docs.nvidia.com/cuda/cuda-c-programming-guide/