GPU Programming. CUDA Memories. Miaoqing Huang University of Arkansas Spring / 43

Similar documents
CS 193G. Lecture 4: CUDA Memories

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

GPU Programming. Parallel Patterns. Miaoqing Huang University of Arkansas 1 / 102

More CUDA. Advanced programming

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

GPU Memory Memory issue for CUDA programming

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

Hardware/Software Co-Design

CSE 591: GPU Programming. Memories. Klaus Mueller. Computer Science Department Stony Brook University

ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications

GPU Memory. GPU Memory. Memory issue for CUDA programming. Copyright 2013 Yong Cao, Referencing UIUC ECE498AL Course Notes

Parallel Patterns Ezio Bartocci

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)

CUDA C Programming Mark Harris NVIDIA Corporation

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Matrix Multiplication in CUDA. A case study

1/25/12. Administrative

GPU programming basics. Prof. Marco Bertini

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

Memory. Lecture 2: different memory and variable types. Memory Hierarchy. CPU Memory Hierarchy. Main memory

GPU Computing with CUDA

CS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

CS 677: Parallel Programming for Many-core Processors Lecture 6

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Lecture 2: different memory and variable types

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

CUDA Programming Model

Programming with CUDA, WS09

Introduction to CUDA (2 of 2)

Tesla Architecture, CUDA and Optimization Strategies

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)

CUDA Advanced Techniques 3 Mohamed Zahran (aka Z)

CUDA Parallel Programming Model Michael Garland

Lecture 6. Programming with Message Passing Message Passing Interface (MPI)

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

University of Bielefeld

Warps and Reduction Algorithms

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

CS377P Programming for Performance GPU Programming - I

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Introduction to CUDA C

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA

CS377P Programming for Performance GPU Programming - II

Atomic Operations. Atomic operations, fast reduction. GPU Programming. Szénási Sándor.

Introduction to CUDA C

Lecture 2: CUDA Programming

ECE 408 / CS 483 Final Exam, Fall 2014

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

CS 179: GPU Computing. Lecture 2: The Basics

CUDA Architecture & Programming Model

Parallel Numerical Algorithms

Introduction to GPU programming. Introduction to GPU programming p. 1/17

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Graph Search

Introduction to CUDA 5.0

GPU Histogramming: Radial Distribution Functions

Advanced CUDA Optimizations. Umar Arshad ArrayFire

Introduction to CUDA (1 of n*)

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

CS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4

CUDA. Sathish Vadhiyar High Performance Computing

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Shared Memory and Synchronizations

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

High Performance Linear Algebra on Data Parallel Co-Processors I

CSC266 Introduction to Parallel Computing using GPUs Synchronization and Communication

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

CUDA Performance Optimization. Patrick Legresley

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Debugging and Optimization strategies

Dense Linear Algebra. HPC - Algorithms and Applications

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Parallel Computing. Lecture 19: CUDA - I

COSC 6385 Computer Architecture. - Data Level Parallelism (II)

Lecture 5. Performance Programming with CUDA

Module Memory and Data Locality

Lecture 3: Introduction to CUDA

High Performance Computing and GPU Programming

Parallel Systems Course: Chapter IV. GPU Programming. Jan Lemeire Dept. ETRO November 6th 2008

Real-time Graphics 9. GPGPU

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 10. Reduction Trees

ECE 574 Cluster Computing Lecture 15

L3: Data Partitioning and Memory Organization, Synchronization

Introduction to CELL B.E. and GPU Programming. Agenda

Cartoon parallel architectures; CPUs and GPUs

Practical Introduction to CUDA and GPU

Massively Parallel Architectures

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Transcription:

1 / 43 GPU Programming CUDA Memories Miaoqing Huang University of Arkansas Spring 2016

2 / 43 Outline CUDA Memories Atomic Operations

3 / 43 Hardware Implementation of CUDA Memories Each thread can: Read/write per-thread registers Read/write per-thread local memory Read/write per-block shared memory Read/write per-grid global memory Read only per-grid constant memory Host Grid Block (0, 0) Registers Shared Memory Thread (0, 0) Registers Thread (1, 0) Global Memory Constant Memory Block (1, 0) Registers Shared Memory Thread (0, 0) Registers Thread (1, 0)

4 / 43 CUDA Variable Type Qualifiers Variable Declaration Memory Scope Life Time int var; register thread thread int array_var[10]; local thread thread shared int shared_var; shared block block device int global_var; global grid application constant int constant_var; constant grid application automatic scalar variables without qualifier reside in registers compiler will spill to thread local memory thread local memory physically resides in global memory automatic array variables without qualifier reside in thread-local memory

5 / 43 CUDA Memory Types Thread Per-thread local memory Thread Block Per-block shared memory Grid 0 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Grid 1 Block (0, 0) Block (1, 0) Global memory Block (0, 1) Block (1, 1) Block (0, 2) Block (1, 2)

6 / 43 CUDA Variable Type Performance Variable Declaration Memory Penalty int var; register 1 int array_var[10]; local 100 shared int shared_var; shared 1 device int global_var; global 100 constant int constant_var; constant 1 scalar variables reside in fast, on-chip registers shared variables reside in fast, on-chip memories thread-local arrays & global variables reside in cached off-chip memory L1/L2 cache constant variables reside in cached off-chip memory The capacity of the constant memory is 64 KB

7 / 43 CUDA Variable Type Scale Variable Declaration Instances Visibility int var; 100,000s 1 int array_var[10]; 100,000s 1 shared int shared_var; 1,000s 100s device int global_var; 1 100,000s constant int constant_var; 1 100,000s 100Ks per-thread variables, R/W by 1 thread 100s shared variables, each R/W by 1Ks of threads 1 global variable is R/W by 100Ks threads 1 constant variable is readable by 100Ks threads

8 / 43 Where to declare variables? Can host access it? Yes No Outside of any function constant int constant_var; device int global_var; int var; In the kernel int array_var[10]; shared int shared_var;

9 / 43 Examples thread-local variables // motivate per-thread variables with // Ten Nearest Neighbors application global void ten_nn(float2 *result, float2 *ps, float2 *qs, size_t num_qs) // p goes in a register float p = ps[threadidx.x]; // per-thread heap goes in off-chip memory float heap[10]; // read through num_qs points, maintaining // the nearest 10 qs to p in the heap... // write out the contents of heap to result...

10 / 43 Examples shared variables // motivate shared variables with // Adjacent Difference application: // compute result[i] = input[i] - input[i-1] global void adj_diff_naive(int *result, int *input) // compute this thread s global index unsigned int i = blockdim.x * blockidx.x + threadidx.x; if(i > 0) // each thread loads two elements from global memory int x_i = input[i]; int x_i_minus_one = input[i-1]; result[i] = x_i - x_i_minus_one;

11 / 43 Examples shared variables // motivate shared variables with // Adjacent Difference application: // compute result[i] = input[i] - input[i-1] global void adj_diff_naive(int *result, int *input) // compute this thread s global index unsigned int i = blockdim.x * blockidx.x + threadidx.x; if(i > 0) // what are the bandwidth requirements of this kernel? int x_i = input[i]; //Two loads int x_i_minus_one = input[i-1]; result[i] = x_i - x_i_minus_one;

12 / 43 Examples shared variables // motivate shared variables with // Adjacent Difference application: // compute result[i] = input[i] - input[i-1] global void adj_diff_naive(int *result, int *input) // compute this thread s global index unsigned int i = blockdim.x * blockidx.x + threadidx.x; if(i > 0) // How many times does this kernel load input[i]? int x_i = input[i]; // once by thread i int x_i_minus_one = input[i-1]; // again by thread i+1 result[i] = x_i - x_i_minus_one;

13 / 43 Examples shared variables // motivate shared variables with // Adjacent Difference application: // compute result[i] = input[i] - input[i-1] global void adj_diff_naive(int *result, int *input) // compute this thread s global index unsigned int i = blockdim.x * blockidx.x + threadidx.x; if(i > 0) // Idea: eliminate redundancy by sharing data int x_i = input[i]; int x_i_minus_one = input[i-1]; result[i] = x_i - x_i_minus_one;

14 / 43 Examples shared variables // optimized version of adjacent difference global void adj_diff(int *result, int *input) // shorthand for threadidx.x int tx = threadidx.x; // allocate a shared array, one element per thread shared int s_data[block_size]; // each thread reads one element to s_data unsigned int i = blockdim.x * blockidx.x + tx; s_data[tx] = input[i]; // avoid race condition: ensure all loads // complete before continuing syncthreads();... if(tx > 0) result[i] = s_data[tx] - s_data[tx-1];

15 / 43 Examples shared variables // optimized version of adjacent difference global void adj_diff(int *result, int *input) // shorthand for threadidx.x int tx = threadidx.x; // allocate a shared array, one element per thread shared int s_data[block_size]; // each thread reads one element to s_data unsigned int i = blockdim.x * blockidx.x + tx; s_data[tx] = input[i]; // avoid race condition: ensure all loads // complete before continuing syncthreads();... if(tx > 0) result[i] = s_data[tx] - s_data[tx-1]; else if(i > 0) // handle thread block boundary result[i] = s_data[tx] - input[i-1];

16 / 43 Examples shared memory // when the size of the array isn t known at compile time... global void adj_diff(int *result, int *input) // use extern to indicate a shared array that will be // allocated dynamically at kernel launch time extern shared int s_data[];... // pass the size of the per-block array, in bytes, as the third // argument to the triple chevrons adj_diff<<<num_blocks, block_size, block_size*sizeof(int)>>>(r,i); Dynamic allocate the size of the shared array during the kernel launch time

17 / 43 Examples global memory and constant memory cudamalloc() and cudamemcpy() cudamemcpytosymbol() and cudamemcpyfromsymbol() // outside any function constant float constdata[256]; // host side float data[256]; // host side cudamemcpytosymbol(constdata, data, 256*sizeof(float)); cudamemcpyfromsymbol(data, constdata, 256*sizeof(float));

18 / 43 About pointers Yes, you can use them! You can point at any memory space per se device int my_global_variable; constant int my_constant_variable = 13; global void foo(void) shared int my_shared_variable; int *ptr_to_global = &my_global_variable; const int *ptr_to_constant = &my_constant_variable; int *ptr_to_shared = &my_shared_variable;... *ptr_to_global = *ptr_to_shared;

19 / 43 A Common Programming Strategy Global memory resides in device memory (DRAM) Much slower access than shared memory Tile data to take advantage of fast shared memory Divide and conquer Carefully partition data according to access patterns Read-only constant memory (fast) R/W & shared within block shared memory (fast) R/W within each thread registers (fast) Indexed R/W within each thread local memory (slow) R/W inputs/results cudamalloc ed global memory (slow)

20 / 43 A Common Programming Strategy Demo 1. Partition data into subsets that fit into shared memory

21 / 43 A Common Programming Strategy Demo 2. Handle each data subset with one thread block

22 / 43 A Common Programming Strategy Demo 3. Load the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism

23 / 43 A Common Programming Strategy Demo 4. Perform the computation on the subset from shared memory

24 / 43 A Common Programming Strategy Demo 5. Copy the result from shared memory back to global memory

25 / 43 Outline CUDA Memories Atomic Operations

Race Conditions //all items in vector are initialized to zeros device int vector[number_vector]; global void race(void)... a=func_a(); vector[0] += a; 26 / 43

Race Conditions //all items in vector are initialized to zeros device int vector[number_vector]; global void race(void)... a=func_a(); vector[0] += a; threadid:0... vector[0] += 5; threadid:1917... vector[0] += 1; What is the value of vector[0] in thread 0? What is the value of vector[0] in thread 1917? 27 / 43

28 / 43 Race Conditions //all items in vector are initialized to zeros device int vector[number_vector]; global void race(void)... a=func_a(); vector[0] += a; threadid:0... vector[0] += 5; threadid:1917... vector[0] += 1; What is the value of vector[0] in thread 0? What is the value of vector[0] in thread 1917? Answer: not defined by the programming model, can be arbitrary Thread 0 could have finished execution before thread 1917 started Or the other way around Or both are executing at the same time

29 / 43 Race Conditions //all items in vector are initialized to zeros device int vector[number_vector]; global void race(void)... a=func_a(); vector[0] += a; threadid:0... vector[0] += 5; threadid:1917... vector[0] += 1; What is the value of vector[0] in thread 0? What is the value of vector[0] in thread 1917? Answer: not defined by the programming model, can be arbitrary Thread 0 could have finished execution before thread 1917 started Or the other way around Or both are executing at the same time CUDA provides atomic operations to deal with this problem

30 / 43 Atomics An atomic operation guarantees that only a single thread has access to a piece of memory until an operation completes The name atomic comes from the fact that it is uninterruptable No dropped data, but ordering is still arbitrary Different types of atomic instructions atomicadd, Sub, Exch, Min, Max, Inc, Dec, CAS, And, Or, Xor More types in Fermi

31 / 43 Example: Histogram // Determine frequency of colors in a picture; // Colors have already been converted into ints; // Each thread looks at one pixel and increments a counter atomically; global void histogram(int* color, int* buckets) int i = threadidx.x + blockdim.x * blockidx.x; int c = colors[i]; atomicadd(&buckets[c], 1); Atomics are slower than normal load/store You can have the whole machine queuing on a single location in memory Atomics unavailable on G80!

32 / 43 Example: Global Min/Max // Naive solution // If you require the maximum across all threads in a grid, you could // do it with a single global maximum value, but it will be VERY slow global void global_max(int* values, int* gl_max) int i = threadidx.x + blockdim.x * blockidx.x; int val = values[i]; atomicmax(gl_max,val); // return the previous max

33 / 43 Example: Global Min/Max // Naive solution // If you require the maximum across all threads in a grid, you could // do it with a single global maximum value, but it will be VERY slow global void global_max(int* values, int* gl_max) int i = threadidx.x + blockdim.x * blockidx.x; int val = values[i]; atomicmax(gl_max,val); // return the previous max // Better solution // Introduce intermediate maximum results, so that // most threads do not try to update the global max global void global_max(int* values, int* gl_max, int *reg_max, int num_regions) // i and val as before... int region = i % num_regions; if(atomicmax(&reg_max[region],val) < val) atomicmax(gl_max,val);

34 / 43 Inspiration from Global Min/Max Single value causes serial bottleneck Create hierarchy of values for more parallelism Performance will still be slow, so use atomic operations judiciously Cannot use normal load/store for inter-thread communication because of race conditions Use atomic instructions for sparse and/or unpredictable global communication Decompose data (very limited use of single global sum/max/min/etc.) for more parallelism

35 / 43 Resource Contention Atomic operations are not cheap! They imply serialized access to a variable

36 / 43 Resource Contention Atomic operations are not cheap! They imply serialized access to a variable global void sum(int *input, int *result) atomicadd(result, input[threadidx.x]);... // how many threads will contend // for exclusive access to result? sum<<<1,n>>>(input,result);

37 / 43 Hierarchical Summation Divide & Conquer Per-thread atomicadd to a shared partial sum Per-block atomicadd to the total sum

Hierarchical Summation global void sum(int *input, int *result) shared int partial_sum; // thread 0 is responsible for initializing partial_sum if(threadidx.x == 0) partial_sum = 0; syncthreads(); // each thread updates the partial sum int tx = threadidx.x; atomicadd(&partial_sum, input[tx]); syncthreads(); // thread 0 updates the total sum if(threadidx.x == 0) atomicadd(result, partial_sum); // launch the kernel sum<<<b,n/b>>>(input,result); 38 / 43

Hierarchical Summation global void sum(int *input, int *result) shared int partial_sum; // thread 0 is responsible for initializing partial_sum if(threadidx.x == 0) partial_sum = 0; syncthreads(); // each thread updates the partial sum int tx = threadidx.x; atomicadd(&partial_sum, input[tx]); syncthreads(); // thread 0 updates the total sum if(threadidx.x == 0) atomicadd(result, partial_sum); // launch the kernel sum<<<b,n/b>>>(input,result); 39 / 43

Hierarchical Summation global void sum(int *input, int *result) shared int partial_sum; // thread 0 is responsible for initializing partial_sum if(threadidx.x == 0) partial_sum = 0; syncthreads(); // each thread updates the partial sum int tx = blockidx.x * blockdim.x + threadidx.x; atomicadd(&partial_sum, input[tx]); syncthreads(); // thread 0 updates the total sum if(threadidx.x == 0) atomicadd(result, partial_sum); // launch the kernel sum<<<b,n/b>>>(input,result); 40 / 43

41 / 43 Advice Use barriers such as syncthreads to wait until shared data is ready Prefer barriers to atomics when data access patterns are regular or predictable Prefer atomics to barriers when data access patterns are sparse or unpredictable Atomics to shared variables are much faster than atomics to global variables Do not synchronize or serialize unnecessarily

42 / 43 Final Thoughts Effective use of CUDA memory hierarchy decreases bandwidth consumption to increase throughput Use shared memory to eliminate redundant loads from global memory Use syncthreads barriers to protect shared data Use atomics if access patterns are sparse or unpredictable Prefer atomics to barriers when data access patterns are sparse or unpredictable Atomics to shared variables are much faster than atomics to global variables Optimization comes with a development cost Memory resources ultimately limit parallelism

43 / 43 Technical Specification of GPU Architectures Technical Specification G80 GT200 Fermi Max x- or y-dimension of a grid 65,535 Max number of threads per block 512 1024 Max x- or y-dimension of a block 512 1024 Max z-dimension of a block 64 Max number of resident blocks per SM 8 Warp size 32 Max number of resident warps per SM 24 32 48 Max number of resident threads per SM 768 1024 1536 Number of 32-bit register per SM 8 K 16 K 32 K Max amount of shared memory per SM 16 KB 48 KB Constant memory size 64 KB