L3: Data Partitioning and Memory Organization, Synchronization

Similar documents
L18: Introduction to CUDA

Administrative. L8: Writing Correct Programs, cont. and Control Flow. Questions/comments from previous lectures. Outline 2/10/11

Blockseminar: Verteiltes Rechnen und Parallelprogrammierung. Tim Conrad

L17: CUDA, cont. 11/3/10. Final Project Purpose: October 28, Next Wednesday, November 3. Example Projects

1/25/12. Administrative

Programming in CUDA. Malik M Khan

Introduction to GPU (Graphics Processing Unit) Architecture & Programming

L14: CUDA, cont. Execution Model and Memory Hierarchy"

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CSE 591: GPU Programming. Memories. Klaus Mueller. Computer Science Department Stony Brook University

Introduction to CUDA (1 of n*)

L1: Introduction to CS6963 and CUDA

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

GPU Memory Memory issue for CUDA programming

L1: Introduction CS 6235: Parallel Programming for Many-Core Architectures"

GPU Memory. GPU Memory. Memory issue for CUDA programming. Copyright 2013 Yong Cao, Referencing UIUC ECE498AL Course Notes

1/26/09. Administrative. L4: Hardware Execution Model and Overview. Recall Execution Model. Outline. First assignment out, due Friday at 5PM

CSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

Paralization on GPU using CUDA An Introduction

Data Parallel Execution Model

ECE 574 Cluster Computing Lecture 15

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 22 CUDA

1/19/11. Administrative. L2: Hardware Execution Model and Overview. What is an Execution Model? Parallel programming model. Outline.

Programming with CUDA, WS09

ECE 574 Cluster Computing Lecture 17

Tesla Architecture, CUDA and Optimization Strategies

Introduction to CELL B.E. and GPU Programming. Agenda

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

CUDA Programming Model

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

Matrix Multiplication in CUDA. A case study

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

High Performance Computing on GPUs using NVIDIA CUDA

2/17/10. Administrative. L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies. Administrative, cont.

Module 3: CUDA Execution Model -I. Objective

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Module 2: Introduction to CUDA C

NVIDIA GPU CODING & COMPUTING

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

CS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME

Introduction to GPU programming. Introduction to GPU programming p. 1/17

CUDA Architecture & Programming Model

CUDA Performance Optimization. Patrick Legresley

Lecture 2: CUDA Programming

Lecture 6. Programming with Message Passing Message Passing Interface (MPI)

Computer Architecture

Lecture 2: Introduction to CUDA C

Spring Prof. Hyesoon Kim

Blocks, Grids, and Shared Memory

Fundamental CUDA Optimization. NVIDIA Corporation

CUDA C Programming Mark Harris NVIDIA Corporation

GPU Programming. CUDA Memories. Miaoqing Huang University of Arkansas Spring / 43

Core/Many-Core Architectures and Programming. Prof. Huiyang Zhou

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

Lecture 11: GPU programming

Warps and Reduction Algorithms

Introduc)on to GPU Programming

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Module 2: Introduction to CUDA C. Objective

Outline. L5: Writing Correct Programs, cont. Is this CUDA code correct? Administrative 2/11/09. Next assignment (a homework) given out on Monday

Tuning CUDA Applications for Fermi. Version 1.2

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

GPU programming. Dr. Bernhard Kainz

Introduction to CUDA Programming

ME964 High Performance Computing for Engineering Applications

Fundamental CUDA Optimization. NVIDIA Corporation

GPU Architecture and Programming. Ozcan Ozturk Bilkent University Ankara, Turkey

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

Lecture 8: GPU Programming. CSE599G1: Spring 2017

CUDA Parallel Programming Model Michael Garland

Mattan Erez. The University of Texas at Austin

CUDA OPTIMIZATIONS ISC 2011 Tutorial

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Lecture 5. Performance Programming with CUDA

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

CS 179: GPU Programming. Lecture 7

High Performance Computing and GPU Programming

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy

GPU Programming for Mathematical and Scientific Computing

High Level Programming for GPGPU. Jason Yang Justin Hensley

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

GPU Computing: Introduction to CUDA. Dr Paul Richmond

Cartoon parallel architectures; CPUs and GPUs

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

Lecture 1: an introduction to CUDA

Transcription:

Outline More on CUDA L3: Data Partitioning and Organization, Synchronization January 21, 2009 1 First assignment, due Jan. 30 (later in lecture) Error checking mechanisms Synchronization More on data partitioning Reading: GPU Gems 2, Ch. 31; CUDA 2.0 Manual, particularly Chapters 4 and 5 This lecture includes slides provided by: Wen-mei Hwu (UIUC) and David Kirk (NVIDIA) see http://courses.ece.uiuc.edu/ece498/al1/ and Austin Robison (NVIDIA) 2 Today s Lecture Establish background to write your first CUDA code Testing for correctness (a little) Debugging (a little) Core concepts to be reinforced - Data partitioning -Synchronization Questions from Last Time Considering the GPU gets a single copy of the array, can two threads access the shared array at the same time? - Yes, if no bank conflicts. But watch out! Can we have different thread functions for different subsets of data? - Not efficiently. Why do I get the same results every time using a random number generator? - Deterministic if starts with the default (or same) seed. 3 4

My Approach to Working with New Systems Approach with caution - May not behave as expected. - Learn about behavior through observation and measurement. - Realize that documentation is not always clear or accurate. Crawl, then walk, then run - Start with something known to work. - Only change one variable at a time. Debugging Using Device Emulation Mode An executable compiled in device emulation mode (nvcc -deviceemu) runs completely on the host using the CUDA runtime - No need of any device and CUDA driver - Each device thread is emulated with a host thread When running in device emulation mode, one can: - Use host native debug support (breakpoints, inspection, etc.) - Access any device-specific data from host code and viceversa - Call any host function from device code (e.g. printf) and vice-versa - Detect deadlock situations caused by improper usage of syncthreads 5 6 ECE 498AL, University of Illinois, Urbana-Champaign Device Emulation Mode Pitfalls Emulated device threads execute sequentially, so simultaneous accesses of the same location by multiple threads could produce different results. Dereferencing device pointers on the host or host pointers on the device can produce correct results in device emulation mode, but will generate an error in device execution mode Results of floating-point computations will slightly differ because of: - Different compiler outputs, instruction sets - Use of extended precision for intermediate results - There are various options to force strict single precision on the host 7 ECE 498AL, University of Illinois, Urbana-Champaign Run-time functions & macros for error checking In CUDA run-time services, cudagetdeviceproperties(deviceprop &dp, d); check number, type and whether device present In libcutil.a of Software Developers Kit, cutcomparef (float *ref, float *data, unsigned len); compare output with reference from CPU implementation In cutil.h of Software Developers Kit (with #define _DEBUG or D_DEBUG compile flag), CUDA_SAFE_CALL(f(<args>)), CUT_SAFE_CALL(f(<args>)) check for error in run-time call and exit if error detected CUT_SAFE_MALLOC(cudaMalloc(<args>)); similar to above, but for malloc calls CUT_CHECK_ERROR( error message goes here ); check for error immediately following kernel execution and if detected, exit with error 8 message

Simple working code example from last time Goal for this example: - Really simple but illustrative of key concepts - Fits in one file with simple compile command - Can absorb during lecture What does it do? - Scan elements of array of numbers (any of 0 to 9) - How many times does 6 appear? - Array of 16 elements, each thread examines 4 elements, 1 block in grid, 1 grid 3 6 7 5 3 5 6 2 9 1 2 7 0 9 3 6 threadidx.x = 0 examines in_array elements 0, 4, 8, 12 Known as a threadidx.x = 1 examines in_array elements 1, 5, 9, 13 cyclic data threadidx.x = 2 examines in_array elements 2, 6, 10, 14 distribution threadidx.x = 3 examines in_array elements 3, 7, 11, 15 9 Reductions (from last time) This type of computation is called a parallel reduction - Operation is applied to large data structure - Computed result represents the aggregate solution across the large data structure -Large data structure computed result (perhaps single number) [dimensionality reduced] Why might parallel reductions be well-suited to GPUs? What if we tried to compute the final sum on the GPUs? (next class and assignment) 10 Reminder: Gathering and Reporting Results Global, device functions and excerpts from host, main device int compare(int a, int b) { if (a == b) return 1; return 0; Compute individual global void results compute(int for each *d_in, thread int *d_out) { d_out[threadidx.x] Serialize = 0; final resutls gathering on host for (i=0; i<size/blocksize; i++) { int val = d_in[i*blocksize + threadidx.x]; d_out[threadidx.x] += compare(val, 6); int host void outer_compute (int *h_in_array, int *h_out_array) { compute<<<1,blocksize,msize)>>> (d_in_array, d_out_array); cudamemcpy(h_out_array, d_out_array, BLOCKSIZE*sizeof(int), cudamemcpydevicetohost); main(int argc, char **argv) { for (int i=0; i<blocksize; i++) { sum+=out_array[i]; printf ( Result = %d\n",sum); 11 Gathering Results on GPU: Synchronization void syncthreads(); Functionality: Synchronizes all threads in a block - Each thread waits at the point of this call until all other threads have reached it - Once all threads have reached this point, execution resumes normally Why is this needed? - A thread can freely read the shared of its thread block or the global of either its block or grid. - Allows the program to guarantee partial ordering of these accesses to prevent incorrect orderings. Watch out! - Potential for deadlock when it appears in conditionals 12

Gathering Results on GPU for Count 6 global void compute(int *d_in, int *d_out) { d_out[threadidx.x] = 0; for (i=0; i<size/blocksize; i++) { int val = d_in[i*blocksize + threadidx.x]; d_out[threadidx.x] += compare(val, 6); global void compute(int *d_in, int *d_out, int *d_sum) { d_out[threadidx.x] = 0; for (i=0; i<size/blocksize; i++) { int val = d_in[i*blocksize + threadidx.x]; d_out[threadidx.x] += compare(val, 6); synthreads(); if (threadidx.x == 0) { for 0..BKSIZE-1 *d_sum += d_out[i]; Also Atomic Update to Sum Variable int atomicadd(int* address, int val); Increments the integer at address by val. Atomic means that once initiated, the operation executes to completion without interruption by other threads Example: atomicadd(d_sum, d_out_array[threadidx.x]); 13 14 Programming Assignment #1 Problem 1: In the count 6 example using synchronization, we accumulated all results into out_array[0], thus serializing the accumulation computation on the GPU. Suppose we want to exploit some parallelism in this part, which will likely be particularly important to performance as we scale the number of threads. A common idiom for reduction computations is to use a tree-structured results-gathering phase, where independent threads collect their results in parallel. The tree below illustrates this for our example, with SIZE=16 and BLOCKSIZE=4. out[0] += out[2] Your job is to write this version of the reduction in CUDA. You can start with the sample code, but adjust the problem size to be larger: #define SIZE 256 #define BLOCKSIZE 32 out[0] out[1] out[2] out[3] out[0] += out[1] out[2] += out[3] 15 Another Example: Adding Two Matrices CPU C program void add_matrix_cpu(float *a, float *b, float *c, int N) { int i, j, index; for (i=0;i<n;i++) { for (j=0;j<n;j++) { index =i+j*n; c[index]=a[index]+b[index]; void main() {... add_matrix(a,b,c,n); Example source: Austin Robison, NVIDIA CUDA C program global void add_matrix_gpu(float *a, float *b, float *c, int N) { int i =blockidx.x*blockdim.x+threadidx.x; int j=blockidx.y*blockdim.y+threadidx.y; int index =i+j*n; if( i <N && j <N) c[index]=a[index]+b[index]; void main() { dim3 dimblock(blocksize,blocksize); dim3 dimgrid(n/dimblock.x,n/dimblock.y); add_matrix_gpu<<<dimgrid,dimblock>>>(a,b,c,n); 16

Closer Inspection of Computation and Data Partitioning Define 2-d set of blocks, and 2-d set of threads per block dim3 dimblock(blocksize,blocksize); dim3 dimgrid(n/dimblock.x,n/dimblock.y); Each thread identifies what single element of the matrix it operates on int i=blockidx.x*blockdim.x+threadidx.x; int j=blockidx.y*blockdim.y+threadidx.y; int index =i+j*n; if( i <N && j <N) c[index]=a[index]+b[index]; Let blocksize = 2 and N = 4 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Closer Inspection of Computation and Data Partitioning Define 2-d set of blocks, and 2-d set of threads per block dim3 dimblock(blocksize,blocksize); dim3 dimgrid(n/dimblock.x,n/dimblock.y); Each thread identifies what single element of the matrix it operates on int i=blockidx.x*blockdim.x+threadidx.x; int j=blockidx.y*blockdim.y+threadidx.y; int index =i+j*n; if( i <N && j <N) c[index]=a[index]+b[index]; Let blocksize = 2 and N = 4 BLOCK (0,0) 0,0 0,1 0,0 0,1 0 1 2 3 1,0 1,1 1,0 1,1 4 5 6 7 0,0 0,1 0,0 0,1 8 9 10 11 1,0 1,1 1,0 1,1 12 13 14 15 BLOCK (1,0) BLOCK (0,1) BLOCK (1,1) 17 18 Data Distribution to Threads Many data parallel programming languages have mechanisms for expressing how a distributed data structure is mapped to threads - That is, the portion of the data a thread accesses (and usually stores locally) - Examples: HPF, Chapel, UPC Reasons for selecting a particular distribution - Sharing patterns: group together according to data reuse and communication needs - Affinity with other data - Locality: transfer size and capacity limit on shared - Access patterns relative to computation (avoid bank conflicts) - Minimizing overhead: Cost of copying to host and between global and shared Concept of Data Distributions in CUDA This concept is not completely natural for CUDA - Implicit from computation partitioning and access expressions within threads - Within a block of threads, most is shared, so not really distributed Nevertheless, I think it is still useful - Spatial understanding of computation and data organization (helps conceptualize) - Distribution of data needed for objects too large to fit in shared (i.e., partitioning across blocks in a grid) - Helpful in putting CUDA in context of other languages 19 20

Common Data Distributions Consider a 1-Dimensional array CYCLIC: for (i = 0; i<blocksize; i++) d_in [i*blocksize + threadidx.x]; 3 6 7 5 3 5 6 2 9 1 2 7 0 9 3 6 BLOCK: for (i=threadidx.x*blocksize; i<(threadidx.x+1) *BLOCKSIZE; i++) d_in[i]; 3 6 7 5 3 5 6 2 9 1 2 7 0 9 3 6 BLOCK-CYCLIC: Programming Assignment #1, cont. Problem 2: Computation Partitioning & Data Distribution Using the matrix addition example from Lecture 2 class notes, develop a complete CUDA program for integer matrix addition. Initialize the two input matrices using random integers from 0 to 9 (as in the count 6 example). Initialize the ith entry in both a[i] and b[i], so that everyone s matrix is initialized the same way. Your CUDA code should use a data distribution for the input and output matrices that exploits several features of the architecture to improve performance (although the advantages of these will be greater on objects allocated to shared and with reuse of data within a block). a grid of 16 square blocks of threads (>= # multiproc) each block executing 64 threads (> warpsize) each thread accessing 32 elements (better if larger) Within a grid block, you should use a BLOCK data distribution for the columns, so that each thread operates on a column. This should help reduce bank conflicts. 21 Return the output array. 22 Hardware Implementation: Architecture The local, global, constant, and texture spaces are regions of device Each multiprocessor has: - A set of 32-bit registers per processor - On-chip shared - Where the shared space resides - A read-only constant cache - To speed up access to the constant space - A read-only texture cache - To speed up access to the texture space Device Multiprocessor N Multiprocessor 2 Multiprocessor 1 Processor 1 Shared Processor 2 Device Global, constant, texture memories 23 ECE 498AL, University of Illinois, Urbana-Champaign Processor M Instruction Unit Constant Cache Texture Cache Programming Model: Spaces Each thread can: - Read/write per-thread registers - Read/write per-thread local - Read/write per-block shared - Read/write per-grid global - Read only per-grid constant - Read only per-grid texture Host The host can read/write global, constant, and texture 24 ECE 498AL, University of Illinois, Urbana-Champaign Grid Block (0, 0) Local Global Constant Texture Shared Thread (0, 0) Thread (1, 0) Local Block (1, 0) Shared Thread (0, 0) Thread (1, 0) Local Local

Threads, Warps, Blocks There are (up to) 32 threads in a Warp - Only <32 when there are fewer than 32 total threads There are (up to) 16 Warps in a Block Each Block (and thus, each Warp) executes on a single SM G80 has 16 SMs At least 16 Blocks required to fill the device More is better - If resources (registers, thread space, shared ) allow, more than 1 Block can occupy each SM 25 ECE 498AL, University of Illinois, Urbana-Champaign Hardware Implementation: A Set of SIMD Multiprocessors A device has a set of multiprocessors Device Each multiprocessor is a Multiprocessor N set of 32-bit processors Multiprocessor 2 with a Single Instruction Multiprocessor 1 Multiple Data architecture - Shared instruction unit At each clock cycle, a multiprocessor executes the same instruction on a group of threads called a warp The number of threads in a warp is the warp size Processor 1 Processor 2 26 ECE 498AL, University of Illinois, Urbana-Champaign Processor M Instruction Unit Hardware Implementation: Execution Model Terminology Review Each thread block of a grid is split into warps, each gets executed by one multiprocessor (SM) - The device processes only one grid at a time Each thread block is executed by one multiprocessor - So that the shared space resides in the on-chip shared A multiprocessor can execute multiple blocks concurrently - Shared and registers are partitioned among the threads of all concurrent blocks - So, decreasing shared usage (per block) and register usage (per thread) increases number of blocks that can run concurrently device = GPU = set of multiprocessors Multiprocessor = set of processors & shared Kernel = GPU program Grid = array of thread blocks that execute a kernel Thread block = group of SIMD threads that execute a kernel and can communicate via shared Location Cached Access Who Local Off-chip No Read/write One thread Shared On-chip N/A - resident Read/write All threads in a block Global Off-chip No Read/write All threads + host Constant Off-chip Yes Read All threads + host Texture Off-chip Yes Read All threads + host 27 ECE 498AL, University of Illinois, Urbana-Champaign 28 ECE 498AL, University of Illinois, Urbana-Champaign

Access Times Register dedicated HW - single cycle Shared dedicated HW - single cycle Local DRAM, no cache - *slow* Global DRAM, no cache - *slow* Constant DRAM, cached, 1 10s 100s of cycles, depending on cache locality Texture DRAM, cached, 1 10s 100s of cycles, depending on cache locality Instruction (invisible) DRAM, cached Language Extensions: Variable Type Qualifiers Scope Lifetime device local int LocalVar; local thread thread device shared int SharedVar; shared block block device int GlobalVar; global grid application device constant int ConstantVar; constant grid application device is optional when used with local, shared, or constant Automatic variables without any qualifier reside in a register - Except arrays that reside in local 29 ECE 498AL, University of Illinois, Urbana-Champaign 30 ECE 498AL, University of Illinois, Urbana-Champaign Variable Type Restrictions Pointerscan only point to allocated or declared in global : - Allocated in the host and passed to the kernel: global void KernelFunc(float* ptr) - Obtained as the address of a global variable: float* ptr = &GlobalVar; Relating this Back to the Count 6 Example A few variables (i, val) are going in device registers We are copying input array into device and output array back to host - These device copies of the data all reside in global Suppose we want to access only shared on the devices - Cannot copy into shared from host! - Requires another copy from/to global to/from shared Other options - Use constant for input array (not helpful for current implementation, but possibly useful) - Use shared for output array and just return single result (today s version with synchronization) 31 ECE 498AL, University of Illinois, Urbana-Champaign 32

Summary of Lecture Some error checking techniques to determine if program is correct Synchronization constructs Discussion of Data Partitioning First assignment, due FRIDAY, JANUARY 30, 5PM Next Week Reasoning about parallelization - When is it safe? 33 34