ECE 408 / CS 483 Final Exam, Fall 2014

Similar documents
GPU programming basics. Prof. Marco Bertini

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)

Tesla Architecture, CUDA and Optimization Strategies

Module Memory and Data Locality

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

High Performance Linear Algebra on Data Parallel Co-Processors I

Parallel Computing. Lecture 19: CUDA - I

Module 3: CUDA Execution Model -I. Objective

Lecture 3: Introduction to CUDA

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

Hardware/Software Co-Design

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Real-time Graphics 9. GPGPU

Image convolution with CUDA

Real-time Graphics 9. GPGPU

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

Module 2: Introduction to CUDA C. Objective

Module 2: Introduction to CUDA C

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

Lecture 2: CUDA Programming

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

CUDA C Programming Mark Harris NVIDIA Corporation

2/17/10. Administrative. L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies. Administrative, cont.

Shared Memory. Table of Contents. Shared Memory Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Shared Memory.

CS377P Programming for Performance GPU Programming - I

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

Programming with CUDA, WS09

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Scientific discovery, analysis and prediction made possible through high performance computing.

Tiled Matrix Multiplication

Programming in CUDA. Malik M Khan

Example 1: Color-to-Grayscale Image Processing

CUDA. More on threads, shared memory, synchronization. cuprintf

Vector Addition on the Device: main()

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Introduction to Parallel Computing with CUDA. Oswald Haan

CME 213 S PRING Eric Darve

CUDA Architecture & Programming Model

Lab 1 Part 1: Introduction to CUDA

Cartoon parallel architectures; CPUs and GPUs

AMS 148 Chapter 6: Histogram, Sort, and Sparse Matrices

CS 314 Principles of Programming Languages

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Review for Midterm 3/28/11. Administrative. Parts of Exam. Midterm Exam Monday, April 4. Midterm. Design Review. Final projects

Practical Introduction to CUDA and GPU

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Lecture 10!! Introduction to CUDA!

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

Dense Linear Algebra. HPC - Algorithms and Applications

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

COSC 6374 Parallel Computations Introduction to CUDA

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Graph Search

CUDA Parallelism Model

Lecture 2: Introduction to CUDA C

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

CUDA Kenjiro Taura 1 / 36

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

Data Parallel Execution Model

Introduction to GPGPUs and to CUDA programming model

Hands-on CUDA Optimization. CUDA Workshop

EEM528 GPU COMPUTING

CS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4

Double-Precision Matrix Multiply on CUDA

Introduction to CUDA C

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Parallel Numerical Algorithms

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 10. Reduction Trees

Josef Pelikán, Jan Horáček CGG MFF UK Praha

CS : Many-core Computing with CUDA

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

Introduction to CUDA C

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

1/25/12. Administrative

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

GPU Computing: Introduction to CUDA. Dr Paul Richmond

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

More CUDA. Advanced programming

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy

CUDA Basics. July 6, 2016

Introduction to CUDA 5.0

Sparse Linear Algebra in CUDA

Computation to Core Mapping Lessons learned from a simple application

Debugging and Optimization strategies

CS516 Programming Languages and Compilers II

Matrix Multiplication in CUDA. A case study

Lessons learned from a simple application

GPU Programming EE Final Examination

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Advanced CUDA Optimizations. Umar Arshad ArrayFire

Transcription:

ECE 408 / CS 483 Final Exam, Fall 2014 Thursday 18 December 2014 8:00 to 11:00 Central Standard Time You may use any notes, books, papers, or other reference materials. In the interest of fair access across the class, you may NOT USE GPUs. (We don t think that they will help you, either.) No interactions with humans other than course staff are allowed. This exam is designed to take TWO hours. To allow for any unforeseen difficulties, you are allowed THREE hours to complete it. Your exam is due promptly at 11:00 a.m. Central Standard Time. Submit your answers in PDF form before the deadline by email to lumetta@illinois.edu. You may cc gao2@illinois.edu in case something is wrong with Prof. Lumetta s email, but if you do not send to the right address, you may get a 0. Please use the subject line ECE408: Final Exam Submission and clearly indicate your NetID in the body of the email. Either use or cc your Illinois email as the source of your submission. You can write down the reasoning behind your answers for possible partial credit. Good luck!

Question 1: Short Answer (20 points) A. (4 points) Explain how writing CUDA kernel code to have thread blocks wait for the execution of other thread blocks in the same grid to complete can lead to problems, even if the dependencies are acyclic. B. (4 points) You need to transfer 400 MB of data to a GPU which is connected via a PCIe2 link. What is the minimum number of lanes needed for you to be able to perform the transfer in under a second? C. (4 points) A friend writes a 3D video filtering (convolution) code in CUDA. The mask is 3 3 5 and is stored in constant memory. Assuming that shared memory is not used, and ignoring boundary effects, how many global memory accesses are needed to process each pixel? D. (4 points) For a C2050 GPU, assuming optimal use of floating point hardware and memory bandwidth, how many floating point operations are necessary per float loaded from global memory in order to maximize the use of both resources? E. (4 points) A friend wants your help to write CUDA code that performs a reduction for each field (A, B, C, and D, all integers) in an array of structures. The friend tried reducing A, then B, then C, then D, but got poor performance. Explain why and suggest a simple fix to solve the problem.

Question 2: CUDA Basics (20 points) For the following vector addition kernel and the corresponding kernel launch code, answer each of the questions below, assuming that the code is running on a C2050 GPU. Note that the code is slightly different from the version discussed in class. 1 global void vecaddkernel (float* A, float* B, float* C, int n) 2 { 3 int i = threadidx.x + blockdim.x * blockidx.x * 2; 4 5 if (i < n) { C_d[i] = A_d[i] + B_d[i]; } 6 i += blockdim.x; 7 if (i < n) { C_d[i] = A_d[i] + B_d[i]; } 8 } 9 10 int vectadd (float* A, float* B, float* C, int n) 11 { 12 // Parameter "n" is the length of arrays A, B, and C. 13 int size = n * sizeof (float); 14 cudamalloc ((void **)&A_d, size); 15 cudamalloc ((void **)&B_d, size); 16 cudamalloc ((void **)&C_d, size); 17 cudamemcpy (A_d, A, size, cudamemcpyhosttodevice); 18 cudamemcpy (B_d, B, size, cudamemcpyhosttodevice); 19 20 vecaddkernel<<<ceil (n / 2048.0), 1024>>> (A_d, B_d, C_d, n); 21 cudamemcpy (C, C_d, size, cudamemcpydevicetohost); 22 } A. (3 points) If the size n of the A, B, and C arrays is 50,000 elements each, how many thread blocks are generated? B. (3 points) If the size n of the A, B, and C arrays is 50,000 elements each, how many warps are there in each thread block? C. (3 points) If the size n of the A, B, and C arrays is 50,000 elements each, how many threads in total will be created for the grid launched on line 20? D. (5 points) If the size n of the A, B, and C arrays is 50,000 elements each, is there any control divergence during the execution of the kernel? Explain why or why not. If so, identify the block number(s) and warp number(s) that causes the control divergence. Also identify the line number(s) at which control diverges for each warp that you have identified. E. (3 points) Explain one performance advantage of this variant of vector addition relative to the version discussed in class (which handles one element per thread rather than two). F. (3 points) Explain one performance disadvantage of this variant of vector addition relative to the version discussed in class (which handles one element per thread rather than two).

Question 3: Histograms (20 points) Histograms are a powerful tool in many fields, such as image processing. Their implementation on GPUs is challenging because of the need for atomic operations. One way to accelerate their computation is using privatization in the fast shared memory. The following code calculates the histogram of an image img using privatization. 1 global void histogram_kernel (unsigned int* histo, 2 unsigned int* img, int size) 3 { 4 shared unsigned int hist_s[bins]; 5 6 const int bx = blockidx.x; // block and thread indices 7 const int tx = threadidx.x; 8 9 const int begin = bx * blockdim.x + tx; // read access constants 10 const int end = size; 11 const int step = blockdim.x * griddim.x; 12 13 // sub-histogram initialization 14 for (int pos = tx; pos < BINS; pos += blockdim.x) { 15 hist_s[pos] = 0; 16 } 17 syncthreads (); // intra-block synchronization 18 19 // main loop 20 for (int i = begin; i < end; i += step) { 21 // global memory read 22 unsigned int d = hist_func (img[i]); // returns 0 to BINS - 1 23 // atomic increment in shared memory 24 atomicadd (&hist_s[d], 1); 25 } 26 syncthreads (); // intra-block synchronization 27 28 // merge in global memory 29 for (int pos = tx; pos < BINS; pos += blockdim.x) { 30 atomicadd (histo + pos, hist_s[pos]); 31 } 32 } A. (5 points) Explain why the loop starting on line 14 uses strided accesses (with stride blockdim.x) instead of initializing a contiguous block (for example, thread 0 could initialize indices 0 through (BINS-1)/blockDim.x).

As natural images are smooth (that is, they present spatial correlation), it is likely that neighboring pixels fall into the same bin. To avoid atomic conflicts, R sub histograms per block can be used (and later merged). Consider two different ways of accessing the sub histograms (to replace line 24): atomicadd (&hist_s[(tx % R) * BINS + d ], 1); // version 1 atomicadd (&hist_s[(tx % R) + d * R], 1); // version 2 This graph shows the execution time for a histogram with 32 bins (BINS is 32): Execution time (ms) 9.0 8.0 7.0 6.0 5.0 4.0 3.0 2.0 1.0 0.0 8.5 8.5 Version 1 Version 2 3.6 2.4 1.6 0.7 0.7 0.3 0.4 0.2 0.3 0.1 1 2 4 8 16 32 R = Number of sub histograms per block B. (5 points) Why does version 2 obtain better results? C. (5 points) What would happen for a histogram with an odd number of BINS? D. (5 points) As shown in the graph above, increasing the number R of sub histograms tends to reduce the number of atomic conflicts, and consequently the execution time. Keeping that advantage in mind, explain what might be happening in the graph below. (Note: Histograms of 256 BINS are calculated. Tests have been carried out on a Kepler GPU with a maximum of 64 warps per multiprocessor, and 48 kb of shared memory. Blocks of 256 threads are used.) 0.60 0.50 0.49 Version 2 Execution time (ms) 0.40 0.30 0.20 0.10 0.19 0.11 0.10 0.10 0.17 0.00 1 2 4 8 16 32 R = Number of sub histograms per block

Question 4: Parallelization (20 points) The Floyd Warshall algorithm is used to compute shortest paths between all pairs of nodes in a graph annotated with edge weights. The edge weights may be negative, but the graph may not contain negative weight cycles. The pseudo code below (taken from Wikipedia, then edited) initializes a matrix dist of pairwise node distances to 0 for all self loops, to edge weights for all nodes connected by an edge, and to infinity for all other pairs of nodes. The code then relaxes the distance for every pair (i,j) by considering the use of node k as an intermediate point. When a path through k is shorter than the current path from i to j, the distance is relaxed (reduced). Note that the standard graph notation G(V,E) is used in the pseudo code. G is the graph. V is the set of vertices/nodes, indexed starting at 1. E is the set of edges. 1 // initialization 2 let dist be a V V matrix initialized to (infinity) 3 for each vertex v in V 4 dist[v][v] 0 5 for each edge (u,v) in E 6 dist[u][v] w(u,v) // the weight of the edge (u,v) 7 8 // relaxation 9 for k from 1 to V 10 for i from 1 to V 11 for j from 1 to V 12 dist[i][j] = minimum (dist[i][j], dist[i][k] + dist[k][j]) A classmate of yours notices that the relaxation portion of the algorithm bears a close resemblance to matrix multiplication and decides to try to map this algorithm onto GPUs using CUDA. They decide to reorder the loops and to use two dimensions of threads and thread blocks, with each thread executing the k loop, as shown in the pseudo code below. Unfortunately, the results seem to be incorrect. 9 i = blockidx.y * blockdim.y + threadidx.y 10 j = blockidx.x * blockdim.x + threadidx.x 11 for k from 1 to V 12 dist[i][j] = minimum (dist[i][j], dist[i][k] + dist[k][j]) A. (8 points) Explain the problem. B. (12 points) Suggest an alternative scheme. Specify how you want to parallelize, what synchronization is necessary, how many kernel launches are needed, whether you need additional memory (for double buffering, for example), and when data need to move between CPU and GPU memories. Do not write code (such answers will be ignored).

Question 5: Tiling (20 points) The CUDA program below computes the outer product of two vectors, u and v. The outer product is a specific case of matrix multiplication in which the first matrix is a (column) vector of M elements and the second matrix is the transpose of a vector (also called a row vector) of N elements. When we multiply an M 1 matrix by a 1 N matrix, the result is an M N matrix, which we call the outer product of the two vectors. The code below calculates the outer product of vector u with vector v and returns the answer as matrix A. 1 #define BLOCK_DIM_X 16 2 #define BLOCK_DIM_Y 16 3 4 global outer_product_kernel (float* u, float* v, float* A, 5 unsigned int M, unsigned int N) 6 { 7 /* Perform the outer product of u and v 8 * u is of size M 9 * v is of size N 10 * A is of size M x N 11 */ 12 unsigned int row = blockidx.y * blockdim.y + threadidx.y; 13 unsigned int col = blockidx.x * blockdim.x + threadidx.x; 14 15 if (row < M && col < N) { 16 A[row * N + col] = u[row] * v[col]; 17 } 18 } 19 20 void outer_product (float* u, float* v, float* A, 21 unsigned int M, unsigned int N) 22 { 23 dim3 blockdim (BLOCK_DIM_X, BLOCK_DIM_Y, 1); 24 dim3 griddim ((N-1)/BLOCK_DIM_X + 1, (M-1)/BLOCK_DIM_X + 1, 1); 25 26 outer_product_kernel <<< griddim, blockdim >>> (u, v, A, M, N); 27 } A. (14 points) Rewrite the kernel to make use of tiling and shared memory. The tile sizes should correspond to the thread block size, BLOCK_DIM_X wide by BLOCK_DIM_Y high. You should assume that both thread block dimensions are greater than 1, and that their product does not require more threads than are available in one streaming multiprocessor. You should not make other assumptions about the thread block dimensions in your code. B. (3 points) How many times is each element of u loaded from global memory in the original version of the code? And in the tiled version? C. (3 points) How many times is each element of v loaded from global memory in the original version of the code? And in the tiled version?