Assignment 7. CUDA Programming Assignment

Similar documents
Assignment 3 MPI Tutorial Compiling and Executing MPI programs

CUDA. More on threads, shared memory, synchronization. cuprintf

$ cd ex $ cd 1.Enumerate_GPUs $ make $./enum_gpu.x. Enter in the application folder Compile the source code Launch the application

Assignment 1 OpenMP Tutorial Assignment

CUDA Exercises. CUDA Programming Model Lukas Cavigelli ETZ E 9 / ETZ D Integrated Systems Laboratory

Parallel Programming Pre-Assignment. Setting up the Software Environment

COMP 605: Introduction to Parallel Computing Quiz 4: Module 4 Quiz: Comparing CUDA and MPI Matrix-Matrix Multiplication

Speed Up Your Codes Using GPU

Supporting Data Parallelism in Matcloud: Final Report

COSC 462. CUDA Basics: Blocks, Grids, and Threads. Piotr Luszczek. November 1, /10

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

CS377P Programming for Performance GPU Programming - I

wget

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

Lab 1 Part 1: Introduction to CUDA

Vector Addition on the Device: main()

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

Cuda Compilation Utilizing the NVIDIA GPU Oct 19, 2017

ECE 574 Cluster Computing Lecture 15

Parallel Programming Assignment 3 Compiling and running MPI programs

COSC 462 Parallel Programming

Scientific GPU computing with Go A novel approach to highly reliable CUDA HPC 1 February 2014

Parallel Programming Pre-Assignment. Setting up the Software Environment

ITCS 4145/5145 Assignment 2

High Performance Computing and GPU Programming

ECE 574 Cluster Computing Lecture 17

04. CUDA Data Transfer

GPU Programming Using CUDA

Introduction to Parallel Computing with CUDA. Oswald Haan

Compiling and Executing CUDA Programs in Emulation Mode. High Performance Scientific Computing II ICSI-541 Spring 2010

GPU Computing: Introduction to CUDA. Dr Paul Richmond

Introduction to CUDA Programming (Compute Unified Device Architecture) Jongsoo Kim Korea Astronomy and Space Science 2018 Workshop

Assignment 5 Suzaku Programming Assignment

GPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways:

Introduction to CUDA Programming

Tutorial: Parallel programming technologies on hybrid architectures HybriLIT Team

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

CUDA Parallelism Model

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

Advanced Topics: Streams, Multi-GPU, Tools, Libraries, etc.

Introduction to CUDA C

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

NVIDIA CUDA C GETTING STARTED GUIDE FOR MAC OS X

Module 2: Introduction to CUDA C. Objective

Introduction to GPU Computing. Design and Analysis of Parallel Algorithms

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

ECE 408 / CS 483 Final Exam, Fall 2014

Lecture 10!! Introduction to CUDA!

Introduction to CUDA C

An Introduction to OpenACC. Zoran Dabic, Rusell Lutrell, Edik Simonian, Ronil Singh, Shrey Tandel

CUDA Programming. Aiichiro Nakano

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Getting Started. NVIDIA CUDA C Installation and Verification on Mac OS X

Module 2: Introduction to CUDA C

Introduction to Parallel Programming

High-Performance Computing Using GPUs

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

University of Bielefeld

E6895 Advanced Big Data Analytics Lecture 8: GPU Examples and GPU on ios devices

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

Lecture 3: Introduction to CUDA

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

Basic CUDA workshop. Outlines. Setting Up Your Machine Architecture Getting Started Programming CUDA. Fine-Tuning. Penporn Koanantakool

GPU Programming. Rupesh Nasre.

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Hardware/Software Co-Design

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

CUDA C Programming Mark Harris NVIDIA Corporation

Parallel Computing. Lecture 19: CUDA - I

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Getting Started. NVIDIA CUDA Development Tools 2.2 Installation and Verification on Mac OS X. May 2009 DU _v01

Massively Parallel Algorithms

GPGPU/CUDA/C Workshop 2012

Module 3: CUDA Execution Model -I. Objective

Module Memory and Data Locality

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

Getting Started. NVIDIA CUDA Development Tools 2.3 Installation and Verification on Mac OS X

Temple University Computer Science Programming Under the Linux Operating System January 2017

Introduction to Scientific Programming using GPGPU and CUDA

Practical Introduction to CUDA and GPU

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications

Tesla Architecture, CUDA and Optimization Strategies

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

GPU 1. CSCI 4850/5850 High-Performance Computing Spring 2018

CME 213 S PRING Eric Darve

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

GPU-accelerated similarity searching in a database of short DNA sequences

GPU CUDA Programming

Transcription:

Assignment 7 CUDA Programming Assignment B. Wilkinson, April 17a, 2016 The purpose of this assignment is to become familiar with writing, compiling, and executing CUDA programs. We will use cci-grid08, which has a 2496-core NVIDIA K20 GPU. In Part 1, you are asked to compile an existing CUDA program that adds two vectors using a given make file. In Part 2, you are asked to write a matrix multiplication program in CUDA and test on the GPU with various grid and block sizes. In Part 3, you are asked to write a program to implement Ranksort. Part 4 asks you to add shared memory. In all your programs, a sequential version is to be executed that runs on the CPU alone, for comparison purposes. The sample programs listed in this assignment are not in the course VM or posted. You have to create them yourself. Do not copy and paste from pdf unless you are sure you are removing any unwanted characters. One other GPU server is currently available, cci-grid07, which has a 448-core NVIDIA C2050 GPU. Use that server if cci-grid08 is not functioning. You may also use your own computer instead of coitgrid08 if you have a NVIDIA GPU card and you install the NVIDIA CUDA Toolkit. However you may not get as great a speed-up. Whatever system you use, clearly state it in your report, and the type of host processor and GPU. You can use PuTTy/WinSCP to connect to the cluster rather than the VM on a Windows platform and this may be easier in this assignment as all the programming is done on the cluster. See the course link Additional Information > "Accessing remote servers from a Windows platform (PuTTY, WinSCP, nano)" for more information on that. Preliminaries (2%) Login to cci-gridgw.uncc.edu and ssh to cci-grid08. You will be able to see your home directory on the cluster. Create a directory called Assign7 and cd into this directory. GPU limitations. Display the details of the GPU(s) installed by issuing the command: devicequery Keep the output as you may need it. In particular, note the maximum number of threads in a block and maximum sizes of blocks and grid. Also invoke the bandwidth test by issuing the command: bandwidthtest Note the maximum host to device, device to host, and device to device bandwidths. 1

Part 1 Compiling and executing vector addition CUDA program (28%) In this part, you will compile and execute a CUDA program to perform vector addition. This program is given below: // VectorAdd.cu #include <stdio.h> #include <cuda.h> #include <stdlib.h> #define N 10 #define B 1 #define T 10 // size of vectors // blocks in the grid // threads in a block global void add (int *a,int *b, int *c) { int tid = blockidx.x * blockdim.x + threadidx.x; if(tid < N) { c[tid] = a[tid]+b[tid]; int main(void) { int a[n],b[n],c[n]; int *dev_a, *dev_b, *dev_c; cudamalloc((void**)&dev_a,n * sizeof(int)); cudamalloc((void**)&dev_b,n * sizeof(int)); cudamalloc((void**)&dev_c,n * sizeof(int)); for (int i=0;i<n;i++) { a[i] = i; b[i] = i*2; cudamemcpy(dev_a, a, N*sizeof(int),cudaMemcpyHostToDevice); cudamemcpy(dev_b, b, N*sizeof(int),cudaMemcpyHostToDevice); cudamemcpy(dev_c, c, N*sizeof(int),cudaMemcpyHostToDevice); add<<<b,t>>>(dev_a,dev_b,dev_c); cudamemcpy(c,dev_c,n*sizeof(int),cudamemcpydevicetohost); for (int i=0;i<n;i++) { printf("%d+%d=%d\n",a[i],b[i],c[i]); cudafree(dev_a); cudafree(dev_b); cudafree(dev_c); return 0; 2

Create a file called VectorAdd.cu containing the vector addition program. Next, create a file called Makefile and copy the following into it: NVCC = /usr/local/cuda/bin/nvcc CUDAPATH = /usr/local/cuda NVCCFLAGS = -I$(CUDAPATH)/include LFLAGS = -L$(CUDAPATH)/lib64 -lcuda -lcudart lm VectorAdd: VectorAdd.cu $(NVCC) $(NVCCFLAGS) -o VectorAdd VectorAdd.cu $(LFLAGS) Be careful to have a tab here. (Very important) To compile the program, type make VectorAdd (or make as there is only one build command). Execute the program by typing the name of the executable (to include the current directory./), i.e../vectoradd. Confirm the results are correct. What to submit from this part 1) Screen shot(s) of executing devicequery and bandwidthtest 2) Screenshot of compiling and executing VectorAdd Part 2 Matrix multiplication (30%) Modify the CUDA program in Part 1 to perform matrix multiplication with two N x N integer matrices, a[n][n] and b[n][n]. Initialize the matrices with randomly generated numbers on the host using the code: int row, col; srand(2); for(row=0; row < N; row++) // load arrays with some numbers for(col=0; col < N; col++) { a[row * N + col] = rand() % 10; b[row * N + col] = rand() % 10; Incorporate the following features to enable experimentation to be done on speed-up factor: (a) Different sizes for the matrices Use dynamically allocated memory and add keyboard input statements to be to specify N. (b) Add host code to compute the matrix multiplication on the host only. (c) Add code to verify that both CPU and GPU versions of matrix multiplication produce the same and correct results. (d) Different CUDA 2-D grid/block structures Add keyboard statements to input different values for: Numbers of threads in a block (T) Number of blocks in a grid (B) 3

Assume a square 2-D grid and square 2-D block structures. Include checks for invalid input. Ensure that GPUs limitations are met from the data given in devicequery (Preliminaries). (e) Timing -- Add statements to time the execution of the code using CUDA events, both for the host-only (CPU) computation and with the device (GPU) computation, and display results. Compute and display the speed-up factor. Arrange that the code returns to keyboard input after each computation with entered keyboard input rather than re-starting the code and having kernel code re-launch. Include print statements to show all input values. During code development, it is recommended that the code is recompiled and tested after adding each of (a), (b), (c), (d) and (e). Modify the make file according to compile the code. Execute your code and experiment with four different combinations of T, B, and N and collect timing results including speed-up factor. What is the effect of the first kernel launch with one combination of T, B, and N? That is, record the execution time when the first time the kernel is executed and subsequent execution times the same kernel is executed with the same parameters without exiting the main program. Discuss the results. What to submit from this part 1) A properly commented listing of your matrix multiplication program with the features (a) (e) specified incorporated. 2) Screenshots of executing your matrix multiplication program with four different combinations of T, B, and N, displaying values and the speedup factor in each case. 3) An experiment showing the effect of the first kernel launch with one combination of T, B, and N. 4) Discussion of the results Part 3 Sorting (30%) Write a CUDA program that implements Ranksort. Generate the numbers to sort on the host in an array A[N] using the random number generator: srand(3); for (i=0; i < N; i++) A[i] = (int)rand(); //initialize random number generator //load array with numbers where there are N numbers. Use a 1-dimensional block and grid structure for the kernel. Incorporate the following features to enable experimentation to be done on speed-up factor: (a) Add keyboard input statements to specify the numbers of numbers (N). (b) Add code to sort the numbers using Ranksort on the host only (and make sure the results are correct). (c) Add code to verify that both CPU and GPU versions of Ranksort produce the same results. (d) Add keyboard statements to input different values for: 4

Numbers of threads in a block (T) Number of blocks in a grid (B) Include checks for invalid input. Ensure that GPUs limitations are met from the data given in devicequery (Preliminaries). (e) Add statements to time the execution of the code using CUDA events, both for the host-only (CPU) computation and for the device (GPU) computation, and display results. Compute and display the speed-up factor. Execute your code and experiment with four different combinations of T, B, and N and collect timing results including speed-up factor. What to submit from Part 3 1) A properly commented listing of your sorting program with the features (a) (e) specified incorporated. 2) Screenshots of executing your sorting program with four different combinations of T, B, and N, displaying values and the speedup factor in each case. 3) Discussion of the results. Part 4 Using Shared Memory (10%) Modify the sorting code in Part 3 to also sort the numbers with Ranksort on the GPU using shared memory to hold the numbers. Include all the features in the code listed in Part 3. The code should give the execution time with shared memory and without shared memory. What to submit from Part 4 1) A properly commented listing of your sorting program. 2) Screenshots of executing your sorting program with four different combinations of T, B, and N, displaying values and the speedup factors in each case, with shared memory and without shared memory. 3) Discussion of the results. Assignment Report Preparation and Grading Produce a report that shows that you successfully followed the instructions and performs all tasks by taking screenshots and include these screenshots in the document. Every task and subtask specified will be allocated a score so make sure you clearly identify each part/task you did. Make sure you include everything that is specified in the What to include... section at the end of each task/part. Include all code, not as screen shots of the listings but complete properly documented code listing. The programs are often too long to see in a single screenshot and 5

also not easy to read if small. Using multiple screenshots is not option for code. Copy/paste the code into the Word document or whatever you are using to create the report. You will lose 10 points if you submit code as screenshots. Assignment Submission Convert your report into a single pdf document and submit the pdf file on Moodle by the due date as described on the course home page. It is extremely important you submit only a pdf file. It is possible to loose points if you submit in any other format (i.e. Word, OpenOffice, zip,...). You will lose 10 points if you submit in any other format (e.g. Word, OpenOffice,...). zip with multiple files is especially irritating. 6