GPU Programming. Rupesh Nasre.

Similar documents
Debugging Your CUDA Applications With CUDA-GDB

Introduction to GPU Computing. Design and Analysis of Parallel Algorithms

CUDA-GDB: The NVIDIA CUDA Debugger

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

April 4-7, 2016 Silicon Valley. CUDA DEBUGGING TOOLS IN CUDA8 Vyas Venkataraman, Kudbudeen Jalaludeen, April 6, 2016

ME964 High Performance Computing for Engineering Applications

NVIDIA CUDA DEBUGGER CUDA-GDB. User Manual

CME 213 S PRING Eric Darve

GPU Programming. Rupesh Nasre.

CSE 599 I Accelerated Computing Programming GPUS. Intro to CUDA C

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

GPU Programming. Rupesh Nasre.

CUDA-MEMCHECK. DU _v5.5 July User Manual

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

CSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University

Tutorial: Parallel programming technologies on hybrid architectures HybriLIT Team

Tesla Architecture, CUDA and Optimization Strategies

CUDA-MEMCHECK. DU _v5.0 October User Manual

Lecture 2: Introduction to CUDA C

CUDA Advanced Techniques 3 Mohamed Zahran (aka Z)

Hands-on CUDA Optimization. CUDA Workshop

Scientific discovery, analysis and prediction made possible through high performance computing.

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

NEW DEVELOPER TOOLS FEATURES IN CUDA 8.0. Sanjiv Satoor

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Introduction to GPGPUs and to CUDA programming model

CUDA Programming. Aiichiro Nakano

Module 3: CUDA Execution Model -I. Objective

NVIDIA GPU CODING & COMPUTING

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

CUDA C Programming Mark Harris NVIDIA Corporation

AMS 148 Chapter 8: Optimization in CUDA, and Advanced Topics

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

PROFILER. DU _v5.0 October User's Guide

CUDA C/C++ BASICS. NVIDIA Corporation

CUDA C/C++ BASICS. NVIDIA Corporation

Module 2: Introduction to CUDA C. Objective

CS377P Programming for Performance GPU Programming - I

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 10. Reduction Trees

Parallel Numerical Algorithms

CUDA-MEMCHECK. DU _v03 February 17, User Manual

Module 2: Introduction to CUDA C

CUDA-GDB. NVIDIA CUDA Debugger Release for Linux and Mac. User Manual. DU _V4.0 April 25, 2011

DEEP DIVE INTO DYNAMIC PARALLELISM

Writing and compiling a CUDA code

CUDA Programming Model

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

HPCSE II. GPU programming and CUDA

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo

CUDA-GDB CUDA DEBUGGER

Practical Introduction to CUDA and GPU

ECE 574 Cluster Computing Lecture 17

CUDA Architecture & Programming Model

CUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

High Performance Linear Algebra on Data Parallel Co-Processors I

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

University of Bielefeld

Lecture 1: an introduction to CUDA

Hands-on CUDA exercises

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

CS 677: Parallel Programming for Many-core Processors Lecture 6

Advanced Topics in CUDA C

GPU Programming Introduction

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

Card Sizes. Tesla K40: 2880 processors; 12 GB memory

Lecture 10!! Introduction to CUDA!

CUDA Tools for Debugging and Profiling. Jiri Kraus (NVIDIA)

04. CUDA Data Transfer

COSC 462 Parallel Programming

ECE 574 Cluster Computing Lecture 15

Debugging and Optimization strategies

Vector Addition on the Device: main()

CUDA Kenjiro Taura 1 / 36

CUDA (Compute Unified Device Architecture)

GPU Programming Using CUDA

Hardware/Software Co-Design

RESOLVING FALSE DEPENDENCE ON SHARED MEMORY. Patric Zhao

CS516 Programming Languages and Compilers II

Speed Up Your Codes Using GPU

Introduction to Scientific Programming using GPGPU and CUDA

Introduction to GPU Programming

Basic CUDA workshop. Outlines. Setting Up Your Machine Architecture Getting Started Programming CUDA. Fine-Tuning. Penporn Koanantakool

GPU programming basics. Prof. Marco Bertini

Lecture 6b Introduction of CUDA programming

An Introduction to GPU Computing and CUDA Architecture

Lecture 11: GPU programming

Transcription:

GPU Programming Rupesh Nasre. http://www.cse.iitm.ac.in/~rupesh IIT Madras July 2017

Debugging Debugging parallel programs is difficult. Non-determinism due to thread-scheduling Output can be different Correct intermediate values may be large cuda-gdb for debugging CUDA programs on real hardware Extension to gdb Allows breakpoints, single-step, read/write memory contents.

Sample Error #include <cuda.h> #include <stdio.h> global void K(int *x) { *x = 0; int main() { int *x; K<<<2, 10>>>(x); cudadevicesynchronize(); return 0;

Sample Error #include <cuda.h> #include <stdio.h> global void K(int *x) { *x = 0; printf( %d\n, *x); // does not print anything. int main() { int *x; K<<<2, 10>>>(x); cudadevicesynchronize(); return 0;

Sample Error #include <cuda.h> #include <stdio.h> global void K(int *x) { *x = 0; printf( %d\n, *x); int main() { int *x; K<<<2, 10>>>(x); cudadevicesynchronize(); cudaerror_t err = cudagetlasterror(); error=77, cudaerrorillegaladdress, an illegal memory access was encountered printf("error=%d, %s, %s\n", err, cudageterrorname(err), cudageterrorstring(err)); return 0;

CUDA Errors cudasuccess = 0, /// No errors cudaerrormissingconfiguration = 1, /// Missing configuration error cudaerrormemoryallocation = 2, /// Memory allocation error cudaerrorinitializationerror = 3, /// Initialization error cudaerrorlaunchfailure = 4, /// Launch failure cudaerrorpriorlaunchfailure = 5, /// Prior launch failure cudaerrorlaunchtimeout = 6, /// Launch timeout error cudaerrorlaunchoutofresources = 7, /// Launch out of resources cudaerrorinvaliddevicefunction = 8, /// Invalid device function cudaerrorinvalidconfiguration = 9, /// Invalid configuration cudaerrorinvaliddevice = 10, /// Invalid device... Homework: Write programs to to invoke these errors.

cuda-gdb Generate debug information nvcc -g -G file.cu Disables optimizations, inserts symbol information. Run with cuda-gdb cuda-gdb a.out > run May have to stop the windows manager. Due to lots of threads, cuda-gdb works with a focus (current thread).

global void K(int *x) { *x = 0; printf( %d\n, *x); (cuda-gdb) run Starting program:.../a.out [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". [New Thread 0x7ffff7396700 (LWP 10305)] [New Thread 0x7ffff696d700 (LWP 10306)] CUDA Exception: Device Illegal Address The exception was triggered in device 0. Program received signal CUDA_EXCEPTION_10, Device Illegal Address. [Switching focus to CUDA kernel 0, grid 1, block (1,0,0), thread (0,0,0), device 0, sm 13, warp 0, lane 0] 0x0000000000aa9510 in K<<<(2,1,1),(10,1,1)>>> (x=0x0) at gdb2.cu:6 6 printf("%d\n", *x); (cuda-gdb)

cuda-gdb (cuda-gdb) info cuda kernels Kernel Parent Dev Grid Status SMs Mask GridDim BlockDim Invocation * 0-0 1 Active 0x00006000 (2,1,1) (10,1,1) K(x=0x0) (cuda-gdb) info threads Id Target Id Frame 3 Thread 0x7ffff696d700 (LWP 10497) "a.out" 0x00000038db4df113 in poll () from /lib64/libc.so.6 2 Thread 0x7ffff7396700 (LWP 10496) "a.out" 0x00000038db4eac6f in accept4 () from /lib64/libc.so.6 * 1 Thread 0x7ffff7fca720 (LWP 10487) "a.out" 0x00007ffff77a2118 in cudbgapidetach () from /usr/lib64/libcuda.so.1

cuda-gdb (cuda-gdb) info cuda threads BlockIdx ThreadIdx To BlockIdx ThreadIdx Count Kernel 0 Virtual PC Filename Line * (0,0,0) (0,0,0) (1,0,0) (9,0,0) 20 0x0000000000aa9f50 gdb2.cu 6 (cuda-gdb) cuda kernel block thread kernel 0, block (0,0,0), thread (0,0,0) (cuda-gdb) cuda block 1 thread 0 [Switching focus to CUDA kernel 0, grid 1, block (1,0,0), thread (0,0,0), device 0, sm 13, warp 0, lane 0] 0x0000000000aa9510 6 printf("%d\n", *x); (cuda-gdb) cuda kernel block thread kernel 0, block (1,0,0), thread (0,0,0)

Breakpoints break main // first instruction in main break file.cu:223 // file:line set cuda break_on_launch application // kernel entry breakpoint break file.cu:23 if threadidx.x == 1 && i < 5 // conditional breakpoint

Step Once at a breakpoint, you can single-step step, s or <enter>

(cuda-gdb) info cuda sms SM Active Warps Mask Device 0 0 0x0000000000000000 1 0x0000000000000000 2 0x0000000000000000 3 0x0000000000000000 4 0x0000000000000000 5 0x0000000000000000 6 0x0000000000000000 7 0x0000000000000000 8 0x0000000000000000 9 0x0000000000000000 10 0x0000000000000000 11 0x0000000000000000 12 0x0000000000000000 13 0x0000000000000001 * 14 0x0000000000000001

(cuda-gdb) info cuda warps Wp Active Lanes Mask Divergent Lanes Mask Active Physical PC Kernel BlockIdx First Active ThreadIdx Device 0 SM 14 * 0 0x000003ff 0x00000000 0x0000000000000110 0 (0,0,0) (0,0,0) 1 0x00000000 0x00000000 n/a n/a n/a n/a 2 0x00000000 0x00000000 n/a n/a n/a n/a 3 0x00000000 0x00000000 n/a n/a n/a n/a 4 0x00000000 0x00000000 n/a n/a n/a n/a 5 0x00000000 0x00000000 n/a n/a n/a n/a 6 0x00000000 0x00000000 n/a n/a n/a n/a 7 0x00000000 0x00000000 n/a n/a n/a n/a 8 0x00000000 0x00000000 n/a n/a n/a n/a 9 0x00000000 0x00000000 n/a n/a n/a n/a 10 0x00000000 0x00000000 n/a n/a n/a n/a ---Type <return> to continue, or q <return> to quit---

(cuda-gdb) info cuda lanes Ln State Physical PC ThreadIdx Exception Device 0 SM 14 Warp 0 * 0 active 0x0000000000000110 (0,0,0) Device Illegal Address 1 active 0x0000000000000110 (1,0,0) Device Illegal Address 2 active 0x0000000000000110 (2,0,0) Device Illegal Address 3 active 0x0000000000000110 (3,0,0) Device Illegal Address 4 active 0x0000000000000110 (4,0,0) Device Illegal Address 5 active 0x0000000000000110 (5,0,0) Device Illegal Address 6 active 0x0000000000000110 (6,0,0) Device Illegal Address 7 active 0x0000000000000110 (7,0,0) Device Illegal Address 8 active 0x0000000000000110 (8,0,0) Device Illegal Address 9 active 0x0000000000000110 (9,0,0) Device Illegal Address 10 inactive n/a n/a n/a 11 inactive n/a n/a n/a 12 inactive n/a n/a n/a... 29 inactive n/a n/a n/a 30 inactive n/a n/a n/a 31 inactive n/a n/a n/a

Homework For the given program, what sequence of cuda-gdb commands would you use to identify the error? global void K(int *p) *p){{ *p *p = 0; 0; printf("%d\n", *p); int int main() {{ int int *x, *x,*y; *y; cudamalloc(&x, sizeof(int)); K<<<2, 10>>>(x); cudadevicesynchronize(); y = x; x; cudafree(y); K<<<2, 10>>>(x); cudadevicesynchronize(); return 0; 0;

Profiling Measuring indicators of performance Time taken by various kernels Memory utilization Number of cache misses Degree of divergence Degree of coalescing...

CUDA Profiler nvprof: command-line nvvp, nsight: Visual Profiler An event is a measurable activity on a device. It corresponds to a hardware counter value. About 140 events tex0_cache_sector_queries gld_inst_8bit inst_executed...

nvprof No changes required to the binary. Uses defaults. nvprof a.out To profile part of a program, use cudaprofilerstart() and Stop(). Include cuda_profiler_api.h nvprof --profile-from-start off a.out

global void K1(int num) { num += num; ++num; Which kernel should you you optimize? device int sum = 0; (Which kernel consumes more more time?) global void K2(int num) { atomicadd(&sum, num); global void K3(int num) { shared int sum; sum = 0; syncthreads(); sum += num; int main() { for (unsigned ii ii = 0; ii ii < 100; ++ii) { K1<<<5, 32>>>(ii); cudadevicesynchronize(); for (unsigned ii ii = 0; ii ii < 100; ++ii) { K2<<<5, 32>>>(ii); cudadevicesynchronize(); for (unsigned ii ii = 0; ii ii < 100; ++ii) { K3<<<5, 32>>>(ii); cudadevicesynchronize(); return 0;

$ nvprof a.out ==26519== NVPROF is profiling process 26519, command: a.out ==26519== Profiling application: a.out ==26519== Profiling result: Time(%) Time Calls Avg Min Max Name 39.46% 191.46us 100 1.9140us 1.8880us 2.1440us K2(int) 33.86% 164.26us 100 1.6420us 1.6000us 1.8880us K3(int) 26.68% 129.44us 100 1.2940us 1.2480us 1.5360us K1(int) ==26519== API calls: Time(%) Time Calls Avg Min Max Name 95.75% 369.08ms 300 1.2303ms 10.560us 364.03ms cudalaunch 2.33% 8.9986ms 728 12.360us 186ns 619.78us cudevicegetattribute 0.91% 3.5039ms 8 437.98us 396.85us 450.61us cudevicetotalmem 0.73% 2.8134ms 300 9.3780us 6.4650us 32.547us cudadevicesynchronize 0.18% 699.99us 8 87.498us 85.431us 90.737us cudevicegetname 0.05% 194.20us 300 647ns 339ns 10.694us cudaconfigurecall 0.04% 156.27us 300 520ns 292ns 2.2700us cudasetupargument 0.00% 9.4130us 24 392ns 186ns 862ns cudeviceget 0.00% 5.7760us 3 1.9250us 317ns 4.7490us cudevicegetcount

global void K1(int num) { num += num; ++num; device int sum = 0; global void K2(int num) { atomicadd(&sum, num); global void K3(int num) { shared int sum; sum = 0; syncthreads(); sum += num; int main() { for (unsigned ii ii = 0; ii ii < 100; ++ii) { K1<<<5, 32>>>(ii); K2<<<5, 32>>>(ii); K3<<<5, 32>>>(ii); cudadevicesynchronize(); return 0; Loop Loop fusion Doesn't change output much. We need kernel fusion.

device int sumg = 0; global void K(int num) { num += num; ++num; atomicadd(&sumg, num); shared int sum; sum = 0; syncthreads(); sum += num; int main() { for (unsigned ii ii = 0; ii ii < 100; ++ii) { K<<<5, 32>>>(ii); cudadevicesynchronize(); return 0; Time(%) Time Calls Avg Min Max Name 100.00% 194.81us 100 1.9480us 1.9200us 2.3040us K(int) ==26721== API calls: Time(%) Time Calls Avg Min Max Name 96.26% 375.85ms 100 3.7585ms 18.175us 373.93ms cudalaunch 2.33% 9.1124ms 728 12.517us 183ns 661.26us cudevicegetattribute...

Time(%) Time Calls Avg Avg Min Min Max Name 39.46% 191.46us 100 100 1.9140us 1.8880us 2.1440us K2(int) 33.86% 164.26us 100 100 1.6420us 1.6000us 1.8880us K3(int) 26.68% 129.44us 100 100 1.2940us 1.2480us 1.5360us K1(int) ==26519== API API calls: Time(%) Time Calls Avg Avg Min Min Max Name 95.75% 369.08ms 300 300 1.2303ms 10.560us 364.03ms cudalaunch 2.33% 8.9986ms 728 728 12.360us 186ns 619.78us cudevicegetattribute Time(%) Time Time Calls Calls Avg Avg Min Min Max Max Name 100.00% 194.81us 100 100 1.9480us 1.9200us 2.3040us K(int) K(int) ==26721== API API calls: calls: Time(%) Time Time Calls Calls Avg Avg Min Min Max Max Name 96.26% 375.85ms 100 100 3.7585ms 18.175us 373.93ms cudalaunch 2.33% 9.1124ms 728 728 12.517us 183ns 661.26us cudevicegetattribute......

device int sumg = 0; global void K(int num) { int num = blockidx.x * blockdim.x + threadidx.x; num += num; ++num; atomicadd(&sumg, num); shared int sum; sum = 0; syncthreads(); sum += num; int main() { K<<<5*100, 32>>>(ii); cudadevicesynchronize(); return 0; Time(%) Time Calls Avg Min Max Name 100.00% 3.7120us 1 3.7120us 3.7120us 3.7120us K(void) ==26862== API calls: Time(%) Time Calls Avg Min Max Name 96.91% 369.68ms 1 369.68ms 369.68ms 369.68ms cudalaunch 2.16% 8.2407ms 728 11.319us 142ns 461.62us cudevicegetattribute

nvprof Supports device-specific profiling Supports remote profiling Output can be dumped to files as a.csv...