PERFORMANCE ANALYSIS AND DEBUGGING FOR VOLTA. Felix Schmitt 11 th Parallel Tools Workshop September 11-12, 2017

Similar documents
May 8-11, 2017 Silicon Valley CUDA 9 AND BEYOND. Mark Harris, May 10, 2017

May 8-11, 2017 Silicon Valley. CUDA 9 AND BEYOND Mark Harris, May 10, 2017

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

NEW DEVELOPER TOOLS FEATURES IN CUDA 8.0. Sanjiv Satoor

GPU Fundamentals Jeff Larkin November 14, 2016

Optimization Techniques for Parallel Code: 5: Warp-synchronous programming with Cooperative Groups

NVIDIA TESLA V100 GPU ARCHITECTURE THE WORLD S MOST ADVANCED DATA CENTER GPU

CSC266 Introduction to Parallel Computing using GPUs Synchronization and Communication

LDetector: A low overhead data race detector for GPU programs

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016

April 4-7, 2016 Silicon Valley INSIDE PASCAL. Mark Harris, October 27,

Towards Automatic Heterogeneous Computing Performance Analysis. Carl Pearson Adviser: Wen-Mei Hwu

NVIDIA COLLECTIVE COMMUNICATION LIBRARY (NCCL)

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

NVIDIA COLLECTIVE COMMUNICATION LIBRARY (NCCL)

LECTURE ON PASCAL GPU ARCHITECTURE. Jiri Kraus, November 14 th 2016

Advanced CUDA Optimizations

S8688 : INSIDE DGX-2. Glenn Dearth, Vyas Venkataraman Mar 28, 2018

April 4-7, 2016 Silicon Valley. CUDA DEBUGGING TOOLS IN CUDA8 Vyas Venkataraman, Kudbudeen Jalaludeen, April 6, 2016

NVIDIA GPU CODING & COMPUTING

EFFICIENT BREADTH FIRST SEARCH ON MULTI-GPU SYSTEMS USING GPU-CENTRIC OPENSHMEM

Improving Performance of Machine Learning Workloads

Lecture 2: CUDA Programming

Bigger GPUs and Bigger Nodes. Carl Pearson PhD Candidate, advised by Professor Wen-Mei Hwu

CS377P Programming for Performance GPU Programming - II

CME 213 S PRING Eric Darve

CUDA Architecture & Programming Model

Fundamental CUDA Optimization. NVIDIA Corporation

Parallel Accelerators

Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA

Fundamental CUDA Optimization. NVIDIA Corporation

Parallel Accelerators

Portland State University ECE 588/688. Graphics Processors

An Evaluation of Unified Memory Technology on NVIDIA GPUs

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Lecture 1: an introduction to CUDA

OpenACC Course. Office Hour #2 Q&A

Tesla GPU Computing A Revolution in High Performance Computing

Advanced CUDA Optimizations. Umar Arshad ArrayFire

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

CUDA Development Using NVIDIA Nsight, Eclipse Edition. David Goodwin

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Programmable Graphics Hardware (GPU) A Primer

University of Bielefeld

CUDA OPTIMIZATIONS ISC 2011 Tutorial

TESLA DRIVER VERSION (LINUX)/411.82(WINDOWS)

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Hands-on CUDA Optimization. CUDA Workshop

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

High Performance Computing on GPUs using NVIDIA CUDA

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Data Placement Optimization in GPU Memory Hierarchy Using Predictive Modeling

Lecture 6: odds and ends

CUB. collective software primitives. Duane Merrill. NVIDIA Research

Profiling & Tuning Applications. CUDA Course István Reguly

THE FUTURE OF GPU DATA MANAGEMENT. Michael Wolfe, May 9, 2017

CUDA Tools for Debugging and Profiling. Jiri Kraus (NVIDIA)

GPU Accelerated Application Performance Tuning. Delivered by John Ashley, Slides by Cliff Woolley, NVIDIA Developer Technology Group

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

From Hello World to Exascale

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Unified Memory. Notes on GPU Data Transfers. Andreas Herten, Forschungszentrum Jülich, 24 April Member of the Helmholtz Association

NVIDIA Tesla P100. Whitepaper. The Most Advanced Datacenter Accelerator Ever Built. Featuring Pascal GP100, the World s Fastest GPU

CUDA Tools for Debugging and Profiling

NVIDIA COLLECTIVE COMMUNICATION LIBRARY (NCCL)

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas

IBM Deep Learning Solutions

NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY FOR NVIDIA GPU CLUSTERS. Sreeram Potluri, Anshuman Goswami - NVIDIA 3/28/2018

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Building NVLink for Developers

ACCELERATING SANJEEVINI: A DRUG DISCOVERY SOFTWARE SUITE

GPU FOR DEEP LEARNING. 周国峰 Wuhan University 2017/10/13

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

ECE 574 Cluster Computing Lecture 15

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory

RECENT TRENDS IN GPU ARCHITECTURES. Perspectives of GPU computing in Science, 26 th Sept 2016

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

ECE 574 Cluster Computing Lecture 17

CUDA: NEW AND UPCOMING FEATURES

Kepler Overview Mark Ebersole

NVIDIA DGX SYSTEMS PURPOSE-BUILT FOR AI

CUDA Performance Considerations (2 of 2) Varun Sampath Original Slides by Patrick Cozzi University of Pennsylvania CIS Spring 2012

RESOLVING FALSE DEPENDENCE ON SHARED MEMORY. Patric Zhao

Advanced CUDA Programming. Dr. Timo Stich

COSC 6339 Accelerators in Big Data

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

HETEROGENEOUS MEMORY MANAGEMENT. Linux Plumbers Conference Jérôme Glisse

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

ARCHER Champions 2 workshop

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Transcription:

PERFORMANCE ANALYSIS AND DEBUGGING FOR VOLTA Felix Schmitt 11 th Parallel Tools Workshop September 11-12, 2017

INTRODUCING TESLA V100 Volta Architecture Improved NVLink & HBM2 Volta MPS Improved SIMT Model Tensor Core Most Productive GPU Efficient Bandwidth Inference Utilization New Algorithms 120 Programmable TFLOPS Deep Learning The Fastest and Most Productive GPU for Deep Learning and HPC 2

Volta Tools Support AGENDA Independent Thread Scheduling Unified Memory NVLINK Conclusion 3

VOLTA INDEPENDENT THREAD SCHEDULING 4

INDEPENDENT THREAD SCHEDULING Convergence Optimizer Program counters 32 thread warp with independent scheduling Volta: Starvation Free Algorithms Threads may wait for messages 5

IMPLICIT WARP SYNCHRONOUS PROGRAMMING Unsafe and Unsupported Warp synchronous programming is a CUDA programming technique that leverages warp execution for efficient inter-thread communication Implicit warp synchronous programming builds upon two unreliable assumptions: implicit thread re-convergence points Implicit lock-step execution of threads in a warp. 6

IMPLICIT WARP SYNCHRONOUS PROGRAMMING Implicit Lock-Step Execution shmem[tid] += shmem[tid+16]; shmem[tid] += shmem[tid+8]; shmem[tid] += shmem[tid+4]; shmem[tid] += shmem[tid+2]; shmem[tid] += shmem[tid+1]; data race Such code will break on Volta! Make warp synchronous programming safe by making synchronizations explicit. 7

COOPERATIVE GROUPS Flexible, Explicit Synchronization Thread groups are explicit objects in your program Thread Block Group thread_group block = this_thread_block(); You can synchronize threads in a group block.sync(); Create new groups by partitioning existing groups thread_group tile32 = tiled_partition(block, 32); thread_group tile4 = tiled_partition(tile32, 4); Partitioned groups can also synchronize Partitioned Thread Groups tile4.sync(); Note: calls in green are part of the cooperative_groups:: namespace 8

WARP SYNCHRONOUS BUILT-IN FUNCTIONS New in CUDA 9.0 Active-mask query Which threads in a warp are active: activemask Synchronized data exchange: Exchange data between threads in warp all_sync, any_sync, ballot_sync, shfl_sync, match_all_sync, CUDA 9 deprecates non-synchronizing shfl(), ballot(), any(), all() Threads synchronization Synchronize threads in a warp and provide a memory fence: syncwarp 9

For current coalesced set of threads: auto g = coalesced_threads(); For warp-sized group of threads: auto block = this_thread_block(); auto g = tiled_partition<32>(block) For CUDA thread blocks: auto g = this_thread_block(); COOPERATIVE GROUPS Levels of Cooperation with CUDA 9.0 Warp Warp SM For device-spanning grid: auto g = this_grid(); GPU For multiple grids spanning GPUs: auto g = this_multi_grid(); Multi-GPU 10

IMPLICIT WARP SYNCHRONOUS PROGRAMMING Problem Solution Make sync explicit Explicit Warp-Level Synchronization shmem[tid] += shmem[tid+16]; shmem[tid] += shmem[tid+8]; shmem[tid] += shmem[tid+4]; shmem[tid] += shmem[tid+2]; shmem[tid] += shmem[tid+1]; data race v += shmem[tid+16]; syncwarp(); shmem[tid] = v; syncwarp(); v += shmem[tid+8]; syncwarp(); shmem[tid] = v; syncwarp(); v += shmem[tid+4]; syncwarp(); shmem[tid] = v; syncwarp(); v += shmem[tid+2]; syncwarp(); shmem[tid] = v; syncwarp(); v += shmem[tid+1]; syncwarp(); shmem[tid] = v; 11

CUDA-MEMCHECK Enhancements in CUDA 9.0 Support for Volta Architecture Support for Cooperative Groups and new synchronization primitives Support for shared memory atomic instructions Detects accesses that extend beyond an allocation (Pascal+) 12

CUDA-MEMCHECK Support for Cooperative Groups With Volta, threads in a warp do not necessarily execute in lock-step in all cases May require updates to unsynchronized warp-level code to guarantee correctness cuda-memcheck s racecheck tool can be used to detect such unsafe code cuda-memcheck --tool racecheck 13

CUDA-MEMCHECK Support for Cooperative Groups Unsafe warp-level programming can be detected on Kepler and later with racecheck UNSAFE CODE RACECHECK OUTPUT device char reduce(char val) { extern shared char smem[]; const int tid = threadidx.x; } #pragma unroll for(int i = warpsize/2; i > 0; i /= 2) { smem[tid] = val; val += smem[tid ^ i]; } return val; $ cuda-memcheck tool racecheck --racecheck-report hazard./a.out ========= CUDA-MEMCHECK ========= WARN:(Warp Level Programming) Potential RAW hazard detected at shared 0xf in block (0, 0, 0) : ========= Write Thread (15, 0, 0) at 0x00000e08 in /home/user/reduction.cu:32:kernel(void) ========= Read Thread (14, 0, 0) at 0x00000ef0 in /home/user/reduction.cu:33:kernel(void)... 14

CUDA-MEMCHECK Support for Cooperative Groups Cooperative Groups adds explicit block and warp-level synchronization APIs UNSAFE CODE (NO CG) SAFE COOP. GROUPS CODE device char reduce(char val) { extern shared char smem[]; const int tid = threadidx.x; } #pragma unroll for(int i = warpsize/2; i > 0; i /= 2) { smem[tid] = val; val += smem[tid ^ i]; } return val; device char reduce(char val) { extern shared char smem[]; const int tid = threadidx.x; thread_group warp = tiled_partition(this_thread_block(), warpsize); #pragma unroll for(int i = warpsize/2; i > 0; i /= 2) { smem[tid] = val; warp.sync(); val += smem[tid ^ i]; warp.sync(); } return val; } ROBUST AND PERFORMANT 15

VOLTA UNIFIED MEMORY 16

VOLTA + NVLINK UNIFIED MEMORY GPU GPU CPU CPU Page Migration Engine GPU GPU CPU CPU Unified Memory GPU Optimized State + Access counters + New NVLink Features (Coherence, Atomics, ATS) Unified Memory CPU Optimized State 17

CUPTI New Measurement Library Features New events for thrashing, throttling and remote map CUpti_ActivityUnifiedMemoryRemoteMapCause lists possible causes for remote map events Correlate CPU page fault with the source code Support to track the allocation and freeing of memory New activity record CUpti_ActivityMemory Virtual base address, size, program counter, timestamps 18

UNIFIED MEMORY EVENTS Events 19

NEW UNIFIED MEMORY EVENTS Visualize Virtual Memory Activity Memory Thrashing Page Throttling Remote Map 20

FILTER AND ANALYZE Unfiltered Filtered 21

FILTER AND ANALYZE 12.2 ms Memory Thrashing Read access page faults Analyze read access page faults and thrashing 22

OPTIMIZATION OLD int threadsperblock = 256; int numblocks = (length + threadsperblock 1) / threadsperblock; kernel<<< numblocks, threadsperblock >>>(A, B, C, length); NEW int threadsperblock = 256; int numblocks = (length + threadsperblock 1) / threadsperblock; cudamemadvise(a, size, cudamemadvisesetreadmostly, 0); cudamemadvise(b, size, cudamemadvisesetreadmostly, 0); kernel<<< numblocks, threadsperblock >>>(A, B, C, length); 23

OPTIMIZED APPLICATION 2.9 ms No DtoH Migrations and thrashing Speedup 4x (2.9 vs 12.2) 24

CPU PAGE FAULT SOURCE CORRELATION Selected interval 25

SEGMENT MODE TIMELINE Segment mode interval Heat map for CPU page faults 26

VOLTA NVLINK 27

IMPROVED NVLINK INTERCONNECT 28

NVLINK VISUALIZATION DGX-1V (Volta) Static properties Runtime values Color codes for NVLink 29

NVLINK VISUALIZATION Timeline Events NVLink Events on Timeline Memcpy API Color Coding of NVLink Events 30

NVLINK ANALYSIS EXAMPLE Stage I: Data Movement Over PCIe 216 milliseconds 31

NVLINK ANALYSIS EXAMPLE Stage II: Data Movement Over NVLink 65 milliseconds Minimal/Unused NVLinks Under-utilized NVLink 32

NVLINK ANALYSIS EXAMPLE Stage III: Data Movement Over NVLink with Streams 33

CONCLUSION 34

EXECUTION MODEL CHANGES With great power comes great responsibility Your implicitly warp synchronous code may break on Volta Update using Cooperative Groups or with new synchronization intrinsics Tools can help greatly in detecting wrong and unsafe code Deploy everywhere from Kepler to Volta 35

TOOLS UPDATES Volta and Beyond CUPTI provides more detailed performance data and allows greater tool control cuda-memcheck can be used to check program correctness on Volta and helps in porting existing applications Visual Profiler adds Detailed insight into UVM events Correlation of page faults with source code Analysis of NVLink utilization 36