PERFORMANCE ANALYSIS AND DEBUGGING FOR VOLTA Felix Schmitt 11 th Parallel Tools Workshop September 11-12, 2017
INTRODUCING TESLA V100 Volta Architecture Improved NVLink & HBM2 Volta MPS Improved SIMT Model Tensor Core Most Productive GPU Efficient Bandwidth Inference Utilization New Algorithms 120 Programmable TFLOPS Deep Learning The Fastest and Most Productive GPU for Deep Learning and HPC 2
Volta Tools Support AGENDA Independent Thread Scheduling Unified Memory NVLINK Conclusion 3
VOLTA INDEPENDENT THREAD SCHEDULING 4
INDEPENDENT THREAD SCHEDULING Convergence Optimizer Program counters 32 thread warp with independent scheduling Volta: Starvation Free Algorithms Threads may wait for messages 5
IMPLICIT WARP SYNCHRONOUS PROGRAMMING Unsafe and Unsupported Warp synchronous programming is a CUDA programming technique that leverages warp execution for efficient inter-thread communication Implicit warp synchronous programming builds upon two unreliable assumptions: implicit thread re-convergence points Implicit lock-step execution of threads in a warp. 6
IMPLICIT WARP SYNCHRONOUS PROGRAMMING Implicit Lock-Step Execution shmem[tid] += shmem[tid+16]; shmem[tid] += shmem[tid+8]; shmem[tid] += shmem[tid+4]; shmem[tid] += shmem[tid+2]; shmem[tid] += shmem[tid+1]; data race Such code will break on Volta! Make warp synchronous programming safe by making synchronizations explicit. 7
COOPERATIVE GROUPS Flexible, Explicit Synchronization Thread groups are explicit objects in your program Thread Block Group thread_group block = this_thread_block(); You can synchronize threads in a group block.sync(); Create new groups by partitioning existing groups thread_group tile32 = tiled_partition(block, 32); thread_group tile4 = tiled_partition(tile32, 4); Partitioned groups can also synchronize Partitioned Thread Groups tile4.sync(); Note: calls in green are part of the cooperative_groups:: namespace 8
WARP SYNCHRONOUS BUILT-IN FUNCTIONS New in CUDA 9.0 Active-mask query Which threads in a warp are active: activemask Synchronized data exchange: Exchange data between threads in warp all_sync, any_sync, ballot_sync, shfl_sync, match_all_sync, CUDA 9 deprecates non-synchronizing shfl(), ballot(), any(), all() Threads synchronization Synchronize threads in a warp and provide a memory fence: syncwarp 9
For current coalesced set of threads: auto g = coalesced_threads(); For warp-sized group of threads: auto block = this_thread_block(); auto g = tiled_partition<32>(block) For CUDA thread blocks: auto g = this_thread_block(); COOPERATIVE GROUPS Levels of Cooperation with CUDA 9.0 Warp Warp SM For device-spanning grid: auto g = this_grid(); GPU For multiple grids spanning GPUs: auto g = this_multi_grid(); Multi-GPU 10
IMPLICIT WARP SYNCHRONOUS PROGRAMMING Problem Solution Make sync explicit Explicit Warp-Level Synchronization shmem[tid] += shmem[tid+16]; shmem[tid] += shmem[tid+8]; shmem[tid] += shmem[tid+4]; shmem[tid] += shmem[tid+2]; shmem[tid] += shmem[tid+1]; data race v += shmem[tid+16]; syncwarp(); shmem[tid] = v; syncwarp(); v += shmem[tid+8]; syncwarp(); shmem[tid] = v; syncwarp(); v += shmem[tid+4]; syncwarp(); shmem[tid] = v; syncwarp(); v += shmem[tid+2]; syncwarp(); shmem[tid] = v; syncwarp(); v += shmem[tid+1]; syncwarp(); shmem[tid] = v; 11
CUDA-MEMCHECK Enhancements in CUDA 9.0 Support for Volta Architecture Support for Cooperative Groups and new synchronization primitives Support for shared memory atomic instructions Detects accesses that extend beyond an allocation (Pascal+) 12
CUDA-MEMCHECK Support for Cooperative Groups With Volta, threads in a warp do not necessarily execute in lock-step in all cases May require updates to unsynchronized warp-level code to guarantee correctness cuda-memcheck s racecheck tool can be used to detect such unsafe code cuda-memcheck --tool racecheck 13
CUDA-MEMCHECK Support for Cooperative Groups Unsafe warp-level programming can be detected on Kepler and later with racecheck UNSAFE CODE RACECHECK OUTPUT device char reduce(char val) { extern shared char smem[]; const int tid = threadidx.x; } #pragma unroll for(int i = warpsize/2; i > 0; i /= 2) { smem[tid] = val; val += smem[tid ^ i]; } return val; $ cuda-memcheck tool racecheck --racecheck-report hazard./a.out ========= CUDA-MEMCHECK ========= WARN:(Warp Level Programming) Potential RAW hazard detected at shared 0xf in block (0, 0, 0) : ========= Write Thread (15, 0, 0) at 0x00000e08 in /home/user/reduction.cu:32:kernel(void) ========= Read Thread (14, 0, 0) at 0x00000ef0 in /home/user/reduction.cu:33:kernel(void)... 14
CUDA-MEMCHECK Support for Cooperative Groups Cooperative Groups adds explicit block and warp-level synchronization APIs UNSAFE CODE (NO CG) SAFE COOP. GROUPS CODE device char reduce(char val) { extern shared char smem[]; const int tid = threadidx.x; } #pragma unroll for(int i = warpsize/2; i > 0; i /= 2) { smem[tid] = val; val += smem[tid ^ i]; } return val; device char reduce(char val) { extern shared char smem[]; const int tid = threadidx.x; thread_group warp = tiled_partition(this_thread_block(), warpsize); #pragma unroll for(int i = warpsize/2; i > 0; i /= 2) { smem[tid] = val; warp.sync(); val += smem[tid ^ i]; warp.sync(); } return val; } ROBUST AND PERFORMANT 15
VOLTA UNIFIED MEMORY 16
VOLTA + NVLINK UNIFIED MEMORY GPU GPU CPU CPU Page Migration Engine GPU GPU CPU CPU Unified Memory GPU Optimized State + Access counters + New NVLink Features (Coherence, Atomics, ATS) Unified Memory CPU Optimized State 17
CUPTI New Measurement Library Features New events for thrashing, throttling and remote map CUpti_ActivityUnifiedMemoryRemoteMapCause lists possible causes for remote map events Correlate CPU page fault with the source code Support to track the allocation and freeing of memory New activity record CUpti_ActivityMemory Virtual base address, size, program counter, timestamps 18
UNIFIED MEMORY EVENTS Events 19
NEW UNIFIED MEMORY EVENTS Visualize Virtual Memory Activity Memory Thrashing Page Throttling Remote Map 20
FILTER AND ANALYZE Unfiltered Filtered 21
FILTER AND ANALYZE 12.2 ms Memory Thrashing Read access page faults Analyze read access page faults and thrashing 22
OPTIMIZATION OLD int threadsperblock = 256; int numblocks = (length + threadsperblock 1) / threadsperblock; kernel<<< numblocks, threadsperblock >>>(A, B, C, length); NEW int threadsperblock = 256; int numblocks = (length + threadsperblock 1) / threadsperblock; cudamemadvise(a, size, cudamemadvisesetreadmostly, 0); cudamemadvise(b, size, cudamemadvisesetreadmostly, 0); kernel<<< numblocks, threadsperblock >>>(A, B, C, length); 23
OPTIMIZED APPLICATION 2.9 ms No DtoH Migrations and thrashing Speedup 4x (2.9 vs 12.2) 24
CPU PAGE FAULT SOURCE CORRELATION Selected interval 25
SEGMENT MODE TIMELINE Segment mode interval Heat map for CPU page faults 26
VOLTA NVLINK 27
IMPROVED NVLINK INTERCONNECT 28
NVLINK VISUALIZATION DGX-1V (Volta) Static properties Runtime values Color codes for NVLink 29
NVLINK VISUALIZATION Timeline Events NVLink Events on Timeline Memcpy API Color Coding of NVLink Events 30
NVLINK ANALYSIS EXAMPLE Stage I: Data Movement Over PCIe 216 milliseconds 31
NVLINK ANALYSIS EXAMPLE Stage II: Data Movement Over NVLink 65 milliseconds Minimal/Unused NVLinks Under-utilized NVLink 32
NVLINK ANALYSIS EXAMPLE Stage III: Data Movement Over NVLink with Streams 33
CONCLUSION 34
EXECUTION MODEL CHANGES With great power comes great responsibility Your implicitly warp synchronous code may break on Volta Update using Cooperative Groups or with new synchronization intrinsics Tools can help greatly in detecting wrong and unsafe code Deploy everywhere from Kepler to Volta 35
TOOLS UPDATES Volta and Beyond CUPTI provides more detailed performance data and allows greater tool control cuda-memcheck can be used to check program correctness on Volta and helps in porting existing applications Visual Profiler adds Detailed insight into UVM events Correlation of page faults with source code Analysis of NVLink utilization 36