VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips PDF Free Download

VOLTA: PROGRAMMABILITY AND PERFORMANCE Jack Choquette NVIDIA Hot Chips 2017 1

TESLA V100 21B transistors 815 mm 2 80 SM 5120 CUDA Cores 640 Tensor Cores 16 GB HBM2 900 GB/s HBM2 300 GB/s NVLink *full GV100 chip contains 84 SMs 2

GPU PERFORMANCE COMPARISON P100 V100 Ratio DL Training 10 TFLOPS 120 TFLOPS 12x DL Inferencing 21 TFLOPS 120 TFLOPS 6x FP64/FP32 5/10 TFLOPS 7.5/15 TFLOPS 1.5x HBM2 Bandwidth 720 GB/s 900 GB/s 1.2x STREAM Triad Perf 557 GB/s 855 GB/s 1.5x NVLink Bandwidth 160 GB/s 300 GB/s 1.9x L2 Cache 4 MB 6 MB 1.5x L1 Caches 1.3 MB 10 MB 7.7x 3

NVLINK 4

CPU Mastering NVLINK - PROGRAMMABILITY Direct Load/Store access for all agents Flat shared address space GPU & CPU atomics Atomic completion at destination Read-Modify-Write primitive for unsupported atomics Address Translation Services GPU uses CPU page tables directly 5

NVLINK PERFORMANCE AND POWER Bandwidth Coherence 25Gbps signaling 6 NVLinks for GV100 1.9 x Bandwidth improvement over GP100 Latency sensitive CPU caches GMEM Fast access in local cache hierarchy Probe filter in GPU Power Savings Reduce number of active lanes for lightly loaded link 6

NVLINK NODES HPC P9 CORAL NODE SUMMIT DL HYBRID CUBE MESH DGX-1 w/ Volta V100 V100 V100 V100 V100 V100 V100 P9 V100 V100 V100 V100 P9 V100 V100 V100 7

SM CORE 8

SM VOLTA GV100 SM Redesigned for Productivity and Accessible Performance Twice the schedulers L1 I$ Simplified Issue Logic Sub- Core Sub- Core Sub- Core Sub- Core Large, fast L1 cache Improved SIMT model Tensor acceleration TEX L1 D$ & SMEM +50% energy efficiency vs GP100 SM 9

SM MICROARCHITECTURE SM L1 I$ 4 Warp Instr/clk Shared L1 I$ Sub-Core 1 Warp Inst/clk Sub-Core 1 Warp Int/clk Sub-Core 1 Warp Intr/clk Sub-Core 1 Warp Intr/clk 4 Independently Scheduled Sub-Cores 64 B/clk 64 B/clk 64 B/clk 64 B/clk MIO TEX 1 quad/clk L1 D$ & SMEM 128KB 128 B/clk Shared MIO (TEX/L1$/SMEM) L2 $ 10

Sub-Core BRU (1 Branch/clk) L1 I$ L0 I$ Warp Scheduler (1 Warp Inst/clk) SUB-CORE Constant $ Warp Scheduler 1 Warp instr/clk L0 I$, branch unit Tensor Core two 4x4x4 tensor/clk FP64 8 DFMA/clk Math Dispatch Unit (1 Warp Inst/clk) INT 16/clk FP32 16 FFMA/clk MUFU 4/clk MIO Queue (Load/Store/TEX) Math Dispatch Unit Keeps 2+ Datapaths Busy MIO Instruction Queue Register File (512 * 32-threads * 32-bits) Hold for Later Scheduling MIO Datapath MIO Scheduler Two 4x4x4 Tensor Cores (64B/clk) (1 warp Inst / 2 clk) 11

64 B/clk 64 B/clk 64 B/clk 64 B/clk L1 AND SHARED MEMORY Streaming L1$ Sub-Core 0 Data Sub-Core 1 Data Sub-Core instructions Sub-Core 2 Data Sub-Core 3 Data Unlimited cache misses in flight MIO MIO Scheduler Low cache hit latency TEX 1 quad/clk L1 D$ & SMEM 128KB 128 B/clk 4x bandwidth vs GP100 4x capacity vs GP100 L2 $ Shared Memory Unified Storage with L1$ Configurable up to 96KB 12

NARROWING THE SHARED MEMORY GAP with the GV100 L1 cache Cache: vs shared Easier to use 90%+ as good Shared: vs cache Faster atomics More banks Average Shared Memory Benefit Directed testing: shared in global 70% 93% More predictable Pascal Volta 13

INDEPENDENT THREAD SCHEDULING Convergence Optimizer 14

PROGRAMMING MODEL: MEMORY ADDRESSES if( ) Programs written using annotations Programs written as normal $(CC) causes Convergence must be proven Convergence is perf optimization causes Convergence is prescribed (e.g. Vector) GPU Innovation Convergence is an optimization (e.g. NVIDIA GPUs)

PROGRAMMING MODEL: CONTROL ADDRESSES if( ) Programs written without synchronization Programs written as normal $(CC) causes Convergence must be proven Convergence is perf optimization causes Convergence is prescribed (e.g. Scalar + Predicated Vector) (e.g. pre-volta GPUs) Volta Innovation vote Convergence is an optimization (e.g. Volta s Independent Thread Scheduling)

NVIDIA SIMT GPUS: SCALAR THREADS ON SIMD UNITS C, C++,, FORTRAN Array of scalar threads Scalar Compiler Scalar ISA Thread Virtualization Tech SIMD Units 50+ year history of optimizations NVIDIA GPUs have always used 30+ year history of TLP-to-ILP conversion Where the efficiency comes from

COOPERATION SYNCHRONIZATION Example: match.any.sync warp Warp-synchronizing operation = = = e.g. our NEW lane data compare Result distributed across warp 18

CPU Threads EFFICIENT GPU THREAD SYNCHRONIZATION 10's Critical Section Performance (relative to single thread) struct mutex { void lock() { while (1) { bool state = false; if (locked.compare_exchange_weak(state, true, std::memory_order_acquire)) return; while (locked.load(std::memory_order_relaxed)) do_exponential_backoff(); // for brevity }; } void unlock() { locked.store(false, std::memory_order_release); } atomic<bool> locked = ATOMIC_VAR_INIT(false); }; GPU Threads Note: Unfair spinlock with exponential back-off algorithm, x86-64 Linux 4.8.0, i7-5820k, CUDA 9.0 pre-release software, V100 pre-production hardware, Aug 2017 10K's 19

TENSOR CORE 20

TENSOR CORE Mixed Precision Matrix Math 4x4 matrices A 0,0 A 0,1 A 0,2 A 0,3 B 0,0 B 0,1 B 0,2 B 0,3 C 0,0 C 0,1 C 0,2 C 0,3 D = A 1,0 A 1,1 A 1,2 A 1,3 B 1,0 B 1,1 B 1,2 B 1,3 C 1,0 C 1,1 C 1,2 C 1,3 A 2,0 A 2,1 A 2,2 A 2,3 B 2,0 B 2,1 B 2,2 B 2,3 C 2,0 C 2,1 C 2,2 C 2,3 A 3,0 A 3,1 A 3,2 A 3,3 B 3,0 B 3,1 B 3,2 B 3,3 C 3,0 C 3,1 C 3,2 C 3,3 FP16 or FP32 FP16 FP16 FP16 or FP32 D = AB + C 21

TENSOR SYNCHRONIZATION Full Warp 16x16 Matrix Math warp Warp-synchronizing operation Composed Matrix Multiply and Accumulate for 16x16 matrices Result distributed across warp 22

VOLTA TENSOR OPERATION FP16 storage/input Full precision product Sum with FP32 accumulator Convert to FP32 result more products F16 F16 + F32 F32 Also supports FP16 accumulator mode for inferencing 23

Relative Performance A GIANT LEAP FOR DEEP LEARNING 9.3x faster cublas Mixed Precision (FP16 input, FP32 compute) Matrix Multiply (M=N=K=2048) P100 (CUDA 8) (CUDA 9) V100 Tensor Cores V100 measured on pre-production hardware. 24

TESLA V100 Volta Architecture Improved NVLink & HBM2 New SM Core SM L1 I$ Independent Thread Scheduling Tensor Core Sub- Core Sub- Core Sub- Core Sub- Core TEX L1 D$ & SMEM Most Productive GPU Efficient Bandwidth Performance & Programmability New Algorithms 120 Programmable TFLOPS Deep Learning More V100 Features: 2x L2 atomics, int8, new memory model, copy engine page migration, MPS acceleration, and more The Fastest and Most Productive GPU for Deep Learning and HPC QUESTIONS? 25

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017