VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017

Size: px

Start display at page:

Download "VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017"

Joan Daniels
6 years ago
Views:

1 VOLTA: PROGRAMMABILITY AND PERFORMANCE Jack Choquette NVIDIA Hot Chips

2 TESLA V100 21B transistors 815 mm 2 80 SM 5120 CUDA Cores 640 Tensor Cores 16 GB HBM2 900 GB/s HBM2 300 GB/s NVLink *full GV100 chip contains 84 SMs 2

3 GPU PERFORMANCE COMPARISON P100 V100 Ratio DL Training 10 TFLOPS 120 TFLOPS 12x DL Inferencing 21 TFLOPS 120 TFLOPS 6x FP64/FP32 5/10 TFLOPS 7.5/15 TFLOPS 1.5x HBM2 Bandwidth 720 GB/s 900 GB/s 1.2x STREAM Triad Perf 557 GB/s 855 GB/s 1.5x NVLink Bandwidth 160 GB/s 300 GB/s 1.9x L2 Cache 4 MB 6 MB 1.5x L1 Caches 1.3 MB 10 MB 7.7x 3

4 NVLINK 4

5 CPU Mastering NVLINK - PROGRAMMABILITY Direct Load/Store access for all agents Flat shared address space GPU & CPU atomics Atomic completion at destination Read-Modify-Write primitive for unsupported atomics Address Translation Services GPU uses CPU page tables directly 5

6 NVLINK PERFORMANCE AND POWER Bandwidth Coherence 25Gbps signaling 6 NVLinks for GV x Bandwidth improvement over GP100 Latency sensitive CPU caches GMEM Fast access in local cache hierarchy Probe filter in GPU Power Savings Reduce number of active lanes for lightly loaded link 6

7 NVLINK NODES HPC P9 CORAL NODE SUMMIT DL HYBRID CUBE MESH DGX-1 w/ Volta V100 V100 V100 V100 V100 V100 V100 P9 V100 V100 V100 V100 P9 V100 V100 V100 7

8 SM CORE 8

9 SM VOLTA GV100 SM Redesigned for Productivity and Accessible Performance Twice the schedulers L1 I$ Simplified Issue Logic Sub- Core Sub- Core Sub- Core Sub- Core Large, fast L1 cache Improved SIMT model Tensor acceleration TEX L1 D$ & SMEM +50% energy efficiency vs GP100 SM 9

10 SM MICROARCHITECTURE SM L1 I$ 4 Warp Instr/clk Shared L1 I$ Sub-Core 1 Warp Inst/clk Sub-Core 1 Warp Int/clk Sub-Core 1 Warp Intr/clk Sub-Core 1 Warp Intr/clk 4 Independently Scheduled Sub-Cores 64 B/clk 64 B/clk 64 B/clk 64 B/clk MIO TEX 1 quad/clk L1 D$ & SMEM 128KB 128 B/clk Shared MIO (TEX/L1$/SMEM) L2 $ 10

Sub-Core BRU (1 Branch/clk) L1 I$ L0 I$ Warp Scheduler (1 Warp Inst/clk) SUB-CORE Constant $ Warp Scheduler 1 Warp instr/clk L0 I$, branch unit Tensor Core two 4x4x4 tensor/clk FP64 8 DFMA/clk Math

11 Sub-Core BRU (1 Branch/clk) L1 I$ L0 I$ Warp Scheduler (1 Warp Inst/clk) SUB-CORE Constant $ Warp Scheduler 1 Warp instr/clk L0 I$, branch unit Tensor Core two 4x4x4 tensor/clk FP64 8 DFMA/clk Math Dispatch Unit (1 Warp Inst/clk) INT 16/clk FP32 16 FFMA/clk MUFU 4/clk MIO Queue (Load/Store/TEX) Math Dispatch Unit Keeps 2+ Datapaths Busy MIO Instruction Queue Register File (512 * 32-threads * 32-bits) Hold for Later Scheduling MIO Datapath MIO Scheduler Two 4x4x4 Tensor Cores (64B/clk) (1 warp Inst / 2 clk) 11

12 64 B/clk 64 B/clk 64 B/clk 64 B/clk L1 AND SHARED MEMORY Streaming L1$ Sub-Core 0 Data Sub-Core 1 Data Sub-Core instructions Sub-Core 2 Data Sub-Core 3 Data Unlimited cache misses in flight MIO MIO Scheduler Low cache hit latency TEX 1 quad/clk L1 D$ & SMEM 128KB 128 B/clk 4x bandwidth vs GP100 4x capacity vs GP100 L2 $ Shared Memory Unified Storage with L1$ Configurable up to 96KB 12

13 NARROWING THE SHARED MEMORY GAP with the GV100 L1 cache Cache: vs shared Easier to use 90%+ as good Shared: vs cache Faster atomics More banks Average Shared Memory Benefit Directed testing: shared in global 70% 93% More predictable Pascal Volta 13

14 INDEPENDENT THREAD SCHEDULING Convergence Optimizer 14

15 PROGRAMMING MODEL: MEMORY ADDRESSES if( ) Programs written using annotations Programs written as normal $(CC) causes Convergence must be proven Convergence is perf optimization causes Convergence is prescribed (e.g. Vector) GPU Innovation Convergence is an optimization (e.g. NVIDIA GPUs)

16 PROGRAMMING MODEL: CONTROL ADDRESSES if( ) Programs written without synchronization Programs written as normal $(CC) causes Convergence must be proven Convergence is perf optimization causes Convergence is prescribed (e.g. Scalar + Predicated Vector) (e.g. pre-volta GPUs) Volta Innovation vote Convergence is an optimization (e.g. Volta s Independent Thread Scheduling)

17 NVIDIA SIMT GPUS: SCALAR THREADS ON SIMD UNITS C, C++,, FORTRAN Array of scalar threads Scalar Compiler Scalar ISA Thread Virtualization Tech SIMD Units 50+ year history of optimizations NVIDIA GPUs have always used 30+ year history of TLP-to-ILP conversion Where the efficiency comes from

18 COOPERATION SYNCHRONIZATION Example: match.any.sync warp Warp-synchronizing operation = = = e.g. our NEW lane data compare Result distributed across warp 18

CPU Threads EFFICIENT GPU THREAD SYNCHRONIZATION 10's Critical Section Performance (relative to single thread) struct mutex { void lock() { while (1) { bool

$load(std::memory_order_relaxed)) do_exponential_backoff(); // for brevity }; } void unlock() { locked.$

19 CPU Threads EFFICIENT GPU THREAD SYNCHRONIZATION 10's Critical Section Performance (relative to single thread) struct mutex { void lock() { while (1) { bool state = false; if (locked.compare_exchange_weak(state, true, std::memory_order_acquire)) return; while (locked.load(std::memory_order_relaxed)) do_exponential_backoff(); // for brevity }; } void unlock() { locked.store(false, std::memory_order_release); } atomic<bool> locked = ATOMIC_VAR_INIT(false); }; GPU Threads Note: Unfair spinlock with exponential back-off algorithm, x86-64 Linux 4.8.0, i7-5820k, CUDA 9.0 pre-release software, V100 pre-production hardware, Aug K's 19

20 TENSOR CORE 20

21 TENSOR CORE Mixed Precision Matrix Math 4x4 matrices A 0,0 A 0,1 A 0,2 A 0,3 B 0,0 B 0,1 B 0,2 B 0,3 C 0,0 C 0,1 C 0,2 C 0,3 D = A 1,0 A 1,1 A 1,2 A 1,3 B 1,0 B 1,1 B 1,2 B 1,3 C 1,0 C 1,1 C 1,2 C 1,3 A 2,0 A 2,1 A 2,2 A 2,3 B 2,0 B 2,1 B 2,2 B 2,3 C 2,0 C 2,1 C 2,2 C 2,3 A 3,0 A 3,1 A 3,2 A 3,3 B 3,0 B 3,1 B 3,2 B 3,3 C 3,0 C 3,1 C 3,2 C 3,3 FP16 or FP32 FP16 FP16 FP16 or FP32 D = AB + C 21

22 TENSOR SYNCHRONIZATION Full Warp 16x16 Matrix Math warp Warp-synchronizing operation Composed Matrix Multiply and Accumulate for 16x16 matrices Result distributed across warp 22

23 VOLTA TENSOR OPERATION FP16 storage/input Full precision product Sum with FP32 accumulator Convert to FP32 result more products F16 F16 + F32 F32 Also supports FP16 accumulator mode for inferencing 23

24 Relative Performance A GIANT LEAP FOR DEEP LEARNING 9.3x faster cublas Mixed Precision (FP16 input, FP32 compute) Matrix Multiply (M=N=K=2048) P100 (CUDA 8) (CUDA 9) V100 Tensor Cores V100 measured on pre-production hardware. 24

TESLA V100 Volta Architecture Improved NVLink & HBM2 New SM

Core Sub- Core Sub- Core Sub- Core TEX L1 D$ & SMEM Most

Programmability New Algorithms 120 Programmable TFLOPS Deep

model, copy engine page migration, MPS acceleration, and

25 TESLA V100 Volta Architecture Improved NVLink & HBM2 New SM Core SM L1 I$ Independent Thread Scheduling Tensor Core Sub- Core Sub- Core Sub- Core Sub- Core TEX L1 D$ & SMEM Most Productive GPU Efficient Bandwidth Performance & Programmability New Algorithms 120 Programmable TFLOPS Deep Learning More V100 Features: 2x L2 atomics, int8, new memory model, copy engine page migration, MPS acceleration, and more The Fastest and Most Productive GPU for Deep Learning and HPC QUESTIONS? 25

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate