Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA

Size: px

Start display at page:

Download "Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA"

Ellen Dawson
5 years ago
Views:

1 Turing Architecture and CUDA 10 New Features Minseok Lee, Developer Technology Engineer, NVIDIA

2 Turing Architecture New SM Architecture Multi-Precision Tensor Core RT Core Turing MPS Inference Accelerated, Graphics Reinvented, Volta s Programmability

3 ANNOUNCING TESLA T4 WORLD S MOST ADVANCED INFERENCE GPU Universal Inference Acceleration 320 Turing Tensor cores 2,560 CUDA cores 65 FP16 TFLOPS 130 INT8 TOPS 260 INT4 TOPS 16GB 320GB/s 3

4 Turing TU102 SM TU102 INT32 64 FP32 64 Tensor Cores 8 RT Core 1 Register File 256 KB L1 and shmem 96 KB Max threads 1024 Compute Capability 75

5 Tensor Core A 0,0 A 0,1 A 0,2 A 0,3 B 0,0 B 0,1 B 0,2 B 0,3 C 0,0 C 0,1 C 0,2 C 0,3 D = A B C A 1,0 A 1,1 A 1,2 A 1,3 A 2,0 A 2,1 A 2,2 A 2,3 B 1,0 B 1,1 B 1,2 B 1,3 B 2,0 B 2,1 B 2,2 B 2,3 C 1,0 C 1,1 C 1,2 C 1,3 C 2,0 C 2,1 C 2,2 C 2,3 A 3,0 A 3,1 A 3,2 A 3,3 B 3,0 B 3,1 B 3,2 B 3,3 C 3,0 C 3,1 C 3,2 C 3,3 FP32 (FP16) FP16 FP16 FP32 (FP16)

6 Multi-Precision Tensor Core Input Precision Output Volta (Ops/Cycle/SM) Turing (Ops/Cycle/SM) FP16 FP16 or FP INT8 INT32 NA 2048 INT4 INT32 NA 4096 BOOL INT32 NA 16384

7 7

8 Multi-Process Service (MPS) Turing MPS: Inherits Volta s enhanced MPS architecture Hardware Accelerated Work Submission A B C CUDA MULTI-PROCESS SERVICE CONTROL CPU Processes GPU Execution Reduced launch latency Improved launch throughput TURING MULTI-PROCESS SERVICE Improved quality of service with scheduler partitioning Hardware Isolation A B C Turing

9 RT Core Accelerate Ray Tracing with Turing RT Cores Boundary Volume Hierarchy (BVH) traversal Ray/Triangle Intersection 10+ Giga Rays/Sec Available in NVIDIA OptiX Single-ray shader programming model using C++ AI Accelerated rendering Free for Commercial-Use

10 CUDA 10 Key Features TURING AND NEW SYSTEMS New GPU Architecture, Tensor Cores, CUDA PLATFORM CUDA Graphs, Warp Matrix, LIBRARIES GPU-accelerated hybrid JPEG decoding, Symmetric Eigenvalue Solvers, FFT Scaling DEVELOPER TOOLS New Nsight Products Nsight Systems and Nsight Compute Scientific Computing

11 Asynchronous Task Graphs Enable Execution Optimization when Workflow is Known Up-Front DL Inference Deep Neural Network Training Loop & Function offload Linear Algebra HPC Simulation

12 CUDA Graphs New Model for Submitting CUDA Work CUDA Work in Streams A Graph of Dependencies A B C Wait D Wait X Any CUDA stream can be mapped to a graph C B D X Wait E Y E Y Wait End

13 Definition of A CUDA Graph Sequence of operations, connected by dependencies. Operations (Nodes) are one of: Kernel Launch CUDA kernel running on GPU B A X CPU Function Call Callback function on CPU C D Memcpy/Memset Sub-Graph Data management Graphs are hierarchical E Y End

14 Repeated Graph Execution Generated Once, Launched Repeatedly A for(int i=0; i<1000; i++) { launch_graph( G ); } C B E D X Y End

15 Execution Optimization Launch Latency & Overhead Reduction Predefined graph lunches any number of kernels in one single operation Benefits especially short-running kernels Launch A Launch B Launch C Launch D Launch E CPU Idle A B C D E time Build Graph Launch Graph CPU Idle A B C D E

16 CUDA Stream to A Graph Construct a graph from normal CUDA stream syntax // Start by initating stream capture cudastreambegincapture(&stream1); A<<<..., stream1 >>>(); cudaeventrecord(e1, stream1); B<<<..., stream1 >>>(); cudastreamwaitevent(stream2, e1); C<<<..., stream2 >>>(); cudaeventrecord(e2, stream2); cudastreamwaitevent(stream1, e2); D<<<..., stream1 >>>(); // Now convert the stream to a graph cudastreamendcapture(stream1, &graph); A B Wait D Wait C A B D stream1 stream2 graph C

17 Capture External Work Stream Capture extends into Library Calls X // Start by initating stream capture cudastreambegincapture(&stream); X B A C A // Captures my kernel launches, recurse into library calls D B C X<<<..., stream >>>(); librarycall(stream); Z<<<..., stream >>>(); // Launches A, B, C, D Z Library call D // Now convert the stream to a graph cudastreamendcapture(stream, &graph); Inserting graph Z Resultant graph

18 Construct Graph Explicitly // Define graph of work + dependencies cudagraphcreate(&graph); B A C cudagraphaddnode(graph, kernel_a, {},...); cudagraphaddnode(graph, kernel_b, { kernel_a },...); cudagraphaddnode(graph, kernel_c, { kernel_a },...); cudagraphaddnode(graph, kernel_d, { kernel_b, kernel_c },...); D Graph from framework // Instantiate graph and apply optimizations cudagraphinstantiate(&instance, graph); // Launch executable graph 100 times for(int i=0; i<100; i++) cudagraphlaunch(instance, stream);

19 Graph Execution Semantics Graph Work can be ordered with other work in the stream stream launchwork(cudagraphexec_t i1, cudagraphexec_t i2, CPU_Func cpu, cudastream_t stream) { A A <<< 256, 256, 0, stream >>>(); cudagraphlaunch(i1, stream); cudastreamaddcallback(stream, cpu); cudagraphlaunch(i2, stream); // Kernel launch // Graph1 launch // CPU callback // Graph2 launch CPU } cudastreamsynchronize(stream);

20 Graph Execution Semantics Graphs ONLY use the stream for the start/end dependencies stream A A B X CPU Branches in graph still execute concurrently even though graph is launched into a stream C E D End Y

How to Access Tensor Cores Warp Matrix Multiply-Accumulate (WMMA) API in CUDA C++ Specialized matrix load, matrix multiply and accumulate, matrix store Turing (sm_75) WMMA

21 How to Access Tensor Cores Warp Matrix Multiply-Accumulate (WMMA) API in CUDA C++ Specialized matrix load, matrix multiply and accumulate, matrix store Turing (sm_75) WMMA supports 8-bit integer operation WMMA 16x16x16 = + WMMA 32x8x16 D 16x16 WMMA 8x32x16 A 16x16 B 16x16 C 16x16 = + D 8x32 = + A 8x16 B 16x32 C 8x32 D 32x8 A 32x16 B 16x8 C 32x8

22 How Tensor Core is Used A large matrix multiplication can be divided into a set of 16X16 matrix products, which are assigned to Tensor cores 16 B 16 A C

23 FP16 WMMA Example device void tensor_op_16_16_16(half *a, half *b, float *c) { wmma::fragment<wmma::matrix_a, 16, 16, 16, half, > a_frag; wmma::fragment<wmma::matrix_b, 16, 16, 16, half, > b_frag; wmma::fragment<wmma::accumulator, 16, 16, 16, float, > c_frag; Tensor Core Input/Output Fragment Tensor Core Computation wmma::load_matrix_sync(a_frag, a, ); wmma::load_matrix_sync(b_frag, b, ); wmma::fill_fragment(c_frag, 0.0f); wmma::mma_sync(c_frag, a_frag, b_frag, c_frag); Load Input Matrix into Input Fragment Initialize Output Fragment } wmma::store_matrix_sync(c, c_frag, ); Store Output Matrix into Memory

24 Turing INT8 WMMA Example device void tensor_op_16_16_16(char *a, char *b, int *c) { wmma::fragment<wmma::matrix_a, 16, 16, 16, char, > a_frag; wmma::fragment<wmma::matrix_b, 16, 16, 16, char, > b_frag; wmma::fragment<wmma::accumulator, 16, 16, 16, int, > c_frag; wmma::load_matrix_sync(a_frag, a, ); wmma::load_matrix_sync(b_frag, b, ); wmma::fill_fragment(c_frag, 0.0f); wmma::mma_sync(c_frag, a_frag, b_frag, c_frag); } wmma::store_matrix_sync(c, c_frag, );

25 Experimental sub-byte WMMA Support experimental 4-bit/1-bit Operations with 32-bit output Access via special namespace: nvcuda::wmma::experimental namespace experimental { namespace precision { struct u4; // 4-bit unsigned struct s4; // 4-bit signed struct b1; // 1-bit } enum bmmabitop { bmmabitopxor = 1 }; enum bmmaaccumulateop { bmmaaccumulateoppopc = 1 }; }

26 Binary Tensor Core Example 1-bit Concept Train neural networks on lower-precision data: faster compute, lower memory size Reduce data to positive / negative sign value can fit in single bit (1 = +ve, 0 = -ve) 1-bit weight & activation calculations based only on sign of data Ref: Binarized Neural Networks: Training Neural Networks with Weights and Activations Constrained to +1 or 1, M. Coubariaux, I. Hubara, D. Soudry, R. El-Yaniv, Y Bengio,

27 Experimental Native Types Turing WMMA API Summary Input Precision Output Supported Sizes Max Ops/Clock/SM half * char unsigned char half or float 16 x 16 x x 8 x 16 integer (int32) 8 x 32 x precision::u4 (4-bit unsigned) precision::s4 (4-bit signed) integer (int32) 8 x 8 x precision::b1 (1-bit) 8 x 8 x * Also available on Volta sm_70. Note: WMMA requires recompilation for Turing sm_75 for peak performance

28 CUTLASS 1.1 Collection of CUDA C++ template abstractions for implementing highperformance matrix-multiplication (GEMM) Turing/CUDA10-Optimized: support 8-bit, 4-bit, and 1-bit integers Include detailed documentation and variouls example Exhibit performance comparable to cublas

Tensor Cores Turing architecture-optimized

Matrix Multiply Performance DEEP LEARNING

GPU-accelerated hybrid JPEG decoding FP16 &

COMPATIBILITY & RELEASE CADENCE Faster &

29 CUDA 10 Math Libraries TURING Turing optimized GEMMs, & GEMM extensions for Tensor Cores Turing architecture-optimized libraries PERFORMANCE Large FFT & 16-GPU Strong Scaling Symmetric Eigensolver & Cholesky Performance cusparse Sparse-Dense Matrix Multiply Performance DEEP LEARNING Scientific Computing NEW ALGORITHMS AND APIs GPU-accelerated hybrid JPEG decoding FP16 & INT8 GEMMs for TensorRT Inference COMPATIBILITY & RELEASE CADENCE Faster & Independent Library Releases Library and CUDA compatibility with enterprise drivers

30 cublas 10 Include Turing-optimized mixed-precision GEMMs with Tensor Cores

31 DL Inference Test on T4

32 cufft 10 Strong scaling on multi-gpu systems such as NVIDIA s DGX cufft 9.2 cufft 10.0 Linear (cufft 10.0) cufft (10.0 and 9.2) using 3D C2C FFT 1024 size on DGX-2

33 Time (s) cusolver 10 Up to 44x Faster on Symmetric Eigensolver (DSYEVD) MKL2018 CUDA 9.2 CUDA Improved performance with new implementations for Cholesky factorization Symmetric & Generalized Symmetric Eigensolver QR factorization Matrix Size Benchmarks use 2 x Intel Gold 6140 (Skylake) processors with Intel MKL 2018 and NVIDIA Tesla V100 (Volta) GPUs

34 Nsight Product Family Nsight Systems System-wide application algorithm tuning Nsight Compute CUDA Kernel Profiling and Debugging Nsight Graphics Graphics Shader Profiling and Debugging

35 Nsight Systems System-wide Performance Analysis Observe Application Behavior: CPU threads, GPU traces, Memory Bandwidth and more Locate Optimization Opportunities: CUDA & OpenGL APIs, Unified Memory transfers, User Annotations using NVTX Ready for Big Data: Fast GUI capable of visualizing in excess of 10 million events on laptops, Container support, Minimum user privileges

36 Thread/core migration Processes and threads Thread state CUDA and OpenGL API trace cudnn and cublas trace Kernel and memory transfer activities Multi-GPU

Programmable UI/Rules) Command Line, Standalone, IDE Integration Platform Support OS: Linux

37 Nsight Compute Next Generation Kernel Profiler Interactive CUDA API debugging and kernel profiling Fast Data Collection Improved Workflow and Fully Customizable (Baselining, Programmable UI/Rules) Command Line, Standalone, IDE Integration Platform Support OS: Linux (x86, ARM), Windows GPUs: Pascal, Volta, Turing Kernel Profile Comparisons with Baseline Metric Data Source Correlation

38 SEOUL NOVEMBER 7-8,2018

May 8-11, 2017 Silicon Valley CUDA 9 AND BEYOND. Mark Harris, May 10, 2017

May 8-11, 2017 Silicon Valley CUDA 9 AND BEYOND Mark Harris, May 10, 2017 INTRODUCING CUDA 9 BUILT FOR VOLTA FASTER LIBRARIES Tesla V100 New GPU Architecture Tensor Cores NVLink Independent Thread Scheduling