VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017

Similar documents
GPU Fundamentals Jeff Larkin November 14, 2016

NVIDIA TESLA V100 GPU ARCHITECTURE THE WORLD S MOST ADVANCED DATA CENTER GPU

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

PERFORMANCE ANALYSIS AND DEBUGGING FOR VOLTA. Felix Schmitt 11 th Parallel Tools Workshop September 11-12, 2017

Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA

Portland State University ECE 588/688. Graphics Processors

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

May 8-11, 2017 Silicon Valley CUDA 9 AND BEYOND. Mark Harris, May 10, 2017

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

Fundamental CUDA Optimization. NVIDIA Corporation

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Fundamental CUDA Optimization. NVIDIA Corporation

May 8-11, 2017 Silicon Valley. CUDA 9 AND BEYOND Mark Harris, May 10, 2017

April 4-7, 2016 Silicon Valley INSIDE PASCAL. Mark Harris, October 27,

Parallel Accelerators

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL

RECENT TRENDS IN GPU ARCHITECTURES. Perspectives of GPU computing in Science, 26 th Sept 2016

SYNERGIE VON HPC UND DEEP LEARNING MIT NVIDIA GPUS

Master Informatics Eng.


GPU ARCHITECTURE Chris Schultz, June 2017

Introduction to Parallel Computing with CUDA. Oswald Haan

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Parallel Accelerators

Efficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford

CS377P Programming for Performance GPU Programming - II

Tesla Architecture, CUDA and Optimization Strategies

GRAPHICS PROCESSING UNITS

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011

GPU ARCHITECTURE Chris Schultz, June 2017

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

CS 179 Lecture 4. GPU Compute Architecture

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPU FOR DEEP LEARNING. 周国峰 Wuhan University 2017/10/13

LECTURE ON PASCAL GPU ARCHITECTURE. Jiri Kraus, November 14 th 2016

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

Spring Prof. Hyesoon Kim

Deep Learning Accelerators

ENDURING DIFFERENTIATION. Timothy Lanfear

ENDURING DIFFERENTIATION Timothy Lanfear

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

S8688 : INSIDE DGX-2. Glenn Dearth, Vyas Venkataraman Mar 28, 2018

Parallel Computing: Parallel Architectures Jin, Hai

B. Tech. Project Second Stage Report on

CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Advanced CUDA Programming. Dr. Timo Stich

Improving Performance of Machine Learning Workloads

University of Bielefeld

Portland State University ECE 588/688. Cray-1 and Cray T3E

Fundamental Optimizations

General Purpose GPU Computing in Partial Wave Analysis

Lecture 8: GPU Programming. CSE599G1: Spring 2017

CUDA Architecture & Programming Model

OpenPOWER Performance

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Godson Processor and its Application in High Performance Computers

Introduction to CELL B.E. and GPU Programming. Agenda

COSC 6339 Accelerators in Big Data

S CUDA on Xavier

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

Lecture 2: CUDA Programming

THE PATH TO EXASCALE COMPUTING. Bill Dally Chief Scientist and Senior Vice President of Research

IBM CORAL HPC System Solution

GPUs have enormous power that is enormously difficult to use

CS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs)

45-year CPU Evolution: 1 Law -2 Equations

Introduction to GPU programming with CUDA

GPU Architecture. Alan Gray EPCC The University of Edinburgh

World s most advanced data center accelerator for PCIe-based servers

Antonio R. Miele Marco D. Santambrogio

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

GPUfs: Integrating a file system with GPUs

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES. Nikolay Markovskiy Peter Messmer

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Parallel Processing SIMD, Vector and GPU s cont.

NVIDIA Tesla P100. Whitepaper. The Most Advanced Datacenter Accelerator Ever Built. Featuring Pascal GP100, the World s Fastest GPU

Introduction to CUDA

GPU for HPC. October 2010

Nvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018

CUDA Experiences: Over-Optimization and Future HPC

Using GPUs to compute the multilevel summation of electrostatic forces

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

Mathematical computations with GPUs

19. prosince 2018 CIIRC Praha. Milan Král, IBM Radek Špimr

GPU CUDA Programming

POWERING THE AI REVOLUTION JENSEN HUANG, FOUNDER & CEO GTC 2017

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Building NVLink for Developers

CME 213 S PRING Eric Darve

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016

Transcription:

VOLTA: PROGRAMMABILITY AND PERFORMANCE Jack Choquette NVIDIA Hot Chips 2017 1

TESLA V100 21B transistors 815 mm 2 80 SM 5120 CUDA Cores 640 Tensor Cores 16 GB HBM2 900 GB/s HBM2 300 GB/s NVLink *full GV100 chip contains 84 SMs 2

GPU PERFORMANCE COMPARISON P100 V100 Ratio DL Training 10 TFLOPS 120 TFLOPS 12x DL Inferencing 21 TFLOPS 120 TFLOPS 6x FP64/FP32 5/10 TFLOPS 7.5/15 TFLOPS 1.5x HBM2 Bandwidth 720 GB/s 900 GB/s 1.2x STREAM Triad Perf 557 GB/s 855 GB/s 1.5x NVLink Bandwidth 160 GB/s 300 GB/s 1.9x L2 Cache 4 MB 6 MB 1.5x L1 Caches 1.3 MB 10 MB 7.7x 3

NVLINK 4

CPU Mastering NVLINK - PROGRAMMABILITY Direct Load/Store access for all agents Flat shared address space GPU & CPU atomics Atomic completion at destination Read-Modify-Write primitive for unsupported atomics Address Translation Services GPU uses CPU page tables directly 5

NVLINK PERFORMANCE AND POWER Bandwidth Coherence 25Gbps signaling 6 NVLinks for GV100 1.9 x Bandwidth improvement over GP100 Latency sensitive CPU caches GMEM Fast access in local cache hierarchy Probe filter in GPU Power Savings Reduce number of active lanes for lightly loaded link 6

NVLINK NODES HPC P9 CORAL NODE SUMMIT DL HYBRID CUBE MESH DGX-1 w/ Volta V100 V100 V100 V100 V100 V100 V100 P9 V100 V100 V100 V100 P9 V100 V100 V100 7

SM CORE 8

SM VOLTA GV100 SM Redesigned for Productivity and Accessible Performance Twice the schedulers L1 I$ Simplified Issue Logic Sub- Core Sub- Core Sub- Core Sub- Core Large, fast L1 cache Improved SIMT model Tensor acceleration TEX L1 D$ & SMEM +50% energy efficiency vs GP100 SM 9

SM MICROARCHITECTURE SM L1 I$ 4 Warp Instr/clk Shared L1 I$ Sub-Core 1 Warp Inst/clk Sub-Core 1 Warp Int/clk Sub-Core 1 Warp Intr/clk Sub-Core 1 Warp Intr/clk 4 Independently Scheduled Sub-Cores 64 B/clk 64 B/clk 64 B/clk 64 B/clk MIO TEX 1 quad/clk L1 D$ & SMEM 128KB 128 B/clk Shared MIO (TEX/L1$/SMEM) L2 $ 10

Sub-Core BRU (1 Branch/clk) L1 I$ L0 I$ Warp Scheduler (1 Warp Inst/clk) SUB-CORE Constant $ Warp Scheduler 1 Warp instr/clk L0 I$, branch unit Tensor Core two 4x4x4 tensor/clk FP64 8 DFMA/clk Math Dispatch Unit (1 Warp Inst/clk) INT 16/clk FP32 16 FFMA/clk MUFU 4/clk MIO Queue (Load/Store/TEX) Math Dispatch Unit Keeps 2+ Datapaths Busy MIO Instruction Queue Register File (512 * 32-threads * 32-bits) Hold for Later Scheduling MIO Datapath MIO Scheduler Two 4x4x4 Tensor Cores (64B/clk) (1 warp Inst / 2 clk) 11

64 B/clk 64 B/clk 64 B/clk 64 B/clk L1 AND SHARED MEMORY Streaming L1$ Sub-Core 0 Data Sub-Core 1 Data Sub-Core instructions Sub-Core 2 Data Sub-Core 3 Data Unlimited cache misses in flight MIO MIO Scheduler Low cache hit latency TEX 1 quad/clk L1 D$ & SMEM 128KB 128 B/clk 4x bandwidth vs GP100 4x capacity vs GP100 L2 $ Shared Memory Unified Storage with L1$ Configurable up to 96KB 12

NARROWING THE SHARED MEMORY GAP with the GV100 L1 cache Cache: vs shared Easier to use 90%+ as good Shared: vs cache Faster atomics More banks Average Shared Memory Benefit Directed testing: shared in global 70% 93% More predictable Pascal Volta 13

INDEPENDENT THREAD SCHEDULING Convergence Optimizer 14

PROGRAMMING MODEL: MEMORY ADDRESSES if( ) Programs written using annotations Programs written as normal $(CC) causes Convergence must be proven Convergence is perf optimization causes Convergence is prescribed (e.g. Vector) GPU Innovation Convergence is an optimization (e.g. NVIDIA GPUs)

PROGRAMMING MODEL: CONTROL ADDRESSES if( ) Programs written without synchronization Programs written as normal $(CC) causes Convergence must be proven Convergence is perf optimization causes Convergence is prescribed (e.g. Scalar + Predicated Vector) (e.g. pre-volta GPUs) Volta Innovation vote Convergence is an optimization (e.g. Volta s Independent Thread Scheduling)

NVIDIA SIMT GPUS: SCALAR THREADS ON SIMD UNITS C, C++,, FORTRAN Array of scalar threads Scalar Compiler Scalar ISA Thread Virtualization Tech SIMD Units 50+ year history of optimizations NVIDIA GPUs have always used 30+ year history of TLP-to-ILP conversion Where the efficiency comes from

COOPERATION SYNCHRONIZATION Example: match.any.sync warp Warp-synchronizing operation = = = e.g. our NEW lane data compare Result distributed across warp 18

CPU Threads EFFICIENT GPU THREAD SYNCHRONIZATION 10's Critical Section Performance (relative to single thread) struct mutex { void lock() { while (1) { bool state = false; if (locked.compare_exchange_weak(state, true, std::memory_order_acquire)) return; while (locked.load(std::memory_order_relaxed)) do_exponential_backoff(); // for brevity }; } void unlock() { locked.store(false, std::memory_order_release); } atomic<bool> locked = ATOMIC_VAR_INIT(false); }; GPU Threads Note: Unfair spinlock with exponential back-off algorithm, x86-64 Linux 4.8.0, i7-5820k, CUDA 9.0 pre-release software, V100 pre-production hardware, Aug 2017 10K's 19

TENSOR CORE 20

TENSOR CORE Mixed Precision Matrix Math 4x4 matrices A 0,0 A 0,1 A 0,2 A 0,3 B 0,0 B 0,1 B 0,2 B 0,3 C 0,0 C 0,1 C 0,2 C 0,3 D = A 1,0 A 1,1 A 1,2 A 1,3 B 1,0 B 1,1 B 1,2 B 1,3 C 1,0 C 1,1 C 1,2 C 1,3 A 2,0 A 2,1 A 2,2 A 2,3 B 2,0 B 2,1 B 2,2 B 2,3 C 2,0 C 2,1 C 2,2 C 2,3 A 3,0 A 3,1 A 3,2 A 3,3 B 3,0 B 3,1 B 3,2 B 3,3 C 3,0 C 3,1 C 3,2 C 3,3 FP16 or FP32 FP16 FP16 FP16 or FP32 D = AB + C 21

TENSOR SYNCHRONIZATION Full Warp 16x16 Matrix Math warp Warp-synchronizing operation Composed Matrix Multiply and Accumulate for 16x16 matrices Result distributed across warp 22

VOLTA TENSOR OPERATION FP16 storage/input Full precision product Sum with FP32 accumulator Convert to FP32 result more products F16 F16 + F32 F32 Also supports FP16 accumulator mode for inferencing 23

Relative Performance A GIANT LEAP FOR DEEP LEARNING 9.3x faster cublas Mixed Precision (FP16 input, FP32 compute) Matrix Multiply (M=N=K=2048) P100 (CUDA 8) (CUDA 9) V100 Tensor Cores V100 measured on pre-production hardware. 24

TESLA V100 Volta Architecture Improved NVLink & HBM2 New SM Core SM L1 I$ Independent Thread Scheduling Tensor Core Sub- Core Sub- Core Sub- Core Sub- Core TEX L1 D$ & SMEM Most Productive GPU Efficient Bandwidth Performance & Programmability New Algorithms 120 Programmable TFLOPS Deep Learning More V100 Features: 2x L2 atomics, int8, new memory model, copy engine page migration, MPS acceleration, and more The Fastest and Most Productive GPU for Deep Learning and HPC QUESTIONS? 25