TESLA PLATFORM. Jan 2018

Size: px

Start display at page:

Download "TESLA PLATFORM. Jan 2018"

Shanon Green
5 years ago
Views:

1 TESLA PLATFORM Jan 2018

2 A NEW ERA OF COMPUTING AI & IOT Deep Learning, GPU 100s of billions of devices MOBILE-CLOUD iphone, Amazon AWS 2.5 billion mobile users PC INTERNET WinTel, Yahoo! 1 billion PC users

3 NVIDIA THE AI COMPUTING COMPANY GPU Computing Computer Graphics Artificial Intelligence 3

4 RISE OF GPU COMPUTING APPLICATIONS GPU-Computing perf 1.5X per year 1000X by 2025 ALGORITHMS X per year SYSTEMS 10 4 CUDA ARCHITECTURE X per year 10 2 Single-threaded perf Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for by K. Rupp 4

5 ELEVEN YEARS OF GPU COMPUTING World s First Atomic Model of HIV Capsid GPU-Trained AI Machine Beats World Champion in Go Oak Ridge Deploys World s Fastest Supercomputer w/ GPUs Fermi: World s First HPC GPU AlexNet beats expert code by huge margin using GPUs Stanford Builds AI Machine using GPUs Google Outperforms Humans in ImageNet Top 13 Greenest Supercomputers Powered by NVIDIA GPUs CUDA Launched World s First GPU Top500 System Discovered How H1N1 Mutates to Resist Drugs World s First 3-D Mapping of Human Genome

TESLA PLATFORM World s Leading Data Center Platform for Accelerating HPC and

Finance ENTERPRISE APPLICATIONS Defense HPC +450 Applications INDUSTRY

cublas cusparse DeepStream SDK CUDA C/C++ FORTRAN DEEP LEARNING SDK

6 TESLA PLATFORM World s Leading Data Center Platform for Accelerating HPC and AI APPLICATIONS Automotive Retail INTERNET SERVICES Healthcare Manufacturing Finance ENTERPRISE APPLICATIONS Defense HPC +450 Applications INDUSTRY FRAMEWORKS & TOOLS FRAMEWORKS ECOSYSTEM TOOLS NVIDIA SDK cudnn TensorRT NCCL cublas cusparse DeepStream SDK CUDA C/C++ FORTRAN DEEP LEARNING SDK COMPUTEWORKS TESLA GPU & SYSTEMS TESLA GPU NVIDIA DGX-1 NVIDIA HGX-1 SYSTEM OEM CLOUD 6

7 MOST ADOPTED PLATFORM FOR ACCELERATING HPC All Top 15 HPC Apps Accelerated 45, , VASP AMBER NAMD GROMACS Gaussian Simulia Abaqus WRF OpenFOAM ANSYS LS-DYNA BLAST LAMMPS ANSYS Fluent Quantum Espresso GAMESS OAK RIDGE SUMMIT US s next fastest supercomputer 200+ Petaflop HPC; 3+ Exaflop of AI ABCI Supercomputer (AIST) Japan s fastest AI supercomputer Piz Daint Europe s fastest supercomputer 14X GPU DEVELOPERS 500+ GPU-ACCELERATED APPLICATIONS DEFINING THE NEXT GIANT WAVE IN HPC 7

8 MOST ADOPTED PLATFORM FOR ACCELERATING AI ,637 Cloud Services Systems Desktops 25X COMPANIES ENGAGED EVERY DEEP LEARNING FRAMEWORK ACCELERATED AVAILABLE EVERYWHERE 8

9 TESLA PLATFORM FOR HPC 9

10 ns/day BIG INEFFICIENCIES WITH CPU NODES Single GPU ARCHITECTING Server 3.5x Faster MODERN than the DATACENTERS Largest CPU Data Center AMBER Simulation of CRISPR, Nature s Tool for Genome Editing Node with 4x V100 GPUs # of CPUs 48 CPU Nodes Comet Supercomputer AMBER 16 Pre-release, CRSPR based on PDB ID 5f9r, 336,898 atoms CPU: Dual Socket Intel E5-2680v3 12 cores, 128 GB DDR4 per node, FDR IB 10

11 WEAK NODES Lots of Nodes Interconnected with Vast Network Overhead STRONG NODES Few Lightning-Fast Nodes with Performance of Hundreds of Weak Nodes Network Fabric Server Racks 11

12 ARCHITECTING MODERN DATACENTERS Strong Core CPU for Sequential code Volta 5,120 CUDA Cores NVLink for Strong Scaling 125 TFLOPS Tensor Core 12

13 70% OF THE WORLD S SUPERCOMPUTING WORKLOAD ACCELERATED VASP AMBER NAMD GROMACS Gaussian Simulia Abaqus WRF OpenFOAM ANSYS LS-DYNA BLAST LAMMPS ANSYS Fluent Quantum Espresso GAMESS Top 15 HPC Applications 500+ Accelerated Applications Intersect360 Research, Nov 2017 HPC Application Support for GPU Computing 13

14 GPU-ACCELERATED HPC APPLICATIONS 500+ APPLICATIONS LIFE SCIENCES MFG, CAD, & CAE PHYSICS OIL & GAS CLIMATE & WEATHER DEEP LEARNING 50+ app Including: Gaussian VASP AMBER HOOMD- Blue GAMESS 111 apps Including: Ansys Fluent Abaqus SIMULIA AutoCAD CST Studio Suite 20 apps Including: QUDA MILC GTC-P 17 apps Including: RTM SPECFEM 3D 4 apps Including: Cosmos Gales WRF 32 apps Including: Caffe2 MXNet Tensorflow MEDIA & ENT. FEDERAL & DEFENSE DATA SCI. & ANALYTICS SAFETY & SECURITY COMP. FINANCE TOOLS & MGMT. 142 apps Including: DaVinci Resolve Premiere Pro CC Redshift Renderer 13 apps Including: ArcGIS Pro EVNI SocetGXP 23 apps Including: MapD Kinetica Graphistry 15 apps Including: Cyllance FaceControl Syndex Pro 16 apps Including: O-Quant Options Pricing MUREX MISYS 15 apps Including: Bright Cluster Manager HPCtoolkit Vampir 14

15 DEEP LEARNING COMES TO HPC NEW DATA TRAINING SET REGRESSION SET NEW DATA SIMULATION (FP64/FP32) TRAINING (FP32/FP16) REGRESSION TESTING (FP16/INT8) INFERENCE (FP16/INT8) ERRORS 15

16 AI ACCELERATES SCIENCE AI ACCELERATES SCIENTIFIC DISCOVERY UIUC & NCSA: ASTROPHYSICS 5,000X LIGO Signal Processing U. FLORIDA & UNC: DRUG DISCOVERY 300,000X Molecular Energetics Prediction SLAC: ASTROPHYSICS Gravitational Lensing: From Weeks to 10ms PRINCETON & ITER: CLEAN ENERGY 50% Higher Accuracy for Fusion Sustainment U.S. DoE: PARTICLE PHYSICS 33% More Accurate Neutrino Detection U. PITT: DRUG DISCOVERY 35% Higher Accuracy for Protein Scoring 16

17 ONE PLATFORM BUILT FOR BOTH DATA SCIENCE & COMPUTATIONAL SCIENCE CUDA Tesla Platform Accelerating AI Accelerating HPC 17

18 DRAMATICALLY MORE FOR YOUR MONEY Save Up To $8M With Each GPU-Accelerated Rack EQUAL THROUGHPUT WITH FEWER RACKS BUDGET: SMALLER, EFFICIENT 1 RACK ($0.8M) 36 CPUs + 72 V100s Compute Servers, 85% Non-Compute 15% 5 RACKS ($2.0M) RTM 360 CPUs 14 RACKS ($6.0M) 22 RACKS ($9.2M) VASP ResNet-50 (DL Training) 1152 CPUs 1764 CPUs Compute Servers, 39% Rack, Cabling Infrastructure Noncompute, 61% Networking # of Racks (~30 KW Per Rack) 18 Source: Traditional Data Centers Cost model by Microsoft Research on Datacenter Costs

Deep Learning (ResNet-50) MIXED WORKLOAD: Materials Science (VASP) Life Sciences (AMBER) Physics (MILC)

19 DATA CENTER SAVINGS FOR MIXED WORKLOADS 5X Better HPC TCO for Same Throughput SAME THROUGHPUT 1/3 THE COST 1/4 THE SPACE 1/5 THE POWER MIXED WORKLOAD: Materials Science (VASP) Life Sciences (AMBER) Physics (MILC) Deep Learning (ResNet-50) MIXED WORKLOAD: Materials Science (VASP) Life Sciences (AMBER) Physics (MILC) Deep Learning (ResNet-50) 12 Accelerated Servers w/4 V100 GPUs 20 KWatts 160 Self-hosted Servers 96 KWatts 19

20 TESLA V100 The Fastest and Most Productive GPU for AI and HPC Volta Architecture Tensor Core Improved NVLink & HBM2 Volta MPS Improved SIMT Model Most Productive GPU 125 Programmable TFLOPS Deep Learning Efficient Bandwidth Inference Utilization New Algorithms 20

VOLTA TO FUEL SUMMIT Next Milestone In AI Supercomputing AI Exascale Today Performance Leadership Accelerated Science ACME 200 PF DIRAC FLASH

21 VOLTA TO FUEL SUMMIT Next Milestone In AI Supercomputing AI Exascale Today Performance Leadership Accelerated Science ACME 200 PF DIRAC FLASH GTC HACC LSDALTON NAMD 20 PF NUCCOR NWCHEM QMCPACK RAPTOR SPECFEM XGC 3+EFLOPS Tensor Ops 10X Perf Over Titan 5-10X Application Perf Over Titan 21

22 GFLOPS per Watt BREAKTHROUGH EFFICIENCY ON THE PATH TO EXASCALE 13/13 Greenest Supercomputers Powered by Tesla P100 Ahead Of The Curve 35 TSUBAME 3.0 Kukai AIST AI Cloud RAIDEN GPU subsystem Piz Daint Wilkes-2 GOSAT-2 (RCF2) DGX Saturn V Reedbush-H JADE Facebook Cluster Cedar DAVIDE Eurotech Aurora K Tsubame- KFC K20X 5.3 Tsubame- KFC K SaturnV P100 V Tsubame 3 P GF/W Exascale Goal Top GPU Systems in Green500 List with measured performance and NVIDIA Projections for V100 22

23 POWER OF GPU COMPUTING PLATFORM Delivered Value Grows Over Time AMBER Performance (ns/ day) GoogleNet Performance (i/s) AMBER 16 CUDA cudnn 7 CUDA 9 NCCL 2 40 AMBER 16 CUDA AMBER 12 CUDA 4 AMBER 14 CUDA 4 AMBER 14 CUDA cudnn 2 CUDA 6 cudnn 4 CUDA 7 cudnn 6 CUDA 8 NCCL K20 (2013) K40 (2014) K80 (2015) P100 (2016) V100 (2017) 0 8X K80 (2014) 8X MAXWELL (2015) DGX-1 (2016) DGX-1V (2017) Amber dataset: Cellulose NVE; GoogLeNet dataset: Imagenet 23

24 TESLA PLATFORM FOR AI 24

25 AI REVOLUTIONIZING OUR WORLD Search, Assistants, Translation, Recommendations, Shopping, Photos Detect, Diagnose and Treat Diseases Powering Breakthroughs in Agriculture, Manufacturing, EDA 25

26 NEURAL NETWORK COMPLEXITY IS EXPLODING Bigger and More Compute Intensive 350X Inception-v4 30X DeepSpeech 3 10X MoE GNMT AlexNet GoogleNet ResNet-50 Inception-v2 DeepSpeech DeepSpeech 2 OpenNMT Image (GOP * Bandwidth) Speech (GOP * Bandwidth) Translation (GOP * Bandwidth)

27 PLATFORM BUILT FOR AI Delivering 125 TFLOPS of DL Performance with Volta TENSOR CORE TENSOR CORE MATRIX DATA OPTIMIZATION: Dense Matrix of Tensor Compute TENSOR-OP CONVERSION: FP32 to Tensor Op Data for Frameworks VOLTA-OPTIMIZED cudnn VOLTA TENSOR CORE 4x4 matrix processing array D[FP32] = A[FP16] * B[FP16] + C[FP32] Optimized For Deep Learning ALL MAJOR FRAMEWORKS 27

28 GPU DEEP LEARNING IS A NEW COMPUTING MODEL Billions of Trillions of Operations GPU train larger models, accelerate time to market Training Datacenter TRAINING Device 28

Speedup vs K80 REVOLUTIONARY AI PERFORMANCE 3X Faster DL Training Performance Exponential Performance over time (GoogleNet) Relative Time to Train

Q2 16 Q2 17 1X V100 6 Hours 0 10 20 Over 80X DL Training Performance in 3 Years GoogleNet Training Performance on versions of cudnn Vs 1x K80

29 Speedup vs K80 REVOLUTIONARY AI PERFORMANCE 3X Faster DL Training Performance Exponential Performance over time (GoogleNet) Relative Time to Train Improvements (LSTM) 100x 80x 8x V100 cudnn7 2X CPU 15 Days 60x 40x 8x P100 cudnn6 1X P Hours 20x 0x 1x K80 cudnn2 Q1 15 4x M40 cudnn3 Q3 15 Q2 16 Q2 17 1X V100 6 Hours Over 80X DL Training Performance in 3 Years GoogleNet Training Performance on versions of cudnn Vs 1x K80 cudnn2 3X Reduction in Time to Train Over P100 Neural Machine Translation Training for 13 Epochs German ->English, WMT15 subset CPU = 2x Xeon E V4 29

30 NVIDIA GPUS POWER WORLD S FASTEST DEEP LEARNING PERFORMANCE Time to Train 60 Mins Image of ResNet 50 network 48 Mins 15 Mins ( ) Facebook June '17 IBM Aug '17 Preferred Networks Nov ' Tesla P Tesla P Tesla P100 ResNet-50 ResNet-50 Dataset: Imagenet Trained for 90 Epochs 30

queries per day GPU inference for fast response,

31 GPU DEEP LEARNING IS A NEW COMPUTING MODEL Training Datacenter 10s of billions of image, voice, video queries per day GPU inference for fast response, maximize datacenter throughput DATACENTER INFERENCING Device 31

32 NVIDIA TENSORRT PROGRAMMABLE INFERENCE ACCELERATOR TESLA P4 TensorRT JETSON TX2 DRIVE PX 2 NVIDIA DLA TESLA V100 32

33 Images/Sec (Target 7ms latency) Sentences/Sec (Target 200ms latency) NVIDIA TENSORRT 3 World s Fastest Inference Platform 6,000 ResNet-50 Throughput 600 OpenNMT Throughput 5, , ,000 2,000 14ms ms 1,000 7ms 7ms ms 117ms 0 CPU + TensorFlow V100 + TensorFlow V100 + TensorRT 0 CPU + Torch V100 + Torch V100 + TensorRT IMAGES TRANSLATION 33

NVIDIA PLATFORM SAVES DATA CENTER COSTS Game Changing Inference Performance SAME THROUGHPUT 1/4 THE SPACE 1/22 THE POWER INFERENCE WORKLOAD: Image recognition using Resnet 50

34 NVIDIA PLATFORM SAVES DATA CENTER COSTS Game Changing Inference Performance SAME THROUGHPUT 1/4 THE SPACE 1/22 THE POWER INFERENCE WORKLOAD: Image recognition using Resnet 50 INFERENCE WORKLOAD: Image recognition using Resnet 50 1 HGX Server 45,000 images/sec 3 KWatts 160 CPU Servers 45,000 images/sec 65 KWatts Image recognition using Resnet-50 34

35 GPU-ACCELERATED INFERENCE iflytek SPEECH RECOGNITION VALOSSA VIDEO INTELLIGENCE MICROSOFT BING VISUAL SEARCH 35

36 TESLA PRODUCT FAMILY 36

END-TO-END PRODUCT FAMILY HYPERSCALE HPC

37 END-TO-END PRODUCT FAMILY HYPERSCALE HPC STRONG-SCALE HPC MIXED-APPS HPC FULLY INTEGRATED SUPERCOMPUTER DGX Station Training & Inference - Tesla V100 Tesla V100 with NVLink Tesla V100 with PCI-E Most Efficient Inference & Transcoding - Tesla P4 DGX-1 Server Deep learning training & inference HPC and DL workloads scaling to multiple GPUs HPC workloads with mix of CPU and GPU workloads Fully integrated deep learning solution 37

38 OPTIMIZED FOR DATACENTER EFFICIENCY 30% More Performance in a Rack DL Perf / Watt Max Efficiency DL Perf Watts 75% Perf at Half the Power Max Performance MAXP Computer Vision 13 KW Rack 4 Nodes of 8xV100 1X ResNet-50 Rack Throughput ResNet-50 Training MAXQ Computer Vision 13 KW Rack 7 Nodes of 8xV X ResNet-50 Rack Throughput 38

7 TF SP 125 TF DL 7 TF DP 14 TF SP 112 TF DL Memory HBM2: 900 GB/s 16 GB HBM2: 900 GB/s 16

39 TESLA V100 Core For NVLink Servers For PCIe Servers 5120 CUDA cores, 640 Tensor cores 5120 CUDA cores, 640 Tensor cores Compute 7.8 TF DP 15.7 TF SP 125 TF DL 7 TF DP 14 TF SP 112 TF DL Memory HBM2: 900 GB/s 16 GB HBM2: 900 GB/s 16 GB Interconnect NVLink (up to 300 GB/s) + PCIe Gen3 (up to 32 GB/s) PCIe Gen3 (up to 32 GB/s) Power 300W 250W Available Now Now 39

40 TESLA PLATFORM FOR CLOUD PROVIDERS 40

41 CLOUD GPU DEMAND OUTSTRIPS SUPPLY AWS Launches P2 Instance P2 instance is one of the fastest growing instance in AWS history. - Andrew Jassy, AWS CEO, re:invent 2016 Azure Launches N-Series Preview We ve had thousands of customers participate in the N-Series preview since we launched it back in August. - Corey Sanders, Director of Compute, Azure Q Q

42 GLOBAL CSP OFFERINGS Compute AWS P3 - up to 8X V100 SXM2 Available only in N. Virginia, Oregon, Ireland, Tokyo AWS P2 up to 8X K80 Physical cards ec2/instance-types/p3/ /ec2/instance-types/p2/ GPU Server - up to 4X K80 GPU Server - up to 4X P100 PCIe Public Beta available /gpu/ GPU Server - up to 2X K80, 1X P100 PCIe (In Bare-metal) oudcomputing/bluemix/gpucomputing NC series - up to 2X K80 NC v2 & ND series - up to 4X P100 PCIe/ 4X P40 Available only in US West 2 Region en-us/pricing/details/virtualmachines/series/#n- series X7 shape - up to 2X P100 (In Bare-metal and VM) Available only in Ashburn region. Frankfurt to come in Jan /infrastructure/compute Virtual W/S AWS G3 M60 GPU Server - P100 PCIe vws private alpha available GPU Server - P100 PCIe vws public beta Jan 18 GPU Server - up to 2X M60, 2X M10 GPU Server - M en-us/pricing/details/virtualmachines/series/#n-series GPU Server - M60 Virtual PC GPU Server - up to 4X K520 Physical cards GPU Server - M10 Vmware Horizon Air vpc launch Jan 42

GPU instances Always up to date Monthly updates by NVIDIA to ensure maximum performance NVIDIA GPU Cloud integrates

43 NVIDIA GPU CLOUD AI and HPC Everywhere, For Everyone Innovate in minutes, not weeks Removes all the DIY complexity of DL and HPC software integration Cross platform Containers run locally on DGX Systems and TITAN PCs, or on cloud service provider GPU instances Always up to date Monthly updates by NVIDIA to ensure maximum performance NVIDIA GPU Cloud integrates GPU-optimized deep learning frameworks, HPC apps, runtimes, libraries, and OS into a ready-to-run container, available at no charge 43

44 NVIDIA GPU CLOUD SIMPLIFYING AI & HPC DEEP LEARNING HPC APPS HPC VIZ 44

NGC GPU-OPTIMIZED DEEP LEARNING CONTAINERS A Comprehensive Catalog of Deep Learning Software NVCaffe Caffe2 Microsoft Cognitive Toolkit (CNTK)

45 NGC GPU-OPTIMIZED DEEP LEARNING CONTAINERS A Comprehensive Catalog of Deep Learning Software NVCaffe Caffe2 Microsoft Cognitive Toolkit (CNTK) DIGITS MXNet PyTorch TensorFlow Theano Torch CUDA (base level container for developers) NEW! NVIDIA TensorRT inference accelerator with ONNX support 45

46 HPC APPS COMING TO NVIDIA GPU CLOUD 46

47 NVIDIA GPU CLOUD FOR HPC VISUALIZATION U CLOUD FOR HPC VISUALIZATION UNIFIED VISUALIZATION FOR LARGE DATA SETS Large-scale Volumetric Rendering Physically Accurate Ray Tracing Production-quality Images Seamless integration with ParaView Early Access NOW Signup now at nvidia.com/gpu-cloud ParaView with NVIDIA IndeX ParaView with NVIDIA OptiX ParaView with NVIDIA Holodeck 47

48 TESLA PLATFORM FOR DEVELOPERS 48

49 49

50 HOW GPU ACCELERATION WORKS Application Code Compute-Intensive Functions GPU 5% of Code Rest of Sequential CPU Code CPU + 50

51 GPU ACCELERATED LIBRARIES Drop-in Acceleration for Your Applications DEEP LEARNING SIGNAL, IMAGE & VIDEO cudnn TensorRT DeepStream SDK cufft NVIDIA NPP CODEC SDK LINEAR ALGEBRA PARALLEL ALGORITHMS cublas cusparse CUDA Math library cusolver curand nvgraph NCCL 51

Second-Generation NVLink HBM2 Stacked Memory

(cublas) >20x Faster Image Processing (NPP)

(cufft) COOPERATIVE THREAD GROUPS DEVELOPER

52 CUDA TOOLKIT 9 UNLEASHES POWER OF VOLTA Optimized for Volta: Tensor Cores Second-Generation NVLink HBM2 Stacked Memory FASTER LIBRARIES GEMM Optimizations for RNNs (cublas) >20x Faster Image Processing (NPP) FFT Optimizations Across Various Sizes (cufft) COOPERATIVE THREAD GROUPS DEVELOPER TOOLS & PLATFORM UPDATES Flexible Thread Groups Efficient Parallel Algorithms Synchronize Across Thread Blocks in a Single GPU or Multi-GPUs 1.3x Faster Compiling New OS and Compiler Support Unified Memory Profiling NVLink Visualization 52

WHAT IS OPENACC OpenACC is a directivesbased programming approach to parallel computing designed for performance and portability on CPUs and accelerators for HPC (OpenPOWER,

53 WHAT IS OPENACC OpenACC is a directivesbased programming approach to parallel computing designed for performance and portability on CPUs and accelerators for HPC (OpenPOWER, Sunway, x86 CPU & Xeon Phi, NVIDIA GPU, PEZY-SC) Add Simple Compiler Directive main() { <serial code> #pragma acc kernels { <parallel code> } } Read more at 53

54 Speedup vs Single Haswell Core OPENACC: EASY ONBOARD TO GPU COMPUTING A Widely Adopted Directives Model for Parallel Programing POWER Sunway x86 CPU x86 Xeon Phi NVIDIA GPU AMD PEZY-SC AWE Hydrodynamics CloverLeaf mini-app (bm32 data set) x x PGI OpenACC Intel/IBM OpenMP 77x x 10x 11x 11x 0 Multicore Broadwell Multicore POWER8 1x 2x 4x Volta V100 3 of Top 5 HPC Apps: ANSYS Fluent, VASP, Gaussian 5 CAAR Codes: GTC, XGC, ACME, FLASH, LSDalton 2017 Gordon Bell Finalist: CAM-SE on TaihuLight SIMPLE. POWERFUL. PORTABLE. ADOPTED BY KEY HPC CODES 54

55 LSDalton Numeca PowerGrid INCOMP3D Quantum Chemistry 12X speedup in 1 week CFD 10X faster kernels 2X faster app Medical Imaging 40 days to 2 hours CFD 3X speedup NekCEM COSMO CloverLeaf MAESTRO CASTRO Computational Electromagnetics 2.5X speedup 60% less energy Climate Weather 40X speedup 3X energy efficiency CFD 4X speedup Single CPU/GPU code Astrophysics 4.4X speedup 4 weeks effort 55

56 OPENACC RESOURCES Guides Talks Tutorials Videos Books Spec Code Samples Teaching Materials Events Success Stories Courses Slack Stack Overflow Resources Success Stories FREE Compilers Compilers and Tools Events 56

NVIDIA DEEP LEARNING SDK High performance GPU-acceleration for deep

GPU-accelerated deep learning applications High performance

NVIDIA GPUs Industry vetted deep learning algorithms and linear

Multi-GPU and multi-node scaling that accelerates training on up to

com/deep-learning-software We are amazed by the steady stream of

57 NVIDIA DEEP LEARNING SDK High performance GPU-acceleration for deep learning Powerful tools and libraries for designing and deploying GPU-accelerated deep learning applications High performance building blocks for training and deploying deep neural networks on NVIDIA GPUs Industry vetted deep learning algorithms and linear algebra subroutines for developing novel deep neural networks Multi-GPU and multi-node scaling that accelerates training on up to eight GPU developer.nvidia.com/deep-learning-software We are amazed by the steady stream of improvements made to the NVIDIA Deep Learning SDK and the speedups that they deliver. Frédéric Bastien, Team Lead (Theano) MILA 57

58 Images/Second NVIDIA COLLECTIVECOMMUNICATIONS LIBRARY (NCCL) Multi-GPU and multi-node collective communication primitives High-performance multi-gpu and multi-node collective communication primitives optimized for NVIDIA GPUs Fast routines for multi-gpu multi-node acceleration that maximizes inter-gpu bandwidth utilization Easy to integrate and MPI compatible. Uses automatic topology detection to scale HPC and deep learning applications over PCIe and NVLink Accelerates leading deep learning frameworks such as Caffe2, Microsoft Cognitive Toolkit, MXNet, PyTorch and more Multi-GPU: NVLink, PCIe 8,000 7,000 6,000 5,000 4,000 3,000 2,000 1, Multi-Node: InfiniBand verbs, IP Sockets Automatic Topology Detection Near-Linear Multi-Node Scaling NCCL developer.nvidia.com/nccl Microsoft Cognitive Toolkit multi-node scaling performance (images/sec), NVIDIA DGX-1 + cudnn 6 (FP32), ResNet50, Batch size: 64 58

NVIDIA DIGITS Interactive Deep Learning GPU Training System Interactive deep

neural network training with an interactive interface to train and validate, and

and image segmentation Improve model accuracy with pre-trained models from the

59 NVIDIA DIGITS Interactive Deep Learning GPU Training System Interactive deep learning training application for engineers and data scientists Simplify deep neural network training with an interactive interface to train and validate, and visualize results Built-in workflows for image classification, object detection and image segmentation Improve model accuracy with pre-trained models from the DIGITS Model Store Faster time to solution with multi-gpu acceleration developer.nvidia.com/digits 59

60 Images/Second NVIDIA cudnn Deep Learning Primitives High performance building blocks for deep learning frameworks Drop-in acceleration for widely used deep learning frameworks such as Caffe2, Microsoft Cognitive Toolkit, PyTorch, Tensorflow, Theano and others Accelerates industry vetted deep learning algorithms, such as convolutions, LSTM RNNs, fully connected, and pooling layers Fast deep learning training performance tuned for NVIDIA GPUs developer.nvidia.com/cudnn Deep Learning Training Performance 12,000 10,000 8,000 6,000 4,000 2,000 0 cudnn 2 cudnn 4 cudnn 6 NCCL 1.6 8x K80 8x Maxwell DGX-1 DGX-1V NVIDIA has improved the speed of cudnn with each release while extending the interface to more operations and devices at the same time. Evan Shelhamer, Lead Caffe Developer, UC Berkeley cudnn 7 NCCL 2 60

Layer & Tensor Fusion Weight & Activation Precision Calibration Kernel Auto-tuning NVIDIA TensorRT 3 Programmable Inference Accelerator TensorRT Compiler for Optimized Neural Networks Weight &

61 Layer & Tensor Fusion Weight & Activation Precision Calibration Kernel Auto-tuning NVIDIA TensorRT 3 Programmable Inference Accelerator TensorRT Compiler for Optimized Neural Networks Weight & Activation Precision Calibration Layer & Tensor Fusion Kernel Auto-Tuning Multi-Stream Execution Trained Neural Network Dynamic Tensor Memory Multi-Stream Execution Compiled & Optimized Neural Network 61

ACCELERATED COMPUTING: THE PATH FORWARD. Jensen Huang, Founder & CEO SC17 Nov. 13, 2017

ACCELERATED COMPUTING: THE PATH FORWARD Jensen Huang, Founder & CEO SC17 Nov. 13, 2017 COMPUTING AFTER MOORE S LAW Tech Walker 40 Years of CPU Trend Data 10 7 GPU-Accelerated Computing 10 5 1.1X per year