NEW NVIDIA PLATFORM FOR AI

Similar documents
NVIDIA PLATFORM FOR AI

POWERING THE AI REVOLUTION JENSEN HUANG, FOUNDER & CEO GTC 2017

A NEW COMPUTING ERA JENSEN HUANG, FOUNDER & CEO GTC CHINA 2017

A NEW COMPUTING ERA. Shanker Trivedi Senior Vice President Enterprise Business at NVIDIA

DEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE. Dennis Lui August 2017

EFFICIENT INFERENCE WITH TENSORRT. Han Vanholder

A NEW COMPUTING ERA. DAVID B. KIRK, FELLOW NVIDIA AI Conference Singapore 2017

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA

SYNERGIE VON HPC UND DEEP LEARNING MIT NVIDIA GPUS

ENDURING DIFFERENTIATION. Timothy Lanfear

ENDURING DIFFERENTIATION Timothy Lanfear

TESLA V100 PERFORMANCE GUIDE. Life Sciences Applications

MACHINE LEARNING WITH NVIDIA AND IBM POWER AI

Autonomous Driving Solutions

GPU FOR DEEP LEARNING. 周国峰 Wuhan University 2017/10/13

ACCELERATED COMPUTING: THE PATH FORWARD. Jensen Huang, Founder & CEO SC17 Nov. 13, 2017

SUPERCHARGE DEEP LEARNING WITH DGX-1. Markus Weber SC16 - November 2016

GTC Jensen Huang Founder & CEO

NVIDIA DGX SYSTEMS PURPOSE-BUILT FOR AI

ACCELERATED COMPUTING: THE PATH FORWARD. Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015

TESLA V100 PERFORMANCE GUIDE May 2018

RECENT TRENDS IN GPU ARCHITECTURES. Perspectives of GPU computing in Science, 26 th Sept 2016

Xilinx ML Suite Overview

TOWARDS ACCELERATED DEEP LEARNING IN HPC AND HYPERSCALE ARCHITECTURES Environnement logiciel pour l apprentissage profond dans un contexte HPC

DGX UPDATE. Customer Presentation Deck May 8, 2017

Object recognition and computer vision using MATLAB and NVIDIA Deep Learning SDK

S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

World s most advanced data center accelerator for PCIe-based servers

Fast Hardware For AI

DGX SYSTEMS: DEEP LEARNING FROM DESK TO DATA CENTER. Markus Weber and Haiduong Vo

NVIDIA TESLA V100 GPU ARCHITECTURE THE WORLD S MOST ADVANCED DATA CENTER GPU

Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager

HPE Deep Learning Cookbook: Recipes to Run Deep Learning Workloads. Natalia Vassilieva, Sergey Serebryakov

Building the Most Efficient Machine Learning System

TESLA P100 PERFORMANCE GUIDE. HPC and Deep Learning Applications

GTC was the introduction to the future of AI, a protector, a healer, a helper, a guardian, a visionary, and just a little slice of amazing.

S8688 : INSIDE DGX-2. Glenn Dearth, Vyas Venkataraman Mar 28, 2018

Building the Most Efficient Machine Learning System

GPU ACCELERATED COMPUTING. 1 st AlsaCalcul GPU Challenge, 14-Jun-2016, Strasbourg Frédéric Parienté, Tesla Accelerated Computing, NVIDIA Corporation

Cisco UCS C480 ML M5 Rack Server Performance Characterization

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS

NVIDIA FOR DEEP LEARNING. Bill Veenhuis

Deploying Deep Learning Networks to Embedded GPUs and CPUs

Introduction to Deep Learning in Signal Processing & Communications with MATLAB

Deep Learning mit PowerAI - Ein Überblick

S INSIDE NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORK CONTAINERS

S8901 Quadro for AI, VR and Simulation

Deep learning in MATLAB From Concept to CUDA Code

The Tesla Accelerated Computing Platform

INVESTOR UPDATE. September 2018

TESLA P100 PERFORMANCE GUIDE. Deep Learning and HPC Applications

Nvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018

IBM Deep Learning Solutions

DEEP LEARNING ALISON B LOWNDES. Deep Learning Solutions Architect & Community Manager EMEA

S8822 OPTIMIZING NMT WITH TENSORRT Micah Villmow Senior TensorRT Software Engineer

GPU-Accelerated Deep Learning

Democratizing Machine Learning on Kubernetes

Deep Learning: Transforming Engineering and Science The MathWorks, Inc.

NVIDIA DEEP LEARNING INSTITUTE

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

IBM Power AC922 Server

GPUS FOR NGVLA. M Clark, April 2015

NVIDIA DLI HANDS-ON TRAINING COURSE CATALOG

Deep learning prevalence. first neuroscience department. Spiking Neuron Operant conditioning First 1 Billion transistor processor

GPU Coder: Automatic CUDA and TensorRT code generation from MATLAB

How to Build Optimized ML Applications with Arm Software

CME 213 S PRING Eric Darve

Deep Learning Inference on Openshift with GPUs

Defense Data Generation in Distributed Deep Learning System Se-Yoon Oh / ADD-IDAR

NVDIA DGX Data Center Reference Design

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017

MIXED PRECISION TRAINING OF NEURAL NETWORKS. Carl Case, Senior Architect, NVIDIA

How to Build Optimized ML Applications with Arm Software

TESLA PLATFORM. Jan 2018

April 4-7, 2016 Silicon Valley INSIDE PASCAL. Mark Harris, October 27,

NVIDIA DEEP LEARNING PLATFORM

Unified Deep Learning with CPU, GPU, and FPGA Technologies

MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius

Accelerated Platforms: The Future of Computing. Marc Hamilton, VP Solutions Architecture & Engineering, NVIDIA Korea AI Conference 2018

IBM CORAL HPC System Solution

Accelerating your Embedded Vision / Machine Learning design with the revision Stack. Giles Peckham, Xilinx

S THE MAKING OF DGX SATURNV: BREAKING THE BARRIERS TO AI SCALE. Presenter: Louis Capps, Solution Architect, NVIDIA,

Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA

NVIDIA AI INFERENCE PLATFORM

NVIDIA GPU TECHNOLOGY UPDATE

NVIDIA AI BRAIN OF SELF DRIVING AND HD MAPPING. September 13, 2016

MIOVISION DEEP LEARNING TRAFFIC ANALYTICS SYSTEM FOR REAL-WORLD DEPLOYMENT. Kurtis McBride CEO, Miovision

In partnership with. VelocityAI REFERENCE ARCHITECTURE WHITE PAPER

April 2 nd, Bob Burroughs Director, HPC Solution Sales

Advancing State-of-the-Art of Autonomous Vehicles and Robotics Research using AWS GPU Instances

INTRODUCING THE DGX FAMILY. Marc Domenech May 8, 2017

An introduction to Machine Learning silicon

Deep Learning and HPC. Bill Dally, Chief Scientist and SVP of Research January 17, 2017

CafeGPI. Single-Sided Communication for Scalable Deep Learning

Shrinath Shanbhag Senior Software Engineer Microsoft Corporation

NVIDIA Accelerators Models HPE NVIDIA GV100 Nvlink Bridge Kit HPE NVIDIA Tesla V100 FHHL 16GB Computational Accelerator

DEEP NEURAL NETWORKS AND GPUS. Julie Bernauer

NVIDIA DATA LOADING LIBRARY (DALI)

CUDA: NEW AND UPCOMING FEATURES

Transcription:

NEW NVIDIA PLATFORM FOR AI Pedro Mario Cruz e Silva (pcruzesilva@nvidia.com) LinkedIn Solution Architect Manager Enterprise Latin America Global Oil & Gas Team

"GTC 2017: 'I AM AI' OPENING IN KEYNOTE" HTTPS://WWW.YOUTUBE.COM/WATCH?V=SUNPRR4O5ZA 2

3

LIFE AFTER MOORE S LAW 10 7 40 Years of Microprocessor Trend Data 10 6 10 5 10 4 Transistors (thousands) 1.1X per year 10 3 10 2 Single-threaded perf 1.5X per year 1980 1990 2000 2010 2020 Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp 4

Normalized Unit (Billions) 200B CORE HOURS OF LOST SCIENCE Data Center Throughput is the Most Important Thing for HPC 400 350 300 National Science Foundation (NSF XSEDE) Supercomputing Resources Computing Resources Requested Computing Resources Available 250 200 150 100 50 0 2009 2010 2011 2012 2013 2014 2015 Source: NSF XSEDE Data: https://portal.xsede.org/#/gallery NU = Normalized Computing Units are used to compare compute resources across supercomputers and are based on the result of the High Performance LINPACK benchmark run on each system 5

THE ADVANTAGES OF GPU-ACCELERATED DATA CENTER TFLOPS NVIDIA GPU x86 CPU 6.0 5.0 P100 4.0 3.0 K80 2.0 K20 K40 Fast GPU + Strong CPU 1.0 M1060 M2090 0.0 2008 2009 2010 2011 2012 2013 2014 2016 6

RISE OF GPU COMPUTING APPLICATIONS 10 7 10 6 GPU-Computing perf 1.5X per year 1000X by 2025 ALGORITHMS 10 5 1.1X per year SYSTEMS 10 4 CUDA ARCHITECTURE 10 3 1.5X per year 10 2 Single-threaded perf 1980 1990 2000 2010 2020 Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp 7

DEEP LEARNING 8

LEARNING FROM DATA AND SOME BUZZ WORDS ARTIFICAL INTELLIGENCE Knowledge & Reason Learning Planning Communicating Perceiving MACHINE LEARNING Learning from data Expert systems Handcrafted features DEEP LEARNING Learning from data Neural networks Computer learned features 9

TRAINING Training Data A NEW COMPUTING MODEL Trained Neural Network Input Label Output INFERENCE Input Trained Neural Network Label Output 10

A NEW COMPUTING MODEL Outperform experts, facts, rules with software that writes software 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% ImageNet Traditional CV Deep Learning 2009 2010 2011 2012 2013 2014 2015 2016 Traditional Computer Vision Experts + Time Deep Learning Object Detection DNN + Data + GPU Deep Learning Achieves Superhuman Results 11

ACCELERATING EULERIAN FLUID SIMULATION WITH CONVOLUTIONAL NETWORKS Tompson, J., Schlachter, K., Sprechmann, P., & Perlin, K. (2016). Accelerating Eulerian Fluid Simulation With Convolutional Networks. arxiv preprint arxiv:1607.03597. 12

"ACCELERATING EULERIAN FLUID SIMULATION WITH CONVOLUTIONAL NETWORKS" HTTPS://WWW.YOUTUBE.COM/WATCH?V=W71ZXKNIJFO 13

14

DEEP LEARNING SOFTWARE 15

POWERING THE DEEP LEARNING ECOSYSTEM NVIDIA SDK accelerates every major framework COMPUTER VISION SPEECH & AUDIO NATURAL LANGUAGE PROCESSING OBJECT DETECTION IMAGE CLASSIFICATION VOICE RECOGNITION LANGUAGE TRANSLATION RECOMMENDATION ENGINES SENTIMENT ANALYSIS DEEP LEARNING FRAMEWORKS Mocha.jl NVIDIA DEEP LEARNING SDK developer.nvidia.com/deep-learning-software 16

IMAGE CLASSIFICATION DEEP LEARNING WORKFLOWS OBJECT DETECTION New in DIGITS 5 IMAGE SEGMENTATION 98% Dog 2% Cat Classify images into classes or categories Object of interest could be anywhere in the image Find instances of objects in an image Objects are identified with bounding boxes Partition image into multiple regions Regions are classified at the pixel level 17

WHAT S NEW IN DIGITS? TENSORFLOW SUPPORT NEW PRE-TRAINED MODELS Train TensorFlow Models Interactively with DIGITS Image Classification: VGG-16, ResNet50 Object Detection: DetectNet 18

NVIDIA TENSORRT PROGRAMMABLE INFERENCE ACCELERATOR TESLA P4 TensorRT JETSON TX2 DRIVE PX 2 NVIDIA DLA TESLA V100 19

DEEP LEARNING APPLICATIONS 20

AI TOOL BOOSTS CUSTOMER SERVICE KLM s 235 social media service agents engage in 15K conversations a week, 24/7. To contend with the overwhelming volume of messages, KLM uses GPUaccelerated deep learning to predict the best response to an incoming message and shows it to a contact center agent for approval or personalization before sending it to the customer. The resulting time savings for KLM service agents means they can focus on customers with more pressing needs and handle a greater volume of questions while still maintaining a high degree of customer satisfaction. 21

Fraud Prevention (ML & DL) 10,000s of features make up todays fraudulent behavior. AI can detect patterns faster and more accurate than humans -Hui Wang, Senior Director of Global Risk Sciences, Pay Pal 22

SAP AI FOR THE ENTERPRISE First commercial AI offerings from SAP Brand Impact, Service Ticketing, Invoice-to-Record applications Powered by NVIDIA GPUs on DGX-1 and AWS 23

MICROSOFT - "BUILD 2017: WORKPLACE SAFETY DEMONSTRATION" HTTPS://WWW.YOUTUBE.COM/WATCH?V=PL-C00M2CNI 24

25

"FUTURE OF AI CITIES ON DISPLAY AT ISC WEST 2017" HTTPS://WWW.YOUTUBE.COM/WATCH?V=ZHBV34KOWHM 26

27

"LIP READING SENTENCES IN THE WILD" HTTPS://WWW.YOUTUBE.COM/WATCH?V=5AOGZAUPILE 28

29

TRAINING SET Features (Seismic) Labels 30

D A T A Conv n: 96 p: 100 k: 11 s: 4 ReLU Pool MAX k:3 s:2 LRN Ls: 5 alpha: 1.0e-4 beta: 0.75 Conv n: 256 p: 2 k: 5 s: 1 ReLU Pool MAX k:3 s:2 LRN Ls: 5 alpha: 1.0e-4 beta: 0.75 Conv n: 384 p: 1 k: 3 s: 1 ReLU Conv n: 384 p: 1 k: 3 s: 1 ReLU Conv n: 256 p: 1 k: 3 s: 1 ReLU Pool MAX k:3 s:2 Conv n: 4096 p: 0 k: 6 s: 1 ReLU Dropout ratio: 0.5 Conv n: 4096 p: 0 k: 1 s: 1 ReLU Dropout ratio: 0.5 Conv n: 21 p: 0 k: 1 Deconv bias: false n: 21 k: 63 s: 32 Crop axis: 2 offset: 19 Softmax with Loss L A B E L 31

RESULT 32

AUTONOMOUS CARS 33

AN AMAZING YEAR FOR SELF-DRIVING CARS Uber Enters the Race Toyota Invests $1B in AI Lab Volvo Drive Me on Public Roads in 2017 NHTSA: Computer Counts as Driver Tesla Model 3: 300K pre-orders Audi, BMW, Daimler Buy HERE Tesla Model S Auto-pilot Baidu Enters the Race Honda, Nissan, Toyota Team Up GM Buys Cruise 34

NEW AI DRIVING MAPPING KALDI LOCALIZATION DRIVENET DAVENET Training on DGX-1 NVIDIA DGX-1 NVIDIA DRIVE PX Driving with DriveWorks 35

"NVIDIA DRIVE AUTONOMOUS VEHICLE PLATFORM" HTTPS://WWW.YOUTUBE.COM/WATCH?V=0RC4RQYLTEU 36

37

"NVIDIA SELF-DRIVING CAR DEMO AT CES 2017" HTTPS://WWW.YOUTUBE.COM/WATCH?V=FMVWLR0X1SK 38

39

TESLA PLATFORM 40

TESLA V100 The Fastest and Most Productive GPU for AI and HPC Volta Architecture Tensor Core Improved NVLink & HBM2 Volta MPS Improved SIMT Model Most Productive GPU 125 Programmable TFLOPS Deep Learning Efficient Bandwidth Inference Utilization New Algorithms 41

"DGX-1: WORLD S FIRST DEEP LEARNING SUPERCOMPUTER IN A BOX" HTTPS://WWW.YOUTUBE.COM/WATCH?V=FAZS4V2AOLI&T 42

43

DGX STACK Complete Analytics and Deep Learning platform Instant productivity plug-andplay, supports every AI framework and accelerated analytics software applications Performance optimized across the entire stack Always up-to-date via the cloud Mixed framework environments baremetal and containerized Direct access to NVIDIA experts 44

ANNOUNCING TESLA V100 GIANT LEAP FOR AI & HPC VOLTA WITH NEW TENSOR CORE 21B xtors TSMC 12nm FFN 815mm 2 5,120 CUDA cores 7.5 FP64 TFLOPS 15 FP32 TFLOPS NEW 120 Tensor TFLOPS 20MB SM RF 16MB Cache 16GB HBM2 @ 900 GB/s 300 GB/s NVLink 45

NEW TENSOR CORE New CUDA TensorOp instructions & data formats 4x4 matrix processing array D[FP32] = A[FP16] * B[FP16] + C[FP32] Optimized for deep learning Activation Inputs Weights Inputs Output Results 46

TENSOR CORE 4x4x4 matrix multiply and accumulate 47

Tesla P100 vs Tesla V100 Tesla P100 (Pascal) Tesla V100 (Volta) Memory 16 GB (HBM2) 16 GB (HMB2) Memory Bandwidth 720 GB/s 900 GB/s NVLINK 160 GB/s 300 GB/s CUDA Cores (FP32) 3584 5120 CUDA Cores (FP64) 1792 2560 Tensor Cores (TC) NA 640 Peak TFLOPS/s (FP32) 10.6 15 Peak TFLOPS/s (FP64) 5.3 7.5 Peak TFLOPS/s (TC) NA 120 Power 300 W 300 W 48

ANNOUNCING NVIDIA SATURNV WITH VOLTA ANNOUNCING NVIDIA SATURNV WITH VOLTA 40 PetaFLOPS Peak FP64 Performance 660 PetaFLOPS DL FP16 Performance 660 NVIDIA DGX-1 Server Nodes 49

NVIDIA GPU CLOUD SIMPLIFYING AI & HPC DEEP LEARNING HPC APPS HPC VIZ 50

THE VALUE OF GPU COMPUTING PLATFORM FOR AI 51

AI TO TRANSFORM EVERY INDUSTRY HEALTHCARE INFRASTRUCTURE IOT >80% Accuracy & Immediate Alert to Radiologists 50% Reduction in Emergency Road Repair Costs >$6M / Year Savings and Reduced Risk of Outage 52

NEURAL NETWORK COMPLEXITY IS EXPLODING Bigger and More Compute Intensive 350X Inception-v4 30X DeepSpeech 3 10X MoE GNMT AlexNet GoogleNet ResNet-50 Inception-v2 DeepSpeech DeepSpeech 2 OpenNMT 2011 2012 2013 2014 2015 2016 2017 2013 2014 2015 2016 2017 2018 2014 2015 2016 2017 2018 Image (GOP * Bandwidth) Speech (GOP * Bandwidth) Translation (GOP * Bandwidth) 53

TESLA V100 32GB WORLD S MOST ADVANCED DATA CENTER GPU NOW WITH 2X THE MEMORY 5,120 CUDA cores 640 NEW Tensor cores 7.8 FP64 TFLOPS 15.7 FP32 TFLOPS 125 Tensor TFLOPS 20MB SM RF 16MB Cache 32GB HBM2 @ 900GB/s 300GB/s NVLink 54

FASTER RESULTS ON COMPLEX DL AND HPC Up to 50% Faster Results With 2x The Memory FASTER RESULTS HIGHER ACCURACY HIGHER RESOLUTION 1.5X Faster Language Translation 1.5X Faster Calculations 40% Lower Error Rate 4X Higher resolution 1024x1024 res images Unsupervised Image Translation Input winter photo 0.8 step/sec 1.2 step/sec 2.5TF 3.8TF Accuracy (16 layers) Accuracy (152 layers) 512x512 res images Neural Machine Translation (NMT) 3D FFT 1k x 1k x 1k VGG-16 RN-152 GAN Image to ImageGen AI converts it to summer V100 16GB V100 32GB Dual E5-2698v4 server, 512GB DDR4, Ubuntu 16.04, CUDA9, cudnn7 NMT is GNMT-like and run with TensorFlow NGC Container 18.01 (Batch Size= 128 (for 16GB) and 256 (for 32GB) FFT is with cufftbench 1k x 1k x 1k and comparing 2 V100 16GB (DGX1V) vs. 2 V100 32GB (DGX1V) R-CNN for object detection at 1080P with Caffe V100 16GB uses VGG16 V100 32GB uses Resnet-152 GAN by NVRESEARCH (https://arxiv.org/pdf/1703.00848.pdf) V100 16GB and V100 32GB with FP32 55

TESLA V100: CHOOSING BETWEEN 32GB & 16GB Application Today s Products New Deployments Benefits DL Training P100, P40, V100 16GB V100 32GB Faster Result with Larger More Complex Models Memory-Constrained HPC (Seismic, Graphs, CFD, Physics, Climate, Signal Processing, Finance) K80, P100, V100 16GB V100 32GB Faster Results with Large Datasets Compute-Bound HPC (Life Science, Molecular Dynamics) K80, P100, V100 16GB V100 16GB Higher TCO 56

V100 WITH 32GB HBM2 Maintain Form Factor Compatibility Form Factor Performance 7.8TF DP, 15.7TF SP, 125TF FP16 7TF DP, 14TF SP, 112TF FP16 Memory Size 32GB HBM2 32GB HBM2 Memory Bandwidth 900GB/s 900GB/s GPU Peer to Peer NVLink PCIe Gen3 Power 300W 250W Available From All Major OEMs SXM2 32GB P/N = 900-2G503-0010-000, PCIE 32GB P/N = 900-2G500-0010-000 57

NVSWITCH WORLD S HIGHEST BANDWIDTH ON-NODE SWITCH 7.2 Terabits/sec or 900 GB/sec 18 NVLINK ports 50GB/s per port bi-directional Fully-connected crossbar 2 billion transistors 47.5mm x 47.5mm package 58

NVSWITCH ENABLES THE WORLD S LARGEST GPU 16 Tesla V100 32GB Connected by New NVSwitch 2 petaflops of DL Compute Unified 512GB HBM2 GPU Memory Space 300GB/sec Every GPU-to-GPU 2.4TB/sec of Total Cross-section Bandwidth 59

2X HIGHER PERFORMANCE WITH NVSWITCH 2X FASTER 2.4X FASTER 2X FASTER 2.7X FASTER Physics (MILC benchmark) 4D Grid Weather (ECMWF benchmark) All-to-all Recommender (Sparse Embedding) Reduce & Broadcast Language Model (Transformer with MoE) All-to-all 2x DGX-1 (Volta) DGX-2 with NVSwitch 2 DGX-1V servers have dual socket Xeon E5 2698v4 Processor. 8 x V100 GPUs. Servers connected via 4X 100Gb IB ports DGX-2 server has dual-socket Xeon Platinum 8168 Processor. 16 V100 GPUs 60

THE WORLD S FIRST 2 PETAFLOPS SYSTEM INTRODUCING NVIDIA DGX-2 THE WORLD S MOST POWERFUL AI SYSTEM FOR THE MOST COMPLEX AI CHALLENGES DGX-2 is the newest addition to the DGX family, powered by DGX software Deliver accelerated AI-at-scale deployment and simplified operations Step up to DGX-2 for unrestricted model parallelism and faster time-to-solution 61

10X PERFORMANCE GAIN LESS THAN A YEAR 15 days 15 DGX-1, SEP 17 DGX-2, Q3 18 10 5 1.5 days 0 DGX-1V DGX-2 PyTorch Stack: Time to Train FAIRSEQ software improvements across the stack including NCCL, cudnn, etc. 62

Container Orchestration for DL Training & Inference AWS-EC2 GCP Azure DGX NVIDIA CONTAINER RUNTIME KUBERNETES NVIDIA GPUs NVIDIA GPU CLOUD KUBERNETES on NVIDIA GPUs Scale-up Thousands of GPUs Instantly Self-healing Cluster Orchestration GPU Optimized Out-of-the-Box Powered by NVIDIA Container Runtime Included with Enterprise Support on DGX Available end of April 2018 63

AI ACCELERATES SCIENCE INFERENCE ON GPUS TODAY BING Object Detection 60X Improved Latency iflytek Speech Recognition 10X Concurrent Requests Per Server DARWIN AI NN Optimizations 1700X Faster Inference VALOSSA Video Intelligence 6X Faster Video Processing ALIBABA Neural Machine Translation 3X Requests Per Server KLM Social Media Engagement 10X increase in customer responses 64

NVIDIA TESLA PLATFORM SAVES MONEY Game-Changing Inference Performance SAME THROUGHPUT INFERENCE WORKLOAD: Image recognition using Resnet 50 1/20 THE SPACE 1/22 THE POWER INFERENCE WORKLOAD: Image recognition using Resnet 50 160 CPU Servers 45,000 images/sec 65 KWatts 1 HGX Server 45,000 images/sec 3 KWatts 65

NVIDIA AI INFERENCE 30M HYPERSCALE SERVERS 190X IMAGE ResNet-50 with TensorFlow Integration TensorRT 4 TensorFlow Integration Kaldi Optimization ONNX WinML 50X NLP GNMT 45X RECOMMENDER Neural Collaborative Filtering 36X SPEECH SYNTH WaveNet 60X SPEECH RECOG DeepSpeech 2 DNN 66

TensorRT INTEGRATED WITH TensorFlow Delivers 8x Faster Inference with TensorFlow + TRT 3,000 2,500 Images/sec @ 7ms Latency ResNet-50 on TensorFlow 2,657 2,000 1,500 1,000 500 0 11 * CPU (FP32) * Best CPU latency measured at 83 ms 325 V100 (TensorFlow, FP32) V100 Tensor Cores (TensorFlow+ TensorRT) Available in TensorFlow 1.7 https://github.com/tensorflow CPU: Skylake Gold 6140, 2.5GHz, Ubuntu 16.04; 18 CPU threads. Volta V100 SXM; CUDA (384.111; v9.0.176); Batch size: CPU=1, TF_GPU=2, TF-TRT=16 w/ latency=6ms 67

NVIDIA TensorRT 4 RC NOW AVAILABLE RNN and MLP Layers ONNX Import NVIDIA DRIVE Support Maximize RNN and MLP Throughput Optimize and Deploy ONNX Models Support for NVIDIA DRIVE Xavier 50X 40X 30X 20X 10X 0X Recommendation Engine 45X Speedup CPU TensorRT Speed up speech, audio and recommender app inference performance through new layers and optimizations Easily import and accelerate inference for ONNX frameworks (PyTorch, Caffe 2, CNTK, MxNet and Chainer) Deploy optimized deep learning inference models NVIDIA DRIVE Xavier Free download to members of NVIDIA Developer Program developer.nvidia.com/tensorrt 68

GPU COMPUTING STACK CONTINUES TO EVOLVE 18X In 5 Years 10X In 6 Months 8X With TensorRT 2650 i/s DGX-2 1.5 days DGX-1 (Volta) 15 days 325 i/s V100 (FP32) V100 Tensor Core (TensorRT) Beyond Moore s Law For HPC PyTorch Stack: Time to Train FAIRSEQ Accelerating Time To Solution For AI Training ResNet-50 Inference @7ms on TensorFlow Compelling Platform For AI Inference 69

NVIDIA SUPPORT PROGRAMS 70

developer.nvidia.com 71

NVIDIA DEEP LEARNING INSTITUTE Hands-on self-paced and instructor-led training in deep learning and accelerated computing for developers Request onsite instructor-led workshops at your organization: www.nvidia.com/requestdli Deep Learning Fundamentals Autonomous Vehicles Medical Image Analysis Take self-paced labs online: www.nvidia.com/dlilabs Download the course catalog, view upcoming workshops, and learn about the University Ambassador Program: www.nvidia.com/dli Genomics Finance Intelligent Video Analytics More industryspecific training coming soon Game Development & Digital Content Accelerated Computing Fundamentals 72

NVIDIA HW GRANT PROGRAM Titan X Pascal Quadro P5000 Jetson TX2 (Dev Kit) Scientific Computing HPC Deep Learning Scientific Visualization Virtual Reality Robotics Autonomous Machines 73

INCEPTION PROGRAM http://www.nvidia.com/object/inception-program.html 74

Pedro Mario Cruz e Silva (pcruzesilva@nvidia.com) Solution Architect Manager Enterprise Latin America Global Oil & Gas Team LinkedIn