POWERING THE AI REVOLUTION JENSEN HUANG, FOUNDER & CEO GTC 2017

Size: px

Start display at page:

Download "POWERING THE AI REVOLUTION JENSEN HUANG, FOUNDER & CEO GTC 2017"

Eleanore Lee
6 years ago
Views:

1 POWERING THE AI REVOLUTION JENSEN HUANG, FOUNDER & CEO GTC 2017

LIFE AFTER MOORE S LAW 10 7 40 Years of Microprocessor Trend Data 10 6 10 5 Transistors

5X per year 10 2 Single-threaded perf 1980 1990 2000 2010 2020 Original data up to the year

2 LIFE AFTER MOORE S LAW Years of Microprocessor Trend Data Transistors (thousands) 1.1X per year X per year 10 2 Single-threaded perf Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for by K. Rupp

5X per year ARCHITECTURE Single-threaded perf 1980 1990 2000 2010 2020 Original data up to the year

3 RISE OF GPU COMPUTING APPLICATIONS ALGORITHMS SYSTEMS 10 7 GPU-Computing perf 1.5X per year X per year 1000X by 2025 CUDA X per year ARCHITECTURE Single-threaded perf Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for by K. Rupp

4 RISE OF GPU COMPUTING 7, ,000 1M GTC Attendees 3X in 5 Years GPU Developers 11X in 5 Years CUDA Downloads in 2016

5 ANNOUNCING PROJECT HOLODECK Photorealistic models Interactive physics Collaboration

6 ANNOUNCING PROJECT HOLODECK Photorealistic models Interactive physics Collaboration Early access in September

7 ERA OF MACHINE LEARNING The Master Algorithm Pedro Domingos A Quest for Intelligence Fei-Fei Li

8 BIG BANG OF MODERN AI Google Photo Arterys FDA Approved Reinforcement Learning AlphaGo Super Resolution Deep Voice Transfer Learning ImageNet Stanford & NVIDIA Large-scale DNN on GPU NVIDIA BB8 Captioning BRETT Auto Encoders Style Transfer NMT IDSIA CNN on GPU U Toronto AlexNet on GPU Baidu DuLight LSTM Superhuman ASR GAN

9 DEEP LEARNING FOR RAY TRACING

10 IRAY WITH DEEP LEARNING

11 BIG BANG OF MODERN AI 13,000 20,000 $5B NIPS, ICML, CVPR, ICLR Attendees 2X in 2 Years Udacity Students in AI Programs 100X in 2 Years AI Startups Investments 9X in 4 Years

12 POWERING THE AI REVOLUTION FRAMEWORKS GPU SYSTEMS GPU AAS HGX-1 NVAIL INTERNET SERVICES HEALTHCARE DGX-1 NVIDIA DEEP LEARNING SDK NVIDIA RESEARCH INCEPTION ENTERPRISE TESLA

13 NVIDIA INCEPTION 1,300 DEEP LEARNING STARTUPS HEALTHCARE RETAIL, ETAIL FINANCIAL SECURITY, IVA PLATFORMs & APIs DATA MANAGEMENT IOT & MANUFACTURING AUTONOMOUS MACHINES CYBER AEC DEVELOPMENT PLATFORMS BUSINESS INTELLIGENCE & VISUALIZATION

14 SAP AI FOR THE ENTERPRISE First commercial AI offerings from SAP Brand Impact, Service Ticketing, Invoice-to-Record applications Powered by NVIDIA GPUs on DGX-1 and AWS

15 MODEL COMPLEXITY IS EXPLODING 105 ExaFLOPS 8.7 Billion Parameters 7 ExaFLOPS 60 Million Parameters 20 ExaFLOPS 300 Million Parameters 2015 Microsoft ResNet 2016 Baidu Deep Speech Google NMT

16 ANNOUNCING TESLA V100 GIANT LEAP FOR AI & HPC VOLTA WITH NEW TENSOR CORE 21B xtors TSMC 12nm FFN 815mm 2 5,120 CUDA cores 7.5 FP64 TFLOPS 15 FP32 TFLOPS NEW 120 Tensor TFLOPS 20MB SM RF 16MB Cache 16GB 900 GB/s 300 GB/s NVLink

17 NEW TENSOR CORE New CUDA TensorOp instructions & data formats 4x4 matrix processing array D[FP32] = A[FP16] * B[FP16] + C[FP32] Optimized for deep learning Activation Inputs Weights Inputs Output Results

18 ANNOUNCING TESLA V100 GIANT LEAP FOR AI & HPC VOLTA WITH NEW TENSOR CORE Compared to Pascal 1.5X General-purpose FLOPS for HPC 12X Tensor FLOPS for DL training 6X Tensor FLOPS for DL inferencing

20 GALAXY

21 DEEP LEARNING FOR STYLE TRANSFER

22 ANNOUNCING NEW FRAMEWORK RELEASES FOR VOLTA CNN Training (ResNet-50) Multi-Node Training with NCCL 2.0 (ResNet-50) LSTM Training (Neural Machine Translation) 8x K80 8x P100 K80 8x P100 8x V100 P100 8x V100 64x V100 V Hours Hours Hours

23 MATT WOOD General Manager Deep Learning and AI, AWS

24 ANNOUNCING NVIDIA DGX-1 WITH TESLA V100 ESSENTIAL INSTRUMENT OF AI RESEARCH 960 Tensor TFLOPS 8x Tesla V100 NVLink Hybrid Cube From 8 days on TITAN X to 8 hours 400 servers in a box

25 ANNOUNCING NVIDIA DGX-1 WITH TESLA V100 ESSENTIAL INSTRUMENT OF AI RESEARCH 960 Tensor TFLOPS 8x Tesla V100 NVLink Hybrid Cube From 8 days on TITAN X to 8 hours 400 servers in a box

26 ANNOUNCING NVIDIA DGX-1 WITH TESLA V100 ESSENTIAL INSTRUMENT OF AI RESEARCH 960 Tensor TFLOPS 8x Tesla V100 NVLink Hybrid Cube From 8 days on TITAN X to 8 hours 400 servers in a box $149,000 Order today: nvidia.com/dgx-1

27 ANNOUNCING NVIDIA DGX STATION PERSONAL DGX 480 Tensor TFLOPS 4x Tesla V100 16GB NVLink Fully Connected 3x DisplayPort 1500W Water Cooled

28 ANNOUNCING NVIDIA DGX STATION PERSONAL DGX 480 Tensor TFLOPS 4x Tesla V100 16GB NVLink Fully Connected 3x DisplayPort 1500W Water Cooled $69,000 Order today: nvidia.com/dgx-station

29 ANNOUNCING HGX-1 WITH TESLA V100 VERSATILE GPU CLOUD COMPUTING 8x Tesla V100 with NVLINK Hybrid Cube 2C:8G 2C:4G 1C:2G Configurable NVIDIA Deep Learning, GRID graphics, CUDA HPC stacks

30 JASON ZANDER Corporate Vice President Microsoft Azure, Microsoft

31 ANNOUNCING TENSORRT next layer Relu Relu Relu Relu FOR TENSORFLOW COMPILER FOR DEEP LEARNING INFERENCING bias 1x1 conv bias 3x3 conv bias 5x5 conv bias 1x1 conv Graph optimizations for vertical and horizontal layer fusion GPU-specific optimizations Import models from Caffe and TensorFlow Relu bias 1x1 conv Relu bias 1x1 conv max pool previous layer

ANNOUNCING TENSORRT FOR TENSORFLOW COMPILER FOR DEEP LEARNING INFERENCING next layer Relu Relu Relu Relu bias bias bias bias 1x1 conv 3x3 conv 5x5 conv 1x1 conv next layer 1x1 CBR 3x3 CBR 5x5 CBR 1x1

32 ANNOUNCING TENSORRT FOR TENSORFLOW COMPILER FOR DEEP LEARNING INFERENCING next layer Relu Relu Relu Relu bias bias bias bias 1x1 conv 3x3 conv 5x5 conv 1x1 conv next layer 1x1 CBR 3x3 CBR 5x5 CBR 1x1 CBR Graph optimizations for vertical and horizontal layer fusion GPU-specific optimizations Import models from Caffe and TensorFlow Relu bias 1x1 conv Relu bias 1x1 conv max pool 1x1 CBR 1x1 CBR max pool previous layer previous layer

33 ANNOUNCING TENSORRT FOR TENSORFLOW COMPILER FOR DEEP LEARNING INFERENCING concat 1x1 CBR 3x3 CBR 5x5 CBR 1x1 CBR next layer 3x3 CBR 5x5 CBR 1x1 CBR Graph optimizations for vertical and horizontal layer fusion GPU-specific optimizations 1x1 CBR 1x1 CBR max pool 1x1 CBR max pool Import models from Caffe and TensorFlow previous layer previous layer

34 images/sec ANNOUNCING TENSORRT FOR TENSORFLOW COMPILER FOR DEEP LEARNING INFERENCING next layer 3x3 CBR 5x5 CBR 1x1 CBR Inference Throughput and Latency (ResNet-50) 20ms 19ms 300 Graph optimizations for vertical and horizontal layer fusion GPU-specific optimizations 1x1 CBR max pool ms Import models from Caffe and TensorFlow previous layer 0 Broadwell K80 Skylake (projected) P100

images/sec ANNOUNCING TENSORRT FOR TENSORFLOW COMPILER FOR DEEP LEARNING INFERENCING Graph optimizations for vertical and horizontal layer fusion GPU-specific optimizations next layer 3x3 CBR 5x5 CBR

35 images/sec ANNOUNCING TENSORRT FOR TENSORFLOW COMPILER FOR DEEP LEARNING INFERENCING Graph optimizations for vertical and horizontal layer fusion GPU-specific optimizations next layer 3x3 CBR 5x5 CBR 1x1 CBR Import models from Caffe and TensorFlow 0 1x1 CBR previous layer max pool Inference Throughput and Latency (ResNet-50) 20ms 19ms 10ms Broadwell K80 Skylake (projected) P100 7ms V100

36 ANNOUNCING TESLA V100 FOR HYPERSCALE INFERENCE 15-25X INFERENCE SPEED-UP VS SKYLAKE 150W FHHL PCIE

37 THE CASE FOR GPU ACCELERATED DATACENTERS 300K inf/s inf/s > 1K CPUs 1K CPUs > = $1.5M = 250KW Tesla V100 Reduce 15X 500 Nodes CPU Servers 33 Nodes GPU Accelerated Server

38 NVIDIA DEEP LEARNING STACK DEEP LEARNING FRAMEWORKS DEEP LEARNING LIBRARIES NVIDIA cudnn, NCCL, cublas, TensorRT CUDA DRIVER OPERATING SYSTEM GPU SYSTEM

39 ANNOUNCING NVIDIA GPU CLOUD GPU-ACCELERATED CLOUD PLATFORM OPTIMIZED FOR DEEP LEARNING Containerized in NVDocker Optimization across the full stack Always up-to-date Fully tested and maintained by NVIDIA NVIDIA GPU CLOUD Registry of Containers, Datasets, and Pre-trained models CSPs

ANNOUNCING NVIDIA GPU CLOUD GPU-ACCELERATED CLOUD PLATFORM OPTIMIZED FOR DEEP LEARNING Containerized in NVDocker Optimization across the full stack

40 ANNOUNCING NVIDIA GPU CLOUD GPU-ACCELERATED CLOUD PLATFORM OPTIMIZED FOR DEEP LEARNING Containerized in NVDocker Optimization across the full stack Always up-to-date Fully tested and maintained by NVIDIA Beta in July NVIDIA GPU CLOUD Registry of Containers, Datasets, and Pre-trained models CSPs

41 AMBER Performance (ns/day) Googlenet Performance (i/s) AMBER 16 CUDA cudnn 7 CUDA 9 NCCL 2 POWER OF GPU COMPUTING 16 AMBER 14 CUDA 5 AMBER 14 CUDA cudnn 6 CUDA 8 NCCL AMBER 12 CUDA cudnn 2 CUDA 6 cudnn 4 CUDA K20 K40 K80 P100 8x K80 8x Maxwell DGX-1 DGX-1V

42 POWERING THE AI REVOLUTION AI at the Edge NVAIL INTERNET SERVICES HEALTHCARE AUTOMOTIVE DRIVE PX NVIDIA DEEP LEARNING SDK NVIDIA RESEARCH INCEPTION ENTERPRISE PLACES ROBOTS JETSON TX

43 AI REVOLUTIONIZING TRANSPORTATION 280B Miles per Year 800M Parking Spots for 250M Cars in U.S. Domino s: 1M Pizzas Delivered per Day

44 NVIDIA DRIVE AI CAR PLATFORM 100 TOPS DRIVE PX Xavier Level 4/5 Localization Perception AI Path Planning 10 TOPS DRIVE PX 2 Parker Level 2/3 Computer Vision Libraries CUDA, cudnn, TensorRT 1 TOPS OS

45 NVIDIA DRIVE Mapping-to-Driving Co-Pilot Guardian Angel

46 ANNOUNCING TOYOTA SELECTS NVIDIA DRIVE PX FOR AUTONOMOUS VEHICLES

47 AI PROCESSOR FOR AUTONOMOUS MACHINES CPU FPGA General Purpose Architectures CUDA GPU Volta Pascal Domain Specific Accelerators DLA XAVIER 30 TOPS DL 30W Custom ARM64 CPU 512 Core Volta GPU 10 TOPS DL Accelerator Energy Efficiency

48 AI PROCESSOR FOR AUTONOMOUS MACHINES CPU General Purpose Architectures + CUDA GPU Volta Domain Specific Accelerators DLA XAVIER 30 TOPS DL 30W Custom ARM64 CPU 512 Core Volta GPU 10 TOPS DL Accelerator Energy Efficiency

Weights Sparse Weight Decompression Native Winograd Input Transform MAC Array 2048 Int8 or 1024 Int16 or

49 Command Interface ANNOUNCING Tensor Execution Micro-controller XAVIER DLA NOW OPEN SOURCE Early Access July General Release September Input DMA (Activations and Weights) Unified 512KB Input Buffer Activations and Weights Sparse Weight Decompression Native Winograd Input Transform MAC Array 2048 Int8 or 1024 Int16 or 1024 FP16 Output Accumulators Output Postprocessor (Activation Function, Pooling etc.) Output DMA Memory Interface

50 ROBOTS SEE, HEAR, TOUCH LEARN & PLAN ACT

51 ROBOTS Credit: Yevgen Chebotar, Karol Hausman, Marvin Zhang, Sergey Levine

52 ANNOUNCING ISAAC ROBOT SIMULATOR Robot & Environment Definition ISAAC Robot Simulator OpenAI GYM NVIDIA GPU Computer Virtual Jetson

53 ANNOUNCING ISAAC ROBOT SIMULATOR SEE, HEAR, TOUCH Robot & Environment Definition ISAAC Robot Simulator OpenAI GYM LEARN & PLAN ACT NVIDIA GPU Computer Virtual Jetson

54 NVIDIA POWERING THE AI REVOLUTION NVIDIA GPU CLOUD CSPs NVIDIA GPU in Every Cloud NVIDIA GPU Cloud Tensor Core TensorRT Xavier DLA Open Source Tesla V100 DGX-1 and DGX Station ISAAC

A NEW COMPUTING ERA JENSEN HUANG, FOUNDER & CEO GTC CHINA 2017

A NEW COMPUTING ERA JENSEN HUANG, FOUNDER & CEO GTC CHINA 2017 TWO FORCES DRIVING THE FUTURE OF COMPUTING 10 7 Transistors (thousands) 10 6 10 5 1.1X per year 10 4 10 3 10 2 1.5X per year Single-threaded