Accelerated Platforms: The Future of Computing. Marc Hamilton, VP Solutions Architecture & Engineering, NVIDIA Korea AI Conference 2018

Size: px

Start display at page:

Download "Accelerated Platforms: The Future of Computing. Marc Hamilton, VP Solutions Architecture & Engineering, NVIDIA Korea AI Conference 2018"

Hope Gordon
5 years ago
Views:

1 Accelerated Platforms: The Future of Computing Marc Hamilton, VP Solutions Architecture & Engineering, NVIDIA Korea AI Conference 2018

2 Forces Shaping Computing GPU PERFORMANCE CPU PERFORMANCE Beyond Moore s Law

3 Forces Shaping Computing GPU PERFORMANCE CPU PERFORMANCE ` Beyond Moore s Law 1000x Every 10 Years Accelerated Computing

Forces Shaping Computing 10 7 10 6 10 5 GPU PERFORMANCE CPU

10 2 1980 1990 2000 2010 2020 Beyond Moore s Law 1000x

4 Forces Shaping Computing GPU PERFORMANCE CPU PERFORMANCE ` DATA DEEP NEURAL NETWORK PROGRAM Beyond Moore s Law 1000x Every 10 Years Accelerated Computing The Deep Learning Revolution

5 NVIDIA is Accelerators for Demanding Applications Platforms That Provide Complete Solutions Graphics AI Healthcare Autonomous Vehicles Data Science Robotics

6 Accelerating Deep Learning Accelerating Graphics Accelerating Science Accelerating Data Science Scaling Accelerating Autonomous Vehicles Accelerating Robotics Conclusion

7 Deep Learning Everywhere Internet & Cloud Image Classification Speech Recognition Language Translation Language Processing Sentiment Analysis Recommendations Medicine & Biology Cancer Cell Detection Diabetic Grading Drug Discovery Media & Entertainment Video Captioning Video Search Real Time Translation Intelligent Video Analytics Traffic Analysis Retail Analytics Access Control Transportation Pedestrian Detection Lane Tracking Traffic Sign Recognition

AI IMPROVES SEMICON INSPECTION ACCURACY In the semiconductor industry, inaccurate or false positive fault inspections can lead to huge product losses.

8 AI IMPROVES SEMICON INSPECTION ACCURACY In the semiconductor industry, inaccurate or false positive fault inspections can lead to huge product losses. SK Hynix deployed an AI fault detection solution with NVIDIA Tesla GPUs, DGX Station, Jetson TX2, CUDA and TensorRT. With its new deep learning-based tool, SK Hynix achieved over 90% inspection accuracy.

9 Deep Learning Was Enabled by Hardware

10 Deep Learning is Gated by Hardware 350X Inception-v4 30X DeepSpeech 3 10X MoE GNMT AlexNet GoogLeNet ResNet-50 Inception-v2 DeepSpeech DeepSpeech 2 OpenNMT Image Network Complexity GOPS * Bandwidth Speech Network Complexity GOPS * Bandwidth Translation Network Complexity GOPS * Bandwidth

11 Tesla V100 Tensor Core GPU 21B Transistors TSMC 12nm FFN 815mm 2 5,120 CUDA Cores 7.5 FP64 TFLOPS 15 FP32 TFLOPS 125 Tensor TFLOPS 20 MB SM RF 16 MB Cache 32 GB 900 GB/s 300 GB/s NVLink

12 A 0,0 A 0,1 A 0,2 A 0,3 B 0,0 B 0,1 B 0,2 B 0,3 C 0,0 C 0,1 C 0,2 C 0,3 Tensor Core A 1,0 A 1,1 A 1,2 A 1,3 B 1,0 B 1,1 B 1,2 B 1,3 C 1,0 C 1,1 C 1,2 C 1,3 D = + Mixed Precision Matrix Math A 2,0 A 2,1 A 2,2 A 2,3 B 2,0 B 2,1 B 2,2 B 2,3 C 2,0 C 2,1 C 2,2 C 2,3 4x4 Matrices A 3,0 A 3,1 A 3,2 A 3,3 B 3,0 B 3,1 B 3,2 B 3,3 C 3,0 C 3,1 C 3,2 C 3,3 D = AB + C FP16 or FP32

13 Turing Accelerates Inference Quadro RTX ,608 CUDA Cores 576 Tensor Cores 48 GB GDDR6 Memory TFLOPS FP TOPS INT8 522 TOPS INT4 672 GB/s DRAM BW 250 GB/s (Bidirectional) NVLink Channels 295 W TENSOR CORES SHADER COMPUTE RT CORES Tesla T4 2,560 CUDA Cores 320 Tensor Cores 16 GB GDDR6 Memory 65 TFLOPS FP TOPS INT8 260 TOPS INT4 320 GB/s DRAM BW 70 W

14 Turing Tensor Core - Optimized for Inference Multi-Precision for AI Inference T4: 65 TFLOPS FP TOPS INT8 260 TOPS INT4 RTX 8000: 130 TFLOPS FP TOPS INT8 520 TOPS INT4

15 TFLOPS / TOPS Speedup vs. CPU Server Speedup vs. CPU Server Speedup vs. CPU Server World s Most Performant Inference Platform Up to 36X Faster Than CPUs Accelerates All AI Workloads Peak Performance Speech Inference Video Inference Natural Language Processing Inference FLOAT INT8 FLOAT INT8 INT4 P4 T CPU Server Tesla P4 Tesla T CPU Server Tesla P4 Tesla T CPU Server Tesla P4 Tesla T4 Speedup: 6X Faster Int8 Ops vs P4 Speedup: 21X Faster DeepSpeech 2 Speedup: 27X Faster ResNet-50 (7ms Latency Limit) Speedup: 36X Faster GNMT

16 TESLA P4/T4 TensorRT JETSON AGX NVIDIA TensorRT 5 Optimizer Runtime DRIVE AGX Multi-Precision Acceleration of All Frameworks Containerized Inference Serving Engine Docker and Kubernetes Integration TESLA V100 NVIDIA DLA Platforms Layer & Tensor Fusion Precision Calibration Kernel Auto-Tuning Dynamic Tensor Memory

17 TensorRT Inference Server DNN Models NV DL SDK NV Docker TensorRT Inference Server Kubernetes

60 KWatts Speech, NLP and Video Inference Workload

18 Space and Power Reduction Game-Changing Inference Performance = Inference Workload 200 CPU Servers 60 KWatts Speech, NLP and Video Inference Workload 1 T4 Accelerated Server 2 KWatts Speech, NLP and Video

19 Accelerating Deep Learning Accelerating Graphics Accelerating Science Accelerating Data Science Scaling Accelerating Autonomous Vehicles Accelerating Robotics Conclusion

Turing Revolutionizes Graphics Quadro RTX 8000 4,608 CUDA Cores 576 Tensor Cores 72 RT Cores 48 GB GDDR6 Memory 130 TFLOPS FP16 260 TOPS INT8 520 TOPS INT4 336 GB/s DRAM BW 250 GB/s NVLINK Channels

20 Turing Revolutionizes Graphics Quadro RTX ,608 CUDA Cores 576 Tensor Cores 72 RT Cores 48 GB GDDR6 Memory 130 TFLOPS FP TOPS INT8 520 TOPS INT4 336 GB/s DRAM BW 250 GB/s NVLINK Channels 295W Turing SM 14 TFLOPS + 14 TIPS Concurrent FP & INT Execution Variable Rate Shading RT Core 10 Giga Rays/sec Ray Triangle Intersection BVH Traversal Tensor Core 114 TFLOPS FP TOPS INT8 455 TOPS INT4

21 Deep Learning for Imaging Colorizing UC Berkeley In-Painting NVIDIA FP32 INT32 TC FP32 INT32 TC PASCAL TITAN Xp TURING 2080 Ti Turing 9X Peak FLOPS Denoising Disney Research, Pixar, UCSB SuperRez NVIDIA

22 Giga Rays/s (Primary) Turing Ray Tracing Performance >10 Giga Rays GTX 1080 Ti RTX 2080 Ti GTX 1080 Ti RTX 2080 Ti 11.3 TFLOPS 68 RT Cores 1.1 Giga Rays 10+ Giga Rays 10 TFLOPS / Giga Ray ~10X faster than 1080 Ti 0 Mustang Dragon Veyron-NG Blade Buddha GeoMean

24 Turing A Giant Leap Gaming Reinvented World s First Ray Tracing GPU Universal Deep Learning Accelerator

25 Accelerating Deep Learning Accelerating Graphics Accelerating Science Accelerating Data Science Scaling Accelerating Autonomous Vehicles Accelerating Robotics Conclusion

26 Accelerating Science VASP AMBER NAMD GROMACS Gaussian Simulia Abaqus WRF OpenFOAM ANSYS LS-DYNA BLAST LAMMPS ANSYS Fluent Quantum Espresso GAMESS Top 15 HPC Applications Intersect360 Research, Nov 2017 HPC Application Support for GPU Computing 600 Accelerated Applications

27 NVIDIA Powers World s Fastest Supercomputer Summit Becomes First System to Scale the 100 PetaFLOPS Milestone = 122 PF HPC 3 EF AI 27,648 Volta V100 Tensor Core GPUs

28 NVIDIA Powers Fastest Supercomputers in US, Europe, Japan, Industry 17 of World s 20 Most Energy-Efficient Supercomputers ORNL Summit World s Fastest 27,648 GPUs 122 PF LLNL Sierra US 2 nd Fastest 17,280 GPUs 72 PF ABCI Japan s Fastest 4,352 GPUs 20 PF Piz Daint Europe s Fastest 5,320 GPUs 20 PF ENI HPC4 Fastest Industrial 3,200 GPUs 12 PF

Accuracy and Time-to-Solution Commercially viable fusion energy Understanding cosmological dark energy and matter Clinically

29 HPC Algorithms Based on First Principles Theory Proven Models for Accurate Results AI Neural Networks That Learn Patterns From Large Data Sets Improve Predictive Accuracy and Faster Response Time AI A New Instrument for Science Dramatically Improves Accuracy and Time-to-Solution Commercially viable fusion energy Understanding cosmological dark energy and matter Clinically viable precision medicine Improvement and validation of the Standard Model of Physics Climate/weather forecasts with ultra- high fidelity

30 AI for Science Transformative Tool to Accelerate the Pace of Scientific Innovation 90% Accuracy Fusion Sustainment Clean Energy 33% Faster Track Neutrinos Particle Physics 5,000X Faster Process LIGO Signal Understanding Universe 300,000X Faster Predict Molecular Energetics Drug Discovery 70% Accuracy Score Protein Ligand Drug Discovery 11% Higher Accuracy Monitor Earth s Vital Climate Weeks to 10 milliseconds Analyze Gravitational Lensing Astrophysics 14X Faster Generate Bose-Einstein Condensate (Physics) Improves Accuracy Enabling Realization of Full Scientific Potential Accelerates Time to Solution Unlocking Science in Exciting New Ways

The Satrec Initiative and SI Analytics (SIA) apply GPU-powered AI to turn satellite images into valuable data for

31 AI TURNS SATELLITE IMAGES INTO VALUABLE INSIGHT Satellite imagery has many uses including disaster recovery, crop yield prediction, urban planning, and national defense. The Satrec Initiative and SI Analytics (SIA) apply GPU-powered AI to turn satellite images into valuable data for its customers. With the NVIDIA DGX Station to improve speed and efficiencies, Satrec and SIA now analyze 30K satellite images in 3 minutes, vs. 40 minutes with previous methods.

32 Tensor Core GPU Fuses HPC & AI Computing HPC (Simulation) FP64, FP32 AI (Deep Learning) FP16, INT8 HPC AI Volta Tensor Core GPU Multi-Precision Computing Fusion of HPC & AI

33 Tensor Core GPU Delivering Breakthrough Performance Multi-Precision Computing for Advancing Science Unlocking the Power of Superconductivity Finding Genes-to-Disease Connection 150x 50x 1x 1x Titan Node Summit Node Titan Node Summit Node Volta Tensor Core GPU Materials APP QMCPack (FP64, FP32) Genomics APP CoMet (FP16)

AI IMPROVES QUALITY AND PRODUCTION YIELD Delivering products of impeccable quality is a great opportunity for manufacturers to differentiate, but it raises the bar for detecting the smallest product

34 AI IMPROVES QUALITY AND PRODUCTION YIELD Delivering products of impeccable quality is a great opportunity for manufacturers to differentiate, but it raises the bar for detecting the smallest product defects. LG Consulting and Solutions (LG CNS) is using AI to identify product defects for the entire LG Electronics and LG Display production lines. With NVIDIA Tesla P4, Jetson TX2, DGX-1 and TensorRT to speed training and inference, LG CNS achieved 1.5% higher yield, 65% quality improvement, and reduced inspection-induced employee fatigue.

Quantum Expresso, SPECFEM3D Mixed HPC Workload 160 Self-hosted Skylake CPU Servers 96 KWatts

35 Reduced Cost, Space, Power 5X Better HPC TCO for Same Throughput = Amber, CHROMA, GTC, LAMMPS, MILC, NAMD, Quantum Expresso, SPECFEM3D Amber, CHROMA, GTC, LAMMPS, MILC, NAMD, Quantum Expresso, SPECFEM3D Mixed HPC Workload 160 Self-hosted Skylake CPU Servers 96 KWatts Mixed HPC Workload 8 Accelerated Servers with 4 V100 GPUs 13 KWatts 1/5 the Cost 1/7 the Space 1/7 the Power

36 Accelerating Deep Learning Accelerating Graphics Accelerating Science Accelerating Data Science Scaling Accelerating Autonomous Vehicles Accelerating Robotics Conclusion

37 INTERNET DEEP LEARNING $36B RETAIL HEALTHCARE FINANCIAL SERVICES LOGISTICS TELECOM AD TECH The New HPC Market $9B SCIENTIFIC COMPUTING HADOOP NUMPY SKL PANDAS SCIENTIFIC COMPUTING MACHINE LEARNING

The Defacto Data Science Platform PYTHON 1991

38 The Defacto Data Science Platform PYTHON 1991 Guido van Rossum Interpreted language emphasizing readability PANDAS SKLEARN 2006 Travis Oliphant Multi-dimensional arrays, math functions 2008 Wes McKinney Data manipulation and analysis NUMPY 2010 Inria Machine learning library

Parallel processing in Python data analytics Dynamic task

39 The Defacto Data Science Platform PYTHON PYTHON PANDAS SKLEARN DASK PANDAS SKLEARN Matthew Rocklin NUMPY NUMPY Parallel processing in Python data analytics Dynamic task scheduling Collection of parallel arrays, data frames, lists

NUMPY NUMPY CUDA ARROW Cross-language platform for in-memory data Columnar

40 RAPIDS Accelerated Data Science PYTHON PYTHON PYTHON CUDF CUML PANDAS SKLEARN PANDAS SKLEARN 2016 Wes McKinney PANDAS-LIKE SKLEARN-LIKE DASK DASK NUMPY NUMPY CUDA ARROW Cross-language platform for in-memory data Columnar memory format Vectorized execution engine Zero-copy IPC Designed with GPU in mind

41 RAPIDS Accelerated Data Science PYTHON PYTHON PYTHON PANDAS SKLEARN DASK PANDAS SKLEARN DASK CUDF RAPIDS CUML CUGRAPH DEEP LEARNING FRAMEWORKS CUDNN CUDA NUMPY NUMPY ARROW

42 RAPIDS: Dramatic ML Acceleration ETL ML 20 CPU Nodes 20 CPU Nodes 20 CPU Nodes 50 CPU Nodes 50 CPU Nodes 50 CPU Nodes 100 CPU Nodes 100 CPU Nodes 100 CPU Nodes DGX-2 DGX-2 DGX SECONDS 2 Hours 1 Hour 3 Hours SECONDS SECONDS ETL ML End-to-End

44 DGX GB DGX GB Enterprise-Scale Data Science DGX STATION RTX GB 96 GB TESLA V GB

45 Accelerating Deep Learning Accelerating Graphics Accelerating Science Accelerating Data Science Scaling Accelerating Autonomous Vehicles Accelerating Robotics Conclusion

46 New NVIDIA DGX-2 The Largest GPU Ever Created 2 PFLOPS 512 GB HBM2 16 TB/sec Memory Bandwidth 10 kw 160 kg

47 The World s Largest GPU 16 Tesla V100 32GB Connected by NVSwitch On-Chip Memory Fabric Semantic Extended Across All GPUs 512 GB HBM2 and 14.4 TB/sec Aggregate 81,920 CUDA Cores 2,000 TFLOPS Tensor Cores

NVSwitch Parameter Spec Bidirectional Bandwidth per NVLink 51.5 GB/s NRZ Lane Rate (x8 per NVLink) Transistors 25.

48 NVSwitch Parameter Spec Bidirectional Bandwidth per NVLink 51.5 GB/s NRZ Lane Rate (x8 per NVLink) Transistors Gbps 2 Billion NVLINK PHYS NVLINK PHYS NVLINK PHYS NVLINK PHYS NVLINK PHYS NVLINK PHYS Process TSMC 12FFN Die Size 106 mm^2 Bidirectional Aggregate Bandwidth 928 GB/s NVLink Ports 18 Mgmt Port (Config, Maintenance, Errors) PCIe NVLINK PHYS NVLINK PHYS NVLINK PHYS NVLINK PHYS NVLINK PHYS NVLINK PHYS PORT LOGIC XBAR PORT LOGIC MANAGEMENT XBAR PORT LOGIC NVLINK PHYS NVLINK PHYS NVLINK PHYS NVLINK PHYS NVLINK PHYS NVLINK PHYS LD/ST BW Efficiency (128B pkts) 80.0% Copy Engine BW Efficiency (256B pkts) 88.9%

49 Traditional Machine Learning Cluster 300 Servers $3M 180 kw

50 GPU-Accelerated Machine Learning Cluster DGX-2 and Rapids for Predictive Analytics 1 DGX-2 10 kw 1/8 the Cost 1/15 the Space 1/18 the Power

51 NVIDIA Accelerated HPC Platform SCIENCE CUDA DL TRAINING cudnn DL INFERENCE New TensorRT Hyperscale Inference Platform MACHINE LEARNING New RAPIDS Dense HPC NVIDIA HPC Acceleration Stacks Hyperscale HPC

52 Accelerating Deep Learning Accelerating Graphics Accelerating Science Accelerating Data Science Scaling Accelerating Autonomous Vehicles Accelerating Robotics Conclusion

53 Trunk Opening NVIDIA DRIVE Software-Defined Car Powerful and Efficient for AI, CV, AR, HPC Rich Software Development Platform 370+ Partners Developing on DRIVE Eye Gaze Detect RADAR Distracted Driver Drowsy Driver Track Cyclist Alert CG Lidar Localization LIDAR LIDAR Localization Path Perception Camera Localization Path Planning Surround Perception Lanes Signs Lights Egomotion DRIVE AGX Xavier DRIVE AGX Pegasus

54 NVIDIA DRIVE TRAINING SIMULATING DRIVING Cars Pedestrians Lanes Path Signs Lights

16 Lane CSI 109 Gbps CPHY 1.1 1Gb Ethernet DLA 5.7 TFLOPS FP16 11.4 TOPS INT8 Xavier World s First Autonomous Machine Processor Multimedia Engines 1.2 GPIX/s Encode 1.

56 16 Lane CSI 109 Gbps CPHY 1.1 1Gb Ethernet DLA 5.7 TFLOPS FP TOPS INT8 Xavier World s First Autonomous Machine Processor Multimedia Engines 1.2 GPIX/s Encode 1.8 GPIX/s Decode 4 GPIX/s Video Image Compositor Vision Accelerator 1.7 TOPS Stereo & Optical Flow Engine 2x 3.1 TOPS Industry Standard High-Speed IO PCle Gen4 Root and Endpoint USB 3.1 Gen2 Host and Device UFS 2.1 Embedded Storage ISP 2.4 GPIX/s Native Full-Range HDR Tile-Based Processing Most Complex SOC Ever Made 9 Billion Transistors, 350mm 2, 12nFFN ~8,000 Engineering Years Volta Tensor Core GPU FP32 / FP16 / INT8 Multi-Precision 512 CUDA Tensor Cores 2.8 CUDA TFLOPS (FP16) 22.6 Tensor Core DL TOPS Carmel ARM64 CPU 8 Cores 10-wide Superscalar 21 SpecInt2K6 256-Bit LPDDR4X 137 GB/s

57 NVIDIA DRIVE World s First Autonomous Vehicle Platform DRIVE IX Available Now DRIVE AGX Xavier Developer Kit Available Now DRIVE AV Available Now

58 Accelerating Deep Learning Accelerating Graphics Accelerating Science Accelerating Data Science Scaling Accelerating Autonomous Vehicles Accelerating Robotics Conclusion

59 NVIDIA Isaac SENSOR PROCESSING MAPPING & LOCALIZATION PERCEPTION PATH & TASK PLANNING SITUATION UNDERSTANDING DIVERSITY & REDUNDANCY

60 Efficient Learning of Robust Policies Randomize Physical Parameters to Match Real World Rollouts [Chebotar-Handa-Makoviychuk-Macklin-Ratliff-Fox: 18]

Naver Labs designs autonomous robots with better human interaction.

61 BETTER HUMAN-ROBOT IN TERACTION, BETTER SERVICE ROBOTS From assisting the elderly to helping with every day chores, service robots hold the promise to make lives easier. But complex crowded environments could mean collisions between robots, humans and objects. Naver Labs designs autonomous robots with better human interaction. With its obstacle avoidance, AROUND G, built on NVIDIA Jetson Xavier for real-time preprocessing and neural network inference critical in navigating real world environments Naver Labs aims to popularize service robots in the coming future.

62 New NVIDIA AGX Embedded AI HPC High-Speed SerDes 109 Gbps Gbps I/O Up to 320 TOPS Tensor Ops Up to 25 TFLOPS FP32 Up to 16 GIGA Rays Starting from 15W

63 NVIDIA Jetson AGX World s First Edge AI Computer Isaac Gems Jetson AGX Xavier Developer Kit Available Now Isaac Sim

64 AI DELIVERS BUSINESS VALUE Realizing the harmonizing AI and robot technology Enhancing embedded deep learning with NVIDIA Jetson

65 New NVIDIA Platforms PYTHON SCIENCE CUDA DL TRAINING cudnn DL INFERENCE New TensorRT Hyperscale Inference Platform MACHINE LEARNING New RAPIDS DASK CUDF RAPIDS CUML CUDA CUGRAPH DEEP LEARNING FRAMEWORKS CUDNN ARROW CUDA GPUs NVIDIA HPC Acceleration Stacks Ecosystem

ACCELERATED COMPUTING: THE PATH FORWARD. Jensen Huang, Founder & CEO SC17 Nov. 13, 2017

ACCELERATED COMPUTING: THE PATH FORWARD Jensen Huang, Founder & CEO SC17 Nov. 13, 2017 COMPUTING AFTER MOORE S LAW Tech Walker 40 Years of CPU Trend Data 10 7 GPU-Accelerated Computing 10 5 1.1X per year