The Tesla Accelerated Computing Platform

Size: px

Start display at page:

Download "The Tesla Accelerated Computing Platform"

Nora Quinn
6 years ago
Views:

1 The Tesla Accelerated Computing Platform Axel Koehler, Principal Solution Architect HPC Advisory Council Meeting Lugano 22 March 2016

2 Introduction TESLA Platform for HPC Agenda TESLA Platform for HYPERSCALE TESLA Platform for MACHINE LEARNING TESLA System Software and Tools Data Center GPU Manager, Docker 2

3 GAMING PRO ENTERPRISE VISUALIZATION DATA CENTER AUTO 3

0 Hyperscale Suite System Tools & Services Enterprise Services Data Center

4 TESLA PLATFORM PRODUCT STACK HPC Enterprise Virtualization DL Training Hyperscale Web Services Software Accelerated Computing Toolkit GRID 2.0 Hyperscale Suite System Tools & Services Enterprise Services Data Center GPU Manager Mesos Docker Accelerators Tesla K80 Tesla M60, M6 Tesla M40 Tesla M4 4

5 TESLA PLATFORM FOR HPC 5

6 HETEROGENEOUS COMPUTING MODEL Complementary Processors Work Together CPU Optimized for Serial Tasks GPU Accelerator Optimized for Parallel Tasks 6

7 COMMON PROGRAMMING MODELS ACROSS MULTIPLE CPUS Libraries AmgX cublas Compiler Directives Programming Languages x86 7

8 370 GPU-Accelerated Applications 8

Peak DP w/ Boost GDDR5 Memory Bandwidth Power GPU Boost 1.9 TFLOPS 2.

9 TESLA K80 World s Fastest Accelerator for HPC & Data Analytics Dual CPU Server Tesla K80 Server 5x Faster AMBER Performance Simulation Time from 1 Month to 1 Week # of Days CUDA Cores 2496 Peak DP Peak DP w/ Boost GDDR5 Memory Bandwidth Power GPU Boost 1.9 TFLOPS 2.9 TFLOPS 24 GB 480 GB/s 300 W Dynamic AMBER Benchmark: PME-JAC-NVE Simulation for 1 microsecond CPU: 2.3GHz. 64GB System Memory, CentOS 6.2 9

10 VISUALIZE DATA INSTANTLY FOR FASTER SCIENCE CPU Supercomputer Viz Cluster Data Transfer Traditional Slower Time to Discovery Simulation- 1 Week Days Viz- 1 Day Time to Discovery = Months Multiple Iterations GPU-Accelerated Supercomputer Interactive Tesla Platform Faster Time to Discovery Visualize while you simulate/without data transfers Restart Simulation Instantly Multiple Iterations Time to Discovery = Weeks Scalable Flexible 10

11 EGL CONTEXT MANAGEMENT Leaving it to the driver Top systems support OpenGL under X EGL: Driver based context management Support for full OpenGL*, not only GL ES Available in e.g. VTK New opportunities for CUDA/OpenGL** interop ParaView/VMD X-server Tesla driver with EGL Tesla GPU *Full OpenGL in r355.11; **CUDA interop in r

SCALABLE RENDERING AND COMPOSITING NVIDIA INDEX Large-scale (volume) data visualization Interactive visualization of TB of data Stand-alone or coupling into

12 SCALABLE RENDERING AND COMPOSITING NVIDIA INDEX Large-scale (volume) data visualization Interactive visualization of TB of data Stand-alone or coupling into simulation HW Accelerated remote rendering Plugin for ParaView available Dataset from NCSA Blue Waters 12

NVLINK : A HIGH-SPEED GPU INTERCONNECT GPU to

Pascal Pascal HBM 16-32GB DDR Memory 10s-100s

13 NVLINK : A HIGH-SPEED GPU INTERCONNECT GPU to GPU via NVLink GPU to CPU via NVLink NVLink CPU (x86) Pascal CPU (NVLINK Enabled) PCIe Switch NVlink PCIe 1Tbyte/s DDR GB/s Pascal Pascal HBM 16-32GB DDR Memory 10s-100s GB Whitepaper: 13

POWER9 CPU + NVIDIA Volta GPU NVLink High Speed Interconnect 40

14 U.S. TO BUILD TWO FLAGSHIP SUPERCOMPUTERS Powered by the Tesla Platform PFLOPS Peak 10x in Scientific App Performance IBM POWER9 CPU + NVIDIA Volta GPU NVLink High Speed Interconnect 40 TFLOPS per Node, >3,400 Nodes 2017 Major Step Forward on the Path to Exascale 14

15 TESLA PLATFORM FOR HYPERSCALE 15

EXABYTES OF CONTENT PRODUCED DAILY User-Generated Content Dominates Web Services 10M Users 40 years of video/day 1.7M Broadcasters Users watch 1.

16 EXABYTES OF CONTENT PRODUCED DAILY User-Generated Content Dominates Web Services 10M Users 40 years of video/day 1.7M Broadcasters Users watch 1.5 hours/day 6B Queries/day 10% use speech 270M Items sold/day 43% on mobile devices 8B Video views/day 400% growth in 6 months 300 hours of video/minute 50% on mobile devices Challenge: Harnessing the Data Tsunami in Real-time 16

TESLA FOR HYPERSCALE 10M Users 40 years of

! Image Compute Engine 270M Items sold/day

17 TESLA FOR HYPERSCALE 10M Users 40 years of video/day! GPU REST Engine HYPERSCALE SUITE GPU Accelerated FFmpeg!! Image Compute Engine 270M Items sold/day TESLA M40 43% on mobile devices POWERFUL: Fastest Deep Learning Performance TESLA M4 LOW POWER: Highest Hyperscale Throughput 17

GPU REST Engine (GRE) SDK Accelerated Microservices for Web and Mobile

Powerful nodes with low response time (~10ms) HTTP (~10ms) GPU REST

Classification Speech Recognition Open source, integrates with existing

18 GPU REST Engine (GRE) SDK Accelerated Microservices for Web and Mobile Applications Supercomputer performance for hyper-scale datacenters Powerful nodes with low response time (~10ms) HTTP (~10ms) GPU REST Engine Easy to develop new microservices Image Scaling Image Classification Speech Recognition Open source, integrates with existing infrastructure Easy to deploy & scale Ready-to-run Docker file developer.nvidia.com/gre 18

Video Processing Stabilization and Enhancements Image Processing Resize, Filter, Search, Auto-Enhance TESLA M4 Highest Throughput Hyperscale Workload Acceleration Video

19 Video Processing Stabilization and Enhancements Image Processing Resize, Filter, Search, Auto-Enhance TESLA M4 Highest Throughput Hyperscale Workload Acceleration Video Transcode H.264 & H.265, SD & HD Machine Learning Inference CUDA Cores 1024 Peak SP 2.2 TFLOPS GDDR5 Memory Bandwidth Form Factor Power 4 GB 88 GB/s PCIe Low Profile W 19

Interface JETSON TX1 1 TFLOP/s 256-core Maxwell 64-bit ARM A57 CPUs 4 GB LPDDR4 25.

20 Unmatched performance under 10W Advanced tech for autonomous machines Smaller than a credit card JETSON TX1 Embedded Deep Learning GPU CPU Memory Storage Wifi/BT Networking Size Interface JETSON TX1 1 TFLOP/s 256-core Maxwell 64-bit ARM A57 CPUs 4 GB LPDDR GB/s 16 GB emmc x2 ac/bt Ready 1 Gigabit Ethernet 50mm x 87mm 400 pin board-to-board connector 20

21 HYPERSCALE DATACENTER NOW ACCELERATED Tesla Platform SERVERS FOR TRAINING Scales with Data SERVERS FOR INFERENCE, WEB SERVICES Scales with Users! Exabytes of Content / Day Trained Model Model Deployed on Every Server Billions of Devices 21

22 TESLA PLATFORM FOR MACHINE LEARNING 22

23 DEEP LEARNING EVERYWHERE INTERNET & CLOUD MEDICINE & BIOLOGY MEDIA & ENTERTAINMENT SECURITY & DEFENSE AUTONOMOUS MACHINES Image Classification Speech Recognition Language Translation Language Processing Sentiment Analysis Recommendation Cancer Cell Detection Diabetic Grading Drug Discovery Video Captioning Video Search Real Time Translation Face Detection Video Surveillance Satellite Imagery Pedestrian Detection Lane Tracking Recognize Traffic Sign 23

24 Why is Deep Learning Hot Now? Big Data Availability New ML Techniques GPU Acceleration 350 millions images uploaded per day 2.5 Petabytes of customer data hourly 300 hours of video uploaded every minute 24

25 WHAT IS DEEP LEARNING? Image Volvo XC90 Image source: Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks ICML 2009 & Comm. ACM Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Ng. 25

26 Cars That See Better And Learn Classified Object! Neural Net Model Camera Inputs NVIDIA GPU DEEP LEARNING SUPERCOMPUTER DRIVE PX AUTO-PILOT CAR COMPUTER 26

27 Deep Learning Platform In Medical Feedback! ü Classified Object Medical Compute Center (Training) Hospital/Doctor (Inference) Neural Net Model Med. Camera device Inputs inputs 27

28 GPUS AND DEEP LEARNING NEURAL NETWORKS GPUS Inherently Parallel ü ü Matrix Operations ü ü FLOPS ü ü Bandwidth ü ü GPUs deliver -- - same or better prediction accuracy - faster results - smaller footprint - lower power 28

29 NVIDIA GPU THE ENGINE OF DEEP LEARNING WATSON CHAINER THEANO MATCONVNET TENSORFLOW CNTK TORCH CAFFE NVIDIA CUDA ACCELERATED COMPUTING PLATFORM 29

cudnn Deep Learning Primitives IGNITING ARTIFICIAL INTELLIGENCE GPU-accelerated Deep Learning subroutines High performance neural network training Accelerates Major Deep Learning frameworks: Caffe,

30 cudnn Deep Learning Primitives IGNITING ARTIFICIAL INTELLIGENCE GPU-accelerated Deep Learning subroutines High performance neural network training Accelerates Major Deep Learning frameworks: Caffe, Theano, Torch Up to 3.5x faster AlexNet training in Caffe than baseline GPU 100 Millions of Images Trained Per Day x 2.0x 1.5x 1.0x 0.5x 0.0x Tiled FFT up to 2x faster than FFT developer.nvidia.com/cudnn 0 cudnn 1 cudnn 2 cudnn 3 cudnn 4

31 NVIDIA DIGITS Interactive Deep Learning GPU Training System Process Data Configure DNN Monitor Progress Visualize Layers Test Image 31

13x Faster Training Caffe TESLA M40 World s Fastest Accelerator for Deep Learning Training Dual CPU Server GPU Server with 4x TESLA M40 Reduce Training Time from 13 Days to just 1 Day 0 1 2 3 4 5 6 7

32 13x Faster Training Caffe TESLA M40 World s Fastest Accelerator for Deep Learning Training Dual CPU Server GPU Server with 4x TESLA M40 Reduce Training Time from 13 Days to just 1 Day Number of Days CUDA Cores 3072 Peak SP 7 TFLOPS GDDR5 Memory Bandwidth Power 12 GB 288 GB/s 250W Note: Caffe benchmark with AlexNet, CPU server uses 2x E5-2680v3 12 Core 2.5GHz CPU, 128GB System Memory, Ubuntu

Facebook s deep learning machine Purpose-Built for Deep Learning Training 2x Faster Training for Faster Deployment 2x Larger Networks for Higher Accuracy Powered by Eight Tesla M40 GPUs Open Rack

33 Facebook s deep learning machine Purpose-Built for Deep Learning Training 2x Faster Training for Faster Deployment 2x Larger Networks for Higher Accuracy Powered by Eight Tesla M40 GPUs Open Rack Compliant Most of the major advances in machine learning and AI in the past few years have been contingent on tapping into powerful GPUs and huge data sets to build and train advanced models Serkan Piantino Engineering Director of Facebook AI Research 33

Designed for AI Computing at large scale Built on the NVIDIA Tesla Platform 8 Tesla M40s deliver aggregate 96 GB GDDR5 memory and 56 teraflops of SP performance Leverages world s leading deep

34 Designed for AI Computing at large scale Built on the NVIDIA Tesla Platform 8 Tesla M40s deliver aggregate 96 GB GDDR5 memory and 56 teraflops of SP performance Leverages world s leading deep learning platform to tap into frameworks such as Torch and libraries such as cudnn Operational Efficiency and Serviceability Free-air Cooled Design Optimizes Thermal and Power Efficiency Components swappable without tools Configurable PCI-e for versatility 34

NCCL Accelerating Multi-GPU Communications for Deep Learning GOAL: Build a research library of accelerated collectives that is easily integrated and topology-aware so as to improve the scalability of

35 NCCL Accelerating Multi-GPU Communications for Deep Learning GOAL: Build a research library of accelerated collectives that is easily integrated and topology-aware so as to improve the scalability of multi-gpu applications APPROACH: Pattern the library after MPI s collectives Handle the intra-node communication in an optimal way Provide the necessary functionality for MPI to build on top to handle inter-node github.com/nvidia/nccl 35

36 TESLA SYSTEM SOFTWARE AND TOOLS

37 DATA CENTER GPU MANAGEMENT Today! Device Management Data Center GPU Manager (DCGM)! Active Diagnostics! Health & Governance Board-level GPU Configuration & Monitoring Diagnostics, Recovery & System Validation Proactive Health, Policy & Power Mgmt. Device Identification Configuration & Monitoring Clock Management GPU Recovery & Isolation System Validation Comprehensive Diagnostics Real-time Monitoring & Analysis Governance Policies Power & Clock Management All GPUs Supported Tesla GPUs Only Tesla GPUs Only

38 DATA CENTER GPU MANAGER (DCGM) Admin Management Node DC Cluster Management SW Network DCGM Available as library & CLI Ready for integration into ISV Mgmt. Software eg. Bright Cluster Manager, IBM Platform Cluster Manager CLI Compute Node Mgmt. SW Agent APIs DC GPU Manager Ready for integration with HPC Job Schedulers eg. Altair PBS Works, Moab & Maui, IBM Platform LSF, SLURM, Univa GRID Engine Tesla Enterprise Driver GPU GPU GPU GPU DCGM currently in Public Beta

39 GROWING CONTAINER ADOPTION IN DATA CENTER Across Enterprise, Cloud and HPC >2X growth in Docker adoption in a year Docker spreads like wildfire, especially in the enterprise Rightscale 2016 Cloud Survey Report

GPU CONTAINERIZATION USING NVIDIA-DOCKER Single command-line interface to take care of all deployment steps Discovery, Config/setup, Device allocation Pre-built images on Docker HUB CUDA, Caffe,

40 GPU CONTAINERIZATION USING NVIDIA-DOCKER Single command-line interface to take care of all deployment steps Discovery, Config/setup, Device allocation Pre-built images on Docker HUB CUDA, Caffe, Digits Reproducible builds across heterogeneous targets Key Highlights Remote deployment using NVIDIA-Docker-Plugin and REST interface v NVIDIA Docker on GitHUB (experimental) Available Now v Bundled with CUDA Product Future Versions (In planning)

41 Axel Koehler

ACCELERATED COMPUTING: THE PATH FORWARD. Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015

ACCELERATED COMPUTING: THE PATH FORWARD Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015 COMMODITY DISRUPTS CUSTOM SOURCE: Top500 ACCELERATED COMPUTING: THE PATH FORWARD It s time to start