Machine Learning on VMware vsphere with NVIDIA GPUs

Similar documents
Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

An Automation Framework for Benchmarking and Optimizing Performance of Remote Desktops in the Cloud

REAL PERFORMANCE RESULTS WITH VMWARE HORIZON AND VIEWPLANNER

EVALUATING WINDOWS 10: LEARN WHY YOUR USERS NEED GPU ACCELERATION

Sharing High-Performance Devices Across Multiple Virtual Machines

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

EVALUATING WINDOWS 10 LEARN WHY YOUR USERS NEED GPU ACCELERATION

NVIDIA GRID APPLICATION SIZING FOR AUTODESK REVIT 2016

Delivering Real World 3D Applications with VMware Horizon, Blast Extreme and NVIDIA Grid

NLVMUG 16 maart Display protocols in Horizon

Fujitsu VDI / vgpu Virtualization

An introduction to Machine Learning silicon

Performance & Scalability Testing in Virtual Environment Hemant Gaidhani, Senior Technical Marketing Manager, VMware

GPU Consolidation for Cloud Games: Are We There Yet?

NVIDIA GRID. Ralph Stocker, GRID Sales Specialist, Central Europe

WHITE PAPER - JULY 2018 ENABLING MACHINE LEARNING AS A SERVICE (MLAAS) WITH GPU ACCELERATION USING VMWARE VREALIZE AUTOMATION. Technical White Paper

IBM Deep Learning Solutions

Contents PART I: CLOUD, BIG DATA, AND COGNITIVE COMPUTING 1

Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

EFFICIENT INFERENCE WITH TENSORRT. Han Vanholder

PERFORMANCE CHARACTERIZATION OF MICROSOFT SQL SERVER USING VMWARE CLOUD ON AWS PERFORMANCE STUDY JULY 2018

PERFORMANCE STUDY OCTOBER 2017 ORACLE MONSTER VIRTUAL MACHINE PERFORMANCE. VMware vsphere 6.5

Deep Learning on Arm Cortex-M Microcontrollers. Rod Crawford Director Software Technologies, Arm

THE NVIDIA DEEP LEARNING ACCELERATOR

S5006 YOUR HORIZON VIEW DEPLOYMENT IS GPU READY, JUST ADD NVIDIA GRID. Jeremy Main Senior Solution Architect - GRID

Delivering Transformational User Experience with Blast Extreme Adaptive Transport and NVIDIA GRID.

Profiling the Performance of Binarized Neural Networks. Daniel Lerner, Jared Pierce, Blake Wetherton, Jialiang Zhang

S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems

How NVIDIA GRID Brings Amazing Graphics to the Virtualized Experience

Dell Wyse Datacenter for VMware Horizon View with High Performance Graphics

gscale: Scaling up GPU Virtualization with Dynamic Sharing of Graphics Memory Space

GaaS Workload Characterization under NUMA Architecture for Virtualized GPU

High-End Graphics Everywhere! UCS C240 and NVIDIA Grid

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks

Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks

NVIDIA DLI HANDS-ON TRAINING COURSE CATALOG

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Improving CPU Performance of Xen Hypervisor in Virtualized Environment

WHAT S NEW IN GRID 7.0. Mason Wu, GRID & ProViz Solutions Architect Nov. 2018

DEEP NEURAL NETWORKS AND GPUS. Julie Bernauer

DEEP LEARNING AND DIGITS DEEP LEARNING GPU TRAINING SYSTEM

Characterizing and Benchmarking Deep Learning Systems on Modern Data Center Architectures

MoonRiver: Deep Neural Network in C++

Small is the New Big: Data Analytics on the Edge

Shrinath Shanbhag Senior Software Engineer Microsoft Corporation

Virtual GPU 을활용한 VDI 구현엔비디아서완석.

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters

Implementing Virtual NVMe for Flash- Storage Class Memory

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,

Deep Learning Frameworks with Spark and GPUs

GPU TECHNOLOGY WORKSHOP SOUTH EAST ASIA 2014

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS

Unified Deep Learning with CPU, GPU, and FPGA Technologies

Onto Petaflops with Kubernetes

Cerner SkyVue Cardiology Remote Review with NVIDIA and VMware Horizon

HPC Performance in the Cloud: Status and Future Prospects

Fast forward. To your <next>

Virtualization. Michael Tsai 2018/4/16

ACCELERATED COMPUTING: THE PATH FORWARD. Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015

Citrix XenApp / Microsoft RDSH. How to get the Best User Experience and Performance with NVIDIA vgpu Technology

Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies

Dell Precision Appliance for Wyse - VMware Horizon

Pexip Infinity Server Design Guide

Use Tesla to provide first GPU VM Service in China

NVIDIA DEEP LEARNING INSTITUTE

Deep Learning Performance and Cost Evaluation

Cisco UCS C480 ML M5 Rack Server Performance Characterization

Object recognition and computer vision using MATLAB and NVIDIA Deep Learning SDK

Chapter 2 Parallel Hardware

Adaptable Computing The Future of FPGA Acceleration. Dan Gibbons, VP Software Development June 6, 2018

In partnership with. VelocityAI REFERENCE ARCHITECTURE WHITE PAPER

World s most advanced data center accelerator for PCIe-based servers

Boosting GPU Virtualization Performance with Hybrid Shadow Page Tables

GPU Technology Conference 2018, San Jose S8199: GPU Enabled VDI Made the Grade at USC Viterbi School of Engineering

Microsoft SQL Server in a VMware Environment on Dell PowerEdge R810 Servers and Dell EqualLogic Storage

What s New in VMware vsphere 4.1 Performance. VMware vsphere 4.1

VIRTUAL GPU SOFTWARE. DU _v5.0 through 5.2 Revision 05 March User Guide

Building the Most Efficient Machine Learning System

Deep Learning Performance and Cost Evaluation

TESLA V100 PERFORMANCE GUIDE. Life Sciences Applications

Dell DVS. Enabling user productivity and efficiency in the Virtual Era. Dennis Larsen & Henrik Christensen. End User Computing

Accelerating Digital Transformation with InterSystems IRIS and vsan

Notes Section Notes for Workload. Configuration Section Configuration

Dell EMC Ready Architectures for VDI

Accelerating Cloud Graphics

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink

S8822 OPTIMIZING NMT WITH TENSORRT Micah Villmow Senior TensorRT Software Engineer

GPU Technology Conference 2016 S6194: Delivering Graphics Intensive Applications to Computing Labs and BYOD in Education Michael Goay, Executive

Virtualization of the MS Exchange Server Environment

Power your planet. Optimizing the Enterprise Data Center POWER7 Powers a Smarter Infrastructure

NVIDIA'S DEEP LEARNING ACCELERATOR MEETS SIFIVE'S FREEDOM PLATFORM. Frans Sijstermans (NVIDIA) & Yunsup Lee (SiFive)

NVIDIA GRID A True PC Experience for Everyone Anywhere

Fast Hardware For AI

Building the Most Efficient Machine Learning System

Deep Learning Accelerators

IOmark-VM. VMware VSAN Intel Servers + VMware VSAN Storage SW Test Report: VM-HC a Test Report Date: 16, August

DIGITS DEEP LEARNING GPU TRAINING SYSTEM

DELIVERING HIGH-PERFORMANCE REMOTE GRAPHICS WITH NVIDIA GRID VIRTUAL GPU. Andy Currid NVIDIA

Embedded GPGPU and Deep Learning for Industrial Market

Transcription:

Machine Learning on VMware vsphere with NVIDIA GPUs Uday Kurkure, Hari Sivaraman, Lan Vu GPU Technology Conference 2017 2016 VMware Inc. All rights reserved.

Gartner Hype Cycle for Emerging Technology 2

Machine Learning Application Areas Assets Management Medical Imaging Computational Biology Gene Sequencing Business Analysis Automation Social Networks Microarray IoT 3

Machine Learning Data Mining Data Science Artificial Intelligence Machine Learning Deep learning Big data Adapted from http://www.kdnuggets.com/2016/03/data-science-puzzle-explained.html 4

High Performance Computing for Machine Learning Multi-core CPU Systems FPGA (A field-programmable Gate Array) GPU (Graphics Process Unit) TPU (Tensor Processing Unit) 5

Why Machine Learning on VMware vsphere with Nvidia GPUs? VMware vsphere hypervisor efficiently manages servers of data center Machine learning workloads are becoming important in cloud environment VMware vsphere supports multiple GPU virtualization solutions 6

Capabilities of GPUs supported by VMware vsphere Accelerating 2D/3D Graphics workloads for VMware VDI VMware Blast Extreme protocol encoding / decoding for VDI H.264 Based ( MPEG-4) General Purpose GPU (GPGPU) Machine learning / Deep Learning Other high performance computing workloads 7

Developing Machine Learning Applications with GPUs For Nvidia GPUs, we can use CUDA with cudnn and other libraries 8

Machine Learning on VMware vsphere using Nvidia GPUs VMware DirectPath IO NVIDIA GRID vgpu 9

VMware DirectPath IO (Pass-through) Virtual Machine Virtual Machine Virtual Machine Virtual Machine Guest OS Guest OS Guest OS Guest OS Applications Applications Applications Applications GPU driver GPU driver GPU driver GPU driver Support a large set of GPU cards including GRID GPU Allow direct access to the physical GPUs A virtual machine (VM) allows one or multiple GPUs Hypervisor Pass-through Pass-through Pass-through Pass-through GPU GPU GPU GPU 10

NVIDIA GRID vgpu Virtual Machine Virtual Machine Virtual Machine Virtual Machine Guest OS Guest OS Guest OS Guest OS Applications Applications Applications Applications GPU driver GPU driver GPU driver GPU driver Support NVIDIA GRID GPUs Each virtual machine owns a vgpu vgpu profile specifies the frame buffer size Graphics mode: One or multiple vgpu per physical GPU Hypervisor Nvidia GRID vgpu manager Compute mode (for GPGPU applications like CUDA) Require the highest vgpu profile vgpu vgpu vgpu One vgpu per physical GPU M60-4Q profile GRID GPU M60-8Q profile GRID GPU Ex: M60 GPU Card has 2 Maxwell GM 204 GPUs Up to 2 CUDA VMs Up to 32 Graphics VMs 11

Nvidia GRID vgpu configurations 12

Benefits of each solution VMware DirectPath I/O Support more GPU cards Allow multiple GPUs per VM Low overhead Nvidia GRID vgpu Allow multiple VMs per physical GPU More users / VMs with vgpu Flexibility of management Virtual Machine Virtual Machine Virtual Machine Virtual Machine Guest OS Guest OS Guest OS Guest OS Applications Applications Applications Applications GPU driver GPU driver GPU driver GPU driver Virtual Machine Virtual Machine Virtual Machine Virtual Machine Guest OS Guest OS Guest OS Guest OS Applications Applications Applications Applications GPU driver GPU driver GPU driver GPU driver Hypervisor Pass-through Pass-through Pass-through Pass-through Hypervisor Nvidia GRID vgpu manager GPU GPU GPU GPU vgpu vgpu vgpu GRID GPU GRID GPU 13

Exploring Machine Learning Workload Performance Performance Performance: Native vs. Virtual, GPU vs. CPU, etc. Scalability: with # of user and # of GPUs Scalability Mixed Workloads Mixed Workloads: ML, 3D workloads 14

PERFORMANCE Performance Scalability Mixed Workloads 15

Machine Learning Framework and Neural Networks Machine Learning Framework: TensorFlow Nvidia CUDA 7.5, Nvidia cudnn 5.1 Neural Network Architectures and Workloads: Recurrent Neural Network (RNN): Language Modeling on Penn Tree Bank Convolutional Neural Network (CNN) Handwriting Recognition with MNIST Image Classifier with CIFAR-10 16

Performance: Native vs Virtual Performance Scalability Mixed Workloads 17

Testbed Configurations for Native vs Virtual Comparison VMware Test-bed for NVIDIA GRID on Horizon View Native Configuration TensorFlow 0.10 CentOS 7.2 Hyperthreading ON SSD Hard Disk Virtual Linux VMs TensorFlow 0.10 CentOS 7.2 ESX 6.X 12vCPU, 16 GB RAM, 96GB HD SSD hard disk Storage Dell R730 Intel Haswell CPUs + 2 x NVidia GRID M60 24 cores (2 x 12-core socket) E5-2680 V3 768 GB RAM Dell R730 Intel Haswell CPUs + 2 x NVidia GRID M60 24 cores (2 x 12-core socket) E5-2680 V3 768 GB GB RAM 18

Performance: Native vs Virtualized GPU Neural Network Type: Recurrent Neural Network Large Model 1500 LSTM units /layer Workload Complex Language Modelling Using Recurrent Neural Network Word Level Prediction Penn Tree Bank (PTB) Database: 929K training words 73K validation words 82K test words 10K vocabulary 19

Performance: Training Times on native GPU vs virtualized GPU 4% of overhead for both DirectPath I/O vs. GRID vgpu compared to native GPU Language Modelling with RNN on PTB Normalized Training Times Lower is etter 1.20 1.00 0.80 0.60 0.40 0.20 1 1.04 1.04 0.00 Native GRID vgpu DirectPathIO 20

Performance: GPUs vs CPUs 21

Performance: CPU vs GPU Cores Intel Haswell Intel Broadwell GM204 (M60) 12 (24 with hyperthreading) 22 (44 with hyperthreading) Implications 2048 Task Parallelism vs Data Parallelism TeraFlop ~0.26 ~0.5 ~4.X 8x to 15x the number of Flops Clock Rate GHz 2.5 and 3.30 (Turbo Boost) 2.2 and 3.6(Turbo Boost) 1.12 and 1.2(Boost) TDP 120 Watts 125 Watts 165 Watts Memory BW 224 GB/s Technology 22nm 14nm 28nm 0.5 the speed of Broadwell In spite of CPU s process technology & clock rate, GPUs outperform CPUs on ML workloads. GPUs: More Silicon is devoted to increase the number of ALUs 22

Performance: GPU vs. CPU on virtualized server Two VM Configurations: 1 VM with 1 vgpu (M60-8q) vs 1 VM without a GPU Each VM has 12 vcpus, 60GB memory, 96GB of SSD storage, CentOS 7.2 Neural Networks Types: CNN and RNN Dataset for CNN: MNIST database of handwritten digits Training set: 60,000 examples Test set: 10,000 examples. Dataset for RNN: PTB R 23

Training Times with GPU vs. without GPU on virtualized server Handwritten Recognition with CNN on MNIST Language Modeling with RNN on PTB 12 10 10 Normalized Training Time Lower is better 8 6 4 2 1.0 10.1 8 Normalized Training Time Lower is better 6 4 2 1 7.9 0 1 vgpu No GPU 0 1 vgpu No GPU 24

Performance: DirectPath IO vs GRID vgpu Performance Scalability Mixed Workloads 25

Performance: DirectPath I/O and GRID vgpu Normalized training times for MNIST and PTB Normalized Training Times Lower is better 1.4 1.2 1.0 0.8 0.6 0.4 0.2 DirectPath I/O GRID vgpu 1.05 1 1 1 0.0 CNN with MNIST RNN with PTB 26

Scalability- VMs or Users Performance Scalability Mixed Workloads 27

Scalability: Scaling number of VMs/server The number of users or VMs with ML workload per server One VM with one GPU Four VMs with one gpu each

Scalability: CIFAR-10 Workload Workload for Convolutional Neural Network Image Classifier 60K images 50K training and 10K image Convolutional Neural Network ~ One Million learning parameters ~19 million multiply-add to compute inference on a single image 29

Scalability VMs or Users Performance Scales almost linearly with number of VMs / users Performance of DirectPath I/O and GRID vgpu are comparable 5 Normalized Images / second of CIFAR-10 Normalized Images / second Higher is better 4 3 2 1 1.0 DirectPath I/O 0.9 GRID vgpu 3.5 3.3 0 1 VM 4 VMs

Scalability - GPUs Performance Scalability Mixed Workloads 31

GPU Scalability How a ML application performs as increasing number of GPUs up to maximum available on the server 1 GPU per VM 2 GPUs per VM 4 GPUs per VM VM VM VM GPU GPU GPU GPU GPU GPU GPU 32

GPU Scalability Scale almost linearly with number of GPUs Only DirectPath I/O supports multiple GPUs per VM Normalized images /sec with CNN on CIFAR-10 5.00 Normalized Images per second Higher is better 4.00 3.00 2.00 1.00 1 2.04 3.74 0.00 1 GPU 2 GPUs 4 GPUs

Resource Utilization 34

CPU Underutilization with Machine Learning with GPUs CPU Utilization (%) Lower is better Handwritten Recognition with CNN on MNIST 100% 80% 60% 40% 20% 0% 3% 78% 1 vgpu No GPU CPU Utilization (%) Lower is better Language Modeling with RNN on PTB 50% 40% 30% 20% 10% 0% 8% 43% 1 vgpu No GPU CPU Underutilization 35

Implications of CPU Under-utilization Have more users / VMs per server Mix GPU and non-gpu workloads on the same server Mix Workloads Mix CUDA and Graphics Workloads

Mixed Workloads Performance Scalability Mixed Workloads 37

Mixed Workloads - Motivation If you are setting up an environment with GPUs, consider sharing resources have a VDI environment with spare GPU capacity consider sharing resources have a setup to run ML, consider sharing GPU resources WHY? Better server utilization with minimal performance impact. 38

Mixed Workloads Goal: Quantify impact of running 3D CAD & ML on the same server concurrently. Benchmarks: ML on M60-8Q CAD: SPECapc for 3ds Max 2015 ML: mnist CentOS 7.2, TensorFlow 0.10 12vCPU, 16 GB RAM, 96GB HD Dell R730 Intel Haswell CPUs + 2 x NVidia GRID M60 24 cores (2 x 12-core socket) E5-2680 V3 768 GB GB RAM Win 7, x64, 4 vcpu, 16 GB RAM, 120 GB HD Autodesk 3ds Max 2015 39

Mixed Workloads Experiment Design 1.) First set of runs : Run SPECapc + MNIST concurrently 2.) Second set of runs: Run SPECapc ONLY 3.) Third set of runs: Run MNIST ONLY Perf. Metrics: SPECapc: Geo. Mean of run-times MNIST: Run time as measured by wall-clock Please Note: SPECapc runs on all vgpu profiles MNIST runs only on M60-8Q profile 40

Mixed Workloads - Results SPEC < 5% penalty in M60-2Q, M60-4Q, M60-8Q MNIST < 15% in M60-2Q, M60-4Q, M60-8Q % Increase [Lower is better] 14 12 10 8 6 4 2 0 3.16 % Increase in run-time for Mixed Workloads 12.25 4.86 SPEC MNIST 3.4 1.34 1.06 M60-2Q M60-4Q M60-8Q vgpu Profile 41

Mixed Workloads - Results SPEC < 5% penalty in M60-2Q, M60-4Q, M60-8Q MNIST < 15% in M60-2Q, M60-4Q, M60-8Q % Increase in run-time for Mixed Workloads % Increase [Lower is better] 100 90 80 70 60 50 40 30 20 10 0 SPEC MNIST 74.75 73.23 12.25 3.16 4.86 1.34 3.4 1.06 M60-1Q M60-2Q M60-4Q M60-8Q vgpu Profile 42

Mixed Workloads - Results Normalized Execution time 2 1.6 1.2 0.8 0.4 0 Concurrent Sequential M60-1Q M60-2Q M60-4Q M60-8Q vgpu Profile 43

Mixed Workloads - Results CPU Utilization (%) 100 80 60 40 20 93 63.3 63.3 60 32 SPEC+MNIST 29 16.8 SPEC 19.4 0 M60-1Q M60-2Q M60-4Q M60-8Q vgpu Profile 44

Takeaways Virtualized GPUs deliver near bare metal Performance for ML workloads GPUs can be used in two modes on vsphere: Direct Path IO and NVidia GRID vgpu For CUDA/ML workloads, GRID VGPU requires 1 GPU per VM i.e. the highest vgpu profile For Multi-GPU ML workloads, use Direct Path IO mode For more consolidation of GPU-based workloads, use GRID vgpu GRID vgpu combines performance of GPUs and datacenter management benefits of VMware vsphere 45

Future Work Distributed Machine Learning Cluster with Multi-Hosts & Multi-GPUs Different Machine Learning Frameworks Bidmach Caffe Torch Theano TensorFlow Performance Studies of Containers with GPUs 46

Q&A Thank you! Contact Uday Kurkure Lan Vu Hari Sivaraman {ukurkure,lanv,hsivaraman}@vmware.com Thanks to our colleagues Aravind Bappanadu, Ravi Soundararajan, Ziv Kalmanovich, Reza Taheri, Vincent Lin 47