Sharing High-Performance Devices Across Multiple Virtual Machines

Preamble What does sharing devices across multiple virtual machines in our title mean? How is it different from virtual networking / NSX, which allow multiple virtual networks to share underlying networking hardware? Virtual networking works well for many standard workloads, but in the realm of extreme performance we need to deliver much closer to bare-metal performance to meet application requirements Application areas: Science & Research (HPC), Finance, Machine Learning & Big Data, etc. This talk is about achieving both extremely high performance and device sharing 2

Sharing High-Performance PCI Devices 1 Technical Background 2 Big Data Analytics with SPARK 3 High Performance (Technical) Computing 3

Direct Device Access Technologies Accessing PCI devices with maximum performance

Direct Path I/O Allows PCI devices to be accessed directly by guest OS Examples: GPUs for computation (GPGPU), ultra-low latency interconnects like InfiniBand and RDMA over Converged Ethernet (RoCE) Downsides: No vmotion, No Snapshots, etc. Full device is made available to a single no sharing Virtual Machine Application Guest OS Kernel No ESXi driver required just the standard vendor device driver ware ESXi DirectPath I/O 5

Device Partitioning () The PCI standard includes a specification for, Single Root I/O Virtualization A single PCI device can present as multiple logical devices (Virtual Functions or VFs) to ESX and to s Downsides: No vmotion, No Snapshots (but note: pvrdma feature in ESX 6.5) An ESXi driver and a guest driver are required for Mellanox Technologies supports ESXi for both InfiniBand and RDMA over Converged Ethernet (RoCE) interconnects Virtual Machine Guest OS Kernel XNET3 vswitch Application NMLX5 VF PF VF 6

Remote Direct Memory Access (RDMA) A hardware transport protocol Optimized for moving data to/from memory Extreme performance 600ns application-to-application latencies 100Gbps throughput Negligible CPU overheads RDMA applications Storage (iser, NFS-RDMA, NoF, Lustre) HPC (MPI, SHMEM) Big data and analytics (Hadoop, Spark) 8

How does RDMA achieve high performance? Traditional network stack challenges Per message / packet / byte overheads User-kernel crossings Memory copies User AppA Buf AppB Buf RDMA provides in hardware: Isolation between applications Transport Packetizing messages Reliable delivery Address translation User-level networking Direct hardware access for data path Kernel RDMA-capable hardware NeF Buf Buf iser Buf 9

Host Configuration Driver Installation Direct Path I/O does not require an ESX driver InfiniBand and RoCE work with the standard guest driver in this case To use, a host driver is required: RoCE bundle: https://my.vmware.com/web/vmware/details?downloadgroup=dt-esxi65- MELLANOX-NMLX5_CORE-41688&productId=614 InfiniBand bundle: will be GA in Q4 2017 Management tools: http://www.mellanox.com/page/management_tools Install and configure the host driver using suitable driver parameters

Verify Virtual Functions are available 2) Select Configure Tab 4) Check Virtual Function is available 1) Select Host 3) Select PCI Devices 11

Host Configuration Assign a VF to a 2) Select Manage Tab 3) Select Hardware 4) Select Edit 1) Select

SPARK Big Data Analytics Accelerating time to solution with shared, high-performance interconnect

SPARK Test Results vsphere with 250 TCP vs. RDMA (Lower Is Better) Runtime (secs) 200 150 100 50 16 ESXi6.5 hosts, one Spark per host 0 Average Min Max TCP RDMA Runtime samples TCP (sec) RDMA (sec) Improvement Average 222 (1.05x) 171 (1.01x) 23% 1 Server used as Named Node Min 213 (1.07x) 165 (1.05x) 23% Max 233 (1.05x) 174 (1.0x) 25%

High Performance Computing Research, Science, and Engineering applications on vsphere

Two Classes of Workloads: Throughput and Tightly-Coupled Often use Message Passing Interface Throughput embarrassingly parallel Examples: Digital movie rendering Financial risk analysis Microprocessor design Genomics analysis HPC Cluster Tightly-Coupled Examples: Weather forecasting Molecular modelling Jet engine design Spaceship, airplane & automobile design

InfiniBand MPI Example Cluster 2 Cluster 1 InfiniBand All s: #vcpu = #cores 100% CPU overcommit No memory overcommit ESXi ESXi ESXi Host Host Host 17

InfiniBand MPI Performance Test Application: NAMD Benchmark: STMV Cluster 2 20-vCPU s for all tests 60 MPI processes per job 10% Cluster 1 Two vclusters 169.3 169.3 Linux ESXi Linux ESXi Linux ESXi One vcluster 98.5 Host Host Host Bare metal 93.4 93.4 Run time (seconds) 18

Compute Accelerators Enabling Machine Learning, Financial and other HPC applications on vsphere

Shared NVIDIA GPGPU Computing Linux TensorFlow CUDA & Driver ESXi CUDA & Driver GRID driver TensorFlow Linux TensorFlow RNN SuperMicro dual 12-core system 16GB NVIDIA P100 GPU Two s, each with an 8Q GPU profile NVIDIA GRID 5.0 ESXi 6.5 Scheduling policies: NVIDIA P100 GPU Host Fixed share Equal share Best Effort 20

Shared NVIDIA GPGPU Computing Single P100, two 8Q s, Legacy scheduler 21

Summary Virtualization can support high-performance device sharing for cases in which extreme performance is a critical requirement Virtualization supports device sharing and delivers near bare-metal performance High Performance Computing Big Data SPARK Analytics Machine and Deep Learning with GPGPU The ware platform and partner ecosystem address the extreme performance needs of the most demanding emerging workloads 22