Sharing High-Performance Devices Across Multiple Virtual Machines

Similar documents
MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

NAMD GPU Performance Benchmark. March 2011

HPC Performance in the Cloud: Status and Future Prospects

RDMA on vsphere: Update and Future Directions

WHITE PAPER - SEPTEMBER 2018 VIRTUALIZING HIGH-PERFORMANCE COMPUTING (HPC) ENVIRONMENTS. Reference Architecture

Ethernet. High-Performance Ethernet Adapter Cards

Machine Learning on VMware vsphere with NVIDIA GPUs

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Performance of RDMA and HPC Applications in Virtual Machines using FDR InfiniBand on VMware vsphere T E C H N I C A L W H I T E P A P E R

PARAVIRTUAL RDMA DEVICE

Building the Most Efficient Machine Learning System

Birds of a Feather Presentation

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

The Future of High Performance Interconnects

IO virtualization. Michael Kagan Mellanox Technologies

Solutions for Scalable HPC

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

The NE010 iwarp Adapter

Building the Most Efficient Machine Learning System

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

The rcuda middleware and applications

LAMMPSCUDA GPU Performance. April 2011

AMBER 11 Performance Benchmark and Profiling. July 2011

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

ABySS Performance Benchmark and Profiling. May 2010

Creating High Performance Clusters for Embedded Use

MM5 Modeling System Performance Research and Profiling. March 2009

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Networking at the Speed of Light

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

Building NVLink for Developers

Interconnect Your Future

PERFORMANCE ACCELERATED Mellanox InfiniBand Adapters Provide Advanced Levels of Data Center IT Performance, Productivity and Efficiency

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

The Future of Interconnect Technology

Application Acceleration Beyond Flash Storage

Accessing NVM Locally and over RDMA Challenges and Opportunities

Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

Evolving HPC Solutions Using Open Source Software & Industry-Standard Hardware

NAMD Performance Benchmark and Profiling. November 2010

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink

High-Performance Training for Deep Learning and Computer Vision HPC

Creating a New SBC SWe VM Instance

SNIA Developers Conference - Growth of the iscsi RDMA (iser) Ecosystem

2008 International ANSYS Conference

FROM HPC TO THE CLOUD WITH AMQP AND OPEN SOURCE SOFTWARE

In-Network Computing. Sebastian Kalcher, Senior System Engineer HPC. May 2017

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

Gateways to Discovery: Cyberinfrastructure for the Long Tail of Science

Interconnect Your Future

Interconnect Your Future

CP2K Performance Benchmark and Profiling. April 2011

Configuring SR-IOV. Table of contents. with HP Virtual Connect and Microsoft Hyper-V. Technical white paper

RoCE vs. iwarp Competitive Analysis

OCP3. 0. ConnectX Ethernet Adapter Cards for OCP Spec 3.0

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

OpenPOWER Performance

LAMMPS and WRF on iwarp vs. InfiniBand FDR

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

How to Network Flash Storage Efficiently at Hyperscale. Flash Memory Summit 2017 Santa Clara, CA 1

Big Data Systems on Future Hardware. Bingsheng He NUS Computing

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems

NVMe over Universal RDMA Fabrics

Emerging Technologies for HPC Storage

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

InfiniBand Networked Flash Storage

Data Path acceleration techniques in a NFV world

Comparing Ethernet and Soft RoCE for MPI Communication

iser as accelerator for Software Defined Storage Rahul Fiske, Subhojit Roy IBM (India)

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Interconnect Your Future

Why AI Frameworks Need (not only) RDMA?

Interconnect Your Future Enabling the Best Datacenter Return on Investment. TOP500 Supercomputers, November 2017

GPU on OpenStack for Science

S5006 YOUR HORIZON VIEW DEPLOYMENT IS GPU READY, JUST ADD NVIDIA GRID. Jeremy Main Senior Solution Architect - GRID

Deep Learning Frameworks with Spark and GPUs

QuickSpecs. Overview. HPE Ethernet 10Gb 2-port 535 Adapter. HPE Ethernet 10Gb 2-port 535 Adapter. 1. Product description. 2.

Meltdown and Spectre Interconnect Performance Evaluation Jan Mellanox Technologies

Creating an agile infrastructure with Virtualized I/O

Revisiting Network Support for RDMA

A Case for High Performance Computing with Virtual Machines

NAMD Performance Benchmark and Profiling. January 2015

ICON Performance Benchmark and Profiling. March 2012

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

Containing RDMA and High Performance Computing

Characterizing and Benchmarking Deep Learning Systems on Modern Data Center Architectures

Benefits of 25, 40, and 50GbE Networks for Ceph and Hyper- Converged Infrastructure John F. Kim Mellanox Technologies

In-Network Computing. Paving the Road to Exascale. 5th Annual MVAPICH User Group (MUG) Meeting, August 2017

Survey of ETSI NFV standardization documents BY ABHISHEK GUPTA FRIDAY GROUP MEETING FEBRUARY 26, 2016

AN INTRODUCTION TO THE PATHSCALE INFINIPATH HTX ADAPTER. LLOYD DICKMAN Distinguished Architect, Office of the CTO PathScale, Inc.

The BioHPC Nucleus Cluster & Future Developments

Storage Protocol Offload for Virtualized Environments Session 301-F

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

Cavium FastLinQ 25GbE Intelligent Ethernet Adapters vs. Mellanox Adapters

Towards Transparent and Efficient GPU Communication on InfiniBand Clusters. Sadaf Alam Jeffrey Poznanovic Kristopher Howard Hussein Nasser El-Harake

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011

Transcription:

Sharing High-Performance Devices Across Multiple Virtual Machines

Preamble What does sharing devices across multiple virtual machines in our title mean? How is it different from virtual networking / NSX, which allow multiple virtual networks to share underlying networking hardware? Virtual networking works well for many standard workloads, but in the realm of extreme performance we need to deliver much closer to bare-metal performance to meet application requirements Application areas: Science & Research (HPC), Finance, Machine Learning & Big Data, etc. This talk is about achieving both extremely high performance and device sharing 2

Sharing High-Performance PCI Devices 1 Technical Background 2 Big Data Analytics with SPARK 3 High Performance (Technical) Computing 3

Direct Device Access Technologies Accessing PCI devices with maximum performance

Direct Path I/O Allows PCI devices to be accessed directly by guest OS Examples: GPUs for computation (GPGPU), ultra-low latency interconnects like InfiniBand and RDMA over Converged Ethernet (RoCE) Downsides: No vmotion, No Snapshots, etc. Full device is made available to a single no sharing Virtual Machine Application Guest OS Kernel No ESXi driver required just the standard vendor device driver ware ESXi DirectPath I/O 5

Device Partitioning () The PCI standard includes a specification for, Single Root I/O Virtualization A single PCI device can present as multiple logical devices (Virtual Functions or VFs) to ESX and to s Downsides: No vmotion, No Snapshots (but note: pvrdma feature in ESX 6.5) An ESXi driver and a guest driver are required for Mellanox Technologies supports ESXi for both InfiniBand and RDMA over Converged Ethernet (RoCE) interconnects Virtual Machine Guest OS Kernel XNET3 vswitch Application NMLX5 VF PF VF 6

Remote Direct Memory Access (RDMA) A hardware transport protocol Optimized for moving data to/from memory Extreme performance 600ns application-to-application latencies 100Gbps throughput Negligible CPU overheads RDMA applications Storage (iser, NFS-RDMA, NoF, Lustre) HPC (MPI, SHMEM) Big data and analytics (Hadoop, Spark) 8

How does RDMA achieve high performance? Traditional network stack challenges Per message / packet / byte overheads User-kernel crossings Memory copies User AppA Buf AppB Buf RDMA provides in hardware: Isolation between applications Transport Packetizing messages Reliable delivery Address translation User-level networking Direct hardware access for data path Kernel RDMA-capable hardware NeF Buf Buf iser Buf 9

Host Configuration Driver Installation Direct Path I/O does not require an ESX driver InfiniBand and RoCE work with the standard guest driver in this case To use, a host driver is required: RoCE bundle: https://my.vmware.com/web/vmware/details?downloadgroup=dt-esxi65- MELLANOX-NMLX5_CORE-41688&productId=614 InfiniBand bundle: will be GA in Q4 2017 Management tools: http://www.mellanox.com/page/management_tools Install and configure the host driver using suitable driver parameters

Verify Virtual Functions are available 2) Select Configure Tab 4) Check Virtual Function is available 1) Select Host 3) Select PCI Devices 11

Host Configuration Assign a VF to a 2) Select Manage Tab 3) Select Hardware 4) Select Edit 1) Select

SPARK Big Data Analytics Accelerating time to solution with shared, high-performance interconnect

SPARK Test Results vsphere with 250 TCP vs. RDMA (Lower Is Better) Runtime (secs) 200 150 100 50 16 ESXi6.5 hosts, one Spark per host 0 Average Min Max TCP RDMA Runtime samples TCP (sec) RDMA (sec) Improvement Average 222 (1.05x) 171 (1.01x) 23% 1 Server used as Named Node Min 213 (1.07x) 165 (1.05x) 23% Max 233 (1.05x) 174 (1.0x) 25%

High Performance Computing Research, Science, and Engineering applications on vsphere

Two Classes of Workloads: Throughput and Tightly-Coupled Often use Message Passing Interface Throughput embarrassingly parallel Examples: Digital movie rendering Financial risk analysis Microprocessor design Genomics analysis HPC Cluster Tightly-Coupled Examples: Weather forecasting Molecular modelling Jet engine design Spaceship, airplane & automobile design

InfiniBand MPI Example Cluster 2 Cluster 1 InfiniBand All s: #vcpu = #cores 100% CPU overcommit No memory overcommit ESXi ESXi ESXi Host Host Host 17

InfiniBand MPI Performance Test Application: NAMD Benchmark: STMV Cluster 2 20-vCPU s for all tests 60 MPI processes per job 10% Cluster 1 Two vclusters 169.3 169.3 Linux ESXi Linux ESXi Linux ESXi One vcluster 98.5 Host Host Host Bare metal 93.4 93.4 Run time (seconds) 18

Compute Accelerators Enabling Machine Learning, Financial and other HPC applications on vsphere

Shared NVIDIA GPGPU Computing Linux TensorFlow CUDA & Driver ESXi CUDA & Driver GRID driver TensorFlow Linux TensorFlow RNN SuperMicro dual 12-core system 16GB NVIDIA P100 GPU Two s, each with an 8Q GPU profile NVIDIA GRID 5.0 ESXi 6.5 Scheduling policies: NVIDIA P100 GPU Host Fixed share Equal share Best Effort 20

Shared NVIDIA GPGPU Computing Single P100, two 8Q s, Legacy scheduler 21

Summary Virtualization can support high-performance device sharing for cases in which extreme performance is a critical requirement Virtualization supports device sharing and delivers near bare-metal performance High Performance Computing Big Data SPARK Analytics Machine and Deep Learning with GPGPU The ware platform and partner ecosystem address the extreme performance needs of the most demanding emerging workloads 22