A Case for High Performance Computing with Virtual Machines

Similar documents
Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

High Performance VMM-Bypass I/O in Virtual Machines

Unified Runtime for PGAS and MPI over OFED

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

Evaluating the Impact of RDMA on Storage I/O over InfiniBand

Intra-MIC MPI Communication using MVAPICH2: Early Experience

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

Job Startup at Exascale: Challenges and Solutions

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Cooperative VM Migration for a virtualized HPC Cluster with VMM-bypass I/O devices

Unifying UPC and MPI Runtimes: Experience with MVAPICH

LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster

High Performance MPI on IBM 12x InfiniBand Architecture

Progress Report on Transparent Checkpointing for Supercomputing

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

UNDERSTANDING THE IMPACT OF MULTI-CORE ARCHITECTURE IN CLUSTER COMPUTING: A CASE STUDY WITH INTEL DUAL-CORE SYSTEM

Improving Virtual Machine Scheduling in NUMA Multicore Systems

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Designing An Efficient Kernel-level and User-level Hybrid Approach for MPI Intra-node Communication on Multi-core Systems

Designing and Building Efficient HPC Cloud with Modern Networking Technologies on Heterogeneous HPC Clusters

Job Startup at Exascale:

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

Enhancing Checkpoint Performance with Staging IO & SSD

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.

Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet. Swamy N. Kandadai and Xinghong He and

High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

Novell Infiniband and XEN

RDMA-like VirtIO Network Device for Palacios Virtual Machines

Intel Cluster Toolkit Compiler Edition 3.2 for Linux* or Windows HPC Server 2008*

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Memory Management Strategies for Data Serving with RDMA

NAMD Performance Benchmark and Profiling. November 2010

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Cluster Network Products

Virtualization. Michael Tsai 2018/4/16

High Performance Migration Framework for MPI Applications on HPC Cloud

Memcached Design on High Performance RDMA Capable Interconnects

Virtualization, Xen and Denali

Understanding the Impact of Multi-Core Architecture in Cluster Computing: A Case Study with Intel Dual-Core System

Single-Points of Performance

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT

Implementation and Analysis of Large Receive Offload in a Virtualized System

The Convergence of Storage and Server Virtualization Solarflare Communications, Inc.

Rollback-Recovery Protocols for Send-Deterministic Applications. Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir and Franck Cappello

Noise Injection Techniques to Expose Subtle and Unintended Message Races

Nested Virtualization and Server Consolidation

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

Accelerating HPL on Heterogeneous GPU Clusters

BlueGene/L. Computer Science, University of Warwick. Source: IBM

Originally prepared by Lehigh graduate Greg Bosch; last modified April 2016 by B. Davison

Advantages to Using MVAPICH2 on TACC HPC Clusters

Xen Network I/O Performance Analysis and Opportunities for Improvement

Distributed File System Support for Virtual Machines in Grid Computing

Xen and the Art of Virtualization. Nikola Gvozdiev Georgian Mihaila

Xen-Based HPC: A Parallel I/O Perspective. Weikuan Yu, Jeffrey S. Vetter

High-Performance Broadcast for Streaming and Deep Learning

Proactive Process-Level Live Migration in HPC Environments

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

Advanced RDMA-based Admission Control for Modern Data-Centers

DISP: Optimizations Towards Scalable MPI Startup

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning

An Evaluation of Multiple Communication Interfaces for Virtualized SMP Clusters

Performance of Mellanox ConnectX Adapter on Multi-core Architectures Using InfiniBand. Abstract

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

2008 International ANSYS Conference

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Outline 1 Motivation 2 Theory of a non-blocking benchmark 3 The benchmark and results 4 Future work

VIProf: A Vertically Integrated Full-System Profiler

Exploring I/O Virtualization Data paths for MPI Applications in a Cluster of VMs: A Networking Perspective

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

arxiv: v2 [cs.dc] 2 May 2017

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

Designing High Performance DSM Systems using InfiniBand Features

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures

Sharing High-Performance Devices Across Multiple Virtual Machines

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

Network Design Considerations for Grid Computing

Xenrelay: An Efficient Data Transmitting Approach for Tracing Guest Domain

MiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces

Slurm Configuration Impact on Benchmarking

Kemari: Virtual Machine Synchronization for Fault Tolerance using DomT

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters

Fujitsu s Approach to Application Centric Petascale Computing

A Portable InfiniBand Module for MPICH2/Nemesis: Design and Evaluation

To EL2, and Beyond! connect.linaro.org. Optimizing the Design and Implementation of KVM/ARM

Performance Analysis of LS-DYNA in Huawei HPC Environment

APENet: LQCD clusters a la APE

MM5 Modeling System Performance Research and Profiling. March 2009

LAPI on HPS Evaluating Federation

Transcription:

A Case for High Performance Computing with Virtual Machines Wei Huang*, Jiuxing Liu +, Bulent Abali +, and Dhabaleswar K. Panda* *The Ohio State University +IBM T. J. Waston Research Center

Presentation Outline Virtual Machine environment and HPC Background -- VMM-bypass I/O A framework for HPC with virtual machines A prototype implementation Performance evaluation Conclusion

What is Virtual Machine Environment? A Virtual Machine environment provides virtualized hardware interface to VMs through Virtual Machine Monitor (VMM) A physical node may host several VMs, with each running separate OSes Benefits: ease of management, performance isolation, system security, checkpoint/restart, live migration

Why HPC with Virtual Machines? Ease of management Customized OS Light-weight OSes customized for applications can potentially gain performance benefits [FastOS] No widely adoption due to management difficulties VM makes it possible System security [FastOS]: Forum to Address Scalable Technology for Runtime and Operating Systems

Why HPC with Virtual Machines? Ease of management Customized OS System security Currently, most HPC environment disallow users to performance privileged operations (e.g. loading customized kernel modules) Limit productivities and convenience Users can do anything in VM, in the worst case crash an VM, not the whole system

But Performance? Normalized Execution Time 1.4 1.2 1 0.8 0.6 0.4 0.2 0 VM Native BT CG EP IS SP CG IS EP BT SP Dom0 16.6% 18.1% 00.6% 06.1% 09.7% VMM 10.7% 13.1% 00.3% 04.0% 06.5% DomU 72.7% 68.8% 99.0% 89.9% 83.8% NAS Parallel Benchmarks (MPICH over TCP) in Xen VM environment Communication intensive benchmarks show bad results Time Profiling using Xenoprof Many CPU cycles are spent in VMM and the device domain to process network IO requests

Challenges I/O virtualization overhead A framework to virtualize the cluster environment Jobs require multiple processes distributed across multiple physical nodes Typically requires all nodes have the same setup How to allow customized OS? How to reduce other virtualization overheads (memory, storage, etc ) How to reconfigure nodes and start jobs efficiently?

Challenges I/O virtualization overhead [USENIX 06] A framework to virtualize the cluster environment Jobs requires multiple processes distributed across multiple physical nodes Typically requires all nodes have the same setup How to allow customized OS? How to reduce other virtualization overheads (memory, storage, etc ) How to reconfigure nodes and start jobs efficiently? [USENIX 06]: J. Liu, W. Huang, B. Abali, D. K. Panda. High Performance VMM-bypass I/O in Virtual Machines

Challenges I/O virtualization overhead [USENIX 06] Evaluation of VMM-bypass I/O with HPC benchmarks A framework to virtualize the cluster environment Jobs requires multiple processes distributed across multiple physical nodes Typically requires all nodes have the same setup How to allow customized OS? How to reduce other virtualization overheads (memory, storage, etc ) How to reconfigure nodes and start jobs efficiently? [USENIX 06]: J. Liu, W. Huang, B. Abali, D. K. Panda. High Performance VMM-bypass I/O in Virtual Machines

Presentation Outline Virtual Machines and HPC Background -- VMM-bypass I/O A framework for HPC with virtual machines A prototype implementation Performance evaluation Conclusion

VMM-Bypass I/O Dom0 Application Backend Module Privileged Module VMM Device Privileged Access VM Application Guest Module VMM-bypass Access OS Original Scheme: Guest module contact with privileged domain to complete I/O Packets are sent to backend module, which are sent out through the privileged module (e.g. drivers) Extra communication, domain switch, is very costly VMM-Bypass I/O: Guest modules in guest VMs handle setup and management operations (privileged access). Once things are setup properly, devices can be accessed directly from guest VMs (VMM-bypass access). Requires the device to have OSbypass feature, e.g. InfiniBand Can achieve native level performance

Presentation Outline Virtual Machines and HPC Background -- VMM-bypass I/O A framework for HPC with virtual machines A prototype implementation Performance evaluation Conclusion

Framework for VM-based Computing Physical Resources Front-end VM VM VMM Jobs/VMs Instantiate VM / launch jobs Image distribution/ application data Management module Queries VM Image Manager Update Storage Nodes Physical Nodes: each running VM environment typically no more VM instances than number of physical CPUs Customized OS is achived through different versions images used to instantiate VMs Front-end node: user submit jobs / customized versions of VMs Management: batch job processing, instantiate VMs/ lauch jobs VM image manager: update user VMs, match user request with VM image versions Storage: Store different versions of VM images and application generated data, fast distribution of VM images

How it works? Physical Resources VM VM Front-end VMM Jobs / requests Instantiate VM / launch jobs Image distribution Management module requests VM Image Manager Match Storage Nodes User requests: number of VMs, number of VCPUs per VM, operating systems, kernels, libraries, etc. Or: previously submitted versions of VM image Matching requests: many algorithms have been studied in grid environment, e.g. Matchmaker in Condor

Challenges I/O virtualization overhead [USENIX 06] Evaluation of VMM-bypass I/O with HPC benchmarks A framework to virtualize the cluster environment Jobs requires multiple processes distributed across multiple physical nodes Typically requires all nodes have the same setup How to allow customized OS? How to reduce other virtualization overheads (memory, storage, etc ) How to reconfigure nodes and start jobs efficiently? [USENIX 06]: J. Liu, W. Huang, B. Abali, D. K. Panda. High Performance VMM-bypass I/O in Virtual Machines

Prototype Setup A Xen-based VM environment on an eightnode SMP cluster with InfiniBand Node with dual Intel Xeon 3.0GHz 2 GB memory Xen-3.0.1: an open-source high performance VMM originally developed at the University of Cambridge InfiniBand: a high performance Interconnect with OS-bypass features

Prototype Implementation Reducing virtualization overhead: I/O overhead Xen-IB, the VMM-bypass I/O implementation for InfiniBand in Xen environment Memory overhead: Including the memory footprints of VMM and the OS in VMs: VMM: can be as small as 20KB per extra domain Guest OSes: specific tuned for HPC, we reduce it to 23MB at fresh boot-up in our prototype

Prototype Implementation Reducing the VM image management cost VM images must be as small as possible to be efficiently stored and distributed Images created based on ttylinux can be as small as 30MB Basic system calls MPI libraries Communication libraries Any user specific libraries Image distribution: distributed through a binomial tree VM image caching: VM image cached at the physical nodes as long as there is enough local storage Things left to future work: VM-awareness storage to further reduce the storage overhead Matching and scheduling

Presentation Outline Virtual Machines and HPC Background -- VMM-bypass I/O A framework for HPC with virtual machines A prototype implementation Performance evaluation Conclusion

Performance Evaluation Outline Focused on MPI applications MVAPICH: high performance MPI implementation over InfiniBand, from the Ohio State University. Current used by over 370 organizations across 30 countries Micro-benchmarks Application-level benchmarks (NAS & HPL) Other virtualization overhead (memory overhead, startup time, image distribution, etc.)

Micro-benchmarks Latency (us) 30 25 20 15 10 5 0 0 2 8 xen native 32 Latency 128 Msg size (Bytes) 512 2k 8k MillionBytes/s 1000 800 600 400 200 0 1 4 16 64 256 xen native Bandwidth 1k 4k Msg size (Bytes) 16k 64k 256k 1M 4M Latency/bandwidth: between 2 VMs on 2 different nodes Performance in VM environment matches with native ones Registration cache in effect: data are sent from the same user buffer multiple times InfiniBand requires registration, tests are benefited from registration cache Registration cost (privileged operations) in VM environment is higher

Micro-benchmarks (2) Latency (us) 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 xen native Latency 32k 64k 128k 256k 512k 1M 2M 4M Msg size (Bytes) MillionBytes/s 1000 800 600 400 200 0 1 4 xen native 16 64 Bandwidth 256 1k 4k 16k 64k Msg size (Bytes) 256k 1M 4M The set of results are taken without registration cache For MVAPICH, small messages are sent through pre-registered buffer, so only for medium to large messages (>16k) we see the difference Latency: a consistent around 200us higher in VM environment Bandwidth: difference is smaller due to potential overlap of registration and communication The worst case scenario is shown: many applications show good buffer reuse.

HPC Benchmarks (NAS) Normalized Execution Time 1.2 1 0.8 0.6 0.4 0.2 0 VM Native BT CG EP FT IS LU MG SP BT CG EP FT IS LU MG SP Dom0 0.4% 0.6% 0.6% 1.6% 3.6% 0.6% 1.8% 0.3% VMM 0.2% 0.3% 0.3% 0.5% 1.9% 0.3% 1.0% 0.1% DomU 99.4% 99.0% 99.3% 97.9% 94.5% 99.0% 97.3% 99.6% NAS Parallel Benchmarks achieves similar performance in VM and native environment Time Profiling using Xenoprof It is clear that most time is spent in effective computation in DomUs

HPC Benchmarks (HPL) 60 50 Xen Native GFLOPS 40 30 20 10 0 2 4 8 16 HPL: the achievable GFLOPS in VM and Native environment is within 1% difference

Management Overhead Startup Shutdown Memory ttylinux-domu 5.3s 5.0s 23.6MB AS4-domu 24.1s 13.2s 77.1MB AS4-native 58.9s 18.4s 90.0MB Scheme 1 2 4 8 Binomial tree 1.3s 2.8s 3.7s 5.0s NFS 4.1s 6.2s 12.1s 16.1s VM image size: ~30MB Reduced services allows VM to be started very efficiently Small image size and the binomial tree distribution make the image distribution fast

Conclusion We proposed a framework to use VM-based computing environment for HPC applications We explained how the disadvantages of virtual machines can be addressed using current technologies with our framework using a prototype implementation We carried out detailed performance evaluations on the overhead of VM-based computing for HPC applications, where we show the virtualization cost is marginal Our case study held promises to bring the benefits of VMs to the area of HPC

Future work Migration support for VM-based computing environment with VMM-bypass I/O Investigate scheduling and resource management schemes More detailed evaluations of VM-based computing environments

Acknowledgements Our research at the Ohio State University is supported by the following organizations: Current Funding support by Current Equipment support by

Thank you! Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/