Designing and Building Efficient HPC Cloud with Modern Networking Technologies on Heterogeneous HPC Clusters

Similar documents
High Performance Migration Framework for MPI Applications on HPC Cloud

High-Performance MPI Library with SR-IOV and SLURM for Virtualized InfiniBand Clusters

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

High-Performance Training for Deep Learning and Computer Vision HPC

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Scaling with PGAS Languages

Designing Shared Address Space MPI libraries in the Many-core Era

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

A Case for High Performance Computing with Virtual Machines

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

Unifying UPC and MPI Runtimes: Experience with MVAPICH

INAM 2 : InfiniBand Network Analysis and Monitoring with MPI

High-Performance Broadcast for Streaming and Deep Learning

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management

Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning

Unified Runtime for PGAS and MPI over OFED

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

Job Startup at Exascale: Challenges and Solutions

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning

COSC6376 Cloud Computing Lecture 15: IO Virtualization

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Enhancing Checkpoint Performance with Staging IO & SSD

High Performance File System and I/O Middleware Design for Big Data on HPC Clusters

Cooperative VM Migration for a virtualized HPC Cluster with VMM-bypass I/O devices

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached on Modern Clusters

Build your own Cloud on Christof Westhues

High Performance Big Data (HiBD): Accelerating Hadoop, Spark and Memcached on Modern Clusters

Sharing High-Performance Devices Across Multiple Virtual Machines

IO virtualization. Michael Kagan Mellanox Technologies

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT

PARAVIRTUAL RDMA DEVICE

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand

The MVAPICH2 Project: Latest Developments and Plans Towards Exascale Computing

Job Startup at Exascale:

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2

MVAPICH2 and MVAPICH2-MIC: Latest Status

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand

Memcached Design on High Performance RDMA Capable Interconnects

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning

MVAPICH2: A High Performance MPI Library for NVIDIA GPU Clusters with InfiniBand

Performance of PGAS Models on KNL: A Comprehensive Study with MVAPICH2-X

Performance Considerations of Network Functions Virtualization using Containers

Designing High Performance DSM Systems using InfiniBand Features

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Bridging Neuroscience and HPC with MPI-LiFE Shashank Gugnani

Slurm Configuration Impact on Benchmarking

Network Function Virtualization Using Data Plane Developer s Kit

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

Application Acceleration Beyond Flash Storage

RDMA on vsphere: Update and Future Directions

The Future of Interconnect Technology

Configuring SR-IOV. Table of contents. with HP Virtual Connect and Microsoft Hyper-V. Technical white paper

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Overview of the MVAPICH Project: Latest Status and Future Roadmap

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

High Performance MPI on IBM 12x InfiniBand Architecture

Overview of MVAPICH2 and MVAPICH2- X: Latest Status and Future Roadmap

Survey of ETSI NFV standardization documents BY ABHISHEK GUPTA FRIDAY GROUP MEETING FEBRUARY 26, 2016

Scalability and Performance of MVAPICH2 on OakForest-PACS

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

Storage Protocol Offload for Virtualized Environments Session 301-F

Next-Generation NVMe-Native Parallel Filesystem for Accelerating HPC Workloads

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

Interconnect Your Future

Designing MPI and PGAS Libraries for Exascale Systems: The MVAPICH2 Approach

Overview of the MVAPICH Project: Latest Status and Future Roadmap

Comet Virtualization Code & Design Sprint

Versions and 14.11

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011

Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications

Impact of HPC Cloud Networking Technologies on Accelerating Hadoop RPC and HBase

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

High-Performance and Scalable Designs of Programming Models for Exascale Systems

Transcription:

Designing and Building Efficient HPC Cloud with Modern Networking Technologies on Heterogeneous HPC Clusters Jie Zhang Dr. Dhabaleswar K. Panda (Advisor) Department of Computer Science & Engineering The Ohio State University

Outline Introduction Problem Statement Detailed Designs and Results Impact on HPC Community Conclusion SC 2017 Doctoral Showcase 2

Cloud Computing and Virtualization Cloud Computing Virtualization Cloud Computing focuses on maximizing the effectiveness of the shared resources Virtualization is the key technology behind Widely adopted in industry computing environment IDC Forecasts Worldwide Public IT Cloud Services spending will reach $195 billion by 2020 (Courtesy: http://www.idc.com/getdoc.jsp?containerid=prus41669516) SC 2017 Doctoral Showcase 3

Drivers of Modern HPC Cluster and Cloud Architecture Multi-/Many-core Processors Multi-/Many-core technologies Accelerators (GPUs/Co-processors) Large memory nodes Accelerators (GPUs/Co-processors) Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE) Single Root I/O Virtualization (SR-IOV) Large memory nodes (Upto 2 TB) High Performance Interconnects InfiniBand (with SR-IOV) <1usec latency, 200Gbps Bandwidth> SDSC Comet TACC Stampede Cloud Cloud SC 2017 Doctoral Showcase 4

Single Root I/O Virtualization (SR-IOV) Allows a single physical device, or a Physical Function (PF), to present itself as multiple virtual devices, or Virtual Functions (VFs) VFs are designed based on the existing non-virtualized PFs, no need for driver change Each VF can be dedicated to a single VM through PCI pass-through Guest 1 Guest OS VF Driver Hypervisor Virtual Function (b) Single Root I/O Virtualization (SR-IOV) is providing new opportunities to design HPC cloud with very little low overhead through bypassing hypervisor Guest 2 Guest OS VF Driver I/O MMU PCI Express Virtual Function Virtual Function SR-IOV Hardware Guest 3 Guest OS VF Driver PF Driver Physical Function SC 2017 Doctoral Showcase 5

Does it suffice to build efficient HPC cloud with only SR-IOV? NO. Not support locality-aware communication, co-located VMs still has to use SR-IOV channel Not support VM migration because of device passthrough Not properly manage and isolate critical virtualized resource SC 2017 Doctoral Showcase 6

Problem Statements Can runtime be redesigned to provide virtualization support for VMs/Containers when building HPC clouds? How much benefits can be achieved on HPC clouds with redesigned runtime for scientific kernels and applications? Can fault-tolerance/resilience (Live Migration) be supported on SR-IOV enabled HPC clouds? Can we co-design with resource management and scheduling systems to enable HPC clouds on modern HPC systems? SC 2017 Doctoral Showcase 7

Research Framework SC 2017 Doctoral Showcase 8

MVAPICH2 Project High Performance open-source Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE) MVAPICH (-1), MVAPICH2 (-2.2 and -3.0), Started in 2001, First version available in 2002 MVAPICH2-X ( + PGAS), Available since 2011 Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 Support for Virtualization (MVAPICH2-Virt), Available since 2015 Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015 Used by more than 2,825 organizations in 85 countries More than 432,000 (> 0.4 million) downloads from the OSU site directly Empowering many TOP500 clusters (Jul 17 ranking) 1 st ranked 10,649,640-core cluster (Sunway TaihuLight) at NSC, Wuxi, China 15 th ranked 241,108-core cluster (Pleiades) at NASA 20 th ranked 522,080-core cluster (Stampede) at TACC 44 th ranked 74,520-core cluster (Tsubame 2.5) at Tokyo Institute of Technology and many others Available with software stacks of many vendors and Linux Distros (RedHat and SuSE) http://mvapich.cse.ohio-state.edu SC 2017 Doctoral Showcase 9

Locality-aware Communication with SR-IOV and IVShmem Application Application Layer Layer ADI3 Layer ADI3 Layer Library Communication Coordinator Locality Detector Virtual Machine Aware SMP Channel Network Channel SMP Channel IVShmem Channel SR-IOV Channel Shared Memory InfiniBand API Communication Device APIs Shared Memory InfiniBand API Communication Device APIs Native Hardware Virtualized Hardware library running in native and virtualization environments In virtualized environment - Support shared-memory channels (SMP, IVShmem) and SR-IOV channel - Locality detection - Communication coordination - Communication optimizations on different channels (SMP, IVShmem, SR-IOV; RC, UD) J. Zhang, X. Lu, J. Jose and D. K. Panda, High Performance Library over SR-IOV Enabled InfiniBand Clusters, The International Conference on High Performance Computing (HiPC 14), Dec 2014 SC 2017 Doctoral Showcase 10

Execution Time (s) 18 16 14 12 10 8 6 4 2 0 Application Performance (NAS & P3DFFT) NAS-32 VMs (8 VMs per node) SR-IOV Proposed 43% Execution Times (s) 0.5 FT LU CG MG IS 0 SPEC INVERSE RAND SINE NAS B Class P3DFFT 512*512*512 5 4.5 4 3.5 3 2.5 2 1.5 1 P3DFFT-32 VMs (8 VMs per node) SR-IOV Proposed 29% 20% 33% 29% Proposed design delivers up to 43% (IS) improvement for NAS Proposed design brings 29%, 33%, 29% and 20% improvement for INVERSE, RAND, SINE and SPEC SC 2017 Doctoral Showcase 11

SR-IOV-enabled VM Migration Support on HPC Clouds SC 2017 Doctoral Showcase 12

High Performance SR-IOV enabled VM Migration Framework for Applications Host Guest VM1 VF / SR-IOV Hypervisor IB Adapter Ethernet Adapter Network Suspend Trigger Guest VM2 VF / SR-IOV IVShmem Read-to- Migrate Detector Network Reactive Notifier Controller Host Guest VM1 VF / SR-IOV IVShmem Migration Done Detector Ethernet Adapter Guest VM2 Migration Trigger VF / SR-IOV Hypervisor IB Adapter Two Challenges 1. Detach/re-attach virtualized devices 2. Maintain IB Connection Challenge 1: Multiple parallel libraries to coordinate with VM during migration (detach/reattach SR-IOV/IVShmem, migrate VMs, migration status) Challenge 2: runtime handles IB connection suspending and reactivating Propose Progress Engine (PE) and Migration Thread based (MT) design to optimize VM migration and application performance J. Zhang, X. Lu, D. K. Panda. High-Performance Virtual Machine Migration Framework for Applications on SR-IOV enabled InfiniBand Clusters. IPDPS, 2017 SC 2017 Doctoral Showcase 13

Proposed Design of Runtime Call No-migration P 0 Comp Call Time Progress Engine Based P 0 Controller Call Pre-migration Comp S Ready-to-Migrate Call Migration R Post-Migration Time Migration-thread based Typical Scenario P 0 Thread Controller Call S Comp Migration R Premigration Ready-tomigrate Postmigration Call Time Migration-thread based Worst Scenario P 0 Thread Controller Call Comp Premigration Migration Call R S Ready-tomigrate Postmigration Time Call S Suspend Channel R Reactivate Channel Lock/Unlock Communication Control Msg Migration Migration Comp Computation Migration Signal Detection Down-Time for VM Live Migration SC 2017 Doctoral Showcase 14

Application Performance Execution Time (s) 30 25 20 15 10 5 NAS PE MT-worst MT-typical NM Execution Time (s) 0.4 0.3 0.2 0.1 PE MT-worst MT-typical NM Graph500 0 LU.B EP.B MG.B CG.B FT.B 0.0 20,10 20,16 20,20 22,10 8 VMs in total and 1 VM carries out migration during application running Compared with NM, MT- worst and PE incur some overhead MT-typical allows migration to be completely overlapped with computation SC 2017 Doctoral Showcase 15

High Performance Communication for Nested Virtualization Memory Controller VM 0 NUMA 0 Container 0 core 0 core 1 core 2 core 3 core 8 core 9 core 10 core 11 core 4 core 5 core 6 core 7 QPI VM 1 Container 1 Container 2 Container 3 core 12 core 13 core 14 core 15 1 3 2 4 Container Locality Detector Two-Layer Locality Detector Nested Locality Combiner VM Locality Detector Two-Layer NUMA Aware Communication Coordinator Memory Controller NUMA 1 Two-Layer Locality Detector: Dynamically detecting processes in the co-resident containers inside one VM as well as the ones in the coresident VMs Two-Layer NUMA Aware Communication Coordinator: Leverage nested locality info, NUMA architecture info and message to select appropriate communication channel CMA Channel SHared Memory (SHM) Channel Network (HCA) Channel J. Zhang, X. Lu and D. K. Panda, Designing Locality and NUMA Aware Runtime for Nested Virtualization based HPC Cloud with SR-IOV Enabled InfiniBand, The 13th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE 17), April 2017 SC 2017 Doctoral Showcase 16

Two-Layer NUMA Aware Communication Coordinator Two-Layer NUMA Aware Communication Coordinator Two-Layer Locality Detector NUMA Loader Nested Locality Loader Message Parser Communication Coordinator CMA Channel SHared Memory (SHM) Channel Network (HCA) Channel Nested Locality Loader reads locality info of destination process from Two-Layer Locality Detector NUMA Loader reads info of VM/container placements to decide on which NUMA node the destination process is pinning Message Parser obtains message attributes, e.g., message type and message size SC 2017 Doctoral Showcase 17

Applications Performance BFS Execution Time (s) 10 8 6 4 2 Graph500 Default 1Layer 2Layer-Enhanced-Hybrid Execution Time (s) 200 150 100 50 Class D NAS Default 1Layer 2Layer-Enhanced-Hybrid 0 22,20 24,16 24,20 24,24 26,16 26,20 26,24 28,16 0 IS MG EP FT CG LU 256 processes across 64 containers on 16 nodes Compared with Default, enhanced-hybrid design reduces up to 16% (28,16) and 10% (LU) of execution time for Graph 500 and NAS, respectively Compared with the 1Layer case, enhanced-hybrid design also brings up to 12% (28,16) and 6% (LU) performance benefit. SC 2017 Doctoral Showcase 18

Typical Usage Scenarios Compute Nodes Exclusive Allocations Concurrent Jobs (EACJ) VM VM Shared-host Allocations Concurrent Jobs (SACJ) Exclusive Allocations Sequential Jobs (EASJ) VM VM VM VM VM VM SC 2017 Doctoral Showcase 19

Slurm-V Architecture Overview sbatch File VM Configuration File Submit Job physical resource request physical node list SLURMctld SLURMd SLURMd SLURMd launch VMs physical node physical node. physical node Image Pool Lustre VF VM1 IVSHME M VF VM2 IVSHME M VM Launching/Reclaiming 1. SR-IOV virtual function 2. IVSHMEM device 3. Network setting 4. Image management 5. Launching VMs and check availability 6. Mount global storage, etc. libvirtd SLURMd SC 2017 Doctoral Showcase 20

Alternative Designs of Slurm-V Slurm SPANK Plugin based design Utilize SPANK plugin to read VM configuration, launch/reclaim VM File based lock to detect occupied VF and exclusively allocate free VF Assign a unique ID to each IVSHMEM device and dynamically attach to each VM Inherit advantages from Slurm: coordination, scalability, security Slurm SPANK Plugin over OpenStack based design Offload VM launch/reclaim to underlying OpenStack framework PCI Whitelist to passthrough free VF to VM Extend Nova to enable IVSHMEM when launching VM Inherit advantage from both OpenStack and Slurm: component optimization, performance SC 2017 Doctoral Showcase 21

Applications Performance Graph500 with 64 Procs acorss 8 Nodes on Chameleon BFS Execution Time (ms) 3000 2500 2000 1500 1000 500 VM Native EASJ 250 VM 200 Native 4% 150 9% 100 50 SACJ 250 200 150 100 50 VM Native EACJ 6% 0 24,16 24,20 26,10 Problem Size (Scale, Edgefactor) 0 22,10 22,16 22,20 Problem Size (Scale, Edgefactor) 0 22 10 22 16 22 20 Problem Size (Scale, Edgefactor) 32 VMs across 8 nodes, 6 Cores/VM EASJ - Compared to Native, less than 4% overhead SACJ, EACJ less than 9% overhead, when running NAS as concurrent job with 64 Procs SC 2017 Doctoral Showcase 22

Impact on HPC and Cloud Communities Designs available through MVAPICH2-Virt library http://mvapich.cse.ohiostate.edu/download/mvapich/virt/mvapich2-virt-2.2-1.el7.centos.x86_64.rpm Complex Appliances available on Chameleon Cloud bare-metal cluster: https://www.chameleoncloud.org/appliances/29/ + SR-IOV KVM cluster: https://www.chameleoncloud.org/appliances/28/ Enables users to easily and quickly deploy HPC clouds and perform jobs with high performance Enables administrators to efficiently manage and schedule cluster resource SC 2017 Doctoral Showcase 23

Conclusion Addresses key issues on building efficient HPC clouds Optimizes communication on various HPC clouds Presents designs of live migration to provide fault-tolerance on HPC clouds Presents co-designs with resource management and scheduling systems Demonstrates the corresponding benefits on modern HPC clusters Broader outreach through MVAPICH2-Virt public releases and complex appliances on Chameleon Cloud testbed SC 2017 Doctoral Showcase 24

Thank You! & Questions? zhang.2794@osu.edu Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ MVAPICH Web Page http://mvapich.cse.ohio-state.edu/ SC 2017 Doctoral Showcase 25