Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Similar documents
Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

Unified Runtime for PGAS and MPI over OFED

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT

Unifying UPC and MPI Runtimes: Experience with MVAPICH

Enhancing Checkpoint Performance with Staging IO & SSD

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics

Op#miza#on and Tuning of Hybrid, Mul#rail, 3D Torus Support and QoS in MVAPICH2

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

INAM 2 : InfiniBand Network Analysis and Monitoring with MPI

Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication

A Portable InfiniBand Module for MPICH2/Nemesis: Design and Evaluation

Memcached Design on High Performance RDMA Capable Interconnects

Intra-MIC MPI Communication using MVAPICH2: Early Experience

High Performance MPI on IBM 12x InfiniBand Architecture

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

Design challenges of Highperformance. MPI over InfiniBand. Presented by Karthik

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand

Advanced RDMA-based Admission Control for Modern Data-Centers

Designing Multi-Leader-Based Allgather Algorithms for Multi-Core Clusters *

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Design and Evaluation of Benchmarks for Financial Applications using Advanced Message Queuing Protocol (AMQP) over InfiniBand

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

Scaling with PGAS Languages

Performance of HPC Middleware over InfiniBand WAN

Performance Evaluation of InfiniBand with PCI Express

High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations

Network bandwidth is a performance bottleneck for cluster computing. Especially for clusters built with SMP machines.

LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Introduction to High-Speed InfiniBand Interconnect

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

Solutions for Scalable HPC

2008 International ANSYS Conference

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters

High-Performance Broadcast for Streaming and Deep Learning

Evaluating the Impact of RDMA on Storage I/O over InfiniBand

NAMD Performance Benchmark and Profiling. November 2010

CP2K Performance Benchmark and Profiling. April 2011

FROM HPC TO THE CLOUD WITH AMQP AND OPEN SOURCE SOFTWARE

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures

Performance Evaluation of InfiniBand with PCI Express

MVAPICH2 Project Update and Big Data Acceleration

High-Performance Training for Deep Learning and Computer Vision HPC

Interconnect Your Future

2-Port 40 Gb InfiniBand Expansion Card (CFFh) for IBM BladeCenter IBM BladeCenter at-a-glance guide

A Case for High Performance Computing with Virtual Machines

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications

MVAPICH2 and MVAPICH2-MIC: Latest Status

Application Acceleration Beyond Flash Storage

Performance monitoring in InfiniBand networks

HYCOM Performance Benchmark and Profiling

Comparing Ethernet and Soft RoCE for MPI Communication

CPMD Performance Benchmark and Profiling. February 2014

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

MVAPICH2: A High Performance MPI Library for NVIDIA GPU Clusters with InfiniBand

SNAP Performance Benchmark and Profiling. April 2014

Acceleration for Big Data, Hadoop and Memcached

GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks

Designing Shared Address Space MPI libraries in the Many-core Era

To Infiniband or Not Infiniband, One Site s s Perspective. Steve Woods MCNC

The Role of InfiniBand Technologies in High Performance Computing. 1 Managed by UT-Battelle for the Department of Energy

Informatix Solutions INFINIBAND OVERVIEW. - Informatix Solutions, Page 1 Version 1.0

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Introduction to Infiniband

The Future of Interconnect Technology

The rcuda middleware and applications

Hybrid MPI - A Case Study on the Xeon Phi Platform

Memory Management Strategies for Data Serving with RDMA

A first look at 100 Gbps LAN technologies, with an emphasis on future DAQ applications.

Designing High-Performance and Resilient Message Passing on InfiniBand

Optimizing LS-DYNA Productivity in Cluster Environments

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet. Swamy N. Kandadai and Xinghong He and

Designing An Efficient Kernel-level and User-level Hybrid Approach for MPI Intra-node Communication on Multi-core Systems

Transcription:

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Hari Subramoni, Ping Lai, Sayantan Sur and Dhabhaleswar. K. Panda Department of Computer Science & Engineering The Ohio State University

Outline Introduction & Motivation Problem Statement Design Performance Evaluation and Results Conclusions and Future Work ICPP '10 2

Introduction & Motivation Fabric Card Switches Line Card Switches Line Card Switches ICPP '10 Supercomputing clusters growing in size and scale MPI predominant programming model for HPC High performance interconnects like InfiniBand increased network capacity Compute capacity outstrips network capacity with advent of multi/many core processors Gets aggravated as jobs get assigned to random nodes and share links 3

Analysis of Traffic Pattern in a Supercomputer Traffic flow in the Ranger supercomputer at Texas Advanced Computing Center (TACC) shows heavy link sharing http://www.tacc.utexas.edu Color of Dot Green Black Red Description Network Elements Line Card Switches Fabric Card Switches Color Number of Streams Black 1 Blue 2 Red 3 4 Orange 5-8 Green > 8 Courtesy - TACC ICPP '10 4

Possible Issue with Link Sharing Compute Node Switch Compute Node Few communicating peers No Problem Packets get backed up as number of communicating peers increases Results in delayed arrival of packet at destination ICPP '10 5

Frequency Distribution of Inter Arrival Times Packet size 2 KB (results same for 1 KB to 16 KB) Arrival time is directly proportional to the load on the links ICPP '10 6

Introduction & Motivation (Cont) Can modern networks like InfiniBand alleviate this? ICPP '10 7

InfiniBand Architecture An industry standard for low latency, high bandwidth System Area Networks Multiple features Two communication types Channel Semantics Memory Semantics (RDMA mechanism) Queue Pair (QP) based communication Quality of Service (QoS) support Multiple Virtual Lanes (VL) QPs associated to VLs by means of pre-specified Service Levels Multiple communication speeds available for Host Channel Adapters (HCA) 10 Gbps (SDR) / 20 Gbps (DDR) / 40 Gbps (QDR) ICPP '10 8

InfiniBand Network Buffer Architecture InfiniBand Host Channel Adapter (HCA) Common Buffer Pool Virtual Lane 0 Virtual Lane 1 Virtual Lane 15 Virtual Lane Arbiter Physical Link Buffers in most IB HCAs and switches grouped into two Common Buffer Pool and, Private VL buffers Most current generation MPIs only use one VL Inefficient use of available network resources Why not use more VLs? Possible con Would it take more time to poll all the VLs ICPP '10 9

Outline Introduction & Motivation Problem Statement Design Performance Evaluation and Results Conclusions and Future Work ICPP '10 10

Problem Statement Can multiple virtual lanes be used to improve performance of HPC applications How can we integrate this design into an MPI library so that end applications will be benefited ICPP '10 11

Outline Introduction & Motivation Problem Statement Design Performance Evaluation and Results Conclusions and Future Work ICPP '10 12

Proposed Framework and Goals Traffic Distribution Job Scheduler MPI Library Application Traffic Segregation InfiniBand Network No change to application Re-design MPI library to use multiple VLs Need new methods to take advantage of multiple VLs Traffic Distribution Load balance traffic across multiple VLs Traffic Segregation Ensure one kind of traffic does not disturb other Distinguish between Low & High priority traffic Small & Large messages ICPP '10 13

Proposed Design Re-design MPI library to use multiple VLs Multiple Virtual Lanes configured with different characteristics Transmit less packets at high priority Transmit more packets at lower priority etc Multiple Service Levels (SL) defined to match VLs Queue Pairs (QPs) assigned proper SLs at QP creation time Multiple ways to assign Service Levels to applications Assign SLs with similar characteristics in a round robin fashion Traffic Distribution Assign SLs with desired characteristic based on type of application Traffic Segregation Other designs being explored ICPP '10 14

Proposed Design (Cont) Job Scheduler MPI Library Service Level Virtual Lane 0 Application Service Level Virtual Lane 1 Virtual Lane Physical Link Arbiter Service Level Virtual Lane 15 ICPP '10 15

Outline Introduction & Motivation Problem Statement Design Performance Evaluation & Results Conclusions and Future Work ICPP '10 16

Compute platforms Intel Nehalem Experimental Testbed Intel Xeon E5530 Dual quad-core processors operating at 2.40 GHz 12GB RAM, 8MB cache PCIe 2.0 interface Network Equipments MT26428 QDR ConnectX HCAs 36-port Mellanox QDR switch used to connect all the nodes Red Hat Enterprise Linux Server release 5.3 (Tikanga) OFED-1.4.2 OpenSM version 3.1.6 Benchmarks Modified version of OFED perftest for verbs level tests MPIBench collective benchmark CPMD used for application level evaluation ICPP '10 17

MVAPICH / MVAPICH2 Software High Performance MPI Library for IB and 10GE MVAPICH (MPI-1) and MVAPICH2 (MPI-2) Used by more than 1255 organizations in 59 countries More than 44,500 downloads from OSU site directly Empowering many TOP500 clusters 11 th ranked 62,976-core cluster (Ranger) at TACC Available with software stacks of many IB, 10GE and server vendors including Open Fabrics Enterprise Distribution (OFED) Also supports udapl device to work with any network supporting udapl http://mvapich.cse.ohio-state.edu/ ICPP '10 18

Verbs Level Performance Tests use 8 communicating pairs One QP per pair Packet size 2 KB Results same for 1 KB to 16 KB Traffic distribution using multiple VLs results in more predictable Inter arrival time Slight increase in average latency ICPP '10 19

Message Rate (in Millions) Bandiwdth (MBps) Latency (us) MPI Level Point to Point Performance 3500 3000 2500 2000 1500 1000 500 0 1-VL 8-VLs 1K 2K 4K 8K 16K 32K 64K Message Size (Bytes) Tests use 8 communicating pairs One QP per pair Traffic distribution using multiple VLs result in better overall performance 13% performance improvement over case with just one VL ICPP '10 140 120 100 80 60 40 20 0 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 1-VL 8-VLs 1K 2K 4K 8K 16K 32K 64K Message Size (Bytes) 1-VL 8-VLs 1K 2K 4K 8K 16K 32K 64K Message Size (Bytes) 20

Latency (us) Latency (us) MPI Level Collective Performance Traffic Distribution Traffic Segregation 7000 6000 5000 1-VL 8-VLs 350 300 250 2 Alltoalls (No Segregation) 2 Alltoalls (Segregation) 1 Alltoall 4000 200 3000 150 2000 100 1000 50 0 1K 2K 4K 8K 16K Message Size (Bytes) 0 1K 2K 4K 8K 16K Message Size (Bytes) For 64 process Alltoall, Traffic Distribution and Traffic Segregation through use of multiple VLs results in better performance 20% performance improvement seen with Traffic Distribution 12% performance improvement seen with Traffic Segregation ICPP '10 21

Normalized Time Application Level Performance 1 0.8 0.6 0.4 0.2 0 CPMD application 64 processes Traffic Distribution through use of multiple VLs results in better performance 11% performance improvement in Alltoall performance 6% improvement in overall performance Total Time Time in Alltoall 1 VL 8 VLs ICPP '10 22

Outline Introduction & Motivation Problem Statement Design Performance Evaluation & Results Conclusions & Future Work ICPP '10 23

Conclusions & Future Work Explore use of Virtual Lanes to improve predictability and performance of HPC applications Integrate our scheme into MVAPICH2 MPI library and conduct performance evaluations at various levels Consistent increase in performance at verbs, MPI and application level evaluations Explore advanced schemes to improve performance using multiple virtual lanes Proposed solution will be available in future MVAPICH2 releases ICPP '10 24

Thank you! {subramon, laipi, surs, panda}@cse.ohio-state.edu Network-Based Computing Laboratory http://mvapich.cse.ohio-state.edu/ ICPP '10 25