Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Hari Subramoni, Ping Lai, Sayantan Sur and Dhabhaleswar. K. Panda Department of Computer Science & Engineering The Ohio State University

Outline Introduction & Motivation Problem Statement Design Performance Evaluation and Results Conclusions and Future Work ICPP '10 2

Introduction & Motivation Fabric Card Switches Line Card Switches Line Card Switches ICPP '10 Supercomputing clusters growing in size and scale MPI predominant programming model for HPC High performance interconnects like InfiniBand increased network capacity Compute capacity outstrips network capacity with advent of multi/many core processors Gets aggravated as jobs get assigned to random nodes and share links 3

Analysis of Traffic Pattern in a Supercomputer Traffic flow in the Ranger supercomputer at Texas Advanced Computing Center (TACC) shows heavy link sharing http://www.tacc.utexas.edu Color of Dot Green Black Red Description Network Elements Line Card Switches Fabric Card Switches Color Number of Streams Black 1 Blue 2 Red 3 4 Orange 5-8 Green > 8 Courtesy - TACC ICPP '10 4

Possible Issue with Link Sharing Compute Node Switch Compute Node Few communicating peers No Problem Packets get backed up as number of communicating peers increases Results in delayed arrival of packet at destination ICPP '10 5

Frequency Distribution of Inter Arrival Times Packet size 2 KB (results same for 1 KB to 16 KB) Arrival time is directly proportional to the load on the links ICPP '10 6

Introduction & Motivation (Cont) Can modern networks like InfiniBand alleviate this? ICPP '10 7

InfiniBand Architecture An industry standard for low latency, high bandwidth System Area Networks Multiple features Two communication types Channel Semantics Memory Semantics (RDMA mechanism) Queue Pair (QP) based communication Quality of Service (QoS) support Multiple Virtual Lanes (VL) QPs associated to VLs by means of pre-specified Service Levels Multiple communication speeds available for Host Channel Adapters (HCA) 10 Gbps (SDR) / 20 Gbps (DDR) / 40 Gbps (QDR) ICPP '10 8

InfiniBand Network Buffer Architecture InfiniBand Host Channel Adapter (HCA) Common Buffer Pool Virtual Lane 0 Virtual Lane 1 Virtual Lane 15 Virtual Lane Arbiter Physical Link Buffers in most IB HCAs and switches grouped into two Common Buffer Pool and, Private VL buffers Most current generation MPIs only use one VL Inefficient use of available network resources Why not use more VLs? Possible con Would it take more time to poll all the VLs ICPP '10 9

Outline Introduction & Motivation Problem Statement Design Performance Evaluation and Results Conclusions and Future Work ICPP '10 10

Problem Statement Can multiple virtual lanes be used to improve performance of HPC applications How can we integrate this design into an MPI library so that end applications will be benefited ICPP '10 11

Outline Introduction & Motivation Problem Statement Design Performance Evaluation and Results Conclusions and Future Work ICPP '10 12

Proposed Framework and Goals Traffic Distribution Job Scheduler MPI Library Application Traffic Segregation InfiniBand Network No change to application Re-design MPI library to use multiple VLs Need new methods to take advantage of multiple VLs Traffic Distribution Load balance traffic across multiple VLs Traffic Segregation Ensure one kind of traffic does not disturb other Distinguish between Low & High priority traffic Small & Large messages ICPP '10 13

Proposed Design Re-design MPI library to use multiple VLs Multiple Virtual Lanes configured with different characteristics Transmit less packets at high priority Transmit more packets at lower priority etc Multiple Service Levels (SL) defined to match VLs Queue Pairs (QPs) assigned proper SLs at QP creation time Multiple ways to assign Service Levels to applications Assign SLs with similar characteristics in a round robin fashion Traffic Distribution Assign SLs with desired characteristic based on type of application Traffic Segregation Other designs being explored ICPP '10 14

Proposed Design (Cont) Job Scheduler MPI Library Service Level Virtual Lane 0 Application Service Level Virtual Lane 1 Virtual Lane Physical Link Arbiter Service Level Virtual Lane 15 ICPP '10 15

Outline Introduction & Motivation Problem Statement Design Performance Evaluation & Results Conclusions and Future Work ICPP '10 16

Compute platforms Intel Nehalem Experimental Testbed Intel Xeon E5530 Dual quad-core processors operating at 2.40 GHz 12GB RAM, 8MB cache PCIe 2.0 interface Network Equipments MT26428 QDR ConnectX HCAs 36-port Mellanox QDR switch used to connect all the nodes Red Hat Enterprise Linux Server release 5.3 (Tikanga) OFED-1.4.2 OpenSM version 3.1.6 Benchmarks Modified version of OFED perftest for verbs level tests MPIBench collective benchmark CPMD used for application level evaluation ICPP '10 17

MVAPICH / MVAPICH2 Software High Performance MPI Library for IB and 10GE MVAPICH (MPI-1) and MVAPICH2 (MPI-2) Used by more than 1255 organizations in 59 countries More than 44,500 downloads from OSU site directly Empowering many TOP500 clusters 11 th ranked 62,976-core cluster (Ranger) at TACC Available with software stacks of many IB, 10GE and server vendors including Open Fabrics Enterprise Distribution (OFED) Also supports udapl device to work with any network supporting udapl http://mvapich.cse.ohio-state.edu/ ICPP '10 18

Verbs Level Performance Tests use 8 communicating pairs One QP per pair Packet size 2 KB Results same for 1 KB to 16 KB Traffic distribution using multiple VLs results in more predictable Inter arrival time Slight increase in average latency ICPP '10 19

Message Rate (in Millions) Bandiwdth (MBps) Latency (us) MPI Level Point to Point Performance 3500 3000 2500 2000 1500 1000 500 0 1-VL 8-VLs 1K 2K 4K 8K 16K 32K 64K Message Size (Bytes) Tests use 8 communicating pairs One QP per pair Traffic distribution using multiple VLs result in better overall performance 13% performance improvement over case with just one VL ICPP '10 140 120 100 80 60 40 20 0 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 1-VL 8-VLs 1K 2K 4K 8K 16K 32K 64K Message Size (Bytes) 1-VL 8-VLs 1K 2K 4K 8K 16K 32K 64K Message Size (Bytes) 20

Latency (us) Latency (us) MPI Level Collective Performance Traffic Distribution Traffic Segregation 7000 6000 5000 1-VL 8-VLs 350 300 250 2 Alltoalls (No Segregation) 2 Alltoalls (Segregation) 1 Alltoall 4000 200 3000 150 2000 100 1000 50 0 1K 2K 4K 8K 16K Message Size (Bytes) 0 1K 2K 4K 8K 16K Message Size (Bytes) For 64 process Alltoall, Traffic Distribution and Traffic Segregation through use of multiple VLs results in better performance 20% performance improvement seen with Traffic Distribution 12% performance improvement seen with Traffic Segregation ICPP '10 21

Normalized Time Application Level Performance 1 0.8 0.6 0.4 0.2 0 CPMD application 64 processes Traffic Distribution through use of multiple VLs results in better performance 11% performance improvement in Alltoall performance 6% improvement in overall performance Total Time Time in Alltoall 1 VL 8 VLs ICPP '10 22

Outline Introduction & Motivation Problem Statement Design Performance Evaluation & Results Conclusions & Future Work ICPP '10 23

Conclusions & Future Work Explore use of Virtual Lanes to improve predictability and performance of HPC applications Integrate our scheme into MVAPICH2 MPI library and conduct performance evaluations at various levels Consistent increase in performance at verbs, MPI and application level evaluations Explore advanced schemes to improve performance using multiple virtual lanes Proposed solution will be available in future MVAPICH2 releases ICPP '10 24

Thank you! {subramon, laipi, surs, panda}@cse.ohio-state.edu Network-Based Computing Laboratory http://mvapich.cse.ohio-state.edu/ ICPP '10 25