High Performance MPI on IBM 12x InfiniBand Architecture

Similar documents
Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand

Performance Evaluation of InfiniBand with PCI Express

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand

Design challenges of Highperformance. MPI over InfiniBand. Presented by Karthik

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Performance Evaluation of InfiniBand with PCI Express

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

EVALUATING INFINIBAND PERFORMANCE WITH PCI EXPRESS

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters

LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Unifying UPC and MPI Runtimes: Experience with MVAPICH

Building Multirail InfiniBand Clusters: MPI-Level Designs and Performance Evaluation

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

Unified Runtime for PGAS and MPI over OFED

Hot-Spot Avoidance With Multi-Pathing Over InfiniBand: An MPI Perspective

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Designing High Performance DSM Systems using InfiniBand Features

UNDERSTANDING THE IMPACT OF MULTI-CORE ARCHITECTURE IN CLUSTER COMPUTING: A CASE STUDY WITH INTEL DUAL-CORE SYSTEM

Performance Evaluation of InfiniBand with PCI Express

Performance Analysis and Evaluation of PCIe 2.0 and Quad-Data Rate InfiniBand

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics

MICROBENCHMARK PERFORMANCE COMPARISON OF HIGH-SPEED CLUSTER INTERCONNECTS

High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations

Evaluating the Impact of RDMA on Storage I/O over InfiniBand

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

A Portable InfiniBand Module for MPICH2/Nemesis: Design and Evaluation

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Enhancing Checkpoint Performance with Staging IO & SSD

A Case for High Performance Computing with Virtual Machines

S. Narravula, P. Balaji, K. Vaidyanathan, H.-W. Jin and D. K. Panda. The Ohio State University

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT

Adaptive Connection Management for Scalable MPI over InfiniBand

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

Micro-Benchmark Level Performance Comparison of High-Speed Cluster Interconnects

Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications

Designing High-Performance and Resilient Message Passing on InfiniBand

Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand

HIGH PERFORMANCE AND NETWORK FAULT TOLERANT MPI WITH MULTI-PATHING OVER INFINIBAND

Topology Agnostic Hot-Spot Avoidance with InfiniBand

Network bandwidth is a performance bottleneck for cluster computing. Especially for clusters built with SMP machines.

Software Distributed Shared Memory with High Bandwidth Network: Production and Evaluation

Comparing Ethernet & Soft RoCE over 1 Gigabit Ethernet

Topology agnostic hot-spot avoidance with InfiniBand

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Memory Management Strategies for Data Serving with RDMA

Performance of HPC Middleware over InfiniBand WAN

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

High Performance MPI-2 One-Sided Communication over InfiniBand

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication

Advanced RDMA-based Admission Control for Modern Data-Centers

High Performance MPI-2 One-Sided Communication over InfiniBand

Scaling with PGAS Languages

The NE010 iwarp Adapter

MM5 Modeling System Performance Research and Profiling. March 2009

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

Interconnect Your Future

Initial Performance Evaluation of the Cray SeaStar Interconnect

Designing Multi-Leader-Based Allgather Algorithms for Multi-Core Clusters *

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

Understanding the Impact of Multi-Core Architecture in Cluster Computing: A Case Study with Intel Dual-Core System

Efficient SMP-Aware MPI-Level Broadcast over InfiniBand s Hardware Multicast

Assessing the Ability of Computation/Communication Overlap and Communication Progress in Modern Interconnects

Design and Evaluation of Benchmarks for Financial Applications using Advanced Message Queuing Protocol (AMQP) over InfiniBand

Paving the Road to Exascale

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

High Performance VMM-Bypass I/O in Virtual Machines

Comparative Performance Analysis of RDMA-Enhanced Ethernet

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications

Birds of a Feather Presentation

Single-Points of Performance

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

Op#miza#on and Tuning of Hybrid, Mul#rail, 3D Torus Support and QoS in MVAPICH2

Application Performance on Dual Processor Cluster Nodes

Unified Communication X (UCX)

High Performance Computing: Concepts, Methods & Means Enabling Technologies 2 : Cluster Networks

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

Design and Implementation of MPICH2 over InfiniBand with RDMA Support

Shared Receive Queue based Scalable MPI Design for InfiniBand Clusters

LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster

Performance Evaluation of Soft RoCE over 1 Gigabit Ethernet

Design and Evaluation of Efficient Collective Communications on Modern Interconnects and Multi-core Clusters

Memcached Design on High Performance RDMA Capable Interconnects

* Department of Computer Science Jackson State University Jackson, MS 39217

Infiniband and RDMA Technology. Doug Ledford

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Transcription:

High Performance MPI on IBM 12x InfiniBand Architecture Abhinav Vishnu, Brad Benton 1 and Dhabaleswar K. Panda {vishnu, panda} @ cse.ohio-state.edu {brad.benton}@us.ibm.com 1 1

Presentation Road-Map Introduction and Motivation Background Enhanced MPI design for IBM 12x Architecture Performance Evaluation Conclusions and Future Work 2

Introduction and Motivation Demand for more compute power is driven by Parallel Applications Molecular Dynamics (NAMD), Car Crash Simulations (LS- DYNA),..., Cluster sizes have been increasing forever to meet these demands 9K proc. (Sandia Thunderbird, ASCI Q) Larger scale clusters are planned using upcoming multicore architectures MPI is used as the primary programming model for writing these applications 3

Emergence of InfiniBand Interconnects with very low latency and very high throughput have become available InfiniBand, Myrinet, Quadrics InfiniBand High Performance and Open Standard Advanced Features PCI-Express Based InfiniBand Adapters are becoming popular 8X (1X ~ 2.5 Gbps) with Double Data Rate (DDR) support MPI Designs for these Adapters are emerging Compared to PCI-Express, GX+ I/O Bus Based Adapters are also emerging 4X and 12X link support 4

InfiniBand Adapters To Host To Network HCA Chipset I/O I/O Bus Bus Interface Interface 4x 12x P1 P2 4x 12x GX+ PCI-Express (>24x PCI-X Bidirectional (4x(16x Bidirectional) Bidirectional) Bandwidth) (SDR/DDR) MPI for PCI-Express based are coming up IBM 12x InfiniBand Adapters on GX+ are coming up 5

Problem Statement How do we design an MPI with low overhead for IBM 12x InfiniBand Architecture? What are the performance benefits of enhanced design over the existing designs? Point-to-point communication Collective communication MPI Applications 6

Presentation Road-Map Introduction and Motivation Background Enhanced MPI design for IBM 12x Architecture Performance Evaluation Conclusions and Future Work 7

Overview of InfiniBand An interconnect technology to connect I/O nodes and processing nodes InfiniBand provides multiple transport semantics Reliable Connection Supports reliable notification and Remote Direct Memory Access (RDMA) Unreliable Datagram Data delivery is not reliable, send/recv is supported Reliable Datagram Currently not implemented by Vendors Unreliable Connection Notification is not supported InfiniBand uses a queue pair (QP) model for data transfer Send queue (for send operations) Receive queue (not involved in RDMA kind of operations) 8

MultiPathing Configurations A combination of these is also possible Switch Multiple Adapters and Multiple Ports (Multi-Rail Configurations) Multi-rail for multiple Send/recv engines 9

Presentation Road-Map Introduction and Motivation Background Enhanced MPI design for IBM 12x Architecture Performance Evaluation Conclusions and Future Work 10

MPI Design for 12x Architecture pt-to-pt, collective? Communication Marker Eager, Rendezvous ADI Layer EPC Communication Scheduler Scheduling Policies Completion Notifier Multiple QPs/port InfiniBand Layer Notification Jiuxing Liu, Abhinav Vishnu and Dhabaleswar K. Panda., Building Multi-rail InfiniBand Clusters: MPI-level Design and Performance Evaluation,. SuperComputing 2004 11

Discussion on Scheduling Policies Enhanced Pt-to-Pt and Collective (EPC) Non-blocking Binding Reverse Multiplexing Round Robin Policies Even Striping Overhead Multiple Stripes Multiple Completions Blocking Communication Collective Communication 12

EPC Characteristics pt-2-pt blocking striping non-blocking round-robin collective striping For small messages, round robin policy is used Striping leads to overhead for small messages 13

MVAPICH/MVAPICH2 We have used MVAPICH as our MPI framework for the enhanced design MVAPICH/MVAPICH2 High Performance MPI-1/MPI-2 implementation over InfiniBand and iwarp Has powered many supercomputers in TOP500 supercomputing rankings Currently being used by more than 450 organizations (academia and industry worldwide) http://nowlab.cse.ohio-state.edu/projects/mpi-iba The enhanced design is available with MVAPICH Will become available with MVAPICH2 in the upcoming releases 14

Presentation Road-Map Introduction and Motivation Background Enhanced MPI design for IBM 12x Architecture Performance Evaluation Conclusions and Future Work 15

Experimental TestBed The Experimental Test-Bed consists of: Power5 based systems with SLES9 SP2 GX+ at 950 MHz clock speed 2.6.9 Kernel Version 2.8 GHz Processor with 8 GB of Memory TS120 switch for connecting the adapters One port per adapter and one adapter is used for communication The objective is to see the benefit with using only one physical port 16

Ping-Pong Latency Test EPC adds insignificant overhead to the small message latency Large Message latency reduces by 41% using EPC with IBM 12x architecture 17

Small Messages Throughput Unidirectional bandwidth doubles for small messages using EPC Bidirectional bandwidth does not improve with increasing number of QPs due to the copy bandwidth limitation 18

Large Messages Throughput EPC improves the uni-directional and bi-directional throughput significantly for medium size messages We can achieve a peak unidirectional bandwidth of 2731 MB/s and bidirectional bandwidth of 5421 MB/s 19

Collective Communication MPI_Alltoall shows significant benefits for large messages MPI_Bcast shows more benefits for very large messages 20

NAS Parallel Benchmarks For class A and class B problem sizes, x1 configuration shows improvement There is no degradation for other configurations on Fourier Transform 21

NAS Parallel Benchmarks Integer sort shows 7-11% improvement for x1 configurations Other NAS Parallel Benchmarks do not show performance degradation 22

Presentation Road-Map Introduction and Motivation Background Enhanced MPI design for IBM 12x Architecture Performance Evaluation Conclusions and Future Work 23

Conclusions We presented an enhanced design for IBM 12x InfiniBand Architecture EPC (Enhanced Point-to-Point and collective communication) We have implemented our design and evaluated with Microbenchmarks, collectives and MPI application kernels IBM 12x HCAs can significantly improve communication performance 41% for ping-pong latency test 63-65% for uni-directional and bi-directional bandwidth tests 7-13% improvement in performance for NAS Parallel Benchmarks We can achieve a peak bandwidth of 2731 MB/s and 5421 MB/s unidirectional and bidirectional bandwidth respectively 24

Future Directions We plan to evaluate EPC with multi-rail configurations on upcoming multi-core systems Multi-port configurations Multi-HCA configurations Scalability studies of using multiple QPs on large scale clusters Impact of QP caching Network Fault Tolerance 25

Acknowledgements Our research is supported by the following organizations Current Funding support by Current Equipment support by 26

Web Pointers http://nowlab.cse.ohio-state.edu/ MVAPICH Web Page http://mvapich.cse.ohio-state.edu E-mail: {vishnu, panda}@cse.ohio-state.edu, brad.benton@us.ibm.com 27