Screencast: OMPI OpenFabrics Protocols (v1.2 series)

Similar documents
Screencast: Basic Architecture and Tuning

Design challenges of Highperformance. MPI over InfiniBand. Presented by Karthik

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand

Optimizing non-blocking Collective Operations for InfiniBand

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand

OpenFabrics Interface WG A brief introduction. Paul Grun co chair OFI WG Cray, Inc.

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

Screencast: What is [Open] MPI? Jeff Squyres May May 2008 Screencast: What is [Open] MPI? 1. What is MPI? Message Passing Interface

Screencast: What is [Open] MPI?

High Performance MPI on IBM 12x InfiniBand Architecture

Open MPI for Cray XE/XK Systems

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

Unified Runtime for PGAS and MPI over OFED

MPI on the Cray XC30

RDMA programming concepts

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Research on the Implementation of MPI on Multicore Architectures

The Role of InfiniBand Technologies in High Performance Computing. 1 Managed by UT-Battelle for the Department of Energy

Memory Management Strategies for Data Serving with RDMA

Lessons learned from MPI

CERN openlab Summer 2006: Networking Overview

Mellanox HPC-X Scalable Software Toolkit README

New Storage Architectures

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Open MPI und ADCL. Kommunikationsbibliotheken für parallele, wissenschaftliche Anwendungen. Edgar Gabriel

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

MOVING FORWARD WITH FABRIC INTERFACES

MPI Caveats. When correct MPI programs did not run. Paul Kapinos 1 Hans Rüdiger Hammer, Alexander Aures 2

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

JANUARY 28, 2014, SAN JOSE, CA. Microsoft Lead Partner Architect OS Vendors: What NVM Means to Them

Virtual Memory Outline

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS

Ravindra Babu Ganapathi

In-Network Computing. Sebastian Kalcher, Senior System Engineer HPC. May 2017

RDMA in Embedded Fabrics

HPC Customer Requirements for OpenFabrics Software

InfiniPath Drivers and Software for QLogic QHT7xxx and QLE7xxx HCAs. Table of Contents

BUILDING A BLOCK STORAGE APPLICATION ON OFED - CHALLENGES

RDMA enabled NIC (RNIC) Verbs Overview. Renato Recio

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

Introduction to High-Speed InfiniBand Interconnect

Interoperability Logo Group (OFILG) July 2017 Logo Report

Working with Cisco UCS Manager

Operating Systems Comprehensive Exam. Spring Student ID # 3/20/2013

Infiniband and RDMA Technology. Doug Ledford

Understanding MPI on Cray XC30

Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters

Scalable High Performance Message Passing over InfiniBand for Open MPI

Kernel Korner AEM: A Scalable and Native Event Mechanism for Linux

Open Fabrics Interfaces Architecture Introduction. Sean Hefty Intel Corporation

The influence of system calls and interrupts on the performances of a PC cluster using a Remote DMA communication primitive

InfiniBand Linux Operating System Software Access Layer

CS4700/CS5700 Fundamentals of Computer Networks

iwarp Learnings and Best Practices

QLogic in HPC Vendor Update IDC HPC User Forum April 16, 2008 Jeff Broughton Sr. Director Engineering Host Solutions Group

Low latency, high bandwidth communication. Infiniband and RDMA programming. Bandwidth vs latency. Knut Omang Ifi/Oracle 2 Nov, 2015

Memory Management Strategies for Data Serving with RDMA

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

Operating Systems Comprehensive Exam. Spring Student ID # 3/16/2006

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2

Asynchronous Events on Linux

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Lecture 2 Process Management

Open Fabrics Workshop 2013

An Introduction to the OpenFabrics Interface. #OFAUserGroup Paul Grun Cray w/ slides stolen (with pride) from Sean Hefty

Voltaire. Fast I/O for XEN using RDMA Technologies. The Grid Interconnect Company. April 2005 Yaron Haviv, Voltaire, CTO

Unified Communication X (UCX)

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

The Exascale Architecture

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

Exploiting Offload Enabled Network Interfaces

Application Acceleration Beyond Flash Storage

Programming with Message Passing PART I: Basics. HPC Fall 2012 Prof. Robert van Engelen

Blocking SEND/RECEIVE

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

SRP Update. Bart Van Assche,

Performance Evaluation of Myrinet-based Network Router

Enhanced Memory debugging of MPI-parallel Applications in Open MPI

Programming with MPI

DB2 purescale: High Performance with High-Speed Fabrics. Author: Steve Rees Date: April 5, 2011

Unifying UPC and MPI Runtimes: Experience with MVAPICH

VEOS high level design. Revision 2.1 NEC

Network bandwidth is a performance bottleneck for cluster computing. Especially for clusters built with SMP machines.

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication

Threads and Too Much Milk! CS439: Principles of Computer Systems February 6, 2019

Presented By: Gregory M. Kurtzer HPC Systems Architect Lawrence Berkeley National Laboratory CONTAINERS IN HPC WITH SINGULARITY

Open MPI from a user perspective. George Bosilca ICL University of Tennessee

I N F M A S T E R S T H E S I S I N C O M P U T E R S C I E N C E

GPUnet: Networking Abstractions for GPU Programs. Author: Andrzej Jackowski

Designing High-Performance and Resilient Message Passing on InfiniBand

Brian W. Barre7 Scalable System So?ware Sandia NaConal Laboratories December 3, 2012 SAND Number: P

Logical disks. Bach 2.2.1

OPERATING SYSTEMS. UNIT II Sections A, B & D. An operating system executes a variety of programs:

Distributed Databases

Updating RPC-over-RDMA: Next steps. David Noveck nfsv4wg meeting at IETF96 July 19, 2016

Processes and More. CSCI 315 Operating Systems Design Department of Computer Science

OpenFabrics Alliance

ROCm: An open platform for GPU computing exploration

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

Transcription:

Screencast: OMPI OpenFabrics Protocols (v1.2 series) Jeff Squyres May 2008 May 2008 Screencast: OMPI OpenFabrics Protocols (v1.2 series) 1

Short Messages For short messages memcpy() into / out of pre-registered buffers short = memcpy cost does not matter Once copied in, do one of two things: Use eager RDMA Use send / receive semantics May 2008 Screencast: OMPI OpenFabrics Protocols (v1.2 series) 2

Short Message Protocol MPI_SEND(short_message, ) Pre-registered buffer queue User buffer May 2008 Screencast: OMPI OpenFabrics Protocols (v1.2 series) 3

Short Message Protocol MPI_SEND(short_message, ) Pre-registered buffer queue User buffer Dequeue a pre-registered buffer May 2008 Screencast: OMPI OpenFabrics Protocols (v1.2 series) 4

Short Message Protocol MPI_SEND(short_message, ) Pre-registered buffer queue User buffer User buffer copy memcpy May 2008 Screencast: OMPI OpenFabrics Protocols (v1.2 series) 5

Short Message Protocol MPI_SEND(short_message, ) Pre-registered buffer queue User buffer User buffer copy Send / RDMA May 2008 Screencast: OMPI OpenFabrics Protocols (v1.2 series) 6

Short Message Protocol MPI_SEND(short_message, ) Pre-registered buffer queue User buffer User buffer copy Return to queue when send completes May 2008 Screencast: OMPI OpenFabrics Protocols (v1.2 series) 7

Short: RDMA vs. Send RDMA semantics Sender specifies [remote] target buffer address Requires N pre-registered buffers for each peer Quickly becomes non-scalable Send / receive semantics Receiver specifies target buffer address Can use common pool of pre-registered buffers May 2008 Screencast: OMPI OpenFabrics Protocols (v1.2 series) 8

Short: RDMA vs. Send RDMA Peer A Requires initial setup (exchange addresses) Completely hardware driven Buffers for peer A B Peer B Send / receive Less initial setup Driver picks buffer Involves remote software Peer A Buffers A B Peer B May 2008 Screencast: OMPI OpenFabrics Protocols (v1.2 series) 9

Short: RDMA vs. Send RDMA Peer A Msg 3 Msg 4 Requires initial setup (exchange addresses) Completely hardware driven Buffers for peer A B Peer B Send / receive Less initial setup Driver picks buffer Involves remote software Peer A Buffers A B Peer B May 2008 Screencast: OMPI OpenFabrics Protocols (v1.2 series) 10

Short: RDMA vs. Send RDMA Peer A Msg 3 Msg 4 Requires initial setup (exchange addresses) Completely hardware driven Buffers for peer A B Peer B Send / receive Less initial setup Driver picks buffer Involves remote software Peer A Buffers A B Peer B May 2008 Screencast: OMPI OpenFabrics Protocols (v1.2 series) 11

Short: RDMA vs. Send RDMA Peer A Msg 3 Msg 4 Requires initial setup (exchange addresses) Completely hardware driven Buffers for peer A B Peer B Send / receive Less initial setup Driver picks buffer Involves remote software Peer A Buffers A B Peer B May 2008 Screencast: OMPI OpenFabrics Protocols (v1.2 series) 12

Short: RDMA vs. Send RDMA Peer A Msg 3 Msg 4 Requires initial setup (exchange addresses) Completely hardware driven Buffers for peer A B Peer B Send / receive Less initial setup Driver picks buffer Involves remote software Peer A Msg 3 Msg 4 Buffers A B Peer B May 2008 Screencast: OMPI OpenFabrics Protocols (v1.2 series) 13

Short: RDMA vs. Send RDMA Peer A Msg 3 Msg 4 Requires initial setup (exchange addresses) Completely hardware driven Buffers for peer A B Peer B Send / receive Less initial setup Driver picks buffer Involves remote software Peer A Msg 3 Msg 4 Buffers A B Peer B May 2008 Screencast: OMPI OpenFabrics Protocols (v1.2 series) 14

Short: RDMA vs. Send RDMA Peer A Msg 3 Msg 4 Requires initial setup (exchange addresses) Completely hardware driven Buffers for peer A B Peer B Send / receive Less initial setup Driver picks buffer Involves remote software Peer A Msg 3 Msg 4 Buffers A B Peer B Or it might RNR! May 2008 Screencast: OMPI OpenFabrics Protocols (v1.2 series) 15

RDMA vs. Send RDMA Buffers for each peer: must be (MxN) Exchange addresses MPI maintains accounting/flow control MPI must notice new received messages Unordered One-sided More work for MPI Send / receive Pool of buffers -- can be less than (MxN) Network maintains accounting / [some] flow control Network notifies of new received messages Ordered Two-sided (ACK ed) Less work for MPI May 2008 Screencast: OMPI OpenFabrics Protocols (v1.2 series) 16

Limiting Short RDMA Open MPI allows N short RDMA peers (N+1)th peer will use send/receive for short Receiver s choice If M buffers posted for each RDMA peer Total registered memory for short RDMA M x N x short_size May 2008 Screencast: OMPI OpenFabrics Protocols (v1.2 series) 17

[Shared] Receive Queue How to receive messages? Per-peer resources Flow control Never error Pooled resources No flow control Better utilization Can play stats game Potential for retransmits Per-peer receive queues For peer 1: For peer 2: For peer 3: For peer 4: Shared receive queue May 2008 Screencast: OMPI OpenFabrics Protocols (v1.2 series) 18

[Shared] Receive Queue How to receive messages? Per-peer resources Flow control Never error Pooled resources No flow control Better utilization Can play stats game Potential for retransmits Per-peer receive queues For peer 1: For peer 2: For peer 3: For peer 4: Shared receive queue Less than NxM buffers May 2008 Screencast: OMPI OpenFabrics Protocols (v1.2 series) 19

Long Messages For long messages Pipelined protocol Fragment the message, event-driven queue Register Send / receive Unregister ( skipping many details ) More complicated if message is not contiguous (not described here) May 2008 Screencast: OMPI OpenFabrics Protocols (v1.2 series) 20

Long Message Protocol MPI_SEND(long_message, ) Sent in 3 phases: Eager / match data Send / receive data RDMA data Long message May 2008 Screencast: OMPI OpenFabrics Protocols (v1.2 series) 21

Long Message Protocol MPI_SEND(long_message, ) 1 1st phase: Find receiver match Use (copy + send/receive) for first fragment Only send enough to confirm receiver match; do not overwhelm receiver resources Typical rendezvous protocol May 2008 Screencast: OMPI OpenFabrics Protocols (v1.2 series) 22

Long Message Protocol MPI_SEND(long_message, ) 1 2 2nd phase: Hide receiver register latency Use (copy + send/receive) for next few frags Allows overlap of memory registration Pipeline subject to max depth setting Conserve resources May 2008 Screencast: OMPI OpenFabrics Protocols (v1.2 series) 23

Unregister Unregister Unregister Long Message Protocol MPI_SEND(long_message, ) 1 2 3 Time Eager send + send until RDMA offset Send(s) complete Register + Write Register + Write Register + Write Register + Write Register + Write Register + Write Register + Write Unregister Unregister Unregister Unregister + Complete May 2008 Screencast: OMPI OpenFabrics Protocols (v1.2 series) 24

Why So Complex? Why not a single register / send / deregister? Each time is directly proportional to buffer length Total time is longer Many details skipped See paper on www.open-mpi.org Time Register Send Unregister Complete May 2008 Screencast: OMPI OpenFabrics Protocols (v1.2 series) 25

fork() Support fork() and registered memory problematic Must differentiate between parent / child registered memory Not properly supported until: OFED v1.2 Open MPI v1.2.1 Kernel 2.6.16 or later (but some Linux distros have backported, e.g., RHEL4U6) Query MCA param: btl_openib_have_fork_support May 2008 Screencast: OMPI OpenFabrics Protocols (v1.2 series) 26

Striping Automatic: Short messages round robin across lowest latency BTL module Long message fragments round robin across all available BTL modules Size of long message fragments proportional to network capacity Developers investigating more complex scheduling scenarios May 2008 Screencast: OMPI OpenFabrics Protocols (v1.2 series) 27

How to Activate Striping? Do nothing (!) mpirun -np 4 a.out If Open MPI detects multiple active OF ports, it will use them all Assuming each peer port is reachable Determined by subnet ID May 2008 Screencast: OMPI OpenFabrics Protocols (v1.2 series) 28

Warning: Default Subnet ID Subnet ID uniquely identifies an IB network All Cisco IB switches ship with the default ID FE:80:00:00 To compute peer process reachability, Open MPI must have different subnet IDs But still, ambiguities exist May 2008 Screencast: OMPI OpenFabrics Protocols (v1.2 series) 29

Warning: Default Subnet ID HCA Switch / net 1 HCA ID: ABCD Switch / net 2 HCA ID: EFGH Good Bad -- OMPI can t compute reachability, but will warn HCA Switch / net 1 HCA ID: FE80 Switch / net 2 HCA ID: FE80 HCA Switch / net 1 HCA ID: ABCD Switch / net 2 HCA ID: ABCD Bad -- OMPI can t compute reachability May 2008 Screencast: OMPI OpenFabrics Protocols (v1.2 series) 30

More Information Open MPI FAQ General tuning http://www.open-mpi.org/faq/?category=tuning InfiniBand / OpenFabrics tuning http://www.open-mpi.org/faq/?category=openfabrics May 2008 Screencast: OMPI OpenFabrics Protocols (v1.2 series) 31

May 2008 Screencast: OMPI OpenFabrics Protocols (v1.2 series) 32