Unified Communication X (UCX)

Similar documents
UCX: An Open Source Framework for HPC Network APIs and Beyond

Interconnect Your Future

UCX: An Open Source Framework for HPC Network APIs and Beyond

Paving the Road to Exascale

ROCm: An open platform for GPU computing exploration

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

Scaling with PGAS Languages

Interconnect Your Future

HPC Network Stack on ARM

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

OPEN MPI AND RECENT TRENDS IN NETWORK APIS

Solutions for Scalable HPC

High Performance Computing

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

High Performance MPI on IBM 12x InfiniBand Architecture

Building the Most Efficient Machine Learning System

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Designing Shared Address Space MPI libraries in the Many-core Era

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Building the Most Efficient Machine Learning System

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Maximizing Cluster Scalability for LS-DYNA

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

RDMA in Embedded Fabrics

The Future of High Performance Interconnects

The Exascale Architecture

Birds of a Feather Presentation

High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations

Design challenges of Highperformance. MPI over InfiniBand. Presented by Karthik

Open MPI for Cray XE/XK Systems

Designing and Developing Performance Portable Network Codes

Unified Runtime for PGAS and MPI over OFED

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics

HYCOM Performance Benchmark and Profiling

Sayantan Sur, Intel. Presenting work done by Arun Ilango, Dmitry Gladkov, Dmitry Durnov and Sean Hefty and others in the OFIWG community

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

Welcome to the IBTA Fall Webinar Series

Unifying UPC and MPI Runtimes: Experience with MVAPICH

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand

2008 International ANSYS Conference

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

OPENFABRICS INTERFACES: PAST, PRESENT, AND FUTURE

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand

Oak Ridge National Laboratory Computing and Computational Sciences

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

MOVING FORWARD WITH FABRIC INTERFACES

NAMD Performance Benchmark and Profiling. November 2010

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

OpenFabrics Interface WG A brief introduction. Paul Grun co chair OFI WG Cray, Inc.

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

The rcuda middleware and applications

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

USING OPEN FABRIC INTERFACE IN INTEL MPI LIBRARY

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

Memcached Design on High Performance RDMA Capable Interconnects

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications

GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

A Portable InfiniBand Module for MPICH2/Nemesis: Design and Evaluation

High-Performance MPI Library with SR-IOV and SLURM for Virtualized InfiniBand Clusters

In-Network Computing. Sebastian Kalcher, Senior System Engineer HPC. May 2017

ADVANCED PGAS CENTRIC USAGE OF THE OPENFABRICS INTERFACE

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

High-Performance Training for Deep Learning and Computer Vision HPC

FaRM: Fast Remote Memory

In-Network Computing. Paving the Road to Exascale. 5th Annual MVAPICH User Group (MUG) Meeting, August 2017

Open MPI State of the Union X Community Meeting SC 16

Interconnect Your Future

Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access

Mellanox HPC-X Software Toolkit Release Notes

Future Routing Schemes in Petascale clusters

Application Acceleration Beyond Flash Storage

NAMD Performance Benchmark and Profiling. February 2012

OPENFABRICS INTERFACES: PAST, PRESENT, AND FUTURE

Overview of MVAPICH2 and MVAPICH2- X: Latest Status and Future Roadmap

Enabling Efficient Use of UPC and OpenSHMEM PGAS Models on GPU Clusters. Presented at GTC 15

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

Comparing Ethernet & Soft RoCE over 1 Gigabit Ethernet

Memory Management Strategies for Data Serving with RDMA

GPU-centric communication for improved efficiency

spin: High-performance streaming Processing in the Network

The Future of Interconnect Technology

Mellanox HPC-X Software Toolkit Release Notes

Open MPI und ADCL. Kommunikationsbibliotheken für parallele, wissenschaftliche Anwendungen. Edgar Gabriel

Interconnect Your Future

Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications

MPI on the Cray XC30

Transcription:

Unified Communication X (UCX) Pavel Shamis / Pasha ARM Research SC 18

UCF Consortium Mission: Collaboration between industry, laboratories, and academia to create production grade communication frameworks and open standards for data centric and high-performance applications Projects UCX Unified Communication X Open RDMA Board members Jeff Kuehn, UCF Chairman (Los Alamos National Laboratory) Gilad Shainer, UCF President (Mellanox Technologies) Pavel Shamis, UCF treasurer (ARM) Brad Benton, Board Member (AMD) Duncan Poole, Board Member (Nvidia) Pavan Balaji, Board Member (Argonne National Laboratory) Sameh Sharkawi, Board Member (IBM) Dhabaleswar K. (DK) Panda, Board Member (Ohio State University) Steve Poole, Board Member (Open Source Software Solutions) 2018 UCF Consortium 2

UCX - History LA MPI Network Layers Modular Architecture Open MPI PAMI MXM APIs, context, thread safety, etc. UCCS APIs, Low-level optimizations APIs, software infrastructure optimizations A new project based on concept and ideas from multiple generation of HPC networks stack UCX Performance Scalability Efficiency Portability Decades of community and industry experience in development of HPC network software 2000 2004 2010 2011 2012 2018 UCF Consortium 3

UCX Framework Mission Collaboration between industry, laboratories, government (DoD, DoE), and academia Create open-source production grade communication framework for HPC applications Enable the highest performance through co-design of software-hardware interfaces API Exposes broad semantics that target data centric and HPC programming models and applications Performance oriented Optimization for low-software overheads in communication path allows near native-level performance Production quality Developed, maintained, tested, and used by industry and researcher community Community driven Collaboration between industry, laboratories, and academia Research The framework concepts and ideas are driven by research in academia, laboratories, and industry Cross platform Support for Infiniband, Cray, various shared memory (x86-64, Power, ARMv8), GPUs Co-design of Exascale Network APIs 2018 UCF Consortium 4

UCX Framework UCX is a framework for network APIs and stacks UCX aims to unify the different network APIs, protocols and implementations into a single framework that is portable, efficient and functional UCX doesn t focus on supporting a single programming model, instead it provides APIs and protocols that can be used to tailor the functionalities of a particular programming model efficiently When different programming paradigms and applications use UCX to implement their functionality, it increases their portability. As just implementing a small set of UCX APIs on top of a new hardware ensures that these applications can run seamlessly without having to implement it themselves 2018 UCF Consortium 5

UCX Architecture UCX UCP High level API UCT Low level API Drivers Hardware UCS UCX framework is composed of three main components. UCP layer is the protocol layer and supports all the functionalities exposed by the high-level APIs, meaning it emulates the features that are not implemented in the underlying hardware UCT layer is the transport layer that aims to provides a very efficient and low overhead access to the hardware resources UCS is a service layer that provides common data structures, memory management tools and other utilities 2018 UCF Consortium 6

UCX High-level Overview Applications MPICH, Open-MPI, etc. OpenSHMEM, UPC, CAF, X10, Chapel, etc. Parsec, OCR, Legions, etc. Burst buffer, ADIOS, etc. UC-P (Protocols) - High Level API Transport selection, cross-transrport multi-rail, fragmentation, operations not supported by hardware UCX Message Passing API Domain: tag matching, randevouze Transport for InfiniBand VERBs driver RC UD XRC DCT UC-T (Hardware Transports) - Low Level API RMA, Atomic, Tag-matching, Send/Recv, Active Message Transport for Gemini/Aries drivers GNI PGAS API Domain: RMAs, Atomics Task Based API Domain: Active Messages Transport for intra-node host memory communication SYSV POSIX KNEM CMA XPMEM Transport for Accelerator Memory communucation GPU I/O API Domain: Stream UC-S (Services) Common utilities Data Utilities stractures Memory Management OFA Verbs Driver Cray Driver OS Kernel Cuda Hardware 2018 UCF Consortium 7

UCX Releases in 2018 v1.3.1 - https://github.com/openucx/ucx/releases/tag/v1.3.1 Multi-rail support for eager and rendezvous protocols Added stream-based communication API Added support for GPU platforms: Nvidia CUDA and AMD ROCM software stacks Added API for Client-Server based connection establishment Added support for TCP transport (Send/Receive semantics) Support for InfiniBand hardware tag-matching for DC and accelerated transports Added support for tag-matching communications with CUDA buffers Initial support for Java bindings Progress engine optimizations Improved scalability of software tag-matching by using a hash table Added transparent huge-pages allocator Added non-blocking flush and disconnect semantics Added registration cache for KNEM Support fixed-address memory allocation via ucp_mem_map() Added ucp_tag_send_nbr() API to avoid send request allocation Support global addressing in all IB transports Add support for external epoll fd and edge-triggered events Added ucp_rkey_ptr() to obtain pointer for shared memory region 2018 UCF Consortium 8

UCX Releases in 2018 - continued V1.4.0 - https://github.com/openucx/ucx/releases/tag/v1.4.0 Support for installation with latest AMD ROCm Support for latest RDMA-CORE Support for NVIDIA CUDA IPC for intra-node GPU Support for NVIDIA CUDA memory allocation cache for mem-type detection Support for latest Mellanox devices (200Gb/s) Support for NVIDIA GPU managed memory Support for bitwise (OpenSHMEM v1.4) atomics operations X86, Power8/9, arm State-of-the-art support for GP-GPU InfiniBand, RoCEv1/v2, Gemini/Aries, Shared Memory, TCP/Ethernet (Beta) 2018 UCF Consortium 9

UCX Releases in 2018 - continued V1.5.0 End of November: ADD branch here New emulation mode enabling comprehensive UCX functionality (Atomic, Put, Get, etc) over TCP and legacy interconnects that don't implement full RDMA semantics. New non-blocking API for all one-sided operations. New client/server connection establishment API New advanced statistic capabilities (tag matching queues) 2018 UCF Consortium 10

UCX Roadmap v1.6 Q1 2018 Bugfixes and optimizations IWARP Active Message API v2.0 Q3-Q4 2019 Updated API not backward compatible with 1.x - Cleanup (remove deprecated APIs) - UCP request object redesign improves future backward compatibility Binary distribution will provide v1.x version of the library (in addition for 2.x) for backward compatibility - All codes should work as it is 2018 UCF Consortium 11

Integrations Open MPI and OSHMEM UCX replaces OpenIB BTL as default transport for InfiniBand and RoCE New UCX BTL (by LANL) MPICH MPI CH4 UCX OSSS SHMEM by StonyBrook and LANL Open SHMEM-X by ORNL Parsec (UTK) Intel Libfabrics/OFI Powered by UCX! 3 rd party commercial projects 2018 UCF Consortium 12

We Love Testing Over 100,000 tests per commit 220,000 CPU hours per release 2018 UCF Consortium 13

Performance MPI Latency and Bandwidth 2018 UCF Consortium 14

Performance - Fluent 2018 UCF Consortium 15

Cavium Thunder X2 SINGLE core InfiniBand Bandwidth 12000 MPI Bandwith 10000 MB/s 8000 6000 4000 Higher is better 2000 0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 Message Size 8192 16384 32768 65536 131072 262144 524288 1048576 2097152 4194304 2018 UCF Consortium 16

Cavium Thunder X2 MPI Ping-Pong Latency with InfiniBand 9 MPI Latency 8 7 Lower is better Micro-Second 6 5 4 3 2 1 0 0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 Message Size 2018 UCF Consortium 17

Cavium Thunder X2 MPI Message Rate with InfiniBand (28 cores) Millions 90 80 MPI Message Rate 70 Message Per Second 60 50 40 30 Higher is better 20 10 0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 Message Size 8192 16384 32768 65536 131072 262144 524288 1048576 2097152 4194304 2018 UCF Consortium 18

Open MPI + UCX full scale! 2018 UCF Consortium 19

UCX over ROCm: Intra-node support y y y y Zero-copy based design uct_rocm_cma_ep_put_zcopy uct_rocm_cma_ep_get_zcopy Zero-copy based implementation Similar to the CMA UCT code in UCX ROCm provides similar functions to the original CMA for GPU memories hsakmtprocessvmwrite hsakmtprocessvmread IPC for intra-node communication Working on providing ROCm-IPC support in UCX Test-bed: AMD FIJI GPUs, Intel CPU, Mellanox Connect-IB OMB latency benchmark Latency (us) 12 10 8 6 4 2 0 1.9 us 0 1 2 4 8 16 32 64 128 256 512 Message Size (Bytes) } ROCM-CMA provides efficient support for large messages } 1.9 us for 4 Bytes transfer for intra-node D-D } 43 us for 512KBytes transfer for intra-node 20 JUNE, 2018 ISC

UCX Support in CH4 UCX Netmod Development MPICH Team Tommy Janjusic (Mellanox) MPICH 3.3rc1 just released Includes an embedded UCX 1.4.0 Native path pt2pt (with pack/unpack callbacks for non-contig buffers) contiguous put/get rma for win_create/win_allocate windows Non-native path is CH4 active messages (hdr + data) Layered over UCX tagged API Not yet supported MPI dynamic processes OSU Latency: 0.99us OSU BW: 12064.12 MB/s Argonne JLSE Gomez Cluster - Intel Haswell-EX E7-8867v3 @ 2.5 GHz - Connect-X 4 EDR - HPC-X 2.2.0, OFED 4.4-2.0.7 21

Network Based Computing Laboratory SC 18 22 Verbs-level Performance: Message Rate Million Messages / second 12 10 8 6 8.73 7.41 Read Million Messages / second 12 10 8 6 9.66 8.05 10.47 Write Million Messages / second 12 10 8 6 10.20 8.97 Send 4 4 4 2 regular acclerated 2 regular acclerated 2 regular acclerated 0 0 0 1 8 64 512 4096 1 8 64 512 4096 1 8 64 512 4096 Message Size (Bytes) Message Size (Bytes) Message Size (Bytes) ConnectX-5 EDR (100 Gbps), Intel Broadwell E5-2680 @ 2.4 GHz MOFED 4.2-1, RHEL-7 3.10.0-693.17.1.el7.x86_64

UCX Unified Communication - X Framework WEB: www.openucx.org https://github.com/openucx/ucx Mailing List: https://elist.ornl.gov/mailman/listinfo/ucx-group ucx-group@elist.ornl.gov 2018 UCF Consortium 23

Thank You The UCF Consortium is a collaboration between industry, laboratories, and academia to create production grade communication frameworks and open standards for data centric and high-performance applications.