UCX: An Open Source Framework for HPC Network APIs and Beyond

Similar documents
UCX: An Open Source Framework for HPC Network APIs and Beyond

Unified Communication X (UCX)

Interconnect Your Future

Paving the Road to Exascale

Oak Ridge National Laboratory Computing and Computational Sciences

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Present and Future Leadership Computers at OLCF

HPC Saudi Jeffrey A. Nichols Associate Laboratory Director Computing and Computational Sciences. Presented to: March 14, 2017

Oak Ridge National Laboratory Computing and Computational Sciences

IBM CORAL HPC System Solution

Interconnect Related Research at Oak Ridge National Laboratory

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

HPC Network Stack on ARM

Mapping MPI+X Applications to Multi-GPU Architectures

Exploring Emerging Technologies in the Extreme Scale HPC Co- Design Space with Aspen

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

Solutions for Scalable HPC

Managing HPC Active Archive Storage with HPSS RAIT at Oak Ridge National Laboratory

Interconnect Your Future

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

High Performance Computing

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

Designing and Developing Performance Portable Network Codes

OPEN MPI AND RECENT TRENDS IN NETWORK APIS

19. prosince 2018 CIIRC Praha. Milan Král, IBM Radek Špimr

Application Acceleration Beyond Flash Storage

In-Network Computing. Paving the Road to Exascale. 5th Annual MVAPICH User Group (MUG) Meeting, August 2017

Building the Most Efficient Machine Learning System

2008 International ANSYS Conference

The rcuda middleware and applications

Oak Ridge Leadership Computing Facility: Summit and Beyond

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

IBM HPC Technology & Strategy

In-Network Computing. Sebastian Kalcher, Senior System Engineer HPC. May 2017

ROCm: An open platform for GPU computing exploration

Building the Most Efficient Machine Learning System

ABySS Performance Benchmark and Profiling. May 2010

The Future of High Performance Interconnects

OPENFABRICS INTERFACES: PAST, PRESENT, AND FUTURE

LS-DYNA Performance Benchmark and Profiling. April 2015

Maximizing Cluster Scalability for LS-DYNA

ADVANCED PGAS CENTRIC USAGE OF THE OPENFABRICS INTERFACE

NAMD Performance Benchmark and Profiling. November 2010

The Exascale Architecture

Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy

Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies

TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS. Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016

Illinois Proposal Considerations Greg Bauer

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Parallel Programming. Libraries and Implementations

Overview of MVAPICH2 and MVAPICH2- X: Latest Status and Future Roadmap

Interconnect Your Future

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

DELIVERABLE D5.5 Report on ICARUS visualization cluster installation. John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS)

Unified Runtime for PGAS and MPI over OFED

GROMACS (GPU) Performance Benchmark and Profiling. February 2016

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

n N c CIni.o ewsrg.au

Sayantan Sur, Intel. SEA Symposium on Overlapping Computation and Communication. April 4 th, 2018

Designing Shared Address Space MPI libraries in the Many-core Era

Open MPI for Cray XE/XK Systems

Lecture 20: Distributed Memory Parallelism. William Gropp

MPI on the Cray XC30

Programming NVM Systems

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Ravindra Babu Ganapathi

MVAPICH User s Group Meeting August 20, 2015

Overview of the MVAPICH Project: Latest Status and Future Roadmap

Preparing Scientific Software for Exascale

HYCOM Performance Benchmark and Profiling

Interconnect Your Future

CPMD Performance Benchmark and Profiling. February 2014

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Building Multi-Petaflop Systems with MVAPICH2 and Global Arrays

GPUs and Emerging Architectures

Center for Accelerated Application Readiness. Summit. Tjerk Straatsma. Getting Applications Ready for. OLCF Scientific Computing Group

OCTOPUS Performance Benchmark and Profiling. June 2015

An Introduction to the SPEC High Performance Group and their Benchmark Suites

IBM Spectrum Scale IO performance

Birds of a Feather Presentation

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Cray Operating System and I/O Road Map Charlie Carroll

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

Power Systems AC922 Overview. Chris Mann IBM Distinguished Engineer Chief System Architect, Power HPC Systems December 11, 2017

High-Performance Key-Value Store on OpenSHMEM

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

Parallel Programming. Libraries and implementations

The Common Communication Interface (CCI)

SNAP Performance Benchmark and Profiling. April 2014

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Enabling Efficient Use of UPC and OpenSHMEM PGAS Models on GPU Clusters. Presented at GTC 15

Initial Performance Evaluation of the Cray SeaStar Interconnect

Parallel Programming Libraries and implementations

Transcription:

UCX: An Open Source Framework for HPC Network APIs and Beyond Presented by: Pavel Shamis / Pasha ORNL is managed by UT-Battelle for the US Department of Energy

Co-Design Collaboration The Next Generation HPC Communication Framework Collaborative Effort Industry, National Laboratories and Academia 2

Challenges 3 Performance Portability (across various interconnects) Collaboration between industry and research institutions but mostly industry (because they built the hardware) Maintenance Maintaining a network stack is time consuming and expensive Industry have resources and strategic interest for this Extendibility MPI+X+Y? Exascale programming environment is an ongoing debate

Challenges (CORAL) Feature Summit Titan Application Performance 5-10x Titan Baseline Number of Nodes ~3,400 18,688 Node performance > 40 TF 1.4 TF Memory per Node >512 GB (HBM + DDR4) 38GB (GDDR5+DDR3) NVRAM per Node 800 GB 0 Node Interconnect NVLink (5-12x PCIe 3) PCIe 2 System Interconnect (node injection bandwidth) Dual Rail EDR-IB (23 GB/s) Gemini (6.4 GB/s) Interconnect Topology Non-blocking Fat Tree 3D Torus Processors IBM POWER9 NVIDIA Volta AMD Opteron NVIDIA Kepler File System 120 PB, 1 TB/s, GPFS 32 PB, 1 TB/s, Lustre Peak power consumption 10 MW 9 MW 4

UCX Unified Communication X Framework Unified Network API for multiple network architectures that target HPC programing models and libraries Communication How to move data from location in memory A to location in memory B considering multiple types of memories Framework A collection of libraries and utilities for HPC network programmers 5

History MXM Developed by Mellanox Technologies HPC communication library for InfiniBand devices and shared memory Primary focus: MPI, PGAS UCCS Developed by ORNL, UH, UTK Originally based on Open MPI BTL and OPAL layers HPC communication library for InfiniBand, Cray Gemini/Aries, and shared memory Primary focus: OpenSHMEM, PGAS Also supports: MPI PAMI Developed by IBM on BG/Q, PERCS, IB VERBS Network devices and shared memory MPI, OpenSHMEM, PGAS, CHARM++, X10 C++ components Aggressive multi-threading with contexts Active Messages Non-blocking collectives with hw accleration support Decades of community and industry experience in development of HPC software 6

What we are doing differently UCX consolidates multiple industry and academic efforts Mellanox MXM, IBM PAMI, ORNL/UTK/UH UCCS, etc. Supported and maintained by industry IBM, Mellanox, NVIDIA, Pathscale 7

What we are doing differently Co-design effort between national laboratories, academia, and industry Applications: LAMMPS, NWCHEM, etc. Co-design Programming models: MPI, PGAS/Gasnet, etc. Middleware: Driver and Hardware 8

Applications MPI GasNet PGAS Task Based Runtimes I/O UCX Protocols Services Transports InfiniBand ugni Shared Memory GPU Memory Emerging Interconnects 9

A Collaboration Efforts Mellanox co-designs network interface and contributes MXM technology Infrastructure, transport, shared memory, protocols, integration with OpenMPI/SHMEM, MPICH ORNL co-designs network interface and contributes UCCS project InfiniBand optimizations, Cray devices, shared memory NVIDIA co-designs high-quality support for GPU devices GPUDirect, GDR copy, etc. IBM co-designs network interface and contributes ideas and concepts from PAMI UH/UTK focus on integration with their research platforms 10

Licensing Open Source BSD 3 Clause license Contributor License Agreement BSD 3 based 11

UCX Framework Mission Collaboration between industry, laboratories, and academia Create open-source production grade communication framework for HPC applications Enable the highest performance through co-design of software-hardware interfaces Unify industry - national laboratories - academia efforts API Exposes broad semantics that target data centric and HPC programming models and applications Performance oriented Optimization for low-software overheads in communication path allows near native-level performance Production quality Developed, maintained, tested, and used by industry and researcher community Community driven Collaboration between industry, laboratories, and academia Research The framework concepts and ideas are driven by research in academia, laboratories, and industry Cross platform Support for Infiniband, Cray, various shared memory (x86-64 and Power), GPUs Co-design of Exascale Network APIs 12

Architecture 13

UCX Framework UC-P for Protocols High-level API uses UCT framework to construct protocols commonly found in applications Functionality: Multi-rail, device selection, pending queue, rendezvous, tag-matching, softwareatomics, etc. UC-T for Transport Low-level API that expose basic network operations supported by underlying hardware. Reliable, out-of-order delivery. Functionality: Setup and instantiation of communication operations. UC-S for Services This framework provides basic infrastructure for component based programming, memory management, and useful system utilities Functionality: Platform abstractions, data structures, debug facilities. 14

A High-level Overview Applications MPICH, Open-MPI, etc. OpenSHMEM, UPC, CAF, X10, Chapel, etc. Parsec, OCR, Legions, etc. Burst buffer, ADIOS, etc. UC-P (Protocols) - High Level API Transport selection, cross-transrport multi-rail, fragmentation, operations not supported by hardware UCX Message Passing API Domain: tag matching, randevouze Transport for InfiniBand VERBs driver RC UD XRC DCT UC-T (Hardware Transports) - Low Level API RMA, Atomic, Tag-matching, Send/Recv, Active Message Transport for Gemini/Aries drivers GNI PGAS API Domain: RMAs, Atomics Task Based API Domain: Active Messages Transport for intra-node host memory communication SYSV POSIX KNEM CMA XPMEM Transport for Accelerator Memory communucation GPU I/O API Domain: Stream UC-S (Services) Common utilities Data Utilities stractures Memory Management OFA Verbs Driver Cray Driver OS Kernel Cuda Hardware 15

Preliminary Evaluation ( UCT ) Two HP ProLiant DL380p Gen8 servers Intel Xeon E5-2697 2.7GHz CPUs Mellanox SX6036 switch Single-port Mellanox Connect-IB FDR (10.10.5056) Mellanox OFED 2.4-1.0.4. (VERBS) Prototype implementation of Accelerated VERBS (AVERBS) 16

OpenSHMEM and OSHMEM (OpenMPI) Put Latency (shared memory) 1000 OpenSHMEM UCX (intranode) OpenSHMEM UCCS (intranode) OSHMEM (intranode) 100 Latency (usec, logscale) 10 1 Lower is better 0.1 8 16 32 64 128 256 512 1KB 2KB 4KB 8KB 16KB 32KB 64KB 128KB256KB 512KB 1MB 2MB 4MB Message Size 17

OpenSHMEM and OSHMEM (OpenMPI) Put Injection Rate 1.4e+07 1.2e+07 Higher is better OpenSHMEM UCX (mlx5) OpenSHMEM UCCS (mlx5) OSHMEM (mlx5) OSHMEM UCX (mlx5) Connect-IB Message Rate (put operations / second) 1e+07 8e+06 6e+06 4e+06 2e+06 18 0 8 16 32 64 128 256 512 1KB 2KB 4KB Message Size

OpenSHMEM and OSHMEM (OpenMPI) GUPs Benchmark Connect-IB 0.0018 0.0016 0.0014 Higher is better UCX (mlx5) OSHMEM (mlx5) GUPS (billion updates per second) 0.0012 0.001 0.0008 0.0006 0.0004 0.0002 0 2 4 6 8 10 12 14 16 Number of PEs (two nodes) 19

MPICH - Message rate Preliminary Results 6 Connect-IB 5 non-blocking tag-send 4 MMPS 3 2 1 0 1 2 4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1M 2M 4M 20 MPICH/UCX MPICH/MXM Slide courtesy of Pavan Balaji, ANL - sent to the ucx mailing list

Where is UCX being used? Upcoming release of Open MPI 2.0 Upcoming release of MPICH OpenSHMEM reference implementation PARSEC runtime used on Scientific Linear Libraries 21

What Next? UCX Consortium! http://www.csm.ornl.gov/newsite/ UCX Specification Early draft is available online: http://www.openucx.org/early-draft-of-ucx-specification-is-here/ Production releases MPICH, Open MPI, Open SHMEM(s), Gasnet, and more Support for more networks and applications and libraries UCX Hackathon 2016! Will be announced on the mailing list and website 22

https://github.com/orgs/openucx WEB: www.openucx.org Contact: info@openucx.org Mailing List: https://elist.ornl.gov/mailman/listinfo/ucx-group ucx-group@elist.ornl.gov

Acknowledgments 24

Acknowledgments Thanks to all our partners! 25

Questions? Unified Communication - X Framework WEB: www.openucx.org Contact: info@openucx.org WE B: https://github.com/orgs/openucx Mailing List: https://elist.ornl.gov/mailman/listinfo/ucx-group ucx-group@elist.ornl.gov