Interconnect Your Future

Similar documents
MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Interconnect Your Future

UCX: An Open Source Framework for HPC Network APIs and Beyond

In-Network Computing. Paving the Road to Exascale. 5th Annual MVAPICH User Group (MUG) Meeting, August 2017

Paving the Road to Exascale

In-Network Computing. Paving the Road to Exascale. June 2017

The Future of High Performance Interconnects

In-Network Computing. Sebastian Kalcher, Senior System Engineer HPC. May 2017

Unified Communication X (UCX)

Interconnect Your Future

High Performance Computing

Solutions for Scalable HPC

Building the Most Efficient Machine Learning System

UCX: An Open Source Framework for HPC Network APIs and Beyond

Building the Most Efficient Machine Learning System

The Effect of In-Network Computing-Capable Interconnects on the Scalability of CAE Simulations

Interconnect Your Future Enabling the Best Datacenter Return on Investment. TOP500 Supercomputers, November 2017

The Future of Interconnect Technology

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

Interconnect Your Future

2008 International ANSYS Conference

Birds of a Feather Presentation

Future Routing Schemes in Petascale clusters

Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

Interconnect Your Future

Interconnect Your Future Paving the Road to Exascale

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

N V M e o v e r F a b r i c s -

Maximizing Cluster Scalability for LS-DYNA

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

Networking at the Speed of Light

How to Network Flash Storage Efficiently at Hyperscale. Flash Memory Summit 2017 Santa Clara, CA 1

The Exascale Architecture

Paving the Road to Exascale Computing. Yossi Avni

InfiniBand Strengthens Leadership as the Interconnect Of Choice By Providing Best Return on Investment. TOP500 Supercomputers, June 2014

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

Designing Shared Address Space MPI libraries in the Many-core Era

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

InfiniBand Networked Flash Storage

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

Introduction to Infiniband

IBM CORAL HPC System Solution

Ethernet. High-Performance Ethernet Adapter Cards

Application Acceleration Beyond Flash Storage

HPC Architectures. Types of resource currently in use

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Cray XC Scalability and the Aries Network Tony Ford

InfiniBand Strengthens Leadership as The High-Speed Interconnect Of Choice

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

Overview of MVAPICH2 and MVAPICH2- X: Latest Status and Future Roadmap

Overview of the MVAPICH Project: Latest Status and Future Roadmap

Hardened Security in the Cloud Bob Doud, Sr. Director Marketing March, 2018

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

Mapping MPI+X Applications to Multi-GPU Architectures

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Deep Learning mit PowerAI - Ein Überblick

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

MICROWAY S NVIDIA TESLA V100 GPU SOLUTIONS GUIDE

ROCm: An open platform for GPU computing exploration

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

HPC Network Stack on ARM

OPENFABRICS INTERFACES: PAST, PRESENT, AND FUTURE

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

Messaging Overview. Introduction. Gen-Z Messaging

High-Performance Training for Deep Learning and Computer Vision HPC

Scaling with PGAS Languages

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

MVAPICH2 and MVAPICH2-MIC: Latest Status

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

Performance Optimizations for LS-DYNA with Mellanox HPC-X Scalable Software Toolkit

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand

GPU-centric communication for improved efficiency

Operational Robustness of Accelerator Aware MPI

Unifying UPC and MPI Runtimes: Experience with MVAPICH

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

Introduction to High-Speed InfiniBand Interconnect

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Infiniband and RDMA Technology. Doug Ledford

Intra-MIC MPI Communication using MVAPICH2: Early Experience

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning

Experiences with ENZO on the Intel Many Integrated Core Architecture

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters

USING OPEN FABRIC INTERFACE IN INTEL MPI LIBRARY

EXPERIENCES WITH NVME OVER FABRICS

ABySS Performance Benchmark and Profiling. May 2010

Transcription:

Interconnect Your Future Smart Interconnect for Next Generation HPC Platforms Gilad Shainer, August 2016, 4th Annual MVAPICH User Group (MUG) Meeting

Mellanox Connects the World s Fastest Supercomputer National Supercomputing Center in Wuxi, China #1 on the TOP500 Supercomputing List 93 Petaflop performance, 3X higher versus #2 on the TOP500 41K nodes, 10 million cores, 256 cores per CPU Mellanox adapter and switch solutions The TOP500 list has evolved, includes HPC & Cloud / Web2.0 Hyperscale systems Mellanox connects 41.2% of overall TOP500 systems Mellanox connects 70.4% of the TOP500 HPC platforms Mellanox connects 46 Petascale systems, Nearly 50% of the total Petascale systems InfiniBand is the Interconnect of Choice for HPC Compute and Storage Infrastructures 2016 Mellanox Technologies 2

Accelerating HPC Leadership Systems Summit System Sierra System Proud to Pave the Path to Exascale 2016 Mellanox Technologies 3

The Ever Growing Demand for Higher Performance Performance Development Terascale Petascale Exascale 1 st Roadrunner 2000 2005 2010 2015 2020 The Interconnect is the Enabling Technology HW APP SW Application Software Hardware SMP to Clusters Single-Core to Many-Core Co-Design 2016 Mellanox Technologies 4

The Intelligent Interconnect to Enable Exascale Performance CPU-Centric Co-Design Limited to Main CPU Usage Results in Performance Limitation Must Wait for the Data Creates Performance Bottlenecks Creating Synergies Enables Higher Performance and Scale Work on The Data as it Moves Enables Performance and Scale 2016 Mellanox Technologies 5

Breaking the Application Latency Wall Network 10 years ago Communication Framework Network Today Communication Framework Co-Design Network Future Communication Framework ~10 microsecond ~100 microsecond ~0.1 microsecond ~10 microsecond ~0.05 microsecond ~1 microsecond Today: Network device latencies are on the order of 100 nanoseconds Challenge: Enabling the next order of magnitude improvement in application performance Solution: Creating synergies between software and hardware intelligent interconnect Intelligent Interconnect Paves the Road to Exascale Performance 2016 Mellanox Technologies 6

Mellanox Smart Interconnect Solutions: Switch-IB 2 and ConnectX-5 2016 Mellanox Technologies 7

Switch-IB 2 and ConnectX-5 Smart Interconnect Solutions SHArP Enables Switch-IB 2 to Execute Data Aggregation / Reduction Operations in the Network Barrier, Reduce, All-Reduce, Broadcast Sum, Min, Max, Min-loc, max-loc, OR, XOR, AND Integer and Floating-Point, 32 / 64 bit Delivering 10X Performance Improvement for MPI and SHMEM/PAGS Communications 100Gb/s Throughput 0.6usec Latency (end-to-end) 200M Messages per Second MPI Collectives in Hardware MPI Tag Matching in Hardware In-Network Memory PCIe Gen3 and Gen4 Integrated PCIe Switch Advanced Dynamic Routing 2016 Mellanox Technologies 8

ConnectX-5 EDR 100G Architecture X86 Open POWER GPU ARM FPGA PCIe Gen4 Multi-Host Technology PCIe Switch PCIe Gen4 Offload Engines Memory RDMA Transport eswitch & Routing InfiniBand / Ethernet 10,20,40,50,56,100G 2016 Mellanox Technologies 9

Switch-IB 2 SHArP Performance Advantage MiniFE is a Finite Element mini-application Implements kernels that represent implicit finite-element applications Allreduce MPI Collective 10X to 25X Performance Improvement 2016 Mellanox Technologies 10

Time SHArP Performance Advantage with Intel Xeon Phi Knight Landing OSU - OSU MPI benchmark; IMB Intel MPI Benchmark Lower is better Maximizing KNL Performance 50% Reduction in Run Time (Customer Results) 2016 Mellanox Technologies 11

SHArP Technology Performance OpenFOAM is a popular computational fluid dynamics application SHArP Delivers 2.2X Higher Performance 2016 Mellanox Technologies 12

BlueField System-on-a-Chip (SoC) Solution NVMe Flash Storage Arrays Scale-Out Storage (NVMe over Fabric) Accelerating & Virtualizing VNFs Open vswitch (OVS), SDN Overlay networking offloads Integration of ConnectX5 + Multicore ARM State of the art capabilities 10 / 25 / 40 / 50 / 100G Ethernet & InfiniBand PCIe Gen3 / Gen4 Hardware acceleration offload - RDMA, RoCE, NVMeF, RAID Family of products Range of ARM core counts and I/O ports/speeds Price/Performance points 2016 Mellanox Technologies 13

InfiniBand Router Solutions Isolation Between Different InfiniBand Networks (Each Network can be Managed Separately) Native InfiniBand Connectivity Between Different Network Topologies (Fat-Tree, Torus, Dragonfly, etc.) SB7780 Router 1U Supports up to 6 Different Subnets 2016 Mellanox Technologies 14

Technology Roadmap 100G 200G 400G 1000G Standard-based Interconnect, Same Software, Backward and Future Compatible Petascale Exascale #1 TOP500, 100Petaflop World-wide Programs 2016 2019 2021 2016 Mellanox Technologies 15

GPUDirect RDMA Technology Maximize Performance via Accelerator and GPU Offloads 2016 Mellanox Technologies 16

GPUs are Everywhere! 2016 Mellanox Technologies 17

Higher is Better Performance of MPI with GPUDirect RDMA GPU-GPU Internode MPI Latency GPU-GPU Internode MPI Bandwidth 10x 9.3X Lower is Better 2.18 usec Source: Prof. DK Panda 88% Lower Latency 10X Increase in Throughput 2016 Mellanox Technologies 18

Mellanox GPUDirect RDMA Performance Advantage HOOMD-blue is a general-purpose Molecular Dynamics simulation code accelerated on GPUs GPUDirect RDMA allows direct peer to peer GPU communications over InfiniBand Unlocks performance between GPU and InfiniBand This provides a significant decrease in GPU-GPU communication latency Provides complete CPU offload from all GPU communications across the network 102% 2X Application Performance! 2016 Mellanox Technologies 19

Average time per iteration (us) GPUDirect Sync (GPUDirect 4.0) GPUDirect RDMA (3.0) direct data path between the GPU and Mellanox interconnect Control path still uses the CPU - CPU prepares and queues communication tasks on GPU - GPU triggers communication on HCA - Mellanox HCA directly accesses GPU memory GPUDirect Sync (GPUDirect 4.0) Both data path and control path go directly between the GPU and the Mellanox interconnect 80 70 60 50 2D stencil benchmark 27% faster 23% faster 40 30 Maximum Performance For GPU Clusters 20 10 0 2 4 Number of nodes/gpus RDMA only RDMA+PeerSync 2016 Mellanox Technologies 20

UCX Unified Communication X 2016 Mellanox Technologies 21

UCX Unified Communication - X Framework www.openucx.org ucx-group@email.ornl.gov

Background MXM Developed by Mellanox Technologies HPC communication library for InfiniBand devices and shared memory Primary focus: MPI, PGAS UCCS Developed by ORNL, UH, UTK Originally based on Open MPI BTL and OPAL layers HPC communication library for InfiniBand, Cray Gemini/Aries, and shared memory Primary focus: OpenSHMEM, PGAS Also supports: MPI PAMI Developed by IBM on BG/Q, PERCS, IB VERBS Network devices and shared memory MPI, OpenSHMEM, PGAS, CHARM++, X10 C++ components Aggressive multi-threading with contexts Active Messages Non-blocking collectives with hw accleration support

High-level Overview Applications MPICH, Open-MPI, etc. OpenSHMEM, UPC, CAF, X10, Chapel, etc. Parsec, OCR, Legions, etc. Burst buffer, ADIOS, etc. UCP (Protocols) UCX Message Passing API Domain InfiniBand VERBs Gemini/Aries PGAS API Domain UCT (Transports) Host Memory Task Based API Domain Accelerator Memory I/O API Domain UCS (Services) RC UD XRC DCT GNI SYSV POSIX KNEM CMA XPMEM CUDA Utilities Platforms Hardware/Driver

Technology Roadmap 100G 200G 400G 1000G Standard-based Interconnect, Same Software, Backward and Future Compatible Petascale Exascale #1 TOP500, 100Petaflop World-wide Programs 2016 2019 2021 2016 Mellanox Technologies 25

Thank You