The Future of High Performance Interconnects

Similar documents
In-Network Computing. Paving the Road to Exascale. 5th Annual MVAPICH User Group (MUG) Meeting, August 2017

Interconnect Your Future

Building the Most Efficient Machine Learning System

Building the Most Efficient Machine Learning System

Interconnect Your Future Enabling the Best Datacenter Return on Investment. TOP500 Supercomputers, November 2017

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Interconnect Your Future

Interconnect Your Future

In-Network Computing. Paving the Road to Exascale. June 2017

Paving the Road to Exascale

Solutions for Scalable HPC

In-Network Computing. Sebastian Kalcher, Senior System Engineer HPC. May 2017

Interconnect Your Future

High Performance Computing

Interconnect Your Future Paving the Road to Exascale

The Future of Interconnect Technology

Interconnect Your Future

Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

N V M e o v e r F a b r i c s -

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

The Effect of In-Network Computing-Capable Interconnects on the Scalability of CAE Simulations

Paving the Road to Exascale Computing. Yossi Avni

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

Corporate Update. Enabling The Use of Data January Mellanox Technologies

InfiniBand Strengthens Leadership as the Interconnect Of Choice By Providing Best Return on Investment. TOP500 Supercomputers, June 2014

MICROWAY S NVIDIA TESLA V100 GPU SOLUTIONS GUIDE

Ethernet. High-Performance Ethernet Adapter Cards

Benefits of 25, 40, and 50GbE Networks for Ceph and Hyper- Converged Infrastructure John F. Kim Mellanox Technologies

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

High-Performance Training for Deep Learning and Computer Vision HPC

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

How to Network Flash Storage Efficiently at Hyperscale. Flash Memory Summit 2017 Santa Clara, CA 1

Accelerating Ceph with Flash and High Speed Networks

Deep Learning mit PowerAI - Ein Überblick

ABySS Performance Benchmark and Profiling. May 2010

ACCELERATED COMPUTING: THE PATH FORWARD. Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Cray XC Scalability and the Aries Network Tony Ford

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

Mapping MPI+X Applications to Multi-GPU Architectures

IBM CORAL HPC System Solution

By John Kim, Chair SNIA Ethernet Storage Forum. Several technology changes are collectively driving the need for faster networking speeds.

Mellanox InfiniBand Solutions Accelerate Oracle s Data Center and Cloud Solutions

Sharing High-Performance Devices Across Multiple Virtual Machines

OCP3. 0. ConnectX Ethernet Adapter Cards for OCP Spec 3.0

Birds of a Feather Presentation

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

2008 International ANSYS Conference

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Meltdown and Spectre Interconnect Performance Evaluation Jan Mellanox Technologies

Introduction to Infiniband

Maximizing Cluster Scalability for LS-DYNA

ARISTA: Improving Application Performance While Reducing Complexity

Why AI Frameworks Need (not only) RDMA?

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

InfiniBand Networked Flash Storage

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

Atos announces the Bull sequana X1000 the first exascale-class supercomputer. Jakub Venc

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures

IBM WebSphere MQ Low Latency Messaging Software Tested With Arista 10 Gigabit Ethernet Switch and Mellanox ConnectX

19. prosince 2018 CIIRC Praha. Milan Král, IBM Radek Špimr

IBM SpectrumAI with NVIDIA Converged Infrastructure Solutions for AI workloads

RapidIO.org Update. Mar RapidIO.org 1

Choosing the Best Network Interface Card for Cloud Mellanox ConnectX -3 Pro EN vs. Intel XL710

The Best Ethernet Storage Fabric

Сетевые технологии для систем хранения данных

S THE MAKING OF DGX SATURNV: BREAKING THE BARRIERS TO AI SCALE. Presenter: Louis Capps, Solution Architect, NVIDIA,

Mellanox Virtual Modular Switch

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Application Acceleration Beyond Flash Storage

iser as accelerator for Software Defined Storage Rahul Fiske, Subhojit Roy IBM (India)

InfiniBand Strengthens Leadership as The High-Speed Interconnect Of Choice

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

High Performance Interconnects: Landscape, Assessments & Rankings

Gen-Z Memory-Driven Computing

Hardened Security in the Cloud Bob Doud, Sr. Director Marketing March, 2018

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

Real Application Performance and Beyond

Characterizing and Benchmarking Deep Learning Systems on Modern Data Center Architectures

Networking at the Speed of Light

IBM Power AC922 Server

LAMMPSCUDA GPU Performance. April 2011

Future Routing Schemes in Petascale clusters

Architectures for Scalable Media Object Search

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

Introduction to High-Speed InfiniBand Interconnect

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Messaging Overview. Introduction. Gen-Z Messaging

IBM Power Advanced Compute (AC) AC922 Server

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

Power Systems AC922 Overview. Chris Mann IBM Distinguished Engineer Chief System Architect, Power HPC Systems December 11, 2017

Dr. Jean-Laurent PHILIPPE, PhD EMEA HPC Technical Sales Specialist. With Dell Amsterdam, October 27, 2016

SNIA Developers Conference - Growth of the iscsi RDMA (iser) Ecosystem

Building NVLink for Developers

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

ETHERNET OPTICS TODAY: 25G NRZ

Transcription:

The Future of High Performance Interconnects Ashrut Ambastha HPC Advisory Council Perth, Australia :: August 2017

When Algorithms Go Rogue 2017 Mellanox Technologies 2

When Algorithms Go Rogue 2017 Mellanox Technologies 3

InfiniBand Accelerates AI Facebook AI Supercomputer #31 on the TOP500 List NVIDIA AI Supercomputer #32 on the TOP500 List EDR InfiniBand In-Network Computing technology key for scalable Deep Learning systems RDMA accelerates Deep Learning performance by 2X, becomes de-facto solution for AI 2017 Mellanox Technologies 4

Mellanox Accelerates The World s Fastest Supercomputers Accelerates the #1 supercomputer InfiniBand connects 36% of the total TOP500 systems (179 systems) InfiniBand connects 60% of the HPC TOP500 systems InfiniBand accelerates 45% of the Petascale systems EDR InfiniBand solutions grow 2.5 Times in six months Connects All of 40G Ethernet systems (3 systems), connects the first 100G Ethernet system on the list InfiniBand most used HPC interconnect in first half of 2017 connects 2.5X more end-user projects versus Omni-Path InfiniBand is the Interconnect of Choice for HPC Infrastructures Enabling Machine Learning, High-Performance, Web 2.0, Cloud, Storage, Big Data Applications 2017 Mellanox Technologies 5

Exponential Data Growth The Need for Intelligent and Faster Interconnect CPU-Centric (Onload) Data-Centric (Offload) Must Wait for the Data Creates Performance Bottlenecks Analyze Data as it Moves! Faster Data Speeds and In-Network Computing Enable Higher Performance and Scale 2017 Mellanox Technologies 6

In-Network Computing to Enable Data-Centric Data Centers CPU GPU Adapters Mesh Switches Torus GPU CPU CPU FPGA Co-Processor SmartNIC GPU GPU Dragonfly+ CPU In-Network Computing Key for Highest Return on Investment 2017 Mellanox Technologies 7

In-Network Computing to Enable Data-Centric Data Centers CPU GPU RDMA CORE-Direct Tag-Matching GPUDirect NVMe over Fabrics Mesh SHARP SHIELD Torus GPU CPU Security CPU Security Programmable (FPGA) NVMe Storage Programmable (ARM) GPU GPU Dragonfly+ CPU In-Network Computing Key for Highest Return on Investment 2017 Mellanox Technologies 8

Making Interconnect Smart Again!!! Critical for High Performance Computing and Machine Learning Applications 2017 Mellanox Technologies 9

Data Centric Architecture to Overcome Latency Bottlenecks CPU-Centric (Onload) Data-Centric (Offload) Network In-Network Computing HPC / Machine Learning Communications Latencies of 30-40us HPC / Machine Learning Communications Latencies of 3-4us Intelligent Interconnect Paves the Road to Exascale Performance 2017 Mellanox Technologies 10

MPI profile of an application (eg. WRF) Take MPI_Reduce for example In this application for eg; 4-8 byte MPI_Reduce has a significant component SHARP can reduce this latency from ~100us to <10us for a 512 node run 2017 Mellanox Technologies 11

SHARP Allreduce Performance Advantages SHARP enables 75% Reduction in Latency Providing Scalable Flat Latency 2017 Mellanox Technologies 12

Performance of MPI with GPUDirect RDMA GPU-GPU Internode MPI Latency GPU-GPU Internode MPI Bandwidth 10x 9.3X Lower is Better Higher is Better 2.18 usec Source: Prof. DK Panda 88% Lower Latency 10X Increase in Throughput 2017 Mellanox Technologies 13

RDMA & GPUDirect for High Performance Machine Learning GPUDirect Accelerates communication between servers and storage Allows peer to peer transfers between GPUs Requires RDMA & RoCE ML Training Scenario Billions of TFLOPS per training run RoCE Enabled Network Years of compute-days on Xeon GPUDirect turns years to days Without GPU Direct - Same Data Copied 3x With GPUDirect (Requires RoCE) 2017 Mellanox Technologies 14

Accelerating TensorFlow with grpc over RDMA Completion Time (in ms) 1600 1400 1200 1000 800 600 400 200 0 RDMA TCP Lower is better >2x Faster 1K 2K 4K 8K 16K 32K 64K 128K 256K Message Size (in bytes) Open Source Machine Learning from Google Hottest ML Software; Popular than Linux Powers 40 Different Google Products Distributed Training With grpc Framework Google s Optimized RPC for Distributed Network RDMA Acceleration over UCX Unified Communication X (UCX) - Initiative by ORNL, Mellanox, IBM, NVidia, etc. Integration With Upstream grpc/tensorflow >2X Performance Gain With RDMA 2017 Mellanox Technologies 15

Mellanox 100GbE RoCE Powers Baidu Self Driving Car Core Switch Leaf1 100GbE Leaf2 Leaf18 Node 4 Node 3 Node 4 Node 3 Node 4 Node 3 Node 4 Node 3 Node 4 Node 3 Node 4 Node 3 Node 2 Node 2 Node 2 Node 2 Node 2 Node 2 Node 1 Node 1 Node 1 Node 1 Node 1 Node 1 Rack1 Rack2 Rack3 Rack4 Rack37 Rack38 Server Configuration 16x Tesla K40M 2x Mellanox ConnectX-4 100GbE Adapters 38-racks of 152 4-U GPU Box Node Network Configuration 18x 100GbE Mellanox Spectrum SN2700 100GbE Mellanox DAC & Optical Cables ~2X Faster Training with Paddle Paddle Machine Learning Framework 2017 Mellanox Technologies 16

Multi-Host Socket Direct Technology Innovative Solution to Dual-Socket Servers 2017 Mellanox Technologies 17

Multi-Host Socket-Direct Adapters Increase Server Return on Investment 30%-60% Better CPU Utilization 50%-80% Lower Data Latency 15%- 28% Better Data Throughout Available for All Servers (x86, Power, ARM, etc.) Highest Applications Performance, Scalability and Productivity 2017 Mellanox Technologies 18

Multi-Host Socket-Direct Adapters Increase Server Return on Investment 30%-60% Better CPU Utilization 50%-80% Lower Data Latency 15%- 28% Better Data Throughout Available for All Servers (x86, Power, ARM, etc.) Highest Applications Performance, Scalability and Productivity 2017 Mellanox Technologies 19

Highest-Performance 200Gb/s Interconnect Solutions 200Gb/s Adapter, 0.6us latency 200 million messages per second (10 / 25 / 40 / 50 / 56 / 100 / 200Gb/s) 40 HDR (200Gb/s) InfiniBand Ports 80 HDR100 InfiniBand Ports Throughput of 16Tb/s, <90ns Latency 16 400GbE, 32 200GbE, 128 25/50GbE Ports (10 / 25 / 40 / 50 / 100 / 200 GbE) Throughput of 6.4Tb/s Transceivers Active Optical and Copper Cables (10 / 25 / 40 / 50 / 56 / 100 / 200Gb/s) VCSELs, Silicon Photonics and Copper MPI, SHMEM/PGAS, UPC For Commercial and Open Source Applications Leverages Hardware Accelerations 2017 Mellanox Technologies 20

Not a mere CAD Drawing.. Design HDR uplinked EDR network today Move to 2x EDR nodes on same network with ConnectX6 40-port 1U HDR Platform 80-port EDR 2017 Mellanox Technologies 21

What We Don t Want to Achieve 2017 Mellanox Technologies 22

Mellanox technology options to scale for more bandwidth Mellanox is pushing the boundary on every dimension to increase the bandwidth of LinkX Line Rate (Gb/ s) 100 50 2x 2x 400G 100G Parallel 4 25G NRZ 4 100G WDM 400G WDM 80x 80 Wavelength Channel 4-8x 4x SMF/MMF PAM16 Physical Channel 32 Complex Modulations 2017 Mellanox Technologies 23

InfiniBand Delivers Best Return on Investment 30%-250% Higher Return on Investment Up to 50% Saving on Capital and Operation Expenses Highest Applications Performance, Scalability and Productivity Weather Automo=ve Chemistry Molecular Dynamics 1.3X Be(er 2X Be(er 1.4X Be(er 2.5X Be(er 2017 Mellanox Technologies Genomics 1.3X Be(er 24

Let s Summarize Scalable, intelligent, flexible, high performance, end-to-end connectivity Standards-based (InfiniBand, Ethernet), supported by large eco-system Supports all compute architectures: x86, Power, ARM, GPU, FPGA etc. Offloading architecture: RDMA, application acceleration engines, etc. Flexible topologies: Fat Tree, Mesh, 3D Torus, Dragonfly+, etc. Converged I/O: compute, storage, management on single fabric Backward and future compatible The Future Depends On Smart Interconnect 2017 Mellanox Technologies 25

More Question Time Ashrut Ambastha ashrut@mellanox.com 2015 Mellanox Technologies - Mellanox Confidential - 26

Thank You