The Future of High Performance Interconnects Ashrut Ambastha HPC Advisory Council Perth, Australia :: August 2017
When Algorithms Go Rogue 2017 Mellanox Technologies 2
When Algorithms Go Rogue 2017 Mellanox Technologies 3
InfiniBand Accelerates AI Facebook AI Supercomputer #31 on the TOP500 List NVIDIA AI Supercomputer #32 on the TOP500 List EDR InfiniBand In-Network Computing technology key for scalable Deep Learning systems RDMA accelerates Deep Learning performance by 2X, becomes de-facto solution for AI 2017 Mellanox Technologies 4
Mellanox Accelerates The World s Fastest Supercomputers Accelerates the #1 supercomputer InfiniBand connects 36% of the total TOP500 systems (179 systems) InfiniBand connects 60% of the HPC TOP500 systems InfiniBand accelerates 45% of the Petascale systems EDR InfiniBand solutions grow 2.5 Times in six months Connects All of 40G Ethernet systems (3 systems), connects the first 100G Ethernet system on the list InfiniBand most used HPC interconnect in first half of 2017 connects 2.5X more end-user projects versus Omni-Path InfiniBand is the Interconnect of Choice for HPC Infrastructures Enabling Machine Learning, High-Performance, Web 2.0, Cloud, Storage, Big Data Applications 2017 Mellanox Technologies 5
Exponential Data Growth The Need for Intelligent and Faster Interconnect CPU-Centric (Onload) Data-Centric (Offload) Must Wait for the Data Creates Performance Bottlenecks Analyze Data as it Moves! Faster Data Speeds and In-Network Computing Enable Higher Performance and Scale 2017 Mellanox Technologies 6
In-Network Computing to Enable Data-Centric Data Centers CPU GPU Adapters Mesh Switches Torus GPU CPU CPU FPGA Co-Processor SmartNIC GPU GPU Dragonfly+ CPU In-Network Computing Key for Highest Return on Investment 2017 Mellanox Technologies 7
In-Network Computing to Enable Data-Centric Data Centers CPU GPU RDMA CORE-Direct Tag-Matching GPUDirect NVMe over Fabrics Mesh SHARP SHIELD Torus GPU CPU Security CPU Security Programmable (FPGA) NVMe Storage Programmable (ARM) GPU GPU Dragonfly+ CPU In-Network Computing Key for Highest Return on Investment 2017 Mellanox Technologies 8
Making Interconnect Smart Again!!! Critical for High Performance Computing and Machine Learning Applications 2017 Mellanox Technologies 9
Data Centric Architecture to Overcome Latency Bottlenecks CPU-Centric (Onload) Data-Centric (Offload) Network In-Network Computing HPC / Machine Learning Communications Latencies of 30-40us HPC / Machine Learning Communications Latencies of 3-4us Intelligent Interconnect Paves the Road to Exascale Performance 2017 Mellanox Technologies 10
MPI profile of an application (eg. WRF) Take MPI_Reduce for example In this application for eg; 4-8 byte MPI_Reduce has a significant component SHARP can reduce this latency from ~100us to <10us for a 512 node run 2017 Mellanox Technologies 11
SHARP Allreduce Performance Advantages SHARP enables 75% Reduction in Latency Providing Scalable Flat Latency 2017 Mellanox Technologies 12
Performance of MPI with GPUDirect RDMA GPU-GPU Internode MPI Latency GPU-GPU Internode MPI Bandwidth 10x 9.3X Lower is Better Higher is Better 2.18 usec Source: Prof. DK Panda 88% Lower Latency 10X Increase in Throughput 2017 Mellanox Technologies 13
RDMA & GPUDirect for High Performance Machine Learning GPUDirect Accelerates communication between servers and storage Allows peer to peer transfers between GPUs Requires RDMA & RoCE ML Training Scenario Billions of TFLOPS per training run RoCE Enabled Network Years of compute-days on Xeon GPUDirect turns years to days Without GPU Direct - Same Data Copied 3x With GPUDirect (Requires RoCE) 2017 Mellanox Technologies 14
Accelerating TensorFlow with grpc over RDMA Completion Time (in ms) 1600 1400 1200 1000 800 600 400 200 0 RDMA TCP Lower is better >2x Faster 1K 2K 4K 8K 16K 32K 64K 128K 256K Message Size (in bytes) Open Source Machine Learning from Google Hottest ML Software; Popular than Linux Powers 40 Different Google Products Distributed Training With grpc Framework Google s Optimized RPC for Distributed Network RDMA Acceleration over UCX Unified Communication X (UCX) - Initiative by ORNL, Mellanox, IBM, NVidia, etc. Integration With Upstream grpc/tensorflow >2X Performance Gain With RDMA 2017 Mellanox Technologies 15
Mellanox 100GbE RoCE Powers Baidu Self Driving Car Core Switch Leaf1 100GbE Leaf2 Leaf18 Node 4 Node 3 Node 4 Node 3 Node 4 Node 3 Node 4 Node 3 Node 4 Node 3 Node 4 Node 3 Node 2 Node 2 Node 2 Node 2 Node 2 Node 2 Node 1 Node 1 Node 1 Node 1 Node 1 Node 1 Rack1 Rack2 Rack3 Rack4 Rack37 Rack38 Server Configuration 16x Tesla K40M 2x Mellanox ConnectX-4 100GbE Adapters 38-racks of 152 4-U GPU Box Node Network Configuration 18x 100GbE Mellanox Spectrum SN2700 100GbE Mellanox DAC & Optical Cables ~2X Faster Training with Paddle Paddle Machine Learning Framework 2017 Mellanox Technologies 16
Multi-Host Socket Direct Technology Innovative Solution to Dual-Socket Servers 2017 Mellanox Technologies 17
Multi-Host Socket-Direct Adapters Increase Server Return on Investment 30%-60% Better CPU Utilization 50%-80% Lower Data Latency 15%- 28% Better Data Throughout Available for All Servers (x86, Power, ARM, etc.) Highest Applications Performance, Scalability and Productivity 2017 Mellanox Technologies 18
Multi-Host Socket-Direct Adapters Increase Server Return on Investment 30%-60% Better CPU Utilization 50%-80% Lower Data Latency 15%- 28% Better Data Throughout Available for All Servers (x86, Power, ARM, etc.) Highest Applications Performance, Scalability and Productivity 2017 Mellanox Technologies 19
Highest-Performance 200Gb/s Interconnect Solutions 200Gb/s Adapter, 0.6us latency 200 million messages per second (10 / 25 / 40 / 50 / 56 / 100 / 200Gb/s) 40 HDR (200Gb/s) InfiniBand Ports 80 HDR100 InfiniBand Ports Throughput of 16Tb/s, <90ns Latency 16 400GbE, 32 200GbE, 128 25/50GbE Ports (10 / 25 / 40 / 50 / 100 / 200 GbE) Throughput of 6.4Tb/s Transceivers Active Optical and Copper Cables (10 / 25 / 40 / 50 / 56 / 100 / 200Gb/s) VCSELs, Silicon Photonics and Copper MPI, SHMEM/PGAS, UPC For Commercial and Open Source Applications Leverages Hardware Accelerations 2017 Mellanox Technologies 20
Not a mere CAD Drawing.. Design HDR uplinked EDR network today Move to 2x EDR nodes on same network with ConnectX6 40-port 1U HDR Platform 80-port EDR 2017 Mellanox Technologies 21
What We Don t Want to Achieve 2017 Mellanox Technologies 22
Mellanox technology options to scale for more bandwidth Mellanox is pushing the boundary on every dimension to increase the bandwidth of LinkX Line Rate (Gb/ s) 100 50 2x 2x 400G 100G Parallel 4 25G NRZ 4 100G WDM 400G WDM 80x 80 Wavelength Channel 4-8x 4x SMF/MMF PAM16 Physical Channel 32 Complex Modulations 2017 Mellanox Technologies 23
InfiniBand Delivers Best Return on Investment 30%-250% Higher Return on Investment Up to 50% Saving on Capital and Operation Expenses Highest Applications Performance, Scalability and Productivity Weather Automo=ve Chemistry Molecular Dynamics 1.3X Be(er 2X Be(er 1.4X Be(er 2.5X Be(er 2017 Mellanox Technologies Genomics 1.3X Be(er 24
Let s Summarize Scalable, intelligent, flexible, high performance, end-to-end connectivity Standards-based (InfiniBand, Ethernet), supported by large eco-system Supports all compute architectures: x86, Power, ARM, GPU, FPGA etc. Offloading architecture: RDMA, application acceleration engines, etc. Flexible topologies: Fat Tree, Mesh, 3D Torus, Dragonfly+, etc. Converged I/O: compute, storage, management on single fabric Backward and future compatible The Future Depends On Smart Interconnect 2017 Mellanox Technologies 25
More Question Time Ashrut Ambastha ashrut@mellanox.com 2015 Mellanox Technologies - Mellanox Confidential - 26
Thank You