The Future of Interconnect Technology

The Future of Interconnect Technology Michael Kagan, CTO HPC Advisory Council Stanford, 2014

Exponential Data Growth Best Interconnect Required 44X 0.8 Zetabyte 2009 35 Zetabyte 2020 2014 Mellanox Technologies 2 Source: IDC

The Power of Data Data-Intensive Simulations Internet of Things National Security Healthcare Smart Cars Congestion-Free Traffic Business Intelligence 2014 Mellanox Technologies 3

Data Must Always Be Accessible and at Real-Time Sensor Data Compute Storage Archive CPU CPU Lower Latency, Higher Bandwidth, RDMA, Offloads, NIC/Switch Routing, Overlay Networks Smart Interconnect Required to Unleash The Power of Data 2014 Mellanox Technologies 4

InfiniBand s Unsurpassed System Efficiency TOP500 systems listed according to their efficiency InfiniBand is the key element responsible for the highest system efficiency Mellanox delivers efficiencies of up to 96% with InfiniBand 2014 Mellanox Technologies 5

FDR InfiniBand Delivers Highest Return on Investment Higher is better Higher is better Higher is better Source: HPC Advisory Council 2014 Mellanox Technologies 6

Businesses Success Depends on Fast Interconnect 13 Million Financial Transactions Per Day, 4 Billion Database Inserts Real Time Fraud Detection Accuracy, Details, Fast Response 10X Higher Performance, 50% CAPEX Reduction Microsoft Bing Maps Reacting to Customers Needs in Real Time! Reducing Data Queries from 20 minutes to 20 seconds 235 Supermarkets, 8 States, USA 97% Reduction in Database Recovery Time From 7 Days to 4 Hours! Tier-1 Fortune100 Company Web 2.0 Application 2014 Mellanox Technologies 7

Helping to Make the World a Better Place SANGER Sequence Analysis and Genomics Research Genomic Analysis for pediatric cancer patients Challenge: An individual patient s RNA analysis took 7 days Goal: Reduce it to 5 days InfiniBand reduced the RNA-Sequence data analysis time for patients to only 1 hour! Fast interconnect for fighting pediatric cancer 2014 Mellanox Technologies 8

Mellanox InfiniBand Paves the Road to Exascale Computing Accelerating Half of the World s Petascale Systems Mellanox Connected Petascale System Examples 2014 Mellanox Technologies 9

NASA Ames Research Center Pleiades 20K InfiniBand nodes Mellanox end-to-end FDR and QDR InfiniBand Supports variety of scientific and engineering projects Coupled atmosphere-ocean models Future space vehicle design Large-scale dark matter halos and galaxy evolution Asian Monsoon Water Cycle High-Resolution Climate Simulations 2014 Mellanox Technologies 10

InfiniBand Enables Lowest Application Cost in the Cloud (Examples) Microsoft Windows Azure 90.2% Cloud Efficiency 33% Lower Cost per Application Cloud Application Performance Improved up to 10X 3x Increase in VMs per Physical Server Consolidation of Network and Storage I/O 32% Lower Cost per Application 694% Higher Network Performance 2014 Mellanox Technologies 11

Dominant in Storage Interconnects SMB Direct Market Leading Performance with RDMA Interconnects 2014 Mellanox Technologies 12

Technology Roadmap 10Gb/s 20Gbs 40Gbs 56Gbs 100Gbs 200Gbs Terascale Petascale Exascale 3 rd TOP500 2003 Virginia Tech (Apple) 1 st Roadrunner Mellanox Connected Mega Supercomputers 2000 2005 2010 2015 2020 2014 Mellanox Technologies 13

Connect-IB Architectural Foundation for Exascale Computing 2014 Mellanox Technologies 14

Mellanox Connect-IB The World s Fastest Adapter The 7 th generation of Mellanox interconnect adapters World s first 100Gb/s interconnect adapter (dual-port FDR 56Gb/s InfiniBand) Delivers 137 million messages per second 4X higher than competition Support the new innovative InfiniBand scalable transport Dynamically Connected 2014 Mellanox Technologies 15

Higher is Better Connect-IB Provides Highest Interconnect Throughput Connect-IB FDR (Dual port) Connect-IB FDR (Dual port) ConnectX-3 FDR ConnectX-3 FDR ConnectX-2 QDR ConnectX-2 QDR Competition (InfiniBand) Competition (InfiniBand) Source: Prof. DK Panda Gain Your Performance Leadership With Connect-IB Adapters 2014 Mellanox Technologies 16

Connect-IB Delivers Highest Application Performance 200% Higher Performance Versus Competition, with Only 32-nodes Performance Gap Increases with Cluster Size 2014 Mellanox Technologies 17

Fabric Collective Accelerations Solutions for MPI/SHMEM/PGAS 2014 Mellanox Technologies 18

Collective Operation Challenges at Large Scale Collective algorithms are not topology aware and can be inefficient Congestion due to many-to-many communications Slow nodes and OS jitter affect scalability and increase variability Ideal Actual 2014 Mellanox Technologies 19

Mellanox Collectives Acceleration Components CORE-Direct US Department of Energy (DOE) funded project ORNL and Mellanox Adapter-based hardware offloading for collectives operations Includes floating-point capability on the adapter for data reductions CORE-Direct API is exposed through the Mellanox drivers FCA FCA is a software plug-in package that integrates into available MPIs Provides scalable topology aware collective operations Utilizes powerful InfiniBand multicast and QOS capabilities Integrates CORE-Direct collective hardware offloads 2014 Mellanox Technologies 20

The Effects of System Noise on Applications Performance Minimizing the impact of system noise on applications critical for scalability Ideal System noise CORE-Direct (Offload) 2014 Mellanox Technologies 21

CORE-Direct Enables Computation and Communication Overlap Provide support for overlapping computation and communication Synchronous CORE-Direct - Asynchronous 2014 Mellanox Technologies 22

Nonblocking Alltoall (Overlap-Wait) Benchmark CoreDirect Offload allows Alltoall benchmark with almost 100% compute 2014 Mellanox Technologies 23

Accelerator and GPU Offloads 2014 Mellanox Technologies 24

1 1 2 GPUDirect 1.0 Receive Transmit System Memory 2 1 CPU Non GPUDirect CPU System Memory GPU Chip set Chip set GPU InfiniBand InfiniBand GPU Memory GPU Memory System Memory 1 CPU CPU System Memory GPU Chip set Chip set GPU GPU Memory InfiniBand GPUDirect 1.0 InfiniBand GPU Memory 2014 Mellanox Technologies 25

1 1 GPUDirect RDMA Receive Transmit System Memory 1 CPU GPUDirect 1.0 CPU System Memory GPU Chip set Chip set GPU InfiniBand InfiniBand GPU Memory GPU Memory System Memory 1 CPU CPU System Memory GPU Chip set Chip set GPU GPU Memory InfiniBand GPUDirect RDMA InfiniBand GPU Memory 2014 Mellanox Technologies 26

Latency (us) Bandwidth (MB/s) Higher is Better Performance of MVAPICH2 with GPUDirect RDMA GPU-GPU Internode MPI Latency GPU-GPU Internode MPI Bandwidth 25 2000 20 15 10 67 % Lower is Better 1800 1600 1400 1200 1000 800 5X 600 5 5.49 usec 400 200 0 1 4 16 64 256 1K 4K 0 1 4 16 64 256 1K 4K Message Size (bytes) 67% Lower Latency Source: Prof. DK Panda Message Size (bytes) 5X Increase in Throughput 2014 Mellanox Technologies 27

Performance of MVAPICH2 with GPU-Direct-RDMA Execution Time of HSG (Heisenberg Spin Glass) Application with 2 GPU Nodes Source: Prof. DK Panda Problem Size 2014 Mellanox Technologies 28

Remote GPU Access through rcuda GPU servers CUDA Application Client Side GPU as a Service Server Side Application Application rcuda library rcuda daemon CUDA Driver + runtime Network Interface Network Interface CUDA Driver + runtime CPU VGPU CPU VGPU CPU VGPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU rcuda provides remote access from every node to any GPU in the system 2014 Mellanox Technologies 29

rcuda Performance Comparison 2014 Mellanox Technologies 30

Other Developments 2014 Mellanox Technologies 31

GBytes/s RDMA Accelerates OpenStack Storage RDMA Accelerates iscsi Storage Compute Servers VM OS VM OS VM OS Hypervisor (KVM) Open-iSCSI w iser Adapter Storage Servers OpenStack (Cinder) iscsi/iser Target (tgt) RDMA Adapter Cache Local Disks 6 5 4 3 2 1 Cinder / Volume Storage Performance * 1.3 5.5 0 iscsi over TCP iser Switching Fabric * iser patches are available on OpenStack branch: https://github.com/mellanox/openstack Utilizing OpenStack Built-in components/management - Open-iSCSI, tgt target, Cinder To accelerate Storage Access 2014 Mellanox Technologies 32

Next Generation Enterprises: The Generation of Open Ethernet Freedom to Choose and Create Any Software, Any Management Enables Vendor / User Differentiation, No Limitations Open Ethernet PROPRIETARY Management PROPRIETARY Software CLOSED ETHERNET Switch Management of Choice Software of Choice OPEN ETHERNET Switch Locked Down Vertical Solution Open Platform 2014 Mellanox Technologies 33

Open Ethernet Solutions The Freedom to Choose Open Source 3rd Party Home Grown Switch Vendor Software Software Software Software 2014 Mellanox Technologies 34

Futures 2014 Mellanox Technologies 35

Thank You