The Future of Interconnect Technology

Similar documents
Solutions for Scalable HPC

Interconnect Your Future

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

High Performance Computing

InfiniBand Strengthens Leadership as the Interconnect Of Choice By Providing Best Return on Investment. TOP500 Supercomputers, June 2014

The Future of High Performance Interconnects

Building the Most Efficient Machine Learning System

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

Building the Most Efficient Machine Learning System

Birds of a Feather Presentation

2008 International ANSYS Conference

Interconnect Your Future

Paving the Road to Exascale

Paving the Road to Exascale Computing. Yossi Avni

Interconnect Your Future

In-Network Computing. Paving the Road to Exascale. June 2017

Interconnect Your Future Enabling the Best Datacenter Return on Investment. TOP500 Supercomputers, November 2017

Interconnect Your Future

In-Network Computing. Sebastian Kalcher, Senior System Engineer HPC. May 2017

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

In-Network Computing. Paving the Road to Exascale. 5th Annual MVAPICH User Group (MUG) Meeting, August 2017

Interconnect Your Future

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

Future Routing Schemes in Petascale clusters

Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

SNIA Developers Conference - Growth of the iscsi RDMA (iser) Ecosystem

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

ABySS Performance Benchmark and Profiling. May 2010

The Exascale Architecture

MM5 Modeling System Performance Research and Profiling. March 2009

InfiniBand Strengthens Leadership as The High-Speed Interconnect Of Choice

Introduction to Infiniband

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

InfiniBand-based HPC Clusters

InfiniBand Networked Flash Storage

Mellanox InfiniBand Solutions Accelerate Oracle s Data Center and Cloud Solutions

CESM (Community Earth System Model) Performance Benchmark and Profiling. August 2011

PERFORMANCE ACCELERATED Mellanox InfiniBand Adapters Provide Advanced Levels of Data Center IT Performance, Productivity and Efficiency

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

Interconnect Your Future Paving the Road to Exascale

Corporate Update. Enabling The Use of Data January Mellanox Technologies

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Ethernet. High-Performance Ethernet Adapter Cards

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Optimizing LS-DYNA Productivity in Cluster Environments

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

LAMMPSCUDA GPU Performance. April 2011

IO virtualization. Michael Kagan Mellanox Technologies

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

HYCOM Performance Benchmark and Profiling

STAR-CCM+ Performance Benchmark and Profiling. July 2014

LS-DYNA Productivity and Power-aware Simulations in Cluster Environments

The rcuda middleware and applications

AcuSolve Performance Benchmark and Profiling. October 2011

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

2-Port 40 Gb InfiniBand Expansion Card (CFFh) for IBM BladeCenter IBM BladeCenter at-a-glance guide

AMBER 11 Performance Benchmark and Profiling. July 2011

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2

VM Migration Acceleration over 40GigE Meet SLA & Maximize ROI

AcuSolve Performance Benchmark and Profiling. October 2011

OCTOPUS Performance Benchmark and Profiling. June 2015

Flex System IB port FDR InfiniBand Adapter Lenovo Press Product Guide

Cavium FastLinQ 25GbE Intelligent Ethernet Adapters vs. Mellanox Adapters

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Multifunction Networking Adapters

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

Application Acceleration Beyond Flash Storage

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

IBM WebSphere MQ Low Latency Messaging Software Tested With Arista 10 Gigabit Ethernet Switch and Mellanox ConnectX

RAIDIX Data Storage Solution. Data Storage for a VMware Virtualization Cluster

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters

Single-Points of Performance

Introduction to High-Speed InfiniBand Interconnect

NAMD GPU Performance Benchmark. March 2011

Sharing High-Performance Devices Across Multiple Virtual Machines

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications

Voltaire. Fast I/O for XEN using RDMA Technologies. The Grid Interconnect Company. April 2005 Yaron Haviv, Voltaire, CTO

The Effect of In-Network Computing-Capable Interconnects on the Scalability of CAE Simulations

ICON Performance Benchmark and Profiling. March 2012

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Mellanox Virtual Modular Switch

NAMD Performance Benchmark and Profiling. November 2010

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

Choosing the Best Network Interface Card for Cloud Mellanox ConnectX -3 Pro EN vs. Intel XL710

SNAP Performance Benchmark and Profiling. April 2014

CP2K Performance Benchmark and Profiling. April 2011

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.

Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters

ARISTA: Improving Application Performance While Reducing Complexity

Transcription:

The Future of Interconnect Technology Michael Kagan, CTO HPC Advisory Council Stanford, 2014

Exponential Data Growth Best Interconnect Required 44X 0.8 Zetabyte 2009 35 Zetabyte 2020 2014 Mellanox Technologies 2 Source: IDC

The Power of Data Data-Intensive Simulations Internet of Things National Security Healthcare Smart Cars Congestion-Free Traffic Business Intelligence 2014 Mellanox Technologies 3

Data Must Always Be Accessible and at Real-Time Sensor Data Compute Storage Archive CPU CPU Lower Latency, Higher Bandwidth, RDMA, Offloads, NIC/Switch Routing, Overlay Networks Smart Interconnect Required to Unleash The Power of Data 2014 Mellanox Technologies 4

InfiniBand s Unsurpassed System Efficiency TOP500 systems listed according to their efficiency InfiniBand is the key element responsible for the highest system efficiency Mellanox delivers efficiencies of up to 96% with InfiniBand 2014 Mellanox Technologies 5

FDR InfiniBand Delivers Highest Return on Investment Higher is better Higher is better Higher is better Source: HPC Advisory Council 2014 Mellanox Technologies 6

Businesses Success Depends on Fast Interconnect 13 Million Financial Transactions Per Day, 4 Billion Database Inserts Real Time Fraud Detection Accuracy, Details, Fast Response 10X Higher Performance, 50% CAPEX Reduction Microsoft Bing Maps Reacting to Customers Needs in Real Time! Reducing Data Queries from 20 minutes to 20 seconds 235 Supermarkets, 8 States, USA 97% Reduction in Database Recovery Time From 7 Days to 4 Hours! Tier-1 Fortune100 Company Web 2.0 Application 2014 Mellanox Technologies 7

Helping to Make the World a Better Place SANGER Sequence Analysis and Genomics Research Genomic Analysis for pediatric cancer patients Challenge: An individual patient s RNA analysis took 7 days Goal: Reduce it to 5 days InfiniBand reduced the RNA-Sequence data analysis time for patients to only 1 hour! Fast interconnect for fighting pediatric cancer 2014 Mellanox Technologies 8

Mellanox InfiniBand Paves the Road to Exascale Computing Accelerating Half of the World s Petascale Systems Mellanox Connected Petascale System Examples 2014 Mellanox Technologies 9

NASA Ames Research Center Pleiades 20K InfiniBand nodes Mellanox end-to-end FDR and QDR InfiniBand Supports variety of scientific and engineering projects Coupled atmosphere-ocean models Future space vehicle design Large-scale dark matter halos and galaxy evolution Asian Monsoon Water Cycle High-Resolution Climate Simulations 2014 Mellanox Technologies 10

InfiniBand Enables Lowest Application Cost in the Cloud (Examples) Microsoft Windows Azure 90.2% Cloud Efficiency 33% Lower Cost per Application Cloud Application Performance Improved up to 10X 3x Increase in VMs per Physical Server Consolidation of Network and Storage I/O 32% Lower Cost per Application 694% Higher Network Performance 2014 Mellanox Technologies 11

Dominant in Storage Interconnects SMB Direct Market Leading Performance with RDMA Interconnects 2014 Mellanox Technologies 12

Technology Roadmap 10Gb/s 20Gbs 40Gbs 56Gbs 100Gbs 200Gbs Terascale Petascale Exascale 3 rd TOP500 2003 Virginia Tech (Apple) 1 st Roadrunner Mellanox Connected Mega Supercomputers 2000 2005 2010 2015 2020 2014 Mellanox Technologies 13

Connect-IB Architectural Foundation for Exascale Computing 2014 Mellanox Technologies 14

Mellanox Connect-IB The World s Fastest Adapter The 7 th generation of Mellanox interconnect adapters World s first 100Gb/s interconnect adapter (dual-port FDR 56Gb/s InfiniBand) Delivers 137 million messages per second 4X higher than competition Support the new innovative InfiniBand scalable transport Dynamically Connected 2014 Mellanox Technologies 15

Higher is Better Connect-IB Provides Highest Interconnect Throughput Connect-IB FDR (Dual port) Connect-IB FDR (Dual port) ConnectX-3 FDR ConnectX-3 FDR ConnectX-2 QDR ConnectX-2 QDR Competition (InfiniBand) Competition (InfiniBand) Source: Prof. DK Panda Gain Your Performance Leadership With Connect-IB Adapters 2014 Mellanox Technologies 16

Connect-IB Delivers Highest Application Performance 200% Higher Performance Versus Competition, with Only 32-nodes Performance Gap Increases with Cluster Size 2014 Mellanox Technologies 17

Fabric Collective Accelerations Solutions for MPI/SHMEM/PGAS 2014 Mellanox Technologies 18

Collective Operation Challenges at Large Scale Collective algorithms are not topology aware and can be inefficient Congestion due to many-to-many communications Slow nodes and OS jitter affect scalability and increase variability Ideal Actual 2014 Mellanox Technologies 19

Mellanox Collectives Acceleration Components CORE-Direct US Department of Energy (DOE) funded project ORNL and Mellanox Adapter-based hardware offloading for collectives operations Includes floating-point capability on the adapter for data reductions CORE-Direct API is exposed through the Mellanox drivers FCA FCA is a software plug-in package that integrates into available MPIs Provides scalable topology aware collective operations Utilizes powerful InfiniBand multicast and QOS capabilities Integrates CORE-Direct collective hardware offloads 2014 Mellanox Technologies 20

The Effects of System Noise on Applications Performance Minimizing the impact of system noise on applications critical for scalability Ideal System noise CORE-Direct (Offload) 2014 Mellanox Technologies 21

CORE-Direct Enables Computation and Communication Overlap Provide support for overlapping computation and communication Synchronous CORE-Direct - Asynchronous 2014 Mellanox Technologies 22

Nonblocking Alltoall (Overlap-Wait) Benchmark CoreDirect Offload allows Alltoall benchmark with almost 100% compute 2014 Mellanox Technologies 23

Accelerator and GPU Offloads 2014 Mellanox Technologies 24

1 1 2 GPUDirect 1.0 Receive Transmit System Memory 2 1 CPU Non GPUDirect CPU System Memory GPU Chip set Chip set GPU InfiniBand InfiniBand GPU Memory GPU Memory System Memory 1 CPU CPU System Memory GPU Chip set Chip set GPU GPU Memory InfiniBand GPUDirect 1.0 InfiniBand GPU Memory 2014 Mellanox Technologies 25

1 1 GPUDirect RDMA Receive Transmit System Memory 1 CPU GPUDirect 1.0 CPU System Memory GPU Chip set Chip set GPU InfiniBand InfiniBand GPU Memory GPU Memory System Memory 1 CPU CPU System Memory GPU Chip set Chip set GPU GPU Memory InfiniBand GPUDirect RDMA InfiniBand GPU Memory 2014 Mellanox Technologies 26

Latency (us) Bandwidth (MB/s) Higher is Better Performance of MVAPICH2 with GPUDirect RDMA GPU-GPU Internode MPI Latency GPU-GPU Internode MPI Bandwidth 25 2000 20 15 10 67 % Lower is Better 1800 1600 1400 1200 1000 800 5X 600 5 5.49 usec 400 200 0 1 4 16 64 256 1K 4K 0 1 4 16 64 256 1K 4K Message Size (bytes) 67% Lower Latency Source: Prof. DK Panda Message Size (bytes) 5X Increase in Throughput 2014 Mellanox Technologies 27

Performance of MVAPICH2 with GPU-Direct-RDMA Execution Time of HSG (Heisenberg Spin Glass) Application with 2 GPU Nodes Source: Prof. DK Panda Problem Size 2014 Mellanox Technologies 28

Remote GPU Access through rcuda GPU servers CUDA Application Client Side GPU as a Service Server Side Application Application rcuda library rcuda daemon CUDA Driver + runtime Network Interface Network Interface CUDA Driver + runtime CPU VGPU CPU VGPU CPU VGPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU rcuda provides remote access from every node to any GPU in the system 2014 Mellanox Technologies 29

rcuda Performance Comparison 2014 Mellanox Technologies 30

Other Developments 2014 Mellanox Technologies 31

GBytes/s RDMA Accelerates OpenStack Storage RDMA Accelerates iscsi Storage Compute Servers VM OS VM OS VM OS Hypervisor (KVM) Open-iSCSI w iser Adapter Storage Servers OpenStack (Cinder) iscsi/iser Target (tgt) RDMA Adapter Cache Local Disks 6 5 4 3 2 1 Cinder / Volume Storage Performance * 1.3 5.5 0 iscsi over TCP iser Switching Fabric * iser patches are available on OpenStack branch: https://github.com/mellanox/openstack Utilizing OpenStack Built-in components/management - Open-iSCSI, tgt target, Cinder To accelerate Storage Access 2014 Mellanox Technologies 32

Next Generation Enterprises: The Generation of Open Ethernet Freedom to Choose and Create Any Software, Any Management Enables Vendor / User Differentiation, No Limitations Open Ethernet PROPRIETARY Management PROPRIETARY Software CLOSED ETHERNET Switch Management of Choice Software of Choice OPEN ETHERNET Switch Locked Down Vertical Solution Open Platform 2014 Mellanox Technologies 33

Open Ethernet Solutions The Freedom to Choose Open Source 3rd Party Home Grown Switch Vendor Software Software Software Software 2014 Mellanox Technologies 34

Futures 2014 Mellanox Technologies 35

Thank You