High Performance Interconnects: Landscape, Assessments & Rankings

Similar documents
Implementing Storage in Intel Omni-Path Architecture Fabrics

Future Routing Schemes in Petascale clusters

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Single-Points of Performance

Interconnect Your Future

Paving the Road to Exascale

Birds of a Feather Presentation

Meltdown and Spectre Interconnect Performance Evaluation Jan Mellanox Technologies

In-Network Computing. Paving the Road to Exascale. 5th Annual MVAPICH User Group (MUG) Meeting, August 2017

Interconnect Your Future

The Future of High Performance Interconnects

investor meeting SANTA CLARA Diane Bryant Senior Vice President & General Manager Data Center Group

In-Network Computing. Paving the Road to Exascale. June 2017

2008 International ANSYS Conference

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Interconnect Your Future

Interconnect Your Future Enabling the Best Datacenter Return on Investment. TOP500 Supercomputers, November 2017

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

In-Network Computing. Sebastian Kalcher, Senior System Engineer HPC. May 2017

Accelerating Real-Time Big Data. Breaking the limitations of captive NVMe storage

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

Interconnect Your Future

LS-DYNA Performance Benchmark and Profiling. October 2017

OpenFOAM Performance Testing and Profiling. October 2017

LS-DYNA Performance Benchmark and Profiling. April 2015

InfiniBand Networked Flash Storage

Introduction to Infiniband

Workshop on High Performance Computing (HPC) Architecture and Applications in the ICTP October High Speed Network for HPC

RoCE vs. iwarp Competitive Analysis

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

NEMO Performance Benchmark and Profiling. May 2011

High Performance Computing

Building the Most Efficient Machine Learning System

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

STAR-CCM+ Performance Benchmark and Profiling. July 2014

CP2K Performance Benchmark and Profiling. April 2011

Building the Most Efficient Machine Learning System

N V M e o v e r F a b r i c s -

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

CESM (Community Earth System Model) Performance Benchmark and Profiling. August 2011

Solutions for Scalable HPC

The Optimal CPU and Interconnect for an HPC Cluster

Optimizing LS-DYNA Productivity in Cluster Environments

NAMD GPU Performance Benchmark. March 2011

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Application Acceleration Beyond Flash Storage

Paving the Road to Exascale Computing. Yossi Avni

PERFORMANCE ACCELERATED Mellanox InfiniBand Adapters Provide Advanced Levels of Data Center IT Performance, Productivity and Efficiency

GROMACS (GPU) Performance Benchmark and Profiling. February 2016

CP2K Performance Benchmark and Profiling. April 2011

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Internet II. CS10 : Beauty and Joy of Computing. cs10.berkeley.edu. !!Senior Lecturer SOE Dan Garcia!!! Garcia UCB!

ABySS Performance Benchmark and Profiling. May 2010

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

Introduction to TCP/IP Offload Engine (TOE)

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

IO virtualization. Michael Kagan Mellanox Technologies

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT

AcuSolve Performance Benchmark and Profiling. October 2011

Chelsio 10G Ethernet Open MPI OFED iwarp with Arista Switch

NAMD Performance Benchmark and Profiling. November 2010

GROMACS Performance Benchmark and Profiling. September 2012

DB2 purescale: High Performance with High-Speed Fabrics. Author: Steve Rees Date: April 5, 2011

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Interconnect Your Future

The NE010 iwarp Adapter

Benefits of full TCP/IP offload in the NFS

Informatix Solutions INFINIBAND OVERVIEW. - Informatix Solutions, Page 1 Version 1.0

OCTOPUS Performance Benchmark and Profiling. June 2015

The Future of Interconnect Technology

Real Parallel Computers

unleashed the future Intel Xeon Scalable Processors for High Performance Computing Alexey Belogortsev Field Application Engineer

MDHIM: A Parallel Key/Value Store Framework for HPC

Evaluating the Impact of RDMA on Storage I/O over InfiniBand

NVMe Over Fabrics: Scaling Up With The Storage Performance Development Kit

Intel Cluster Toolkit Compiler Edition 3.2 for Linux* or Windows HPC Server 2008*

IBM WebSphere MQ Low Latency Messaging Software Tested With Arista 10 Gigabit Ethernet Switch and Mellanox ConnectX

A Low Latency Solution Stack for High Frequency Trading. High-Frequency Trading. Solution. White Paper

Creating an agile infrastructure with Virtualized I/O

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

GROMACS Performance Benchmark and Profiling. August 2011

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

Sharing High-Performance Devices Across Multiple Virtual Machines

Infiniband Fast Interconnect

Agenda. Introduction Network functions virtualization (NFV) promise and mission cloud native approach Where do we want to go with NFV?

USING OPEN FABRIC INTERFACE IN INTEL MPI LIBRARY

ADVANCED PGAS CENTRIC USAGE OF THE OPENFABRICS INTERFACE

ETHERNET OVER INFINIBAND

2-Port 40 Gb InfiniBand Expansion Card (CFFh) for IBM BladeCenter IBM BladeCenter at-a-glance guide

Advanced Computer Networks. End Host Optimization

Altair RADIOSS Performance Benchmark and Profiling. May 2013

Recent Topics in the IBTA and a Look Ahead

The Impact of Optics on HPC System Interconnects

QLogic TrueScale InfiniBand and Teraflop Simulations

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

iser as accelerator for Software Defined Storage Rahul Fiske, Subhojit Roy IBM (India)

Intel Omni-Path Architecture

Transcription:

High Performance Interconnects: Landscape, Assessments & Rankings Dan Olds Partner, OrionX April 12, 2017

Specialized 100G InfiniBand MPI Multi-Rack OPA Single-Rack HPI market segment 100G 40G 10G TCP/IP JDBC RMI IIOP SOAP etc. Current HPI line 1G Link Speed Net Protocol App Comm

2016 OrionX 4 Three Types of HPI Ethernet Sold by a host of providers, Cisco, HPE, Juniper, plus many others Tried and true interconnect, easiest to implement While it has the bandwidth of others, latency is pretty high (ms rather than ns) Proprietary Primarily sold by Cray, SGI, and IBM plus a few others Have to purchase a system in order to get their brand of HPI Intel is a new entrant in this segment of the market, although without an accompanying system InfiniBand Mellanox has emerged as the de-facto leader Highest performance based on published numbers:200gb/s, 200m messages/s, 90ns latency

Key Differences in HPI: Product Maturity/Position Ethernet Ethernet has been around longer than any HPI, but was surpassed in performance Still many installations, but has lost much of its share at the high end Latency (measured in ms, not ns) is the problem, not bandwidth 2016 OrionX 5

2016 OrionX.net 6 Key Differences in HPI: Product Maturity/Position Intel Omni Path Architecture Intel and Omni-Path are in their infancy still, very few installations Handful of customers (although some big names), few, if any, in production Claims bandwidth/latency/message rate same or better than InfiniBand (covered later)

Key Differences in HPI: Product Maturity/Position InfiniBand Has been in the HPI market since early 2000 s Thousands of customers, millions of nodes Now makes up large proportion of Top500 list (187 systems) Synonymous with Mellanox these days 2016 OrionX.net 7

2016 OrionX 8 Key Differences in HPI technology Onload vs. Offload Onload: main CPU handles all network processing chores, adapter and switches just pass the messages, examples Intel Omni-Path Architecture, Ethernet Also PC servers, old UNIX systems where CPUs handled every task and received interrupts on communications Offload: HCA and switches handle all network processing tasks, very little or no need for main CPU cycles, allows CPU to continue processing applications, examples Mellanox InfiniBand Mainframes with communication assist processors used to allow CPU to process applications, not communications

2016 OrionX.net 9 Offload Details Network protocol load includes: Link Layer: packet layout, packet forwarding, flow control, data integrity, QoS Network layer: adds header, routing of packets from one subnet to another Transport layer: in-order packet delivery, divides data into packets, receiver reassembles packets, sends/receives acknowledgements MPI operations: scatter, gather, broadcast, etc. With offload, ALL of these operations are handled by the adapter hardware, example: InfiniBand HBA

2016 OrionX.net 10 Onload Details Network protocol load includes: Link layer packet layout, packet forwarding, flow control, data integrity, QoS Network layer: header, routing of packets from one subnet to another Transport layer: in-order packet delivery, divides data into packets, receiver reassembles packets, sends/receives acknowledgements MPI operations: scatter, gather, broadcast, etc. With onload, ALL of these operations are performed by the host processor, using host memory

2016 OrionX.net 11 Onload vs. Offload Onload vs. Offload isn t a big deal when the cluster is small

2016 OrionX.net 12 Onload vs. Offload But it will become a very large deal when the cluster becomes larger Will particularly be a problem on scatter, gather type collective problems when head node will be overrun trying to process messages

Onload vs. Offload As node count increases, performance of Onload will drop Higher node count = more messaging, pressure on head node Node counts are increasing significantly Offload uses dedicated hardware ASICs Much faster than general purpose CPUs MPI not highly parallel With Onload, this means that speed is limited to slowest core speed Has no bearing on Offload speed 2016 OrionX.net 13

Rampant FUD War From: The Next Platform, Intel Stretches Deep Learning On Scalable System Framework, 5/10/16 Cost of HPI in cluster budget is typically ~15-20% of total Prices in high tech typically don t increase over time Price points for new products typically are the same as the former high end product they replace ex: high end PCs, low-end servers, etc. 2016 OrionX.net 14

2016 OrionX.net 15 More FUD War.. All images provided by Intel, all from The Next Platform story Intel Stretches Learning on Scalable System Framework May 10 th, 2016 What else do these images have in common?

2016 OrionX.net 16 FUD Wars Behind the Numbers It s all in the fine print, right? Here s Intel s fine print for the graphs on the last slide. dapl is key, it s an Intel MPI mechanism that doesn t allow for offload operations ala InfiniBand..48 port (B0 silicon). IOU Non-posted Prefetch disabled in BIOS. Snoop hold-off timer = 9. EDR based on internal testing: Intel MPI 5.1.3, shm:dapl fabric, RHEL 7.2 -genv I_MPI_DAPL_EAGER_MESSAGE_AGGREGATION off. Mellanox EDR ConnectX-4 Single Port Rev 3 MCX455A HCA. Mellanox SB7700 36 Port EDR InfiniBand switch. MLNX_OFED_LINUX-3.2-2.0.0.0 (OFED-3.2-2.0.0). IOU Non-posted Prefetch enabled in BIOS. 1. osu_latency 8 B message. 2. osu_bw 1 MB message. 3. osu_mbw_mr, 8 B Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.

Even More FUD 100% CPU core utilization on a Offload HCA?!! Does anyone believe this?!! This means that about half of the Top500 systems are absolutely useless Intel is using a CPU polling mechanism that pegs the CPU on the Mellanox box at 100%, yet has nothing to do with network comms Both Intel and Mellanox have benchmarked OPA at ~65% CPU utilization 2016 OrionX.net 17

2016 OrionX.net 18 FUD Aside, here are the numbers.update FOR HDR Intel OPA Mellanox EDR/HDR InfiniBand Bandwidth 100 Gb/sec 100 Gb/sec / 200Gb/sec Latency (ns).93.85 or less /.90 or less Message rate 89 million/sec* 150m/sec / 200m/sec * this number, provided by Intel, has dropped from >150 million in 2015

2016 OrionX.net 19 HPI Roadmaps InfiniBand roadmap shows HDR now (200Gb/ s) and NDR down the road (400Gb/s? 2020?) OPA Roadmap Formerly: OPA 2 in 2018 Now: OPA 2 in 2020 ouch Ethernet roadmap shows 200Gb/s in 2018-19

2016 OrionX.net 20 Major HPI Choices: OrionX analysis Vendor Market Customer Product Presence Trends Overall Readiness Needs Overall Capabilities Roadmap Overall Mellanox 9 9 9 8 9 8.5 9 10 9.5 Ethernet vendors 7 7 7 9 6 7.5 7 6 6.5 Intel 6 8 7 6 7 6.5 7 7 7.0

OrionX Constellation Mellanox Intel Ethernet vendors Vendor Market Product Customer Ethernet 7 6.5 7.5 Mellanox 9 9.5 8.5 Intel 7 7.25 6.5

2016 OrionX 22 Questions? Comments? Concerns? OrionX Constellation reports