MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Similar documents
Interconnect Your Future

High Performance Computing

Solutions for Scalable HPC

Paving the Road to Exascale

The Future of High Performance Interconnects

Interconnect Your Future

Building the Most Efficient Machine Learning System

Building the Most Efficient Machine Learning System

The Future of Interconnect Technology

In-Network Computing. Paving the Road to Exascale. 5th Annual MVAPICH User Group (MUG) Meeting, August 2017

Interconnect Your Future

Interconnect Your Future

Interconnect Your Future Enabling the Best Datacenter Return on Investment. TOP500 Supercomputers, November 2017

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

In-Network Computing. Sebastian Kalcher, Senior System Engineer HPC. May 2017

InfiniBand Strengthens Leadership as the Interconnect Of Choice By Providing Best Return on Investment. TOP500 Supercomputers, June 2014

Paving the Road to Exascale Computing. Yossi Avni

Interconnect Your Future Paving the Road to Exascale

Birds of a Feather Presentation

2008 International ANSYS Conference

In-Network Computing. Paving the Road to Exascale. June 2017

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

Interconnect Your Future

Future Routing Schemes in Petascale clusters

Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

Ethernet. High-Performance Ethernet Adapter Cards

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

LAMMPSCUDA GPU Performance. April 2011

Corporate Update. Enabling The Use of Data January Mellanox Technologies

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

ABySS Performance Benchmark and Profiling. May 2010

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

OCTOPUS Performance Benchmark and Profiling. June 2015

Application Acceleration Beyond Flash Storage

N V M e o v e r F a b r i c s -

SNAP Performance Benchmark and Profiling. April 2014

The rcuda middleware and applications

Sharing High-Performance Devices Across Multiple Virtual Machines

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

Accelerating Ceph with Flash and High Speed Networks

NAMD GPU Performance Benchmark. March 2011

OCP3. 0. ConnectX Ethernet Adapter Cards for OCP Spec 3.0

Introduction to High-Speed InfiniBand Interconnect

NAMD Performance Benchmark and Profiling. January 2015

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

Introduction to Infiniband

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

PERFORMANCE ACCELERATED Mellanox InfiniBand Adapters Provide Advanced Levels of Data Center IT Performance, Productivity and Efficiency

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

The Exascale Architecture

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

Accelerating Real-Time Big Data. Breaking the limitations of captive NVMe storage

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

AcuSolve Performance Benchmark and Profiling. October 2011

InfiniBand Networked Flash Storage

CPMD Performance Benchmark and Profiling. February 2014

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

GROMACS (GPU) Performance Benchmark and Profiling. February 2016

IBM WebSphere MQ Low Latency Messaging Software Tested With Arista 10 Gigabit Ethernet Switch and Mellanox ConnectX

STAR-CCM+ Performance Benchmark and Profiling. July 2014

ARISTA: Improving Application Performance While Reducing Complexity

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters

The NE010 iwarp Adapter

VM Migration Acceleration over 40GigE Meet SLA & Maximize ROI

RoCE vs. iwarp Competitive Analysis

Asynchronous Peer-to-Peer Device Communication

Tightly Coupled Accelerators Architecture

HA-PACS/TCA: Tightly Coupled Accelerators for Low-Latency Communication between GPUs

Single-Points of Performance

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

MM5 Modeling System Performance Research and Profiling. March 2009

Cray XC Scalability and the Aries Network Tony Ford

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

The Effect of In-Network Computing-Capable Interconnects on the Scalability of CAE Simulations

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

ANSYS Fluent 14 Performance Benchmark and Profiling. October 2012

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

Interconnection Network for Tightly Coupled Accelerators Architecture

High Performance Interconnects: Landscape, Assessments & Rankings

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

High-Performance Training for Deep Learning and Computer Vision HPC

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Real Application Performance and Beyond

UCX: An Open Source Framework for HPC Network APIs and Beyond

Messaging Overview. Introduction. Gen-Z Messaging

Embracing Open Technologies in the HPEC Market

Voltaire Making Applications Run Faster

2-Port 40 Gb InfiniBand Expansion Card (CFFh) for IBM BladeCenter IBM BladeCenter at-a-glance guide

High-Performance GPU Clustering: GPUDirect RDMA over 40GbE iwarp

Сетевые технологии для систем хранения данных

InfiniBand-based HPC Clusters

Transcription:

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Leading Supplier of End-to-End Interconnect Solutions Analyze Enabling the Use of Data Store ICs Comprehensive End-to-End InfiniBand and Ethernet Portfolio Adapter Cards Switches/Gateways Software and Services Metro / WAN Cables/Modules At the Speeds of 10, 25, 40, 50, 56 and 100 Gigabit per Second 2

Entering the Era of 100Gb/s 100Gb/s Adapter, 0.7us latency 150 million messages per second (10 / 25 / 40 / 50 / 56 / 100Gb/s) 36 EDR (100Gb/s) Ports, <90ns Latency Throughput of 7.2Tb/s 32 100GbE Ports, 64 25/50GbE Ports (10 / 25 / 40 / 50 / 100GbE) Throughput of 6.4Tb/s Copper (Passive, Active) Optical Cables (VCSEL) Silicon Photonics 3

End-to-End Interconnect Solutions for All Platforms Highest Performance and Scalability for X86, Power, GPU, ARM and FPGA-based Compute and Storage Platforms 10, 20, 25, 40, 50, 56 and 100Gb/s Speeds X86 Open POWER GPU ARM FPGA Smart Interconnect to Unleash The Power of All Compute Architectures 4

Mellanox Centers of Excellence Applications Infrastructure Centers of Excellence Exascale Computing Technology Early access to new technologies (EDR, Multi-Host, HPC-X etc.) Co-Design effort to optimize and accelerate applications performance and scalability Participate in the Mellanox advisory board Together We Can Develop the Solutions of Tomorrow 5

Technology Roadmap One-Generation Lead over the Competition Mellanox 20Gbs 40Gbs 56Gbs 100Gbs 200Gbs Terascale Petascale Exascale 3 rd TOP500 2003 Virginia Tech (Apple) 1 st Roadrunner Mellanox Connected 2000 2005 2010 2015 2020 6

EDR InfiniBand Performance Commercial Applications 40% 25% 15% 7

EDR InfiniBand Performance Weather Simulation Weather Research and Forecasting Model Optimization effort with the HPCAC EDR InfiniBand delivers 28% higher performance 32-node cluster Performance advantage increase with system size 8

InfiniBand Adapters Performance Comparison Mellanox Adapters ConnectX-4 Connect-IB ConnectX-3 Pro Single Port Performance EDR 100G FDR 56G FDR 56G Uni-Directional Throughput 100 Gb/s 54.24 Gb/s 51.1 Gb/s Bi-Directional Throughput 195 Gb/s 107.64 Gb/s 98.4 Gb/s Latency 0.61 us 0.63 us 0.64 us Message Rate 149.5 Million/sec 105 Million/sec 35.9 Million/sec 9

Mellanox Interconnect Advantages Proven, scalable and high performance end-to-end connectivity Flexible, support all compute architectures: x86, Power, ARM, GPU, FPGA etc. Standards-based (InfiniBand, Ethernet), supported by large eco-system Offloading architecture RDMA, application acceleration engines etc. Flexible topologies: Fat Tree, mesh, 3D Torus, Dragonfly+, etc. Converged I/O compute, storage, management on single fabric Backward and future compatible EDR InfiniBand delivers highest applications performance Speed-Up Your Present, Protect Your Future Paving The Road to Exascale Computing Together 10

Mellanox PeerDirect with NVIDIA GPUDirect RDMA Native support for peer-to-peer communications between Mellanox HCA adapters and NVIDIA GPU devices 11

Industry Adoption of GPUDirect RDMA GPUDirect RDMA was released in May 2014 and is available for download from Mellanox Adoption and development continues to grow in various areas of technical disciplines Leveraging RDMA and NVIDIA GPUs in today s energy-efficient datacenters Big Data Bioscience Defense Database Green Computing Government Healthcare HPC Risk / Analysis Space Exploration Transportation Oil & Gas Physics Research/Education Financial 12

What is PeerDirect PeerDirect is natively supported by Mellanox OFED 2.1 or later distribution Supports peer-to-peer communications between Mellanox adapters and third-party devices No unnecessary system memory copies & CPU overhead No longer needs a host buffer for each device No longer needs to share a host buffer either Supports NVIDIA GPUDirect RDMA with a separate plug-in Support for RoCE protocol over Mellanox VPI CPU CPU Chip set Chip set Vendor Device Chipset Chipset Vendor 0101001011 Device Supported with all Mellanox ConnectX-3 and Connect-IB Adapters 13

PeerDirect Technology Based on Peer-to-Peer capability of PCIe Support for any PCIe peer which can provide access to its memory NVIDIA GPU, XEON PHI, AMD, custom FPGA PeerDirect RDMA System Memory CPU PCIe PCIe CPU System Memory GPU Memory GPU RDMA GPU GPU Memory PeerDirect PeerDirect 14

Evolution of GPUDirect RDMA Pre-GPUDirect Before GPUDirect Network and third-party device drivers, did not share buffers, and needed to make a redundant copy in host memory. With GPUDirect Shared Host Memory Pages Network and GPU could share pinned (pagelocked) buffers, eliminating the need to make a redundant copy in host memory. GPUDirect Shared Host Memory Pages Model 15

GPUDirect RDMA (GPUDirect 3.0) With GPUDirect RDMA Using PeerDirect Eliminates CPU bandwidth and latency bottlenecks Uses remote direct memory access (RDMA) transfers between GPUs Resulting in significantly improved MPI efficiency between GPUs in remote nodes Based on PCIe PeerDirect technology 16

Average time per iteration (us) GPUDirect Sync (GPUDirect 4.0) GPUDirect RDMA (3.0) direct data path between the GPU and Mellanox interconnect Control path still uses the CPU - CPU prepares and queues communication tasks on GPU - GPU triggers communication on HCA - Mellanox HCA directly accesses GPU memory GPUDirect Sync (GPUDirect 4.0) Both data path and control path go directly between the GPU and the Mellanox interconnect 80 70 60 50 40 30 2D stencil benchmark 27% faster 23% faster Maximum Performance For GPU Clusters 20 10 0 RDMA only 2 4 Number of nodes/gpus RDMA+PeerSync 17

Hardware considerations for GPUDirect RDMA Note : A requirement on current platforms for GPUDirect RDMA to work properly is that the NVIDIA GPU and the Mellanox InfiniBand Adapter share the same root complex Only a limitation of current hardware today, not GPUDirect RDMA CPU CPU Chipset or PCIE switch Chip set 18

Mellanox PeerDirect with NVIDIA GPUDirect RDMA HOOMD-blue is a general-purpose Molecular Dynamics simulation code accelerated on GPUs GPUDirect RDMA allows direct peer to peer GPU communications over InfiniBand Unlocks performance between GPU and InfiniBand This provides a significant decrease in GPU-GPU communication latency Provides complete CPU offload from all GPU communications across the network Demonstrated up to 102% performance improvement with large number of particles 21% 102% 19

Higher is Better Performance of MVAPICH2 with GPUDirect RDMA GPU-GPU Internode MPI Latency GPU-GPU Internode MPI Bandwidth 10x 8.6X Lower is Better 2.37 usec Source: Prof. DK Panda 88% Lower Latency 10X Increase in Throughput 20

Remote GPU Access through rcuda GPU servers CUDA Application Client Side GPU as a Service Server Side Application Application rcuda library rcuda daemon CUDA Driver + runtime Network Interface Network Interface CUDA Driver + runtime CPU VGPU CPU VGPU CPU VGPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU rcuda provides remote access from every node to any 21

Mellanox GPUDirect 기술을통한 GPU 기반어플리케이션가속화솔루션 1.0 2010 GPUDirect 1.0 2.0 2012 GPUDirect 2.0 3.0 2014 GPUDirect RDMA 4.0 2015 GPUDirect Sync 2X Data Center Performance 5X Higher Throughput 5X Lower Latency 22

THANK YOU