High-Performance GPU Clustering: GPUDirect RDMA over 40GbE iwarp

Similar documents
iscsi or iser? Asgeir Eiriksson CTO Chelsio Communications Inc

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

NVMe over Universal RDMA Fabrics

Concurrent Support of NVMe over RDMA Fabrics and Established Networked Block and File Storage

LAMMPS and WRF on iwarp vs. InfiniBand FDR

Storage Protocol Offload for Virtualized Environments Session 301-F

RoCE vs. iwarp Competitive Analysis

Universal RDMA: A Unique Approach for Heterogeneous Data-Center Acceleration

The Chelsio Terminator 6 ASIC

Chelsio 10G Ethernet Open MPI OFED iwarp with Arista Switch

RoCE vs. iwarp A Great Storage Debate. Live Webcast August 22, :00 am PT

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Learn Your Alphabet - SRIOV, NPIV, RoCE, iwarp to Pump Up Virtual Infrastructure Performance

Multifunction Networking Adapters

FCoE at 40Gbps with FC-BB-6

Best Practices for Deployments using DCB and RoCE

LAMMPS, LS- DYNA, HPL, and WRF on iwarp vs. InfiniBand FDR

Solutions for Scalable HPC

Application Acceleration Beyond Flash Storage

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

Ethernet. High-Performance Ethernet Adapter Cards

Cavium FastLinQ 25GbE Intelligent Ethernet Adapters vs. Mellanox Adapters

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

Building the Most Efficient Machine Learning System

OCP3. 0. ConnectX Ethernet Adapter Cards for OCP Spec 3.0

Comparing Server I/O Consolidation Solutions: iscsi, InfiniBand and FCoE. Gilles Chekroun Errol Roberts

Building the Most Efficient Machine Learning System

PCI Express x8 Quad Port 10Gigabit Server Adapter (Intel XL710 Based)

Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

Lossless 10 Gigabit Ethernet: The Unifying Infrastructure for SAN and LAN Consolidation

SNIA Developers Conference - Growth of the iscsi RDMA (iser) Ecosystem

InfiniBand Networked Flash Storage

QuickSpecs. HP Z 10GbE Dual Port Module. Models

Martin Dubois, ing. Contents

Designing Next Generation FS for NVMe and NVMe-oF

Introduction to Infiniband

Sharing High-Performance Devices Across Multiple Virtual Machines

PCI Express x8 Single Port SFP+ 10 Gigabit Server Adapter (Intel 82599ES Based) Single-Port 10 Gigabit SFP+ Ethernet Server Adapters Provide Ultimate

Unified Wire Networking Solutions

2014 LENOVO INTERNAL. ALL RIGHTS RESERVED.

PLUSOPTIC NIC-PCIE-2SFP+-V2-PLU

Creating an agile infrastructure with Virtualized I/O

ARISTA: Improving Application Performance While Reducing Complexity

SoftRDMA: Rekindling High Performance Software RDMA over Commodity Ethernet

Network Adapters. FS Network adapter are designed for data center, and provides flexible and scalable I/O solutions. 10G/25G/40G Ethernet Adapters

NVMe Over Fabrics (NVMe-oF)

Data Path acceleration techniques in a NFV world

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

N V M e o v e r F a b r i c s -

Evaluation of the Chelsio T580-CR iscsi Offload adapter

The NE010 iwarp Adapter

FastLinQ QL41162HLRJ. 8th Generation 10Gb Converged Network Adapter with iscsi, FCoE, and Universal RDMA. Product Brief OVERVIEW

Reference Design: NVMe-oF JBOF

Survey of ETSI NFV standardization documents BY ABHISHEK GUPTA FRIDAY GROUP MEETING FEBRUARY 26, 2016

by Brian Hausauer, Chief Architect, NetEffect, Inc

LAMMPSCUDA GPU Performance. April 2011

SwitchX Virtual Protocol Interconnect (VPI) Switch Architecture

iser as accelerator for Software Defined Storage Rahul Fiske, Subhojit Roy IBM (India)

QLogic FastLinQ QL45462HLCU 40GbE Converged Network Adapter

VIRTUALIZING SERVER CONNECTIVITY IN THE CLOUD

Sub-microsecond interconnects for processor connectivity The opportunity

Network Function Virtualization Using Data Plane Developer s Kit

QuickSpecs. Overview. HPE Ethernet 10Gb 2-port 535 Adapter. HPE Ethernet 10Gb 2-port 535 Adapter. 1. Product description. 2.

Application Advantages of NVMe over Fabrics RDMA and Fibre Channel

Ethernet: The High Bandwidth Low-Latency Data Center Switching Fabric

HP Cluster Interconnects: The Next 5 Years

SMB Direct Update. Tom Talpey and Greg Kramer Microsoft Storage Developer Conference. Microsoft Corporation. All Rights Reserved.

OFED Storage Protocols

Accessing NVM Locally and over RDMA Challenges and Opportunities

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

Cisco UCS Virtual Interface Card 1225

Audience This paper is targeted for IT managers and architects. It showcases how to utilize your network efficiently and gain higher performance using

High Performance Computing

What a Long Strange Trip It s Been: Moving RDMA into Broad Data Center Deployments

Benefits of Offloading I/O Processing to the Adapter

Birds of a Feather Presentation

Cisco UCS Virtual Interface Card 1227

Interconnect Your Future

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

QLogic FastLinQ QL41232HMKR

Advanced Computer Networks. End Host Optimization

Interconnect Your Future

Fibre Channel vs. iscsi. January 31, 2018

Hálózatok üzleti tervezése

NVMe Direct. Next-Generation Offload Technology. White Paper

ALLNET ALL0141-4SFP+-10G / PCIe 10GB Quad SFP+ Fiber Card Server

The rcuda middleware and applications

Memory Management Strategies for Data Serving with RDMA

iscsi : A loss-less Ethernet fabric with DCB Jason Blosil, NetApp Gary Gumanow, Dell

All Roads Lead to Convergence

The Missing Piece of Virtualization. I/O Virtualization on 10 Gb Ethernet For Virtualized Data Centers

Oracle IaaS, a modern felhő infrastruktúra

SUSE. High Performance Computing. Eduardo Diaz. Alberto Esteban. PreSales SUSE Linux Enterprise

Optimize New Intel Xeon E based Ser vers with Emulex OneConnect and OneCommand Manager

A-GEAR 10Gigabit Ethernet Server Adapter X520 2xSFP+

THE ZADARA CLOUD. An overview of the Zadara Storage Cloud and VPSA Storage Array technology WHITE PAPER

Big Data Processing Technologies. Chentao Wu Associate Professor Dept. of Computer Science and Engineering

Accelerating Real-Time Big Data. Breaking the limitations of captive NVMe storage

Transcription:

High-Performance GPU Clustering: GPUDirect RDMA over 40GbE iwarp Tom Reu Consulting Applications Engineer Chelsio Communications tomreu@chelsio.com Efficient Performance 1

Chelsio Corporate Snapshot Leader in High Speed Converged Ethernet Adapters Leading 10/40GbE adapter solution provider for servers and storage systems Manufacturing ~800K ports shipped High performance protocol engine 80MPPS 1.5μsec ~5M+ IOPs Feature rich solution Media streaming hardware/software WAN Optimization, Security, etc. Company Facts Founded in 2000 150 strong staff R&D Offices USA Sunnyvale India Bangalore China - Shanghai HPC Oil and Gas Media Market Coverage Security Finance Service/Cloud Storage Efficient Performance 2

RDMA Overview Efficient Performance Chelsio T5 RNIC Performance and efficiency in return for new communication paradigm Chelsio T5 RNIC Direct memory-to-memory transfer All protocol processing handling by the NIC Must be in hardware Protection handled by the NIC User space access requires both local and remote enforcement Asynchronous communication model Reduced host involvement Performance Latency - polling Throughput Efficiency Zero copy Kernel bypass (user space I/O) CPU bypass

iwarp What is it? Provides the ability to do Remote Direct Memory Access over Ethernet using TCP/IP Uses Well-Known IB Verbs Inboxed in OFED since 2008 Runs on top of TCP/IP Chelsio implements iwarp/tcp/ip stack in silicon Cut-through send Cut-through receive Benefits Engineered to use typical Ethernet No need for technologies like DCB or QCN Natively Routable Multi-path support at Layer 3 (and Layer 2) It runs on TCP/IP Mature and Proven Goes where TCP/IP goes (everywhere) Efficient Performance 4

iwarp iwarp updates and enhancements are done by the IETF STORM (Storage Maintenance) working group RFCs RFC 5040 A Remote Direct Memory Access Protocol Specification RFC 5041 Direct Data Placement over Reliable Transports RFC 5044 Marker PDU Aligned Framing for TCP Specification RFC 6580 IANA Registries for the RDDP Protocols RFC 6581 Enhanced RDMA Connection Establishment RFC 7306 Remote Direct Memory Access (RDMA) Protocol Extensions Support from several vendors, Chelsio, Intel, QLogic Efficient Performance 5

iwarp Increasing Interest in iwarp as of late Some Use Cases High Performance Computing SMB Direct GPUDirect RDMA NFS over RDMA FreeBSD iwarp Hadoop RDMA Lustre RDMA NVMe over RDMA fabrics Efficient Performance 6

iwarp Advantages over Other RDMA Transports It s Ethernet Well Understood and Administered Uses TCP/IP Mature and Proven Supports rack, cluster, datacenter, LAN/MAN/WAN and wireless Compatible with SSL/TLS Do not need to use any bolt-on technologies like DCB QCN Does not require a totally new network infrastructure Reduces TCO and OpEx Efficient Performance 7

iwarp vs RoCE Efficient Performance iwarp Native TCP/IP over Ethernet, no different from NFS or HTTP Works with ANY Ethernet switches RoCE Difficult to install and configure - needs a team of experts - Plug-and-Debug Requires DCB - expensive equipment upgrade Works with ALL Ethernet equipment No need for special QoS or configuration - TRUE Plug-and-Play No need for special configuration, preserves network robustness TCP/IP allows reach to Cloud scale No distance limitations. Ideal for remote communication and HA WAN routable, uses any IP infrastructure Standard for whole stack has been stable for a decade Transparent and open IETF standards process Poor interoperability - may not work with switches from different vendors Fixed QoS configuration - DCB must be setup identically across all switches Easy to break - switch configuration can cause performance collapse Does not scale - requires PFC, limited to single subnet Short distance - PFC range is limited to few hundred meters maximum RoCEv1 not routable. RoCE v2 requires lossless IP infrastructure and restricts router configuration ROCEv2 incompatible with v1. More fixes to missing reliability and scalability layers required and expected Incomplete specification and opaque process

Chelsio s T5 Single ASIC does it all High Performance Purpose Built Protocol Processor Runs multiple protocols TCP with Stateless Offload and Full Offload UDP with Stateless Offload iwarp FCoE with Offload iscsi with Offload All of these protocols run on T5 with a SINGLE FIRMWARE IMAGE No need to reinitialize the card for different uses Future proof e.g. support for NVMf yet preserves today s investment in iscsi Efficient Performance 9

T5 ASIC Architecture High Performance Purpose Built Protocol Processor PCI-e, X8, Gen 3 DMA Engine Application Co- Processor TX Application Co- Processor RX General Purpose Processor Cut-Through TX Memory Traffic Manager Data-flow Protocol Engine Cut-Through RX Memory Embedded Layer 2 Ethernet Switch Lookup, filtering and Firewall 1G/10G/40G MAC 1G/10G/40G MAC 100M/1G/10G MAC 100M/1G/10G MAC Memory Controller On-Chip DRAM Optional external DDR3 memory Single connection at 40Gb. Low Latency. Efficient Performance 10 Single processor data-flow pipelined architecture Up to 1M connections Concurrent Multi-Protocol Operation

Leading Unified Wire Architecture Converged Network Architecture with all-in-one Adapter and Software Storage NVMe/Fabrics SMB Direct iscsi and FCoE with T10-DIX iser and NFS over RDMA pnfs (NFS 4.1) and Lustre NAS Offload Diskless boot Replication and failover Virtualization & Cloud Hypervisor offload SR-IOV with embedded VEB VEPA, VN-TAGs VXLAN/NVGRE NFV and SDN OpenStack storage Hadoop RDMA HFT HPC WireDirect technology Ultra low latency Highest messages/sec Wire rate classification iwarp RDMA over Ethernet GPUDirect RDMA Lustre RDMA pnfs (NFS 4.1) OpenMPI MVAPICH Networking 4x10GbE/2x40GbE NIC Full Protocol Offload Data Center Bridging Hardware firewall Wire Analytics DPDK/netmap Media Streaming Traffic Management Video segmentation Offload Large stream capacity Single Qualification Single SKU Concurrent Multi-Protocol Operation Efficient Performance 11

GPUDirect RDMA Efficient Performance Introduced by NVIDIA with the Kepler Class GPUs. Available today on Tesla and Quadro GPUs as well. Enables Multiple GPUs, 3rd party network adapters, SSDs and other devices to read and write CUDA host and device memory Avoids unnecessary system memory copies and associated CPU overhead by copying data directly to and from pinned GPU memory One hardware limitation The GPU and the Network device MUST share the same upstream PCIe root complex Available with Infiniband, RoCE, and now iwarp

GPUDirect RDMA T5 iwarp RDMA over Ethernet certified with NVIDIA GPUDirect Read/write GPU memory directly from network adapter Peer-to-peer PCIe communication Bypass host CPU Bypass host memory Zero copy Ultra low latency Very high performance Scalable GPU pooling Any Ethernet networks Host Host MEMORY MEMORY CPU CPU Payload GPU Notifications RNIC Notifications RNIC Payload GPU Packets Packets LAN/Datacenter/WAN Network Efficient Performance 13

Efficient Performance Modules required for GPUDirect RMDA with iwarp Chelsio Modules cxgb4 - Chelsio adapter driver iw_cxgb4 - Chelsio iwarp driver rdma_ucm - RDMA User Space Connection Manager NVIDIA Modules nvidia - NVIDIA driver nvidia_uvm - NVIDIA Unified Memory nv_peer_mem - NVIDIA Peer Memory

Case Studies

HOOMD-blue General Purpose Particle simulation toolkit Stands for: Highly Optimized Object-oriented Many-particle Dynamics - Blue Edition Running on GPUDirect RDMA - WITH NO CHANGES TO THE CODE - AT ALL! More Info: www.codeblue.umich.edu/hoomd-blue Efficient Performance 16

HOOMD-blue Test Configuration 4 Nodes Intel E5-1660 v2 @ 3.7 Ghz 64 GB RAM Chelsio T580-CR 40Gb Adapter NVIDIA Tesla K80 (2 GPUs per card) RHEL 6.5 OpenMPI 1.10.0 OFED 3.18 CUDA Toolkit 6.5 HOOMD-blue v1.3.1-9 Chelsio-GDR-1.0.0.0 Command Line: $MPI_HOME/bin/mpirun --allow-run-as-root -mca btl_openib_want_cuda_gdr 1 -np X -hostfile /root/hosts -mca btl openib,sm,self -mca btl_openib_if_include cxgb4_0:1 --mca btl_openib_cuda_rdma_limit 65538 - mca btl_openib_receive_queues P,131072,64 -x CUDA_VISIBILE_DEVICES=0,1 / root/hoomd-install/bin/hoomd./bmark.py --mode=gpu cpu Efficient Performance 17

HOOMD-blue Lennard-Jones Liquid 64K Particles Benchmark Classic benchmark for general purpose MD simulations. Representative of the performance HOOMD-blue achieves for straight pair potential simulations Efficient Performance 18

HOOMD-blue Lennard-Jones Liquid 64K Particles Benchmark Results Average Timesteps per Second Test 1 26 2 CPU Cores 488 2 GPUs 1,230 2 GPUs 88 Test 2 8 CPU Cores 503 4 GPUs 1,403 4 GPUs Test 3 214 40 CPU Cores 1,089 8 GPUs 1,771 8 GPUs 0 450 900 1350 1800 CPU GPU w/ GPUDirect RDMA GPU w/o GPUDirect RDMA Longer is Better Efficient Performance 19

HOOMD-blue Lennard-Jones Liquid 64K Particles Benchmark Results Hours to complete 10e6 steps Test 1 6 2.2 2 GPUs 2 GPUs 108 2 CPU Cores Test 2 5.5 1.7 4 GPUs 4 GPUs 32 8 CPU Cores Test 3 2.5 1.5 13 40 CPU Cores 8 GPUs 8 GPUs 0 30 60 90 120 CPU GPU w/o GPUDirect RDMA GPU w/ GPUDirect RDMA Shorter is Better Efficient Performance 20

HOOMD-blue Quasicrystal Benchmark runs a system of particles with an oscillatory pair potential that forms a icosahedral quasicrystal This model is used in the research article: Engel M, et. al. (2015) Computational self-assembly of a onecomponent icosahedral quasicrystal, Nature materials 14(January), p. 109-116. Efficient Performance 21

HOOMD-blue Quasicrystal results Average Timesteps per Second Test 1 11 2 CPU Cores 308 2 GPUs 2 GPUs 407 Test 2 43 8 CPU Cores 656 728 4 GPUs 4 GPUs Test 3 31 40 CPU Cores 915 8 GPUs 1,158 8 GPUs 0 300 600 900 1200 CPU GPU w/ GPUDirect RDMA GPU w/o GPUDirect RDMA Longer is Better Efficient Performance 22

HOOMD-blue Quasicrystal results Hours to complete 10e6 steps Test 1 9 7 2 GPUs 2 GPUs 264 2 CPU Cores Test 2 4 3.5 63 4 GPUs 4 GPUs 8 CPU Cores Test 3 3 2.4 8 GPUs 8 GPUs 86 40 CPU Cores 0 75 150 225 300 CPU GPU w/o GPUDirect RDMA GPU w/ GPUDirect RDMA Shorter is Better Efficient Performance 23

Efficient Performance Caffe Deep Learning Framework Open source Deep Learning software from Berkeley Vision and Learning Center Updated to include CUDA support to utilize GPUs Standard version does NOT include MPI support MPI implementations mpi-caffe Used to train a large network across a cluster of machines model-parallel distributed approach. caffe-parallel Faster framework for deep learning. data-parallel via MPI, splits the training data across nodes

Summary GPUDirect RDMA over 40GbE iwarp iwarp provides RDMA Capabilities to a Ethernet network iwarp uses tried and true TCP/IP as its underlying transport mechanism Using iwarp does not require a whole new network infrastructure and the management requirements that come along with it iwarp can be used with existing software running on GPUDirect RDMA which NO CHANGES required to the code Applications that use GPUDirect RDMA will see huge performance improvements Chelsio provides 10/40Gb iwarp TODAY with 25/50/100 Gb on the horizon Efficient Performance 25

More information GPUDirect RDMA over 40GbE iwarp Visit our website, www.chelsio.com, for more White Papers, Benchmarks, etc. GPUDirect RDMA White Paper: http://www.chelsio.com/ wp-content/uploads/resources/t5-40gb-linux- GPUDirect.pdf Webinar : https://www.brighttalk.com/webcast/ 13671/189427 Beta code for GPUDirect RDMA is available TODAY from our download site at service.chelsio.com Sales questions - sales@chelsio.com Support questions - support@chelsio.com Efficient Performance 26

Questions? 27

Thank You