Generic RDMA Enablement in Linux

Similar documents
by Brian Hausauer, Chief Architect, NetEffect, Inc

Multifunction Networking Adapters

Voltaire. Fast I/O for XEN using RDMA Technologies. The Grid Interconnect Company. April 2005 Yaron Haviv, Voltaire, CTO

Advanced Computer Networks. End Host Optimization

Memory Management Strategies for Data Serving with RDMA

Infiniband and RDMA Technology. Doug Ledford

The NE010 iwarp Adapter

PARAVIRTUAL RDMA DEVICE

HP Cluster Interconnects: The Next 5 Years

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

InfiniBand Linux Operating System Software Access Layer

Comparative Performance Analysis of RDMA-Enhanced Ethernet

Welcome to the IBTA Fall Webinar Series

Brent Callaghan Sun Microsystems, Inc. Sun Microsystems, Inc

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

NFS/RDMA. Tom Talpey Network Appliance

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

What is RDMA? An Introduction to Networking Acceleration Technologies

Introduction to High-Speed InfiniBand Interconnect

CERN openlab Summer 2006: Networking Overview

Open Fabrics Workshop 2013

BUILDING A BLOCK STORAGE APPLICATION ON OFED - CHALLENGES

Chelsio 10G Ethernet Open MPI OFED iwarp with Arista Switch

RoCE vs. iwarp Competitive Analysis

iwarp Transport Specific Extensions for DAT 2.0

An RDMA Protocol Specification (Version 1.0)

RDMA enabled NIC (RNIC) Verbs Overview. Renato Recio

The Exascale Architecture

Informatix Solutions INFINIBAND OVERVIEW. - Informatix Solutions, Page 1 Version 1.0

URDMA: RDMA VERBS OVER DPDK

OFED Storage Protocols

Containing RDMA and High Performance Computing

Evaluating the Impact of RDMA on Storage I/O over InfiniBand

Application Acceleration Beyond Flash Storage

Ethernet: The High Bandwidth Low-Latency Data Center Switching Fabric

Storage Protocol Offload for Virtualized Environments Session 301-F

To Infiniband or Not Infiniband, One Site s s Perspective. Steve Woods MCNC

The Case for RDMA. Jim Pinkerton RDMA Consortium 5/29/2002

Advanced Computer Networks. RDMA, Network Virtualization

OpenFabrics Interface WG A brief introduction. Paul Grun co chair OFI WG Cray, Inc.

RDMA programming concepts

InfiniBand* Software Architecture Access Layer High Level Design June 2002

iser as accelerator for Software Defined Storage Rahul Fiske, Subhojit Roy IBM (India)

IO virtualization. Michael Kagan Mellanox Technologies

jverbs: Java/OFED Integration for the Cloud

Windows OpenFabrics (WinOF)

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

NTRDMA v0.1. An Open Source Driver for PCIe NTB and DMA. Allen Hubbe at Linux Piter 2015 NTRDMA. Messaging App. IB Verbs. dmaengine.h ntb.

Cisco - Enabling High Performance Grids and Utility Computing

Sharing High-Performance Devices Across Multiple Virtual Machines

Impact of Cache Coherence Protocols on the Processing of Network Traffic

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

FaRM: Fast Remote Memory

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

Comparing Server I/O Consolidation Solutions: iscsi, InfiniBand and FCoE. Gilles Chekroun Errol Roberts

OFA Developer Workshop 2014

Learn Your Alphabet - SRIOV, NPIV, RoCE, iwarp to Pump Up Virtual Infrastructure Performance

iscsi or iser? Asgeir Eiriksson CTO Chelsio Communications Inc

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Low latency, high bandwidth communication. Infiniband and RDMA programming. Bandwidth vs latency. Knut Omang Ifi/Oracle 2 Nov, 2015

Memcached Design on High Performance RDMA Capable Interconnects

LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November Abstract

RoCE vs. iwarp A Great Storage Debate. Live Webcast August 22, :00 am PT

NFS/RDMA and Sessions Updates

DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience

ehca Virtualization on System p

Birds of a Feather Presentation

DB2 purescale: High Performance with High-Speed Fabrics. Author: Steve Rees Date: April 5, 2011

iwarp Protocol Kernel Space Software Implementation

Creating an agile infrastructure with Virtualized I/O

IsoStack Highly Efficient Network Processing on Dedicated Cores

Implementation and Evaluation of iscsi over RDMA

OceanStor 9000 InfiniBand Technical White Paper. Issue V1.01 Date HUAWEI TECHNOLOGIES CO., LTD.

Designing and Enhancing the Sockets Direct Protocol (SDP) over iwarp and InfiniBand

Accelerating Web Protocols Using RDMA

SoftRDMA: Rekindling High Performance Software RDMA over Commodity Ethernet

High Performance VMM-Bypass I/O in Virtual Machines

LAMMPS, LS- DYNA, HPL, and WRF on iwarp vs. InfiniBand FDR

Copyright 2006 Penton Media, Inc., All rights reserved. Printing of this document is for personal use only. Reprints

Initial Performance Evaluation of the NetEffect 10 Gigabit iwarp Adapter

Server Networking e Virtual Data Center

Infiniband und Virtualisierung

RDMA Container Support. Liran Liss Mellanox Technologies

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

MIGSOCK A Migratable TCP Socket in Linux

Performance Evaluation of Soft RoCE over 1 Gigabit Ethernet

Novell Infiniband and XEN

Research Report. IBM Research. Enabling Applications for RDMA: Distributed Compilation Revisited

Concurrent Support of NVMe over RDMA Fabrics and Established Networked Block and File Storage

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

PERFORMANCE ACCELERATED Mellanox InfiniBand Adapters Provide Advanced Levels of Data Center IT Performance, Productivity and Efficiency

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

A Study of iscsi Extensions for RDMA (iser)

Introduction to Infiniband

iscsi Extensions for RDMA Updates and news Sagi Grimberg Mellanox Technologies

Extending RDMA for Persistent Memory over Fabrics. Live Webcast October 25, 2018

10GbE RoCE Express and Shared Memory Communications RDMA (SMC-R) Frequently Asked Questions

Transcription:

Generic RDMA Enablement in Linux (Why do we need it, and how) Krishna Kumar Linux Technology Center, IBM February 28, 2006

AGENDA RDMA : Definition Why RDMA, and how does it work OpenRDMA history Architectural overview How to get there - code base Convergence with openib Exploiters of RDMA References

RDMA : Definition A protocol that allows applications across multiple systems to exchange data using Direct Memory Access. Reduces CPU, multiple copies, and context switching overhead needed to perform the transfer. Preserves memory protection semantics.

RDMA : Definition (contd) Hence RDMA is useful in applications where high throughput, low latency networking is needed. without requiring to add extra processors and memory for smaller latency and greater bandwidth. Current devices : Infiniband HCA's, iwarp RNIC's.

RDMA : Why is it required? CPU speeds are not keeping up with networking speeds. Usage of CPU for network processing reduces performance of CPU intensive applications. Remote Direct Memory Access allows for : zero-copy between the application and kernel - fewer memory copies reduce memory bandwidth and improves latency. minimal demands on system resources like CPU and memory bandwidth.

RDMA : Why is it required? (contd) Other earlier methods (like page flipping) are either not general enough or don't scale. CPU utilization for network intensive applications : a large amount of time spent in TCP/IP stack to copy data and manage buffers. Memory bandwidth requirements : About 3 times link bandwidth for data receives 3GB/s of host memory required to support 10GigE at full receive link bandwidth

How RDMA works Sending / Receiving data old method User Buffer User Process User 1 CPU 4 10 3 2 8 9 Kernel Buffer 7 Kernel Regular NIC Interface 5 6 Data Send Incoming Data

How RDMA works (contd) Sending / Receiving data new method User Process User Buffer User 1 2 2 5 Kernel 3 Data Send RNIC Interface 4 Incoming Data

How RDMA works (contd) User memory registration with adapter for direct adapter access to user memory and vice-versa reduces context switches, CPU utilization reduces memory bandwidth Some offload to card reduces CPU processing overhead for each packet.

OpenRDMA history (till Jan 2006) openrdma project started in November 2004 to : Enable RDMA in Linux and get acceptance in LKML Common API/single stack between Infiniband and iwarp, use standard API's like DAPL and IT-API. Use RNIC-PI developed by ICSC Consortium. An industry supported open source project Support hardware independent RDMA based interconnects : iwarp based RNIC's Infiniband HCA's.

openrdma Architecture OpenRDMA architecture using RNIC-PI Consumer Prrivate Interface OS Services udapl User RDMA Access Layer (ual) IT-API OS generic code IHV private code OS Services User Kernel DDI Privileged Op. + Conn./Res. Mgmt. Res. Handle + opt. opaque data Kernel RDMA Access Layer (kal) SYS-RNICPI Kernel Consumer kdapl P-RNICPI SVC-RNICPI Kernel RDMA Verbs Provider (kvp) NP-RNICPI User RDMA Verbs Provider (uvp) SVC-RNICPI Non-privileged Op.: Fast Path RNIC RNIC Hardware RNIC Hardware

RNICPI Interfaces RDMA Host Non-priviledged consumer uapi (IT-API, udapl) (linux) (generic) (vendor) IT-API User Library / User Access Layer (ual) Kernel Access Endpoint Mgmt Mem Mgmt Mgmt Services usi (Syscall I/F) upi (Verbs) Linux Kernel TCP/IP Stack Socket Layer TCP IPv4 IPv6 ARP ND Link Level Device Interface Kernel IT-API Provider / Kernel Access Layer (kal) User Level Proxy LL Interface ksi (Syscall I/F) TCP/IP Mgmt kpi (Verbs + LLP Mgmt) ENIC Endpoint Mgmt TCP/IP privileged Client Memory Mgmt Verbs RNIC Kernel Device Driver kapi (kdapl,it-api) IA Mgmt QPs RNIC User Device Driver RNIC

How to get there - Code Base This architecture led to the creation of a base RDMA stack Currently in alpha ready state (RNIC-PI 0.95) being worked on by the community to add functionalities and fix issues. Goal is to build RDMA ecosystem with : vendor independent RNIC support offering multiple RDMA capable API's at both kernel and user level (DAPL, IT-API).

How to get there - Code Base (contd) TOE a no-no for Linux kernel community. TOE involves offloading TCP/IP stack processing to RNIC. Hacking internals of TCP/IP stack Has maintainability issues Core TCP/IP stack features missing (eg netfilter) Questionable performance gains

How to get there - Code Base (contd) Current code kal layer (rigen.o) Register special device : /dev/rdma* with system, define open(), ioctl() and close() interfaces; for user access. Device kvp layer will call into kal ri_svc_attach() for each device that is found kvp provides rnic_ops for implementing various verbs Allocate various data structures, like QP's, CQ's, etc. Call into driver specific RNIC open routine using rnic_ops

How to get there - Code Base (contd) Current code kal layer (contd) ioctl() provides various handles for user access : rnic handle qp, cq, pd mr handles, etc These handles allow various operations of the device modify QP, register region, etc. Separate kernel API's provided for both of the above (common code for kernel/user)

How to get there - Code Base (contd) Current code ual layer (libual.so) ual init code opens special file /dev/rdma to get list of RNICS and the kvp handle identifying each RNIC. User calls ri_get_rnic_handle() specifying which device to get a handle, eg device of 1 is /dev/rdma1. Call open( /dev/rdma1 ) Load uvp library provided. Call ioctl(get_rnic_handle) to trap to kernel and retrieve a handle to access the RNIC.

How to get there - Code Base (contd) Current code ual layer User can perform operations on RNIC handle, like allocate pd, create qp/cq, register memory, allocate window, etc. Other operations for RDMA : post RQ buffers, post SQ buffers. free_rnic_handle() calls ioctl with FREE_RNIC_HANDLE subsequently closes the /dev/rdma1 device

OpenIB and OpenRDMA convergence (Feb 2006) Infiniband has the same concept of RDMA using slightly different API's. How to have a single API that works irrespective of the underlying fabric type? Convergence aspects between openib and openrdma is currently being discussed. Goal is to have applications not be aware of underlying transport and to have a single implementation in linux.

Exploiters of RDMA ULP's : NFSR, iser. HPC applications based on MPI - typically require : fast access to large amounts of data high-speed interconnects between systems. Cluster interconnects. Any application with high data to control ratio Wall Street financial applications

References An RDMA Protocol Specification www.ietf.org/internet-drafts/draft-ietf-rddp-rdmap-05.txt Open RDMA www.openrdma.org Interconnect Software Consortium (ICSC) www.opengroup.org/icsc Code : Browse @ cvs.sourceforge.net/viewcvs.py/rdma/

Legal Statement This work represents the view of the author and does not necessarily represent the view of IBM Linux is a trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others.