HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS

Similar documents
Advanced Computer Networks. End Host Optimization

AN O/S PERSPECTIVE ON NETWORKS Adem Efe Gencer 1. October 4 th, Department of Computer Science, Cornell University

Ethan Kao CS 6410 Oct. 18 th 2011

Advanced Computer Networks. RDMA, Network Virtualization

6.9. Communicating to the Outside World: Cluster Networking

FlexNIC: Rethinking Network DMA

by Brian Hausauer, Chief Architect, NetEffect, Inc

The NE010 iwarp Adapter

Multifunction Networking Adapters

An RDMA Protocol Specification (Version 1.0)

ECE 650 Systems Programming & Engineering. Spring 2018

Message Passing Architecture in Intra-Cluster Communication

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

Transport Layer. The transport layer is responsible for the delivery of a message from one process to another. RSManiaol

CSE398: Network Systems Design

IsoStack Highly Efficient Network Processing on Dedicated Cores

Optimizing Performance: Intel Network Adapters User Guide

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

NTRDMA v0.1. An Open Source Driver for PCIe NTB and DMA. Allen Hubbe at Linux Piter 2015 NTRDMA. Messaging App. IB Verbs. dmaengine.h ntb.

Operating Systems. 16. Networking. Paul Krzyzanowski. Rutgers University. Spring /6/ Paul Krzyzanowski

FaRM: Fast Remote Memory

High Performance File Serving with SMB3 and RDMA via SMB Direct

Introduction. CS3026 Operating Systems Lecture 01

FPGA Augmented ASICs: The Time Has Come

Generic RDMA Enablement in Linux

CERN openlab Summer 2006: Networking Overview

Low-Latency Communication over Fast Ethernet

PARAVIRTUAL RDMA DEVICE

OpenOnload. Dave Parry VP of Engineering Steve Pope CTO Dave Riddoch Chief Software Architect

08:End-host Optimizations. Advanced Computer Networks

ETSF10 Internet Protocols Transport Layer Protocols

Memory Management Strategies for Data Serving with RDMA

The Common Communication Interface (CCI)

RDMA enabled NIC (RNIC) Verbs Overview. Renato Recio

RDMA programming concepts

SoftRDMA: Rekindling High Performance Software RDMA over Commodity Ethernet

6.1 Internet Transport Layer Architecture 6.2 UDP (User Datagram Protocol) 6.3 TCP (Transmission Control Protocol) 6. Transport Layer 6-1

FreeBSD Network Stack Optimizations for Modern Hardware

Application Acceleration Beyond Flash Storage

Transport Layer. Application / Transport Interface. Transport Layer Services. Transport Layer Connections

The Network Stack. Chapter Network stack functions 216 CHAPTER 21. THE NETWORK STACK

Netchannel 2: Optimizing Network Performance

Maximum Performance. How to get it and how to avoid pitfalls. Christoph Lameter, PhD

Gateware Defined Networking (GDN) for Ultra Low Latency Trading and Compliance

High Performance Packet Processing with FlexNIC

Much Faster Networking

Remote Persistent Memory With Nothing But Net Tom Talpey Microsoft

Common Computer-System and OS Structures

INT G bit TCP Offload Engine SOC

An Implementation of the Homa Transport Protocol in RAMCloud. Yilong Li, Behnam Montazeri, John Ousterhout

URDMA: RDMA VERBS OVER DPDK

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University

Network Interface Architecture and Prototyping for Chip and Cluster Multiprocessors

CS 428/528 Computer Networks Lecture 01. Yan Wang

Topics. TCP sliding window protocol TCP PUSH flag TCP slow start Bulk data throughput

An O/S perspective on networks: Active Messages and U-Net

Containing RDMA and High Performance Computing

A Userspace Packet Switch for Virtual Machines

RoGUE: RDMA over Generic Unconverged Ethernet

Last Class: RPCs and RMI. Today: Communication Issues

Chapter 23 Process-to-Process Delivery: UDP, TCP, and SCTP 23.1

打造 Linux 下的高性能网络 北京酷锐达信息技术有限公司技术总监史应生.

Internetwork. recursive definition point-to-point and multi-access: internetwork. composition of one or more internetworks

CS 4390 Computer Networks. Transport Services and Protocols

19: Networking. Networking Hardware. Mark Handley

Basic Low Level Concepts

User Datagram Protocol

MOVING FORWARD WITH FABRIC INTERFACES

Data and Computer Communications

Chapter 7. The Transport Layer

TSIN02 - Internetworking

Chapter 2 Applications and

Optimizing your virtual switch for VXLAN. Ron Fuller, VCP-NV, CCIE#5851 (R&S/Storage) Staff Systems Engineer NSBU

IS370 Data Communications and Computer Networks. Chapter 5 : Transport Layer

UNIT IV TRANSPORT LAYER

U-Net: A User-Level Network Interface for Parallel and Distributed Computing

05 Transmission Control Protocol (TCP)

Operating Systems. 17. Sockets. Paul Krzyzanowski. Rutgers University. Spring /6/ Paul Krzyzanowski

xkcd.com End To End Protocols End to End Protocols This section is about Process to Process communications.

The latency of user-to-user, kernel-to-kernel and interrupt-to-interrupt level communication

OpenFabrics Interface WG A brief introduction. Paul Grun co chair OFI WG Cray, Inc.

Infiniband and RDMA Technology. Doug Ledford

Performance of a High-Level Parallel Language on a High-Speed Network

Voltaire. Fast I/O for XEN using RDMA Technologies. The Grid Interconnect Company. April 2005 Yaron Haviv, Voltaire, CTO

May 1, Foundation for Research and Technology - Hellas (FORTH) Institute of Computer Science (ICS) A Sleep-based Communication Mechanism to

Chapter 4. Routers with Tiny Buffers: Experiments. 4.1 Testbed experiments Setup

Processes and Threads

Kernel Bypass. Sujay Jayakar (dsj36) 11/17/2016

Network Design Considerations for Grid Computing

Networking and Internetworking 1

NETWORK PROGRAMMING. Instructor: Junaid Tariq, Lecturer, Department of Computer Science

Mike Anderson. TCP/IP in Embedded Systems. CTO/Chief Scientist The PTR Group, Inc.

Distributed Systems. 02. Networking. Paul Krzyzanowski. Rutgers University. Fall 2017

Transport Layer. Gursharan Singh Tatla. Upendra Sharma. 1

Understanding and Improving the Cost of Scaling Distributed Event Processing

Networks and distributed computing

Introduction to TCP/IP Offload Engine (TOE)

U-Net: A User-Level Network Interface for Parallel and Distributed Computing

Chapter 23 Process-to-Process Delivery: UDP, TCP, and SCTP

Transcription:

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS CS6410 Moontae Lee (Nov 20, 2014) Part 1

Overview 00 Background User-level Networking (U-Net) Remote Direct Memory Access (RDMA) Performance

Index 00 Background

Network Communication 01 Send Application buffer à Socket buffer Attach headers Data is pushed to NIC buffer Receive NIC buffer à Socket buffer Parsing headers Data is copied into Application buffer Application is scheduled (context switching)

Today s Theme 02 Faster and lightweight communication!

Terms and Problems 03 Communication latency Processing overhead: message-handling time at sending/receiving ends Network latency: message transmission time between two ends (i.e., end-to-end latency)

Terms and Problems 03 Communication latency Processing overhead: message-handling time at sending/receiving ends Network latency: message transmission time between two ends (i.e., end-to-end latency) If network environment satisfies High bandwidth / Low network latency Long connection durations / Relatively few connections

TCP Offloading Engine (TOE) 04 THIS IS NOT OUR STORY!

Our Story 05 Large vs Small messages Large: transmission dominant à new networks improves (e.g., video/audio stream) Small: processing dominant à new paradigm improves (e.g., just a few hundred bytes)

Our Story 05 Large vs Small messages Large: transmission dominant à new networks improves (e.g., video/audio stream) Small: processing dominant à new paradigm improves (e.g., just a few hundred bytes) Our underlying picture Sending many small messages in LAN Processing overhead is overwhelming (e.g., buffer management, message copies, interrupt)

Traditional Architecture 06 Problem: Messages pass through the kernel Low performance n Duplicate several copies n Multiple abstractions between device driver and user apps Low flexibility n All protocol processing inside the kernel n Hard to support new protocols and new message send/receive interfaces

History of High-Performance 07 User-level Networking (U-Net) One of the first kernel-bypassing systems Virtual Interface Architecture (VIA) First attempt to standardize user-level communication Combine U-Net interface with remote DMA service Remote Direct Memory Access (RDMA) Modern high-performance networking Many other names, but sharing common themes

Index 00 U-Net

U-Net Ideas and Goals 08 Move protocol processing parts into user space! Move the entire protocol stack to user space Remove kernel completely from data communication path

U-Net Ideas and Goals 08 Move protocol processing parts into user space! Move the entire protocol stack to user space Remove kernel completely from data communication path Focusing on small messages, key goals are: High performance / High flexibility

U-Net Ideas and Goals 08 Move protocol processing parts into user space! Move the entire protocol stack to user space Remove kernel completely from data communication path Focusing on small messages, key goals are: High performance / High flexibility n Low communication latency in local area setting n Exploit full bandwidth n Emphasis on protocol design and integration flexibility n Portable to off-the-shelf communication hardware

U-Net Ideas and Goals 08 Move protocol processing parts into user space! Move the entire protocol stack to user space Remove kernel completely from data communication path Focusing on small messages, key goals are: High performance / High flexibility n Low communication latency in local area setting n Exploit full bandwidth n Emphasis on protocol design and integration flexibility n Portable to off-the-shelf communication hardware

U-Net Ideas and Goals 08 Move protocol processing parts into user space! Move the entire protocol stack to user space Remove kernel completely from data communication path Focusing on small messages, key goals are: High performance / High flexibility n Low communication latency in local area setting n Exploit full bandwidth n Emphasis on protocol design and integration flexibility n Portable to off-the-shelf communication hardware

U-Net Ideas and Goals 08 Move protocol processing parts into user space! Move the entire protocol stack to user space Remove kernel completely from data communication path Focusing on small messages, key goals are: High performance / High flexibility n Low communication latency in local area setting n Exploit full bandwidth n Emphasis on protocol design and integration flexibility n Portable to off-the-shelf communication hardware

U-Net Ideas and Goals 08 Move protocol processing parts into user space! Move the entire protocol stack to user space Remove kernel completely from data communication path Focusing on small messages, key goals are: High performance / High flexibility n Low communication latency in local area setting n Exploit full bandwidth n Emphasis on protocol design and integration flexibility n Portable to off-the-shelf communication hardware

U-Net Architecture 09 Traditionally Kernel controls network All communications via the kernel U-Net Applications can access network directly via MUX Kernel involves only in connection setup * Virtualize NI à provides each process the illusion of owning interface to network

U-Net Building Blocks 10 End points: application s / kernel s handle into network Communication segments: memory buffers for sending/receiving messages data Message queues: hold descriptors for messages that are to be sent or have been received

U-Net Communication: Initialize 11 Initialization: Create single/multiple endpoints for each application Associate a communication segment and send/receive/free message queues with each endpoint

U-Net Communication: Send 12 [NI] exerts backpressure to the user processes START Composes the data in the communication segment Associated send buffer can be reused Is Queue Full? Push a descriptor for the message onto the send queue [NI] indicates messages injection status by flag [NI] leaves the descriptor in the queue Backedup? NEGATIVE POSITIVE [NI] picks up the message and inserts into the network Send as simple as changing one or two pointers!

U-Net Communication: Receive 13 START [NI] demultiplexes incoming messages to their destinations Periodically check the status of queue Read the data using the descriptor from the receive queue polling Get available space from the free queue Receive Model? event driven Use upcall to signal the arrival Transfer data into the appropriate comm. segment Push message descriptor to the receive queue Receive as simple as NIC changing one or two pointers!

14 U-Net Protection Owning process protection Endpoints Communication segments Send/Receive/Free queues Only owning process can access! Tag protection Outgoing messages are tagged with the originating endpoint address Incoming messages are only delivered to the correct destination endpoint

U-Net Zero Copy 15 Base-level U-Net (might not be zero copy) Send/receive needs a buffer Requires a copy between application data structures and the buffer in the communication segment Can also keep the application data structures in the buffer without requiring a copy Direct Access U-Net (true zero copy) Span the entire process address space But requires special hardware support to check address

Index 00 RDMA

RDMA Ideas and Goals 16 Move buffers between two applications via network Once programs implement RDMA: Tries to achieve lowest latency and highest throughput Smallest CPU footprint

RDMA Architecture (1/2) 17 Traditionally, socket interface involves the kernel Has a dedicated verbs interface instead of the socket interface Involves the kernel only on control path Can access rnic directly from user space on data path bypassing kernel

RDMA Architecture (2/2) 18 To initiate RDMA, establish data path from RNIC to application memory Verbs interface provide API to establish these data path Once data path is established, directly read from/write to buffers Verbs interface is different from the traditional socket interface.

RDMA Building Blocks 19 Applications use verb interfaces in order to Register memory: kernel ensures memory is pinned and accessible by DMA Create a queue pair (QP): a pair of send/receive queues Create a completion queue (CQ): RNIC puts a new completion-queue element into the CQ after an operation has completed. Send/receive data

RDMA Communication (1/4) 20 Step 1

RDMA Communication (2/4) 21 Step 2

RDMA Communication (3/4) 22 Step 3

RDMA Communication (4/4) 23 Step 4

Index 00 Performance

U-Net Performance: Bandwidth 24 * UDP bandwidth * TCP bandwidth size of messages data generation by application

U-Net Performance: Latency 25 End-to-end round trip latency size of messages

RDMA Performance: CPU load 26 CPU Load