Advanced Computer Networks. End Host Optimization

Similar documents
08:End-host Optimizations. Advanced Computer Networks

Advanced Computer Networks. RDMA, Network Virtualization

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS

NTRDMA v0.1. An Open Source Driver for PCIe NTB and DMA. Allen Hubbe at Linux Piter 2015 NTRDMA. Messaging App. IB Verbs. dmaengine.h ntb.

IsoStack Highly Efficient Network Processing on Dedicated Cores

Memory Management Strategies for Data Serving with RDMA

Multifunction Networking Adapters

URDMA: RDMA VERBS OVER DPDK

Much Faster Networking

The NE010 iwarp Adapter

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

Programmable NICs. Lecture 14, Computer Networks (198:552)

Low latency, high bandwidth communication. Infiniband and RDMA programming. Bandwidth vs latency. Knut Omang Ifi/Oracle 2 Nov, 2015

The Case for RDMA. Jim Pinkerton RDMA Consortium 5/29/2002

Application Acceleration Beyond Flash Storage

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

PARAVIRTUAL RDMA DEVICE

SoftRDMA: Rekindling High Performance Software RDMA over Commodity Ethernet

HP Cluster Interconnects: The Next 5 Years

QuickSpecs. HP Z 10GbE Dual Port Module. Models

High Performance Packet Processing with FlexNIC

Large Receive Offload implementation in Neterion 10GbE Ethernet driver

Generic RDMA Enablement in Linux

Containing RDMA and High Performance Computing

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

by Brian Hausauer, Chief Architect, NetEffect, Inc

RDMA programming concepts

An O/S perspective on networks: Active Messages and U-Net

Advanced Computer Networks. Flow Control

CERN openlab Summer 2006: Networking Overview

Networking at the Speed of Light

CSCI-GA Operating Systems. Networking. Hubertus Franke

The Network Stack. Chapter Network stack functions 216 CHAPTER 21. THE NETWORK STACK

How to Turbocharge Network Throughput

6.9. Communicating to the Outside World: Cluster Networking

FaRM: Fast Remote Memory

RoCE vs. iwarp Competitive Analysis

RoGUE: RDMA over Generic Unconverged Ethernet

OpenOnload. Dave Parry VP of Engineering Steve Pope CTO Dave Riddoch Chief Software Architect

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

Storage Protocol Offload for Virtualized Environments Session 301-F

Introduction to High-Speed InfiniBand Interconnect

IO virtualization. Michael Kagan Mellanox Technologies

Energy-Efficient Data Transfers in Radio Astronomy with Software UDP RDMA Third Workshop on Innovating the Network for Data-Intensive Science, INDIS16

NVMf based Integration of Non-volatile Memory in a Distributed System - Lessons learned

Accelerating Web Protocols Using RDMA

Support for Smart NICs. Ian Pratt

An FPGA-Based Optical IOH Architecture for Embedded System

Welcome to the IBTA Fall Webinar Series

Voltaire. Fast I/O for XEN using RDMA Technologies. The Grid Interconnect Company. April 2005 Yaron Haviv, Voltaire, CTO

Advanced Computer Networks. Flow Control

PCI Express x8 Single Port SFP+ 10 Gigabit Server Adapter (Intel 82599ES Based) Single-Port 10 Gigabit SFP+ Ethernet Server Adapters Provide Ultimate

High Performance File Serving with SMB3 and RDMA via SMB Direct

The Common Communication Interface (CCI)

ECE 650 Systems Programming & Engineering. Spring 2018

Ethan Kao CS 6410 Oct. 18 th 2011

<Insert Picture Here> Boost Linux Performance with Enhancements from Oracle

Tolerating Malicious Drivers in Linux. Silas Boyd-Wickizer and Nickolai Zeldovich

FPGA Augmented ASICs: The Time Has Come

Network Adapter Flow Steering

Meltdown and Spectre Interconnect Performance Evaluation Jan Mellanox Technologies

Optimizing Performance: Intel Network Adapters User Guide

Brent Callaghan Sun Microsystems, Inc. Sun Microsystems, Inc

FlexNIC: Rethinking Network DMA

DB2 purescale: High Performance with High-Speed Fabrics. Author: Steve Rees Date: April 5, 2011

USING OPEN FABRIC INTERFACE IN INTEL MPI LIBRARY

Informatix Solutions INFINIBAND OVERVIEW. - Informatix Solutions, Page 1 Version 1.0

Maximum Performance. How to get it and how to avoid pitfalls. Christoph Lameter, PhD

jverbs: Java/OFED Integration for the Cloud

Agenda. About us Why para-virtualize RDMA Project overview Open issues Future plans

BUILDING A BLOCK STORAGE APPLICATION ON OFED - CHALLENGES

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

vnetwork Future Direction Howie Xu, VMware R&D November 4, 2008

Chapter 23 Process-to-Process Delivery: UDP, TCP, and SCTP 23.1

SMB Direct Update. Tom Talpey and Greg Kramer Microsoft Storage Developer Conference. Microsoft Corporation. All Rights Reserved.

440GX Application Note

CS 640: Introduction to Computer Networks. Today s Lecture. Page 1

Impact of Cache Coherence Protocols on the Processing of Network Traffic

NFS/RDMA. Tom Talpey Network Appliance

I/O Handling. ECE 650 Systems Programming & Engineering Duke University, Spring Based on Operating Systems Concepts, Silberschatz Chapter 13

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Kernel Bypass. Sujay Jayakar (dsj36) 11/17/2016

Operating Systems. 17. Sockets. Paul Krzyzanowski. Rutgers University. Spring /6/ Paul Krzyzanowski

Chapter 11. User Datagram Protocol (UDP)

Arrakis: The Operating System is the Control Plane

Message Passing Architecture in Intra-Cluster Communication

Learn Your Alphabet - SRIOV, NPIV, RoCE, iwarp to Pump Up Virtual Infrastructure Performance

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects

MOVING FORWARD WITH FABRIC INTERFACES

EXTENDING AN ASYNCHRONOUS MESSAGING LIBRARY USING AN RDMA-ENABLED INTERCONNECT. Konstantinos Alexopoulos ECE NTUA CSLab

CMPE 150/L : Introduction to Computer Networks. Chen Qian Computer Engineering UCSC Baskin Engineering Lecture 11

FaSST: Fast, Scalable, and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs

Operating System: Chap13 I/O Systems. National Tsing-Hua University 2016, Fall Semester

AN O/S PERSPECTIVE ON NETWORKS Adem Efe Gencer 1. October 4 th, Department of Computer Science, Cornell University

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

The Exascale Architecture

An RDMA Protocol Specification (Version 1.0)

DPDK Summit China 2017

Memcached Design on High Performance RDMA Capable Interconnects

High bandwidth, Long distance. Where is my throughput? Robin Tasker CCLRC, Daresbury Laboratory, UK

Transcription:

Oriana Riva, Department of Computer Science ETH Zürich 263 3501 00 End Host Optimization Patrick Stuedi Spring Semester 2017 1

Today End-host optimizations: NUMA-aware networking Kernel-bypass Remote Direct Memory Access (RDMA) 2

Trends Networks get faster 10Gb/s 40Gb/s 100Gb/s Latency decreases ( 1 s) CPUs do not But there are more of them Latency is increasingly a software factor Moore's law continues More transistors on a NIC 3

Key challenges Scaling One fast network, lots of cores Latency Dominated by software processing cost Multiplied by multi-tier server architectures CPU load Packet processing and data copying cost can be high 4

Delay numbers Component Network switch Network adaptor OS network stack Speed of light (in fibre) Delay 10-30 s 2.5-32 s 15 s 5 ns/m 0.05-0.5 s OS overhead per packet exchanged between two hosts attached to the same switch: (2*15)/(2*2.5+2*15+10)=66% (!) 5

Receiving a packet Applicatio n Applicatio n Stream socket Datagram socket TCP UDP ICMP IP Kernel Receive queue Network interface 6

Receiving a packet Applicatio n Applicatio n Stream socket Datagram socket TCP UDP ICMP IP Kernel Receive queue Network interface 1. DMA packet to host memory, H/W Interrupt adds latency 7

Receiving a packet Applicatio n Applicatio n Stream socket Datagram socket TCP UDP IP ICMP 2. S/W Interrupt, TCP/IP processing (demultiplexing) Receive queue Kernel Network interface can be CPU intensive at high data rates 8

Receiving a packet Applicatio Applicatio n n adds latency Stream socket 3. Application Datagram scheduling (context socket switch), copy packet to user space application process context TCP UDP IP ICMP can be CPU intensive at high data rates Receive queue Kernel Network interface 9

Sending a packet Applicatio n Applicatio n Stream socket Datagram socket TCP UDP ICMP IP Kernel Transmit queue Network interface 10

Sending a packet Applicatio n Applicatio n Stream socket 1. Datagram System call, copy from user socket space to socket buffer, TCP processing TCP UDP IP ICMP can be CPU intensive at high data rates Transmit queue Kernel Network interface 11

Sending a packet Applicatio n Applicatio n Stream socket Datagram socket TCP UDP IP ICMP 2. S/W Interrupt, remaining TCP/IP processing (multiplexing) Transmit queue Kernel Network interface can be CPU intensive at high data rates 12

Sending a packet Applicatio n Applicatio n Stream socket Datagram socket TCP UDP ICMP IP Kernel Transmit queue 3. DMA Interrupt Network interface 13

What does the hardware do? 14

Descriptor rings Main Memory Owned by OS Owned by device Socket Buffer last buffer used by OS last buffer used by device Network Card Interrupt Linux Kernel 15 15

Overruns Device has no buffers for received packets starts discarding packets Not as bad as it sounds Signals that it s lost some in a status register CPU has no more slots to send packets must wait Can spin polling, but inefficient Signals device to interrupt it when a packet has been sent i.e. a buffer slot is now free 16 16

Receive-Side Scaling (RSS) 17

Scaling with cores Use multiple queues (descriptor rings) One queue per core better scaling Per-queue interrupts direct to core Key question: which Rx queue gets a packet? Network interface 18

RSS Incomin g packet 5-tuple hash function Queues serviced by different cores Each queue can trigger interrupt (MSI) Interrupt steered to core Hash table 19

More sophisticated filters Wildcard matches on 5-tuples VLAN matching SYN filter Direct connection setups to a different queue Filters based on other protocols etc. 20

What about transmit? Multiple transmit queues: Better scaling: each core has a tx queue No locks or synchronization between cores Performance isolation: NIC can schedule different tx queues CPU scheduler not involved 21

Example: Intel 82599EB 10Gb NIC 128 Send queues 128 Receive queues RSS filters 256 5-tuple filters Large number of "Flow Director" filters SYN filter etc. 22 22

What this buys you Scaling with cores Reduced CPU load Synchronization Multiplexing Scheduling Performance isolation Receive: livelock If the OS is designed accordingly 23

What it doesn't CPU load: Demultiplexing on a queue Protocol processing Connection setup/teardown Context switches Must still switch to receiving process Enter/leave kernel 24

TCP Offload 25

TCP Offload Moving IP and TCP processing to the Network Interface (NIC) Main justification: Reduction of host CPU cycles for protocol header processing, checksumming Fewer CPU interrupts Fewer bytes copied over the memory bus Potential to offload expensive features such as encryption 26

TCP Offload Engines (TOEs) Application Application Kernel Sockets TCP IP NIC driver Ethernet NIC TOE Kernel Sockets NIC driver TCP IP Ethernet NIC 27

Why TCP offload never worked Moore s Law worked against smart NICs CPU's used to be fast enough TCP/IP headers don t take many CPU cycles 30 instructions if done with care. TOEs impose complex interfaces TOE CPU protocol can be worse than TCP! Connection management overhead For short connections, overwhelms any savings OS can't control protocol implementation Bug fixes Protocol evolution (DCTCP, D2TCP, etc.) 28

Why TOEs sometimes help Now many cores, cores don't get faster Network processing is hard to parallelize Sweet spot might be apps with: Very high bandwidth Relatively low end-to-end latency network paths Long connection durations Relatively few connections For example: Storage-server access Cluster interconnects 29

User-level networking 30

User-level Networking TCP offloading is not enough: System call overhead Context switches Memory copies User-level networking key idea: remove overhead Map individual queues to application Allow applications to poll OS-byapss (OS still needed for interrupts) Requires hardware which: Can validate queue entries Demultiplex messages to applications 31

User-space networking (2) Traditional Networking User-level Networking App App App App App App App App App App App App Kernel Kernel Kernel Kernel NIC NIC NIC NIC Node1 Node2 Node1 Node2 Traditional networking All communication through kernel User-level networking Applications acces NIC directly Kernel involved only during connection setup/teardown U-Net: a user-level network interface for parallel and distributed computing Eicken, Basu, Buch, Vogels, Cornell University, 1995 32

U-Net Data Structures & API application buffers Send Recv Free Communication Segment Data structures: Message queues Communication segment: application's handle Buffers: hold data for sending and receiving Message queues: hold descriptors to buffers Programming U-Net Sending: compose data in buffer, post descriptor to send queue Receiving: post descriptor to receive queue, poll status Messages contain TAG identifying source or destination queue 33

RDMA 34

Message passing exchange Client Server send Block (rx) "read a value" Block (rx) Time "result" send Block (rx) So far: messages between processes 35

Message passing exchange Client Server Time 1. Packet arrives 2. DMA transfer to host send memory 3. IRQ; Block schedule (rx) processor 4. Thread reads result 5. Copies into buffer in main memory 6. DMA request from main memory 7. Packet leaves "read a value" "result" Block (rx) send Block (rx) So far: messages between processes 36

Message passing exchange Client Server Time 1. Packet arrives 2. DMA transfer to host send memory 3. IRQ; Block schedule (rx) processor 4. Thread reads result 5. Copies into buffer in main memory 6. DMA request from main memory 7. Packet leaves "read a value" "result" Block (rx) send Block (rx) So far: messages between processes What about removing the server software entirely? 37

Remote Direct Memory Access (RDMA) Only the "client" actively involved in transfer "One-sided" operation RDMA Write specifies: where the data should be taken from locally where it is to be placed remotely RDMA Read: where the data should be taken from remotely Where it is to be placed locally Require buffer advertisement prior to data exchange 38

RDMA Read Client Server send "read a value" Block (rx) Time src buffer "result" dst buffer 39

RDMA Read Client Server send "read a value" Time Block (rx) 1. Req packet arrives 2. DMA request from main memory 3. Resp packet leaves src buffer "result" dst buffer 40

RDMA Write Client Server src buffer send Block (rx) "write value" Time "result" dst buffer 41

RDMA Write Client Server Time src buffer send 1. Data Block packet (rx) arrives 2. DMA request to main memory 3. Optional [ACK packet leaves] "write value" "result" dst buffer 42

RDMA implementations Infiniband Compaq, HP, IBM, Intel Microsoft and Sun Microsystems, 2000 New network from ground up (see Lecture Flow Control) IWARP (Internet Wide Area RDMA Protocol) RDMA semantics over offloaded TCP/IP Requires custom Ethernet NICs RoCE RDMA semantics directly over Ethernet All implementations provide both: Classical message passing (send/recv similar to U-Net) RDMA operations (RDMA read/write) 43

Open Fabrics Enterprise Distribution (OFED) RDMA verbs interface RDMA verbs userlib Application Socket layer user space traditional socket interface TCP UDP kernel IP Kernel Mod Ethernet NIC Driver RDMA enabled NIC i/o space Unified RDMA stack different OS and hardware Linux/Windows Infiniband, iwarp, RoCE Vendors write their own device drivers and user library Device driver implements allocation of message queues on device User driver provides access to message queues from user user space Stack provides common application interface calls verbs interface 44

Verbs Data Structures & API Applications use 'verbs' interface to Register application memory: - Operating system will make sure the memory is pinned and accessible by DMA Create a queue pair (QP) - send/recv queue Create a completion queue (CQ) - RNIC puts a new completion-queue element into the CQ after an operation has completed Send/Receive/Read/Write data - Place a work-request element (WQE) into the send or recv queue - WQE points to user buffer and defines the type of the operation (e.g., send, recv, read, write)

RDMA vs TCP-Sockets Throughput Latency 46

Typical CPU loads for three network stack implementations 47

Large RDMA Design Space

How to Design a Sequencer Service Paper: Design Guidelines for High Performance RDMA Systems, Usenix 2016

Sequencer Throughput

Sequencer Throughput Use unreliable transport (avoid ACK), Use unsignaled Ops Put multiple messages on the queue before notifying the NIC One QP per core Inline data with descriptor

Summary & Open Questions NICs getting faster, CPUs are not Multiple tx/rx queues Sophisticated mux/demux filters Offload engines Kernel bypass Open questions: How to program closely-coupled set of RDMA devices? Requires a new way to think about software construction. See Crail: www.crail.io How to manage the sheer complexity and diversity of NICs? What new architectures are required as bottlenecks move from NIC to PCI? (e.g., Intel OmniPath, Direct Cache Access, etc) 52