Towards a Software Defined Data Plane for Datacenters

Similar documents
High Performance Packet Processing with FlexNIC

FlexNIC: Rethinking Network DMA

Evaluating the Power of Flexible Packet Processing for Network Resource Allocation

Arrakis: The Operating System is the Control Plane

MUD: Send me your top 1 3 questions on this lecture

DeTail Reducing the Tail of Flow Completion Times in Datacenter Networks. David Zats, Tathagata Das, Prashanth Mohan, Dhruba Borthakur, Randy Katz

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

An Implementation of the Homa Transport Protocol in RAMCloud. Yilong Li, Behnam Montazeri, John Ousterhout

Information-Agnostic Flow Scheduling for Commodity Data Centers

Advanced Computer Networks. End Host Optimization

Homework 1. Question 1 - Layering. CSCI 1680 Computer Networks Fonseca

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS

Congestion. Can t sustain input rate > output rate Issues: - Avoid congestion - Control congestion - Prioritize who gets limited resources

Advanced Computer Networks. Datacenter TCP

Building Efficient and Reliable Software-Defined Networks. Naga Katta

Packet Scheduling in Data Centers. Lecture 17, Computer Networks (198:552)

Topic 6: SDN in practice: Microsoft's SWAN. Student: Miladinovic Djordje Date:

Approximating Fair Queueing on Reconfigurable Switches

Episode 5. Scheduling and Traffic Management

RDMA and Hardware Support

Advanced Computer Networks. Flow Control

Be Fast, Cheap and in Control with SwitchKV. Xiaozhou Li

CS557: Queue Management

Information-Agnostic Flow Scheduling for Commodity Data Centers. Kai Chen SING Group, CSE Department, HKUST May 16, Stanford University

Chapter 6 Queuing Disciplines. Networking CS 3470, Section 1

Titan: Fair Packet Scheduling for Commodity Multiqueue NICs. Brent Stephens, Arjun Singhvi, Aditya Akella, and Mike Swift July 13 th, 2017

Kernel Bypass. Sujay Jayakar (dsj36) 11/17/2016

Real-Time Protocol (RTP)

Overview Computer Networking What is QoS? Queuing discipline and scheduling. Traffic Enforcement. Integrated services

QUIC. Internet-Scale Deployment on Linux. Ian Swett Google. TSVArea, IETF 102, Montreal

CSE 123A Computer Networks

Flexplane: An Experimenta0on Pla3orm for Resource Management in Datacenters. Amy Ousterhout, Jonathan Perry, Hari Balakrishnan, Petr Lapukhov

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Got Loss? Get zovn! Daniel Crisan, Robert Birke, Gilles Cressier, Cyriel Minkenberg, and Mitch Gusat. ACM SIGCOMM 2013, August, Hong Kong, China

QoS Services with Dynamic Packet State

Congestion Control and Resource Allocation

NetCache: Balancing Key-Value Stores with Fast In-Network Caching

Designing Distributed Systems using Approximate Synchrony in Data Center Networks

NetCache: Balancing Key-Value Stores with Fast In-Network Caching

15-744: Computer Networking. Data Center Networking II

Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency

Language-Directed Hardware Design for Network Performance Monitoring

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

DIBS: Just-in-time congestion mitigation for Data Centers

QuickSpecs. HP Z 10GbE Dual Port Module. Models

Computer Networking. Queue Management and Quality of Service (QOS)

Mohammad Hossein Manshaei 1393

CS268: Beyond TCP Congestion Control

Congestion Control In the Network

Fundamental Questions to Answer About Computer Networking, Jan 2009 Prof. Ying-Dar Lin,

Chapter 4. Routers with Tiny Buffers: Experiments. 4.1 Testbed experiments Setup

Application of SDN: Load Balancing & Traffic Engineering

TLDK Overview. Transport Layer Development Kit Ray Kinsella February ray.kinsella [at] intel.com IRC: mortderire

CS551 Router Queue Management

Chapter 4: network layer. Network service model. Two key network-layer functions. Network layer. Input port functions. Router architecture overview

Optical Packet Switching

RoGUE: RDMA over Generic Unconverged Ethernet

Software Datapath Acceleration for Stateless Packet Processing

Performance Evaluation of Myrinet-based Network Router

6.033 Spring Lecture #12. In-network resource management Queue management schemes Traffic differentiation spring 2018 Katrina LaCurts

Lecture 21: Congestion Control" CSE 123: Computer Networks Alex C. Snoeren

Flexplane: An Experimenta0on Pla3orm for Resource Management in Datacenters

Network Management & Monitoring

IncBricks: Toward In-Network Computation with an In-Network Cache

Learning with Purpose

6.9. Communicating to the Outside World: Cluster Networking

OpenFlow Software Switch & Intel DPDK. performance analysis

Lecture 21. Reminders: Homework 6 due today, Programming Project 4 due on Thursday Questions? Current event: BGP router glitch on Nov.

6.888 Lecture 5: Flow Scheduling

Priority Traffic CSCD 433/533. Advanced Networks Spring Lecture 21 Congestion Control and Queuing Strategies

PacketShader: A GPU-Accelerated Software Router

GUARANTEED END-TO-END LATENCY THROUGH ETHERNET

Handles all kinds of traffic on a single network with one class

Network Interface Architecture and Prototyping for Chip and Cluster Multiprocessors

15-744: Computer Networking. Overview. Queuing Disciplines. TCP & Routers. L-6 TCP & Routers

Unit 2 Packet Switching Networks - II

Lecture 14: Congestion Control"

Lecture 14: Congestion Control"

CS 356: Computer Network Architectures Lecture 19: Congestion Avoidance Chap. 6.4 and related papers. Xiaowei Yang

Overview. TCP & router queuing Computer Networking. TCP details. Workloads. TCP Performance. TCP Performance. Lecture 10 TCP & Routers

of-service Support on the Internet

Enabling ECN over Generic Packet Scheduling

XCo: Explicit Coordination for Preventing Congestion in Data Center Ethernet

Resource allocation in networks. Resource Allocation in Networks. Resource allocation

PCI Express x8 Single Port SFP+ 10 Gigabit Server Adapter (Intel 82599ES Based) Single-Port 10 Gigabit SFP+ Ethernet Server Adapters Provide Ultimate

Computer Networks. Sándor Laki ELTE-Ericsson Communication Networks Laboratory

Core-Stateless Fair Queueing: Achieving Approximately Fair Bandwidth Allocations in High Speed Networks. Congestion Control in Today s Internet

Router s Queue Management

Congestion Control for High-Bandwidth-Delay-Product Networks: XCP vs. HighSpeed TCP and QuickStart

Recap. TCP connection setup/teardown Sliding window, flow control Retransmission timeouts Fairness, max-min fairness AIMD achieves max-min fairness

Revisiting Network Support for RDMA

Baidu s Best Practice with Low Latency Networks

Programmable NICs. Lecture 14, Computer Networks (198:552)

Queuing Mechanisms. Overview. Objectives

CS 268: Computer Networking

A Network-centric TCP for Interactive Video Delivery Networks (VDN)

High bandwidth, Long distance. Where is my throughput? Robin Tasker CCLRC, Daresbury Laboratory, UK

Expeditus: Congestion-Aware Load Balancing in Clos Data Center Networks

Providing Multi-tenant Services with FPGAs: Case Study on a Key-Value Store

Transcription:

Towards a Software Defined Data Plane for Datacenters Arvind Krishnamurthy Joint work with: Antoine Kaufmann, Ming Liu, Naveen Sharma Tom Anderson, Kishore Atreya, Changhoon Kim, Jacob Nelson, Simon Peter

Programmable Networks Recent and upcoming hardware allow for per-packet data-plane processing Both at the switch and the NIC Programmable devices at line rate, but Tiny bit of computing Small amount of state Multiple hardware realizations but some convergence is taking place

Reconfigurable Match Table (RMT) Architecture Packet Stream Programmable Parser TCAM SRAM... REGs...... Egress Queues Eth TCAM/SRAM for matches port = lookup(eth.dst_mac) IPv4 Stateful memory for counter and meters ipv4.ttl = ipv4.ttl - 1 TCP UDP ALUs for modifying headers and registers counter[ipv4.dst_port]++ RCP

RMT Devices Growing list of RMT-like offerings at different points in the network Switches, multi-host NICs, NICs Key use cases for the reconfigurability: Allow for integration of new protocols (e.g., tunneling) Perform in-the-wild customizability (e.g., table sizes) Offload some end-host processing (e.g., filtering) Modest proposal, but it comes at a cost so we better make good use of it!

Research Questions What more can we do with customizable dataplanes? What are the compelling applications? Can we cope with the hardware constraints? Are the performance benefits significant enough?

Research Projects 1. The intelligent switch: Realize stateful network functions Leverage switch visibility Use approximation to cope with constraints 2. The intelligent NIC: Accelerate end-host applications Reduce application overheads, optimize fast path processing

Challenge: RMT Devices are not all powerful Processing primitives are limited Available stateful memory is constrained Limited number of stages and communication across stages Appropriate for the originally intended use cases Custom routing and tunneling protocols such as VxLAN or MPLS Most are packet-level transformations involving static tables

What about network-rich protocols? Decades worth of protocols involving active network elements Persistent and mutable state for cross-packet transformations Require implementing some state machine functionality Examples: Congestion Control (XCP, RCP, QCN, HULL) Load Balancing (Hedera, CONGA, WCMP, Ananta) Fairness & QoS Scheduling (Fair Queueing, Seawall, CoDel, D 3 ) Can we implement these powerful protocols on RMT Switches?

Case Study: RCP (Rate Control Protocol) Explicit granular network feedback for rate control R Rate = 50 R = 25 R = 20 R a R b = 25 = 20 Sender Switch A Switch B Receiver Switches periodically compute the flow rate:

Building Block: Cardinality Estimation Count the number of unique flows traversing a switch We extend an approach from streaming algorithms (HyperLoglog) pkt-1 0010 1101 1110 0101 pkt-2 hash(5-tuple) 0000 1000 1010 1100 pkt-3 0001 0111 0011 0110 Maximum leading zeroes: Estimate = 2 4 = 16 24...... 1. For each packet, we hash the 5-tuple 2. Keep track of largest number of leading zeros (say M) 3. Estimate number of unique elements as 2 M

RCP implementation on a flexible switch Implemented on XPliant CNX880xx 2-level FatTree topology with 4 ToRs, 2 core switches Flow Completion Time in ms 1000 100 10 1 TCP RCP Short Flows Medium Flows Long Flows Simulations show that the impact of approximation is minimal

Case Study: Fair Queueing for In-Network Enforcement Enforce fair allocation and isolation at switches Provide an illusion that every flow has its own queue Proven to have perfect isolation and fairness + Simplifies congestion control at the end-host + Protects against misbehaving traffic + Enables bounded delay guarantees However, challenging to realize in high-speed switches

Fair Queueing without per-flow queues

Fair Queueing without per-flow queues Simulates an ideal round-robin scheme where each active flow transmits a single bit of data every round. Flow 1 Round Number 10 9 8 7 6 5 4 3 2 1 0 E A 3 7 Track global round number Sorted packet buffer Flow 2 B 0 5 E, 7 B, 5 C, 4 A, 3 D, 2 Flow 3 C 0 4 Store and update per-flow counters Flow 4 D 0 2 Ideal fair-queueing Flow Counters Simulated fair-queueing (Demers et.al.)

Realizing Fair Queueing on Reconfigurable Switches 1. Maintain a sorted packet buffer Requirement: O(logN) insertion complexity Constraint: Limited operations per packet 2. Store per-flow counters Requirement: Per-flow mutable state Constraint: Limited switch memory 3. Access and modify current round number Requirement: Synchronize state across switch modules Constraint: Limited cross-module communication

Our approach: Approximate Fair Queueing Simulate a bit-by-bit round robin scheme with approximations Flow 1 Coarse round numbers 3 9 82 7 6 5 41 3 2 1 0 7 E A 7 Limited # of FIFO queues with rotating priorities to approximate a sorted buffer D C A Flow 2 Flow 3 B C 5 4 E, 7 E B, 5 B C, 4 A, 3 D, 2 Sorted packet buffer Flow 4 Ideal fair-queueing D 2 Store approximate per-flow counters using a variation Simulated of the fair-queueing count-min sketch

Storing Approximate Flow Counters Variation of count-min sketch to track flow s finish round number hash 1 ( ) % C C pkt hash 2 ( ) % C - - - - - - - - - - - - - - - - - - - - - - - - R hash R ( ) % C - - - - - - - - update increments all cells; read returns the minimum Never under-estimates, has provable space-accuracy trade-off

Customized to perform a combined read-update operation Conditional increment up to the new value for better accuracy Flow 1 Flow 2 size : 1000 0 0 1000 0 0 0 500 0 0 0 0 0 0 0 10000 0 0 0 size : 500 0 1000 0 0 500 0 0 0 0 0 0 0 0 0 500 1000 0 0 min (0, 1000, 0, 0, 0, 0) 0) = 0 = + 01000 + 500 = = 1000 500 Read Counter Find the minimum of all cells Bytes sent = minimum + pkt.size Update Counter Increment all cells upto new value cell x,y = max (cell x,y, new value) Implemented in hardware using predicated read-write registers

Buffering Packets in Approximate Sorted Order Flow 1 K K-1.. 2 1 0 C A Round 1 D A Flow 2... B Round 2 Round K C... B K FIFO queues Flow N BpR{ Ideal per-flow queues D Approximate Fair Queueing Coarse rounds: flows transmit a quantum of bytes per round (BpR) For each packet, outgoing round number = bytes sent / BpR

Rotating Strict Priority Queues Flow 1 K K-1.. 2 1 0 C A Round 1 D A Highest Priority Flow 2 B Round 32 C B Flow N... Ideal per-flow queues D Round K + 1... Approximate Fair Queueing Lowest Priority Drain queue with the lowest round number till it is empty Push queue to lowest priority; increment round number by 1

Realizing an RSP Scheduler RSP can be implemented in hardware Identical complexity to a Deficit Round Robin scheduler RSP can be emulated on current switches Switch CPU to periodically change priorities Hierarchical priority queues Avoid explicit round number synchronization by exposing queue metadata Utilize dynamic buffer sharing to vary size of individual queues

Summary of Techniques 1. Modified count-min sketch + Counters for large number of flows in limited memory - Collisions cause packets to enqueue in a later round 2. RSP queues to approximate sorted buffer + Process packets in a fixed number of operations - Packets not strictly prioritized within a round 3. Coarse round numbers + Updates to shared state are not per-packet anymore - Packets can enqueue in an earlier round

End-host Flow Control Protocol AFQ can be deployed without modifying end-hosts Adapt the packet-pair algorithm [Keshav et.al.] to gain even more benefits Sender transmits a pair of back-to-back packets Inter-arrival delay is an estimate of the bottleneck bandwidth End-hosts pace packets at estimated rate Lets us perform fast ramp-up and keep small queue sizes

Testbed Results 100 TCP DCTCP AFQ Normalized Flow Completion Time 99%tile 10 Average 1 Flow size (in bytes) Compared to TCP, 4x better average FCT, 10x better tail latency. Compared to DCTCP, 2x better average FCT, 4x better tail latency.

AFQ compared to Ideal Fair Queueing Evaluated using a packet-level simulator across different loads All Flows Short Flows < 100 KB 2000 400 TCP 1600 320 DCTCP Average FCT in μs 1200 800 240 160 SFQ AFQ Ideal FQ 400 80 0 10 20 30 40 50 60 70 80 90 Network Load (%) 0 10 20 30 40 50 60 70 80 90 Network Load (%)

More Protocols/Building Blocks

Discussion Feasible to implement non-trivial stateful protocols Use approximations to overcome hardware limits Breaks the vicious cycle between hardware support and protocol evaluation Building blocks reused across many protocols Possibility of having fixed function instantiations Suggests a long-term evolution strategy for programmable network devices

Research Projects 1. The intelligent switch: Realize stateful network functions Leverage switch visibility Use approximation to cope with constraints 2. The intelligent NIC: Accelerate end-host applications Reduce application overheads, optimize fast path processing

Networks: Fast and Growing Faster Ethernet Bandwidth [bits/s] 1 T 100 G 10 G 1 G 1 GbE 40 GbE 5ns inter-arrival 10 GbE time for 64B packets at 100Gbps 400 GbE 100 GbE 100 MbE 100 M 1988 1995 2003 2010 2018 2025 Year of Standard Release

... but Packet Processing is Slow Many cloud apps dominated by packet processing Key-value store, real-time analytics Particularly so with microservices Recv+send network stack processing overheads Linux: 3.4µs Kernel bypass: ~1µs Can parallelize, but still too slow RDMA API rigid: Difficult to traverse complex data structures

Sources of Inefficiency Wasted CPU cycles Packet parsing and validation repeated in software Poor cache locality, extra synchronization NIC steers packets to cores by connection Application locality may not match connection Many cycles spent in directing packets into data structures RMT NICs can address many of these issues!

Example: Key-Value Store Client 1 K = 3, 4 Client 2 K = 1, 4, 7 Client 3 K = 5, 7, 8 Receive-side scaling: core = hash(connection) % N Lock contention Core 1 Poor cache utilization NIC Core 2 Hash Table 4 7

Optimizing Reads: Key-based Steering Implemented using the RMT NIC model [Kaufmann et al., ASPLOS 16] Client 1 K = 4, 3 Client 2 K = 1, 4, 7 Client 3 K = 1, 7, 8 Match: IF udp.port == kvs_port Action: core = HASH(kvs.key) % 2 DMA hash, kvs TO Cores[core] No locks neededcore 1 Better cache utilization NIC Core 2 Hash Table 1 2 3 4 5 6 7 8

Optimizing Writes: Custom DMA Event Queue G S GET, Client SET, ID, Hash, Client Key ID, Item Pointer Item 1 Item 2 Item Log DMA to application-level data structures Requires packet validation and transformation

Key-based steering Throughput [m op/s] 8. 6. 4. 2. FlexKVS/Flow FlexKVS/Key FlexKVS/Linux Memcached 0. 1 2 3 4 5 Number of CPU Cores 6 Core Sandy Bridge Xeon 2.2GHz, 2x 10G links. Workload: 100k 32B keys, 64B values, 90% GET Per-key processing cycles: 1110 for Flow based, 690 for Key-based, 440 with custom DMA

Hardware Assisted TCP Transport protocols can also be optimized using RMT-NICs Data delivered directly from app to app Applications safely access NIC directly Kernel configures NIC to enforce resource limits Goals: performance, safety, and flexibility

NIC support for TCP TCP requires lots of state and computation E.g., corner cases like out of order packets Expensive to handle and maintain in hardware But common case is simple Increment next sequence number, send ACK Update next ACK number, free TX buffer Core Idea: Fast-Path / Slow-Path split Only fast-path needs to be done on NIC

SplitTCP at a Glance Kernel still responsible for slow path Setting up connections Calculating congestion control rate-limit out of band Recovering from packet drops/out-of-order packets Apps directly interact with NIC for TCP fast path Sending and receiving TCP payload NIC ensures correctness Apps can only send valid segments for fast-pathed flows Congestion information processed safely

FlexTCP Performance Throughput [mop/s] 16 14 12 10 8 6 4 2 0 10.7x vs Linux 4.1x vs mtcp Linux mtcp FlexTCP 7.2x vs Linux 2.2x vs mtcp 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Cores Key-value store throughput scalability Evaluated using software emulation

Summary Data plane programmability now a reality Can address many pressing challenges: efficient realization of stateful in-network protocol processing allow server cores to keep up with networks Many interesting research directions: application-specific acceleration systems support for data plane reconfigurability programming models and frameworks for application-specific networking

Additional Material Talk material is drawn from the following papers: Approximating Fair Queueing on Reconfigurable Switches N. Sharma, M. Liu, K. Atreya, A. Krishnamurthy, NSDI, 2018. Evaluating the Power of Flexible Packet Processing for Network Resource Allocation N. Sharma, A. Kaufmann, T. Anderson, C. Kim, A. Krishnamurthy, J. Nelson, S. Peter, NSDI, 2017. High Performance Packet Processing with FlexNIC A. Kaufmann, S. Peter, N. Sharma, T. Anderson, A. Krishnamurthy, ASPLOS, 2016.