Towards a Software Defined Data Plane for Datacenters

Towards a Software Defined Data Plane for Datacenters Arvind Krishnamurthy Joint work with: Antoine Kaufmann, Ming Liu, Naveen Sharma Tom Anderson, Kishore Atreya, Changhoon Kim, Jacob Nelson, Simon Peter

Programmable Networks Recent and upcoming hardware allow for per-packet data-plane processing Both at the switch and the NIC Programmable devices at line rate, but Tiny bit of computing Small amount of state Multiple hardware realizations but some convergence is taking place

Reconfigurable Match Table (RMT) Architecture Packet Stream Programmable Parser TCAM SRAM... REGs...... Egress Queues Eth TCAM/SRAM for matches port = lookup(eth.dst_mac) IPv4 Stateful memory for counter and meters ipv4.ttl = ipv4.ttl - 1 TCP UDP ALUs for modifying headers and registers counter[ipv4.dst_port]++ RCP

RMT Devices Growing list of RMT-like offerings at different points in the network Switches, multi-host NICs, NICs Key use cases for the reconfigurability: Allow for integration of new protocols (e.g., tunneling) Perform in-the-wild customizability (e.g., table sizes) Offload some end-host processing (e.g., filtering) Modest proposal, but it comes at a cost so we better make good use of it!

Research Questions What more can we do with customizable dataplanes? What are the compelling applications? Can we cope with the hardware constraints? Are the performance benefits significant enough?

Research Projects 1. The intelligent switch: Realize stateful network functions Leverage switch visibility Use approximation to cope with constraints 2. The intelligent NIC: Accelerate end-host applications Reduce application overheads, optimize fast path processing

Challenge: RMT Devices are not all powerful Processing primitives are limited Available stateful memory is constrained Limited number of stages and communication across stages Appropriate for the originally intended use cases Custom routing and tunneling protocols such as VxLAN or MPLS Most are packet-level transformations involving static tables

What about network-rich protocols? Decades worth of protocols involving active network elements Persistent and mutable state for cross-packet transformations Require implementing some state machine functionality Examples: Congestion Control (XCP, RCP, QCN, HULL) Load Balancing (Hedera, CONGA, WCMP, Ananta) Fairness & QoS Scheduling (Fair Queueing, Seawall, CoDel, D 3 ) Can we implement these powerful protocols on RMT Switches?

Case Study: RCP (Rate Control Protocol) Explicit granular network feedback for rate control R Rate = 50 R = 25 R = 20 R a R b = 25 = 20 Sender Switch A Switch B Receiver Switches periodically compute the flow rate:

Building Block: Cardinality Estimation Count the number of unique flows traversing a switch We extend an approach from streaming algorithms (HyperLoglog) pkt-1 0010 1101 1110 0101 pkt-2 hash(5-tuple) 0000 1000 1010 1100 pkt-3 0001 0111 0011 0110 Maximum leading zeroes: Estimate = 2 4 = 16 24...... 1. For each packet, we hash the 5-tuple 2. Keep track of largest number of leading zeros (say M) 3. Estimate number of unique elements as 2 M

RCP implementation on a flexible switch Implemented on XPliant CNX880xx 2-level FatTree topology with 4 ToRs, 2 core switches Flow Completion Time in ms 1000 100 10 1 TCP RCP Short Flows Medium Flows Long Flows Simulations show that the impact of approximation is minimal

Case Study: Fair Queueing for In-Network Enforcement Enforce fair allocation and isolation at switches Provide an illusion that every flow has its own queue Proven to have perfect isolation and fairness + Simplifies congestion control at the end-host + Protects against misbehaving traffic + Enables bounded delay guarantees However, challenging to realize in high-speed switches

Fair Queueing without per-flow queues

Fair Queueing without per-flow queues Simulates an ideal round-robin scheme where each active flow transmits a single bit of data every round. Flow 1 Round Number 10 9 8 7 6 5 4 3 2 1 0 E A 3 7 Track global round number Sorted packet buffer Flow 2 B 0 5 E, 7 B, 5 C, 4 A, 3 D, 2 Flow 3 C 0 4 Store and update per-flow counters Flow 4 D 0 2 Ideal fair-queueing Flow Counters Simulated fair-queueing (Demers et.al.)

Realizing Fair Queueing on Reconfigurable Switches 1. Maintain a sorted packet buffer Requirement: O(logN) insertion complexity Constraint: Limited operations per packet 2. Store per-flow counters Requirement: Per-flow mutable state Constraint: Limited switch memory 3. Access and modify current round number Requirement: Synchronize state across switch modules Constraint: Limited cross-module communication

Our approach: Approximate Fair Queueing Simulate a bit-by-bit round robin scheme with approximations Flow 1 Coarse round numbers 3 9 82 7 6 5 41 3 2 1 0 7 E A 7 Limited # of FIFO queues with rotating priorities to approximate a sorted buffer D C A Flow 2 Flow 3 B C 5 4 E, 7 E B, 5 B C, 4 A, 3 D, 2 Sorted packet buffer Flow 4 Ideal fair-queueing D 2 Store approximate per-flow counters using a variation Simulated of the fair-queueing count-min sketch

Storing Approximate Flow Counters Variation of count-min sketch to track flow s finish round number hash 1 ( ) % C C pkt hash 2 ( ) % C - - - - - - - - - - - - - - - - - - - - - - - - R hash R ( ) % C - - - - - - - - update increments all cells; read returns the minimum Never under-estimates, has provable space-accuracy trade-off

Customized to perform a combined read-update operation Conditional increment up to the new value for better accuracy Flow 1 Flow 2 size : 1000 0 0 1000 0 0 0 500 0 0 0 0 0 0 0 10000 0 0 0 size : 500 0 1000 0 0 500 0 0 0 0 0 0 0 0 0 500 1000 0 0 min (0, 1000, 0, 0, 0, 0) 0) = 0 = + 01000 + 500 = = 1000 500 Read Counter Find the minimum of all cells Bytes sent = minimum + pkt.size Update Counter Increment all cells upto new value cell x,y = max (cell x,y, new value) Implemented in hardware using predicated read-write registers

Buffering Packets in Approximate Sorted Order Flow 1 K K-1.. 2 1 0 C A Round 1 D A Flow 2... B Round 2 Round K C... B K FIFO queues Flow N BpR{ Ideal per-flow queues D Approximate Fair Queueing Coarse rounds: flows transmit a quantum of bytes per round (BpR) For each packet, outgoing round number = bytes sent / BpR

Rotating Strict Priority Queues Flow 1 K K-1.. 2 1 0 C A Round 1 D A Highest Priority Flow 2 B Round 32 C B Flow N... Ideal per-flow queues D Round K + 1... Approximate Fair Queueing Lowest Priority Drain queue with the lowest round number till it is empty Push queue to lowest priority; increment round number by 1

Realizing an RSP Scheduler RSP can be implemented in hardware Identical complexity to a Deficit Round Robin scheduler RSP can be emulated on current switches Switch CPU to periodically change priorities Hierarchical priority queues Avoid explicit round number synchronization by exposing queue metadata Utilize dynamic buffer sharing to vary size of individual queues

Summary of Techniques 1. Modified count-min sketch + Counters for large number of flows in limited memory - Collisions cause packets to enqueue in a later round 2. RSP queues to approximate sorted buffer + Process packets in a fixed number of operations - Packets not strictly prioritized within a round 3. Coarse round numbers + Updates to shared state are not per-packet anymore - Packets can enqueue in an earlier round

End-host Flow Control Protocol AFQ can be deployed without modifying end-hosts Adapt the packet-pair algorithm [Keshav et.al.] to gain even more benefits Sender transmits a pair of back-to-back packets Inter-arrival delay is an estimate of the bottleneck bandwidth End-hosts pace packets at estimated rate Lets us perform fast ramp-up and keep small queue sizes

Testbed Results 100 TCP DCTCP AFQ Normalized Flow Completion Time 99%tile 10 Average 1 Flow size (in bytes) Compared to TCP, 4x better average FCT, 10x better tail latency. Compared to DCTCP, 2x better average FCT, 4x better tail latency.

AFQ compared to Ideal Fair Queueing Evaluated using a packet-level simulator across different loads All Flows Short Flows < 100 KB 2000 400 TCP 1600 320 DCTCP Average FCT in μs 1200 800 240 160 SFQ AFQ Ideal FQ 400 80 0 10 20 30 40 50 60 70 80 90 Network Load (%) 0 10 20 30 40 50 60 70 80 90 Network Load (%)

More Protocols/Building Blocks

Discussion Feasible to implement non-trivial stateful protocols Use approximations to overcome hardware limits Breaks the vicious cycle between hardware support and protocol evaluation Building blocks reused across many protocols Possibility of having fixed function instantiations Suggests a long-term evolution strategy for programmable network devices

Networks: Fast and Growing Faster Ethernet Bandwidth [bits/s] 1 T 100 G 10 G 1 G 1 GbE 40 GbE 5ns inter-arrival 10 GbE time for 64B packets at 100Gbps 400 GbE 100 GbE 100 MbE 100 M 1988 1995 2003 2010 2018 2025 Year of Standard Release

... but Packet Processing is Slow Many cloud apps dominated by packet processing Key-value store, real-time analytics Particularly so with microservices Recv+send network stack processing overheads Linux: 3.4µs Kernel bypass: ~1µs Can parallelize, but still too slow RDMA API rigid: Difficult to traverse complex data structures

Sources of Inefficiency Wasted CPU cycles Packet parsing and validation repeated in software Poor cache locality, extra synchronization NIC steers packets to cores by connection Application locality may not match connection Many cycles spent in directing packets into data structures RMT NICs can address many of these issues!

Example: Key-Value Store Client 1 K = 3, 4 Client 2 K = 1, 4, 7 Client 3 K = 5, 7, 8 Receive-side scaling: core = hash(connection) % N Lock contention Core 1 Poor cache utilization NIC Core 2 Hash Table 4 7

Optimizing Reads: Key-based Steering Implemented using the RMT NIC model [Kaufmann et al., ASPLOS 16] Client 1 K = 4, 3 Client 2 K = 1, 4, 7 Client 3 K = 1, 7, 8 Match: IF udp.port == kvs_port Action: core = HASH(kvs.key) % 2 DMA hash, kvs TO Cores[core] No locks neededcore 1 Better cache utilization NIC Core 2 Hash Table 1 2 3 4 5 6 7 8

Optimizing Writes: Custom DMA Event Queue G S GET, Client SET, ID, Hash, Client Key ID, Item Pointer Item 1 Item 2 Item Log DMA to application-level data structures Requires packet validation and transformation

Key-based steering Throughput [m op/s] 8. 6. 4. 2. FlexKVS/Flow FlexKVS/Key FlexKVS/Linux Memcached 0. 1 2 3 4 5 Number of CPU Cores 6 Core Sandy Bridge Xeon 2.2GHz, 2x 10G links. Workload: 100k 32B keys, 64B values, 90% GET Per-key processing cycles: 1110 for Flow based, 690 for Key-based, 440 with custom DMA

Hardware Assisted TCP Transport protocols can also be optimized using RMT-NICs Data delivered directly from app to app Applications safely access NIC directly Kernel configures NIC to enforce resource limits Goals: performance, safety, and flexibility

NIC support for TCP TCP requires lots of state and computation E.g., corner cases like out of order packets Expensive to handle and maintain in hardware But common case is simple Increment next sequence number, send ACK Update next ACK number, free TX buffer Core Idea: Fast-Path / Slow-Path split Only fast-path needs to be done on NIC

SplitTCP at a Glance Kernel still responsible for slow path Setting up connections Calculating congestion control rate-limit out of band Recovering from packet drops/out-of-order packets Apps directly interact with NIC for TCP fast path Sending and receiving TCP payload NIC ensures correctness Apps can only send valid segments for fast-pathed flows Congestion information processed safely

FlexTCP Performance Throughput [mop/s] 16 14 12 10 8 6 4 2 0 10.7x vs Linux 4.1x vs mtcp Linux mtcp FlexTCP 7.2x vs Linux 2.2x vs mtcp 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Cores Key-value store throughput scalability Evaluated using software emulation

Summary Data plane programmability now a reality Can address many pressing challenges: efficient realization of stateful in-network protocol processing allow server cores to keep up with networks Many interesting research directions: application-specific acceleration systems support for data plane reconfigurability programming models and frameworks for application-specific networking

Additional Material Talk material is drawn from the following papers: Approximating Fair Queueing on Reconfigurable Switches N. Sharma, M. Liu, K. Atreya, A. Krishnamurthy, NSDI, 2018. Evaluating the Power of Flexible Packet Processing for Network Resource Allocation N. Sharma, A. Kaufmann, T. Anderson, C. Kim, A. Krishnamurthy, J. Nelson, S. Peter, NSDI, 2017. High Performance Packet Processing with FlexNIC A. Kaufmann, S. Peter, N. Sharma, T. Anderson, A. Krishnamurthy, ASPLOS, 2016.