Intelligent NIC Queue Management in the Dragonet Network Stack

Similar documents
Intelligent NIC Queue Management in the Dragonet Network Stack

Master s Thesis Nr. 125

Advanced Computer Networks. End Host Optimization

Much Faster Networking

FlexNIC: Rethinking Network DMA

High Performance Packet Processing with FlexNIC

The Network Stack. Chapter Network stack functions 216 CHAPTER 21. THE NETWORK STACK

Outline. Motivation. Our System. Conclusion

Arrakis: The Operating System is the Control Plane

XDP: 1.5 years in production. Evolution and lessons learned. Nikita V. Shirokov

IsoStack Highly Efficient Network Processing on Dedicated Cores

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

OpenOnload. Dave Parry VP of Engineering Steve Pope CTO Dave Riddoch Chief Software Architect

Programmable Software Switches. Lecture 11, Computer Networks (198:552)

On the cost of tunnel endpoint processing in overlay virtual networks

GUARANTEED END-TO-END LATENCY THROUGH ETHERNET

FPGA Augmented ASICs: The Time Has Come

Part Three - Memory Management. Chapter 8: Memory-Management Strategies

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

Support for Smart NICs. Ian Pratt

Containers Do Not Need Network Stacks

Towards a Software Defined Data Plane for Datacenters

Introduction to OpenOnload Building Application Transparency and Protocol Conformance into Application Acceleration Middleware

8.1 Background. Part Four - Memory Management. Chapter 8: Memory-Management Management Strategies. Chapter 8: Memory Management

Last time. Network layer. Introduction. Virtual circuit vs. datagram details. IP: the Internet Protocol. forwarding vs. routing

20: Networking (2) TCP Socket Buffers. Mark Handley. TCP Acks. TCP Data. Application. Application. Kernel. Kernel. Socket buffer.

No Tradeoff Low Latency + High Efficiency

Kernel Bypass. Sujay Jayakar (dsj36) 11/17/2016

TCP Tuning for the Web

Solarflare and OpenOnload Solarflare Communications, Inc.

Stateless Datacenter Load Balancing with Beamer

Solros: A Data-Centric Operating System Architecture for Heterogeneous Computing

PCI Express x8 Quad Port 10Gigabit Server Adapter (Intel XL710 Based)

QuickSpecs. HP Z 10GbE Dual Port Module. Models

Be Fast, Cheap and in Control with SwitchKV. Xiaozhou Li

CS 43: Computer Networks. 08:Network Services and Distributed Systems 19 September

Jinho Hwang (IBM Research) Wei Zhang, Timothy Wood, H. Howie Huang (George Washington Univ.) K.K. Ramakrishnan (Rutgers University)

Chapter 5 End-to-End Protocols

Chapter 8: Main Memory

Chapter 8: Memory Management

Data Path acceleration techniques in a NFV world

Software Datapath Acceleration for Stateless Packet Processing

Dataflow Architectures. Karin Strauss

CompSci 356: Computer Network Architectures. Lecture 7: Switching technologies Chapter 3.1. Xiaowei Yang

Memory Management. Memory Management

DPDK Summit China 2017

Main Memory. Electrical and Computer Engineering Stephen Kim ECE/IUPUI RTOS & APPS 1

Chapter 8: Main Memory

Accelerating Load Balancing programs using HW- Based Hints in XDP

An Experimental Study of Network Performance Impact of Increased Latency in SDR

Configuring Tap Aggregation and MPLS Stripping

Open Source Traffic Analyzer

LegUp: Accelerating Memcached on Cloud FPGAs

Learning with Purpose

DPDK Load Balancers RSS H/W LOAD BALANCER DPDK S/W LOAD BALANCER L4 LOAD BALANCERS L7 LOAD BALANCERS NOV 2018

Decoupling Cores, Kernels and Operating Systems

Chapter 8: Memory-Management Strategies

Software Defined Networking

PE310G4DBi9 Quad port Fiber 10 Gigabit Ethernet PCI Express Content Director Server Adapter Intel based

19: Networking. Networking Hardware. Mark Handley

TRex Realistic Traffic Generator

Light: A Scalable, High-performance and Fully-compatible User-level TCP Stack. Dan Li ( 李丹 ) Tsinghua University

Lecture 10. Denial of Service Attacks (cont d) Thursday 24/12/2015

Virtual Security Gateway Overview

A Next Generation Home Access Point and Router

Chapter 8: Main Memory. Operating System Concepts 9 th Edition

DeTail Reducing the Tail of Flow Completion Times in Datacenter Networks. David Zats, Tathagata Das, Prashanth Mohan, Dhruba Borthakur, Randy Katz

Arrakis: The Operating System is the Control Plane

BIG-IQ Centralized Management: ADC. Version 5.0

Switching & ARP Week 3

The Missing Piece of Virtualization. I/O Virtualization on 10 Gb Ethernet For Virtualized Data Centers

Hardware Flow Offload. What is it? Why you should matter?

Lecture 3. The Network Layer (cont d) Network Layer 1-1

Chapter 8: Main Memory

Real-Time Protocol (RTP)

Chapter 8: Main Memory

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

Programmable NICs. Lecture 14, Computer Networks (198:552)

Software-Defined Networking (Continued)

Modernizing NetBSD Networking Facilities and Interrupt Handling. Ryota Ozaki Kengo Nakahara

Chapter 9: Memory Management. Background

6.9. Communicating to the Outside World: Cluster Networking

MoonGen. A Scriptable High-Speed Packet Generator. Paul Emmerich. January 31st, 2016 FOSDEM Chair for Network Architectures and Services

Achieve Low Latency NFV with Openstack*

Performance of Various Levels of Storage. Movement between levels of storage hierarchy can be explicit or implicit

PDP : A Flexible and Programmable Data Plane. Massimo Gallo et al.

Chapter 6 Connecting Device

Implementation of Software-based EPON-OLT and Performance Evaluation

SoftRDMA: Rekindling High Performance Software RDMA over Commodity Ethernet

DPDK Tunneling Offload RONY EFRAIM & YONGSEOK KOH MELLANOX

Improving Altibase Performance with Solarflare 10GbE Server Adapters and OpenOnload

Advanced Computer Networks

Advanced Computer Networks. Datacenter TCP

Supporting Fine-Grained Network Functions through Intel DPDK

Quality of Service Mechanism for MANET using Linux Semra Gulder, Mathieu Déziel

Memory Management. Memory

Towards High-performance Flow-level level Packet Processing on Multi-core Network Processors

Data Processing on Emerging Hardware

A Userspace Packet Switch for Virtual Machines

Linux. Sirindhorn International Institute of Technology Thammasat University. Linux. Firewalls with iptables. Concepts. Examples

Transcription:

Intelligent NIC Queue Management in the Dragonet Network Stack Kornilios Kourtis 2 Pravin Shinde 1 Antoine Kaufmann 3 imothy Roscoe 1 1 EH Zurich 2 IBM Research 3 University of Washington RIOS, Oct. 4, 2015 RIOS, Oct. 4, 2015 1

NIC queues and network stacks Packets OU filters Packets IN NIC NE (1 core/queue) Apps RIOS, Oct. 4, 2015 2

NIC queues and network stacks Packets OU filters Packets IN sock 1 NIC NE (1 core/queue) Apps RIOS, Oct. 4, 2015 2

NIC queues and network stacks Packets OU filters SEND RECV SEND RECV SEND RECV sock 3 sock 2 sock 1 Packets IN NIC NE (1 core/queue) Apps RIOS, Oct. 4, 2015 2

NIC queues and network stacks Packets OU filters SEND RECV SEND RECV SEND RECV sock 3 sock 2 sock 1 Packets IN NIC NE (1 core/queue) Apps 1st approach: policy in the NIC Receive Side Scaling (RSS) RIOS, Oct. 4, 2015 2

NIC queues and network stacks Packets OU filters SEND RECV SEND RECV SEND RECV sock 3 sock 2 sock 1 Packets IN NIC NE (1 core/queue) Apps 1st approach: policy in the NIC Receive Side Scaling (RSS) poor locality RIOS, Oct. 4, 2015 2

NIC queues and network stacks Packets OU filters SEND RECV SEND RECV SEND RECV sock 3 sock 2 sock 1 Packets IN NIC NE (1 core/queue) Apps NIC should steer packet in the core the application resides (ars in Linux, AR in i82599 driver, Affinity-Accept [Eurosys12]) RIOS, Oct. 4, 2015 2

NIC queues and network stacks Packets OU filters SEND RECV SEND RECV SEND RECV sock 3 sock 2 sock 1 Packets IN NIC NE (1 core/queue) Apps Data-plane OSes: Arrakis [OSDI14a], IX [OSDI14b]: remove OS from the data path, demultiplexing on NIC RIOS, Oct. 4, 2015 2

NIC queues and network stacks Packets OU filters SEND RECV SEND RECV SEND RECV sock 3 sock 2 sock 1 Packets IN NIC NE (1 core/queue) Apps how do you configure the NIC? what happens if you run out of filters? or queues? RIOS, Oct. 4, 2015 2

NIC queues and network stacks Packets OU filters SEND RECV SEND RECV SEND RECV sock 3 sock 2 sock 1 Packets IN NIC NE (1 core/queue) Apps how do you configure the NIC? what happens if you run out of filters? or queues? what if you want a different policy (e.g., QoS)? RIOS, Oct. 4, 2015 2

Dragonet offers an alternative: NIC queue policy in the OS (not in the NIC or the driver) NIC model that strives to fully capture the NIC capabilities NIC-agnostic policies expressed as cost functions RIOS, Oct. 4, 2015 3

Dragonet offers an alternative: NIC queue policy in the OS (not in the NIC or the driver) NIC model that strives to fully capture the NIC capabilities NIC-agnostic policies expressed as cost functions alk outline Dragonet models the NIC as a dataflow graph (called the Physical Resource Graph: PRG) Using the model to manage queues in Dragonet Evaluation RIOS, Oct. 4, 2015 3

NIC model -nodes: single input ports, each with (possibly) multiple outputs when computation is done, one port is activated subsequently, nodes connected to that port are activated 5(UDPv4,*:* *:23) OR Q1 5(UDPv4,*:* *:24) OR Q0 RIOS, Oct. 4, 2015 4

NIC model O-nodes: multiple inputs: {, } operands can be short-circuited 5(UDPv4,*:* *:23) OR Q1 5(UDPv4,*:* *:24) OR Q0 RIOS, Oct. 4, 2015 4

NIC model Predicates: boolean expressions about the packet (atoms of the form: k = v) each -node port has a predicate prot = UDPv4 DstPort = 23 5(UDPv4,*:* *:23) OR Q1 prot = UDPv4 DstPort = 24 5(UDPv4,*:* *:24) OR Q0 RIOS, Oct. 4, 2015 4

NIC model Predicates: boolean expressions about the packet (atoms of the form: k = v) each -node port has a predicate prot = UDPv4 (dstport = 23 dstport = 24) prot = UDPv4 DstPort = 23 5(UDPv4,*:* *:23) OR Q1 prot = UDPv4 DstPort = 24 5(UDPv4,*:* *:24) OR Q0 RIOS, Oct. 4, 2015 4

Modeling NIC Configuration modern NICs offer rich configuration options drastically modify behaviour of NIC Configuration nodes (C-Nodes) apply configuration value: remove C-node and its edges add a subgraph based on configuration value RIOS, Oct. 4, 2015 5

PRG configuration example (i82599: SYN filter + 5-tuple filters) Q0 Q1 X out CSynilter true false HWIsCPSyn true false Q1 CSynOut Q2 Q3 queues C5ilter default Q2 Q3 Q0 RIOS, Oct. 4, 2015 6

PRG configuration example (i82599: SYN filter + 5-tuple filters) CSynilter = true HWIsCPSyn true X out CSynilter false CSynOut = Q1 CSynOut true false C5ilter Q0 Q1 Q2 Q3 queues default Q1 Q2 Q3 Q0 RIOS, Oct. 4, 2015 6

PRG configuration example (i82599: SYN filter + 5-tuple filters) CSynilter = true HWIsCPSyn true X out CSynilter false CSynOut = Q1 CSynOut true false C5ilter Q0 Q1 Q2 Q3 queues default Q1 Q2 Q3 Q0 Q1 X out HWIsCPSyn true false C5ilter queues default Q2 Q3 Q0 RIOS, Oct. 4, 2015 6

PRG configuration example (cont d) (i82599: SYN filter + 5-tuple filters) Q1 X out HWIsCPSyn true false C5ilter queues default Q2 Q3 Q0 RIOS, Oct. 4, 2015 7

PRG configuration example (cont d) (i82599: SYN filter + 5-tuple filters) Q1 X out HWIsCPSyn true false C5ilter queues default Q2 Q3 (IPV4/UDP,*,*,*,53) Q2 (IPV4/UDP,*,*,*,67) Q3 Q0 RIOS, Oct. 4, 2015 7

PRG configuration example (cont d) (i82599: SYN filter + 5-tuple filters) Q1 X out HWIsCPSyn true false C5ilter queues default Q2 Q3 (IPV4/UDP,*,*,*,53) Q2 (IPV4/UDP,*,*,*,67) Q3 Q0 X out HWIsCPSyn true false Q1 5(IPv4/UDP,*,*,*,53) true false Q2 5(IPv4/UDP,*,*,*,67) true false Q3 Q0 RIOS, Oct. 4, 2015 7

Managing queues Dragonet provides: NIC model (including configuration) boolean logic for reasoning Example Policies: 1. balancing flows across queues (and subsequently cores) 2. providing performance isolation for high-priority flows RIOS, Oct. 4, 2015 8

Managing queues Dragonet provides: NIC model (including configuration) boolean logic for reasoning Policies are expressed as cost functions: input: How flows are mapped into queues (f q) output: cost Example Policies: 1. balancing flows across queues (and subsequently cores) 2. providing performance isolation for high-priority flows RIOS, Oct. 4, 2015 8

Specifying policies with cost functions Load balancing: variance of number of flows per queue across queues RIOS, Oct. 4, 2015 9

Specifying policies with cost functions Load balancing: variance of number of flows per queue across queues QoS/Performance isolation: high-priority (HP) flows, best-effort (BE) flows HP flows get N queues, rest to BE flows Each class provides its own cost function for its flows (e.g., balancing) reject all configurations that assign flows to queues of a different class accepted configurations cost: c = c BE + c HP 20 lines of Haskell code RIOS, Oct. 4, 2015 9

Computing flow queue mapping (cost function input) 5:UDPv4 1.1.1.2:* *:100 OR Q1 IN 5:UDPv4 1.1.1.1:* *:100 Q0 RIOS, Oct. 4, 2015 10

Computing flow queue mapping (cost function input) 5:UDPv4 1.1.1.2:* *:100 OR Q1 IN 5:UDPv4 1.1.1.1:* *:100 Q0 low: UDPv4 / 1.1.1.1:9001 1.1.1.42:100 predicate: Etherype = IPv4 IpProt = UDP srcip = 1.1.1.1 srcport = 9001 dstip = 1.1.1.42 dstport = 100 RIOS, Oct. 4, 2015 10

Computing flow queue mapping (cost function input) 5:UDPv4 1.1.1.2:* *:100 OR Q1 IN 5:UDPv4 1.1.1.1:* *:100 Q0 low: UDPv4 / 1.1.1.1:9001 1.1.1.42:100 predicate: Etherype = IPv4 IpProt = UDP srcip = 1.1.1.1 srcport = 9001 dstip = 1.1.1.42 dstport = 100 RIOS, Oct. 4, 2015 10

Searching the configuration space c o (PRG, all ) = arg min cost(qmap(prg(c), all )) c C Performance concerns: full search space is too big Improving performance: reduce space (e.g., NIC-specific heuristics) incremental computations (flows added, removed) RIOS, Oct. 4, 2015 11

Greedy search algorithm Input : he set of active flows all Input : A cost function cost Output : A configuration c c C 0 // start with an empty configuration // flows already considered foreach f in all do // CC f : A set of configuration changes on f // that incrementally change c CC f oraclegetconfchanges(c, f ) + f // Add f to find cc CC f that minimizes cost(qmap(prg(c + cc), )) c c + cc // Apply change to configuration RIOS, Oct. 4, 2015 12

Greedy search algorithm Input : he set of active flows all Input : A cost function cost Output : A configuration c c C 0 // start with an empty configuration // flows already considered foreach f in all do // CC f : A set of configuration changes on f // that incrementally change c CC f oraclegetconfchanges(c, f ) + f // Add f to find cc CC f that minimizes cost(qmap(prg(c + cc), )) c c + cc // Apply change to configuration generate configurations from flows oracle: NIC-specific configuration generation RIOS, Oct. 4, 2015 12

Greedy search algorithm Input : he set of active flows all Input : A cost function cost Output : A configuration c c C 0 // start with an empty configuration // flows already considered foreach f in all do // CC f : A set of configuration changes on f // that incrementally change c CC f oraclegetconfchanges(c, f ) + f // Add f to find cc CC f that minimizes cost(qmap(prg(c + cc), )) c c + cc // Apply change to configuration generate configurations from flows oracle: NIC-specific configuration generation can be used incrementally, as flows arrive RIOS, Oct. 4, 2015 12

Greedy search algorithm Input : he set of active flows all Input : A cost function cost Output : A configuration c c C 0 // start with an empty configuration // flows already considered foreach f in all do // CC f : A set of configuration changes on f // that incrementally change c CC f oraclegetconfchanges(c, f ) + f // Add f to find cc CC f that minimizes cost(qmap(prg(c + cc), )) c c + cc // Apply change to configuration generate configurations from flows oracle: NIC-specific configuration generation can be used incrementally, as flows arrive RIOS, Oct. 4, 2015 12

Efficient flow-to-queue map computation foreach f in all do... find cc CC f that minimizes cost(qmap(prg(c + cc), ))... naive: compute configuration (C) from configuration changes ([cc]) apply C to PRG compute map RIOS, Oct. 4, 2015 13

Efficient flow-to-queue map computation foreach f in all do... find cc CC f that minimizes cost(qmap(prg(c + cc), ))... incremental: maintain a partially configured PRG compute flow-to-port mappings for each node Applying a cc adds new nodes propagate mappings RIOS, Oct. 4, 2015 13

Evaluation RIOS, Oct. 4, 2015 14

Implementation + Experimental setup Implementation Haskel + C Solarlare SC9020 (OpenOnload) Intel i82599 (DPDK) Setup 10 client machines for load generation 1 server with 20 cores 10 cores to Dragonet, 10 cores to application, 10 queues. RIOS, Oct. 4, 2015 15

Experiment #1: basic comparison goal: to show that Dragonet has reasonable performance under the same conditions µbench: UDP echo server 20 netperf clients, 16 packets in-flight Solarflare SC9020 (vs: Linux stack, OpenOnload user-level stack) Dragonet: load balancing cost function, other: RSS RIOS, Oct. 4, 2015 16

echo server performance on the SC9020 NIC Mean latency(usecs) 1600 1400 1200 1000 800 600 400 200 0 Linux Onload DNet (a) Latency, 1024 bytes PS(kx/sec/client) 70 60 50 40 30 20 10 0 Linux Onload DNet (b) hroughput, 1024 bytes Mean latency(usecs) 1000 800 600 400 200 0 Linux Onload DNet (c) Latency, 64 bytes PS(kx/sec/client) 50 40 30 20 10 0 Linux Onload DNet (d) hroughput, 64 bytes RIOS, Oct. 4, 2015 17

Experiment #2 Performance isolation/qos UDP memcached, memaslap clients HP clients: 4 queues, BE clients: 6 queues 2 HP clients 16 flows, 18 BE clients 16 flows (320 flows in total) (stable) we show here results for the Intel i82599 (similar results for Solarflare SC9020 are in the paper) RIOS, Oct. 4, 2015 18

Experiment #2 Performance isolation/qos UDP memcached, memaslap clients HP clients: 4 queues, BE clients: 6 queues 2 HP clients 16 flows, 18 BE clients 16 flows (320 flows in total) (stable) after 10secs, we start a new HP client that runs for 50secs after new HP is done, we start new BE client we show here results for the Intel i82599 (similar results for Solarflare SC9020 are in the paper) RIOS, Oct. 4, 2015 18

Performance Isolation (Intel i82599) Mean latency(usecs) 600 500 400 300 200 100 0 Bal Isolated +BE +HP (e) Latency, 1024 bytes PS(kx/sec/client) 80 70 60 50 40 30 20 10 0 Bal Isolated +BE +HP (f) hroughput, 1024 bytes Mean latency(usecs) 500 400 300 200 100 0 Bal Isolated +BE +HP (g) Latency, 64 bytes PS(kx/sec/client) 80 70 60 50 40 30 20 10 0 Bal Isolated +BE +HP (h) hroughput, 64 bytes RIOS, Oct. 4, 2015 19

Search cost (10 queues, i82599 PRG) Naive Incremental flows full full +1 fl. +10 fl. -1 fl. -10 fl. 10 11 ms 17 ms 2 ms 22 ms 9 µs 23.7 µs 100 1.2 s 0.6 s 9 ms 94 ms 74 µs 117 µs 250 13 s 4 s 21 ms 219 ms 190 µs 277 µs 500 76 s 17 s 43 ms 484 ms 382 µs 548 µs RIOS, Oct. 4, 2015 20

Conclusion Dragonet offers a systematic approach to managing queues Models NIC using a dataflow graph Expresses policy via cost-functions Incremental computations for improving performance Code available at http://git.barrelfish.org/?p=dragonet RIOS, Oct. 4, 2015 21

Conclusion Dragonet offers a systematic approach to managing queues Models NIC using a dataflow graph Expresses policy via cost-functions Incremental computations for improving performance Code available at http://git.barrelfish.org/?p=dragonet hank you! Acknowledgements: EH Barrelfish team! RIOS, Oct. 4, 2015 21

filter configuration for the Intel i82599 NIC 5-tuple filters: 128 filters that match: <protocol, source IP, destination IP, source port, destination port>. Each field can be masked. low director filters: Similar to 5-tuple filters. Increased flexibility at the cost of additional memory and latency (stored in the receive-side buffer space and implemented as a hash with linked list chains). RIOS, Oct. 4, 2015 22

filter configuration for the Intel i82599 NIC 5-tuple filters: 128 filters that match: <protocol, source IP, destination IP, source port, destination port>. Each field can be masked. low director filters: Similar to 5-tuple filters. Increased flexibility at the cost of additional memory and latency (stored in the receive-side buffer space and implemented as a hash with linked list chains). low director filters can operate in two modes: perfect match, which supports 8192 filters and matches on fields, and signature, which supports 32768 filters and the matches on a hashed-based signature of the fields. low-director filters support global fine-grained masking, enabling range matching. Ethertype filters: these filters match packets based on the Ethertype field (although they are not to be used for IP packets) and can be used for protocols such as ibre Channel over Ethernet (CoE). a SYN filter for directing CP SYN packets to a given queue, for example to mitigate SYN-flood attacks. CoE redirection filters for steering ibre Channel over Ethernet packets based on C protocol fields. Originator Exchange ID or Responder Exchange ID MAC address filters for steering packets into queue pools, typically assigned to virtual machines. Receive Side Scaling (RSS) where packet fields are used to generate a hash value used to index a 128-entry table with 4-bit values indicating the destination queue. RIOS, Oct. 4, 2015 22

Dragonet in a nutshell Physical Resource Graph Logical Protocol Graph PRG: - hw functions - configuration LPG: - protocol state - packet processing RIOS, Oct. 4, 2015 23

Dragonet in a nutshell Physical Resource Graph Logical Protocol Graph PRG: - hw functions - configuration Search LPG: - protocol state - packet processing PRG configuration Configured PRG RIOS, Oct. 4, 2015 23

Dragonet in a nutshell Physical Resource Graph Logical Protocol Graph PRG: - hw functions - configuration Search LPG: - protocol state - packet processing PRG configuration Embedding Configured PRG Embedded graph: - part of LPG in hw RIOS, Oct. 4, 2015 23

Dragonet in a nutshell Physical Resource Graph Logical Protocol Graph PRG: - hw functions - configuration Search the rest of the talk: PRG- configuration Dragonet model (graph building blocks) - case-study: NIC queue management - search for optimal NIC queue configuration - evaluation Embedding LPG: - protocol state - packet processing Configured PRG Embedded graph: - part of LPG in hw RIOS, Oct. 4, 2015 23

Performance isolation cost function costn ishp nhpqs queues fm not hpok = CostReject 1 not beok = CostReject 1 length hps == 0 = CostVal balbe length bes == 0 = CostVal balhp otherwise = CostVal $ balhp + balbe where hpqs = take nhpqs queues -- HP queues beqs = drop nhpqs queues -- BE queues -- partition flows to HP/BE (hps,bes) = partition (ishp. fst) fm -- check if HP (BE) flows are assigned -- only to HP (BE) queues hpok = and [q elem hpqs (_,q)<-hps] beok = and [q elem beqs (_,q)<-bes] -- compute costs of individual classes CostVal balhp = balancecost_ hpqs hps CostVal balbe = balancecost_ beqs bes RIOS, Oct. 4, 2015 24

Incremental C-nodes can compute flows OR Q0 OR Q1 IN 1 OR Q2 RIOS, Oct. 4, 2015 25

Incremental C-nodes can compute flows OR Q0 OR Q1 IN 1 OR Q2 RIOS, Oct. 4, 2015 25

Incremental C-nodes can compute flows OR Q0 2 OR Q1 IN 1 OR Q2 RIOS, Oct. 4, 2015 25

Incremental C-nodes can compute flows OR Q0 2 OR Q1 IN 1 OR Q2 RIOS, Oct. 4, 2015 25

Incremental C-nodes can compute flows OR Q0 2 OR Q1 IN 1 OR Q2 RIOS, Oct. 4, 2015 25

Incremental C-nodes can compute flows OR Q0 2 OR Q1 IN 1 OR Q2 RIOS, Oct. 4, 2015 25

Adding/removing flows adding flows: another step in the greedy search we remove flows lazily: each cc paired with a flow remove the flow, but keep cc (do not change configuration) oracle repurposes cc s generated nodes RIOS, Oct. 4, 2015 26

Configuration nodes (C-nodes) Conf I 1 X I 2 C O 1 O 2 Y (configure) I 1 X I 2 O 1 O 2 Y RIOS, Oct. 4, 2015 27

Configuration nodes (C-nodes) Conf I 1 X I 2 C O 1 O 2 Y (configure) I 1 N 1 N 2 X I 2 O 1 O 2 Y RIOS, Oct. 4, 2015 27

Configuration nodes (C-nodes) Conf I 1 X I 2 C O 1 O 2 Y (configure) I 1 N 1 N 2 X I 2 O 1 O 2 Y RIOS, Oct. 4, 2015 27

Configuration nodes (C-nodes) Conf I 1 X I 2 C O 1 O 2 Y (configure) I 1 N 1 N 2 X X I 2 X O 1 O 2 Y RIOS, Oct. 4, 2015 27

Managing NIC queues hardware receive filters (packets Rx queue) Receive Side Scaling (RSS): hash-based load balancing NIC-specific hardware filters (e.g., 5-tuples, CP SYN packets) Linux support RSS (does not consider application locality) Accelerated Receive low Steering aims to steer packets to core that application resides maintains flow information calls the NIC driver to steer flows inlined in the protocol implementation Application argeted Receive implemented in the driver driver samples transmit packets uses device-specific filters to steer packets RIOS, Oct. 4, 2015 28

Adding an HP client when using 64 byte requests (i82599) we instruct clients to provide results every one second (minimum possible value). Avg(us) 700 600 500 400 300 200 Stable (HP) Stable (BE) +HP 100 70 75 80 85 90 ime(sec) (i) Latency PS(kx/sec/client) 120 100 80 60 40 20 Stable (HP) Stable (BE) +HP 0 70 75 80 85 90 ime(sec) (j) hroughput RIOS, Oct. 4, 2015 29