Titan: Fair Packet Scheduling for Commodity Multiqueue NICs. Brent Stephens, Arjun Singhvi, Aditya Akella, and Mike Swift July 13 th, 2017

Size: px

Start display at page:

Download "Titan: Fair Packet Scheduling for Commodity Multiqueue NICs. Brent Stephens, Arjun Singhvi, Aditya Akella, and Mike Swift July 13 th, 2017"

Bartholomew Burke
5 years ago
Views:

1 Titan: Fair Packet Scheduling for Commodity Multiqueue NICs Brent Stephens, Arjun Singhvi, Aditya Akella, and Mike Swift July 13 th, 2017

2 Ethernet line-rates are increasing! 2

3 Servers need: To drive increasing line-rates Low CPU utilization networking 3

4 Underlying mechanisms: Segmentation Offload Multiqueue NICs 4

5 TCP Segmentation Offload (TSO) Many operations performed by the OS are per-packet, not perbyte TSO allows the OS to send large segments to the NIC TSO NIC hardware generates packets from segments Using large segments (64KB) instead of packets can reduce CPU load 5

6 Multiqueue NICs Core 1 Core 2 Core 1 TXQ-1 Core 2 TXQ-2 Locking/Polling Packet Scheduler Multiqueue NICs enable parallelism 6

7 Fairness Problems Core 1 TXQ-1 Packet Scheduler Core 2 TXQ-2 Fair packet schedule: Actual packet schedule: Multiqueue unfairness TSO unfairness TSO and multiqueue cause pervasive unfairness 7

8 Fairness is needed so competing applications can share the network Fairness is needed for predictability Unfairness leads to unpredictable completion times across runs Perfect fairness perfect predictability Fairness is important Fairness can improve application performance Ex: Weighted Coflow Scheduling [Chowdhury SIGCOMM11, Chowdhury SIGCOMM14] 8

9 Titan Goals: Drive increasing line-rates Low CPU utilization Per-flow fairness Work on commodity NICs 9

10 Multiqueue Fairness in Linux: Flow arrivals to each transmit queue are dynamic The OS statically uses a per-flow hash to assign flows to queues The NIC scheduler statically uses deficit round-robin (DRR) to provide per-queue fairness In the datacenter, the OS statically chooses a TSO size 10

11 Titan Design: As flows dynamically arrive and complete, in Titan: The OS dynamically: Assigns weights to flows Tracks the flow occupancy of queues Picks queues for flows Updates the NIC with queue weights The NIC dynamically: Applies queue weights from the OS

12 Causes of Unfairness: Multiqueue unfairness TSO unfairness 12

13 Problem: Hash collisions TXQ-1 TXQ-2 TXQ-3 Multiqueue unfairness Packet Scheduler 13

14 Problem: Hash collisions Solution: Dynamic Queue Assignment (DQA) TXQ-1 TXQ-2 TXQ-3 OS assigns a weight to each flow Packet Scheduler DQA picks the queue with the lowest occupancy when a flow starts Queue occupancies are updated: Any time a flow starts enqueuing data Any time a flow has no enqueued bytes (at most each TX interrupt) 14

15 Problem: Hash collisions Solution: Dynamic Queue Assignment (DQA) TXQ-1 TXQ-2 TXQ-3 Packet Scheduler 15

16 Problem: Asymmetric Oversubscription TXQ-1 TXQ-2 TXQ-3 W: 1 W: 1 W: 1 and receive half throughput F4 F4 F4 F4 F4 Packet Scheduler 16

17 Problem: Asymmetric Oversubscription Solution: Dynamic Queue Weight Assignment (DQWA) TXQ-1 TXQ-2 TXQ-3 W: 2 W: 1 W: 1 F4 OS assigns weights to flows OS updates the NIC scheduler with queue occupancies as flows start and stop (at most each TX interrupt) NIC updates DRR weights Packet Scheduler ndo_set_tx_weight This is implementable on existing commodity NICs because it only needs to update DRR weights! 17

18 Problem: Asymmetric Oversubscription Solution: Dynamic Queue Weight Assignment (DQWA) TXQ-1 TXQ-2 TXQ-3 W: 2 W: 1 W: 1 DQA and DQWA provide long-term fairness F4 F4 F4 Packet Scheduler ndo_set_tx_weight This is implementable on existing commodity NICs because it only needs to update DRR weights! 18

19 Problem: TSO Unfairness TXQ-1 TXQ-2 TXQ-3 W: 2 W: 1 W: 1 Short-term unfairness can cause bursts of congestion in the network Short-term unfairness can increase latency Short-term unfairness F4 Packet Scheduler F4 F4 19

20 Problem: TSO Unfairness Solution: Dynamic Segmentation Offload Sizing (DSOS) TXQ-1 TXQ-2 TXQ-3 W: 2 W: 1 W: 1 F4 DSOS dynamically changes the segment size during oversubscription Same implementation as GSO CPU vs fairness tradeoff Segmenting after the TCP/IP stack reduces CPU costs F4 F4 Packet Scheduler 20

Implementation DQA, DQWA, and DSOS are implemented in Linux 4.

21 Implementation DQA, DQWA, and DSOS are implemented in Linux Support for ndo_set_tx_weight is implemented in the Intel ixgbe driver for the Intel Gbps NIC Titan is open source! 21

22 Evaluation Microbenchmarks 2 servers, 1 switch 8 queue NICs Vary number of flows (level of oversubscription) Incremental fairness benefits of DQA, DQWA, and DSOS DQA and DQWA: expected to improve long-term fairness DSOS: expected to improve short-term fairness 22

23 Evaluation Fairness Metric Metrics: Normalized fairness metric (NFM) inspired by Shreedhar and Varghese: NFM = 0 is fair NFM > 1 is very unfair Ideal packet schedule: NFM = 0 NFM = (Bytes(MaxFlow) Bytes(MinFlow)) / Bytes(FairShair) Unfair packet schedule: NFM = 1 23

24 Microbenchmarks 1s Timescale Linux is unfair at all subscription levels DQA often significantly improves fairness At 48 flows, flow churn prevents DQA from evenly spreading flows DQWA improves fairness when DQA cannot evenly spread flows across queues DSOS does not have a significant impact on longterm fairness NFM -1s Number of Flows Linux DQA DQA + DQWA DQA + DQWA + DSOS (16KB) 24

25 Microbenchmarks 1ms Timescale At short timescales and under oversubscription, DQA and DQWA do not significantly improve fairness TSO is the primary cause of unfairness DSOS (16KB) often reduces unfairness by >2x NFM -1ms Number of Flows Linux DQA DQA + DQWA DQA + DQWA + DSOS (16KB) 25

Cluster Experiments CDF of completion times in a 1GB all-to-all shuffle (24 servers) Ideal CDF would be a vertical line Titan makes performance more predictable Titan improves tail performance (>90

26 Cluster Experiments CDF of completion times in a 1GB all-to-all shuffle (24 servers) Ideal CDF would be a vertical line Titan makes performance more predictable Titan improves tail performance (>90 th percentile) Cumulative Probability (c) 24 servers Flow Completion Time (s) Titan improves fairness without changing the network core! Linux Vanilla Titan 26

Additional Evaluation Additional performance metrics: Throughput: line-rate Latency: no significant change CPU Utilization: DQA and DQWA: increase < 10% DSOS is

27 Additional Evaluation Additional performance metrics: Throughput: line-rate Latency: no significant change CPU Utilization: DQA and DQWA: increase < 10% DSOS is better than statically decreasing the TSO size DSOS motivates creating a better TSO implementation (zero-copy) Linux network configuration trade-off study See paper 27

28 Summary Multi queue NICs can lead to significant flow-level unfairness Titan significantly improves fairness by allowing the OS to dynamically interact with the NIC packet scheduler Titan is implementable on commodity NICs! 28

RoGUE: RDMA over Generic Unconverged Ethernet

RoGUE: RDMA over Generic Unconverged Ethernet Yanfang Le with Brent Stephens, Arjun Singhvi, Aditya Akella, Mike Swift RDMA Overview RDMA USER KERNEL Zero Copy Application Application Buffer Buffer HARWARE