Network stack specialization for performance

Size: px

Start display at page:

Download "Network stack specialization for performance"

Helen McDaniel
5 years ago
Views:

1 Network stack specialization for performance goo.gl/1la2u6 Ilias Marinos, Robert N.M. Watson, Mark Handley* University of Cambridge, * University College London

2 Motivation Providers are scaling out rapidly. Key aspects: 1 machine:n functions N machines:1 function Performance is critical Scalability on multicore systems Cost & energy concerns

3 Motivation Providers are scaling out rapidly. Key aspects: 1 machine:n functions N machines:1 function Performance is critical Scalability on multicore systems Cost & energy concerns re general-purpose stacks the right solution for that kind of role?

4 The Problem Conventional stacks are great for bulk transfers, but what about short ones?

5 The Problem 10 Network Throughput (Gbps) Throughput (Gbps) HTTP object size (K)

6 The Problem 10 Network Throughput (Gbps) CPU utilization (%) 200 Throughput (Gbps) CPU utilization (%) HTTP object size (K) 0

7 The Problem 10 Network Throughput (Gbps) CPU utilization (%) NIC saturation, Low CPU-usage 200 Throughput (Gbps) CPU utilization (%) HTTP object size (K) 0

8 Throughput/CPU ratio is low 10 The Problem Network Throughput (Gbps) CPU utilization (%) NIC saturation, Low CPU-usage 200 Throughput (Gbps) CPU utilization (%) HTTP object size (K) 0

9 Throughput/CPU ratio is low 10 The Problem Network Throughput (Gbps) CPU utilization (%) NIC saturation, Low CPU-usage 200 Throughput (Gbps) CPU utilization (%) HTTP object size (K) Short-lived HTTP flows are a problem! 0

10 Why is this important?

11 Why is this important? Distribution based on traces from Yahoo! CDN [l-fares et al 2011]

12 Why is this important? 95% of the HTTP requested object sizes 50K 90% of the HTTP requested object sizes 25K Distribution based on traces from Yahoo! CDN [l-fares et al 2011]

13 Design Goals Design a network stack that: llows transparent flow of memory from NIC to the application and vice versa Reduces system costs (e.g., batching, cachelocality, lock- and sharing-free, CPU-affinity) Exploits application-specific knowledge to reduce repetitive processing costs (e.g. TCP segmentation of web objects, checksums)

14 Sandstorm: specialized webserver stack Prototyped on top of FreeSD s netmap framework: webserver web_write() tcpip_write() web_recv() tcpip_recv() libnmio: abstracting netmaprelated I/O libeth: lightweight ethernet layer libtcpip.so libeth.so libnmio.so tcpip_output() zero copy eth_output() netmap_output() tcpip_fsm() tcpip_input() eth_input() netmap_input() netmap ioctls user libtcpip: optimized TCP/IP layer application: simple HTTP server that serves static content DM memory mapped to user buffer rings TX RX syscall device driver kernel

15 Sandstorm: specialized webserver stack Key decisions (some of them): pplication & stack are merged into the same process address Static content is pre-segmented into network packets and a-priori loaded to DRM Received packet frames are processed in-place on the RX rings, w/o memory copying/buffering RX/TX packet batching greatly amortizes the system call overhead ufferless, synchronous model (no socket layer)

16 Sandstorm rchitecture (10,000ft view) app tcpip eth user nmio ix0:rx content ix0:tx NIC driver kernel

17 Sandstorm rchitecture (10,000ft view) app tcpip eth user netmap_input() nmio ix0:rx content ix0:tx NIC driver kernel

18 Sandstorm rchitecture (10,000ft view) app tcpip eth user netmap_input() nmio ix0:rx content ix0:tx NIC driver kernel

19 Sandstorm rchitecture (10,000ft view) app tcpip eth user netmap_input() nmio POLLIN ix0:rx content ix0:tx NIC driver kernel

20 Sandstorm rchitecture (10,000ft view) app tcpip eth ether_input() netmap_input() user nmio POLLIN ix0:rx content ix0:tx NIC driver kernel

21 Sandstorm rchitecture (10,000ft view) app tcpip eth tcpip_input() ether_input() netmap_input() user nmio POLLIN ix0:rx content ix0:tx NIC driver kernel

22 Sandstorm rchitecture (10,000ft view) app tcpip eth tcpip_input() ether_input() TCP! FSM user netmap_input() nmio POLLIN ix0:rx content ix0:tx NIC driver kernel

23 Sandstorm rchitecture (10,000ft view) app websrv_accept() websrv_receive() tcpip eth tcpip_input() ether_input() TCP! FSM tcpip_output() user netmap_input() nmio POLLIN ix0:rx content ix0:tx NIC driver kernel

24 Sandstorm rchitecture (10,000ft view) app websrv_accept() websrv_receive() tcpip eth tcpip_input() ether_input() TCP! FSM tcpip_output() user netmap_input() nmio POLLIN ix0:rx content ix0:tx NIC driver kernel

25 Sandstorm rchitecture (10,000ft view) app websrv_accept() websrv_receive() tcpip eth tcpip_input() ether_input() TCP! FSM tcpip_output() ether_output() user netmap_input() nmio POLLIN ix0:rx content ix0:tx NIC driver kernel

26 Sandstorm rchitecture (10,000ft view) app websrv_accept() websrv_receive() tcpip eth tcpip_input() ether_input() TCP! FSM tcpip_output() ether_output() user netmap_input() nmio POLLIN ix0:rx content ix0:tx NIC driver kernel

27 Sandstorm rchitecture (10,000ft view) app websrv_accept() websrv_receive() tcpip eth tcpip_input() ether_input() TCP! FSM tcpip_output() ether_output() user netmap_input() netmap_output() nmio POLLIN ix0:rx content ix0:tx NIC driver kernel

28 Sandstorm rchitecture (10,000ft view) app websrv_accept() websrv_receive() tcpip eth tcpip_input() ether_input() TCP! FSM tcpip_output() ether_output() user netmap_input() netmap_output() nmio POLLIN ix0:rx content ix0:tx NIC driver kernel

29 Sandstorm rchitecture (10,000ft view) app websrv_accept() websrv_receive() tcpip eth tcpip_input() ether_input() TCP! FSM tcpip_output() ether_output() user netmap_input() netmap_output() nmio POLLIN ix0:rx content ix0:tx NIC driver kernel

30 Sandstorm rchitecture (10,000ft view) app websrv_accept() websrv_receive() tcpip eth tcpip_input() ether_input() TCP! FSM tcpip_output() ether_output() user netmap_input() netmap_output() nmio POLLIN ix0:rx content ix0:tx POLLOUT NIC driver kernel

31 Evaluation Throughput - 6NICs (Gbps) nginx+freesd nginx+linux Sandstorm HTTP Object Size (K)

32 Evaluation Throughput - 6NICs (Gbps) nginx+freesd nginx+linux Sandstorm ~1.8x ~3.6x ~9.8x HTTP Object Size (K)

Evaluation Throughput - 6NICs (Gbps) 60 50 40 30 20 10 0 nginx+freesd nginx+linux Sandstorm ~9.

33 Evaluation Throughput - 6NICs (Gbps) nginx+freesd nginx+linux Sandstorm ~9.8x ~3.6x ~1.8x HTTP Object Size (K) Start converging for sizes 256K

To copy or not to copy? memcpy zerocopy /* Get src and destination slots */ struct netmap_slot *bf = &ppool->slot[slotindex]; struct netmap_slot *tx = &txring->slot[cur];!

34 To copy or not to copy? memcpy zerocopy /* Get src and destination slots */ struct netmap_slot *bf = &ppool->slot[slotindex]; struct netmap_slot *tx = &txring->slot[cur];! /* zero-copy packet */ tx->buf_idx = bf->buf_idx; tx->len = bf->len; tx->flags = NS_UF_CHNGED; OR /* Get source and destination bufs */ char *srcp = NETMP_UF(ppool, bf->buf_idx); char *dstp = NETMP_UF(txring, tx->buf_idx);! /* memcpy packet */ memcpy(dstp, srcp, bf->len); tx->len = bf->len; n n TX TX

35 To copy or not to copy? 10 Throughput (Gbps) Sandstorm zerocopy Sandstorm memcpy Intel Core 2 (2006) Serving a 24K HTTP object

36 To copy or not to copy? 10 Throughput (Gbps) % 0 Sandstorm zerocopy Sandstorm memcpy Intel Core 2 (2006) Serving a 24K HTTP object

37 To copy or not to copy? 10 Throughput (Gbps) ? = 0 Sandstorm zerocopy Sandstorm memcpy Intel Sandybridge (2013) Serving a 24K HTTP object

38 CPU microarchitecture ~2006 C C C C L 2 L 2 FS Memory Controller Hub DM engine PCIe PCIe

39 CPU microarchitecture ~2006 C C C C L 2 L 2 FS Memory Controller Hub DM engine PCIe PCIe

40 CPU microarchitecture ~2006 C C C C L 2 L 2 FS Memory Controller Hub DM engine Raise interrupt PCIe PCIe

41 CPU microarchitecture ~2006 C C C C L 2 L 2 FS Memory Controller Hub DM engine Raise interrupt PCIe PCIe

42 CPU microarchitecture ~2006 C C C C L 2 L 2 ottleneck FS Memory Controller Hub DM engine Extra detour to RM Raise interrupt PCIe PCIe

43 CPU microarchitecture ~2013 C C C C LLC MC PCIe PCIe

44 CPU microarchitecture ~2013 C C C C LLC MC PCIe PCIe

45 CPU microarchitecture ~2013 C C C C LLC MC Raise interrupt PCIe PCIe

46 CPU microarchitecture ~2013 C C C C LLC MC Raise interrupt PCIe PCIe

47 CPU microarchitecture ~2013 C C C C Eventual eviction from LLC LLC MC Raise interrupt PCIe PCIe

48 CPU microarchitecture ~2013 C C C C Eventual eviction from LLC LLC MC Raise interrupt PCIe PCIe No extra detours to DRM No FS bottleneck

49 CPU microarchitecture ~2013 C C C C Eventual eviction from LLC LLC MC Raise interrupt PCIe PCIe No extra detours to DRM No FS bottleneck?? LLC utilization ( thrashing?)

50 HW/SW Intersection Should HW architecture evolution be considered a Mem Read Throughput 6NICs (Gbps) black box for networked systems development? Sandstorm "zerocopy" Object Size (K) Sandstorm "memcpy" Lower is better

51 Generality of Specialization Natural fit for: Web & DNS servers (Sandstorm, Namestorm check our paper) In-memory Key-Value stores RPC-based services Rate-adaptive video streaming applications (with MPEG-DSH or pple HLS)

52 Generality of Specialization Natural fit for: Web & DNS servers (Sandstorm, Namestorm check our paper) In-memory Key-Value stores RPC-based services Rate-adaptive video streaming applications (with MPEG-DSH or pple HLS) Limitations:! Possibly not a good fit for CPU- and/or filesystem-intensive applications locking in application-layer cannot be tolerated!

53 Conclusions General-purpose stacks:! Great for bulk transfers, bad for short ones (but web is dominated by small-sized objects!) Picked a lot of generality in favor of flexibility (we don t need it for application-specific clusters) Hard to tune/profile/debug

54 Conclusions General-purpose stacks:! Great for bulk transfers, bad for short ones (but web is dominated by small-sized objects!) Picked a lot of generality in favor of flexibility (we don t need it for application-specific clusters) Hard to tune/profile/debug Specialized stacks:! 2-10x throughput improvement for web, 9x for DNS Linear scaling on multicore systems Low CPU utilization

55 Conclusions General-purpose stacks:! Great for bulk transfers, bad for short ones (but web is dominated by small-sized objects!) Picked a lot of generality in favor of flexibility (we don t need it for application-specific clusters) Hard to tune/profile/debug Specialized stacks:! 2-10x throughput improvement for web, 9x for DNS Linear scaling on multicore systems Low CPU utilization Specialized network stacks not only viable, but necessary!

56 ackup Slides

57 Supported TCP features Follows RFC 793, with Reno congestion control Limitations: Support of the required TCP subset to serve incoming connections (not initiating them) TCP reordering not supported (not needed with typical HTTP requests)

58 Latency vg. Latency (μs) Sandstorm Linux+nginx FreeSD+nginx # Concurrent Connections Serving a 24K object

59 Overview Problems with generalpurpose stacks: System-call overhead Shared accept-queue, PC locks Cache-unfriendly due to async. design Memory-related overhead (e.g., mbuf alloc./copying) Solutions with specialized stacks:! Packet batching Share- & Lock-free design, per-core state Process-to-completion, cache-friendly, incr. cksum Pre-packetization, no memory copying/buffering

Advanced Computer Networks. End Host Optimization

Advanced Computer Networks. End Host Optimization Oriana Riva, Department of Computer Science ETH Zürich 263 3501 00 End Host Optimization Patrick Stuedi Spring Semester 2017 1 Today End-host optimizations: NUMA-aware networking Kernel-bypass Remote Direct