Network stack specialization for performance

Size: px
Start display at page:

Download "Network stack specialization for performance"

Transcription

1 Network stack specialization for performance goo.gl/1la2u6 Ilias Marinos, Robert N.M. Watson, Mark Handley* University of Cambridge, * University College London

2 Motivation Providers are scaling out rapidly. Key aspects: 1 machine:n functions N machines:1 function Performance is critical Scalability on multicore systems Cost & energy concerns

3 Motivation Providers are scaling out rapidly. Key aspects: 1 machine:n functions N machines:1 function Performance is critical Scalability on multicore systems Cost & energy concerns re general-purpose stacks the right solution for that kind of role?

4 The Problem Conventional stacks are great for bulk transfers, but what about short ones?

5 The Problem 10 Network Throughput (Gbps) Throughput (Gbps) HTTP object size (K)

6 The Problem 10 Network Throughput (Gbps) CPU utilization (%) 200 Throughput (Gbps) CPU utilization (%) HTTP object size (K) 0

7 The Problem 10 Network Throughput (Gbps) CPU utilization (%) NIC saturation, Low CPU-usage 200 Throughput (Gbps) CPU utilization (%) HTTP object size (K) 0

8 Throughput/CPU ratio is low 10 The Problem Network Throughput (Gbps) CPU utilization (%) NIC saturation, Low CPU-usage 200 Throughput (Gbps) CPU utilization (%) HTTP object size (K) 0

9 Throughput/CPU ratio is low 10 The Problem Network Throughput (Gbps) CPU utilization (%) NIC saturation, Low CPU-usage 200 Throughput (Gbps) CPU utilization (%) HTTP object size (K) Short-lived HTTP flows are a problem! 0

10 Why is this important?

11 Why is this important? Distribution based on traces from Yahoo! CDN [l-fares et al 2011]

12 Why is this important? 95% of the HTTP requested object sizes 50K 90% of the HTTP requested object sizes 25K Distribution based on traces from Yahoo! CDN [l-fares et al 2011]

13 Design Goals Design a network stack that: llows transparent flow of memory from NIC to the application and vice versa Reduces system costs (e.g., batching, cachelocality, lock- and sharing-free, CPU-affinity) Exploits application-specific knowledge to reduce repetitive processing costs (e.g. TCP segmentation of web objects, checksums)

14 Sandstorm: specialized webserver stack Prototyped on top of FreeSD s netmap framework: webserver web_write() tcpip_write() web_recv() tcpip_recv() libnmio: abstracting netmaprelated I/O libeth: lightweight ethernet layer libtcpip.so libeth.so libnmio.so tcpip_output() zero copy eth_output() netmap_output() tcpip_fsm() tcpip_input() eth_input() netmap_input() netmap ioctls user libtcpip: optimized TCP/IP layer application: simple HTTP server that serves static content DM memory mapped to user buffer rings TX RX syscall device driver kernel

15 Sandstorm: specialized webserver stack Key decisions (some of them): pplication & stack are merged into the same process address Static content is pre-segmented into network packets and a-priori loaded to DRM Received packet frames are processed in-place on the RX rings, w/o memory copying/buffering RX/TX packet batching greatly amortizes the system call overhead ufferless, synchronous model (no socket layer)

16 Sandstorm rchitecture (10,000ft view) app tcpip eth user nmio ix0:rx content ix0:tx NIC driver kernel

17 Sandstorm rchitecture (10,000ft view) app tcpip eth user netmap_input() nmio ix0:rx content ix0:tx NIC driver kernel

18 Sandstorm rchitecture (10,000ft view) app tcpip eth user netmap_input() nmio ix0:rx content ix0:tx NIC driver kernel

19 Sandstorm rchitecture (10,000ft view) app tcpip eth user netmap_input() nmio POLLIN ix0:rx content ix0:tx NIC driver kernel

20 Sandstorm rchitecture (10,000ft view) app tcpip eth ether_input() netmap_input() user nmio POLLIN ix0:rx content ix0:tx NIC driver kernel

21 Sandstorm rchitecture (10,000ft view) app tcpip eth tcpip_input() ether_input() netmap_input() user nmio POLLIN ix0:rx content ix0:tx NIC driver kernel

22 Sandstorm rchitecture (10,000ft view) app tcpip eth tcpip_input() ether_input() TCP! FSM user netmap_input() nmio POLLIN ix0:rx content ix0:tx NIC driver kernel

23 Sandstorm rchitecture (10,000ft view) app websrv_accept() websrv_receive() tcpip eth tcpip_input() ether_input() TCP! FSM tcpip_output() user netmap_input() nmio POLLIN ix0:rx content ix0:tx NIC driver kernel

24 Sandstorm rchitecture (10,000ft view) app websrv_accept() websrv_receive() tcpip eth tcpip_input() ether_input() TCP! FSM tcpip_output() user netmap_input() nmio POLLIN ix0:rx content ix0:tx NIC driver kernel

25 Sandstorm rchitecture (10,000ft view) app websrv_accept() websrv_receive() tcpip eth tcpip_input() ether_input() TCP! FSM tcpip_output() ether_output() user netmap_input() nmio POLLIN ix0:rx content ix0:tx NIC driver kernel

26 Sandstorm rchitecture (10,000ft view) app websrv_accept() websrv_receive() tcpip eth tcpip_input() ether_input() TCP! FSM tcpip_output() ether_output() user netmap_input() nmio POLLIN ix0:rx content ix0:tx NIC driver kernel

27 Sandstorm rchitecture (10,000ft view) app websrv_accept() websrv_receive() tcpip eth tcpip_input() ether_input() TCP! FSM tcpip_output() ether_output() user netmap_input() netmap_output() nmio POLLIN ix0:rx content ix0:tx NIC driver kernel

28 Sandstorm rchitecture (10,000ft view) app websrv_accept() websrv_receive() tcpip eth tcpip_input() ether_input() TCP! FSM tcpip_output() ether_output() user netmap_input() netmap_output() nmio POLLIN ix0:rx content ix0:tx NIC driver kernel

29 Sandstorm rchitecture (10,000ft view) app websrv_accept() websrv_receive() tcpip eth tcpip_input() ether_input() TCP! FSM tcpip_output() ether_output() user netmap_input() netmap_output() nmio POLLIN ix0:rx content ix0:tx NIC driver kernel

30 Sandstorm rchitecture (10,000ft view) app websrv_accept() websrv_receive() tcpip eth tcpip_input() ether_input() TCP! FSM tcpip_output() ether_output() user netmap_input() netmap_output() nmio POLLIN ix0:rx content ix0:tx POLLOUT NIC driver kernel

31 Evaluation Throughput - 6NICs (Gbps) nginx+freesd nginx+linux Sandstorm HTTP Object Size (K)

32 Evaluation Throughput - 6NICs (Gbps) nginx+freesd nginx+linux Sandstorm ~1.8x ~3.6x ~9.8x HTTP Object Size (K)

33 Evaluation Throughput - 6NICs (Gbps) nginx+freesd nginx+linux Sandstorm ~9.8x ~3.6x ~1.8x HTTP Object Size (K) Start converging for sizes 256K

34 To copy or not to copy? memcpy zerocopy /* Get src and destination slots */ struct netmap_slot *bf = &ppool->slot[slotindex]; struct netmap_slot *tx = &txring->slot[cur];! /* zero-copy packet */ tx->buf_idx = bf->buf_idx; tx->len = bf->len; tx->flags = NS_UF_CHNGED; OR /* Get source and destination bufs */ char *srcp = NETMP_UF(ppool, bf->buf_idx); char *dstp = NETMP_UF(txring, tx->buf_idx);! /* memcpy packet */ memcpy(dstp, srcp, bf->len); tx->len = bf->len; n n TX TX

35 To copy or not to copy? 10 Throughput (Gbps) Sandstorm zerocopy Sandstorm memcpy Intel Core 2 (2006) Serving a 24K HTTP object

36 To copy or not to copy? 10 Throughput (Gbps) % 0 Sandstorm zerocopy Sandstorm memcpy Intel Core 2 (2006) Serving a 24K HTTP object

37 To copy or not to copy? 10 Throughput (Gbps) ? = 0 Sandstorm zerocopy Sandstorm memcpy Intel Sandybridge (2013) Serving a 24K HTTP object

38 CPU microarchitecture ~2006 C C C C L 2 L 2 FS Memory Controller Hub DM engine PCIe PCIe

39 CPU microarchitecture ~2006 C C C C L 2 L 2 FS Memory Controller Hub DM engine PCIe PCIe

40 CPU microarchitecture ~2006 C C C C L 2 L 2 FS Memory Controller Hub DM engine Raise interrupt PCIe PCIe

41 CPU microarchitecture ~2006 C C C C L 2 L 2 FS Memory Controller Hub DM engine Raise interrupt PCIe PCIe

42 CPU microarchitecture ~2006 C C C C L 2 L 2 ottleneck FS Memory Controller Hub DM engine Extra detour to RM Raise interrupt PCIe PCIe

43 CPU microarchitecture ~2013 C C C C LLC MC PCIe PCIe

44 CPU microarchitecture ~2013 C C C C LLC MC PCIe PCIe

45 CPU microarchitecture ~2013 C C C C LLC MC Raise interrupt PCIe PCIe

46 CPU microarchitecture ~2013 C C C C LLC MC Raise interrupt PCIe PCIe

47 CPU microarchitecture ~2013 C C C C Eventual eviction from LLC LLC MC Raise interrupt PCIe PCIe

48 CPU microarchitecture ~2013 C C C C Eventual eviction from LLC LLC MC Raise interrupt PCIe PCIe No extra detours to DRM No FS bottleneck

49 CPU microarchitecture ~2013 C C C C Eventual eviction from LLC LLC MC Raise interrupt PCIe PCIe No extra detours to DRM No FS bottleneck?? LLC utilization ( thrashing?)

50 HW/SW Intersection Should HW architecture evolution be considered a Mem Read Throughput 6NICs (Gbps) black box for networked systems development? Sandstorm "zerocopy" Object Size (K) Sandstorm "memcpy" Lower is better

51 Generality of Specialization Natural fit for: Web & DNS servers (Sandstorm, Namestorm check our paper) In-memory Key-Value stores RPC-based services Rate-adaptive video streaming applications (with MPEG-DSH or pple HLS)

52 Generality of Specialization Natural fit for: Web & DNS servers (Sandstorm, Namestorm check our paper) In-memory Key-Value stores RPC-based services Rate-adaptive video streaming applications (with MPEG-DSH or pple HLS) Limitations:! Possibly not a good fit for CPU- and/or filesystem-intensive applications locking in application-layer cannot be tolerated!

53 Conclusions General-purpose stacks:! Great for bulk transfers, bad for short ones (but web is dominated by small-sized objects!) Picked a lot of generality in favor of flexibility (we don t need it for application-specific clusters) Hard to tune/profile/debug

54 Conclusions General-purpose stacks:! Great for bulk transfers, bad for short ones (but web is dominated by small-sized objects!) Picked a lot of generality in favor of flexibility (we don t need it for application-specific clusters) Hard to tune/profile/debug Specialized stacks:! 2-10x throughput improvement for web, 9x for DNS Linear scaling on multicore systems Low CPU utilization

55 Conclusions General-purpose stacks:! Great for bulk transfers, bad for short ones (but web is dominated by small-sized objects!) Picked a lot of generality in favor of flexibility (we don t need it for application-specific clusters) Hard to tune/profile/debug Specialized stacks:! 2-10x throughput improvement for web, 9x for DNS Linear scaling on multicore systems Low CPU utilization Specialized network stacks not only viable, but necessary!

56 ackup Slides

57 Supported TCP features Follows RFC 793, with Reno congestion control Limitations: Support of the required TCP subset to serve incoming connections (not initiating them) TCP reordering not supported (not needed with typical HTTP requests)

58 Latency vg. Latency (μs) Sandstorm Linux+nginx FreeSD+nginx # Concurrent Connections Serving a 24K object

59 Overview Problems with generalpurpose stacks: System-call overhead Shared accept-queue, PC locks Cache-unfriendly due to async. design Memory-related overhead (e.g., mbuf alloc./copying) Solutions with specialized stacks:! Packet batching Share- & Lock-free design, per-core state Process-to-completion, cache-friendly, incr. cksum Pre-packetization, no memory copying/buffering

Advanced Computer Networks. End Host Optimization

Advanced Computer Networks. End Host Optimization Oriana Riva, Department of Computer Science ETH Zürich 263 3501 00 End Host Optimization Patrick Stuedi Spring Semester 2017 1 Today End-host optimizations: NUMA-aware networking Kernel-bypass Remote Direct

More information

L41 - Lecture 5: The Network Stack (1)

L41 - Lecture 5: The Network Stack (1) L41 - Lecture 5: The Network Stack (1) Dr Robert N. M. Watson 27 April 2015 Dr Robert N. M. Watson L41 - Lecture 5: The Network Stack (1) 27 April 2015 1 / 19 Introduction Reminder: where we left off in

More information

Speeding up Linux TCP/IP with a Fast Packet I/O Framework

Speeding up Linux TCP/IP with a Fast Packet I/O Framework Speeding up Linux TCP/IP with a Fast Packet I/O Framework Michio Honda Advanced Technology Group, NetApp michio@netapp.com With acknowledge to Kenichi Yasukata, Douglas Santry and Lars Eggert 1 Motivation

More information

Evolution of the netmap architecture

Evolution of the netmap architecture L < > T H local Evolution of the netmap architecture Evolution of the netmap architecture -- Page 1/21 Evolution of the netmap architecture Luigi Rizzo, Università di Pisa http://info.iet.unipi.it/~luigi/vale/

More information

SoftRDMA: Rekindling High Performance Software RDMA over Commodity Ethernet

SoftRDMA: Rekindling High Performance Software RDMA over Commodity Ethernet SoftRDMA: Rekindling High Performance Software RDMA over Commodity Ethernet Mao Miao, Fengyuan Ren, Xiaohui Luo, Jing Xie, Qingkai Meng, Wenxue Cheng Dept. of Computer Science and Technology, Tsinghua

More information

The Power of Batching in the Click Modular Router

The Power of Batching in the Click Modular Router The Power of Batching in the Click Modular Router Joongi Kim, Seonggu Huh, Keon Jang, * KyoungSoo Park, Sue Moon Computer Science Dept., KAIST Microsoft Research Cambridge, UK * Electrical Engineering

More information

IsoStack Highly Efficient Network Processing on Dedicated Cores

IsoStack Highly Efficient Network Processing on Dedicated Cores IsoStack Highly Efficient Network Processing on Dedicated Cores Leah Shalev Eran Borovik, Julian Satran, Muli Ben-Yehuda Outline Motivation IsoStack architecture Prototype TCP/IP over 10GE on a single

More information

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

IX: A Protected Dataplane Operating System for High Throughput and Low Latency IX: A Protected Dataplane Operating System for High Throughput and Low Latency Adam Belay et al. Proc. of the 11th USENIX Symp. on OSDI, pp. 49-65, 2014. Presented by Han Zhang & Zaina Hamid Challenges

More information

PASTE: A Networking API for Non-Volatile Main Memory

PASTE: A Networking API for Non-Volatile Main Memory PASTE: A Networking API for Non-Volatile Main Memory Michio Honda (NEC Laboratories Europe) Lars Eggert (NetApp) Douglas Santry (NetApp) TSVAREA@IETF 99, Prague May 22th 2017 More details at our HotNets

More information

The Network Stack (1)

The Network Stack (1) The Network Stack (1) L41 Lecture 5 Dr Robert N. M. Watson 25 January 2017 Reminder: where we left off last term Long, long ago, but in a galaxy not so far away: Lecture 3: The Process Model (1) Lecture

More information

Xen Network I/O Performance Analysis and Opportunities for Improvement

Xen Network I/O Performance Analysis and Opportunities for Improvement Xen Network I/O Performance Analysis and Opportunities for Improvement J. Renato Santos G. (John) Janakiraman Yoshio Turner HP Labs Xen Summit April 17-18, 27 23 Hewlett-Packard Development Company, L.P.

More information

MegaPipe: A New Programming Interface for Scalable Network I/O

MegaPipe: A New Programming Interface for Scalable Network I/O MegaPipe: A New Programming Interface for Scalable Network I/O Sangjin Han in collabora=on with Sco? Marshall Byung- Gon Chun Sylvia Ratnasamy University of California, Berkeley Yahoo! Research tl;dr?

More information

PASTE: A Network Programming Interface for Non-Volatile Main Memory

PASTE: A Network Programming Interface for Non-Volatile Main Memory PASTE: A Network Programming Interface for Non-Volatile Main Memory Michio Honda (NEC Laboratories Europe) Giuseppe Lettieri (Università di Pisa) Lars Eggert and Douglas Santry (NetApp) USENIX NSDI 2018

More information

Impact of Cache Coherence Protocols on the Processing of Network Traffic

Impact of Cache Coherence Protocols on the Processing of Network Traffic Impact of Cache Coherence Protocols on the Processing of Network Traffic Amit Kumar and Ram Huggahalli Communication Technology Lab Corporate Technology Group Intel Corporation 12/3/2007 Outline Background

More information

HKG net_mdev: Fast-path userspace I/O. Ilias Apalodimas Mykyta Iziumtsev François-Frédéric Ozog

HKG net_mdev: Fast-path userspace I/O. Ilias Apalodimas Mykyta Iziumtsev François-Frédéric Ozog HKG18-110 net_mdev: Fast-path userspace I/O Ilias Apalodimas Mykyta Iziumtsev François-Frédéric Ozog Why userland I/O Time sensitive networking Developed mostly for Industrial IOT, automotive and audio/video

More information

Software Routers: NetMap

Software Routers: NetMap Software Routers: NetMap Hakim Weatherspoon Assistant Professor, Dept of Computer Science CS 5413: High Performance Systems and Networking October 8, 2014 Slides from the NetMap: A Novel Framework for

More information

An FPGA-Based Optical IOH Architecture for Embedded System

An FPGA-Based Optical IOH Architecture for Embedded System An FPGA-Based Optical IOH Architecture for Embedded System Saravana.S Assistant Professor, Bharath University, Chennai 600073, India Abstract Data traffic has tremendously increased and is still increasing

More information

Learning with Purpose

Learning with Purpose Network Measurement for 100Gbps Links Using Multicore Processors Xiaoban Wu, Dr. Peilong Li, Dr. Yongyi Ran, Prof. Yan Luo Department of Electrical and Computer Engineering University of Massachusetts

More information

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet Pilar González-Férez and Angelos Bilas 31 th International Conference on Massive Storage Systems

More information

NTRDMA v0.1. An Open Source Driver for PCIe NTB and DMA. Allen Hubbe at Linux Piter 2015 NTRDMA. Messaging App. IB Verbs. dmaengine.h ntb.

NTRDMA v0.1. An Open Source Driver for PCIe NTB and DMA. Allen Hubbe at Linux Piter 2015 NTRDMA. Messaging App. IB Verbs. dmaengine.h ntb. Messaging App IB Verbs NTRDMA dmaengine.h ntb.h DMA DMA DMA NTRDMA v0.1 An Open Source Driver for PCIe and DMA Allen Hubbe at Linux Piter 2015 1 INTRODUCTION Allen Hubbe Senior Software Engineer EMC Corporation

More information

High bandwidth, Long distance. Where is my throughput? Robin Tasker CCLRC, Daresbury Laboratory, UK

High bandwidth, Long distance. Where is my throughput? Robin Tasker CCLRC, Daresbury Laboratory, UK High bandwidth, Long distance. Where is my throughput? Robin Tasker CCLRC, Daresbury Laboratory, UK [r.tasker@dl.ac.uk] DataTAG is a project sponsored by the European Commission - EU Grant IST-2001-32459

More information

Toward MP-safe Networking in NetBSD

Toward MP-safe Networking in NetBSD Toward MP-safe Networking in NetBSD Ryota Ozaki Kengo Nakahara EuroBSDcon 2016 2016-09-25 Contents Background and goals Approach Current status MP-safe Layer 3

More information

The benefits and costs of writing a POSIX kernel in a high-level language

The benefits and costs of writing a POSIX kernel in a high-level language 1 / 38 The benefits and costs of writing a POSIX kernel in a high-level language Cody Cutler, M. Frans Kaashoek, Robert T. Morris MIT CSAIL Should we use high-level languages to build OS kernels? 2 / 38

More information

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research Fast packet processing in the cloud Dániel Géhberger Ericsson Research Outline Motivation Service chains Hardware related topics, acceleration Virtualization basics Software performance and acceleration

More information

PASTE: Fast End System Networking with netmap

PASTE: Fast End System Networking with netmap PASTE: Fast End System Networking with netmap Michio Honda, Giuseppe Lettieri, Lars Eggert and Douglas Santry BSDCan 2018 Contact: @michioh, micchie@sfc.wide.ad.jp Code: https://github.com/micchie/netmap/tree/stack

More information

WORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES BIG AND SMALL SERVER PLATFORMS

WORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES BIG AND SMALL SERVER PLATFORMS WORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES ON BIG AND SMALL SERVER PLATFORMS Shuang Chen*, Shay Galon**, Christina Delimitrou*, Srilatha Manne**, and José Martínez* *Cornell University **Cavium

More information

Much Faster Networking

Much Faster Networking Much Faster Networking David Riddoch driddoch@solarflare.com Copyright 2016 Solarflare Communications, Inc. All rights reserved. What is kernel bypass? The standard receive path The standard receive path

More information

To Grant or Not to Grant

To Grant or Not to Grant To Grant or Not to Grant (for the case of Xen network drivers) João Martins Principal Software Engineer Virtualization Team July 11, 2017 Safe Harbor Statement The following is intended to outline our

More information

VALE: a switched ethernet for virtual machines

VALE: a switched ethernet for virtual machines L < > T H local VALE VALE -- Page 1/23 VALE: a switched ethernet for virtual machines Luigi Rizzo, Giuseppe Lettieri Università di Pisa http://info.iet.unipi.it/~luigi/vale/ Motivation Make sw packet processing

More information

6.9. Communicating to the Outside World: Cluster Networking

6.9. Communicating to the Outside World: Cluster Networking 6.9 Communicating to the Outside World: Cluster Networking This online section describes the networking hardware and software used to connect the nodes of cluster together. As there are whole books and

More information

Hammer Slide: Work- and CPU-efficient Streaming Window Aggregation

Hammer Slide: Work- and CPU-efficient Streaming Window Aggregation Large-Scale Data & Systems Group Hammer Slide: Work- and CPU-efficient Streaming Window Aggregation Georgios Theodorakis, Alexandros Koliousis, Peter Pietzuch, Holger Pirk Large-Scale Data & Systems (LSDS)

More information

Optimizing TCP Receive Performance

Optimizing TCP Receive Performance Optimizing TCP Receive Performance Aravind Menon and Willy Zwaenepoel School of Computer and Communication Sciences EPFL Abstract The performance of receive side TCP processing has traditionally been dominated

More information

No Tradeoff Low Latency + High Efficiency

No Tradeoff Low Latency + High Efficiency No Tradeoff Low Latency + High Efficiency Christos Kozyrakis http://mast.stanford.edu Latency-critical Applications A growing class of online workloads Search, social networking, software-as-service (SaaS),

More information

Tolerating Malicious Drivers in Linux. Silas Boyd-Wickizer and Nickolai Zeldovich

Tolerating Malicious Drivers in Linux. Silas Boyd-Wickizer and Nickolai Zeldovich XXX Tolerating Malicious Drivers in Linux Silas Boyd-Wickizer and Nickolai Zeldovich How could a device driver be malicious? Today's device drivers are highly privileged Write kernel memory, allocate memory,...

More information

High Performance Packet Processing with FlexNIC

High Performance Packet Processing with FlexNIC High Performance Packet Processing with FlexNIC Antoine Kaufmann, Naveen Kr. Sharma Thomas Anderson, Arvind Krishnamurthy University of Washington Simon Peter The University of Texas at Austin Ethernet

More information

CSCI-GA Operating Systems. Networking. Hubertus Franke

CSCI-GA Operating Systems. Networking. Hubertus Franke CSCI-GA.2250-001 Operating Systems Networking Hubertus Franke frankeh@cs.nyu.edu Source: Ganesh Sittampalam NYU TCP/IP protocol family IP : Internet Protocol UDP : User Datagram Protocol RTP, traceroute

More information

Master s Thesis (Academic Year 2015) Improving TCP/IP stack performance by fast packet I/O framework

Master s Thesis (Academic Year 2015) Improving TCP/IP stack performance by fast packet I/O framework Master s Thesis (Academic Year 2015) Improving TCP/IP stack performance by fast packet I/O framework Keio University Graduate School of Media and Governance Kenichi Yasukata Master s Thesis Academic Year

More information

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS CS6410 Moontae Lee (Nov 20, 2014) Part 1 Overview 00 Background User-level Networking (U-Net) Remote Direct Memory Access

More information

Case Study: Using System Tracing to Improve Packet Forwarding Performance

Case Study: Using System Tracing to Improve Packet Forwarding Performance Case Study: Using System Tracing to Improve Packet Forwarding Performance Sebastien Marineau-Mes, Senior Networking Architect, sebastien@qnx.com Abstract Symmetric multiprocessing (SMP) can offer enormous

More information

Low-Latency Datacenters. John Ousterhout Platform Lab Retreat May 29, 2015

Low-Latency Datacenters. John Ousterhout Platform Lab Retreat May 29, 2015 Low-Latency Datacenters John Ousterhout Platform Lab Retreat May 29, 2015 Datacenters: Scale and Latency Scale: 1M+ cores 1-10 PB memory 200 PB disk storage Latency: < 0.5 µs speed-of-light delay Most

More information

TCP Tuning for the Web

TCP Tuning for the Web TCP Tuning for the Web Jason Cook - @macros - jason@fastly.com Me Co-founder and Operations at Fastly Former Operations Engineer at Wikia Lots of Sysadmin and Linux consulting The Goal Make the best use

More information

Message Passing Architecture in Intra-Cluster Communication

Message Passing Architecture in Intra-Cluster Communication CS213 Message Passing Architecture in Intra-Cluster Communication Xiao Zhang Lamxi Bhuyan @cs.ucr.edu February 8, 2004 UC Riverside Slide 1 CS213 Outline 1 Kernel-based Message Passing

More information

Netchannel 2: Optimizing Network Performance

Netchannel 2: Optimizing Network Performance Netchannel 2: Optimizing Network Performance J. Renato Santos +, G. (John) Janakiraman + Yoshio Turner +, Ian Pratt * + HP Labs - * XenSource/Citrix Xen Summit Nov 14-16, 2007 2003 Hewlett-Packard Development

More information

FlexSC. Flexible System Call Scheduling with Exception-Less System Calls. Livio Soares and Michael Stumm. University of Toronto

FlexSC. Flexible System Call Scheduling with Exception-Less System Calls. Livio Soares and Michael Stumm. University of Toronto FlexSC Flexible System Call Scheduling with Exception-Less System Calls Livio Soares and Michael Stumm University of Toronto Motivation The synchronous system call interface is a legacy from the single

More information

Demystifying Network Cards

Demystifying Network Cards Demystifying Network Cards Paul Emmerich December 27, 2017 Chair of Network Architectures and Services About me PhD student at Researching performance of software packet processing systems Mostly working

More information

Modernizing NetBSD Networking Facilities and Interrupt Handling. Ryota Ozaki Kengo Nakahara

Modernizing NetBSD Networking Facilities and Interrupt Handling. Ryota Ozaki Kengo Nakahara Modernizing NetBSD Networking Facilities and Interrupt Handling Ryota Ozaki Kengo Nakahara Overview of Our Work Goals 1. MP-ify NetBSD networking facilities 2.

More information

The Network Stack (2)

The Network Stack (2) The Network Stack (2) L41 Lecture 6 Dr Robert N. M. Watson 27 January 2017 Reminder: Last time Rapid tour across hardware and software: Networking and the sockets API Network-stack design principles: 1980s

More information

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed ASPERA HIGH-SPEED TRANSFER Moving the world s data at maximum speed ASPERA HIGH-SPEED FILE TRANSFER 80 GBIT/S OVER IP USING DPDK Performance, Code, and Architecture Charles Shiflett Developer of next-generation

More information

VM and I/O. IO-Lite: A Unified I/O Buffering and Caching System. Vivek S. Pai, Peter Druschel, Willy Zwaenepoel

VM and I/O. IO-Lite: A Unified I/O Buffering and Caching System. Vivek S. Pai, Peter Druschel, Willy Zwaenepoel VM and I/O IO-Lite: A Unified I/O Buffering and Caching System Vivek S. Pai, Peter Druschel, Willy Zwaenepoel Software Prefetching and Caching for TLBs Kavita Bala, M. Frans Kaashoek, William E. Weihl

More information

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services SEDA: An Architecture for Well-Conditioned, Scalable Internet Services Matt Welsh, David Culler, and Eric Brewer Computer Science Division University of California, Berkeley Operating Systems Principles

More information

PacketShader: A GPU-Accelerated Software Router

PacketShader: A GPU-Accelerated Software Router PacketShader: A GPU-Accelerated Software Router Sangjin Han In collaboration with: Keon Jang, KyoungSoo Park, Sue Moon Advanced Networking Lab, CS, KAIST Networked and Distributed Computing Systems Lab,

More information

Sybase Adaptive Server Enterprise on Linux

Sybase Adaptive Server Enterprise on Linux Sybase Adaptive Server Enterprise on Linux A Technical White Paper May 2003 Information Anywhere EXECUTIVE OVERVIEW ARCHITECTURE OF ASE Dynamic Performance Security Mission-Critical Computing Advanced

More information

CSCI-1680 Link Layer Wrap-Up Rodrigo Fonseca

CSCI-1680 Link Layer Wrap-Up Rodrigo Fonseca CSCI-1680 Link Layer Wrap-Up Rodrigo Fonseca Based partly on lecture notes by David Mazières, Phil Levis, John Janno< Administrivia Homework I out later today, due next Thursday, Sep 25th Today: Link Layer

More information

Light: A Scalable, High-performance and Fully-compatible User-level TCP Stack. Dan Li ( 李丹 ) Tsinghua University

Light: A Scalable, High-performance and Fully-compatible User-level TCP Stack. Dan Li ( 李丹 ) Tsinghua University Light: A Scalable, High-performance and Fully-compatible User-level TCP Stack Dan Li ( 李丹 ) Tsinghua University Data Center Network Performance Hardware Capability of Modern Servers Multi-core CPU Kernel

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

FreeBSD Network Stack Optimizations for Modern Hardware

FreeBSD Network Stack Optimizations for Modern Hardware FreeBSD Network Stack Optimizations for Modern Hardware Robert N. M. Watson FreeBSD Foundation EuroBSDCon 2008 Introduction Hardware and operating system changes TCP input and output paths Hardware offload

More information

What s New in VMware vsphere 4.1 Performance. VMware vsphere 4.1

What s New in VMware vsphere 4.1 Performance. VMware vsphere 4.1 What s New in VMware vsphere 4.1 Performance VMware vsphere 4.1 T E C H N I C A L W H I T E P A P E R Table of Contents Scalability enhancements....................................................................

More information

Data Path acceleration techniques in a NFV world

Data Path acceleration techniques in a NFV world Data Path acceleration techniques in a NFV world Mohanraj Venkatachalam, Purnendu Ghosh Abstract NFV is a revolutionary approach offering greater flexibility and scalability in the deployment of virtual

More information

StackMap: Low-Latency Networking with the OS Stack and Dedicated NICs

StackMap: Low-Latency Networking with the OS Stack and Dedicated NICs StackMap: Low-Latency Networking with the OS Stack and Dedicated NICs Kenichi Yasukata 1, Michio Honda 2, Douglas Santry 2, and Lars Eggert 2 1 Keio University 2 NetApp Abstract StackMap leverages the

More information

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed ASPERA HIGH-SPEED TRANSFER Moving the world s data at maximum speed ASPERA HIGH-SPEED FILE TRANSFER Aspera FASP Data Transfer at 80 Gbps Elimina8ng tradi8onal bo

More information

FlexNIC: Rethinking Network DMA

FlexNIC: Rethinking Network DMA FlexNIC: Rethinking Network DMA Antoine Kaufmann Simon Peter Tom Anderson Arvind Krishnamurthy University of Washington HotOS 2015 Networks: Fast and Growing Faster 1 T 400 GbE Ethernet Bandwidth [bits/s]

More information

Performance Evaluation of Myrinet-based Network Router

Performance Evaluation of Myrinet-based Network Router Performance Evaluation of Myrinet-based Network Router Information and Communications University 2001. 1. 16 Chansu Yu, Younghee Lee, Ben Lee Contents Suez : Cluster-based Router Suez Implementation Implementation

More information

Transport Layer. <protocol, local-addr,local-port,foreign-addr,foreign-port> ϒ Client uses ephemeral ports /10 Joseph Cordina 2005

Transport Layer. <protocol, local-addr,local-port,foreign-addr,foreign-port> ϒ Client uses ephemeral ports /10 Joseph Cordina 2005 Transport Layer For a connection on a host (single IP address), there exist many entry points through which there may be many-to-many connections. These are called ports. A port is a 16-bit number used

More information

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications Outline RDMA Motivating trends iwarp NFS over RDMA Overview Chelsio T5 support Performance results 2 Adoption Rate of 40GbE Source: Crehan

More information

DPDK Summit China 2017

DPDK Summit China 2017 Summit China 2017 Embedded Network Architecture Optimization Based on Lin Hao T1 Networks Agenda Our History What is an embedded network device Challenge to us Requirements for device today Our solution

More information

Application Acceleration Beyond Flash Storage

Application Acceleration Beyond Flash Storage Application Acceleration Beyond Flash Storage Session 303C Mellanox Technologies Flash Memory Summit July 2014 Accelerating Applications, Step-by-Step First Steps Make compute fast Moore s Law Make storage

More information

Measurement-based Analysis of TCP/IP Processing Requirements

Measurement-based Analysis of TCP/IP Processing Requirements Measurement-based Analysis of TCP/IP Processing Requirements Srihari Makineni Ravi Iyer Communications Technology Lab Intel Corporation {srihari.makineni, ravishankar.iyer}@intel.com Abstract With the

More information

Data Center Traffic and Measurements: SoNIC

Data Center Traffic and Measurements: SoNIC Center Traffic and Measurements: SoNIC Hakim Weatherspoon Assistant Professor, Dept of Computer Science CS 5413: High Performance Systems and ing November 12, 2014 Slides from USENIX symposium on ed Systems

More information

CSE 120 Principles of Operating Systems

CSE 120 Principles of Operating Systems CSE 120 Principles of Operating Systems Spring 2018 Lecture 15: Multicore Geoffrey M. Voelker Multicore Operating Systems We have generally discussed operating systems concepts independent of the number

More information

Optimizing Performance: Intel Network Adapters User Guide

Optimizing Performance: Intel Network Adapters User Guide Optimizing Performance: Intel Network Adapters User Guide Network Optimization Types When optimizing network adapter parameters (NIC), the user typically considers one of the following three conditions

More information

Question Score 1 / 19 2 / 19 3 / 16 4 / 29 5 / 17 Total / 100

Question Score 1 / 19 2 / 19 3 / 16 4 / 29 5 / 17 Total / 100 NAME: Login name: Computer Science 461 Midterm Exam March 10, 2010 3:00-4:20pm This test has five (5) questions. Put your name on every page, and write out and sign the Honor Code pledge before turning

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 1: Distributed File Systems GFS (The Google File System) 1 Filesystems

More information

Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency

Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency Jialin Li, Naveen Kr. Sharma, Dan R. K. Ports and Steven D. Gribble February 2, 2015 1 Introduction What is Tail Latency? What

More information

Enabling Fast, Dynamic Network Processing with ClickOS

Enabling Fast, Dynamic Network Processing with ClickOS Enabling Fast, Dynamic Network Processing with ClickOS Joao Martins*, Mohamed Ahmed*, Costin Raiciu, Roberto Bifulco*, Vladimir Olteanu, Michio Honda*, Felipe Huici* * NEC Labs Europe, Heidelberg, Germany

More information

RouteBricks: Exploiting Parallelism To Scale Software Routers

RouteBricks: Exploiting Parallelism To Scale Software Routers outebricks: Exploiting Parallelism To Scale Software outers Mihai Dobrescu & Norbert Egi, Katerina Argyraki, Byung-Gon Chun, Kevin Fall, Gianluca Iannaccone, Allan Knies, Maziar Manesh, Sylvia atnasamy

More information

On the cost of tunnel endpoint processing in overlay virtual networks

On the cost of tunnel endpoint processing in overlay virtual networks J. Weerasinghe; NVSDN2014, London; 8 th December 2014 On the cost of tunnel endpoint processing in overlay virtual networks J. Weerasinghe & F. Abel IBM Research Zurich Laboratory Outline Motivation Overlay

More information

CSCI-1680 Link Layer Wrap-Up Rodrigo Fonseca

CSCI-1680 Link Layer Wrap-Up Rodrigo Fonseca CSCI-1680 Link Layer Wrap-Up Rodrigo Fonseca Based partly on lecture notes by David Mazières, Phil Levis, John Jannotti Administrivia Homework I out later today, due next Thursday Today: Link Layer (cont.)

More information

Using (Suricata over) PF_RING for NIC-Independent Acceleration

Using (Suricata over) PF_RING for NIC-Independent Acceleration Using (Suricata over) PF_RING for NIC-Independent Acceleration Luca Deri Alfredo Cardigliano Outlook About ntop. Introduction to PF_RING. Integrating PF_RING with

More information

DPDK Tunneling Offload RONY EFRAIM & YONGSEOK KOH MELLANOX

DPDK Tunneling Offload RONY EFRAIM & YONGSEOK KOH MELLANOX x DPDK Tunneling Offload RONY EFRAIM & YONGSEOK KOH MELLANOX Rony Efraim Introduction to DC w/ overlay network Modern data center (DC) use overly network like Virtual Extensible LAN (VXLAN) and GENEVE

More information

Developing deterministic networking technology for railway applications using TTEthernet software-based end systems

Developing deterministic networking technology for railway applications using TTEthernet software-based end systems Developing deterministic networking technology for railway applications using TTEthernet software-based end systems Project n 100021 Astrit Ademaj, TTTech Computertechnik AG Outline GENESYS requirements

More information

MiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces

MiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces MiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces Hye-Churn Jang Hyun-Wook (Jin) Jin Department of Computer Science and Engineering Konkuk University Seoul, Korea {comfact,

More information

Networks and distributed computing

Networks and distributed computing Networks and distributed computing Hardware reality lots of different manufacturers of NICs network card has a fixed MAC address, e.g. 00:01:03:1C:8A:2E send packet to MAC address (max size 1500 bytes)

More information

An Intelligent NIC Design Xin Song

An Intelligent NIC Design Xin Song 2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) An Intelligent NIC Design Xin Song School of Electronic and Information Engineering Tianjin Vocational

More information

jverbs: Java/OFED Integration for the Cloud

jverbs: Java/OFED Integration for the Cloud jverbs: Java/OFED Integration for the Cloud Authors: Bernard Metzler, Patrick Stuedi, Animesh Trivedi. IBM Research Zurich Date: 03/27/12 www.openfabrics.org 1 Motivation The commodity Cloud is Flexible

More information

Receive Livelock. Robert Grimm New York University

Receive Livelock. Robert Grimm New York University Receive Livelock Robert Grimm New York University The Three Questions What is the problem? What is new or different? What are the contributions and limitations? Motivation Interrupts work well when I/O

More information

Light & NOS. Dan Li Tsinghua University

Light & NOS. Dan Li Tsinghua University Light & NOS Dan Li Tsinghua University Performance gain The Power of DPDK As claimed: 80 CPU cycles per packet Significant gain compared with Kernel! What we care more How to leverage the performance gain

More information

Accelerating Load Balancing programs using HW- Based Hints in XDP

Accelerating Load Balancing programs using HW- Based Hints in XDP Accelerating Load Balancing programs using HW- Based Hints in XDP PJ Waskiewicz, Network Software Engineer Neerav Parikh, Software Architect Intel Corp. Agenda Overview express Data path (XDP) Software

More information

Slides on cross- domain call and Remote Procedure Call (RPC)

Slides on cross- domain call and Remote Procedure Call (RPC) Slides on cross- domain call and Remote Procedure Call (RPC) This classic paper is a good example of a microbenchmarking study. It also explains the RPC abstraction and serves as a case study of the nuts-and-bolts

More information

A Look at Intel s Dataplane Development Kit

A Look at Intel s Dataplane Development Kit A Look at Intel s Dataplane Development Kit Dominik Scholz Chair for Network Architectures and Services Department for Computer Science Technische Universität München June 13, 2014 Dominik Scholz: A Look

More information

Be Fast, Cheap and in Control with SwitchKV. Xiaozhou Li

Be Fast, Cheap and in Control with SwitchKV. Xiaozhou Li Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li Goal: fast and cost-efficient key-value store Store, retrieve, manage key-value objects Get(key)/Put(key,value)/Delete(key) Target: cluster-level

More information

LegUp: Accelerating Memcached on Cloud FPGAs

LegUp: Accelerating Memcached on Cloud FPGAs 0 LegUp: Accelerating Memcached on Cloud FPGAs Xilinx Developer Forum December 10, 2018 Andrew Canis & Ruolong Lian LegUp Computing Inc. 1 COMPUTE IS BECOMING SPECIALIZED 1 GPU Nvidia graphics cards are

More information

Introduction to Ethernet Latency

Introduction to Ethernet Latency Introduction to Ethernet Latency An Explanation of Latency and Latency Measurement The primary difference in the various methods of latency measurement is the point in the software stack at which the latency

More information

Communication Networks

Communication Networks Communication Networks Spring 2018 Laurent Vanbever nsg.ee.ethz.ch ETH Zürich (D-ITET) April 30 2018 Materials inspired from Scott Shenker & Jennifer Rexford Last week on Communication Networks We started

More information

Memory Management Strategies for Data Serving with RDMA

Memory Management Strategies for Data Serving with RDMA Memory Management Strategies for Data Serving with RDMA Dennis Dalessandro and Pete Wyckoff (presenting) Ohio Supercomputer Center {dennis,pw}@osc.edu HotI'07 23 August 2007 Motivation Increasing demands

More information

CSCI-1680 Link Layer Wrap-Up Rodrigo Fonseca

CSCI-1680 Link Layer Wrap-Up Rodrigo Fonseca CSCI-1680 Link Layer Wrap-Up Rodrigo Fonseca Based partly on lecture notes by David Mazières, Phil Levis, John Jannotti Today: Link Layer (cont.) Framing Reliability Error correction Sliding window Medium

More information

Networking at the Speed of Light

Networking at the Speed of Light Networking at the Speed of Light Dror Goldenberg VP Software Architecture MaRS Workshop April 2017 Cloud The Software Defined Data Center Resource virtualization Efficient services VM, Containers uservices

More information

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX Inventing Internet TV Available in more than 190 countries 104+ million subscribers Lots of Streaming == Lots of Traffic

More information

The Network Stack. Chapter Network stack functions 216 CHAPTER 21. THE NETWORK STACK

The Network Stack. Chapter Network stack functions 216 CHAPTER 21. THE NETWORK STACK 216 CHAPTER 21. THE NETWORK STACK 21.1 Network stack functions Chapter 21 The Network Stack In comparison with some other parts of OS design, networking has very little (if any) basis in formalism or algorithms

More information

Distributed Systems 27. Process Migration & Allocation

Distributed Systems 27. Process Migration & Allocation Distributed Systems 27. Process Migration & Allocation Paul Krzyzanowski pxk@cs.rutgers.edu 12/16/2011 1 Processor allocation Easy with multiprocessor systems Every processor has access to the same memory

More information

Capriccio : Scalable Threads for Internet Services

Capriccio : Scalable Threads for Internet Services Capriccio : Scalable Threads for Internet Services - Ron von Behren &et al - University of California, Berkeley. Presented By: Rajesh Subbiah Background Each incoming request is dispatched to a separate

More information