NetSlices: Scalable Mul/- Core Packet Processing in User- Space

Similar documents
Software Routers: NetMap

The Power of Batching in the Click Modular Router

RouteBricks: Exploi2ng Parallelism to Scale So9ware Routers

PacketShader: A GPU-Accelerated Software Router

GASPP: A GPU- Accelerated Stateful Packet Processing Framework

SoNIC: Precise Real1me So3ware Access and Control of Wired Networks. Ki Suh Lee, Han Wang, Hakim Weatherspoon Cornell University

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

소프트웨어기반고성능침입탐지시스템설계및구현

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Implemen'ng IPv6 Segment Rou'ng in the Linux Kernel

PacketShader as a Future Internet Platform

Comparing TCP performance of tunneled and non-tunneled traffic using OpenVPN. Berry Hoekstra Damir Musulin OS3 Supervisor: Jan Just Keijser Nikhef

ntop Users Group Meeting

Demystifying Network Cards

G-NET: Effective GPU Sharing In NFV Systems

Advanced Computer Networks. End Host Optimization

IsoStack Highly Efficient Network Processing on Dedicated Cores

Arrakis: The Operating System is the Control Plane

Research on DPDK Based High-Speed Network Traffic Analysis. Zihao Wang Network & Information Center Shanghai Jiao Tong University

VALE: a switched ethernet for virtual machines

Evolution of the netmap architecture

DPDK Summit China 2017

Data Center Traffic and Measurements: SoNIC

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed

Recent Advances in Software Router Technologies

SEDA An architecture for Well Condi6oned, scalable Internet Services

Speeding up Linux TCP/IP with a Fast Packet I/O Framework

Improving DPDK Performance

Enabling Fast, Dynamic Network Processing with ClickOS

WORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES BIG AND SMALL SERVER PLATFORMS

MiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces

Impact of Cache Coherence Protocols on the Processing of Network Traffic

vnetwork Future Direction Howie Xu, VMware R&D November 4, 2008

Open Source Traffic Analyzer

Design and Implementa/on of a Consolidated Middlebox Architecture. Vyas Sekar Sylvia Ratnasamy Michael Reiter Norbert Egi Guangyu Shi

Maelstrom: An Enterprise Continuity Protocol for Financial Datacenters

100 Gbps Open-Source Software Router? It's Here. Jim Thompson, CTO, Netgate

COMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory

Postprint.

The Convergence of Storage and Server Virtualization Solarflare Communications, Inc.

Network Coding: Theory and Applica7ons

May 1, Foundation for Research and Technology - Hellas (FORTH) Institute of Computer Science (ICS) A Sleep-based Communication Mechanism to

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Input / Output. Kevin Webb Swarthmore College April 12, 2018

Programmable NICs. Lecture 14, Computer Networks (198:552)

Memory Management Strategies for Data Serving with RDMA

FAQ. Release rc2

Learning with Purpose

Performance evaluation of Linux CAN-related system calls

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet

RouteBricks: Exploiting Parallelism To Scale Software Routers

10GE network tests with UDP. Janusz Szuba European XFEL

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

Using (Suricata over) PF_RING for NIC-Independent Acceleration

Parallel Algorithm Engineering

An Experimental Analysis on Iterative Block Ciphers and Their Effects on VoIP under Different Coding Schemes

Evaluating the Impact of RDMA on Storage I/O over InfiniBand

GPGPU introduction and network applications. PacketShaders, SSLShader

Networking Servers made for BSD and Linux systems

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Module 11: I/O Systems

EXTENDING AN ASYNCHRONOUS MESSAGING LIBRARY USING AN RDMA-ENABLED INTERCONNECT. Konstantinos Alexopoulos ECE NTUA CSLab

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University

Understanding and Improving the Cost of Scaling Distributed Event Processing

Agilio CX 2x40GbE with OVS-TC

MegaPipe: A New Programming Interface for Scalable Network I/O

ECSE 425 Lecture 25: Mul1- threading

Virtualization. Introduction. Why we interested? 11/28/15. Virtualiza5on provide an abstract environment to run applica5ons.

FPGAs and Networking

Parallelizing IPsec: switching SMP to On is not even half the way

The rcuda middleware and applications

GPU Cluster Computing. Advanced Computing Center for Research and Education

Advances of parallel computing. Kirill Bogachev May 2016

LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster

PVPP: A Programmable Vector Packet Processor. Sean Choi, Xiang Long, Muhammad Shahbaz, Skip Booth, Andy Keep, John Marshall, Changhoon Kim

Data Center Virtualization: VirtualWire

打造 Linux 下的高性能网络 北京酷锐达信息技术有限公司技术总监史应生.

Placement de processus (MPI) sur architecture multi-cœur NUMA

Accelerating 4G Network Performance

Lecture 13 Input/Output (I/O) Systems (chapter 13)

Networking at the Speed of Light

Intel Workstation Technology

Supporting Fine-Grained Network Functions through Intel DPDK

Chapter 13: I/O Systems

Stash Directory: A Scalable Directory for Many- Core Coherence! Socrates Demetriades and Sangyeun Cho

double split driver model

High Performance Java Remote Method Invocation for Parallel Computing on Clusters

W H I T E P A P E R. Comparison of Storage Protocol Performance in VMware vsphere 4

High Performance Packet Processing with FlexNIC

Argo: Architecture- Aware Graph Par33oning

libvnf: building VNFs made easy

Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access

NUMA replicated pagecache for Linux

Chapter 13: I/O Systems

Caches (Writing) Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University. P & H Chapter 5.2 3, 5.5

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

Basic OS Progamming Abstrac2ons

Transcription:

NetSlices: Scalable Mul/- Core Packet Processing in - Space Tudor Marian, Ki Suh Lee, Hakim Weatherspoon Cornell University Presented by Ki Suh Lee

Packet Processors Essen/al for evolving networks Sophis/cated func/onality Complex performance enhancement protocols

Packet Processors Essen/al for evolving networks Sophis/cated func/onality Complex performance enhancement protocols Challenges: High- performance and flexibility 10GE and beyond Tradeoffs

SoMware Packet Processors Low- level (kernel) vs. High- level (userspace) Parallelism in userspace: Four major difficul/es Overheads & Conten/on Kernel network stack Lack of control over hardware resources Portability

Overheads & Conten/on Cache coherence Memory Wall Slow cores vs. Fast NICs NIC CPU Memory Memory

Kernel network stack & HW control Raw socket: all traffic from all NICs to user- space Too general, hence complex network stack Hardware and somware are loosely coupled Applica/ons have no control over resources Applica/on Network Stack Raw socket Network Applica/on Stack Network Applica/on Network Stack Stack Applica/on Network Stack Network Applica/on Network Stack Applica/on Stack Applica/on Network Network Network Stack Stack Stack Applica/on Applica/on

Portability Hardware dependencies Kernel and device driver modifica/ons Zero- copy Kernel bypass

Outline Difficul/es in building packet processors NetSlice Evalua/on Discussions Conclusion

NetSlice Give power to the applica/on Overheads & Conten/on Lack of control over hardware resources Spa/al par//oning exploi/ng NUMA architecture Kernel network stack Streamlined path for packets Portability No zero- copy, kernel & device driver modifica/ons

NetSlice Spa/al Par//oning Independent (parallel) execu/on contexts Split each Network Interface Controller (NIC) One NIC queue per NIC per context Group and split the CPU cores Implicit resources (bus and memory bandwidth) Temporal par//oning (/me- sharing) Spa/al par//oning (exclusive- access)

NetSlice Spa/al Par//oning Example 2x quad core Intel Xeon X5570 (Nehalem) Two simultaneous hyperthreads OS sees 16 CPUs Non Uniform Memory Access (NUMA) QuickPath point- to- point interconnect Shared L3 cache

Streamlined Path for Packets Inefficient conven/onal network stack One network stack to rule them all Performs too many memory accesses Pollutes cache, context switches, synchroniza/on, system calls, blocking API Heavyweight Network Stack

Portability No zero- copy Tradeoffs between portability and performance NetSlices achieves both No hardware dependency A run- /me loadable kernel module

NetSlice API Expresses fine- grained hardware control Flexible: based on ioctl! Easy: open, read, write, close 1: #include "netslice.h" 2: 3: structnetslice_rw_mul/ { 4: int flags; 5: } rw_mul/; 6: 7: structnetslice_cpu_mask { 8: cpu_set_tk_peer, u_peer; 9: } mask; 10: 11: fd = open("/dev/netslice- 1", O_RDWR); 12: 13: rw_mul/.flags = MULTI_READ MULTI_WRITE; 14: ioctl(fd, NETSLICE_RW_MULTI_SET, &rw_mul/); 15: ioctl(fd, NETSLICE_CPUMASK_GET, &mask); 16: sched_setaffinity(getpid(), sizeof(cpu_set_t), 17: &mask.u_peer); 18 19: for (;;) { 20: ssize_tcnt, wcnt = 0; 21: if ((cnt = read(fd, iov, IOVS)) < 0) 22: EXIT_FAIL_MSG("read"); 23: 24: for (i = 0; i<cnt; i++) 25: /* iov_rlen marks bytes read */ 26: scan_pkg(iov[i].iov_base, iov[i].iov_rlen); 27: do { 28: size_twr_iovs; 29: /* write iov_rlen bytes */ 30: wr_iovs = write(fd, &iov[wcnt], cnt- wcnt); 31: if (wr_iovs< 0) 32: EXIT_FAIL_MSG("write"); 33: wcnt += wr_iovs; 34: } while (wcnt<cnt); 35: }

NetSlice Evalua/on Compare against state- of- the- art RouteBricks in- kernel, Click & pcap- mmap user- space Addi/onal baseline scenario All traffic through single NIC queue (receive- livelock) What is the basic forwarding performance? How efficient is the streamlining of one NetSlice? How is NetSlice scaling with the number of cores?

Experimental Setup R710 packet processors dual socket quad core 2.93GHz Xeon X5570 (Nehalem) 8MB of shared L3 cache and 12GB of RAM 6GB connected to each of the two CPU sockets Two Myri- 10G NICs R900 client end- hosts four socket 2.40GHz Xeon E7330 (Penryn) 6MB of L2 cache and 32GB of RAM

Simple Packet Rou/ng End- to- end throughput, MTU (1500 byte) packets Throughput (Mbps) 12000 10000 8000 6000 4000 2000 9.7 9.7 9.7 7.6 7.5 5.6 74% of kernel 2.3 2.3 best configura/on receive- livelock 1/11 of NetSlice 0 kernel RouteBricks NetSlice pcap pcap- mmap Click user- space

Linear Scaling with CPUs IPsec with 128 bit key typically used by VPN AES encryp/on in Cipher- block Chaining mode Throughput (Mbps) 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 RouteBricks NetSlice pcap pcap- mmap Click user- space 9.2 8.5 2 4 6 8 10 12 14 16 # of CPUs used

Outline Difficul/es in building packet processors Netslice Evalua/on Discussions Conclusion

SoMware Packet Processors Can support 10GE and more at line- speed Batching Hardware, device driver, cross- domain batching Hardware support Mul/- queue, mul/- core, NUMA, GPU Removing IRQ overhead Removing memory overhead Including zero- copy Bypassing kernel network stack

SoMware Packet Processors Batching Parallelism Zero- Copy Portability Domain Raw socket RouteBricks PacketShader PF_RING netmap Kernel- bypass NetSlice Kernel

SoMware Packet Processors Batching MulA- queue Zero- Copy Portability Domain Raw socket RouteBricks PacketShader PF_RING netmap Kernel- bypass NetSlice Kernel

SoMware Packet Processors Batching MulA- queue Zero- Copy Portability Domain Raw socket RouteBricks PacketShader PF_RING netmap Kernel- bypass NetSlice Kernel

SoMware Packet Processors Batching MulA- queue Zero- Copy Portability Domain Raw socket RouteBricks PacketShader PF_RING netmap Kernel- bypass NetSlice Kernel

SoMware Packet Processors Batching Parallelism Zero- Copy Portability Domain Raw socket RouteBricks PacketShader PF_RING netmap Kernel- bypass NetSlice Kernel

SoMware Packet Processors Batching Parallelism Zero- Copy Portability Domain Raw socket RouteBricks PacketShader PF_RING netmap Kernel- bypass NetSlice OpAmized for RX path only Kernel

SoMware Packet Processors Batching Parallelism Zero- Copy Portability Domain Raw socket RouteBricks PacketShader PF_RING netmap Kernel- bypass NetSlice Kernel

Discussions 40G and beyond DPI, FEC, DEDUP, Determinis/c RSS Small packets

Conclusion NetSlices: A new abstrac/on OS support to build packet processing applica/ons Harness implicit parallelism of modern hardware to scale Highly portable Webpage: h p://netslice.cs.cornell.edu