IsoStack Highly Efficient Network Processing on Dedicated Cores

Similar documents
IsoStack Highly Efficient Network Processing on Dedicated Cores

Advanced Computer Networks. End Host Optimization

LITE Kernel RDMA. Support for Datacenter Applications. Shin-Yeh Tsai, Yiying Zhang

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

Impact of Cache Coherence Protocols on the Processing of Network Traffic

Networking at the Speed of Light

Memory Management Strategies for Data Serving with RDMA

NTRDMA v0.1. An Open Source Driver for PCIe NTB and DMA. Allen Hubbe at Linux Piter 2015 NTRDMA. Messaging App. IB Verbs. dmaengine.h ntb.

Multifunction Networking Adapters

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

Messaging Overview. Introduction. Gen-Z Messaging

Ausgewählte Betriebssysteme - Mark Russinovich & David Solomon (used with permission of authors)

Motivation CPUs can not keep pace with network

CSE398: Network Systems Design

Virtualization, Xen and Denali

OpenOnload. Dave Parry VP of Engineering Steve Pope CTO Dave Riddoch Chief Software Architect

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS

QuickSpecs. HP Z 10GbE Dual Port Module. Models

Programmable NICs. Lecture 14, Computer Networks (198:552)

SoftRDMA: Rekindling High Performance Software RDMA over Commodity Ethernet

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services

Optimizing TCP Receive Performance

Scalable I/O A Well-Architected Way to Do Scalable, Secure and Virtualized I/O

Motivation. Threads. Multithreaded Server Architecture. Thread of execution. Chapter 4

Device-Functionality Progression

Chapter 12: I/O Systems. I/O Hardware

vpipe: One Pipe to Connect Them All!

Speeding up Linux TCP/IP with a Fast Packet I/O Framework

IBM POWER8 100 GigE Adapter Best Practices

Netronome 25GbE SmartNICs with Open vswitch Hardware Offload Drive Unmatched Cloud and Data Center Infrastructure Performance

Support for Smart NICs. Ian Pratt

Agilio CX 2x40GbE with OVS-TC

Implementation and Analysis of Large Receive Offload in a Virtualized System

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen

vnetwork Future Direction Howie Xu, VMware R&D November 4, 2008

Part 1: Introduction to device drivers Part 2: Overview of research on device driver reliability Part 3: Device drivers research at ERTOS

The Case for RDMA. Jim Pinkerton RDMA Consortium 5/29/2002

The NE010 iwarp Adapter

Advanced RDMA-based Admission Control for Modern Data-Centers

Application Acceleration Beyond Flash Storage

Tailwind: Fast and Atomic RDMA-based Replication. Yacine Taleb, Ryan Stutsman, Gabriel Antoniu, Toni Cortes

Sun N1: Storage Virtualization and Oracle

Optimizing Performance: Intel Network Adapters User Guide

jverbs: Java/OFED Integration for the Cloud

Measurement-based Analysis of TCP/IP Processing Requirements

HP Cluster Interconnects: The Next 5 Years

Netchannel 2: Optimizing Network Performance

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

Much Faster Networking

Light & NOS. Dan Li Tsinghua University

QuickSpecs. Overview. HPE Ethernet 10Gb 2-port 535 Adapter. HPE Ethernet 10Gb 2-port 535 Adapter. 1. Product description. 2.

打造 Linux 下的高性能网络 北京酷锐达信息技术有限公司技术总监史应生.

Background: I/O Concurrency

FPGA Augmented ASICs: The Time Has Come

Maximum Performance. How to get it and how to avoid pitfalls. Christoph Lameter, PhD

Linux multi-core scalability

6.9. Communicating to the Outside World: Cluster Networking

PCI Express x8 Quad Port 10Gigabit Server Adapter (Intel XL710 Based)

Lessons learned from MPI

Mark Falco Oracle Coherence Development

First-In-First-Out (FIFO) Algorithm

Virtual Switch Acceleration with OVS-TC

Module 12: I/O Systems

ntop Users Group Meeting

Enabling Hybrid Parallel Runtimes Through Kernel and Virtualization Support. Kyle C. Hale and Peter Dinda

CS370 Operating Systems

Distributed File Systems Issues. NFS (Network File System) AFS: Namespace. The Andrew File System (AFS) Operating Systems 11/19/2012 CSC 256/456 1

Status Update About COLO (COLO: COarse-grain LOck-stepping Virtual Machines for Non-stop Service)

W H I T E P A P E R. What s New in VMware vsphere 4: Performance Enhancements

PDP : A Flexible and Programmable Data Plane. Massimo Gallo et al.

Accelerate Applications Using EqualLogic Arrays with directcache

Performance and Optimization Issues in Multicore Computing

Evaluating Network Processing Efficiency with Processor Partitioning and Asynchronous I/O

Chapter 13: I/O Systems. Operating System Concepts 9 th Edition

Introduction to OpenOnload Building Application Transparency and Protocol Conformance into Application Acceleration Middleware

Background. IBM sold expensive mainframes to large organizations. Monitor sits between one or more OSes and HW

Tolerating Malicious Drivers in Linux. Silas Boyd-Wickizer and Nickolai Zeldovich

Connection Handoff Policies for TCP Offload Network Interfaces

Linux Network Tuning Guide for AMD EPYC Processor Based Servers

The latency of user-to-user, kernel-to-kernel and interrupt-to-interrupt level communication

MiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces

EbbRT: A Framework for Building Per-Application Library Operating Systems

Learning with Purpose

An FPGA-Based Optical IOH Architecture for Embedded System

Chapter 13: I/O Systems

Introduction to TCP/IP Offload Engine (TOE)

EXTENDING AN ASYNCHRONOUS MESSAGING LIBRARY USING AN RDMA-ENABLED INTERCONNECT. Konstantinos Alexopoulos ECE NTUA CSLab

Network device drivers in Linux

by Brian Hausauer, Chief Architect, NetEffect, Inc

Xen Network I/O Performance Analysis and Opportunities for Improvement

Windows Server 2012 and Windows Server 2012 R2 NIC Optimization and Best Practices with Dell PS Series

VALE: a switched ethernet for virtual machines

Low-Latency Datacenters. John Ousterhout Platform Lab Retreat May 29, 2015

Software Routers: NetMap

Module 12: I/O Systems

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

Case Study: Using System Tracing to Improve Packet Forwarding Performance

MegaPipe: A New Programming Interface for Scalable Network I/O

Transcription:

IsoStack Highly Efficient Network Processing on Dedicated Cores Leah Shalev Eran Borovik, Julian Satran, Muli Ben-Yehuda

Outline Motivation IsoStack architecture Prototype TCP/IP over 10GE on a single core Performance results Summary 2

TCP/IP End System Performance Challenge TCP/IP stack is a major consumer of CPU cycles easy benchmark workloads can consume 80% CPU difficult workloads cause throughput degradation at 100% CPU TCP/IP stack wastes CPU cycles: 100s of "useful" instructions per packet 10,000s of CPU cycles 3

Long History of TCP/IP Optimizations Decrease per-byte overhead Checksum calculation offload Decrease the number of interrupts interrupt mitigation (coalescing) Decrease the number of packets (for bulk transfers) Jumbo frames Large Send Offload (TCP Segmentation Offload) Large Receive Offload 4

History of TCP Optimizations cont` Full Offload Instead of optimizations - offload to hardware TOE (TCP Offload Engine) Expensive Not flexible Not robust - dependency on device vendor Not supported by some operating systems on principle RDMA Requires support on the remote side not applicable to legacy upper layer protocols TCP onload offload to a dedicated main processor Using a multiprocessor system asymmetrically 5

TCP/IP Parallelization Naïve initial transition to multiprocessor systems Using one lock to protect it all Incremental attempts to improve parallelism Use more locks to decrease contention Use kernel threads to perform processing in parallel Hardware support to parallelize incoming packet processing Receive-Side Scaling (RSS) 6

Parallelizing TCP/IP Stack Using RSS Theory (customized system) Practice CPU 1 CPU 2 CPU 1 CPU 2 t3 t1 t1 t3 t1 t2 Conn A t3 TCP/IP Conn A Conn B Conn C TCP/IP Rx 2 Conn B Conn C Conn D Rx 2 Conn D Tx Rx1 Tx Rx 1 NIC NIC 7

So, Where Do the Cycles Go? No clear hot-spots Except lock/unlock functions CPU is misused by the network stack: Interrupts, context switches, cache pollution due to CPU sharing between applications and stack IPIs, locking and cache line bouncing due to stack control state shared by different CPUs 8

Our Approach Isolate the Stack Dedicate CPUs for network stack Use light-weight internal interconnect Scaling for many applications and high request rates Make it transparent to applications Not just API-compatible hide the latency of interaction 9

IsoStack Architecture Split socket layer: front-end in application Maintains socket buffers posts socket commands onto command queue back-end in IsoStack On dedicated core[s] With connection affinity Polls for commands Executes the socket operations asynchronously Zero-copy App CPU #2 App CPU #1 app app app Socket Socket front-end Socket front-end Shared mem front-end Shared queue mem client Shared queue mem client queue client Internal interconnect Internal interconnect IsoStack CPU TCP/IP Socket back-end Shared mem queue server Shared-memory queues for socket delegation Asynchronous messaging Flow control and aggregation Data copy by socket front-end 10

IsoStack Shared Memory Command Queues Low overhead multipleproducers-single-consumer mechanism Non-trusted producers Design Principles: Lock-free, cache-aware Bypass kernel whenever possible problematic with the existing hardware support Interrupt mitigation Design Choices Extremes: A single command queue Con - high contention on access Per-thread command queue Con - high number of queues to be polled by the server Our choice: Per-socket command queues Aggregation of tx and rx data Per-CPU notification queues Requires kernel involvement to protect access to these queues 11

IsoStack Prototype Implementation Power6 (4x2 cores), AIX 6.1 10Gb/s HEA Same codebase for IsoStack and legacy stack IsoStack runs as single kernel thread dispatcher Polls adapter rx queue Polls socket back-end queues Invokes regular TCP/IP processing Network stack is [partially] optimized for serialized execution Some locks eliminated Some control data structures replicated to avoid sharing Other OS services are avoided when possible E.g., avoid wakeup calls Just to workaround HW and OS support limitations 12

TX Performance Cpu Utilization. 100 87.5 75 62.5 50 37.5 25 12.5 0 1200 1000 800 600 400 200 0 Throughput (MB/s). 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 Message size Native CPU IsoStack CPU Native Througput IsoStack Throughput 13

Rx Performance CPU utilization 100 87.5 75 62.5 50 37.5 25 12.5 0 1200 1000 800 600 400 200 0 Throughput (MB/s) 64 128 256 512 1024 Message size Native CPU Native Throughput 2048 4096 8192 16384 32768 65536 IsoStack CPU IsoStack Throughput 14

Impact of Un-contended Locks Impact of unnecessary lock re-enabled in IsoStack: For low number of connections: Throughput decreased Same or higher CPU utilization For higher number of connections: Same throughput Higher CPU utilization Even when un-contended, locks have tangible cost! CPU utilization Transmit performance for 64 byte messages 100 1200 90 1100 80 1000 900 70 800 60 700 50 600 40 500 30 400 300 20 200 10 100 0 0 1 2 4 8 16 32 64 128 Number of connections Native CPU IsoStack CPU IsoStack+Lock CPU Native Throughput IsoStack Throughput IsoStack+Lock Throughput Throughput (MB/s) 15

IsoStack Summary USENIX ATC 2010 Isolation of network stack dramatically reduces overhead No CPU sharing costs Decreased memory sharing costs Explicit asynchronous messaging instead of blind sharing Optimized for large number of applications Optimized for high request rate (short messages) Opportunity for further improvement with hardware and OS extensions Generic support for subsystem isolation 16

Questions? 17

Backup 18

Using Multiple IsoStack Instances Utilize adapter packet classification capabilities App CPU 1 App CPU 2 Connections are assigned to IsoStack instances according to the adapter classification function t1 t2 t3 Applications can request connection establishment from any stack instance, but once the connection is established, socket back-end notifies socket front-end which instance will handle this connection. IsoStack CPU 1 TCP/IP/Eth Conn A Conn B Conn C IsoStack CPU 2 TCP/IP/Eth Conn D NIC 19

Potential for Platform Improvements The hardware and the operating systems should provide a better infrastructure for subsystem isolation: efficient interaction between large number of applications and an isolated subsystem in particular, better notification mechanisms, both to and from the isolated subsystem Non-shared memory pools Energy-efficient wait on multiple memory locations 20