SLIPSTREAM: AUTOMATIC INTERPROCESS COMMUNICATION OPTIMIZATION. Will Dietz, Joshua Cranmer, Nathan Dautenhahn, Vikram Adve

Similar documents
Light & NOS. Dan Li Tsinghua University

Operating System Structure

Light: A Scalable, High-performance and Fully-compatible User-level TCP Stack. Dan Li ( 李丹 ) Tsinghua University

Operating System Structure

ECE 650 Systems Programming & Engineering. Spring 2018

Exploring Alternative Routes Using Multipath TCP

IO-Lite: A Unified I/O Buffering and Caching System

Glauber Costa, Lead Engineer

SFO17-403: Optimizing the Design and Implementation of KVM/ARM

SFO17-410: NEVE: Nested Virtualization Extensions for ARM

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services

Preserving I/O Prioritization in Virtualized OSes

Inside look at benchmarks Wim Coekaerts Senior Vice President, Linux and Virtualization Engineering. Wednesday, August 17, 11

Design Project 2: Virtual Machine Placement in a Data Center Network. Tiffany Yu-Han Chen

VM Migration, Containers (Lecture 12, cs262a)

FlexSC. Flexible System Call Scheduling with Exception-Less System Calls. Livio Soares and Michael Stumm. University of Toronto

Communication System Design Projects

Introduction to OpenOnload Building Application Transparency and Protocol Conformance into Application Acceleration Middleware

Transparent TCP Recovery

Accelerate Network Protocol Stack Performance and Adoption in the Cloud Networking via DMM

Think Small to Scale Big

TECHNICAL NOTE. Oracle Protocol Agent Overview. Version 2.5 TECN04_SG2.5_

Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh

ARISTA: Improving Application Performance While Reducing Complexity

Process. Program Vs. process. During execution, the process may be in one of the following states

Square Pegs in a Round Pipe: Wire-Compatible Unordered Delivery In TCP and TLS

NTRDMA v0.1. An Open Source Driver for PCIe NTB and DMA. Allen Hubbe at Linux Piter 2015 NTRDMA. Messaging App. IB Verbs. dmaengine.h ntb.

L41: Lab 2 - Kernel implications of IPC

CHAPTER 2: SYSTEM STRUCTURES. By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

The case for ubiquitous transport-level encryption

Kernel Level Speculative DSM

W H I T E P A P E R. Virtual Machine Monitor Execution Modes in VMware vsphere 4.0

Travis Cardwell Technical Meeting

Minion: An All-Terrain Packet Packhorse to Jump-Start Stalled Internet Transports

TCP Tuning for the Web

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

Four Components of a Computer System

Why Your Application only Uses 10Mbps Even the Link is 1Gbps?

STATE OF MODERN APPLICATIONS IN THE CLOUD

Network stack virtualization for FreeBSD 7.0. Marko Zec

Migratory TCP (MTCP) Transport Layer Support for Highly Available Network Services

CACHE ME IF YOU CAN! GETTING STARTED WITH AMAZON ELASTICACHE. AWS Charlotte Meetup / Charlotte Cloud Computing Meetup Bilal Soylu October 2013

CS519: Computer Networks. Lecture 5, Part 1: Mar 3, 2004 Transport: UDP/TCP demux and flow control / sequencing

tcpcrypt: real transport-level encryption Andrea Bittau, Mike Hamburg, Mark Handley, David Mazieres, Dan Boneh. UCL and Stanford.

Interoperation of tasks

Transport Layer Overview

CS 43: Computer Networks. 15: Transport Layer & UDP October 5, 2018

CS 43: Computer Networks. 08:Network Services and Distributed Systems 19 September

Real Time Linux patches: history and usage

Designing a Resource Pooling Transport Protocol

Information-Agnostic Flow Scheduling for Commodity Data Centers

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

Advanced RDMA-based Admission Control for Modern Data-Centers

To EL2, and Beyond! connect.linaro.org. Optimizing the Design and Implementation of KVM/ARM

Optimizing your virtual switch for VXLAN. Ron Fuller, VCP-NV, CCIE#5851 (R&S/Storage) Staff Systems Engineer NSBU

MegaPipe: A New Programming Interface for Scalable Network I/O

Dr Markus Hagenbuchner CSCI319. Distributed Systems Chapter 3 - Processes

NGN Progress Report. Table of Contents

Lecture 2-ter. 2. A communication example Managing a HTTP v1.0 connection. Managing a HTTP request. transport session. Step 1 - opening transport

Network Implementation

Unikernels as Processes

Information-Agnostic Flow Scheduling for Commodity Data Centers. Kai Chen SING Group, CSE Department, HKUST May 16, Stanford University

A unified multicore programming model

Agenda. Threads. Single and Multi-threaded Processes. What is Thread. CSCI 444/544 Operating Systems Fall 2008

Joseph Faber Wonderful Talking Machine (1845)

ORB Performance: Gross vs. Net

Information Flow Control For Standard OS Abstractions

CSC 401 Data and Computer Communications Networks

Graphene-SGX. A Practical Library OS for Unmodified Applications on SGX. Chia-Che Tsai Donald E. Porter Mona Vij

/ Cloud Computing. Recitation 5 February 14th, 2017

Performance analysis, development and improvement of programs, commands and BASH scripts in GNU/Linux systems

Improved Path Exploration in shim6-based Multihoming

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Memory Management Strategies for Data Serving with RDMA

Performance of Variant Memory Configurations for Cray XT Systems

ECS-087: Mobile Computing

An Analysis and Empirical Study of Container Networks

A Trace-based Java JIT Compiler Retrofitted from a Method-based Compiler

CSEN 503 Introduction to Communication Networks. Mervat AbuElkheir Hana Medhat Ayman Dayf. ** Slides are attributed to J. F.

Lecture 05 Application Layer - I

OS structure. Process management. Major OS components. CSE 451: Operating Systems Spring Module 3 Operating System Components and Structure

TincVPN Optimization. Derek Chiang, Jasdeep Hundal, Jisun Jung

Kernel Types Simple OS Examples System Calls. Operating Systems. Autumn CS4023

CSCD 330 Network Programming

Using FPGAs as Microservices

Services in the Virtualization Plane. Andrew Warfield Adjunct Professor, UBC Technical Director, Citrix Systems

Developing ILNP. Saleem Bhatti, University of St Andrews, UK FIRE workshop, Chania. (C) Saleem Bhatti.

Enabling Hybrid Parallel Runtimes Through Kernel and Virtualization Support. Kyle C. Hale and Peter Dinda

Advanced Computer Networks. End Host Optimization

Multiprocessor Systems. Chapter 8, 8.1

Fault Tolerant Java Virtual Machine. Roy Friedman and Alon Kama Technion Haifa, Israel

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

Difference Engine: Harnessing Memory Redundancy in Virtual Machines (D. Gupta et all) Presented by: Konrad Go uchowski

14th ANNUAL WORKSHOP 2018 A NEW APPROACH TO SWITCHING NETWORK IMPLEMENTATION. Harold E. Cook. Director of Software Engineering Lightfleet Corporation

Transport Layer. Chapter 3: Transport Layer

Network stack personality in Android phone. Cristina Opriceana, Hajime Tazaki (IIJ Research Lab.) Linux netdev 2.2, Seoul, Korea 08 Nov.

MODERN SYSTEMS: EXTENSIBLE KERNELS AND CONTAINERS

Hewlett Packard Enterprise HPE GEN10 PERSISTENT MEMORY PERFORMANCE THROUGH PERSISTENCE

Modern and Fast: A New Wave of Database and Java in the Cloud. Joost Pronk Van Hoogeveen Lead Product Manager, Oracle

Transcription:

SLIPSTREAM: AUTOMATIC INTERPROCESS COMMUNICATION OPTIMIZATION Will Dietz, Joshua Cranmer, Nathan Dautenhahn, Vikram Adve

Introduction 2 Use of TCP is ubiquitous Widely Supported Location Transparency Programmer-friendly but not always ideal Faster IPC exists for local communication Goal: Best of TCP generality with fast performance locally

Motivating Example 3 (Java, C, Python, ) Database

Motivating Example 4 (Java, C, Python, ) Database Optimization opportunity: Local communication could use UDS

Motivating Example Manually Optimized 5 (Java, C, Python, ) Database UDS vs TCP: up to 2x ops/sec! Avoid TCP layers

Motivating Example Manually Optimized 6 (Java, C, Python, ) Can we do this automatically? Database UDS vs TCP: up to 2x ops/sec! Slipstream: Automatic IPC Optimization Avoid TCP layers

What s Coming 7 Slipstream Overview How Slipstream Works Selected Results Performance Compatibility Docker Conclusions

8 Slipstream Overview

Design Principles: Automatic and Useful 9 (Java, C, Python, ) Database No App. Modifications Language Agnostic No Modifications No Network Assumptions Backwards Compatible

Slipstream Architecture: What is it? 10 (Java, C, Python, ) Database

Example + (libipc) 11 (Java, C, Python, ) Database libipc libipc libipc libipc: Shim library between application and OS

Example + (libipc + ipcd) 12 (Java, C, Python, ) Database libipc ipcd libipc libipc ipcd ipcd: Coordinating daemon for all libipc on host

Example + Slipstream 13 (Java, C, Python, ) Database libipc ipcd libipc libipc ipcd

14 How Slipstream Works

How Slipstream Works: Overview 15 Track TCP State Detect Local Communication Switch to UDS

Tracking TCP State 16 (Java, C, Python, ) Database libipc ipcd libipc libipc ipcd Track TCP State Detect Local Communication Switch to UDS

1) Tracking TCP State 17 read write listen (Java, C, Python, ) libipc Database libipc: Transparently monitors TCP operations ipcd libipc libipc ipcd Track TCP State Detect Local Communication Switch to UDS

1) Tracking TCP State 18 read write listen fork (Java, C, Python, ) libipc exec dup Database libipc: Transparently monitors TCP operations ipcd libipc libipc ipcd libipc: Maintains state throughout execution Track TCP State Detect Local Communication Switch to UDS

Detecting Local Communication 19 (Java, C, Python, ) Database libipc ipcd libipc libipc ipcd Track TCP State Detect Local Communication Switch to UDS

Detecting Local Communication 20 (Java, C, Python, ) ipcd: Uses information from libipc to perform endpoint Database matching libipc ipcd libipc libipc ipcd Track TCP State Detect Local Communication Switch to UDS

Detecting Local Communication 21 (Java, C, Python, ) libipc ipcd: Uses information Endpoint from Matching: libipc to identify local Memory 1. Store {SRC IP:Port, communication DEST Database IP:Port} 2. N-Byte Checksum of Stream Data 3. Timing Windows ipcd libipc libipc ipcd Track TCP State Detect Local Communication Switch to UDS

2) Detecting Local Communication 22 (Java, C, Python, ) Database libipc ipcd libipc libipc ipcd Local communication detected! Track TCP State Detect Local Communication Switch to UDS

Switching to UDS 23 (Java, C, Python, ) Database libipc ipcd libipc libipc ipcd Track TCP State Detect Local Communication Switch to UDS

Switching to UDS 24 (Java, C, Python, ) libipc Database ipcd: Generate optimized socketpair, send to libipc instances ipcd libipc libipc ipcd Track TCP State Detect Local Communication Switch to UDS

Switching to UDS 25 (Java, C, Python, ) libipc: Transparently switch Database to new transport libipc ipcd libipc libipc ipcd Track TCP State Detect Local Communication Switch to UDS

Switching to UDS 26 (Java, C, Python, ) libipc: Transparently switch Database to new transport libipc ipcd libipc libipc: ipcd Preserve expected socket behavior for compatibility Track TCP State Detect Local Communication Switch to UDS

Switching to UDS 27 (Java, C, Python, ) Database libipc ipcd libipc libipc ipcd Slipstream: up to 2x ops/sec Track TCP State Detect Local Communication Switch to UDS

28 Selected Results

Performance vs Local TCP 29 Memcached Up to +80% ops/sec! At least +40% ops/sec!

Performance vs Manual UDS 30 Netperf UDS Slipstream automatically achieves performance similar to manual UDS!

Performance Overhead 31 Netperf ~3.5% overhead Memcached < 3% overhead

Software Compatibility 32 Supports many popular server applications MySQL, PostgreSQL, Redis, Memcached, Apache, Jenkins, Language agnostic in practice Java, C, Python, All work correctly with Slipstream All are optimized when communicating locally Supports use with Docker!

Example with Docker 33 (Java, C, Python, ) Database Put app in container Container per component Slower comm., extra layers

Example with Docker + Slipstream? 34 (Java, C, Python, ) Database Did someone say they need Automatic IPC Optimization? Put app in container Container per component Slower comm., extra layers

Slipstream with Docker 35 (Java, C, libipc Python, ) libipc Database libipc per container

Slipstream with Docker 36 (Java, C, libipc Python, ) ipcd libipc Database libipc per container Connect to common ipcd

Slipstream with Docker 37 (Java, C, libipc Python, ) ipcd libipc Database libipc per container Connect to common ipcd ipcd can run in a container too

Slipstream with Docker 38 (Java, C, libipc Python, ) ipcd libipc Database Optimize across containers!

Slipstream with Docker 39 (Java, C, libipc Python, ) ipcd libipc Database Optimize across containers! 2x-3x ops/sec over Docker Same perf. as without Docker

Docker with NetPIPE-C 40 Slipstream shows large speedups Same perf. with or without Docker! Baseline Docker is significantly slower

Conclusions: 41 Automatically use faster IPC mechanism when available Easy to deploy, highly compatible No OS modifications No modifications to application Language-agnostic Backwards-compatible (partial deployment supported) ~2x bandwidth improvement for host-local communication Low overhead when optimization is not possible Freely Available at Thank you! Questions?

(End of presentation) 42

43 FAQ

Q: Why use UDS instead of other local IPC? 44 Simply: Socket interface of UDS facilitated prototyping Any local IPC mechanism should be possible (Patches welcome! ) Others could avoid kernel interaction for even better performance See ipc-bench paper Slipstream is not UDS-specific UDS implementation detail, idea is faster performance leveraging locality In fact, netperf sometimes slower with UDS (see paper).

Q: Why not have application do this itself? 45 Legacy: many applications don t do this today (Knowledge that UDS faster than TCP is not new!) They shouldn t have to, Slipstream! Achieves performance of manual optimization Low overhead when optimization not possible Applications have better things to do than transport selection!

Q: Why userspace solution, not in-kernel? 46 Mostly: Legacy Ease of use, ease of deployment Suitable for use on other systems (FreeBSD, etc.) May not be appropriate for kernel Correctness concerns (Opt-in approach likely appropriate) Possibility for using even more efficient transports than UDS Can t cut out kernel with a kernel-based solution

Q: What about frequent short-lived connections? 47 Current implementation not well-suited for this, but could be! Would not benefit from optimization only applied to longer-lived connections In fact, initial connection latency over localhost increased Not by design, result of implementation choices (but nevertheless) Probably better off with custom is local decision mechanism Ex: Is destination IP 127.0.0.1 Would allow optimizing before connection, get faster connect() times Thanks for the question, Joshua.

48 Extra Slides

Related Work 49 Category Name Application Transparency OS Transparency OS Feature VM-VM Userspace Networking Stack Userspace Shim Library Windows Fast-Path Solaris TCP Fusion AIX fastlo Linux TCP Friends? XWay XenSocket XenLoop mtcp Sandstorm Fable Java Fast Sockets Universal Fast Sockets FastSockets Slipstream

Building on Slipstream 50 Alternative transports (shared memory, etc.) (see ipc-bench) Select best transport for workload/configuration Support multiple locality detection mechanisms Provided one is suitable for many use-cases If can assume network topology, can directly determine this More exhaustive testing, performance investigation

Try it out today! 51 No configuration, simple deployment Big performance gains for many applications (Docker support!) Low overheads when optimization is not possible Makes most of partial deployment Open-source, CRAPL (don t judge, it s research!) Download today!

53 Deploying Slipstream

Example + Slipstream: Deployment 54 (Java, C, Python, ) Database

Example + Slipstream: Deployment 55 (Java, C, Python, ) Database libipc libipc libipc libipc deployed with application (LD_PRELOAD or /etc/ld.so.preload)

56 Challenges and Solutions

Example + Slipstream: Key Challenge 57 (Java, C, Python, ) Challenge: Which connections are communicating locally? Database libipc ipcd libipc libipc ipcd libipc only sees what goes through it Solution: Endpoint matching algorithm to identify local communication

Endpoint Matching Overview 58 It s the orange one! (Java, C, Python, ) libipc works with ipcd which has more information Database libipc ipcd libipc libipc ipcd Observed behavior from both libipc instances used by ipcd to find match

Endpoint Matching Overview 59 It s the orange one! (Java, C, Python, ) libipc works with ipcd which has more information Database libipc ipcd libipc libipc ipcd What information can be used to help pair endpoints? Observed behavior from both libipc instances used by ipcd to find match

Endpoint Matching Overview 60 Src/Dst IP and Port (Java, C, Python, ) libipc works with ipcd which has more information Database Data Sent/Recv libipc Timing Details ipcd libipc libipc ipcd and it s orange. Observed behavior from both libipc instances used by ipcd to find match

Endpoint Matching Overview 61 Src/Dst IP and Port (Java, C, Python, ) libipc works with ipcd which has more information Database Data Sent/Recv libipc Timing Details ipcd libipc libipc ipcd More details in paper! Observed behavior from both libipc instances used by ipcd to find match

Endpoint Matching Overview 62 (Java, C, Python, ) libipc works with ipcd which has more information Database libipc ipcd libipc libipc ipcd Observed behavior from both libipc instances used by ipcd to find match

Endpoint Matching Overview 63 (Java, C, Python, ) libipc works with ipcd which has more information Database libipc ipcd libipc libipc ipcd Observed behavior from both libipc instances used by ipcd to find match Local endpoints found!

Endpoint Matching: Complete 64 (Java, C, Python, ) Database libipc ipcd libipc libipc ipcd Local endpoints found!

Automatic IPC Optimization 65 (Java, C, Python, ) Database libipc ipcd libipc libipc ipcd Transparently switches to UDS

Automatic IPC Optimization 66 (Java, C, Python, ) Database libipc ipcd libipc libipc ipcd UDS vs TCP: up to 2x ops/sec

Without Modifying Applications? 67 (Java, C, Python, ) Implementation Challenge: libipc must preserve expected socket behavior Database for compatibility libipc ipcd libipc libipc ipcd Solution: Engineering!

68 Results, with slides. Source: http://www.penny-arcade.com/comic/2005/09/09

Faster than TCP? NetPIPE Microbenchmarks 69 NetPIPE-C NetPIPE-Java 1.4x to 2.5x improvement! Works with both C and Java!

Faster than TCP? Memcached 70 1.4x to 1.8x ops/sec

Faster than TCP? PostgreSQL 71 PostgreSQL TPC-B PostgreSQL - Select No improvement when CPU/IO bound Small improvement for Select-only

Faster than TCP? PostgreSQL 72 PostgreSQL TPC-B PostgreSQL - Select Application benefit depends on how much network is the bottleneck No improvement when CPU/IO bound Small improvement for Select-only

Slower than TCP? Slipstream Overhead 73 Netperf Memcached ~3.5% overhead for Netperf < 3% overhead for Memcached

Slower than TCP? Slipstream Overhead 74 Netperf Memcached Overhead incurred is low in even network-intensive applications ~3.5% overhead for Netperf < 3% overhead for Memcached

Slower than manual UDS? Netperf Microbenchmark 75 Relative to fixed benchmark (Default UDS does poorly)

Slower than manual UDS? Netperf Microbenchmark 76 Relative to fixed benchmark Automatically achieves performance close to manually using UDS! (Default UDS does poorly)

Docker with NetPIPE-C 77 Slipstream shows large speedups Same perf. with or without Docker! Baseline Docker is significantly slower

Docker with Memcached 78 Memcached with Docker Curious: Memcached without with Docker Slipstream performs better across Docker containers Baseline Docker has lower performance

Software Compatibility 79 Supports many popular server applications MySQL, PostgreSQL, Redis, Memcached, Apache, Jenkins, Language agnostic in practice Java, C, Python, And of course our microbenchmarks: NetPIPE, netperf, iperf, lmbench All work correctly with Slipstream All are optimized when communicating locally

Slipstream Overview Multiple Applications 80 ipcd: One instance per host Web App libipc libipc libipc ipcd libipc: One instance per process