Arrakis: The Operating System is the Control Plane

Similar documents
Kernel Bypass. Sujay Jayakar (dsj36) 11/17/2016

FlexNIC: Rethinking Network DMA

Arrakis: The Operating System is the Control Plane

Arrakis: The Operating System is the Control Plane

High Performance Packet Processing with FlexNIC

Arrakis: The Operating System Is the Control Plane

Designing a True Direct-Access File System with DevFS

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

Towards High-Performance Application-Level Storage Management

Performance Considerations of Network Functions Virtualization using Containers

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

Advanced Computer Networks. End Host Optimization

WORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES BIG AND SMALL SERVER PLATFORMS

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

Tolerating Malicious Drivers in Linux. Silas Boyd-Wickizer and Nickolai Zeldovich

Towards a Software Defined Data Plane for Datacenters

Utilizing the IOMMU scalably

Achieve Low Latency NFV with Openstack*

Data Path acceleration techniques in a NFV world

KVM PERFORMANCE OPTIMIZATIONS INTERNALS. Rik van Riel Sr Software Engineer, Red Hat Inc. Thu May

Speeding up Linux TCP/IP with a Fast Packet I/O Framework

MultiLanes: Providing Virtualized Storage for OS-level Virtualization on Many Cores

Virtualization, Xen and Denali

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet

Containers Do Not Need Network Stacks

Rack Disaggregation Using PCIe Networking

Distributed caching for cloud computing

GPUfs: Integrating a file system with GPUs

Non-Volatile Memory Through Customized Key-Value Stores

Utilizing the IOMMU Scalably

CS370 Operating Systems

PVPP: A Programmable Vector Packet Processor. Sean Choi, Xiang Long, Muhammad Shahbaz, Skip Booth, Andy Keep, John Marshall, Changhoon Kim

CSC 5930/9010 Cloud S & P: Virtualization

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University

Netchannel 2: Optimizing Network Performance

Storage Protocol Offload for Virtualized Environments Session 301-F

OpenMPDK and unvme User Space Device Driver for Server and Data Center

What s Wrong with the Operating System Interface? Collin Lee and John Ousterhout

PASTE: A Networking API for Non-Volatile Main Memory

Dongjun Shin Samsung Electronics

File. File System Implementation. File Metadata. File System Implementation. Direct Memory Access Cont. Hardware background: Direct Memory Access

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

GPUfs: Integrating a file system with GPUs

OpenFlow Software Switch & Intel DPDK. performance analysis

VERITAS Foundation Suite TM 2.0 for Linux PERFORMANCE COMPARISON BRIEF - FOUNDATION SUITE, EXT3, AND REISERFS WHITE PAPER

High Performance Solid State Storage Under Linux

Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems

Choosing Hardware and Operating Systems for MySQL. Apr 15, 2009 O'Reilly MySQL Conference and Expo Santa Clara,CA by Peter Zaitsev, Percona Inc

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

Design Principles for End-to-End Multicore Schedulers

SR-IOV support in Xen. Yaozu (Eddie) Dong Yunhong Jiang Kun (Kevin) Tian

Maximizing NFS Scalability

Spring 2017 :: CSE 506. Device Programming. Nima Honarmand

Xen Community Update. Ian Pratt, Citrix Systems and Chairman of Xen.org

The Convergence of Storage and Server Virtualization Solarflare Communications, Inc.

TLDK Overview. Transport Layer Development Kit Ray Kinsella February ray.kinsella [at] intel.com IRC: mortderire

Measuring a 25 Gb/s and 40 Gb/s data plane

The Barrelfish operating system for heterogeneous multicore systems

Initial Evaluation of a User-Level Device Driver Framework

Rethink the Sync 황인중, 강윤지, 곽현호. Embedded Software Lab. Embedded Software Lab.

Towards General-Purpose Resource Management in Shared Cloud Services

Aerie: Flexible File-System Interfaces to Storage-Class Memory [Eurosys 2014] Operating System Design Yongju Song

Agilio CX 2x40GbE with OVS-TC

Xen Network I/O Performance Analysis and Opportunities for Improvement

The Power of Batching in the Click Modular Router

GPUfs: Integrating a file system with GPUs

Maximum Performance. How to get it and how to avoid pitfalls. Christoph Lameter, PhD

CS 111. Operating Systems Peter Reiher

Solarflare and OpenOnload Solarflare Communications, Inc.

FAQ. Release rc2

Introduction to the Cisco ASAv

Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh

Amazon EC2 Deep Dive. Michael #awssummit

Part 1: Introduction to device drivers Part 2: Overview of research on device driver reliability Part 3: Device drivers research at ERTOS

Capriccio : Scalable Threads for Internet Services

Comparing UFS and NVMe Storage Stack and System-Level Performance in Embedded Systems

Virtual switching technologies and Linux bridge

TLDK Overview. Transport Layer Development Kit Keith Wiles April Contributions from Ray Kinsella & Konstantin Ananyev

IOFlow. A Software-Defined Storage Architecture

Accelerate Applications Using EqualLogic Arrays with directcache

Chapter 12: File System Implementation

Tackling the Management Challenges of Server Consolidation on Multi-core System

File Systems. CS170 Fall 2018

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

Much Faster Networking

Low-Overhead Flash Disaggregation via NVMe-over-Fabrics Vijay Balakrishnan Memory Solutions Lab. Samsung Semiconductor, Inc.

An FPGA-Based Optical IOH Architecture for Embedded System

Facilitating Magnetic Recording Technology Scaling for Data Center Hard Disk Drives through Filesystem-level Transparent Local Erasure Coding

An Intelligent NIC Design Xin Song

Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency

Performance Modeling and Analysis of Flash based Storage Devices

Protocol-Independent FIB Architecture for Network Overlays

Netronome 25GbE SmartNICs with Open vswitch Hardware Offload Drive Unmatched Cloud and Data Center Infrastructure Performance

VIRTUALIZING SERVER CONNECTIVITY IN THE CLOUD

Course Review. Hui Lu

Low-Overhead Flash Disaggregation via NVMe-over-Fabrics

OpenOnload. Dave Parry VP of Engineering Steve Pope CTO Dave Riddoch Chief Software Architect

NVM PCIe Networked Flash Storage

BCube: A High Performance, Servercentric. Architecture for Modular Data Centers

High Performance SSD & Benefit for Server Application

Transcription:

Arrakis: The Operating System is the Control Plane Simon Peter, Jialin Li, Irene Zhang, Dan Ports, Doug Woos, Arvind Krishnamurthy, Tom Anderson University of Washington Timothy Roscoe ETH Zurich

Building an OS for the Data Center Server I/O performance matters Key-value stores, web & file servers, lock managers, Can we deliver performance close to hardware? Example system: Dell PowerEdge R520 + + = $1,200 Intel X520 10G NIC Intel RS3 RAID 1GB flash-backed cache 2 us / 1KB packet 25 us / 1KB write Sandy Bridge CPU 6 cores, 2.2 GHz

Building an OS for the Data Center Server I/O performance matters Key-value stores, web & file servers, lock managers, Can we deliver performance close to hardware? Today s I/O devices are fast Example system: Dell PowerEdge R520 + + = $1,200 Intel X520 10G NIC Intel RS3 RAID 1GB flash-backed cache 2 us / 1KB packet 25 us / 1KB write Sandy Bridge CPU 6 cores, 2.2 GHz

Can t we just use Linux?

Linux I/O Performance % OF 1KB REQUEST TIME SPENT GET SET HW 18% HW 13% 62% 84% App 20% App 3% 9 us 163 us API Multiplexing Naming Access control I/O Processing Protection Resource limits I/O Scheduling Copying Data Path 10G NIC 2 us / 1KB packet RAID Storage 25 us / 1KB write

Linux I/O Performance % OF 1KB REQUEST TIME SPENT GET SET HW 18% HW 13% 62% 84% App 20% mediation API is too heavyweight Naming Access control I/O Processing Protection Multiplexing App 3% Resource limits I/O Scheduling Copying 9 us 163 us Data Path 10G NIC 2 us / 1KB packet RAID Storage 25 us / 1KB write

Arrakis Goals Skip kernel & deliver I/O directly to applications Reduce OS overhead Keep classical server OS features Process protection Resource limits I/O protocol flexibility Global naming The hardware can help us

Hardware I/O Virtualization Standard on NIC, emerging on RAID Multiplexing SR-IOV: Virtual PCI devices w/ own registers, queues, INTs Protection IOMMU: Devices use app virtual memory Packet filters, logical disks: Only allow eligible I/O I/O Scheduling NIC rate limiter, packet schedulers User-level VNIC 1 SR-IOV NIC Rate limiters Packet filters Network User-level VNIC 2

How to skip the kernel? API Multiplexing Naming Access control I/O Processing Protection Resource limits I/O Scheduling Copying Data Path I/O Devices

How to skip the kernel? API Naming Access control I/O Processing Resource limits Copying Data Path Protection I/O Devices Multiplexing I/O Scheduling

How to skip the kernel? API I/O Processing Naming Access control Resource limits Copying Data Path Protection I/O Devices Multiplexing I/O Scheduling

How to skip the kernel? API I/O Processing Naming Access control Resource limits Data Path Protection I/O Devices Multiplexing I/O Scheduling

Arrakis I/O Architecture Control Plane Data Plane Naming Access control Resource limits API I/O Processing Data Path I/O Devices Protection Multiplexing I/O Scheduling

Arrakis I/O Architecture Control Plane Data Plane Naming Access control Resource limits API I/O Processing Data Path I/O Devices Protection Multiplexing I/O Scheduling

Arrakis I/O Architecture Control Plane Data Plane Naming Access control Resource limits API I/O Processing Data Path I/O Devices Protection Multiplexing I/O Scheduling

Arrakis Control Plane Access control Do once when configuring data plane Enforced via NIC filters, logical disks Resource limits Program hardware I/O schedulers Global naming Virtual file system still in kernel Storage implementation in applications

Global Naming Fast HW ops Virtual Storage Area /tmp/lockfile /var/lib/key_value.db /etc/config.rc VFS Logical disk

Global Naming emacs Fast HW ops Virtual Storage Area /tmp/lockfile /var/lib/key_value.db /etc/config.rc VFS Logical disk

Global Naming emacs Fast HW ops Virtual Storage Area /tmp/lockfile /var/lib/key_value.db /etc/config.rc open( /etc/config.rc ) VFS Logical disk

Global Naming emacs Fast HW ops Indirect IPC interface Virtual Storage Area /tmp/lockfile /var/lib/key_value.db /etc/config.rc VFS Logical disk

Arrakis I/O Architecture Control Plane Data Plane Naming Access control Resource limits API I/O Processing Data Path I/O Devices Protection Multiplexing I/O Scheduling

Arrakis I/O Architecture Control Plane Data Plane Naming Access control Resource limits API I/O Processing Data Path I/O Devices Protection Multiplexing I/O Scheduling

Arrakis I/O Architecture Control Plane Data Plane Naming Access control Resource limits API I/O Processing Data Path I/O Devices Protection Multiplexing I/O Scheduling

Storage Data Plane: Persistent Data Structures Examples: log, queue Operations immediately persistent on disk Benefits: In-memory = on-disk layout Eliminates marshaling Metadata in data structure Early allocation Spatial locality Data structure specific caching/prefetching Modified to use persistent log: 109 LOC changed

Evaluation

Latency Reduced (in-memory) GET latency by 65% Linux HW 18% 62% App 20% 9 us Arrakis HW 33% libio 35% App 32% 4 us Reduced (persistent) SET latency by 81% Linux (ext4) HW 13% 84% App 3% 163 us Arrakis HW 77% libio 7% App 15% 31 us

Throughput Improved GET throughput by 1.75x Linux: 143k transactions/s Arrakis: 250k transactions/s Improved SET throughput by 9x Linux: 7k transactions/s Arrakis: 63k transactions/s

memcached Scalability 1200 10Gb/s interface limit 3.1x 1000 800 Throughput (k transactions/s) 600 2x 400 1.8x 200 0 1 2 4 Number of CPU cores Linux Arrakis

Single-core Performance UDP echo benchmark 1200 10Gb/s interface limit 1000 3.4x 3.6x Throughput (k packets/s) 800 600 2.3x 400 1x 200 0 Linux Arrakis/POSIX Arrakis/Zero-copy Driver

Summary OS is becoming an I/O bottleneck Globally shared I/O stacks are slow on data path Arrakis: Split OS into control/data plane Direct application I/O on data path Specialized I/O libaries Application-level I/O stacks deliver great performance : up to 9x throughput, 81% speedup Memcached scales linearly to 3x throughput Source code: http://arrakis.cs.washington.edu