Arachne. Core Aware Thread Management Henry Qin Jacqueline Speiser John Ousterhout

Similar documents
Arachne: Core-Aware Thread Management

What s Wrong with the Operating System Interface? Collin Lee and John Ousterhout

Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency

Low-Latency Datacenters. John Ousterhout

Lightning Talks. Platform Lab Students Stanford University

Data Processing on Modern Hardware

Capriccio : Scalable Threads for Internet Services

CGAR: Strong Consistency without Synchronous Replication. Seo Jin Park Advised by: John Ousterhout

Threads. Raju Pandey Department of Computer Sciences University of California, Davis Spring 2011

Akaros. Barret Rhoden, Kevin Klues, Andrew Waterman, Ron Minnich, David Zhu, Eric Brewer UC Berkeley and Google

CSE 120 Principles of Operating Systems

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

COLLIN LEE INITIAL DESIGN THOUGHTS FOR A GRANULAR COMPUTING PLATFORM

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

CPU Pinning and Isolation in Kubernetes*

ò mm_struct represents an address space in kernel ò task represents a thread in the kernel ò A task points to 0 or 1 mm_structs

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser

VoltDB vs. Redis Benchmark

Scheduling. Don Porter CSE 306

Threads. CS3026 Operating Systems Lecture 06

Low-Latency Datacenters. John Ousterhout Platform Lab Retreat May 29, 2015

Multiprocessor Systems. Chapter 8, 8.1

Abstract. Testing Parameters. Introduction. Hardware Platform. Native System

Diffusion TM 5.0 Performance Benchmarks

Improving Real-Time Performance on Multicore Platforms Using MemGuard

Review: Easy Piece 1

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

Agilio CX 2x40GbE with OVS-TC

HPMMAP: Lightweight Memory Management for Commodity Operating Systems. University of Pittsburgh

What s in a process?

Operating System. Chapter 4. Threads. Lynn Choi School of Electrical Engineering

Applications, services. Middleware. OS2 Processes, threads, Processes, threads, communication,... communication,... Platform

No Tradeoff Low Latency + High Efficiency

ò Paper reading assigned for next Tuesday ò Understand low-level building blocks of a scheduler User Kernel ò Understand competing policy goals

Fast and Easy Persistent Storage for Docker* Containers with Storidge and Intel

HLD For SMP node affinity

Native POSIX Thread Library (NPTL) CSE 506 Don Porter

SMD149 - Operating Systems

10/17/ Gribble, Lazowska, Levy, Zahorjan 2. 10/17/ Gribble, Lazowska, Levy, Zahorjan 4

10/10/ Gribble, Lazowska, Levy, Zahorjan 2. 10/10/ Gribble, Lazowska, Levy, Zahorjan 4

What s in a traditional process? Concurrency/Parallelism. What s needed? CSE 451: Operating Systems Autumn 2012

Spinlocks. Spinlocks. Message Systems, Inc. April 8, 2011

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.

CPU Scheduling. Operating Systems (Fall/Winter 2018) Yajin Zhou ( Zhejiang University

SWAP: EFFECTIVE FINE-GRAIN MANAGEMENT

SOFT CONTAINER TOWARDS 100% RESOURCE UTILIZATION ACCELA ZHAO, LAYNE PENG

Multiprocessor Systems. COMP s1

Indexing in RAMCloud. Ankita Kejriwal, Ashish Gupta, Arjun Gopalan, John Ousterhout. Stanford University

1 Multiprocessors. 1.1 Kinds of Processes. COMP 242 Class Notes Section 9: Multiprocessor Operating Systems

Practical Considerations for Multi- Level Schedulers. Benjamin

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University

IsoStack Highly Efficient Network Processing on Dedicated Cores

EECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun

CSCE 313 Introduction to Computer Systems. Instructor: Dezhen Song

For use by students enrolled in #71251 CSE430 Fall 2012 at Arizona State University. Do not use if not enrolled.

CSCE 313: Intro to Computer Systems

CSE 451: Operating Systems Winter Lecture 7 Synchronization. Hank Levy 412 Sieg Hall

PROCESSES AND THREADS THREADING MODELS. CS124 Operating Systems Winter , Lecture 8

When the OS gets in the way

Scheduler Activations: Effective Kernel Support for the User-Level Management of Parallelism

A Comparison of Scheduling Latency in Linux, PREEMPT_RT, and LITMUS RT. Felipe Cerqueira and Björn Brandenburg

Heckaton. SQL Server's Memory Optimized OLTP Engine

A Scalable Event Dispatching Library for Linux Network Servers

IBM InfoSphere Streams v4.0 Performance Best Practices

May 1, Foundation for Research and Technology - Hellas (FORTH) Institute of Computer Science (ICS) A Sleep-based Communication Mechanism to

An Implementation of the Homa Transport Protocol in RAMCloud. Yilong Li, Behnam Montazeri, John Ousterhout

Portland State University ECE 588/688. Graphics Processors

Are You Insured Against Your Noisy Neighbor Sunku Ranganath, Intel Corporation Sridhar Rao, Spirent Communications

Measuring the impacts of the Preempt-RT patch

Dynamic Partitioned Global Address Spaces for Power Efficient DRAM Virtualization

Load-Sto-Meter: Generating Workloads for Persistent Memory Damini Chopra, Doug Voigt Hewlett Packard (Enterprise)

What s New in VMware vsphere 4.1 Performance. VMware vsphere 4.1

WORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES BIG AND SMALL SERVER PLATFORMS

A Cool Scheduler for Multi-Core Systems Exploiting Program Phases

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

CSE 451: Operating Systems Winter Lecture 7 Synchronization. Steve Gribble. Synchronization. Threads cooperate in multithreaded programs

VARIABILITY IN OPERATING SYSTEMS

Distributed caching for cloud computing

Tessellation OS. Present and Future. Juan A. Colmenares, Sarah Bird, Andrew Wang, Krste Asanovid, John Kubiatowicz [Parallel Computing Laboratory]

User-level threads. with threads. Paul Turner

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

Gavin Payne Senior Consultant.

Tessellation: Space-Time Partitioning in a Manycore Client OS

Technical Paper. Performance and Tuning Considerations for SAS on Dell EMC VMAX 250 All-Flash Array

Cloud Monitoring as a Service. Built On Machine Learning

Tasks. Task Implementation and management

Department of Computer Science Institute for System Architecture, Operating Systems Group REAL-TIME MICHAEL ROITZSCH OVERVIEW

Container Adoption for NFV Challenges & Opportunities. Sriram Natarajan, T-Labs Silicon Valley Innovation Center

TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional

CS370 Operating Systems

Tesla GPU Computing A Revolution in High Performance Computing

Adaptive SMT Control for More Responsive Web Applications

Separating Access Control Policy, Enforcement, and Functionality in Extensible Systems. Robert Grimm University of Washington

Extensions to Barrelfish Asynchronous C

ibench: Quantifying Interference in Datacenter Applications

High-Performance Transaction Processing in Journaling File Systems Y. Son, S. Kim, H. Y. Yeom, and H. Han

OVERVIEW. Last Week: But if frequency of high priority task increases temporarily, system may encounter overload: Today: Slide 1. Slide 3.

DataON and Intel Select Hyper-Converged Infrastructure (HCI) Maximizes IOPS Performance for Windows Server Software-Defined Storage

ECE 574 Cluster Computing Lecture 8

A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b

Transcription:

Arachne Core Aware Thread Management Henry Qin Jacqueline Speiser John Ousterhout

Granular Computing Platform Zaharia Winstein Levis Applications Kozyrakis Cluster Scheduling Ousterhout Low-Latency RPC Thread/App Mgmt Hardware Accelerators Scalable Notifications Low-Latenc y Storage Ousterhout Qin / Speiser Kozyrakis 2

Introduction Computation is becoming more granular Faster networks, increased use of DRAM response times in μs s RAMCloud performs reads in 5 μs, service time 2 μs Problem: Hard to achieve both low latency and high throughput Arachne: Core-Aware Approach to Thread Management Applications own cores for tens of ms, run userspace threads Each application estimates its core needs Core arbiter allocates cores among applications Results Efficient user-level threads (< 200 ns thread creation) Efficient core usage (21 μs core reallocation) 3

Outline Problem: Thread management in modern data centers is inefficient Arachne overview Core allocation Arachne runtime Performance Benchmarks Conclusion 4

Problems with Thread Management Efficiency Kernel threading overheads too high 5 μs to create a kernel thread Resource opacity Applications don t know how many cores they have Hard to match internal parallelism to available resources Interference Other applications compete for cores Need exclusive use of cores for low latency Application loads vary over time 5

Thread Pools Thread pools solve the kernel thread efficiency problem by reusing kernel threads. Incoming Requests Want # of threads == # of cores Kernel Threads But we must know the number of available cores. Too Few Threads Too Many Threads Wasted Core Kernel Multiplexing 6

Thread Pools Dedicated machines enable us to know the number of cores, but they are wasteful under low load. Incoming Requests No Requests Thread spin with no requests. Kernel Threads But colocating other applications causes competition for cores. Multiple Tenants Kernel Multiplexing 7

Thread Pools (Desired Behavior) Reduce number of kernel threads at low load. Incoming Requests Threads from another application. Displace other applications at high load. 8

Arachne: Core-Aware Thread Management dedicated to particular applications Provide isolation eliminate interference problem Application requests cores, not threads Application knows # of cores it owns Move thread management to userspace Very fast thread operations (100-200 ns) Multiplex short-lived threads on allocated cores Context switch when waiting on μs-scale operations 9

System Overview Core arbiter: an external setuid process shared by all applications Allocates (managed) cores to applications and arbitrates between applications Runtime component linked into each application Multiplexes user threads on top of managed cores Estimates number of cores needed Application User Threads Arachne Runtime Request Assign Core Arbiter 10

Core Arbiter Design Overview How to allocate a core to an application? Ensure only one kernel thread runs on that core Core pinning is insufficient Applications own cores for long periods (tens of ms) Adjust allocation as application workloads change Today s World Managed Core World Kernel Thread Kernel thread that wants its own core 11

Kernel Threads Arachne runtime creates a pool of kernel threads; threads block Application requests X cores Arbiter moves X kernel threads to managed cores and unblocks them Unblocked threads own cores Application KT0 KT1 KT2 Threads send (TID, PID) to arbiter; block Core Arbiter KT2 Request 2 Core Arbiter == Blocked == Running KT3 KT3 KT0 KT1 Unblock! 12

Linux cpusets Idea: Use Linux cpusets to manage cores Kernel Thread Core Each thread assigned to a cpuset runs only on cores of that cpuset. Omnipresent Limited Cpuset: subset of cores 13

Using cpusets for core allocation One cpuset for each core C0 C1 C2 C3 C0 C1 C2 C3 Allocate Core Unmanaged One unmanaged cpuset initially includes all cores. Blocked && waiting for core Running on managed core Running on unmanaged Unmanaged Allocated cores are removed from unmanaged cpu set. 14

Core Preemption Give me back one core Application KT2 Core Arbiter Blocked Running KT3 KT0 KT1 15

Core Preemption Give me back one core Application KT2 Core Arbiter Blocked Running KT3 KT0 KT1 I will block. 16

Core Preemption Give me back one core Application KT2 Core Arbiter Blocked Running KT3 KT0 KT1 I will block. 17

Core Preemption Give me back one core Application KT1 KT2 Core Arbiter Blocked Running KT3 KT0 18

Core Preemption (Misbehaving App) Give me back one core Application KT2 Core Arbiter Blocked Running KT3 Wait for 10 ms KT0 KT1 Work on my own computation 19

Core Preemption (Misbehaving App) Give me back one core I ll force you off that core Application KT1 KT2 Core Arbiter Blocked Running KT3 KT0 20

Core Preemption (Misbehaving App) Keep running in unmanaged cpuset Give me back one core I ll force you off that core Application KT1 KT2 Core Arbiter Blocked Running KT3 KT0 21

Arachne Runtime Overview Arachne is cooperative - threads on a given core must terminate, yield, or block before other threads run Each kernel thread (aka core) perpetually looks for user threads and runs them The search continues when threads relinquish control Thread API createthread() - Spawn a thread with given function and arguments yield() - Give other threads on this core a chance to run sleep() - Sleep for a minimum duration of time join() - Sleep until another thread completes SleepLock{} SpinLock{} ConditionVariable{} Semaphore{} 22

Designing a highly efficient runtime Threading performance is dominated by cache operations Basic operations are not compute heavy Context switch is only 14 instructions Require communication between cores Inter-core communication requires cache operations. E.g. 100 cycles Arachne runtime minimize cache traffic (data transferred across cores) 23

Results Configuration 4-Core Xeon X3470 @ 2.93 Ghz 24 GB DDR3 @ 800 Mhz Benchmarks Cost of core allocation Cost of scheduling primitives 24

What is the cost of core allocation? Plot shows interval from time core requested to time core acquired Case 1: Arbiter has an idle core (orange line) Case 2: Arbiter must reclaim core from another app (blue line) Median Cost: 21 μs 25

What is cost of scheduling primitives? Operation Arachne std::thread Go Thread Creation 182 ns 5760 ns 261 ns Condition Variable Notify 195 ns 4137 ns 317 ns Yield 88 ns N/A N/A Null Yield 15 ns N/A N/A Yield: Time to transfer control from one thread to another on same core Null Yield: Time for one thread to yield when no other threads are on core Golang creates threads on same core; higher cost even without cache misses 26

Future Work What are the right abstractions for providing more direct control over cores to applications? Should applications specify different thread profiles (ie latency-sensitive, background, long-running)? Give applications ability to pick cores to run and co-locate specific threads? What are the right policies for allocating hyperthreaded cores? Always allocate one thread of each pair first? Ensure that hyper-twins always go to the same application? Can Arachne interact with cluster-level schedulers for increased cluster-wide efficiency? How can Arachne play nicely with today s containerized world? 27

Conclusion We built Arachne, a thread management system for granular tasks on multi-core systems. Applications own cores for tens of milliseconds and run userspace threads Each application estimates its core needs Core arbiter allocates cores among applications Solves the problems of efficiency, interference, and resource opacity Enables granular tasks to combine low latency with high throughput 28

Questions? github.com/platformlab/arachne 29