COLLIN LEE INITIAL DESIGN THOUGHTS FOR A GRANULAR COMPUTING PLATFORM

Similar documents
What s Wrong with the Operating System Interface? Collin Lee and John Ousterhout

Cloud Computing Concepts, Models, and Terminology

CGAR: Strong Consistency without Synchronous Replication. Seo Jin Park Advised by: John Ousterhout

Arachne. Core Aware Thread Management Henry Qin Jacqueline Speiser John Ousterhout

Notes based on prof. Morris's lecture on scheduling (6.824, fall'02).

Lightning Talks. Platform Lab Students Stanford University

Project 2: CPU Scheduling Simulator

Scheduling Mar. 19, 2018

DIY Hosting for Online Privacy. Shoumik Palkar and Matei Zaharia Stanford University

Cloud Computing. What is cloud computing. CS 537 Fall 2017

Low-Latency Datacenters. John Ousterhout

CPU Scheduling. CSE 2431: Introduction to Operating Systems Reading: Chapter 6, [OSC] (except Sections )

Chapter 5: CPU Scheduling

CPU Scheduling. Operating Systems (Fall/Winter 2018) Yajin Zhou ( Zhejiang University

Essential Features of an Integration Solution

Flash: an efficient and portable web server

Efficient Point to Multipoint Transfers Across Datacenters

A Capacity Planning Methodology for Distributed E-Commerce Applications

Properties of Processes

Exploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout

Exploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout

Introduction to User Stories. CSCI 5828: Foundations of Software Engineering Lecture 05 09/09/2014

Rocksteady: Fast Migration for Low-Latency In-memory Storage. Chinmay Kulkarni, Aniraj Kesavan, Tian Zhang, Robert Ricci, Ryan Stutsman

Outline. Definition of a Distributed System Goals of a Distributed System Types of Distributed Systems

DIY Hosting for Online Privacy

Process Time. Steven M. Bellovin January 25,

A Practical Scalable Distributed B-Tree

Ch 4 : CPU scheduling

Announcements. Program #1. Reading. Due 2/15 at 5:00 pm. Finish scheduling Process Synchronization: Chapter 6 (8 th Ed) or Chapter 7 (6 th Ed)

Stretched Clusters & VMware Site Recovery Manager December 15, 2017

High-resolution Measurement of Data Center Microbursts

Last Time. Think carefully about whether you use a heap Look carefully for stack overflow Especially when you have multiple threads

Java Without the Jitter

RAMCloud. Scalable High-Performance Storage Entirely in DRAM. by John Ousterhout et al. Stanford University. presented by Slavik Derevyanko

ANALYSIS OF A DYNAMIC FREQUENCY SCALING ALGORTIHM ON XSCALE.

Building Durable Real-time Data Pipeline

Chapter 6: CPU Scheduling. Operating System Concepts 9 th Edition

Operating Systems Design Fall 2010 Exam 1 Review. Paul Krzyzanowski

Scaling Without Sharding. Baron Schwartz Percona Inc Surge 2010

Introduction to Distributed * Systems

Multiprocessor Scheduling. Multiprocessor Scheduling

Multiprocessor Scheduling

Microkernel Construction

Millisort: An Experiment in Granular Computing. Seo Jin Park with Yilong Li, Collin Lee and John Ousterhout

Preemptive, Low Latency Datacenter Scheduling via Lightweight Virtualization

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

G Robert Grimm New York University

Processes, PCB, Context Switch

Caching patterns and extending mobile applications with elastic caching (With Demonstration)

Data Integrity in Stateful Services. Velocity, China, 2016

CPU scheduling. Alternating sequence of CPU and I/O bursts. P a g e 31

Operating Systems. Process scheduling. Thomas Ropars.

Chapter 6: CPU Scheduling. Operating System Concepts 9 th Edition

Operating System Concepts Ch. 5: Scheduling

Data Analytics on RAMCloud

Multitasking and scheduling

Scheduling. CS 161: Lecture 4 2/9/17

Administrivia. Added 20 more so far. Software Process. Only one TA so far. CS169 Lecture 2. Start thinking about project proposal

Lecture 15: Datacenter TCP"

Lecture 5 / Chapter 6 (CPU Scheduling) Basic Concepts. Scheduling Criteria Scheduling Algorithms

Efficient Point to Multipoint Transfers Across Datacenters

CPU Scheduling. Daniel Mosse. (Most slides are from Sherif Khattab and Silberschatz, Galvin and Gagne 2013)

Receive Livelock. Robert Grimm New York University

CS370 Operating Systems

!! What is virtual memory and when is it useful? !! What is demand paging? !! When should pages in memory be replaced?

Introduction. CS3026 Operating Systems Lecture 01

CSC Operating Systems Spring Lecture - XII Midterm Review. Tevfik Ko!ar. Louisiana State University. March 4 th, 2008.

Low-Latency Datacenters. John Ousterhout Platform Lab Retreat May 29, 2015

Provisioning IT at the Speed of Need with Microsoft Azure. Presented by Mark Gordon and Larry Kuhn Hashtag: #HAND5

Processes. CS 475, Spring 2018 Concurrent & Distributed Systems

CSE120 Principles of Operating Systems. Prof Yuanyuan (YY) Zhou Scheduling

PROCESS SCHEDULING II. CS124 Operating Systems Fall , Lecture 13

The Concurrency Viewpoint

Lies, Damn Lies and Performance Metrics. PRESENTATION TITLE GOES HERE Barry Cooks Virtual Instruments

COSC243 Part 2: Operating Systems

No Tradeoff Low Latency + High Efficiency

Building High Performance Apps using NoSQL. Swami Sivasubramanian General Manager, AWS NoSQL

μtune: Auto- Tuned Threading for OLDI Microservices

Scheduling of processes

LECTURE 3:CPU SCHEDULING

Capacity Estimation for Linux Workloads. David Boyes Sine Nomine Associates

Announcements. Reading. Project #1 due in 1 week at 5:00 pm Scheduling Chapter 6 (6 th ed) or Chapter 5 (8 th ed) CMSC 412 S14 (lect 5)

Lecture Topics. Announcements. Today: Uniprocessor Scheduling (Stallings, chapter ) Next: Advanced Scheduling (Stallings, chapter

Goals. Facebook s Scaling Problem. Scaling Strategy. Facebook Three Layer Architecture. Workload. Memcache as a Service.

416 Distributed Systems. March 23, 2018 CDNs

Announcements. Program #1. Program #0. Reading. Is due at 9:00 AM on Thursday. Re-grade requests are due by Monday at 11:59:59 PM.

CS370 Operating Systems

Improving Interrupt Response Time in a Verifiable Protected Microkernel

Multi-Level Feedback Queues

15-744: Computer Networking. Data Center Networking II

Today: Distributed Objects. Distributed Objects

Lecture 2 Process Management

Massive Scalability With InterSystems IRIS Data Platform

Design of a Virtualization Framework to Enable GPU Sharing in Cluster Environments

Department of Computer Science Institute for System Architecture, Operating Systems Group REAL-TIME MICHAEL ROITZSCH OVERVIEW

Start of Lecture: February 10, Chapter 6: Scheduling

Arachne: Core-Aware Thread Management

Making RAMCloud Writes Even Faster

Datacenter replication solution with quasardb

Scheduling. Scheduling 1/51

Transcription:

COLLIN LEE INITIAL DESIGN THOUGHTS FOR A GRANULAR COMPUTING PLATFORM

INITIAL DESIGN THOUGHTS FOR A GRANULAR COMPUTING PLATFORM GOAL OF THIS TALK Introduce design ideas and issues for a granular computing platform. Stimulate discussion. Straw-man ideas and opinions up for debate. Structured as a collection of what-if s and assertions. Ultimately looking for safe simplifying assumptions. Encourage feedback and suggestions. Feel free to point out the good, the bad, and the ugly.

INITIAL DESIGN THOUGHTS FOR A GRANULAR COMPUTING PLATFORM OVERVIEW What is granular computing? Marching forward with or without applications. Anatomy of a Granular Application. Properties of a Granular Application. Assertions: All code needs to be pre-loaded and ready to run. A shared polling infrastructure is needed. Trade memory protection for low latency. Scheduling must be decentralized. Fast resource preemption is necessary. Granules can only communicate via invocation. Durability of granules must be batched. Need reliable networks for low latency (geo-)replication. Storage system must expose data location. Conclusion Questions and Feedback Poster Session

INITIAL DESIGN THOUGHTS FOR A GRANULAR COMPUTING PLATFORM WHAT IS GRANULAR COMPUTING? LARGE NUMBER OF TINY TASKS RAPID SCALE UP/ DOWN APPS COMPOSED OF TINY TASKS

WHAT IS GRANULAR COMPUTING? MARCHING FORWARD WITH OR WITHOUT APPLICATIONS Current lack of concrete applications. Possible approach: wait for applications? Have yet to find one. Applications and platforms might be chicken and egg. Current approach: assume application properties and proceed. Risk: we could be totally wrong. Belief: we will learn a lot anyway. Results can be used as input for designing future applications and platforms. If you do have applications, let us know!

Request to Application WHAT IS GRANULAR COMPUTING? ANATOMY OF A GRANULAR APPLICATION Application Composed of a collection of granule types. e.g. Backyard surveillance Granule type Distinct unit of execution. e.g. Detect Cat, Detect Mountain Lion Request to application invokes a large number of granules. Invocations both in parallel and in series.

WHAT IS GRANULAR COMPUTING? PROPERTIES OF A GRANULAR APPLICATION Application vs Granule Latency Application Requests Latency: 1-10 millisecond SLOs Granule Execution Latency: As small as possible (10-100 µs SLOs?) Low Latency Granules Low Duty Cycle Bursty: scales to full cluster in short bursts Dynamic workload Unpredictable Execution Paths Dynamic/Data dependent code paths Unpredictable granule execution App Request App Request App Request

ASSERTIONS

ASSERTION #1 ALL CODE NEEDS TO BE PRE-LOADED AND READY TO RUN Cost of dynamic code load is high. Remote code shipping (100+ ms), Container startup (300-400 ms), Process startup (~1 ms). Dominates 10-100µs granule SLO. Solution: Have code already loaded and ready to execute. Burst request needs code to be on every machine. Every machine must be ready to run (~)any granule. New payment model: pay for hot loaded code. SLO Questions: Does all the code in a datacenter fit in the memory of a single machine? Would the new Non-Volatile Memory technology help? SHIPPING CODE CONTAINER STARTUP PROCESS STARTUP

ASSERTION #2 A SHARED POLLING INFRASTRUCTURE IS NEEDED Need polling for low latency invocation. Experience from RAMCloud shows that interrupts and signals are too slow (~500µs). Large number of polling applications would overwhelm a machine. Might have 100s of applications and 1000s of granule types. Applications need to poll for lowest invocation latency. Solution: Shared polling for all applications on a machine. App App App App Reduces idle resource usage to single poller. SHARED POLLING INFRASTRUCTURE Dispatches granule invocation requests for execution. Questions: Is polling necessary? Are there better ways to dispatch work without polling? NETWORK How does a shared poller work?

ASSERTION #3 TRADE MEMORY PROTECTION FOR LOW LATENCY Process switch cost is too high for granule invocation. Suppose a process per application (or granule type). Shared polling would need to signal between process. Signal Cost (~5µs), Process Switch Cost (~0.5ms). App App App App App App App App Solution: Run all applications under a single process. Applications are shared libraries. Dispatch can be fast user-thread creation (< 200ns). SHARED POLLING INFRASTRUCTURE VS SHARED POLLING INFRASTRUCTURE Lose memory protection between applications. NETWORK NETWORK Questions: Is this an acceptable trade off? Are there other ways to provide memory protection?

ASSERTION #4 SCHEDULING MUST BE DECENTRALIZED A full cluster might see 1 to 100 billion granules per second. 1 billion granules per second (100µs granules on 100,000 cores). 100 billion granules per second (10µs granules on 1,000,000 cores). Too much to schedule in a centralized manner. Solution: Decentralized scheduler with minimal information. Levels for scheduling: Local: schedule on the invoking machine (<200ns). Remote random: send an RPC (3-5 µs - 1/2 RTT). Remote load balanced: power of two choices with RPCs (8-15µs - 1.5 RTTs). Simplified by property of low duty cycle. Questions: How much load balancing do we really need? How much will you pay in latency for load balancing? (Good balancing costs latency)

ASSERTION #5 FAST RESOURCE PREEMPTION IS NECESSARY (1) Handling application bursts require provisioning resources for peak load. Without sufficient resources, low latency during high load is not possible. Busty applications are by definition low duty cycle. Low duty cycle applications leave significant idle resources during low load. Idle resources can be used for low priority jobs (improving utilization). Granule invocation may require preempting execution of low priority jobs. If a low priority job is running when a burst occurs, the job may need to get out of the way. Invoking a granule may incur the cost of preempting low priority jobs. Aggressive idle resource usage means that a granule invocation request preemption. No known preemption mechanism with cost under 10s of µs. (best known is Arachne @ 22µs) Aggressive use of idle resources increases utilization. e.g. batch jobs, high latency SLO jobs, etc.

ASSERTION #5 FAST RESOURCE PREEMPTION IS NECESSARY (2) Can only have 2 of 3: Low latency High Resource Utilization Slow Preemption Solution 0: Give up on high utilization Leave more resources idle. Hope that we can detect a burst early enough to mass preempt resources. Solution 1: Build a faster preemption mechanism. Questions: Are there preemption mechanisms we are missing? Can we amortize the cost of preemption across a burst?

ASSERTION #6 GRANULES CAN ONLY COMMUNICATE VIA INVOCATION Granule execution times are too short for significant communication. A granule might only execute for 10-100 µs. Not enough time to find the other granule (which could be anywhere) and communicate. Granules must have some communication Otherwise, the granule model may be too restrictive? Solution: Communicate during invocation Communicated information is passed along with the invocation. No need to find the target granule. Questions: Can we still write interesting applications where all communication is done via invocation?

ASSERTION #7 DURABILITY OF GRANULES MUST BE BATCHED Making computation results durable means a write to a storage system. RAMCloud can do small writes in 100B in 15µs, larger 1kB writes in 17µs. Making each individual Granule durable would be too slow. Doing storage system write could double the latency of a granule. Sync Batched Solution: Defer durability of a related set of granules and batch the write. Allows amortization of write overheads VS Removes durability cost for intermediate results. Creates set of granules that will externalize their results together and rerun in case of failure. Questions: How do we define the set of granules to be batched?

ASSERTION #8 NEED RELIABLE NETWORKS FOR LOW LATENCY (GEO-)REPLICATION Replication traditionally requires synchronous updates to remote storage. Intra-datacenter replication in µs, inter-datacenter (geo) replication in ms. Replication is needed for durability. Synchronous replication (especially geo replication) may be too slow. Granular computing has granules executing in µs and applications in low ms. Solution: Rely on replication of messages over a reliable network. Assume lifeboats away model of replication. e.g. If 3 replication requests are sent on the network, the network should guarantee delivery of at least 1 replication request. Questions: Can we trust networks to be reliable?

ASSERTION #9 STORAGE SYSTEM MUST EXPOSE DATA LOCATION Short granules can t spend too much time collecting data. Remote read costs > 5µs. 10-100 µs granule execution may not allow (that much) time for remote data reads. Colocating a granule with its working dataset can improve performance. Solution: Have storage systems expose information about data location. Scheduler can use this information to schedule granules closer to the data. Requires granules to expose information about their working dataset. Questions: For this to work, most storage systems need to provide location info. Is there any easy way for storage system to do so? Is it possible for granules to know their data dependencies on invocation?

INITIAL DESIGN THOUGHTS FOR A GRANULAR COMPUTING PLATFORM CONCLUSION What is a granular computing platform? Platform for applications composed of a large number of very small tasks that rapidly scale up/down. 10 100µs granules Bursty workloads Dynamic code paths Goal of talk: stimulate discussion. Assertions: All code needs to be pre-loaded and ready to run. A shared polling infrastructure is needed. Trade memory protection for low latency. Scheduling must be decentralized. Fast resource preemption is necessary. Granules can only communicate via invocation. Durability of granules must be batched. Need reliable networks for low latency (geo-)replication. Storage system must expose data location. All feedback welcome. Poster session later today.

THANK YOU QUESTIONS AND FEEDBACK