Tessellation OS. Present and Future. Juan A. Colmenares, Sarah Bird, Andrew Wang, Krste Asanovid, John Kubiatowicz [Parallel Computing Laboratory]

Similar documents
Tessellation: Space-Time Partitioning in a Manycore Client OS

Swarm at the Edge of the Cloud. John Kubiatowicz UC Berkeley Swarm Lab September 29 th, 2013

Akaros. Barret Rhoden, Kevin Klues, Andrew Waterman, Ron Minnich, David Zhu, Eric Brewer UC Berkeley and Google

CS Advanced Operating Systems Structures and Implementation Lecture 18. Dominant Resource Fairness Two-Level Scheduling.

Making Performance Understandable: Towards a Standard for Performance Counters on Manycore Architectures

Hardware Acceleration for Tagging

Integrating the Par Lab Stack Running Damascene on SEJITS/ROS/RAMP Gold

Execution architecture concepts

Department of Computer Science Institute for System Architecture, Operating Systems Group REAL-TIME MICHAEL ROITZSCH OVERVIEW

ECE519 Advanced Operating Systems

BERKELEY PAR LAB. RAMP Gold Wrap. Krste Asanovic. RAMP Wrap Stanford, CA August 25, 2010

CPSC/ECE 3220 Summer 2017 Exam 2

Arachne. Core Aware Thread Management Henry Qin Jacqueline Speiser John Ousterhout

Distributed File Systems Issues. NFS (Network File System) AFS: Namespace. The Andrew File System (AFS) Operating Systems 11/19/2012 CSC 256/456 1

What s An OS? Cyclic Executive. Interrupts. Advantages Simple implementation Low overhead Very predictable

Flexible Architecture Research Machine (FARM)

Abstract. 1 Introduction

Tessellation: Space-Time Partitioning in a Manycore Client OS Rose Liu, Kevin Klues, Sarah Bird, Steven Hofmeyr, Krste Asanović, John Kubiatowicz

No Tradeoff Low Latency + High Efficiency

Re-architecting Virtualization in Heterogeneous Multicore Systems

Deterministic Memory Abstraction and Supporting Multicore System Architecture

Lithe: Enabling Efficient Composition of Parallel Libraries

Reservation-Based Scheduling for IRQ Threads

OVERVIEW. Last Week: But if frequency of high priority task increases temporarily, system may encounter overload: Today: Slide 1. Slide 3.

Multi-core Architecture and Programming

CSE 120 Principles of Operating Systems

Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications

Xen and the Art of Virtualization. CSE-291 (Cloud Computing) Fall 2016

CS250 VLSI Systems Design Lecture 9: Patterns for Processing Units and Communication Links

Multimedia-Systems. Operating Systems. Prof. Dr.-Ing. Ralf Steinmetz Prof. Dr. rer. nat. Max Mühlhäuser Prof. Dr.-Ing. Wolfgang Effelsberg

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories

Real-Time Cache Management for Multi-Core Virtualization

Practical Considerations for Multi- Level Schedulers. Benjamin

Real-Time Internet of Things

EECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun

Department of Computer Science, Institute for System Architecture, Operating Systems Group. Real-Time Systems '08 / '09. Hardware.

Advanced RDMA-based Admission Control for Modern Data-Centers

Multiprocessor Systems. COMP s1

Abstract. Testing Parameters. Introduction. Hardware Platform. Native System

Operating System. Operating System Overview. Structure of a Computer System. Structure of a Computer System. Structure of a Computer System

RAMP-White / FAST-MP

Multiprocessor and Real- Time Scheduling. Chapter 10

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Improving Real-Time Performance on Multicore Platforms Using MemGuard

Multimedia Systems 2011/2012

Janus: A-Cross-Layer Soft Real- Time Architecture for Virtualization

a. A binary semaphore takes on numerical values 0 and 1 only. b. An atomic operation is a machine instruction or a sequence of instructions

Operating Systems Overview. Chapter 2

CS2506 Quick Revision

Diffusion TM 5.0 Performance Benchmarks

Chapter 8: Main Memory

Operating Systems Overview

Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures

Reservation-Based Scheduling for IRQ Threads

Chapter 8: Memory-Management Strategies

Communication Patterns in Safety Critical Systems for ADAS & Autonomous Vehicles Thorsten Wilmer Tech AD Berlin, 5. March 2018

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.

Hard Real-time Scheduling for Parallel Run-time Systems

Commercial Real-time Operating Systems An Introduction. Swaminathan Sivasubramanian Dependable Computing & Networking Laboratory

CS370 Operating Systems Midterm Review

Deduplication File System & Course Review

CPSC/ECE 3220 Summer 2018 Exam 2 No Electronics.

Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency

Why Study Multimedia? Operating Systems. Multimedia Resource Requirements. Continuous Media. Influences on Quality. An End-To-End Problem

Tuned Pipes: End-to-end Throughput and Delay Guarantees for USB Devices. Ahmad Golchin, Zhuoqun Cheng and Richard West Boston University

19: I/O Devices: Clocks, Power Management

CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES

Scheduling II. Today. Next Time. ! Proportional-share scheduling! Multilevel-feedback queue! Multiprocessor scheduling. !

Learning Outcomes. Scheduling. Is scheduling important? What is Scheduling? Application Behaviour. Is scheduling important?

Uniprocessor Scheduling. Basic Concepts Scheduling Criteria Scheduling Algorithms. Three level scheduling

Timers 1 / 46. Jiffies. Potent and Evil Magic

Memory management. Last modified: Adaptation of Silberschatz, Galvin, Gagne slides for the textbook Applied Operating Systems Concepts

Chapter 8: Main Memory. Operating System Concepts 9 th Edition

Device-Functionality Progression

Chapter 12: I/O Systems. I/O Hardware

CSCI-GA Operating Systems Lecture 3: Processes and Threads -Part 2 Scheduling Hubertus Franke

Main Points of the Computer Organization and System Software Module

Multiprocessor Systems. Chapter 8, 8.1

Windows Server 2003 NetBench Performance Report

Low-Latency Datacenters. John Ousterhout Platform Lab Retreat May 29, 2015

Last class: Today: Course administration OS definition, some history. Background on Computer Architecture

CS Advanced Operating Systems Structures and Implementation Lecture 13. Two-Level Scheduling (Con t) Segmentation/Paging/Virtual Memory

Chapter 8: Main Memory

Outline Background Jaluna-1 Presentation Jaluna-2 Presentation Overview Use Cases Architecture Features Copyright Jaluna SA. All rights reserved

Lecture 2: September 9

Operating System Support for Multimedia. Slides courtesy of Tay Vaughan Making Multimedia Work

CPU Scheduling. Operating Systems (Fall/Winter 2018) Yajin Zhou ( Zhejiang University

TDT 4260 lecture 3 spring semester 2015

8: Scheduling. Scheduling. Mark Handley

MiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces

Design Principles for End-to-End Multicore Schedulers

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Placement de processus (MPI) sur architecture multi-cœur NUMA

COMP SCI 3SH3: Operating System Concepts (Term 2 Winter 2006) Test 2 February 27, 2006; Time: 50 Minutes ;. Questions Instructor: Dr.

Outline. Database Tuning. Ideal Transaction. Concurrency Tuning Goals. Concurrency Tuning. Nikolaus Augsten. Lock Tuning. Unit 8 WS 2013/2014

Memory Management. Disclaimer: some slides are adopted from book authors slides with permission 1

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology

CIS Operating Systems Memory Management Cache and Demand Paging. Professor Qiang Zeng Spring 2018

Objectives and Functions Convenience. William Stallings Computer Organization and Architecture 7 th Edition. Efficiency

Transcription:

Tessellation OS Present and Future Juan A. Colmenares, Sarah Bird, Andrew Wang, Krste Asanovid, John Kubiatowicz [Parallel Computing Laboratory] Steven Hofmeyr, John Shalf [Lawrence Berkeley National Laboratory] Presented by Juan A. Colmenares at the Par Lab Winter Retreat January 13 th, 2011 The Granlibakken Conference Center, Tahoe City, CA

Content Tessellation OS: Goals and vision How the vision is turning into reality Future work Final remarks

Basic Goals in Tessellation OS Support a simultaneous mix of high-throughput parallel, interactive, and real-time applications Allow applications to consistently deliver performance In presence of other applications with conflicting requirements Enable rapid adaptation Provide scalability

Two-level Scheduling Monolithic CPU and Resource Scheduling Coarse-grained Resource Allocation and Distribution (Level 1) Chunks of resources distributed to application or system components Global decisions Option to simply turn off unused resources Fine-grained Application-specific Scheduling (Level 2) Applications are allowed to utilize their resources in any way they see fit Local Decisions Other components cannot interfere with their use of resources Performance Isolation

Spatial Partitioning Key for Performance Isolation Each partition receives a vector of dedicated basic resources A number of processing elements (e.g., cores) A portion of physical memory A portion of shared cache memory A fraction of memory bandwidth A partition may also receive Guaranteed fractional services from other partitions (via channels) e.g., network service and file service Exclusive access to other resources e.g., certain hardware devices

Space Space-Time Partitioning Spatial Partitioning Varies over Time grows Time Time Partitioning adapts to the needs of the system Partitions can be time multiplexed Resources are gang-scheduled Use affinity or HW partitioning to avoid cross-partition interference Multiplexing at a coarse granularity Scheduling resources as proactively as possible

CPU L1 L2 Bank The Cell: Our Partitioning Abstraction Defining the Partitioned Environment Active Cell Address Space A 2nd-level Scheduling Address Space B Task 2nd-level Memory Mgmt Tessellation Kernel (Partition Support) CPU L1 L2 Bank CPU L1 L1 Interconnect L2 Bank CPU L1 L2 Bank DRAM & I/O Interconnect CPU L1 L2 Bank CPU L1 L2 Bank Another Cell Container for parallel software components providing guaranteed access to resources Basic properties of a cell Full control over resources it owns when mapped to hardware ( Bare Metal ) One or more address spaces Communication channels DRAM DRAM DRAM DRAM DRAM DRAM (+) Fraction of memory bandwidth

Basis of a Component-based Model with Composable Performance Parallel Library Device Drivers Channel Real-time Cell Channel Core Application Application Component File Service Applications = Set of interacting components deployed on different cells Applications split into performance-incompatible and mutually distrusting cells with controlled communication OS services are independent servers that provide QoS Requires fast inter-cell communication Could use hardware acceleration for fast messaging

Tessellation Kernel Partition #1 Partition #2 Partition #3 Policy Service Cell #1 Cell #2 Cell #3 Cell Creation and Resizing Requests from Users ACK/NACK Space-Time Resource Graph (STRG) Partition Mapping and Multiplexing Layer Resource Allocation Architecture Admission Control Minor Changes All system resources Cell group with fraction of resources Cell Cell Cell Cell Major Change Request ACK/NACK STRG Validator Resource Planner Resource Allocation and Adaptation Mechanism Offline Models and Behavioral Parameters (Current Resources) Partition Multiplexing Global Policies / User Policies and Preferences Online Performance Monitoring, Model Building, and Prediction User/System Partition Mechanism Layer Partition Implementation QoS Enforcement Channel Authenticator Network Bandwidth Cache/ Physical Cores Local Store Memory Disks NICs Performance Counters Partitionable Hardware Resources

10 audio channels A Music Application on Tessellation OS Prepared by: Juan A. Colmenares, Ian Saxton, Rimas Avizienis, Eric Battenberg, Krste Asanović, David Wessel, and John Kubiatowicz [CNMAT, Par Lab Music & OS Groups] Multi-channel audio DSP graph Input Music Program Output Ethernet Switch Cell B Network Service Input buffer with 32 samples per channel generated every 0.6 ms Pure Data (pd) Patch Audio Processing & Synthesis Engine Scheduler Lithe Runtime Cell A Hardware cores CNMAT s Ethernet Audio Device

Time Multiplexing of Cells Basic Policies in Tessellation OS Always Active (Pinned) Cells Start time Core 0 Core 1 Core 2 Time-triggered Cells Type 1: Single activation per period at the beginning of each period Start time Active time Core 3 Core 4 Period Time-triggered Cells Type 2: Multiple activations per period Core 3 Core 4 Core 5 Start time Total the specified active time Period

Time Multiplexing of Cells Ongoing/Future Work Event-triggered Cells Will use an adaptation of a Dynamic Priority Server for aperiodic tasks Particularly interested in the Constant Bandwidth Server [AB04] Never exceeds assigned fraction of processor time (i.e., bandwidth), even in presence of overloads Good responsiveness and utilization Does not need to know the execution time of the tasks Events Cell Activations [AB04] L. Abeni, G. Butazzo. Resource reservations in dynamic real-time systems. Real-Time Systems, 27(2): 123-165, 2004

Time Multiplexing of Cells There are also Best-effort Cells (with no guarantees) Precedence Rule For example, no Event-triggered Cell should compromise the timing behavior of the Time-triggered Cells or other Event-triggered Cells in the system Admission Control and Adaptation Mechanisms must ensure that Plan of cell activations/suspensions in advance, whenever possible Simplicity in Kernel s Mapping and Multiplexing Layer Policy Service (i.e., Admission Control and Adaptation)

2 nd -Level Scheduling Lithe harts Cell A Application Library A Library B Sched 1 Sched 2 Sched 3 Lithe Runtime Lithe[PAH10] has been ported to Tessellation OS Next steps Port Lithe-based parallel libraries (e.g., TBB and OpenMP) Evaluate performance of parallel applications Tessellation Kernel (Partition Support) Hardware cores [PHA10] H. Pan, B. Hindman, K. Asanovic. Composing parallel software efficiently with Lithe. PLDI 10.

2 nd -Level Scheduling Ongoing/Future Work Cell B Preemptive Scheduling Framework An initial implementation exists with a basic round-robin scheduler Particular interest in EDF schedulers for multi-core (e.g., AIRS [KRI10] ) Load balancers (e.g., Speed Balancing [HIB10] ) Driving to an ABI that incorporates both preemptive and cooperative scheduling in the same framework Application Preemptive Scheduling Framework Timer interrupts Scheduler X Tessellation Kernel (Partition Support) Hardware cores [KRI10] S. Kato, R. Rajkumar, Y. Ishikawa. Supporting interactive real-time applications on multicore platforms. ECRTS 10. [HIB10] S. Hofmeyr, C. Iancu, F. Blagojevic. Load balancing on speed. PPoPP 10.

Parallel Library Inter-Cell Channels Channel Channel Channel Net Service Real-time Cell Core Application Very simple and non-optimized implementation based on the Non-Blocking Buffer (NBB) [Lam77, KCK07] Enables lock-free data passing between a single producer and a single consumer Only requirement: Atomicity of reads and writes of integers Lot of copying in this early implementation! Operate in two modes Polling Notification events and callback functions [Lam77] L. Lamport. Proving the Correctness of Multiprocess Programs. IEEE Trans. on Software Engineering, SE-3(2): 125-143, 1977. [KCK07] K. Kim, J. Colmenares, et al. Efficient adaptations of the non-blocking buffer for event message communication between real-time threads. ISORC 07.

Cell A Net Service Δt Inter-Cell Channels Preliminary Experimental Results Send( ) Receive( ) Experiment A Channel IO Handler Polling mode Music App Audio Processing & Synthesis Engine Cell B Experiment B Input Output Loopback Audio Graph Test platform 2.66 GHz Intel Core i7 (quad-core processor) Hyper threading 3GB RAM

Cell A Net Service Δt Inter-Cell Channels Preliminary Experimental Results Send( ) Receive( ) Experiment A Channel IO Handler Polling mode Music App Audio Processing & Synthesis Engine Cell B Experiment B Input Output Round-trip Time - Experiment B 100,000 measurements @ 2.66GHz, Message Size = 1280 bytes Average = 21.08 µs (56,179 cycles) Test platform 2.66 GHz Intel Core i7 (quad-core processor) Hyper threading 3GB RAM Requirement CNMAT s Ethernet Audio Device Loopback Audio Graph Input buffer with 1280 bytes generated every 0.6 ms

What has been difficult so far? Coordination between Time multiplexing of Cells 2nd-level scheduling (especially the preemptive one) Event notifications in user space (contributed by Barret Rhoden and David Zhu) Cell adaptation (i.e., resizing) Very easy to go in a direction that Negatively affects performance isolation Prevent us from providing performance guarantees And of course, expected issues Virtual Machine (e.g., QEMU)!= Real Machine Dealing with IO devices is time consuming

Partition #1 Partition #2 Partition #3 Policy Service Cell #1 Cell #2 Cell #3 Future Work Close the adaptation loop in the Policy Service and implement PACORA in Tessellation OS Cell Creation and Resizing Requests from Users ACK/NACK Space-Time Resource Graph (STRG) Network Bandwidth Admission Control Minor Changes All system resources Cell group with fraction of resources Cell Cell Cell Cell Major Change Request ACK/NACK Resource Allocation and Adaptation Mechanism Offline Models and Behavioral Parameters (Current Resources) Tessellation Kernel Cache/ Physical Cores Local Store Memory Disks NICs Global Policies / User Policies and Preferences Online Performance Monitoring, Model Building, and Prediction Performance Counters Partitionable Hardware Resources User/System

Future Work QoS-aware OS Services Focus on soft guarantees with high confidence levels Service Level Agreements (SLAs) But How do we make them practical? What might we put into our SLAs? How to enforce them? Plan to use models that tell how service properties vary with resources Close the gap between 1) the performance the user wants and 2) the required resources to achieve such performance Basic services Network Service (being modified) GUI Service (right next) [Andrew Wang s work] Others (e.g., File Service) later Internal Services Energy control (disabling of unused resources) Integration of ACPI Component Architecture (mostly working!) [Aki Kohiga s work] Extension of the Mapping Layer and Mechanism Layer

Future Work Virtual Memory Management Will use the 2-level approach and treat memory like the other resources Assign regions of physical address space to Cells Let user-level runtime do what it wants with full information about memory pages More than one address space in a Cell But What interfaces to give to runtime? What HW & SW mechanisms will enable that? What about the illusion of very large memory? Revocable and non-revocable pages Will this approach help simplify memory management? Evaluation of novel hardware partitioning mechanisms using RAMP Gold/Midas that benefits Tessellation OS e.g., shared cache, IO, and channels

Final Remarks Vision of Tessellation OS [Space-time partitioning] + [Two-level scheduling] Cells and inter-cell channels Implementation status Basic time multiplexing policies for Cells 2 nd -level runtimes Lithe (cooperative scheduling) Preemptive scheduling Channels Future work Policy service and adaptive resource allocation QoS-aware services Virtual memory management Hardware partitioning mechanisms

Questions? THANKS