Tessellation OS. Present and Future. Juan A. Colmenares, Sarah Bird, Andrew Wang, Krste Asanovid, John Kubiatowicz [Parallel Computing Laboratory]

Tessellation OS Present and Future Juan A. Colmenares, Sarah Bird, Andrew Wang, Krste Asanovid, John Kubiatowicz [Parallel Computing Laboratory] Steven Hofmeyr, John Shalf [Lawrence Berkeley National Laboratory] Presented by Juan A. Colmenares at the Par Lab Winter Retreat January 13 th, 2011 The Granlibakken Conference Center, Tahoe City, CA

Content Tessellation OS: Goals and vision How the vision is turning into reality Future work Final remarks

Basic Goals in Tessellation OS Support a simultaneous mix of high-throughput parallel, interactive, and real-time applications Allow applications to consistently deliver performance In presence of other applications with conflicting requirements Enable rapid adaptation Provide scalability

Two-level Scheduling Monolithic CPU and Resource Scheduling Coarse-grained Resource Allocation and Distribution (Level 1) Chunks of resources distributed to application or system components Global decisions Option to simply turn off unused resources Fine-grained Application-specific Scheduling (Level 2) Applications are allowed to utilize their resources in any way they see fit Local Decisions Other components cannot interfere with their use of resources Performance Isolation

Spatial Partitioning Key for Performance Isolation Each partition receives a vector of dedicated basic resources A number of processing elements (e.g., cores) A portion of physical memory A portion of shared cache memory A fraction of memory bandwidth A partition may also receive Guaranteed fractional services from other partitions (via channels) e.g., network service and file service Exclusive access to other resources e.g., certain hardware devices

Space Space-Time Partitioning Spatial Partitioning Varies over Time grows Time Time Partitioning adapts to the needs of the system Partitions can be time multiplexed Resources are gang-scheduled Use affinity or HW partitioning to avoid cross-partition interference Multiplexing at a coarse granularity Scheduling resources as proactively as possible

CPU L1 L2 Bank The Cell: Our Partitioning Abstraction Defining the Partitioned Environment Active Cell Address Space A 2nd-level Scheduling Address Space B Task 2nd-level Memory Mgmt Tessellation Kernel (Partition Support) CPU L1 L2 Bank CPU L1 L1 Interconnect L2 Bank CPU L1 L2 Bank DRAM & I/O Interconnect CPU L1 L2 Bank CPU L1 L2 Bank Another Cell Container for parallel software components providing guaranteed access to resources Basic properties of a cell Full control over resources it owns when mapped to hardware ( Bare Metal ) One or more address spaces Communication channels DRAM DRAM DRAM DRAM DRAM DRAM (+) Fraction of memory bandwidth

Basis of a Component-based Model with Composable Performance Parallel Library Device Drivers Channel Real-time Cell Channel Core Application Application Component File Service Applications = Set of interacting components deployed on different cells Applications split into performance-incompatible and mutually distrusting cells with controlled communication OS services are independent servers that provide QoS Requires fast inter-cell communication Could use hardware acceleration for fast messaging

Tessellation Kernel Partition #1 Partition #2 Partition #3 Policy Service Cell #1 Cell #2 Cell #3 Cell Creation and Resizing Requests from Users ACK/NACK Space-Time Resource Graph (STRG) Partition Mapping and Multiplexing Layer Resource Allocation Architecture Admission Control Minor Changes All system resources Cell group with fraction of resources Cell Cell Cell Cell Major Change Request ACK/NACK STRG Validator Resource Planner Resource Allocation and Adaptation Mechanism Offline Models and Behavioral Parameters (Current Resources) Partition Multiplexing Global Policies / User Policies and Preferences Online Performance Monitoring, Model Building, and Prediction User/System Partition Mechanism Layer Partition Implementation QoS Enforcement Channel Authenticator Network Bandwidth Cache/ Physical Cores Local Store Memory Disks NICs Performance Counters Partitionable Hardware Resources

10 audio channels A Music Application on Tessellation OS Prepared by: Juan A. Colmenares, Ian Saxton, Rimas Avizienis, Eric Battenberg, Krste Asanović, David Wessel, and John Kubiatowicz [CNMAT, Par Lab Music & OS Groups] Multi-channel audio DSP graph Input Music Program Output Ethernet Switch Cell B Network Service Input buffer with 32 samples per channel generated every 0.6 ms Pure Data (pd) Patch Audio Processing & Synthesis Engine Scheduler Lithe Runtime Cell A Hardware cores CNMAT s Ethernet Audio Device

Time Multiplexing of Cells Basic Policies in Tessellation OS Always Active (Pinned) Cells Start time Core 0 Core 1 Core 2 Time-triggered Cells Type 1: Single activation per period at the beginning of each period Start time Active time Core 3 Core 4 Period Time-triggered Cells Type 2: Multiple activations per period Core 3 Core 4 Core 5 Start time Total the specified active time Period

Time Multiplexing of Cells Ongoing/Future Work Event-triggered Cells Will use an adaptation of a Dynamic Priority Server for aperiodic tasks Particularly interested in the Constant Bandwidth Server [AB04] Never exceeds assigned fraction of processor time (i.e., bandwidth), even in presence of overloads Good responsiveness and utilization Does not need to know the execution time of the tasks Events Cell Activations [AB04] L. Abeni, G. Butazzo. Resource reservations in dynamic real-time systems. Real-Time Systems, 27(2): 123-165, 2004

Time Multiplexing of Cells There are also Best-effort Cells (with no guarantees) Precedence Rule For example, no Event-triggered Cell should compromise the timing behavior of the Time-triggered Cells or other Event-triggered Cells in the system Admission Control and Adaptation Mechanisms must ensure that Plan of cell activations/suspensions in advance, whenever possible Simplicity in Kernel s Mapping and Multiplexing Layer Policy Service (i.e., Admission Control and Adaptation)

2 nd -Level Scheduling Lithe harts Cell A Application Library A Library B Sched 1 Sched 2 Sched 3 Lithe Runtime Lithe[PAH10] has been ported to Tessellation OS Next steps Port Lithe-based parallel libraries (e.g., TBB and OpenMP) Evaluate performance of parallel applications Tessellation Kernel (Partition Support) Hardware cores [PHA10] H. Pan, B. Hindman, K. Asanovic. Composing parallel software efficiently with Lithe. PLDI 10.

2 nd -Level Scheduling Ongoing/Future Work Cell B Preemptive Scheduling Framework An initial implementation exists with a basic round-robin scheduler Particular interest in EDF schedulers for multi-core (e.g., AIRS [KRI10] ) Load balancers (e.g., Speed Balancing [HIB10] ) Driving to an ABI that incorporates both preemptive and cooperative scheduling in the same framework Application Preemptive Scheduling Framework Timer interrupts Scheduler X Tessellation Kernel (Partition Support) Hardware cores [KRI10] S. Kato, R. Rajkumar, Y. Ishikawa. Supporting interactive real-time applications on multicore platforms. ECRTS 10. [HIB10] S. Hofmeyr, C. Iancu, F. Blagojevic. Load balancing on speed. PPoPP 10.

Parallel Library Inter-Cell Channels Channel Channel Channel Net Service Real-time Cell Core Application Very simple and non-optimized implementation based on the Non-Blocking Buffer (NBB) [Lam77, KCK07] Enables lock-free data passing between a single producer and a single consumer Only requirement: Atomicity of reads and writes of integers Lot of copying in this early implementation! Operate in two modes Polling Notification events and callback functions [Lam77] L. Lamport. Proving the Correctness of Multiprocess Programs. IEEE Trans. on Software Engineering, SE-3(2): 125-143, 1977. [KCK07] K. Kim, J. Colmenares, et al. Efficient adaptations of the non-blocking buffer for event message communication between real-time threads. ISORC 07.

Cell A Net Service Δt Inter-Cell Channels Preliminary Experimental Results Send( ) Receive( ) Experiment A Channel IO Handler Polling mode Music App Audio Processing & Synthesis Engine Cell B Experiment B Input Output Loopback Audio Graph Test platform 2.66 GHz Intel Core i7 (quad-core processor) Hyper threading 3GB RAM

Cell A Net Service Δt Inter-Cell Channels Preliminary Experimental Results Send( ) Receive( ) Experiment A Channel IO Handler Polling mode Music App Audio Processing & Synthesis Engine Cell B Experiment B Input Output Round-trip Time - Experiment B 100,000 measurements @ 2.66GHz, Message Size = 1280 bytes Average = 21.08 µs (56,179 cycles) Test platform 2.66 GHz Intel Core i7 (quad-core processor) Hyper threading 3GB RAM Requirement CNMAT s Ethernet Audio Device Loopback Audio Graph Input buffer with 1280 bytes generated every 0.6 ms

What has been difficult so far? Coordination between Time multiplexing of Cells 2nd-level scheduling (especially the preemptive one) Event notifications in user space (contributed by Barret Rhoden and David Zhu) Cell adaptation (i.e., resizing) Very easy to go in a direction that Negatively affects performance isolation Prevent us from providing performance guarantees And of course, expected issues Virtual Machine (e.g., QEMU)!= Real Machine Dealing with IO devices is time consuming

Partition #1 Partition #2 Partition #3 Policy Service Cell #1 Cell #2 Cell #3 Future Work Close the adaptation loop in the Policy Service and implement PACORA in Tessellation OS Cell Creation and Resizing Requests from Users ACK/NACK Space-Time Resource Graph (STRG) Network Bandwidth Admission Control Minor Changes All system resources Cell group with fraction of resources Cell Cell Cell Cell Major Change Request ACK/NACK Resource Allocation and Adaptation Mechanism Offline Models and Behavioral Parameters (Current Resources) Tessellation Kernel Cache/ Physical Cores Local Store Memory Disks NICs Global Policies / User Policies and Preferences Online Performance Monitoring, Model Building, and Prediction Performance Counters Partitionable Hardware Resources User/System

Future Work QoS-aware OS Services Focus on soft guarantees with high confidence levels Service Level Agreements (SLAs) But How do we make them practical? What might we put into our SLAs? How to enforce them? Plan to use models that tell how service properties vary with resources Close the gap between 1) the performance the user wants and 2) the required resources to achieve such performance Basic services Network Service (being modified) GUI Service (right next) [Andrew Wang s work] Others (e.g., File Service) later Internal Services Energy control (disabling of unused resources) Integration of ACPI Component Architecture (mostly working!) [Aki Kohiga s work] Extension of the Mapping Layer and Mechanism Layer

Future Work Virtual Memory Management Will use the 2-level approach and treat memory like the other resources Assign regions of physical address space to Cells Let user-level runtime do what it wants with full information about memory pages More than one address space in a Cell But What interfaces to give to runtime? What HW & SW mechanisms will enable that? What about the illusion of very large memory? Revocable and non-revocable pages Will this approach help simplify memory management? Evaluation of novel hardware partitioning mechanisms using RAMP Gold/Midas that benefits Tessellation OS e.g., shared cache, IO, and channels

Final Remarks Vision of Tessellation OS [Space-time partitioning] + [Two-level scheduling] Cells and inter-cell channels Implementation status Basic time multiplexing policies for Cells 2 nd -level runtimes Lithe (cooperative scheduling) Preemptive scheduling Channels Future work Policy service and adaptive resource allocation QoS-aware services Virtual memory management Hardware partitioning mechanisms

Questions? THANKS