The Multikernel: A new OS architecture for scalable multicore systems Baumann et al. Presentation: Mark Smith

Similar documents
Multiprocessor OS. COMP9242 Advanced Operating Systems S2/2011 Week 7: Multiprocessors Part 2. Scalability of Multiprocessor OS.

The Multikernel A new OS architecture for scalable multicore systems

MULTIPROCESSOR OS. Overview. COMP9242 Advanced Operating Systems S2/2013 Week 11: Multiprocessor OS. Multiprocessor OS

The Barrelfish operating system for heterogeneous multicore systems

INFLUENTIAL OS RESEARCH

CSE 120 Principles of Operating Systems

Design Principles for End-to-End Multicore Schedulers

CS533 Concepts of Operating Systems. Jonathan Walpole

Implications of Multicore Systems. Advanced Operating Systems Lecture 10

Operating System Architecture. CS3026 Operating Systems Lecture 03

Virtual Machines Disco and Xen (Lecture 10, cs262a) Ion Stoica & Ali Ghodsi UC Berkeley February 26, 2018

Barrelfish Project ETH Zurich. Message Notifications

CS533 Concepts of Operating Systems. Jonathan Walpole

Multiprocessor Systems Continuous need for faster computers Multiprocessors: shared memory model, access time nanosec (ns) Multicomputers: message pas

Operating Systems, Fall Lecture 9, Tiina Niklander 1

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

SCALABILITY AND HETEROGENEITY MICHAEL ROITZSCH

Modern systems: multicore issues

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 4 Threads, SMP, and

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Lightweight Remote Procedure Call

Multiple Processor Systems. Lecture 15 Multiple Processor Systems. Multiprocessor Hardware (1) Multiprocessors. Multiprocessor Hardware (2)

Process Characteristics

Multiprocessors - Flynn s Taxonomy (1966)

Client Server & Distributed System. A Basic Introduction

Motivation. Threads. Multithreaded Server Architecture. Thread of execution. Chapter 4

Lecture 5: Process Description and Control Multithreading Basics in Interprocess communication Introduction to multiprocessors

the Barrelfish multikernel: with Timothy Roscoe

Hierarchical Clustering: A Structure for Scalable Multiprocessor Operating System Design

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Lightweight Remote Procedure Call. Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented by Alana Sweat

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

CSC 4320 Test 1 Spring 2017

For use by students enrolled in #71251 CSE430 Fall 2012 at Arizona State University. Do not use if not enrolled.

Computer Architecture

Last class: Today: Course administration OS definition, some history. Background on Computer Architecture

New directions in OS design: the Barrelfish research operating system. Timothy Roscoe (Mothy) ETH Zurich Switzerland

Chapter 4: Threads. Overview Multithreading Models Thread Libraries Threading Issues Operating System Examples Windows XP Threads Linux Threads

Threads, SMP, and Microkernels. Chapter 4

Operating System 4 THREADS, SMP AND MICROKERNELS

CPSC/ECE 3220 Fall 2017 Exam Give the definition (note: not the roles) for an operating system as stated in the textbook. (2 pts.

Multiprocessors & Thread Level Parallelism

SMP/SMT. Daniel Potts. University of New South Wales, Sydney, Australia and National ICT Australia. Home Index p.

Message Passing. Advanced Operating Systems Tutorial 5

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Distributed File Systems Issues. NFS (Network File System) AFS: Namespace. The Andrew File System (AFS) Operating Systems 11/19/2012 CSC 256/456 1

Advanced Computer Networks. End Host Optimization

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Commercial Real-time Operating Systems An Introduction. Swaminathan Sivasubramanian Dependable Computing & Networking Laboratory

Operating System Support

Lightweight RPC. Robert Grimm New York University

Operating Systems: Internals and Design Principles. Chapter 4 Threads Seventh Edition By William Stallings

Chapter 5. Multiprocessors and Thread-Level Parallelism

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

WHY PARALLEL PROCESSING? (CE-401)

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Operating System. Chapter 4. Threads. Lynn Choi School of Electrical Engineering

Computer Systems Architecture

Computer Systems Architecture

I/O Systems. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

CPSC/ECE 3220 Summer 2017 Exam 2

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Threads. Raju Pandey Department of Computer Sciences University of California, Davis Spring 2011

Multiprocessors and Locking

Parallel Computing Platforms

Applications, services. Middleware. OS2 Processes, threads, Processes, threads, communication,... communication,... Platform

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Handout 3 Multiprocessor and thread level parallelism

Chapter 7: Multiprocessing Advanced Operating Systems ( L)

Faculty of Computer Science, Operating Systems Group. The L4Re Microkernel. Adam Lackorzynski. July 2017

CSC501 Operating Systems Principles. OS Structure

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT I

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Parallel Architectures

Computer parallelism Flynn s categories

EbbRT: A Framework for Building Per-Application Library Operating Systems

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

CS370 Operating Systems

CSC Operating Systems Fall Lecture - II OS Structures. Tevfik Ko!ar. Louisiana State University. August 27 th, 2009.

Announcements. Computer System Organization. Roadmap. Major OS Components. Processes. Tevfik Ko!ar. CSC Operating Systems Fall 2009

CS370 Operating Systems

Milestone 7: another core. Multics. Multiprocessor OSes. Multics on GE645. Multics: typical configuration

SMD149 - Operating Systems - Multiprocessing

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Scalable Multiprocessors

Chapter 13: I/O Systems

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

殷亚凤. Processes. Distributed Systems [3]

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

CSE398: Network Systems Design

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Multithreaded Processors. Department of Electrical Engineering Stanford University

Transcription:

The Multikernel: A new OS architecture for scalable multicore systems Baumann et al. Presentation: Mark Smith

Review Introduction Optimizing the OS based on hardware Processor changes Shared Memory vs Messaging Barrelfish Overview Barrelfish Goals Barrelfish Architecture Barrelfish Evaluations

Improving concurrency with shared memory data structures Non-blocking solutions: data structure specific CAS and CAS2 Load Linked/Store Conditional Non-blocking solutions: more general approaches Hazard Pointers RCU Transactional Memory

Improving concurrency with shared memory data structures Improving spin-locks Exponential back-off Time delay Queuing Several improvement techniques drawn from network arbitration schemes (CSMA)

Updating the problem statement Synchronization versus Locality Remote memory access versus Messaging Tornado Object-oriented with Clustered Objects Microkernel approach Interprocess communication using Protected Procedure Calls (PPC) Kernel-Kernel communications Remote memory access versus Remote invocation Dependent on cost ratio of remote memory/local memory Psyche: used shared memory for communication channel

Today s computer systems contain more processor cores and increasingly diverse architectures High-performance systems have scaled, but have done so in a system-specific manner Today s general purpose OS will not be able to scale fast enough to keep up with the new system designs Consider changing OS to distributed system model

Improving read-write lock on Sun Niagra Exploits shared, banked L2 cache Concurrent writes to shared cache line tracks presence of readers Highly efficient improvement for Sun Niagra Highly inefficient for others ping-pong between reader caches Windows 7 optimizations Removal of dispatcher lock a single, global lock Based on NT Kernel code designed for 32 proc systems Windows 7/Vista originally limited to 64 processor systems Replacement of dispatcher lock with finer-grained locks Windows 7 now scalable to 256 processor systems Customers still asking for 300+ processor support RCU changes for Linux Original RCU API had 7 components which increased to 31 Changes required for the large variety of architecture + workloads Grace period detection limits scalability

Replacement of single-shared interconnect (front-side bus) Interconnect topology now resembling a message passing network Ring topologies Mesh network topologies

Magny-Cours with 12 cores released March 2010 Hypertransport: front-side bus replacement Packet-based serial protocol Multi-processor interconnect for NUMA systems

Nehalem-EX with 8 cores released April 2010 QPI: front-side bus replacement Packet-based serial protocol Multi-processor interconnect for NUMA systems

Duality of message-passing versus shared memory Lauer and Needham (1978) Model selection dependent on hardware architecture Shared memory abstraction is difficult to tune Shared memory Single core update to shared memory in under 30 cycles Sixteen (16) cores updating data requires 12,000 cycles Cores are stalled on cache misses In message passing, use lightweight RPC Scales linearly with number of threads Consider QPI and Hypertransport Messaging abstraction better suited to architecture

SHM8 Updating 8 cache lines MSG Messaging Note linear scale

OS as distributed system with units communicating via messaging Three design principles Make all inter-core communications explicit Make OS structure hardware neutral View state as replicated instead of shared

Make inter-core communications explicit Since hardware behaves as a network, treat OS as a distributed system Asynchronous communications Make OS hardware neutral Use messaging abstraction to avoid extensive optimizations to achieve scalability Focus on optimization of messaging rather than hardware/cache/memory access View state as replicated Maintain state through replication rather than shared memory Provides improved locality

Comparable performance to existing OS Demonstrate evidence of scalability to large number of cores Can be re-targeted to different hardware Can exploit message-passing abstraction to achieve good performance Can exploit modularity of the OS design

Privileged-mode CPU driver Lightweight RPC for core-to-core communication Alternate optimizations (L4 raw IPC in 420 cycles) Distinguished user-mode monitor Coordinate system state User level applications use thread libraries

Shares no state with other cores Event driven Single-threaded, nonpreemtable Serially processes events Traps from user processes Device interrrupts Interrupts from other cores Implements lightweight, asychronous communication

Coordinate System State Single-core, user-space application supports OS scheduling Handles user-level application requests for state information Memory allocation tables Address space mappings Virtual memory management Inter-core communication through URPC Shared memory used as channel for communication Sender writes message to cache line Receiver polls cache line to read message

Collection of dispatcher objects, one for each core the process may execute on Communication is between dispatchers Dispatchers run in user space Shared address space handled in one of two ways Shared hardware page table between dispatchers Replicating hardware page table among dispatchers Thread scheduler on dispatchers Exchange messages to create/unblock threads Migrate threads between dispatchers (and cores)

Service known as System Knowledge Base Populated with information from hardware discovery Online measurements created at boot time Includes pre-defined system knowledge Useful for optimizing communications to a given hardware architecture

TLB Shootdown Measure maintenance of TLB consistency by invalidating entries when pages are unmapped Linux/Windows operation Writes operation to a well known location Issue inter-process interrupts to every core that might have a mapping in its TLB (800 cycles/trap) Each core acks the interrupt by writing to shared memory location, invalidates its TLB and resumes

TLB Shootdown in Barrelfish Use messages for shootdown Allows optimization of messaging mechanism Broadcast: good for AMD/Hypertransport which is a broadcast network Unicast: good for small number of cores Multicast: good for shared, on-chip L3 cache Two-level tree send message to 1 st core of processor which forwards to other cores as appropriate Newer 4-core AMD/HyperTransport Similar to Intel Nehalem/QPI, 4-core architecture NUMA-Aware Multicast Send messages to highest latency nodes first

Unmap latency on 8x4-core AMD Barrelfish Fixed LRPC Overhead Overhead amortized over more efficient multicore communications

Does not beat Linux in performance, but Limited number of cores in modeling (32) Linux/Windows have had significant investments towards optimizations Barrelfish approach has similarities to largescale multiprocessor systems like Cray T3