Multiprocessor OS. COMP9242 Advanced Operating Systems S2/2011 Week 7: Multiprocessors Part 2. Scalability of Multiprocessor OS.

Similar documents
MULTIPROCESSOR OS. Overview. COMP9242 Advanced Operating Systems S2/2013 Week 11: Multiprocessor OS. Multiprocessor OS

Mul$processor OS COMP9242 Advanced Opera$ng Systems. Mul$processor OS. Overview. Uniprocessor OS

The Multikernel A new OS architecture for scalable multicore systems

The Multikernel: A new OS architecture for scalable multicore systems Baumann et al. Presentation: Mark Smith

Design Principles for End-to-End Multicore Schedulers

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

The Barrelfish operating system for heterogeneous multicore systems

CSE 120 Principles of Operating Systems

Tile Processor (TILEPro64)

Modern systems: multicore issues

SCALABILITY AND HETEROGENEITY MICHAEL ROITZSCH

Milestone 7: another core. Multics. Multiprocessor OSes. Multics on GE645. Multics: typical configuration

Implications of Multicore Systems. Advanced Operating Systems Lecture 10

Chapter 7: Multiprocessing Advanced Operating Systems ( L)

Virtual Machines Disco and Xen (Lecture 10, cs262a) Ion Stoica & Ali Ghodsi UC Berkeley February 26, 2018

WHY PARALLEL PROCESSING? (CE-401)

Disco: Running Commodity Operating Systems on Scalable Multiprocessors

INFLUENTIAL OS RESEARCH

Computer Architecture

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

NUMA replicated pagecache for Linux

Parallel Computing Platforms

New directions in OS design: the Barrelfish research operating system. Timothy Roscoe (Mothy) ETH Zurich Switzerland

Multiprocessors and Locking

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Bruce Shriver University of Tromsø, Norway. November 2010 Óbuda University, Hungary

Interconnect Routing

COMP9242 Advanced OS. Copyright Notice. The Memory Wall. Caching. These slides are distributed under the Creative Commons Attribution 3.

Multiprocessor Systems. Chapter 8, 8.1

Barrelfish Project ETH Zurich. Message Notifications

Convergence of Parallel Architecture

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

CSE 392/CS 378: High-performance Computing - Principles and Practice

G Disco. Robert Grimm New York University

Master s Thesis Nr. 10

FlexSC. Flexible System Call Scheduling with Exception-Less System Calls. Livio Soares and Michael Stumm. University of Toronto

Multi-core Architectures. Dr. Yingwu Zhu

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Chapter 5. Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism

Markets Demanding More Performance

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.

Efficient inter-kernel communication in a heterogeneous multikernel operating system. Ajithchandra Saya

COSC 6385 Computer Architecture - Multi Processor Systems

Flexible Architecture Research Machine (FARM)

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Multiprocessor Systems. COMP s1

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

COMP9242 Advanced OS. S2/2017 W03: Caches: What Every OS Designer Must

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Message Passing. Advanced Operating Systems Tutorial 5

Shared memory multiprocessors

Parallel Architecture. Hwansoo Han

Multiprocessor Support

EE382 Processor Design. Processor Issues for MP

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Exception-Less System Calls for Event-Driven Servers

ffwd: delegation is (much) faster than you think Sepideh Roghanchi, Jakob Eriksson, Nilanjana Basu

DISTRIBUTED SYSTEMS [COMP9243] Lecture 3b: Distributed Shared Memory DISTRIBUTED SHARED MEMORY (DSM) DSM consists of two components:

How much energy can you save with a multicore computer for web applications?

Computer Architecture

VIRTUALIZATION: IBM VM/370 AND XEN

SMD149 - Operating Systems - Multiprocessing

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

the Barrelfish multikernel: with Timothy Roscoe

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Architecture-Conscious Database Systems

Distributed File Systems Issues. NFS (Network File System) AFS: Namespace. The Andrew File System (AFS) Operating Systems 11/19/2012 CSC 256/456 1

Multi-core Architectures. Dr. Yingwu Zhu

Chapter 5. Multiprocessors and Thread-Level Parallelism

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

Parallel Algorithm Engineering

Arquitecturas y Modelos de. Multicore

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Linux multi-core scalability

Multiprocessors & Thread Level Parallelism

Computer Science 146. Computer Architecture

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

PARALLEL MEMORY ARCHITECTURE

Comp. Org II, Spring

HPMMAP: Lightweight Memory Management for Commodity Operating Systems. University of Pittsburgh

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

An Introduction to Parallel Programming

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

19: Multiprocessing and Multicore. Computer Architecture and Systems Programming , Herbstsemester 2013 Timothy Roscoe

CSE502: Computer Architecture CSE 502: Computer Architecture

Multicore Hardware and Parallelism

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Comp. Org II, Spring

Transcription:

Multiprocessor OS COMP9242 Advanced Operating Systems S2/2011 Week 7: Multiprocessors Part 2 2 Key design challenges: Correctness of (shared) data structures Scalability Scalability of Multiprocessor OS Remember Amdahl s law Serialisation prevents scalability Whenever application not running on core, scalability reduced Sources of Serialisation: Locking Waiting for a lock stalls self Lock implementation: Atomic operations lock bus stalls everyone Cache coherence traffic loads bus slows down others Memory access Relatively high latency to memory stalls self Cache Processor stalled while cache line is fetched or invalidated Limited by latency of interconnect round-trips Performance depends on data size (cache lines) and contention (number of cores) More Cache Issues False sharing Unrelated data structs share the same cache line Accessed from different processors Cache coherence traffic and delay Cache line bouncing Shared R/W on many processors E.g: bouncing due to locks: each processor spinning on a lock brings it into its own cache Cache coherence traffic and delay Cache misses Potentially direct memory access When does cache miss occur? Application runs on new core Cached memory has been evicted 3 4 Optimisation for Scalability Reduce amount of code in critical sections Increases concurrency Fine grained locking Lock data not code Tradeoff: more concurrency but more locking (and locking causes serialisation) Lock free data structures Reduce false sharing Pad data structures to cache lines Reduce cache line bouncing Reduce sharing E.g: MCS locks use local data Reduce cache misses Affinity scheduling: run process on the core where it last ran. Avoid cache pollution Contemporary Multiprocessor Hardware Intel Nehalem: Beckton, Westmere AMD Opteron: Barcelona, Magny Cours ARM Cortex A9, A15 MPCore Oracle (Sun) UltraSparc T1,T2,T3,T4 (Niagara) 5 6 1

Scale and Structure ARM Cortex A9 MPCore Scale and Structure Intel Nehalem From www.dawnofthered.net/wp-content/uploads/2011/02/nehalem-ex-architecture-detailed.jpg 7 From http://www.arm.com/images/cortex-a9-mp-core_big.gif 8 Interconnect AMD Barcelona Memory Locality and Caches 9 From www.sigops.org/sosp/sosp09/slides/baumann-slides-sosp09.pdf From www.systems.ethz.ch/education/past-courses/fall-20/aos/lectures/wk-multicore.pdf Interprocessor Communication FB DIMM FB DIMM FB DIMM FB DIMM MCU MCU MCU MCU Experimental/Future Multiprocessor Hardware Intel SCC Microsoft Beehive Intel Polaris Tilera Tile64 L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$ Full Cross Bar C0 C1 C2 C3 C4 C5 C6 C7 FPU FPU FPU FPU FPU FPU FPU FPU SPU SPU SPU SPU SPU SPU SPU SPU NIU (Ethernet+) Sys I/F Buffer Switch Core PCIe From Sun/Oracle 11 2x Gigabit Ethernet Power <95 W x8 @ 2.0 GHz 12 2

Scale and Structure Tilera Tile64, Intel Polaris Cache and Memory Intel SCC DDR2 Controller 0 DDR2 Controller 1 PCIe 0 MAC/ PHY PROCESSOR Reg File P P P 2 1 0 CACHE L2 CACHE L-1I L-1D I-TLB D-TLB 2D DMA UART, HPI, I2C, JTAG,SPI Flexible I/O GbE GbE 1 SWITCH MDN TDN UDN IDN STN PCIe 1 MAC/ PHY Flexible I/O XAUI 1 MAC/ PHY DDR2 Controller 3 DDR2 Controller 2 From www.tilera.com/products/processors/tile64 From techresearch.intel.com/spaw2/uploads/files/scc_platform_overview.pdf 13 14 Interprocessor Communication Beehive 15 From projects.csail.mit.edu/beehive/beehivev5.pdf Summary Scalability 0+ cores Amdahl s law really kicks in NUMA Also variable latencies due to topology and cache coherence Cache coherence may not be possible Can t use it for locking Shared data structures require explicit work Computer is a distributed system Message passing Consistency and Synchronisation Fault tolerance Heterogeneity Heterogeneous cores, memory, etc. Properties of similar systems may vary wildly (e.g. interconnect topology and latencies between different AMD platforms) 16 OS Design for Modern (and future) Multiprocessors Avoid shared data Performance issues arise less from lock contention than from data locality Explicit communication Regain control over communication costs Sometimes it s the only option Tradeoff: parallelism vs synchronisation Synchronisation introduces serialisation Make concurrent threads independent Allocate for locality E.g. provide memory local to a core Schedule for locality With cached data With local memory Tradeoff: uniprocessor performance vs scalability Design approaches Divide and conquer Using virtualisation Using exokernel Reduced sharing By design Brute force No sharing Computer is a distributed system 17 18 3

Divide and Conquer Disco Architecture Disco http://www-flash.stanford.edu/disco/ Scalability is too hard! Context: ca. 1995, large ccnuma multiprocessors appearing Scaling OSes requires extensive modifications Idea: Implement a scalable VMM Run multiple OS instances VMM has most of the features of a scalable OS: NUMA aware allocator Page replication, remapping, etc. VMM substantially simpler/cheaper to implement Modern incarnations of this Virtual servers (Amazon, etc.) Research (Cerebrus) Running commodity OSes on scalable multiprocessors [Bugnion et al., 1997] 19 20 Disco Performance Space-Time Partitioning Tessellation http://tessellation.cs.berkeley.edu/ Space-Time partitioning 2-level scheduling Context: 2009- highly parallel multicore systems Berkeley Par Lab Tessellation: Space-Time Partitioning in a Manycore Client OS [Liu et al., 20] 21 22 Tessellation Reduce Sharing K42 Clustered Objects, Ph.D. thesis [Appavoo, 2005] http://www.research.ibm.com/k42/ Context: 1997-2006: OS for ccnuma systems IBM, U Toronto (Tornado, Hurricane) Goals: High locality Scalability Object Oriented Fine grained objects Clustered (Distributed) Objects Data locality Deferred deletion (RCU) Avoid locking NUMA aware memory allocator Memory locality 23 24 4

K42: Fine-grained objects K42: Clustered objects 25 26 Globally valid object reference Resolves to Processor local representative Sharing, locking strategy local to each object Transparency Eases complexity Controlled introduction of locality Shared counter: inc, dec: local access val: communication Fast path: Access mostly local structures K42 Performance Corey 27 2.4.19 Context http://pdos.csail.mit.edu/corey 2008, high-end multicore servers, MIT Goals: Application control of OS sharing Address Ranges Control private per core and shared address spaces Kernel Cores Dedicate cores to run specific kernel functions Shares Lookup tables for kernel objects allow control over which object identifiers are visible to other cores. Linux scalability (20 scale Linux to 48 cores) sloppy counters, per-core data structs, fine-grained lock, lock free, cache lines : 3002 lines of code changed no scalability reason to give up on traditional operating system organizations just yet. 28 Corey: An Operating System for Many Cores [Boyd-Wickizer et al., 2008] An Analysis of Linux Scalability to Many Cores [Boyd-Wickizer et al., 20] FlexSC FlexSC Results FlexSC: Flexible System Call Scheduling with Exception-Less System Calls Context: [Soares and Stumm., 20] 20, commodity multicores U Toronto Goal: Reduce context switch overhead of system calls Syscall context switch: Usual mode switch overhead But: cache and TLB pollution! Asynchronous system calls Batch system calls Run them on dedicated cores FlexSC-Threads M on N M >> N 29 30 5

No sharing Multikernel Barrelfish fos: factored operating system The Multikernel: A new OS architecture for scalable multicore systems [Baumann et al., 2009] http://www.barrelfish.org/ 31 Barrelfish Context: [Baumann et al., 2009] http://www.barrelfish.org/ 2007 large multicore machines appearing 0s of cores on the horizon NUMA (cc and non-cc) ETH Zurich and Microsoft Goals: Scale to many cores Support and manage heterogeneous hardware Approach: Structure OS as distributed system Design principles: Interprocessor communication is explicit OS structure hardware neutral State is replicated Microkernel Similar to sel4: capabilities 32 The Multikernel: A new OS architecture for scalable multicore systems Barrelfish Barrelfish: Replication Kernel + Monitor: Only memory shared for message channels Monitor: Collectively coordinate system-wide state System-wide state: Memory allocation tables Address space mappings Capability lists What state is replicated in Barrelfish Capability lists Consistency and Coordination Retype: two-phase commit to globally execute operation in order Page (re/un)mapping: one-phase commit to synchronise TLBs 33 34 Barrelfish: Communication Barrelfish: Results Different mechanisms: Intra-core Kernel endpoints Inter-core URPC URPC Uses cache coherence + polling Shared bufffer Sender writes a cache line Receiver polls on cache line (last word so no part message) Polling? Cache only changes when sender writes, so poll is cheap Switch to block and IPI if wait is too long. 35 36 Message passing vs caching Latency (cycles! 00) 12 8 6 4 2 SHM8 SHM4 SHM2 SHM1 MSG8 MSG1 Server 0 2 4 6 8 12 14 16 Cores 6

Barrelfish: Results Broadcast vs Multicast Latency (cycles! 00) 14 12 8 6 4 2 Broadcast Unicast Multicast NUMA-Aware Multicast Barrelfish: Results TLB shootdown Latency (cycles 00) 60 50 40 30 20 Windows Linux Barrelfish 0 2 4 6 8 12 14 16 18 20 22 24 26 28 30 32 37 Cores 38 0 2 4 6 8 12 14 16 18 20 22 24 26 28 30 32 Cores Summary Trends in multicore Scale (0+ cores) NUMA No cache coherence Distributed system Heterogeneity OS design guidelines Avoid shared data Explicit communication Locality Approaches to multicore OS Partition the machine (Disco, Tessellation) Reduce sharing (K42, Corey, Linux, FlexSC) No sharing (Barrelfish, fos) 39 7