Multiprocessor Systems. COMP s1

Similar documents
Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.

Multiprocessor Systems. Chapter 8, 8.1

Multiple Processor Systems. Lecture 15 Multiple Processor Systems. Multiprocessor Hardware (1) Multiprocessors. Multiprocessor Hardware (2)

Multiprocessors and Locking

Multiprocessor Systems Continuous need for faster computers Multiprocessors: shared memory model, access time nanosec (ns) Multicomputers: message pas

Operating Systems, Fall Lecture 9, Tiina Niklander 1

Uniprocessor Scheduling. Basic Concepts Scheduling Criteria Scheduling Algorithms. Three level scheduling

Process & Thread Management II. Queues. Sleep() and Sleep Queues CIS 657

Process & Thread Management II CIS 657

CPSC/ECE 3220 Summer 2017 Exam 2

The complete license text can be found at

Synchronization I. Jo, Heeseung

CSE 451: Operating Systems Winter Lecture 7 Synchronization. Steve Gribble. Synchronization. Threads cooperate in multithreaded programs

Multiprocessor and Real- Time Scheduling. Chapter 10

Operating Systems. Lecture 4 - Concurrency and Synchronization. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

CSE 120 Principles of Operating Systems

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter

Lecture 5: Synchronization w/locks

W4118: advanced scheduling

CSE 451: Operating Systems Winter Lecture 7 Synchronization. Hank Levy 412 Sieg Hall

Last Class: CPU Scheduling! Adjusting Priorities in MLFQ!

Dealing with Issues for Interprocess Communication

10/17/ Gribble, Lazowska, Levy, Zahorjan 2. 10/17/ Gribble, Lazowska, Levy, Zahorjan 4

CSE Traditional Operating Systems deal with typical system software designed to be:

Concurrent Preliminaries

Handout 3 Multiprocessor and thread level parallelism

CS 571 Operating Systems. Midterm Review. Angelos Stavrou, George Mason University

CPSC/ECE 3220 Summer 2018 Exam 2 No Electronics.

Operating Systems. Synchronisation Part I

Operating Systems. Designed and Presented by Dr. Ayman Elshenawy Elsefy

What s An OS? Cyclic Executive. Interrupts. Advantages Simple implementation Low overhead Very predictable

CSE 4/521 Introduction to Operating Systems

Process Synchronisation (contd.) Deadlock. Operating Systems. Spring CS5212

Concept of a process

SMP/SMT. Daniel Potts. University of New South Wales, Sydney, Australia and National ICT Australia. Home Index p.

Synchronization for Concurrent Tasks

CHAPTER 6: PROCESS SYNCHRONIZATION

Today s class. Scheduling. Informationsteknologi. Tuesday, October 9, 2007 Computer Systems/Operating Systems - Class 14 1

Advance Operating Systems (CS202) Locks Discussion

Chapter 5. Multiprocessors and Thread-Level Parallelism

MULTITHREADING AND SYNCHRONIZATION. CS124 Operating Systems Fall , Lecture 10

2 nd Half. Memory management Disk management Network and Security Virtual machine

CS370 Operating Systems Midterm Review

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Interactive Scheduling

Two Level Scheduling. Interactive Scheduling. Our Earlier Example. Round Robin Scheduling. Round Robin Schedule. Round Robin Schedule

Multiprocessors & Thread Level Parallelism

Concurrency, Mutual Exclusion and Synchronization C H A P T E R 5

OPERATING SYSTEMS. Prescribed Text Book. Operating System Principles, Seventh Edition. Abraham Silberschatz, Peter Baer Galvin and Greg Gagne

Synchronization I. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

A Multiprocessor system generally means that more than one instruction stream is being executed in parallel.

[537] Locks. Tyler Harter

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

Operating Systems. Synchronization

Threads. Concurrency. What it is. Lecture Notes Week 2. Figure 1: Multi-Threading. Figure 2: Multi-Threading

Last Class: Monitors. Real-world Examples

Today: Synchronization. Recap: Synchronization

Lesson 6: Process Synchronization

Department of Computer Science Institute for System Architecture, Operating Systems Group REAL-TIME MICHAEL ROITZSCH OVERVIEW

CS 4410 Operating Systems. Review 1. Summer 2016 Cornell University

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Chapter 5. Multiprocessors and Thread-Level Parallelism

1 Multiprocessors. 1.1 Kinds of Processes. COMP 242 Class Notes Section 9: Multiprocessor Operating Systems

Process Synchronisation (contd.) Operating Systems. Autumn CS4023

Operating Systems: Quiz2 December 15, Class: No. Name:

Example: CPU-bound process that would run for 100 quanta continuously 1, 2, 4, 8, 16, 32, 64 (only 37 required for last run) Needs only 7 swaps

Multiprocessor Support

CSE 5306 Distributed Systems

Multicore and Multiprocessor Systems: Part I

SYNCHRONIZATION M O D E R N O P E R A T I N G S Y S T E M S R E A D 2. 3 E X C E P T A N D S P R I N G 2018

CS 326: Operating Systems. CPU Scheduling. Lecture 6

Concurrency and Race Conditions (Linux Device Drivers, 3rd Edition (

Scheduling. CS 161: Lecture 4 2/9/17

Synchronization Principles

8 MULTIPLE PROCESSOR SYSTEMS

Multiprocessor and Real-Time Scheduling. Chapter 10

Problem Set 2. CS347: Operating Systems

How Linux RT_PREEMPT Works

Computer Systems Architecture

Chapter 6: CPU Scheduling. Operating System Concepts 9 th Edition

CSL373: Lecture 5 Deadlocks (no process runnable) + Scheduling (> 1 process runnable)

CS 31: Introduction to Computer Systems : Threads & Synchronization April 16-18, 2019

ECE469 - Operating Systems Engineering Exam # March 25

Operating System Performance and Large Servers 1

Synchronization. CS61, Lecture 18. Prof. Stephen Chong November 3, 2011

Operating Systems: Internals and Design Principles. Chapter 2 Operating System Overview Seventh Edition By William Stallings

Locks. Dongkun Shin, SKKU

EE382 Processor Design. Processor Issues for MP

CS 153 Design of Operating Systems Winter 2016

Module 6: Process Synchronization. Operating System Concepts with Java 8 th Edition

Advanced Topic: Efficient Synchronization

C09: Process Synchronization

CSCI-GA Operating Systems Lecture 3: Processes and Threads -Part 2 Scheduling Hubertus Franke

Concurrency: Deadlock and Starvation. Chapter 6

Chapter 6: Process Synchronization. Operating System Concepts 9 th Edit9on

Scheduling Mar. 19, 2018

Computer Systems Assignment 4: Scheduling and I/O

Chapter 18: Database System Architectures.! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems!

Chapter 20: Database System Architectures

CPS 110 Midterm. Spring 2011

Transcription:

Multiprocessor Systems 1

Multiprocessor System We will look at shared-memory multiprocessors More than one processor sharing the same memory A single CPU can only go so fast Use more than one CPU to improve performance Assumes Workload can be parallelised Workload is not I/O-bound or memory-bound Disks and other hardware can be expensive Can share hardware between CPUs 2

Types of Multiprocessors (MPs) UMA MP Uniform Memory Access Access to all memory occurs at the same speed for all processors. NUMA MP Non-uniform memory access Access to some parts of memory is faster for some processors than other parts of memory We will focus on UMA 3

Bus Based UMA Simplest MP is more than one processor on a single bus connect to memory (a) Bus bandwidth becomes a bottleneck with more than just a few CPUs 4

Bus Based UMA Each processor has a cache to reduce its need for access to memory (b) Hope is most accesses are to the local cache Bus bandwidth still becomes a bottleneck with many CPUs 5

Cache Consistency What happens if one CPU writes to address 0x1234 (and it is stored in its cache) and another CPU reads from the same address (and gets what is in its cache)? 6

Cache Consistency Cache consistency is usually handled by the hardware. Writes to one cache propagate to, or invalidate appropriate entries on other caches Cache transactions also consume bus bandwidth 7

Bus Based UMA To further scale the number processors, we give each processor private local memory Keep private data local on off the shared memory bus Bus bandwidth still becomes a bottleneck with many CPUs with shared data Complicate application development We have to partition between private and shared variables 8

Bus Based UMA With only a single shared bus, scalability is limited by the bus bandwidth of the single bus Caching only helps so much Alternative bus architectures do exist on high-end machines 9

UMA Crossbar Switch 10

UMA Crossbar Switch Pro Any CPU can access any available memory without blocking Con Number of switches required scales with n 2. 1000 CPUs need 1000000 switches 11

Summary Multiprocessors can Increase computation power beyond that available from a single CPU Share resources such as disk and memory However Shared buses (bus bandwidth) limit scalability Can be reduced via hardware design Can be reduced by carefully crafted software behaviour Good cache locality together with private data where possible Question How do we construct an OS for a multiprocessor? What are the issues? 12

Each CPU has its own OS Statically allocate physical memory to each CPU Each CPU runs its own independent OS Share peripherals Each CPU (OS) handles its processes system calls 13

Each CPU has its own OS Used in early multiprocessor systems to get them going Simple to implement Avoids concurrency issues by not sharing 14

Issues Each processor has its own scheduling queue We can have one processor overloaded, and the rest idle Each processor has its own memory partition We can a one processor thrashing, and the others with free memory No way to move free memory from one OS to another Consistency is an issue with independent disk buffer caches and potentially shared files 15

Master-Slave Multiprocessors OS (mostly) runs on a single fixed CPU All OS tables, queues, buffers are present/manipulated on CPU 1 User-level apps run on the other CPUs And CPU 1 if there is spare CPU time All system calls are passed to CPU 1 for processing 16

Master-Slave Multiprocessors Very little synchronisation required Only one CPU accesses kernel data Simple to implement Single, centralised scheduler Keeps all processors busy Memory can be allocated as needed to all CPUs 17

Issue Master CPU can become the bottleneck 18

Symmetric Multiprocessors (SMP) OS kernel run on all processors Load and resource are balance between all processors Including kernel execution Issue: Real concurrency in the kernel Need carefully applied synchronisation primitives to avoid disaster 19

Symmetric Multiprocessors (SMP) One alternative: A single mutex that make the entire kernel a large critical section Only one CPU can be in the kernel at a time Only slight better solution than master slave Better cache locality The big lock becomes a bottleneck when in-kernel processing exceed what can be done on a single CPU 20

Symmetric Multiprocessors (SMP) Better alternative: identify largely independent parts of the kernel and make each of them their own critical section Allows more parallelism in the kernel Issue: Difficult task Code is mostly similar to uniprocessor code Hard part is identifying independent parts that don t interfere with each other 21

Symmetric Multiprocessors (SMP) Example: Associate a mutex with independent parts of the kernel Some kernel activities require more than one part of the kernel Need to acquire more than one mutex Great opportunity to deadlock!!!! Results in potentially complex lock ordering schemes that must be adhered to 22

Symmetric Multiprocessors (SMP) Example: Given a big lock kernel, we divide the kernel into two independent parts with a lock each Good chance that one of those locks will become the next bottleneck Leads to more subdivision, more locks, more complex lock acquisition rules Subdivision in practice is (in reality) making more code multithreaded 23

Real life Scalability Example Early 1990 s, CSE wanted to run 80 X-Terminals off one or more server machines Winning tender was a 4-CPU bar-fridge-sized machine with 256M of RAM Eventual config 6-CPU and 512M of RAM Machine ran fine in all pre-session testing 24

Real life Scalability Example Students + assignment deadline = machine unusable 25

Real life Scalability Example To fix the problem, the tenderer supplied more CPUs to improve performance (number increased to 8) No change???? Eventually, machine was replaced with Three 2-CPU pizza-box-sized machines, each with 256M RAM Cheaper overall Performance was dramatically improved!!!!! Why? 26

Real life Scalability Example Paper: Ramesh Balan and Kurt Gollhardt, A Scalable Implementation of Virtual Memory HAT Layer for Shared Memory Multiprocessor Machines, Proc. 1992 Summer USENIX conference The 4-8 CPU machine hit a bottleneck in the single threaded VM code Adding more CPUs simply added them to the wait queue for the VM locks The 2 CPU machines did not generate that much lock contention and performed proportionally better. 27

Lesson Learned Building scalable multiprocessor kernels is hard Lock contention can limit overall system performance 28

Multiprocessor Synchronisation Given we need synchronisation, how can we achieve it on a multiprocessor machine? Unlike a uniprocessor, disabling interrupts does not work. It does not prevent other CPUs from running in parallel Need special hardware support 29

Recall Mutual Exclusion with Test-and-Set Entering and leaving a critical region using the TSL instruction 30

Test-and-Set Hardware guarantees that the instruction executes atomically. Atomically: As an indivisible unit. The instruction can not stop half way through 31

Test-and-Set on SMP It does not work without some extra hardware support 32

Test-and-Set on SMP A solution: Hardware locks the bus during the TSL instruction to prevent memory accesses by any other CPU 33

Test-and-Set on SMP Test-and Set is a busy-wait synchronisation primitive Called a spinlock Issue: Lock contention leads to spinning on the lock Spinning on a lock requires bus locking which slows all other CPUs down Independent of whether other CPUs need a lock or not Causes bus contention 34

Test-and-Set on SMP Caching does not help reduce bus contention Either TSL still locks the bus Or TSL requires exclusive access to an entry in the local cache Requires invalidation of same entry in other caches, and loading entry into local cache Many CPUs performing TSL simply bounce a single exclusive entry between all caches using the bus 35

Reducing Bus Contention Read before TSL Spin reading the lock variable waiting for it to change When it does, use TSL to acquire the lock Allows lock to be shared read-only in all caches until its released no bus traffic until actual release No race conditions, as acquisition is still with TSL. start: while (lock == 1); r = TSL(lock) if (r == 1) goto start; 36

Reducing Bus Contention If lock fails, use exponential back-off algorithm Wait a longer and longer time before trying again 2 instructions, 4, 8, 16 etc Up to a maximum High maximum gives lower bus contention Low maximum gives more responsive locks start: r = TSL(lock) if (r == 1) { exp_wait(); goto start; } 37

MCS Locks Each CPU enqueues its own private lock variable into a queue and spins on it No contention On lock release, the releaser unlocks the next lock in the queue Only have bus contention on actual unlock No starvation (order of lock acquisitions defined by the list) 38

Other Hardware Provided SMP Synchronisation Primitives Atomic Add/Subtract Can be used to implement counting semaphores Exchange Compare and Exchange Load linked; Store conditionally Two separate instructions Load value using load linked Modify, and store using store conditionally If value changed by another processor, or an interrupt occurred, then store conditionally failed Can be used to implement all of the above primitives Implemented without bus locking 39

Spinning versus Switching Remember spinning (busy-waiting) on a lock made little sense on a uniprocessor The was no other running process to release the lock Blocking and (eventually) switching to the lock holder is the only option. On SMP systems, the decision to spin or block is not as clear. The lock is held by another running CPU and will be freed without necessarily blocking the requestor 40

Spinning versus Switching Blocking and switching to another process takes time Save context and restore another Cache contains current process not new» Adjusting the cache working set also takes time TLB is similar to cache Switching back when the lock is free encounters the same again Spinning wastes CPU time directly Trade off If lock is held for less time than the overhead of switching to and back It s more efficient to spin 41

Spinning versus Switching The general approaches taken are Spin forever Spin for some period of time, if the lock is not acquired, block and switch The spin time can be Fixed (related to the switch overhead) Dynamic» Based on previous observations of the lock acquisition time 42

Preemption and Spinlocks Critical sections synchronised via spinlocks are expected to be short Avoid other CPUs wasting cycles spinning What happens if the spinlock holder is preempted at end of holder s timeslice Mutual exclusion is still guaranteed Other CPUs will spin until the holder is scheduled again!!!!! Spinlock implementations disable interrupts in addition to acquiring locks to avoid lock-holder preemption 43

Multiprocessor Scheduling Given X processes (or threads) and Y CPUs, how do we allocate them to the CPUs 44

A Single Shared Ready Queue When a CPU goes idle, it take the highest priority process from the shared ready queue 45

Single Shared Ready Queue Pros Simple Automatic load balancing Cons Lock contention on the ready queue can be a major bottleneck Due to frequent scheduling or many CPUs or both Not all CPUs are equal The last CPU a process ran on is likely to have more related entries in the cache. 46

Affinity Scheduling Basic Idea Try hard to run a process on the CPU it ran on last time One approach: Two-level scheduling 47

Two-level Scheduling Each CPU has its own ready queue Top-level algorithm assigns process to CPUs Defines their affinity, and roughly balances the load The bottom-level scheduler: Is the frequently invoked scheduler (e.g. on blocking on I/O, a lock, or exhausting a timeslice) Runs on each CPU and selects from its own ready queue Ensures affinity If nothing is available from the local ready queue, it runs a process from another CPUs ready queue rather than go idle 48

Two-level Scheduling Pros No lock contention on per-cpu ready queues in the (hopefully) common case Load balancing to avoid idle queues Automatic affinity to a single CPU for more cache friendly behaviour 49

Gang Scheduling We assumed thus far that processes (or threads) are independent What happens if they are not? Here A x must wait until its next timeslice to get the message Could go much faster if A s were scheduled together Similar problems can occur with synchronisation between threads 50

The approach Gang Scheduling Groups of related threads are scheduled together as a gang All members of the gang run simultaneously, on different CPUs All gang members start and end their time slice together 51

Gang Scheduling 52