Advanced Operating Systems (CS 202) Memory Consistency, Cache Coherence and Synchronization

Similar documents
Advanced Operating Systems (CS 202) Memory Consistency, Cache Coherence and Synchronization

Advanced Operating Systems (CS 202)

Operating Systems. Synchronization

CS 153 Design of Operating Systems Winter 2016

Lecture 5: Synchronization w/locks

CSE 451: Operating Systems Winter Lecture 7 Synchronization. Hank Levy 412 Sieg Hall

Synchronization. CS61, Lecture 18. Prof. Stephen Chong November 3, 2011

Lecture 9: Multiprocessor OSs & Synchronization. CSC 469H1F Fall 2006 Angela Demke Brown

CSE 153 Design of Operating Systems

CSE 451: Operating Systems Winter Lecture 7 Synchronization. Steve Gribble. Synchronization. Threads cooperate in multithreaded programs

CS 153 Design of Operating Systems Winter 2016

Algorithms for Scalable Synchronization on Shared Memory Multiprocessors by John M. Mellor Crummey Michael L. Scott

Synchronization I. Jo, Heeseung

250P: Computer Systems Architecture. Lecture 14: Synchronization. Anton Burtsev March, 2019

Concurrency and Race Conditions (Linux Device Drivers, 3rd Edition (

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

10/17/ Gribble, Lazowska, Levy, Zahorjan 2. 10/17/ Gribble, Lazowska, Levy, Zahorjan 4

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Lecture 19: Synchronization. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Synchronization I. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CSE 153 Design of Operating Systems Fall 2018

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

Dealing with Issues for Interprocess Communication

CSE 120 Principles of Operating Systems

Lecture. DM510 - Operating Systems, Weekly Notes, Week 11/12, 2018

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

Threads. Concurrency. What it is. Lecture Notes Week 2. Figure 1: Multi-Threading. Figure 2: Multi-Threading

CSC2/458 Parallel and Distributed Systems Scalable Synchronization

Spinlocks. Spinlocks. Message Systems, Inc. April 8, 2011

Cache Coherence and Atomic Operations in Hardware

CS 31: Introduction to Computer Systems : Threads & Synchronization April 16-18, 2019

Multiprocessor Systems. Chapter 8, 8.1

Multiprocessors and Locking

Concurrent Processes Rab Nawaz Jadoon

M4 Parallelism. Implementation of Locks Cache Coherence

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Chapter 6: Process Synchronization

Synchronization. Coherency protocols guarantee that a reading processor (thread) sees the most current update to shared data.

Algorithms for Scalable Lock Synchronization on Shared-memory Multiprocessors

Chapter 5: Process Synchronization. Operating System Concepts 9 th Edition

Hardware: BBN Butterfly

CS377P Programming for Performance Multicore Performance Synchronization

Prof. Dr. Hasan Hüseyin

The need for atomicity This code sequence illustrates the need for atomicity. Explain.

Chapter 6: Synchronization. Chapter 6: Synchronization. 6.1 Background. Part Three - Process Coordination. Consumer. Producer. 6.

Chapter 6: Process Synchronization

Operating Systems. Synchronisation Part I

Lecture 25: Multiprocessors

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Lecture: Coherence and Synchronization. Topics: synchronization primitives, consistency models intro (Sections )

Chapter 5. Multiprocessors and Thread-Level Parallelism

CSE Traditional Operating Systems deal with typical system software designed to be:

Lecture 19: Coherence and Synchronization. Topics: synchronization primitives (Sections )

Distributed Operating Systems

Advanced Topic: Efficient Synchronization

CSE 451: Operating Systems Winter Synchronization. Gary Kimura

Concept of a process

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.

Multiprocessor Systems. COMP s1

CSE 120 Principles of Operating Systems Spring 2016

CPSC/ECE 3220 Fall 2017 Exam Give the definition (note: not the roles) for an operating system as stated in the textbook. (2 pts.

10/17/2011. Cooperating Processes. Synchronization 1. Example: Producer Consumer (3) Example

Distributed Operating Systems

[537] Locks. Tyler Harter

Concurrency: Mutual Exclusion (Locks)

Page 1. Cache Coherence

SYNCHRONIZATION M O D E R N O P E R A T I N G S Y S T E M S R E A D 2. 3 E X C E P T A N D S P R I N G 2018

Scalable Locking. Adam Belay

CSE 486/586 Distributed Systems

Mutex Implementation

CSL373: Lecture 5 Deadlocks (no process runnable) + Scheduling (> 1 process runnable)

1) If a location is initialized to 0, what will the first invocation of TestAndSet on that location return?

Concurrency: Locks. Announcements

CS377P Programming for Performance Multicore Performance Cache Coherence

Lesson 6: Process Synchronization

Shared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB

CS510 Advanced Topics in Concurrency. Jonathan Walpole

Chapter 5: Process Synchronization. Operating System Concepts 9 th Edition

CSE 120. Fall Lecture 6: Semaphores. Keith Marzullo

Algorithms for Scalable Lock Synchronization on Shared-memory Multiprocessors

Operating Systems. Lecture 4 - Concurrency and Synchronization. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Lecture 10: Multi-Object Synchronization

COMP 3361: Operating Systems 1 Final Exam Winter 2009

CS5460: Operating Systems

Advance Operating Systems (CS202) Locks Discussion

Locks. Dongkun Shin, SKKU

Multiprocessors & Thread Level Parallelism

CS 537 Lecture 11 Locks

High-level Synchronization

Chapter 6: Process Synchronization. Module 6: Process Synchronization

Chapter 5: Process Synchronization. Operating System Concepts Essentials 2 nd Edition

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Chapter 6: Synchronization. Operating System Concepts 8 th Edition,

Learning Outcomes. Concurrency and Synchronisation. Textbook. Concurrency Example. Inter- Thread and Process Communication. Sections & 2.

Introduction to OS Synchronization MOS 2.3

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 4 Shared-Memory Concurrency & Mutual Exclusion

Introducing Shared-Memory Concurrency

The complete license text can be found at

Transcription:

Advanced Operating Systems (CS 202) Memory Consistency, Cache Coherence and Synchronization (some cache coherence slides adapted from Ian Watson; some memory consistency slides from Sarita Adve)

Classic Example Suppose we have to implement a function to handle withdrawals from a bank account: withdraw (account, amount) { balance = get_balance(account); } balance = balance amount; put_balance(account, balance); return balance; Now suppose that you and your father share a bank account with a balance of $1000 Then you each go to separate ATM machines and simultaneously withdraw $100 from the account 2 2 2

Interleaved Schedules The problem is that the execution of the two threads can be interleaved: balance = get_balance(account); balance = balance amount; Execution sequence seen by CPU balance = get_balance(account); balance = balance amount; put_balance(account, balance); Context switch put_balance(account, balance); What is the balance of the account now? 3 3 3

How Interleaved Can It Get? How contorted can the interleavings be? We'll assume that the only atomic operations are reads and writes of individual memory locations Some architectures don't even give you that! We'll assume that a context... get_balance(account); switch can occur at any time balance = get_balance(account); We'll assume that you can balance =... delay a thread as long as you like as long as it's not delayed balance = balance amount; forever balance = balance amount; put_balance(account, balance); put_balance(account, balance); 4 4 4

Mutual Exclusion Mutual exclusion to synchronize access to shared resources This allows us to have larger atomic blocks What does atomic mean? Code that uses mutual called a critical section Only one thread at a time can execute in the critical section All other threads are forced to wait on entry When a thread leaves a critical section, another can enter Example: sharing an ATM with others What requirements would you place on a critical section? 5 5 5

Using Locks withdraw (account, amount) { acquire(lock); balance = get_balance(account); balance = balance amount; put_balance(account, balance); release(lock); return balance; } Critical Section acquire(lock); balance = get_balance(account); balance = balance amount; acquire(lock); put_balance(account, balance); release(lock); balance = get_balance(account); balance = balance amount; put_balance(account, balance); release(lock); 6 6 6

Using Test-And-Set Here is our lock implementation with testand-set: struct lock { } int held = 0; void acquire (lock) { } while (test-and-set(&lock->held)); void release (lock) { } lock->held = 0; When will the while return? What is the value of held? 7 7 7

Overview Before we talk deeply about synchronization Need to get an idea about the memory model in shared memory systems Is synchronization only an issue in multi-processor systems? What is a shared memory processor (SMP)? Shared memory processors Two primary architectures: Bus-based/local network shared-memory machines (small-scale) Directory-based shared-memory machines (large-scale) 8 8 8

Plan Introduce and discuss cache coherence Discuss basic synchronization, up to MCS locks (from the paper we are reading) Introduce memory consistency and implications Is this an architecture class??? The same issues manifest in large scale distributed systems 9 9 9

Crash course on cache coherence 10 10 10

Bus-based Shared Memory Organization Basic picture is simple :- CPU CPU CPU Shared Memory Cache Cache Cache Shared Bus 11 11 11

Organization Bus is usually simple physical connection (wires) Bus bandwidth limits no. of CPUs Could be multiple memory elements For now, assume that each CPU has only a single level of cache 12 12 12

Problem of Memory Coherence Assume just single level caches and main memory Processor writes to location in its cache Other caches may hold shared copies - these will be out of date Updating main memory alone is not enough What happens if two updates happen at (nearly) the same time? Can two different processors see them out of order? 13 13 13

Example 1 2 3 CPU CPU CPU Cache Cache Cache Shared Memory X: 24 Shared Bus Processor 1 reads X: obtains 24 from memory and caches it Processor 2 reads X: obtains 24 from memory and caches it Processor 1 writes 32 to X: its locally cached copy is updated Processor 3 reads X: what value should it get? Memory and processor 2 think it is 24 Processor 1 thinks it is 32 Notice that having write-through caches is not good enough 14 14 14

Cache Coherence Try to make the system behave as if there are no caches! How? Idea: Try to make every CPU know who has a copy of its cached data? too complex! More practical: Snoopy caches Each CPU snoops memory bus Looks for read/write activity concerned with data addresses which it has cached. What does it do with them? This assumes a bus structure where all communication can be seen by all. More scalable solution: directory based coherence schemes 15 15 15

Snooping Protocols Write Invalidate CPU with write operation sends invalidate message Snooping caches invalidate their copy CPU writes to its cached copy Write through or write back? Any shared read in other CPUs will now miss in cache and re-fetch new data. 16 16 16

Snooping Protocols Write Update CPU with write updates its own copy All snooping caches update their copy Note that in both schemes, problem of simultaneous writes is taken care of by bus arbitration - only one CPU can use the bus at any one time. Harder problem for arbitrary networks 17 17 17

Update or Invalidate? Which should we use? Bus bandwidth is a precious commodity in shared memory multi-processors Contention/cache interrogation can lead to 10x or more drop in performance (also important to minimize false sharing) Therefore, invalidate protocols used in most commercial SMPs 18 18 18

Cache Coherence summary Reads and writes are atomic What does atomic mean? As if there is no cache Some magic to make things work Have performance implications and therefore, have implications on performance of programs 19 19 19

So, lets try our hand at some synchronization 20 20 20

What is synchronization? Making sure that concurrent activities don t access shared data in inconsistent ways int i = 0; // shared Thread A i=i+1; What is in i? Thread B i=i-1; 21 21 21

What are the sources of concurrency? Multiple user-space processes On multiple CPUs Device interrupts Workqueues Tasklets Timers 22 22 22

Pitfalls in scull Race condition: result of uncontrolled access to shared data if (!dptr->data[s_pos]) { } dptr->data[s_pos] = kmalloc(quantum, GFP_KERNEL); if (!dptr->data[s_pos]) { } goto out; Scull is the Simple Character Utility for Locality Loading (an example device driver from the Linux Device Driver book) 23 23

Pitfalls in scull Race condition: result of uncontrolled access to shared data if (!dptr->data[s_pos]) { dptr->data[s_pos] = kmalloc(quantum, GFP_KERNEL); if (!dptr->data[s_pos]) { goto out; } } 24 24

Pitfalls in scull Race condition: result of uncontrolled access to shared data if (!dptr->data[s_pos]) { dptr->data[s_pos] = kmalloc(quantum, GFP_KERNEL); if (!dptr->data[s_pos]) { goto out; } } 25 25

Pitfalls in scull Race condition: result of uncontrolled access to shared data if (!dptr->data[s_pos]) { dptr->data[s_pos] = kmalloc(quantum, GFP_KERNEL); if (!dptr->data[s_pos]) { goto out; } } Memory leak 26 26

Synchronization primitives Lock/Mutex To protect a shared variable, surround it with a lock (critical region) Only one thread can get the lock at a time Provides mutual exclusion Shared locks More than one thread allowed (hmm ) Others? Yes, including Barriers (discussed in the paper) 27 27 27

Synchronization primitives (cont d) Lock based Blocking (e.g., semaphores, futexes, completions) Non-blocking (e.g., spin-lock, ) Sometimes we have to use spinlocks Lock free (or partially lock free ) Atomic instructions seqlocks RCU Transactions 28 28 28

How about locks? Lock(L): Unlock(L): If(L==0) L=0; L=1; else while(l==1); //wait go back; Check and lock are not atomic! Can we do this just with atomic reads and writes? Yes but not easy Decker s algorithm Easier to use read-modify-update atomic instructions 29 29 29

Naïve implementation of spinlock Lock(L): While(test_and_set(L)); //we have the lock! //eat, dance and be merry Unlock(L) L=0; 30 30 30

Why naïve? Works? Yes, but not used in practice Contention Think about the cache coherence protocol Set in test and set is a write operation Has to go to memory A lot of cache coherence traffic Unnecessary unless the lock has been released Imagine if many threads are waiting to get the lock Fairness/starvation 31 31 31

Better implementation Spin on read Assumption: We have cache coherence Not all are: e.g., Intel SCC Lock(L): while(l==locked); //wait if(test_and_set(l)==locked) go back; Still a lot of chattering when there is an unlock Spin lock with backoff 32 32 32

Bakery Algorithm struct lock { int next_ticket; int now_serving; } Acquire_lock: int my_ticket = fetch_and_inc(l->next_ticket); while(l->new_serving!=my_ticket); //wait //Eat, Dance and me merry! Release_lock: L->now_serving++; Still too much chatter Comments? Fairness? Effiency/cache coherence? 33 33 33

Anderson Lock (Array lock) Problem with bakery algorithm: All threads listening to next_serving A lot of cache coherence chatter But only one will actually acquire the lock Can we have each thread wait on a different variable to reduce chatter? 34 34 34

Anderson s Lock We have an array (actually circular queue) of variables Lock(L): Each variable can indicate either lock available or waiting for lock Only one location has lock available my_place = fetch_and_inc (queuelast); while (flags[myplace mod N] == must_wait); Unlock(L) flags[myplace mod N] = must_wait; flags[mypalce+1 mod N] = available; Fair and not noisy compare to spin-on-read and bakery algorithm Any negative side effects? 35 35 35