CS5460: Operating Systems

Similar documents
Last Class: Synchronization

Other consistency models

MULTITHREADING AND SYNCHRONIZATION. CS124 Operating Systems Fall , Lecture 10

Today: Synchronization. Recap: Synchronization

Lecture #7: Implementing Mutual Exclusion

Symmetric Multiprocessors: Synchronization and Sequential Consistency

[537] Locks. Tyler Harter

1) If a location is initialized to 0, what will the first invocation of TestAndSet on that location return?

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Concurrency: Locks. Announcements

Administrivia. p. 1/20

CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II

Consistency & Coherence. 4/14/2016 Sec5on 12 Colin Schmidt

Operating Systems. Lecture 4 - Concurrency and Synchronization. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

Multiprocessors and Locking

SPIN, PETERSON AND BAKERY LOCKS

Locks. Dongkun Shin, SKKU

Multiprocessor Synchronization

Administrivia. Review: Thread package API. Program B. Program A. Program C. Correct answers. Please ask questions on Google group

Synchronization. Disclaimer: some slides are adopted from the book authors slides with permission 1

CS 537 Lecture 11 Locks

Memory Consistency Models

Cache Coherence and Atomic Operations in Hardware

Mutex Implementation

CS 152 Computer Architecture and Engineering. Lecture 19: Synchronization and Sequential Consistency

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Synchronization for Concurrent Tasks

EE382 Processor Design. Processor Issues for MP

Memory Consistency Models

Concurrency: Mutual Exclusion (Locks)

Synchronization I. Jo, Heeseung

SHARED-MEMORY COMMUNICATION

Last Class: CPU Scheduling! Adjusting Priorities in MLFQ!

Review: Thread package API

Review: Thread package API

Implementing Mutual Exclusion. Sarah Diesburg Operating Systems CS 3430

What is uop Cracking?

Page 1. Challenges" Concurrency" CS162 Operating Systems and Systems Programming Lecture 4. Synchronization, Atomic operations, Locks"

Lecture #7: Shared objects and locks

M4 Parallelism. Implementation of Locks Cache Coherence

Recap: Thread. What is it? What does it need (thread private)? What for? How to implement? Independent flow of control. Stack

EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors

Systèmes d Exploitation Avancés

Chapter 6: Process [& Thread] Synchronization. CSCI [4 6] 730 Operating Systems. Why does cooperation require synchronization?

Lecture 9: Multiprocessor OSs & Synchronization. CSC 469H1F Fall 2006 Angela Demke Brown

CPSC/ECE 3220 Fall 2017 Exam Give the definition (note: not the roles) for an operating system as stated in the textbook. (2 pts.

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Chapter 5 Asynchronous Concurrent Execution

CS510 Advanced Topics in Concurrency. Jonathan Walpole

CSE 451: Operating Systems Winter Lecture 7 Synchronization. Steve Gribble. Synchronization. Threads cooperate in multithreaded programs

Implementing Locks. Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau)

CS 31: Introduction to Computer Systems : Threads & Synchronization April 16-18, 2019

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

G52CON: Concepts of Concurrency

Synchronization I. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CMSC 330: Organization of Programming Languages

Page 1. Goals for Today" Atomic Read-Modify-Write instructions" Examples of Read-Modify-Write "

Thread-Local. Lecture 27: Concurrency 3. Dealing with the Rest. Immutable. Whenever possible, don t share resources

Models of concurrency & synchronization algorithms

Order Is A Lie. Are you sure you know how your code runs?

Dealing with Issues for Interprocess Communication

Announcements. ECE4750/CS4420 Computer Architecture L17: Memory Model. Edward Suh Computer Systems Laboratory

Relaxed Memory Consistency

CS3350B Computer Architecture

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Synchronization 1. Synchronization

Overview: Memory Consistency

Example: The Dekker Algorithm on SMP Systems. Memory Consistency The Dekker Algorithm 43 / 54

Program A. Review: Thread package API. Program B. Program C. Correct answers. Correct answers. Q: Can both critical sections run?

Module 7: Synchronization Lecture 13: Introduction to Atomic Primitives. The Lecture Contains: Synchronization. Waiting Algorithms.

Review: Easy Piece 1

Midterm Exam March 13, 2013 CS162 Operating Systems

Reminder from last time

! Why is synchronization needed? ! Synchronization Language/Definitions: ! How are locks implemented? Maria Hybinette, UGA

Page 1. Goals for Today" Atomic Read-Modify-Write instructions" Examples of Read-Modify-Write "

CSE 451: Operating Systems Winter Lecture 7 Synchronization. Hank Levy 412 Sieg Hall

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

OS Structure. User mode/ kernel mode (Dual-Mode) Memory protection, privileged instructions. Definition, examples, how it works?

Synchronization. Coherency protocols guarantee that a reading processor (thread) sees the most current update to shared data.

Synchronization. CS61, Lecture 18. Prof. Stephen Chong November 3, 2011

CS 152 Computer Architecture and Engineering. Lecture 19: Synchronization and Sequential Consistency

Lock-Free, Wait-Free and Multi-core Programming. Roger Deran boilerbay.com Fast, Efficient Concurrent Maps AirMap

Role of Synchronization. CS 258 Parallel Computer Architecture Lecture 23. Hardware-Software Trade-offs in Synchronization and Data Layout

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Review: Thread package API

Page 1. Goals for Today. Atomic Read-Modify-Write instructions. Examples of Read-Modify-Write

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

CS533 Concepts of Operating Systems. Jonathan Walpole

Distributed Operating Systems Memory Consistency

Concurrent Preliminaries

Concurrency: a crash course

Advance Operating Systems (CS202) Locks Discussion

Computer Architecture

Synchronization for Concurrent Tasks

Locks and semaphores. Johan Montelius KTH

Synchronising Threads

CS162 Operating Systems and Systems Programming Lecture 7. Mutual Exclusion, Semaphores, Monitors, and Condition Variables

G52CON: Concepts of Concurrency

Transcription:

CS5460: Operating Systems Lecture 9: Implementing Synchronization (Chapter 6)

Multiprocessor Memory Models Uniprocessor memory is simple Every load from a location retrieves the last value stored to that location Caches are transparent All processes / threads see the same view of memory The straightforward multiprocessor version of this memory model is sequential consistency : A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor occur in this sequence in the order specified by its program. Operations performed by each processor occur in the specified order. This is Lamport s definition

Multiprocessor Memory Models Real multiprocessors do not provide sequential consistency Loads may be reordered after loads (IA64, Alpha) Stores may be reordered after stores (IA64, Alpha) Loads may be reordered after stores (IA64, Alpha) Stores may be reordered after loads (many, including x86 / x64) Even on a uniprocessor, compiler can reorder memory accesses

x86 / x86-64 Memory Model: TSO TSO: Total store ordering CPU 1 CPU 2 Store buffer Store buffer RAM

x86 / x86-64 Memory Model: TSO TSO: Total store ordering CPU 1 CPU 2 This breaks Peterson, Store Store Dekker, buffer Bakery, buffer etc. RAM

Weak Memory Example (This is the same as the code I sent out last week) Initially x and y are 0 Now run in parallel: CPU 0: x=1 ; print y CPU 1: y=1 ; print x What might be printed on a sequentially consistent machine? What might be printed on a TSO machine?

Memory Fences The x86 mfence instruction is your weapon against having your programs broken by TSO Loads and stores cannot be moved before or after the mfence instruction Basically you can think about it as flushing the store buffer and preventing the pipeline from reordering around the fence mfence is not cheap But see sfence and lfence which are weaker (and faster) than mfence

Weak Memory Example Initially x and y are 0 Now run in parallel: CPU 0: x=1 ; mfence ; print y CPU 1: y=1 ; mfence ; print x What might be printed on a sequentially consistent machine? What might be printed on a TSO machine?

Some good news for programmers If your multithreaded code is free of data races, you don t have to worry about the memory model Execution will be sequentially consistent Acquire/release of locks include fences Free of data races means every byte of memory is Not shared between threads Shared, but in a read-only fashion Shared, but consistently protected by locks Your goal is to always write programs that are free of data races! Programs you write for this course will have data races but these should be a rare exception

If You Do Write Data Races Accidental data race Always a serious bug Means you don t understand your code Deliberate data race Executions no longer sequentially consistent Dealing with the memory system and compiler optimizations is now your problem Always ask: Why am I writing racy code?

Writing Correct Racy Code 1. Mark all racing variables as volatile volatile int x[10]; This keeps the compiler from optimizing away and reordering memory references 2. Use memory fences, atomic instructions, etc. as needed These keep the memory system from reordering operations and breaking your code

Dekker Sync. Algorithm static int f0, f1, turn; void lock_p0 (void) { f0 = 1; while (f1) { if (turn!= 0) { f0 = false; while (turn!= 0) { f0 = true;

Dekker Sync. Algorithm static int f0, f1, turn; GCC turns this into: void lock_p0 (void) { f0 = 1; while (f1) { if (turn!= 0) { f0 = false; while (turn!= 0) { f0 = true; lock_p0: movl ret $1, f0(%rip)

Reminder For any mutual exclusion implementation, we want to be sure it guarantees: Cannot allow multiple processes in critical section at the same time (mutual exclusion) Ensure progress (lack of deadlock) Ensure fairness (lack of livelock) We also want to know what invariants hold over the lock s data structures

Implementing Mutual Exclusion Option 1: Build on atomicity of loads and stores Peterson, Bakery, Dekker, etc. Loads and stores are weak, tedious to work with Portable solutions do not exist on modern processors Option 2: Build on more powerful atomic primitives Disable interrupts à keep scheduler from performing context switch at unfortunate time Atomic synchronization instructions» Many processors have some form of atomic: Load-Op-Store» Also: Load-linked à Store-conditional (ARM, PPC, MIPS, Alpha) Common synchronization primitives: Semaphores and locks (similar) Barriers Condition variables Monitors

Lock by Disabling Interrupts V.1 class Lock { public: void Acquire(); void Release(); Lock::Lock { Lock::Acquire() { disable interrupts; Lock::Release() { enable interrupts;

Lock by Disabling Interrupts V.2 class Lock { public: void Acquire(); void Release(); private: int locked; Queue Q; Lock::Lock { locked ß 0; // Lock free Q ß 0; // Queue empty Lock::Acquire(T:Thread) { disable interrupts; if (locked) { add T to Q; T à Sleep(); locked ß 1; enable interrupts; Lock::Release() { disable interrupts; if (Q not empty) { remove T from Q; put T on readyq; else locked ß 0; enable interrupts;

Lock by Disabling Interrupts V.2 class Lock { public: void Acquire(); void Release(); private: int locked; Queue Q; Lock::Lock { locked ß 0; // Lock free Q ß 0; // Queue empty Lock::Acquire(T:Thread) { disable interrupts; if (locked) { add T to Q; When do you T à Sleep(); enable ints.? locked ß 1; enable interrupts; Lock::Release() { disable interrupts; if (Q not empty) { remove T from Q; put T on readyq; else locked ß 0; enable interrupts;

Blocking vs. Not Blocking? Option 1: Spinlock Option 2: Yielding spinlock Option 3: Blocking locks Option 4: Hybrid solution spin for a little while and then block How do we choose among these options On a uniprocessor? On a multiprocessor?

Problems With Disabling Interrupts Disabling interrupts for long is always bad Can result in lost interrupts and dropped data The actual max value depends on what you re doing Disabling interrupts (briefly!) is heavily used on uniprocessors But what about multiprocessors? Disabling interrupts on just the local processor is not very helpful» Unless we know that all processes are running on the local processor Disabling interrupts on all processors is expensive In practice, multiprocessor synchronization is usually done differently

Hardware Synchronization Ops test-and-set(loc, t) Atomically read original value and replace it with t compare-and-swap(loc, a, b) Atomically: if (loc == a) { loc = b; fetch-and-add(loc, n) Atomically read the value at loc and replace it with its value incremented by n load-linked / store-conditional load-linked : loads value from specified address store-conditional : if no other thread has touched value à store, else return error Typically used in a loop that does read-modify-write Loop checks to see if read-modify-write sequence was interrupted

Using Test&Set ( Spinlock ) test&set(loc, value) Atomically tests old value and replaces with new value Acquire() If free, what happens? If locked, what happens? If more than one at a time trying to acquire, what happens? Busy waiting While testing lock, process runs in tight loop What issues arise? class Lock { public: void Acquire(), Release(); private: int locked; Lock::Lock() { locked ß 0; Lock::Acquire() { // Spin atomically until free while (test&set(locked,1)); Lock::Release() { locked ß 0;

Using Test&Set (Improved V.1) test&set(loc, value) Atomically tests old value and replaces with new value Acquire() If free, what happens? If locked, what happens? If more than one at a time trying to acquire, what happens? Busy waiting What is new? class Lock { public: void Acquire(), Release(); private: int locked; Lock::Lock() { locked ß 0; Lock::Acquire() { for (;;) { // Spin non-atomically first; while(locked); if (!test&set(locked,1)) break; Lock::Release() { locked ß 0;

Using Test&Set (Improved V.2) test&set(loc, value) Atomically tests old value and replaces with new value Acquire() If free, what happens? If locked, what happens? If more than one at a time trying to acquire, what happens? Queueing What changes? class Lock { public: void Acquire(), Release(); private: int locked, guard; Lock::Lock() { locked ß guard ß 0; Lock::Acquire(t:Thread) { while(test&set(guard,1)); if (locked) { put T to sleep, guard ß 0; else { locked ß 1; guard ß 0; Lock::Release() { while(test&set(guard,1)); if (Q not empty) { wake up T; else {locked ß 0; guard ß 0;

Compare&Swap Cmp&swap(loc, a, b) Atomically tests if loc contains a; if so, stores b into loc Returns old value from loc Acquire() If free, what happens? If locked, what happens? If more than one at a time trying to acquire, what happens? Busy waiting Can improve by:» Backing off if compare fails» Introducing wait queue class Lock { public: void Acquire(), Release(); private: int locked; Lock::Lock() { locked ß 0; Lock:: Acquire() { // Spin atomically until free while (cmp&swap(locked,0,1)); Lock::Release() { locked ß 0;

Fetch&Add Fetch&add(loc, val) Atomically reads loc, adds val to it, and writes new value back Acquire() If free, what happens? If locked, what happens? If more than one at a time trying to acquire, what happens? Busy waiting Can improve by:» Backing off if compare fails» Introducing wait queue class Lock { public: void Acquire(), Release(); private: int counter = 0; turn = 0; Lock:: Acquire() { // Spin atomically until free int me; me = fetch&add(counter,1); while (me!= turn); Lock::Release() { fetch&add(turn,1);

Load-linked, Store-conditional LL(loc) Loads loc and puts address in special register for snoop HW SC(loc) Conditionally stores to loc Must be same addr as last LL( ) Succeeds iff loc not modified since LL( ) Fails if modified, or there has been a context switch Typically used in try-retry loops Optimistic concurrency control Optimize for common case Locking à pessimistic c.c. // increment counter w/ LL-SC (MIPS) // r1: &counter try: ll r2,(r1) // LL counter to r2 addi r3,r2,1 // r3 ß r2+1 sc r3,(r1) // SC new counter beq r3,0,try // Test if success

void arch_spin_lock (arch_spinlock_t *lock) { unsigned long tmp; u32 newval; arch_spinlock_t lockval; asm volatile ( "1: ldrex %0, [%3]\n" " add %1, %0, %4\n" " strex %2, %1, [%3]\n" " teq %2, #0\n" " bne 1b : "=&r" (lockval), "=&r" (newval), "=&r" (tmp) : "r" (&lock->slock), "I" (1 << TICKET_SHIFT) : "cc"); while (lockval.tickets.next!= lockval.tickets.owner) { wfe(); lockval.tickets.owner = ACCESS_ONCE(lock->tickets.owner); smp_mb();

Performance Issues Spinlocks (busy waiting) Ties up processor while thread continually tests value On an SMP, test&set( ) on a shared variable crosses network Some possible optimizations:» Test and test&set( ) : Use non-atomic instruction to do initial test» Back off (yield/sleep) between each probe (constant? variable?)» Implement per-lock queues à extra overhead, but fair When is spinning preferred? Consider where lock variable resides: Spinning on local cache versus spinning over the interconnect Consider granularity of locking Lock entire data structure versus lock individual elements

Summary The multiprocessor memory model is not your friend Write race-free programs whenever possible Common hardware synchronization primitives test-and-set, compare-and-swap, fetch-and-add, LL/SC Implementing spinlocks Performance issues: Spinning versus queuing Cache-aware synchronization policies Scheduling-aware synchronization policies

Questions?