Synchronization. Erik Hagersten Uppsala University Sweden. Components of a Synchronization Even. Need to introduce synchronization.

Similar documents
Role of Synchronization. CS 258 Parallel Computer Architecture Lecture 23. Hardware-Software Trade-offs in Synchronization and Data Layout

The MESI State Transition Graph

Page 1. Outline. Coherence vs. Consistency. Why Consistency is Important

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Lecture 19: Coherence and Synchronization. Topics: synchronization primitives (Sections )

Scalable Locking. Adam Belay

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Module 7: Synchronization Lecture 13: Introduction to Atomic Primitives. The Lecture Contains: Synchronization. Waiting Algorithms.

Synchronization. Coherency protocols guarantee that a reading processor (thread) sees the most current update to shared data.

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Lecture: Coherence and Synchronization. Topics: synchronization primitives, consistency models intro (Sections )

Lecture: Coherence, Synchronization. Topics: directory-based coherence, synchronization primitives (Sections )

Lecture 19: Synchronization. CMU : Parallel Computer Architecture and Programming (Spring 2012)

M4 Parallelism. Implementation of Locks Cache Coherence

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Chapter 5: Thread-Level Parallelism Part 1

1) If a location is initialized to 0, what will the first invocation of TestAndSet on that location return?

250P: Computer Systems Architecture. Lecture 14: Synchronization. Anton Burtsev March, 2019

Multiprocessors and Locking

Chapter 5. Multiprocessors and Thread-Level Parallelism

Lecture 9: Multiprocessor OSs & Synchronization. CSC 469H1F Fall 2006 Angela Demke Brown

SHARED-MEMORY COMMUNICATION

Deadlocks. Frédéric Haziza Spring Department of Computer Systems Uppsala University

Chapter 5. Thread-Level Parallelism

GLocks: Efficient Support for Highly- Contended Locks in Many-Core CMPs

6.852: Distributed Algorithms Fall, Class 15

CSC2/458 Parallel and Distributed Systems Scalable Synchronization

Chapter 5. Multiprocessors and Thread-Level Parallelism

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

Algorithms for Scalable Synchronization on Shared Memory Multiprocessors by John M. Mellor Crummey Michael L. Scott

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations

Process synchronization

Advanced Topic: Efficient Synchronization

Concurrent Preliminaries

Advance Operating Systems (CS202) Locks Discussion

Spinlocks. Spinlocks. Message Systems, Inc. April 8, 2011

Lecture 25: Multiprocessors

Multiprocessor Synchronization

Recall: Sequential Consistency of Directory Protocols How to get exclusion zone for directory protocol? Recall: Mechanisms for reducing depth

CS140 Operating Systems and Systems Programming

The complete license text can be found at

ffwd: delegation is (much) faster than you think Sepideh Roghanchi, Jakob Eriksson, Nilanjana Basu

CS5460: Operating Systems

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Spin Locks and Contention

Spin Locks and Contention. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Locks Chapter 4. Overview. Introduction Spin Locks. Queue locks. Roger Wattenhofer. Test-and-Set & Test-and-Test-and-Set Backoff lock

Adaptive Lock. Madhav Iyengar < >, Nathaniel Jeffries < >

Agenda. Lecture. Next discussion papers. Bottom-up motivation Shared memory primitives Shared memory synchronization Barriers and locks

CS758 Synchronization. CS758 Multicore Programming (Wood) Synchronization

Locks. Dongkun Shin, SKKU

Lecture notes for CS Chapter 4 11/27/18

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Spin Locks and Contention. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Concurrent programming: From theory to practice. Concurrent Algorithms 2015 Vasileios Trigonakis

The need for atomicity This code sequence illustrates the need for atomicity. Explain.

Lecture 8: Directory-Based Cache Coherence. Topics: scalable multiprocessor organizations, directory protocol design issues

Shared Memory Multiprocessors

ECE 669 Parallel Computer Architecture

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

Readers-Writers Problem. Implementing shared locks. shared locks (continued) Review: Test-and-set spinlock. Review: Test-and-set on alpha

[537] Locks. Tyler Harter

Lock cohorting: A general technique for designing NUMA locks

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Speculative Synchronization

SPIN, PETERSON AND BAKERY LOCKS

Systèmes d Exploitation Avancés

CS3350B Computer Architecture

ECE 669 Parallel Computer Architecture

Cache Coherence and Atomic Operations in Hardware

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

CSE 120 Principles of Operating Systems

Last Class: CPU Scheduling! Adjusting Priorities in MLFQ!

ECE 574 Cluster Computing Lecture 8

Symmetric Multiprocessors: Synchronization and Sequential Consistency

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 3 Nov 2017

Multiprocessors 1. Outline

Evaluating Synchronization on Shared Address Space Multiprocessors: Methodology and Performance

CS377P Programming for Performance Multicore Performance Cache Coherence

Advanced Operating Systems (CS 202) Memory Consistency, Cache Coherence and Synchronization

Today: Synchronization. Recap: Synchronization

Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois

Lecture 25: Thread Level Parallelism -- Synchronization and Memory Consistency

Administrivia. p. 1/20

Message Passing Improvements to Shared Address Space Thread Synchronization Techniques DAN STAFFORD, ROBERT RELYEA

Handout 3 Multiprocessor and thread level parallelism

Implementing Locks. Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau)

Shared Memory Programming Models I

Conventions. Barrier Methods. How it Works. Centralized Barrier. Enhancements. Pitfalls

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 14 November 2014

CS 61C: Great Ideas in Computer Architecture. Amdahl s Law, Thread Level Parallelism

Lecture 10: Avoiding Locks

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

Lecture #7: Implementing Mutual Exclusion

Synchronization. CS61, Lecture 18. Prof. Stephen Chong November 3, 2011

Locking. Part 2, Chapter 11. Roger Wattenhofer. ETH Zurich Distributed Computing

Transcription:

Synchronization sum := thread_create Execution on a sequentially consistent shared-memory machine: Erik Hagersten Uppsala University Sweden while (sum < threshold) sum := sum while + (sum < threshold) sum := sum while + (sum < threshold) sum := sum while + (sum < threshold) sum := sum + How many addition will get executed? A: any value between printf (sum) threshold and threshold * 4 2 join What value will be printed? A: any value between threshold and threshold + 3 Need to introduce synchronization Locking primitives are needed to ensure that only one process can be in the critical section: LOCK(lock_variable) /* wait for your turn */ if (sum > threshold) { sum := my_sum + sum UNLOCK(lock_variable) /* release the lock*/ Critical Section Components of a Synchronization Even Acquire method Acquire right to the synch (enter critical section, go past event Waiting algorithm Wait for synch to become available when it isn t Release method Enable other processors to acquire right to the synch if (sum > threshold) { LOCK(lock_variable) /* wait for your turn */ sum := my_sum + sum UNLOCK(lock_variable) /* release the lock*/ 3 Critical Section 4

Atomic Instruction to Acquire Atomic example: test&set TAS (SPARC: LDSTB) The value at Mem(lock_addr) loaded into the specified register Constant atomically stored into Mem(lock_addr) (SPARC: FF ) Software can determin if won (i.e., set changed the value from to ) Other constants could be used instead of and Looks like a store instruction to the caches/memory system Implementation:. Get an exclisive copy of the cache line 2. Make the atomic modification to the cached copy Other read-modify-write primitives can be used too Swap (SWAP): atomically swap the value of REG with Mem(lock_addr) Compare&swap (CAS): SWAP if Mem(lock_addr)==REG2 Waiting Algorithms Blocking Waiting processes/threads are de-scheduled High overhead Allows processor to do other things Busy-waiting Waiting processes repeatedly test a lock_variable until it changes value Releasing process sets the lock_variable Lower overhead, but consumes processor resources Can cause network traffic Hybrid methods: busy-wait a while, then block 5 6 Release Algorithm Typically just a store More complicated locks may require a conditional store or a wake-up. A Bad Example: POUNDING proc lock(lock_variable) { while (TAS[lock_variable]==) { /* bang on the lock until free */ proc unlock(lock_variable) { lock_variable := Assume: The function TAS (test and set) -- returns the current memory value and atomically writes the busy pattern to the memory Generates too much traffic!! -- spinning threads produce traffic! 7 8

Optimistic Test&Set Lock spinlock It could still get messy! proc lock(lock_variable) { while true { if (TAS[lock_variable] ==) break; /* bang on the lock once, done if TAS== */ while(lock_variable!= ) { /* spin locally in your cache until observed*/ L== = CS = Interconnect Interconnect...... proc unlock(lock_variable) { lock_variable := N reads L== Interconnect L= L= L= L= L=... L= Much less coherence traffic!! -- still lots of traffic at lock handover! 9...messy (part 2) N- Test&Set (i.e., N writes) Interconnect T&S T&S T&S T&S T&S... T& Could Get Even Worse on a NUMA Poor communication latency L== Interconnect CS = =... = Serialization of accesses to the same cache line potentially: ~N*N/2 reads :-( Problem: Contention on the interconnect slows down the CS proc Problem2: The lock hand-over time is N*read_throughput Fix: Some back-off strategy, bad news for hand-over latency Fix: Queue-based locks WF: added hardware optimization: TAS can bypass loads in the coherence protocol ==>N-2 loads queue up in the protocol ==> the winner s atomic TAS will bypass the loads ==>the loads will return busy 2

Ticket-based queue locks: ticket Ticket-based back-off TBO proc lock(lstruct) { int my_num; my_num := INC(lstruct.ticket) /* get your unique number*/ while(my_num!= lstruct.nowserving) { /* wait here for your turn */ proc lock(lstruct) { int my_num; my_num := INC(lstruct.ticket) /* get your number*/ while(my_num!= lstruct.nowserving) { /* my turn?*/ idle_wait(lstruct.nowserving - my_num) /* do other shopping */ proc unlock(lstruct) { lstruct.nowserving++ /* next in line please */ proc unlock(lock_struct) { lock_struct.nowserving++ /* next in line please */ Less traffic at lock handover! Even less traffic at lock handover! 3 4 Queue-based lock: CLH-lock -- a variation of the MCS lock Initially, each process owns one global flag, pointed to by private *I and *P The global lock is pointed to by global *L ) Initialize the *I flag to busy (= ) 2) Atomically, make *L point our flag and make *P point where *L pointed 3) Wait until *P points to a CLH lock { **I = /* init to busy */ swap {*P =*L; *L=*P while (**P!= ){ { **I =; /*initialized as busy */ swap {*P =*L; *L=*P /* *P now stores a pointer to the flag L pointed to*/ /* *L now stores a pointer to our flag */ while (**P!= ){ /* keep spinning until prev owner releases lock*/ proc unlock(int **I, **P) { **I =; /* release the lock */ *I =*P; /* next time *I to reuse the previous guys flag*/ 5 6

{ **I = /* init to busy */ swap {*P =*L; *L=*P while (**P!= ){ { **I = /* init to busy */ swap {*P =*L; *L=*P while (**P!= ){ ; 7 8 { **I = /* init to busy */ atomic {*P =*L; *L=*P while (**P!= ){ ; { **I = /* init to busy */ swap {*P =*L; *L=*P while (**P!= ){ ; 9 2

while *P { **I = /* init to busy */ swap {*P =*L; *L=*P; while (**P!= ){ ; 2 while *P while *P { **I = /* init to busy */ 22 swap {*P =*L; *L=*P; while (**P!= ){ ; proc unlock(int **I, **P) proc unlock(int **I, **P) { **I = ; /* release the lock */ *I = *P; /* reuse the previous guy s *P*/ { **I = ; /* release the lock */ *I = *P; /* reuse the previous guy s *P*/ while *P while *P while *P while *P 23 24

Minimizes traffic at lock handover! May be too fair for NUMAs. proc unlock(int **I, **P) { **I = ; *I = *P; proc unlock(int **I, **P) while *P { **I = ; *I = *P; 25 26 E68 locks 2 s E68 locks (exluding POUND) 3, Avg time per CS work 6, 5, 4, 3, 2,, 2 3 4 5 6 7 8 9 2 POUND SPIN TICKET TBO CLH Avg. time per CS job 2,5 2,,5, 5 2 3 4 5 6 7 8 9 2 # Contenders SPIN TICKET TBO CLH #Contenders 27 28

Snoop NUMA: Mem Switch WF I/F Switch Directory-latency = 6x snoop i.e., roughly CMP NUCA-ness NUCA: Non-uniform Comm Arch. Mem Switch I/F Trad. chart over lock performance on a hierarchical NUMA Benchmark: (round robin scheduling) Time/Processors,5,45,4,35,3,25,2,5, TATAS spin TATAS_EXP spin_exp MCS-queue CLH-queue RH for i = to { lock(al) A:= A + ; unlock(al),5 29, 2 4 6 8 2 4 6 8 2 22 24 26 28 3 32 Processors 3 Introducing RH locks Benchmark: RH locks: encourages unfairness Time per lock handover Node migration (%) Time/Processors,5,45,4,35,3,25,2 TATAS spin TATAS_EXP spin_exp MCS-queue CLH-queue RH-locks for i = to { lock(al) A:= A + ; unlock(al) Time/Processors [seconds],5,45,4,35,3,25,2,5 TATAS spin TATAS_EXP spin_exp MCS-queue CLH-queue RH-locks Node-handoffs [%] 9 8 7 6 5 4 3 TATAS spin TATAS_EXP spin_exp MCS-queue CLH-queue RH-locks,5, 2,,5, 2 4 6 8 2 4 6 8 2 22 24 26 28 3 32 Processors 3,5, 2 4 6 8 2468222242628332 Processors 32 2 4 6 8 2 4 6 8 2 22 24 26 28 3 32 Processors

Ex: Splash Raytrace Application Speedup Speedup 8 7 6 5 4 3 2 HBO@HPCA 23 RH@SC 22 SPIN TATAS SPIN_EXP TATAS_EXP MCS CLH RH Performance under contention Time between enterings into CS TestAndSet Queuebased RH TestAndSet ExpBackoff 4 8 2 6 2 24 28 Number of Processors 33 ~2 34 Contention Barriers: Make the first threads wait for the last thread to reach a point in the program A Centralized Barrier. Software algorithms implemented using locks, flags, counters 2. Hardware barriers Wired-AND line separate from address/data bus Set input high when arrive, wait for output to be high to leave (In practice, multiple wires to allow reuse) Difficult to support arbitrary subset of processors even harder with multiple processes per processor Difficult to dynamically change number and identity of participants e.g. latter due to process migration BARRIER (bar_name, p) { int loops; loops = ; local_sense =!(local_sense); /* toggle private sense variable each time the barrier is used */ LOCK(bar_name.lock); bar_name.counter++; /* globally increment the barrier count */ if (bar_name.counter == p) { /* everybody here yet? */ bar_name.flag = local_sense; /* release waiters*/ UNLOCK(bar_name.lock) else { UNLOCK(bar_name.lock); while (bar_name.flag!= local_sense) { /* wait for the last guy */ if (loops++ > UNREASONABLE) report_warning(pid) 35 36

Centralized Barrier Performance Latency Want short critical path in barrier Centralized has critical path length at least proportional to p Traffic Barriers likely to be highly contended, so want traffic to scale well About 3p bus transactions in centralized Storage Cost Very low: centralized counter and flag Key problems for centralized barrier are latency and traffic Especially with distributed memory, traffic goes to same node Hierarchical barriers 37