Lecture 19: Synchronization. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Similar documents
Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

The need for atomicity This code sequence illustrates the need for atomicity. Explain.

Fine-Grained Synchronization and Lock-Free Programming

Lecture 19: Coherence and Synchronization. Topics: synchronization primitives (Sections )

Lecture 9: Multiprocessor OSs & Synchronization. CSC 469H1F Fall 2006 Angela Demke Brown

CSC2/458 Parallel and Distributed Systems Scalable Synchronization

The MESI State Transition Graph

Role of Synchronization. CS 258 Parallel Computer Architecture Lecture 23. Hardware-Software Trade-offs in Synchronization and Data Layout

250P: Computer Systems Architecture. Lecture 14: Synchronization. Anton Burtsev March, 2019

Module 7: Synchronization Lecture 13: Introduction to Atomic Primitives. The Lecture Contains: Synchronization. Waiting Algorithms.

Synchronization. Coherency protocols guarantee that a reading processor (thread) sees the most current update to shared data.

M4 Parallelism. Implementation of Locks Cache Coherence

Lecture: Coherence and Synchronization. Topics: synchronization primitives, consistency models intro (Sections )

A More Sophisticated Snooping-Based Multi-Processor

Shared Memory Multiprocessors

Scalable Locking. Adam Belay

Advanced Operating Systems (CS 202) Memory Consistency, Cache Coherence and Synchronization

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

Multiprocessors and Locking

Algorithms for Scalable Synchronization on Shared Memory Multiprocessors by John M. Mellor Crummey Michael L. Scott

CSE502: Computer Architecture CSE 502: Computer Architecture

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Lecture: Coherence, Synchronization. Topics: directory-based coherence, synchronization primitives (Sections )

CS377P Programming for Performance Multicore Performance Synchronization

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Advanced Operating Systems (CS 202) Memory Consistency, Cache Coherence and Synchronization

A Basic Snooping-Based Multi-Processor Implementation

Cache Coherence and Atomic Operations in Hardware

Other consistency models

Lecture 25: Multiprocessors

CS5460: Operating Systems

Concurrent Preliminaries

Adaptive Lock. Madhav Iyengar < >, Nathaniel Jeffries < >

A Basic Snooping-Based Multi-Processor Implementation

Spin Locks and Contention. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Agenda. Lecture. Next discussion papers. Bottom-up motivation Shared memory primitives Shared memory synchronization Barriers and locks

Concurrency: Mutual Exclusion (Locks)

[537] Locks. Tyler Harter

SHARED-MEMORY COMMUNICATION

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Synchronization. Erik Hagersten Uppsala University Sweden. Components of a Synchronization Even. Need to introduce synchronization.

Models of concurrency & synchronization algorithms

Spinlocks. Spinlocks. Message Systems, Inc. April 8, 2011

Implementing Locks. Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau)

A Basic Snooping-Based Multi-processor

Chapter 5. Multiprocessors and Thread-Level Parallelism

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Operating Systems. Lecture 4 - Concurrency and Synchronization. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

Multiprocessor Synchronization

Hardware: BBN Butterfly

CSE 451: Operating Systems Winter Lecture 7 Synchronization. Steve Gribble. Synchronization. Threads cooperate in multithreaded programs

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Today: Synchronization. Recap: Synchronization

Spin Locks and Contention

Distributed Operating Systems

Last Class: Synchronization

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Spin Locks and Contention. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Advanced Operating Systems (CS 202)

Page 1. Outline. Coherence vs. Consistency. Why Consistency is Important

Distributed Operating Systems

10/17/ Gribble, Lazowska, Levy, Zahorjan 2. 10/17/ Gribble, Lazowska, Levy, Zahorjan 4

Synchronization. CS61, Lecture 18. Prof. Stephen Chong November 3, 2011

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

Spin Locks and Contention. Companion slides for Chapter 7 The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

CS 31: Introduction to Computer Systems : Threads & Synchronization April 16-18, 2019

A three-state update protocol

Concurrent programming: From theory to practice. Concurrent Algorithms 2015 Vasileios Trigonakis

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

CSE 451: Operating Systems Winter Lecture 7 Synchronization. Hank Levy 412 Sieg Hall

CS758 Synchronization. CS758 Multicore Programming (Wood) Synchronization

Memory Consistency Models

Threads and Synchronization. Kevin Webb Swarthmore College February 15, 2018

CSL373: Lecture 5 Deadlocks (no process runnable) + Scheduling (> 1 process runnable)

Synchronization I. Jo, Heeseung

Operating Systems. Synchronization

10/17/2011. Cooperating Processes. Synchronization 1. Example: Producer Consumer (3) Example

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Handout 3 Multiprocessor and thread level parallelism

The complete license text can be found at

Understanding Hardware Transactional Memory

GLocks: Efficient Support for Highly- Contended Locks in Many-Core CMPs

Speculative Locks. Dept. of Computer Science

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

Distributed Systems Synchronization. Marcus Völp 2007

Last Class: CPU Scheduling! Adjusting Priorities in MLFQ!

A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 4 Shared-Memory Concurrency & Mutual Exclusion

Concurrency: Locks. Announcements

SELECTED TOPICS IN COHERENCE AND CONSISTENCY

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Cloud Computing CS

The New Java Technology Memory Model

CMU /618 Exam 2 Practice Problems

Lecture 4: Directory Protocols and TM. Topics: corner cases in directory protocols, lazy TM

Synchronization I. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CS 537 Lecture 11 Locks

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Transcription:

Lecture 19: Synchronization CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012)

Announcements Assignment 4 due tonight at 11:59 PM

Synchronization primitives (that we have or will soon see in class) For ensuring mutual exclusion - Locks - Basic atomic operations (e.g., atomicadd) - Transactions (next time) For event signaling - Barriers - Flags Today s topic: efficiently implementing synchronization primitives

Three phases of a synchronization event 1. Acquire method - How process attempts to gain access to protected resource 2. Waiting algorithm - How process waits for access to be granted to shared resource 3. Release method - How process enables other processes to gain resource when its work in the synchronized region is complete

What you should know Performance issues related to various lock implementations (specifically their interaction with cache coherence) Performance issues related to various barrier implementations

Busy waiting and blocking Busy waiting (a.k.a. spinning ) while (condition X not true) { proceed with logic that assumes X is true In 15-213 or in OS, you have talked about synchronization - You have probably been taught busy-waiting is bad: why? Blocking - If progress cannot be made, free up resources for someone else (pre-emption) if (condition X not true) block; // OS scheduler kicks in, de- schedules process from processor

Busy waiting vs. blocking Busy-waiting can be preferable to blocking if: - Scheduling overhead is larger than expected wait time - Processor s resources not needed for other tasks - This often the case in a parallel program since we usually don t oversubscribe a system when running a performance-critical parallel app: e.g., there aren t multiple programs running at the same time) - Clarification: be careful to not confuse this idea with the clear value of multithreading (interleaving execution of multiple threads/tasks to hiding long latency of memory operations) with other work within the same app.

Locks

Warm up: a simple, but incorrect, lock lock: ld R0, mem[addr] // load word into R0 cmp R0, #0 // if 0, store 1 bnz lock // else, try again st mem[addr], #1 unlock: st mem[addr], #0 // store 0 to address Problem: data race because LOAD-TEST-STORE is not atomic!

Test-and-set based lock Test-and-set instruction: ts R0, mem[addr] // atomically load mem[addr] into R0 // and set mem[addr] to 1 lock: ts R0, mem[addr] // load word into R0 bnz R0, #0 // if 0, lock obtained unlock: st mem[addr], #0 // store 0 to address

Test & set lock: consider coherence traffic Processor 1 Processor 2 Processor 3 BusRdX T&S Invalidate line Invalidate line Update line in cache (set to 1) Invalidate line BusRdX T&S Update line in cache (set to 1) Invalidate line BusRdX T&S [P1 is holding holding lock...] BusRdX Update line in cache (set to 1) T&S Update line in cache (set to 1) Invalidate line Invalidate line BusRdX T&S Update line in cache (set to 1) BusRdX Invalidate line Update line in cache (set to 0) Invalidate line BusRdX T&S Update line in cache (set to 1)

Test-and-set lock performance Benchmark: Total of N lock/unlock sequences (in aggregate) by P processors Critical section time removed so graph plots only time acquiring/releasing the lock Benchmark executes: lock(l); critical- section(c) unlock(l); Time (us) Bus contention increases amount of time to transfer lock (lock holder must wait to acquire bus to release) Not shown: bus contention also slows down execution of critical section Ideal: one bus transaction per lock event Number of processors

Desirable lock performance characteristics Low latency - If lock is free, and no other processors are trying to acquire it, a processor should be able to acquire it quickly Low traffic - If all processors are trying to acquire lock at once, they should acquire the lock in suggestion with as little traffic as possible Scalability - Latency / traffic should scale reasonably with number of processors Low storage cost Fairness - Avoid starvation or substantial unfairness - One ideal: processors should acquire lock in the order they request access to it Simple: test and set lock: low latency (under low contention), high traffic, poor scaling, low storage cost (one int), no provisions for fairness

Test-and-test-and-set lock void Lock(volatile int* lock) { while (1) { while (*lock!= 0); if (test&set(*lock) == 0) return; // while another processor has the lock // when lock is released, try to acquire it void Unlock(volatile int* lock) { *lock = 0;

Test & test & set lock: coherence traffic BusRdX Processor 1 Processor 2 T&S Update line in cache (set to 1) BusRd Invalidate line Invalidate line BusRd Processor 3 [P1 is holding holding lock...] [Many reads from local cache] [Many reads from local cache] Update line in cache (set to 0) Invalidate line Invalidate line BusRdX Update line in cache (set to 1) T&S Invalidate line Invalidate line BusRdX T&S Update line in cache (set to 1)

Test & test & set characteristics Higher latency than test & set in uncontended case - Must test... then test and set Generates much less bus traffic - One invalidation per waiting processor per lock release More scalable (due to less traffic) Storage cost unchanged Still no provisions for fairness

Test-and-set lock with backoff Upon failure to acquire lock, delay for awhile before retrying void Lock(volatile int* l) { int amount = 1; while (1) { if (test&set(*l) == 0) return; delay(amount); amount *= 2; Same uncontended latency as test and set Generates less traffic than test and set (not continually attempting to acquire lock) Improves scalability (due to less traffic) Storage cost unchanged Exponential backoff can cause severe unfairness - Newer requesters back off for shorter intervals

Ticket lock Main problem with test & set style locks: upon release, all waiting processors attempt to acquire lock using test & set struct lock { volatile int next_ticket; volatile int now_serving; ; void Lock(lock* l) { int my_ticket = atomicincrement(l- >next_ticket); while (my_ticket!= l- >now_serving); void unlock(lock* l) { l- >now_serving++;

Array-based lock Each processor spins on a different memory address Use fetch&op (below: atomicincrement) to assign address on attempt to acquire struct lock { volatile int status[p]; volatile int head; ; int my_element; void Lock(lock* l) { my_element = atomicincrement(l- >head); // assume circular inc while (l- >status[my_element] == 1); void unlock(lock* l) { l- >status[next(my_element)] = 0; O(1) traffic per release, but requires space linear in P

Implementing atomic fetch and op // atomiccas: atomic compare and swap int atomiccas(int* addr, int compare, int val) { int old = *addr; *addr = (old == compare)? val : old; return old; Exercise: how can you build an atomic fetch+op out of atomiccas()? - try: atomicincrement() See definition of atomiccas() in NVIDIA programmers guide

Barriers

Implementing a centralized barrier Based on shared counter struct Bar { int counter; // initialize to 0 int flag; LOCK lock; ; // barrier for p processors void Barrier(Bar* b, int p) { lock(b- >lock); if (b- >counter == 0) { b- >flag = 0; // first arriver clears flag int arrived = ++(b- >counter); unlock(b- >lock); if (arrived == p) { // last arriver sets flag b- >counter = 0; b- >flag = 1; else { while (b- >flag == 0); // wait for flag Does it work? Consider: do stuff... Barrier(b, P); do more stuff... Barrier(b, P);

Correct centralized barrier struct Bar { int arrive_counter; // initialize to 0 int leave_counter; // initialize to P int flag; LOCK lock; ; // barrier for p processors void Barrier(Bar* b, int p) { lock(b- >lock); if (b- >arrive_counter == 0) { while (b- >leave_counter!= P); // wait for all to leave before clearing b- >flag = 0; // first arriver clears flag int arrived = ++(b- >counter); unlock(b- >lock); if (arrived == p) { // last arriver sets flag b- >arrive_counter = 0; b- >leave_counter = 0; b- >flag = 1; else { while (b- >flag == 0); // wait for flag lock(b- >lock); b- >leave_counter++; unlock(b- >lock); Main idea: wait for all processes to leave first barrier, before clearing flag for the second

Correct centralized barrier: sense reversal struct Bar { int counter; // initialize to 0 int flag; LOCK lock; ; int local_sense = 0; // private per processor // barrier for p processors void Barrier(Bar* b, int p) { local_sense!= local_sense; lock(b- >lock); int arrived = ++(b- >counter); if (b- >counter == p) { // last arriver sets flag unlock(b- >lock); b- >counter = 0; b- >flag = local_sense; else { unlock(b- >lock); while (b.flag!= local_sense); // wait for flag One spin instead of two

Centralized barrier: traffic O(p) traffic on a bus: - 2p transactions to obtain barrier lock and update counter - 2 transactions to write flag + reset counter - p-1 transactions to read updated flag But there is still serialization on a single shared variable - Latency is O(P) - Can we do better?

Combining trees High contention! Centralized Barrier Combining Tree Barrier Combining trees make better use of parallelism in interconnect topologies - lg(p) latency - Strategy makes less sense on a bus (all traffic still serialized on single shared bus) Acquire: when processor arrives at barrier, performs atomicincr() of parent counter - Process recurses to root Release: beginning from root, notify children of release

Next time What if you have a shared variable for which contention is low enough that it is unlikely two processors will enter the critical section at the same time? You could avoid the overhead of taking the lock since it is very likely ensuring mutual exclusion is not needed for correctness What happens if you take this approach and you re wrong: in the middle of the critical region, another process enters the same region?

Next time: transactional memory atomic { // begin transaction perform atomic computation here... // end transaction Instead of ensuring mutual exclusion via locks, system will proceed as if no synchronization was necessary (speculation). System provides hardware/software support for rolling back all loads and stores from critical region if it detects (at runtime) that another thread has entered same region.