Multiprocessor Synchronization

Similar documents
Lecture 33: Multiprocessors Synchronization and Consistency Professor Randy H. Katz Computer Science 252 Spring 1996

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

M4 Parallelism. Implementation of Locks Cache Coherence

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville. Review: Small-Scale Shared Memory

Synchronization. Coherency protocols guarantee that a reading processor (thread) sees the most current update to shared data.

CS654 Advanced Computer Architecture Lec 14 Directory Based Multiprocessors

Chapter-4 Multiprocessors and Thread-Level Parallelism

Chapter 5. Thread-Level Parallelism

CPE 631 Lecture 21: Multiprocessors

Chapter 5. Multiprocessors and Thread-Level Parallelism

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

5008: Computer Architecture

Lecture 19: Coherence and Synchronization. Topics: synchronization primitives (Sections )

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Computer Architecture Lecture 10: Thread Level Parallelism II (Chapter 5) Chih Wei Liu 劉志尉 National Chiao Tung University

SHARED-MEMORY COMMUNICATION

Chapter 5. Multiprocessors and Thread-Level Parallelism

250P: Computer Systems Architecture. Lecture 14: Synchronization. Anton Burtsev March, 2019

Module 7: Synchronization Lecture 13: Introduction to Atomic Primitives. The Lecture Contains: Synchronization. Waiting Algorithms.

CSE 502 Graduate Computer Architecture Lec 19 Directory-Based Shared-Memory Multiprocessors & MP Synchronization

Lecture: Coherence, Synchronization. Topics: directory-based coherence, synchronization primitives (Sections )

Lecture: Coherence and Synchronization. Topics: synchronization primitives, consistency models intro (Sections )

EEC 581 Computer Architecture. Lec 11 Synchronization and Memory Consistency Models (4.5 & 4.6)

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

Lecture 17: Multiprocessors: Size, Consitency. Review: Networking Summary

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Flynn's Classification

EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors

Chapter - 5 Multiprocessors and Thread-Level Parallelism

Chapter 9 Multiprocessors

Multiprocessors: Basics, Cache coherence, Synchronization, and Memory consistency

Cache Coherence and Atomic Operations in Hardware

Parallel Architecture. Hwansoo Han

Role of Synchronization. CS 258 Parallel Computer Architecture Lecture 23. Hardware-Software Trade-offs in Synchronization and Data Layout

Chapter 6: Multiprocessors and Thread-Level Parallelism

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

CS5460: Operating Systems

The need for atomicity This code sequence illustrates the need for atomicity. Explain.

Thread-Level Parallelism

The MESI State Transition Graph

Limitations of parallel processing

Lecture 25: Thread Level Parallelism -- Synchronization and Memory Consistency

Computer Science 146. Computer Architecture

EE382 Processor Design. Processor Issues for MP

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY>

Spin Locks and Contention. Companion slides for Chapter 7 The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

G52CON: Concepts of Concurrency

Shared Symmetric Memory Systems

Lecture: Consistency Models, TM

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University

Lecture #7: Implementing Mutual Exclusion

Lecture 9: Multiprocessor OSs & Synchronization. CSC 469H1F Fall 2006 Angela Demke Brown

Spin Locks and Contention. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Spin Locks and Contention

Handout 3 Multiprocessor and thread level parallelism

Computer Architecture

Lecture 25: Multiprocessors

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Concurrent Preliminaries

Memory Consistency Models

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Other consistency models

Memory Consistency and Multiprocessor Performance

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville. Review: Bus Snooping Topology

Spin Locks and Contention. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Memory Consistency and Multiprocessor Performance. Adapted from UCB CS252 S01, Copyright 2001 USB

CS3350B Computer Architecture

SELECTED TOPICS IN COHERENCE AND CONSISTENCY

Concurrency: Mutual Exclusion (Locks)

6.852: Distributed Algorithms Fall, Class 15

Embedded Systems Architecture. Multiprocessor and Multicomputer Systems

LOCK PREDICTION TO REDUCE THE OVERHEAD OF SYNCHRONIZATION PRIMITIVES. A Thesis ANUSHA SHANKAR

CS 61C: Great Ideas in Computer Architecture. Amdahl s Law, Thread Level Parallelism

Anne Bracy CS 3410 Computer Science Cornell University

Symmetric Multiprocessors: Synchronization and Sequential Consistency

The University of Texas at Arlington

Spin Locks and Contention Management

Advanced Topic: Efficient Synchronization

CS433 Homework 6. Problem 1 [15 points] Assigned on 11/28/2017 Due in class on 12/12/2017

Multiprocessors and Locking

Computer System Architecture Final Examination Spring 2002

CS 654 Computer Architecture Summary. Peter Kemper

Advance Operating Systems (CS202) Locks Discussion

CS252 Spring 2017 Graduate Computer Architecture. Lecture 14: Multithreading Part 2 Synchronization 1

Shared Memory Multiprocessors

Data-Centric Consistency Models. The general organization of a logical data store, physically distributed and replicated across multiple processes.

CS 152 Computer Architecture and Engineering. Lecture 19: Synchronization and Sequential Consistency

9/28/2014. CS341: Operating System. Synchronization. High Level Construct: Monitor Classical Problems of Synchronizations FAQ: Mid Semester

The complete license text can be found at

Page 1. Goals for Today" Atomic Read-Modify-Write instructions" Examples of Read-Modify-Write "

Case Study 1: Single-Chip Multicore Multiprocessor

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Synchronization. Before We Begin. Synchronization. Example of a Race Condition. CSE 120: Principles of Operating Systems. Lecture 4.

Transcription:

Multiprocessor Synchronization Material in this lecture in Henessey and Patterson, Chapter 8 pgs. 694-708 Some material from David Patterson s slides for CS 252 at Berkeley 1 Multiprogramming and Multiprocessing Imply Synchronization Locking Critical sections Mutual exclusion Used for exclusive access to shared resource or shared data for some period of time Efficient update of a shared (work) queue Barriers Process synchronization -- All processes must reach the barrier before any one can proceed (e.g., end of a parallel loop). Why doesn t coherency solve these problems? 2

Locking Typical use of a lock: while (!acquire (lock)) /*spin*/ ; /* some computation on shared data (critical section) */ release (lock) Acquire based on primitive: Read-Modify-Write Basic principle: Atomic exchange Test-and-set Fetch-and-increment 3 Synchronization Issues for Synchronization: Uninterruptable instruction to fetch and update memory (atomic operation); User level synchronization operation using this primitive; For large scale MPs, synchronization can be a bottleneck; techniques to reduce contention and latency of synchronization 4

Uninterruptable Instruction to Fetch and Update Memory Atomic exchange: interchange a value in a register for a value in memory 0 => synchronization variable is free 1 => synchronization variable is locked and unavailable Set register to 1 & swap New value in register determines success in getting lock 0 if you succeeded in setting the lock (you were first) 1 if other processor had already claimed access Key is that exchange operation is indivisible Release the lock simply by writing a 0 Note that every execution requires a read and a write 5 Uninterruptable Instruction to Fetch and Update Memory Test-and-set: tests a value and sets it if the value passes the test Fetch-and-increment: it returns the value of a memory location and atomically increments it 0 => synchronization variable is free 6

Load linked & store conditional Hard to have read & write in 1 instruction (needed for atomic exchange and others) Potential pipeline difficulties from needing 2 memory operations Makes coherence more difficult, since hardware cannot allow any operations between the read and write, and yet must not deadlock So, use 2 instructions instead Load linked (or load locked) + store conditional Load linked returns the initial value Store conditional returns 1 if it succeeds (no other store to same memory location since preceding load) and 0 otherwise How could you implement this instruction? 7 Load linked & store conditional - Example Example doing atomic exchange with LL & SC: try: mov R3,R4 ; move exchange value ll R2,0(R1) ; load linked sc R3,0(R1) ; store conditional beqz R3,try ; branch if store fails (R3 = 0) mov R4,R2 ; put load value in R4 Example doing fetch & increment with LL & SC: try: ll R2,0(R1) ; load linked addi R2,R2,#1 ; increment (OK if reg reg) sc R2,0(R1) ; store conditional beqz R2,try ; branch if store fails (R2 = 0) Note that these code sequences only do a single atomic swap or a fetch & increment they do not implement a full lock acquire function (more on this later). 8

Spin Locks Processor continuously tries to acquire, spinning around a loop trying to get the lock li R2,#1 lockit: exch R2,0(R1) ;atomic exchange bnez R2,lockit ;already locked? Spin lock in a cache coherent environment (invalidationbased): Bus utilized during the whole read-modify-write cycle Since atomic-exchange writes a location in memory, need to send an invalidate (even if the lock is not acquired) In general loop to test the lock is short, so lots of bus contention 9 Improved Spin Locks ( test and test&set ) Be smarter about spinning: Want to spin on cache copy to avoid full memory latency Likely to get cache hits for such variables, if we can avoid invalidates Solution: start by simply repeatedly reading the variable; when it changes, then try exchange ( test and test&set ): try: li R2,#1 lockit: lw R3,0(R1) ;load var bnez R3,lockit ;not free=>spin exch R2,0(R1) ;atomic exchange bnez R2,try ;already locked? Any remaining performance problems? 10

Problems with Improved Spin Lock Have a race condition for acquiring a lock that has just been released. All waiting processors will suffer read and write miss O(n 2 ) bus transactions for n contending processes. Potential improvements Queuing Locks (software or hardware) Exponential backoff after each failed attempt to grab the lock, wait an exponentially increasing amount of time before trying again (similar to Ethernet collision handling) 11 Queuing Locks Basic idea: a queue of waiting processors is maintained in shared-memory for each lock (best for bus-based machines) Each processor performs an atomic operation to obtain a memory location (element of an array) on which to spin Upon a release, the lock can be directly handed off to the next waiting processor 12

Barriers All processes have to wait at a synchronization point End of parallel do loops Processes don t progress until they all reach the barrier Low-performance implementation: use a counter initialized with the number of processes When a process reaches the barrier, it decrements the counter (atomically -- fetch-and-add (-1)) and busy waits When the counter is zero, all processes are allowed to progress (broadcast) Lots of possible optimizations (tree, butterfly etc. ) Is it important? Barriers do not occur that often (Amdahl s law.) 13