Role of Synchronization. CS 258 Parallel Computer Architecture Lecture 23. Hardware-Software Trade-offs in Synchronization and Data Layout

Similar documents
Lecture 25: Thread Level Parallelism -- Synchronization and Memory Consistency

Recall: Sequential Consistency of Directory Protocols How to get exclusion zone for directory protocol? Recall: Mechanisms for reducing depth

The MESI State Transition Graph

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Synchronization. Erik Hagersten Uppsala University Sweden. Components of a Synchronization Even. Need to introduce synchronization.

Page 1. Outline. Coherence vs. Consistency. Why Consistency is Important

SHARED-MEMORY COMMUNICATION

The need for atomicity This code sequence illustrates the need for atomicity. Explain.

Lecture 19: Coherence and Synchronization. Topics: synchronization primitives (Sections )

Module 7: Synchronization Lecture 13: Introduction to Atomic Primitives. The Lecture Contains: Synchronization. Waiting Algorithms.

Lecture 19: Synchronization. CMU : Parallel Computer Architecture and Programming (Spring 2012)

M4 Parallelism. Implementation of Locks Cache Coherence

Multiprocessor Synchronization

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Synchronization. Coherency protocols guarantee that a reading processor (thread) sees the most current update to shared data.

Lecture: Coherence and Synchronization. Topics: synchronization primitives, consistency models intro (Sections )

ECE7660 Parallel Computer Architecture. Shared Memory Multiprocessors

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Lecture: Coherence, Synchronization. Topics: directory-based coherence, synchronization primitives (Sections )

Lecture 9: Multiprocessor OSs & Synchronization. CSC 469H1F Fall 2006 Angela Demke Brown

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Algorithms for Scalable Synchronization on Shared Memory Multiprocessors by John M. Mellor Crummey Michael L. Scott

CS162 Operating Systems and Systems Programming Lecture 7. Mutual Exclusion, Semaphores, Monitors, and Condition Variables

250P: Computer Systems Architecture. Lecture 14: Synchronization. Anton Burtsev March, 2019

CS Advanced Operating Systems Structures and Implementation Lecture 8. Synchronization Continued. Goals for Today. Synchronization Scheduling

Page 1. Goals for Today" Atomic Read-Modify-Write instructions" Examples of Read-Modify-Write "

Today: Synchronization. Recap: Synchronization

Page 1. Goals for Today" Atomic Read-Modify-Write instructions" Examples of Read-Modify-Write "

Locks. Dongkun Shin, SKKU

September 23 rd, 2015 Prof. John Kubiatowicz

Page 1. Goals for Today. Atomic Read-Modify-Write instructions. Examples of Read-Modify-Write

Page 1. Challenges" Concurrency" CS162 Operating Systems and Systems Programming Lecture 4. Synchronization, Atomic operations, Locks"

Lecture #7: Implementing Mutual Exclusion

CS5460: Operating Systems

Shared Memory Multiprocessors

Signaling and Hardware Support

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

CSC2/458 Parallel and Distributed Systems Scalable Synchronization

Page 1. CS162 Operating Systems and Systems Programming Lecture 8. Readers-Writers Language Support for Synchronization

Page 1. Another Concurrent Program Example" Goals for Today" CS162 Operating Systems and Systems Programming Lecture 4

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

Basic Architecture of SMP. Shared Memory Multiprocessors. Cache Coherency -- The Problem. Cache Coherency, The Goal.

Last Class: CPU Scheduling! Adjusting Priorities in MLFQ!

Relaxed Memory-Consistency Models

Reminder from last time

Multiprocessors and Locking

CS Advanced Operating Systems Structures and Implementation Lecture 8. Synchronization Continued. Critical Section.

Chapter 5. Multiprocessors and Thread-Level Parallelism

CS162 Operating Systems and Systems Programming Lecture 8. Readers-Writers Language Support for Synchronization

6.852: Distributed Algorithms Fall, Class 15

Advance Operating Systems (CS202) Locks Discussion

Mutex Implementation

Other consistency models

Evaluating Synchronization on Shared Address Space Multiprocessors: Methodology and Performance

Agenda. Lecture. Next discussion papers. Bottom-up motivation Shared memory primitives Shared memory synchronization Barriers and locks

Using Relaxed Consistency Models

Lecture 33: Multiprocessors Synchronization and Consistency Professor Randy H. Katz Computer Science 252 Spring 1996

Synchronization. CS61, Lecture 18. Prof. Stephen Chong November 3, 2011

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Cache Coherence and Atomic Operations in Hardware

Scalable Locking. Adam Belay

Operating Systems (1DT020 & 1TT802) Lecture 6 Process synchronisation : Hardware support, Semaphores, Monitors, and Condition Variables

Concurrent Preliminaries

Chapter 5. Thread-Level Parallelism

Last Class: Synchronization

Spin Locks and Contention. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Algorithms for Scalable Lock Synchronization on Shared-memory Multiprocessors

The Architectural and Operating System Implications on the Performance of Synchronization on ccnuma Multiprocessors

LOCK PREDICTION TO REDUCE THE OVERHEAD OF SYNCHRONIZATION PRIMITIVES. A Thesis ANUSHA SHANKAR

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 8: Semaphores, Monitors, & Condition Variables

Unit 12: Memory Consistency Models. Includes slides originally developed by Prof. Amir Roth

CS 152 Computer Architecture and Engineering. Lecture 19: Synchronization and Sequential Consistency

Synchronization Principles I

Adaptive Lock. Madhav Iyengar < >, Nathaniel Jeffries < >

IT 540 Operating Systems ECE519 Advanced Operating Systems

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Lecture 25: Multiprocessors

Conventions. Barrier Methods. How it Works. Centralized Barrier. Enhancements. Pitfalls

Spin Locks and Contention

2 Threads vs. Processes

Hardware: BBN Butterfly

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville. Review: Small-Scale Shared Memory

EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors

5 Chip Multiprocessors (II) Chip Multiprocessors (ACS MPhil) Robert Mullins

CS758 Synchronization. CS758 Multicore Programming (Wood) Synchronization

NOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem.

Chapter 5. Multiprocessors and Thread-Level Parallelism

Shared Memory Programming Models I

GLocks: Efficient Support for Highly- Contended Locks in Many-Core CMPs

Lecture 10: Avoiding Locks

CMSC421: Principles of Operating Systems

CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II

Spin Locks and Contention. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Module 14: "Directory-based Cache Coherence" Lecture 31: "Managing Directory Overhead" Directory-based Cache Coherence: Replacement of S blocks

Portland State University ECE 588/688. Transactional Memory

[537] Locks. Tyler Harter

Transcription:

CS 28 Parallel Computer Architecture Lecture 23 Hardware-Software Trade-offs in Synchronization and Data Layout April 21, 2008 Prof John D. Kubiatowicz http://www.cs.berkeley.edu/~kubitron/cs28 Role of Synchronization A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Types of Synchronization Mutual Exclusion Event synchronization» point-to-point» group» global (barriers) How much hardware support? high-level operations? atomic instructions? specialized interconnect? Lec 23.2 Components of a Synchronization Event Acquire method Acquire right to the synch» enter critical section, go past event Waiting algorithm Wait for synch to become available when it isn t busy-waiting, blocking, or hybrid Release method Enable other processors to acquire right to the synch Waiting algorithm is independent of type of synchronization makes no sense to put in hardware Strawman Lock Busy-Wait lock: ld register, location /* copy location to register */ cmp location, #0 /* compare with 0 */ bnz lock /* if not 0, try again */ st location, #1 /* store 1 to mark it locked */ ret /* return control to caller */ unlock: st location, #0 /* write 0 to location */ ret /* return control to caller */ Why doesn t the acquire method work? Release method? Lec 23.3 Lec 23.4

What to do if only load and store? Here is a possible two-thread solution: Thread A Thread B Set A=1; Set B=1; while (B) {//X if (!A) {//Y do nothing; Critical Section; Critical Section; Set B=0; Set A=0; Does this work? Yes. Both can guarantee that: Only one will enter critical section at a time. At X: if B=0, safe for A to perform critical section, otherwise wait to find out what will happen At Y: if A=0, safe for B to perform critical section. Otherwise, A is in critical section or waiting for B to quit But: Really messy Generalization gets worse Lec 23. Atomic Instructions Specifies a location, register, & atomic operation Value in location read into a register Another value (function of value read or not) stored into location Many variants Varying degrees of flexibility in second part Simple example: test&set Value in location read into a specified register Constant 1 stored into location Successful if value loaded into register is 0 Other constants could be used instead of 1 and 0 Lec 23.6 Zoo of hardware primitives test&set (&address) { /* most architectures */ result = M[address]; M[address] = 1; return result; swap (&address, register) { /* x86 */ temp = M[address]; M[address] = register; register = temp; compare&swap (&address, reg1, reg2) { /* 68000 */ if (reg1 == M[address]) { M[address] = reg2; return success; else { return failure; load-linked&store conditional(&address) { /* R4000, alpha */ loop: ll r1, M[address]; movi r2, 1; /* Can do arbitrary comp */ sc r2, M[address]; beqz r2, loop; Lec 23.7 Mini-Instruction Set debate atomic read-modify-write instructions IBM 370: included atomic compare&swap for multiprogramming x86: any instruction can be prefixed with a lock modifier High-level language advocates want hardware locks/barriers» but it s goes against the RISC flow,and has other problems SPARC: atomic register-memory ops (swap, compare&swap) MIPS, IBM Power: no atomic operations but pair of instructions» load-locked, store-conditional» later used by PowerPC and DEC Alpha too 68000: CCS: Compare and compare and swap» No-one does this any more Rich set of tradeoffs Lec 23.8

Other forms of hardware support Separate lock lines on the bus Lock locations in memory Lock registers (Cray Xmp) Hardware full/empty bits (Tera) QOLB (machines supporting SCI protocol) Bus support for interrupt dispatch Simple Test&Set Lock lock: t&s register, location bnz lock /* if not 0, try again */ ret /* return control to caller */ unlock: st location, #0 /* write 0 to location */ ret /* return control to caller */ Other read-modify-write primitives Swap Fetch&op Compare&swap»Three operands: location, register to compare with, register to swap with»not commonly supported by RISC instruction sets cacheable or uncacheable Lec 23.9 Lec 23.10 Performance Criteria for Synch. Ops Latency (time per op) especially when light contention Bandwidth (ops per sec) especially under high contention Traffic load on critical resources especially on failures under contention Storage Fairness T&S Lock Microbenchmark: SGI Chal. 20 18 16 14 12 10 8 6 4 2 0 Test&set, c = 0 Test&set, exponential backof f, c = 3.64 Test&set, exponential backof f, c = 0 Ideal 3 Number of processors Why does performance degrade? Bus Transactions on T&S? Hardware support in CC protocol? 7 9 11 13 1 lock; delay(c); unlock; Lec 23.11 Lec 23.12

Enhancements to Simple Lock Reduce frequency of issuing test&sets while waiting Test&set lock with backoff Don t back off too much or will be backed off when lock becomes free Exponential backoff works quite well empirically: i th time = k*c i Busy-wait with read operations rather than test&set Test-and-test&set lock Keep testing with ordinary load» cached lock variable will be invalidated when release occurs When value changes (to 0), try to obtain lock with test&set» only one attemptor will succeed; others will fail and start testing again Busy-wait vs Blocking Busy-wait: I.e. spin lock Keep trying to acquire lock until read Very low latency/processor overhead! Very high system overhead!» Causing stress on network while spinning» Processor is not doing anything else useful Blocking: If can t acquire lock, deschedule process (I.e. unload state) Higher latency/processor overhead (1000s of cycles?)» Takes time to unload/restart task» Notification mechanism needed Low system overheadd» No stress on network» Processor does something useful Hybrid: Spin for a while, then block 2-competitive: spin until have waited blocking time Lec 23.13 Lec 23.14 Improved Hardware Primitives: LL-SC Goals: Test with reads Failed read-modify-write attempts don t generate invalidations Nice if single primitive can implement range of r-m-w operations Load-Locked (or -linked), Store-Conditional LL reads variable into register Follow with arbitrary instructions to manipulate its value SC tries to store back to location succeed if and only if no other write to the variable since this processor s LL» indicated by condition codes; If SC succeeds, all three steps happened atomically If fails, doesn t write or generate invalidations must retry aquire Simple Lock with LL-SC lock: ll reg1, location /* LL location to reg1 */ sc location, reg2 /* SC reg2 into location*/ beqz reg2, lock /* if failed, start again */ ret unlock: st location, #0 /* write 0 to location */ ret Can do more fancy atomic ops by changing what s between LL & SC But keep it small so SC likely to succeed Don t include instructions that would need to be undone (e.g. stores) SC can fail (without putting transaction on bus) if: Detects intervening write even before trying to get bus Tries to get bus but another processor s SC gets bus first LL, SC are not lock, unlock respectively Only guarantee no conflicting write to lock variable between them But can use directly to implement simple operations on shared variables Lec 23.1 Lec 23.16

Trade-offs So Far Latency? Bandwidth? Traffic? Storage? Fairness? What happens when several processors spinning on lock and it is released? traffic per P lock operations? Ticket Lock Only one r-m-w per acquire Two counters per lock (next_ticket, now_serving) Acquire: fetch&inc next_ticket; wait for now_serving == next_ticket» atomic op when arrive at lock, not when it s free (so less contention) Release: increment now-serving Performance low latency for low-contention - if fetch&inc cacheable O(p) read misses at release, since all spin on same variable FIFO order» like simple LL-SC lock, but no inval when SC succeeds, and fair Backoff? Wouldn t it be nice to poll different locations... Lec 23.17 Lec 23.18 Array-based Queuing Locks Lock Performance on SGI Challenge Waiting processes poll on different locations in an array of size p Acquire» fetch&inc to obtain address on which to spin (next array element)» ensure that these addresses are in different cache lines or memories Release» set next location in array, thus waking up process spinning on it O(1) traffic per acquire with coherent caches FIFO ordering, as in ticket lock, but, O(p) space per lock Not so great for non-cache-coherent machines with distributed memory» array location I spin on not necessarily in my local memory (solution later) Array-based LL-SC LL-SC, exponential Ticket Ticket, proportional Loop: lock; delay(c); unlock; delay(d); 7 7 7 6 6 6 4 4 4 3 3 3 2 2 2 1 1 1 0 0 0 1 3 7 9 11 13 1 1 3 7 9 11 13 1 1 3 7 9 11 13 1 Number of processors Number of processors Number of processors (a) Null (c = 0, d = 0) (b) Critical-section (c = 3.64 μs, d = 0) (c) Delay (c = 3.64 μs, d = 1.29 μs) Lec 23.19 Lec 23.20

Point to Point Event Synchronization Software methods: Interrupts Busy-waiting: use ordinary variables as flags Blocking: use semaphores Full hardware support: full-empty bit with each word in memory Set when word is full with newly produced data (i.e. when written) Unset when word is empty due to being consumed (i.e. when read) Natural for word-level producer-consumer synchronization» producer: write if empty, set to full; consumer: read if full; set to empty Hardware preserves atomicity of bit manipulation with read or write Problem: flexibility» multiple consumers, or multiple writes before consumer reads?» needs language support to specify when to use» composite data structures? Barriers Software algorithms implemented using locks, flags, counters Hardware barriers Wired-AND line separate from address/data bus» Set input high when arrive, wait for output to be high to leave In practice, multiple wires to allow reuse Useful when barriers are global and very frequent Difficult to support arbitrary subset of processors» even harder with multiple processes per processor Difficult to dynamically change number and identity of participants» e.g. latter due to process migration Not common today on bus-based machines Lec 23.21 Lec 23.22 A Simple Centralized Barrier A Working Centralized Barrier Shared counter maintains number of processes that have arrived increment when arrive (lock), check until reaches numprocs Problem? struct bar_type {int counter; struct lock_type lock; int flag = 0; bar_name; BARRIER (bar_name, p) { LOCK(bar_name.lock); if (bar_name.counter == 0) bar_name.flag = 0; /* reset flag if first to reach*/ mycount = bar_name.counter++; /* mycount is private */ UNLOCK(bar_name.lock); if (mycount == p) { /* last to arrive */ bar_name.counter = 0; /* reset for next barrier */ bar_name.flag = 1; /* release waiters */ else while (bar_name.flag == 0) {; /* busy wait for release */ Lec 23.23 Consecutively entering the same barrier doesn t work Must prevent process from entering until all have left previous instance Could use another counter, but increases latency and contention Sense reversal: wait for flag to take different value consecutive times Toggle this value only when all processes reach BARRIER (bar_name, p) { local_sense =!(local_sense); /* toggle private sense variable */ LOCK(bar_name.lock); mycount = bar_name.counter++; /* mycount is private */ if (bar_name.counter == p) UNLOCK(bar_name.lock); bar_name.flag = local_sense; /* release waiters*/ else { UNLOCK(bar_name.lock); while (bar_name.flag!= local_sense) {; Lec 23.24

Centralized Barrier Performance Latency Centralized has critical path length at least proportional to p Traffic About 3p bus transactions Storage Cost Very low: centralized counter and flag Fairness Same processor should not always be last to exit barrier No such bias in centralized Key problems for centralized barrier are latency and traffic Especially with distributed memory, traffic goes to same node Improved Barrier Algorithms for a Bus Software combining tree Only k processors access the same location, where k is degree of tree Flat Contention Little contention Tree structured Separate arrival and exit trees, and use sense reversal Valuable in distributed network: communicate along different paths On bus, all traffic goes on same bus, and no less total traffic Higher latency (log p steps of work, and O(p) serialized bus xactions) Advantage on bus is use of ordinary reads/writes instead of locks Lec 23.2 Lec 23.26 Barrier Performance on SGI Challenge 3 30 2 20 1 10 0 1234678 Centralized Combining tree Tournament Dissemination Number of processors Centralized does quite well» Will discuss fancier barrier algorithms for distributed machines Helpful hardware support: piggybacking of reads misses on bus» Also for spinning on highly contended locks Lock-Free Synchronization What happens if process grabs lock, then goes to sleep??? Page fault Processor scheduling Etc Lock-free synchronization: Operations do not require mutual exclusion of multiple insts Nonblocking: Some process will complete in a finite amount of time even if other processors halt Wait-Free (Herlihy): Every (nonfaulting) process will complete in a finite amount of time Systems based on LL&SC can implement these Lec 23.27 Lec 23.28

Using of Compare&Swap for queues compare&swap (&address, reg1, reg2) { /* 68000 */ if (reg1 == M[address]) { M[address] = reg2; return success; else { return failure; Here is an atomic add to linked-list function: addtoqueue(&object) { do { // repeat until no conflict ld r1, M[root] // Get ptr to current head st r1, M[object] // Save link in new object until (compare&swap(&root,r1,object)); root next next next New Object Lec 23.29 Synchronization Summary Rich interaction of hardware-software tradeoffs Must evaluate hardware primitives and software algorithms together primitives determine which algorithms perform well Evaluation methodology is challenging Use of delays, microbenchmarks Should use both microbenchmarks and real workloads Simple software algorithms with common hardware primitives do well on bus Will see more sophisticated techniques for distributed machines Hardware support still subject of debate Theoretical research argues for swap or compare&swap, not fetch&op Algorithms that ensure constant-time access, but complex Lec 23.30