Module 7: Synchronization Lecture 13: Introduction to Atomic Primitives. The Lecture Contains: Synchronization. Waiting Algorithms.

Similar documents
Multiprocessor Synchronization

M4 Parallelism. Implementation of Locks Cache Coherence

SHARED-MEMORY COMMUNICATION

Synchronization. Coherency protocols guarantee that a reading processor (thread) sees the most current update to shared data.

Role of Synchronization. CS 258 Parallel Computer Architecture Lecture 23. Hardware-Software Trade-offs in Synchronization and Data Layout

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

The need for atomicity This code sequence illustrates the need for atomicity. Explain.

Lecture 19: Coherence and Synchronization. Topics: synchronization primitives (Sections )

Lecture 19: Synchronization. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Lecture 25: Multiprocessors

Lecture: Coherence and Synchronization. Topics: synchronization primitives, consistency models intro (Sections )

Cache Coherence and Atomic Operations in Hardware

Lecture: Coherence, Synchronization. Topics: directory-based coherence, synchronization primitives (Sections )

The MESI State Transition Graph

250P: Computer Systems Architecture. Lecture 14: Synchronization. Anton Burtsev March, 2019

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:

Scalable Locking. Adam Belay

Spin Locks and Contention. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Lecture 33: Multiprocessors Synchronization and Consistency Professor Randy H. Katz Computer Science 252 Spring 1996

Adaptive Lock. Madhav Iyengar < >, Nathaniel Jeffries < >

Chapter 5. Multiprocessors and Thread-Level Parallelism

Spin Locks and Contention. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Advance Operating Systems (CS202) Locks Discussion

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

CS5460: Operating Systems

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Spin Locks and Contention

Spin Locks and Contention. Companion slides for Chapter 7 The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Synchronization. Erik Hagersten Uppsala University Sweden. Components of a Synchronization Even. Need to introduce synchronization.

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Memory Consistency Models

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

CS433 Homework 6. Problem 1 [15 points] Assigned on 11/28/2017 Due in class on 12/12/2017

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

Lecture 7: Implementing Cache Coherence. Topics: implementation details

Chapter 5. Thread-Level Parallelism

GLocks: Efficient Support for Highly- Contended Locks in Many-Core CMPs

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology

CS433 Homework 6. Problem 1 [15 points] Assigned on 11/28/2017 Due in class on 12/12/2017

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Module 15: "Memory Consistency Models" Lecture 34: "Sequential Consistency and Relaxed Models" Memory Consistency Models. Memory consistency

Chapter-4 Multiprocessors and Thread-Level Parallelism

Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY>

Page 1. Outline. Coherence vs. Consistency. Why Consistency is Important

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

Algorithms for Scalable Synchronization on Shared Memory Multiprocessors by John M. Mellor Crummey Michael L. Scott

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Hardware: BBN Butterfly

Lecture 9: Multiprocessor OSs & Synchronization. CSC 469H1F Fall 2006 Angela Demke Brown

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains:

Distributed Operating Systems

Other consistency models

Module 14: "Directory-based Cache Coherence" Lecture 31: "Managing Directory Overhead" Directory-based Cache Coherence: Replacement of S blocks

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

Lecture 26: Multiprocessors. Today s topics: Directory-based coherence Synchronization Consistency Shared memory vs message-passing

Announcements. Office hours. Reading. W office hour will be not starting next week. Chapter 7 (this whole week) CMSC 412 S02 (lect 7)

Chapter 5. Multiprocessors and Thread-Level Parallelism

EEC 581 Computer Architecture. Lec 11 Synchronization and Memory Consistency Models (4.5 & 4.6)

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Distributed Operating Systems

A three-state update protocol

Shared Memory Multiprocessors

Computer Science 146. Computer Architecture

SPIN, PETERSON AND BAKERY LOCKS

SELECTED TOPICS IN COHERENCE AND CONSISTENCY

CS654 Advanced Computer Architecture Lec 14 Directory Based Multiprocessors

Example Threads. compile: gcc mythread.cc -o mythread -lpthread What is the output of this program? #include <pthread.h> #include <stdio.

CSC2/458 Parallel and Distributed Systems Scalable Synchronization

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring 2002

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

Conventions. Barrier Methods. How it Works. Centralized Barrier. Enhancements. Pitfalls

Spinlocks. Spinlocks. Message Systems, Inc. April 8, 2011

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville. Review: Small-Scale Shared Memory

Today: Synchronization. Recap: Synchronization

2 Principles of Instruction Set Design

EE382 Processor Design. Processor Issues for MP

CIS Operating Systems Synchronization based on Busy Waiting. Professor Qiang Zeng Spring 2018

Algorithms for Scalable Lock Synchronization on Shared-memory Multiprocessors

Lecture 25: Thread Level Parallelism -- Synchronization and Memory Consistency

ECE 669 Parallel Computer Architecture

Symmetric Multiprocessors: Synchronization and Sequential Consistency

CS377P Programming for Performance Multicore Performance Synchronization

Shared Memory Multiprocessors

Agenda. Lecture. Next discussion papers. Bottom-up motivation Shared memory primitives Shared memory synchronization Barriers and locks

EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors

EC 513 Computer Architecture

Synchronization. Before We Begin. Synchronization. Credit/Debit Problem: Race Condition. CSE 120: Principles of Operating Systems.

Lecture 4: Directory Protocols and TM. Topics: corner cases in directory protocols, lazy TM

CS 31: Introduction to Computer Systems : Threads & Synchronization April 16-18, 2019

Page 1. Cache Coherence

Last class: Today: CPU Scheduling. Start synchronization

Reactive Synchronization Algorithms for Multiprocessors

AST: scalable synchronization Supervisors guide 2002

Concurrent programming: From theory to practice. Concurrent Algorithms 2015 Vasileios Trigonakis

Transcription:

The Lecture Contains: Synchronization Waiting Algorithms Implementation Hardwired Locks Software Locks Hardware Support Atomic Exchange Test & Set Fetch & op Compare & Swap Traffic of Test & Set Backoff Test & Set Test & Test & Set TTS Traffic Analysis Goals of a Lock Algorithm file:///d /...audhary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture13/13_1.htm[6/14/2012 11:57:57 AM]

Synchronization Types Mutual exclusion Synchronize entry into critical sections Normally done with locks Point-to-point synchronization Tell a set of processors (normally set cardinality is one) that they can proceed Normally done with flags Global synchronization Bring every processor to sync Wait at a point until everyone is there Normally done with barriers Synchronization Normally a two-part process: acquire and release; acquire can be broken into two parts: intent and wait Intent: express intent to synchronize (i.e. contend for the lock, arrive at a barrier) Wait: wait for your turn to synchronization (i.e. wait until you get the lock) Release: proceed past synchronization and enable other contenders to synchronize Waiting algorithms do not depend on the type of synchronization file:///d /...audhary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture13/13_2.htm[6/14/2012 11:57:57 AM]

Waiting Algorithms Busy wait (common in multiprocessors) Waiting processes repeatedly poll a location (implemented as a load in a loop) Releasing process sets the location appropriately May cause network or bus transactions Block Waiting processes are de-scheduled Frees up processor cycles for doing something else Busy waiting is better if De-scheduling and re-scheduling take longer than busy waiting No other active process Does not work for single processor Hybrid policies: busy wait for some time and then block Implementation Popular trend Architects offer some simple atomic primitives Library writers use these primitives to implement synchronization algorithms Normally hardware primitives for acquire and possibly release are provided Hard to offer hardware solutions for waiting Also hardwired waiting may not offer that much of flexibility file:///d /...audhary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture13/13_3.htm[6/14/2012 11:57:57 AM]

Hardwired Locks Not popular today Less flexible Cannot support large number of locks Possible designs Dedicated lock line in bus so that the lock holder keeps it asserted and waiters snoop the lock line in hardware Set of lock registers shared among processors and lock holder gets a lock register (Cray Xmp ) Software Locks Bakery algorithm Shared: choosing[p] = FALSE, ticket[p] = 0; Acquire : choosing[ i ] = TRUE; ticket[ i ] = max(ticket[0],,ticket[p-1]) + 1; choosing[ i ] = FALSE; for j = 0 to P-1 while (choosing[j]); while (ticket[j] && ((ticket[j], j) < (ticket[ i ], i ))); endfor Release : ticket[ i ] = 0; Does it work for multiprocessors? Assume sequential consistency Performance issues related to coherence? Too much overhead: need faster and simpler lock algorithms Need some hardware support file:///d /...audhary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture13/13_4.htm[6/14/2012 11:57:57 AM]

Hardware Support Start with a simple software lock Shared: lock = 0; Acquire : while (lock); lock = 1; Release or Unlock : lock = 0; Assembly translation Lock: lw register, lock_addr /* register is any processor register */ bnez register, Lock addi register, register, 0x1 sw register, lock_addr Unlock: xor register, register, register sw register, lock_addr Does it work? Atomic Exchange What went wrong? We wanted the read-modify-write sequence to be atomic We can fix this if we have an atomic exchange instruction addi register, r0, 0x1 /* r0 is hardwired to 0 */ Lock: xchg register, lock_addr /* An atomic load and store */ bnez register, Lock Unlock remains unchanged Various processors support this type of instruction Intel x86 has xchg, Sun UltraSPARC has ldstub (load-store-unsigned byte), UltraSPARC also has swap Normally easy to implement for bus-based systems: whoever wins the bus for xchg can lock the bus Difficult to support in distributed memory systems file:///d /...audhary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture13/13_5.htm[6/14/2012 11:57:57 AM]

Test & Set Less general compared to exchange Lock: ts register, lock_addr bnez register, Lock Unlock remains unchanged Loads current lock value in a register and sets location always with 1 Exchange allows to swap any value A similar type of instruction is fetch & op Fetch memory location in a register and apply op on the memory location Op can be a set of supported operations e.g. add, increment, decrement, store etc. In Test & set op=set Fetch & op Possible to implement a lock with fetch & clear then add (used to be supported in BBN Butterfly 1) addi reg1, r0, 0x1 Lock: fetch & clr then add reg1, reg2, lock_addr /* fetch in reg2, clear, add reg1 */ bnez reg2, Lock Butterfly 1 also supports fetch & clear then xor Sequent Symmetry supports fetch & store More sophisticated: compare & swap Takes three operands: reg1, reg2, memory address Compares the value in reg1 with address and if they are equal swaps the contents of reg2 and address Not in line with RISC philosophy (same goes for fetch & add) file:///d /...audhary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture13/13_6.htm[6/14/2012 11:57:57 AM]

Compare & Swap addi reg1, r0, 0x0 /* reg1 has 0x0 */ addi reg2, r0, 0x1 /* reg2 has 0x1 */ Lock: compare & swap reg1, reg2, lock_addr bnez reg2, Lock Traffic of Test & Set In some machines (e.g., SGI Origin 2000) uncached fetch & op is supported Every such instruction will generate a transaction (may be good or bad depending on the support in memory controller; will discuss later) Let us assume that the lock location is cacheable and is kept coherent Every invocation of test & set must generate a bus transaction; Why? What is the transaction? What are the possible states of the cache line holding lock_addr? Therefore all lock contenders repeatedly generate bus transactions even if someone is still in the critical section and is holding the lock Can we improve this? Test & set with backoff file:///d /...audhary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture13/13_7.htm[6/14/2012 11:57:57 AM]

Backoff Test & Set Instead of retrying immediately wait for a while How long to wait? Waiting for too long may lead to long latency and lost opportunity Constant and variable backoff Special kind of variable backoff : exponential backoff (after the i th attempt the delay is k* c i where k and c are constants) Test & set with exponential backoff works pretty well delay = k Lock: ts register, lock_addr bez register, Enter_CS pause (delay) /* Can be simulated as a timed loop */ delay = delay*c j Lock Test & Test & Set Reduce traffic further Before trying test & set make sure that the lock is free Lock: ts register, lock_addr bez register, Enter_CS Test: lw register, lock_addr bnez register, Test j Lock How good is it? In a cacheable lock environment the Test loop will execute from cache until it receives an invalidation (due to store in unlock); at this point the load may return a zero value after fetching the cache line If the location is zero then only everyone will try test & set file:///d /...audhary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture13/13_8.htm[6/14/2012 11:57:57 AM]

TTS Traffic Analysis Recall that unlock is always a simple store In the worst case everyone will try to enter the CS at the same time First time P transactions for ts and one succeeds; every other processor suffers a miss on the load in Test loop; then loops from cache The lock-holder when unlocking generates an upgrade (why?) and invalidates all others All other processors suffer read miss and get value zero now; so they break Test loop and try ts and the process continues until everyone has visited the CS (P+(P-1)+1+(P-1))+((P-1)+(P-2)+1+(P-2))+ = (3P-1) + (3P-4) + (3P-7) + ~ 1.5P 2 asymptotically For distributed shared memory the situation is worse because each invalidation becomes a separate message (more later) Goals of a Lock Algorithm Low latency: If no contender the lock should be acquired fast Low traffic: Worst case lock acquire traffic should be low; otherwise it may affect unrelated transactions Scalability: Traffic and latency should scale slowly with the number of processors Low storage cost: Maintaining lock states should not impose unrealistic memory overhead Fairness: Ideally processors should enter CS according to the order of lock request (TS or TTS does not guarantee this) file:///d /...audhary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture13/13_9.htm[6/14/2012 11:57:58 AM]