CS 152 Computer Architecture and Engineering. Lecture 19: Synchronization and Sequential Consistency

Similar documents
CS 152 Computer Architecture and Engineering. Lecture 19: Synchronization and Sequential Consistency

CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II

Consistency & Coherence. 4/14/2016 Sec5on 12 Colin Schmidt

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Computer Architecture and Parallel Computing 并行结构与计算. Lecture 6 Coherence Protocols

Announcements. ECE4750/CS4420 Computer Architecture L17: Memory Model. Edward Suh Computer Systems Laboratory

CS252 Spring 2017 Graduate Computer Architecture. Lecture 15: Synchronization and Memory Models Part 2

CS252 Graduate Computer Architecture Fall 2015 Lecture 14: Synchroniza>on and Memory Models

Dr. George Michelogiannakis. EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory

CS252 Spring 2017 Graduate Computer Architecture. Lecture 14: Multithreading Part 2 Synchronization 1

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 19 Memory Consistency Models

CS654 Advanced Computer Architecture. Lec 1 - Introduction

Beyond Sequential Consistency: Relaxed Memory Models

Lecture 25: Thread Level Parallelism -- Synchronization and Memory Consistency

Lecture 14: Multithreading

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 13 - Out-of-Order Issue and Register Renaming

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

CS5460: Operating Systems

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

CS 152 Computer Architecture and Engineering. Lecture 14: Multithreading

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Lecture 13 - VLIW Machines and Statically Scheduled ILP

CS 194 Parallel Programming. Why Program for Parallelism?

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

3/13/2008 Csci 211 Lecture %/year. Manufacturer/Year. Processors/chip Threads/Processor. Threads/chip 3/13/2008 Csci 211 Lecture 8 4

CS 152 Computer Architecture and Engineering. Lecture 9 - Address Translation

Lecture 9 - Virtual Memory

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY>

Handout 3 Multiprocessor and thread level parallelism

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Relaxed Memory-Consistency Models

Computer Architecture and Engineering CS152 Quiz #5 May 2th, 2013 Professor Krste Asanović Name: <ANSWER KEY>

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

CS 152 Computer Architecture and Engineering. Lecture 8 - Address Translation

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

CS 152 Computer Architecture and Engineering. Lecture 11 - Virtual Memory and Caches

Computer Performance. Relative Performance. Ways to measure Performance. Computer Architecture ELEC /1/17. Dr. Hayden Kwok-Hay So

CS 152 Computer Architecture and Engineering. Lecture 8 - Address Translation

Introduction. What is Computer Architecture? Design constraints. What is Computer Architecture? Computer Architecture ELEC3441

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines

CS 152, Spring 2011 Section 8

Computer System Architecture Final Examination Spring 2002

Memory Consistency Models

Introduction. What is Computer Architecture? Meltdown & Spectre. Meltdown & Spectre. Computer Architecture ELEC /1/17. Dr. Hayden Kwok-Hay So

Introduction. What is Computer Architecture? Meltdown & Spectre. Meltdown & Spectre. Computer Architecture ELEC3441. Dr. Hayden Kwok-Hay So

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering. Lecture 9 - Address Translation

Parallel Computer Architecture Spring Memory Consistency. Nikos Bellas

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Lecture 7 - Memory Hierarchy-II

C 1. Last time. CSE 490/590 Computer Architecture. Complex Pipelining I. Complex Pipelining: Motivation. Floating-Point Unit (FPU) Floating-Point ISA

Distributed Operating Systems Memory Consistency

CS 152 Computer Architecture and Engineering. Lecture 13 - VLIW Machines and Statically Scheduled ILP

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 14 Very Long Instruction Word Machines

Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 9 Instruction-Level Parallelism Part 2

EE382 Processor Design. Processor Issues for MP

Last Class: Synchronization

CS 152 Computer Architecture and Engineering. Lecture 9 - Virtual Memory

Administrivia. p. 1/20

NOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem.

Module 15: "Memory Consistency Models" Lecture 34: "Sequential Consistency and Relaxed Models" Memory Consistency Models. Memory consistency

Page 1. Challenges" Concurrency" CS162 Operating Systems and Systems Programming Lecture 4. Synchronization, Atomic operations, Locks"

Synchronization for Concurrent Tasks

Distributed Shared Memory and Memory Consistency Models

Using Relaxed Consistency Models

MOTIVATION. B649 Parallel Architectures and Programming

Systèmes d Exploitation Avancés

SHARED-MEMORY COMMUNICATION

CS 152 Computer Architecture and Engineering. Lecture 19: Directory- Based Cache Protocols. Recap: Snoopy Cache Protocols

ENGR 3950U / CSCI 3020U UOIT, Fall 2012 Quiz on Process Synchronization SOLUTIONS

Shared Memory Architecture

Mutex Implementation

Lecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol

Motivations. Shared Memory Consistency Models. Optimizations for Performance. Memory Consistency

Shared Memory Consistency Models: A Tutorial

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Other consistency models

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 13 Memory Part 2

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 8 Instruction-Level Parallelism Part 1

B649 Graduate Computer Architecture. Lec 1 - Introduction

Multiprocessor Synchronization

CS252 Spring 2017 Graduate Computer Architecture. Lecture 17: Virtual Memory and Caches

Page 1. Goals for Today" Atomic Read-Modify-Write instructions" Examples of Read-Modify-Write "

CS 152 Computer Architecture and Engineering. Lecture 16 - VLIW Machines and Statically Scheduled ILP

Outline. EEL 5764 Graduate Computer Architecture. Chapter 4 - Multiprocessors and TLP. Déjà vu all over again?

Page 1. Goals for Today" Atomic Read-Modify-Write instructions" Examples of Read-Modify-Write "

Vector Processors and Graphics Processing Units (GPUs)

Computer Science 146. Computer Architecture

Unit 12: Memory Consistency Models. Includes slides originally developed by Prof. Amir Roth

Lecture #7: Implementing Mutual Exclusion

Page 1. Goals for Today. Atomic Read-Modify-Write instructions. Examples of Read-Modify-Write

Today: Synchronization. Recap: Synchronization

Transcription:

CS 152 Computer Architecture and Engineering Lecture 19: Synchronization and Sequential Consistency Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste http://inst.cs.berkeley.edu/~cs152 Time (processor cycle) Summary: Multithreaded Categories Superscalar Fine-Grained Coarse-Grained Multiprocessing Simultaneous Multithreading Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Idle slot 4/17/2008 2

Performance (vs. VAX-11/780) Uniprocessor Performance (SPECint) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006 25%/year 52%/year??%/year 3X 1 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 VAX : 25%/year 1978 to 1986 RISC + x86: 52%/year 1986 to 2002 4/17/2008 3 RISC + x86:??%/year 2002 to present Déjà vu all over again? today s processors are nearing an impasse as technologies approach the speed of light.. David Mitchell, The Transputer: The Time Is Now (1989) Transputer had bad timing (Uniprocessor performance!) " Procrastination rewarded: 2X seq. perf. / 1.5 years We are dedicating all of our future product development to multicore designs. This is a sea change in computing Paul Otellini, President, Intel (2005) All microprocessor companies switch to MP (2X CPUs / 2 yrs) " Procrastination penalized: 2X sequential perf. / 5 yrs Manufacturer/Year Processors/chip Threads/Processor 1 1 2 8 Threads/chip AMD/ 07 Intel/ 07 IBM/ 07 Sun/ 07 4 2 2 8 4 4/17/2008 4 2 4 64

Symmetric Multiprocessors Processor Processor Memory symmetric All memory is equally far away from all processors Any processor can do any I/O (set up a DMA transfer) CPU-Memory bus I/O controller bridge I/O bus I/O controller Graphics output I/O controller Networks 4/17/2008 5 Synchronization The need for synchronization arises whenever there are concurrent processes in a system (even in a uniprocessor system) Forks and Joins: In parallel programming, a parallel process may want to wait until several events have occurred Producer-Consumer: A consumer process must wait until the producer process has produced data Exclusive use of a resource: Operating system has to ensure that only one process uses a resource at a given time fork P1 P2 join producer consumer 4/17/2008 6

A Producer-Consumer Example Producer tail head Consumer R tail R tail R head R Producer posting Item x: Load R tail, (tail) Store (R tail ), x R tail =R tail +1 Store (tail), R tail The program is written assuming instructions are executed in order. Consumer: Load R head, (head) spin: Load R tail, (tail) if R head ==R tail goto spin Load R, (R head ) R head =R head +1 Store (head), R head process(r) Problems? 4/17/2008 7 A Producer-Consumer Example continued Producer posting Item x: Load R tail, (tail) 1 Store (R tail ), x R tail =R tail +1 2 Store (tail), R tail Can the tail pointer get updated before the item x is stored? Consumer: Load R head, (head) spin: Load R tail, (tail) 3 if R head ==R tail goto spin Load R, (R head ) 4 R head =R head +1 Store (head), R head process(r) Programmer assumes that if 3 happens after 2, then 4 happens after 1. Problem sequences are: 2, 3, 4, 1 4, 1, 2, 3 4/17/2008 8

Sequential Consistency A Memory Model P P P P P P M A system is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in the order specified by the program Leslie Lamport Sequential Consistency = arbitrary order-preserving interleaving of memory references of sequential programs 4/17/2008 9 Sequential Consistency Sequential concurrent tasks: T1, T2 Shared variables: X, Y (initially X = 0, Y = 10) T1: T2: Store (X), 1 (X = 1) Load R 1, (Y) Store (Y), 11 (Y = 11) Store (Y ), R 1 (Y = Y) Load R 2, (X) Store (X ), R 2 (X = X) what are the legitimate answers for X and Y? (X,Y ) # {(1,11), (0,10), (1,10), (0,11)}? 4/17/2008 10

Sequential Consistency Sequential consistency imposes more memory ordering constraints than those imposed by uniprocessor program dependencies ( ) What are these in our example? T1: T2: Store (X), 1 (X = 1) Load R 1, (Y) Store (Y), 11 (Y = 11) Store (Y ), R 1 (Y = Y) Load R 2, (X) Store (X ), R 2 (X = X) Does (can) a system with caches or out-of-order execution capability provide a sequentially consistent view of the memory? more on this later 4/17/2008 11 Multiple Consumer Example Producer tail head Consumer 1 R head R tail R R tail Consumer 2 R head R tail R Producer posting Item x: Load R tail, (tail) Store (R tail ), x R tail =R tail +1 Store (tail), R tail Critical section: Needs to be executed atomically by one consumer! locks Consumer: Load R head, (head) spin: Load R tail, (tail) if R head ==R tail goto spin Load R, (R head ) R head =R head +1 Store (head), R head process(r) What is wrong with this code? 4/17/2008 12

Locks or Semaphores E. W. Dijkstra, 1965 A semaphore is a non-negative integer, with the following operations: P(s): if s>0, decrement s by 1, otherwise wait V(s): increment s by 1 and wake up one of the waiting processes P s and V s must be executed atomically, i.e., without interruptions or interleaved accesses to s by other processors Process i P(s) <critical section> V(s) initial value of s determines the maximum no. of processes in the critical section 4/17/2008 13 Implementation of Semaphores Semaphores (mutual exclusion) can be implemented using ordinary Load and Store instructions in the Sequential Consistency memory model. However, protocols for mutual exclusion are difficult to design Simpler solution: atomic read-modify-write instructions Examples: m is a memory location, R is a register Test&Set (m), R: R $ M[m]; if R==0 then M[m] $ 1; Fetch&Add (m), R V, R: R $ M[m]; M[m] $ R + R V ; Swap (m), R: R t $ M[m]; M[m] $ R; R $ R t ; 4/17/2008 14

CS152 Administrivia CS194-6 class description Quiz results Quiz common mistakes 4/17/2008 15 CS194-6 Digital Systems Project Laboratory Prerequisites: EECS 150 Units/Credit: 4, may be taken for a grade. Meeting times: M 10:30-12 (lecture), F 10-12 (lab) Instructor: TBA Fall 2008 CS 194 is a capstone digital design project course. The projects are team projects, with 2-4 students per team. Teams propose a project in a detailed design document, implement the project in Verilog, map it to a state-of-the-art FPGA-based platform, and verify its correct operation using a test vector suite. Projects may be of two types: general-purpose computing systems based on a standard ISA (for example, a pipelined implementation of a subset of the MIPS ISA, with caches and a DRAM interface), or special-purpose digital computing systems (for example, a real-time engine to decode an MPEG video packet stream). This class requires a significant time commitment (we expect students to spend 200 hours on the project over the semester). Note that CS 194 is a pure project course (no exams or homework). Be sure to register for both CS 194 P 006 and CS 194 S 601. 4/17/2008 16

Multiple Consumers Example using the Test&Set Instruction P: Test&Set (mutex),r temp if (R temp!=0) goto P Load R head, (head) spin: Load R tail, (tail) if R head ==R tail goto spin Load R, (R head ) R head =R head +1 Store (head), R head V: Store (mutex),0 process(r) Critical Section Other atomic read-modify-write instructions (Swap, Fetch&Add, etc.) can also implement P s and V s What if the process stops or is swapped out while in the critical section? 4/17/2008 17 Nonblocking Synchronization Compare&Swap(m), R t, R s : if (R t ==M[m]) then M[m]=R s ; R s =R t ; status $ success; else status $ fail; status is an implicit argument try: spin: Load R head, (head) Load R tail, (tail) if R head ==R tail goto spin Load R, (R head ) R newhead = R head +1 Compare&Swap(head), R head, R newhead if (status==fail) goto try process(r) 4/17/2008 18

Load-reserve & Store-conditional Special register(s) to hold reservation flag and address, and the outcome of store-conditional Load-reserve R, (m): <flag, adr> $ <1, m>; R $ M[m]; Store-conditional (m), R: if <flag, adr> == <1, m> then cancel other procs reservation on m; M[m] $ R; status $ succeed; else status $ fail; try: spin: Load-reserve R head, (head) Load R tail, (tail) if R head ==R tail goto spin Load R, (R head ) R head = R head + 1 Store-conditional (head), R head if (status==fail) goto try process(r) 4/17/2008 19 Performance of Locks Blocking atomic read-modify-write instructions e.g., Test&Set, Fetch&Add, Swap vs Non-blocking atomic read-modify-write instructions e.g., Compare&Swap, Load-reserve/Store-conditional vs Protocols based on ordinary Loads and Stores Performance depends on several interacting factors: degree of contention, caches, out-of-order execution of Loads and Stores later 4/17/2008 20

Issues in Implementing Sequential Consistency P P P P P P M Implementation of SC is complicated by two issues Out-of-order execution capability Load(a); Load(b) yes Load(a); Store(b) yes if a % b Store(a); Load(b) yes if a % b Store(a); Store(b) yes if a % b Caches Caches can prevent the effect of a store from being seen by other processors 4/17/2008 21 Memory Fences Instructions to sequentialize memory accesses Processors with relaxed or weak memory models (i.e., permit Loads and Stores to different addresses to be reordered) need to provide memory fence instructions to force the serialization of memory accesses Examples of processors with relaxed memory models: Sparc V8 (TSO,PSO): Membar Sparc V9 (RMO): Membar #LoadLoad, Membar #LoadStore Membar #StoreLoad, Membar #StoreStore PowerPC (WO): Sync, EIEIO Memory fences are expensive operations, however, one pays the cost of serialization only when it is required 4/17/2008 22

Using Memory Fences Producer tail head Consumer R tail R tail R head R Producer posting Item x: Load R tail, (tail) Store (R tail ), x Membar SS R tail =R tail +1 Store (tail), R tail ensures that tail ptr is not updated before x has been stored ensures that R is not loaded before x has been stored Consumer: Load R head, (head) spin: Load R tail, (tail) if R head ==R tail goto spin Membar LL Load R, (R head ) R head =R head +1 Store (head), R head process(r) 4/17/2008 23 Data-Race Free Programs a.k.a. Properly Synchronized Programs Process 1 Acquire(mutex); Release(mutex); Process 2 Acquire(mutex); Release(mutex); Synchronization variables (e.g. mutex) are disjoint from data variables Accesses to writable shared data variables are protected in critical regions " no data races except for locks (Formal definition is elusive) In general, it cannot be proven if a program is data-race free. 4/17/2008 24

Fences in Data-Race Free Programs Process 1 Acquire(mutex); membar; membar; Release(mutex); Process 2 Acquire(mutex); membar; membar; Release(mutex); Relaxed memory model allows reordering of instructions by the compiler or the processor as long as the reordering is not done across a fence The processor also should not speculate or prefetch across fences 4/17/2008 25 Mutual Exclusion Using Load/Store A protocol based on two shared variables c1 and c2. Initially, both c1 and c2 are 0 (not busy) Process 1 c1=1; L: if c2=1 then go to L c1=0; Process 2 c2=1; L: if c1=1 then go to L c2=0; What is wrong? 4/17/2008 26

Mutual Exclusion: second attempt To avoid deadlock, let a process give up the reservation (i.e. Process 1 sets c1 to 0) while waiting. Process 1 L: c1=1; if c2=1 then { c1=0; go to L} c1=0 Process 2 L: c2=1; if c1=1 then { c2=0; go to L} c2=0 Deadlock is not possible but with a low probability a livelock may occur. An unlucky process may never get to enter the critical section " starvation 4/17/2008 27 A Protocol for Mutual Exclusion T. Dekker, 1966 A protocol based on 3 shared variables c1, c2 and turn. Initially, both c1 and c2 are 0 (not busy) Process 1 c1=1; turn = 1; L: if c2=1 & turn=1 then go to L c1=0; Process 2 c2=1; turn = 2; L: if c1=1 & turn=2 then go to L c2=0; turn = i ensures that only process i can wait variables c1 and c2 ensure mutual exclusion Solution for n processes was given by Dijkstra and is quite tricky! 4/17/2008 28

Analysis of Dekker s Algorithm Scenario 1 Process 1 c1=1; turn = 1; L: if c2=1 & turn=1 then go to L c1=0; Process 2 c2=1; turn = 2; L: if c1=1 & turn=2 then go to L c2=0; Scenario 2 Process 1 c1=1; turn = 1; L: if c2=1 & turn=1 then go to L c1=0; Process 2 c2=1; turn = 2; L: if c1=1 & turn=2 then go to L c2=0; 4/17/2008 29 N-process Mutual Exclusion Lamport s Bakery Algorithm Process i Initially num[j] = 0, for all j Entry Code choosing[i] = 1; num[i] = max(num[0],, num[n-1]) + 1; choosing[i] = 0; for(j = 0; j < N; j++) { while( choosing[j] ); while( num[j] && ( ( num[j] < num[i] ) ( num[j] == num[i] && j < i ) ) ); } Exit Code num[i] = 0; 4/17/2008 30

Acknowledgements These slides contain material developed and copyright by: Arvind (MIT) Krste Asanovic (MIT/UCB) Joel Emer (Intel/MIT) James Hoe (CMU) John Kubiatowicz (UCB) David Patterson (UCB) MIT material derived from course 6.823 UCB material derived from course CS252 4/17/2008 31