Page 1. Outline. Coherence vs. Consistency. Why Consistency is Important

Similar documents
Role of Synchronization. CS 258 Parallel Computer Architecture Lecture 23. Hardware-Software Trade-offs in Synchronization and Data Layout

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

The MESI State Transition Graph

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Module 15: "Memory Consistency Models" Lecture 34: "Sequential Consistency and Relaxed Models" Memory Consistency Models. Memory consistency

NOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem.

Motivations. Shared Memory Consistency Models. Optimizations for Performance. Memory Consistency

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Synchronization. Erik Hagersten Uppsala University Sweden. Components of a Synchronization Even. Need to introduce synchronization.

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Relaxed Memory-Consistency Models

SHARED-MEMORY COMMUNICATION

Designing Memory Consistency Models for. Shared-Memory Multiprocessors. Sarita V. Adve

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Computer Architecture

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5)

CSE502: Computer Architecture CSE 502: Computer Architecture

Today s Outline: Shared Memory Review. Shared Memory & Concurrency. Concurrency v. Parallelism. Thread-Level Parallelism. CS758: Multicore Programming

Using Relaxed Consistency Models

SELECTED TOPICS IN COHERENCE AND CONSISTENCY

Lecture 13: Consistency Models. Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models

Distributed Operating Systems Memory Consistency

Portland State University ECE 588/688. Memory Consistency Models

Lecture 19: Coherence and Synchronization. Topics: synchronization primitives (Sections )

Multiprocessors continued

Lamport Clocks: Verifying A Directory Cache-Coherence Protocol. Computer Sciences Department

Shared Memory Consistency Models: A Tutorial

Memory Consistency Models

Shared Memory Multiprocessors

Lecture: Coherence and Synchronization. Topics: synchronization primitives, consistency models intro (Sections )

Relaxed Memory Consistency

Handout 3 Multiprocessor and thread level parallelism

The need for atomicity This code sequence illustrates the need for atomicity. Explain.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University

Synchronization. Coherency protocols guarantee that a reading processor (thread) sees the most current update to shared data.

Module 7: Synchronization Lecture 13: Introduction to Atomic Primitives. The Lecture Contains: Synchronization. Waiting Algorithms.

Lecture: Coherence, Synchronization. Topics: directory-based coherence, synchronization primitives (Sections )

M4 Parallelism. Implementation of Locks Cache Coherence

Multiprocessors continued

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Relaxed Memory-Consistency Models

Overview: Memory Consistency

ECE/CS 757: Advanced Computer Architecture II

Consistency Models. ECE/CS 757: Advanced Computer Architecture II. Readings. ECE 752: Advanced Computer Architecture I 1. Instructor:Mikko H Lipasti

Data-Centric Consistency Models. The general organization of a logical data store, physically distributed and replicated across multiple processes.

Switch Gear to Memory Consistency

Bus-Based Coherent Multiprocessors

Parallel Computer Architecture Spring Memory Consistency. Nikos Bellas

CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University

Multiprocessor Synchronization

CS5460: Operating Systems

Cache Coherence and Atomic Operations in Hardware

Computer Science 146. Computer Architecture

Chapter 5. Multiprocessors and Thread-Level Parallelism

Beyond Sequential Consistency: Relaxed Memory Models

Lecture 19: Synchronization. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

EEC 581 Computer Architecture. Lec 11 Synchronization and Memory Consistency Models (4.5 & 4.6)

Lecture 25: Thread Level Parallelism -- Synchronization and Memory Consistency

Interconnect Routing

Lecture 12: TM, Consistency Models. Topics: TM pathologies, sequential consistency, hw and hw/sw optimizations

Lecture: Consistency Models, TM

Unit 12: Memory Consistency Models. Includes slides originally developed by Prof. Amir Roth

Lecture 7: Implementing Cache Coherence. Topics: implementation details

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Thread-Level Parallelism

CS654 Advanced Computer Architecture Lec 14 Directory Based Multiprocessors

Announcements. ECE4750/CS4420 Computer Architecture L17: Memory Model. Edward Suh Computer Systems Laboratory

SGI Challenge Overview

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Replication of Data. Data-Centric Consistency Models. Reliability vs. Availability

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Shared Memory Architectures. Approaches to Building Parallel Machines

Distributed Shared Memory and Memory Consistency Models

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

Review: Symmetric Multiprocesors (SMP)

Advanced Operating Systems (CS 202)

Lecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6)

CSC2/458 Parallel and Distributed Systems Scalable Synchronization

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 19 Memory Consistency Models

EE382 Processor Design. Processor Issues for MP

Shared memory. Caches, Cache coherence and Memory consistency models. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16

EECS 470. Lecture 17 Multiprocessors I. Fall 2018 Jon Beaumont

Shared Memory Architectures. Shared Memory Multiprocessors. Caches and Cache Coherence. Cache Memories. Cache Memories Write Operation

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

Administrivia. p. 1/20

Consistency & Coherence. 4/14/2016 Sec5on 12 Colin Schmidt

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Wenisch Portions Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Prof. Thomas Wenisch. Lecture 23 EECS 470.

Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY>

ECE 669 Parallel Computer Architecture

EE382 Processor Design. Illinois

Transcription:

Outline ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Memory Consistency Models Copyright 2006 Daniel J. Sorin Duke University Slides are derived from work by Sarita Adve (Illinois), Babak Falsafi (CMU), Mark Hill (Wisconsin), Alvy Lebeck (Duke), Steve Reinhardt (Michigan), and J. P. Singh (Princeton). Thanks! Difference Between Coherence and Consistency Sequential Consistency Relaxed Memory Consistency Models Consistency Optimizations Synchronization Optimizations ECE 259 / CPS 221 2 Coherence vs. Consistency Intuition says load should return latest value What is latest? Coherence concerns only one memory location Consistency concerns apparent ordering for all locations A memory system is coherent if, for all locations, Can serialize all operations to that location such that, Operations performed by any processor appear in program order» Program order = order defined by program text or assembly code A read gets the value written by last store to that location Why Consistency is Important Consistency model defines correct behavior It is a contract between the system and the programmer Analogous to the ISA specification Coherence protocol is only a means to an end Coherence is not visible to software Enables new system to present same consistency model despite using newer, fancier coherence protocol Systems maintain backward compatibility for consistency (like ISA) Consistency model restricts ordering of loads/stores Does NOT care at all about ordering of coherence messages ECE 259 / CPS 221 3 ECE 259 / CPS 221 4 Page 1

Why Coherence!= Consistency /* initially, A = B = flag = 0 */ A = 1; while (flag == 0); /* spin */ B = 1; print A; flag = 1; print B; Intuition says we should print A = B = 1 Yet, in some consistency models, this isn t required! Coherence doesn t say anything why? Sequential Consistency Leslie Lamport 1979: A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program Abstraction: a multitasking uniprocessor ECE 259 / CPS 221 5 ECE 259 / CPS 221 6 sequential processors issue memory ops in program order The Memory Model Pn Memory switch randomly set after each memory op SC: Definitions and Sufficient Conditions Sequentially consistent execution Result is same as one of the possible interleavings on uniprocessor Sequentially consistent system Any possible execution corresponds to some possible total order Alternate equivalent definition of SC There exists a total order of all loads and stores (across all processors), such that the value returned by each load equals the value of the most recent store to that location ECE 259 / CPS 221 7 ECE 259 / CPS 221 8 Page 2

Definitions Memory operation Load, store, atomic read-modify-write to mem location Issue An operation is issued when it leaves processor and is presented to memory system (cache, write-buffer, local and remote memories) Perform A store is performed wrt to a processor P when a load by P returns value produced by that store or a later store A load is performed wrt to a processor when subsequent stores cannot affect value returned by that load Complete A memory operation is complete when performed wrt all processors. Program execution Memory operations for specific run only (ignore non-memory-referencing instructions) Sufficient Conditions for Sequential Consistency Processors issue memory ops in program order Processor must wait for store to complete before issuing next memory operation After load, issuing proc waits for load to complete, and store that produced value to complete before issuing next op Easily implemented with shared (physical) bus This is sufficient, but more than necessary ECE 259 / CPS 221 9 ECE 259 / CPS 221 10 SGI Origin: Preserving Sequential Consistency MIPS R10000 is dynamically scheduled Allows memory operations to issue and execute out of program order But ensures that they become visible and complete in order Doesn t satisfy sufficient conditions, but provides SC An interesting issue w.r.t. preserving SC On a write to a shared block, requestor gets two types of replies:» Exclusive reply from the home, indicates write is serialized at memory» Invalidation acks, indicate that write has completed wrt processors But microprocessor expects only one reply (as in a uniprocessor)» So replies have to be dealt with by requestor s HUB To ensure SC, Hub must wait until inval acks are received before replying to proc» Can t reply as soon as exclusive reply is received Would allow later accesses from proc to complete (writes become visible) before this write Outline Difference Between Coherence and Consistency Sequential Consistency Relaxed Memory Consistency Models Motivation Processor Consistency Weak Ordering & Release Consistency Consistency Optimizations Synchronization Optimizations ECE 259 / CPS 221 11 ECE 259 / CPS 221 12 Page 3

Why Relaxed Memory Models? Motivation with directory protocols Misses have longer latency Collecting acknowledgments can take even longer Recall SC requires strict ordering of reads/writes Each processor generates a local total order of its reads and writes (R R, R W, W W, & R W) All local total orders are interleaved into a global total order Relaxed models relax some of these constraints PC: Relax ordering from writes to (other processors ) reads RC: Relax all read/write orderings (but add fences ) Processor Consistency (PC): Relax Write to Read Order /* initial A = B = 0 */ A = 1; B = 1 r1 = B; r2 = A; Processor Consistency (PC) Allows r1==r2==0 (not allowed by SC) Examples: IBM 370, Sun Total Store Order (TSO), & Intel IA-32 Why do this?» Allows FIFO write buffers performance!» Does not confuse programmers (too much) ECE 259 / CPS 221 13 ECE 259 / CPS 221 14 Write Buffers w/ Read Bypass Also Want Causality (Transitivity) Read Flag 2 t1 Write Flag 1 Read t3 Flag 1 t2 Shared Bus Flag 1: 0 Flag 2: 0 Write Flag 2 t4 /* initially all 0 */ P3 A = 1; while (flag1==0) {}; while (flag2==0) {}; flag1 = 1; flag2 = 1; r3 = A; We expect P3 s r3=a to get value 1 All commercial versions of PC guarantee causality Flag 1 = 1 Flag 2 = 1 if (Flag 2 == 0) if (Flag 1 == 0) critical section critical section ECE 259 / CPS 221 15 ECE 259 / CPS 221 16 Page 4

So Why Not Relax All Order? /* initially all 0 */ A = 1; while (flag == 0); /* spin */ B = 1; r1 = A; flag = 1; r2 = B; We d like to be able to reorder A = 1 / B = 1 and/or r1 = A / r2 = B Useful because it could allow for OOO processors, non-fifo write buffers, delayed directory acknowledgments, etc. But, for sanity, we still would like to order A = 1 / B = 1 before flag =1 flag!= 0 before r1 = A / r2 = B Order with Synch Operations /* initially all 0 */ A = 1; while (SYNCH flag == 0); B = 1; r1 = A; SYNCH flag = 1; r2 = B; Called weak ordering (WO) or weak consistency SYNCH orders all prior and subsequent operations Alternatively, release consistency (RC) specializes Acquire: forces subsequent reads/writes after Release: forces previous reads/writes before ECE 259 / CPS 221 17 ECE 259 / CPS 221 18 Weak Ordering Example Release Consistency Example Read / Write Read/Write Read / Write Read/Write Acquire Synch Read / Write Read/Write Synch Read / Write Read/Write Release Read / Write Read/Write Read / Write Read/Write ECE 259 / CPS 221 19 ECE 259 / CPS 221 20 Page 5

Time Review: Directory Example with Sequential Consistency Directory Node P3 ST x I S M Request Response S M Invalidate Ack I Directory Example with Release Consistency Time Directory Node Acquire P3 ST x I ST y I Release start Request x Invalidate x Request y Response x M M Release completes Response y S M M Invalidate y Ack x Ack y I I ECE 259 / CPS 221 21 ECE 259 / CPS 221 22 Commercial Models Use Fences /* initially all 0 */ A = 1; while (flag == 0); B = 1; FENCE; FENCE; r1 = A; flag = 1; r2 = B; Examples: Compaq Alpha, IBM PowerPC,& Sun RMO Can specialize fences (e.g., RMO) The Programming Interface WO and RC require synchronized programs All synchronization operations must be labeled and visible to the hardware Easy (easier!) if synchronization library used Must provide language support for arbitrary Ld/St synchronization (event notification, e.g., flag) Program written for weaker model OK on stricter E.g., SC is a valid implementation of TSO, WO, or RC Intel IA-64 is RCpc (acquires & releases obey PC) ECE 259 / CPS 221 23 ECE 259 / CPS 221 24 Page 6

Outline Difference Between Coherence and Consistency Sequential Consistency Relaxed Memory Consistency Models Consistency Optimizations Synchronization Optimizations Consistency is an Abstraction We only have to implement a system that preserves the illusion of the specified consistency model Like OOO processor preserves illusion of in-order execution We can speculate and recover if speculation leads to violation of consistency model E.g., MIPS R10000 processor speculates on consistency Scheurich s Optimization (for directory protocols) Immediately acknowledge incoming Invalidation (to Shared block) Yet continue to read this block (i.e., pretend Inv didn t arrive yet) Must invalidate before asking mem system for upgrade to any block Optimization does NOT even violate SC ECE 259 / CPS 221 25 ECE 259 / CPS 221 26 Scheurich s Optimization Legal Example /* initially A=B=0, owns A, shares B */ Scheurich s Optimization Illegal Example /* initially A=B=0, owns A, shares B */ issue GETX B LD B=0 issue GETX B LD B=0 Ack Inv B ST B=1 LD B=0 /* Legal! */ ST A=1 Ack Inv B ST B=1 LD B=0 /* Legal! */ ST A=1 issue GETS A send A=1 LD A=1 LD B=0 /* Illegal */ ECE 259 / CPS 221 27 ECE 259 / CPS 221 28 Page 7

Consistency Speculation Paper (ICPP 91) Is SC+ILP=RC? DISCUSSION PRESENTATION ECE 259 / CPS 221 29 ECE 259 / CPS 221 30 Outline Difference Between Coherence and Consistency Sequential Consistency Relaxed Memory Consistency Models Consistency Optimizations Synchronization Optimizations Review of Synchronization (Section 5.5 of Culler/Singh) Synchronization Design Tradeoffs (More Section 5.5) Cool New Synchronization Schemes (Two Papers: QOLB and SLE) Synchronization Mutual exclusion (critical sections) Lock & unlock Event notification Point-to-point (producer-consumer, flags) Global (barrier) Locks, Barriers, Flags, etc. How are these implemented? ECE 259 / CPS 221 31 ECE 259 / CPS 221 32 Page 8

Anatomy of A Synchronization Operation 1) Acquire method Method for trying to obtain the lock, or proceed past barrier 2) Waiting algorithm Spin or busy wait Block (suspend) 3) Release method Method to allow other processes to get past synchronization event HW/SW Implementation Tradeoffs User wants high level (ease of programming) LOCK(lock_variable), UNLOCK(lock_variable) BARRIER(barrier_variable, Num_Procs) Hardware s advantage The Need for Speed (it s fast) Software s advantage Flexibility Goals Low latency Low traffic Scalability Low storage overhead Fairness ECE 259 / CPS 221 33 ECE 259 / CPS 221 34 How NOT To Implement Locks Atomic Read-Modify-Write Operations LOCK while(lock_variable == 1); // spin lock_variable = 1; UNLOCK lock_variable = 0; Implementation requires Mutual Exclusion! Can have two processes successfully acquire the lock Mutual exclusion requires atomic operations Test&Set(r,x) r = m[x] m[x] = 1 Swap(r,x) r = m[x], m[x] = r Compare&Swap(r1,r2,x) if (r1 == m[x]) then r2 = m[x], m[x] = r2 Fetch&Op(r,x,op) r = m[x], m[x] = op(m[x]) r is a register m[x] is memory location x ECE 259 / CPS 221 35 ECE 259 / CPS 221 36 Page 9

Aside: Load-Locked Store-Conditional Pair of instructions that can implement lots of atomic operations Load-Locked loads address A into register r1 We can then manipulate r1 for a while Store-Conditional writes r1 back to A if A hasn t been modified Invalidation Replacement Context switch lock: ll r1, memloc // load-lock memloc into r1 sc memloc, #1 // conditionally store 1 into memloc beqz lock // if sc fails, try lock again ret unlock: st location, #0 ret Performance of Test & Set LOCK while (test&set(x) == 1); UNLOCK x = 0; Problem 1: Bad performance under contention Test&Set causes GetExclusive coherence request Problem 2: Not fair Processors don t get lock in order in which they request it Both problems due to the waiting algorithm! ECE 259 / CPS 221 37 ECE 259 / CPS 221 38 Better Lock Implementations Two choices: Don t execute test&set so much test&set with back-off Spin without generating bus traffic test-and-test&set Test&Set with back-off Insert delay between test&set operations (not too long) Exponential seems good (k*c i ) Not fair Test-and-Test&Set Spin (test) on local cached copy until it gets invalidated, then issue test&set Intuition: No point in trying to set the location until we know that it s not set, which we can detect when it gets invalidated Still contention after invalidation Still not fair Lock Implementation Details To Cache or Not to Cache, that is the question Uncached - Latency for one operation increases + Fast hand-off between processes Cached - Might generate a lot of traffic if lock moves around + Might reuse lock a lot (locality), then traffic would be reduced by caching Must keep ownership for entire read-modifywrite cycle Synchronization operation is visible to the memory system ECE 259 / CPS 221 39 ECE 259 / CPS 221 40 Page 10

Fetch&Inc Based Locks Ticket Lock (like at the bakery) LOCK Obtain number via fetch&inc Spin on now-serving counter UNLOCK Increment now-serving counter Array based Lock Obtain location to spin on rather than value Fair Slight increase in storage Put locations in separate cache blocks, else same traffic as t&t&s H/W Queue-Based Locks: QOLB lock holder requester requester P P P Cache Cache Cache Distributed directory (SCI) Memory Lock bit in every cache line Maintains sharing list in a doubly-linked list Local spinning in the cache: cache hit while waiting ECE 259 / CPS 221 41 ECE 259 / CPS 221 42 S/W Queue-Based Locks: MCS To emulate QOLB in software: (master) lock resides in shared memory Requesters allocate a block in shared memory to spin on The flag the requester spins on is originally set to 1 Upon acquire, requester inserts its block in linked list It reads the address of the last requester s local block It writes its local block address to the master Upon release Releaser removes itself from the linked list Sets the next processor in line s flag to 0 Mechanisms To Reduce Lock Overhead Optimizations used by locks: Local spinning (on data in local cache) Queue-based locking Collocation of lock and the data it protects Synchronous prefetch L-spin Queue Collocation S-fetch T&S no no optional no T&T&S yes no optional no MCS yes yes partial no QOLB yes yes optional yes ECE 259 / CPS 221 43 ECE 259 / CPS 221 44 Page 11

Elapsed Time (M cycles) 15 10 Microbenchmark Analysis Lock performance with increase in processors 5 0 T&S T&T&S MCS QOLB Number of Processors 1 2 4 8 16 32 64 What are the Performance Issues? Low latency: Should be able to get a free lock quickly Scalability: Should perform well beyond a small number of processors (< 64) Low storage overhead Fairness Blocking/Non-blocking ECE 259 / CPS 221 45 ECE 259 / CPS 221 46 Performance of Locks Performance depends on many factors Amount of contention Number of processors Snooping vs. directory Hardware support for locks Test&set is good with no contention Array based (Queue) is best with high contention Reactive Synchronization by Lim & Agarwal Choose lock implementation based on contention Point-to-Point Event Synchronization Often use normal variables as flags a = f(x); while (flag == 0); flag = 1; b = g(a); If we know a=0 beforehand a = f(x) while (a == 0); b = g(a); Assumes Sequential Consistency!! Full/Empty bits are similar to flags Set on Write Cleared on Read Can t write if set, can t read if clear ECE 259 / CPS 221 47 ECE 259 / CPS 221 48 Page 12

Implementing a Centralized Barrier BARRIER(bar_name, p) { LOCK(bar_name.lock); if (bar_name.counter == 0) bar_name.flag = 0; bar_name.counter++; UNLOCK(bar_name.lock); if (bar_name.counter == p) { bar_name.counter = 0; bar_name.flag = 1; } else while(bar_name.flag = 0) {}; /* busy wait */ } } But what if doesn t see flag set to 1 in first barrier before re-enters barrier and resets flag to 0? Barrier With Sense Reversal BARRIER(bar_name, p) { local_sense =!(local_sense); /* toggle private state */ LOCK(bar_name.lock); bar_name.counter++; UNLOCK(bar_name.lock); if (bar_name.counter == p) { bar_name.counter = 0; bar_name.flag = local_sense; } else while(bar_name.flag!= local_sense) {};/* busy wait*/ } } ECE 259 / CPS 221 49 ECE 259 / CPS 221 50 PRESENTATION SLE: Speculative Lock Elision Outline Difference Between Coherence and Consistency Sequential Consistency Relaxed Memory Consistency Models Consistency Optimizations Synchronization Optimizations ECE 259 / CPS 221 51 ECE 259 / CPS 221 52 Page 13