Today s Outline: Shared Memory Review. Shared Memory & Concurrency. Concurrency v. Parallelism. Thread-Level Parallelism. CS758: Multicore Programming

Similar documents
Thread-Level Parallelism. Shared Memory. This Unit: Shared Memory Multiprocessors. U. Wisconsin CS/ECE 752 Advanced Computer Architecture I

Multiprocessors continued

On Power and Multi-Processors

Wenisch Portions Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Prof. Thomas Wenisch. Lecture 23 EECS 470.

On Power and Multi-Processors

Multiprocessors continued

EECS 470. Lecture 17 Multiprocessors I. Fall 2018 Jon Beaumont

CSE502: Computer Architecture CSE 502: Computer Architecture

Goldibear and the 3 Locks. Programming With Locks Is Tricky. More Lock Madness. And To Make It Worse. Transactional Memory: The Big Idea

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Multiplying Performance. Application Domains for Multiprocessors. Multicore: Mainstream Multiprocessors. Identifying Parallelism

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Chapter 5. Multiprocessors and Thread-Level Parallelism

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Computer Architecture

Computer Science 146. Computer Architecture

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Multiprocessors & Thread Level Parallelism

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Handout 3 Multiprocessor and thread level parallelism

Fall 2015 :: CSE 610 Parallel Computer Architectures. Cache Coherence. Nima Honarmand

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5)

Unit 12: Memory Consistency Models. Includes slides originally developed by Prof. Amir Roth

EC 513 Computer Architecture

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

Shared Symmetric Memory Systems

Foundations of Computer Systems

Computer Systems Architecture

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Chapter 5. Thread-Level Parallelism

CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

! A single processor can only be so fast! Limited clock frequency! Limited instruction-level parallelism! Limited cache hierarchy

This Unit: Shared Memory Multiprocessors. CIS 501 Computer Architecture. Multiplying Performance. Readings

Computer Architecture

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

1. Memory technology & Hierarchy

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

SELECTED TOPICS IN COHERENCE AND CONSISTENCY

Thread Level Parallelism I: Multicores

Interconnect Routing

Page 1. Cache Coherence

Chapter-4 Multiprocessors and Thread-Level Parallelism

Chap. 4 Multiprocessors and Thread-Level Parallelism

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains:

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

Cache Coherence Protocols: Implementation Issues on SMP s. Cache Coherence Issue in I/O

ECE 1749H: Interconnec1on Networks for Parallel Computer Architectures: Interface with System Architecture. Prof. Natalie Enright Jerger

Page 1. Outline. Coherence vs. Consistency. Why Consistency is Important

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

Cache Coherence in Bus-Based Shared Memory Multiprocessors

Module 9: "Introduction to Shared Memory Multiprocessors" Lecture 16: "Multiprocessor Organizations and Cache Coherence" Shared Memory Multiprocessors

Multiprocessors 1. Outline

Parallel Architecture. Hwansoo Han

CS 61C: Great Ideas in Computer Architecture. Amdahl s Law, Thread Level Parallelism

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Processor Architecture

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Lecture 25: Multiprocessors

Cache Coherence and Atomic Operations in Hardware

Beyond ILP In Search of More Parallelism

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

Cache Coherence. Bryan Mills, PhD. Slides provided by Rami Melhem

This Unit: Shared Memory Multiprocessors. CIS 501 Computer Architecture. Beyond Implicit Parallelism. Readings

EECS 570 Lecture 9. Snooping Coherence. Winter 2017 Prof. Thomas Wenisch h6p://

Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY>

CS654 Advanced Computer Architecture Lec 14 Directory Based Multiprocessors

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology

EECS 570 Lecture 11. Directory-based Coherence. Winter 2019 Prof. Thomas Wenisch

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Cache Coherence. Introduction to High Performance Computing Systems (CS1645) Esteban Meneses. Spring, 2014

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Bus-Based Coherent Multiprocessors

Chapter 6. Parallel Processors from Client to Cloud Part 2 COMPUTER ORGANIZATION AND DESIGN. Homogeneous & Heterogeneous Multicore Architectures

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Lecture 24: Virtual Memory, Multiprocessors

Beyond Sequential Consistency: Relaxed Memory Models

Computer Systems Architecture

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

3/13/2008 Csci 211 Lecture %/year. Manufacturer/Year. Processors/chip Threads/Processor. Threads/chip 3/13/2008 Csci 211 Lecture 8 4

Multiprocessors and Locking

Limitations of parallel processing

Memory Hierarchy in a Multiprocessor

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

The Cache-Coherence Problem

Scalable Cache Coherence

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Transcription:

CS758: Multicore Programming Today s Outline: Shared Memory Review Shared Memory & Concurrency Introduction to Shared Memory Thread-Level Parallelism Shared Memory Prof. David A. Wood University of Wisconsin-Madison Cache Coherence Synchronization Memory Consistency 1 Prof. Wood 2 Thread-Level Parallelism struct acct_t { int bal; }; shared struct acct_t accts[max_acct]; int id,amt; if (accts[id].bal >= amt) { accts[id].bal -= amt; spew_cash(); } Thread-level parallelism (TLP) Collection of asynchronous tasks: not started and stopped together Data shared loosely, dynamically Example: database/web server (each query is a thread) accts is shared, can t register allocate even if it were scalar id and amt are private variables, register allocated to r1 & r2 Concurrency v. Parallelism Concurrency is not (just) parallelism A Concurrency Logically simultaneous processing Interleaved execution (perhaps on UP) Parallelism Physically simultaneous processing Requires multiprocessor A B C B C Time Prof. Wood 3 Prof. Wood 4 Time

Shared Memory Shared memory Multiple execution contexts sharing a single address space Multiple programs (MIMD) Or more frequently: multiple copies of one program (SPMD) Implicit (automatic) communication via loads and stores + Simple software No need for messages, communication happens naturally Maybe too naturally Supports irregular, dynamic communication patterns Both DLP and TLP Complex hardware Must create a uniform view of memory Several aspects to this as we will see Shared-Memory Multiprocessors Provide a shared-memory abstraction Familiar and efficient for programmers P 1 P 2 P 3 P 4 Memory System Prof. Wood 5 Prof. Wood 6 Shared-Memory Multiprocessors Provide a shared-memory abstraction Familiar and efficient for programmers P 1 Cache M 1 Interface P 2 Cache M 2 Interface P 3 Cache M 3 Interface Interconnection Network P 4 Cache M 4 Interface Paired vs. Separate Processor/Memory? Separate processor/memory Uniform memory access (UMA): equal latency to all memory + Simple software, doesn t matter where you put data Lower peak performance Bus-based UMAs common: symmetric multi-processors (SMP) Paired processor/memory Non-uniform memory access (NUMA): faster to local memory More complex software: where you put data matters + Higher peak performance: assuming proper data placement Prof. Wood 7 Mem Mem Mem Prof. Wood 8

Shared vs. Point-to-Point Networks Shared network: e.g., bus (left) + Low latency Low bandwidth: doesn t scale beyond ~16 processors + Shared property simplifies cache coherence protocols (later) Point-to-point network: e.g., mesh or ring (right) Longer latency: may need multiple hops to communicate + Higher bandwidth: scales to 1000s of processors Cache coherence protocols are complex Prof. Wood 9 R Mem R Mem Implementation #1: Snooping Bus MP Mem Mem Two basic implementations Bus-based systems Typically small: 2 8 (maybe 16) processors Typically processors split from memories (UMA) Multiple processors on single chip (CMP) Symmetric multiprocessors (SMPs) Common, most chips have 2-4 cores today Prof. Wood 10 Mem Mem Implementation #2: Scalable MP Today s Outline: Shared Memory Review R Mem Motivation for Four-Lecture Course Introduction to Shared Memory R Mem General point-to-point network-based systems Typically processor/memory/router blocks (NUMA) Glueless MP: no need for additional glue chips Can be arbitrarily large: 1000 s of processors Massively parallel processors (MPPs) Increasingly used for small systems Eliminates need for buses, enables point-to-point wires Coherent Hypertransport (AMD Opteron) Intel QuickPath (Core 2) Cache Coherence Problem Snooping Directories Synchronization Memory Consistency Prof. Wood 11 Prof. Wood 12

An Example Execution No-Cache, No-Problem Two $100 withdrawals from account #241 at two ATMs Each transaction maps to thread on different processor Track accts[241].bal (address is in r3) CPU0 CPU1 Mem Scenario I: processors have no caches No problem 400 400 300 Prof. Wood 13 Prof. Wood 14 Cache Incoherence Write-Thru Alone Doesn t Help Scenario II: processors have write-back caches Potentially 3 copies of accts[241].bal: memory, p0$, p1$ Can get incoherent (inconsistent) Prof. Wood 15 V: D:400 D:400 V: D:400 D:400 Scenario II: processors have write-thru caches This time only 2 (different) copies of accts[241].bal No problem? What if another withdrawal happens on processor 0? Prof. Wood 16 V: V:400 400 V:400 V:400 400 V:400 V:300 300

Hardware Cache Coherence Bus-Based Coherence Protocols CC CPU bus D$ tags D$ data Coherence controller: Examines bus traffic (addresses and data) Executes coherence protocol What to do with local copy when you see different things happening on bus Bus-based coherence protocols Also called snooping or broadcast ALL controllers see ALL transactions IN SAME ORDER Bus is the ordering point Protocol relies on all processors seeing a total order of requests Three processor-side events R: read W: write WB: write-back (select block for replacement) Three bus-side events BR: bus-read, read miss on another processor BW: bus-write, write miss or write-back on another processor CB: copy-back, send block back to memory or other processor Point-to-point network protocols also exist Typical solution is a directory protocol Prof. Wood 17 Prof. Wood 18 VI (MI) Coherence Protocol VI Protocol (Write-Back Cache) R BR, W BW I V BR/BW BR/BW CB, WB CB R/W VI (valid-invalid) protocol: aka MI Two states (per block) V (valid): have block aka M (modified) when block written I (invalid): don t have block Protocol summary If anyone wants to read/write block Give it up: transition to I state copy-back on replacement or other request Miss gets latest copy (memory or processor) This is an invalidate protocol Update protocol copy data, don t invalidate Sounds good, but wastes bandwidth ld by processor 1 generates a BR 0: addi r1,&accts,r3 processor 0 responds by WB its dirty copy, transitioning to I V: V:400 I:WB V:400 400 V:300 400 Prof. Wood 19 Prof. Wood 20

VI MSI MSI Protocol (Write-Back Cache) W BW I M BR/BW BW CB, WB CB R/W BW R BR W BW BR CB VI protocol is inefficient Only one cached copy allowed in entire system Multiple copies can t exist even if read-only Not a problem in example Big problem in reality MSI (modified-shared-invalid) Fixes problem: splits V state into two states M (modified): local dirty copy S (shared): local clean copy Allows either Multiple read-only copies (S-state) --OR-- Single read/write copy (M-state) Prof. Wood 21 S R/BR ld by processor 1 generates a BR processor 0 responds by WB its dirty copy, transitioning to S st by processor 1 generates a BW processor 0 responds by transitioning to I Prof. Wood 22 S: M:400 S:400 S:400 400 I: M:300 400 More Generally: MOESI [Sweazey & Smith ISCA86] M - Modified (dirty) O - Owned (dirty but shared) WHY? E - Exclusive (clean unshared) only copy, not dirty S - Shared I - Invalid Variants MSI MESI MOSI MOESI S O M E I ownership validity exclusiveness Qualitative Sharing Patterns [Weber & Gupta, ASPLOS3] Read-Only Mostly Read More processors imply more invalidations, but writes are rare Migratory Objects Manipulated by one processor at a time Often protected by a lock Usually a write causes only a single invalidation Synchronization Objects Often more processors imply more invalidations Frequently Read/Written More processors imply more invalidations Prof. Wood 23 Prof. Wood 24

False Sharing Two (or more) logically separate data share a cache block Modern caches have block sizes of 32-256 bytes (64 bytes typ.) Compilers may co-allocate independent fields Updates may cause block to ping-pong Example int a[16]; foreach thread i { a[i]++; } Solution Pad data structure struct {int d; int pad[7]} a[16]; foreach thread i { a[i].d++; } Scalable Cache Coherence Scalable cache coherence: two part solution Part I: bus bandwidth Replace non-scalable bandwidth substrate (bus) with scalable bandwidth one (point-to-point network, e.g., mesh) Part II: processor snooping bandwidth Interesting: most snoops result in no action For loosely shared data, only one processor probably has it Replace non-scalable broadcast protocol (spam everyone) with scalable directory protocol (only spam processors that care) Prof. Wood 25 Prof. Wood 26 Directory Coherence Protocols Observe: physical address space statically partitioned + Can easily determine which memory module holds a given line That memory module sometimes called home Can t easily determine which processors have line in their caches Bus-based protocol: broadcast events to all processors/caches ± Simple and fast, but non-scalable Directories: non-broadcast coherence protocol Extend memory to track caching information For each physical cache line whose home this is, track: Owner: which processor has a dirty copy (i.e., M state) Sharers: which processors have clean copies (i.e., S state) Processor sends coherence event to home directory Home directory forwards events to processors Processors only receive events they care about Prof. Wood 27 MSI Directory Protocol W BW I M BR/BW BW CB, WB CB ACK R/W BW CB Processor side Directory follows its own protocol (obvious in principle) Similar to bus-based MSI Same three states Same six actions (keep BR/BW names) Minus grayed out arcs/actions Bus events that would not trigger action anyway + Directory won t bother you unless you need to act R BR W BW BR CB Prof. Wood 28 S R/BR 2 hop miss P 0 Dir P 0 3 hop miss Dir P 1

Directory MSI Protocol ld by P1 sends BR to directory Directory sends BR to P0, P0 sends P1 data, does WB, goes to S st by P1 sends BW to directory Directory sends BW to P0, P0 goes to I Prof. Wood 29 P0 P1 Directory : : S: S:0: M:400 M:0: (stale) S:400 S:400 S:0,1:400 M:300 M:1:400 Directory Flip Side: Latency Directory protocols + Lower bandwidth consumption more scalable Longer latencies Two read miss situations Unshared block: get data from memory Dir Bus: 2 hops (P0 memory P0) Directory: 2 hops (P0 memory P0) Shared or exclusive block: get data from other processor (P1) Assume cache-to-cache transfer optimization Bus: 2 hops (P0 P1 P0) Directory: 3 hops (P0 memory P1 P0) Occurs frequently with many processors high probability one other processor has it Prof. Wood 30 3 hop miss P 0 P 1 Today s Outline: Shared Memory Review The Need for Synchronization Introduction to Shared Memory Cache Coherence Synchronization Problem Test-and-set, etc. Memory Consistency S: S: S: M:400 I: 400 I: M:400 400 We re not done, consider the following execution Write-back caches (doesn t matter, though), MSI protocol What happened? We got it wrong and coherence had nothing to do with it Prof. Wood 31 Prof. Wood 32

The Need for Synchronization What really happened? Access to accts[241].bal should conceptually be atomic Transactions should not be interleaved But that s exactly what happened Same thing can happen on a multiprogrammed uniprocessor! Solution: synchronize access to accts[241].bal Prof. Wood 33 S: S: S: M:400 I: 400 I: M:400 400 Synchronization Synchronization: second issue for shared memory Regulate access to shared data Software constructs: semaphore, monitor Hardware primitive: lock Operations: acquire(lock)and release(lock) Region between acquire and release is a critical section Must interleave acquire and release Second consecutive acquire will fail (actually it will block) struct acct_t { int bal; }; shared struct acct_t accts[max_acct]; shared int lock; int id,amt; acquire(lock); if (accts[id].bal >= amt) { accts[id].bal -= amt; // critical section spew_cash(); } release(lock); Prof. Wood 34 Working Spinlock: Test-And-Set ISA provides an atomic lock acquisition instruction Example: test-and-set t&s r1,0(&lock) Atomically executes mov r1,r2 ld r1,0(&lock) st r2,0(&lock) If lock was initially free (0), acquires it (sets it to 1) If lock was initially busy (1), doesn t change it New acquire sequence Similar instructions: swap, compare&swap, exchange, and fetch-and-add Test-and-Set Lock Correctness A1: bnez r1,#a0 CRITICAL_SECTION A1: bnez r1,#a0 A1: bnez r1,#a0 + Test-and-set lock actually works keeps spinning Prof. Wood 35 Prof. Wood 36

Test-and-Set Lock Performance A1: bnez r1,#a0 A1: bnez r1,#a0 Processor 2 A1: bnez r1,#a0 A1: bnez r1,#a0 But performs poorly in doing so Consider 3 processors rather than 2 (not shown) has the lock and is in the critical section But what are processors 1 and 2 doing in the meantime? Loops of t&s, each of which includes a st Taking turns invalidating each others cache lines Generating a ton of useless bus (network) traffic Prof. Wood 37 M:1 I: 1 I: M:1 1 M:1 I: 1 I: M:1 1 M:1 I: 1 Test-and-Test-and-Set Locks Solution: test-and-test-and-set locks New acquire sequence A0: ld r1,0(&lock) A2: addi r1,1,r1 A3: t&s r1,0(&lock) A4: bnez r1,a0 Within each loop iteration, before doing a t&s Spin doing a simple test (ld) to see if lock value has changed Only do a t&s (st) if lock is actually free Processors can spin on a busy lock locally (in their own cache) Less unnecessary bus traffic Prof. Wood 38 Test-and-Test-and-Set Lock Performance A0: ld r1,0(&lock) A0: ld r1,0(&lock) // lock released by processor 0 A0: ld r1,0(&lock) A2: addi r1,1,r1 A3: t&s r1,(&lock) A4: bnez r1,a0 CRITICAL_SECTION Processor 2 A0: ld r1,0(&lock) A0: ld r1,0(&lock) A2: addi r1,1,r1 A3: t&s r1,(&lock) A4: bnez r1,a0 A0: ld r1,0(&lock) releases lock, invalidates processors 1 and 2 Processors 1 and 2 race to acquire, processor 1 wins S:1 I: 1 S:1 S:1 1 S:1 S:1 1 I: I: 0 S:0 I: 0 S:0 S:0 0 S:0 S:0 0 M:1 I: 1 I: M:1 1 I: M:1 1 I: M:1 1 I: M:1 1 A Final Word on Locking A single lock for the whole array may restrict parallelism Will force updates to different accounts to proceed serially Solution: one lock per account Locking granularity: how much data does a lock lock? A software issue, but one you need to be aware of Also, recall deadlock example struct acct_t { int bal,lock; }; shared struct acct_t accts[max_acct]; int id,amt; acquire(accts[id].lock); if (accts[id].bal >= amt) { accts[id].bal -= amt; spew_cash(); } release(accts[id].lock); Prof. Wood 39 Prof. Wood 40

Today s Outline: Shared Memory Review Motivation for Four-Lecture Course Introduction to Shared Memory Cache Coherence Synchronization Memory Consistency Memory Consistency Memory coherence Creates globally uniform (consistent) view Of a single memory location (in other words: cache line) Not enough Cache lines A and B can be individually consistent But inconsistent with respect to each other Memory consistency Creates globally uniform (consistent) view Of all memory locations relative to each other Who cares? Programmers Globally inconsistent memory creates mystifying behavior Prof. Wood 41 Prof. Wood 42 Coherence vs. Consistency Sequential Consistency (SC) A=1; flag=1; A=flag=0; while (!flag); // spin print A; Intuition says: P1 prints A=1 Coherence says: absolutely nothing P1 can see P0 s write of flag before write of A!!! How? Maybe coherence event of A is delayed somewhere in network Maybe P0 has a coalescing write buffer that reorders writes Imagine trying to figure out why this code sometimes works and sometimes doesn t (some) Real systems act in this strange manner sequential processors issue memory ops in program order P1 P2 P3 Memory switch randomly set after each memory op provides single sequential order among all operations Prof. Wood 43 Prof. Wood 44

Sufficient Conditions for SC Every processor issues memory ops in program order Processor must wait for store to complete before issuing next memory operation After load, issuing proc waits for load to complete, and store that produced value to complete before issuing next op Easily implemented with shared bus. Relaxed Memory Models Motivation 1: Directory Protocols Misses have longer latency Collecting acknowledgements can take even longer Motivation 2: (Simpler) Out-of-order processors do cache hits to get to next miss Recall SC has Each processor generates at total order of its reads and writes (R-->R, R-->W, W-->W, & W-->R) That are interleaved into a global total order (Most) Relaxed Models PC: Relax ordering from writes to (other proc s) reads RC: Relax all read/write orderings (but add fences ) Prof. Wood 45 Prof. Wood 46 Relax Write to Read Order /* initial A = B = 0 */ P1 P2 A = 1; B = 1 r1 = B; r2 = A; Processor Consistent (PC) Models Allow r1==r2==0 (precluded by SC) Examples: IBM 370, Sun TSO, & Intel IA-32 Why do this? Allows FIFO write buffers Does not astonish programmers (too much) Prof. Wood 47 Write Buffers w/ Read Bypass Read Flag 2 t1 P1 Write Flag 1 Read t3 Flag 1 t2 Shared Bus Flag 1: 0 Flag 2: 0 Prof. Wood 48 P2 Write Flag 2 P1 P2 Flag 1 = 1 Flag 2 = 1 if (Flag 2 == 0) if (Flag 1 == 0) critical section critical section t4

Also Want Causality /* initially all 0 */ P1 P2 P3 A = 1; while (flag1==0) {}; while (flag2==0) {}; flag1 = 1; flag2 = 1; r3 = A; All commercial models guarantee causality Why Not Relax All Order? /* initially all 0 */ P1 P2 A = 1; while (flag == 0); /* spin */ B = 1; r1 = A; flag = 1; r2 = B; Reorder of A = 1 / B = 1 or r1 = A / r2 = B Via OOO processor, non-fifo write buffers, delayed directory acknowledgements, etc. But programmers expect the following order: A = 1 / B = 1 before flag =1 flag!= 0 before r1 = A / r2 = B Prof. Wood 49 Prof. Wood 50 Order with Synch Operations /* initially all 0 */ P1 P2 A = 1; while (SYNCH flag == 0); B = 1; r1 = A; SYNCH flag = 1; r2 = B; Called weak ordering or weak consistency (WC) Alternatively, relaxed consistency (RC) specializes Acquires: force subsequent reads/writes after Releases: force previous reads/writes before Weak Ordering Example Read / Write Read/Write Synch Read / Write Read/Write Synch Read / Write Read/Write Prof. Wood 51 Prof. Wood 52

Release Consistency Example Commercial Models use Fences Read / Write Read/Write Acquire Read / Write Read/Write Release Read / Write Read/Write /* initially all 0 */ P1 P2 A = 1; while (flag == 0); B = 1; FENCE; FENCE; r1 = A; flag = 1; r2 = B; Examples: Compaq Alpha, IBM PowerPC, & Sun RMO Can specialize fences (e.g., Alpha and RMO) Prof. Wood 53 Prof. Wood 54