Shared Memory Multiprocessors

Similar documents
A More Sophisticated Snooping-Based Multi-Processor

A Basic Snooping-Based Multi-Processor Implementation

A Basic Snooping-Based Multi-Processor Implementation

Computer Architecture

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Page 1. Cache Coherence

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

EC 513 Computer Architecture

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Lecture 19: Coherence and Synchronization. Topics: synchronization primitives (Sections )

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

A Basic Snooping-Based Multi-processor

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Processor Architecture

Lecture: Coherence, Synchronization. Topics: directory-based coherence, synchronization primitives (Sections )

Lecture 7: Implementing Cache Coherence. Topics: implementation details

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

The need for atomicity This code sequence illustrates the need for atomicity. Explain.

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

Lecture 25: Multiprocessors

Lecture: Coherence and Synchronization. Topics: synchronization primitives, consistency models intro (Sections )

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

Switch Gear to Memory Consistency

Cache Coherence Protocols: Implementation Issues on SMP s. Cache Coherence Issue in I/O

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Scalable Cache Coherence

Shared Memory Multiprocessors

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations

Cache Coherence in Bus-Based Shared Memory Multiprocessors

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Multicore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh

Physical Design of Snoop-Based Cache Coherence on Multiprocessors

Advanced OpenMP. Lecture 3: Cache Coherency

5 Chip Multiprocessors (II) Chip Multiprocessors (ACS MPhil) Robert Mullins

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Lecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based)

Lecture 19: Synchronization. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Lecture 4: Directory Protocols and TM. Topics: corner cases in directory protocols, lazy TM

Chapter 5. Multiprocessors and Thread-Level Parallelism

Lecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley

Shared Memory Architectures. Shared Memory Multiprocessors. Caches and Cache Coherence. Cache Memories. Cache Memories Write Operation

SELECTED TOPICS IN COHERENCE AND CONSISTENCY

The MESI State Transition Graph

Multiprocessors & Thread Level Parallelism

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Interconnect Routing

CSE502: Computer Architecture CSE 502: Computer Architecture

Chapter 5. Multiprocessors and Thread-Level Parallelism

CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5)

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Concurrent Preliminaries

5 Chip Multiprocessors (II) Robert Mullins

Lecture-22 (Cache Coherence Protocols) CS422-Spring

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

CS315A Midterm Solutions

M4 Parallelism. Implementation of Locks Cache Coherence

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

ECE 551 System on Chip Design

CSE 5306 Distributed Systems

NOW Handout Page 1. Recap. Protocol Design Space of Snooping Cache Coherent Multiprocessors. Sequential Consistency.

Page 1. Outline. Coherence vs. Consistency. Why Consistency is Important

Cray XE6 Performance Workshop

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

Role of Synchronization. CS 258 Parallel Computer Architecture Lecture 23. Hardware-Software Trade-offs in Synchronization and Data Layout

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

Lecture 5: Directory Protocols. Topics: directory-based cache coherence implementations

Incoherent each cache copy behaves as an individual copy, instead of as the same memory location.

Multiprocessors 1. Outline

Memory Hierarchy in a Multiprocessor

12 Cache-Organization 1

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Multiprocessors and Locking

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville. Review: Small-Scale Shared Memory

Scalable Cache Coherence

Lecture 24: Board Notes: Cache Coherency

Module 14: "Directory-based Cache Coherence" Lecture 31: "Managing Directory Overhead" Directory-based Cache Coherence: Replacement of S blocks

Multiprocessor Cache Coherency. What is Cache Coherence?

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Lecture 8: Directory-Based Cache Coherence. Topics: scalable multiprocessor organizations, directory protocol design issues

Lecture 12: TM, Consistency Models. Topics: TM pathologies, sequential consistency, hw and hw/sw optimizations

Review: Symmetric Multiprocesors (SMP)

Main Points of the Computer Organization and System Software Module

Transcription:

Shared Memory Multiprocessors Jesús Labarta Index 1

Shared Memory architectures............... Memory Interconnect Cache Processor Concepts? Memory Time 2

Concepts? Memory Load/store (@) Containers Time Order happens before than last Interval How many things I can do between two events Relativity Metrics/relations defined for a specific observer Concepts? Time in multiprocessor system Local time Program order Interval between instructions Elastic: drift, preemptions Global time Observable by all processors? Propagation of signals Mapping of local time to global time Problem: Ordering of events when timescales comparable to propagation of signal, drift,? 3

Concepts? Memory operation Issued: presented to the memory subsystem Performed: effect achieved Complete: performed with respect to all processors Issues Coherence Consistency 4

Coherence Every read sees the last write Order Clear if long distance between events Fuzzy in the short distances Who defines last? Coherence The problem u=5 P1 Load r1,u Load r2,u r2=r2+3 Store r2,u Load r3,u P2 Load r1,u Load r2,u Load r3,u P3 Load r1,u Store 7,u Load r2,u r1=? r2=? r3=? r1=? r2=? r3=? r1=? r2=? 5

Coherence u=5 Local / Absolute time /Global order??? P1 Load r1,u Load r2,u r4=r2+3 Store r4,u Load r3,u P2 Load r1,u Load r2,u Load r3,u P3 Load r1,u Store 7,u Load r2,u P1: Load r1,u P3: Load r1,u P2: Load r1,u P2: Load r2,u P3: Store 7,u P1: Load r2,u P1: Store r4,u P2: Load r3,u P3: Load r2,u P1: Load r3,u r1= 5 r2= 7 r3= 10 r1= 5 r2= 5 r3= 10 r1= 5 r2= 10 Coherent memory behavior Coherence u=5 P1 Load r1,u Load r2,u r4=r2+3 Store r4,u Load r3,u P2 Load r1,u Load r2,u Load r3,u P3 Load r1,u Store 7,u Load r2,u P1: Load r1,u P3: Load r1,u P2: Load r1,u P3: Store 7,u P2: Load r2,u P1: Load r2,u P1: Store r4,u P2: Load r3,u P3: Load r2,u P1: Load r3,u r1= 5 r2= 7 r3= 10 r1= 5 r2= 7 r3= 10 r1= 5 r2= 10 Coherent memory behavior 6

Coherence u=5 P1 Load r1,u Load r2,u r4=r2+3 Store r4,u Load r3,u P2 Load r1,u Load r2,u Load r3,u P3 Load r1,u Store 7,u Load r2,u P1: Load r1,u P3: Load r1,u P2: Load r1,u P1: Load r2,u P2: Load r2,u P2: Load r3,u P3: Store 7,u P3: Load r2,u P1: Store r4,u P1: Load r3,u r1= 5 r2= 5 r3= 8 r1= 5 r2= 5 r3= 5 r1= 5 r2= 7 Coherent memory behavior?? Coherence U=5 P1 Load r1,u Load r2,u r4=r2+3 Store r4,u Load r3,u P2 Load r1,u Load r2,u Load r3,u P3 Load r1,u Store 7,u Load r2,u Non Coherent memory behavior How could this happen?? r1= 5 r2= 7 r3= 10 r1= 5 r2= 10 r3= 7 r1= 5 r2= 7 7

Coherence U=5 P1 Load r1,u Load r2,u r4=r2+3 Store r4,u Load r3,u P2 Load r1,u Load r2,u Load r3,u P3 Load r1,u Store 7,u Load r2,u Non Coherent memory behavior How could this happen?? r1= 5 r2= 7 r3= 10 r1= 5 r2= 10 r3= 10 r1= 5 r2= 8 Coherence U=5 U=? U=8 U=5 U=5 U=5 U=7... Load? 8

Coherence For any execution of the program It is possible to build a hypothetical total order of access to each memory location that Would result in the actual observed result of the program Where Local partial order is the program order The value returned by each read is the value written by the last write in the total order. Coherence Properties Write propagation All writes become visible at some time to other processes Write serialization All processes see writes in the same order 9

Coherence Need Mechanism to establish total order / global time Mechanisms Bus Snoop based protocols Memory Directory based protocols Bus based coherence Bus Set of lines visible by all processors Address: driven by master Data: driven by master / slave / other Control Driven by master (command) Driven by slave (response) Driven by other (responses) Global time base Bus cycles that define a global time 10

Snooping protocols All processors listen to all bus transactions Order of writes on the bus defines the total order of writes Write propagation Invalidation Cache state transitions diagram Consider events from processor and from bus Example: copyback caches, MSI protocol Multiple readers, single writer copies of a block in cache(s) MSI Processor commands PrRd PrWr Bus transactions BusRd: want block. Will not modify it BusRdX: Want block to modify. BusWB: write back block Bus transactions Processor commands... 11

MSI Notation controller state transition command (form processor or bus) / Bus transaction States Invalid Shared Modified M PrWr/BusRdX BusRdX/BusWB PrWr/BusRdX I BusRd/BusWB PrRd/BusRd S PrRd/- PrWr/- BusRdX/- PrRd/- BusRd/- Bus transfers Bus lines driven by other caches Data lines cache to cache transfers Control Abort command / error retry I am providing data (Memory don t do it) Other bus transactions BusUpgr: Upgrade from shared to Modified Reduce traffic 12

Directory based coherency Interconnect: Many links. No longer a single bus Point to point transactions Responsibility to define global order Memory Directory protocols System state Caches: local state information Same as in bus based protocols Memory: Global state information directory union of local states Protocol Local and directory state transitions Fired by processor / network transactions Firing new transactions 13

Directory protocols Processor commands PrRd PrWr States Invalid Shared Modified Protocol Sequence of request response transactions involved in state transition Explicit requests Non atomic: risk of race conditions Huge number of possibilities Number of transactions in critical path?... Directory protocols: transactions Operation Non shared blocks (read/write) Network transactions Commands RdReq RdXReq WBReq Responses RdResp RdXResp 1: Command Local Home 2: Response 14

Directory protocols: transactions Operation Get shared block for read Network transactions Commands RdReq Responses RdResp Home 1: RdReq 2: RdResp Local Directory protocols: transactions Operation Get shared block for write Network transactions Commands RdXReq Invalidate Responses ReadersList Done Home 1: RdXReq 2: ReadersList 3: Invalidate Local Reader 4: done 3: Invalidate 4: done Reader 15

Directory protocols: transactions Operation Get Modified block for read Network transactions Commands RdReq: Intervention: Downgrade, send it to me & WB Response Owner Data WB_Downgraded Home 1: RdReq 2: Owner 3: Intervention Local 3: Data 4: WB_D Owner Directory protocols: transactions Operation Get Modified block for write Network transactions Commands RdXReq TransferOwnership Response Owner Ownership 1: RdXReq Local Directory 2: Owner 3: Transfer Own. Owner 4: Ownership 16

Directory protocols Time Non atomic sequence of transactions Establishment of global time? When does owner change? What if Get shared block for write: sharer sends simultaneously a RdXReq Get shared block for write: sharer sends simultaneously an UpgradeReq be safe, slow down. Temporary states State transition diagram for directory and nodes in the above protocol Directory protocols: safe transactions Slow down on Get shared block for write 1: RdXReq 2: ReadersList Home Local Local 2 Do not serve later interventions to local before it actually completes the write. 3: Invalidate 4: done Reader 4: done 3: Invalidate Reader 17

Directory protocols: safe transactions 1: RdXReq Home Local 3 2: ReadersList Local Local 2 3: Invalidate Reader Reader Directory protocols: safe transactions Slow down on Get shared block for write 1: RdXReq 2: ReadersList Home 5: revision Extremely conservative: Avoid arrival of later interventions to local before it actually completes the write. Local 3: Invalidate Reader 4: done 3: Invalidate 4: done Reader 18

Directory protocols: safe transactions Slow down on Get Modified block for write Directory 1: RdXReq 2: Owner 5: revision Extremely conservative: Avoid arrival of later interventions to local before it actually completes the write. Local 3: Transfer Own. Owner 4: Ownership Directory protocols: states/transitions Directory automata (extremely conservative) RdXReq/RdXResp I Revision/- Invalidating M RdXReq/Owner Revision/- Transferring RdXReq/ ReaderList RdReq/RdResp RdReq/ Owner Intervening WB_D/- S RdReq/RdResp 19

Directory protocols: states/transitions Directory automata RdXReq/RdXResp I M RdXReq/ ReaderList RdXReq/Owner RdReq/RdResp RdReq/ Owner Intervening WB_D/- S RdReq/RdResp Directory protocols: states/transitions Local automata TransferOwner/Ownership M Last Done / (Revision) Ownership/ (Revision) Replacing Owner Invalidating Owner / TransferOwner ReaderList/Invalidate* Upgrading PrWr/RdXReq PrRd/RdReq Reading I Owner/ Intervention PrWr/RdXReq Intervening Intervention/Data,WB_D S PrRd/- Done*/- RdResp/- Data/- PrRd/- PrWr/- Invalidate/Done 20

Directory protocols: states/transitions Local automata TransferOwner/Ownership M Last Done / Revision Ownership/ Revision Replacing Owner Invalidating Owner / TransferOwner ReaderList/Invalidate* Upgrading PrWr/RdXReq PrRd/RdReq PrRd/- Done*/- RdResp/- Reading I Owner/ Intervention Intervention/Data PrWr/RdXReq PrRd/- PrWr/- S Data/Done Intervening Invalidate/Done Directory protocols: states/transitions Local automata Invalidation?? Invalidating I M Replacing Owner Upgrading Reading Intervention?? S Intervening 21

Options / optimizations Three message miss protocols Reply forwarding Reduce messages in critical path No longer request response Example Get modified block for read Home 1: RdReq 4b: WB_D 2: Owner Local 3: Intervention 4a: Data Owner 1: Command Directory 2: Intervention 4: WB_D Local Owner 3: Data Request/Reply forwarding: states/transitions Directory automata I M RdXReq/ Intervention (inv&fwd) RdXReq/ Invalidations RdReq/RdResp RdReq/ Intervention (downgrade&fwd) Intervening WB/- RdXReq/RdXResp Done*/- lastdone/data Invalidating Revision(data)/- RdReq/RdResp S 22

Reply forwarding: states/transitions Local automata Intervention(inv&fwd)/Data, Revision Line WB/WB PrWr/RdXReq PrRd/RdReq I M Data /- Upgrading Reading PrWr/RdXReq Intervention(downgrade&fwd)/ Data, Revision S Invalidate/Done PrRd/- PrWr/- PrRd/- Reply forwarding: states/transitions Local automata Invalidate&forward?? WAIT, NACK I M Upgrading Reading Invalidate&forward?? Invalidate?? WAIT, NACK S RdResp or Data/- PrRd/- PrWr/- PrRd/- 23

Directory Centralized flat directory Full vector bit 1 presence bit per processor + 1 dirty bit Hierarchical Distributed list Directory: head of list Every node block: pointer to next sharer Features Sequential search Limited space requirement Options / optimizations Replacement hint Notify replacement of a shared block to avoid unnecessary invalidations 24

Consistency About: Order between accesses to different memory locations Consistency The problem A, Flag=0 P1 A= 1 flag= 1 A, B=0 P2 While (flag==0); Print A; Printed value:? P1 A = 1 P2 u = A B = 1 P3 v = B w = A u,v,w:? 25

Consistency The problem A, Flag=0 P1 A= 1 Flag= 1 A, B=0 P2 While (flag==0); Print A; Printed value: 0? P1 A = 1 P2 u = A B = 1 P3 v = B u,v,w: 0, 0, 1? w = A u,v,w: 1, 1, 1? u,v,w: 1, 1, 0? Consistency Memory consistency model Constraints on the order in which memory operations must appear to be performed with respect to one another. Sequential consistency Effect as if: Same interleaving of individual accesses seen by all processors. Respects program order for each processor. 26

Sequential consistency Sequential consistency requires Write atomicity Sufficient conditions Every process issues memory operations in program order After issuing a write, a process waits for it to complete before issuing the next operation After issuing a read, a process waits for it to complete and for the write whose value is being returned to complete before issuing the next operation. Sequential consistency Need Mechanism to detect write completion Mechanism to detect read completion Data arrives Sequential consistency in the snooping MSI protocol Write completion: BusRdX transaction occurs and write is performed in the cache. Read completion condition: Availability of data implies write that produced it completed: If read causes bus transaction, that of the write was before If data was M: this processor had already completed the write If data was S: this processor had already read the data (go to 1 or 2) 27

Sequential consistency Sequential consistency in directory protocols Risks: Different paths through the network. Different delays, congestions Network not preserving order of messages Sequential consistency Write completion A=B=0 P1 A= 1; ; B= 2; P2, P3, P4, while (A==0 && B==0); print A,,, B; Printed values? P2 1,0 P3 1,2 P4 0,2 A=1 A=0 B=2 A=0 A=1 A=0 B=2... 28

Sequential consistency Write atomicity A=B=0 P1 A= 1 P2 While (A==0); B=1; P3 =A; While (B==0); Print A; Printed value: 0? A=0 B=0 A=1 A=0 A=1 B=1... A=0 B=0 B=1 Sequential consistency Sequential consistency in directory protocols Write completion detection Acknowledgement of invalidations Write atomicity Owner not allowing accesses to the new value until all acknowledges have returned. Risks: Concurrent transactions under: Different paths through the network. Different delays, congestions Network not preserving order of messages 29

Synchronization Need to ensure certain order between computations Mutual exclusion Point to point event Collective synchronization Barrier Components Acquire method Acquire the right to the synchronization Waiting algorithm Release method Allow others to proceed Synchronization Waiting algorithm Busy wait Faster Resource consumption during wait Deadlock? Blocking Overhead Dependence on scheduler Two phase Wait for a while, then block Sensitive to waiting time x1 / x2 context switch overhead Can frequently be controlled through an environment variable 30

Mutual exclusion Problem with following code? lock: ld r1, location st location, #1 bnz r1, lock Unlock: st location, #0 Mutual exclusion t&s Mutual exclusion based on atomic modification of variable lock: t&s r1, location bnz r1, lock Unlock: st location, #0 Atomicity Single processor: 1 instruction Multiprocessor: atomic memory read and update Lock bus inefficient Read exclusive and do not release till updated 31

Mutual exclusion: performance Goals Low latency High throughput Scalability Low storage cost Fairness t&s performance Latency Low Cached locks faster reuse Throughput Write every iteration by all threads a lot of traffic and invalidations Scalability: poor Storage: low Fairness: No Cached locks higher probability of success 32

Mx algorithms t&s with backoff Insert delays after unsuccessful attempts to acquire the lock Not too long: would lead to idle lock while requests pending Not too short: contention Exponential backoff: k c i t&t&s Wait by using a normal load instruction Lock will be cached and wait will not use system bandwidth Mx algorithms LL-SC SC will fail if other processor modifies the variable. If it succeeds the update was atomic. lock: ll r1, location bnz r1, lock r2 = f (r1) sc location, r2 beqz lock f: #1 test&set f: +1 fetch&increment f: +ri fetch&add Unlock: st location, #0 33

Mx algorithms LL-SC A failing SC does not generate invalidations Hardware lock flag and lock address register Replacements might lead to livelock avoid them Disallow memory references in instruction sequence f Do not allow memory instruction reordering across LL-SC The unlock still generates an invalidation to all contending processors and their simultaneous attempt to reacquire the lock Mx algorithms Ticket lock lock: f&i r1, counter while (r1!=now_serving); Unlock: now_serving++; Contention for now-serving after release Proportional backoff (r1 now_serving) Fairness: yes 34

Mx algorithms Array-based lock lock: f&i r1, counter while (array[r1]==wait); Unlock: array[r1+1]=continue; Storage cost: Array layout to avoid false sharing Proportional to P Fairness: yes Other issues Reuse of positions Initialization? Point to point event synchornization Normal load/sore instructions on ordinary variables signal: st flag, #1 wait: ld r1, flag bz r1, wait The variable itself can be used as flag if there is a possible value that the produced value will never have (minint, 0, ) 35

Barrier algorithms Centralized Barrier(barr) { lock (barr.lock); if (barr.counter == 0) barr.flag = 0; mycount = barr.counter++; unlock (barr.lock); if (mycount == P) { barr.counter =0; barr.flag = 1; } else while (barr.flag == 0); } // reset flag if first // last to arrive? // reset for next barrier // release waiting processors // busy wait for release Potential of deadlock Back to back barriers Barrier algorithms Centralized with sense reversal Barrier(barr) { local_sense =!(local_sense); lock (barr.lock); mycount = barr.counter++; if (barr.counter == P) { unlock (barr.lock); barr.counter =0; barr.flag = local_sense; } else unlock (barr.lock); while (barr.flag!= local_sense); } // acquire // release // wait Contention on counter 36

Barrier algorithms Tree based barriers acquire release Trade off Overlap between potentially contending phases Increased latency Can be implemented with ordinary load/stores Further details MESI Snooping designs Atomic transaction, non atomic global process Split transaction bus 37

MESI Performance of not sharing applications under MSI? Two transactions per modified line BusRd BusRdX / Upgrade Exclusive state Only read copy in the system Transition from Exclusive to Modified does not require bus transaction BusRd fires transition to either Shared or Exclusive depending on some other cache has the block or not Snooping cache design Bus-side controller Tags & status Coordination P Cache data RAM Addr Tags & status Cmd Processor- side controller Tag Write-back buffer Snoop state ==? Arbitration Cmd Addr Data buffer Data buffer Addr Cmd 38

Snooping cache design Replicated tags & status Concurrent access by processor and snoop controllers Should be identical (~) Snoop bus lines Wired OR control lines. Ex: Snoop hit: the address matches a tag in some processor Modified state: some processor has the line block in modified state Memory should detect it and not respond to request Valid snoop result: all processors performed their snoop check and the other two lines are valid Snooping cache design Write backs Proc. Controller transfers dirty line to write-back buffer Requests BusRd(X) transaction Requests WB transaction Snoop controller: if transaction affects line respond appropriately and provide data Proc controller: Cancel WB request 39

Snooping cache design Coordination Update of snoop and processor tags I S,E No conflict. Processor is waiting Invalidations Simultaneous» Blocking processor» Delay transaction till achieved Delayed» Queue request to processor controller» Coherence and consistency? Snooping cache design Coordination Access to Cache data On M S transition Snoop controller needs access to data Delay transaction till access achieved Non atomic state transitions While posting a request, a transaction may be seen that affects the block and requires a change in state and type of request. Ex: WB posted and BusRdX observed provide data, Cancel WB request Upgrade posted and other Upgrade observed Invalidate, post BUsRdX Temporary states may be encoded in communication variables between both controllers 40

Snooping cache design Coherence and consistency When can a write be performed in the local cache? Bus granted for transaction The actual block invalidation in other processor may take place later Commitment completion Snooping cache design Resources Buffers Write back buffers Notifications from snoop to processor controller Require associative comparisons If exhausted may cause deadlock 41

Snooping cache design Deadlock No system activity It is necessary to continue servicing incoming transactions while posting requests Resource limitations may cause deadlocks Avoidance: NACKs Snooping cache design Livelock System activity but no real progress Ex: complete write immediately after entering Modified state 42

Snooping cache design Starvation Some processors do not progress Fair scheduling policies FIFO arbiter Threshold on count of denied accesses May appear at higher levels (synchronization algorithms, ) Snooping cache design Bus synchronization Synchronous Master clock Predefined timing between requests and responses Conservative timings: designed for worst case Asynchronous Handshake protocol 43

Snooping @ Multilevel caches State @ L1 Determines the action of the processor on hits State @ L2 Determines the reaction of the snoop controller to the bus transactions The snoop controller must respond for the node Need to coordinate state changes between L1 and L2 Need to maintain inclusion Snooping @ Multilevel caches Inclusion property If block in L1 then also in L2 If block is in owned state in L1 then also in L2 Maintaining inclusion Not immediate L1 and L2 see different access patterns may take different replacement decisions L1 set associative with LRU can violate inclusion Multiple independent L1 caches Different cache block sizes 44

Snooping @ Multilevel caches Maintaining inclusion Guaranteed if direct L1 Incoming blocks are put in both L1 and L2 same L1 and L2 block size #lines in L1 #sets in L2 Snooping @ Multilevel caches Enforcing inclusion by using coherence mechanisms Replacement in L2 invalidation/flush sent to L1 Snoop actions @ L2 Need to invalidate L1 as reaction to bus transaction Send intervention for every transaction relevant to L2 Keep inclusion bit for each line Processor writes Write through L2 state: modified but stale Request block to L1 if needed 45

Snooping @ Split transaction bus Split transaction bus Transactions split into two independent phases Request and response Several ongoing requests concurrently Snooping @ Split transaction bus Split transaction bus issues Split snoop Within request or response? Flow control limited buffers for incoming requests and responses Out of order responses? Conflicting requests at least one of them a write Not possible to modify the later requests. Need to properly handle coherence states, transitions and responses Trade-off Performance resources (buffers) complexity cost 46

Example split transaction snooping Options Maximum outstanding requests: 8 Conflicting requests: Not allowed Flow control: NACK Out of order responses: yes Split snoop Total order established by request phase Snoop results presented Modified in request phase Shared in response phase Read merge 128 bytes blocks Example split transaction snooping Split transaction phases Request Response Arbitration Actual data response 47

Example split transaction snooping Bus Request bus: command & address Request identification 3 bits tag 5 cycles: arbitration, resolution, address, decode, acknowledge Response bus: Data (256bits) & tag 4 cycles Snooping @ Split transaction bus Arb Rslv Addr Dcd Ack Address bus Addr Req Grant Addr Addr Snoop Ack Arb Rslv Addr Ack Arb Rslv Addr Ack Data arbitration Data Req Tag Data Grant check Ack Arbitration Tag generated Lookup snoop tag Data bus D 0 D 1 D 2 D 3 Snoop sharing state to decide E,S Delay if not ready to receive Take local action Drive snoop lines: tell memory whether it has to respond or not Delay if not time to check/nack Note if response needed Withdraw request if conflicting with previous 48

Snooping @ Split transaction bus Snoop state from $ Request + response queue 0 Tag Address 7 Request table Originator My response Miscellaneous Information Data to/from $ Request buffer Issue + merge check Response queue Snoop state ==? Tag To control Write-back buffer Write backs Responses Addr + Cmd Tag Data buffer Data buffer Addr + Cmd Tag Addr + cmd bus Data + tag bus Snooping @ Split transaction bus Request issue Wait if conflicting with one in Request table Request merge Read request for same block as one BusRd entry in Request table Set bit in Request table: Bit in Request table: Write completion I want the data I generated the request If not, I activate the snoop sharing line during the data response phase so that block is Shared instead of Exclusive Write commitment in the last cycle of the request phase 49

Snooping @ Split transaction bus Snoop result only in response phase? Would allow faster request phase Memory Start servicing every request Abort service if observed response from a cache When responding, observe snoop lines (wait if inhibit) if dirty line activated cancel response Snooping @ Split transaction bus Sequential consistency? Global order established by request phase Write commit substitute for completion Completion may take place much later than commitment At each processor, serve incoming queue before completing a read miss or write that generates a bus transaction 50

Snooping @ Split transaction bus In order responses? Potential performance loss Access patterns to interleaved memory modules 2 requests to module A followed by a request to module B Potential to support conflicting requests Snooping @ Split transaction bus Write Back? Shared lines Two logical busses may share physical lines Lines not used in one cycle by one bus may be used by the other Pipeline stages skewed in order not to conflict Must stay synchronized. Inhibits affect both busses Complexity vs. cost 51