Approaches to Building Parallel Machines. Shared Memory Architectures. Example Cache Coherence Problem. Shared Cache Architectures

Similar documents
Shared Memory Architectures. Approaches to Building Parallel Machines

NOW Handout Page 1. Recap. Protocol Design Space of Snooping Cache Coherent Multiprocessors. Sequential Consistency.

NOW Handout Page 1. Recap: Performance Trade-offs. Shared Memory Multiprocessors. Uniprocessor View. Recap (cont) What is a Multiprocessor?

Shared Memory Multiprocessors

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

Recall: Sequential Consistency. CS 258 Parallel Computer Architecture Lecture 15. Sequential Consistency and Snoopy Protocols

Switch Gear to Memory Consistency

Cache Coherence: Part 1

Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley

Alewife Messaging. Sharing of Network Interface. Alewife User-level event mechanism. CS252 Graduate Computer Architecture.

Bus-Based Coherent Multiprocessors

Cache Coherence in Bus-Based Shared Memory Multiprocessors

EC 513 Computer Architecture

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5)

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

Processor Architecture

Multiprocessors 1. Outline

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Computer Architecture

Memory Hierarchy in a Multiprocessor

Basic Architecture of SMP. Shared Memory Multiprocessors. Cache Coherency -- The Problem. Cache Coherency, The Goal.

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

Snooping coherence protocols (cont.)

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

L7 Shared Memory Multiprocessors. Shared Memory Multiprocessors

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Advanced OpenMP. Lecture 3: Cache Coherency

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Shared Memory Architectures. Shared Memory Multiprocessors. Caches and Cache Coherence. Cache Memories. Cache Memories Write Operation

Cache Coherence. Todd C. Mowry CS 740 November 10, Topics. The Cache Coherence Problem Snoopy Protocols Directory Protocols

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

CSE502: Computer Architecture CSE 502: Computer Architecture

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Data-Centric Consistency Models. The general organization of a logical data store, physically distributed and replicated across multiple processes.

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

Lecture 9 Outline. Lower-Level Protocol Choices. MESI (4-state) Invalidation Protocol. MESI: Processor-Initiated Transactions

Lecture 17: Parallel Architectures and Future Computer Architectures. Shared-Memory Multiprocessors

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

Chapter Seven. Idea: create powerful computers by connecting many smaller ones

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

Multicore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh

Snooping-Based Cache Coherence

Caches. Parallel Systems. Caches - Finding blocks - Caches. Parallel Systems. Parallel Systems. Lecture 3 1. Lecture 3 2

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Snooping coherence protocols (cont.)

EC 513 Computer Architecture

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996

Lecture-22 (Cache Coherence Protocols) CS422-Spring

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Cray XE6 Performance Workshop

PARALLEL MEMORY ARCHITECTURE

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

Review. EECS 252 Graduate Computer Architecture. Lec 13 Snooping Cache and Directory Based Multiprocessors. Outline. Challenges of Parallel Processing

Page 1. Cache Coherence

Learning Curve for Parallel Applications. 500 Fastest Computers

Lecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Parallel Processing Paradigms. EE 457 Unit 10. SIMD Execution. SIMT Execution. Parallel Processing Cache Coherency

Limitations of parallel processing

Portland State University ECE 588/688. Cache Coherence Protocols

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

Thread- Level Parallelism. ECE 154B Dmitri Strukov

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Computer Systems Architecture

Chap. 4 Multiprocessors and Thread-Level Parallelism

CS315A Midterm Solutions

Cache Coherence and Atomic Operations in Hardware

CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II

CMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3

The Cache-Coherence Problem

Cache Coherence Protocols: Implementation Issues on SMP s. Cache Coherence Issue in I/O

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Cache Coherence in Scalable Machines

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Scalable Multiprocessors

EECS 470. Lecture 17 Multiprocessors I. Fall 2018 Jon Beaumont

Chapter 5. Thread-Level Parallelism

Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions:

12 Cache-Organization 1

Scalable Cache Coherent Systems

Scalable Cache Coherence

Scalable Cache Coherence

Parallel Computer Architecture Spring Memory Consistency. Nikos Bellas

1. Memory technology & Hierarchy

The Cache Write Problem

Handout 3 Multiprocessor and thread level parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Physical Design of Snoop-Based Cache Coherence on Multiprocessors

Limitations of Memory System Performance

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Multiprocessor Systems

Multiprocessors & Thread Level Parallelism

Symmetric Multiprocessors: Synchronization and Sequential Consistency

CS 5220: Shared memory programming. David Bindel

Transcription:

Approaches to Building arallel achines Switch/Bus n Scale Shared ory Architectures (nterleaved) First-level (nterleaved) ain memory n Arvind Krishnamurthy Fall 2004 (nterleaved) ain memory Shared Cache Bus Based shared memory Alliant FX-8 early 80 s eight 68020s with x-bar to 512 KB interleaved cache Encore & Sequent first 32-bit micros (N32032) two to a board with a shared cache nterconnection network n Centralized ory Dance Hall, UA Shared Cache Architectures Example Cache Coherence roblem What are the advantages and disadvantages? 2 3 u =? u =? 3 4 5 u:5 u= 7 1 u:5 ory 2 /O devices rocessors see different values for u after event 3 With write back caches, value written back to memory depends on happenstance of which cache flushes or writes back value when rocesses accessing main memory may see very stale value Unacceptable to programs, and frequent! Snoopy Cache-Coherence rotocols Design Choices State Address Data 1 Bus snoop Cache-memory /O devices transaction Bus is a broadcast medium & caches know what they have Cache Controller snoops all transactions on the shared bus A transaction is a relevant transaction if it involves a cache block currently contained in this cache take action to ensure coherence invalidate, update, or supply value depends on state of the block and the protocol n Update state of blocks in rocessor response to processor and Ld/St snoop events Cache Controller Valid/invalid State Tag Data Dirty (inconsistent with memory) Shared (in another caches) Snoop Snoopy protocol set of states state-transition diagram actions Basic Choices Write-through vs Write-back nvalidate vs. Update 1

Write-through nvalidate rotocol Write-through vs. Write-back Two states per block in each cache nvalid, Valid as in uniprocessor Cache check: Compute position(s) to check based on address Check whether valid f valid, check whether tag matches rrd / BusRd required address f present: f read just use the value f write update value, send update to bus Writes invalidate all other caches can have multiple simultaneous readers of block, but write invalidates them V rrd/ -- rwr / BusWr BusWr / - rwr / BusWr Write-through protocol is simple every write is observable Every write goes on the bus => Only one write can take place at a time in any processor Uses a lot of bandwidth! Example: 3GHz processor, C = 1, 10% stores of 8 bytes => 300 stores per second per processor => 2400 B/s per processor 4 GB/s bus can support only about 1-2 processors without saturating nvalidate vs. Update Write-back Caches (Uniprocessor) When does one prefer: nvalidation based scheme? Update based scheme? => Need to look at program reference patterns and hardware complexity, but first: correctness 3 rocessor operations: read write replace 3 states: nvalid, valid(clean), modified(dirty) 2 bus transactions: Read, write-back rw rw / rrd/busrd Replace/- BusRd/ rw/busrd V Replace/BusWB Write-back S (multi-processors) Treat valid as shared Treat modified as exclusive ntroduce new bus operation Read-exclusive: read for later modifications (read to own) BusRdx causes others to invalidate BusRdx even if write-hit in S Read obtains block in shared S rw/ rw/busrdx rrd/busrd rw/busrdx BusRd/ BusRdX/Flush BusRdX/ Lower Level rotocol Choices How does memory know whether or not to supply data on BusRd? BusRd observed in state: transition to make? S Depends on expectation of access patterns BusRdX could be replaced by BusUpgr without data transfer Read-Write is 2 bus transactions, even if no sharing BusRd ( S) followed by BusRdX or BusUpgr What happens on sequential programs? erformance degrades 2

Update rotocols f data is to be communicated between processors, invalidate protocols seem inefficient Consider shared flag: 0 waits for it to be zero, then does work and sets it one 1 waits for it to be one, then does work and sets it zero How many transactions? 0: Read shared 1: Read Exclusive 1: Write 0 0: Read 1: Read shared 0: Read Exclusive 0: Write 1 1: Read Shared ory Systems Two variants: Shared cache systems Separate cache, bus-based access to shared memory Variants: Write-through vs. write-back systems nvalidation-based vs. update-based systems Write-Back Update rotocol Let s have a system where: Write-backs happen when cache line is replaced All writes result in updates of other caches caching the value Let s design the simplest write-back update protocol: How many states should it have? What are the significance of the states? 4 states Dragon Write-back Update rotocol Exclusive-clean (E): yproc & ory have it Shared clean (Sc): yproc and other procs may have it Shared modified (Sm): yproc and other procs may have it, memory does not have updated value (yproc s responsibility to update memory) odified(): yproc has it, no one else Cache block can be: state on one cache and no one has the same cache block E state on one cache and no one has the same cache block Sc on one or more caches Sm on one cache, Sc on zero or more caches No invalid state f in cache, cannot be invalid (but still need to deal with tag mismatches) New bus transaction: BusUpd Broadcasts single word written on bus, updates other relevant caches Bandwidth savings Questions: Dragon State Transition Diagram How can we recognize which state should be currently associated with a cache line? BusUpd/Update How do we know that a cache line should be stored in: Exclusive state? odified state? Shared clean state? Shared modified state? BusRd/ E Sc rrdiss/busrd(!s) rrdiss/busrd(s) rw/ rwr/busupd(!s) rw/busupd(s) BusUpd/Update rwiss/(busrd(s); BusUpd) Sm rw/busupd(!s) rw/busupd(s) rwiss/busrd(!s) rw/ 3

Lower-level rotocol Choices Can shared-modified state be eliminated? f memory is updated on BusUpd transactions (DEC Firefly) Dragon protocol doesn t (assumes DRA memory slow to update) Should replacement of an Sc block be broadcast? Would allow last copy to go to E state (or state) and not generate updates ore overhead on replacements Less overhead for updates /O E E 140 B/s cache 5.2 GB/s ROC Limits of Bus-based Shared ory cache ROC Assume: 3 GHz processor w/o cache => 12 GB/s inst BW per proc. (32-bit) => 3.6 GB/s data BW at 30% load-store Suppose 98% inst hit rate and 95% data hit rate => 240 B/s inst BW per processor => 180 B/s data BW per processor 420 B/s combined BW Assuming 4 GB/s bus bandwidth 10 processors will saturate bus What is the solution to this scalability problem? True sharing Sharing: a performance problem Frequent writes to a variable can create a bottleneck OK for read-only or infrequently written data Technique: make copies of the value, one per processor, if this is possible in the algorithm Example problem: the data structure that stores the freelist/heap for malloc/free False sharing Cache block may also introduce artifacts Two distinct variables in the same cache block Technique: allocate data used by each processor contiguously, or at least avoid interleaving Example problem: an array of ints, one written frequently by each processor ory Disk ntuitive ory odel L2 L1 100:67 100:34 100:35 Reading an address should return the last value written to that address Uniprocessor: Different levels can have different values ade consistent at replacement time ultiprocessor: Approach 1: Use shared top level caches Approach 2: Use separate caches but keep them consistent through snooping Ordering of operations on single variable Ordering of operations on multiple variables 0 : : 2 : R R R W R R R R R R R W R R R R R R 2 /*Assume initial values of A and B are 0 */ (1a) A = 1; (2a) print B; (1b) B = 2; (2b) print A; Bus establishes a total ordering on writes Assuming atomic bus transactions Later we will look at split-phase transactions Writes establish a partial order on operations from different processors Doesn t constrain ordering of reads, though bus will order read misses too Question: which of these outputs are valid? [ 0 0 ] [ 0 1 ] [ 2 0 ] [ 2 1 ] Question: what might cause these kinds of outputs on real machines? What s the intuition? This is the memory consistency model 4

Related Example 2 /*Assume initial value of A and ag is 0*/ A = 1; while (flag == 0); /*spin idly*/ flag = 1; print A; Sequential Consistency rocessors issuing memory 1 2 n references as per program or der The switch is randomly set after each memory reference ntuition not guaranteed by coherence Coherence: preserves order to a single variable Coherence is not enough! Also expect memory to respect order between accesses to different locations issued by a given process ory Total order achieved by interleaving accesses: aintains program order; memory operations, from all processes, appear to [issue, execute, complete] atomically as if there were no caches, and a single memory A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program. [Lamport, 1979] mplementing Sequential Consistency Sufficient Conditions every process issues memory operations in program order after a read is issued, the issuing process waits for the read to complete before issuing its next operation after a write operation is issued, the issuing process waits for the write to complete (update other caches if necessary) before issuing next memory operation How can architectural enhancements violate SC? Out-of-order execution Write-buffers Non-blocking operations How can compilers violate SC? Reordering of operations Compilers designed to generate correct uniprocessor code and not correct multiprocessor code Summary Basic shared memory design Shared cache Separate caches: results in coherence problems Snoopy cache: write back vs. write-through, update vs. invalidate State machines for coherence logic Cache coherence implements state machine Scales to about 10 s of processors Coherence is not enough Require to specify memory consistency model Abstract model for accesses to multiple variables 5