Shared Memory Architectures. Approaches to Building Parallel Machines

Similar documents
Approaches to Building Parallel Machines. Shared Memory Architectures. Example Cache Coherence Problem. Shared Cache Architectures

Recall: Sequential Consistency. CS 258 Parallel Computer Architecture Lecture 15. Sequential Consistency and Snoopy Protocols

Alewife Messaging. Sharing of Network Interface. Alewife User-level event mechanism. CS252 Graduate Computer Architecture.

Bus-Based Coherent Multiprocessors

Shared Memory Multiprocessors

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

NOW Handout Page 1. Recap: Performance Trade-offs. Shared Memory Multiprocessors. Uniprocessor View. Recap (cont) What is a Multiprocessor?

Cache Coherence in Bus-Based Shared Memory Multiprocessors

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5)

Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley

NOW Handout Page 1. Recap. Protocol Design Space of Snooping Cache Coherent Multiprocessors. Sequential Consistency.

Multiprocessors 1. Outline

Processor Architecture

Symmetric Multiprocessors

Switch Gear to Memory Consistency

Snooping coherence protocols (cont.)

Basic Architecture of SMP. Shared Memory Multiprocessors. Cache Coherency -- The Problem. Cache Coherency, The Goal.

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Computer Architecture

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

L7 Shared Memory Multiprocessors. Shared Memory Multiprocessors

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

EC 513 Computer Architecture

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University

Data-Centric Consistency Models. The general organization of a logical data store, physically distributed and replicated across multiple processes.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Cache Coherence: Part 1

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996

Chapter Seven. Idea: create powerful computers by connecting many smaller ones

Chap. 4 Multiprocessors and Thread-Level Parallelism

CSE502: Computer Architecture CSE 502: Computer Architecture

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Snooping-Based Cache Coherence

Thread- Level Parallelism. ECE 154B Dmitri Strukov

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Snooping coherence protocols (cont.)

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

Chapter 5. Multiprocessors and Thread-Level Parallelism

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Chapter 5. Thread-Level Parallelism

Cache Coherence. Todd C. Mowry CS 740 November 10, Topics. The Cache Coherence Problem Snoopy Protocols Directory Protocols

Computer Systems Architecture

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Advanced OpenMP. Lecture 3: Cache Coherency

Limitations of parallel processing

Portland State University ECE 588/688. Cache Coherence Protocols

12 Cache-Organization 1

EC 513 Computer Architecture

1. Memory technology & Hierarchy

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Shared Memory Architectures. Shared Memory Multiprocessors. Caches and Cache Coherence. Cache Memories. Cache Memories Write Operation

Lecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Page 1. Cache Coherence

Cache Coherence Protocols: Implementation Issues on SMP s. Cache Coherence Issue in I/O

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Lecture 9 Outline. Lower-Level Protocol Choices. MESI (4-state) Invalidation Protocol. MESI: Processor-Initiated Transactions

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Cache Coherence and Atomic Operations in Hardware

Lecture-22 (Cache Coherence Protocols) CS422-Spring

Multiprocessors & Thread Level Parallelism

PARALLEL MEMORY ARCHITECTURE

Interconnect Routing

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Homework 6. BTW, This is your last homework. Assigned today, Tuesday, April 10 Due time: 11:59PM on Monday, April 23. CSCI 402: Computer Architectures

Handout 3 Multiprocessor and thread level parallelism

Computer Systems Architecture

Multiprocessors continued

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Multiprocessors continued

Multicore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh

??%/year. ! All microprocessor companies switch to MP (2X CPUs / 2 yrs) " Procrastination penalized: 2X sequential perf. / 5 yrs

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

Outline. EEL 5764 Graduate Computer Architecture. Chapter 4 - Multiprocessors and TLP. Déjà vu all over again?

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Flynn s Classification

3/13/2008 Csci 211 Lecture %/year. Manufacturer/Year. Processors/chip Threads/Processor. Threads/chip 3/13/2008 Csci 211 Lecture 8 4

Lecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol

CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II

Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY>

Caches. Parallel Systems. Caches - Finding blocks - Caches. Parallel Systems. Parallel Systems. Lecture 3 1. Lecture 3 2

CISC 662 Graduate Computer Architecture Lectures 15 and 16 - Multiprocessors and Thread-Level Parallelism

Motivations. Shared Memory Consistency Models. Optimizations for Performance. Memory Consistency

Lecture 20: Multi-Cache Designs. Spring 2018 Jason Tang

Cache Coherence. Bryan Mills, PhD. Slides provided by Rami Melhem

Transcription:

Shared Memory Architectures Arvind Krishnamurthy Fall 2004 Approaches to Building Parallel Machines P 1 Switch/Bus P n Scale (Interleaved) First-level $ P 1 P n $ $ (Interleaved) Main memory Shared Cache (Interleaved) Main memory Alliant FX-8 early 80 s eight 68020s with x-bar to 512 KB interleaved cache Encore & Sequent first 32-bit micros (N32032) two to a board with a shared cache Bus Based shared memory P 1 $ Mem Interconnection network Mem P n Centralized Memory Dance Hall, UMA $ 1

Shared Cache Architectures What are the advantages and disadvantages? Example Cache Coherence Problem P 1 P 2 P 3 u =? u =? $ 4 $ 5 $ u:5 3 u= 7 1 u:5 Memory 2 I/O devices Processors see different values for u after event 3 With write back caches, value written back to memory depends on happenstance of which cache flushes or writes back value when Processes accessing main memory may see very stale value Unacceptable to programs, and frequent! 2

Snoopy Cache-Coherence Protocols State Address Data P1 $ Bus snoop Pn $ Mem Bus is a broadcast medium & caches know what they have Cache-memory I/O devices transaction Cache Controller snoops all transactions on the shared bus A transaction is a relevant transaction if it involves a cache block currently contained in this cache take action to ensure coherence invalidate, update, or supply value depends on state of the block and the protocol Design Choices Update state of blocks in response to processor and snoop events Valid/invalid Dirty (inconsistent with memory) Shared (in another caches) Snoop Snoopy protocol set of states state-transition diagram actions Basic Choices Write-through vs Write-back Invalidate vs. Update Processor Ld/St Cache Controller State Tag Data 3

Write-through Invalidate Protocol Two states per block in each cache Invalid, Valid as in uniprocessor Cache check: Compute position(s) to check based on address Check whether valid If valid, check whether tag matches required address If present: If read just use the value If write update value, send update to bus Writes invalidate all other caches can have multiple simultaneous readers of block, but write invalidates them Prd / Busd V I Prd/ -- PrWr / BusWr BusWr / - PrWr / BusWr Write-through vs. Write-back Write-through protocol is simple every write is observable Every write goes on the bus => Only one write can take place at a time in any processor Uses a lot of bandwidth! Example: 3GHz processor, CPI = 1, 10% stores of 8 bytes => 300 M stores per second per processor => 2400 MB/s per processor 4 GB/s bus can support only about 1-2 processors without saturating 4

When does one prefer: Invalidation based scheme? Update based scheme? Invalidate vs. Update => Need to look at program reference patterns and hardware complexity, but first: correctness Write-back Caches (Uniprocessor) 3 Processor operations: read write replace 3 states: Invalid, valid(clean), modified(dirty) 2 bus transactions: ead, write-back Prd/ PrW M PrW / Prd/Busd eplace/- Prd/ Busd/ PrW/Busd V I eplace/buswb 5

Write-back MSI (multi-processors) Treat valid as shared Treat modified as exclusive Introduce new bus operation ead-exclusive: read for later modifications (read to own) Busdx causes others to invalidate Busdx even if write-hit in S ead obtains block in shared Prd/ PrW/BusdX M S PrW/ Prd/Busd Prd/ PrW/BusdX Busd/ I Busd/Flush BusdX/Flush BusdX/ Lower Level Protocol Choices How does memory know whether or not to supply data on Busd? Busd observed in M state: transition to make? M I M S Depends on expectation of access patterns BusdX could be replaced by BusUpgr without data transfer ead-write is 2 bus transactions, even if no sharing Busd (I S) followed by BusdX or BusUpgr What happens on sequential programs? Performance degrades 6

Update Protocols If data is to be communicated between processors, invalidate protocols seem inefficient Consider shared flag: P0 waits for it to be zero, then does work and sets it one P1 waits for it to be one, then does work and sets it zero How many transactions? P0: ead shared P1: ead Exclusive P1: Write 0 P0: ead P1: ead shared P0: ead Exclusive P0: Write 1 P1: ead Shared Memory Systems Two variants: Shared cache systems Separate cache, bus-based access to shared memory Variants: Write-through vs. write-back systems Invalidation-based vs. update-based systems 7

Write-Back Update Protocol Let s have a system where: Write-backs happen when cache line is replaced All writes result in updates of other caches caching the value Let s design the simplest write-back update protocol: How many states should it have? What are the significance of the states? 4 states Dragon Write-back Update Protocol Exclusive-clean (E): Myproc & Memory have it Shared clean (Sc): Myproc and other procs may have it Shared modified (Sm): Myproc and other procs may have it, memory does not have updated value (Myproc s responsibility to update memory) Modified(M): Myproc has it, no one else Cache block can be: M state on one cache and no one has the same cache block E state on one cache and no one has the same cache block Sc on one or more caches Sm on one cache, Sc on zero or more caches No invalid state If in cache, cannot be invalid (but still need to deal with tag mismatches) New bus transaction: BusUpd Broadcasts single word written on bus, updates other relevant caches Bandwidth savings 8

Questions: How can we recognize which state should be currently associated with a cache line? How do we know that a cache line should be stored in: Exclusive state? Modified state? Shared clean state? Shared modified state? Dragon State Transition Diagram Prd/ Prd/ BusUpd/Update PrdMiss/Busd(!S) E PrW/ Busd/ Sc PrdMiss/Busd(S) PrWr/BusUpd(!S) BusUpd/Update PrW/BusUpd(S) PrWMiss/(Busd(S); BusUpd) Busd/Flush Sm PrW/BusUpd(!S) Prd/ PrW/BusUpd(S) Busd/Flush M PrWMiss/Busd(!S) Prd/ PrW/ 9

Lower-level Protocol Choices Can shared-modified state be eliminated? If memory is updated on BusUpd transactions (DEC Firefly) Dragon protocol doesn t (assumes DAM memory slow to update) Should replacement of an Sc block be broadcast? Would allow last copy to go to E state (or M state) and not generate updates More overhead on replacements Less overhead for updates Limits of Bus-based Shared Memory I/O MEM MEM Assume: 3 GHz processor w/o cache => 12 GB/s inst BW per proc. (32-bit) => 3.6 GB/s data BW at 30% load-store 140 MB/s cache 5.2 GB/s POC cache POC Suppose 98% inst hit rate and 95% data hit rate => 240 MB/s inst BW per processor => 180 MB/s data BW per processor 420 MB/s combined BW Assuming 4 GB/s bus bandwidth 10 processors will saturate bus What is the solution to this scalability problem? 10

True sharing Sharing: a performance problem Frequent writes to a variable can create a bottleneck OK for read-only or infrequently written data Technique: make copies of the value, one per processor, if this is possible in the algorithm Example problem: the data structure that stores the freelist/heap for malloc/free False sharing Cache block may also introduce artifacts Two distinct variables in the same cache block Technique: allocate data used by each processor contiguously, or at least avoid interleaving Example problem: an array of ints, one written frequently by each processor Intuitive Memory Model P P L1 100:67 L2 100:35 Memory Disk 100:34 eading an address should return the last value written to that address Uniprocessor: Different levels can have different values Made consistent at replacement time Multiprocessor: Approach 1: Use shared top level caches Approach 2: Use separate caches but keep them consistent through snooping 11

Ordering of operations on single variable P 0 : W P 1 : W P 2 : Bus establishes a total ordering on writes Assuming atomic bus transactions Later we will look at split-phase transactions Writes establish a partial order on operations from different processors Doesn t constrain ordering of reads, though bus will order read misses too Ordering of operations on multiple variables P 1 P 2 /*Assume initial values of A and B are 0 */ (1a) A = 1; (2a) print B; (1b) B = 2; (2b) print A; Question: which of these outputs are valid? [ 0 0 ] [ 0 1 ] [ 2 0 ] [ 2 1 ] Question: what might cause these kinds of outputs on real machines? What s the intuition? This is the memory consistency model 12

elated Example P 1 P 2 /*Assume initial value of A and ag is 0*/ A = 1; while (flag == 0); /*spin idly*/ flag = 1; print A; Intuition not guaranteed by coherence Coherence: preserves order to a single variable Coherence is not enough! Also expect memory to respect order between accesses to different locations issued by a given process Sequential Consistency Processors issuing memory P 1 P 2 P n references as per program or der The switch is randomly set after each memory reference Memory Total order achieved by interleaving accesses: Maintains program order; memory operations, from all processes, appear to [issue, execute, complete] atomically as if there were no caches, and a single memory A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program. [Lamport, 1979] 13

Implementing Sequential Consistency Sufficient Conditions every process issues memory operations in program order after a read is issued, the issuing process waits for the read to complete before issuing its next operation after a write operation is issued, the issuing process waits for the write to complete (update other caches if necessary) before issuing next memory operation How can architectural enhancements violate SC? Out-of-order execution Write-buffers Non-blocking operations How can compilers violate SC? eordering of operations Compilers designed to generate correct uniprocessor code and not correct multiprocessor code Summary Basic shared memory design Shared cache Separate caches: results in coherence problems Snoopy cache: write back vs. write-through, update vs. invalidate State machines for coherence logic Cache coherence implements state machine Scales to about 10 s of processors Coherence is not enough equire to specify memory consistency model Abstract model for accesses to multiple variables 14