Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Similar documents
Processor Architecture

Shared Memory Multiprocessors

Cache Coherence in Bus-Based Shared Memory Multiprocessors

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Memory Hierarchy in a Multiprocessor

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5)

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Multiprocessors & Thread Level Parallelism

Computer Architecture

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Advanced OpenMP. Lecture 3: Cache Coherency

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Switch Gear to Memory Consistency

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

Lecture 20: Multi-Cache Designs. Spring 2018 Jason Tang

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

Multicore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh

Lecture 25: Multiprocessors

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

Lecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

CSE502: Computer Architecture CSE 502: Computer Architecture

Shared Memory Architectures. Approaches to Building Parallel Machines

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

NOW Handout Page 1. Recap. Protocol Design Space of Snooping Cache Coherent Multiprocessors. Sequential Consistency.

Multiprocessors 1. Outline

1. Memory technology & Hierarchy

Lecture-22 (Cache Coherence Protocols) CS422-Spring

A Basic Snooping-Based Multi-Processor Implementation

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

Bus-Based Coherent Multiprocessors

Approaches to Building Parallel Machines. Shared Memory Architectures. Example Cache Coherence Problem. Shared Cache Architectures

Computer Systems Architecture

Lecture 24: Virtual Memory, Multiprocessors

Lecture 7: Implementing Cache Coherence. Topics: implementation details

Snooping coherence protocols (cont.)

Cache Coherence: Part 1

Chapter 5. Multiprocessors and Thread-Level Parallelism

EC 513 Computer Architecture

Shared Memory Architectures. Shared Memory Multiprocessors. Caches and Cache Coherence. Cache Memories. Cache Memories Write Operation

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Limitations of parallel processing

Chapter 5. Multiprocessors and Thread-Level Parallelism

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Recall: Sequential Consistency. CS 258 Parallel Computer Architecture Lecture 15. Sequential Consistency and Snoopy Protocols

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

L7 Shared Memory Multiprocessors. Shared Memory Multiprocessors

Multiprocessor Cache Coherency. What is Cache Coherence?

Alewife Messaging. Sharing of Network Interface. Alewife User-level event mechanism. CS252 Graduate Computer Architecture.

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Cache Coherence Protocols: Implementation Issues on SMP s. Cache Coherence Issue in I/O

A More Sophisticated Snooping-Based Multi-Processor

Handout 3 Multiprocessor and thread level parallelism

Caches. Parallel Systems. Caches - Finding blocks - Caches. Parallel Systems. Parallel Systems. Lecture 3 1. Lecture 3 2

Snooping-Based Cache Coherence

Cache Coherence. Introduction to High Performance Computing Systems (CS1645) Esteban Meneses. Spring, 2014

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

PARALLEL MEMORY ARCHITECTURE

Computer Systems Architecture

A Basic Snooping-Based Multi-Processor Implementation

Chapter 5. Thread-Level Parallelism

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Scalable Multiprocessors

Special Topics. Module 14: "Directory-based Cache Coherence" Lecture 33: "SCI Protocol" Directory-based Cache Coherence: Sequent NUMA-Q.

Shared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB

Fall 2015 :: CSE 610 Parallel Computer Architectures. Cache Coherence. Nima Honarmand

Snooping coherence protocols (cont.)

Page 1. Cache Coherence

Scalable Cache Coherence

Cache Coherence in Scalable Machines

Cache Coherence. Todd C. Mowry CS 740 November 10, Topics. The Cache Coherence Problem Snoopy Protocols Directory Protocols

Computer Organization. Chapter 16

Aleksandar Milenkovich 1

A Basic Snooping-Based Multi-processor

Midterm Exam 02/09/2009

3/13/2008 Csci 211 Lecture %/year. Manufacturer/Year. Processors/chip Threads/Processor. Threads/chip 3/13/2008 Csci 211 Lecture 8 4

CS315A Midterm Solutions

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Portland State University ECE 588/688. Cache Coherence Protocols

Other consistency models

CSCI 4717 Computer Architecture

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

Transcription:

SMP Review Multiprocessors Today s topics: SMP cache coherence general cache coherence issues snooping protocols Improved interaction lots of questions warning I m going to wait for answers granted it s an experiment pace will be SLOWer Characteristics global physical address space» UMA and hence symmetric each processor has it s own cache» for now let s just assume 1 level to simplify things physically shared main memory» easy export of shared memory programming model 1 CS6810 2 CS6810 Bus Based Coherence Bus Based Coherence Cache coherence for shared lines: simple version» all copies of the cached line have the same contents simultaneous update is hard: complex version» for any read: return value of the last write problem: 2 processors write to same value at the same time» how is order determined?» need a single atomic decider Cache coherence for shared lines: simple version» all copies of the cached line have the same contents simultaneous update is hard: complex version» for any read: return value of the last write problem: 2 processors write to same value at the same time» how is order determined?» need a single atomic decider [Bush ism ack d] Bus single thing so it becomes the decider limited scalability» even 4 cores is a stretch at today s clock speeds clear broadcast win» all caches see whatever happens on the bus bus order is the write order not good enough then the programmer needs to synchronize 3 CS6810 4 CS6810 Page 1

Private vs. Shared Data Private vs. Shared Data SMP should support both private» normal cache policies and benefits shared: 2 options» NCC-UMA forces all shared data to be via main memory too slow forces programmer to deal with all synchronization requires write- and read-no-allocate instructions otherwise caching could create a problem how?» CC-UMA today s focus How to partition shared vs. private? SMP should support both private» normal cache policies and benefits shared: 2 options» NCC-UMA forces all shared data to be via main memory too slow forces programmer to deal with all synchronization requires write- and read-no-allocate instructions otherwise caching could create a problem how?» CC-UMA today s focus How to partition shared vs. private? variable declarations in the code partition by page or segment 5 CS6810 6 CS6810 Other Sharing Issues Consistency vs. Coherence Consider conventional cache wisdom write-back is good (faster)» problems? large line sizes help exploit spatial locality» problems? valid and dirty tag bits» are they enough? TLB» what changes with page sized partitioning pvt:shared? bus requests» normally always mastered from the cache side» what changes? Terminology some confusion in literature» but it s rare so be clear and avoid mutt status key is that they are different Coherence defines what value is returned by a read» e.g. value of the last write Consistency defines when things are coherent bigger issue as systems get bigger sequential consistency value of the last write» as determined by the decider Both are critical for correctness varies as to whether consistency is exposed to programmer» sequential consistency doesn t need to be exposed same as usual sequential programming model 7 CS6810 8 CS6810 Page 2

Coherence Implications 2 SMP Protocol Options Additional cost caches now need to snoop the bus» watch for writes, tag compare and update if they have a copy update options? Ordering constraints reordering reads is OK» but not involving writes same as uniprocessor world writes must finish in program order» EVEN if they are independent since there may be a hidden dependency in the other processors also because cache management is by line not variable» this can be relaxed more on this later Write-invalidate writer needs exclusive copy» write forces other copies to be invalidated» next read by others is a miss and they get new fresh line 2 writers» one win s bus arbitration and the decider has spoken bus broadcast» doesn t need to broadcast write value only address Write-update broadcast write value & address if other copies exist» then appropriate line is updated What haven t we considered so far? hint: LOTS 9 CS6810 10 CS6810 Consider All Cases Performance Issues X product (read, write) (miss, hit) (valid copy in cache, memory) (write invalidate, write update) Simple with write-through caches memory always has an updated copy new writer gets valid copy» either by cache to cache transfer or from memory Harder with write-back caches good idea if cache is mostly holding private data» but memory may not be up to date force invalidate of write back to memory snoop grabs latest copy cache-to-cache copy and no-update of memory if write update and previous owner keeps copy then must clear D bit key: only 1 D-bit can exist max single exclusive owner What happens? write miss, read miss Too many to exhaustively list Key protocol choice issues multiple writes to the same line write invalidate» less bus traffic 1 st write bus invalidate and data transfer on a write miss subsequent writes are kept local as long as there is a write hit» typically Wr-Inv is best choice when line is hammered by one processor at a time write-update» every write generates bus traffic bus scalability is an issue so it easily saturates» still it wins when a certain line is being hammered by multiple processors and when there is 1 writer and the rest are consumers Programs share variables not cache lines issues? 11 CS6810 12 CS6810 Page 3

Snooping Cache Complexities Simplest Possible Protocol Cache now has 2 masters processor side same as before» each line still has state I, D tags but add private/shared» tag match the same action can vary write-miss & shared? write-back, clean, and shared? write-back, dirty, and shared? bus side» sees write transactions write-back & dirty? clean? End result cache controller FSM now gets more complicated» will vary with cache policy, organization, and share protocol Write-through, Write-allocate, Write-invalidate Processor transactions: PrRd, PrWr Bus transactions: BusRd, BusWr Line state: I/V» no D bit since write through 13 CS6810 14 CS6810 MSI Protocol MSI State Diagram Write-invalidate protocal w/ Write-back cache Line states: Modified, Shared, Invalid Proc side events: PrRd, PrWr Bus transactions BusRd asks for copy of line w/ no intent to modify» e.g. PrRd miss» line supplied by either main memory or another cache BusRdX asks for exclusive copy of line» PrWr miss or PrWr hit to clean line (not Modified)» note new type of bus transaction BusWB» imposed by write back policy choice» note write data is an entire line 15 CS6810 16 CS6810 Page 4

MSI Analysis MESI Protocol Seq. Consistency write completion» BusRdX & data return complete bus atomicity decider point makes this easy» other cache snoopers can see when their pending write is issued when the controller wins arbitration Other options BusRd in M: go to I rather than S» migratory protocol line always just migrates to writer Synapse machine choice» tradeoff another owner likely to write soon then I is better old owner likely to read soon then S is better» hybrid is possible with extra protocol bit choice of Sequent Symmetry and MIT Alewife machines flexibility potential for increased cost & performance question is how much of each? Add Exclusive state deals w/ PrRd followed by PrWr problem meanings change a bit» E = exclusive clean memory is consistent» M = exclusive dirty memory is inconsistent» S = 2 or more sharers, no writers, memory consistent» I = same as always New S semantics adds an additional problem a shared signal must be added to the bus» single wired-or wire is sufficient note scaling problem doesn t work well at today s frequencies» BusRd(S) shared signal asserted» BusRd(S ) shared signal not asserted» BusRd means don t care about shared signal» FLUSH optional for cache to cache copy? 17 CS6810 18 CS6810 MESI State Machine MESI Analysis Errors Exist: Find them!! Flush issues don t want redundant suppliers when a new sharer comes on line» last exclusive owner knows who they are» so that one does the flush» if no sharing then supplied from memory complicates bus Stanford Dash & SGI Origin series choice What haven t we worried about yet? 19 CS6810 20 CS6810 Page 5

MESI Analysis Dragon Protocol Flush issues don t want redundant suppliers when a new sharer comes on line» last exclusive owner knows who they are» so that one does the flush» if no sharing then supplied from memory complicates bus Stanford Dash & SGI Origin series choice What haven t we worried about yet? what happens when a line gets victimized?» exercise to figure out the new state machine Write-back and Write-update* Xerox PARC Dragon» subsequently modified somewhat for Sun s SparcServer states» E exclusive clean» SC shared clean» SM shared modified this one used to update memory» M exclusive dirty» no explicit I state there implicitly new bus transactions» BusUpd update request with same S and S variants 21 CS6810 22 CS6810 Dragon FSM Key Points Status tags need to encode the local line status» protocol dependent 2 ported cache controller priority becomes the bus» since it s the atomicity point possibly stalls process requests New miss source the 4 th C: Coherence» true shared miss: reads and writes to same target» false shared miss: reads and writes to different target but same line Increased bus pressure due to coherence traffic» increased power already a scaling problem 23 CS6810 24 CS6810 Page 6

Classifying Misses Miss Classification For a particular reference stream define the lifetime for a block in the cache do per word accounting» e.g. remote reference from processor x causes eviction Ideas for how to do this? 25 CS6810 26 CS6810 Getting More Real Concluding Remarks 2 level cache hierarchy is likely Harvard L1$ Unified L2 L2 is the one that sits on the bus now L2 is coherent but what about L1 s» L1 write and read misses don t cause much problem percolate through to L2 and then the rest is similar» L1 read hit no problem» L1 write hit write has to percolate all the way to the bus L2 line eviction» due to invalidate» L2 needs to pass eviction up and evict L1 s entry More synchronization through the cache hierarchy will slow things down» question is how much How well does it work see Chap. 4.3 data Is SMP dead because buses are dead? small way SMP may make sense» on multi-core socket» or in small clusters where socket is multi-cluster short buses aren t so bad» easy enough to extend life with point to point interconnect Next we move onto DSM variant of CC-Numa protocol ideas are still valid» hence the time spent to understand these protocols is well spent note exam question is highly likely main difference with DSM» lines have both local state: similar to today s discussion global state: more on that next lecture 27 CS6810 28 CS6810 Page 7