Snooping coherence protocols (cont.)

Size: px
Start display at page:

Download "Snooping coherence protocols (cont.)"

Transcription

1 Snooping coherence protocols (cont.) A four-state update protocol [ 5.3.3] When there is a high degree of sharing, invalidation-based protocols perform poorly. Blocks are often invalidated, and then have to be re-fetched from memory. Wouldn t it be better to send new values out rather than invalidation signals? This is the motivation behind update-based protocols. We will look at the Dragon protocol, initially proposed for Xerox s Dragon multiprocessor, and more recently used in sun SparcServer multiprocessors. This is a four-state protocol, with two of the states identical to those in the four-state invalidation protocol: The E (exclusive) state indicates that a block is in use by a single processor, but has not been modified. The M (modified) state indicates that a block is present in only this cache, and main memory is not up to date. There are also two new states. The Sc (shared-clean) state indicates that potentially two or more caches hold this block, and main memory may or may not be up to date. The Sm (shared-modified) state indicates that potentially two or more caches hold this block, main memory is not up to date, and it is this cache s responsibility to update main memory when the block is purged (i.e., ). A block can be in Sm state in only one cache at a time. However, while a block is in Sm state in one cache, it can be in Sc state in others. Lecture 14 Architecture of Parallel Computers 1

2 It is possible for a block to be in Sc state in some caches without being in Sm state in any cache. In this case, main memory is up to date. Why is there no I (invalid) state? Here is a state-transition diagram for this protocol. PrRd/ PrRd/ BusUpd/Update PrRdMiss/ BusRd(S) E BusRd/ PrWr/ Sc PrRdMiss/ BusRd(S) PrWr/BusUpd(S) BusUpd/Update PrWr/ BusUpd(S) PrWrMiss/ (BusRd(S); BusUpd) Sm BusRd/Flush M PrWrMiss/ BusRd(S) PrRd/ PrWr/BusUpd(S) BusRd/Flush PrWr/BusUpd(S) PrRd/ PrWr/ In diagrams for previous protocols, if a block not in the cache was referenced, we showed the transition as coming out of the I (invalid) state. In this protocol, we don t have an invalid state. So, looking at the diagram above, can you see what is supposed to happen when a referenced block is not in the cache? What happens if there is a read-miss and Lecture 14 Architecture of Parallel Computers 2

3 the shared line is asserted? the shared line is not asserted? What happens if there is a write-miss and the shared line is asserted? the shared line is not asserted? If there s a write-miss and the shared line is asserted, what else happens? Why is only a single word broadcast? Let us first consider the transitions out of the Exclusive state. What happens if this processor reads a word? What happens if this processor writes a word? There is one more transition out of this state. What causes it, and what happens? Now let us consider the transitions out of the Shared-Clean state. What happens if this processor reads a word? What happens if this processor writes a word? Lecture 14 Architecture of Parallel Computers 3

4 There is one more transition out of this state. What causes it, and what happens? Next, let s look at the transitions out of the Shared-Modified state. What happens if this processor reads a word? What happens if this processor writes a word? How many more transitions are there out of this state? What causes the first one, and what happens? What causes the second one, and what happens? Finally, let s look at the transitions out of the Modified state. What happens if this processor reads a word? What happens if this processor writes a word? What happens if another processor reads a word? Let s go through the same example as we did for the 3-state invalidation protocol. Lecture 14 Architecture of Parallel Computers 4

5 P 1 P 2 P 3 u =? u =? u = 7 $ 4 $ 5 $ u:5 u:5 3 1 u:5 2 I/O devices Memory Processor action State in P 1 State in P 2 State in P 3 P 1 reads u P 3 reads u P 3 writes u P 1 reads u P 2 reads u Bus action Data supplied by A three-state update protocol Whenever a bus update is generated, suppose that main memory as well as the caches updates its contents. Then which state don t we need? What s the advantage, then, of having the fourth state? The Firefly protocol, named after a multiprocessor workstation developed by DEC, is an example of such a protocol. Lecture 14 Architecture of Parallel Computers 5

6 Here is a state diagram for the Firefly protocol: V BR CRMx CWHx S BR, BW D BR, BW CWH CWMx Key: CRM CPU read miss CWM CPU write miss CWH CPU write hit BR bus read BW bus write A following a transition means SharedLine was asserted. An x means it was not. Processor-induced transitions Bus-induced transitions CWH CRM, CWM Read hits do not cause state transitions and are not shown. What do you think the states are, and how do they correspond to the states in The scheme works as follows: On a read hit, the data is returned immediately to the processor, and no caches change state. On a read miss, If another cache (other caches) had a copy of the block, it supplies (one supplies) it directly to the requesting cache and raises the SharedLine. The bus timing is fixed so all caches respond in the same cycle. All caches, including the requestor, set the state to shared. If the owning cache had the block in state dirty, the block is written to main memory at the same time. Lecture 14 Architecture of Parallel Computers 6

7 If no other cache had a copy of the block, it is read from main memory and assigned state valid-exclusive. On a write hit, If the block is already dirty, the write proceeds to the cache without delay. If the block is valid-exclusive, the write proceeds without delay and the state is changed to dirty. If the block is in state shared, the write is delayed until the bus is acquired and a write-word to main memory initiated. Other caches pick the data off the bus and update their copies (if any). They also raise the SharedLine. The writing cache can determine whether the block is still being shared by testing this line. On a write miss, If the SharedLine is not asserted, no other cache has a copy of the block. The requesting cache changes to state valid-exclusive. If the SharedLine is asserted, the block remains in state shared. If any other caches have a copy of the block, they supply it. By inspecting the SharedLine, the requesting processor determines that the block has been supplied by another cache, and sets its state to shared. The block is also written to memory, and other caches pick the data off the bus and update their copies (if any). If no other cache has a copy of the block, the block is loaded from memory in state dirty. Lecture 14 Architecture of Parallel Computers 7

8 In update protocols in general, since all writes appear on the bus, write serialization, write-completion detection, and write atomicity are simple. Performance results [ 5.4] What cache line size is performs best? Which protocol is best to use? Questions like these can be answered by simulation. However, getting the answer write is part art and part science. Parameters need to be chosen for the simulator. The authors selected a single-level 4-way set-associative 1 MB cache with 64- byte lines. The simulation assumes an idealized memory model, which assumes that references take constant time. Why is this not realistic? The simulated workload consists of 6 parallel programs from the SPLASH-2 suite and one multiprogrammed workload, consisting of mainly serial programs. Effect of coherence protocol [ 5.4.3] Three coherence protocols were compared: The Illinois MESI protocol ( Ill, left bar). The three-state invalidation protocol (3St) with bus upgrade for S M transitions. (This means that instead of rereading data from main memory when a block moves to the M state, we just issue a bus transaction invalidating the other copies.) The three-state invalidation protocol without bus upgrade (3St-BusRdX). (This means that when a block moves to the M state, we reread it from main memory.) Lecture 14 Architecture of Parallel Computers 8

9 Address bus Data bus Traffic (MB/s) x Barnes/III Barnes/3St Barnes/3St-RdEx LU/III LU/3St LU/3St-RdEx Ocean/III Ocean/3S Ocean/3St-RdEx d l Radiosity/III t Radiosity/3St x Ill t Ex Radiosity/3St-RdEx Radix/III Radix/3St Radix/3St-RdEx Raytrace/III Raytrace/3St Raytrace/3St-RdEx In our parallel programs, which protocol seems to be best? Somewhat surprisingly, the result turns out to be the same for the multiprocessor workload. The reason for this? The advantage of the four-state protocol is that no bus traffic is generated on E M transitions. But E M transitions are very rare (less than 1 per 1K references). Effect of cache line size [ 5.4.4] Recall from Lecture 11 that cache misses can be classified into four categories: Cold misses (called compulsory misses in the previous discussion) occur the first time that a block is referenced. Lecture 14 Architecture of Parallel Computers 9

10 Conflict misses are misses that would not occur if the cache were fully associative with LRU replacement. Capacity misses occur when the cache size is not sufficient to hold data between references. Coherence misses are misses caused by the coherence protocol. Coherence misses can be divided into those caused by true sharing and those caused by false sharing. False-sharing misses are those caused by having a line size larger than one word. Can you explain? Lecture 14 Architecture of Parallel Computers 10

11 True-sharing misses, on the other hand, occur when a processor writes some words into a cache block, invalidating the block in another processors cache, after which the other processor reads one of the modified words. How could we attack each of the four kinds of misses? To reduce capacity misses, we could To reduce conflict misses, we could To reduce cold misses, we could To reduce coherence misses, we could If we increase the line size, the number of coherence misses might go up or down. Why? Increasing the line size has other disadvantages. It increases conflict misses. Why? It increases bus traffic. Why? So it is not clear which line size will work best. Lecture 14 Architecture of Parallel Computers 11

12 0.6 Upgrade False sharing True sharing Capacity Cold Miss rate (%) Barnes/8 Barnes/16 Barnes/32 Barnes/64 Barnes/128 Barnes/256 8 Lu/8 Lu/16 Lu/32 Lu/64 Lu/128 Lu/256 Radiosity/8 Radiosity/16 Radiosity/32 Radiosity/64 Radiosity/128 Radiosity/256 Results for the first three applications seem to show that which line size is best? Lecture 14 Architecture of Parallel Computers 12

13 12 Upgrade 10 8 False sharing True sharing Capacity Cold Miss rate (%) Ocean/8 Ocean/16 Ocean/32 Ocean/64 Ocean/ Ocean/ Radix/8 6 8 Radix/16 Radix/32 Radix/64 Radix/128 Radix/256 Raytrace/8 Raytrace/16 Raytrace/32 Raytrace/64 Raytrace/128 Raytrace/256 For the second set of applications, Radix shows a greatly increasing number of false-sharing misses with increasing block size. However, this is not the whole story. Larger line sizes also create more bus traffic. Lecture 14 Architecture of Parallel Computers 13

14 Address bus Data bus 0 2 Barnes/ Barnes/16 Barnes/32 Traffic (bytes/instructions) Barnes/64 Barnes/128 Barnes/256 Radiosity/8 Radiosity/16 Radiosity/32 Radiosity/64 Radiosity/128 Radiosity/256 Raytrace/8 Raytrace/16 Raytrace/32 Raytrace/64 Raytrace/128 Raytrace/256 With this in mind, which line size would you say is best? Invalidate vs. update [ 5.4.5] Which is better, an update or an invalidation protocol? At first glance, it might seem that update schemes would always be superior to write-invalidate schemes. Why might this be true? Why might this not be true? When there are not many external rereads, When there is a high degree of sharing, For example, in a producer-consumer pattern, Lecture 14 Architecture of Parallel Computers 14

15 Update and invalidation schemes can be combined (see 5.4.5). Let s look at real programs False sharing True sharing Capacity 2.00 Miss rate (%) Cold Miss rate (%) LU/inv LU/upd Ocean/inv Ocean/mix Ocean/upd Raytrace/inv Raytrace/upd Radix/inv Radix/mix Radix/upd Where there are many coherence misses, If there were many capacity misses, So let s look at bus traffic Lecture 14 Architecture of Parallel Computers 15

16 Note that in two of the applications, updates in an update protocol are much more prevalent than upgrades in an invalidation protocol. LU/inv LU/upd 0.00 Upgrade/update rate (%) Each of these operations produces bus traffic; therefore, the update protocol causes more traffic. Ocean/inv Ocean/mix Ocean/upd The main problem is that one processor tends to write a block multiple times before another processor reads it. Raytrace/inv Raytrace/upd This causes several bus transactions instead of one, as there would be in an invalidation protocol. Radix/inv Upgrade/update rate (%) In addition, updates cause problems in nonbus-based multiprocessors. Radix/mix Radix/upd Lecture 14 Architecture of Parallel Computers 16

A three-state update protocol

A three-state update protocol A three-state update protocol Whenever a bus update is generated, suppose that main memory as well as the caches updates its contents. Then which state don t we need? What s the advantage, then, of having

More information

Snooping coherence protocols (cont.)

Snooping coherence protocols (cont.) Snooping coherence protocols (cont.) A four-state update protocol [ 5.3.3] When there is a high degree of sharing, invalidation-based protocols perform poorly. Blocks are often invalidated, and then have

More information

[ 5.4] What cache line size is performs best? Which protocol is best to use?

[ 5.4] What cache line size is performs best? Which protocol is best to use? Performance results [ 5.4] What cache line size is performs best? Which protocol is best to use? Questions like these can be answered by simulation. However, getting the answer write is part art and part

More information

ECE PP used in class for assessing cache coherence protocols

ECE PP used in class for assessing cache coherence protocols ECE 5315 PP used in class for assessing cache coherence protocols Assessing Protocol Design The benchmark programs are executed on a multiprocessor simulator The state transitions observed determine the

More information

Performance of coherence protocols

Performance of coherence protocols Performance of coherence protocols Cache misses have traditionally been classified into four categories: Cold misses (or compulsory misses ) occur the first time that a block is referenced. Conflict misses

More information

Cache Coherence: Part 1

Cache Coherence: Part 1 Cache Coherence: art 1 Todd C. Mowry CS 74 October 5, Topics The Cache Coherence roblem Snoopy rotocols The Cache Coherence roblem 1 3 u =? u =? $ 4 $ 5 $ u:5 u:5 1 I/O devices u:5 u = 7 3 Memory A Coherent

More information

L7 Shared Memory Multiprocessors. Shared Memory Multiprocessors

L7 Shared Memory Multiprocessors. Shared Memory Multiprocessors L7 Shared Memory Multiprocessors Logical design and software interactions 1 Shared Memory Multiprocessors Symmetric Multiprocessors (SMPs) Symmetric access to all of main memory from any processor Dominate

More information

Shared Memory Multiprocessors

Shared Memory Multiprocessors Parallel Computing Shared Memory Multiprocessors Hwansoo Han Cache Coherence Problem P 0 P 1 P 2 cache load r1 (100) load r1 (100) r1 =? r1 =? 4 cache 5 cache store b (100) 3 100: a 100: a 1 Memory 2 I/O

More information

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012) Lecture 11: Snooping Cache Coherence: Part II CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Announcements Assignment 2 due tonight 11:59 PM - Recall 3-late day policy Assignment

More information

Lecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Lecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015 Lecture 11: Cache Coherence: Part II Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Bang Bang (My Baby Shot Me Down) Nancy Sinatra (Kill Bill Volume 1 Soundtrack) It

More information

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012) Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues

More information

Basic Architecture of SMP. Shared Memory Multiprocessors. Cache Coherency -- The Problem. Cache Coherency, The Goal.

Basic Architecture of SMP. Shared Memory Multiprocessors. Cache Coherency -- The Problem. Cache Coherency, The Goal. Shared emory ultiprocessors Basic Architecture of SP Buses are good news and bad news The (memory) bus is a point all processors can see and thus be informed of what is happening A bus is serially used,

More information

Processor Architecture

Processor Architecture Processor Architecture Shared Memory Multiprocessors M. Schölzel The Coherence Problem s may contain local copies of the same memory address without proper coordination they work independently on their

More information

Lecture 9 Outline. Lower-Level Protocol Choices. MESI (4-state) Invalidation Protocol. MESI: Processor-Initiated Transactions

Lecture 9 Outline. Lower-Level Protocol Choices. MESI (4-state) Invalidation Protocol. MESI: Processor-Initiated Transactions Outline protocol Dragon updatebased protocol mpact of protocol optimizations LowerLevel Protocol Choices observed in state: what transition to make? Change to : assume ll read again soon good for mostly

More information

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5)

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5) CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5) Copyright 2001 Mark D. Hill University of Wisconsin-Madison Slides are derived

More information

NOW Handout Page 1. Recap. Protocol Design Space of Snooping Cache Coherent Multiprocessors. Sequential Consistency.

NOW Handout Page 1. Recap. Protocol Design Space of Snooping Cache Coherent Multiprocessors. Sequential Consistency. Recap Protocol Design Space of Snooping Cache Coherent ultiprocessors CS 28, Spring 99 David E. Culler Computer Science Division U.C. Berkeley Snooping cache coherence solve difficult problem by applying

More information

Snooping-Based Cache Coherence

Snooping-Based Cache Coherence Lecture 10: Snooping-Based Cache Coherence Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2017 Tunes Elle King Ex s & Oh s (Love Stuff) Once word about my code profiling skills

More information

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network Shared Memory Multis Processor Processor Processor i Processor n Symmetric Shared Memory Architecture (SMP) cache cache cache cache Interconnection Network Main Memory I/O System Cache Coherence Cache

More information

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations Lecture 3: Snooping Protocols Topics: snooping-based cache coherence implementations 1 Design Issues, Optimizations When does memory get updated? demotion from modified to shared? move from modified in

More information

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Module 10: Design of Shared Memory Multiprocessors Lecture 20: Performance of Coherence Protocols MOESI protocol. MOESI protocol Dragon protocol State transition Dragon example Design issues General issues Evaluating protocols Protocol optimizations Cache size Cache line size Impact on bus traffic Large cache line

More information

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University 18-742 Parallel Computer Architecture Lecture 5: Cache Coherence Chris Craik (TA) Carnegie Mellon University Readings: Coherence Required for Review Papamarcos and Patel, A low-overhead coherence solution

More information

CS315A Midterm Solutions

CS315A Midterm Solutions K. Olukotun Spring 05/06 Handout #14 CS315a CS315A Midterm Solutions Open Book, Open Notes, Calculator okay NO computer. (Total time = 120 minutes) Name (please print): Solutions I agree to abide by the

More information

Cache Coherence in Bus-Based Shared Memory Multiprocessors

Cache Coherence in Bus-Based Shared Memory Multiprocessors Cache Coherence in Bus-Based Shared Memory Multiprocessors Shared Memory Multiprocessors Variations Cache Coherence in Shared Memory Multiprocessors A Coherent Memory System: Intuition Formal Definition

More information

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins 4 Chip Multiprocessors (I) Robert Mullins Overview Coherent memory systems Introduction to cache coherency protocols Advanced cache coherency protocols, memory systems and synchronization covered in the

More information

Shared Memory Architectures. Approaches to Building Parallel Machines

Shared Memory Architectures. Approaches to Building Parallel Machines Shared Memory Architectures Arvind Krishnamurthy Fall 2004 Approaches to Building Parallel Machines P 1 Switch/Bus P n Scale (Interleaved) First-level $ P 1 P n $ $ (Interleaved) Main memory Shared Cache

More information

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence CS252 Spring 2017 Graduate Computer Architecture Lecture 12: Cache Coherence Lisa Wu, Krste Asanovic http://inst.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 Last Time in Lecture 11 Memory Systems DRAM

More information

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence SMP Review Multiprocessors Today s topics: SMP cache coherence general cache coherence issues snooping protocols Improved interaction lots of questions warning I m going to wait for answers granted it

More information

The MESI State Transition Graph

The MESI State Transition Graph Small-scale shared memory multiprocessors Semantics of the shared address space model (Ch. 5.3-5.5) Design of the M(O)ESI snoopy protocol Design of the Dragon snoopy protocol Performance issues Synchronization

More information

Lecture 20: Multi-Cache Designs. Spring 2018 Jason Tang

Lecture 20: Multi-Cache Designs. Spring 2018 Jason Tang Lecture 20: Multi-Cache Designs Spring 2018 Jason Tang 1 Topics Split caches Multi-level caches Multiprocessor caches 2 3 Cs of Memory Behaviors Classify all cache misses as: Compulsory Miss (also cold-start

More information

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming Cache design review Let s say your code executes int x = 1; (Assume for simplicity x corresponds to the address 0x12345604

More information

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T. Coherence Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T. L5- Coherence Avoids Stale Data Multicores have multiple private caches for performance Need to provide the illusion

More information

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University Material from: Parallel Computer Organization and Design by Debois,

More information

Advanced OpenMP. Lecture 3: Cache Coherency

Advanced OpenMP. Lecture 3: Cache Coherency Advanced OpenMP Lecture 3: Cache Coherency Cache coherency Main difficulty in building multiprocessor systems is the cache coherency problem. The shared memory programming model assumes that a shared variable

More information

Approaches to Building Parallel Machines. Shared Memory Architectures. Example Cache Coherence Problem. Shared Cache Architectures

Approaches to Building Parallel Machines. Shared Memory Architectures. Example Cache Coherence Problem. Shared Cache Architectures Approaches to Building arallel achines Switch/Bus n Scale Shared ory Architectures (nterleaved) First-level (nterleaved) ain memory n Arvind Krishnamurthy Fall 2004 (nterleaved) ain memory Shared Cache

More information

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T. Coherence Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T. L25-1 Coherence Avoids Stale Data Multicores have multiple private caches for performance Need to provide the illusion

More information

Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley

Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley Avinash Kodi Department of Electrical Engineering & Computer

More information

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015 Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Marble House The Knife (Silent Shout) Before starting The Knife, we were working

More information

Multicore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh

Multicore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh Multicore Workshop Cache Coherency Mark Bull David Henty EPCC, University of Edinburgh Symmetric MultiProcessing 2 Each processor in an SMP has equal access to all parts of memory same latency and bandwidth

More information

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections Lecture 18: Coherence Protocols Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections 4.2-4.4) 1 SMP/UMA/Centralized Memory Multiprocessor Main Memory I/O System

More information

Lecture-22 (Cache Coherence Protocols) CS422-Spring

Lecture-22 (Cache Coherence Protocols) CS422-Spring Lecture-22 (Cache Coherence Protocols) CS422-Spring 2018 Biswa@CSE-IITK Single Core Core 0 Private L1 Cache Bus (Packet Scheduling) Private L2 DRAM CS422: Spring 2018 Biswabandan Panda, CSE@IITK 2 Multicore

More information

Recall: Sequential Consistency. CS 258 Parallel Computer Architecture Lecture 15. Sequential Consistency and Snoopy Protocols

Recall: Sequential Consistency. CS 258 Parallel Computer Architecture Lecture 15. Sequential Consistency and Snoopy Protocols CS 258 Parallel Computer Architecture Lecture 15 Sequential Consistency and Snoopy Protocols arch 17, 2008 Prof John D. Kubiatowicz http://www.cs.berkeley.edu/~kubitron/cs258 ecall: Sequential Consistency

More information

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University Material from: Parallel Computer Organization and Design by Debois,

More information

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( ) Systems Group Department of Computer Science ETH Zürich Lecture 24: Multiprocessing Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Most of the rest of this

More information

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri Cache Coherence (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri mainakc@cse.iitk.ac.in 1 Setting Agenda Software: shared address space Hardware: shared memory multiprocessors Cache

More information

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

Suggested Readings! What makes a memory system coherent?! Lecture 27 Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality! 1! 2! Suggested Readings! Readings!! H&P: Chapter 5.8! Could also look at material on CD referenced on p. 538 of your text! Lecture 27" Cache Coherency! 3! Processor components! Multicore processors and

More information

The Cache Write Problem

The Cache Write Problem Cache Coherency A multiprocessor and a multicomputer each comprise a number of independent processors connected by a communications medium, either a bus or more advanced switching system, such as a crossbar

More information

A Basic Snooping-Based Multi-Processor Implementation

A Basic Snooping-Based Multi-Processor Implementation Lecture 11: A Basic Snooping-Based Multi-Processor Implementation Parallel Computer Architecture and Programming Tsinghua has its own ice cream! Wow! CMU / 清华 大学, Summer 2017 Review: MSI state transition

More information

Computer Architecture

Computer Architecture 18-447 Computer Architecture CSCI-564 Advanced Computer Architecture Lecture 29: Consistency & Coherence Lecture 20: Consistency and Coherence Bo Wu Prof. Onur Mutlu Colorado Carnegie School Mellon University

More information

Performance metrics for caches

Performance metrics for caches Performance metrics for caches Basic performance metric: hit ratio h h = Number of memory references that hit in the cache / total number of memory references Typically h = 0.90 to 0.97 Equivalent metric:

More information

Lecture 7: Implementing Cache Coherence. Topics: implementation details

Lecture 7: Implementing Cache Coherence. Topics: implementation details Lecture 7: Implementing Cache Coherence Topics: implementation details 1 Implementing Coherence Protocols Correctness and performance are not the only metrics Deadlock: a cycle of resource dependencies,

More information

Flynn s Classification

Flynn s Classification Flynn s Classification SISD (Single Instruction Single Data) Uniprocessors MISD (Multiple Instruction Single Data) No machine is built yet for this type SIMD (Single Instruction Multiple Data) Examples:

More information

ECE7660 Parallel Computer Architecture. Shared Memory Multiprocessors

ECE7660 Parallel Computer Architecture. Shared Memory Multiprocessors ECE7660 Parallel Computer Architecture Shared Memory Multiprocessors 1 Layer Perspective CAD Database Scientific modeling Parallel applications Multipr ogramming Shar ed addr ess Message passing Data parallel

More information

Multiprocessor Cache Coherency. What is Cache Coherence?

Multiprocessor Cache Coherency. What is Cache Coherence? Multiprocessor Cache Coherency CS448 1 What is Cache Coherence? Two processors can have two different values for the same memory location 2 1 Terminology Coherence Defines what values can be returned by

More information

Caches. Parallel Systems. Caches - Finding blocks - Caches. Parallel Systems. Parallel Systems. Lecture 3 1. Lecture 3 2

Caches. Parallel Systems. Caches - Finding blocks - Caches. Parallel Systems. Parallel Systems. Lecture 3 1. Lecture 3 2 Parallel ystems Parallel ystems Parallel ystems Outline for lecture 3 s (a quick review) hared memory multiprocessors hierarchies coherence nooping protocols» nvalidation protocols (, )» Update protocol

More information

Cray XE6 Performance Workshop

Cray XE6 Performance Workshop Cray XE6 Performance Workshop Cache Coherency Mark Bull David Henty EPCC, University of Edinburgh ymmetric MultiProcessing Each processor in an MP has equal access to all parts of memory same latency and

More information

Switch Gear to Memory Consistency

Switch Gear to Memory Consistency Outline Memory consistency equential consistency Invalidation vs. update coherence protocols MI protocol tate diagrams imulation Gehringer, based on slides by Yan olihin 1 witch Gear to Memory Consistency

More information

Incoherent each cache copy behaves as an individual copy, instead of as the same memory location.

Incoherent each cache copy behaves as an individual copy, instead of as the same memory location. Cache Coherence This lesson discusses the problems and solutions for coherence. Different coherence protocols are discussed, including: MSI, MOSI, MOESI, and Directory. Each has advantages and disadvantages

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

A More Sophisticated Snooping-Based Multi-Processor

A More Sophisticated Snooping-Based Multi-Processor Lecture 16: A More Sophisticated Snooping-Based Multi-Processor Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2014 Tunes The Projects Handsome Boy Modeling School (So... How

More information

Lecture 24: Board Notes: Cache Coherency

Lecture 24: Board Notes: Cache Coherency Lecture 24: Board Notes: Cache Coherency Part A: What makes a memory system coherent? Generally, 3 qualities that must be preserved (SUGGESTIONS?) (1) Preserve program order: - A read of A by P 1 will

More information

Scalable Cache Coherence

Scalable Cache Coherence Scalable Cache Coherence [ 8.1] All of the cache-coherent systems we have talked about until now have had a bus. Not only does the bus guarantee serialization of transactions; it also serves as convenient

More information

ECE 485/585 Microprocessor System Design

ECE 485/585 Microprocessor System Design Microprocessor System Design Lecture 11: Reducing Hit Time Cache Coherence Zeshan Chishti Electrical and Computer Engineering Dept Maseeh College of Engineering and Computer Science Source: Lecture based

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Scalable Cache Coherence Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Hierarchical Cache Coherence Hierarchies in cache organization Multiple levels

More information

EC 513 Computer Architecture

EC 513 Computer Architecture EC 513 Computer Architecture Cache Coherence - Directory Cache Coherence Prof. Michel A. Kinsy Shared Memory Multiprocessor Processor Cores Local Memories Memory Bus P 1 Snoopy Cache Physical Memory P

More information

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence Computer Architecture ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols 1 Shared Memory Multiprocessor Memory Bus P 1 Snoopy Cache Physical Memory P 2 Snoopy

More information

Portland State University ECE 588/688. Cache Coherence Protocols

Portland State University ECE 588/688. Cache Coherence Protocols Portland State University ECE 588/688 Cache Coherence Protocols Copyright by Alaa Alameldeen 2018 Conditions for Cache Coherence Program Order. A read by processor P to location A that follows a write

More information

Limitations of parallel processing

Limitations of parallel processing Your professor du jour: Steve Gribble gribble@cs.washington.edu 323B Sieg Hall all material in this lecture in Henessey and Patterson, Chapter 8 635-640 645, 646 654-665 11/8/00 CSE 471 Multiprocessors

More information

CMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3

CMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3 MS 411 omputer Systems rchitecture Lecture 21 Multiprocessors 3 Outline Review oherence Write onsistency dministrivia Snooping Building Blocks Snooping protocols and examples oherence traffic and performance

More information

Multicore MESI Based Cache Design HAS

Multicore MESI Based Cache Design HAS Multicore MESI Based Cache Design HAS Ver. 2.5.2 1. Introduction The Design is targeted at developing a DUT comprising of L1 and L2 cache systems which can be utilized for undertaking functional verification.

More information

A Basic Snooping-Based Multi-Processor Implementation

A Basic Snooping-Based Multi-Processor Implementation Lecture 15: A Basic Snooping-Based Multi-Processor Implementation Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Pushing On (Oliver $ & Jimi Jules) Time for the second

More information

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence 1 COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations

More information

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology Review: Multiprocessor CPE 631 Session 21: Multiprocessors (Part 2) Department of Electrical and Computer Engineering University of Alabama in Huntsville Basic issues and terminology Communication: share

More information

A Basic Snooping-Based Multi-processor

A Basic Snooping-Based Multi-processor Lecture 15: A Basic Snooping-Based Multi-processor Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2014 Tunes Stompa (Serena Ryder) I wrote Stompa because I just get so excited

More information

Bus-Based Coherent Multiprocessors

Bus-Based Coherent Multiprocessors Bus-Based Coherent Multiprocessors Lecture 13 (Chapter 7) 1 Outline Bus-based coherence Memory consistency Sequential consistency Invalidation vs. update coherence protocols Several Configurations for

More information

CSC/ECE 506: Architecture of Parallel Computers Program 2: Bus-Based Cache Coherence Protocols Due: Wednesday, October 25, 2017

CSC/ECE 506: Architecture of Parallel Computers Program 2: Bus-Based Cache Coherence Protocols Due: Wednesday, October 25, 2017 CSC/ECE 506: Architecture of Parallel Computers Program 2: Bus-Based Cache Coherence Protocols Due: Wednesday, October 25, 2017 1. Overall Problem Description In this project, you will add new features

More information

Multiprocessors. Loosely coupled [Multi-computer] each CPU has its own memory, I/O facilities and OS. CPUs DO NOT share physical memory

Multiprocessors. Loosely coupled [Multi-computer] each CPU has its own memory, I/O facilities and OS. CPUs DO NOT share physical memory Loosely coupled [Multi-computer] each CPU has its own memory, I/O facilities and OS CPUs DO NOT share physical memory IITAC Cluster [in Lloyd building] 346 x IBM e326 compute node each with 2 x 2.4GHz

More information

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation Outline Evaluation of applications is important Simulation of sample data sets provides important information Working sets indicate

More information

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano Outline The problem of cache coherence Snooping protocols Directory-based protocols Prof. Cristina Silvano, Politecnico

More information

Lecture 10: Cache Coherence. Parallel Computer Architecture and Programming CMU / 清华 大学, Summer 2017

Lecture 10: Cache Coherence. Parallel Computer Architecture and Programming CMU / 清华 大学, Summer 2017 Lecture 10: Cache Coherence Parallel Computer Architecture and Programming CMU / 清华 大学, Summer 2017 Course schedule (where we are) Week 1: How parallel hardware works: types of parallel execution in modern

More information

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology CS252 Graduate Computer Architecture Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency Review: Multiprocessor Basic issues and terminology Communication:

More information

Memory Hierarchy in a Multiprocessor

Memory Hierarchy in a Multiprocessor EEC 581 Computer Architecture Multiprocessor and Coherence Department of Electrical Engineering and Computer Science Cleveland State University Hierarchy in a Multiprocessor Shared cache Fully-connected

More information

Cache Coherence Tutorial

Cache Coherence Tutorial Cache Coherence Tutorial The cache coherence protocol described in the book is not really all that difficult and yet a lot of people seem to have troubles when it comes to using it or answering an assignment

More information

CSC/ECE 506: Architecture of Parallel Computers Sample Final Examination with Answers

CSC/ECE 506: Architecture of Parallel Computers Sample Final Examination with Answers CSC/ECE 506: Architecture of Parallel Computers Sample Final Examination with Answers This was a 180-minute open-book test. You were to answer five of the six questions. Each question was worth 20 points.

More information

ECE/CS 757: Homework 1

ECE/CS 757: Homework 1 ECE/CS 757: Homework 1 Cores and Multithreading 1. A CPU designer has to decide whether or not to add a new micoarchitecture enhancement to improve performance (ignoring power costs) of a block (coarse-grain)

More information

Shared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB

Shared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB Shared SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB 1 Review: Snoopy Cache Protocol Write Invalidate Protocol: Multiple readers, single writer Write to shared data: an

More information

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste

More information

Scalable Cache Coherence

Scalable Cache Coherence arallel Computing Scalable Cache Coherence Hwansoo Han Hierarchical Cache Coherence Hierarchies in cache organization Multiple levels of caches on a processor Large scale multiprocessors with hierarchy

More information

Snoop-Based Multiprocessor Design III: Case Studies

Snoop-Based Multiprocessor Design III: Case Studies Snoop-Based Multiprocessor Design III: Case Studies Todd C. Mowry CS 41 March, Case Studies of Bus-based Machines SGI Challenge, with Powerpath SUN Enterprise, with Gigaplane Take very different positions

More information

CS 433 Homework 5. Assigned on 11/7/2017 Due in class on 11/30/2017

CS 433 Homework 5. Assigned on 11/7/2017 Due in class on 11/30/2017 CS 433 Homework 5 Assigned on 11/7/2017 Due in class on 11/30/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies on collaboration.

More information

ECSE 425 Lecture 30: Directory Coherence

ECSE 425 Lecture 30: Directory Coherence ECSE 425 Lecture 30: Directory Coherence H&P Chapter 4 Last Time Snoopy Coherence Symmetric SMP Performance 2 Today Directory- based Coherence 3 A Scalable Approach: Directories One directory entry for

More information

Getting Started with SMPCache 2.0

Getting Started with SMPCache 2.0 Getting Started with SMPCache 2.0 1. Introduction...2 1.1. Prerequisites...2 1.2. Suggestions?...2 2. Installation...2 2.1. Uninstalling SMPCache...3 3. Configuration Files...3 4. Trace Files...5 5. Interface

More information

Computer Architecture

Computer Architecture Jens Teubner Computer Architecture Summer 2016 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2016 Jens Teubner Computer Architecture Summer 2016 83 Part III Multi-Core

More information

Cache Coherence in Scalable Machines

Cache Coherence in Scalable Machines ache oherence in Scalable Machines SE 661 arallel and Vector Architectures rof. Muhamed Mudawar omputer Engineering Department King Fahd University of etroleum and Minerals Generic Scalable Multiprocessor

More information

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste

More information

Q3: Block Replacement. Replacement Algorithms. ECE473 Computer Architecture and Organization. Memory Hierarchy: Set Associative Cache

Q3: Block Replacement. Replacement Algorithms. ECE473 Computer Architecture and Organization. Memory Hierarchy: Set Associative Cache Fundamental Questions Computer Architecture and Organization Hierarchy: Set Associative Q: Where can a block be placed in the upper level? (Block placement) Q: How is a block found if it is in the upper

More information

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations 1 Split Transaction Bus So far, we have assumed that a coherence operation (request, snoops, responses,

More information

Chapter-4 Multiprocessors and Thread-Level Parallelism

Chapter-4 Multiprocessors and Thread-Level Parallelism Chapter-4 Multiprocessors and Thread-Level Parallelism We have seen the renewed interest in developing multiprocessors in early 2000: - The slowdown in uniprocessor performance due to the diminishing returns

More information

CS433 Homework 6. Problem 1 [15 points] Assigned on 11/28/2017 Due in class on 12/12/2017

CS433 Homework 6. Problem 1 [15 points] Assigned on 11/28/2017 Due in class on 12/12/2017 CS433 Homework 6 Assigned on 11/28/2017 Due in class on 12/12/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies on collaboration.

More information

Multiprocessors. Loosely coupled [Multi-computer] each CPU has its own memory, I/O facilities and OS. CPUs DO NOT share physical memory

Multiprocessors. Loosely coupled [Multi-computer] each CPU has its own memory, I/O facilities and OS. CPUs DO NOT share physical memory Loosely coupled [Multi-computer] each CPU has its own memory, I/O facilities and OS CPUs DO NOT share physical memory IITAC Cluster [in Lloyd building] 346 x IBM e326 compute node each with 2 x 2.4GHz

More information