Snooping coherence protocols (cont.)

Similar documents
A three-state update protocol

Snooping coherence protocols (cont.)

[ 5.4] What cache line size is performs best? Which protocol is best to use?

ECE PP used in class for assessing cache coherence protocols

Performance of coherence protocols

Cache Coherence: Part 1

L7 Shared Memory Multiprocessors. Shared Memory Multiprocessors

Shared Memory Multiprocessors

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Lecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Basic Architecture of SMP. Shared Memory Multiprocessors. Cache Coherency -- The Problem. Cache Coherency, The Goal.

Processor Architecture

Lecture 9 Outline. Lower-Level Protocol Choices. MESI (4-state) Invalidation Protocol. MESI: Processor-Initiated Transactions

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5)

NOW Handout Page 1. Recap. Protocol Design Space of Snooping Cache Coherent Multiprocessors. Sequential Consistency.

Snooping-Based Cache Coherence

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University

CS315A Midterm Solutions

Cache Coherence in Bus-Based Shared Memory Multiprocessors

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

Shared Memory Architectures. Approaches to Building Parallel Machines

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

The MESI State Transition Graph

Lecture 20: Multi-Cache Designs. Spring 2018 Jason Tang

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

Advanced OpenMP. Lecture 3: Cache Coherency

Approaches to Building Parallel Machines. Shared Memory Architectures. Example Cache Coherence Problem. Shared Cache Architectures

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Multicore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Lecture-22 (Cache Coherence Protocols) CS422-Spring

Recall: Sequential Consistency. CS 258 Parallel Computer Architecture Lecture 15. Sequential Consistency and Snoopy Protocols

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

The Cache Write Problem

A Basic Snooping-Based Multi-Processor Implementation

Computer Architecture

Performance metrics for caches

Lecture 7: Implementing Cache Coherence. Topics: implementation details

Flynn s Classification

ECE7660 Parallel Computer Architecture. Shared Memory Multiprocessors

Multiprocessor Cache Coherency. What is Cache Coherence?

Caches. Parallel Systems. Caches - Finding blocks - Caches. Parallel Systems. Parallel Systems. Lecture 3 1. Lecture 3 2

Cray XE6 Performance Workshop

Switch Gear to Memory Consistency

Incoherent each cache copy behaves as an individual copy, instead of as the same memory location.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

A More Sophisticated Snooping-Based Multi-Processor

Lecture 24: Board Notes: Cache Coherency

Scalable Cache Coherence

ECE 485/585 Microprocessor System Design

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

EC 513 Computer Architecture

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

Portland State University ECE 588/688. Cache Coherence Protocols

Limitations of parallel processing

CMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3

Multicore MESI Based Cache Design HAS

A Basic Snooping-Based Multi-Processor Implementation

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

A Basic Snooping-Based Multi-processor

Bus-Based Coherent Multiprocessors

CSC/ECE 506: Architecture of Parallel Computers Program 2: Bus-Based Cache Coherence Protocols Due: Wednesday, October 25, 2017

Multiprocessors. Loosely coupled [Multi-computer] each CPU has its own memory, I/O facilities and OS. CPUs DO NOT share physical memory

ECE 669 Parallel Computer Architecture

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Lecture 10: Cache Coherence. Parallel Computer Architecture and Programming CMU / 清华 大学, Summer 2017

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology

Memory Hierarchy in a Multiprocessor

Cache Coherence Tutorial

CSC/ECE 506: Architecture of Parallel Computers Sample Final Examination with Answers

ECE/CS 757: Homework 1

Shared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

Scalable Cache Coherence

Snoop-Based Multiprocessor Design III: Case Studies

CS 433 Homework 5. Assigned on 11/7/2017 Due in class on 11/30/2017

ECSE 425 Lecture 30: Directory Coherence

Getting Started with SMPCache 2.0

Computer Architecture

Cache Coherence in Scalable Machines

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

Q3: Block Replacement. Replacement Algorithms. ECE473 Computer Architecture and Organization. Memory Hierarchy: Set Associative Cache

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations

Chapter-4 Multiprocessors and Thread-Level Parallelism

CS433 Homework 6. Problem 1 [15 points] Assigned on 11/28/2017 Due in class on 12/12/2017

Multiprocessors. Loosely coupled [Multi-computer] each CPU has its own memory, I/O facilities and OS. CPUs DO NOT share physical memory

Transcription:

Snooping coherence protocols (cont.) A four-state update protocol [ 5.3.3] When there is a high degree of sharing, invalidation-based protocols perform poorly. Blocks are often invalidated, and then have to be re-fetched from memory. Wouldn t it be better to send new values out rather than invalidation signals? This is the motivation behind update-based protocols. We will look at the Dragon protocol, initially proposed for Xerox s Dragon multiprocessor, and more recently used in sun SparcServer multiprocessors. This is a four-state protocol, with two of the states identical to those in the four-state invalidation protocol: The E (exclusive) state indicates that a block is in use by a single processor, but has not been modified. The M (modified) state indicates that a block is present in only this cache, and main memory is not up to date. There are also two new states. The Sc (shared-clean) state indicates that potentially two or more caches hold this block, and main memory may or may not be up to date. The Sm (shared-modified) state indicates that potentially two or more caches hold this block, main memory is not up to date, and it is this cache s responsibility to update main memory when the block is purged (i.e., ). A block can be in Sm state in only one cache at a time. However, while a block is in Sm state in one cache, it can be in Sc state in others. Lecture 14 Architecture of Parallel Computers 1

It is possible for a block to be in Sc state in some caches without being in Sm state in any cache. In this case, main memory is up to date. Why is there no I (invalid) state? Here is a state-transition diagram for this protocol. PrRd/ PrRd/ BusUpd/Update PrRdMiss/ BusRd(S) E BusRd/ PrWr/ Sc PrRdMiss/ BusRd(S) PrWr/BusUpd(S) BusUpd/Update PrWr/ BusUpd(S) PrWrMiss/ (BusRd(S); BusUpd) Sm BusRd/Flush M PrWrMiss/ BusRd(S) PrRd/ PrWr/BusUpd(S) BusRd/Flush PrWr/BusUpd(S) PrRd/ PrWr/ In diagrams for previous protocols, if a block not in the cache was referenced, we showed the transition as coming out of the I (invalid) state. In this protocol, we don t have an invalid state. So, looking at the diagram above, can you see what is supposed to happen when a referenced block is not in the cache? What happens if there is a read-miss and Lecture 14 Architecture of Parallel Computers 2

the shared line is asserted? the shared line is not asserted? What happens if there is a write-miss and the shared line is asserted? the shared line is not asserted? If there s a write-miss and the shared line is asserted, what else happens? Why is only a single word broadcast? Let us first consider the transitions out of the Exclusive state. What happens if this processor reads a word? What happens if this processor writes a word? There is one more transition out of this state. What causes it, and what happens? Now let us consider the transitions out of the Shared-Clean state. What happens if this processor reads a word? What happens if this processor writes a word? Lecture 14 Architecture of Parallel Computers 3

There is one more transition out of this state. What causes it, and what happens? Next, let s look at the transitions out of the Shared-Modified state. What happens if this processor reads a word? What happens if this processor writes a word? How many more transitions are there out of this state? What causes the first one, and what happens? What causes the second one, and what happens? Finally, let s look at the transitions out of the Modified state. What happens if this processor reads a word? What happens if this processor writes a word? What happens if another processor reads a word? Let s go through the same example as we did for the 3-state invalidation protocol. Lecture 14 Architecture of Parallel Computers 4

P 1 P 2 P 3 u =? u =? u = 7 $ 4 $ 5 $ u:5 u:5 3 1 u:5 2 I/O devices Memory Processor action State in P 1 State in P 2 State in P 3 P 1 reads u P 3 reads u P 3 writes u P 1 reads u P 2 reads u Bus action Data supplied by A three-state update protocol Whenever a bus update is generated, suppose that main memory as well as the caches updates its contents. Then which state don t we need? What s the advantage, then, of having the fourth state? The Firefly protocol, named after a multiprocessor workstation developed by DEC, is an example of such a protocol. Lecture 14 Architecture of Parallel Computers 5

Here is a state diagram for the Firefly protocol: V BR CRMx CWHx S BR, BW D BR, BW CWH CWMx Key: CRM CPU read miss CWM CPU write miss CWH CPU write hit BR bus read BW bus write A following a transition means SharedLine was asserted. An x means it was not. Processor-induced transitions Bus-induced transitions CWH CRM, CWM Read hits do not cause state transitions and are not shown. What do you think the states are, and how do they correspond to the states in The scheme works as follows: On a read hit, the data is returned immediately to the processor, and no caches change state. On a read miss, If another cache (other caches) had a copy of the block, it supplies (one supplies) it directly to the requesting cache and raises the SharedLine. The bus timing is fixed so all caches respond in the same cycle. All caches, including the requestor, set the state to shared. If the owning cache had the block in state dirty, the block is written to main memory at the same time. Lecture 14 Architecture of Parallel Computers 6

If no other cache had a copy of the block, it is read from main memory and assigned state valid-exclusive. On a write hit, If the block is already dirty, the write proceeds to the cache without delay. If the block is valid-exclusive, the write proceeds without delay and the state is changed to dirty. If the block is in state shared, the write is delayed until the bus is acquired and a write-word to main memory initiated. Other caches pick the data off the bus and update their copies (if any). They also raise the SharedLine. The writing cache can determine whether the block is still being shared by testing this line. On a write miss, If the SharedLine is not asserted, no other cache has a copy of the block. The requesting cache changes to state valid-exclusive. If the SharedLine is asserted, the block remains in state shared. If any other caches have a copy of the block, they supply it. By inspecting the SharedLine, the requesting processor determines that the block has been supplied by another cache, and sets its state to shared. The block is also written to memory, and other caches pick the data off the bus and update their copies (if any). If no other cache has a copy of the block, the block is loaded from memory in state dirty. Lecture 14 Architecture of Parallel Computers 7

In update protocols in general, since all writes appear on the bus, write serialization, write-completion detection, and write atomicity are simple. Performance results [ 5.4] What cache line size is performs best? Which protocol is best to use? Questions like these can be answered by simulation. However, getting the answer write is part art and part science. Parameters need to be chosen for the simulator. The authors selected a single-level 4-way set-associative 1 MB cache with 64- byte lines. The simulation assumes an idealized memory model, which assumes that references take constant time. Why is this not realistic? The simulated workload consists of 6 parallel programs from the SPLASH-2 suite and one multiprogrammed workload, consisting of mainly serial programs. Effect of coherence protocol [ 5.4.3] Three coherence protocols were compared: The Illinois MESI protocol ( Ill, left bar). The three-state invalidation protocol (3St) with bus upgrade for S M transitions. (This means that instead of rereading data from main memory when a block moves to the M state, we just issue a bus transaction invalidating the other copies.) The three-state invalidation protocol without bus upgrade (3St-BusRdX). (This means that when a block moves to the M state, we reread it from main memory.) Lecture 14 Architecture of Parallel Computers 8

200 180 160 Address bus Data bus Traffic (MB/s) 140 120 100 80 60 40 20 0 x Barnes/III Barnes/3St Barnes/3St-RdEx LU/III LU/3St LU/3St-RdEx Ocean/III Ocean/3S Ocean/3St-RdEx d l Radiosity/III t Radiosity/3St x Ill t Ex Radiosity/3St-RdEx Radix/III Radix/3St Radix/3St-RdEx Raytrace/III Raytrace/3St Raytrace/3St-RdEx In our parallel programs, which protocol seems to be best? Somewhat surprisingly, the result turns out to be the same for the multiprocessor workload. The reason for this? The advantage of the four-state protocol is that no bus traffic is generated on E M transitions. But E M transitions are very rare (less than 1 per 1K references). Effect of cache line size [ 5.4.4] Recall from Lecture 11 that cache misses can be classified into four categories: Cold misses (called compulsory misses in the previous discussion) occur the first time that a block is referenced. Lecture 14 Architecture of Parallel Computers 9

Conflict misses are misses that would not occur if the cache were fully associative with LRU replacement. Capacity misses occur when the cache size is not sufficient to hold data between references. Coherence misses are misses caused by the coherence protocol. Coherence misses can be divided into those caused by true sharing and those caused by false sharing. False-sharing misses are those caused by having a line size larger than one word. Can you explain? Lecture 14 Architecture of Parallel Computers 10

True-sharing misses, on the other hand, occur when a processor writes some words into a cache block, invalidating the block in another processors cache, after which the other processor reads one of the modified words. How could we attack each of the four kinds of misses? To reduce capacity misses, we could To reduce conflict misses, we could To reduce cold misses, we could To reduce coherence misses, we could If we increase the line size, the number of coherence misses might go up or down. Why? Increasing the line size has other disadvantages. It increases conflict misses. Why? It increases bus traffic. Why? So it is not clear which line size will work best. Lecture 14 Architecture of Parallel Computers 11

0.6 Upgrade 0.5 0.4 False sharing True sharing Capacity Cold Miss rate (%) 0.3 0.2 0.1 0 Barnes/8 Barnes/16 Barnes/32 Barnes/64 Barnes/128 Barnes/256 8 Lu/8 Lu/16 Lu/32 Lu/64 Lu/128 Lu/256 Radiosity/8 Radiosity/16 Radiosity/32 Radiosity/64 Radiosity/128 Radiosity/256 Results for the first three applications seem to show that which line size is best? Lecture 14 Architecture of Parallel Computers 12

12 Upgrade 10 8 False sharing True sharing Capacity Cold Miss rate (%) 6 4 2 0 Ocean/8 Ocean/16 Ocean/32 Ocean/64 Ocean/128 8 6 2 Ocean/256 4 8 Radix/8 6 8 Radix/16 Radix/32 Radix/64 Radix/128 Radix/256 Raytrace/8 Raytrace/16 Raytrace/32 Raytrace/64 Raytrace/128 Raytrace/256 For the second set of applications, Radix shows a greatly increasing number of false-sharing misses with increasing block size. However, this is not the whole story. Larger line sizes also create more bus traffic. Lecture 14 Architecture of Parallel Computers 13

0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 Address bus Data bus 0 2 Barnes/8 4 28 Barnes/16 Barnes/32 Traffic (bytes/instructions) Barnes/64 Barnes/128 Barnes/256 Radiosity/8 Radiosity/16 Radiosity/32 Radiosity/64 Radiosity/128 Radiosity/256 Raytrace/8 Raytrace/16 Raytrace/32 Raytrace/64 Raytrace/128 Raytrace/256 With this in mind, which line size would you say is best? Invalidate vs. update [ 5.4.5] Which is better, an update or an invalidation protocol? At first glance, it might seem that update schemes would always be superior to write-invalidate schemes. Why might this be true? Why might this not be true? When there are not many external rereads, When there is a high degree of sharing, For example, in a producer-consumer pattern, Lecture 14 Architecture of Parallel Computers 14

Update and invalidation schemes can be combined (see 5.4.5). Let s look at real programs. 0.60 False sharing 2.50 0.50 True sharing Capacity 2.00 Miss rate (%) 0.40 0.30 Cold Miss rate (%) 1.50 1.00 0.20 0.10 0.50 0.00 0.00 LU/inv LU/upd Ocean/inv Ocean/mix Ocean/upd Raytrace/inv Raytrace/upd Radix/inv Radix/mix Radix/upd Where there are many coherence misses, If there were many capacity misses, So let s look at bus traffic Lecture 14 Architecture of Parallel Computers 15

Note that in two of the applications, updates in an update protocol are much more prevalent than upgrades in an invalidation protocol. LU/inv LU/upd 0.00 Upgrade/update rate (%) 1.50 1.00 0.50 2.00 2.50 Each of these operations produces bus traffic; therefore, the update protocol causes more traffic. Ocean/inv Ocean/mix Ocean/upd The main problem is that one processor tends to write a block multiple times before another processor reads it. Raytrace/inv Raytrace/upd This causes several bus transactions instead of one, as there would be in an invalidation protocol. Radix/inv 0.00 1.00 Upgrade/update rate (%) 6.00 5.00 4.00 3.00 2.00 7.00 8.00 In addition, updates cause problems in nonbus-based multiprocessors. Radix/mix Radix/upd Lecture 14 Architecture of Parallel Computers 16