A three-state update protocol

Similar documents
Snooping coherence protocols (cont.)

[ 5.4] What cache line size is performs best? Which protocol is best to use?

Performance of coherence protocols

ECE PP used in class for assessing cache coherence protocols

Snooping coherence protocols (cont.)

The need for atomicity This code sequence illustrates the need for atomicity. Explain.

Cache Coherence: Part 1

L7 Shared Memory Multiprocessors. Shared Memory Multiprocessors

The MESI State Transition Graph

Lecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Basic Architecture of SMP. Shared Memory Multiprocessors. Cache Coherency -- The Problem. Cache Coherency, The Goal.

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Shared Memory Multiprocessors

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

Snooping-Based Cache Coherence

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

Lecture 7: Implementing Cache Coherence. Topics: implementation details

ECE/CS 757: Homework 1

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

A Basic Snooping-Based Multi-Processor Implementation

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Module 7: Synchronization Lecture 13: Introduction to Atomic Primitives. The Lecture Contains: Synchronization. Waiting Algorithms.

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

Lecture 24: Board Notes: Cache Coherency

A Basic Snooping-Based Multi-processor

CS315A Midterm Solutions

Lecture 19: Synchronization. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Lecture 25: Multiprocessors

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

EC 513 Computer Architecture

A Basic Snooping-Based Multi-Processor Implementation

The Cache Write Problem

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology

NOW Handout Page 1. Recap. Protocol Design Space of Snooping Cache Coherent Multiprocessors. Sequential Consistency.

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

Lecture 20: Multi-Cache Designs. Spring 2018 Jason Tang

EE382 Processor Design. Processor Issues for MP

Lecture 9 Outline. Lower-Level Protocol Choices. MESI (4-state) Invalidation Protocol. MESI: Processor-Initiated Transactions

Lecture 33: Multiprocessors Synchronization and Consistency Professor Randy H. Katz Computer Science 252 Spring 1996

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University

ECE 669 Parallel Computer Architecture

A More Sophisticated Snooping-Based Multi-Processor

Chapter 5. Multiprocessors and Thread-Level Parallelism

Performance metrics for caches

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

Foundations of Computer Systems

Chapter 5. Multiprocessors and Thread-Level Parallelism

Multiprocessor Cache Coherency. What is Cache Coherence?

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Chapter-4 Multiprocessors and Thread-Level Parallelism

M4 Parallelism. Implementation of Locks Cache Coherence

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5)

Flynn s Classification

CS433 Homework 6. Problem 1 [15 points] Assigned on 11/28/2017 Due in class on 12/12/2017

Page 1. Cache Coherence

Lecture: Coherence and Synchronization. Topics: synchronization primitives, consistency models intro (Sections )

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Limitations of parallel processing

Scalable Cache Coherence

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

CSC 631: High-Performance Computer Architecture

Handout 3 Multiprocessor and thread level parallelism

Cloud Computing CS

Incoherent each cache copy behaves as an individual copy, instead of as the same memory location.

Scalable Cache Coherence

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

CS654 Advanced Computer Architecture Lec 14 Directory Based Multiprocessors

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Synchronization. Coherency protocols guarantee that a reading processor (thread) sees the most current update to shared data.

Scalable Locking. Adam Belay

Lecture 19: Coherence and Synchronization. Topics: synchronization primitives (Sections )

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Shared Memory Multiprocessors

Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY>

250P: Computer Systems Architecture. Lecture 14: Synchronization. Anton Burtsev March, 2019

Chapter 6. Parallel Processors from Client to Cloud Part 2 COMPUTER ORGANIZATION AND DESIGN. Homogeneous & Heterogeneous Multicore Architectures

Lecture 4: Directory Protocols and TM. Topics: corner cases in directory protocols, lazy TM

7 Solutions. Solution 7.1. Solution 7.2

CS433 Homework 6. Problem 1 [15 points] Assigned on 11/28/2017 Due in class on 12/12/2017

Multiprocessor Synchronization

Department of Electrical and Computer Engineering University of Wisconsin - Madison. ECE/CS 752 Advanced Computer Architecture I

Goldibear and the 3 Locks. Programming With Locks Is Tricky. More Lock Madness. And To Make It Worse. Transactional Memory: The Big Idea

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

Computer Science 146. Computer Architecture

Computer Architecture

Lecture 8: Directory-Based Cache Coherence. Topics: scalable multiprocessor organizations, directory protocol design issues

Lecture: Coherence, Synchronization. Topics: directory-based coherence, synchronization primitives (Sections )

ECE 485/585 Microprocessor System Design

CSC/ECE 506: Architecture of Parallel Computers Program 2: Bus-Based Cache Coherence Protocols Due: Wednesday, October 25, 2017

Computer Architecture

Transcription:

A three-state update protocol Whenever a bus update is generated, suppose that main memory as well as the caches updates its contents. Then which state don t we need? What s the advantage, then, of having the fourth state? The Firefly protocol, named after a multiprocessor workstation developed by DEC, is an example of such a protocol. Here is a state diagram for the Firefly protocol: V BR CRMx CWHx S BR, BW D BR, BW CWH CWMx Key: CRM CPU read miss CWM CPU write miss CWH CPU write hit BR bus read BW bus write A following a transition means SharedLine was asserted. An x means it was not. Processor-induced transitions Bus-induced transitions CWH CRM, CWM Read hits do not cause state transitions and are not shown. What do you think the states are, and how do they correspond to the states in The scheme works as follows: Lecture 10 Architecture of Parallel Computers 1

On a read hit, the data is returned immediately to the processor, and no caches change state. On a read miss, If another cache (other caches) had a copy of the block, it supplies (one supplies) it directly to the requesting cache and raises the SharedLine. The bus timing is fixed so all caches respond in the same cycle. All caches, including the requestor, set the state to shared. If the owning cache had the block in state dirty, the block is written to main memory at the same time. If no other cache had a copy of the block, it is read from main memory and assigned state valid-exclusive. On a write hit, If the block is already dirty, the write proceeds to the cache without delay. If the block is valid-exclusive, the write proceeds without delay and the state is changed to dirty. If the block is in state shared, the write is delayed until the bus is acquired and a write-word to main memory initiated. Other caches pick the data off the bus and update their copies (if any). They also raise the SharedLine. The writing cache can determine whether the block is still being shared by testing this line. If the SharedLine is not asserted, no other cache has a copy of the block. The requesting cache changes to state valid-exclusive. If the SharedLine is asserted, the block remains in state shared. 2010 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2010 2

On a write miss, If any other caches have a copy of the block, they supply it. By inspecting the SharedLine, the requesting processor determines that the block has been supplied by another cache, and sets its state to shared. The block is also written to memory, and other caches pick the data off the bus and update their copies (if any). If no other cache has a copy of the block, the block is loaded from memory in state dirty. In update protocols in general, since all writes appear on the bus, write serialization, write-completion detection, and write atomicity are simple. Performance of coherence protocols [ 5.4] What cache line size is performs best? Which protocol is best to use? Questions like these can be answered by simulation. However, getting the answer write is part art and part science. Parameters need to be chosen for the simulator. Culler & Singh (1998) selected a single-level 4-way set-associative 1 MB cache with 64-byte lines. The simulation assumes an idealized memory model, which assumes that references take constant time. Why is this not realistic? The simulated workload consists of 6 parallel programs from the SPLASH-2 suite and one multiprogrammed workload, consisting of mainly serial programs. Lecture 10 Architecture of Parallel Computers 3

Effect of coherence protocol [CS&G 5.4.3] Three coherence protocols were compared: 200 The Illinois MESI protocol ( Ill, left bar). The three-state invalidation protocol (3St) with bus upgrade for S M transitions. (This means that instead of rereading data from main memory when a block moves to the M state, we just issue a bus transaction invalidating the other copies.) The three-state invalidation protocol without bus upgrade (3St-BusRdX). (This means that when a block moves to the M state, we reread it from main memory.) 180 160 Address bus Data bus Traffic (MB/s) 140 120 100 80 60 40 20 0 x Barnes/III Barnes/3St d Barnes/3St-RdEx l LU/III t LU/3St x LU/3St-RdEx Ill Ocean/III t Ocean/3S Ex Ocean/3St-RdEx Radiosity/III Radiosity/3St Radiosity/3St-RdEx Radix/III Radix/3St Radix/3St-RdEx Raytrace/III Raytrace/3St Raytrace/3St-RdEx In our parallel programs, which protocol seems to be best? 2010 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2010 4

Somewhat surprisingly, the result turns out to be the same for the multiprocessor workload. The reason for this? The advantage of the four-state protocol is that no bus traffic is generated on E M transitions. But E M transitions are very rare (less than 1 per 1K references). Invalidate vs. update [CS&G 5.4.5] Which is better, an update or an invalidation protocol? Let s look at real programs. 0.60 False sharing 2.50 0.50 True sharing Capacity 2.00 Miss rate (%) 0.40 0.30 Cold Miss rate (%) 1.50 1.00 0.20 0.10 0.50 0.00 0.00 LU/inv LU/upd Ocean/inv Ocean/mix Ocean/upd Raytrace/inv Raytrace/upd Radix/inv Radix/mix Radix/upd Where there are many coherence misses, If there were many capacity misses, So let s look at bus traffic Lecture 10 Architecture of Parallel Computers 5

Note that in two of the applications, updates in an update protocol are much more prevalent than upgrades in an invalidation protocol. LU/inv LU/upd 0.00 Upgrade/update rate (%) 1.50 1.00 0.50 2.00 2.50 Each of these operations produces bus traffic; therefore, the update protocol causes more traffic. Ocean/inv Ocean/mix Ocean/upd The main problem is that one processor tends to write a block multiple times before another processor reads it. Raytrace/inv Raytrace/upd This causes several bus transactions instead of one, as there would be in an invalidation protocol. Radix/inv 0.00 1.00 Upgrade/update rate (%) 6.00 5.00 4.00 3.00 2.00 7.00 8.00 In addition, updates cause problems in nonbus-based multiprocessors. Radix/mix Radix/upd Effect of cache line size [CS&G 5.4.4] Cache misses can be classified into four categories: Cold misses (or compulsory misses ) occur the first time that a block is referenced. Conflict misses are misses that would not occur if the cache were fully associative with LRU replacement. Capacity misses occur when the cache size is not sufficient to hold data between references. 2010 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2010 6

Coherence misses are misses caused by the coherence protocol. Coherence misses can be divided into those caused by true sharing and those caused by false sharing. False-sharing misses are those caused by having a line size larger than one word. Can you explain? True-sharing misses, on the other hand, occur when a processor writes some words into a cache block, invalidating the block in another processors cache, after which the other processor reads one of the modified words. How could we attack each of the four kinds of misses? To reduce capacity misses, we could To reduce conflict misses, we could To reduce cold misses, we could To reduce coherence misses, we could If we increase the line size, the number of coherence misses might go up or down. What happens to the number of false-sharing misses? What happens to the number of true-sharing misses? If we increase the line size, what happens to capacity misses? Lecture 10 Architecture of Parallel Computers 7

conflict misses? bus traffic? So it is not clear which line size will work best. 0.6 Upgrade 0.5 0.4 False sharing True sharing Capacity Cold 0.3 0.2 0.1 0 Barnes/8 Barnes/16 Barnes/32 8 Barnes/64 Miss rate (%) Barnes/128 Barnes/256 Lu/8 Lu/16 Lu/32 Lu/64 Lu/128 Lu/256 Radiosity/8 Radiosity/16 Radiosity/32 Radiosity/64 Radiosity/128 Radiosity/256 Results for the first three applications seem to show that which line size is best? For the second set of applications, which do not fit in cache, Radix shows a greatly increasing number of false-sharing misses with increasing block size. 2010 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2010 8

12 Upgrade 10 8 False sharing True sharing Capacity Cold Miss rate (%) 6 4 2 0 Ocean/8 Ocean/16 8 6 2 4 8 6 8 Ocean/32 Ocean/64 Ocean/128 Ocean/256 Radix/8 Radix/16 Radix/32 Radix/64 Radix/128 Radix/256 Raytrace/8 Raytrace/16 Raytrace/32 Raytrace/64 Raytrace/128 Raytrace/256 However, larger line sizes also create more bus traffic. 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 Address bus Data bus 0 Barnes/8 Barnes/16 2 Barnes/32 Barnes/64 Barnes/128 Barnes/256 Radiosity/8 Radiosity/16 Radiosity/32 4 28 Traffic (bytes/instructions) Radiosity/64 Radiosity/128 Radiosity/256 Raytrace/8 Raytrace/16 Raytrace/32 Raytrace/64 Raytrace/128 Raytrace/256 With this in mind, which line size would you say is best? 32 or 64. Lecture 10 Architecture of Parallel Computers 9

Write propagation in multilevel caches [ 8.4.2] The coherence protocols we have seen so far have been based on one-level caches. Suppose each processor has its own L1 cache and L2 cache, and L2 the caches are coherent. Writes must be propagated upstream and downstream. Define the terms. Downstream write propagation. Upstream write propagation. Which makes downstream propagation simpler, a write-through or write-back L1 cache? Why? For upstream write propagation: o An invalidation/intervention received by the L2 must be propagated to the L1 (in case the L1 has the block). o The inclusion property cuts down the number of such upstream invalidations/interventions. Lock Implementations [ 9.1] Recall the three kinds of synchronization from Lecture 6: Point-to-point Lock 2010 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2010 10

Performance metrics for lock implementations Uncontended latency Traffic o Time to acquire a lock when there is no contention o Lock acquisition when lock is already locked o Lock acquisition when lock is free o Lock release Fairness Storage o Degree in which a thread can acquire a lock with respect to others o As a function of # of threads/processors The need for atomicity This code sequence illustrates the need for atomicity. Explain. void lock (int *lockvar) { while (*lockvar == 1) {} ; // wait until released *lockvar = 1; // acquire lock } void unlock (int *lockvar) { *lockvar = 0; } In assembly language, the sequence looks like this: lock: ld R1, &lockvar // R1 = lockvar bnz R1, lock // jump to lock if R1!= 0 st &lockvar, #1 // lockvar = 1 ret // return to caller unlock: sti &lockvar, #0 // lockvar = 0 ret // return to caller The ld-to-sti sequence must be executed atomically: The sequence appears to execute in its entirety Multiple sequences are serialized Lecture 10 Architecture of Parallel Computers 11

Examples of atomic instructions test-and-set Rx, M o read the value stored in memory location M, test the value against a constant (e.g. 0), and if they match, write the value in register Rx to the memory location M. fetch-and-op M o read the value stored in memory location M, perform op to it (e.g., increment, decrement, addition, subtraction), then store the new value to the memory location M. exchange Rx, M o atomically exchange (or swap) the value in memory location M with the value in register Rx. compare-and-swap Rx, Ry, M o compare the value in memory location M with the value in register Rx. If they match, write the value in register Ry to M, and copy the value in Rx to Ry. How to ensure one atomic instruction is executed at a time: 1. Reserve the bus until done o Other atomic instructions cannot get to the bus 2. Reserve the cache block involved until done o Obtain exclusive permission (e.g. M in MESI) o Reject or delay any invalidation or intervention requests until done 3. Provide illusion of atomicity instead o Using load-link/store-conditional (to be discussed later) 2010 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2010 12

Test and set test-and-set is implemented like this: lock: t&s R1, &lockvar // R1 = MEM[&lockvar]; // if (R1==0) MEM[&lockvar]=1 bnz R1, lock; // jump to lock if R1!= 0 ret // return to caller unlock: st &lockvar, #0 // MEM[&lockvar] = 0 ret // return to caller What value does lockvar have when the lock is acquired? free? Here is an example of test-and-set execution. Describe what it shows. Lecture 10 Architecture of Parallel Computers 13

Let s look at how a sequence of test-and-sets by three processors plays out: Request P1 P2 P3 BusRequest Initially - P1: t&s M BusRdX P2: t&s I M BusRdX P3: t&s I I M BusRdX P2: t&s I M I BusRdX P1: unlock M I I BusRdX P2: t&s I M I BusRdX P3: t&s I I M BusRdX P3: t&s I I M P2: unlock I M I BusRdX P3: t&s I I M BusRdX P3: unlock I I M How does test-and-set perform on the four metrics listed above? Uncontended latency Fairness Traffic Storage 2010 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2010 14