Lecture 9 Outline. Lower-Level Protocol Choices. MESI (4-state) Invalidation Protocol. MESI: Processor-Initiated Transactions

Size: px
Start display at page:

Download "Lecture 9 Outline. Lower-Level Protocol Choices. MESI (4-state) Invalidation Protocol. MESI: Processor-Initiated Transactions"

Transcription

1 Outline protocol Dragon updatebased protocol mpact of protocol optimizations LowerLevel Protocol Choices observed in state: what transition to make? Change to : assume ll read again soon good for mostly read data what about migratory data, thus: Change to : assume other will write to it (ynapse) read and write, then you read and write, then X reads and writes... equent ymmetry and T Alewife use adaptive protocols Gehringer, based on slides by Yan olihin 1 Gehringer, based on slides by Yan olihin (state) nvalidation Protocol : Processornitiated Transactions Problem with protocol Rd, Wr sequence incurs transactions even when no one is sharing (e.g., serial program!) ( ) followed by X or BusUpgr ( ) n general, penalizing serial programs is unacceptable Add exclusive state: nvalid odified (dirty) hared (two or more caches may have copies) xclusive: (only this cache has clean copy, same value as in memory) How to decide or? Need to check whether someone else has copy hared signal on bus: wiredor line asserted in response to PrWr/ PrRd/ PrWr/X PrRd/ PrWr/X PrRd/() PrRd/ PrWr/ PrRd/(~) Gehringer, based on slides by Yan olihin 3 Gehringer, based on slides by Yan olihin : Busnitiated Transactions tate Transition Diagram PrRd PrWr/ X/Flush X/Flush X/Flush PrWr/X PrWr/X PrWr/ / Flush X/Flush X/Flush PrRd/ () 1 X/Flush1 Gehringer, based on slides by Yan olihin 5 / X/ PrRd/ () () means shared line asserted on transaction Gehringer, based on slides by Yan olihin 1

2 Flush vs. Flush1 (Flush' in textbook) Visualization Flush: mandatory Flush' (Flush1): happens only when Cachetocache sharing is used, and, Only one cache flushes data Cache Bus ain emory Gehringer, based on slides by Yan olihin 7 Gehringer, based on slides by Yan olihin Visualization Visualization Gehringer, based on slides by Yan olihin 9 Gehringer, based on slides by Yan olihin 1 Visualization Visualization wr &X (X=) X= One less bus request due to xclusive state, esp. for serial programs Gehringer, based on slides by Yan olihin 11 Gehringer, based on slides by Yan olihin 1

3 Visualization Visualization X= X= X= X= 3 wr &X X=3 Flush BusUpgr X= Note: BusUpgr instead of X Gehringer, based on slides by Yan olihin 13 Gehringer, based on slides by Yan olihin 1 Visualization Visualization X= 3 X=3 X=3 X=3 Flush X= 3 X=3 Gehringer, based on slides by Yan olihin 15 Gehringer, based on slides by Yan olihin 1 Visualization xample (CachetoCache Transfer) Proc Action W1 tate tate tate Bus Action Data From em X=3 X=3 X=3 W3 X cache em Flush1 cache X=3 Referred to as Cachetocache transfer in llinois protocol R 1 / Cache* Gehringer, based on slides by Yan olihin 17 * Data from memory if no cachecache transfer, / Gehringer, based on slides by Yan olihin 1 3

4 xample (CachetoCache Transfer+BusUpgr) LowerLevel Protocol Choices Proc Action W1 W3 tate tate tate Bus Action BusUpgr Data From em cache cache Who supplies data on miss when not in state: memory or cache? Original, lllinois : cache assume cache faster than memory (Cachetocache transfer) Not necessarily true Adds complexity How does memory know it should supply data? (must wait for caches) election algorithm if multiple caches have valid data Valuable for distributed memory ay be cheaper to obtain from nearby cache than distant memory specially when constructed out of P nodes (tanford DAH) R 1 / Cache* * Data from memory if no cachecache transfer, / Gehringer, based on slides by Yan olihin 19 Gehringer, based on slides by Yan olihin Outline protocol Dragon updatebased protocol mpact of protocol optimizations Dragon Writeback Update Protocol Four states xclusiveclean (): and memory have it hared clean (c):, others, and maybe memory, but m not owner hared modified (m): and others but not memory, and m the owner m and c can coexist in different caches, with at most one m odified or dirty (): and, no one else On replacement: c can silently drop, m has to flush No invalid state f in cache, cannot be invalid f not present in cache, can view as being in notpresent or invalid state New processor events: PrRdiss, PrWriss ntroduced to specify actions when block not present in cache New bus transaction: BusUpd Broadcasts single word written on bus; updates other relevant caches Gehringer, based on slides by Yan olihin 1 Gehringer, based on slides by Yan olihin Dragon tate Transition Diagram Dragon: Processornitiated Transactions BusUpd/Update PrRd/ PrRd/ PrRdiss/ () / PrWr/ c PrRdiss/ () PrRdiss/(~) PrWr/BusUpd() c PrRdiss/() PrWr/BusUpd() BusUpd/Update PrWr/ BusUpd() PrWr/ PrWr/BusUpd(~) PrWriss/ ((); BusUpd) m PrWriss/ () PrWriss/ (();BusUpd) m PrWr/BusUpd(~) PrRdiss/(~) PrWr/BusUpd() PrWr/BusUpd() PrWr/ PrRd/ PrWr/BusUpd() PrRd/ PrWr/ Gehringer, based on slides by Yan olihin 3 Gehringer, based on slides by Yan olihin

5 Dragon: Busnitiated Transactions / BusUpd/Update / c BusUpd/Update Cache m Bus ain emory Gehringer, based on slides by Yan olihin 5 Gehringer, based on slides by Yan olihin Gehringer, based on slides by Yan olihin 7 Gehringer, based on slides by Yan olihin wr &X (X=) X= One less bus request due to xclusive state, esp. for serial programs Gehringer, based on slides by Yan olihin 9 Gehringer, based on slides by Yan olihin 3 5

6 X= m X= c X= 3 m c X= 3 c wr &X X=3 m BusUpd Note: BusUpdate instead of BusUpgr (no inval is performed) Gehringer, based on slides by Yan olihin 31 Gehringer, based on slides by Yan olihin 3 X=3 c X=3 m X=3 c X=3 m This is a miss in the and protocols Gehringer, based on slides by Yan olihin 33 Gehringer, based on slides by Yan olihin 3 X=3 c X=3 c X=3 m X=3 c X=3 c X=3 m Note: only one with m is responsible for cachetocache transfer replaces X Gehringer, based on slides by Yan olihin 35 Gehringer, based on slides by Yan olihin 3

7 x d t l x t Dragon xample Proc Action tate tate tate Bus Action Data From em W1 X=3 c X=3 c X=3 m W3 m c c m BusUpd/Upd cache c c m m replaces X Owner responsible for writing back to mem 3 vs. or where writeback only when the line is in state R c c m cache Gehringer, based on slides by Yan olihin 37 Gehringer, based on slides by Yan olihin 3 LowerLevel Protocol Choices Can sharedmodified state be eliminated? f update memory as well on BusUpd transactions (DC Firefly) Dragon protocol doesn t (assumes DRA memory slow to update) hould replacement of an c block be broadcast? Would allow last copy to go to xclusive state and not generate updates Replacement bus transaction is not in critical path, later update may be houldn t update local copy on write hit before controller gets bus Can mess up serialization Coherence, consistency considerations much like writethrough case Outline protocol Dragon updatebased protocol mpact of protocol optimizations n general, many subtle race conditions in protocols But first, let s illustrate quantitative assessment at logical level Gehringer, based on slides by Yan olihin 39 Gehringer, based on slides by Yan olihin Assessing Protocol Tradeoffs ethodology: Use simulator; choose parameters per earlier methodology (default 1B, way cache, byte block, 1 processors; K cache for some) Focus on frequencies, not end performance for now transcends architectural details, but not what we re really after Use idealized memory performance model to avoid changes of reference interleaving across processors with machine parameters Cheap simulation: no need to model contention mpact of Protocol Optimizations vs. (w/ BusUpgr) vs. (w/ X) Traffic (B/s) Traffic (B/s) Barnes/ Barnes/3t Barnes/3t Rdx LU/ LU/3t LU/3tRdx Ocean/ Ocean/3 Ocean/3tRdx Radiosity/ Radiosity/3t Radiosity/3tRdx Radix/ Radix/3t Radix/3t Rdx Raytrace/ ll Raytrace/3t Raytrace/3tRdx x ApplCode/ ApplCode/3t ApplCode/3tRdx Appl ApplData/3t Data/ ApplData/3tRdx OCode/ OCode/3t OCode/3tRdx OData/ OData/3t OData/3tRdx Gehringer, based on slides by Yan olihin 1 = Upgrades instead of readexclusive helps ame story when working sets don t fit for Ocean, Radix, Raytrace Gehringer, based on slides by Yan olihin 7

8 mpact of CacheBlock ize ultiprocessors add new kind of miss to cold, capacity, conflict Coherence misses: Due to invalidations True sharing: Write to same word False sharing: Write to different words Reducing misses architecturally in invalidation protocol Capacity: enlarge cache; increase block size (if spatial locality) Conflict: increase associativity Cold and coherence: only block size ncreasing block size has advantages and disadvantages Can reduce misses if spatial locality is good Can hurt too increase misses due to false sharing if spatial locality not good increase misses due to conflicts in fixedsize cache increase traffic due to fetching unnecessary data and due to false sharing can increase miss C/CC penalty 5 and ummer perhaps. hit F. cost Gehringer, based on slides by Yan olihin 3 mpact of Block ize on iss Rate For default problem size: vary block/line size from 5 Bytes iss rate (%) Barnes/ Barnes/1 Upgrade False sharing True sharing Capacity Cold Barnes/3 Barnes/ Barnes/1 Barnes/5 Lu/ Lu/1 Lu/3 Lu/ Lu/1 Lu/5 Radiosity/ Radiosity/1 Radiosity/3 Radiosity/ Radiosity/1 Radiosity/5 False sharing True sharing Decreases with larger lines: cold, capacity (due to spatial locality), true sharing (due to spatial locality) ncreases with larger lines: false sharing Working set doesn t fit: impact of capacity misses large: (Ocean, Radix) Gehringer, based on slides by Yan olihin iss rate (%) 1 1 Ocean/ Ocean/1 Upgrade Capacity Cold Ocean/3 Ocean/ Ocean/1 Ocean/5 Radix/ Radix/1 Radix/3 Radix/ Radix/1 Radix/5 Raytrace/ Raytrace/1 Raytrace/3 Raytrace/ Raytrace/1 Raytrace/5 mpact of Block ize on Traffic Traffic (bytes/inst) affects performance indirectly through contention 1 1. Traffic (bytes/instruction) Traffic (bytes/flop) Radix/ Radix/1 Radix/3 Radix/ Radix/1 Radix/5 LU/ LU/1 LU/3 LU/ LU/1 LU/5 Ocean/ Ocean/1 Ocean/3 Ocean/ Ocean/1 Ocean/5.1 Traffic (bytes/instructions) Barnes/ Barnes/1 Barnes/3 Barnes/ Barnes/1 Barnes/5 Radiosity/ Radiosity/1 Radiosity/3 Radiosity/ Radiosity/1 Radiosity/5 Raytrace/ Raytrace/1 Raytrace/3 Raytrace/ Raytrace/1 Raytrace/5 Results different than for miss rate: traffic almost always increases When working sets fits, overall traffic still small, except for Radix Fixed overhead is significant component o total traffic often minimized at 13 byte block, not smaller Working set doesn t fit: even 1byte good for Ocean due to capacity traffic behaves in opposite way as the data bus traffic Gehringer, based on slides by Yan olihin 5

Shared Memory Multiprocessors

Shared Memory Multiprocessors Parallel Computing Shared Memory Multiprocessors Hwansoo Han Cache Coherence Problem P 0 P 1 P 2 cache load r1 (100) load r1 (100) r1 =? r1 =? 4 cache 5 cache store b (100) 3 100: a 100: a 1 Memory 2 I/O

More information

NOW Handout Page 1. Recap. Protocol Design Space of Snooping Cache Coherent Multiprocessors. Sequential Consistency.

NOW Handout Page 1. Recap. Protocol Design Space of Snooping Cache Coherent Multiprocessors. Sequential Consistency. Recap Protocol Design Space of Snooping Cache Coherent ultiprocessors CS 28, Spring 99 David E. Culler Computer Science Division U.C. Berkeley Snooping cache coherence solve difficult problem by applying

More information

Cache Coherence: Part 1

Cache Coherence: Part 1 Cache Coherence: art 1 Todd C. Mowry CS 74 October 5, Topics The Cache Coherence roblem Snoopy rotocols The Cache Coherence roblem 1 3 u =? u =? $ 4 $ 5 $ u:5 u:5 1 I/O devices u:5 u = 7 3 Memory A Coherent

More information

Caches. Parallel Systems. Caches - Finding blocks - Caches. Parallel Systems. Parallel Systems. Lecture 3 1. Lecture 3 2

Caches. Parallel Systems. Caches - Finding blocks - Caches. Parallel Systems. Parallel Systems. Lecture 3 1. Lecture 3 2 Parallel ystems Parallel ystems Parallel ystems Outline for lecture 3 s (a quick review) hared memory multiprocessors hierarchies coherence nooping protocols» nvalidation protocols (, )» Update protocol

More information

Recall: Sequential Consistency. CS 258 Parallel Computer Architecture Lecture 15. Sequential Consistency and Snoopy Protocols

Recall: Sequential Consistency. CS 258 Parallel Computer Architecture Lecture 15. Sequential Consistency and Snoopy Protocols CS 258 Parallel Computer Architecture Lecture 15 Sequential Consistency and Snoopy Protocols arch 17, 2008 Prof John D. Kubiatowicz http://www.cs.berkeley.edu/~kubitron/cs258 ecall: Sequential Consistency

More information

L7 Shared Memory Multiprocessors. Shared Memory Multiprocessors

L7 Shared Memory Multiprocessors. Shared Memory Multiprocessors L7 Shared Memory Multiprocessors Logical design and software interactions 1 Shared Memory Multiprocessors Symmetric Multiprocessors (SMPs) Symmetric access to all of main memory from any processor Dominate

More information

Snooping coherence protocols (cont.)

Snooping coherence protocols (cont.) Snooping coherence protocols (cont.) A four-state update protocol [ 5.3.3] When there is a high degree of sharing, invalidation-based protocols perform poorly. Blocks are often invalidated, and then have

More information

Switch Gear to Memory Consistency

Switch Gear to Memory Consistency Outline Memory consistency equential consistency Invalidation vs. update coherence protocols MI protocol tate diagrams imulation Gehringer, based on slides by Yan olihin 1 witch Gear to Memory Consistency

More information

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012) Lecture 11: Snooping Cache Coherence: Part II CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Announcements Assignment 2 due tonight 11:59 PM - Recall 3-late day policy Assignment

More information

Basic Architecture of SMP. Shared Memory Multiprocessors. Cache Coherency -- The Problem. Cache Coherency, The Goal.

Basic Architecture of SMP. Shared Memory Multiprocessors. Cache Coherency -- The Problem. Cache Coherency, The Goal. Shared emory ultiprocessors Basic Architecture of SP Buses are good news and bad news The (memory) bus is a point all processors can see and thus be informed of what is happening A bus is serially used,

More information

Lecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Lecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015 Lecture 11: Cache Coherence: Part II Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Bang Bang (My Baby Shot Me Down) Nancy Sinatra (Kill Bill Volume 1 Soundtrack) It

More information

Snooping coherence protocols (cont.)

Snooping coherence protocols (cont.) Snooping coherence protocols (cont.) A four-state update protocol [ 5.3.3] When there is a high degree of sharing, invalidation-based protocols perform poorly. Blocks are often invalidated, and then have

More information

Approaches to Building Parallel Machines. Shared Memory Architectures. Example Cache Coherence Problem. Shared Cache Architectures

Approaches to Building Parallel Machines. Shared Memory Architectures. Example Cache Coherence Problem. Shared Cache Architectures Approaches to Building arallel achines Switch/Bus n Scale Shared ory Architectures (nterleaved) First-level (nterleaved) ain memory n Arvind Krishnamurthy Fall 2004 (nterleaved) ain memory Shared Cache

More information

Shared Memory Architectures. Approaches to Building Parallel Machines

Shared Memory Architectures. Approaches to Building Parallel Machines Shared Memory Architectures Arvind Krishnamurthy Fall 2004 Approaches to Building Parallel Machines P 1 Switch/Bus P n Scale (Interleaved) First-level $ P 1 P n $ $ (Interleaved) Main memory Shared Cache

More information

Lecture-22 (Cache Coherence Protocols) CS422-Spring

Lecture-22 (Cache Coherence Protocols) CS422-Spring Lecture-22 (Cache Coherence Protocols) CS422-Spring 2018 Biswa@CSE-IITK Single Core Core 0 Private L1 Cache Bus (Packet Scheduling) Private L2 DRAM CS422: Spring 2018 Biswabandan Panda, CSE@IITK 2 Multicore

More information

Shared Memory Architectures. Shared Memory Multiprocessors. Caches and Cache Coherence. Cache Memories. Cache Memories Write Operation

Shared Memory Architectures. Shared Memory Multiprocessors. Caches and Cache Coherence. Cache Memories. Cache Memories Write Operation hared Architectures hared Multiprocessors ngo ander ngo@imit.kth.se hared Multiprocessor are often used pecial Class: ymmetric Multiprocessors (MP) o ymmetric access to all of main from any processor A

More information

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Module 10: Design of Shared Memory Multiprocessors Lecture 20: Performance of Coherence Protocols MOESI protocol. MOESI protocol Dragon protocol State transition Dragon example Design issues General issues Evaluating protocols Protocol optimizations Cache size Cache line size Impact on bus traffic Large cache line

More information

Performance of coherence protocols

Performance of coherence protocols Performance of coherence protocols Cache misses have traditionally been classified into four categories: Cold misses (or compulsory misses ) occur the first time that a block is referenced. Conflict misses

More information

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5)

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5) CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5) Copyright 2001 Mark D. Hill University of Wisconsin-Madison Slides are derived

More information

ECE PP used in class for assessing cache coherence protocols

ECE PP used in class for assessing cache coherence protocols ECE 5315 PP used in class for assessing cache coherence protocols Assessing Protocol Design The benchmark programs are executed on a multiprocessor simulator The state transitions observed determine the

More information

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012) Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues

More information

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University 18-742 Parallel Computer Architecture Lecture 5: Cache Coherence Chris Craik (TA) Carnegie Mellon University Readings: Coherence Required for Review Papamarcos and Patel, A low-overhead coherence solution

More information

Processor Architecture

Processor Architecture Processor Architecture Shared Memory Multiprocessors M. Schölzel The Coherence Problem s may contain local copies of the same memory address without proper coordination they work independently on their

More information

[ 5.4] What cache line size is performs best? Which protocol is best to use?

[ 5.4] What cache line size is performs best? Which protocol is best to use? Performance results [ 5.4] What cache line size is performs best? Which protocol is best to use? Questions like these can be answered by simulation. However, getting the answer write is part art and part

More information

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence SMP Review Multiprocessors Today s topics: SMP cache coherence general cache coherence issues snooping protocols Improved interaction lots of questions warning I m going to wait for answers granted it

More information

The MESI State Transition Graph

The MESI State Transition Graph Small-scale shared memory multiprocessors Semantics of the shared address space model (Ch. 5.3-5.5) Design of the M(O)ESI snoopy protocol Design of the Dragon snoopy protocol Performance issues Synchronization

More information

Cache Coherence in Bus-Based Shared Memory Multiprocessors

Cache Coherence in Bus-Based Shared Memory Multiprocessors Cache Coherence in Bus-Based Shared Memory Multiprocessors Shared Memory Multiprocessors Variations Cache Coherence in Shared Memory Multiprocessors A Coherent Memory System: Intuition Formal Definition

More information

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network Shared Memory Multis Processor Processor Processor i Processor n Symmetric Shared Memory Architecture (SMP) cache cache cache cache Interconnection Network Main Memory I/O System Cache Coherence Cache

More information

Snooping-Based Cache Coherence

Snooping-Based Cache Coherence Lecture 10: Snooping-Based Cache Coherence Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2017 Tunes Elle King Ex s & Oh s (Love Stuff) Once word about my code profiling skills

More information

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri Cache Coherence (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri mainakc@cse.iitk.ac.in 1 Setting Agenda Software: shared address space Hardware: shared memory multiprocessors Cache

More information

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University Material from: Parallel Computer Organization and Design by Debois,

More information

A three-state update protocol

A three-state update protocol A three-state update protocol Whenever a bus update is generated, suppose that main memory as well as the caches updates its contents. Then which state don t we need? What s the advantage, then, of having

More information

Alewife Messaging. Sharing of Network Interface. Alewife User-level event mechanism. CS252 Graduate Computer Architecture.

Alewife Messaging. Sharing of Network Interface. Alewife User-level event mechanism. CS252 Graduate Computer Architecture. CS252 Graduate Computer Architecture Lecture 18 April 5 th, 2010 ory Consistency Models and Snoopy Bus Protocols Alewife Messaging Send message write words to special network interface registers Execute

More information

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University Material from: Parallel Computer Organization and Design by Debois,

More information

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections Lecture 18: Coherence Protocols Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections 4.2-4.4) 1 SMP/UMA/Centralized Memory Multiprocessor Main Memory I/O System

More information

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins 4 Chip Multiprocessors (I) Robert Mullins Overview Coherent memory systems Introduction to cache coherency protocols Advanced cache coherency protocols, memory systems and synchronization covered in the

More information

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence CS252 Spring 2017 Graduate Computer Architecture Lecture 12: Cache Coherence Lisa Wu, Krste Asanovic http://inst.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 Last Time in Lecture 11 Memory Systems DRAM

More information

CS315A Midterm Solutions

CS315A Midterm Solutions K. Olukotun Spring 05/06 Handout #14 CS315a CS315A Midterm Solutions Open Book, Open Notes, Calculator okay NO computer. (Total time = 120 minutes) Name (please print): Solutions I agree to abide by the

More information

Caches (Writing) Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University. P & H Chapter 5.2 3, 5.5

Caches (Writing) Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University. P & H Chapter 5.2 3, 5.5 s (Writing) Hakim Weatherspoon CS, Spring Computer Science Cornell University P & H Chapter.,. Administrivia Lab due next onday, April th HW due next onday, April th Goals for Today Parameter Tradeoffs

More information

Lecture 20: Multi-Cache Designs. Spring 2018 Jason Tang

Lecture 20: Multi-Cache Designs. Spring 2018 Jason Tang Lecture 20: Multi-Cache Designs Spring 2018 Jason Tang 1 Topics Split caches Multi-level caches Multiprocessor caches 2 3 Cs of Memory Behaviors Classify all cache misses as: Compulsory Miss (also cold-start

More information

Multiprocessor Cache Coherency. What is Cache Coherence?

Multiprocessor Cache Coherency. What is Cache Coherence? Multiprocessor Cache Coherency CS448 1 What is Cache Coherence? Two processors can have two different values for the same memory location 2 1 Terminology Coherence Defines what values can be returned by

More information

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations Lecture 3: Snooping Protocols Topics: snooping-based cache coherence implementations 1 Design Issues, Optimizations When does memory get updated? demotion from modified to shared? move from modified in

More information

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T. Coherence Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T. L25-1 Coherence Avoids Stale Data Multicores have multiple private caches for performance Need to provide the illusion

More information

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( ) Systems Group Department of Computer Science ETH Zürich Lecture 24: Multiprocessing Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Most of the rest of this

More information

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations 1 Split Transaction Bus So far, we have assumed that a coherence operation (request, snoops, responses,

More information

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T. Coherence Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T. L5- Coherence Avoids Stale Data Multicores have multiple private caches for performance Need to provide the illusion

More information

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming Cache design review Let s say your code executes int x = 1; (Assume for simplicity x corresponds to the address 0x12345604

More information

CMSC 611: Advanced Computer Architecture

CMSC 611: Advanced Computer Architecture CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis Interconnection Networks Massively processor networks (MPP) Thousands of nodes

More information

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015 Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Marble House The Knife (Silent Shout) Before starting The Knife, we were working

More information

Caches. Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University. See P&H 5.1, 5.2 (except writes)

Caches. Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University. See P&H 5.1, 5.2 (except writes) Caches akim Weatherspoon CS 341, Spring 212 Computer Science Cornell University See P& 5.1, 5.2 (except writes) ctrl ctrl ctrl inst imm B A B D D Big Picture: emory emory: big & slow vs Caches: small &

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Computer Architecture

Computer Architecture 18-447 Computer Architecture CSCI-564 Advanced Computer Architecture Lecture 29: Consistency & Coherence Lecture 20: Consistency and Coherence Bo Wu Prof. Onur Mutlu Colorado Carnegie School Mellon University

More information

Thread- Level Parallelism. ECE 154B Dmitri Strukov

Thread- Level Parallelism. ECE 154B Dmitri Strukov Thread- Level Parallelism ECE 154B Dmitri Strukov Introduc?on Thread- Level parallelism Have mul?ple program counters and resources Uses MIMD model Targeted for?ghtly- coupled shared- memory mul?processors

More information

Cache Coherence in Scalable Machines

Cache Coherence in Scalable Machines Cache Coherence in Scalable Machines COE 502 arallel rocessing Architectures rof. Muhamed Mudawar Computer Engineering Department King Fahd University of etroleum and Minerals Generic Scalable Multiprocessor

More information

Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley

Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley Avinash Kodi Department of Electrical Engineering & Computer

More information

Shared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB

Shared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB Shared SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB 1 Review: Snoopy Cache Protocol Write Invalidate Protocol: Multiple readers, single writer Write to shared data: an

More information

Incoherent each cache copy behaves as an individual copy, instead of as the same memory location.

Incoherent each cache copy behaves as an individual copy, instead of as the same memory location. Cache Coherence This lesson discusses the problems and solutions for coherence. Different coherence protocols are discussed, including: MSI, MOSI, MOESI, and Directory. Each has advantages and disadvantages

More information

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation Outline Evaluation of applications is important Simulation of sample data sets provides important information Working sets indicate

More information

1. Memory technology & Hierarchy

1. Memory technology & Hierarchy 1. Memory technology & Hierarchy Back to caching... Advances in Computer Architecture Andy D. Pimentel Caches in a multi-processor context Dealing with concurrent updates Multiprocessor architecture In

More information

Cache Coherence in Scalable Machines

Cache Coherence in Scalable Machines ache oherence in Scalable Machines SE 661 arallel and Vector Architectures rof. Muhamed Mudawar omputer Engineering Department King Fahd University of etroleum and Minerals Generic Scalable Multiprocessor

More information

Flynn s Classification

Flynn s Classification Flynn s Classification SISD (Single Instruction Single Data) Uniprocessors MISD (Multiple Instruction Single Data) No machine is built yet for this type SIMD (Single Instruction Multiple Data) Examples:

More information

CMSC 611: Advanced. Distributed & Shared Memory

CMSC 611: Advanced. Distributed & Shared Memory CMSC 611: Advanced Computer Architecture Distributed & Shared Memory Centralized Shared Memory MIMD Processors share a single centralized memory through a bus interconnect Feasible for small processor

More information

Scalable Cache Coherence

Scalable Cache Coherence arallel Computing Scalable Cache Coherence Hwansoo Han Hierarchical Cache Coherence Hierarchies in cache organization Multiple levels of caches on a processor Large scale multiprocessors with hierarchy

More information

Caches Concepts Review

Caches Concepts Review Caches Concepts Review What is a block address? Why not bring just what is needed by the processor? What is a set associative cache? Write-through? Write-back? Then we ll see: Block allocation policy on

More information

Slide Set 9. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

Slide Set 9. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng Slide Set 9 for ENCM 369 Winter 2018 Section 01 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary March 2018 ENCM 369 Winter 2018 Section 01

More information

Lecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based)

Lecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based) Lecture 8: Snooping and Directory Protocols Topics: split-transaction implementation details, directory implementations (memory- and cache-based) 1 Split Transaction Bus So far, we have assumed that a

More information

Chapter Seven. SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors)

Chapter Seven. SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) Chapter Seven emories: Review SRA: value is stored on a pair of inverting gates very fast but takes up more space than DRA (4 to transistors) DRA: value is stored as a charge on capacitor (must be refreshed)

More information

Scalable Cache Coherence

Scalable Cache Coherence Scalable Cache Coherence [ 8.1] All of the cache-coherent systems we have talked about until now have had a bus. Not only does the bus guarantee serialization of transactions; it also serves as convenient

More information

Lecture 3: Directory Protocol Implementations. Topics: coherence vs. msg-passing, corner cases in directory protocols

Lecture 3: Directory Protocol Implementations. Topics: coherence vs. msg-passing, corner cases in directory protocols Lecture 3: Directory Protocol Implementations Topics: coherence vs. msg-passing, corner cases in directory protocols 1 Future Scalable Designs Intel s Single Cloud Computer (SCC): an example prototype

More information

Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions:

Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions: Scalable ache oherent Systems Scalable distributed shared memory machines ssumptions: rocessor-ache-memory nodes connected by scalable network. Distributed shared physical address space. ommunication assist

More information

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,

More information

SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES. Natalie Enright Jerger University of Toronto

SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES. Natalie Enright Jerger University of Toronto SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES University of Toronto Interaction of Coherence and Network 2 Cache coherence protocol drives network-on-chip traffic Scalable coherence protocols

More information

The Cache Write Problem

The Cache Write Problem Cache Coherency A multiprocessor and a multicomputer each comprise a number of independent processors connected by a communications medium, either a bus or more advanced switching system, such as a crossbar

More information

ECE7660 Parallel Computer Architecture. Shared Memory Multiprocessors

ECE7660 Parallel Computer Architecture. Shared Memory Multiprocessors ECE7660 Parallel Computer Architecture Shared Memory Multiprocessors 1 Layer Perspective CAD Database Scientific modeling Parallel applications Multipr ogramming Shar ed addr ess Message passing Data parallel

More information

Cache Coherence: Part II Scalable Approaches

Cache Coherence: Part II Scalable Approaches ache oherence: art II Scalable pproaches Hierarchical ache oherence Todd. Mowry S 74 October 27, 2 (a) 1 2 1 2 (b) 1 Topics Hierarchies Directory rotocols Hierarchies arise in different ways: (a) processor

More information

Lecture 24: Board Notes: Cache Coherency

Lecture 24: Board Notes: Cache Coherency Lecture 24: Board Notes: Cache Coherency Part A: What makes a memory system coherent? Generally, 3 qualities that must be preserved (SUGGESTIONS?) (1) Preserve program order: - A read of A by P 1 will

More information

Cray XE6 Performance Workshop

Cray XE6 Performance Workshop Cray XE6 Performance Workshop Cache Coherency Mark Bull David Henty EPCC, University of Edinburgh ymmetric MultiProcessing Each processor in an MP has equal access to all parts of memory same latency and

More information

A Scalable SAS Machine

A Scalable SAS Machine arallel omputer Organization and Design : Lecture 8 er Stenström. 2008, Sally. ckee 2009 Scalable ache oherence Design principles of scalable cache protocols Overview of design space (8.1) Basic operation

More information

Bus-based shared-memory multiprocessors, or symmetric multiprocessors,

Bus-based shared-memory multiprocessors, or symmetric multiprocessors, Caching in Distributed ystems Aleksandar ilenkovic University of Belgrade n bus-based shared-memory multiprocessors, several techniques reduce cache misses and bus traffic, the key obstacles to high performance.

More information

Bus-Based Coherent Multiprocessors

Bus-Based Coherent Multiprocessors Bus-Based Coherent Multiprocessors Lecture 13 (Chapter 7) 1 Outline Bus-based coherence Memory consistency Sequential consistency Invalidation vs. update coherence protocols Several Configurations for

More information

Outline. EEL 5764 Graduate Computer Architecture. Chapter 4 - Multiprocessors and TLP. Déjà vu all over again?

Outline. EEL 5764 Graduate Computer Architecture. Chapter 4 - Multiprocessors and TLP. Déjà vu all over again? Outline EEL 5764 Graduate Computer Architecture Chapter 4 - Multiprocessors and TLP Ann Gordon-Ross Electrical and Computer Engineering University of Florida MP Motivation ID v. IMD v. MIMD Centralized

More information

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Scalable Cache Coherence Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Hierarchical Cache Coherence Hierarchies in cache organization Multiple levels

More information

Performance metrics for caches

Performance metrics for caches Performance metrics for caches Basic performance metric: hit ratio h h = Number of memory references that hit in the cache / total number of memory references Typically h = 0.90 to 0.97 Equivalent metric:

More information

CMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3

CMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3 MS 411 omputer Systems rchitecture Lecture 21 Multiprocessors 3 Outline Review oherence Write onsistency dministrivia Snooping Building Blocks Snooping protocols and examples oherence traffic and performance

More information

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence Computer Architecture ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols 1 Shared Memory Multiprocessor Memory Bus P 1 Snoopy Cache Physical Memory P 2 Snoopy

More information

Introduction. Memory Hierarchy

Introduction. Memory Hierarchy Introduction Why memory subsystem design is important CPU speeds increase 25%-30% per year DRAM speeds increase 2%-11% per year 1 Memory Hierarchy Levels of memory with different sizes & speeds close to

More information

Lecture 11 Cache. Peng Liu.

Lecture 11 Cache. Peng Liu. Lecture 11 Cache Peng Liu liupeng@zju.edu.cn 1 Associative Cache Example 2 Associative Cache Example 3 Associativity Example Compare 4-block caches Direct mapped, 2-way set associative, fully associative

More information

Lecture 4: Directory Protocols and TM. Topics: corner cases in directory protocols, lazy TM

Lecture 4: Directory Protocols and TM. Topics: corner cases in directory protocols, lazy TM Lecture 4: Directory Protocols and TM Topics: corner cases in directory protocols, lazy TM 1 Handling Reads When the home receives a read request, it looks up memory (speculative read) and directory in

More information

SISTEMI EMBEDDED. Computer Organization Memory Hierarchy, Cache Memory. Federico Baronti Last version:

SISTEMI EMBEDDED. Computer Organization Memory Hierarchy, Cache Memory. Federico Baronti Last version: SISTEMI EMBEDDED Computer Organization Memory Hierarchy, Cache Memory Federico Baronti Last version: 20160524 Ideal memory is fast, large, and inexpensive Not feasible with current memory technology, so

More information

Computer Science 432/563 Operating Systems The College of Saint Rose Spring Topic Notes: Memory Hierarchy

Computer Science 432/563 Operating Systems The College of Saint Rose Spring Topic Notes: Memory Hierarchy Computer Science 432/563 Operating Systems The College of Saint Rose Spring 2016 Topic Notes: Memory Hierarchy We will revisit a topic now that cuts across systems classes: memory hierarchies. We often

More information

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano Outline The problem of cache coherence Snooping protocols Directory-based protocols Prof. Cristina Silvano, Politecnico

More information

Limitations of parallel processing

Limitations of parallel processing Your professor du jour: Steve Gribble gribble@cs.washington.edu 323B Sieg Hall all material in this lecture in Henessey and Patterson, Chapter 8 635-640 645, 646 654-665 11/8/00 CSE 471 Multiprocessors

More information

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization Lec-11 Multi-Threading Concepts: Coherency, Consistency, and Synchronization Coherency vs Consistency Memory coherency and consistency are major concerns in the design of shared-memory systems. Consistency

More information

Page 1. Multilevel Memories (Improving performance using a little cash )

Page 1. Multilevel Memories (Improving performance using a little cash ) Page 1 Multilevel Memories (Improving performance using a little cash ) 1 Page 2 CPU-Memory Bottleneck CPU Memory Performance of high-speed computers is usually limited by memory bandwidth & latency Latency

More information

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

Suggested Readings! What makes a memory system coherent?! Lecture 27 Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality! 1! 2! Suggested Readings! Readings!! H&P: Chapter 5.8! Could also look at material on CD referenced on p. 538 of your text! Lecture 27" Cache Coherency! 3! Processor components! Multicore processors and

More information

Cache Coherence (II) Instructor: Josep Torrellas CS533. Copyright Josep Torrellas

Cache Coherence (II) Instructor: Josep Torrellas CS533. Copyright Josep Torrellas Cache Coherence (II) Instructor: Josep Torrellas CS533 Copyright Josep Torrellas 2003 1 Sparse Directories Since total # of cache blocks in machine is much less than total # of memory blocks, most directory

More information

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University Lecture 12 Memory Design & Caches, part 2 Christos Kozyrakis Stanford University http://eeclass.stanford.edu/ee108b 1 Announcements HW3 is due today PA2 is available on-line today Part 1 is due on 2/27

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two

Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two Bushra Ahsan and Mohamed Zahran Dept. of Electrical Engineering City University of New York ahsan bushra@yahoo.com mzahran@ccny.cuny.edu

More information

CSC/ECE 506: Architecture of Parallel Computers Sample Final Examination with Answers

CSC/ECE 506: Architecture of Parallel Computers Sample Final Examination with Answers CSC/ECE 506: Architecture of Parallel Computers Sample Final Examination with Answers This was a 180-minute open-book test. You were to answer five of the six questions. Each question was worth 20 points.

More information