CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

Similar documents
CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

CS 152 Computer Architecture and Engineering. Lecture 19: Directory- Based Cache Protocols. Recap: Snoopy Cache Protocols

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 18 Cache Coherence

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

CSC 631: High-Performance Computer Architecture

CS 152, Spring 2011 Section 12

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

Computer Architecture and Parallel Computing 并行结构与计算. Lecture 6 Coherence Protocols

Cache Coherence Protocols for Sequential Consistency

Lecture 24: Thread Level Parallelism -- Distributed Shared Memory and Directory-based Coherence Protocol

CMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3

Chapter 6. Parallel Processors from Client to Cloud Part 2 COMPUTER ORGANIZATION AND DESIGN. Homogeneous & Heterogeneous Multicore Architectures

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Dr. George Michelogiannakis. EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory

3/13/2008 Csci 211 Lecture %/year. Manufacturer/Year. Processors/chip Threads/Processor. Threads/chip 3/13/2008 Csci 211 Lecture 8 4

CS 152 Computer Architecture and Engineering. Lecture 11 - Virtual Memory and Caches

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Parallel Processors from Client to Cloud Part 2

Lecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol

CS 152 Computer Architecture and Engineering. Lecture 8 - Address Translation

Shared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB

Scalable Cache Coherence

CS 152 Computer Architecture and Engineering. Lecture 8 - Address Translation

Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 9 - Address Translation

EC 513 Computer Architecture

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

CS654 Advanced Computer Architecture Lec 14 Directory Based Multiprocessors

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Lecture 9 - Virtual Memory

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

CS 152 Computer Architecture and Engineering. Lecture 9 - Address Translation

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

ECSE 425 Lecture 30: Directory Coherence

Chapter 5. Multiprocessors and Thread-Level Parallelism

CS 152 Computer Architecture and Engineering. Lecture 9 - Virtual Memory

Problem Set 5 Solutions CS152 Fall 2016

A Cache Coherence Protocol to Implement Sequential Consistency. Memory Consistency in SMPs

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 13 Memory Part 2

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 19 Memory Consistency Models

Lecture 25: Multiprocessors

CS 152 Computer Architecture and Engineering. Lecture 13 - Out-of-Order Issue and Register Renaming

CS 152 Computer Architecture and Engineering. Lecture 19: Synchronization and Sequential Consistency

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 13 Memory Part 2

CS 152 Computer Architecture and Engineering. Lecture 19: Synchronization and Sequential Consistency

CMSC 611: Advanced. Distributed & Shared Memory

1. Memory technology & Hierarchy

Limitations of parallel processing

Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY>

Flynn s Classification

Cache Coherence. Introduction to High Performance Computing Systems (CS1645) Esteban Meneses. Spring, 2014

Lecture 14: Multithreading

Outline. EEL 5764 Graduate Computer Architecture. Chapter 4 - Multiprocessors and TLP. Déjà vu all over again?

Foundations of Computer Systems

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Multiprocessor Cache Coherency. What is Cache Coherence?

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Multiprocessors & Thread Level Parallelism

ECE 485/585 Microprocessor System Design

Lecture 13 - VLIW Machines and Statically Scheduled ILP

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)

CS252 Spring 2017 Graduate Computer Architecture. Lecture 17: Virtual Memory and Caches

Review. EECS 252 Graduate Computer Architecture. Lec 13 Snooping Cache and Directory Based Multiprocessors. Outline. Challenges of Parallel Processing

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Lecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Handout 3 Multiprocessor and thread level parallelism

The Cache Write Problem

Lecture 9 Virtual Memory

EE 660: Computer Architecture Advanced Caches

Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions:

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5)

Chapter 5. Multiprocessors and Thread-Level Parallelism

CS 152 Computer Architecture and Engineering. Lecture 9 - Virtual Memory. Last?me in Lecture 9

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

Thread- Level Parallelism. ECE 154B Dmitri Strukov

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Cache Coherence. Bryan Mills, PhD. Slides provided by Rami Melhem

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

CMSC 611: Advanced Computer Architecture

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 7 Memory III

Lecture 24: Board Notes: Cache Coherency

Lecture 4 - Pipelining

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

Lecture 24: Virtual Memory, Multiprocessors

Scalable Cache Coherence

Transcription:

CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste http://inst.cs.berkeley.edu/~cs152

Recap: Snoopy Cache Protocols Memory Bus M 1 Snoopy Cache Physical Memory M 2 Snoopy Cache M 3 Snoopy Cache DMA DISKS Use snoopy mechanism to keep all processors view of memory coherent 2

MESI: An Enhanced MSI protocol increased performance for private data Each cache line has a tag state bits Address tag Write miss P 1 write or read Other processor reads P 1 writes back Read miss, shared Read by any processor M S P 1 intent to write P 1 write Other processor intent to write M: Modified Exclusive E: Exclusive but unmodified S: Shared I: Invalid Other processor reads E Other processor intent to write, P1 writes back I P 1 read Read miss, not shared Other processor intent to write Cache state in processor P 1 3

Performance of Symmetric Shared-Memory Multiprocessors Cache performance is combination of: 1. Uniprocessor cache miss traffic 2. Traffic caused by communication Results in invalidations and subsequent cache misses Adds 4 th C: coherence miss Joins Compulsory, Capacity, Conflict (Sometimes called a Communication miss) 4

Coherency Misses 1. True sharing misses arise from the communication of data through the cache coherence mechanism Invalidates due to 1 st write to shared block Reads by another CPU of modified block in different cache Miss would still occur if block size were 1 word 2. False sharing misses when a block is invalidated because some word in the block, other than the one being read, is written into Invalidation does not cause a new value to be communicated, but only causes an extra cache miss Block is shared, but no word in block is actually shared miss would not occur if block size were 1 word 5

Example: True v. False Sharing v. Hit? Assume x1 and x2 in same cache block. P1 and P2 both read x1 and x2 before. Time P1 P2 True, False, Hit? Why? 1 Write x1 2 Read x2 3 Write x1 4 Write x2 5 Read x2 True miss; invalidate x1 in P2 False miss; x1 irrelevant to P2 False miss; x1 irrelevant to P2 False miss; x1 irrelevant to P2 True miss; invalidate x2 in P1 6

MP Performance 4 Processor Commercial Workload: OLTP, Decision Support (Database), Search Engine True sharing and false sharing unchanged going from 1 MB to 8 MB (L3 cache) Uniprocessor cache misses improve with cache size increase (Instruction, Capacity/Conflict, Compulsory) Memory cycles per instruction 3.25 2.75 2.5 2.25 1.75 1.5 1.25 0.75 0.5 0.25 3 2 1 0 1 MB 2 MB 4 MB 8 MB Cache size Instruction Capacity/Conflict Cold False Sharing True Sharing 7

MP Performance 2MB Cache Commercial Workload: OLTP, Decision Support (Database), Search Engine True sharing, false sharing increase going from 1 to 8 CPUs Memory cycles per instruction 3 2.5 2 1.5 1 0.5 Instruction Conflict/Capacity Cold False Sharing True Sharing 0 1 2 4 6 8 Processor count 8

A Cache-Coherent System Must: Provide set of states, state transition diagram, and actions Manage coherence protocol (0) Determine when to invoke coherence protocol (a) Find info about state of address in other caches to determine action» whether need to communicate with other cached copies (b) Locate the other copies (c) Communicate with those copies (invalidate/update) (0) is done the same way on all systems state of the line is maintained in the cache protocol is invoked if an access fault occurs on the line Different approaches distinguished by (a) to (c) 9

Bus-based Coherence All of (a), (b), (c) done through broadcast on bus faulting processor sends out a search others respond to the search probe and take necessary action Could do it in scalable network too broadcast to all processors, and let them respond Conceptually simple, but broadcast doesn t scale with number of processors, P on bus, bus bandwidth doesn t scale on scalable network, every fault leads to at least P network transactions Scalable coherence: can have same cache states and state transition diagram different mechanisms to manage protocol 10

Scalable Approach: Directories Every memory block has associated directory information keeps track of copies of cached blocks and their states on a miss, find directory entry, look it up, and communicate only with the nodes that have copies if necessary in scalable networks, communication with directory and copies is through network transactions Many alternatives for organizing directory information 11

Basic Operation of Directory P P Cache Cache k processors. Interconnection Network With each cache-block in memory: k presence-bits, 1 dirty-bit Memory Directory With each cache-block in cache: 1 valid bit, and 1 dirty (owner) bit presence bits dirty bit Read from main memory by processor i: If dirty-bit OFF then { read from main memory; turn p[i] ON; } if dirty-bit ON then { recall line from dirty proc (downgrade cache state to shared); update memory; turn dirty-bit OFF; turn p[i] ON; supply recalled data to i;} Write to main memory by processor i: If dirty-bit OFF then {send invalidations to all caches that have the block; turn dirty-bit ON; supply data to i; turn p[i] ON;... } 12

CS152 Administrivia Final quiz, Wednesday April 27 Multiprocessors, Memory models, Cache coherence Lectures 17-19, PS 5, Lab 5 13

Directory Cache Protocol (Handout 6) CPU CPU CPU CPU CPU CPU Cache Cache Cache Cache Cache Cache Interconnection Network Directory Controller Directory Controller Directory Controller Directory Controller DRAM Bank DRAM Bank DRAM Bank DRAM Bank Assumptions: Reliable network, FIFO message delivery between any given source-destination pair 14

Cache States For each cache line, there are 4 possible states: C-invalid (= Nothing): The accessed data is not resident in the cache. C-shared (= Sh): The accessed data is resident in the cache, and possibly also cached at other sites. The data in memory is valid. C-modified (= Ex): The accessed data is exclusively resident in this cache, and has been modified. Memory does not have the most up-to-date data. C-transient (= Pending): The accessed data is in a transient state (for example, the site has just issued a protocol request, but has not received the corresponding protocol reply). 15

Home directory states For each memory block, there are 4 possible states: R(dir): The memory block is shared by the sites specified in dir (dir is a set of sites). The data in memory is valid in this state. If dir is empty (i.e., dir = ε), the memory block is not cached by any site. W(id): The memory block is exclusively cached at site id, and has been modified at that site. Memory does not have the most up-to-date data. TR(dir): The memory block is in a transient state waiting for the acknowledgements to the invalidation requests that the home site has issued. TW(id): The memory block is in a transient state waiting for a block exclusively cached at site id (i.e., in C-modified state) to make the memory block at the home site up-todate. 16

Protocol Messages There are 10 different protocol messages: Category Cache to Memory Requests Messages ShReq, ExReq Memory to Cache Requests Cache to Memory Responses WbReq, InvReq, FlushReq WbRep(v), InvRep, FlushRep(v) Memory to Cache Responses ShRep(v), ExRep(v) 17

Cache State Transitions (from invalid state) 18

Cache State Transitions (from shared state) 19

Cache State Transitions (from exclusive state) 20

Cache Transitions (from pending) 21

Home Directory State Transitions Messages sent from site id 22

Home Directory State Transitions Messages sent from site id 23

Home Directory State Transitions Messages sent from site id 24

Home Directory State Transitions Messages sent from site id 25

Acknowledgements These slides contain material developed and copyright by: Arvind (MIT) Krste Asanovic (MIT/UCB) Joel Emer (Intel/MIT) James Hoe (CMU) John Kubiatowicz (UCB) David Patterson (UCB) MIT material derived from course 6.823 UCB material derived from course CS252 26