CS 152 Computer Architecture and Engineering

Similar documents
CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

CS 152 Computer Architecture and Engineering. Lecture 19: Directory- Based Cache Protocols. Recap: Snoopy Cache Protocols

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 18 Cache Coherence

CSC 631: High-Performance Computer Architecture

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

Chapter 6. Parallel Processors from Client to Cloud Part 2 COMPUTER ORGANIZATION AND DESIGN. Homogeneous & Heterogeneous Multicore Architectures

Cache Coherence Protocols for Sequential Consistency

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Parallel Processors from Client to Cloud Part 2

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

CS 152, Spring 2011 Section 12

Lecture 24: Thread Level Parallelism -- Distributed Shared Memory and Directory-based Coherence Protocol

Computer Architecture and Parallel Computing 并行结构与计算. Lecture 6 Coherence Protocols

Dr. George Michelogiannakis. EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

EC 513 Computer Architecture

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

A Cache Coherence Protocol to Implement Sequential Consistency. Memory Consistency in SMPs

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Lecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol

Shared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB

Problem Set 5 Solutions CS152 Fall 2016

CS 152 Computer Architecture and Engineering. Lecture 11 - Virtual Memory and Caches

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

CS 152 Computer Architecture and Engineering. Lecture 9 - Address Translation

CMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

CS 152 Computer Architecture and Engineering. Lecture 8 - Address Translation

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism

CS 152 Computer Architecture and Engineering. Lecture 9 - Address Translation

CS 152 Computer Architecture and Engineering. Lecture 8 - Address Translation

3/13/2008 Csci 211 Lecture %/year. Manufacturer/Year. Processors/chip Threads/Processor. Threads/chip 3/13/2008 Csci 211 Lecture 8 4

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

CS 152 Computer Architecture and Engineering. Lecture 9 - Virtual Memory

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Cache Coherence. Introduction to High Performance Computing Systems (CS1645) Esteban Meneses. Spring, 2014

Multiprocessors & Thread Level Parallelism

Lecture 7 - Memory Hierarchy-II

Lecture 9 - Virtual Memory

Lecture 25: Multiprocessors

CS654 Advanced Computer Architecture Lec 14 Directory Based Multiprocessors

Flynn s Classification

Multiprocessor Cache Coherency. What is Cache Coherence?

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

ECSE 425 Lecture 30: Directory Coherence

Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY>

Directory Implementation. A High-end MP

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Foundations of Computer Systems

Thread- Level Parallelism. ECE 154B Dmitri Strukov

Cache Coherence. Bryan Mills, PhD. Slides provided by Rami Melhem

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5)

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 13 Memory Part 2

Lecture 14: Multithreading

Aleksandar Milenkovich 1

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II

Computer Architecture

CMSC 611: Advanced Computer Architecture

Page 1. Cache Coherence

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

1. Memory technology & Hierarchy

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Limitations of parallel processing

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 13 Memory Part 2

CS252 Spring 2017 Graduate Computer Architecture. Lecture 17: Virtual Memory and Caches

Multicore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

ECE 485/585 Microprocessor System Design

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Scalable Cache Coherence

CS 152 Computer Architecture and Engineering. Lecture 13 - Out-of-Order Issue and Register Renaming

EE 660: Computer Architecture Advanced Caches

CMSC 611: Advanced. Distributed & Shared Memory

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 19 Memory Consistency Models

Advanced OpenMP. Lecture 3: Cache Coherency

Lecture 24: Virtual Memory, Multiprocessors

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

The Cache Write Problem

Lecture 24: Board Notes: Cache Coherency

Lect. 6: Directory Coherence Protocol

CS 152 Computer Architecture and Engineering. Lecture 19: Synchronization and Sequential Consistency

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

Transcription:

CS 152 Computer Architecture and Engineering Lecture 18: Directory-Based Cache Protocols John Wawrzynek EECS, University of California at Berkeley http://inst.eecs.berkeley.edu/~cs152

Administrivia 2

Recap: Snoopy Cache Protocols Memory Bus M 1 Snoopy Cache Physical Memory M 2 Snoopy Cache M 3 Snoopy Cache DMA DISKS Use snoopy mechanism to keep all processors view of memory coherent 3

Recap: MESI: An Enhanced MSI protocol increased performance for private data Each cache line has a tag state bits Address tag Write miss P 1 write or read Other processor reads P 1 writes back Read miss, shared Read by any processor M S P 1 intent to write P 1 write Other processor intent to write M: Modified Exclusive E: Exclusive but unmodified S: Shared I: Invalid Other processor reads Other processor intent to write, P1 writes back E I P 1 read Read miss, not shared Other processor intent to write Cache state in processor P 1 4

Performance of Symmetric Shared-Memory Multiprocessors Cache performance is combination of: 1. Uniprocessorcache miss traffic 2. Miss traffic caused by communication Shared memory causes invalidations and subsequent cache misses Coherence misses Sometimes called a Communication miss A cache miss which is a result of a remote core Read miss: remote core wrote Write miss: remote core wrote or read 4 th C of cache misses along with Compulsory, Capacity, & Conflict. 5

Coherence Misses 1. True sharing misses arise from the communication of data through the cache coherence mechanism Invalidates due to 1 st write to shared line Reads by another CPU of modified line in different cache Miss would still occur if line size were 1 word 2. False sharing misses when a line is invalidated because some word in the line, other than the one being read, is written into Invalidation does not cause a new value to be communicated, but only causes an extra cache miss Line is shared, but no word in line is actually shared miss would not occur if line size were 1 word state line addr data0 data1... datan 6

Example: True v. False Sharing v. Hit? Assume x1 and x2 in same cache line. P1 and P2 both read x1 and x2 previously. Time P1 P2 True, False, Hit? Why? 1 Write x1 2 Read x2 3 Write x1 4 Write x2 5 Read x2 7

Example: True v. False Sharing v. Hit? Assume x1 and x2 in same cache line. P1 and P2 both read x1 and x2 previously. Time P1 P2 True, False, Hit? Why? 1 Write x1 2 Read x2 3 Write x1 4 Write x2 5 Read x2 No miss; invalidate x1 in P2 False miss; x1 irrelevant to P2 No miss; invalidate x1 in P2 False miss; x1 irrelevant to P2 True communication miss 8

MP Performance 4 Processor Commercial Workload: OLTP, Decision Support (Database), Search Engine Uniprocessor cache misses improve with cache size increase (Instruction, Capacity/Conflict, Compulsory) True sharing and false sharing unchanged going from 1 MB to 8 MB (L3 cache) Memory cycles per instruction 3.25 2.75 2.5 2.25 1.75 1.5 1.25 0.75 0.5 0.25 3 2 1 0 1 MB 2 MB 4 MB 8 MB Cache size Instruction Capacity/Conflict Cold False Sharing True Sharing 11

MP Performance 2MB Cache Commercial Workload: OLTP, Decision Support (Database), Search Engine True sharing, false sharing increase going from 1 to 8 CPUs Memory cycles per instruction 3 2.5 2 1.5 1 0.5 Instruction Conflict/Capacity Cold False Sharing True Sharing 0 1 2 4 6 8 Processor count 12

What if We Had 128 Cores? Memory Bus M 1 Snoopy Cache Physical Memory M 2 Snoopy Cache M 3 Snoopy Cache DMA DISKS 13

Bus Is a Synchronization Point So far, any message that enters the bus reaches all cores and gets replies before another message can enter the bus (instantaneous actions) Therefore, the bus forces ordering Memory Bus M 1 Snoopy Cache Physical Memory M 2 Snoopy Cache M 3 Snoopy Cache DMA DISKS 14

Scaling Snoopy/Broadcast Coherence When any processor gets a miss, must probe every other cache Scaling up to more processors limited by: Communication bandwidth over bus Snoop bandwidth into tags Can improve bandwidth by using multiple interleaved buses with interleaved tag banks E.g, two bits of address pick which of four buses and four tag banks to use (e.g., bits 7:6 of address pick bus/tag bank, bits 5:0 pick byte in 64-byte line) Buses don t scale to large number of connections 15

Dynamically Switched Interconnection Network 16

Scalable Approach: Directories With point-to-point network for larger number of nodes, could broadcast snoop requests (emulate operation of a bus). But swamps the network and will be limited by tag bandwidth at caches. Insight: Most snoops fail to find a match! Better approach: Every memory line has associated directory information keeps track of copies of cached lines and their states on a miss, find directory entry, look it up, and communicate only with the nodes that have copies if necessary communication with directory and copies is through network transactions Many alternatives for organizing directory information 17

What if no Bus to Synchronize? What if a cache broadcasts invalidations to transition to modified (M), and before that completes it receives an invalidation from another core s transition to modified? Write miss P 1 write or read Other processor reads P 1 writes back Read miss, shared Read by any processor M S P 1 intent to write P 1 write Other processor reads Other processor intent to write, P1 writes back Other processor intent to write E I P 1 read Read miss, not shared Other processor intent to write Cache state in processor P 1 18

Directory Cache System CPU CPU CPU CPU CPU CPU Each line in cache has state field plus tag Stat. Tag Data Cache Cache Cache Cache Cache Cache Interconnection Network Stat. Each line in memory has state field plus bit vector directory with one bit per processor Directry Data Directory Controller Directory Controller Directory Controller Directory Controller DRAM Bank DRAM Bank DRAM Bank DRAM Bank Assumptions: Reliable network, FIFO message delivery between any given source-destination pair Directory becomes the point of serialization. 20

Alternative Architecture Model Non-uniform memory access (NUMA) 21

Vector Directory Bit Mask Stat. Directory Data 0 1 1 0 With four cores, means that cores 2 and 3 have that line in their local cache Can also have a list of core IDs Each cache has keeps it s own state. 22

Cache States For each cache line, there are 4 possible states (based on MSI): C-invalid (= Nothing): The accessed data is not resident in the cache. C-shared (= Sh): The accessed data is resident in the cache, and possibly also cached at other sites. The data in memory isvalid. C-modified (= Ex): The accessed data is exclusively resident in this cache, and has been modified. Memory does not have the most up-to-date data. C-transient (= Pending): The accessed data is in a transient state (for example, the site has just issued a protocol request, but has not received the corresponding protocol reply). 23

Home Directory States For each memory line, there are 4 possible directory states: R(dir): The memory line is shared by the sites specified in dir (dir is a set of sites). The data in memory is up to date. If dir is empty (i.e., dir = ε), the memory line is not cached by any site. W(id): The memory line is exclusively cached at site id, and has been modified at that site. Memory does not have the most upto-date data. TR(dir): The memory line is in a transient state waiting for the acknowledgements to the invalidation requests that the home site has issued. TW(id): The memory line is in a transient state waiting for a line exclusively cached at site id (i.e., in C-modified state) to make the memory line at the home site up-to-date. Note: different states in directory than caches 25

Protocol Messages (MESI) There are 10 different protocol messages: Category Cache to Memory Requests Messages ShReq, ExReq Memory to Cache Requests Cache to Memory Responses WbReq, InvReq, FlushReq WbRep(v), InvRep, FlushRep(v) Memory to Cache Responses ShRep(v), ExRep(v) 26

Read miss, to uncached or shared line CPU Load request at head of CPU->Cache queue. Load misses in cache. 1 2 Cache Update cache tag and data and return load data to CPU. 9 Send ShReq message to directory. 3 8 ShRep arrives at cache. Interconnection Network Message received at directory controller. 4 7 Send ShRep message with contents of cache line. Directory Controller DRAM Bank 6 Update directory by setting bit for new processor sharer. Access state and directory for line. Line s state is R, with zero or more sharers. 5 27

Store request at head of CPU->Cache queue. Store misses in cache. Send ExReq message to directory. Write miss, to read shared line 1 2 3 CPU Cache Update cache tag and data, then store data from CPU 12 ExRep arrives at cache 11 Interconnection Network Invalidate cache line. Send InvRep to directory. Multiple sharers 8 CPU CPU CPU Cache Cache Cache InvReq arrives at cache. 7 ExReq message received at directory controller. InvRep received. Clear down sharer bit. 9 4 Directory Controller DRAM Bank 10 When no more sharers, send ExRep to cache. 6 Send one InvReq message to each sharer. Access state and directory for line. Line s state is R, with some set of sharers. 5 28

Concurrency Management Protocol would be easy to design if only one transaction in flight across entire system But, want greater throughput and don t want to have to coordinate across entire system Great complexity in managing multiple outstanding concurrent transactions to cache lines Can have multiple requests in flight to same cache line! 29

This is The Standard MESI Write miss P 1 write or read Other processor reads P 1 writes back Read miss, shared Read by any processor M S P 1 intent to write P 1 write Other processor reads Other processor intent to write, P1 writes back Other processor intent to write E I P 1 read Read miss, not shared Other processor intent to write Cache state in processor P 1 30

With Transients (Cache) Based on MESI 31

With Transient (Directory) 17 states with MESI! If you are curious: http://www.m5sim.org/mesi_two_level Figure based on MSI: 13 states 32

More Complex Coherence Protocols 33

Cache State Transitions (from invalid state) 34

Cache State Transitions (from shared state) 35

Cache State Transitions (from exclusive state) 36

Cache Transitions (from pending) 37

Home Directory State Transitions Messages sent from site id 38

Home Directory State Transitions Messages sent from site id 39

Home Directory State Transitions Messages sent from site id 40

Home Directory State Transitions Messages sent from site id 41

Acknowledgements These slides contain material developed and copyright by: Arvind (MIT) Krste Asanovic (MIT/UCB) Joel Emer (Intel/MIT) James Hoe (CMU) John Kubiatowicz (UCB) David Patterson (UCB) Mark Hill (U. Wisconsin-Madison) Dana Vantrase, Mikko Lipasti, Nathan Binkert (U. Wisconsin-Madison & HP) MIT material derived from course 6.823 UCB material derived from course CS252 42