EECS 570 Lecture 11. Directory-based Coherence. Winter 2019 Prof. Thomas Wenisch

Similar documents
Cache Coherence (II) Instructor: Josep Torrellas CS533. Copyright Josep Torrellas

Designing Cache Coherence Protocol using Murphi

EECS 570 Lecture 13. Directory & Optimizations. Winter 2018 Prof. Satish Narayanasamy

Fall 2015 :: CSE 610 Parallel Computer Architectures. Cache Coherence. Nima Honarmand

Multiprocessor Cache Coherency. What is Cache Coherence?

Scalable Cache Coherence

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Bandwidth Adaptive Snooping

Module 9: "Introduction to Shared Memory Multiprocessors" Lecture 16: "Multiprocessor Organizations and Cache Coherence" Shared Memory Multiprocessors

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains:

Lecture 5: Directory Protocols. Topics: directory-based cache coherence implementations

Cache Coherence. Todd C. Mowry CS 740 November 10, Topics. The Cache Coherence Problem Snoopy Protocols Directory Protocols

CSE502: Computer Architecture CSE 502: Computer Architecture

ECE 669 Parallel Computer Architecture

Lecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based)

Special Topics. Module 14: "Directory-based Cache Coherence" Lecture 33: "SCI Protocol" Directory-based Cache Coherence: Sequent NUMA-Q.

Lecture 8: Directory-Based Cache Coherence. Topics: scalable multiprocessor organizations, directory protocol design issues

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Module 14: "Directory-based Cache Coherence" Lecture 31: "Managing Directory Overhead" Directory-based Cache Coherence: Replacement of S blocks

1. Memory technology & Hierarchy

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations

Review: Creating a Parallel Program. Programming for Performance

EC 513 Computer Architecture

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5)

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

Lecture 25: Multiprocessors

Thread- Level Parallelism. ECE 154B Dmitri Strukov

Scalable Cache Coherence

Foundations of Computer Systems

Cache Coherence Protocols: Implementation Issues on SMP s. Cache Coherence Issue in I/O

Lecture 3: Directory Protocol Implementations. Topics: coherence vs. msg-passing, corner cases in directory protocols

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Cache Coherence in Scalable Machines

Homework 6. BTW, This is your last homework. Assigned today, Tuesday, April 10 Due time: 11:59PM on Monday, April 23. CSCI 402: Computer Architectures

Token Coherence. Milo M. K. Martin Dissertation Defense

Multiprocessors continued

EECS 570 Lecture 9. Snooping Coherence. Winter 2017 Prof. Thomas Wenisch h6p://

SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES. Natalie Enright Jerger University of Toronto

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Cache Coherence: Part II Scalable Approaches

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

CCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers

SGI Challenge Overview

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Chapter 5. Multiprocessors and Thread-Level Parallelism

CMSC 611: Advanced Computer Architecture

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

A Basic Snooping-Based Multi-processor

Multiprocessors continued

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

Lecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Shared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB

Lecture 24: Board Notes: Cache Coherency

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Today s Outline: Shared Memory Review. Shared Memory & Concurrency. Concurrency v. Parallelism. Thread-Level Parallelism. CS758: Multicore Programming

CMSC 611: Advanced. Distributed & Shared Memory

Multicast Snooping: A Multicast Address Network. A New Coherence Method Using. With sponsorship and/or participation from. Mark Hill & David Wood

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

Lect. 6: Directory Coherence Protocol

Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions:

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

ECSE 425 Lecture 30: Directory Coherence

EECS 570 Final Exam - SOLUTIONS Winter 2015

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

CS 152 Computer Architecture and Engineering

Computer Architecture

Portland State University ECE 588/688. Cache-Only Memory Architectures

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Midterm Exam 02/09/2009

CMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3

ECE 485/585 Microprocessor System Design

Cache Coherence. Introduction to High Performance Computing Systems (CS1645) Esteban Meneses. Spring, 2014

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

Scalable Cache Coherent Systems

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Incoherent each cache copy behaves as an individual copy, instead of as the same memory location.

Cache Coherence. Bryan Mills, PhD. Slides provided by Rami Melhem

Computer Science 146. Computer Architecture

EECS 570 Lecture 25 Genomics and Hardware Multi-threading

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

A Basic Snooping-Based Multi-Processor Implementation

Lecture 4: Directory Protocols and TM. Topics: corner cases in directory protocols, lazy TM

CS654 Advanced Computer Architecture Lec 14 Directory Based Multiprocessors

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Shared memory. Caches, Cache coherence and Memory consistency models. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

Lecture 24: Virtual Memory, Multiprocessors

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

Limitations of parallel processing

Transcription:

Directory-based Coherence Winter 2019 Prof. Thomas Wenisch http://www.eecs.umich.edu/courses/eecs570/ Slides developed in part by Profs. Adve, Falsafi, Hill, Lebeck, Martin, Narayanasamy, Nowatzyk, Reinhardt, Roth, Smith, Singh, and Wenisch. Slide 1

Announcements Project Milestone, Monday 2/25 Midterm, Wednesday 2/27 Slide 2

For today: Readings r Chaiken et al., Directory-Based Cache Coherence Protocols for Large-Scale Multiprocessors, IEEE Computer, 19-58, June 1990. r Daniel J. Sorin, Mark D. Hill, and David A. Wood, A Primer on Memory Consistency and Cache Coherence, Chapter 8 For Monday 2/25: r Gupta et al - Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes, 1990. r Fredrik Dahlgren and Josep Torrellas. Cache-only memory architectures. Computer 6 (1999): 72-79.. Slide 3

Directory-Based Coherence Slide 4

Scalable Cache Coherence Scalable cache coherence: two part solution Part I: bus bandwidth r r Replace non-scalable bandwidth substrate (bus) with scalable bandwidth one (point-to-point network, e.g., mesh) Part II: processor snooping bandwidth r r r Interesting: most snoops result in no action Replace non-scalable broadcast protocol (spam everyone) with scalable directory protocol (only spam processors that care) Slide 5

Directory Coherence Protocols Observe: physical address space statically partitioned + Can easily determine which memory module holds a given line m That memory module sometimes called home Can t easily determine which processors have line in their caches r Bus-based protocol: broadcast events to all processors/caches ± Simple and fast, but non-scalable Directories: non-broadcast coherence protocol r r r Extend memory to track caching information For each physical cache line whose home this is, track: m Owner: which processor has a dirty copy (I.e., M state) m Sharers: which processors have clean copies (I.e., S state) Processor sends coherence event to home directory m Home directory only sends events to processors that care Slide 6

Basic Operation: Read Node #1 Directory Node #2 Load A (miss) Get-S A Data A A: Shared, #1 Slide 7

Basic Operation: Write Node #1 Directory Node #2 Read A (miss) Read A Fill A A: Shared, #1 Invalidate A Get-M A A: Mod., #2 Inv-Ack A Data A Slide 8

Centralized Directory Single directory contains a copy of cache tags from all nodes Advantages: r Central serialization point: easy to get memory consistency (just like a bus ) Problems: r Not scalable (imagine traffic from 1000 s of nodes ) r Directory size/organization changes with number of nodes Slide 9

Distributed Directory Distribute directory among memory modules r Memory block = coherence block (usually = cache line) r Home node à node with directory entry m Usually also dedicated main memory storage for cache line r Scalable directory grows with memory capacity m Common trick: steal bits from ECC for directory state r Directory can no longer serialize accesses across all addresses m Memory consistency becomes responsibility of CPU interface Slide 10

What is in the directory? Directory State r Invalid, Exclusive, Shared, ( stable states) r # outstanding invalidation messages, ( transient states) Pointer to exclusive owner Sharer list r List of caches that may have a copy r May include local node r Not necessarily precise, but always conservative Slide 11

Directory State Few stable states 2-3 bits usually enough Transient states r Often 10 s of states (+ need to remember node ids, ) r Transient state changes frequently, need fast RMW access r Design options: m Keep in directory: scalable (high concurrency), but slow m Keep in separate memory m Keep in directory, use cache to accelerate access m Keep in protocol controller q Transaction State Register File like MSHRs Slide 12

Pointer to Exclusive Owner Simple node id log 2 nodes Can share storage with sharer list (don t need both ) May point to a group of caches that internally maintain coherence (e.g., via snooping) May treat local node differently Slide 13

Sharer List Representation Key to scalability must efficiently represent node subsets Observation: most blocks cached by only 1 or 2 nodes r But, there are important exceptions (synchronization vars.) OLTP workload [Data from Nowatzyk] Slide 14

Idea #1: Sharer Bit Vectors One bit per processor / node / cache r Storage requirement grows with system size 0 1 1 0 0 0 0 1 Slide 15

Idea #2: Limited Pointers Fixed number (e.g., 4) of pointers to node ids If more than n sharers: r Recycle one pointer (force invalidation) r Revert to broadcast r Handle in software (maintain longer list elsewhere) Slide 16

Idea #3: Linked Lists Each node has fixed storage for next (prev) sharer Doubly-linked (Scalable Coherent Interconnect) Singly-linked (S3.mp) Poor performance: r Long invalidation latency r Replacements difficult to get out of sharer list m Especially with singly-linked list how to do it? X X Slide 17

Directory representation optimizations Coarse Vectors (CV) Cruise Missile Invalidations (CMI) Tree Extensions (TE) List-based Overflow (LO) CV CMI TE LO 0 1 0 0 0 1 0 0 0 1 1 0 Slide 18

Clean Eviction Notification Should directory learn when clean blocks are evicted? Advantages: r Avoids broadcast, frees pointers in limited pointer schemes r Avoids unnecessary invalidate messages Disadvantages: r Read-only data never invalidated (extra evict messages) r Notification traffic is unneccesary r New protocol races Slide 19

Sparse Directories Most of memory is invalid; why waste directory storage? Instead, use a directory cache r Any address w/o an entry is invalid r If full, need to evict & invalidate a victim entry r Generally needs to be highly associative Slide 20

Cache Invalidation Patterns Hypothesis: On a write to a shared location, # of caches to be invalidated is typically small If this isn t true, directory is no better than broadcast/snoop Experience tends to validate this hypothesis Slide 21

Common Sharing Patterns Code and read-only objects r No problem since rarely written Migratory objects r Even as number of caches grows, only 1-2 invalidations Mostly-read objects r Invalidations are expensive but infrequent, so OK Frequently read/written objects (e.g., task queues) r Invalidations frequent, hence sharer list usually small Synchronization objects r Low-contention locks result in few invalidations r High contention locks may need special support (e.g. MCS) Badly-behaved objects Slide 22

Designing a Directory Protocol: Nomenclature Local Node (L) r Node initiating the transaction we care about Home Node (H) r Node where directory/main memory for the block lives Remote Node (R) r Any other node that participates in the transaction Slide 23

Read Transaction L has a cache miss on a load instruction 1: Get-S L H 2: Data Slide 24

4-hop Read Transaction L has a cache miss on a load instruction r Block was previously in modified state at R State: M Owner: R 1: Get-S 2: Recall L H R 4: Data 3: Data Slide 25

3-hop Read Transaction L has a cache miss on a load instruction r Block was previously in modified state at R State: M Owner: R 1: Get-S 2: Fwd-Get-S L H R 3: Data 3: Data Slide 26

An Example Race: Writeback & Read L has dirty copy, wants to write back to H R concurrently sends a read to H To make your head really hurt: Race! Put-M & Fwd-Get-S MI SI A L 1: Put-M+Data 4: 6: Race! State: SM Final State: D S Sharers: Owner: L,RL H 7: Put-Ack 3: Fwd-Get-S 2: Get-S Can optimize away SI A & Put-Ack! L and H each know the race happened, don t need more msgs. R 5: Data Slide 27

Store-Store Race Line is invalid, both L and R race to obtain write permission Race! Stall for Data, do 1 store, then Fwd to R Fwd-Get-M to L; State: M New Owner: R Owner: L 1: Get-M Get-M IM AD L 7: 3: H 5: R IM AD 4: Data [ack=0] 6: Fwd-Get-M 8: Data [ack=0] Slide 28

Worst-case scenario? L evicts dirty copy, R concurrently seeks write permission Race! Put-M floating around! Wait till its gone 1: Put-M Put-M from NonOwner: Race! State: M L waiting to ensure Owner: L Put-M gone 2: Get-M MI II A L 5: H R 3: Fwd-Get-M 6: Put-Ack 4: Data [ack=0] Slide 29

Design Principles Think of sending and receiving messages as separate events At each step, consider what new requests can occur r E.g., can a new writeback overtake an older one? Two messages traversing same direction implies a race r Need to consider both delivery orders m Usually results in a branch in coherence FSM to handle both orderings r Need to make sure messages can t stick around lost m Every request needs an ack; extra states to clean up messages r Often, only one node knows how a race resolves m Might need to send messages to tell others what to do Slide 30