Lecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based)

Similar documents
Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations

Lecture 5: Directory Protocols. Topics: directory-based cache coherence implementations

Lecture 3: Directory Protocol Implementations. Topics: coherence vs. msg-passing, corner cases in directory protocols

Lecture 8: Directory-Based Cache Coherence. Topics: scalable multiprocessor organizations, directory protocol design issues

Lecture 4: Directory Protocols and TM. Topics: corner cases in directory protocols, lazy TM

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

Lecture 25: Multiprocessors

Lecture 7: Implementing Cache Coherence. Topics: implementation details

Cache Coherence in Scalable Machines

Scalable Cache Coherence

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

Scalable Cache Coherence

Cache Coherence in Scalable Machines

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Special Topics. Module 14: "Directory-based Cache Coherence" Lecture 33: "SCI Protocol" Directory-based Cache Coherence: Sequent NUMA-Q.

CMSC 611: Advanced Computer Architecture

Scalable Cache Coherent Systems

Shared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB

CMSC 611: Advanced. Distributed & Shared Memory

ECE 669 Parallel Computer Architecture

Lecture 24: Virtual Memory, Multiprocessors

A Scalable SAS Machine

Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions:

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996

Cache Coherence in Scalable Machines

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

EC 513 Computer Architecture

Cache Coherence: Part II Scalable Approaches

CMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3

NOW Handout Page 1. Context for Scalable Cache Coherence. Cache Coherence in Scalable Machines. A Cache Coherent System Must:

Lecture 24: Thread Level Parallelism -- Distributed Shared Memory and Directory-based Coherence Protocol

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

A More Sophisticated Snooping-Based Multi-Processor

Lecture 8: Transactional Memory TCC. Topics: lazy implementation (TCC)

Module 14: "Directory-based Cache Coherence" Lecture 31: "Managing Directory Overhead" Directory-based Cache Coherence: Replacement of S blocks

5008: Computer Architecture

Scalable Multiprocessors

Multiprocessors & Thread Level Parallelism

Lecture: Coherence Protocols. Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

Cache Coherence. Todd C. Mowry CS 740 November 10, Topics. The Cache Coherence Problem Snoopy Protocols Directory Protocols

A Basic Snooping-Based Multi-Processor Implementation

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

... The Composibility Question. Composing Scalability and Node Design in CC-NUMA. Commodity CC node approach. Get the node right Approach: Origin

Computer Architecture Lecture 10: Thread Level Parallelism II (Chapter 5) Chih Wei Liu 劉志尉 National Chiao Tung University

CS654 Advanced Computer Architecture Lec 14 Directory Based Multiprocessors

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Lecture 26: Multiprocessors. Today s topics: Directory-based coherence Synchronization Consistency Shared memory vs message-passing

Lect. 6: Directory Coherence Protocol

Lecture: Coherence, Synchronization. Topics: directory-based coherence, synchronization primitives (Sections )

Lecture 24: Board Notes: Cache Coherency

Fall 2015 :: CSE 610 Parallel Computer Architectures. Cache Coherence. Nima Honarmand

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Memory Hierarchy in a Multiprocessor

CSC 631: High-Performance Computer Architecture

COMP Parallel Computing. CC-NUMA (1) CC-NUMA implementation

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University

ECSE 425 Lecture 30: Directory Coherence

A Basic Snooping-Based Multi-Processor Implementation

Lecture 13: Consistency Models. Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models

Chapter 5. Multiprocessors and Thread-Level Parallelism

Lecture 6: Lazy Transactional Memory. Topics: TM semantics and implementation details of lazy TM

Computer Architecture

Lecture: Transactional Memory. Topics: TM implementations

Lecture 7: PCM Wrap-Up, Cache coherence. Topics: handling PCM errors and writes, cache coherence intro

Shared Memory Multiprocessors

Cache Coherence. Bryan Mills, PhD. Slides provided by Rami Melhem

Page 1. Cache Coherence

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology

SGI Challenge Overview

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

EECS 570 Lecture 11. Directory-based Coherence. Winter 2019 Prof. Thomas Wenisch

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Interconnect Routing

Lecture 17: Transactional Memories I

Foundations of Computer Systems

Lecture 21: Transactional Memory. Topics: Hardware TM basics, different implementations

CSE502: Computer Architecture CSE 502: Computer Architecture

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Chapter 5. Thread-Level Parallelism

Lecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6)

Lecture 7: PCM, Cache coherence. Topics: handling PCM errors and writes, cache coherence intro

Flynn s Classification

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Thread- Level Parallelism. ECE 154B Dmitri Strukov

Aleksandar Milenkovich 1

COSC 6385 Computer Architecture. - Thread Level Parallelism (III)

Lecture 1: Introduction

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Performance study example ( 5.3) Performance study example

Chapter 5. Multiprocessors and Thread-Level Parallelism

Lecture: Consistency Models, TM

SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES. Natalie Enright Jerger University of Toronto

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Transcription:

Lecture 8: Snooping and Directory Protocols Topics: split-transaction implementation details, directory implementations (memory- and cache-based) 1

Split Transaction Bus So far, we have assumed that a coherence operation (request, snoops, responses, update) happens atomically What would it take to implement the protocol correctly while assuming a split transaction bus? Split transaction bus: a cache puts out a request, releases the bus (so others can use the bus), receives its response much later Assumptions: only one request per block can be outstanding separate lines for addr (request) and data (response) 2

Split Transaction Bus Proc 1 Proc 2 Proc 3 Cache Buf Cache Buf Cache Buf Request lines Response lines 3

Design Issues Could be a 3-stage pipeline: request/snoop/response or (much simpler) 2-stage pipeline: request-snoop/response (note that the response is slowest and needs to be hidden) Buffers track the outstanding transactions; buffers are identical in each core; an entry is freed when the response is seen; the next operation uses any free entry; every bus operation carries the buffer entry number as a tag Must check the buffer before broadcasting a new operation; must ensure only one outstanding operation per block What determines the write order requests or responses? 4

Design Issues II What happens if processor-a is arbitrating for the bus and witnesses another bus transaction for the same address or same buffer entry? What if processor-a was trying to do an upgrade? What if processor-a was trying to do a read and there is already a matching read in the request table? Processor-cache handshake: after acquiring the block in excl state, the processor must complete the write before handing the block to other writers; else, there s a livelock 5

Directory-Based Protocol For each block, there is a centralized directory that maintains the state of the block in different caches The directory is co-located with the corresponding memory Requests and replies on the interconnect are no longer seen by everyone the directory serializes writes P P C C Dir Mem CA Dir Mem CA 6

Definitions Home node: the node that stores memory and directory state for the cache block in question Dirty node: the node that has a cache copy in modified state Owner node: the node responsible for supplying data (usually either the home or dirty node) Also, exclusive node, local node, requesting node, etc. P P C C Dir Mem CA Dir Mem CA 7

Directory Organizations Centralized Directory: one fixed location bottleneck! Flat Directories: directory info is in a fixed place, determined by examining the address can be further categorized as memory-based or cache-based Hierarchical Directories: the processors are organized as a logical tree structure and each parent keeps track of which of its immediate children has a copy of the block more searching, can exploit locality 8

Flat Memory-Based Directories Directory is associated with memory and stores info for all cached copies A presence vector stores a bit for every processor, for every memory block the overhead is a function of memory/block size and #processors Reducing directory overhead: 9

Flat Memory-Based Directories Directory is associated with memory and stores info for all cache copies A presence vector stores a bit for every processor, for every memory block the overhead is a function of memory/block size and #processors Reducing directory overhead: Width: pointers (keep track of processor ids of sharers) (need overflow strategy), organize processors into clusters Height: increase block size, track info only for blocks that are cached (note: cache size << memory size) 10

Flat Cache-Based Directories The directory at the memory home node only stores a pointer to the first cached copy the caches store pointers to the next and previous sharers (a doubly linked list) Cache 7 Cache 3 Cache 26 Main memory 11

Flat Cache-Based Directories The directory at the memory home node only stores a pointer to the first cached copy the caches store pointers to the next and previous sharers (a doubly linked list) Potentially lower storage, no bottleneck for network traffic Invalidates are now serialized (takes longer to acquire exclusive access), replacements must update linked list, must handle race conditions while updating list 12

Flat Memory-Based Directories Block size = 128 B Memory in each node = 1 GB Cache in each node = 1 MB For 64 nodes and 64-bit directory, Directory size = 4 GB For 64 nodes and 12-bit directory, Directory size = 0.75 GB Main memory Cache 1 Cache 2 Cache 64 13

Flat Cache-Based Directories Block size = 128 B Memory in each node = 1 GB Cache in each node = 1 MB 6-bit storage in DRAM for each block; DRAM overhead = 0.375 GB 12-bit storage in SRAM for each block; SRAM overhead = 0.75 MB Main memory Cache 7 Cache 3 Cache 26 14

Flat Memory-Based Directories Block size = 64 B L3 cache in each node = 2 MB L2 Cache in each node = 256 KB For 64 nodes and 64-bit directory, Directory size = 16 MB For 64 nodes and 12-bit directory, Directory size = 3 MB L2 cache L1 Cache 1 L1 Cache 2 L1 Cache 64 15

Flat Cache-Based Directories Block size = 64 B L3 cache in each node = 2 MB L2 Cache in each node = 256 KB 6-bit storage in L3 for each block; L3 overhead = 1.5 MB 12-bit storage in L2 for each block; L2 overhead = 384 KB Main memory Cache 7 Cache 3 Cache 26 16

SGI Origin 2000 Flat memory-based directory protocol Uses a bit vector directory representation Two processors per node combining multiple processors in a node reduces cost P L2 P L2 CA Interconnect M/D 17

Directory Structure The system supports either a 16-bit or 64-bit directory (fixed cost); for small systems, the directory works as a full bit vector representation Seven states, of which 3 are stable For larger systems, a coarse vector is employed each bit represents p/64 nodes State is maintained for each node, not each processor the communication assist broadcasts requests to both processors 18

Handling Reads SGI Origin 2000 case study: directory states: 3 stable states, 3 busy states, and 1 poison state; cache states: invalid, shared, excl-clean, excl-modified When the home receives a read request, it looks up memory (speculative read) and directory in parallel Actions taken for each directory state: shared or unowned: data is returned to requestor, state is changed to excl if there are no other sharers busy: a NACK is sent to the requestor exclusive: home is not the owner, request is fwded to owner, owner sends data to requestor and home 19

Inner Details of Handling the Read The block is in exclusive state memory may or may not have a clean copy it is speculatively read anyway The directory state is set to busy-exclusive and the presence vector is updated In addition to fwding the request to the owner, the memory copy is speculatively forwarded to the requestor Case 1: excl-dirty: owner sends block to requestor and home, the speculatively sent data is over-written Case 2: excl-clean: owner sends an ack (without data) to requestor and home, requestor waits for this ack before it moves on with speculatively sent data 20

Inner Details II Why did we send the block speculatively to the requestor if it does not save traffic or latency? the R10K cache controller is programmed to not respond with data if it has a block in excl-clean state when an excl-clean block is replaced from the cache, the directory need not be updated hence, directory cannot rely on the owner to provide data and speculatively provides data on its own 21

Handling Write Requests The home node must invalidate all sharers and all invalidations must be acked (to the requestor), the requestor is informed of the number of invalidates to expect Actions taken for each state: shared: invalidates are sent, state is changed to excl, data and num-sharers are sent to requestor, the requestor cannot continue until it receives all acks (Note: the directory does not maintain busy state, subsequent requests will be fwded to new owner and they must be buffered until the previous write has completed) 22

Handling Writes II Actions taken for each state: unowned: if the request was an upgrade and not a read-exclusive, is there a problem? exclusive: is there a problem if the request was an upgrade? In case of a read-exclusive: directory is set to busy, speculative reply is sent to requestor, invalidate is sent to owner, owner sends data to requestor (if dirty), and a transfer of ownership message (no data) to home to change out of busy busy: the request is NACKed and the requestor must try again 23

Handling Write-Back When a dirty block is replaced, a writeback is generated and the home sends back an ack Can the directory state be shared when a writeback is received by the directory? Actions taken for each directory state: exclusive: change directory state to unowned and send an ack busy: a request and the writeback have crossed paths: the writeback changes directory state to shared or excl (depending on the busy state), memory is updated, and home sends data to requestor, the intervention request is dropped 24

Writeback Cases P1 P2 Wback Ack D3 E: P1 This is the normal case D3 sends back an Ack 25

Writeback Cases P1 P2 Wback Fwd D3 E: P1 busy Rd or Wr If someone else has the block in exclusive, D3 moves to busy If Wback is received, D3 serves the requester If we didn t use busy state when transitioning from E:P1 to E:P2, D3 may not have known who to service (since ownership may have been passed on to P3 and P4 ) (although, this problem can be solved by NACKing the Wback and having P1 buffer its strange intervention requests this could lead to other corner cases ) 26

Writeback Cases P1 Data P2 Transfer ownership Fwd D3 E: P1 busy Wback If Wback is from new requester, D3 sends back a NACK Floating unresolved messages are a problem Alternatively, can accept the Wback and put D3 in some new busy state Conclusion: could have got rid of busy state between E:P1 E:P2, but with Wback ACK/NACK and other buffering could have kept the busy state between E:P1 E:P2, could have got rid of ACK/NACK, but need one new busy state27

Title Bullet 28