Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations

Similar documents
Lecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based)

Lecture 8: Directory-Based Cache Coherence. Topics: scalable multiprocessor organizations, directory protocol design issues

Lecture 5: Directory Protocols. Topics: directory-based cache coherence implementations

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

Lecture 7: Implementing Cache Coherence. Topics: implementation details

Scalable Cache Coherence

Cache Coherence in Scalable Machines

Lecture 25: Multiprocessors

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Scalable Cache Coherence

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

Lecture 3: Directory Protocol Implementations. Topics: coherence vs. msg-passing, corner cases in directory protocols

Cache Coherence in Scalable Machines

Cache Coherence: Part II Scalable Approaches

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

Special Topics. Module 14: "Directory-based Cache Coherence" Lecture 33: "SCI Protocol" Directory-based Cache Coherence: Sequent NUMA-Q.

EC 513 Computer Architecture

ECE 669 Parallel Computer Architecture

Lecture 24: Virtual Memory, Multiprocessors

Shared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB

Scalable Cache Coherent Systems

Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions:

NOW Handout Page 1. Context for Scalable Cache Coherence. Cache Coherence in Scalable Machines. A Cache Coherent System Must:

A Scalable SAS Machine

Cache Coherence (II) Instructor: Josep Torrellas CS533. Copyright Josep Torrellas

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Page 1. Cache Coherence

Lecture 7: PCM Wrap-Up, Cache coherence. Topics: handling PCM errors and writes, cache coherence intro

A More Sophisticated Snooping-Based Multi-Processor

Cache Coherence. Todd C. Mowry CS 740 November 10, Topics. The Cache Coherence Problem Snoopy Protocols Directory Protocols

CMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3

Lecture 4: Directory Protocols and TM. Topics: corner cases in directory protocols, lazy TM

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

ECSE 425 Lecture 30: Directory Coherence

CMSC 611: Advanced Computer Architecture

Lecture 24: Thread Level Parallelism -- Distributed Shared Memory and Directory-based Coherence Protocol

A Basic Snooping-Based Multi-Processor Implementation

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

CMSC 611: Advanced. Distributed & Shared Memory

Limitations of parallel processing

Lect. 6: Directory Coherence Protocol

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Cache Coherence in Scalable Machines

Lecture 7: PCM, Cache coherence. Topics: handling PCM errors and writes, cache coherence intro

Computer Architecture

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Multiprocessors & Thread Level Parallelism

Module 14: "Directory-based Cache Coherence" Lecture 31: "Managing Directory Overhead" Directory-based Cache Coherence: Replacement of S blocks

Lecture 1: Introduction

Shared Memory Multiprocessors

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

Lecture 13. Shared memory: Architecture and programming

5008: Computer Architecture

CSC 631: High-Performance Computer Architecture

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Cache Coherence Protocols: Implementation Issues on SMP s. Cache Coherence Issue in I/O

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

EECS 570 Lecture 11. Directory-based Coherence. Winter 2019 Prof. Thomas Wenisch

Lecture 24: Board Notes: Cache Coherency

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

COMP Parallel Computing. CC-NUMA (1) CC-NUMA implementation

Scalable Multiprocessors

Aleksandar Milenkovich 1

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Lecture: Coherence Protocols. Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols

Lecture 26: Multiprocessors. Today s topics: Directory-based coherence Synchronization Consistency Shared memory vs message-passing

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University

Fall 2015 :: CSE 610 Parallel Computer Architectures. Cache Coherence. Nima Honarmand

Computer Architecture Lecture 10: Thread Level Parallelism II (Chapter 5) Chih Wei Liu 劉志尉 National Chiao Tung University

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Lecture: Coherence, Synchronization. Topics: directory-based coherence, synchronization primitives (Sections )

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

Chapter 5. Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism

Memory Hierarchy in a Multiprocessor

Chapter 5. Multiprocessors and Thread-Level Parallelism

COSC 6385 Computer Architecture. - Thread Level Parallelism (III)

Handout 3 Multiprocessor and thread level parallelism

A Basic Snooping-Based Multi-Processor Implementation

Lecture 18: Shared-Memory Multiprocessors. Topics: coherence protocols for symmetric shared-memory multiprocessors (Sections

5 Chip Multiprocessors (II) Robert Mullins

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Cache Coherence and Atomic Operations in Hardware

Shared Memory Architectures. Approaches to Building Parallel Machines

Multicast Snooping: A Multicast Address Network. A New Coherence Method Using. With sponsorship and/or participation from. Mark Hill & David Wood

Snoop-Based Multiprocessor Design III: Case Studies

Lecture-22 (Cache Coherence Protocols) CS422-Spring

CS654 Advanced Computer Architecture Lec 14 Directory Based Multiprocessors

Atomic Coherence: Leveraging Nanophotonics to Build Race-Free Cache Coherence Protocols. Dana Vantrease, Mikko Lipasti, Nathan Binkert

Interconnect Routing

Lecture 8: Transactional Memory TCC. Topics: lazy implementation (TCC)

... The Composibility Question. Composing Scalability and Node Design in CC-NUMA. Commodity CC node approach. Get the node right Approach: Origin

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

Lecture 24: Memory, VM, Multiproc

CS 152 Computer Architecture and Engineering

Bus-Based Coherent Multiprocessors

Transcription:

Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations 1

Split Transaction Bus So far, we have assumed that a coherence operation (request, snoops, responses, update) happens atomically What would it take to implement the protocol correctly while assuming a split transaction bus? Split transaction bus: a cache puts out a request, releases the bus (so others can use the bus), receives its response much later Assumptions: only one request per block can be outstanding separate lines for addr (request) and data (response) 2

Split Transaction Bus Proc 1 Proc 2 Proc 3 Cache Buf Cache Buf Cache Buf Request lines Response lines 3

Design Issues Could be a 3-stage pipeline: request/snoop/response or (much simpler) 2-stage pipeline: request-snoop/response (note that the response is slowest and needs to be hidden) Buffers track the outstanding transactions; buffers are identical in each core; an entry is freed when the response is seen; the next operation uses any free entry; every bus operation carries the buffer entry number as a tag Must check the buffer before broadcasting a new operation; must ensure only one outstanding operation per block What determines the write order requests or responses? 4

Design Issues II What happens if a processor is arbitrating for the bus and witnesses another bus transaction for the same address? If the processor issues a read miss and there is already a matching read in the request table, can we reduce bus traffic? 5

Non-Atomic State Transitions Other subtle considerations for correctness Example I: block A in shared state in P1 and P2; both issue a write; the bus controllers are ready to issue an upgrade request and try to acquire the bus; is there a problem? Must realize that the nature of the transaction might have changed between the time of queuing and time of issue Snoops should also look at the queue; or, eliminate upgrades and use a wired-or signal to suppress response 6

Livelock Example II: Livelock can happen if the processor-cache handshake is not designed correctly Before the processor can attempt the write, it must acquire the block in exclusive state If all processors are writing to the same block, one of them acquires the block first if another exclusive request is seen on the bus, the cache controller must wait for the processor to complete the write before releasing the block -- else, the processor s write will fail again because the block would be in invalid state 7

Directory-Based Protocol For each block, there is a centralized directory that maintains the state of the block in different caches The directory is co-located with the corresponding memory Requests and replies on the interconnect are no longer seen by everyone the directory serializes writes P P C C Dir Mem CA Dir Mem CA 8

Definitions Home node: the node that stores memory and directory state for the cache block in question Dirty node: the node that has a cache copy in modified state Owner node: the node responsible for supplying data (usually either the home or dirty node) Also, exclusive node, local node, requesting node, etc. P P C C Dir Mem CA Dir Mem CA 9

Directory Organizations Centralized Directory: one fixed location bottleneck! Flat Directories: directory info is in a fixed place, determined by examining the address can be further categorized as memory-based or cache-based Hierarchical Directories: the processors are organized as a logical tree structure and each parent keeps track of which of its immediate children has a copy of the block more searching, can exploit locality 10

Flat Memory-Based Directories Directory is associated with memory and stores info for all cache copies A presence vector stores a bit for every processor, for every memory block the overhead is a function of memory/block size and #processors Reducing directory overhead: 11

Flat Memory-Based Directories Directory is associated with memory and stores info for all cache copies A presence vector stores a bit for every processor, for every memory block the overhead is a function of memory/block size and #processors Reducing directory overhead: Width: pointers (keep track of processor ids of sharers) (need overflow strategy), organize processors into clusters Height: increase block size, track info only for blocks that are cached (note: cache size << memory size) 12

Flat Cache-Based Directories The directory at the memory home node only stores a pointer to the first cached copy the caches store pointers to the next and previous sharers (a doubly linked list) Main memory Cache 7 Cache 3 Cache 26 13

Flat Cache-Based Directories The directory at the memory home node only stores a pointer to the first cached copy the caches store pointers to the next and previous sharers (a doubly linked list) Potentially lower storage, no bottleneck for network traffic Invalidates are now serialized (takes longer to acquire exclusive access), replacements must update linked list, must handle race conditions while updating list 14

Flat Memory-Based Directories Block size = 128 B Memory in each node = 1 GB Cache in each node = 1 MB For 64 nodes and 64-bit directory, Directory size = 4 GB For 64 nodes and 12-bit directory, Directory size = 0.75 GB Main memory Cache 1 Cache 2 Cache 64 15

Flat Cache-Based Directories Block size = 128 B Memory in each node = 1 GB Cache in each node = 1 MB 6-bit storage in DRAM for each block; DRAM overhead = 0.375 GB 12-bit storage in SRAM for each block; SRAM overhead = 0.75 MB Main memory Cache 7 Cache 3 Cache 26 16

Flat Memory-Based Directories Block size = 64 B L2 cache in each node = 256 KB L1 Cache in each node = 64 KB For 64 nodes and 64-bit directory, Directory size = 2 MB For 64 nodes and 12-bit directory, Directory size = 375 KB L2 cache L1 Cache 1 L1 Cache 2 L1 Cache 64 17

Flat Cache-Based Directories Block size = 64 B L2 cache in each node = 256 KB L1 Cache in each node = 64 KB 6-bit storage in L2 for each block; L2 overhead = 187 KB 12-bit storage in L1 for each block; L1 overhead = 96 KB Main memory Cache 7 Cache 3 Cache 26 18

Data Sharing Patterns Two important metrics that guide our design choices: invalidation frequency and invalidation size turns out that invalidation size is rarely greater than four Read-only data: constantly read, never updated (raytrace) Producer-consumer: flag-based synchronization, updates from neighbors (Ocean) Migratory: reads and writes from a single processor for a period of time (global sum) Irregular: unpredictable accesses (distributed task queue) 19

SGI Origin 2000 Flat memory-based directory protocol Uses a bit vector directory representation Two processors per node combining multiple processors in a node reduces cost P L2 P L2 CA Interconnect M/D 20

Directory Structure The system supports either a 16-bit or 64-bit directory (fixed cost); for small systems, the directory works as a full bit vector representation Seven states, of which 3 are stable For larger systems, a coarse vector is employed each bit represents p/64 nodes State is maintained for each node, not each processor the communication assist broadcasts requests to both processors 21

Title Bullet 22