Scalable Cache Coherence

Similar documents
Scalable Multiprocessors

Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions:

Scalable Cache Coherent Systems

Scalable Cache Coherence

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Cache Coherence in Scalable Machines

Cache Coherence: Part II Scalable Approaches

NOW Handout Page 1. Context for Scalable Cache Coherence. Cache Coherence in Scalable Machines. A Cache Coherent System Must:

CMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Cache Coherence in Scalable Machines

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations

Lecture 8: Directory-Based Cache Coherence. Topics: scalable multiprocessor organizations, directory protocol design issues

Lecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based)

A Scalable SAS Machine

ECE 669 Parallel Computer Architecture

Multiprocessors & Thread Level Parallelism

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

ECSE 425 Lecture 30: Directory Coherence

Shared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB

Lecture 5: Directory Protocols. Topics: directory-based cache coherence implementations

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

Cache Coherence. Todd C. Mowry CS 740 November 10, Topics. The Cache Coherence Problem Snoopy Protocols Directory Protocols

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

Lecture 3: Directory Protocol Implementations. Topics: coherence vs. msg-passing, corner cases in directory protocols

EC 513 Computer Architecture

Lecture 25: Multiprocessors

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

CMSC 611: Advanced. Distributed & Shared Memory

CMSC 611: Advanced Computer Architecture

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Cache Coherence in Scalable Machines

Cache Coherence. Bryan Mills, PhD. Slides provided by Rami Melhem

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Cache Coherence. Introduction to High Performance Computing Systems (CS1645) Esteban Meneses. Spring, 2014

Performance study example ( 5.3) Performance study example

Page 1. Cache Coherence

Chapter 5. Multiprocessors and Thread-Level Parallelism

Multiprocessor Cache Coherency. What is Cache Coherence?

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Lecture 24: Virtual Memory, Multiprocessors

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Recall Ordering: Scheurich and Dubois. CS 258 Parallel Computer Architecture Lecture 21 P 1 : Directory Based Protocols. Terminology for Shared Memory

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains:

Memory Hierarchy in a Multiprocessor

Foundations of Computer Systems

5008: Computer Architecture

Lecture 24: Board Notes: Cache Coherency

EC 513 Computer Architecture

ECE 485/585 Microprocessor System Design

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

1. Memory technology & Hierarchy

CISC 662 Graduate Computer Architecture Lectures 15 and 16 - Multiprocessors and Thread-Level Parallelism

Aleksandar Milenkovich 1

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

CS654 Advanced Computer Architecture Lec 14 Directory Based Multiprocessors

Recall: Sequential Consistency Example. Recall: MSI Invalidate Protocol: Write Back Cache. Recall: Ordering: Scheurich and Dubois

Review. EECS 252 Graduate Computer Architecture. Lec 13 Snooping Cache and Directory Based Multiprocessors. Outline. Challenges of Parallel Processing

COMP Parallel Computing. CC-NUMA (1) CC-NUMA implementation

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

Computer Architecture

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

Advanced OpenMP. Lecture 3: Cache Coherency

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

Flynn s Classification

CSC 631: High-Performance Computer Architecture

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

Computer Architecture Lecture 10: Thread Level Parallelism II (Chapter 5) Chih Wei Liu 劉志尉 National Chiao Tung University

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology

SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES. Natalie Enright Jerger University of Toronto

Homework 6. BTW, This is your last homework. Assigned today, Tuesday, April 10 Due time: 11:59PM on Monday, April 23. CSCI 402: Computer Architectures

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Thread- Level Parallelism. ECE 154B Dmitri Strukov

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 9 Multiprocessors

Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY>

Incoherent each cache copy behaves as an individual copy, instead of as the same memory location.

Lecture 24: Thread Level Parallelism -- Distributed Shared Memory and Directory-based Coherence Protocol

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Scalable Multiprocessors

Module 9: "Introduction to Shared Memory Multiprocessors" Lecture 16: "Multiprocessor Organizations and Cache Coherence" Shared Memory Multiprocessors

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

Lecture 1: Introduction

Midterm Exam 02/09/2009

Lecture 7: PCM Wrap-Up, Cache coherence. Topics: handling PCM errors and writes, cache coherence intro

Processor Architecture

Interconnect Routing

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Handout 3 Multiprocessor and thread level parallelism

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

Transcription:

Scalable Cache Coherence [ 8.1] All of the cache-coherent systems we have talked about until now have had a bus. Not only does the bus guarantee serialization of transactions; it also serves as convenient broadcast mechanism to assure that each transaction is propagated to all other processors caches. Now we want to consider how cache coherence can be provided on a machine with physically distributed memory and no globally snoopable interconnect. These machines provide a shared address space. They must be able to satisfy a cache miss transparently from local or remote memory. This replicates data. How can it be kept coherent? Scalable network Switch Switch Switch M CA = = $ Scalable distributed memory machines consist of -C-M nodes connected by a network. Lecture 18 Architecture of arallel Computers 1

The communication assist interprets network transactions, forms interface. A coherent system must do these things. rovide set of states, state-transition diagram, and actions. Manage coherence protocol. (0) Determine when to invoke the coherence protocol (a) Find source of info about state of this block in other caches. (b) Find out where the other copies are (c) Communicate with those copies (invalidate/update) (0) is done the same way on all systems The state of the line is maintained in the cache The protocol is invoked if an access fault occurs on the line. The different approaches to scalable cache coherence are distinguished by their approach to (a), (b), and (c). Bus-based coherence In a bus-based coherence scheme, all of (a), (b), and (c) are done through broadcast on bus. The faulting processor sends out a search. Other processors respond to the search probe and take necessary action. We could do this in a scalable network too broadcast to all processors, and let them respond. Why don t we? Why not? Lecture 18 Architecture of arallel Computers 2

Approach #1: Hierarchical snooping Extend the snooping approach to a hierarchy of broadcast media. The interconnection network is a not a bus, but a tree of buses or rings (e.g., KSR-1). The processors are the bus- or ring-based multiprocessors at the leaves of the network. arents and children are connected by two-way snoopy interfaces. Functions (a) through (c) are performed by a hierarchical extension of the broadcast and snooping mechanism. roblems: A processor puts a search request on the bus. It is propagated up and down the hierarchy as needed, based on snoop results. Hierarchical snooping schemes are much less common than directory based schemes. We won t consider them further; for details, see 8.10.2. Approach #2: Directories The basic idea of a directory-based approach is this. Every memory block has associated directory information; it keeps track of copies of cached blocks and their states. On a miss, it finds the directory entry, looks it up, and communicates only with the nodes that have copies (if necessary). Lecture 18 Architecture of arallel Computers 3

In scalable networks, communication with directory and copies is through network transactions. Let us start off by considering a simple directory-based approach due to Tang (1976). It uses A central directory (called a present table) of which caches contain which main-memory blocks. Entry [i, c] tells whether the ith block is cached in the cth cache. A central modified table that tells, for each block of main memory, whether the block is up-to-date (or whether some cache holds a more recent copy). A local table for each cache that keeps track of whether each line in the cache is also cached somewhere else. Until a block is modified, it can be cached in any number of places. Once it is modified, only one cached copy is allowed to exist until it is written back to main memory. Here is a diagram of the three tables: Lecture 18 Architecture of arallel Computers 4

Local flags Valid flags Local flags Valid flags Local flags Valid flags Cache line containing block k L [k, 0] L [k, 1] Cache number 0 1 n 1 Block 0 Block i [i, 0] [i, 1] [i, n 1] M [i ] Block N 1 resent table Modified table A cached block can be in one of three states: RO read-only. Several cached copies of the block exist, but none has been modified since it was loaded from main memory. EX exclusive read-only. This is the only cached copy of the block, but it hasn t been modified since it was loaded. RW read/write. The block has been written since it was loaded from main memory. This is the only cached copy of the block. When a processor writes a block, it changes its status to RW, and invalidates the other cached copies if its previous status was RO. Here is a flowchart of the coherence checks on a read reference: Lecture 18 Architecture of arallel Computers 5

Fetch block i in local cache c. Yes Hit? No Find block j to replace. If block j is RW, MM block[j, c] Yes, RO copy exists Yes, EX copy exists copy of block i in a remote cache r? No copy exists Yes, RW copy exists Instruction fetch Fetch type? Data fetch L[i, c] EX L[i, r] RO MM block[i, r] L[i, r] RO L[i, c] RO Cache block i in cache c M[i] 0 Send word to processor. GT: possible access to global table WB: possible MM update MM: main memory MM[i]: modified bit of block i in GT MM block[i, r]: update MM with Modified copy of i in cache r. L[i, r]: state of block i in cache r. End Here is a flowchart of the coherence checks on a write reference: Lecture 18 Architecture of arallel Computers 6

Store into block i in local cache c. Hit? Yes No Find block j to replace If block j is RW, MM block[j, c] GT, WB EX Block i status in cache (L[i, c])? RW No GT Other copies of block i? Yes, one RW copy in cache r. RO Yes Invalidate all copies of block i in other caches I Write block i back to MM and invalidate it in cache r. Invalidate all copies of block i in other caches Fetch block i with RW access. Record block i as RW in GT. Store block in cache c. End GT: ossible access to global table WB: ossible write-back to memory I: ure invalidation MM: Main memory Operation of a simple directory scheme The scheme we have just discussed is called a centralized directory scheme, because the directory is kept together in one place. Let us now assume that the directory is distributed, with each node holding directory information for the blocks it contains. This node is called the home node for these blocks. Lecture 18 Architecture of arallel Computers 7

What happens on a read-miss? Requestor C A 3. Read req. to owner C A M/D M/D Node with dirty copy 4a. Data Reply 1. Read request to directory 2. Reply with owner identity C A 4b. Revision message to directory Directory node for block M/D The requesting node sends a request transaction over the network to the home node. The home node responds with the identity of the owner the node that currently holds a valid copy of the block. The requesting node then gets the data from the owner, and revises the directory entry accordingly. Lecture 18 Architecture of arallel Computers 8

On a writemiss, the directory identifies copies of the block, and invalidation or update messages may be sent to the copies. Requestor C A M/D RdEx request to directory Reply with sharers identity 3a. 3b. Inval. req. Inval. req. to sharer to sharer 4a. 4b. Inval. ack Inval. ack 2. 1. C A M/D Directory node C C A M/D A M/D Sharer Sharer One major difference from bus-based schemes is What information will be held in the directory? There will be a dirty bit telling if the block is dirty in some cache. Not all state information (MESI, etc.) needs to be kept in the cache, only enough to determine what actions to take. Sometimes the state information in the directory will be out of date. Why? So, sometimes a directory will send a message to the cache that is no longer correct when it is received. Let us consider an example system. Three stable cache states (MSI). Single-level caches. One processor per node. Lecture 18 Architecture of arallel Computers 9

In the directory entry for each block, a dirty bit and presence bits p[i], one for each processor. On a read-miss from main memory by processor i: If = dirty, then { read from main memory; turn p[i] on } If dirty, then { recall line from dirty processor (setting cache state to shared); update memory; turn dirty off; turn p[i] on; supply recalled data to processor i } On a write-miss to main memory by processor i: If = dirty, then { supply line to i, along with p vector; send invalidations to all caches that have the block; turn dirty on; turn p[i] on, other p bits off in directory entry } If dirty, then { recall line from dirty processor d; change d s cache state to invalid; supply line to i; On the replacement of a dirty block by node i, the data is written back to memory and the directory is updated to turn off dirty bit and p[i]. On the replacement of a shared block, the directory may or may not be updated. How does a directory help? It keeps track of which nodes have copies of a block, eliminating the need for Would directories be valuable if most data were shared by most of the nodes in the system? Lecture 18 Architecture of arallel Computers 10

Fortunately, the number of valid copies of data on most writes is small. Scaling with number of processors [ 8.2.2] Scaling of memory and directory bandwidth provided Centralized directory is bandwidth bottleneck, just like centralized memory. How to maintain directory information in distributed way? Scaling of performance characteristics traffic: # of network transactions each time protocol is invoked latency: # of network transactions in critical path each time Scaling of directory storage requirements Number of presence bits needed grows as the number of processors. E.g., 32-byte block size and 1024 processors. How many bits in block, vs. # of bits in directory? How a directory is organized affects all these, performance at a target scale, as well as coherence-management issues. Alternatives for organizing directories [ 8.2.3] When a miss occurs, how do we find the directory information? There are two main alternatives. A flat directory scheme. Directory information is in a fixed place, usually at the home (where the memory is located). On a miss, a transaction is sent to the home node. A hierarchical directory scheme. Directory information is organized as a tree, with the processing nodes at the leaves. Each node keeps track of which, if any, of its (immediate) children have a copy of the block. Lecture 18 Architecture of arallel Computers 11

When a miss occurs, the directory information is found by traversing up the hierarchy level until the block is found (in the appropriate state ). The state indicates, e.g., whether copies of the block exist outside the subtree of this directory. Directory Schemes Centralized Distributed How to find source of directory information Flat Hierarchical How to locate copies Memory-based Cache-based How do flat schemes store information about copies? Memory-based schemes store the information about all cached copies at the home node of the block. The directory schemes we considered earlier are memory based. Cache-based schemes distribute information about copies among the copies themselves. The home contains a pointer to one cached copy of the block. Each copy contains the identity of the next node that has a copy of the block. The location of the copies is therefore determined through network transactions. Lecture 18 Architecture of arallel Computers 12

Main Memory (Home) Node 0 Node 1 Node 2 Cache Cache Cache When do hierarchical schemes outperform flat schemes? Why might hierarchical schemes be slower than flat schemes? Organizing a memory-based directory scheme All info about copies is colocated with the block itself at the home They work just like centralized scheme, except for being distributed. Scaling of performance characteristics Traffic on a write is proportional to number of sharers. Latency? Can issue invalidations in parallel. M Scaling of storage overhead? Assume representation is a full bitvector. Optimizations for full bit-vector schemes Increase size (reduces storage overhead proportionally). Lecture 18 Architecture of arallel Computers 13

Use multiprocessor nodes (one bit per multiprocessor node, not per processor) still scales as pm, but only a problem for very large machines 256 procs, 4 per cluster, 128B line: 6.25% ovhd. Reducing width : addressing the p term observation: most blocks cached by only few nodes don t have a bit per node, but make entry contain a few p = 1024, 10-bit pointers = can use 100 pointers and still save space. sharing patterns indicate a few pointers should suffice (five or so). need an overflow strategy when there are more sharers (later). Reducing height : addressing the m term. observation: number of memory blocks >> number of cache lines. most blocks will not be cached at any particular time; therefore, most directory entries are useless at any given time organize directory as a cache, rather than having one entry per memory block Organizing a cache-based directory scheme. In a cache-based scheme, home only holds a pointer to the rest of the directory information. The copies are lined together via a distributed list that weaves through caches. Each cache tag has a pointer that points to the next cache with a copy. On a read, a processor adds itself to the head of the list (communication needed). Lecture 18 Architecture of arallel Computers 14

On a write, it makes itself the head node on the list, then propagates a chain of invalidations down the list. Each invalidation must be acknowledged. On a write-back, the node must delete itself from the list (and therefore communicate with the nodes before and after it). Disadvantages: All operations require communicating with at least three nodes (node that is being operated on, previous node, and next node). Write latency is proportional to number of sharers. Synchronization is needed Advantages: Directory overhead is small. Linked list records order that accesses were made, making it easier to avoid starvation. Work of performing invalidations can be distributed among sharers. IEEE Scalable Coherent Interface has formalized protocols for handling cache-based directory schemes. Summary Flat Schemes: Issue (a): finding source of directory data go to home, based on address Issue (b): finding out where the copies are memory-based: all info is in directory at home cache-based: home has pointer to first element of distributed linked list Issue (c): communicating with those copies memory-based: point-to-point messages (perhaps coarser on overflow) can be multicast or overlapped Lecture 18 Architecture of arallel Computers 15

cache-based: part of point-to-point linked list traversal to find them serialized Hierarchical Schemes: all three issues through sending messages up and down tree no single explict list of sharers only direct communication is between parents and children Lecture 18 Architecture of arallel Computers 16