Cache Coherence in Scalable Machines

Similar documents
Cache Coherence in Scalable Machines

Scalable Cache Coherence

Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions:

Scalable Cache Coherent Systems

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

NOW Handout Page 1. Context for Scalable Cache Coherence. Cache Coherence in Scalable Machines. A Cache Coherent System Must:

Cache Coherence: Part II Scalable Approaches

A Scalable SAS Machine

Scalable Cache Coherence

Scalable Multiprocessors

CMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3

Lecture 8: Directory-Based Cache Coherence. Topics: scalable multiprocessor organizations, directory protocol design issues

Recall Ordering: Scheurich and Dubois. CS 258 Parallel Computer Architecture Lecture 21 P 1 : Directory Based Protocols. Terminology for Shared Memory

Recall: Sequential Consistency Example. Implications for Implementation. Issues for Directory Protocols

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations

Cache Coherence. Todd C. Mowry CS 740 November 10, Topics. The Cache Coherence Problem Snoopy Protocols Directory Protocols

Cache Coherence in Scalable Machines

Recall: Sequential Consistency Example. Recall: MSI Invalidate Protocol: Write Back Cache. Recall: Ordering: Scheurich and Dubois

Lecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based)

Module 14: "Directory-based Cache Coherence" Lecture 31: "Managing Directory Overhead" Directory-based Cache Coherence: Replacement of S blocks

Multiprocessors & Thread Level Parallelism

Lecture 5: Directory Protocols. Topics: directory-based cache coherence implementations

... The Composibility Question. Composing Scalability and Node Design in CC-NUMA. Commodity CC node approach. Get the node right Approach: Origin

Memory Hierarchy in a Multiprocessor

EC 513 Computer Architecture

Lecture 3: Directory Protocol Implementations. Topics: coherence vs. msg-passing, corner cases in directory protocols

Recall: Sequential Consistency of Directory Protocols How to get exclusion zone for directory protocol? Recall: Mechanisms for reducing depth

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Lect. 6: Directory Coherence Protocol

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Cache Coherence (II) Instructor: Josep Torrellas CS533. Copyright Josep Torrellas

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

1. Memory technology & Hierarchy

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Cache Coherence. Bryan Mills, PhD. Slides provided by Rami Melhem

Chapter 5. Multiprocessors and Thread-Level Parallelism

Performance study example ( 5.3) Performance study example

EC 513 Computer Architecture

ECE 669 Parallel Computer Architecture

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

5 Chip Multiprocessors (II) Chip Multiprocessors (ACS MPhil) Robert Mullins

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES. Natalie Enright Jerger University of Toronto

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University

Special Topics. Module 14: "Directory-based Cache Coherence" Lecture 33: "SCI Protocol" Directory-based Cache Coherence: Sequent NUMA-Q.

Lecture 7: PCM Wrap-Up, Cache coherence. Topics: handling PCM errors and writes, cache coherence intro

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Lecture 1: Introduction

Shared Memory Multiprocessors

5 Chip Multiprocessors (II) Robert Mullins

CMSC 611: Advanced Computer Architecture

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

Lecture 7: PCM, Cache coherence. Topics: handling PCM errors and writes, cache coherence intro

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

Fall 2015 :: CSE 610 Parallel Computer Architectures. Cache Coherence. Nima Honarmand

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Platforms Design Challenges with many cores

EECS 570 Lecture 11. Directory-based Coherence. Winter 2019 Prof. Thomas Wenisch

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Foundations of Computer Systems

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

ECSE 425 Lecture 30: Directory Coherence

Flynn s Classification

CMSC 611: Advanced. Distributed & Shared Memory

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

Multiprocessor Cache Coherency. What is Cache Coherence?

CSC 631: High-Performance Computer Architecture

Chapter 5. Multiprocessors and Thread-Level Parallelism

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

NOW Handout Page 1. Recap: Performance Trade-offs. Shared Memory Multiprocessors. Uniprocessor View. Recap (cont) What is a Multiprocessor?

Cache Coherence: Part 1

Lecture 25: Multiprocessors

Chapter 9 Multiprocessors

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

Lecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

Shared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB

Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley

Lecture 24: Board Notes: Cache Coherency

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Review. EECS 252 Graduate Computer Architecture. Lec 13 Snooping Cache and Directory Based Multiprocessors. Outline. Challenges of Parallel Processing

Cache Coherence in Bus-Based Shared Memory Multiprocessors

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

Multicast Snooping: A Multicast Address Network. A New Coherence Method Using. With sponsorship and/or participation from. Mark Hill & David Wood

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

Lecture notes for CS Chapter 4 11/27/18

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology

CS 152 Computer Architecture and Engineering

Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY>

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

A More Sophisticated Snooping-Based Multi-Processor

Approaches to Building Parallel Machines. Shared Memory Architectures. Example Cache Coherence Problem. Shared Cache Architectures

EE382 Processor Design. Illinois

Transcription:

ache oherence in Scalable Machines SE 661 arallel and Vector Architectures rof. Muhamed Mudawar omputer Engineering Department King Fahd University of etroleum and Minerals Generic Scalable Multiprocessor 1 2 n Memory Memory Memory A A = ommunication Assist A Scalable Interconnection Network many parallel transactions Scalable distributed memory machine --M nodes connected by a scalable network Scalable memory bandwidth at reasonable latency ommunication Assist (A) Interprets network transactions, forms an interface rovides a shared address space in hardware ache oherence in Scalable Machines Muhamed Mudawar, SE 661 Spring 2005, KFUM Slide 2

What Must a oherent System do? rovide set of states, transition diagram, and actions Determine when to invoke coherence protocol Done the same way on all systems State of the line is maintained in the cache rotocol is invoked if an access fault occurs on the line Read miss, Write miss, Writing to a shared block Manage coherence protocol 1. Find information about state of block in other caches Whether need to communicate with other cached copies 2. Locate the other cached copies 3. ommunicate with those copies (invalidate / update) Handled differently by different approaches ache oherence in Scalable Machines Muhamed Mudawar, SE 661 Spring 2005, KFUM Slide 3 Bus-Based ache oherence Functions 1, 2, and 3 are accomplished through Broadcast and snooping on bus rocessor initiating bus transaction sends out a search Others respond and take necessary action ould be done in a scalable network too Broadcast to all processors, and let them respond onceptually simple, but broadcast doesn t scale On a bus, bus bandwidth doesn t scale On a scalable network Every access fault leads to at least p network transactions Scalable ache oherence needs Different mechanisms to manage protocol ache oherence in Scalable Machines Muhamed Mudawar, SE 661 Spring 2005, KFUM Slide 4

Directory-Based ache oherence 1 n Directory Memory Directory Memory A = ommunication Assist A Scalable Interconnection Network many parallel transactions Scalable cache coherence is based on directories Distributed directories for distributed memories Each cache-line-sized block of a memory Has a directory entry that keeps track of State and the nodes that are currently sharing the block ache oherence in Scalable Machines Muhamed Mudawar, SE 661 Spring 2005, KFUM Slide 5 Simple Directory Scheme Simple way to organize a directory is to associate Every memory block with a corresponding directory entry To keep track of copies of cached blocks and their states On a miss Locate home node and the corresponding directory entry Look it up to find its state and the nodes that have copies ommunicate with the nodes that have copies if necessary On a read miss Directory indicates from which node data may be obtained On a write miss Directory identifies shared copies to be invalidated/updated Many alternatives for organizing directory ache oherence in Scalable Machines Muhamed Mudawar, SE 661 Spring 2005, KFUM Slide 6

Directory Information 1 n Directory Memory Dirty bit resence bits for all nodes Directory Memory A A Scalable Interconnection Network many parallel transactions A simple organization of a directory entry is to have Bit vector of p presence bits for each of the p nodes Indicating which nodes are sharing that block One or more state bits per block reflecting memory view One dirty bit is used indicating whether block is modified If dirty bit is 1 then block is in modified state in only one node ache oherence in Scalable Machines Muhamed Mudawar, SE 661 Spring 2005, KFUM Slide 7 Definitions Home Node Node in whose main memory the block is allocated Dirty Node Node that has copy of block in its cache in modified state Owner Node Node that currently holds the valid copy of a block an be either the home node or dirty node Exclusive Node Node that has block in its cache in an exclusive state Either exclusive clean or exclusive modified Local or Requesting Node Node that issued the request for the block ache oherence in Scalable Machines Muhamed Mudawar, SE 661 Spring 2005, KFUM Slide 8

Basic Operation on a Read Miss Read miss by processor i Requestor sends read request message to home node Assist looks-up directory entry, if dirty-bit is OFF Reads block from memory Sends data reply message containing to requestor Turns ON resence[ i ] If dirty-bit is ON Requestor sends read request message to dirty node Dirty node sends data reply message to requestor Dirty node also sends revision message to home node ache block state is changed to shared Home node updates its main memory and directory entry Turns dirty-bit OFF Turns resence[ i ] ON ache oherence in Scalable Machines Muhamed Mudawar, SE 661 Spring 2005, KFUM Slide 9 Read Miss to a Block in Dirty State The number 1, 2, and so on show serialization of transactions Letters to same number indicate overlapped transactions 1. Total of 5 transactions Message 4a and 4b are performed in parallel Requestor A 3. Mem / Dir Read request to owner 4a. Data Reply Read request to home node 2. Response with owner identity A Mem / Dir Home Node A Mem / Dir 4b. Revision message to home node Dirty Node ache oherence in Scalable Machines Muhamed Mudawar, SE 661 Spring 2005, KFUM Slide 10

Basic Operation on a Write Miss Write miss caused by processor i Requestor sends read exclusive message to home node Assist looks up directory entry, if dirty-bit is OFF Home node sends a data reply message to processor i Message contains data and presence bits identifying all sharers Requestor node sends invalidation messages to all sharers Sharer nodes invalidate their cached copies and Send acknowledgement messages to requestor node If dirty-bit is ON Home node sends a response message identifying dirty node Requestor sends a read exclusive message to dirty node Dirty node sends a data reply message to processor i And changes its cache state to Invalid In both cases, home node clears presence bits But turns ON resence[ i ] and dirty-bit ache oherence in Scalable Machines Muhamed Mudawar, SE 661 Spring 2005, KFUM Slide 11 Write Miss to a Block with Sharers rotocol is orchestrated by the assists alled coherence or directory controllers 3a. Invalidation request to sharer A Mem / Dir Sharer Requestor A Mem / Dir 4a. Invalidation acknowledgement 1. 3b. Invalidation request to sharer RdEx request to home node 2. Data reply with sharer s identity 4b. A Mem / Dir Sharer Home Node A Invalidation acknowledgement Mem / Dir ache oherence in Scalable Machines Muhamed Mudawar, SE 661 Spring 2005, KFUM Slide 12

Data Sharing atterns rovide insight into directory requirements If most misses involve O() transactions then Broadcast might be a good solution However generally, there are few sharers at a write Important to understand two aspects of data sharing Frequency of shared writes or invalidating writes On a write miss or when writing to a block in the shared state alled invalidation frequency in invalidation-based protocols Distribution of sharers called the invalidation size distribution Invalidation size distribution also provides Insight into how to organize and store directory information ache oherence in Scalable Machines Muhamed Mudawar, SE 661 Spring 2005, KFUM Slide 13 ache Invalidation atterns Simulation running on 64 processors Figures show invalidation size distribution MSI protocol is used % of shared writes % of shared writes 100 90 80 70 60 50 40 30 20 8.75 10 90 80 70 60 50 40 30 20 10 0 0 0 91.22 LU Invalidation atterns 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.22 # of invalidations 80.98 0 1 2 3 4 5 6 7 8 to 11 12 to 15 16 to 19 20 to 23 24 to 27 28 to 31 32 to 35 36 to 39 40 to 43 44 to 47 48 to 51 52 to 55 56 to 59 60 to 63 Ocean Invalidation atterns 15.06 3.04 0.49 0.34 0.03 0 0.03 0 0 0 0 0 0 0 0 0 0 0 0 0.02 # of invalidations 0 1 2 3 4 5 6 7 8 to 11 12 to 15 16 to 19 20 to 23 24 to 27 28 to 31 32 to 35 36 to 39 40 to 43 44 to 47 48 to 51 52 to 55 56 to 59 60 to 63 ache oherence in Scalable Machines Muhamed Mudawar, SE 661 Spring 2005, KFUM Slide 14

ache Invalidation atterns cont d Infinite per-processor caches are used To capture inherent sharing patterns % of shared writes % of shared writes 50 45 40 35 30 25 20 15 10 5 1.27 0 60 50 40 30 20 10 0 6.68 48.35 22.87 10.56 Barnes-Hut Invalidation atterns 5.33 2.87 1.88 1.4 2.5 1.06 0.61 0.24 0.28 0.2 0.06 0.1 0.07 0 0 0 0 0.33 # of invalidations 58.35 0 1 2 3 4 5 6 7 8 to 11 12 to 15 16 to 19 20 to 23 24 to 27 28 to 31 32 to 35 36 to 39 40 to 43 44 to 47 48 to 51 52 to 55 56 to 59 60 to 63 Radiosity Invalidation atterns 12.04 4.16 2.24 1.59 3.28 1.16 0.97 2.2 1.74 1.46 0.92 0.45 0.37 0.31 0.28 0.26 0.24 0.19 0.19 0.91 # of invalidations 0 1 2 3 4 5 6 7 8 to 11 12 to 15 16 to 19 20 to 23 24 to 27 28 to 31 32 to 35 36 to 39 40 to 43 44 to 47 48 to 51 52 to 55 56 to 59 60 to 63 ache oherence in Scalable Machines Muhamed Mudawar, SE 661 Spring 2005, KFUM Slide 15 Framework for Sharing atterns Shared-Data access patterns can be categorized: ode and read-only data structures are never written Not an issue for directories roducer-consumer One processor produces data and others consume it Invalidation size is determined by number of consumers Migratory data Data migrates from one processor to another Example: computing a global sum, where sum migrates Invalidation size is small (typically 1) even as scales Irregular read-write Example: distributed task-queue (processes probe head ptr) Invalidations usually remain small, though frequent ache oherence in Scalable Machines Muhamed Mudawar, SE 661 Spring 2005, KFUM Slide 16

Sharing atterns Summary Generally, few sharers at a write Scales slowly with A write may send 0 invalidations in MSI protocol Since block is loaded in shared state This would not happen in MESI protocol Infinite per-processor caches are used To capture inherent sharing patterns Finite caches send replacement hints on block replacement Which turn off presence bits and reduce # of invalidations However, traffic will not be reduced Non-zero frequencies of very large invalidation sizes Due to spinning on a synchronization variable by all ache oherence in Scalable Machines Muhamed Mudawar, SE 661 Spring 2005, KFUM Slide 17 Alternatives for Directories Directory Schemes entralized Distributed Finding source of directory information Flat Hierarchical Locating copies Memory-based Directory information co-located with memory module in home node for each memory block Examples: Stanford DASH/FLASH, MIT Alewife, SGI Origin, HAL ache-based aches holding a copy of the memory block from a linked list; memory hold pointer to head of list Examples: IEEE SI, Sequent NUMA-Q ache oherence in Scalable Machines Muhamed Mudawar, SE 661 Spring 2005, KFUM Slide 18

Flat Memory-based Schemes Information about cached (shared) copies is Stored with the block at the home node erformance Scaling Traffic on a shared write: proportional to number of sharers Latency on shared write: can issue invalidations in parallel Simplest Representation: one presence bit per node Storage overhead is proportional to M For M memory blocks in memory, and ignoring state bits Directory storage overhead scale poorly with Given 64-byte cache block size For 64 nodes: 64 presence bits / 64 bytes = 12.5% overhead 256 nodes: 256 presence bits / 64 bytes = 50% overhead 1024 nodes: 1024 presence bits / 64 bytes = 200% overhead ache oherence in Scalable Machines Muhamed Mudawar, SE 661 Spring 2005, KFUM Slide 19 Reducing Storage Overhead Optimizations for full bit vector schemes Increase cache block size Reduces storage overhead proportionally Use more than one processor (SM) per node resence bit is per node, not per processor But still scales as M Reasonable and simple enough for all but very large machines Example: 256 processors, 4-processor nodes, 128-byte lines Overhead = (256/4) / (128*8) = 64 / 1024 = 6.25% (attractive) Need to reduce width Addressing the term Need to reduce height Addressing the M term M ache oherence in Scalable Machines Muhamed Mudawar, SE 661 Spring 2005, KFUM Slide 20

Directory Storage Reductions Width observation: Most blocks cached (shared) by only few nodes Don t need a bit per node Sharing patterns indicate a few pointers should suffice Entry can contain a few (5 or so) pointers to sharing nodes = 1024 10 bit pointers 5 pointers need only 50 bits rather than 1024 bits Need an overflow strategy when there are more sharers Height observation: Number of memory blocks >> Number of cache blocks Most directory entries are useless at any given time Organize directory as a cache Rather than having one entry per memory block ache oherence in Scalable Machines Muhamed Mudawar, SE 661 Spring 2005, KFUM Slide 21 Limited ointers Dir i : only i pointers are used per entry Works in most cases when # sharers i Overflow mechanism needed when # sharers > i Overflow bit indicates that number of sharers exceed i Overflow methods include Broadcast No-broadcast oarse Vector Software Overflow Dynamic ointers Overflow bit Directory Entry 2 ointers 0 9 7 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ache oherence in Scalable Machines Muhamed Mudawar, SE 661 Spring 2005, KFUM Slide 22

Overflow Methods Broadcast (Dir i B) Broadcast bit turned on upon overflow Invalidations broadcast to all nodes when block is written Regardless of whether or not they were sharing the block Network bandwidth may be wasted for overflow cases Latency increases if processor wait for acknowledgements No-Broadcast (Dir i NB) Avoids broadcast by never allowing # of sharers to exceed i When number of sharers is equal to i New sharer replaces (invalidates) one of the old ones And frees up its pointer in the directory entry Major drawback: does not deal with widely shared data ache oherence in Scalable Machines Muhamed Mudawar, SE 661 Spring 2005, KFUM Slide 23 oarse Vector Overflow Method oarse vector (Dir i V r ) Uses i pointers in its initial representation But on overflow, changes representation to a coarse vector Each bit indicates a unique group of r nodes On a write, invalidate all r nodes that a bit corresponds to Overflow bit 2 ointers Overflow bit 8-bit coarse vector Directory Entry 0 9 7 Directory Entry 1 0 1 2 3 0 1 2 3 4 5 6 7 4 5 6 7 8 9 10 11 8 9 10 11 12 13 14 15 12 13 14 15 ache oherence in Scalable Machines Muhamed Mudawar, SE 661 Spring 2005, KFUM Slide 24

Robustness of oarse Vector Normalized Invalidations 800 700 600 500 400 300 200 100 0 Dir 4 B Dir 4 NB Dir 4 V 4 LocusRoute holesky Barnes-Hut 64 processors (one per node), 4 pointers per entry 16-bit coarse vector, 4 processors per group Normalized to full-bit-vector (100 invalidations) onclusion: coarse vector scheme is quite robust ache oherence in Scalable Machines Muhamed Mudawar, SE 661 Spring 2005, KFUM Slide 25 Software Overflow Schemes Software (Dir i SW) On overflow, trap to system software Overflow pointers are saved in local memory Frees directory entry to handle i new sharers in hardware Used by MIT Alewife: 5 pointers, plus one bit for local node But large overhead for interrupt processing 84 cycles for 5 invalidations, but 707 cycles for 6. Dynamic ointers (Dir i D) is a variation of Dir i SW Each directory entry contains extra pointer to local memory Extra pointer uses a free list in special area of memory Free list is manipulated by hardware, not software Example: Stanford FLASH multiprocessor ache oherence in Scalable Machines Muhamed Mudawar, SE 661 Spring 2005, KFUM Slide 26

Reducing Directory Height Sparse Directory: reducing M term in M Observation: cached entries << main memory Most directory entries are unused most of the time Example: 2MB cache and 512MB local memory per node => 510 / 512 or 99.6% of directory entries are unused Organize directory as a cache to save space Dynamically allocate directory entries, as cache lines Allow use of faster SRAMs, instead of slower DRAMs Reducing access time to directory information in critical path When an entry is replaced, send invalidations to all sharers Handles references from potentially all processors Essential to be large enough and with enough associativity ache oherence in Scalable Machines Muhamed Mudawar, SE 661 Spring 2005, KFUM Slide 27 Alternatives for Directories Directory Schemes entralized Distributed Finding source of directory information Flat Hierarchical Locating copies Memory-based Directory information co-located with memory module in home node for each memory block Examples: Stanford DASH/FLASH, MIT Alewife, SGI Origin, HAL ache-based aches holding a copy of the memory block from a linked list; memory hold pointer to head of list Examples: IEEE SI, Sequent NUMA-Q ache oherence in Scalable Machines Muhamed Mudawar, SE 661 Spring 2005, KFUM Slide 28

Flat ache-based Schemes How they work: Home only holds pointer to rest of directory info Distributed linked list of copies, weaves through caches ache tag has pointer, points to next cache with a copy On read, add yourself to head of the list On write, propagate chain of invalidations down the list Scalable oherent Interface (SI) IEEE Standard Doubly linked list Main Memory (Home) Node 0 Node 1 Node 2 ache ache ache ache oherence in Scalable Machines Muhamed Mudawar, SE 661 Spring 2005, KFUM Slide 29 Scaling roperties (ache-based) Traffic on write: proportional to number of sharers Latency on write: proportional to number of sharers Don t know identity of next sharer until reach current one Also assist processing at each node along the way Even reads involve more than one assist Home and first sharer on list Storage overhead Quite good scaling along both axes Only one head pointer per memory block Rest is all proportional to cache size But complex hardware implementation ache oherence in Scalable Machines Muhamed Mudawar, SE 661 Spring 2005, KFUM Slide 30

rotocol Design Optimizations Reduce Bandwidth demands By reducing number of protocol transactions per operation Reduce Latency By reducing network transactions in critical path Overlap network activities or make them faster Reduce endpoint assist occupancy per transaction Especially when the assists are programmable Traffic, latency, and occupancy Should not scale up too quickly with number of nodes ache oherence in Scalable Machines Muhamed Mudawar, SE 661 Spring 2005, KFUM Slide 31 Reducing Read Miss Latency Strict Request-Response: L: Local or Requesting node H: Home node D: Dirty node 4 transactions in critical path 5 transactions in all Intervention Forwarding: Home forwards intervention Directed to owner s cache Home keeps track of requestor 4 transactions in all Reply Forwarding: Owner replies to requestor 3 transactions in critical path 3: intervention 1: request 4a: revise L H D 2: response 4b: data reply 1: request 2: intervention L H D 4: data reply 3: revise 1: request 2: intervention L H D 3b: data reply 3a: revise ache oherence in Scalable Machines Muhamed Mudawar, SE 661 Spring 2005, KFUM Slide 32

Reducing Invalidation Latency In ache-based rotocol Invalidations sent from Home To all sharer nodes S i Strict Request-Response: 2s transactions in total 2s transactions in critical path s = number of sharers Invalidation Forwarding: Each sharer forwards invalidation to next sharer s acknowledgements to Home s+1 transactions in critical path Single acknowledgement: Sent by last sharer s+1 transactions in total 3: inval 5: inval 1: inval H S1 S2 S3 2: ack 4: ack 6: ack 1: inval 2a: inval 3a: inval H S1 S2 S3 2b: ack 3b: ack 4: ack 1: inval 2: inval 3: inval H S1 S2 S3 4: ack ache oherence in Scalable Machines Muhamed Mudawar, SE 661 Spring 2005, KFUM Slide 33 orrectness Ensure basics of coherence at state transition level Relevant lines are invalidated, updated, and retrieved orrect state transitions and actions happen Ensure serialization and ordering constraints Ensure serialization for coherence (on a single location) Ensure atomicity for consistency (on multiple locations) Avoid deadlocks, livelocks, and starvation roblems: Multiple copies and multiple paths through network Large latency makes optimizations attractive But optimizations complicate correctness ache oherence in Scalable Machines Muhamed Mudawar, SE 661 Spring 2005, KFUM Slide 34

Serialization for oherence Serialization means that writes to a given location Are seen in the same order by all processors In a bus-based system: Multiple copies, but write serialization is imposed by bus In a scalable multiprocessor with cache coherence Home node can impose serialization on a given location All relevant operations go to the home node first But home node cannot satisfy all requests Valid copy may not be in main memory but in a dirty node Requests may reach the home node in one order, But reach the dirty node in a different order Then writes to same location are not seen in same order by all ache oherence in Scalable Machines Muhamed Mudawar, SE 661 Spring 2005, KFUM Slide 35 Ensuring Serialization Use additional busy or pending directory states Indicate that previous operation is in progress Further operations on same location must be delayed Ensure serialization using one of the following Buffer at the home node Until previous (in progress) request has completed Buffer at the requestor nodes By constructing a distributed list of pending requests NAK and retry Negative acknowledgement sent to the requestor Request retried later by the requestor s assist Forward to the dirty node Dirty node determines their serialization when block is dirty ache oherence in Scalable Machines Muhamed Mudawar, SE 661 Spring 2005, KFUM Slide 36

Sequential onsistency bus-based: write completion: wait till gets on bus write atomicity: bus plus buffer ordering provides non-coherent scalable case write completion: needed to wait for explicit ack from memory write atomicity: easy due to single copy now, with multiple copies and distributed network pathways write completion: need explicit acks from copies themselves writes are not easily atomic... in addition to earlier issues with bus-based and noncoherent ache oherence in Scalable Machines Muhamed Mudawar, SE 661 Spring 2005, KFUM Slide 37 Alternatives for Directories Directory Schemes entralized Distributed Finding source of directory information Flat Hierarchical Locating copies Memory-based Directory information co-located with memory module in home node for each memory block Examples: Stanford DASH/FLASH, MIT Alewife, SGI Origin, HAL ache-based aches holding a copy of the memory block from a linked list; memory hold pointer to head of list Examples: IEEE SI, Sequent NUMA-Q ache oherence in Scalable Machines Muhamed Mudawar, SE 661 Spring 2005, KFUM Slide 38

Finding Directory Information Flat schemes Directory distributed with memory Information at home node Location based on address: transaction sent to home Hierarchical schemes Directory organized as a hierarchical data structure Leaves are processing nodes Internal nodes have only directory information Directory entry says whether subtree caches a block To find directory info, send search message up to parent Routes itself through directory lookups oint-to-point messages between children and parents ache oherence in Scalable Machines Muhamed Mudawar, SE 661 Spring 2005, KFUM Slide 39 Locating Shared opies of a Block Flat Schemes Memory-based schemes Information about copies stored all at the home with the block Examples: Stanford Dash/Flash, MIT Alewife, SGI Origin ache-based schemes Information about copies distributed among copies themselves Inside caches, each copy points to next to form a linked list Scalable oherent Interface (SI: IEEE standard) Hierarchical Schemes Through the directory hierarchy Each directory has presence bits To parent and children subtrees ache oherence in Scalable Machines Muhamed Mudawar, SE 661 Spring 2005, KFUM Slide 40

Hierarchical Directories Tracks which of its children level-1 directories have a copy of the memory block. Also tracks which local memory blocks are cached outside this subtree. Inclusion is maintained between level-2 and level-1 directories level-2 directory Directory is a hierarchical data structure rocessing nodes Tracks which of its children level-1 processing nodes have a copy directory of the memory block. Also tracks which local memory blocks are cached outside this subtree. Inclusion is maintained between level-1 directory and processor caches Leaves are processing nodes, internal nodes are just directories Logical hierarchy, not necessarily a physical one an be embedded in a general network topology ache oherence in Scalable Machines Muhamed Mudawar, SE 661 Spring 2005, KFUM Slide 41 Two-Level Hierarchy Individual nodes are multiprocessors Example: mesh of SMs oherence across nodes is directory-based Directory keeps track of nodes, not individual processors oherence within nodes is snooping or directory Orthogonal, but needs a good interface of functionality Examples: onvex Exemplar: directory-directory Sequent, Data General, HAL: directory-snoopy ache oherence in Scalable Machines Muhamed Mudawar, SE 661 Spring 2005, KFUM Slide 42

Example Two-level Hierarchies B1 B1 B1 B1 Main Mem Snooping Adapter Snooping Adapter Main Mem Dir. Main Mem Assist Assist Main Mem Dir. B2 Network (a) Snooping-snooping (b) Snooping-directory M/D A M/D A M/D A M/D A M/D A M/D A M/D A M/D A Network1 Network1 Network1 Network1 Directory adapter Directory adapter Dir/Snoopy adapter Dir/Snoopy adapter Network2 (c) Directory-directory Bus (or Ring) (d) Directory-snooping ache oherence in Scalable Machines Muhamed Mudawar, SE 661 Spring 2005, KFUM Slide 43 Advantages of M Nodes otential for cost and performance advantages Amortization of node fixed costs over multiple processors an use commodity SMs Less nodes for directory to keep track of Much communication may be contained within node Nodes prefetch data for each other (fewer remote misses) ombining of requests (like hierarchical, only two-level) an even share caches (overlapping of working sets) Benefits depend on sharing pattern (and mapping) good for widely read-shared: e.g. tree data in Barnes-Hut good for nearest-neighbor, if properly mapped not so good for all-to-all communication ache oherence in Scalable Machines Muhamed Mudawar, SE 661 Spring 2005, KFUM Slide 44

Disadvantages of M Nodes Bandwidth shared among nodes Bus increases latency to local memory With coherence, typically wait for local snoop results before sending remote requests Snoopy bus at remote node increases delays there too, increasing latency and reducing bandwidth May hurt performance if sharing patterns don t comply ache oherence in Scalable Machines Muhamed Mudawar, SE 661 Spring 2005, KFUM Slide 45