Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

Similar documents
Portland State University ECE 588/688. Cache Coherence Protocols

Portland State University ECE 588/688. Cache-Only Memory Architectures

Chapter 5. Multiprocessors and Thread-Level Parallelism

Midterm Exam 02/09/2009

Chapter 9 Multiprocessors

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

EC 513 Computer Architecture

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5)

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

6.1 Multiprocessor Computing Environment

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Portland State University ECE 588/688. IBM Power4 System Microarchitecture

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Thread- Level Parallelism. ECE 154B Dmitri Strukov

Portland State University ECE 588/688. Cray-1 and Cray T3E

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Multiprocessors & Thread Level Parallelism

Lecture 5: Directory Protocols. Topics: directory-based cache coherence implementations

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996

Chapter 5. Multiprocessors and Thread-Level Parallelism

ECSE 425 Lecture 30: Directory Coherence

Module 14: "Directory-based Cache Coherence" Lecture 31: "Managing Directory Overhead" Directory-based Cache Coherence: Replacement of S blocks

Lecture 3: Directory Protocol Implementations. Topics: coherence vs. msg-passing, corner cases in directory protocols

Chapter 18 Parallel Processing

Lecture 24: Virtual Memory, Multiprocessors

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Memory Hierarchy in a Multiprocessor

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Chapter 5. Thread-Level Parallelism

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Multiprocessors - Flynn s Taxonomy (1966)

CSE502: Computer Architecture CSE 502: Computer Architecture

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Handout 3 Multiprocessor and thread level parallelism

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Shared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Lect. 6: Directory Coherence Protocol

Aleksandar Milenkovich 1

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Computer Architecture

Introduction to Parallel Computing

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Lecture 24: Board Notes: Cache Coherency

Page 1. Cache Coherence

Cache Coherence and Atomic Operations in Hardware

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

1. Memory technology & Hierarchy

Physical Design of Snoop-Based Cache Coherence on Multiprocessors

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

Parallel Architecture. Hwansoo Han

Multiprocessor Cache Coherency. What is Cache Coherence?

Interconnect Routing

Shared Memory. SMP Architectures and Programming

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Scalable Cache Coherence

Portland State University ECE 588/688. Graphics Processors

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Portland State University ECE 588/688. Cray-1 and Cray T3E

Distributed Shared Memory

COMP Parallel Computing. CC-NUMA (1) CC-NUMA implementation

Computer Architecture Memory hierarchies and caches

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

Portland State University ECE 587/687. Caches and Prefetching

4. Networks. in parallel computers. Advances in Computer Architecture

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Flynn s Classification

Special Topics. Module 14: "Directory-based Cache Coherence" Lecture 33: "SCI Protocol" Directory-based Cache Coherence: Sequent NUMA-Q.

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

CMSC 611: Advanced. Distributed & Shared Memory

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Cache Coherence in Scalable Machines

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

Foundations of Computer Systems

EE382 Processor Design. Illinois

Chapter 18. Parallel Processing. Yonsei University

CMSC 611: Advanced Computer Architecture

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

Organisasi Sistem Komputer

CCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)

Scalable Cache Coherence

Lecture 9: MIMD Architectures

ECE 587 Advanced Computer Architecture I

EC 513 Computer Architecture

Embedded Systems Architecture. Multiprocessor and Multicomputer Systems

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University

CSCI 4717 Computer Architecture

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Transcription:

Portland State University ECE 588/688 Directory-Based Cache Coherence Protocols Copyright by Alaa Alameldeen and Haitham Akkary 2018

Why Directory Protocols? Snooping-based protocols may not scale All requests must be broadcast to all processors All processors should monitor all requests on the shared interconnect Shared interconnect utilization can be high, leading to very long wait times Directory protocols Coherence state maintained in a directory associated with memory Requests to a memory block do not need broadcasts Served by local nodes if possible Otherwise, sent to owning node Note: Some snooping-based protocols do not require broadcast, and therefore are more scalable Portland State University ECE 588/688 Winter 2018 2

Design Issues for Distributed Correctness Coherence Protocols Memory consistency model: Performance vs. ease of programming Deadlock avoidance Error handling (fault tolerance) Performance Latency Bandwidth Distributed Control and Complexity Scalability Portland State University ECE 588/688 Winter 2018 3

The Stanford DASH Prototype Directory Architecture for SHared memory Architecture consists of many clusters Each cluster contains 4 processors Processor caches L1I: 64KB, direct mapped L1D: 64KB, direct-mapped, write-through L2: 256KB, direct-mapped, write-back 4-word write buffer Snooping implemented within a cluster (Illinois protocol, similar to MSI) General architecture of DASH: paper figure 1 Sample 2x2 DASH system: paper figure 2 Portland State University ECE 588/688 Winter 2018 4

DASH Directories Directory controller (DC) Directory memory corresponding to cluster s main memory portion Initiates out-bound network requests and replies Pseudo-CPU (PCPU) Buffers incoming requests and issues them on cluster bus Mimics a CPU on behalf of remote processors (except for bus replies sent by DC) Reply Controller (RC) Remote Access Cache (RAC) tracks outstanding requests by local processors Receives and buffers corresponding replies from remote clusters RAC snoops on bus Requests and replies sent on two different networks using wormhole routing (discussed later in this course) Directory block diagram: paper figure 3 Portland State University ECE 588/688 Winter 2018 5

Terminology DASH Coherence Protocol Local cluster: cluster containing the processor originating a request Home cluster: cluster containing the main memory and directory for a given memory address Remote cluster: Any cluster other than local and home clusters Local memory: main memory associated with the local cluster Remote memory: Any memory whose home is not the local cluster Invalidation-based protocol Cache states: invalid, shared, and dirty Directory state (for all local memory blocks) Uncached-remote: not cached by any remote cluster Shared-remote: Cached, unmodified, by one or more remote clusters Dirty-remote: Cached, modified, by one remote cluster Owning cluster for a block is the home cluster except if dirty-remote Owning cluster responds to requests and updates directory state Portland State University ECE 588/688 Winter 2018 6

Read Requests Initiated by CPU load instruction If address is in L1 cache, L1 supplies data otherwise, fill request sent to L2 If address is in L2, L2 supplies data otherwise, read request sent on bus If address is in the cache of another processor in the cluster or in the RAC, that cache responds Shared: data transferred over the bus to requester Dirty: data transferred over bus to requester, RAC takes ownership of cache line If address not in local cluster, processor retries bus operation, and request is sent to home cluster, RAC entry is allocated Requests to remote nodes explained in figure 4 Portland State University ECE 588/688 Winter 2018 7

Read-Exclusive Requests Initiated by CPU store instruction Data written through L1 and buffered in a write buffer If L2 has ownership permission, write is retired otherwise, readexclusive request sent on local bus Write buffer is stalled If address is in dirty in one of the caches in the cluster or in the RAC Owning cache sends data and ownership to requester Owning cache invalidates its copy If address not in local cluster Processor retries bus operation Request is sent to home cluster RAC entry is allocated Requests to remote nodes explained in figure 5 Portland State University ECE 588/688 Winter 2018 8

Other Implementation Details Writeback requests: When a dirty block is replaced Home is local cluster: Write data to main memory Home is a remote cluster: Send data to home which updates memory and state as uncached-remote Latency for memory operations: paper figure 6 Exception conditions Request to a dirty block of a remote cluster after it gave up ownership Ownership bouncing back and forth between two remote clusters while a third cluster requests block Multiple paths in the system lead to requests being received out of order Amount of information stored in directory affects scalability For each memory block, DASH stores state and bit vector for other processors For a more scalable system, overhead needs to be lower Portland State University ECE 588/688 Winter 2018 9

The SGI Origin Cache coherent non-uniform memory access Up to 512 nodes Scalable Cray link network (hypercube) 1 or 2 R10000 MIPS processors per node Up to 4G bytes per node Node connects to a portion of the IO subsystem No snooping within node Portland State University ECE 588/688 Winter 2018 10

Key Goals Scale to large number of processors Provide higher performance per processor Maintain cache-coherent globally addressable memory model For ease of programming Entry level and incremental cost of the system lower than a high performance SMP Portland State University ECE 588/688 Winter 2018 11

Origin Architecture Distributed shared memory (DSM) Directory based cache coherence Designed to minimize latency difference between local and remote memory Hardware and software provided to insure most memory references are local Origin block diagram: paper figure 1 Cache coherence does not require in-order message delivery I/O subsystem is also distributed and globally addressable I/O can DMA to and from all memory in the system Cluster bus is multiplexed but is not a snoopy bus Reduce local and remote memory latency Fewer processors on the bus Remote request does not need to wait for snoop response Portland State University ECE 588/688 Winter 2018 12

Origin Architecture (Cont.) Non-snoopy node bus tradeoff Disadvantage: remote bandwidth needs to match local bandwidth, unlike in SMP node systems Advantage: easier migration path for existing SMP software Page migration and replication insures most references are local Memory reference hardware counters Copy engine to copy at near peak memory bandwidth Rich synchronization primitives Fetch and op primitives are not cached and performed at memory Useful in highly contended locks HUB implements 4-way full crossbar between processors, memory and I/O-network RAS features ECC in external cache and memory Faulty packets automatic retries Modular design provides highly available hardware Portland State University ECE 588/688 Winter 2018 13

Network Six ported router chip Fat-hypercube (paper figures 3 and 4) Low latency wormhole routing Four virtual channels per physical channel Congestion control to allow messages to adaptively switch between two virtual channels Support for 256 levels of message priority Increased priority via packet aging Automatic packet retries Software programmable routing tables Portland State University ECE 588/688 Winter 2018 14

Cache Coherence Protocol Similar to DASH protocol but with significant improvements MESI protocol is fully supported Single fetch from memory for read-modify-writes Permits processor to replace E block in cache without informing directory Requests from processors that had replaced E blocks can be immediately satisfied from memory Support of upgrade requests from S to E without data transfer Cache coherence requests (see paper) Portland State University ECE 588/688 Winter 2018 15

Configuration and Performance CPU Configuration MIPS R10000 195 MHz 4-way out-of-order 4 M byte L2 cache Bus connected to the HUB chip Latency variation (paper table 4) High memory bandwidth (paper figures 11 & 12) Synchronization (paper figure 13) Portland State University ECE 588/688 Winter 2018 16

Reading Assignment Wednesday C. Seitz, The Cosmic Cube, Communications of the ACM, 1985 (Read) Monday Erik Hagersten et al., "DDM - A Cache-Only Memory Architecture," IEEE Computer, 1992 (Review) Babak Falsafi and David Wood, "Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA," ISCA 1997 (Skim) Homework 3 due Wednesday Submission instructions same as HW2. Text and C files should be submitted in D2L Portland State University ECE 588/688 Winter 2018 17