CMSC 611: Advanced. Distributed & Shared Memory

Similar documents
CMSC 611: Advanced Computer Architecture

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

Shared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB

Lecture 24: Thread Level Parallelism -- Distributed Shared Memory and Directory-based Coherence Protocol

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

COSC 6385 Computer Architecture - Thread Level Parallelism (III)

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Aleksandar Milenkovich 1

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville. Review: Small-Scale Shared Memory

COSC 6385 Computer Architecture. - Thread Level Parallelism (III)

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

Lecture 17: Multiprocessors: Size, Consitency. Review: Networking Summary

CPE 631 Lecture 21: Multiprocessors

ECSE 425 Lecture 30: Directory Coherence

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

Thread- Level Parallelism. ECE 154B Dmitri Strukov

CS654 Advanced Computer Architecture Lec 14 Directory Based Multiprocessors

5008: Computer Architecture

Flynn s Classification

CISC 662 Graduate Computer Architecture Lectures 15 and 16 - Multiprocessors and Thread-Level Parallelism

Computer Architecture Lecture 10: Thread Level Parallelism II (Chapter 5) Chih Wei Liu 劉志尉 National Chiao Tung University

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

CMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Page 1. Instruction-Level Parallelism (ILP) CISC 662 Graduate Computer Architecture Lectures 16 and 17 - Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism

Multiprocessors & Thread Level Parallelism

EC 513 Computer Architecture

Review. EECS 252 Graduate Computer Architecture. Lec 13 Snooping Cache and Directory Based Multiprocessors. Outline. Challenges of Parallel Processing

CSE 502 Graduate Computer Architecture Lec 19 Directory-Based Shared-Memory Multiprocessors & MP Synchronization

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

Multiprocessor Cache Coherency. What is Cache Coherence?

??%/year. ! All microprocessor companies switch to MP (2X CPUs / 2 yrs) " Procrastination penalized: 2X sequential perf. / 5 yrs

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

Multiprocessors: Basics, Cache coherence, Synchronization, and Memory consistency

Chapter 6: Multiprocessors and Thread-Level Parallelism

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

Computer Architecture

Multiprocessor Systems

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley

1. Memory technology & Hierarchy

Lecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based)

3/13/2008 Csci 211 Lecture %/year. Manufacturer/Year. Processors/chip Threads/Processor. Threads/chip 3/13/2008 Csci 211 Lecture 8 4

Scalable Cache Coherence

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter-4 Multiprocessors and Thread-Level Parallelism

Cache Coherence. Bryan Mills, PhD. Slides provided by Rami Melhem

Cache Coherence. Introduction to High Performance Computing Systems (CS1645) Esteban Meneses. Spring, 2014

Handout 3 Multiprocessor and thread level parallelism

Lecture 25: Multiprocessors

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains:

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

Directory Implementation. A High-end MP

Chapter 9 Multiprocessors

Chap. 4 Multiprocessors and Thread-Level Parallelism

Limitations of parallel processing

Thread-Level Parallelism

Module 9: "Introduction to Shared Memory Multiprocessors" Lecture 16: "Multiprocessor Organizations and Cache Coherence" Shared Memory Multiprocessors

Outline. EEL 5764 Graduate Computer Architecture. Chapter 4 - Multiprocessors and TLP. Déjà vu all over again?

Computer Systems Architecture

Computer Architecture

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Lecture 5: Directory Protocols. Topics: directory-based cache coherence implementations

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Lecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol

Portland State University ECE 588/688. Cache Coherence Protocols

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

Lecture 3: Directory Protocol Implementations. Topics: coherence vs. msg-passing, corner cases in directory protocols

Multiprocessors and Thread Level Parallelism CS 282 KAUST Spring 2010 Muhamed Mudawar

Multiprocessors 1. Outline

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Chapter 5. Thread-Level Parallelism

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

COSC4201 Multiprocessors

Midterm Exam 02/09/2009

Scalable Multiprocessors

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

Scalable Cache Coherence

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Computer Systems Architecture

Incoherent each cache copy behaves as an individual copy, instead of as the same memory location.

Lecture 24: Virtual Memory, Multiprocessors

CMPE 511 TERM PAPER. Distributed Shared Memory Architecture. Seda Demirağ

Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions:

Transcription:

CMSC 611: Advanced Computer Architecture Distributed & Shared Memory

Centralized Shared Memory MIMD Processors share a single centralized memory through a bus interconnect Feasible for small processor count to limit memory contention Caches serve to: Increase bandwidth versus bus/memory Reduce latency of access Valuable for both private data and shared data Access to shared data is optimized by replication Decreases latency Increases memory bandwidth Reduces contention Replication introduces the problem of cache coherence

Cache Coherency A cache coherence problem arises when the cache reflects a view of memory which is different from reality A memory system is coherent if: P reads X, P writes X, no other processor writes X, P reads X Always returns value written by P P reads X, Q writes X, P reads X Returns value written by Q (provided sufficient W/R separation) P writes X, Q writes X Seen in the same order by all processors

Potential HW Coherency Solutions Snooping Solution (Snoopy Bus) Send all requests for data to all processors Processors snoop to see if they have a copy and respond accordingly Requires broadcast, since caching information is at processors Works well with bus (natural broadcast medium) Dominates for small scale machines (most of the market) Directory-Based Schemes Keep track of what is being shared in one centralized place Distributed memory distributed directory for scalability (avoids bottlenecks) Send point-to-point requests to processors via network Scales better than Snooping Actually existed before Snooping-based schemes * Slide is partially a courtesy of Dave Patterson

Basic Snooping Protocols Write Invalidate Protocol: Write to shared data: an invalidate is sent to all caches which snoop and invalidate any copies Cache invalidation will force a cache miss when accessing the modified shared item For multiple writers only one will win the race ensuring serialization of the write operations Read Miss: Write-through: memory is always up-to-date Write-back: snoop in caches to find most recent copy * Slide is partially a courtesy of Dave Patterson

* Slide is partially a courtesy of Dave Patterson Basic Snooping Protocols Write Broadcast (Update) Protocol (typically write through): Write to shared data: broadcast on bus, processors snoop, and update any copies To limit impact on bandwidth, track data sharing to avoid unnecessary broadcast of written data that is not shared Read miss: memory is always up-to-date Write serialization: bus serializes requests!

Invalidate vs. Update Write-invalidate has emerged as the winner for the vast majority of designs Qualitative Performance Differences : Spatial locality WI: 1 transaction/cache block; WU: 1 broadcast/word Latency WU: lower write read latency WI: must reload new value to cache

Invalidate vs. Update Because the bus and memory bandwidth is usually in demand, write-invalidate protocols are very popular Write-update can causes problems for some memory consistency models, reducing the potential performance gain it could bring The high demand for bandwidth in writeupdate limits its scalability for large number of processors

An Example Snoopy Protocol Invalidation protocol, write-back cache Each block of memory is in one state: Clean in all caches and up-to-date in memory (Shared) OR Dirty in exactly one cache (Exclusive) OR Not in any caches Each cache block is in one state (track these): Shared : block can be read OR Exclusive : cache has only copy, it is write-able, and dirty OR Invalid : block contains no data Read misses: cause all caches to snoop bus Writes to clean line are treated as misses * Slide is partially a courtesy of Dave Patterson

Snoopy-Cache Controller Complications Cannot update cache until bus is obtained Two step process: Arbitrate for bus Place miss on bus and complete operation Split transaction bus: Bus transaction is not atomic Multiple misses can interleave, allowing two caches to grab block in the Exclusive state Must track and prevent multiple misses for one block

Example Assumes memory blocks A1 and A2 map to same cache block, initial cache state is invalid

Example Assumes memory blocks A1 and A2 map to same cache block

Example Assumes memory blocks A1 and A2 map to same cache block

Example Assumes memory blocks A1 and A2 map to same cache block

Example Assumes memory blocks A1 and A2 map to same cache block

Example A1 A1 Assumes memory blocks A1 and A2 map to same cache block

* Slide is partially a courtesy of Dave Patterson Distributed Directory Multiprocessors Directory per cache that tracks state of every block in every cache Which caches have a copy of block, dirty vs. clean,... Info per memory block vs. per cache block? PLUS: In memory => simpler protocol (centralized/one location) MINUS: In memory => directory is ƒ(memory size) vs. ƒ(cache size) To prevent directory from being a bottleneck, distribute directory entries with memory, each keeping track of which processor have copies of their blocks

Directory Protocol Similar to Snoopy Protocol: Three states Shared: Multiple processors have the block cached and the contents of the block in memory (as well as all caches) is up-todate Uncached No processor has a copy of the block (not valid in any cache) Exclusive: Only one processor (owner) has the block cached and the contents of the block in memory is out-to-date (the block is dirty) In addition to cache state, must track which processors have data when in the shared state usually bit vector, 1 if processor has copy

Keep it simple(r): Directory Protocol Writes to non-exclusive data => write miss Processor blocks until access completes Assume messages received and acted upon in order sent Terms: typically 3 processors involved Local node where a request originates Home node where the memory location of an address resides Remote node has a copy of a cache block, whether exclusive or shared No bus and do not want to broadcast: interconnect no longer single arbitration point all messages have explicit responses

Example Directory Protocol Message sent to directory causes two actions: Update the directory More messages to satisfy request We assume operations atomic, but they are not; reality is much harder; must avoid deadlock when run out of buffers in network

Directory Protocol Messages Type SRC DEST MSG Read miss local cache home directory P,A P has read miss at A; request data and make P a read sharer Write miss local cache home directory P,A P has write miss at A; request data and make P exclusive owner Invalidate home directory remote cache A Invalidate shared data at A Fetch home directory remote cache A Fetch block A home; change A remote state to shared Fetch/invalidate home directory remote cache A Fetch block A home; invalidate remote copy Data value reply home directory local cache D Return data value from home memory Data write back remote cache home directory A,D Write back data value for A

Cache Controller State States identical to snoopy case Transactions very similar. Miss messages to home directory Explicit invalidate & data fetch requests Machine State machine for CPU requests for each memory block * Slide is partially a courtesy of Dave Patterson

Directory Controller State Same states and structure as the transition diagram for an individual cache Actions: update of directory state send messages to satisfy requests Tracks all copies of each memory block Sharers set implementation can use a bit vector of a size of # processors for each block Machine State machine for Directory requests for each memory block * Slide is partially a courtesy of Dave Patterson

Example Processor 1 Processor 2 Interconnect Directory Memory P2: Write 20 to A1 Assumes memory blocks A1 and A2 map to same cache block

Example Processor 1 Processor 2 Interconnect Directory Memory WrMs P1 A1 A1 Ex {P1} Excl. A1 10 DaRp P1 A1 0 P2: Write 20 to A1 Assumes memory blocks A1 and A2 map to same cache block

Example Processor 1 Processor 2 Interconnect Directory Memory WrMs P1 A1 A1 Ex {P1} Excl. A1 10 DaRp P1 A1 0 Excl. A1 10 P2: Write 20 to A1 Assumes memory blocks A1 and A2 map to same cache block

Example Processor 1 Processor 2 Interconnect Directory Memory P2: Write 20 to A1 WrMs P1 A1 A1 Ex {P1} Excl. A1 10 DaRp P1 A1 0 Excl. A1 10 Shar. A1 RdMs P2 A1 Shar. A1 10 Ftch P1 A1 10 A1 10 Shar. A1 10 DaRp P2 A1 10 A1 Shar. {P1,P2} 10 Write Back Assumes memory blocks A1 and A2 map to same cache block

Example Processor 1 Processor 2 Interconnect Directory Memory P2: Write 20 to A1 WrMs P1 A1 A1 Ex {P1} Excl. A1 10 DaRp P1 A1 0 Excl. A1 10 Shar. A1 RdMs P2 A1 Shar. A1 10 Ftch P1 A1 10 A1 10 Shar. A1 10 DaRp P2 A1 10 A1 Shar. {P1,P2} 10 Excl. A1 20 WrMs P2 A1 10 Inv. Inval. P1 A1 A1 Excl. {P2} 10 Assumes memory blocks A1 and A2 map to same cache block

Example Processor 1 Processor 2 Interconnect Directory Memory P2: Write 20 to A1 WrMs P1 A1 A1 Ex {P1} Excl. A1 10 DaRp P1 A1 0 Excl. A1 10 Shar. A1 RdMs P2 A1 Shar. A1 10 Ftch P1 A1 10 A1 10 Shar. A1 10 DaRp P2 A1 10 A1 Shar. {P1,P2} 10 Excl. A1 20 WrMs P2 A1 10 Inv. Inval. P1 A1 A1 Excl. {P2} 10 WrMs P2 A2 A2 Excl. {P2} 0 WrBk P2 A1 20 A1 Unca. {} 20 Excl. A2 40 DaRp P2 A2 0 A2 Excl. {P2} 0 Assumes memory blocks A1 and A2 map to same cache block