Lecture 25: Multiprocessors

Similar documents
Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

Lecture 26: Multiprocessors. Today s topics: Directory-based coherence Synchronization Consistency Shared memory vs message-passing

Lecture 24: Virtual Memory, Multiprocessors

Lecture: Coherence, Synchronization. Topics: directory-based coherence, synchronization primitives (Sections )

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Lecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based)

Lecture: Coherence Protocols. Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols

Lecture 26: Multiprocessors. Today s topics: Synchronization Consistency Shared memory vs message-passing

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations

Lecture: Coherence and Synchronization. Topics: synchronization primitives, consistency models intro (Sections )

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

Lecture 7: PCM Wrap-Up, Cache coherence. Topics: handling PCM errors and writes, cache coherence intro

EC 513 Computer Architecture

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Chapter 5. Multiprocessors and Thread-Level Parallelism

Lecture 1: Introduction

250P: Computer Systems Architecture. Lecture 14: Synchronization. Anton Burtsev March, 2019

ECE 669 Parallel Computer Architecture

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Lecture 19: Coherence and Synchronization. Topics: synchronization primitives (Sections )

Lecture 4: Directory Protocols and TM. Topics: corner cases in directory protocols, lazy TM

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Lecture 5: Directory Protocols. Topics: directory-based cache coherence implementations

Lecture 7: Implementing Cache Coherence. Topics: implementation details

M4 Parallelism. Implementation of Locks Cache Coherence

Cache Coherence. Bryan Mills, PhD. Slides provided by Rami Melhem

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Lecture 3: Directory Protocol Implementations. Topics: coherence vs. msg-passing, corner cases in directory protocols

Lecture 7: PCM, Cache coherence. Topics: handling PCM errors and writes, cache coherence intro

ECSE 425 Lecture 30: Directory Coherence

Lecture: Coherence Protocols. Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

5008: Computer Architecture

Scalable Cache Coherence

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Lecture 24: Memory, VM, Multiproc

Multiprocessors & Thread Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism

COMP Parallel Computing. CC-NUMA (1) CC-NUMA implementation

Computer Organization

S = 32 2 d kb (1) L = 32 2 D B (2) A = 2 2 m mod 4 (3) W = 16 2 y mod 4 b (4)

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

CMSC 611: Advanced. Distributed & Shared Memory

Shared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB

Shared Memory Multiprocessors

Lecture: Memory, Coherence Protocols. Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols

Computer Architecture Lecture 10: Thread Level Parallelism II (Chapter 5) Chih Wei Liu 劉志尉 National Chiao Tung University

Lecture 8: Directory-Based Cache Coherence. Topics: scalable multiprocessor organizations, directory protocol design issues

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Performance study example ( 5.3) Performance study example

Cache Coherence. Introduction to High Performance Computing Systems (CS1645) Esteban Meneses. Spring, 2014

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Memory Hierarchy in a Multiprocessor

CMSC 611: Advanced Computer Architecture

Lecture 18: Shared-Memory Multiprocessors. Topics: coherence protocols for symmetric shared-memory multiprocessors (Sections

Data-Centric Consistency Models. The general organization of a logical data store, physically distributed and replicated across multiple processes.

Lecture 24: Thread Level Parallelism -- Distributed Shared Memory and Directory-based Coherence Protocol

Lecture: Consistency Models, TM

The MESI State Transition Graph

Computer Architecture

Aleksandar Milenkovich 1

Interconnect Routing

Page 1. Cache Coherence

CS654 Advanced Computer Architecture Lec 14 Directory Based Multiprocessors

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

Lecture 8: Transactional Memory TCC. Topics: lazy implementation (TCC)

CMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3

Module 7: Synchronization Lecture 13: Introduction to Atomic Primitives. The Lecture Contains: Synchronization. Waiting Algorithms.

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology

Cache Coherence Protocols: Implementation Issues on SMP s. Cache Coherence Issue in I/O

Midterm Exam 02/09/2009

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

CS 152 Computer Architecture and Engineering

Thread- Level Parallelism. ECE 154B Dmitri Strukov

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Cache Coherence and Atomic Operations in Hardware

Foundations of Computer Systems

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

Handout 3 Multiprocessor and thread level parallelism

Lecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6)

CSE502: Computer Architecture CSE 502: Computer Architecture

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville. Review: Small-Scale Shared Memory

Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY>

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

CS 433 Homework 5. Assigned on 11/7/2017 Due in class on 11/30/2017

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

1. Memory technology & Hierarchy

Cache Organizations for Multi-cores

EE382 Processor Design. Processor Issues for MP

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Transcription:

Lecture 25: Multiprocessors Today s topics: Virtual memory wrap-up Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization 1

TLB and Cache Is the cache indexed with virtual or physical address? To index with a physical address, we will have to first look up the TLB, then the cache longer access time Multiple virtual addresses can map to the same physical address must ensure that these different virtual addresses will map to the same location in cache else, there will be two different copies of the same physical memory word Does the tag array store virtual or physical addresses? Since multiple virtual addresses can map to the same physical address, a virtual tag comparison can flag a miss even if the correct physical memory word is present 2

Cache and TLB Pipeline Virtual page number TLB Physical page number Virtual address Virtual index Tag array Offset Data array Physical tag comparion Physical tag Virtually Indexed; Physically Tagged Cache 3

Snooping-Based Protocols Three states for a block: invalid, shared, modified A write is placed on the bus and sharers invalidate themselves The protocols are referred to as MSI, MESI, etc. Caches Caches Caches Caches Main Memory I/O System 4

Example P1 reads X: not found in cache-1, request sent on bus, memory responds, X is placed in cache-1 in shared state P2 reads X: not found in cache-2, request sent on bus, everyone snoops this request, cache-1does nothing because this is just a read request, memory responds, X is placed in cache-2 in shared state P1 P2 Cache-1 Cache-2 Main Memory P1 writes X: cache-1 has data in shared state (shared only provides read perms), request sent on bus, cache-2 snoops and then invalidates its copy of X, cache-1 moves its state to modified P2 reads X: cache-2 has data in invalid state, request sent on bus, cache-1 snoops and realizes it has the only valid copy, so it downgrades itself to shared state and responds with data, X is placed in cache-2 in shared state, memory is also updated 5

Example Request Cache Hit/Miss Request on the bus Who responds State in Cache 1 State in Cache 2 State in Cache 3 State in Cache 4 Inv Inv Inv Inv P1: Rd X Miss Rd X Memory S Inv Inv Inv P2: Rd X Miss Rd X Memory S S Inv Inv P2: Wr X Perms Miss P3: Wr X Write Miss Upgrade X No response. Other caches invalidate. Inv M Inv Inv Wr X P2 responds Inv Inv M Inv P3: Rd X Read Hit - - Inv Inv M Inv P4: Rd X Read Miss Rd X P3 responds. Mem wrtbk Inv Inv S S 6

Cache Coherence Protocols Directory-based: A single location (directory) keeps track of the sharing status of a block of memory Snooping: Every cache block is accompanied by the sharing status of that block all cache controllers monitor the shared bus so they can update the sharing status of the block, if necessary Write-invalidate: a processor gains exclusive access of a block before writing by invalidating all other copies Write-update: when a processor writes, it updates other shared copies of that block 7

Coherence in Distributed Memory Multiprocs Distributed memory systems are typically larger bus-based snooping may not work well Option 1: software-based mechanisms message-passing systems or software-controlled cache coherence Option 2: hardware-based mechanisms directory-based cache coherence 8

Distributed Memory Multiprocessors & Caches & Caches & Caches & Caches Memory I/O Memory I/O Memory I/O Memory I/O Directory Directory Directory Directory Interconnection network 9

Directory-Based Cache Coherence The physical memory is distributed among all processors The directory is also distributed along with the corresponding memory The physical address is enough to determine the location of memory The (many) processing nodes are connected with a scalable interconnect (not a bus) hence, messages are no longer broadcast, but routed from sender to receiver since the processing nodes can no longer snoop, the directory keeps track of sharing state 10

Cache Block States What are the different states a block of memory can have within the directory? Note that we need information for each cache so that invalidate messages can be sent The directory now serves as the arbitrator: if multiple write attempts happen simultaneously, the directory determines the ordering 11

Directory-Based Example Memory Directory & Caches I/O Memory Directory X & Caches I/O Memory Directory Y & Caches I/O A: Rd X B: Rd X C: Rd X A: Wr X A: Wr X C: Wr X B: Rd X A: Rd X A: Rd Y B: Wr X B: Rd Y B: Wr X B: Wr Y Interconnection network 12

Example Request Cache Hit/Miss Messages Dir State State in C1 State in C2 State in C3 State in C4 Inv Inv Inv Inv P1: Rd X Miss Rd-req to Dir. Dir responds. X: S: 1 S Inv Inv Inv P2: Rd X Miss Rd-req to Dir. Dir responds. X: S: 1, 2 S S Inv Inv P2: Wr X Perms Miss P3: Wr X Write Miss Upgr-req to Dir. Dir sends INV to P1. P1 sends ACK to Dir. Dir grants perms to P2. Wr-req to Dir. Dir fwds request to P2. P2 sends data to Dir. Dir sends data to P3. X: M: 2 Inv M Inv Inv X: M: 3 Inv Inv M Inv P3: Rd X Read Hit - - Inv Inv M Inv P4: Rd X Read Miss Rd-req to Dir. Dir fwds request to P3. P3 sends data to Dir. Memory wrtbk. Dir sends data to P4. X: S: 3, 4 Inv Inv S S 13

Directory Actions If block is in uncached state: Read miss: send data, make block shared Write miss: send data, make block exclusive If block is in shared state: Read miss: send data, add node to sharers list Write miss: send data, invalidate sharers, make excl If block is in exclusive state: Read miss: ask owner for data, write to memory, send data, make shared, add node to sharers list Data write back: write to memory, make uncached Write miss: ask owner for data, write to memory, send data, update identity of new owner, remain exclusive 14

Constructing Locks Applications have phases (consisting of many instructions) that must be executed atomically, without other parallel processes modifying the data A lock surrounding the data/code ensures that only one program can be in a critical section at a time The hardware must provide some basic primitives that allow us to construct locks with different properties Bank balance $1000 Parallel (unlocked) banking transactions Rd $1000 Add $100 Wr $1100 Rd $1000 Add $200 Wr $1200 15

Synchronization The simplest hardware primitive that greatly facilitates synchronization implementations (locks, barriers, etc.) is an atomic read-modify-write Atomic exchange: swap contents of register and memory Special case of atomic exchange: test & set: transfer memory location into register and write 1 into memory (if memory has 0, lock is free) lock: t&s register, location bnz register, lock CS st location, #0 When multiple parallel threads execute this code, only one will be able to enter CS 16

Title Bullet 17