Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Similar documents
Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Multiprocessors & Thread Level Parallelism

Computer Architecture

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Thread- Level Parallelism. ECE 154B Dmitri Strukov

Shared Symmetric Memory Systems

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1)

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

EC 513 Computer Architecture

Computer Systems Architecture

Chapter-4 Multiprocessors and Thread-Level Parallelism

Cache Coherence. Bryan Mills, PhD. Slides provided by Rami Melhem

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

Aleksandar Milenkovich 1

Handout 3 Multiprocessor and thread level parallelism

5008: Computer Architecture

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

CMSC 611: Advanced. Distributed & Shared Memory

1. Memory technology & Hierarchy

ECSE 425 Lecture 30: Directory Coherence

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Computer Architecture Lecture 10: Thread Level Parallelism II (Chapter 5) Chih Wei Liu 劉志尉 National Chiao Tung University

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology

Parallel Architecture. Hwansoo Han

Processor Architecture

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

Flynn s Classification

Chapter 6. Parallel Processors from Client to Cloud Part 2 COMPUTER ORGANIZATION AND DESIGN. Homogeneous & Heterogeneous Multicore Architectures

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

CMSC 611: Advanced Computer Architecture

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Shared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB

Computer Systems Architecture

Foundations of Computer Systems

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

CS654 Advanced Computer Architecture Lec 14 Directory Based Multiprocessors

Page 1. Cache Coherence

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains:

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Today s Outline: Shared Memory Review. Shared Memory & Concurrency. Concurrency v. Parallelism. Thread-Level Parallelism. CS758: Multicore Programming

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

COSC4201 Multiprocessors

CMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3

Limitations of parallel processing

Chapter 5. Thread-Level Parallelism

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

Directory Implementation. A High-end MP

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville. Review: Small-Scale Shared Memory

Computer Architecture

Cache Coherence and Atomic Operations in Hardware

EEC 581 Computer Architecture. Lec 11 Synchronization and Memory Consistency Models (4.5 & 4.6)

Multiprocessor Cache Coherency. What is Cache Coherence?

Lecture 25: Multiprocessors

Module 9: "Introduction to Shared Memory Multiprocessors" Lecture 16: "Multiprocessor Organizations and Cache Coherence" Shared Memory Multiprocessors

COSC 6385 Computer Architecture - Multi Processor Systems

3/13/2008 Csci 211 Lecture %/year. Manufacturer/Year. Processors/chip Threads/Processor. Threads/chip 3/13/2008 Csci 211 Lecture 8 4

Lecture 24: Thread Level Parallelism -- Distributed Shared Memory and Directory-based Coherence Protocol

Lecture 17: Multiprocessors: Size, Consitency. Review: Networking Summary

Lecture 4: Directory Protocols and TM. Topics: corner cases in directory protocols, lazy TM

Advanced OpenMP. Lecture 3: Cache Coherency

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Parallel Processors from Client to Cloud Part 2

CS433 Homework 6. Problem 1 [15 points] Assigned on 11/28/2017 Due in class on 12/12/2017

Multiprocessors and Locking

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Incoherent each cache copy behaves as an individual copy, instead of as the same memory location.

Computer Science 146. Computer Architecture

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Lecture 24: Board Notes: Cache Coherency

Cache Coherence in Bus-Based Shared Memory Multiprocessors

Cache Coherence. Introduction to High Performance Computing Systems (CS1645) Esteban Meneses. Spring, 2014

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

Lecture 24: Virtual Memory, Multiprocessors

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

CISC 662 Graduate Computer Architecture Lectures 15 and 16 - Multiprocessors and Thread-Level Parallelism

Chapter 9 Multiprocessors

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Multiprocessors 1. Outline

Transcription:

Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon cost Workloads did not justify it Independent computations on large data sets dominate Thread-Level parallelism is a form of MIMD Uses MIMD model Have multiple program counters Targeted for tightly-coupled -memory multiprocessors For N processors, need N threads Amount of computation assigned to each thread is grain size Threads can be used for data-level parallelism, but the overheads may outweigh the benefit P writes X, P reads X, no intervening writes X returns value written by P Q writes X, <sufficient time period>, P reads X returns value written by Q Writes are serialized Written values seen in the same order by all processors Coherency defines what values can be returned by a read Consistency defines when a written value is returned by a read If processor P writes to location A and than to B then any processor Q that sees the new value in B must also see the new value in A 2 5 Multiprocessor Types Distributed -memory Memory distributed among processors Non-uniform memory access/ latency (NUMA) Processors connected via direct (switched) and non- direct (multi-hop) interconnection networks Centralized -memory Symmetric Multiprocessors (SMP) Small number of cores single memory Uniform main memory latency Enforcing Cache Coherence Coherent caches provide: Migration: movement of data Replication: multiple copies of data Cache coherence protocols based Sharing status of each block kept in one location Single directory for small core counts, or Distributed directories Snooping Each core tracks sharing status of each block Requires broadcast medium, e.g. data bus May be combined with cache directories for large core counts 3 6

CPU write Snooping Cache Protocols Write invalidate protocol Most common protocol Upon write, make sure there is only one valid copy Write serialization happens through exclusive access to data bus Write update protocol (write broadcast) Upon write, make sure all copies are the same Requires more bandwidth State Transition for Write Invalidate: CPU Requests Invalid CPU read CPU read miss CPU read hit Write-back block Place read miss on bus Place invalidate on bus CPU write (read (read only) only) CPU read miss CPU write miss Place read miss on bus 7 CPU read hit CPU write hit Exclusive (read/write) CPU write miss Write-back cache block 10 Write Invalidate Cache Coherence Protocol State Transition for Write Invalidate: Bus Requests Invalid Write miss for this block CPU read miss Invalidate for this block (read (read only) only) 1: read 2: read 3: write 4: invalidate invalid Write-back block; abort memory access Write-back block; abort memory access Read miss for this block Write miss for this block Exclusive (read/write) 8 11 Snooping Protocol Implementation Details Snooping Protocol Extensions Where is the most recent value of data? Write-through cache: in cache or memory Write-back cache: in other core's private cache or in cache/memory Getting the most recent value of data If in -cache/memory: request as usual If in private cache: Ask for data from other core and wait Invalidate data in other core's cache What to do with the data buffers Need to be treated as another cache Basic protocol (Modified--Invalid = MSI) has problems State transitions are not atomic They involve many steps detect change/request of data acquire bus send request receive response Multi-processor (multi-socket) systems need intra- and inter-chip solution Intel: Quickpath Interconnect AMD: Hypertransport 9 12

Extensions to Simple Coherence Protocols True and False Sharing X MESI = Modified-Exclusive--Invalid Coherent bit is added for each cache block Dirty bit indicates that the block was modified MESIF = Modified-Exclusive--Invalid-Forward (Intel i7) 0 1 2 Forward state: if a core has a block in this state it will respond to misses Prevents all cores from responding to a single miss MOESI (AMD) Owned state indicates that the cache owns the block and the memory is out-of-date Transition M O prevents a (costly) write to memory Note: blocks can be in M state True sharing: Write to block Read invalidated block Write X[1] Write X[1] Read X[2] true true Read X[2] Write X[2] False sharing: Read unmodified word in invalidated block 16 Limitations of SMP and Snooping Protocols Practical scalability limit: 8 cores Scalability bottleneck: snooping bandwidth Solution: duplicate address tags to keep private cache separate One set of tags for internal cache operation One set of tags for comparisons when misses happen on snooping bus Solution: add a directory to the outermost cache The directory maps the blocks to the processors that own it Solution: pair up cross-bar or point-to-point interconnects with multi-banked cache/memory for better bandwidth Bank Bank 0 Bank Bank 1 Bank Bank 2 Bank Bank 3 cache Larger cache cannot alleviate snooping broadcasts: faster processor means more traffic. 14 -Based Coherence Snooping broadcast are a major drain of cache bus bandwidth Distribution of resources localizes the traffic But snooping broadcasts must be removed Alternative to snooping is the directory protocol keeps The state of every cache block Which caches have a copy of the block It could be as simple as an L3 cache that is inclusive Each L3 block as a bit vector of length=number of cores Single directory scheme is not scalable must be distributed P1 P2 P3 P4 P5 P6 P7 P8 Snooping Protocol Extensions: AMD Opteron Protocol Basics Storage size Snooping inside each multicore chip Separate memory connected to each multicore chip NUMA organization Coherence implemented with point-to-point links with up to three other chips The links not part of the bus Broadcast on each link the request for a cache block (like snooping) Acknowledgement completes memory operations Similar to directory-based protocols Used for serialization of operations 15 Number of cached memory blocks * number of nodes Node is a single core, a single multicore socket, or a collection of multicore chips with internal coherence (usually based on snooping) For each block, directory maintains a state. For example: One or more nodes have the block cached, value in memory is up-to-date Set of node IDs or a bit vector Uncached No node has a copy of the cache block It's been evicted or it's somewhere in lower (below directory) level caches Modified Exactly one node has a copy of the cache block, value in memory is out-of-date Owner node ID maintains block states and sends invalidation messages 18

Protocol Terms Protocol Details: Exclusive Protocol has two sides Local (originator) Remote Local node is where the requests originates Home node is where memory location (address) and is directory entry reside Invalid state The state of a cache block from the perspective of that block Uncached state The state of a cache block from the perspective of the directory For exclusive block: The owner is sent a data fetch message, block becomes, owner sends data to the directory, data written back to memory, sharers set contains old owner and requestor Data write back Block becomes uncached, sharer set is empty Message is sent to old owner to invalidate and send the value to the directory, requestor becomes new owner, block remains exclusive 19 22 Protocol Messages Message type Source Destination Messag e contents Function of this message Read miss Local cache Home directory P, A Node P has a read miss at addr. A; request data and make P share A Write miss Local cache Home directory P, A Node P has a write miss at addr. A; request data and make P exclusive owner Invalidate Local cache Home directory A Request to send invalidate to all remote caches that cache A Invalidate Home directory Remote cache A Invalidate a copy of data at A Fetch Home directory Remote cache A Fetch the block at A and send it to its home directory; Change the state of A in remote cache to Fetch/Invalidate Home directory Remote cache A Fetch the block at A and send it to its home directory; Invalidate the block in the cache Data value reply Home directory Local cache D Return a data value from the home memory Data write-back Remote cache Home directory A, D Write-back a data value for address A 20 Hardware Synchronization Overview There is always a need for synchronizing threads Start, stop, increment a counter, Software-based solutions are slower, reserved for software sync. Basic synchronization concepts: atomicity and transaction Concepts also used in databases Elementary synchronization primitives Atomic exchange Swaps a register with a memory location Test-and-set Set a value if a condition is met Fetch-and-increment Returns a value in memory and increments after the read Requirement: read and write as a single, uninterruptable instruction Coherence protocol must guarantee atomicity and lack of deadlock or livelock 23 Protocol Details: Uncached and Hardware Synchronization: Implementation For uncached block: Requesting node is sent the requested data and is made the only sharing node, block is now The requesting node is sent the requested data and becomes the sharing node, block is now exclusive For block: The requesting node is sent the requested data from memory, node is added to sharing set The requesting node is sent the value, all nodes in the sharing set are sent invalidate messages, sharing set only contains requesting node, block is now exclusive 21 Special instruction with atomic semantics Requires complications to coherence protocol for caches, directories,... Special instruction pair: attempt and see if succeeded Example: Load linked (or load locked) Tries to load the content of a memory location A Store conditional It might succeed if there was no intervening change to the location A It fails if a write to A occurred or context switch occurred Sample implementation of atomic exchange between R4 and addr(r1) try: mov R3, R4 ; moves between registers are atomic LoadLink R2, 0(R1) ; initiate a load StoreCond R3, 0(R1) ; return 0 on failure beqz R3, try ; keep trying if failure mov R4, R2 ; R2 got the value from LoadLink Is the memory still coherent? Is deadlock possible? Livelock? 24

Memory Consistency Introduction Memory Consistency: Coherence Protocol View How consistent memory view must be? What is the observable order of writes from different processors Observation = read from memory Consistency defines properties between reads and writes Strongest form of consistency: sequential consistency if (B == 0) A A B B Processor 3 if (A == 0) A B 25 28 Memory Consistency: Example if (B == 0) What does do? if (A == 0) If A is not in a cached block: send request to get it Send invalidate if A is in block in some processor Wait For block with A For invalidate to complete Strongest Form of Memory Consistency Sequential consistency requires... accesses from each processor are kept in order accesses among different processors are arbitrarily interleaved Sample implementation Delay each memory access until all invalidation messages complete, or Delay the next memory access until the previous one completes Either scheme affects performance negatively delay means more memory-related stalls To improve performance Make the coherence protocol faster and more effective by hiding latency etc. Relax the consistency model More stringent consistency model may be implemented in software and only as required 26 29 Memory Consistency: Sample Interleaved Execution Relaxed Consistency Models if (B == 0) if (A == 0) Rule X Y means that Operation X must complete before operation Y is done Sequential consistency requires all orderings to be maintained: if (B==0) if (B==0) R W, R R, W R, W W Relaxed W R Is called: Total store ordering or processor consistency Relaxed W W if (B==0) if (B==0) How many possible interleaved sequences? Is called: Partial store order Relaxed R W and R R Is called: Weak ordering, PowerPC consistency model, and release consistency Relaxing any of the orderings is done to increases performance 27 30