Cache Coherence. Introduction to High Performance Computing Systems (CS1645) Esteban Meneses. Spring, 2014

Similar documents
Cache Coherence. Bryan Mills, PhD. Slides provided by Rami Melhem

Chapter 5. Multiprocessors and Thread-Level Parallelism

Performance study example ( 5.3) Performance study example

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

1. Memory technology & Hierarchy

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

CMSC 611: Advanced. Distributed & Shared Memory

CMSC 611: Advanced Computer Architecture

Multiprocessors & Thread Level Parallelism

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

The Cache Write Problem

Scalable Cache Coherence

Shared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB

ECSE 425 Lecture 30: Directory Coherence

Lecture 25: Multiprocessors

Chapter 5. Multiprocessors and Thread-Level Parallelism

CS 152 Computer Architecture and Engineering

Chapter Seven. Idea: create powerful computers by connecting many smaller ones

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

Computer Architecture

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

EC 513 Computer Architecture

Lecture 20: Multi-Cache Designs. Spring 2018 Jason Tang

Computer Architecture

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

Computer Systems Architecture

COMP Parallel Computing. CC-NUMA (1) CC-NUMA implementation

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Thread- Level Parallelism. ECE 154B Dmitri Strukov

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

Multicore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

Flynn s Classification

Limitations of parallel processing

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology

3/13/2008 Csci 211 Lecture %/year. Manufacturer/Year. Processors/chip Threads/Processor. Threads/chip 3/13/2008 Csci 211 Lecture 8 4

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains:

Multiprocessor Cache Coherency. What is Cache Coherence?

Processor Architecture

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Module 9: "Introduction to Shared Memory Multiprocessors" Lecture 16: "Multiprocessor Organizations and Cache Coherence" Shared Memory Multiprocessors

Handout 3 Multiprocessor and thread level parallelism

Lecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

CMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

Lecture 24: Virtual Memory, Multiprocessors

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Lecture 24: Thread Level Parallelism -- Distributed Shared Memory and Directory-based Coherence Protocol

CSC 631: High-Performance Computer Architecture

Memory Hierarchy in a Multiprocessor

Foundations of Computer Systems

Scalable Multiprocessors

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Chapter-4 Multiprocessors and Thread-Level Parallelism

CMPE 511 TERM PAPER. Distributed Shared Memory Architecture. Seda Demirağ

Lecture 24: Board Notes: Cache Coherency

Data-Centric Consistency Models. The general organization of a logical data store, physically distributed and replicated across multiple processes.

UNIT I (Two Marks Questions & Answers)

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Shared Symmetric Memory Systems

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Advanced OpenMP. Lecture 3: Cache Coherency

Cache Coherence in Bus-Based Shared Memory Multiprocessors

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 18 Cache Coherence

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Shared memory. Caches, Cache coherence and Memory consistency models. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16

Aleksandar Milenkovich 1

Lecture 13. Shared memory: Architecture and programming

Midterm Exam 02/09/2009

ECE 485/585 Microprocessor System Design

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1)

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

CS 152 Computer Architecture and Engineering. Lecture 19: Directory- Based Cache Protocols. Recap: Snoopy Cache Protocols

Chap. 4 Multiprocessors and Thread-Level Parallelism

Portland State University ECE 588/688. Cache Coherence Protocols

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

Computer Architecture Memory hierarchies and caches

Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY>

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Intro to Multiprocessors

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Transcription:

Cache Coherence Introduction to High Performance Computing Systems (CS1645) Esteban Meneses Spring, 2014

Supercomputer Galore Starting around 1983, the number of companies building supercomputers exploded: Control Data Corporation (CDC), Thinking Machines Corporation, IBM, Cray, Burroughs, ncube, MasPar. Several different designs were explored. Two major designs persists even today: Symmetric Multiprocessor System (SMP). Massively Parallel Processor (MPP). 2

Top 500 Source: www.top500.org 3

Symmetric Multiprocessor System Source: www.wikipedia.org 4

The Coherence Problem Data may not be consistent, i.e., same memory location with different values in different caches. Example: Time Event Cache A Cache B Memory i X=1 i+1 A reads X X=1 X=1 i+2 B reads X X=1 X=1 X=1 i+3 A writes X=0 X=0 X=1 X=0 WT X=1 WB 5

Approaches Do not cache shared data. Do not cache writeable shared data. Use snoopy caches (if connected by a bus). If no shared bus, then: Use broadcast to emulate shared bus. Use directory-based protocols (to communicate only with concerned parties, directory keeps track of concerned parties). 6

Snoopy Caches Each processor monitors the activity on the bus. Source: Prof. Rami Melhem On a read, all caches check to see if they have a copy of the requested block. If yes, they may have to supply the data. On a write, all caches check to see if they have a copy of the data. If yes, they either invalidate the local copy, or update it with the new value. Can have either write back or write through policy (the former is usually more efficient, but harder to maintain coherence). 7

MSI Protocol Invalidation protocol for write-back caches. Each block of memory can be: - Clean: in one or more caches and up-to-date in memory. - Dirty: in exactly one cache. - Uncached: not in any cache. Correspondingly, the state of each block in a cache is recorded: - Invalid: block contains invalid data. - Shared: a clean block (can be shared by other caches). - Exclusive/Modified: a dirty block (cannot be in any other cache). MSI protocol (modified/shared/invalid) makes sure that if a block is dirty in one cache, it is not valid in any other cache and that a read request gets the most updated data. 8

MSI Protocol (cont.) A read miss to a block in a cache, C1, generates a bus transaction: - If another cache, C2, has the block exclusively, it has to write back the block before memory supplies it. - C1 gets the data from the bus and the block becomes shared in both caches. A write hit to a shared block in C1: - All other caches that have the block should invalidate that block - The block becomes exclusive in C1. A write hit to an exclusive block does not generate a write back or change of state. 9

MSI Protocol (cont.) A write miss (to an invalid block) in C1 generates a bus transaction: - If a cache, C2, has the block as shared, it invalidates it. - If a cache, C2, has the block in exclusive, it writes back the block and changes its state in C2 to invalid. - If no cache supplies the block, the memory will supply it. - When C1 gets the block, it sets its state to exclusive. 10

MSI Protocol (cont.) Source: Computer Architecture: A Quantitative Approach (Hennesy and Patterson) 11

Example Both blocks B1 and B2 map to cache line L. Initially, neither B1 nor B2 is cached. Event P1 s Cache P2 s Cache Bus L: invalid L: invalid P1 write 10 to B1 (write miss) P1 reads B1 (read hit) P2 reads B1 (read miss) P2 writes 20 to B1 (write miss) P2 writes 40 to B2 (write miss) L: B1=10 (exclusive) L: invalid write miss B1 L: B1=10 (exclusive) L: invalid B1 is written back L: B1=10 (shared) L: B1=10 (shared) read miss B1 L: invalid L: B1=20 (exclusive) write miss B1 L: invalid B1 is written back L: B2=40 (exclusive) write miss B2 12

Example (cont.) Event P1 s Cache P2 s Cache Bus P1 reads B1 (read miss) P1 write 30 to B1 (write miss) P2 write 50 to B1 (write miss) P1 reads B1 (read miss) P2 reads B2 (read miss) P1 writes 60 to B2 (write miss) L: B1=10 (shared) L: B2=40 (exclusive) read miss B1 L: B1=30 (exclusive) L: B2=40 (exclusive) write miss B1 B1 is written back L: invalid L: B1=50 (shared) B2 is written back L: B1=50 (exclusive) B1 is written back L: B1=50 (shared) write miss B1 read miss B1 L: B1=50 (shared) L: B2=40 (shared) read miss B2 L: B2=60 (exclusive) L: invalid write miss B2 13

Hierarchical Snooping Source: Prof. Rami Melhem B1 buses follow standard snooping protocol to keep L1 copies consistent with the content of its shared L2. B2 bus keeps copies in L2 consistent with memory. A change of state of a cache line in L2 can be triggered by a transaction on one of the two busses (B1 or B2) and may require propagation to the other bus. 14

Directory-based Coherence Source: Prof. Rami Melhem Assumes a shared memory space which is physically distributed. Goal: implement a directory that keeps track of where each copy of a block is cached and its state in each cache (snooping keeps the state of a block only in the cache). Processors must consult directory before loading blocks from memory to cache. When a block in memory is updated (written), the directory is consulted to either update or invalidate other cached copies. Eliminates the overhead of broadcasting/snooping (bus bandwidth). Hence, scales up to numbers of processors that would saturate a single bus. But is slower in terms of latency. 15

Types of Directories Centralized: Distributed: Source: Prof. Rami Melhem Alternatives: Distributed memory, centralized directory. Centralized memory, distributed directory. 16

Directory Operation The location (home) of each memory block is determined by its address. A controller decides if access is local or remote. As in snooping caches, the state of every block in every cache is tracked in that cache (exclusive/dirty, shared/clean, invalid) to avoid the need for write through and unnecessary write back. In addition, with each block in memory, a directory entry keeps track of where the block is cached. Accordingly, a block can be in one of the following states: - Uncached: no processor has it (not valid in any cache). - Shared/clean: cached in one or more processors and memory is up-todate. - Exclusive/dirty: one processor (owner) has data; memory out-of-date. 17

Coherence Protocol Coherence is enforced by exchanging messages between nodes. Three types of nodes may be involved: - Local requestor node (L): the node that reads or write the cache block. - Home node (H): the node that stores the block in its memory -- may be the same as L. - Remote nodes (R): other nodes that have a cached copy of the requested block. When L encounters a Read Hit, it just reads the data. When L encounters a Read Miss, it sends a message to the home node, H, of the requested block. Three cases may arise: - The directory indicates that the block is not cached. - The directory indicates that the block is shared/clean and may supply the list of sharers. - The directory indicates that the block is exclusive. 18

Read Miss Read miss to a clean or uncached block: - L sends request to H. - H sends the block to L. - state of block is shared in directory. - state of block is shared in L Read miss to a block that is exclusive (in another cache): - L sends request to H. - H informs L about the block owner, R. - L requests the block from R. - R send the block to L. - L and R set the state of block to shared. - R informs H that it should change the state of the block to shared. Source: Prof. Rami Melhem 19

Write Miss Write miss to an uncached block: - similar to a read miss to an uncached block except that the state of the block is set to exclusive. Write miss to an block that is exclusive in another cache: - similar to a read miss to an exclusive block except that the state of the block is set to exclusive. Write miss to a shared block: - L sends request to H. - H sets the state to exclusive. - H sends the block to L. - H sends to L the list of other sharers. - L sets the block s state to exclusive. - L sends invalidating messages to each sharer (R). - R sets block s state to invalid Source: Prof. Rami Melhem 20

Write Hit If the block is exclusive in L, just write the data. If the block is shared in L: - L sends a request to H to have the block as exclusive. - H sets the state to exclusive. - H informs L of the block s other sharers. - L sets the block s state to exclusive. - L sends invalidating messages to each sharer (R). - R sets block s state to invalid. Source: Prof. Rami Melhem 21

Exercise A system with several processors (A, B, C, and more) uses a directory-based cache coherence protocol. A is the home node of block X. What would happen in the following scenarios? 1. X is in the uncached state. B reads X. B writes to X. 2. X is in the exclusive state, owned by B. C reads X. 3. X is in the exclusive state, owned by B. C writes to X. 4. X is in the shared state, clean in both B and C. B reads X. C writes to X. 22

MESI Protocol Divides the Modified/Exclusive state into two states: - Invalid/uncached: no processor has it (not valid in any cache). - Shared: cached in more than one processor and memory is upto-date. - Exclusive: one processor (owner) has data and it is clean. - Modified: one processor (owner) has data, but it is dirty. If MESI is implemented using a directory, then the information kept for each block in the directory is the same as the three state protocol: - Shared in MESI = shared/clean but more than one sharer. - Exclusive in MESI = shared/clean but only one sharer. - Modified in MESI = Exclusive/Modified/dirty. - However, at each cached copy, a distinction is made between shared and exclusive. 23

MESI Protocol (cont.) On a read miss (local block is invalid), load the block and change its state to: - exclusive if it was uncached in memory. - shared if it was already shared, modified or exclusive. - If it was modified, get the clean copy from the owner. - If was modified or exclusive, change the copy at the previous owner to shared. On a write miss, load the block and mark it modified. Invalidate other cached copies (if any). On a write hit to a modified block, do nothing. On a write hit to an exclusive block change the block to modified. On a write hit to a shared block change the block to modified, and invalidate the other cached copies. When a modified block is evicted, write it back. 24

MESI Protocol (cont.) If MESI is implemented as a snooping protocol, then the main advantage over the three state protocol is when a read to an uncached block is followed by a write to that block. - After the uncached block is read, it is marked exclusive. - Note that, when writing to a shared block, the transaction has to be posted on the bus so that other sharers invalidate their copies. - But when writing to an exclusive block, there is no need to post the transaction on the bus. - Hence, by distinguishing between shared and exclusive states, we can avoid bus transactions when writing on an exclusive block. - However, now a cache that has an exclusive block has to monitor the bus for any read to that block. Such a read will change the state to shared. This advantage disappears in a directory protocol since after a write onto an exclusive block, the directory has to be notified to change the state to modified. 25

MESI Protocol (cont.) Source: www.wikipedia.org 26

Example Both blocks B1 and B2 map to cache line L. Initially, neither B1 nor B2 is cached. Event P1 s Cache P2 s Cache L: invalid L: invalid P1 write 10 to B1 (write miss) L: B1=10 (modified) L: invalid P1 reads B1 (read hit) L: B1=10 (modified) L: invalid P2 reads B1 (read miss) P2 writes 20 to B1 (write hit) P2 writes 40 to B2 (write miss) B1 is written back L: B1=10 (shared) L: B1=10 (shared) L: invalid L: B1=20 (modified) L: invalid B1 is written back L: B2=40 (modified) 27

Example (cont.) Event P1 s Cache P2 s Cache P1 reads B1 (read miss) L: B1=10 (exclusive) L: B2=40 (modified) P1 write 30 to B1 (write hit) P2 write 50 to B1 (write miss) P1 reads B1 (read miss) L: B1=30 (modified) L: B2=40 (modified) B1 is written back L: invalid L: B1=50 (shared) B2 is written back L: B1=50 (modified) B1 is written back L: B1=50 (shared) P2 reads B2 (read miss) L: B1=50 (shared) L: B2=40 (shared) P1 writes 60 to B2 (write miss) L: B2=60 (modified) L: invalid 28

Memory Coherence Informally, a memory system is coherent if a read of X returns the most recently value written to X. Formally, Local ordering: if P writes α to X, and then reads from X, it should retrieve α assuming there are no remote operations to X in between. Coordination: if P writes α to X, and Q reads from X, it should retrieve α assuming there is enough time between the two operations and there are no other operations to X in between. Write serialization: write operations to the same location by multiple processors are seen in the same order by all other processors. 29

Memory Consistency A model to determine when a write to a memory location is seen by other processors. Different consistency models provide different orders of execution. There is no absolute correct behavior. Example: Processor A X=0... X=1 if(y==0)... Processor B Y=0... Y=1 if(x==0)... 30

Consistency Models Strong consistency: Enforces a total ordering among memory operations by different processors. Simple conceptual model (multitasking single CPU). Potentially has high performance penalization. Weak consistency: Allows out-of-order execution within individual processors. Programmers must be aware of consistency models and adapt the code to it. Allows more concurrency and improves performance. 31

Sequential Consistency Requirements: Memory operations must be atomic. Memory operations from the same processor execute in program order. There is a serialization (interleaving) of all memory operations from all processors. Example: Processor A Processor B W(Y)5 W(X)0 W(X)1 R(X)0 R(Y)0 32

Exercise Show the following executions are sequentially consistent. Processor A W(Y)5 W(X)0 W(Y)0 Processor B R(Y)5 W(X)1 R(X)0 R(Y)0 Processor A Processor B Processor C W(Z)1 W(X)0 R(Z)1 W(X)7 R(Y)9 R(X)7 W(Y)9 33

Summary The SMP and MPP designs for supercomputers are still valid. Cache coherence protocols enforce natural expectations from the behavior of the memory system. Directory-based protocols are more scalable than snoopy protocols. Memory consistency models determine how soon a write operation is seen by other processors. 34

Acknowledgments Lecture notes by Prof. Rami Melhem. Computer Architecture: A Quantitative Approach by John L. Hennesy and David A. Patterson. Morgan-Kaufmann, fourth edition, 2006. An Introduction to Parallel Programming by Peter S. Pacheco. Morgan-Kaufmann, 2011. Machine Organization Spring, 2014 35