Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

Similar documents
Lecture 24: Board Notes: Cache Coherency

Incoherent each cache copy behaves as an individual copy, instead of as the same memory location.

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Chapter 5. Multiprocessors and Thread-Level Parallelism

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Cache Coherence. Bryan Mills, PhD. Slides provided by Rami Melhem

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Advanced OpenMP. Lecture 3: Cache Coherency

1. Memory technology & Hierarchy

EC 513 Computer Architecture

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

Lecture 25: Multiprocessors

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Multicore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh

Scalable Cache Coherence

Multiprocessors & Thread Level Parallelism

CS377P Programming for Performance Multicore Performance Cache Coherence

Lecture 24: Virtual Memory, Multiprocessors

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

ECE 485/585 Microprocessor System Design

Lecture-22 (Cache Coherence Protocols) CS422-Spring

The Cache Write Problem

CSC 631: High-Performance Computer Architecture

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Cache Coherence. Introduction to High Performance Computing Systems (CS1645) Esteban Meneses. Spring, 2014

ECSE 425 Lecture 30: Directory Coherence

Cache Coherence and Atomic Operations in Hardware

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Chap. 4 Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism

Flynn s Classification

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Limitations of parallel processing

Shared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB

Multiprocessor Cache Coherency. What is Cache Coherence?

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

Lecture 3: Directory Protocol Implementations. Topics: coherence vs. msg-passing, corner cases in directory protocols

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Lecture 4: Directory Protocols and TM. Topics: corner cases in directory protocols, lazy TM

Computer Architecture Memory hierarchies and caches

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Foundations of Computer Systems

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

Chapter 5. Thread-Level Parallelism

CSC 631: High-Performance Computer Architecture

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Interconnect Routing

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

Lecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based)

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY>

Aleksandar Milenkovich 1

Portland State University ECE 588/688. Cache Coherence Protocols

3/13/2008 Csci 211 Lecture %/year. Manufacturer/Year. Processors/chip Threads/Processor. Threads/chip 3/13/2008 Csci 211 Lecture 8 4

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Scalable Multiprocessors

Lecture 5: Directory Protocols. Topics: directory-based cache coherence implementations

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

Shared Symmetric Memory Systems

CS 152 Computer Architecture and Engineering

Computer Architecture

Computer Systems Architecture

CMSC 611: Advanced. Distributed & Shared Memory

Lecture 1: Introduction

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

Cache Coherence (II) Instructor: Josep Torrellas CS533. Copyright Josep Torrellas

Thread- Level Parallelism. ECE 154B Dmitri Strukov

Computer Organization

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

Shared memory. Caches, Cache coherence and Memory consistency models. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

Data-Centric Consistency Models. The general organization of a logical data store, physically distributed and replicated across multiple processes.

CMSC 611: Advanced Computer Architecture

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations

Page 1. Cache Coherence

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5)

Chapter 6. Parallel Processors from Client to Cloud Part 2 COMPUTER ORGANIZATION AND DESIGN. Homogeneous & Heterogeneous Multicore Architectures

Lecture 7: PCM Wrap-Up, Cache coherence. Topics: handling PCM errors and writes, cache coherence intro

Implementing caches. Example. Client. N. America. Client System + Caches. Asia. Client. Africa. Client. Client. Client. Client. Client.

Computer Architecture

CSE 5306 Distributed Systems

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1)

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

Computer Systems Architecture

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Lecture 8: Directory-Based Cache Coherence. Topics: scalable multiprocessor organizations, directory protocol design issues

Transcription:

1! 2! Suggested Readings! Readings!! H&P: Chapter 5.8! Could also look at material on CD referenced on p. 538 of your text! Lecture 27" Cache Coherency! 3! Processor components! Multicore processors and programming! What makes a memory system coherent?! Processor comparison! vs.! 4! Program order! Sequential writes! Causality! Explain & articulate why modern microprocessors now have more than one core and how SW must adapt. " CSE 30321! Writing more! efficient code! The right HW for the right application! HLL code translation! Part A!

Coherency and Caches! One centralized shared cache / memory is not practical!! Data must be cached locally! Consider the following!! 1 node works with data no other node uses! Why not cache it?!! If data is frequently modified, do we always tell everyone else?!! What if data is cached and read by 2 nodes and now one of them wants to do a write?! How is this handled?!!! 5! Maintaining Cache Coherence! Hardware schemes!! Shared Caches! Trivially enforces coherence! Not scalable (L1 cache quickly becomes a bottleneck)!! Snooping! Needs a broadcast network (like a bus) to enforce coherence! Each cache that has a block tracks its sharing state on its own!! Directory! Can enforce coherence even with a point-to-point network! A block has just one place where its full sharing state is kept! Bus! State Tag How Snooping Works! CPU! Data! Often 2 sets of tags why?" CPU references check cache tags (as usual)! Cache misses filled from memory (as usual)!!!+! Other read/write on bus must check tags, too, and possibly invalidate! 7! Update vs. Invalidate! A burst of writes by a processor to one address!! Update: each sends an update!! Invalidate: only the first invalidation is sent! Writes to different words of a block!! Update: update sent for each word!! Invalidate: only the first invalidation is sent! Producer-consumer communication latency!! Update: producer sends an update," consumer reads new value from its! Invalidate: producer invalidates consumer#s copy," consumer#s read misses and has to request the block! Which is better depends on application!! But write-invalidate usually wins! Part B! Part C!

Write invalidate example! Write update example! Processor Activity! Bus Activity! Contents of CPU A#s Assumes neither cache had value/location X in it 1 st! When 2 nd miss by B occurs, CPU A responds with value canceling response from memory.! Update B#s cache & memory contents of X updated! Typical and simple! CPU B#s memory location X! CPU A reads X! Cache miss for X! 0! 0! CPU B reads X! Cache miss for X! 0! 0! 0! CPU A writes a 1 to X! Invalidation for X! 1! 0! CPU B reads X! Cache miss for X! 1! 1! 1! 0! Processor Activity! Bus Activity! Contents of CPU A#s Assumes neither cache had value/location X in it 1 st! CPU and memory contents show value after processor and bus activity both completed! When CPU A broadcasts the write, cache in CPU B and memory location X are updated! CPU B#s memory location X! CPU A reads X! Cache miss for X! 0! 0! CPU B reads X! Cache miss for X! 0! 0! 0! CPU A writes a 1 to X! Write broadcast of X! 0! 1! 1! 1! CPU B reads X! 1! 1! 1! (Shaded parts are different than before)! M(E)SI Snoopy Protocols for $ coherency! State of block B in cache C can be!! Invalid: B is not cached in C! To read or write, must make a request on the bus!! Modified: B is dirty in C! C has the block, no other cache has the block," and C must update memory when it displaces B! Can read or write B without going to the bus!! Shared: B is clean in C! C has the block, other caches have the block," and C need not update memory when it displaces B! Can read B without going to bus! To write, must send an upgrade request to the bus!! Exclusive: B is exclusive to cache C! Can help to eliminate bus traffic! E state not absolutely necessary! See notes and board! MSI protocol! 12! Part D!

13! MESI protocol! Cache to Cache transfers! See notes and board! Problem!! P1 has block B in M state!! P2 wants to read B, puts a RdReq on bus!! If P1 does nothing, memory will supply the data to P2!! What does P1 do?! Solution 1: abort/retry!! P1 cancels P2#s request, issues a write back!! P2 later retries RdReq and gets data from memory!! Too slow (two memory latencies to move data from P1 to P2)! Solution 2: intervention!! P1 indicates it will supply the data ( intervention bus signal)!! Memory sees that, does not supply the data, and waits for P1#s data!! P1 starts sending the data on the bus, memory is updated!! P2 snoops the transfer during the write-back and gets the block! Part E! Part F! Directory-Based Coherence for DSM! Typically in distributed shared memory! For every local memory block," local directory has an entry! Directory entry indicates!! Who has cached copies of the block!! In what state do they have the block! Each entry has! Basic Directory Scheme!! One dirty bit (1 if there is a dirty cached copy)!! A presence vector (1 bit for each node)" Tells which nodes may have cached copies! All misses sent to block#s home! Directory does needed coherence actions! Eventually, directory responds with data!

Read Miss! Processor P k has a read miss on block B," sends request to home node of the block! Directory controller!! Finds entry for B, checks D bit!! If D=0! Read memory and send data back, set P[k]!! If D=1! Request block from processor whose P bit is 1! When block arrives, update memory, clear D bit," send block to P k and set P[k]! Directory Operation! Network controller connected to each bus!! A proxy for remote caches and memories! Requests for remote addresses forwarded to home," responses from home placed on the bus! Requests from home placed on the bus," cache responses sent back to home node! Each cache still has its own coherence state!! Directory is there just to avoid broadcasts" and order accesses to each location! Simplest scheme:" If access A1 to block B still not fully processed by directory" when A2 arrives, A2 waits in a queue until A1 is done! Shared Memory Performance! Another C for cache misses!! Still have Compulsory, Capacity, Conflict!! Now have Coherence, too! We had it in our cache and it was invalidated! Two sources for coherence misses!! True sharing! Different processors access the same data!! False sharing! Different processors access different data," but they happen to be in the same block!