Lecture-22 (Cache Coherence Protocols) CS422-Spring

Similar documents
Processor Architecture

Advanced OpenMP. Lecture 3: Cache Coherency

Multicore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

Computer Architecture

Cray XE6 Performance Workshop

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Switch Gear to Memory Consistency

Shared Memory Multiprocessors

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University

Cache Coherence in Bus-Based Shared Memory Multiprocessors

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Memory Hierarchy in a Multiprocessor

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Lecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5)

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

Lecture 20: Multi-Cache Designs. Spring 2018 Jason Tang

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Multiprocessors & Thread Level Parallelism

Lecture 1: Introduction

Snooping coherence protocols (cont.)

Lecture 7: PCM Wrap-Up, Cache coherence. Topics: handling PCM errors and writes, cache coherence intro

A Basic Snooping-Based Multi-Processor Implementation

NOW Handout Page 1. Recap. Protocol Design Space of Snooping Cache Coherent Multiprocessors. Sequential Consistency.

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

An#Aside# Course#thus#far:# We#have#focused#on#parallelism:#how#to#divide#up#work#into# parallel#tasks#so#that#they#can#be#done#faster.

Lecture 25: Multiprocessors

L7 Shared Memory Multiprocessors. Shared Memory Multiprocessors

CSC 252: Computer Organization Spring 2018: Lecture 26

CS315A Midterm Solutions

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Snooping-Based Cache Coherence

Shared Memory Architectures. Approaches to Building Parallel Machines

CSC/ECE 506: Architecture of Parallel Computers Program 2: Bus-Based Cache Coherence Protocols Due: Wednesday, October 25, 2017

Approaches to Building Parallel Machines. Shared Memory Architectures. Example Cache Coherence Problem. Shared Cache Architectures

Lecture 9 Outline. Lower-Level Protocol Choices. MESI (4-state) Invalidation Protocol. MESI: Processor-Initiated Transactions

Lecture 7: PCM, Cache coherence. Topics: handling PCM errors and writes, cache coherence intro

Advanced Parallel Programming I

Modern CPU Architectures

EC 513 Computer Architecture

Lecture 24: Board Notes: Cache Coherency

Cache Coherence. Bryan Mills, PhD. Slides provided by Rami Melhem

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

Snooping coherence protocols (cont.)

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

Incoherent each cache copy behaves as an individual copy, instead of as the same memory location.

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Lecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol

Shared Memory Multiprocessors

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations

Computer Architecture Lecture 19: Multiprocessors, Consistency, Coherence. Prof. Onur Mutlu ETH Zürich Fall November 2017

Special Course on Computer Architecture

Lecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based)

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Cache Coherence: Part 1

12 Cache-Organization 1

CS377P Programming for Performance Multicore Performance Cache Coherence

Caches. Parallel Systems. Caches - Finding blocks - Caches. Parallel Systems. Parallel Systems. Lecture 3 1. Lecture 3 2

Recall: Sequential Consistency. CS 258 Parallel Computer Architecture Lecture 15. Sequential Consistency and Snoopy Protocols

Multicore MESI Based Cache Design HAS

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

Lecture 24: Virtual Memory, Multiprocessors

Review: Symmetric Multiprocesors (SMP)

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Limitations of parallel processing

The need for atomicity This code sequence illustrates the need for atomicity. Explain.

Scalable Cache Coherence

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Fall 2015 :: CSE 610 Parallel Computer Architectures. Cache Coherence. Nima Honarmand

Multiprocessors 1. Outline

Chapter 5. Multiprocessors and Thread-Level Parallelism

Shared Memory Architectures. Shared Memory Multiprocessors. Caches and Cache Coherence. Cache Memories. Cache Memories Write Operation

Shared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB

PARALLEL MEMORY ARCHITECTURE

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Outline. EEL 5764 Graduate Computer Architecture. Chapter 4 - Multiprocessors and TLP. Déjà vu all over again?

LOCK-BASED CACHE COHERENCE PROTOCOL FOR CHIP MULTIPROCESSORS

Concurrent Preliminaries

Chapter 5. Multiprocessors and Thread-Level Parallelism

Shared Symmetric Memory Systems

Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY>

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains:

Parallel Arch. Review

Transcription:

Lecture-22 (Cache Coherence Protocols) CS422-Spring 2018 Biswa@CSE-IITK

Single Core Core 0 Private L1 Cache Bus (Packet Scheduling) Private L2 DRAM CS422: Spring 2018 Biswabandan Panda, CSE@IITK 2

Multicore Core 0 Core 1 Core k Private L1 Cache Private L1 Cache Private L1 Cache Private L2 Cache Private L2 Cache Private L2 Cache Interconnect (Packet Scheduling) I Shared Last Level Cache (L3) DRAM CS422: Spring 2018 Biswabandan Panda, CSE@IITK 3

Cache Coherence P1 Cache Bus LLC/DRAM sum=0 CS422: Spring 2018 Biswabandan Panda, CSE@IITK 4

Cache Coherence rd &sum P1 sum =0 V DRAM sum=0 CS422: Spring 2018 Biswabandan Panda, CSE@IITK 5

Cache Coherence P1 rd &sum sum=0 V sum=0 V DRAM sum=0 CS422: Spring 2018 Biswabandan Panda, CSE@IITK 6

Cache Coherence wr sum (sum = 3) P1 sum=0 3 V D sum=0 V DRAM sum=0 CS422: Spring 2018 Biswabandan Panda, CSE@IITK 7

Cache Coherence P1 wr &sum (sum = 7) sum=3 D sum=0 7 V D DRAM sum=0 CS422: Spring 2018 Biswabandan Panda, CSE@IITK 8

Cache Coherence Err! Voila rd &sum P1 print sum=3 print sum=7 sum=3 D sum=7 D DRAM sum=0 CS422: Spring 2018 Biswabandan Panda, CSE@IITK 9

The Problem If multiple cores cache the same block, how do they ensure they all see a consistent state? Solution: 1. Write Propagation: All writes eventually become visible to other cores. 2. Write Serialization: All cores see write to a cache line in same order. Need a protocol to ensure (1) and (2) called cache coherence protocol CS422: Spring 2018 Biswabandan Panda, CSE@IITK 10

Non-solutions Keeping caches coherent is software s responsibility (+) Makes microarchitect s life (my) easier (-) Makes average programmer s life (yours) much harder CS422: Spring 2018 Biswabandan Panda, CSE@IITK 11

Non-solutions All caches are shared between all processors + No need for coherence -- Shared cache becomes the bandwidth bottleneck -- Scalability : 100 monkeys 1 banana CS422: Spring 2018 Biswabandan Panda, CSE@IITK 12

Who Is Responsible? (1) ISA - What if the ISA provided a cache flush/invalidate instruction? Such as - FLUSH-LOCAL, FLUSH-GLOBAL, FLUSH-CACHE (2) Hardware - Invalidate all other copies of block say A when a core writes to it Simplifies yours job CS422: Spring 2018 Biswabandan Panda, CSE@IITK 13

Invalidate vs Update Invalidate Write to a cache line, and simultaneously broadcast invalidation of address to all others Other cores clear/invalidate their cache lines Update Write to a cache line, and simultaneously broadcast written data to all others Other cores update their caches if data was present CS422: Spring 2018 Biswabandan Panda, CSE@IITK 14

MSI Coherence Protocol Extend single valid bit per line to three states: M(odified): only one cache, memory not updated. S(hared): one or more caches, and memory copy is up-to-date I(nvalid): not present Read - Read request on bus, transitions to S Write - ReadEx request, transitions to M state CS422: Spring 2018 Biswabandan Panda, CSE@IITK 15

State Transitions Core Side PrRd/ PrWr/ PrRd/ M PrWr/BusRdX S PrWr/BusRdX I PrRd/BusRd CS422: Spring 2018 Biswabandan Panda, CSE@IITK 16

Bus Side BusRd/ M BusRd/Flush S BusRdX/Flush I BusRd/ BusRdX/ BusRdX/ CS422: Spring 2018 Biswabandan Panda, CSE@IITK 17

MSI State Bits Tag Data 31 0 P1 Cache Bus Snooper Snooper Snooper DRAM X=1 CS422: Spring 2018 Biswabandan Panda, CSE@IITK 18

MSI P1 rd &X BusRd Snooper Snooper Snooper DRAM X=1 CS422: Spring 2018 Biswabandan Panda, CSE@IITK 19

MSI P1 X=1 S Snooper Snooper Snooper DRAM X=1 CS422: Spring 2018 Biswabandan Panda, CSE@IITK 20

MSI P1 wr &X (X=2) X=1 2 S M Snooper Snooper Snooper DRAM X=1 CS422: Spring 2018 Biswabandan Panda, CSE@IITK 21

MSI P1 X=2 M Snooper Snooper Snooper rd &X BusRd DRAM X=1 CS422: Spring 2018 Biswabandan Panda, CSE@IITK 22

MSI P1 X=2 M S X=2 S Snooper Flush Snooper Snooper DRAM X=1 2 Cancel DRAM read CS422: Spring 2018 Biswabandan Panda, CSE@IITK 23

MSI P1 X=2 S I X=2 3 S Snooper Snooper Snooper wr &X X=3 M BusRdX LLC X=2 CS422: Spring 2018 Biswabandan Panda, CSE@IITK 24

MSI P1 rd &X BusRd X=2 3 I S X=3 M Snooper Snooper Snooper Flush S DRAM X=2 3 Update DRAM as well CS422: Spring 2018 Biswabandan Panda, CSE@IITK 25

MSI P1 X=3 S X=3 S rd &X Snooper Snooper Snooper DRAM X=3 CS422: Spring 2018 Biswabandan Panda, CSE@IITK 26

MSI Proc Action State P1 State State Bus Action Data From R1 - - - BusRd Mem W1 - - - BusRdX Mem R3 - - - BusRd P1 cache W3 - - - BusRdX Mem R1 - - - BusRd cache R3 - - - - - R2 - - - BusRd Mem CS422: Spring 2018 Biswabandan Panda, CSE@IITK 27

MSI Proc Action State P1 State State Bus Action Data From R1 S - - BusRd Mem W1 M - - BusRdX Mem R3 S - S BusRd P1 cache W3 I - M BusRdX Mem R1 S - S BusRd cache R3 S - S - - R2 S S S BusRd Mem CS422: Spring 2018 Biswabandan Panda, CSE@IITK 28

1. Write Propagation: All writes eventually become visible to other cores. In MSI - (Through Invalidation and flush on subsequent BusRds) 2. Write Serialization: All cores see write to a cache line in same order. In MSI Writes (BusRdX) that go to the bus appear in bus order (and handled by snoopers in bus order!) CS422: Spring 2018 Biswabandan Panda, CSE@IITK 29

Issues Rd, Wr sequence incurs 2 transactions even when no one sharing (e.g., serial program!) BusRd (I->S) followed by BusRdX (S->M) In general, penalizing serial programs is unacceptable CS422: Spring 2018 Biswabandan Panda, CSE@IITK 30

Solution - MESI Add Exclusive state: Invalid modified (dirty) shared (two or more caches may have copies) Exclusive: (only this cache has clean copy, same value as in memory) How to decide I E or I S? Need to check whether someone else has copy a separate signal on bus: wired-or line asserted in response to BusRd. Call it C = copy exists CS422: Spring 2018 Biswabandan Panda, CSE@IITK 31

State Transitions Processor level M E PrWr/BusRdX PrRd/BusRd(!C) PrWr/BusUpgr S PrRd/BusRd(C) I PrRd/- PrWr/- PrRd/- PrWr/- PrRd/- CS422: Spring 2018 Biswabandan Panda, CSE@IITK 32

Bus Level M BusRdX/Flush E BusRd/Flush BusRd/FlushOpt BusRdX/FlushOpt BusRd/FlushOpt S BusRdX/FlushOpt BusUpgr/- I BusRd/- BusRdX/- BusUpgr/- CS422: Spring 2018 Biswabandan Panda, CSE@IITK 33

MESI: An Example P1 Cache Bus Snooper Snooper Snooper Mem Ctrl Main Memory X=1 CS422: Spring 2018 Biswabandan Panda, CSE@IITK 34

P1 rd &X BusRd Snooper Snooper Snooper Mem Ctrl X=1 CS422: Spring 2018 Biswabandan Panda, CSE@IITK 35

P1 X=1 E Snooper Snooper Snooper Mem Ctrl X=1 CS422: Spring 2018 Biswabandan Panda, CSE@IITK 36

P1 wr &X (X=2) X=1 2 E M Snooper Snooper Snooper Mem Ctrl X=1 CS422: Spring 2018 Biswabandan Panda, CSE@IITK 37

P1 rd &X X=2 M Snooper Snooper Snooper BusRd Mem Ctrl X=1 CS422: Spring 2018 Biswabandan Panda, CSE@IITK 38

P1 X=2 M S X=2 S Snooper Snooper Snooper Flush Mem Ctrl X=1 2 CS422: Spring 2018 Biswabandan Panda, CSE@IITK 39

P1 X=2 S I X=2 3 S M wr &X X=3 Snooper Snooper Snooper BusUpgr Mem Ctrl X=2 Note: BusUpgr instead of BusRdX. Mem Ctrl knows it can ignore the request CS422: Spring 2018 Biswabandan Panda, CSE@IITK 40

P1 rd &X X=3 X=3 S I X=3 S M Snooper Snooper Snooper BusRead Mem Ctrl X=3 CS422: Spring 2018 Biswabandan Panda, CSE@IITK 41

P1 rd &X X=3 S X=3 S Snooper Snooper Snooper Mem Ctrl X=3 CS422: Spring 2018 Biswabandan Panda, CSE@IITK 42

P1 rd &X X=3 S X=3 S X=3 S Snooper Snooper Snooper FlushOpt BusRd Mem Ctrl X=3 Referred to as Cache-to-cache transfer in Illinois MESI protocol CS422: Spring 2018 Biswabandan Panda, CSE@IITK 43

Proc Action State P1 State State Bus Action Data From R1 E - - BusRd Mem W1 M - - - - R3 S - S BusRd/Flush P1 s cache W3 I - M BusUpgr - R1 S - S BusRd/Flush s cache R3 S - S - - R2 S S S BusRd/Flush Opt P1/ s cache CS422: Spring 2018 Biswabandan Panda, CSE@IITK 44

Coherence misses: misses due to accesses to blocks that were invalidated due to coherence events True sharing: misses that occur when multiple threads share the same data item (byte/word) False sharing: misses that occur when multiple threads share different data items located in a single cache block CS422: Spring 2018 Biswabandan Panda, CSE@IITK 45

Parameters True Sharing Misses False Sharing Misses Larger cache size increased increased Larger block/line size Larger associativity decreased unclear increased unclear CS422: Spring 2018 Biswabandan Panda, CSE@IITK 46

Directory Based P P P P P $ $ $ $ $ Interconnection Network LLC 0 1 1 0 0 0 1 1 0 0 1 0 0 0 0 0 C(k) C(k+1) 0 0 1 0 0 0 0 1 C(k+j) 1 modified bit for each cache block in LLC/ memory 1 presence bit for each core, each cache block in LLC/ memory CS422: Spring 2018 Biswabandan Panda, CSE@IITK 47