CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

Similar documents
CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 18 Cache Coherence

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

CS 152 Computer Architecture and Engineering. Lecture 19: Directory- Based Cache Protocols. Recap: Snoopy Cache Protocols

CSC 631: High-Performance Computer Architecture

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Dr. George Michelogiannakis. EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Page 1. Cache Coherence

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

Lecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol

EC 513 Computer Architecture

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5)

Multicore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

Advanced OpenMP. Lecture 3: Cache Coherency

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Lecture 24: Thread Level Parallelism -- Distributed Shared Memory and Directory-based Coherence Protocol

Chapter 6. Parallel Processors from Client to Cloud Part 2 COMPUTER ORGANIZATION AND DESIGN. Homogeneous & Heterogeneous Multicore Architectures

Lecture 20: Multi-Cache Designs. Spring 2018 Jason Tang

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Shared Memory Multiprocessors

Computer Architecture

Lecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Processor Architecture

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

3/13/2008 Csci 211 Lecture %/year. Manufacturer/Year. Processors/chip Threads/Processor. Threads/chip 3/13/2008 Csci 211 Lecture 8 4

Limitations of parallel processing

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Cache Coherence in Bus-Based Shared Memory Multiprocessors

CS252 Spring 2017 Graduate Computer Architecture. Lecture 17: Virtual Memory and Caches

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Lecture-22 (Cache Coherence Protocols) CS422-Spring

EC 513 Computer Architecture

Cache Coherence Protocols: Implementation Issues on SMP s. Cache Coherence Issue in I/O

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

12 Cache-Organization 1

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Multiprocessors & Thread Level Parallelism

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University

Cray XE6 Performance Workshop

Shared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Lecture 24: Board Notes: Cache Coherency

A Basic Snooping-Based Multi-Processor Implementation

CMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3

Aleksandar Milenkovich 1

Snooping-Based Cache Coherence

Chapter 5. Multiprocessors and Thread-Level Parallelism

The Cache Write Problem

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

NOW Handout Page 1. Recap. Protocol Design Space of Snooping Cache Coherent Multiprocessors. Sequential Consistency.

CMSC 611: Advanced. Distributed & Shared Memory

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Parallel Processors from Client to Cloud Part 2

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Foundations of Computer Systems

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Multiprocessors 1. Outline

Incoherent each cache copy behaves as an individual copy, instead of as the same memory location.

Shared Memory Architectures. Approaches to Building Parallel Machines

Chapter-4 Multiprocessors and Thread-Level Parallelism

A More Sophisticated Snooping-Based Multi-Processor

Thread- Level Parallelism. ECE 154B Dmitri Strukov

Approaches to Building Parallel Machines. Shared Memory Architectures. Example Cache Coherence Problem. Shared Cache Architectures

Computer Systems Architecture

Lecture 7 - Memory Hierarchy-II

CMSC 611: Advanced Computer Architecture

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. See P&H Chapter: , 5.8, 5.10, 5.15; Also, 5.13 & 5.

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Recall: Sequential Consistency. CS 258 Parallel Computer Architecture Lecture 15. Sequential Consistency and Snoopy Protocols

ECE 485/585 Microprocessor System Design

Caches. Parallel Systems. Caches - Finding blocks - Caches. Parallel Systems. Parallel Systems. Lecture 3 1. Lecture 3 2

Flynn s Classification

Cache Coherence and Atomic Operations in Hardware

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

CS 152 Computer Architecture and Engineering. Lecture 11 - Virtual Memory and Caches

Chapter Seven. Idea: create powerful computers by connecting many smaller ones

Chapter 5. Multiprocessors and Thread-Level Parallelism

Transcription:

CS252 Spring 2017 Graduate Computer Architecture Lecture 12: Cache Coherence Lisa Wu, Krste Asanovic http://inst.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17

Last Time in Lecture 11 Memory Systems DRAM design/packaging Uniprocessor cache design - Capacity, associativity, line size - 3 C s: Compulsory, Capacity, Conflict Multilevel caches Prefetching 2

Review: Store Write Policies Cache hit Write-Through: writes to cache and memory Write-Back: writes to cache and wait until later to write it to memory (i.e. when the line is evicted) No-Write: invalidates the cache and writes to memory directly Cache miss Write-Allocate: allocates a line in the cache for the data (put the store in the cache) Write-No-Allocate: writes directly to memory without allocating a line in the cache Material inspired by Hakim Weatherspoon, Cornell University, Spring 2013 3 WU UCB CS252 SP17

Review: Cache Policies Write-through vs. Write-back Write-through is slower but memory is always consistent Write-through allows the update of only the modified portion of a cacheline as memory always has the most up-to-date copy Evictions do not need to write to memory with a writethrough policy but need to write to memory with a writeback policy Write-back is faster but more complicated when dealing w/ multiple cores sharing memory Write-back dictates that the update has to be on cacheline granularity as there is only one dirty bit per cacheline and the memory does not know which words/bytes are dirty Both policies read an entire cacheline on a cache miss Material inspired by Hakim Weatherspoon, Cornell University, Spring 2013 4 WU UCB CS252 SP17

What Does Coherency Mean? Informally: Any read must return the most recent write Too strict and very difficult to implement Better: Any write must eventually be seen by a read All writes are seen in proper order ( serialization ) Two rules to ensure this: If P writes x and P1 reads it, P s write will be seen by P1 if the read and write are sufficiently far apart Writes to a single location are serialized: seen in one order» Latest write will be seen» Otherewise could see writes in illogical order (could see older value after a newer value) Dave Patterson, CS252, Fall 1996 DAP.F96 14 5 WU UCB CS252 SP17

Potential Solutions Snooping Solution (Snoopy Bus): Send all requests for data to all processors Processors snoop to see if they have a copy and respond accordingly Requires broadcast, since caching information is at processors Works well with bus (natural broadcast medium) Dominates for small scale machines (most of the market) Directory-Based Schemes Keep track of what is being shared in one centralized place Distributed memory => distributed directory (avoids bottlenecks) Send point-to-point requests to processors Scales better than Snoop Actually existed BEFORE Snoop-based schemes Dave Patterson, CS252, Fall 1996 DAP.F96 16 6 WU UCB CS252 SP17

Shared Memory Multiprocessor Memory Bus CPU 1 Snoopy Cache Main Memory (DRAM) CPU 2 Snoopy Cache DMA Disk CPU 3 Snoopy Cache DMA Network Use snoopy mechanism to keep all processors view of memory coherent 7

Basic Snoopy Protocols Write Invalidate Protocol: Multiple readers, single writer Write to shared data: an invalidate is sent to all caches which snoop and invalidate any copies Read Miss:» Write-through: memory is always up-to-date» Write-back: snoop in caches to find most recent copy Write Broadcast Protocol: Write to shared data: broadcast on bus, processors snoop, and update copies Read miss: memory is always up-to-date Write serialization: bus serializes requests Bus is single point of arbitration Dave Patterson, CS252, Fall 1996 DAP.F96 17 8 WU UCB CS252 SP17

Write Broadcast (Update) vs. Write Invalidate Mikko Lipasti, University of Wisconsin, Spring 2009 9 WU UCB CS252 SP17

Snoopy Cache, Goodman 1983 Idea: Have cache watch (or snoop upon) other memory transactions, and then do the right thing Snoopy cache tags are dual-ported Used to drive Memory Bus when Cache is Bus Master Proc. A R/W D Tags and State Data (lines) A R/W Snoopy read port attached to Memory Bus Cache 10

Snoopy Cache Coherence Protocols Write miss: - the address is invalidated in all other caches before the write is performed Read miss: - if a dirty copy is found in some cache, a write-back is performed before the memory is read 11

Cache State Transition Diagram The MSI protocol Each cache line has state bits state bits Address tag Write miss (P1 gets line from memory) Other processor reads (P 1 writes back) M: Modified S: Shared I: Invalid M P 1 reads or writes Read miss (P1 gets line from memory) Read by any processor S Other processor intent to write I Other processor intent to write (P 1 writes back) Cache state in processor P 1 12

P 1 reads P 1 writes P 2 reads P 2 writes P 1 reads P 1 writes P 2 writes P 1 writes Two Processor Example (Reading and writing the same cache line) P 1 P 2 Read miss Read miss P 2 reads, P 1 writes back S S P 2 intent to write P 1 reads, P 2 writes back P 1 intent to write M I M I P 1 reads or writes Write miss P 2 intent to write P 2 reads or writes Write miss P 1 intent to write 13

Observation Other processor reads P 1 writes back M P 1 reads or writes Write miss Other processor intent to write Read miss Read by any processor S Other processor intent to write I If a line is in the M state then no other cache can have a copy of the line! Memory stays coherent, multiple differing copies cannot exist 14

MSI state transition diagram PrRd / -- PrWr / -- M (Modified) A / B: if action A is observed by cache controller, action B is taken Broadcast (bus) initiated transaction Processor initiated transaction PrWr / BusRdX BusRd / flush PrWr / BusRdX S (Shared) BusRdX / flush PrRd / BusRd PrRd / -- BusRdX / -- BusRd / -- I (Invalid) Alternative state names: - E (exclusive, read/write access) - S (potentially shared, read-only access) - I (invalid, no access) (CMU 15-418, Spring 2012) 15 WU UCB CS252 SP17

MESI invalidation protocol MSI requires two bus transactions for the common case of reading data, then writing to it - Transaction 1: BusRd to move from I to S state - Transaction 2: BusRdX to move from S to M state This inefficiency exists even if application has no sharing at all Solution: add additional state E ( exclusive clean ) - Line not modified, but only this cache has copy - Decouples exclusivity from line ownership (line not dirty, so copy in memory is valid copy of data) - Upgrade from E to M does not require a bus transaction (CMU 15-418, Spring 2012) 16 WU UCB CS252 SP17

MESI: An Enhanced MSI protocol increased performance for private data Each cache line has a tag state bits Address tag Write miss P 1 write or read Other processor reads P 1 writes back Read miss, shared Read by any processor M S P 1 intent to write P 1 write Other processor intent to write M: Modified Exclusive E: Exclusive but unmodified S: Shared I: Invalid Other processor reads Other processor intent to write, P1 writes back E I P 1 read Other processor intent to write Cache state in processor P 1 Read miss, not shared 17

Implementation Complications Write Races: Cannot update cache until bus is obtained» Otherwise, another processor may get bus first, and write the same cache block Two step process:» Arbitrate for bus» Place miss on bus and complete operation If miss occurs to block while waiting for bus, handle miss (invalidate may be needed) and then restart. Split transaction bus:» Bus transaction is not atomic: can have multiple outstanding transactions for a block» Multiple misses can interleave, allowing two caches to grab block in the Exclusive state» Must track and prevent multiple misses for one block Must support interventions and invalidations Dave Patterson, CS252, Fall 1996 DAP.F96 30 18 WU UCB CS252 SP17

Implementing Snooping Caches Bus serializes writes, getting bus ensures no one else can perform memory operation On a miss in a write back cache, may have the desired copy and its dirty, so must reply Add extra state bit to cache to determine shared or not Since every bus transaction checks cache tags, could interfere with CPU just to check: solution 1: duplicate set of tags for L1 caches just to allow checks in parallel with CPU solution 2: L2 cache that obeys inclusion with L1 cache Dave Patterson, CS252, Fall 1996 DAP.F96 32 19 WU UCB CS252 SP17

Optimized Snoop with Level-2 Caches CPU CPU CPU CPU L1 $ L1 $ L1 $ L1 $ L2 $ L2 $ L2 $ L2 $ Snooper Snooper Snooper Snooper Processors often have two-level caches - small L1, large L2 (usually both on chip now) Inclusion property: entries in L1 must be in L2 - invalidation in L2 à invalidation in L1 Snooping on L2 does not affect CPU-L1 bandwidth 20

Intervention CPU-1 CPU-2 A 200 cache-1 CPU-Memory bus A 100 cache-2 memory (stale data) When a read-miss for A occurs in cache-2, a read request for A is placed on the bus Cache-1 needs to supply & change its state to shared The memory may respond to the request also! Does memory know it has stale data? Cache-1 needs to intervene through memory controller to supply correct data to cache-2 21

False Sharing state line addr data0 data1... datan A cache line contains more than one word Cache-coherence is done at the line-level and not word-level Suppose M 1 writes word i and M 2 writes word k and both words have the same line address. What can happen? 22

Performance of Symmetric Multiprocessors (SMPs) Cache performance is combination of: Uniprocessor cache miss traffic Traffic caused by communication - Results in invalidations and subsequent cache misses Coherence misses - Sometimes called a Communication miss - 4th C of cache misses along with Compulsory, Capacity, & Conflict. 23

Coherency Misses True sharing misses arise from the communication of data through the cache coherence mechanism - Invalidates due to 1st write to shared line - Reads by another CPU of modified line in different cache - Miss would still occur if line size were 1 word False sharing misses when a line is invalidated because some word in the line, other than the one being read, is written into - Invalidation does not cause a new value to be communicated, but only causes an extra cache miss - Line is shared, but no word in line is actually shared Þ miss would not occur if line size were 1 word 24

False Sharing Example CPU 0 Write Red Word CPU 0 updates state from S to M because the line was Shared; invalidates copy in CPU 1 CPU 1 Read Blue Word CPU 1 Read Miss because the line was invalidated (FALSE SHARING MISS) https://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads 25 WU UCB CS252 SP17

MP Performance 4-Processor Commercial Workload: OLTP, Decision Support (Database), Search Engine Uniprocessor cache misses improve with cache size increase (Instruction, Capacity/Conflict, Compulsory) True sharing and false sharing unchanged going from 1 MB to 8 MB (L3 cache) Memory cycles per instruction 3.25 3 2.75 2.5 2.25 2 1.75 1.5 1.25 1 0.75 0.5 0.25 0 1 MB 2 MB 4 MB 8 MB Cache size Instruction Capacity/Conflict Cold False Sharing True Sharing 26

MP Performance 2MB Cache Commercial Workload: OLTP, Decision Support (Database), Search Engine 3 True sharing, false sharing increase going from 1 to 8 CPUs Memory cycles per instruction 2.5 2 1.5 1 0.5 Instruction Conflict/Capacity Cold False Sharing True Sharing 0 1 2 4 6 8 Processor count 27

Scaling Snoopy/Broadcast Coherence When any processor gets a miss, must probe every other cache Scaling up to more processors limited by: - Communication bandwidth over bus - Snoop bandwidth into tags Can improve bandwidth by using multiple interleaved buses with interleaved tag banks - E.g, two bits of address pick which of four buses and four tag banks to use (e.g., bits 7:6 of address pick bus/tag bank, bits 5:0 pick byte in 64-byte line) Buses don t scale to large number of connections, so can use point-to-point network for larger number of nodes, but then limited by tag bandwidth when broadcasting snoop requests. Insight: Most snoops fail to find a match! 28

Acknowledgements This course is partly inspired by previous MIT 6.823 and Berkeley CS252 computer architecture courses created by my collaborators and colleagues: - Arvind (MIT) - Joel Emer (Intel/MIT) - James Hoe (CMU) - John Kubiatowicz (UCB) - David Patterson (UCB) Online material from Cornell University, University of Wisconsin, and CMU 29