Dr. George Michelogiannakis. EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory

Similar documents
CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II

Page 1. Cache Coherence

Consistency & Coherence. 4/14/2016 Sec5on 12 Colin Schmidt

Cache Coherence Protocols: Implementation Issues on SMP s. Cache Coherence Issue in I/O

Computer Architecture and Parallel Computing 并行结构与计算. Lecture 6 Coherence Protocols

CS 152 Computer Architecture and Engineering. Lecture 19: Synchronization and Sequential Consistency

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

CS 152 Computer Architecture and Engineering. Lecture 19: Synchronization and Sequential Consistency

EC 513 Computer Architecture

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

Symmetric Multiprocessors: Synchronization and Sequential Consistency

EC 513 Computer Architecture

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 18 Cache Coherence

Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY>

CS252 Spring 2017 Graduate Computer Architecture. Lecture 15: Synchronization and Memory Models Part 2

CS252 Graduate Computer Architecture Fall 2015 Lecture 14: Synchroniza>on and Memory Models

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 19 Memory Consistency Models

CS 152 Computer Architecture and Engineering. Lecture 19: Directory- Based Cache Protocols. Recap: Snoopy Cache Protocols

Lecture 9 - Virtual Memory

Lecture 24: Thread Level Parallelism -- Distributed Shared Memory and Directory-based Coherence Protocol

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Lecture 25: Thread Level Parallelism -- Synchronization and Memory Consistency

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Lecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol

Computer Architecture

CS 152 Computer Architecture and Engineering. Lecture 13 - Out-of-Order Issue and Register Renaming

CS 152 Computer Architecture and Engineering. Lecture 9 - Address Translation

CS 152 Computer Architecture and Engineering. Lecture 11 - Virtual Memory and Caches

Computer Architecture and Engineering CS152 Quiz #5 May 2th, 2013 Professor Krste Asanović Name: <ANSWER KEY>

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

CSE502: Computer Architecture CSE 502: Computer Architecture

Cache Coherence and Atomic Operations in Hardware

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

CS 152 Computer Architecture and Engineering. Lecture 9 - Address Translation

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

CSC 631: High-Performance Computer Architecture

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Parallel Computer Architecture Spring Memory Consistency. Nikos Bellas

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

M4 Parallelism. Implementation of Locks Cache Coherence

Announcements. ECE4750/CS4420 Computer Architecture L17: Memory Model. Edward Suh Computer Systems Laboratory

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

EE382 Processor Design. Processor Issues for MP

CS 152 Computer Architecture and Engineering. Lecture 8 - Address Translation

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Chapter 5. Multiprocessors and Thread-Level Parallelism

Bus-Based Coherent Multiprocessors

Multiprocessors & Thread Level Parallelism

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread- Level Parallelism (TLP) and OpenMP

Lecture 7: Implementing Cache Coherence. Topics: implementation details

Lecture 7 - Memory Hierarchy-II

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

Computer Architecture and Engineering CS152 Quiz #4 April 11th, 2016 Professor George Michelogiannakis

CS 152 Computer Architecture and Engineering. Lecture 9 - Virtual Memory

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

CS 152 Computer Architecture and Engineering. Lecture 8 - Address Translation

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Shared Memory Architectures. Shared Memory Multiprocessors. Caches and Cache Coherence. Cache Memories. Cache Memories Write Operation

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

C 1. Last time. CSE 490/590 Computer Architecture. Complex Pipelining I. Complex Pipelining: Motivation. Floating-Point Unit (FPU) Floating-Point ISA

Lecture 24: Board Notes: Cache Coherency

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Lecture 4 - Pipelining

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

CS5460: Operating Systems

Lecture 13 - VLIW Machines and Statically Scheduled ILP

CS654 Advanced Computer Architecture Lec 14 Directory Based Multiprocessors

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 13 Memory Part 2

Lecture 14: Multithreading

Shared Memory Multiprocessors

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Switch Gear to Memory Consistency

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 8 Instruction-Level Parallelism Part 1

Aleksandar Milenkovich 1

1. Memory technology & Hierarchy

Lecture 13: Consistency Models. Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models

CS252 Spring 2017 Graduate Computer Architecture. Lecture 14: Multithreading Part 2 Synchronization 1

Chapter 6. Parallel Processors from Client to Cloud Part 2 COMPUTER ORGANIZATION AND DESIGN. Homogeneous & Heterogeneous Multicore Architectures

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Chapter 5. Multiprocessors and Thread-Level Parallelism

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

CS 152 Computer Architecture and Engineering

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

Computer Architecture ELEC3441

Transcription:

CS 152 Computer Architecture and Engineering Lecture 18: Snoopy Caches Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory http://inst.eecs.berkeley.edu/~cs152!

Administrivia Lab 4 due now PS 5 due next week Wednesday (20 th ) Lecture 20 will be on datacenters Quiz 4 and PS 3 will be returned tomorrow 2

Last =me in Lecture 17 Two kinds of synchronizakon between processors: Producer-Consumer Consumer must wait unkl producer has produced value SoPware version of a read-aper-write hazard Mutual Exclusion Only one processor can be in a crikcal seckon at a Kme CriKcal seckon guards shared data that can be writen Producer-consumer synchronizakon implementable with just loads and stores, but need to know ISA s memory model! Mutual-exclusion can also be implemented with loads and stores, but tricky and slow, so ISAs add atomic read-modifywrite instruckons to implement locks 3

Sequen=al Consistency Cost: Prevents aggressive compiler reordering optimizations Constrains hardware utilization (e.g., store buffer) 4

Sequen=al Consistency Sequential consistency imposes more memory ordering constraints than those imposed by uniprocessor program dependencies ( ) What are these in our example? T1: T2: Store (X), 1 (X = 1) Load R 1, (Y) Store (Y), 11 (Y = 11) Store (Y ), R 1 (Y = Y) Load R 2, (X) Store (X ), R 2 (X = X) additional SC requirements 5

Load-reserve & Store-condi=onal Special register(s) to hold reservation flag and address, and the outcome of store-conditional Load-reserve R, (m): <flag, adr> <1, m>; R M[m]; Store-conditional (m), R: if <flag, adr> == <1, m> then cancel other procs reservation on m; M[m] R; status succeed; else status fail; try: Load-reserve R head, (head) spin: Load R tail, (tail) if R head ==R tail goto spin Load R, (R head ) R head = R head + 1 Store-conditional (head), R head if (status==fail) goto try process(r) 6

Performance of Locks Blocking atomic read-modify-write instructions e.g., Test&Set, Fetch&Add, Swap vs Non-blocking atomic read-modify-write instructions e.g., Compare&Swap, Load-reserve/Store-conditional vs Protocols based on ordinary Loads and Stores Performance depends on several interacting factors: degree of contention, caches, out-of-order execution of Loads and Stores later... 7

Amdahl s Law Begins with Simple Software Assumption (Limit Arg.) Fraction F of execution time perfectly parallelizable No Overhead for Scheduling Communication, Synchronization, etc. F is the Parallel Part Fraction 1 F Completely Serial Time on 1 core = (1 F) / 1 + F / 1 = 1 Time on N cores = (1 F) / 1 + F / N 8

Strong Consistency An execution is strongly consistent (linearizable) if the method calls can be correctly arranged retaining the mutual order of calls that do not overlap in time, regardless of what thread calls them. 9

Quiescent Consistency An execution is quiescently consistent if the method calls can be correctly arranged retaining the mutual order of calls separated by quiescence, a period of time where no method is being called in any thread. 10

Relaxed Memory Models Need Fences Producer tail head Consumer R tail R tail R head R Producer posting Item x: Load R tail, (tail) Store (R tail ), x Membar SS R tail =R tail +1 Store (tail), R tail ensures that tail ptr is not updated before x has been stored ensures that R is not loaded before x has been stored Consumer: Load R head, (head) spin: Load R tail, (tail) if R head ==R tail goto spin Membar LL Load R, (R head ) R head =R head +1 Store (head), R head process(r) 11

Memory Coherence in SMPs CPU-1 CPU-2 A 100 cache-1 A 100 cache-2 CPU-Memory bus A 100 memory Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale values write-through: cache-2 has a stale value Do these stale values matter? What is the view of shared memory for programming? 12

Write-back Caches & SC T1 is executed prog T1 ST X, 1 ST Y,11 cache-1 writes back Y cache-1 X= 1 Y=11 X= 1 Y=11 memory X = 0 Y =10 X = Y = X = 0 Y =11 X = Y = cache-2 Y = Y = X = X = Y = Y = X = X = prog T2 LD Y, R1 ST Y, R1 LD X, R2 ST X,R2 T2 executed cache-1 writes back X X= 1 Y=11 X= 1 Y=11 X = 0 Y =11 X = Y = X = 1 Y =11 X = Y = Y = 11 Y = 11 X = 0 X = 0 Y = 11 Y = 11 X = 0 X = 0 cache-2 writes back X & Y X= 1 Y=11 X = 1 Y =11 X = 0 Y =11 Y =11 Y =11 X = 0 X = 0 13

Write-through Caches & SC prog T1 ST X, 1 ST Y,11 cache-1 X= 0 Y=10 memory X = 0 Y =10 X = Y = cache-2 Y = Y = X = 0 X = prog T2 LD Y, R1 ST Y, R1 LD X, R2 ST X,R2 T1 executed X= 1 Y=11 X = 1 Y =11 X = Y = Y = Y = X = 0 X = T2 executed X= 1 Y=11 X = 1 Y =11 X = 0 Y =11 Y = 11 Y = 11 X = 0 X = 0 Write-through caches don t preserve sequen9al consistency either 14

Maintaining Cache Coherence Hardware support is required such that only one processor at a Kme has write permission for a locakon no processor can load a stale copy of the locakon aper a write -> cache coherence protocols 15

Cache Coherence vs. Memory Consistency A cache coherence protocol ensures that all writes by one processor are eventually visible to other processors, for one memory address i.e., updates are not lost No guarantee of when an update should be seen No guarantee of what order of updates (of different addresses) should be seen A cache coherence protocol is not enough to ensure sequenkal consistency But if sequenkally consistent, then caches must be coherent 16

Cache Coherence vs. Memory Consistency A memory consistency model gives the rules on when a write by one processor can be observed by a read on another, across different addresses As previously seen with examples CombinaKon of cache coherence protocol plus processor memory reorder buffer used to implement a given architecture s memory consistency model 17

Warmup: Parallel I/O Address (A) Memory Bus Physical Memory Proc. Data (D) Cache R/W Either Cache or DMA can be the Bus Master and effect transfers A D R/W Page transfers occur while the Processor is running DMA DISK (DMA stands for Direct Memory Access, means the I/O device can read/write memory autonomous from the CPU) 18

Problems with Parallel I/O Proc. Cached portions of page Cache Memory Bus Physical Memory DMA transfers DMA Memory DISK Disk: Physical memory may be stale if cache copy is dirty Disk Memory: Cache may hold stale data and not see memory writes 19

Snoopy Cache, Goodman 1983 Idea: Have cache watch (or snoop upon) DMA transfers, and then do the right thing Snoopy cache tags are dual-ported Used to drive Memory Bus when Cache is Bus Master Proc. A R/W D Tags and State Data (lines) A R/W Snoopy read port attached to Memory Bus Cache 20

Snoopy Cache Ac=ons for DMA Observed Bus Cycle Cache State Cache Action Address not cached DMA Read Cached, unmodified Memory Disk Cached, modified Address not cached DMA Write Cached, unmodified Disk Memory Cached, modified No action No action Cache intervenes No action Cache purges its copy??? 21

Shared Memory Mul=processor Memory Bus M 1 Snoopy Cache Physical Memory M 2 Snoopy Cache M 3 Snoopy Cache DMA DISKS Use snoopy mechanism to keep all processors view of memory coherent 22

Snoopy Cache Coherence Protocols write miss: the address is invalidated in all other caches before the write is performed read miss: if a dirty copy is found in some cache, a writeback is performed before the memory is read 23

Cache State Transi=on Diagram The MSI protocol Each cache line has state bits state bits Address tag Write miss (P1 gets line from memory) Other processor reads (P 1 writes back) M: Modified S: Shared I: Invalid M P 1 reads or writes Read miss (P1 gets line from memory) Read by any processor S Other processor intents to write I Other processor intents to write (P 1 writes back) Cache state in processor P 1 24

Two Processor Example (Reading and wri=ng the same cache line) P 1 reads P 1 writes P 2 reads P 2 writes P 1 reads P 1 writes P 2 writes P 1 writes P 1 Read miss P 2 reads, P 1 writes back S P 2 intent to write M I P 1 reads or writes Write miss P 2 intent to write P 2 P 1 reads, P 2 writes back M P 2 reads or writes Write miss P 1 intent to write Read miss S P 1 intent to write I 25

Observa=on Other processor reads P 1 writes back M P 1 reads or writes Write miss Other processor intent to write Read miss Read by any processor S Other processor intent to write I If a line is in the M state then no other cache can have a copy of the line! Memory stays coherent, mulkple differing copies cannot exist 26

MESI: An Enhanced MSI protocol increased performance for private data Each cache line has a tag state bits Address tag Write miss P 1 write or read Other processor reads P 1 writes back Read miss, shared Read by any processor M S P 1 intent to write P 1 write Other processor intent to write M: Modified Exclusive E: Exclusive but unmodified S: Shared I: Invalid Other processor reads E Other processor intent to write, P1 writes back I P 1 read Read miss, not shared Other processor intent to write Cache state in processor P 1 27

Op=mized Snoop with Level-2 Caches CPU CPU CPU CPU L1 $ L1 $ L1 $ L1 $ L2 $ L2 $ L2 $ L2 $ Snooper Snooper Snooper Snooper Processors open have two-level caches small L1, large L2 (on chip) Inclusion property: entries in L1 must be in L2 invalidakon in L2 invalidakon in L1 Snooping on L2 does not affect CPU-L1 bandwidth What problem could occur? 28

Interven=on CPU-1 A 200 cache-1 CPU-Memory bus CPU-2 cache-2 A 100 memory (stale data) When a read-miss for A occurs in cache-2, a read request for A is placed on the bus Cache-1 needs to supply & change its state to shared The memory may respond to the request also! Does memory know it has stale data? Cache-1 needs to intervene through memory controller to supply correct data to cache-2 29

False Sharing state line addr data0 data1... datan A cache line contains more than one word Cache-coherence is done at the line-level and not word-level Suppose M 1 writes word i and M 2 writes word k and both words have the same line address. What can happen? 30

Synchroniza=on and Caches: Performance Issues Processor 1 R 1 L: swap (mutex), R; if <R> then goto L; <critical section> M[mutex] 0; Processor 2 R 1 L: swap (mutex), R; if <R> then goto L; <critical section> M[mutex] 0; Processor 3 R 1 L: swap (mutex), R; if <R> then goto L; <critical section> M[mutex] 0; cache mutex=1 cache CPU-Memory Bus cache Cache-coherence protocols will cause mutex to ping-pong between P1 s and P2 s caches. Ping-ponging can be reduced by first reading the mutex location (non-atomically) and executing a swap only if it is found to be zero. 31

Load-reserve & Store-condi=onal Special register(s) to hold reservation flag and address, and the outcome of store-conditional Load-reserve R, (a): <flag, adr> <1, a>; R M[a]; Store-conditional (a), R: if <flag, adr> == <1, a> then cancel other procs reservation on a; M[a] <R>; status succeed; else status fail; If the snooper sees a store transaction to the address in the reserve register, the reserve bit is set to 0 Several processors may reserve a simultaneously These instructions are like ordinary loads and stores with respect to the bus traffic 32

Out-of-Order Loads/Stores & CC CPU load/store buffers Cache snooper Wb-req, Inv-req, Inv-rep pushout (Wb-rep) Memory (I/S/E) (S-rep, E-rep) (S-req, E-req) Blocking caches One request at a time + CC SC Non-blocking caches Multiple requests (different addresses) concurrently + CC Relaxed memory models CC ensures that all processors observe the same order of loads and stores to an address CPU/Memory Interface 33

Acknowledgements These slides contain material developed and copyright by: Arvind (MIT) Krste Asanovic (MIT/UCB) Joel Emer (Intel/MIT) James Hoe (CMU) John Kubiatowicz (UCB) David PaTerson (UCB) MIT material derived from course 6.823 UCB material derived from course CS252 New microprocessor claims 10x energy improvement, from extremetech Exploring the diverse world of programming by Pavel Shved Amdahl s Law in the MulKcore Era b Mark D. Hill and Michael R. Marty 34