Module 14: "Directory-based Cache Coherence" Lecture 31: "Managing Directory Overhead" Directory-based Cache Coherence: Replacement of S blocks

Similar documents
Special Topics. Module 14: "Directory-based Cache Coherence" Lecture 33: "SCI Protocol" Directory-based Cache Coherence: Sequent NUMA-Q.

Lecture 5: Directory Protocols. Topics: directory-based cache coherence implementations

Cache Coherence in Scalable Machines

Cache Coherence in Scalable Machines

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Scalable Cache Coherence

Module 17: "Interconnection Networks" Lecture 37: "Introduction to Routers" Interconnection Networks. Fundamentals. Latency and bandwidth

Lect. 6: Directory Coherence Protocol

Module 9: "Introduction to Shared Memory Multiprocessors" Lecture 16: "Multiprocessor Organizations and Cache Coherence" Shared Memory Multiprocessors

... The Composibility Question. Composing Scalability and Node Design in CC-NUMA. Commodity CC node approach. Get the node right Approach: Origin

Lecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based)

Lecture 3: Directory Protocol Implementations. Topics: coherence vs. msg-passing, corner cases in directory protocols

Cache Coherence. Todd C. Mowry CS 740 November 10, Topics. The Cache Coherence Problem Snoopy Protocols Directory Protocols

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

CSC 631: High-Performance Computer Architecture

Lecture 8: Directory-Based Cache Coherence. Topics: scalable multiprocessor organizations, directory protocol design issues

VIRTUAL channels in routers allow messages to share a

The Impact of Negative Acknowledgments in Shared Memory Scientific Applications

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains:

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Portland State University ECE 588/688. Cache Coherence Protocols

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University

EECS 570 Lecture 11. Directory-based Coherence. Winter 2019 Prof. Thomas Wenisch

Chapter 5. Multiprocessors and Thread-Level Parallelism

1. Memory technology & Hierarchy

Midterm Exam 02/09/2009

Module 15: "Memory Consistency Models" Lecture 34: "Sequential Consistency and Relaxed Models" Memory Consistency Models. Memory consistency

EC 513 Computer Architecture

A Basic Snooping-Based Multi-Processor Implementation

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Computer Systems Architecture I. CSE 560M Lecture 17 Guest Lecturer: Shakir James

COEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory

Flexible Use of Memory for Replication/Migration in Cache-Coherent DSM Multiprocessors

Lecture 7: Implementing Cache Coherence. Topics: implementation details

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

Chapter 8. Virtual Memory

A More Sophisticated Snooping-Based Multi-Processor

Lecture 13. Shared memory: Architecture and programming

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Chapter 8: Virtual Memory. Operating System Concepts

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Lecture 25: Multiprocessors

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Scalable Cache Coherent Systems

NOW Handout Page 1. Context for Scalable Cache Coherence. Cache Coherence in Scalable Machines. A Cache Coherent System Must:

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

CMSC 611: Advanced Computer Architecture

Cache Coherence: Part II Scalable Approaches

Recall: Sequential Consistency Example. Implications for Implementation. Issues for Directory Protocols

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Fall 2015 :: CSE 610 Parallel Computer Architectures. Cache Coherence. Nima Honarmand

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Lecture 17. NUMA Architecture and Programming

Lecture 12: Hardware/Software Trade-Offs. Topics: COMA, Software Virtual Memory

Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY>

Chapter 6: Demand Paging

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

Cache Coherence in Scalable Machines

CMSC 611: Advanced. Distributed & Shared Memory

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o

CSE 160 Lecture 9. Load balancing and Scheduling Some finer points of synchronization NUMA

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 2)

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

Shared Memory Multiprocessors

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)

Lecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

Lecture 15: Virtual Memory

Lecture 4: Directory Protocols and TM. Topics: corner cases in directory protocols, lazy TM

Lecture 13: Cache Hierarchies. Today: cache access basics and innovations (Sections )

Scalable Cache Coherence

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

Virtual Memory. Patterson & Hennessey Chapter 5 ELEC 5200/6200 1

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

UNIX rewritten using C (Dennis Ritchie) UNIX (v7) released (ancestor of most UNIXs).

VIRTUAL MEMORY READING: CHAPTER 9

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

CPSC/ECE 3220 Summer 2017 Exam 2

Multiprocessors & Thread Level Parallelism

Chapter 8: Main Memory

ADRIAN PERRIG & TORSTEN HOEFLER Networks and Operating Systems ( ) Chapter 6: Demand Paging

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Aleksandar Milenkovich 1

Module 7: Synchronization Lecture 13: Introduction to Atomic Primitives. The Lecture Contains: Synchronization. Waiting Algorithms.

Operating System Concepts

Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions:

Transcription:

Directory-based Cache Coherence: Replacement of S blocks Serialization VN deadlock Starvation Overflow schemes Sparse directory Remote access cache COMA Latency tolerance Page migration Queue lock in hardware [From Chapter 8 of Culler, Singh, Gupta] [SGI Origin 2000 material taken from Laudon and Lenoski, ISCA 1997] [GS320 material taken from Gharachorloo et al., ASPLOS 2000] file:///e /parallel_com_arch/lecture31/31_1.htm[6/13/2012 12:14:03 PM]

Replacement of S blocks Send notification to directory? Can save a future invalidation Does it reduce overall traffic? Origin 2000 does not use replacement hints No notification to directory Why? Replacements of E blocks are hinted and require acknowledgments also (why?) Summary of transaction types Coherence: 9 request transaction types, 6 invalidation/intervention, 39 reply types Non-coherent (I/O, synch, special): 19 requests, 14 replies Serialization Home is used to serialize requests The order determined by the home is final No node should violate this order Example: read-invalidate races P0, P1, and P2 are trying to access a cache block P0 and P2 want to read while P1 wants to write The requests from P0 and P2 reach home first, home replies and marks both in sharer vector; but the reply message to P0 gets delayed in the network P1 s write causes home to send out invalidation to P0 and P2; P0 s inv. reaches P0 before the read reply P0 s hub sends acknowledgment to P1 and also forwards the invalidation to P0 s processor cache What happens when P0 s reply arrives? Can the data be used? Requester s viewpoint When a read reply arrives it finds the OTT entry has the inv bit set Under what conditions can it happen? Seen one in the last slide Can replacement hints help? What about upgrade-invalidation races? What about readx-invalidation races? VN deadlock Origin 2000 has only two virtual networks, but has three-hop transactions Resorts to back-off invalidate or intervention to fall back to strict request-reply Does it really solve the problem or just move the problem elsewhere? Stanford DASH has same problems Uses NACKs after a time-out period if the outgoing network doesn t free up Worse compared to Origin because NACKs inflate total traffic and may lead to livelock DASH avoids livelocks by sizing the queues according to the machine size (not a scalable solution) file:///e /parallel_com_arch/lecture31/31_2.htm[6/13/2012 12:14:03 PM]

Starvation NACKs can cause starvation Build a FIFO list of waiters either in home memory (Chaudhuri and Heinrich, 2004) or use a distributed linked list (IEEE Scalable Coherent Interface) Former imposes large occupancy on home, yet offers better performance by read combining Latter is an extremely complex protocol with a large number of transient states and 29 stable states, but does distribute the occupancy across the system Origin 2000 devotes extra bits in the directory to raise priority of requests NACKed too many times (above a threshold) Use delay between retries Use Alpha GS320 protocol (will discuss later) Overflow schemes How to make the directory size independent of the number of processors Basic idea is to have a bit vector scheme until the total number of sharers is not more than the directory entry width When the number of sharers overflows the hardware resorts to an overflow scheme Dir i B: i sharer bits, broadcast invalidation on overflow Dir i NB: pick one sharer and invalidate it Dir i CV: assign one bit to a group of nodes of size P/i; broadcast invalidations to that group on a write May generate useless invalidations Dir i DP (Stanford FLASH) DP stands for dynamic pointer Allocate directory entries from a free list pool maintained in memory Need replacement hints Still may run into reclamation mode if free list pool is not sized properly at boot time How do you size it? If replacement hints are not supported, assume k sharers on average per cache block (k=8 is found to be good) Reclamation algorithms? Pick a random cache line and invalidate it Dir i SW (MIT Alewife) Trap to software on overflow Software maintains the information about overflown sharers MIT Alewife has directory entry of five pointers and a local bit (i.e. overflow threshold is five or six) Remote read before overflow takes 40 cycles and after overflow takes 425 cycles Five invalidations take 84 cycles while six invalidations take 707 cycles Sparse directory How to reduce the height of the directory? file:///e /parallel_com_arch/lecture31/31_3.htm[6/13/2012 12:14:03 PM]

Observation: total number of cache blocks in all processors is far less than total number of memory blocks Assume a 32 MB L3 cache and 4 GB memory: less than 1% of directory entries are active at any point in time Idea is to organize directory as a highly associative cache On a directory entry eviction send invalidations to all sharers or retrieve line if dirty file:///e /parallel_com_arch/lecture31/31_3.htm[6/13/2012 12:14:03 PM]

Remote access cache Essentially a large tertiary cache Captures remote cache blocks evicted from local cache hierarchy Also visible to the coherence protocol: so inclusion must be maintained with processor caches Must be highly associative and larger than the outermost level of cache Usually part of DRAM is reserved for RAC For multiprocessor nodes, requests from different processors to the same cache block can be merged together; also there is a prefetching effect Used in Stanford DASH Disadvantage: latency and space COMA Cache-only memory architecture Solves the space problem of RAC Home node only maintains the directory entries, but may not have the cache block in memory A node requesting a cache block brings it to its local memory and local cache as usual Entire memory is treated as a large tertiary cache Known as the attraction memory (AM) Home as well as any node having a cache block maintain a directory entry for the cache block A request first looks up AM directory state and, if unowned, gets forwarded to home which, in turn, forwards it to one of the sharers Cache-only memory architecture To start with home has the cache blocks It retains a cache block until it is replaced by some other migration There is always a master copy of each cache block The last valid copy What happens on a replacement of the master copy? Swap with source of migrating cache block Latency problem remains at the requester Inclusion problems between AM and processor cache hierarchy Complicates the protocol file:///e /parallel_com_arch/lecture31/31_4.htm[6/13/2012 12:14:04 PM]

Latency tolerance Page placement Application-directed page placement is used in many cases to minimize the number of remote misses The application provides the kernel (via system call) the starting address and ending address of a chunk of memory (multiple of pages) and also the node number where to map these pages Thus an application writer can specify (based on sharing pattern) which shared pages he/she wants to map in a node s local memory (private pages and stack pages are mapped on local memory by default) The page fault handler of a NUMA kernel is normally equipped with some default policies e.g., round robin mapping or first-touch mapping Examples: matrix-vector multiplication, matrix transpose Software prefetching Even after rigorous analysis of a parallel application it may not be possible to map all pages used by a node on its local memory: the same page may be used by multiple nodes at different times (example: matrix-vector multiplication) Two options are available: dynamic page migration (very costly; coming next, stay tuned) or software prefetching Today, most microprocessors support prefetch and prefetch exclusive instructions: use prefetch to initiate a cache line read miss long before application actually accesses it; use prefetch exclusive if you know for sure that you will write to the line and no one else will need the line before you write to it Prefetches must be used carefully Swap with source of migrating cache block Early prefetches may evict useful cache blocks and itself may get evicted before use; may generate extra invalidations or interventions in multiprocessors Late prefetches may not be fully effective, but at least less harmful than early prefetches Wrong prefetches are most dangerous: these bring in cache blocks that may not be used at all in near-future; in multiprocessors this can severely hurt performance by generating extra invalidations and interventions Wrong prefetches waste bandwidth and pollute cache Software prefetching usually offers better control than hardware prefetching Software prefetching vs. hardware prefetching Software prefetching requires close analysis of program; profile information may help Hardware prefetching tries to detect patterns in accessed addresses and using the detected patterns predicts future addresses AMD Athlon has a simple next line prefetcher (works perfectly for most numerical applications) Intel Pentium 4 has a very sophisticated stream prefetcher file:///e /parallel_com_arch/lecture31/31_5.htm[6/13/2012 12:14:04 PM]

Page migration Page migration changes the existing VA to PA mapping of the migrated page Requires notifying all TLBs caching the old mapping Introduces a TLB coherence problem Origin 2000 uses a smart page migration algorithm: allows the page copy and TLB shootdown to proceed in parallel Array of 64 page reference counters per directory entry to decide whether to migrate a page or not: compare requester s counter against home s and send an interrupt to home if migration is required What does the interrupt handler do? Access all directory entries of the lines belonging to the to-be migrated page Send invalidations to sharers or interventions to owners; at the end all cache lines of that page must be in memory Set the poison bits in the directory entries of all the cache lines of the page Start a block transfer of the page from home to requester at this point (30 µs to copy 16 KB) An access to a poisoned cache line from a node results in a bus error which invalidates the TLB entry for that page in the requesting node (avoids broadcast shootdown) Until the page is completely migrated and is assigned a physical page frame on target node, all nodes accessing a poisoned line wait in a pending queue After the page copy is completed the waiting nodes are served one by one; however, the directory entries and the page itself are moved to a poisoned list and are not yet freed at the home (i.e. you still cannot use that physical page frame) On every scheduler tick the kernel invalidates one TLB entry per processor After a time equal to TLB entries per processor multiplied by scheduling quantum the page frame is marked free and is removed from the poisoned list Major advantage: requesting nodes only see the page copy latency including invalidation and interventions in critical path, but not the TLB shootdown latency Queue lock in hardware Stanford DASH Memory controller recognizes lock accesses Requires changes in compiler and instruction set Marks the directory entry with contenders On unlock a contender is chosen and lock is granted to that node Unlock is forced to generate a notification message to home Possibly requires special cache state for lock variables or special uncached instructions for unlock if lock variables are not allowed to be cached file:///e /parallel_com_arch/lecture31/31_6.htm[6/13/2012 12:14:04 PM]