Scalable Cache Coherence

Similar documents
Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Cache Coherence: Part II Scalable Approaches

Cache Coherence in Scalable Machines

Scalable Cache Coherent Systems

Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions:

Cache Coherence. Todd C. Mowry CS 740 November 10, Topics. The Cache Coherence Problem Snoopy Protocols Directory Protocols

Scalable Cache Coherence

Cache Coherence in Scalable Machines

NOW Handout Page 1. Context for Scalable Cache Coherence. Cache Coherence in Scalable Machines. A Cache Coherent System Must:

A Scalable SAS Machine

Scalable Multiprocessors

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations

Cache Coherence in Scalable Machines

Cache Coherence (II) Instructor: Josep Torrellas CS533. Copyright Josep Torrellas

Lecture 8: Directory-Based Cache Coherence. Topics: scalable multiprocessor organizations, directory protocol design issues

CMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3

Lecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based)

ECE 669 Parallel Computer Architecture

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Memory Hierarchy in a Multiprocessor

Module 14: "Directory-based Cache Coherence" Lecture 31: "Managing Directory Overhead" Directory-based Cache Coherence: Replacement of S blocks

Lecture 5: Directory Protocols. Topics: directory-based cache coherence implementations

Recall: Sequential Consistency Example. Recall: MSI Invalidate Protocol: Write Back Cache. Recall: Ordering: Scheurich and Dubois

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Lecture 3: Directory Protocol Implementations. Topics: coherence vs. msg-passing, corner cases in directory protocols

EC 513 Computer Architecture

SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES. Natalie Enright Jerger University of Toronto

Shared Memory Multiprocessors

Recall Ordering: Scheurich and Dubois. CS 258 Parallel Computer Architecture Lecture 21 P 1 : Directory Based Protocols. Terminology for Shared Memory

... The Composibility Question. Composing Scalability and Node Design in CC-NUMA. Commodity CC node approach. Get the node right Approach: Origin

CMSC 611: Advanced Computer Architecture

Multiprocessor Cache Coherency. What is Cache Coherence?

Performance study example ( 5.3) Performance study example

EECS 570 Lecture 11. Directory-based Coherence. Winter 2019 Prof. Thomas Wenisch

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Fall 2015 :: CSE 610 Parallel Computer Architectures. Cache Coherence. Nima Honarmand

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

Shared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB

EC 513 Computer Architecture

CMSC 611: Advanced. Distributed & Shared Memory

Foundations of Computer Systems

Multiprocessors & Thread Level Parallelism

ECSE 425 Lecture 30: Directory Coherence

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5)

Multicast Snooping: A Multicast Address Network. A New Coherence Method Using. With sponsorship and/or participation from. Mark Hill & David Wood

Lect. 6: Directory Coherence Protocol

Handout 3 Multiprocessor and thread level parallelism

Page 1. Cache Coherence

Chapter 5. Multiprocessors and Thread-Level Parallelism

CSC 631: High-Performance Computer Architecture

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

1. Memory technology & Hierarchy

Cache Coherence. Bryan Mills, PhD. Slides provided by Rami Melhem

Review. EECS 252 Graduate Computer Architecture. Lec 13 Snooping Cache and Directory Based Multiprocessors. Outline. Challenges of Parallel Processing

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

Computer Architecture

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)

CS654 Advanced Computer Architecture Lec 14 Directory Based Multiprocessors

12 Cache-Organization 1

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Flynn s Classification

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Midterm Exam 02/09/2009

Parallel Programming Platforms

Special Topics. Module 14: "Directory-based Cache Coherence" Lecture 33: "SCI Protocol" Directory-based Cache Coherence: Sequent NUMA-Q.

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

Parallel Architecture. Sathish Vadhiyar

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Lecture 25: Multiprocessors

5 Chip Multiprocessors (II) Chip Multiprocessors (ACS MPhil) Robert Mullins

Aleksandar Milenkovich 1

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Recall: Sequential Consistency Example. Implications for Implementation. Issues for Directory Protocols

Cache Coherence: Part 1

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

PARALLEL MEMORY ARCHITECTURE

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Chapter 9 Multiprocessors

Thread- Level Parallelism. ECE 154B Dmitri Strukov

Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY>

Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes

Cache Coherency and Interconnection Networks

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

5008: Computer Architecture

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

[ 5.4] What cache line size is performs best? Which protocol is best to use?

5 Chip Multiprocessors (II) Robert Mullins

Chapter 2: Memory Hierarchy Design Part 2

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology

CISC 662 Graduate Computer Architecture Lectures 15 and 16 - Multiprocessors and Thread-Level Parallelism

Lecture notes for CS Chapter 2, part 1 10/23/18

Transcription:

arallel Computing Scalable Cache Coherence Hwansoo Han

Hierarchical Cache Coherence Hierarchies in cache organization Multiple levels of caches on a processor Large scale multiprocessors with hierarchy of buses Cache 1 Cache 1 Cache 1 Cache 2 Cache 2 2

Multilevel Caches Inclusion roperty Everything in cache is also present in L2 cache If has the block in owned state (e.g. modified in MESI), L2 must have it in modified state Snoop of L2 takes responsibility for flushing or invalidating data due to remote requests It often helps if the block size in is smaller or the same size as that in L2 cache L2 3

Hierarchical Snoopy Cache Coherence Hierarchy of buses Simplest way to build large-scale cache-coherent Ms Use snoopy coherence at each level Memory location two alternatives Main memory centralized at the global (B2) bus Main memory distributed among the clusters L2 may not include local data in, but need to snoop for local data B1 L2 B1 L2 B1 Memory L2 B1 L2 Memory B2 Main Memory B2 4

Hierarchies with Global Memory First-level caches Highest performance SRAM caches B1 follows standard snoopy protocol Second-level caches Much larger than caches (set associative) Must maintain inclusion property L2 cache acts as filter for B1-bus and -caches L2 cache can be DRAM based, since fewer references reach here B1 L2 B1 L2 B2 Main Memory 5

Hierarchies with Global Memory (cont d) Advantages Misses just require single traversal to the root of the hierarchy (global memory) lacement of shared data is not an issue Disadvantages Misses to local data structures (e.g., stack) also have to traverse the hierarchy, resulting in larger traffic and higher latency Memory at the global bus must be highly interleaved. Otherwise, bandwidth to the global memory will not scale 6

Hierarchies with Distributed Memory Main memory is distributed among clusters Reduces global bus traffic Local data and suitably placed shared data Reduce latency Less contention and local accesses are faster Observation L2 cache holds data from remote memories L2 cache can be replaced by a tag-only router coherence switch B1 Memory L2 B1 L2 Memory B2 7

Alternative: Hierarchy of Rings Hierarchical ring network, not bus U of Toronto: Hector, NUMAchine, Kendall Square Research (KSR) Snoop on requests passing by on ring oint-to-point structure of ring otentially higher bandwidth than buses Still, high latency 8

Hierarchies Advantages Conceptually simple to build Apply snooping recursively Merging and combining of requests in hardware Disadvantages hysical hierarchies do not provide enough bandwidth The root becomes a bottleneck atch solution: multiple buses/rings at higher levels Latencies often larger than those in direct networks 9

Directory-Based Cache Coherence More scalable solution for large scale machines Motivation Snoopy schemes do not scale as they rely on broadcast Directory-based schemes allow scaling Avoid broadcasts by: Keeping track of all processors (Es) which cache memory blocks Using point-to-point messages to maintain coherence Will work on any scalable point-to-point interconnect No reliance on buses or other broadcast-based interconnects 10

Basic Scheme of Directory Coherence Assume processors 0 n With each cache-block in memory presence bits, 1 dirty-bit With each cache-block in cache 1 valid bit, 1 dirty (owner) bit memory directory Read from main memory by i If dirty-bit OFF { read from main memory; turn p[i] ON; } presence bits If dirty-bit ON { recall line from dirty (cache state to shared); update memory; turn dirty-bit OFF; turn p[i] ON; supply recalled data to ; i } dirty-bit Write to main memory by i If dirty-bit OFF { supply data to i ; send invalidations to all s caching that block; turn dirty-bit ON; turn [i] ON; others OFF } If dirty-bit ON { recall line from dirty (cache state to invalid); update memory; supply data to i ; turn [i] ON; others OFF } 11

Directory rotocol Examples Requestor C A 3. Read req. to o wner C A M/D 4a. Data Reply M/D 1. Read request to directory 2. Reply with owner identity C A 4b. Revision message to directory Dir ectory node for block M/D C A M/D Requestor C A M/D RdEx request to directory 2. Reply with sharers identity 3a. 3b. Inval. req. In v al. req. to sharer to sharer 4a. 4b. Inval. ack Inval. ack C A 1. M/D C A M/D Dir ectory node Node with dirty copy Sharer Sharer Read miss to a block in dirty state Write miss to a block with two sharers 12

Scaling with Number of rocessors Scaling of memory & directory Centralized directory is BW bottleneck as in centralized memory Maintain directory information in distributed way erformance characteristics Traffic: # of network transactions each time protocol is invoked Latency: # of network transactions in critical path each time Directory storage requirements Number of presence bits needed grows as # of processors grows Lead to potentially large storage overhead Characteristics that matter Frequency of write misses How many sharers on a write miss 13

Cache Invalidation atterns Number of sharers = number of invalidations Small number of sharers for most of write misses - LU, Ocean Several sharers, but a few sharers for most of write misses B.-H., Radiosity 14

Sharing atterns Generally, only a few sharers at a write, scales slowly with p Code and read-only objects no problems as rarely written Migratory objects even as # of s scales, only 1-2 invalidations Mostly-read objects invalidations are infrequent Frequently read/written objects Invalidations usually remain small, though frequent Synchronization objects (e.g., lock variables) Low contention locks results in small invalidation High contention locks need special support (SW trees, queuing locks) Implies directories very useful in containing traffic If organized properly, traffic and latency shouldn t scale too badly 15

Organizing Directories Directory Schemes Centralized Distributed How to find source of directory information Flat Hierarchical How to locate copies Memory-based Cache-based 16

Find Directory Information Centralized memory and directory Easy but not scalable Distributed memory and directory Flat schemes Directory is distributed with memory at home node Location of directory is obtained based on address (hashing) Can send messages directly to home Hierarchical schemes Directory organized as a hierarchical data structure Leaves are processors, internal nodes have directory states Directory entry specifies whether the subtree caches the block Send search message to parent and passed down to leaves 17

Locations of Copies (resence Bits) Hierarchical schemes Each directory has presence bits for its children (subtrees), dirty bit Flat schemes Different storage overheads and performance characteristics Memory-based schemes Info about copies stored all at the home with the memory block E.g., Dash, Alewife, SGI Origin, Flash Cache-based schemes Info about copies distributed among copies themselves Each copy points to next distributed doubly linked list Scalable Coherent Interface (IEEE 1596-1992 SCI standard) 18

Flat, Memory-Based Schemes All info about copies Co-located with the block itself at home Works just like centralized scheme, except physically distributed Scaling of performance characteristics Traffic on a write: proportional to number of sharers Invalidation performed for each sharer individually Latency of a write can be reduced by Issuing invalidation to sharers in parallel 19

Flat, Memory-Based Schemes (cont d) How does storage overhead scale? Simplest representation: full bit vector i.e. one presence bit per node Directory storage overhead roportional to * M = # of processors (or nodes) M = # of blocks in memory Does not scale well with Assume 64-byte cache line Need a full bit vector per cache line 64 nodes: 12.5% overhead 256 nodes: 50% overhead 1024 nodes: 200% overhead bits M lines 20

Flat, Memory-Based Schemes (cont d) Reducing storage overhead Optimize full bit vector schemes Limited pointer schemes Reduce width of directory Sparse directories Reduce height of directory 21

Flat, Memory-Based Schemes (cont d) The full bit vector schemes Invalidation traffic is best, because sharing info is accurate Optimization for full bit vector schemes Increase cache block size Reduces storage overhead proportionally Any problems in this approach? Use multiprocessor nodes Bit per multiprocessor node, not per processor Still scales as *M, but tolerates larger machines 128 byte line, 256 nodes with quad core processors Overhead is 6.25% (256/4 bits per 128 byte = 1/16 overhead) 22

Flat, Memory-Based Schemes (cont d) Limited pointer schemes Observation: Data will reside at a few caches at any time A limited number of pointers per directory entry should be sufficient Need overflow strategy What if # of sharers exceeds # of given pointers Many different schemes based on differing overflow strategies 23

Flat, Memory-Based Schemes (cont d) Overflow schemes for Limited ointers Broadcast (Dir i B) Broadcast bit turned on upon overflow overflow bit 2 pointers 0 1001 0111 No-broadcast (Dir i NB) On overflow, new sharer replaces one of the old one (and invalidate the old one) Coarse Vector (Dir i CV) Change representation to a coarse vector 1 bit per k nodes On a write, invalidate all nodes that are corresponding to the bit 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 overflow bit 8-bit coarse vector 1 0 0 0 1 1 0 0 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 24

Flat, Memory-Based Schemes (cont d) Overflow schemes (cont d) Software (Dir i SW) Trap to software, use any number of pointers (no precision loss) MIT Alewife: 5 pointers, plus one bit for local node Extra cost of interrupt processing on software rocessor overhead and occupancy Latency: 40~425 cycles for remote read in Alewife Dynamic ointers (Dir i D) Use pointers from a hardware free list in portion of memory Manipulation done by hardware assist, not software E.g., Stanford FLASH 25

Data on Invalidations Normalized number of invalidations 64 processors, 4 pointers, normalized to full-bit-vector Coarse vector shows quite robust performance General conclusion Full bit vector is simple and good for moderate-scale machines Several optimized schemes for large-scale machines, no clear winner broadcast no-broadcast coarse vector 26

Flat, Memory-Based Schemes (cont d) Reducing height: Sparse Directories Observation Total number of cache entries total amount of memory Most directory entries are idle most of the time 1MB cache and 64MB per node 98.5% of entries are idle Organize directory as a cache No need for backup store, but send invalidations to all sharers when entry is replaced One entry per line: no spatial locality Different access patterns (from many procs, but filtered) Allows use of SRAM, can be in critical path Needs high associativity, and should be large enough Can trade off width and height 27

Flat, Cache-Based Schemes Operations Home only holds pointer to rest of directory info Distributed linked list of copies, weaves through caches Cache tag has pointer, points to next cache with a copy On read: add yourself to head of the list (communication needed) On write: propagate chain of invalidations down the list Scalable Coherent Interface (SCI) IEEE standard Doubly linked list 28

Flat, Cache-Based Schemes (cont d) Scaling properties Traffic on write: proportional to # of sharers Latency on write: proportional to # of sharers Don t know the identity of next sharer until reach current one Assist processing at each node along the way Even reads involve more than one other assist (home and the first sharer in the list to insert an entry) Storage overhead: quite good scaling Only one head pointer per memory block Rest is all proportional to cache size (but need to store them in SRAM) Other properties Good: mature IEEE standard, fairness Bad: complex 29

Directory Organization Flat Schemes Issue: finding source of directory data go to home, based on addr Issue: finding out where the copies are Memory-based: all info is in the directory at home Cache-based: home has pointer to first element of distributed, linked list Issue: communicating with those copies Memory-based: point-to-point messages - can be multicast or overlapped Cache-based: point-to-point linked list traversal to find them - serialized Hierarchical Schemes All three issues through sending messages up and down tree No single explicit list of sharers Only direct communication is between parents and children 30

Directory rotocol Directory approaches Directory offer scalable coherence on general networks No need for broadcast media Many possibilities for organizing directory and managing protocols Hierarchical vs. flat directories Hierarchical directories are not used much High latency, many network transactions, and BW bottleneck near root Both memory-based and cache-based flat schemes are popular For memory-based, full-bit vector suffices for moderate scale machines 31

Summary Scalable coherence schemes for large machines Hierarchical cache coherence protocol Directory-based cache coherence protocol Hierarchical coherence Inclusion property should be maintained Directory-based coherence Flat, memory-based directory is dominant Limited # of pointers: overflow strategy needed 32