Lecture 7: Implementing Cache Coherence. Topics: implementation details

Similar documents
Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Lecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based)

Physical Design of Snoop-Based Cache Coherence on Multiprocessors

Snoop-Based Multiprocessor Design III: Case Studies

SGI Challenge Overview

A More Sophisticated Snooping-Based Multi-Processor

Lecture 25: Multiprocessors

Lecture 5: Directory Protocols. Topics: directory-based cache coherence implementations

A Basic Snooping-Based Multi-Processor Implementation

Lecture 24: Virtual Memory, Multiprocessors

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

12 Cache-Organization 1

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

EC 513 Computer Architecture

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Consistency Models. Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models

Page 1. Cache Coherence

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

A Basic Snooping-Based Multi-Processor Implementation

Shared Memory Multiprocessors

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Lecture 12: TM, Consistency Models. Topics: TM pathologies, sequential consistency, hw and hw/sw optimizations

Cache Coherence Protocols: Implementation Issues on SMP s. Cache Coherence Issue in I/O

Limitations of parallel processing

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II

Review: Symmetric Multiprocesors (SMP)

Chapter 9 Multiprocessors

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

Lecture 3: Directory Protocol Implementations. Topics: coherence vs. msg-passing, corner cases in directory protocols

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

A Basic Snooping-Based Multi-processor

Computer Systems Architecture

Lecture 17: Transactional Memories I

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)

The MESI State Transition Graph

Page 1. What is (Hardware) Shared Memory? Some Memory System Options. Outline

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

EE382 Processor Design. Processor Issues for MP

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

[ 5.4] What cache line size is performs best? Which protocol is best to use?

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

ECE/CS 757: Homework 1

Multiprocessors and Locking

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Lecture: Coherence and Synchronization. Topics: synchronization primitives, consistency models intro (Sections )

Lecture 7: PCM Wrap-Up, Cache coherence. Topics: handling PCM errors and writes, cache coherence intro

The Cache Write Problem

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Lecture: Coherence, Synchronization. Topics: directory-based coherence, synchronization primitives (Sections )

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996

Memory Hierarchy in a Multiprocessor

Aleksandar Milenkovich 1

Portland State University ECE 588/688. Cray-1 and Cray T3E

Chapter 5. Multiprocessors and Thread-Level Parallelism

Computer Architecture

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed

Performance study example ( 5.3) Performance study example

Lecture 7: PCM, Cache coherence. Topics: handling PCM errors and writes, cache coherence intro

Lecture 18: Shared-Memory Multiprocessors. Topics: coherence protocols for symmetric shared-memory multiprocessors (Sections

Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley

High Performance Multiprocessor System

Lecture 4: Directory Protocols and TM. Topics: corner cases in directory protocols, lazy TM

Lecture 1: Introduction

CS 654 Computer Architecture Summary. Peter Kemper

Processor Architecture

CSC 631: High-Performance Computer Architecture

Module 2: Virtual Memory and Caches Lecture 3: Virtual Memory and Caches. The Lecture Contains:

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Handout 3 Multiprocessor and thread level parallelism

CMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Lecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6)

Shared Memory Multiprocessors

Lecture: DRAM Main Memory. Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3)

Multiprocessors & Thread Level Parallelism

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Lecture 26: Multiprocessors. Today s topics: Directory-based coherence Synchronization Consistency Shared memory vs message-passing

CS 136: Advanced Architecture. Review of Caches

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

ECE 485/585 Microprocessor System Design

Lecture: Consistency Models, TM

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Transcription:

Lecture 7: Implementing Cache Coherence Topics: implementation details 1

Implementing Coherence Protocols Correctness and performance are not the only metrics Deadlock: a cycle of resource dependencies, where each process holds shared resources in a non-preemptible fashion Livelock: similar to deadlock, but transactions continue in the system without each process making forward progress Starvation: an extreme case of unfairness 2

Basic Implementation Assume single level of cache, atomic bus transactions It is simpler to implement a processor-side cache controller that monitors requests from the processor and a bus-side cache controller that services the bus Both controllers are constantly trying to read tags tags can be duplicated (moderate area overhead) unlike data, tags are rarely updated tag updates stall the other controller 3

Reporting Snoop Results Uniprocessor system: initiator places address on bus, all devices monitor address, one device acks by raising a wired-or signal, data is transferred In a multiprocessor, memory has to wait for the snoop result before it chooses to respond need 3 wired-or signals: (i) indicates that a cache has a copy, (ii) indicates that a cache has a modified copy, (iii) indicates that the snoop has not completed Ensuring timely snoops: the time to respond could be fixed or variable (with the third wired-or signal), or the memory could track if a cache has a block in M state 4

Non-Atomic State Transitions Note that a cache controller s actions are not all atomic: tag look-up, bus arbitration, bus transaction, data/tag update Consider this: block A in shared state in P1 and P2; both issue a write; the bus controllers are ready to issue an upgrade request and try to acquire the bus; is there a problem? The controller can keep track of additional intermediate states so it can react to bus traffic (e.g. S M, I M, I S,E) Alternatively, eliminate upgrade request; use the shared wire to suppress memory s response to an exclusive-rd 5

Serialization Write serialization is an important requirement for coherence and sequential consistency writes must be seen by all processors in the same order On a write, the processor hands the request to the cache controller and some time elapses before the bus transaction happens (the external world sees the write) If the writing processor continues its execution after handing the write to the controller, the same write order may not be seen by all processors hence, the processor is not allowed to continue unless the write has completed 6

Livelock Livelock can happen if the processor-cache handshake is not designed correctly Before the processor can attempt the write, it must acquire the block in exclusive state If all processors are writing to the same block, one of them acquires the block first if another exclusive request is seen on the bus, the cache controller must wait for the processor to complete the write before releasing the block -- else, the processor s write will fail again because the block would be in invalid state 7

Atomic Instructions A test&set instruction acquires the block in exclusive state and does not release the block until the read and write have completed Should an LL bring the block in exclusive state to avoid bus traffic during the SC? Note that for the SC to succeed, a bit associated with the cache block must be set (the bit is reset when a write to that block is observed or when the block is evicted) What happens if an instruction between the LL and SC causes the LL-SC block to always be replaced? 8

Multilevel Cache Hierarchies Ideally, the snooping protocol employed for L2 must be duplicated for L1 redundant work because of blocks common to L1 and L2 Inclusion greatly simplifies the implementation 9

Maintaining Inclusion Assuming equal block size, if L1 is 8KB 2-way and L2 is 256KB 8-way, is the hierarchy inclusive? (assume that an L1 miss brings a block into L1 and L2) Assuming equal block size, if L1 is 8KB direct-mapped and L2 is 256KB 8-way, is the hierarchy inclusive? To maintain inclusion, L2 replacements must also evict relevant blocks in L1 10

Intra-Hierarchy Protocol Some coherence traffic needs to be propagated to L1; likewise, L1 write traffic needs to be propagated to L2 What is the best way to implement the above? More traffic? More state? In general, external requests propagate upward from L3 to L1 and processor requests percolate down from L1 to L3 Dual tags are not as important as the L2 can filter out bus transactions and the L1 can filter out processor requests 11

Split Transaction Bus What would it take to implement the protocol correctly while assuming a split transaction bus? Split transaction bus: a cache puts out a request, releases the bus (so others can use the bus), receives its response much later Assumptions: only one request per block can be outstanding separate lines for addr (request) and data (response) 12

Split Transaction Bus Proc 1 Proc 2 Proc 3 Cache Cache Cache Request lines Response lines 13

Design Issues When does the snoop complete? What if the snoop takes a long time? What if the buffer in a processor/memory is full? When does the buffer release an entry? Are the buffers identical? How does each processor ensure that a block does not have multiple outstanding requests? What determines the write order requests or responses? 14

Design Issues II What happens if a processor is arbitrating for the bus and witnesses another bus transaction for the same address? If the processor issues a read miss and there is already a matching read in the request table, can we reduce bus traffic? 15

Shared Cache Designs There are benefits to sharing the first level cache among many processors (for example, in a CMP): no coherence protocol low cost communication between processors better prefetching by processors working set overlap allows shared cache size to be smaller than combined size of private caches improves utilization Disadvantages: high contention for ports longer hit latency (size and proximity) more conflict misses 16

TLBs Recall that a TLB caches virtual to physical page translations While swapping a page out, can we have a problem in a multiprocessor system? All matching entries in every processor s TLB must be removed TLB shootdown: the initiating processor sends a special instruction to other TLBs asking them to invalidate a page table entry 17

Case Study: SGI Challenge Supports 18 or 36 MIPS processors Employs a 1.2 GB/s 47.6 MHz system bus (Powerpath-2) The bus has 256-bit-wide data, 40-bit-wide address, plus 33 other signals (non multiplexed) Split transaction, supporting eight outstanding requests Employs the MESI protocol by default also supports update transactions 18

Processor Board Each board has four processors (to reduce the number of slots on the bus from 36 to 9) A-chip has request tables, arbitration logic, etc. L2 L2 L2 L2 MIPS MIPS MIPS MIPS Tags CC Tags CC Tags CC Tags CC A-chip D-chip 19

Latencies 75ns for an L2 cache hit 300ns for a cache miss to percolate down to the A-chip Additional 400ns for the data to be delivered to the D-chips across the bus (includes 250ns memory latency) Another 300ns for the data to reach the processor Note that the system bus can accommodate 256 bits of data, while the CC-chip to processor interface can handle 64 bits at a time 20

Sun Enterprise 6000 Supports 30 UltraSparcs 2.67 GB/s 83.5 MHz Gigaplane system bus Non multiplexed bus with 256 bits of data, 41 bits of address, and 91 bits of control/error correction, etc. Split transaction bus with up to 112 outstanding requests Each node speculatively drives the bus (in parallel with arbitration) L2 hits are 40 ns, memory access is 300 ns 21

Title Bullet 22