Lecture 7: PCM Wrap-Up, Cache coherence. Topics: handling PCM errors and writes, cache coherence intro

Similar documents
Lecture 7: PCM, Cache coherence. Topics: handling PCM errors and writes, cache coherence intro

Lecture 1: Introduction

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

Lecture 15: PCM, Networks. Today: PCM wrap-up, projects discussion, on-chip networks background

Lecture 25: Multiprocessors

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Lecture 12: Large Cache Design. Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

Lecture 24: Board Notes: Cache Coherency

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations

Lecture 24: Virtual Memory, Multiprocessors

Computer Architecture

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Chapter 5. Multiprocessors and Thread-Level Parallelism

Multiprocessors & Thread Level Parallelism

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Chapter 5. Multiprocessors and Thread-Level Parallelism

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains:

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Lecture 5: Directory Protocols. Topics: directory-based cache coherence implementations

Lecture-22 (Cache Coherence Protocols) CS422-Spring

Incoherent each cache copy behaves as an individual copy, instead of as the same memory location.

Computer Architecture Memory hierarchies and caches

Memory Hierarchy in a Multiprocessor

Lecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based)

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Portland State University ECE 588/688. Cache Coherence Protocols

Module 9: "Introduction to Shared Memory Multiprocessors" Lecture 16: "Multiprocessor Organizations and Cache Coherence" Shared Memory Multiprocessors

Lecture 7: Implementing Cache Coherence. Topics: implementation details

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY>

CS433 Homework 6. Problem 1 [15 points] Assigned on 11/28/2017 Due in class on 12/12/2017

Lecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Lecture 11: Large Cache Design

Cache Coherence in Scalable Machines

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University

Cache Coherence. Bryan Mills, PhD. Slides provided by Rami Melhem

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

Foundations of Computer Systems

Lecture 1: Introduction

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Scalable Cache Coherence

Processor Architecture

EC 513 Computer Architecture

Lecture 12: Hardware/Software Trade-Offs. Topics: COMA, Software Virtual Memory

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

1. Memory technology & Hierarchy

Shared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB

COMP Parallel Computing. CC-NUMA (1) CC-NUMA implementation

Lecture 8: Directory-Based Cache Coherence. Topics: scalable multiprocessor organizations, directory protocol design issues

CS433 Homework 6. Problem 1 [15 points] Assigned on 11/28/2017 Due in class on 12/12/2017

Performance study example ( 5.3) Performance study example

Lecture 3: Directory Protocol Implementations. Topics: coherence vs. msg-passing, corner cases in directory protocols

Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions:

CMSC 611: Advanced. Distributed & Shared Memory

Token Coherence. Milo M. K. Martin Dissertation Defense

Limitations of parallel processing

A Basic Snooping-Based Multi-Processor Implementation

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Lecture 26: Multiprocessors. Today s topics: Directory-based coherence Synchronization Consistency Shared memory vs message-passing

The Cache Write Problem

ECE/CS 757: Homework 1

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

Scalable Cache Coherent Systems

CMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3

Computer Architecture

Computer Organization. Chapter 16

Advanced OpenMP. Lecture 3: Cache Coherency

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:

Special Topics. Module 14: "Directory-based Cache Coherence" Lecture 33: "SCI Protocol" Directory-based Cache Coherence: Sequent NUMA-Q.

Phase-change RAM (PRAM)- based Main Memory

Scalable Multiprocessors

CMSC 611: Advanced Computer Architecture

Lecture: Coherence Protocols. Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols

Snooping-Based Cache Coherence

ECE 485/585 Microprocessor System Design

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996

Lecture 4: Directory Protocols and TM. Topics: corner cases in directory protocols, lazy TM

Lecture 20: Multi-Cache Designs. Spring 2018 Jason Tang

Lecture: Coherence Protocols. Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols

Multiprocessor Cache Coherency. What is Cache Coherence?

Cache Coherence. Introduction to High Performance Computing Systems (CS1645) Esteban Meneses. Spring, 2014

Shared Memory Multiprocessors

Lecture: Memory, Coherence Protocols. Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5)

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

12 Cache-Organization 1

Transcription:

Lecture 7: M Wrap-Up, ache coherence Topics: handling M errors and writes, cache coherence intro 1

Optimizations for Writes (Energy, Lifetime) Read a line before writing and only write the modified bits Zhou et al., ISA 09 Write either the line or its inverted version, whichever causes fewer bit-flips ho and Lee, MIRO 09 Only write dirty lines in a M page (when a page is evicted from a DRAM cache) Lee et al., Qureshi et al., ISA 09 When a page is brought from disk, place it only in DRAM cache and place in M upon eviction Qureshi et al., ISA 09 Wear-leveling: rotate every new page, shift a row periodically, swap segments Zhou et al., Qureshi et al., ISA 09 2

Hard Error Tolerance in M M cells will eventually fail; important to cause gradual capacity degradation when this happens airing: among the pool of faulty pages, pair two pages that have faults in different locations; replicate data across the two pages Ipek et al., ASLOS 10 Errors are detected with parity bits; replica reads are issued if the initial read is faulty 3

E Schechter et al., ISA 10 Instead of using E to handle a few transient faults in DRAM, use error-correcting pointers to handle hard errors in specific locations For a 512-bit line with 1 failed bit, maintain a 9-bit field to track the failed location and another bit to store the value in that location an store multiple such pointers and can recover from faults in the pointers too E has similar storage overhead and can handle soft errors; but E has high entropy and can hasten wearout 4

SAFER Seong et al., MIRO 2010 Most M hard errors are stuck-at faults (stuck at 0 or stuck at 1) Either write the word or its flipped version so that the failed bit is made to store the stuck-at value For multi-bit errors, the line can be partitioned such that each partition has a single error Errors are detected by verifying a write; recently failed bit locations are cached so multiple writes can be avoided 5

FREE-p Yoon et al., HA 2011 When a M block (64B) is unusable because the number of hard errors has exceeded the E capability, it is remapped to another address; the pointer to this address is stored in the failed block; need another bit per block The pointer can be replicated many times in the failed block to tolerate the multiple errors in the failed block Requires two accesses when handling failed blocks; this overhead can be reduced by caching the pointer at the memory controller 6

7 Multi-ore ache Organizations rivate L1 caches Shared L2 cache Bus between L1s and single L2 cache controller Snooping-based coherence between L1s

Multi-ore ache Organizations rivate L1 caches Shared L2 cache, but physically distributed Bus connecting the four L1s and four L2 banks Snooping-based coherence between L1s 8

Multi-ore ache Organizations rivate L1 caches Shared L2 cache, but physically distributed Scalable network Directory-based coherence between L1s 9

Multi-ore ache Organizations D rivate L1 caches rivate L2 caches Scalable network Directory-based coherence between L2s (through a separate directory) 10

Shared-Memory Vs. Message assing Shared-memory single copy of (shared) data in memory threads communicate by reading/writing to a shared location Message-passing each thread has a copy of data in its own private memory that other threads cannot access threads communicate by passing values with SEND/ REEIVE message pairs 11

ache oherence A multiprocessor system is cache coherent if a value written by a processor is eventually visible to reads by other processors write propagation two writes to the same location by two processors are seen in the same order by all processors write serialization 12

ache oherence rotocols Directory-based: A single location (directory) keeps track of the sharing status of a block of memory Snooping: Every cache block is accompanied by the sharing status of that block all cache controllers monitor the shared bus so they can update the sharing status of the block, if necessary Write-invalidate: a processor gains exclusive access of a block before writing by invalidating all other copies Write-update: when a processor writes, it updates other shared copies of that block 13

rotocol-i MSI 3-state write-back invalidation bus-based snooping protocol Each block can be in one of three states invalid, shared, modified (exclusive) A processor must acquire the block in exclusive state in order to write to it this is done by placing an exclusive read request on the bus every other cached copy is invalidated When some other processor tries to read an exclusive block, the block is demoted to shared 14

Design Issues, Optimizations When does memory get updated? demotion from modified to shared? move from modified in one cache to modified in another? Who responds with data? memory or a cache that has the block in exclusive state does it help if sharers respond? We can assume that bus, memory, and cache state transactions are atomic if not, we will need more states A transition from shared to modified only requires an upgrade request and no transfer of data 15

Reporting Snoop Results In a multiprocessor, memory has to wait for the snoop result before it chooses to respond need 3 wired-or signals: (i) indicates that a cache has a copy, (ii) indicates that a cache has a modified copy, (iii) indicates that the snoop has not completed Ensuring timely snoops: the time to respond could be fixed or variable (with the third wired-or signal) Tags are usually duplicated if they are frequently accessed by the processor (regular ld/sts) and the bus (snoops) 16

4 and 5 State rotocols Multiprocessors execute many single-threaded programs A read followed by a write will generate bus transactions to acquire the block in exclusive state even though there are no sharers (leads to MESI protocol) Also, to promote cache-to-cache sharing, a cache must be designated as the responder (leads to MOESI protocol) Note that we can optimize protocols by adding more states increases design/verification complexity 17

MESI rotocol The new state is exclusive-clean the cache can service read requests and no other cache has the same block When the processor attempts a write, the block is upgraded to exclusive-modified without generating a bus transaction When a processor makes a read request, it must detect if it has the only cached copy the interconnect must include an additional signal that is asserted by each cache if it has a valid copy of the block When a block is evicted, a block may be exclusive-clean, but will not realize it 18

MOESI rotocol The first reader or the last writer are usually designated as the owner of a block The owner is responsible for responding to requests from other caches There is no need to update memory when a block transitions from M S state The block in O state is responsible for writing back a dirty block when it is evicted 19

Title Bullet 20