Token Coherence. Milo M. K. Martin Dissertation Defense

Similar documents
Bandwidth Adaptive Snooping

Adding Token Counting to Directory-Based Cache Coherence

Token Tenure: PATCHing Token Counting Using Directory-Based Cache Coherence

Token Coherence: Decoupling Performance and Correctness

Computer Architecture

Foundations of Computer Systems

Token Tenure and PATCH: A Predictive/Adaptive Token-Counting Hybrid

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism

EECS 570 Lecture 11. Directory-based Coherence. Winter 2019 Prof. Thomas Wenisch

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Document downloaded from: This paper must be cited as:

Multicast Snooping: A Multicast Address Network. A New Coherence Method Using. With sponsorship and/or participation from. Mark Hill & David Wood

Multiprocessors & Thread Level Parallelism

Lecture 7: PCM Wrap-Up, Cache coherence. Topics: handling PCM errors and writes, cache coherence intro

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

ECE 551 System on Chip Design

Lecture 22: Fault Tolerance

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Scalable Cache Coherence

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Multiprocessor Cache Coherency. What is Cache Coherence?

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University

SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES. Natalie Enright Jerger University of Toronto

Computer Science 146. Computer Architecture

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

Handout 3 Multiprocessor and thread level parallelism

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5)

Optimizing Replication, Communication, and Capacity Allocation in CMPs

Lecture 7: PCM, Cache coherence. Topics: handling PCM errors and writes, cache coherence intro

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Computer Architecture

Module 14: "Directory-based Cache Coherence" Lecture 31: "Managing Directory Overhead" Directory-based Cache Coherence: Replacement of S blocks

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

Scalable Cache Coherence

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Lecture 24: Board Notes: Cache Coherency

Shared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

(10) Patent No.: (45) Date of Patent: (56) References Cited. * cited by examiner

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

Lect. 6: Directory Coherence Protocol

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1)

Cache Coherence. Bryan Mills, PhD. Slides provided by Rami Melhem

CMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3

SGI Challenge Overview

Lecture 5: Directory Protocols. Topics: directory-based cache coherence implementations

LogTM: Log-Based Transactional Memory

Lecture 12: TM, Consistency Models. Topics: TM pathologies, sequential consistency, hw and hw/sw optimizations

CS 152 Computer Architecture and Engineering

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Thread- Level Parallelism. ECE 154B Dmitri Strukov

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks

Computer Architecture Memory hierarchies and caches

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

Incoherent each cache copy behaves as an individual copy, instead of as the same memory location.

Today s Outline: Shared Memory Review. Shared Memory & Concurrency. Concurrency v. Parallelism. Thread-Level Parallelism. CS758: Multicore Programming

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

ECE/CS 757: Homework 1

PVCoherence. Zhang, Bringham, Erickson, & Sorin. Amlan Nayak & Jay Zhang 1

Lecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based)

CSC 631: High-Performance Computer Architecture

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Cache Coherence in Scalable Machines

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY>

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

Flynn s Classification

EC 513 Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture

Lecture 1: Introduction

EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:

Chapter 05. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

Spandex: A Flexible Interface for Efficient Heterogeneous Coherence Johnathan Alsop 1, Matthew D. Sinclair 1,2, and Sarita V.

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

CMSC 611: Advanced Computer Architecture

A More Sophisticated Snooping-Based Multi-Processor

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers

Variability in Architectural Simulations of Multi-threaded

EECS 570 Final Exam - SOLUTIONS Winter 2015

Interconnect Routing

Computer Systems Architecture

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

Cache Coherence: Part II Scalable Approaches

Cache Coherence Protocols for Chip Multiprocessors - I

Transcription:

Token Coherence Milo M. K. Martin Dissertation Defense Wisconsin Multifacet Project http://www.cs.wisc.edu/multifacet/ University of Wisconsin Madison (C) 2003 Milo Martin

Overview Technology and software trends are changing multiprocessor design Workload trends snooping protocols Technology trends directory protocols Three desired attributes Fast cache-to-cache misses No bus-like interconnect Bandwidth efficiency (moderate) Our approach: Token Coherence Fast: directly respond to unordered requests (1, 2) Correct: count tokens, prevent starvation Efficient: use prediction to reduce request traffic (3) slide 2

Key Insight Goal of invalidation-based coherence Invariant: many readers -or- single writer Enforced by globally coordinated actions Key insight Enforce this invariant directly using tokens Fixed number of tokens per block One token to read, all tokens to write Guarantees safety in all cases Global invariant enforced with only local rules Independent of races, request ordering, etc. slide 3

Contributions 1. Token counting rules for enforcing safety 2. Persistent requests for preventing starvation 3. Decoupling correctness and performance in cache coherence protocols Correctness Substrate Performance Policy 4. Exploration of three performance policies slide 4

Outline Motivation: Three Desirable Attributes Fast but Incorrect Approach Correctness Substrate #1 Enforcing Safety with Token Counting Preventing Starvation with Persistent Requests Performance Policies TokenB TokenD TokenM Methods and Evaluation Related Work Contributions #4 Background & Setup #2 #3 slide 5

Motivation: Three Desirable Attributes Low-latency cache-to-cache misses No bus-like interconnect Bandwidth efficient Dictated by workload and technology trends slide 6

Workload Trends Commercial workloads Many cache-to-cache misses Clusters of small multiprocessors Goals: Direct cache-to-cache misses (2 hops, not 3 hops) Directory Moderate scalability Protocol 1 P P P M 2 1 P P P M Workload trends snooping protocols 3 2 slide 7

Workload Trends Low-latency cache-to-cache misses No bus-like interconnect Bandwidth efficient slide 8

Workload Trends Snooping Protocols Low-latency cache-to-cache misses (Yes: direct request/response) No bus-like interconnect (No: requires a virtual bus ) Bandwidth efficient (No: broadcast always) slide 9

Technology Trends High-speed point-to-point links No (multi-drop) busses Increasing design integration Glueless multiprocessors Improve cost & latency Desire: low-latency interconnect Avoid virtual bus ordering Enabled by directory protocols Technology trends unordered interconnects slide 10

Technology Trends Low-latency cache-to-cache misses No bus-like interconnect Bandwidth efficient slide 11

Technology Trends Directory Protocols Low-latency cache-to-cache misses (No: indirection through directory) No bus-like interconnect (Yes: no ordering required) Bandwidth efficient (Yes: avoids broadcast) slide 12

Goal: All Three Attributes Low-latency cache-to-cache misses Step#1 Step#2 No bus-like interconnect Bandwidth efficient slide 13

Outline Motivation: Three Desirable Attributes Fast but Incorrect Approach Correctness Substrate Enforcing Safety with Token Counting Preventing Starvation with Persistent Requests Performance Policies TokenB TokenD TokenM Methods and Evaluation Related Work Contributions slide 14

Basic Approach Fast cache-to-cache misses Broadcast with direct responses As in snooping protocols Fast interconnect Use unordered interconnect As in directory protocols Low latency, high bandwidth, low cost 1 P P P M 2 Fast & works fine with no races but what happens in the case of a race? slide 15

Basic approach but not yet correct Request to write Delayed in interconnect No Copy 1 No Copy Read/Write P 0 Ack 2 P 1 P 2 Request to read P 0 issues a request to write (delayed to P 2 ) P 1 issues a request to read 3 slide 16

Basic approach but not yet correct No Copy 1 Read-only No Copy Read-only Read/Write P 0 2 P 1 P 2 4 3 P 2 responds with data to P 1 slide 17

Basic approach but not yet correct No Copy 1 P 0 2 Read-only No Copy P 1 5 4 Read-only Read/Write P 2 3 P 0 s delayed request arrives at P 2 slide 18

Basic approach but not yet correct Read/Write1 7 P 0 2 Read-only No Copy P 1 5 4 No Copy Read-only Read/Write P 2 6 P 2 responds to P 0 3 slide 19

Basic approach but not yet correct Read/Write1 7 P 0 2 Read-only No Copy P 1 5 4 No Copy Read-only Read/Write P 2 6 Problem: P 0 and P 1 are in inconsistent states Locally correct operation, globally inconsistent 3 slide 20

Outline Motivation: Three Desirable Attributes Fast but Incorrect Approach Correctness Substrate Enforcing Safety with Token Counting Preventing Starvation with Persistent Requests Performance Policies TokenB TokenD TokenM Methods and Evaluation Related Work Contributions slide 21

Enforcing Safety with Token Counting Definition of safety: All reads and writes are coherent i.e., maintain the coherence invariant Processor uses this property to enforce consistency Approach: token counting Associate a fixed number of tokens for each block At least one token to read All tokens to write Tokens in memory, caches, and messages Present rules as successive refinement but first, revisit example slide 22

Token Coherence Example Request to write Max Tokens Delayed T=0 1 T=0 T=16 (R/W) P 0 2 P 1 P 2 Delayed P 0 issues a request to write (delayed to P 2 ) P 1 issues a request to read 3 Request to read slide 23

Token Coherence Example T=0 1 T=1(R) T=0 T=15(R) T=16 (R/W) P 0 2 P 1 P 2 T=1 4 3 P 2 responds with data to P 1 slide 24

Token Coherence Example T=0 1 T=1(R) T=0 5 T=15(R) T=16 (R/W) P 0 2 P 1 P 2 4 3 P 0 s delayed request arrives at P 2 slide 25

7 T=15(R) 1 P 0 Token Coherence Example 2 T=15 T=1(R) T=0 P 1 5 T=0 T=15(R) T=16 (R/W) P 2 6 4 3 P 2 responds to P 0 slide 26

7 T=15(R) 1 P 0 Token Coherence Example 2 T=1(R) T=0 P 1 5 T=0 T=15(R) T=16 (R/W) P 2 6 4 3 slide 27

Token Coherence Example T=15(R) T=1(R) T=0 P 0 P 1 P 2 Now what? (P 0 still wants all tokens) Before addressing the starvation issue, more depth on safety slide 28

Simple Rules Conservation of Tokens: Components do not create or destroy tokens. Write Rule: A processor can write a block only if it holds all the block s tokens. Read Rule: A processor can read a block only if it holds at least one token. Data Transfer Rule: A message with one or more tokens must contain data. slide 29

Deficiency of Simple Rules Tokens must always travel with data! Bandwidth inefficient 1. When collecting many tokens Much like invalidation acknowledgements 2. When evicting tokens in shared (Token Coherence does not support silent eviction) Simple rules require data writeback on all evictions 3. When evicting tokens in exclusive Solution: distinguish clean/dirty state of block slide 30

Revised Rules (1 of 2) Conservation of Tokens: Tokens may not be created or destroyed. One token is the owner token that is clean or dirty. Write Rule: A processor can write a block only if it holds all the block s tokens and has valid data. The owner token of a block is set to dirty when the block is written. Read Rule: A processor can read a block only if it holds at least one token and has valid data. Data Transfer Rule: A message with a dirty owner token must contain data. slide 31

Revised Rules (2 of 2) Valid-Data Bit Rule: Set valid-data bit when data and token(s) arrive Clear valid-data bit when it no longer holds any tokens The memory sets the valid-data bit whenever it receives the owner token (even if the message does not contain data). Clean Rule: Whenever the memory receives the owner token, the memory sets the owner token to clean. Result: reduced traffic, encodes all MOESI states slide 32

Token Counting Overheads Token storage in caches 64 tokens, owner, dirty/clear = 8 bits 1 byte per 64-byte block is 2% overhead Transferring tokens in messages Data message: similar to above Control message: 1 byte in 7 bytes is 13% Non-silent eviction overheads Clean: 8-byte eviction per 72-byte data is 11% Dirty: data + token message = 2% Token storage in memory Similar to a directory protocol, but fewer bits Like directory: ECC bits, directory cache slide 33

Stray data Other Token Counting Issues Tokens can arriving at any time Ingest or redirect to memory Handling I/O DMA: issue read requests and write requests Memory mapped: unaffected Block-write instructions Send clean-owner without data Reliability Assumes reliable delivery Same as other coherence protocols slide 34

Outline Motivation: Three Desirable Attributes Fast but Incorrect Approach Correctness Substrate Enforcing Safety with Token Counting Preventing Starvation with Persistent Requests Performance Policies TokenB TokenD TokenM Methods and Evaluation Related Work Contributions slide 35

Preventing Starvation via Persistent Requests Definition of starvation-freedom: All loads and stores must eventually complete Basic idea Invoke after timeout (wait 4x average miss latency) Send to all components Each component remembers it in a small table Continually redirect all tokens to requestor Deactivate when complete As described later, not for the common case Back to the example slide 36

Token Coherence Example T=15(R) T=1(R) T=0 P 0 P 1 P 2 P 0 still wants all tokens slide 37

Token Coherence Example Timeout! T=15(R) 8 T=1(R) T=0 P 0 T=19 P 1 P 2 P 0 issues persistent request P 1 responds with a token slide 38

Token Coherence Example 10 T=16 (R/W) T=0 T=0 P 0 P 1 P 2 P 0 s request completed P 0 s deactivates persistent request slide 39

Persistent Request Arbitration Problem: many processors issue persistent requests for the same block Solution: use starvation-free arbitration Single arbiter (in dissertation) Banked arbiters (in dissertation) Distributed arbitration (my focus, today) slide 40

Distributed Arbitration One persistent request per processor One table entry per processor Lowest processor number has highest priority Calculated per block Forward all tokens for block (now and later) When invoking mark all valid entries in local table Don t issue another persistent request until marked entries are deactivated Based on arbitration techniques (FutureBus) slide 41

Distributed Arbitration System slide 42

Other Persistent Request Issues All tokens, no data problem Bounce clean owner token to memory Persistent read requests Keep only one (non-owner) token Add read/write bit to each table entry Preventing reordering of activation and deactivation messages 1. Point-to-point ordering 2. Explicit acknowledgements 3. Acknowledgement aggregation 4. Large sequence numbers Scalability of persistent requests slide 43

Outline Motivation: Three Desirable Attributes Fast but Incorrect Approach Correctness Substrate Enforcing Safety with Token Counting Preventing Starvation with Persistent Requests Performance Policies TokenB TokenD TokenM Methods and Evaluation Related Work Contributions slide 44

Performance Policies Correctness substrate is sufficient Enforces safety with token counting Prevents starvation with persistent requests A performance policy can do better Faster, less traffic, lower overheads Direct when and to whom tokens/data are sent With no correctness requirements Even a random protocol is correct Correctness substrate has final word slide 45

Decoupled Correctness and Performance Cache Coherence Protocol Correctness Substrate (all cases) Performance Policy (common cases) Safety (token counting) Starvation Freedom (persistent requests) Simple rules Refined rules Centralized Distributed TokenB TokenD TokenM Just some of the implementation choices slide 46

TokenB Performance Policy Goal: snooping without ordered interconnect Broadcast unordered transient requests Hints for recipient to send tokens/data Reissue requests once (if necessary) After 2x average miss latency Substrate invokes a persistent request As before, after 4x average miss latency Processors & memory respond to requests As in other MOESI protocols Uses migratory sharing optimization (as do our base-case protocols) slide 47

TokenB Potential Low-latency cache-to-cache misses (Yes: direct request/response) No bus-like interconnect (Yes: unordered protocol) Bandwidth efficient (No: broadcast always) slide 48

Beyond TokenB Broadcast is not required P 0 T=16 T=16 P 1 P 2 P 2 P2 P 2 P n TokenD: directory-like performance policy TokenM Multicast to a predicted destination-set Based on past history Need not be correct (fall back on persistent request) Enables larger or more cost-effective systems slide 49

TokenD Performance Policy Goal: traffic & performance of directory protocol Operation Send all requests to soft-state directory at memory Forwards request (like directory protocol) Processors respond as in MOESI directory protocol Reissue requests Identical to TokenB Enhancement Pending set of processors Send completion message to update directory slide 50

TokenD Potential Low-latency cache-to-cache misses (No: indirection through directory) No bus-like interconnect (Yes: unordered protocol) Bandwidth efficient (Yes: avoids broadcast) slide 51

TokenM Performance Policy Goals: Less traffic than TokenB Faster than TokenD Builds on TokenD, but uses prediction Predict a destination set of processors Soft-state directory forwards to missing processors slide 52

Destination-Set Prediction Observe past behavior to predict the future Leverage prior work on coherence prediction Training events Other requests Data responses Mostly subsumes TokenD TokenB Average Miss Latency Ideal TokenD Owner Group Bcast-ifshared Bandwidth usage TokenB slide 53

Destination-Set Predictors Three predictors (ISCA 03 paper) Broadcast-if-shared Group Owner All simple cache-like (tagged) predictors 4-way set-associative 8k entries (32KB to 64KB) 1024-byte macroblock-based indexing Prediction On tag miss, send only to memory Otherwise, generate prediction slide 54

TokenM Potential Low-latency cache-to-cache misses (Yes: direct request/response) Bandwidth/latency tradeoff No bus-like interconnect (Yes: unordered protocol) Bandwidth efficient (Yes: reduce broadcasts using prediction) slide 55

Outline Motivation: Three Desirable Attributes Fast but Incorrect Approach Correctness Substrate Enforcing Safety with Token Counting Preventing Starvation with Persistent Requests Performance Policies TokenB TokenD TokenM Methods and Evaluation Related Work Contributions slide 56

Evaluation Methods Non-goal: exact speedup numbers Many assumptions and parameters (next slide) Goal: Quantitative evidence for qualitative behavior Simulation methods Full-system simulation with Simics Dynamically scheduled processor model Detailed memory system model Multiple simulations due to workload variability slide 57

Evaluation Parameters 16 processors SPARC ISA 2 GHz, 11 pipe stages 4-wide fetch/execute Dynamically scheduled 128 entry ROB 64 entry scheduler Memory system 64 byte cache lines 64KB L1 Instruction and Data, 4-way SA, 2 ns (4 cycles) 4MB L2, 4-way SA, 6 ns (12 cycles) 4GB main memory, 80 ns (160 cycles) Interconnect 15ns link latency (30 cycles) 4ns to enter/exit interconnect Switched tree (4 link latencies) - 256 cycles 2-hop round trip 2D torus (2 link latencies on average) - 136 cycles 2-hop round trip Coherence Protocols Aggressive snooping Alpha 21364-like directory 72 byte data messages 8 byte request messages slide 58

Three Commercial Workloads All workloads use Solaris 9 for SPARC OLTP - On-line transaction processing IBM s DB2 v7.2 DBMS TPCC-like workload 5GB database, 25,000 warehouses 8 raw disks, additional log disk 256 concurrent users SPECjbb - Java middleware workload Sun s HotSpot 1.4.1-b21 Server JVM 24 threads, 24 warehouses (~500MB) Apache - Static web serving workload 80,000 files, 6400 concurrent users slide 59

Are reissued and persistent requests rare? (percent of all misses) Outcome SpecJBB Apache OLTP Not Reissued 99.5% 99.1% 97.6% Reissued Once Persistent Requests 0.2% 0.3% 0.7% 0.2% 1.5% 0.9% TokenB results (TokenD/TokenM are similar) Yes, reissue requests are rare slide 60

Runtime: Snooping vs. TokenB Tree Switched Interconnect Similar performance on same interconnect Tree interconnect slide 61

Runtime: Snooping vs. TokenB Torus Interconnect Snooping not applicable Torus interconnect slide 62

Runtime: Snooping vs. TokenB Tree Switched Interconnect TokenB can outperform snooping (23-34% faster) Why? Lower latency interconnect slide 63

Runtime: Directory vs. TokenB TokenB outperforms directory (12-64% or 7-27%) Why? Avoids directory lookup, third hop slide 64

Interconnect Traffic: TokenB and Directory TokenB s additional traffic is moderate (18-35% more) Why? (1) requests smaller than data (8B v. 64B) (2) Broadcast routing Analytical model: 64p is 2.3x 256p is 3.9x slide 65

Runtime: TokenD and Directory Similar runtime, still slower than TokenB slide 66

Runtime: TokenD and TokenM slide 67

Interconnect Traffic: TokenD and Directory Similar traffic, still less than TokenB slide 68

Interconnect Traffic: TokenD and TokenM slide 69

Evaluation Summary TokenB faster than: Snooping, due to faster/cheaper interconnect Directories, avoids directory looking & third hop TokenB uses more traffic than directories Especially as system size increases TokenD is similar to directories Runtime and traffic TokenM provides intermediate design points Owner is 6-16% faster than TokenD, negligible additional bandwidth Bcast-if-shared is only 1-6% slower then TokenB, but 7-14% less traffic (more for larger systems) slide 70

Outline Motivation: Three Desirable Attributes Fast but Incorrect Approach Correctness Substrate Enforcing Safety with Token Counting Preventing Starvation with Persistent Requests Performance Policies TokenB TokenD TokenM Methods and Evaluation Related Work Contributions slide 71

Architecture Related Work (1 of 2) Many, many previous coherence protocols Including many hybrids and adaptive protocols Coherence prediction Early work: migratory sharing optimization [ISCA93] Later: DSI, LTP, Cosmos Destination-Set Prediction [ISCA 03] Multicast Snooping [ISCA 99, TPDS] Acacio et al. [SC 02, Pact 02] Cachet [ICS 99] slide 72

Non-Architecture Related Work (2 of 2) Much tangentially related work but little directly related Many single-token schemes (token-base sync.) Or use multiple tokens for faults (quorum commit) Fortran-M uses message passing of many tokens for protecting shared variables [Foster] Read/writers locks Not implemented using tokens slide 73

Contributions 1. Token counting rules for enforcing safety 2. Persistent requests for preventing starvation 3. Decoupling correctness and performance in cache coherence protocols Correctness Substrate Performance Policy 4. Developing and evaluating three performance policies slide 74

Backup Slides slide 75

Centralized Arbiter System slide 76

Centralized Arbiter Example slide 77

Banked Arbiter System slide 78

Distributed Arbitration System slide 79

Distributed Arbitration Example slide 80

Predictor #1: Broadcast-if-shared Performance of snooping, fewer broadcasts Broadcast for shared data Minimal set for private data Each entry: valid bit, 2-bit counter Decrement on data from memory Increment on data from a processor Increment other processor s request Prediction If counter > 1 then broadcast Otherwise, send only to memory slide 81

Predictor #2: Owner Traffic similar to directory, fewer indirections Predict one extra processor (the owner ) Pairwise sharing, write part of migratory sharing Each entry: valid bit, predicted owner ID Set owner on data from other processor Set owner on other s request to write Unset owner on response from memory Prediction If valid then predict owner + memory Otherwise, send only to memory slide 82

Predictor #3: Group Try to achieve ideal bandwidth/latency Detect groups of sharers Temporary groups or logical partitions (LPAR) Each entry: N 2-bit counters Response or request from another processor Increment corresponding counter Train down by occasionally decrement all counters (every 2N increments) Prediction For each processor, if the corresponding counter > 1, add it in the predicted set Send to predicted set + memory slide 83