ECE 669 Parallel Computer Architecture

Similar documents
Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

Scalable Cache Coherence

Lecture 25: Multiprocessors

Scalable Cache Coherence

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Lecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based)

Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions:

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

Lecture 8: Directory-Based Cache Coherence. Topics: scalable multiprocessor organizations, directory protocol design issues

A Scalable SAS Machine

Scalable Cache Coherent Systems

NOW Handout Page 1. Context for Scalable Cache Coherence. Cache Coherence in Scalable Machines. A Cache Coherent System Must:

Cache Coherence. Todd C. Mowry CS 740 November 10, Topics. The Cache Coherence Problem Snoopy Protocols Directory Protocols

CMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3

Handout 3 Multiprocessor and thread level parallelism

Lecture 5: Directory Protocols. Topics: directory-based cache coherence implementations

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Cache Coherence: Part II Scalable Approaches

COMP Parallel Computing. CC-NUMA (1) CC-NUMA implementation

Scalable Multiprocessors

Shared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB

CSE502: Computer Architecture CSE 502: Computer Architecture

Lecture 3: Directory Protocol Implementations. Topics: coherence vs. msg-passing, corner cases in directory protocols

EC 513 Computer Architecture

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Lecture: Coherence, Synchronization. Topics: directory-based coherence, synchronization primitives (Sections )

EECS 570 Lecture 11. Directory-based Coherence. Winter 2019 Prof. Thomas Wenisch

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Chapter 5. Thread-Level Parallelism

Cache Coherence in Scalable Machines

ECSE 425 Lecture 30: Directory Coherence

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Chapter 5. Multiprocessors and Thread-Level Parallelism

CMSC 611: Advanced. Distributed & Shared Memory

CMSC 611: Advanced Computer Architecture

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Computer Architecture

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

Multiprocessors & Thread Level Parallelism

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

5008: Computer Architecture

Lecture 24: Virtual Memory, Multiprocessors

Chapter 5. Multiprocessors and Thread-Level Parallelism

1. Memory technology & Hierarchy

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Limitations of parallel processing

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996

Lecture 16/17: Distributed Shared Memory. CSC 469H1F Fall 2006 Angela Demke Brown

Computer Architecture Lecture 10: Thread Level Parallelism II (Chapter 5) Chih Wei Liu 劉志尉 National Chiao Tung University

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

CS654 Advanced Computer Architecture Lec 14 Directory Based Multiprocessors

Outline. EEL 5764 Graduate Computer Architecture. Chapter 4 - Multiprocessors and TLP. Déjà vu all over again?

Foundations of Computer Systems

Cache Coherence. Bryan Mills, PhD. Slides provided by Rami Melhem

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology

Lecture 26: Multiprocessors. Today s topics: Directory-based coherence Synchronization Consistency Shared memory vs message-passing

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

Multiprocessor Cache Coherency. What is Cache Coherence?

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

Recall: Sequential Consistency. CS 258 Parallel Computer Architecture Lecture 15. Sequential Consistency and Snoopy Protocols

Lecture 4: Directory Protocols and TM. Topics: corner cases in directory protocols, lazy TM

Cache Coherence Protocols: Implementation Issues on SMP s. Cache Coherence Issue in I/O

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

PARALLEL MEMORY ARCHITECTURE

Memory Hierarchy in a Multiprocessor

Page 1. Cache Coherence

Thread- Level Parallelism. ECE 154B Dmitri Strukov

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

Data-Centric Consistency Models. The general organization of a logical data store, physically distributed and replicated across multiple processes.

Lecture 24: Board Notes: Cache Coherency

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Scalable Locking. Adam Belay

Cache Coherence in Scalable Machines

Computer Architecture

CISC 662 Graduate Computer Architecture Lectures 15 and 16 - Multiprocessors and Thread-Level Parallelism

Interconnect Routing

Multiprocessors 1. Outline

Special Topics. Module 14: "Directory-based Cache Coherence" Lecture 33: "SCI Protocol" Directory-based Cache Coherence: Sequent NUMA-Q.

Lecture 19: Coherence and Synchronization. Topics: synchronization primitives (Sections )

Directory Implementation. A High-end MP

Cache Coherence (II) Instructor: Josep Torrellas CS533. Copyright Josep Torrellas

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Scalable Multiprocessors

NOW Handout Page 1. Recap. Protocol Design Space of Snooping Cache Coherent Multiprocessors. Sequential Consistency.

Page 1. Outline. Coherence vs. Consistency. Why Consistency is Important

Lecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Distributed Systems. Lec 9: Distributed File Systems NFS, AFS. Slide acks: Dave Andersen

Lecture 24: Thread Level Parallelism -- Distributed Shared Memory and Directory-based Coherence Protocol

EC 513 Computer Architecture

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University

Transcription:

ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches

Overview ost cache protocols are more complicated than two state Snooping not effective for network-based systems Consider three alternate cache coherence approaches Full-map, limited directory, chain Caching will affect network performance Limitless protocol Gives appearance of full-map Practical issues of processor memory system interaction

Context for Scalable Cache Coherence Realizing Pgm odels through net transaction protocols - efficient node-to-net interface - interprets transactions Scalable network Scalable Networks - many simultaneous transactions Switch Switch Switch Scalable distributed memory $ P CA Caches naturally replicate data - coherence through bus snooping protocols - consistency Need cache coherence protocols that scale! - no broadcast or single point of order

Generic Solution: Directories P1 P1 Cache Cache Directory emory Comm. Assist Directory emory Comm Assist Scalable Interconnection Network aintain state vector explicitly associate with memory block records state of block in each cache On miss, communicate with directory determine location of cached copies determine action to take conduct protocol to maintain coherence

A Cache Coherent System ust: Provide set of states, state transition diagram, and actions anage coherence protocol (0) Determine when to invoke coherence protocol (a) Find info about state of block in other caches to determine action - whether need to communicate with other cached copies (b) Locate the other copies (c) Communicate with those copies (inval/update) (0) is done the same way on all systems state of the line is maintained in the cache protocol is invoked if an access fault occurs on the line Different approaches distinguished by (a) to (c)

Coherence in small machines: Snooping Caches a a 3 snoop a 2 Broadcast cache cache atch cache a directory 4 directory a 5 Dual cache ported Purge 1 write Processor Processor Broadcast address on shared write Everyone listens (snoops) on bus to see if any of their own addresses match How do you know when to broadcast, invalidate... - State associated with each cache line

State diagram for ownership protocols Ownership invalid Local Read read-clean Remote Write (Replace) Remote Write (Replace) Remote Read Local Write Local Write Broadcast a Broadcast a write-dirty In ownership protocol: writer owns exclusive copy For each shared data cache block

aintaining coherence in large machines Software Hardware - directories Software coherence Typically yields weak coherence i.e. Coherence at sync points (or fence pts) E.g.: When using critical sections for shared ops... Code GET_FOO_LOCK /* UNGE WITH FOOs */ Foo1 = X = Foo2 Foo3 =... foo1 foo2 foo3 foo4 RELEASE_FOO_LOCK How do you make this work?

Situation cache cache E foo1 foo2 foo3 foo4 P flush wait P foo home Flush foo* from cache, wait till done Issues Lock? ust be conservative - Lose some locality But, can exploit appl. characteristics e.g. TSP, Chaotic Allow some inconsistency Need special processor instructions flush r a 1, r 1 a... a 1 unlock r w,... a 2 w

Scalable Approach: Directories Every memory block has associated directory information keeps track of copies of cached blocks and their states on a miss, find directory entry, look it up, and communicate only with the nodes that have copies if necessary in scalable networks, communication with directory and copies is through network transactions any alternatives for organizing directory information

Basic Operation of Directory P P Cache Cache k processors. Interconnection Network With each cache-block in memory: k presence-bits, 1 dirty-bit emory Directory With each cache-block in cache: 1 valid bit, and 1 dirty (owner) bit presence bits dirty bit Read from main memory by processor i: If dirty-bit OFF then { read from main memory; turn p[i] ON; } if dirty-bit ON then { recall line from dirty proc (cache state to shared); update memory; turn dirty-bit OFF; turn p[i] ON; supply recalled data to i;} Write to main memory by processor i: If dirty-bit OFF then { supply data to i; send invalidations to all caches that have the block; turn dirty-bit ON; turn p[i] ON;... }

Scalable dynamic schemes Limited directories Chained directories Limitless schemes Use software Other approach: Full ap (not scalable)

Scalable hardware schemes DIR E... E D I R C C C P P P General directories: On write, check directory if shared, send inv msg Distribute directories with Es Limited directories Chained directories LimitLESS directories Directory bandwidth scales in proportion to memory bandwidth ost directory accesses during memory access -- so not too many extra network requests (except, write to read VAR)

emory controller - (directory) state diagram for memory block uncached read write replace update 1 or more read copies write/invs req read i/update req 1 write copy i pointers pointer

Network C C C P P P...

Network Limited directories: Exploit worker set behavior Invalidate 1 if 5th processor comes along (sometimes can set a broadcast invalidate bit) Rarely more than 2 processors share Insight: The set of 4 pointers can be managed like a fully-associative 4 entry cache on the virtual space of all pointers But what do you do about widely shared data?

Network 1 5th req Trap 2 3 C C C P P P... LimitLESS directories: Limited directories Locally Extended through Software Support Trap processor when 5th request comes Processor extends directory into local memory

Zero pointer LimitLESS: All software coherence directory mem cache comm mem trap always proc cache proc comm remote mem cache comm proc local

Network inv inv inv C C C wrt P P P... Chained directories: Simply different data structure for directory Link all cache entries But longer latencies Also more complex hardware ust handle replacements of elements in chain due to misses!

Network C C C P P P... Doubly linked chains Of course, can do these data structures though software + msgsas in LimitLESS

Network DIR... C C C P P P...

Network DIR... 3 4 ack write permission granted 2 1 inv C C C P P P write... Full map: Problem Does not scale -- need N pointers Directories distributed, so not a bottleneck

EORY disks? cache P P P P P P P Hierarchical - E.g. KSR (actually has rings...)

EORY disks? A A A A cache RD A P P P P P P P Hierarchical - E.g. KSR (actually has rings...)

EORY disks? A A A A A? A cache RD A P P P P P P P Hierarchical - E.g. KSR (actually has rings...)

Widely shared data 1. Synchronization variables Software combining tree spin flag wrt 2. Read only objects Instructions Read-only data wrt Can mark these and bypass coherence protocol 3. But, biggest problem: Frequently read, but rarely written data which does not fall into known patterns like synchronization variables

All software coherence All software LimitLESS 1 LimitLESS 2 LimitLESS 4 All hardware 0.00 0.40 0.80 1.20 1.60 Execution Time (cycles)

All software coherence Execution time (megacycles) 9.0 8.0 7.0 6.0 5.0 4.0 3.0 LimitLESS 4 Protocol LimitLESS 0 : All software 2.0 1.0 0.0 4x4 8x8 16x16 32x32 64x64 128x128 atrix size Cycle Performance for Block atrix ultiplication, 16 processors

All software coherence Execution time (10000 cycles) 5.0 4.0 3.0 LimitLESS 4 Protocol LimitLESS 0 : All software 2.0 1.0 0.0 2 nodes 4 nodes 16 nodes 32 nodes 64 nodes Performance for Single Global Barrier (first INVR to last RDATA)

Summary Tradeoffs in caching an important issue Limitless protocol provides software extension to hardware caching Goal: maintain coherence but minimize network traffic Full map not scalable and too costly Distributed memory makes caching more of a challenge