CSE502: Computer Architecture CSE 502: Computer Architecture

Similar documents
EECS 470. Lecture 17 Multiprocessors I. Fall 2018 Jon Beaumont

Beyond ILP In Search of More Parallelism

Multiprocessors continued

Multiprocessors continued

Today s Outline: Shared Memory Review. Shared Memory & Concurrency. Concurrency v. Parallelism. Thread-Level Parallelism. CS758: Multicore Programming

Computer Architecture

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

PARALLEL MEMORY ARCHITECTURE

Wenisch Portions Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Prof. Thomas Wenisch. Lecture 23 EECS 470.

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5)

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

On Power and Multi-Processors

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University

Interconnect Routing

Fall 2015 :: CSE 610 Parallel Computer Architectures. Cache Coherence. Nima Honarmand

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

On Power and Multi-Processors

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Page 1. Cache Coherence

Processor Architecture

EC 513 Computer Architecture

Switch Gear to Memory Consistency

Cache Coherence in Bus-Based Shared Memory Multiprocessors

EECS 570 Lecture 9. Snooping Coherence. Winter 2017 Prof. Thomas Wenisch h6p://

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

1. Memory technology & Hierarchy

Handout 3 Multiprocessor and thread level parallelism

Cache Coherence Protocols: Implementation Issues on SMP s. Cache Coherence Issue in I/O

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains:

SELECTED TOPICS IN COHERENCE AND CONSISTENCY

Computer Architecture

CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

Memory Hierarchy in a Multiprocessor

Multiprocessors & Thread Level Parallelism

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Module 9: "Introduction to Shared Memory Multiprocessors" Lecture 16: "Multiprocessor Organizations and Cache Coherence" Shared Memory Multiprocessors

Cache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Lecture 3: Directory Protocol Implementations. Topics: coherence vs. msg-passing, corner cases in directory protocols

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Shared Memory Architectures. Approaches to Building Parallel Machines

Multiprocessors 1. Outline

Foundations of Computer Systems

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Parallel Computer Architecture Spring Memory Consistency. Nikos Bellas

Chapter 5. Thread-Level Parallelism

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Shared Memory. SMP Architectures and Programming

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

Aleksandar Milenkovich 1

Approaches to Building Parallel Machines. Shared Memory Architectures. Example Cache Coherence Problem. Shared Cache Architectures

Bus-Based Coherent Multiprocessors

Multiprocessor Synchronization

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Computer Science 146. Computer Architecture

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

Lecture 25: Multiprocessors

Cache Coherence and Atomic Operations in Hardware

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

CSE 5306 Distributed Systems

Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Lecture 24: Virtual Memory, Multiprocessors

DISTRIBUTED SHARED MEMORY

Other consistency models

ECE 669 Parallel Computer Architecture

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

Shared Memory Multiprocessors

Lecture 13: Memory Consistency. + Course-So-Far Review. Parallel Computer Architecture and Programming CMU /15-618, Spring 2014

Parallel Architecture. Hwansoo Han

Shared Memory Multiprocessors

CMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

EECS 570 Lecture 11. Directory-based Coherence. Winter 2019 Prof. Thomas Wenisch

EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors

Thread-level Parallelism. Synchronization. Explicit multithreading Implicit multithreading Redundant multithreading Summary

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Concurrent Preliminaries

Lecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Shared Symmetric Memory Systems

Thread-Level Parallelism. Shared Memory. This Unit: Shared Memory Multiprocessors. U. Wisconsin CS/ECE 752 Advanced Computer Architecture I

Goldibear and the 3 Locks. Programming With Locks Is Tricky. More Lock Madness. And To Make It Worse. Transactional Memory: The Big Idea

Transcription:

CSE 502: Computer Architecture Shared-Memory Multi-Processors

Shared-Memory Multiprocessors Multiple threads use shared memory (address space) SysV Shared Memory or Threads in software Communication implicit via loads and stores Opposite of explicit message-passing multiprocessors Theoretical foundation: PRAM model P 1 P 2 P 3 P 4 Memory System

Why Shared Memory? Pluses App sees multitasking uniprocessor OS needs only evolutionary extensions Communication happens without OS Minuses Synchronization is complex Communication is implicit (hard to optimize) Hard to implement (in hardware) Result SMPs and CMPs are most successful machines to date First with multi-billion-dollar markets

Paired vs. Separate Processor/Memory? Separate CPU/memory Uniform memory access (UMA) Equal latency to memory Low peak performance Paired CPU/memory Non-uniform memory access (NUMA) Faster local memory Data placement matters High peak performance Mem R Mem R Mem R Mem R Mem Mem Mem Mem

Shared vs. Point-to-Point Networks Shared network Example: bus Low latency Low bandwidth Doesn t scale >~16 cores Simple cache coherence Point-to-point network: Example: mesh, ring High latency (many hops ) Higher bandwidth Scales to 1000s of cores Complex cache coherence Mem R Mem R Mem R Mem R Mem R R Mem Mem R R Mem

Organizing Point-To-Point Networks Network topology: organization of network Tradeoff perf. (connectivity, latency, bandwidth) cost Router chips Networks w/separate router chips are indirect Networks w/ processor/memory/router in chip are direct Fewer components, Glueless MP R R R Mem R R Mem Mem R Mem R Mem R Mem R Mem R R Mem

Issues for Shared Memory Systems Two big ones Cache coherence Memory consistency model Closely related Often confused

Cache Coherence: The Problem (1/2) Variable A initially has value 0 P1 stores value 1 into A P2 loads A from memory and sees old value 0 t1: Store A=1 P1 P2 A: A: 0 1 L1 A: 0 t2: Load A? L1 Bus A: 0 Main Memory Need to do something to keep P2 s cache coherent

Cache Coherence: The Problem (2/2) P1 and P2 have variable A (value 0) in their caches P1 stores value 1 into A P2 loads A from its cache and sees old value 0 t1: Store A=1 P1 P2 A: A: 0 1 L1 A: 0 t2: Load A? L1 Bus A: 0 Main Memory Need to do something to keep P2 s cache coherent

Approaches to Cache Coherence Software-based solutions Mechanisms: Mark cache blocks/memory pages as cacheable/non-cacheable Add Flush and Invalidate instructions Could be done by compiler or run-time system Difficult to get perfect (e.g., what about memory aliasing?) Hardware solutions are far more common System ensures everyone always sees the latest value

Coherence with Write-through Caches Allows multiple readers, but writes through to bus Requires Write-through, no-write-allocate cache All caches must monitor (aka snoop ) all bus traffic Simple state machine for each cache frame t1: Store A=1 P1 P2 A [V]: 0 1 Write-through A [V]: I]: 0 No-write-allocate t3: Invalidate A t2: BusWr A=1 Bus A: A: 0 1 Main Memory

Valid-Invalid Snooping Protocol Processor Actions Ld, St, BusRd, BusWr Bus Messages BusRd, BusWr Track 1 bit per cache frame Valid/Invalid Load / BusRd Load / -- Valid Store / BusWr BusWr / -- Invalid Store / BusWr

Supporting Write-Back Caches Write-back caches are good Drastically reduce bus write bandwidth Add notion of ownership to Valid-Invalid When owner has only replica of a cache block Update it freely Multiple readers are ok Not allowed to write without gaining ownership On a read, system must check if there is an owner If yes, take away ownership

Modified-Shared-Invalid (MSI) States Processor Actions Load, Store, Evict Bus Messages BusRd, BusRdX, BusInv, BusWB, BusReply (Here for simplicity, some messages can be combined) Track 3 states per cache frame Invalid: cache does not have a copy Shared: cache has a read-only copy; clean Clean: memory (or later caches) is up to date Modified: cache has the only valid copy; writable; dirty Dirty: memory (or later caches) is out of date

Simple MSI Protocol (1/9) Load / BusRd Invalid Shared P1 1: Load A P2 A [I A S]: [I] 0 A [I] 2: BusRd A 3: BusReply A Bus A: 0

Simple MSI Protocol (2/9) Load / BusRd BusRd / [BusReply] Load / -- Invalid Shared P1 1: Load A P2 1: Load A A [S]: 0 A [I A S]: [I] 0 3: BusReply A Bus A: 0 2: BusRd A

Simple MSI Protocol (3/9) Load / BusRd BusRd / [BusReply] Load / -- Invalid Shared Evict / -- P1 P2 Evict A A [S]: 0 A [S]: [I] I] 0 Bus A: 0

Store / BusRdX Simple MSI Protocol (4/9) Load / BusRd BusRd / [BusReply] Load / -- Invalid BusRdX / [BusReply] Shared Evict / -- P1 A [S]: I]: 0 P2 1: Store A A [I A M]: [I] 0 1 Modified 3: BusReply A Bus 2: BusRdX A Load, Store / -- A: 0

Store / BusRdX Simple MSI Protocol (5/9) Load / BusRd BusRd / [BusReply] Load / -- Invalid BusRdX / [BusReply] Shared Evict / -- P1 1: Load A P2 A [I A S]: [I] 1 A [M]: S]: 1 Modified 2: BusRd A Bus 3: BusReply A Load, Store / -- 4: Snarf A A: A: 001

Store / BusRdX Invalid Simple MSI Protocol (6/9) Load / BusRd BusRdX, BusRdX BusInv / / [BusReply] Shared BusRd / [BusReply] Load / -- Evict / -- P1 1: Store A aka Upgrade P2 A A [S[S]: M]: 12 A [S]: I] 1 Load, Store / -- Modified 2: BusInv A Bus A: 1

Store / BusRdX Simple MSI Protocol (7/9) Load / BusRd BusRd / [BusReply] Load / -- Invalid BusRdX, BusInv / [BusReply] Shared BusRdX / BusReply Evict / -- P1 P2 A [M]: I]: 2 A [I A M]: [I] 3 1: Store A Modified 3: BusReply A Bus 2: BusRdX A Load, Store / -- A: 1

Store / BusRdX Simple MSI Protocol (8/9) Load / BusRd BusRd / [BusReply] Load / -- Invalid BusRdX, BusInv / [BusReply] Shared Evict / BusWB BusRdX / BusReply Evict / -- P1 A [I] P2 A [M]: I]: 3 1: Evict A Load, Store / -- Modified Bus A: A: 1 3 2: BusWB A

Store / BusRdX Simple MSI Protocol (9/9) Load / BusRd BusRd / [BusReply] Load / -- Invalid BusRdX, BusInv / [BusReply] Shared Evict / BusWB Modified BusRdX / BusReply Evict / -- Cache Actions: Load, Store, Evict Bus Actions: BusRd, BusRdX BusInv, BusWB, BusReply Load, Store / -- Usable coherence protocol

Scalable Cache Coherence Part I: bus bandwidth Replace non-scalable bandwidth substrate (bus) with scalable-bandwidth one (e.g., mesh) Part II: processor snooping bandwidth Most snoops result in no action Replace non-scalable broadcast protocol (spam everyone) with scalable directory protocol (spam cores that care) Requires a directory to keep track of sharers

Directory Coherence Protocols Extend memory to track caching information For each physical cache line, a home directory tracks: Owner: core that has a dirty copy (i.e., M state) Sharers: cores that have clean copies (i.e., S state) Cores send coherence events to home directory Home directory only sends events to cores that care

Read Transaction L has a cache miss on a load instruction 1: Read Req L H 2: Read Reply

4-hop Read Transaction L has a cache miss on a load instruction Block was previously in modified state at R State: M Owner: R 1: Read Req 2: Recall Req L H R 4: Read Reply 3: Recall Reply

3-hop Read Transaction L has a cache miss on a load instruction Block was previously in modified state at R 1: Read Req State: M Owner: R 2: Fwd d Read Req L H R 3: Fwd d Read Ack 3: Read Reply

An Example Race: Writeback & Read L has dirty copy, wants to write back to H R concurrently sends a read to H Race! WB & Fwd Rd No need to ack 1: WB Req Race! State: Final M State: S Owner: No need L to Ack 2: Read Req L 4: 6: H R 3: Fwd d Read Req 5: Read Reply Races require complex intermediate states

Basic Operation: Read Read A (miss) L Directory R A: Shared, #1 Typical way to reason about directories

Basic Operation: Write Read A (miss) L Directory R A: Shared, #1 A: Mod., #2

Coherence vs. Consistency Coherence concerns only one memory location Consistency concerns ordering for all locations A Memory System is Coherent if Can serialize all operations to that location Operations performed by any core appear in program order Read returns value written by last store to that location A Memory System is Consistent if It follows the rules of its Memory Model Operations on memory locations appear in some defined order

Why Coherence!= Consistency /* initial A = B = flag = 0 */ P1 P2 A = 1; while (flag == 0); /* spin */ B = 1; print A; flag = 1; print B; Intuition says we see 1 printed twice (A,B) Coherence doesn t say anything Difference memory locations Uniprocessor ordering (LSQ) won t help Consistency defines what is correct behavior

Sequential Consistency (SC) processors issue memory ops in program order P1 P2 P3 switch randomly set after each memory op Memory Defines Single Sequential Order Among All Ops.

Sufficient Conditions for SC A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program. -Lamport, 1979 Every proc. issues memory ops in program order Memory ops happen (start and end) atomically On Store, wait to commit before issuing next memory op On Load, wait to write back before issuing next op Easy to reason about, very slow (without ugly tricks)

Mutual Exclusion Example Mutually exclusive access to a critical region Works as advertised under Sequential Consistency Fails if P1 and P2 see different Load/Store order OoO allows P1 to read B before writing (committing) A P1 P2 locka: A = 1; lockb: B=1; if (B!= 0) if (A!= 0) { A = 0; goto locka; } { B = 0; goto lockb; } /* critical section*/ /* critical section*/ A = 0; B = 0;

Problems with SC Memory Model Difficult to implement efficiently in hardware Straight-forward implementations: No concurrency among memory access Strict ordering of memory accesses at each node Essentially precludes out-of-order CPUs Unnecessarily restrictive Most parallel programs won t notice out-of-order accesses Conflicts with latency hiding techniques

Mutex Example w/ Store Buffer Read B t1 P1 Write A t3 Read A t2 Shared Bus A: 0 B: 0 P1 P2 locka: A = 1; lockb: B=1; if (B!= 0) if (A!= 0) { A = 0; goto locka; } { B = 0; goto lockb; } /* critical section*/ /* critical section*/ A = 0; B = 0; P2 Write B Does not work t4

Relaxed Consistency Models Sequential Consistency (SC): R W, R R, W R, W W Total Store Ordering (TSO) relaxes W R R W, R R, W W Partial Store Ordering relaxes W W (coalescing WB) R W, R R Weak Ordering or Release Consistency (RC) All ordering explicitly declared Use fences to define boundaries Use acquire and release to force flushing of values X Y X must complete before Y

Atomic Operations & Synchronization Atomic operations perform multiple actions together Each of these can implement the others Compare-and-Swap (CAS) If memory value matches, overwrite with new value Test-and-Set Write new value and return old value Fetch-and-Increment Increment value in memory and return the old value Load-Linked/Store-Conditional (LL/SC) Two operations, but Store succeeds iff value unchanged