Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Similar documents
Chapter 5. Multiprocessors and Thread-Level Parallelism

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Chapter 5. Multiprocessors and Thread-Level Parallelism

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Chapter 5. Thread-Level Parallelism

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

EC 513 Computer Architecture

Lecture: Coherence, Synchronization. Topics: directory-based coherence, synchronization primitives (Sections )

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

Computer Architecture

M4 Parallelism. Implementation of Locks Cache Coherence

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville. Review: Small-Scale Shared Memory

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology

5008: Computer Architecture

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

Multiprocessors & Thread Level Parallelism

Computer Architecture

1. Memory technology & Hierarchy

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Handout 3 Multiprocessor and thread level parallelism

Chapter-4 Multiprocessors and Thread-Level Parallelism

Lecture 25: Multiprocessors

CSE502: Computer Architecture CSE 502: Computer Architecture

Lecture 19: Coherence and Synchronization. Topics: synchronization primitives (Sections )

CPE 631 Lecture 21: Multiprocessors

CS654 Advanced Computer Architecture Lec 14 Directory Based Multiprocessors

Computer Architecture Lecture 10: Thread Level Parallelism II (Chapter 5) Chih Wei Liu 劉志尉 National Chiao Tung University

Multiprocessor Synchronization

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

Lecture 33: Multiprocessors Synchronization and Consistency Professor Randy H. Katz Computer Science 252 Spring 1996

Cache Coherence and Atomic Operations in Hardware

Lecture: Coherence and Synchronization. Topics: synchronization primitives, consistency models intro (Sections )

Multiprocessors and Locking

Cache Coherence Protocols: Implementation Issues on SMP s. Cache Coherence Issue in I/O

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Page 1. Cache Coherence

SHARED-MEMORY COMMUNICATION

Synchronization. Coherency protocols guarantee that a reading processor (thread) sees the most current update to shared data.

Parallel Architecture. Hwansoo Han

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

Incoherent each cache copy behaves as an individual copy, instead of as the same memory location.

The MESI State Transition Graph

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Concurrent Preliminaries

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

CMSC 611: Advanced. Distributed & Shared Memory

Page 1. Outline. Coherence vs. Consistency. Why Consistency is Important

Multiprocessor Cache Coherency. What is Cache Coherence?

Shared Memory Multiprocessors

Lecture 24: Board Notes: Cache Coherency

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY>

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

Computer Science 146. Computer Architecture

ECSE 425 Lecture 30: Directory Coherence

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Lecture: Consistency Models, TM

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Computer Systems Architecture

CSE 502 Graduate Computer Architecture Lec 19 Directory-Based Shared-Memory Multiprocessors & MP Synchronization

Role of Synchronization. CS 258 Parallel Computer Architecture Lecture 23. Hardware-Software Trade-offs in Synchronization and Data Layout

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II

UNIT I (Two Marks Questions & Answers)

Thread- Level Parallelism. ECE 154B Dmitri Strukov

CMSC 611: Advanced Computer Architecture

Shared Symmetric Memory Systems

Cache Coherence in Bus-Based Shared Memory Multiprocessors

Flynn's Classification

EE382 Processor Design. Processor Issues for MP

ECE 669 Parallel Computer Architecture

Processor Architecture

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

250P: Computer Systems Architecture. Lecture 14: Synchronization. Anton Burtsev March, 2019

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

CS5460: Operating Systems

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

EEC 581 Computer Architecture. Lec 11 Synchronization and Memory Consistency Models (4.5 & 4.6)

Chap. 4 Multiprocessors and Thread-Level Parallelism

Aleksandar Milenkovich 1

Data-Centric Consistency Models. The general organization of a logical data store, physically distributed and replicated across multiple processes.

Lecture 17: Multiprocessors: Size, Consitency. Review: Networking Summary

Module 7: Synchronization Lecture 13: Introduction to Atomic Primitives. The Lecture Contains: Synchronization. Waiting Algorithms.

Chapter 6. Parallel Processors from Client to Cloud Part 2 COMPUTER ORGANIZATION AND DESIGN. Homogeneous & Heterogeneous Multicore Architectures

CS433 Homework 6. Problem 1 [15 points] Assigned on 11/28/2017 Due in class on 12/12/2017

SELECTED TOPICS IN COHERENCE AND CONSISTENCY

Multiprocessors 1. Outline

Other consistency models

Transcription:

Lec-11 Multi-Threading Concepts: Coherency, Consistency, and Synchronization Coherency vs Consistency Memory coherency and consistency are major concerns in the design of shared-memory systems. Consistency Coherency guarantees that a single guarantees that multiple processor can execute a processors can execute a sequential program correctly concurrent program correctly ECE-720-t4: Innovations in Processor Design 2008t1 (Winter) Mark Aagaard University of Waterloo defines the relative order of read and write operations from multiple processors to a single memory location. defines the relative order of write operations from multiple processors to multiple memory locations. 1 3 Introduction P2 1: Wr A 1 1: Wr A 2 1: Wr A 3 2: Wr B β 1 2: Rd A 2: Rd A 3: Wr A 4 3: Rd B 3: Rd B 4: Wr B β 2 P3 2 4

Simple Memory Coherency Three requirements for simple coherency If P i writes to A, and then reads from A without any intervening writes by P i or another processor, then the read will return. If P i writes to A, and then after a sufficiently long time without any intervening writes by P i or another processor, then if P j reads A, the read will return. If P i writes 1 to A and P j writes 2 to A, then all processors will agree on the relative order of the two write operations. Synchronization Basic idea: force multiples processors to stop at particular points in their programs P2 P3 synchronization point synchronization point synchronization point Purpose of synchronization in hardware: Provide basic building blocks for messages and critical sections in software. Simple Memory Consistency Two requirements for simple memory consistency A write operation is defined to be completed when all processors have seen the effect of the write Each processor preserves the order of writes with respect to both reads and writes 5 Coherency Introduction 7 6 8

Terminology for Shared Memory Location of memory Uniform access time Communication mechanism Symmetric Distributed Distributed Memory Systems Distributed shared memory vs distributed private memory Shared HW is responsible for coherency HW is complicated SW uses normal load and store instructions to communicate Private SW is responsible for coherency HW is simple SW uses messages and remote procedure calls to communicate Q: Why shared = hw-coherence and private = sw-coherence? Shared Private Multiprocessor Memory Models UMA Ld/St Symmetric HW coherency Snooping protocol NUMA Ld/St Distributed HW coherency Directory protocol NUMA Messages SW coherency 9 HW vs SW for Coherence HW has dynamic view of program, SW has static view HW view small horizon (can see only a few instructions at a time) fine detail (e.g. know physical addresses) SW view large horizon (can see many instructions at a time) coarse detail (e.g. don't know physical addresses) To guarantee coherence, SW must treat any piece of data that might be shared as if it were shared. HW knows at each instant whether a piece of data is shared. SW must behave conservatively. For example, generate invalidate messages that might not be needed. 11 10 12

Try SW-Coherence for Shared Memory Caches introduce coherency problems Option 1: remove caches Exactly one copy of each memory block No coherency problems! No performance! Option 2: only private data may be cached Same effect as option 1 Option 3: software controls cache 3a: compiler generates cache control code Feasible only for specialized applications (e.g. arrays) 3b: programmer controls cache For efficiency, want to move blocks of memory Normal ISAs provide load/store for at most double words Summary of HW vs SW Shared vs Private Shared has simple programming model Symmetric vs Distributed Symmetric uses snooping protocol, which doesn't scale SW coherency is difficult Goal: Find HW-coherency protocol for distributed shared memory Simple programming model Scales to 64 256 processors Answer: Directory protocol HW-based coherency protocol for distributed memory 13 Coherence Protocols: Snoop vs Directory Two major categories of memory coherence protocols, distinguished by location of status information for block of physical memory C1 Symmetric, Snooping C2 P2 C1 M1 Distributed, Directory M2 P2 status status Coherence Protocols: Snoop vs Directory Two major categories of memory coherence protocols, distinguished by location of status information for block of physical memory Location of status Update status Snooping C2 P2 Directory 15 14 Shared mem type 16

Maintaining Coherence on Write Applicable to both snooping and directory protocols Write invalidate Coherency: Snooping Protocols Before P i writes to memory location A, all other processors must invalidate location A. This guarantees that P i has exclusive access to A Write update When P i writes to memory location A, it broadcasts the write to all other processors Write invalidate usually has better performance than write update Write invalidate much more popular than write update Snooping Protocols 17 Snooping Protocol Example 19 Each cache keeps copy of its status for each memory block that it contains. Status: Invalid: Shared: must refetch block on next access block held by multiple caches, all are clean Modified: local processor has modified block (other procs should be invalid, main mem out of date) Exclusive: local processor is only proc that contains block (block is clean, main mem is up to date) H&P Fig 6.8: Write-through, write-invalidate, Status = {Valid, Invalid} 1: Rd A 2: Wr A β C1 P2 C2 Bus Miss A P 1 A 1: Rd A Miss A P 2 A A β INV β Mem[A] β Owned: local processor owns block and must service all requests Options for status: MSI, MESI, MOSI, MOESI 18 20

Snooping Protocol Example Write-through, write-invalidate, Status = {Valid, Invalid} Illustrate competing writes 1: Rd A C1 Miss A P2 C2 Bus P 1 A 1: Rd A Miss A P 2 A 1 Mem[A] Coherency: Directory Protocols 2: Wr A β β INV 2: Wr A γ INV γ A β A γ β γ Advantages of Rich Status 21 Memory Block Status 23 Basic protocol provides Invalid status Shared: memory block held by multiple caches, all are clean Q: Why add more status values? Need to know which processors have memory block Shared: Modified: Uncached: no processor has copy of memory block Exclusive: exactly one processor has copy of memory block, processor might have written to block Need to know which processor has memory block Exclusive: Need to know if processor has written to block Owned: Implementation of status A few bits for status One-hot status encoding P9 P8 P7for P6processors P5 P4 P3 P2 P0 share 22 24

Directory Protocol Two fundamental operations: read miss, write to non-exclusive block Directory Protocol Example Same scenario as first snooping protocol example in Lec-18 Read miss M1 M2[A] Write to non-exclusive block Handle as write miss (e.g. read miss; write) Wait until have exclusive access before changing memory Complication Rd A C1 Miss A A: Val Stat P2 Net Val Stat P2 C2 P2C1 unc P 1 RdMiss A P 1 A: exc Have complex interconnection network P 2 RdMiss A Miss A Rd A No longer have a single bus P 2 A: sh Snooping uses bus to provide consistent ordering of write operations Α: Directory Messages (Simple Version) 25 Directory Protocol Example 27 owner of block data and directory P2 writes to A when A is shared home M1 M2[A] C1[A] Val Stat P2 Net Val Stat P2 C2[A] P2C1 read miss write miss data reply fetch/invalidate fetch invalidate data writeback A: INV P 2 WrMiss A Invalidate A sh A: unc INV Wr A β P 2 A: exc local remote A: source of request copy of block A:β Possible to have: Local = home, Local = remote, or Remote = home 26 28

Directory Protocol Example and P2 try to write to A simultaneously Wr A γ C1[A] INV A:β A:γ M1 Val Stat P2 Net at M2 P 1 WrMiss A P 2 WrMiss A P 1 WrMiss A P P 2 A: 1 WrMiss A P 1 WrMiss A FetInv P 1 A A β P 1 A:β M2[A] Val Stat P2 C2[A] P2C1 β unc exc exc A: A:β INV Wr A β Conistency Performance Example 29 Quagmire of Consistency 31 H&P 6.12: Use distributed mem with bus communication P2 64B cache blocks, negligible I-Cache misses Total D-Cache miss rate 2% % misses to private data 60% % shared-data-misses that are uncached 20% % shared-data-misses that are dirty 80% A 0;... A 1; if (B == 0) { print P2 is slow } B 0;... B 1; if (A == 0) { print is slow } Local memory hit: 100 ns Remote access to uncached data: 1800ns Q: possible for both and P2 to think that the other is slow? Remote access to dirty data: 2200ns 1GHz clock, CPI w/ 100% cache hit = 1.0, 40% instrs ld or st Find real CPI Issue is consistency: relative order of accesses to A and B by a single processor. 30 32

Tradeoffs in Consistency There are many definitions of consistency Strict (strong) definitions: make everything sequential easy to understand less parallelism, hurts performance Loose (weak) definitions: lots of parallelism and out-of-orderedness hard to understand improves performance Idea: Weak Orderings allow operations to different addresses to complete out of order use synchronization where need to enforce order Relaxation modes: total-store ordering partial-store ordering weak ordering Alpha ordering PowerPC ordering (relax RAW order) (relax WAW order) (relax WAR and RAR order) Programming with relaxed ordering is very stressful Sequential Consistency Overal order of operations is consistent with the order seen by each individual processor Example: : a; b; c; d; e; f; P2: m; n; o; p; q; OK: a; b; c; d; e; f; m; n; o; p; q; OK: a; m; n; b; o; c; p; d; q; e; f; BAD: a; m; n; b; p; c; o; d; q; e; f; Sequential consistency is incompatible with loads-passing-stores in out-of-order execution. Idea: weak ordering models of consistency 33 Speculation with Sequential Ordering An alternative to weak ordering Speculate that performing memory operations out-of-order operations will produce the same result as preserving sequential order A 0; A 1; Ld R3 B BNZ R3 end print P2 is slow end: P2 B 0; B 1; Ld R3 A BNZ R3 end print is slow end: 35 34 36

Synchronization Correctness (Solution to bugs) Challenges Hardware provides simple, fast primitives Low-level programmers build libraries on top of ISA-specific primitives that provide high-level constructs to programmers Performance When no contention: minimize latency When high contention: minimize communication Synchronization Basic idea: force multiples processors to stop at particular points in their programs P2 P3 synchronization point Purpose of synchronization in hardware: Provide basic building blocks for messages and critical sections in software. Requirements on primitive operation for synchronization: Read a memory location Change the value of the memory location synchronization point synchronization point Either guarantee that, or test if, (read;change) is atomic 37 Historical Approaches Hardware provides single instruction that is guaranteed to be atomic test-and-set reg addr reg M[addr] M[addr] #1 fetch-and-increment reg addr reg M[addr] M[addr] M[addr] + 1 atomic exchange reg addr tmp M[addr] M[addr] reg reg tmp 39 38 40

Historical Approaches Load-Link + Store-Conditional (Idea) Disadvantage: requires that a read-write sequence is atomic Load-linked: load mem value into reg and special link register Memory coherence requires that no other memory operations occur between the read and the write Potential problems with deadlock R1 = sqrt Ld R3 (R1) Ex R3 A C1 Miss A M1 Val Stat P2 P 2 RdMiss (R1) Net P 1 WrMiss A M2[A] Val Stat P2 C2 P2 LL reg addr reg M[addr] link reg Store conditional: if mem value = link-reg then store reg value in memory else fail SC reg addr if M[addr] = link { M[addr] reg Q: what's the problem with SC? Q: what's the problem? } else { reg #0 } Modern Approaches 41 Use of Load-Link 43 Don't guarantee atomicity of (read; write) Assembly code to implement atomic exchange with load-link Instead: provide two operations read write and test if (read;write) was atomic The code is the equivalent of: atomic-exchange R4 (R1) Assembly code If test returns FALSE, then (read;write) was not atomic so try (read;write) again try: R3 R4 -- R3 holds working copy of R4 LL R2 (R1) -- R2 M[R1] SC R3 (R1) -- M[R1] R3 (if succeed) BEQZ R3 try -- jump to try if failed R4 R2 -- R4 M[R1] Q: why do we need R3, rather than using just R4? 42 44

Load-Link + Store-Conditional (Reality) Load-linked: load mem value into reg and copy address into special link register LL reg addr reg M[addr] link addr Store conditional: if addr = link-reg then store reg value in memory, else fail SC reg addr if addr = link { M[addr] reg } else { reg #0 } Link register is special Atomicity of LL+SC Processor must reset link register if the memory location that the link register points to might have been modified LL R2 A SC R3 A R3 = Link1 C1[A] Net M[A] C2[A] Link2 P2 A 0 INV RdMiss A A: P2 WrMiss A Inv P 1 A P 2 A: status of M[A] = β ST A β 45 Spin-Lock Locks are used to guarantee mutual exclusion With spin-lock, process spins in tight loop until can acquire lock Pseudo code for use of a spin-lock in mutual exclusion: while ( locked = #1 ) {}; -- loop waiting for locked = #0 -- know that locked = #0 locked #1; -- set locked to prevent others -- -- critical section -- locked #0; -- release locked Spin-Lock with LL+SC ------------------------------------------------------ loop: LL R2 locked -- loop waiting for locked = #0 BNEZ R2 loop ------------------------------------------------------ R2 #1 -- set locked to prevent others SC R2 locked ------------------------------------------------------ BEQZ R2 loop -- jump back to top if store failed ------------------------------------------------------ -- -- critical section -- ------------------------------------------------------ ST locked #0 -- release locked Q: what would happen if replaced SC with ST? 47 status of M[A] = 46 48

Caches and Spin-Lock with LL+SC Link1 C1[A] Net M[A] C2[A] Link2 P2 1 LL R2 A A 1 RdMiss A A:1 LL R2 A LL R2 A LL R2 A P2 WrMiss A ST A 0 LL R2 A 0 RdMiss A INV Inv P 1 A P 2 A:1 RdMiss A RdMiss A RdMiss A 1 0 Performance Analysis Bus and memory traffic low while spinning (good) Lots of communication when transfer lock (bad) Must invalidate all copies of lock when released (ST) Must invalidate all copies of lock when acquired (SC) Observation: only one processor will acquire lock For losing processor: locked=0 INV RdMiss locked=0 Idea: Pick winning processor, don't talk to losers (Queuing lock) 49 Queuing Lock Augment memory directory with queue of processors waiting for lock Mem P2 P3 P4 locker queue req; acq req P3 req P2 req rel acq P3 P2 P4 acq rel P2 P4 rel acq P4 rel Exponential Backoff P4 51 Another technique to reduce memory/bus traffic with spin locks Each time that fail to aquire lock, wait twice as long before re-trying Implement in low-level software library, no change to hardware 50 52

Barrier Synchronization Barrier: point in a program where all processors must synchronize Force all processors to wait at barrier Release all processors after last has arrived at barrier Implement barrier with two spin-locks protect critical section that counts arrivals processors spin-lock at barrier waiting for barrier to be unlocked Performance problem with many processors Queuing lock: forces serialization of arrival counter Idea: Use combining-tree of locks, rather than a single lock Combining Tree 53 P2 P3 P4 P5 P6 P7 P8 P9 0 1 2 Spin-lock to protect counter for P3 All processors have arrived, now unlock the barrier Can also use combining tree for spin-lock to allow processors past barrier. Q: when beneficial? 54