CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Similar documents
Computer Architecture

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

Lecture 25: Multiprocessors

Overview: Memory Consistency

Data-Centric Consistency Models. The general organization of a logical data store, physically distributed and replicated across multiple processes.

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Page 1. Outline. Coherence vs. Consistency. Why Consistency is Important

Bus-Based Coherent Multiprocessors

CSE502: Computer Architecture CSE 502: Computer Architecture

CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

EC 513 Computer Architecture

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

Motivations. Shared Memory Consistency Models. Optimizations for Performance. Memory Consistency

Today s Outline: Shared Memory Review. Shared Memory & Concurrency. Concurrency v. Parallelism. Thread-Level Parallelism. CS758: Multicore Programming

Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY>

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Lecture 13: Consistency Models. Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models

Multiprocessors and Locking

Chapter 5. Multiprocessors and Thread-Level Parallelism

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Module 15: "Memory Consistency Models" Lecture 34: "Sequential Consistency and Relaxed Models" Memory Consistency Models. Memory consistency

Memory Consistency Models

Chapter 5. Multiprocessors and Thread-Level Parallelism

EE382 Processor Design. Processor Issues for MP

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

M4 Parallelism. Implementation of Locks Cache Coherence

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Computer Architecture

Page 1. Cache Coherence

CS5460: Operating Systems

Lecture 26: Multiprocessors. Today s topics: Directory-based coherence Synchronization Consistency Shared memory vs message-passing

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains:

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Aleksandar Milenkovich 1

Shared Memory Architectures. Approaches to Building Parallel Machines

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Cache Coherence Protocols: Implementation Issues on SMP s. Cache Coherence Issue in I/O

SELECTED TOPICS IN COHERENCE AND CONSISTENCY

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology

CMSC Computer Architecture Lecture 18: Exam 2 Review Session. Prof. Yanjing Li University of Chicago

Handout 3 Multiprocessor and thread level parallelism

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

1. Memory technology & Hierarchy

Unit 12: Memory Consistency Models. Includes slides originally developed by Prof. Amir Roth

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

Lecture-22 (Cache Coherence Protocols) CS422-Spring

Designing Memory Consistency Models for. Shared-Memory Multiprocessors. Sarita V. Adve

Parallel Computer Architecture Spring Memory Consistency. Nikos Bellas

Chapter 5. Thread-Level Parallelism

EEC 581 Computer Architecture. Lec 11 Synchronization and Memory Consistency Models (4.5 & 4.6)

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

Thread-level Parallelism. Synchronization. Explicit multithreading Implicit multithreading Redundant multithreading Summary

The MESI State Transition Graph

Coherence and Consistency

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Scalable Cache Coherence

Multiprocessors 1. Outline

EECS 470. Lecture 17 Multiprocessors I. Fall 2018 Jon Beaumont

Problem Set 5 Solutions CS152 Fall 2016

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

CS3350B Computer Architecture

The Cache-Coherence Problem

Multiprocessors & Thread Level Parallelism

Important Lessons. A Distributed Algorithm (2) Today's Lecture - Replication

CS533 Concepts of Operating Systems. Jonathan Walpole

Lecture 12: TM, Consistency Models. Topics: TM pathologies, sequential consistency, hw and hw/sw optimizations

ECE 669 Parallel Computer Architecture

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Lecture 7: Implementing Cache Coherence. Topics: implementation details

CS 61C: Great Ideas in Computer Architecture. Amdahl s Law, Thread Level Parallelism

Switch Gear to Memory Consistency

CS/COE1541: Intro. to Computer Architecture

Processor Architecture

Module 9: "Introduction to Shared Memory Multiprocessors" Lecture 16: "Multiprocessor Organizations and Cache Coherence" Shared Memory Multiprocessors

Replication. Consistency models. Replica placement Distribution protocols

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Multiprocessor Synchronization

Relaxed Memory Consistency

Shared memory. Caches, Cache coherence and Memory consistency models. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16

Beyond Sequential Consistency: Relaxed Memory Models

NOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem.

Cache Coherence in Bus-Based Shared Memory Multiprocessors

Incoherent each cache copy behaves as an individual copy, instead of as the same memory location.

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

Lecture 8: Directory-Based Cache Coherence. Topics: scalable multiprocessor organizations, directory protocol design issues

Distributed Shared Memory and Memory Consistency Models

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

Transcription:

CMSC 22200 Computer Architecture Lecture 15: Memory Consistency and Synchronization Prof. Yanjing Li University of Chicago

Administrative Stuff! Lab 5 (multi-core) " Basic requirements: out later today " Extra credit: out later this week " Due: 11:59pm, Dec. 1 st, Thursday " Two late days with penalty! My office hours this week are canceled 2

Lecture Outline! Cache coherence (continued)! Memory consistency! Synchronization 3

Parallel Computer Architecture! Important for both computer architects and programmers! Why do programmers need to know about parallel computer architecture? " They need to get parallel programs to be correct " They need to optimize performance in the presence of bottlenecks 4

Main Multi-Core Design Issues! Cache coherence " Ensure correct operation in the presence of private caches! Memory consistency: ordering of memory operations " What should the programmer expect the hardware to provide?! Shared memory synchronization " Hardware support for synchronization primitives! We will discuss the above issues! Others " Shared resource management, interconnects, 5

Memory Coherence Discussions Continued 6

Review: Cache Coherence! Intuition: reading value at memory location A should return the last value written to A by any processor! What is last?! Single processor: easy; everything follows program order! Multi-core " What if two processors write at the same time? " What if a read follows a write so closely in time such that it s physical impossible to communicate the new value? " We need all processors to see the same write order during within a single execution (ordering in different executions can be different) 7

Properties of Coherence! I. Program order on any processors (von Neumann model)! II. Write propagation: guarantee that updates will propagate! III. Write serialization: provide a consistent global order seen by all processors (need a global point of serialization for this store ordering)! Check yourself: locks/barriers etc. do not solve the coherence issue. Why?! Aside: do uniprocessors have coherence issues? 8

Review: Snooping Cache Coherence! Idea " Use a shared bus to provide a single point of serialization " All caches now have two ends, the processor and the bus, and they must observe/respond to both " All caches serve memory requests from their own processors " All caches also snoop the bus to see what everyone else is doing, and take actions accordingly to keep things coherent! Protocols " VI, MSI, MESI,! Tradeoffs " Simple vs. complex protocols, cache-to-cache transfer vs. memory access, update vs. invalidate protocols 9

Atomic Bus Assumption! We assume that bus operations are atomic " i.e., one operation finishes before the next one can begin " Simple, but low throughput Req 1# delay # Response 1 Req 2# delay # Response 2 Atomic Req 1 Req 2 Resp 2 Resp 2 Non-atomic! Non-atomic # Transient states " More complex! 10

Scalability! Snooping cache protocols are easy to understand and implement! Good for small scale! But what if you would like to have a 1000-core CMP? 11

Directory Based Coherence! Idea: A logically-central directory keeps track of where the copies of each cache block reside. Caches consult this directory to ensure coherence.! An example mechanism: " For each cache block in memory, store P+1 bits in directory! One bit for each cache, indicating whether the block is in cache! Exclusive bit: indicates that a cache has the only copy of the block and can update it without notifying others " On a read: set the cache s bit and arrange the supply of data " On a write: invalidate all caches that have the block and reset their bits " Have an exclusive bit associated with each block in each cache (so that the cache can update the exclusive block silently) 12

Directory Based Coherence 13

Snooping vs. Directory Coherence! Snooping + Simple: + Miss latency (critical path) is short: request # bus transaction to mem. + Global serialization is easy: bus provides this already (arbitration) - Relies on broadcast messages to be seen by all caches (in same order):! Directory # single point of serialization (bus): not scalable + Does not require broadcast to all caches + Much more scalable than bus - Adds indirection to miss latency (critical path): request # dir. # mem. - Requires extra storage space to track sharer sets - Protocols and race conditions are more complex (for high-performance) 14

False Sharing P1 ld word0 st word0 ld word0 st word0 Cache block/line: P2 ld word3 st word3 ld word3 st word3 word0 word1 word2 word3 15

Quick Tip to Avoid False Sharing! DO " Map variables written by different processors on different cache blocks " Group variables written by the same processor into the same cache block! DON T " Group variables written by different processors into the same cache block 16

Which Is Better? int sum [NUM_PROCS]; int product [NUM_PROCS]; sum[mynum]++; product[mynum] *=2; typedef struct { int sum; int product; } Proc; Proc x[num_procs]; x[mynum].sum++; x[mynum].product*=2; 17

Takeaway! Cache coherence is critical for ensuring correctness! Software-managed cache coherence very difficult! Hardware coherence protocols to help programmers write correct and high-performance programs " Snooping cache protocols: VI, MSI, MESI " How do they work? " Various design decisions and tradeoffs! Programmers, be aware of and avoid false sharing! 18

Main Multi-Core Design Issues! Cache coherence " Ensure correct operation in the presence of private caches! Memory consistency: ordering of memory operations " What should the programmer expect the hardware to provide?! Shared memory synchronization " Hardware support for synchronization primitives! We will discuss the above issues! Others " Shared resource management, interconnects, 19

Memory Consistency 20

Motivational Example! Dekker s algorithm for critical sections [Adve WRL Research Report 95]! Can the two processors be in the critical section at the same time given that they both obey the von Neumann model? 21

Motivational Example! Intuition:! Assume P1 is in critical section, which means Flag2 must be 0, which means P2 cannot have executed Flag2 = 1, which means means P2 cannot be in the critical section. [Adve WRL Research Report 95] 22

Both Processors in Critical Section!! Consider a store buffer (aka. write buffer) " Remember this from OoO? " Can also be used with in-order execution! load processor store (and load bypassing) cache 23

Both Processors in Critical Section!! Cycle 1 (A): value written in P1 s store buffer, P1 thinks A is executed, but memory is not updated until cycle 51! Cycle 1 (X): value written in P2 s store buffer, P2 thinks X is executed, but memory is not updated until cycle 52! Cycle 2 (B): P1 still sees 0 in Flag2, so it enters critical section! Cycle 2 (Y): P2 still sees 0 in Flag1, so it enters critical section A B X Y [Adve WRL Research Report 95] 24

Both Processors in Critical Section!! What happened? P1 s view of memory operations P2 s view of memory operations A (cycle 1) X (cycle 1) B (cycle 2) Y (cycle 2) X (cycle 51) A (cycle 52) A appeared to happen before X X appeared to happen before A 25

The Problem! The two processors did NOT see the same order of operations to memory! The happened before relationship between multiple updates to memory was inconsistent between the two processors points of view! As a result, each processor thought the other was not in the critical section 26

How Can We Solve The Problem?! Idea: Sequential consistency! I. All processors see the same order of operations to memory " i.e., all memory operations happen in an order (called the global total order) that is consistent across all processors! II. Within this global order, each processor s operations appear in sequential order with respect to its own operations. 27

Sequentially Consistent Operation Orders! Potential correct global orders (all are correct):! A B X Y! A X B Y! A X Y B! X A B Y! X A Y B! X Y A B A B X Y [Adve WRL Research Report 95]! Which order (interleaving) is observed depends on implementation and dynamic latencies 28

The General Problem of Memory Ordering! A contract between software and hardware specified by the ISA " ISA specifies what programmers can assume about memory ordering, e.g., whether sequential consistency (or another memory consistency model) is provided! Preserving an intuitive model (e.g., sequential consistency) simplifies programmer s life! But makes the hardware designer s life difficult (limits performance optimizations that can be used) 29

Memory Ordering in a Single Processor! Specified by the von Neumann model! Sequential consistency is trivially satisfied " Hardware executes the load and store operations in the order specified by the sequential program " Out-of-order execution does not change the semantics 30

Memory Ordering in a Multi-Core Design! Each processor s memory operations are in sequential order with respect to the thread running on that processor (assume each processor obeys the von Neumann model)! Multiple processors execute memory operations concurrently " Can we have incorrect execution if the order of memory operations is different from the point of view of different processors?! How does memory ordering affect performance and ease of debugging? 31

Memory Consistency vs. Cache Coherence! Consistency is about ordering of all memory operations from different processors (i.e., to different memory locations) " Global ordering of accesses to all memory locations! Coherence is about ordering of operations from different processors to the same memory location " Local ordering of accesses to each cache block 32

Memory Consistency Models 33

Sequential Consistency (SC)! Lamport, How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs, IEEE Transactions on Computers, 1979! A multiprocessor system is sequentially consistent if: " the result of any execution is the same as if the operations of all the processors were executed in some sequential order AND " the operations of each individual processor appear in this sequence in the order specified by its program 34

Another Way of Interpreting SC! The whole system (all processors and memory) sees the same order of all fours memory operation combinations performed by any processor! Load # load! Load # store! Store # store! Store # load 35

Sequential Consistency Abstraction! Memory is a switch that services one load or store at a time from any processor! All processors see the currently serviced load or store at the same time! Each processor s operations are serviced in program order P1 P2 P3 Pn MEMORY 36

Consequences of Sequential Consistency 1. Within the same execution, all processors see the same global order of operations to memory # No correctness issue # Satisfies the happened before intuition 2. Across different executions, different global orders can be observed (each of which is sequentially consistent) # Debugging can still be difficult (as order changes across runs) 37

Issues with Sequential Consistency (SC)?! Nice abstraction for programming, intuitive! Two issues " Ordering requirements too conservative " Limits the aggressiveness of performance enhancement techniques! E.g., can t use a store buffer 38

Weaker Memory Consistency! The ordering of operations is important when the order affects operations on shared data # i.e., when processors need to synchronize! Relaxing sequential consistency " Idea: Programmer specifies regions in which memory operations do not need to be ordered " Memory fence instructions delineate those regions! All memory operations before a fence must complete before fence is executed! All memory operations after the fence must wait for the fence to complete! Fences complete in program order 39

Tradeoffs: Weaker Consistency! Advantage " No need to guarantee a very strict order of memory operations # Enables the hardware implementation of performance enhancement techniques to be simpler # Can be higher performance than stricter ordering! Disadvantage " More burden on the programmer or software (need to get the fences correct)! Another example of the programmer-microarchitect tradeoff 40

Total Store Order (TSO)! Remember, for sequential consistency, The whole system (all processors and memory) sees the same order of all fours memory operation combinations performed by any processor " Load # load, load # store, store # store, store # load! TSO relaxes the store # load ordering requirement " Major benefit: a FIFO-based store buffer can be used! Modern ISAs that uses the TSO model " SPARC " Also similar to X86 41

Total Store Order (TSO) Example! TSO allows both P1 and P2 to be in the critical section! P2 is allowed to see B (load) before A (store)! P1 is allowed to see Y (load) before X (store)! How should a programmer fix Dekker s algorithm? A B X Y [Adve WRL Research Report 95] 42

Takeaway! To write correct parallel programs, it is crucial to understand memory consistency models! To ensure correctness! DON T rely on intuition! DON T use only normal memory operations for synchronization! DO use special synchronization instructions provided by the ISA " E.g., memory fences, ACQUIRE/RELEASE pairs, etc.! Different ISA s define different consistency models! Affects portability of programs 43

Main Multi-Core Design Issues! Cache coherence " Ensure correct operation in the presence of private caches! Memory consistency: ordering of memory operations " What should the programmer expect the hardware to provide?! Shared memory synchronization " Hardware support for synchronization primitives! We will discuss the above issues! Others " Shared resource management, interconnects, 44

How NOT To Implement Locks! Lock: while (lock_var == 1); lock_var = 1;! Unlock: lock_var = 0;! What s the problem? " Testing if lock_var is 1 and setting it to 1 are not atomic " i.e., another processor can set lock_var to 1 in between # Multiple processors acquire the lock! 45

Atomic Read & Write Instructions! Aka. read-modify-write! Specify a memory location and a register " I. Value in location read into a register " II. Another value stored into location " Many variants based on what values are allowed in II! Simple example: test&set " Read memory location into specified register " Store constant 1 into location " Successful if value loaded into register is 0 46