CMSC 22200 Computer Architecture Lecture 15: Memory Consistency and Synchronization Prof. Yanjing Li University of Chicago
Administrative Stuff! Lab 5 (multi-core) " Basic requirements: out later today " Extra credit: out later this week " Due: 11:59pm, Dec. 1 st, Thursday " Two late days with penalty! My office hours this week are canceled 2
Lecture Outline! Cache coherence (continued)! Memory consistency! Synchronization 3
Parallel Computer Architecture! Important for both computer architects and programmers! Why do programmers need to know about parallel computer architecture? " They need to get parallel programs to be correct " They need to optimize performance in the presence of bottlenecks 4
Main Multi-Core Design Issues! Cache coherence " Ensure correct operation in the presence of private caches! Memory consistency: ordering of memory operations " What should the programmer expect the hardware to provide?! Shared memory synchronization " Hardware support for synchronization primitives! We will discuss the above issues! Others " Shared resource management, interconnects, 5
Memory Coherence Discussions Continued 6
Review: Cache Coherence! Intuition: reading value at memory location A should return the last value written to A by any processor! What is last?! Single processor: easy; everything follows program order! Multi-core " What if two processors write at the same time? " What if a read follows a write so closely in time such that it s physical impossible to communicate the new value? " We need all processors to see the same write order during within a single execution (ordering in different executions can be different) 7
Properties of Coherence! I. Program order on any processors (von Neumann model)! II. Write propagation: guarantee that updates will propagate! III. Write serialization: provide a consistent global order seen by all processors (need a global point of serialization for this store ordering)! Check yourself: locks/barriers etc. do not solve the coherence issue. Why?! Aside: do uniprocessors have coherence issues? 8
Review: Snooping Cache Coherence! Idea " Use a shared bus to provide a single point of serialization " All caches now have two ends, the processor and the bus, and they must observe/respond to both " All caches serve memory requests from their own processors " All caches also snoop the bus to see what everyone else is doing, and take actions accordingly to keep things coherent! Protocols " VI, MSI, MESI,! Tradeoffs " Simple vs. complex protocols, cache-to-cache transfer vs. memory access, update vs. invalidate protocols 9
Atomic Bus Assumption! We assume that bus operations are atomic " i.e., one operation finishes before the next one can begin " Simple, but low throughput Req 1# delay # Response 1 Req 2# delay # Response 2 Atomic Req 1 Req 2 Resp 2 Resp 2 Non-atomic! Non-atomic # Transient states " More complex! 10
Scalability! Snooping cache protocols are easy to understand and implement! Good for small scale! But what if you would like to have a 1000-core CMP? 11
Directory Based Coherence! Idea: A logically-central directory keeps track of where the copies of each cache block reside. Caches consult this directory to ensure coherence.! An example mechanism: " For each cache block in memory, store P+1 bits in directory! One bit for each cache, indicating whether the block is in cache! Exclusive bit: indicates that a cache has the only copy of the block and can update it without notifying others " On a read: set the cache s bit and arrange the supply of data " On a write: invalidate all caches that have the block and reset their bits " Have an exclusive bit associated with each block in each cache (so that the cache can update the exclusive block silently) 12
Directory Based Coherence 13
Snooping vs. Directory Coherence! Snooping + Simple: + Miss latency (critical path) is short: request # bus transaction to mem. + Global serialization is easy: bus provides this already (arbitration) - Relies on broadcast messages to be seen by all caches (in same order):! Directory # single point of serialization (bus): not scalable + Does not require broadcast to all caches + Much more scalable than bus - Adds indirection to miss latency (critical path): request # dir. # mem. - Requires extra storage space to track sharer sets - Protocols and race conditions are more complex (for high-performance) 14
False Sharing P1 ld word0 st word0 ld word0 st word0 Cache block/line: P2 ld word3 st word3 ld word3 st word3 word0 word1 word2 word3 15
Quick Tip to Avoid False Sharing! DO " Map variables written by different processors on different cache blocks " Group variables written by the same processor into the same cache block! DON T " Group variables written by different processors into the same cache block 16
Which Is Better? int sum [NUM_PROCS]; int product [NUM_PROCS]; sum[mynum]++; product[mynum] *=2; typedef struct { int sum; int product; } Proc; Proc x[num_procs]; x[mynum].sum++; x[mynum].product*=2; 17
Takeaway! Cache coherence is critical for ensuring correctness! Software-managed cache coherence very difficult! Hardware coherence protocols to help programmers write correct and high-performance programs " Snooping cache protocols: VI, MSI, MESI " How do they work? " Various design decisions and tradeoffs! Programmers, be aware of and avoid false sharing! 18
Main Multi-Core Design Issues! Cache coherence " Ensure correct operation in the presence of private caches! Memory consistency: ordering of memory operations " What should the programmer expect the hardware to provide?! Shared memory synchronization " Hardware support for synchronization primitives! We will discuss the above issues! Others " Shared resource management, interconnects, 19
Memory Consistency 20
Motivational Example! Dekker s algorithm for critical sections [Adve WRL Research Report 95]! Can the two processors be in the critical section at the same time given that they both obey the von Neumann model? 21
Motivational Example! Intuition:! Assume P1 is in critical section, which means Flag2 must be 0, which means P2 cannot have executed Flag2 = 1, which means means P2 cannot be in the critical section. [Adve WRL Research Report 95] 22
Both Processors in Critical Section!! Consider a store buffer (aka. write buffer) " Remember this from OoO? " Can also be used with in-order execution! load processor store (and load bypassing) cache 23
Both Processors in Critical Section!! Cycle 1 (A): value written in P1 s store buffer, P1 thinks A is executed, but memory is not updated until cycle 51! Cycle 1 (X): value written in P2 s store buffer, P2 thinks X is executed, but memory is not updated until cycle 52! Cycle 2 (B): P1 still sees 0 in Flag2, so it enters critical section! Cycle 2 (Y): P2 still sees 0 in Flag1, so it enters critical section A B X Y [Adve WRL Research Report 95] 24
Both Processors in Critical Section!! What happened? P1 s view of memory operations P2 s view of memory operations A (cycle 1) X (cycle 1) B (cycle 2) Y (cycle 2) X (cycle 51) A (cycle 52) A appeared to happen before X X appeared to happen before A 25
The Problem! The two processors did NOT see the same order of operations to memory! The happened before relationship between multiple updates to memory was inconsistent between the two processors points of view! As a result, each processor thought the other was not in the critical section 26
How Can We Solve The Problem?! Idea: Sequential consistency! I. All processors see the same order of operations to memory " i.e., all memory operations happen in an order (called the global total order) that is consistent across all processors! II. Within this global order, each processor s operations appear in sequential order with respect to its own operations. 27
Sequentially Consistent Operation Orders! Potential correct global orders (all are correct):! A B X Y! A X B Y! A X Y B! X A B Y! X A Y B! X Y A B A B X Y [Adve WRL Research Report 95]! Which order (interleaving) is observed depends on implementation and dynamic latencies 28
The General Problem of Memory Ordering! A contract between software and hardware specified by the ISA " ISA specifies what programmers can assume about memory ordering, e.g., whether sequential consistency (or another memory consistency model) is provided! Preserving an intuitive model (e.g., sequential consistency) simplifies programmer s life! But makes the hardware designer s life difficult (limits performance optimizations that can be used) 29
Memory Ordering in a Single Processor! Specified by the von Neumann model! Sequential consistency is trivially satisfied " Hardware executes the load and store operations in the order specified by the sequential program " Out-of-order execution does not change the semantics 30
Memory Ordering in a Multi-Core Design! Each processor s memory operations are in sequential order with respect to the thread running on that processor (assume each processor obeys the von Neumann model)! Multiple processors execute memory operations concurrently " Can we have incorrect execution if the order of memory operations is different from the point of view of different processors?! How does memory ordering affect performance and ease of debugging? 31
Memory Consistency vs. Cache Coherence! Consistency is about ordering of all memory operations from different processors (i.e., to different memory locations) " Global ordering of accesses to all memory locations! Coherence is about ordering of operations from different processors to the same memory location " Local ordering of accesses to each cache block 32
Memory Consistency Models 33
Sequential Consistency (SC)! Lamport, How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs, IEEE Transactions on Computers, 1979! A multiprocessor system is sequentially consistent if: " the result of any execution is the same as if the operations of all the processors were executed in some sequential order AND " the operations of each individual processor appear in this sequence in the order specified by its program 34
Another Way of Interpreting SC! The whole system (all processors and memory) sees the same order of all fours memory operation combinations performed by any processor! Load # load! Load # store! Store # store! Store # load 35
Sequential Consistency Abstraction! Memory is a switch that services one load or store at a time from any processor! All processors see the currently serviced load or store at the same time! Each processor s operations are serviced in program order P1 P2 P3 Pn MEMORY 36
Consequences of Sequential Consistency 1. Within the same execution, all processors see the same global order of operations to memory # No correctness issue # Satisfies the happened before intuition 2. Across different executions, different global orders can be observed (each of which is sequentially consistent) # Debugging can still be difficult (as order changes across runs) 37
Issues with Sequential Consistency (SC)?! Nice abstraction for programming, intuitive! Two issues " Ordering requirements too conservative " Limits the aggressiveness of performance enhancement techniques! E.g., can t use a store buffer 38
Weaker Memory Consistency! The ordering of operations is important when the order affects operations on shared data # i.e., when processors need to synchronize! Relaxing sequential consistency " Idea: Programmer specifies regions in which memory operations do not need to be ordered " Memory fence instructions delineate those regions! All memory operations before a fence must complete before fence is executed! All memory operations after the fence must wait for the fence to complete! Fences complete in program order 39
Tradeoffs: Weaker Consistency! Advantage " No need to guarantee a very strict order of memory operations # Enables the hardware implementation of performance enhancement techniques to be simpler # Can be higher performance than stricter ordering! Disadvantage " More burden on the programmer or software (need to get the fences correct)! Another example of the programmer-microarchitect tradeoff 40
Total Store Order (TSO)! Remember, for sequential consistency, The whole system (all processors and memory) sees the same order of all fours memory operation combinations performed by any processor " Load # load, load # store, store # store, store # load! TSO relaxes the store # load ordering requirement " Major benefit: a FIFO-based store buffer can be used! Modern ISAs that uses the TSO model " SPARC " Also similar to X86 41
Total Store Order (TSO) Example! TSO allows both P1 and P2 to be in the critical section! P2 is allowed to see B (load) before A (store)! P1 is allowed to see Y (load) before X (store)! How should a programmer fix Dekker s algorithm? A B X Y [Adve WRL Research Report 95] 42
Takeaway! To write correct parallel programs, it is crucial to understand memory consistency models! To ensure correctness! DON T rely on intuition! DON T use only normal memory operations for synchronization! DO use special synchronization instructions provided by the ISA " E.g., memory fences, ACQUIRE/RELEASE pairs, etc.! Different ISA s define different consistency models! Affects portability of programs 43
Main Multi-Core Design Issues! Cache coherence " Ensure correct operation in the presence of private caches! Memory consistency: ordering of memory operations " What should the programmer expect the hardware to provide?! Shared memory synchronization " Hardware support for synchronization primitives! We will discuss the above issues! Others " Shared resource management, interconnects, 44
How NOT To Implement Locks! Lock: while (lock_var == 1); lock_var = 1;! Unlock: lock_var = 0;! What s the problem? " Testing if lock_var is 1 and setting it to 1 are not atomic " i.e., another processor can set lock_var to 1 in between # Multiple processors acquire the lock! 45
Atomic Read & Write Instructions! Aka. read-modify-write! Specify a memory location and a register " I. Value in location read into a register " II. Another value stored into location " Many variants based on what values are allowed in II! Simple example: test&set " Read memory location into specified register " Store constant 1 into location " Successful if value loaded into register is 0 46