CS533 Concepts of Operating Systems. Jonathan Walpole

Similar documents
Shared Memory Consistency Models: A Tutorial

Relaxed Memory-Consistency Models

Overview: Memory Consistency

Memory Consistency Models

Module 15: "Memory Consistency Models" Lecture 34: "Sequential Consistency and Relaxed Models" Memory Consistency Models. Memory consistency

NOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem.

Beyond Sequential Consistency: Relaxed Memory Models

Lecture 13: Consistency Models. Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models

Parallel Computer Architecture Spring Memory Consistency. Nikos Bellas

Using Relaxed Consistency Models

CS510 Advanced Topics in Concurrency. Jonathan Walpole

Shared Memory Consistency Models: A Tutorial

Shared Memory Consistency Models: A Tutorial

Memory Consistency Models. CSE 451 James Bornholt

Distributed Shared Memory and Memory Consistency Models

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

The Java Memory Model

Advanced Operating Systems (CS 202)

Distributed Operating Systems Memory Consistency

Announcements. ECE4750/CS4420 Computer Architecture L17: Memory Model. Edward Suh Computer Systems Laboratory

Relaxed Memory-Consistency Models

EE382 Processor Design. Processor Issues for MP

C++ Memory Model. Don t believe everything you read (from shared memory)

Shared Memory Consistency Models: A Tutorial

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University

Motivations. Shared Memory Consistency Models. Optimizations for Performance. Memory Consistency

Portland State University ECE 588/688. Memory Consistency Models

Relaxed Memory-Consistency Models

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Bus-Based Coherent Multiprocessors

Foundations of the C++ Concurrency Memory Model

Sequential Consistency & TSO. Subtitle

Multiprocessors and Locking

Unit 12: Memory Consistency Models. Includes slides originally developed by Prof. Amir Roth

Coherence and Consistency

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Lecture 12: TM, Consistency Models. Topics: TM pathologies, sequential consistency, hw and hw/sw optimizations

MULTITHREADING AND SYNCHRONIZATION. CS124 Operating Systems Fall , Lecture 10

Multiprocessor Synchronization

SELECTED TOPICS IN COHERENCE AND CONSISTENCY

The Cache-Coherence Problem

EEC 581 Computer Architecture. Lec 11 Synchronization and Memory Consistency Models (4.5 & 4.6)

CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II

Handout 3 Multiprocessor and thread level parallelism

CS 152 Computer Architecture and Engineering. Lecture 19: Synchronization and Sequential Consistency

Designing Memory Consistency Models for. Shared-Memory Multiprocessors. Sarita V. Adve

GPU Concurrency: Weak Behaviours and Programming Assumptions

Hardware Memory Models: x86-tso

COMP Parallel Computing. CC-NUMA (2) Memory Consistency

Thread-level Parallelism. Synchronization. Explicit multithreading Implicit multithreading Redundant multithreading Summary

SPIN, PETERSON AND BAKERY LOCKS

Relaxed Memory Consistency

Lecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory

Shared memory. Caches, Cache coherence and Memory consistency models. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16

Memory Consistency Models

CS5460: Operating Systems

1. Memory technology & Hierarchy

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

CONSISTENCY MODELS IN DISTRIBUTED SHARED MEMORY SYSTEMS

Lecture 11: Relaxed Consistency Models. Topics: sequential consistency recap, relaxing various SC constraints, performance comparison

Computer Architecture

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains:

Lecture 12: Relaxed Consistency Models. Topics: sequential consistency recap, relaxing various SC constraints, performance comparison

Chap. 4 Multiprocessors and Thread-Level Parallelism

Shared Memory Architecture

CS 152 Computer Architecture and Engineering. Lecture 19: Synchronization and Sequential Consistency

Example: The Dekker Algorithm on SMP Systems. Memory Consistency The Dekker Algorithm 43 / 54

The Java Memory Model

Programming Language Memory Models: What do Shared Variables Mean?

Module 9: "Introduction to Shared Memory Multiprocessors" Lecture 16: "Multiprocessor Organizations and Cache Coherence" Shared Memory Multiprocessors

Using Weakly Ordered C++ Atomics Correctly. Hans-J. Boehm

Lecture: Consistency Models, TM

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:

Computer Architecture Crash course

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Switch Gear to Memory Consistency

EECS 570 Final Exam - SOLUTIONS Winter 2015

CSE 153 Design of Operating Systems

CS/COE1541: Intro. to Computer Architecture

Computer System Architecture Final Examination Spring 2002

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

CS533 Concepts of Operating Systems. Jonathan Walpole

Review of last lecture. Peer Quiz. DPHPC Overview. Goals of this lecture. Is coherence everything?

Lecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6)

RCU in the Linux Kernel: One Decade Later

Multiprocessors 1. Outline

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Administrivia. p. 1/20

Shared memory multiprocessors

<atomic.h> weapons. Paolo Bonzini Red Hat, Inc. KVM Forum 2016

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Chapter Seven. Idea: create powerful computers by connecting many smaller ones

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Chasing Away RAts: Semantics and Evaluation for Relaxed Atomics on Heterogeneous Systems

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Memory Consistency. Minsoo Ryu. Department of Computer Science and Engineering. Hanyang University. Real-Time Computing and Communications Lab.

Constructing a Weak Memory Model

Multiprocessors Should Support Simple Memory- Consistency Models

Transcription:

CS533 Concepts of Operating Systems Jonathan Walpole

Shared Memory Consistency Models: A Tutorial

Outline Concurrent programming on a uniprocessor The effect of optimizations on a uniprocessor The effect of the same optimizations on a multiprocessor Methods for restoring sequential consistency Conclusion

Outline Concurrent programming on a uniprocessor The effect of optimizations on a uniprocessor The effect of the same optimizations on a multiprocessor Methods for restoring sequential consistency Conclusion

Dekker s Algorithm Process 1:: If (Flag2 == 0) Process 2:: Flag2 = 1 If (Flag1 == 0) Flag2 = 0

Dekker s Algorithm Process 1:: If (Flag2 == 0) Process 2:: Flag2 = 1 If (Flag1 == 0) Flag2 = 0

Dekker s Algorithm Process 1:: If (Flag2 == 0) Process 2:: Flag2 = 1 If (Flag1 == 0) Flag2 = 0

Dekker s Algorithm Process 1:: If (Flag2 == 0) Process 2:: Flag2 = 1 If (Flag1 == 0) Flag2 = 1

Dekker s Algorithm Process 1:: If (Flag2 == 0) Process 2:: Flag2 = 1 If (Flag1 == 0) Flag2 = 1

Dekker s Algorithm Process 1:: If (Flag2 == 0) Process 2:: Flag2 = 1 If (Flag1 == 0) Critical section is protected!

Dekker s Algorithm Process 1:: If (Flag2 == 0) Process 2:: Flag2 = 1 If (Flag1 == 0) Flag2 = 0

Dekker s Algorithm Process 1:: If (Flag2 == 0) Process 2:: Flag2 = 1 If (Flag1 == 0) Flag2 = 1

Dekker s Algorithm Process 1:: If (Flag2 == 0) Process 2:: Flag2 = 1 If (Flag1 == 0) Flag2 = 1

Dekker s Algorithm Process 1:: If (Flag2 == 0) Process 2:: Flag2 = 1 If (Flag1 == 0) Flag2 = 1

Dekker s Algorithm Process 1:: If (Flag2 == 0) Process 2:: Flag2 = 1 If (Flag1 == 0) Both processes can block, but the is still protected!

Outline Concurrent programming on a uniprocessor The effect of optimizations on a uniprocessor The effect of the same optimizations on a multiprocessor Methods for restoring sequential consistency Conclusion

Write Buffer With Bypass SpeedUp: - Write takes 100 cycles - Buffering takes 1 cycle - So Buffer and keep going! Problem: Read from a location with a buffered write pending?

Dekker s Algorithm Process 1:: If (Flag2 == 0) Process 2:: Flag2 = 1 If (Flag1 == 0) Flag1 = 0 Flag2 = 0

Dekker s Algorithm Process 1:: If (Flag2 == 0) Flag1 = 0 Flag2 = 0 Process 2:: Flag2 = 1 If (Flag1 == 0) Flag2 = 1

Dekker s Algorithm Process 1:: If (Flag2 == 0) Flag1 = 0 Flag2 = 0 Process 2:: Flag2 = 1 If (Flag1 == 0) Flag2 = 1

Dekker s Algorithm Process 1:: If (Flag2 == 0) Flag1 = 0 Flag2 = 0 Process 2:: Flag2 = 1 If (Flag1 == 0) Flag2 = 1

Dekker s Algorithm Process 1:: If (Flag2 == 0) Flag1 = 0 Flag2 = 0 Process 2:: Flag2 = 1 If (Flag1 == 0) Flag2 = 1 Critical section is not protected!

Write Buffer With Bypass Rule:" " "- If a write is issued, buffer it and keep executing" Unless: there is a read from the same location (subsequent writes don't matter), then wait for the write to complete"

Dekker s Algorithm Process 1:: If (Flag2 == 0) Process 2:: Flag2 = 1 If (Flag1 == 0) Flag1 = 0 Flag2 = 0

Dekker s Algorithm Process 1:: If (Flag2 == 0) Flag1 = 0 Flag2 = 0 Process 2:: Flag2 = 1 If (Flag1 == 0) Flag2 = 1

Dekker s Algorithm Process 1:: If (Flag2 == 0) Flag1 = 0 Flag2 = 0 Process 2:: Flag2 = 1 If (Flag1 == 0) Flag2 = 1 Stall!

Dekker s Algorithm Process 1:: If (Flag2 == 0) Flag2 = 0 Process 2:: Flag2 = 1 If (Flag1 == 0) Flag2 = 1

Dekker s Algorithm Process 1:: If (Flag2 == 0) Process 2:: Flag2 = 1 If (Flag1 == 0) Flag2 = 1

Is This a General Solution? - If each CPU has a write buffer with bypass, and follows the rules, will the algorithm still work correctly?

Outline Concurrent programming on a uniprocessor The effect of optimizations on a uniprocessor The effect of the same optimizations on a multiprocessor Methods for restoring sequential consistency Conclusion

Dekker s Algorithm Process 1:: If (Flag2 == 0) Process 2:: Flag2 = 1 If (Flag1 == 0) Flag1 = 0 Flag2 = 0

Dekker s Algorithm Process 1:: If (Flag2 == 0) Process 2:: Flag2 = 1 If (Flag1 == 0) Flag1 = 0 Flag2 = 0

Dekker s Algorithm Process 1:: If (Flag2 == 0) Process 2:: Flag2 = 1 If (Flag1 == 0) Flag1 = 0 Flag2 = 1 Flag2 = 0

Dekker s Algorithm Process 1:: If (Flag2 == 0) Process 2:: Flag2 = 1 If (Flag1 == 0) Flag1 = 0 Flag2 = 1 Flag2 = 0

Dekker s Algorithm Process 1:: If (Flag2 == 0) Process 2:: Flag2 = 1 If (Flag1 == 0) Flag1 = 0 Flag2 = 1 Flag2 = 0

Its Broken! How did that happen? - write buffers are processor specific - writes are not visible to other processors until they hit memory

Generalization of the Problem Dekker s algorithm has the form: - The write buffer delays the writes until after the reads! - It reorders the reads and writes - Both processes can read the value prior to the other s write!

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 There are 4! or 24 possible orderings. If either < or < Then the Critical Section is protected (Correct Behavior).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 There are 4! or 24 possible orderings. If either < or < Then the Critical Section is protected (Correct Behavior). 18 of the 24 orderings are OK. But the other 6 are trouble!

Another Example What happens if reads and writes can be delayed by the interconnect? - non-uniform memory access time - cache misses - complex interconnects

Non-Uniform Write Delays Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data Memory Interconnect Head = 0 Data = 0

Non-Uniform Write Delays Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data Memory Interconnect Head = 0 Data = 0

Non-Uniform Write Delays Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data Memory Interconnect Head = 0 Data = 0

Non-Uniform Write Delays Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data Memory Interconnect Head = 1 Data = 0

Non-Uniform Write Delays Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data Memory Interconnect Head = 1 Data = 0

Non-Uniform Write Delays Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data Memory Interconnect Head = 1 Data = 0 WRONG DATA!

Non-Uniform Write Delays Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data Memory Interconnect Head = 1 Data = 2000

What Went Wrong? Maybe we need to acknowledge each write before proceeding to the next?

Write Acknowledgement? But what about reordering of reads? - Non-Blocking Reads - Lockup-free Caches - Speculative execution - Dynamic scheduling... all allow execution to proceed past a read Acknowledging writes may not help!

General Interconnect Delays Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data Memory Interconnect Head = 0 Data = 0

General Interconnect Delays Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data (0) Memory Interconnect Head = 0 Data = 0

General Interconnect Delays Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data (0) Memory Interconnect Head = 0 Data = 2000

General Interconnect Delays Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data Memory Interconnect Head = 1 Data = 2000

General Interconnect Delays Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data (0) Memory Interconnect Head = 1 Data = 2000 WRONG DATA!

Generalization of the Problem This algorithm has the form: - The interconnect reorders reads and writes

Correct behavior requires <, <. Program requires <. => 6 correct orders out of 24. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Correct behavior requires <, <. Program requires <. => 6 correct orders out of 24. Write Acknowledgment means <. Does that Help? Disallows only 12 out of 24. 9 still incorrect!

Outline Concurrent programming on a uniprocessor The effect of optimizations on a uniprocessor The effect of the same optimizations on a multiprocessor Methods for restoring sequential consistency Conclusion

Sequential Consistency for MPs Why is it surprising that these code examples break on a multi-processor? What ordering property are we assuming (incorrectly!) that multiprocessors support? We are assuming they are sequentially consistent!

Sequential Consistency Sequential Consistency requires that the result of any execution be the same as if the memory accesses executed by each processor were kept in order and the accesses among different processors were interleaved arbitrarily....appears as if a memory operation executes atomically or instantaneously with respect to other memory operations (Hennessy and Patterson, 4th ed.)

Understanding Ordering Program Order Compiled Order Interleaving Order Execution Order

Reordering Writes reach memory, and reads see memory, in an order different than that in the program! - Caused by Processor - Caused by Multiprocessors (and Cache) - Caused by Compilers

What Are the Choices? If we want our results to be the same as those of a Sequentially Consistent Model. Do we: - Enforce Sequential Consistency at the memory level? - Use Coherent (Consistent) Cache? - Or what?

Enforce Sequential Consistency? Removes virtually all optimizations Too slow!

What Are the Choices? If we want our results to be the same as those of a Sequentially Consistent Model. Do we: - Enforce Sequential Consistency at the memory level? - Use Coherent (Consistent) Cache? - Or what?

Cache Coherence Multiple processors have a consistent view of memory (i.e. MESI protocol) But this does not say when a processor must see a value updated by another processor. Cache coherency does not guarantee Sequential Consistency! Example: a write-through cache acts just like a write buffer with bypass.

What Are the Choices? If we want our results to be the same as those of a Sequentially Consistent Model. Do we: - Enforce Sequential Consistency at the memory level? - Use Coherent (Consistent) Cache? - Or what?

Involve the Programmer Someone s got to tell your CPU about concurrency! Use memory barrier / fence instructions when order really matters!

Memory Barrier Instructions A way to prevent reordering - Also known as a safety net - Require previous instructions to complete before allowing further execution on that CPU Not cheap, but perhaps not often needed? - Must be placed by the programmer - Memory consistency model for processor tells you what reordering is possible

Using Memory Barriers Process 1:: >>Mem_Bar<< If (Flag2 == 0) " >>Fence<<" " Fence: < " Process 2:: Flag2 = 1 >>Mem_Bar<< If (Flag1 == 0) " >>Fence<<" " Fence: < "

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 There are 4! or 24 possible orderings. If either < or < Then the Critical Section is protected (Correct Behavior) 18 of the 24 orderings are OK. But the other 6 are trouble! Enforce < and <. Only 6 of the 18 good orderings are allowed OK. But the 6 bad ones are still forbidden!

Example 2 Process 1:: Data = 2000; >>Mem_Bar<< Head = 1; " >>Fence<<" " Fence: < " Process 2:: While (Head == 0) {;} >>Mem_Bar<< LocalValue = Data " >>Fence<<" " Fence: < "

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Correct behavior requires <, <. Program requires <. => 6 correct orders out of 24. We can require < and <. Is that enough? Program requires <. Thus, <<<; hence < and <. Only 2 of the 6 good orderings are allowed - But all 18 incorrect orderings are forbidden.

Memory Consistency Models Every CPU architecture has one! - It explains what reordering of memory operations that CPU can do The CPUs instruction set contains memory barrier instructions of various kinds - These can be used to constrain reordering where necessary - The programmer must understand both the memory consistency model and the memory barrier instruction semantics!!

Memory Consistency Models

Code Portability? Linux provides a carefully chosen set of memory-barrier primitives, as follows: - smp_mb(): memory barrier that orders both loads and stores. This means loads and stores preceding the memory barrier are committed to memory before any loads and stores following the memory barrier. - smp_rmb(): read memory barrier that orders only loads. - smp_wmb(): write memory barrier that orders only stores.

Words of Advice - The difficult problem is identifying the ordering constraints that are necessary for correctness. -...the programmer must still resort to reasoning with low level reordering optimizations to determine whether sufficient orders are enforced. -...deep knowledge of each CPU's memoryconsistency model can be helpful when debugging, to say nothing of writing architecturespecific code or synchronization primitives.

Programmer's View - What does a programmer need to do? - How do they know when to do it? - Compilers & Libraries can help, but still need to use primitives in truly concurrent programs - Assuming the worst and synchronizing everything results in sequential consistency - - Too slow, but may be a good way to start

Outline Concurrent programming on a uniprocessor The effect of optimizations on a uniprocessor The effect of the same optimizations on a multiprocessor Methods for restoring sequential consistency Conclusion

Conclusion - Parallel programming on a multiprocessor that relaxes the sequentially consistent memory model presents new challenges - Know the memory consistency models for the processors you use - Use barrier (fence) instructions to allow optimizations while protecting your code - Simple examples were used, there are others much more subtle.

References - Shared Memory Consistency Models: A Tutorial By Sarita Adve & Kourosh Gharachorloo - Memory Ordering in Modern Microprocessors, Part I, Paul E. McKenney, Linux Journel, June, 2005 - Computer Architecture, Hennessy and Patterson, 4 th Ed., 2007