CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 19 Memory Consistency Models

Similar documents
CS252 Graduate Computer Architecture Fall 2015 Lecture 14: Synchroniza>on and Memory Models

CS252 Spring 2017 Graduate Computer Architecture. Lecture 14: Multithreading Part 2 Synchronization 1

CS252 Spring 2017 Graduate Computer Architecture. Lecture 15: Synchronization and Memory Models Part 2

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Dr. George Michelogiannakis. EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory

CS 152 Computer Architecture and Engineering. Lecture 19: Synchronization and Sequential Consistency

CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II

CS 152 Computer Architecture and Engineering. Lecture 19: Synchronization and Sequential Consistency

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 18 Cache Coherence

Consistency & Coherence. 4/14/2016 Sec5on 12 Colin Schmidt

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

Announcements. ECE4750/CS4420 Computer Architecture L17: Memory Model. Edward Suh Computer Systems Laboratory

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

Module 15: "Memory Consistency Models" Lecture 34: "Sequential Consistency and Relaxed Models" Memory Consistency Models. Memory consistency

Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars

Overview: Memory Consistency

Beyond Sequential Consistency: Relaxed Memory Models

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

Portland State University ECE 588/688. Memory Consistency Models

CS 152 Computer Architecture and Engineering

Memory Consistency Models

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

Atomic Instructions! Hakim Weatherspoon CS 3410, Spring 2011 Computer Science Cornell University. P&H Chapter 2.11

NOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem.

Motivations. Shared Memory Consistency Models. Optimizations for Performance. Memory Consistency

Lecture 9 - Virtual Memory

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Relaxed Memory-Consistency Models

CS 152 Computer Architecture and Engineering. Lecture 19: Directory- Based Cache Protocols. Recap: Snoopy Cache Protocols

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Parallel Computer Architecture Spring Memory Consistency. Nikos Bellas

CS252 Spring 2017 Graduate Computer Architecture. Lecture 17: Virtual Memory and Caches

Relaxed Memory Consistency

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Unit 12: Memory Consistency Models. Includes slides originally developed by Prof. Amir Roth

Page 1. Outline. Coherence vs. Consistency. Why Consistency is Important

CS 152 Computer Architecture and Engineering. Lecture 13 - Out-of-Order Issue and Register Renaming

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 7 Memory III

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Using Relaxed Consistency Models

CS 152 Computer Architecture and Engineering. Lecture 9 - Virtual Memory

Lecture 25: Thread Level Parallelism -- Synchronization and Memory Consistency

CS 152 Computer Architecture and Engineering. Lecture 9 - Address Translation

CS 152 Computer Architecture and Engineering. Lecture 11 - Virtual Memory and Caches

CS 152 Computer Architecture and Engineering. Lecture 9 - Address Translation

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

Shared Memory Consistency Models: A Tutorial

Distributed Operating Systems Memory Consistency

SELECTED TOPICS IN COHERENCE AND CONSISTENCY

Lecture 14: Multithreading

CS 152 Computer Architecture and Engineering. Lecture 14: Multithreading

Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY>

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II

Computer Architecture

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Lecture 13: Consistency Models. Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models

C 1. Last time. CSE 490/590 Computer Architecture. Complex Pipelining I. Complex Pipelining: Motivation. Floating-Point Unit (FPU) Floating-Point ISA

CS252 Graduate Computer Architecture Spring 2014 Lecture 17: I/O

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

Lecture 13 - VLIW Machines and Statically Scheduled ILP

Lecture 9 Virtual Memory

Lecture 24: Thread Level Parallelism -- Distributed Shared Memory and Directory-based Coherence Protocol

Sequential Consistency & TSO. Subtitle

CS 152 Computer Architecture and Engineering. Lecture 8 - Address Translation

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

CS 152, Spring 2011 Section 8

Distributed Shared Memory and Memory Consistency Models

Constructing a Weak Memory Model

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

CS252 Spring 2017 Graduate Computer Architecture. Lecture 18: Virtual Machines

Signaling and Hardware Support

RELAXED CONSISTENCY 1

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

CS 152 Computer Architecture and Engineering. Lecture 8 - Address Translation

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 9 Virtual Memory

CS252 Graduate Computer Architecture Spring 2014 Lecture 13: Mul>threading

CSE502: Computer Architecture CSE 502: Computer Architecture

Cache Coherence and Atomic Operations in Hardware

Lecture 7 - Memory Hierarchy-II

Lecture 13: Memory Consistency. + Course-So-Far Review. Parallel Computer Architecture and Programming CMU /15-618, Spring 2014

Computer Architecture ELEC3441

CS 152 Computer Architecture and Engineering. Lecture 16: Vector Computers

Lecture: Consistency Models, TM

Lecture 4 - Pipelining

Introduction. What is Computer Architecture? Meltdown & Spectre. Meltdown & Spectre. Computer Architecture ELEC /1/17. Dr. Hayden Kwok-Hay So

CS5460: Operating Systems

Lecture 4 Pipelining Part II

CS 152 Computer Architecture and Engineering. Lecture 13 - VLIW Machines and Statically Scheduled ILP

Computer Architecture and Parallel Computing 并行结构与计算. Lecture 6 Coherence Protocols

Relaxed Memory-Consistency Models

Lecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 9 Instruction-Level Parallelism Part 2

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines

Lecture 11: Relaxed Consistency Models. Topics: sequential consistency recap, relaxing various SC constraints, performance comparison

CS 152 Computer Architecture and Engineering. Lecture 16 - VLIW Machines and Statically Scheduled ILP

Transcription:

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 19 Memory Consistency Models Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste http://inst.eecs.berkeley.edu/~cs152

Last Time in Lecture 18 Cache coherence, making sure every store to memory is eventually visible to any load to same memory address Cache line states: M,S,I or M,E,S,I Cache miss if tag not present, or line has wrong state Write to a shared line is handled as a miss Snoopy coherence: Broadcast updates and probe all cache tags on any miss of any processor, used to be bus conneceon now ofen broadcast over point-to-point iinks Lower latency, but consumes lots of bandwidth on both the communicaeon bus and for probing the cache tags Directory coherence: Structure keeps track of which caches can have copies of data, and only send messages/probes to those caches Complicated to get right with all the possible overlapping cache transaceons 2

SynchronizaAon The need for synchronizaeon arises whenever there are concurrent processes in a system (even in a uniprocessor system). Two classes of synchronizaeon: Producer-Consumer: A consumer process must wait unel the producer process has produced data producer consumer Mutual Exclusion: Ensure that only one process uses a resource at a given Eme P1 P2 Shared Resource 3

Simple Producer-Consumer Example xflagp xdatap Producer processor flag data Shared Memory xflagp xdatap Consumer processor IniEally flag=0 sw xdata, (xdatap) li xflag, 1 sw xflag, (xflagp) spin: lw xflag, (xflagp) beqz xflag, spin lw xdata, (xdatap) Is this correct? 4

Memory Consistency Model SequenEal ISA only specifies that each processor sees its own memory operaeons in program order Memory consistency model describes what values can be returned by load instruceons across muleple hardware threads Coherence describes the legal values a single memory address should return Consistency describes properees across all memory addresses 5

Simple Producer-Consumer Example Producer flag data IniEally flag=0 Consumer sw xdata, (xdatap) li xflag, 1 sw xflag, (xflagp) spin: lw xflag, (xflagp) beqz xflag, spin lw xdata, (xdatap) Can consumer read flag=1 before data writen by producer visible to consumer? 6

SequenAal Consistency (SC) A Memory Model P P P P P P M A system is sequen?ally consistent if the result of any execueon is the same as if the operaeons of all the processors were executed in some sequeneal order, and the operaeons of each individual processor appear in the order specified by the program Leslie Lamport SequenEal Consistency = arbitrary order-preserving interleaving of memory references of sequeneal programs 7

Simple Producer-Consumer Example Producer flag data Consumer sw xdata, (xdatap) li xflag, 1 sw xflag, (xflagp) IniEally flag =0 spin: lw xflag, (xflagp) beqz xflag, spin lw xdata, (xdatap) Dependencies from sequeneal ISA Dependencies added by sequeneally consistent memory model 8

Most real machines are not SC Only a few commercial ISAs require SC Neither x86 nor ARM are SC Originally, architects developed uniprocessors with opemized memory systems (e.g., store buffer) When uniprocessors were lashed together to make muleprocessors, resuleng machines were not SC Requiring SC would make simpler machines slower, or requires adding complex hardware to retain performance Architects/language designers/applicaeons developers work hard to explain weak memory behavior Resulted in weak memory models with fewer guarantees 9

Store Buffer OpAmizaAon CPU CPU Store Buffer Store Buffer Shared Memory Common opemizaeon allows stores to be buffered while waieng for access to shared memory Load opemizaeons: Later loads can go ahead of buffered stores if to different address Later loads can bypass value from earlier buffered store if to same address 10

TSO example Allows local buffering of stores by processor Initially M[X] = M[Y] = 0 P1: li x1, 1 sw x1, X lw x2, Y P2: li x1, 1 sw x1, Y lw x2, X Possible Outcomes P1.x2 P2.x2 SC TSO 0 0 N Y 0 1 Y Y 1 0 Y Y 1 1 Y Y TSO is the strongest memory model in common use 11

Strong versus Weak Memory Consistency Models Stronger models provide more guarantees on ordering of loads and stores across different hardware threads Easier ISA-level programming model Can require more hardware to ensure orderings (e.g., MIPS R10K was SC, with hardware to speculate on load/stores and squash when ordering violaeons detected across cores) Weaker models provide fewer guarantees Much more complex ISA-level programming model Extremely difficult to understand, even for experts Simpler to achieve high performance, as weaker models allow many reorderings to be exposed to sofware AddiEonal instruceons (fences) are provided to allow sofware to specify which orderings are required 12

Fences in Producer-Consumer Example Producer flag data IniEally flag =0 Consumer sd xdata, (xdatap) li xflag, 1 fence w,w //Write-write fence sd xflag, (xflagp) spin: ld xflag, (xflagp) beqz xflag, spin fence r,r // Read-read fence ld xdata, (xdatap) 13

CS152 Administrivia Lab 4 due Monday April 9 Midterm 2 in class Wednesday April 11 covers lectures 10-17, plus associated problem sets, labs, and readings 14

CS252 Administrivia Monday April 9 th Project Checkpoint, 4-5pm, 405 Soda Prepare 10-minute presentaeon on current status CS252 15

Range of Memory Consistency Models SC SequenEal Consistency MIPS R10K TSO Total Store Order processor can see its own writes before others do (store buffer) IBM-370 TSO, x86 TSO, SPARC TSO (default), RISC-V RVTSO (opeonal) Weak, mule-copy-atomic memory models all processors see writes by another processor in same order Revised ARM v8 memory model RISC-V RVWMO, baseline weak memory model for RISC-V Weak, non-mule-copy-atomic memory models processors can see another s writes in different orders ARM v7, original ARM v8 IBM POWER Digital Alpha Recent consensus is that this appears to be too weak 16

MulA-Copy Atomic models CPU CPU Buffer Buffer Point of global visibility Shared Memory Each hardware thread must view its own memory operaeons in program order, but can buffer these locally and reorder accesses around the buffer But once a local store is made visible to one other hardware thread in system, all other hardware threads must also be able to observe it (this is what is meant by atomic ) 17

Hierarchical Shared Buffering CPU CPU CPU CPU Intermediate Buffer Intermediate Buffer Shared Memory Common in large systems to have shared intermediate buffers on path between CPUs and global memory PotenEal opemizaeon is to allow some CPUs see some writes by a CPU before other CPUs Shared memory stores are not seen to happen atomically by other threads (non mule-copy atomic) 18

Non-MulA-Copy Atomic Initially M[X] = M[Y] = 0 P1: li x1, 1 sw x1, X P2: lw x1, X sw x1, Y P3: lw x1, Y fence r,r lw x2, X Can P3.x1 = 1, and P3.x2 = 0? In general, Non-MCA is very difficult to reason about SoFware in one thread cannot assume all data it sees is visible to other threads, so how to share data structures? Adding local fences to require ordering of each thread s accesses is insufficient need a more global memory barrier to ensure all writes are made visible CS252 19

Relaxed Memory Models Not all dependencies assumed by SC are supported, and sofware has to explicitly insert addieonal dependencies were needed Which dependencies are dropped depends on the parecular memory model IBM370, TSO, PSO, WO, PC, Alpha, RMO, Some ISAs allow several memory models, some machines have switchable memory models How to introduce needed dependencies varies by system Explicit FENCE instruceons (someemes called sync or memory barrier instruceons) Implicit effects of atomic memory instruceons How on earth are programmers supposed to work with this???? 20

But compilers reorder too! //Producer code *datap = x/y; *flagp = 1; //Consumer code while (!*flagp) ; d = *datap; Compiler can reorder/remove memory operaeons: InstrucEon scheduling, move loads before stores if to different address Register allocaeon, cache load value in register, don t check memory ProhibiEng these opemizaeons would result in very poor performance 21

Language-Level Memory Models Programming languages have memory models too Hide details of each ISA s memory model underneath language standard c.f. C funceon declaraeons versus ISA-specific subrouene linkage conveneon Language memory models: C/C++, Java Describe legal behaviors of threaded code in each language and what opemizaeons are legal for compiler to make E.g., C11/C++11: atomic_load(memory_order_seq_cst) maps to RISC-V fence rw,rw; lw; fence r,rw 22

Release Consistency ObservaEon that consistency only maters when processes communicate data Only need to have consistent view when one process shares its updates to other processes Other processes only need to ensure they receive updates afer they acquire access to shared data P1 P2 Ensure criecal seceon updates visible before release visible Acquire CriEcal Release Other Code Other Code Acquire CriEcal Release Ensure acquire happened before criecal seceon reads data 23

Release Consistency Adopted Memory model for C/C++ and Java uses release consistency Programmer has to idenefy synchronizaeon operaeons, and if all data accesses are protected by synchronizaeon, appears like SC to programmer ARM v8.1 and RISC-V ISA adopt release consistency semanecs on AMOs 24

Acknowledgements This course is partly inspired by previous MIT 6.823 and Berkeley CS252 computer architecture courses created by my collaborators and colleagues: Arvind (MIT) Joel Emer (Intel/MIT) James Hoe (CMU) John Kubiatowicz (UCB) David PaTerson (UCB) 25