Memory Consistency Models

Size: px

Start display at page:

Download "Memory Consistency Models"

Mariah Craig
6 years ago
Views:

1 Calcolatori Elettronici e Sistemi Operativi Memory Consistency Models Sources of out-of-order memory accesses... Compiler optimizations Store buffers FIFOs for uncommitted writes Invalidate queues (for cache coherency) Data prefetch Banked cache architectures Networked interconnect Non-uniform memory access (NUMA) architectures: different accesses to memory have different latencies Compiler optimizations Language semantic does not consider 1. Side-effects of memory accesses 2. Multi-threading 3. Asynchronous execution Compiler can: Reorder instructions Eliminate operations Some compiler optimization can be controlled by the volatile qualifier int add3 (int x) { add3: int i; mov r0, r0, asl #3 for (i=0; i<3; i++) x += x; mov pc, lr return x; } C code ARM assembly code This function always returns 8 x: compiler can optimize code int add_vals (int *vec) { add_vals: int y = vec[1]; ldr r3, [r0] y += vec[0]; ldr r0, [r0, #4] return y; add r0, r0, r3 } mov pc, lr C code ARM assembly code Result does not depend on access order: compiler can change loads order waitval: ldr r3, [r0] void waitval (int *ptr) { cmp r3, #0 while (*ptr == 0) movne pc, lr continue; loop: } b loop C code ARM assembly code Compiler does not need to consider that someone else can change *ptr

2 Volatile Examples Semantic Each read from a volatile variable requires an actual load and may return a different value Compiler optimization cannot merge reads from the same address Each write to a volatile variable requires an actual store Compiler optimization cannot cancel stores Required to access I/O address space this is the C/C++ semantic the Java semantic differs (it also implies atomicity) int *ptr; /* pointer to int */ volatile int *ptr_to_vol; /* pointer to volatile int */ int *volatile vol_ptr; /* volatile pointer to int */ volatile int *volatile vol_ptr_to_vol; /* volatile pointer to volatile int */ Beware the semantic: a = *ptr_to_vol; is a volatile access a = *vol_ptr; is not a volatile access Volatile Volatile Inconsistent qualification causes errors volatile int A; volatile int B; Volatile does not enforce ordering with non-volatile accesses Volatile does not enforce order on how access are actually performed A=1; /* these two lines won't be */ B=1; /* reordered by compiler */ int A; volatile int B; A=1; /* these two lines can be */ B=1; /* reordered by compiler */ volatile int A; volatile int B; A=1; /* these two lines won't be */ B=1; /* reordered by compiler but */ /* accesses can be reordered */ /* by HW */ Volatile does not mean atomic volatile int X; X=1; /* this assignment can be interrupted or preempted */

3 Memory Store Buffer Implementation on GCC asm volatile ("" : : : "memory"); This inline assembly code: 1. contains no instructions 2. may read or write all of RAM Hence: compiler memory accesses reordering is not allowed around the in either direction Record the store in buffer until is actually performed Hide memory latency Cache latency Cache-miss on write Processor can execute other instructions Data dependency (RAW) Wait until the write is actually performed in memory or in cache Read the data from the store buffer (store forwarding) Data dependency (WAW) Add a new entry in the store buffer Replace the previous write in the store buffer Example Example Processor executes 1) store A 2) store B A and B are shared with : A is in cache B is in both caches Store buffer Cache B Interconnect Store buffer Cache B A Execution: 1: store A: cache miss write the updated value in store buffer send a read request (data will come from cache) several clock cycles needed can proceed, (the new value is in the store buffer) does not see the write 2: store B: cache hit data is written in cache a coherence message is sent to sees the write 3: A is loaded in cache 4: A is updated in cache a coherence message is sent to sees the write sees the store on B first, then the store on A

4 Consequence Cache coherency initially: A=0 and B=0 A and B are volatiles Cache coherency can require cache line invalidation A processor send an invalidate message to another one A = 1 B = 1 while (B==0) continue; assert (A==1); /* this can fail! */ Target processor must invalidate cache line Invalidate Queue Store invalidate requests until the cache is busy Invalidate the line when the cache is ready If sees the stores performed by in reverse order, the assertion fails Data prefetch Banked cache architectures Processor can read data before the actual load instruction Hide memory latency Caches split in several banks While accesses to busy banks must wait, accesses to idle banks can proceed Preload data in cache Processor Speculative execution Execute instructions after a branch before the branch Store buffer Cache Cache \ Interconnect

5 Definitions Definitions Program order The order of operations as specified by software Execution order The order of operations as executed by a processor Perceived order The order of operations as seen by processors and memories Memory consistency model Rules that specify the allowed behavior of programs in terms of memory accesses Rules: order restrictions Performed Write Read a write by processor i is performed with respect to processor k when: a read issued by k to the same address returns the value stored by i a read by processor i is performed with respect to processor k when: a write issued by k to the same address cannot affect the value read by i Globally Performed globally performed: is performed with respect to all processors Write A write is globally performed when its modification has been propagated to all processors Read A read is globally performed when the value it returns is bound and the write that wrote this value is globally performed Memory consistency models Memory consistency models Rules on access ordering can regard: Uniform consistency models Location (address of access) Direction Rules do no concern category of accesses read, write, read-write Value Causality Category behavior of an access depends on the behavior of another one Hybrid consistency models Category of accesses matters shared / private synchronizing / not synchronizing

6 Uniform consistency models Uniform consistency models Local Consistency (LC) Each process sees its own accesses in program order There is no restriction on the order of the accesses seen by other processors Different processes may see different orders The weakest consistency model: it only guarantees sequential behavior on uniprocessor systems Sequential consistency (SC) There is a global total order of all memory accesses (of all processors) all processors agree with such global order global order can change at each run Each processor sees its own accesses in program order Offsets many architectural optimizations Easy to use Not usable to program in parallel environments Model implied by a cacheless system, with a single memory device, with processors unable to perform Out-of-Order execution SC: Consequence SC: Consequence initially: A=0 and B=0 A and B are volatiles initially: A=0 and B=0 A and B are volatiles 1a. A = 1 2a. B = 1 1b. while (B==0) continue; 2b. assert (A==1); 1a. A = 1 2a. B = 1 1b. while (B==0) continue; 2b. assert (A==1); access type data stored/read The assertion cannot fail The assertion cannot fail 1a 1a 2a : : W(A)1 history : : W(A)1 R(B)0 R(B)0 R(B)0 W(B)1 R(B)1 R(A)1 variable/address 1b 1b 1b 1b 2b time time

7 SC: Consequence Sequential consistency initially: A=0 and B=0 A and B are volatiles Cache based system, no constraint on the interconnect 1a. A = 1 2a. B = 1 1b. while (B==0) continue; 2b. assert (A==1); Sufficient conditions All processors issue their access in program order The assertion cannot fail A processor does not issue an access until its previous accesses have been globally performed Need waiting for acknowledges from other processors Each processor sees its own accesses in program order All processors agree with a global order Access 1a is before access 2b It is easy to enforce order between accesses from different processors Offsets many architectural optimizations No out-of -order execution Write-hit on cache must wait answers Easy to use Comparison Comparison The union of all the Perceived orders can be valid or not for a given consistency model Consistency model A is stronger than consistency model B if: Example: executes: I1: store A executes: I3: load A each execution valid on A is also valid on B also: B is weaker than A I2: store B I4: load B 1) sees I1, I2; sees I1, I3, I4, I2 valid execution for Sequential consistency total order implied: I1, I3, I4, I2 2) sees I1, I2; sees I3, I2, I4, I1 If there exist some execution E1 valid on A and not valid on B some execution E2 valid on B and not valid on A invalid execution for Sequential consistency there is not an unique total order then, A and B are incomparable valid execution for Local consistency and see their own accesses in order

8 Uniform consistency models Causal consistency: example 1 Causal consistency (Causal) All processors agree on the order of causally related events initially: X=0 Example X is initially 0 event 1 : writes 1 to X causally unrelated events can be observed in different orders event 2 : reads X and obtains 1 event 3 : writes 2 to X hence: event 1 is happened before event 2 Example X is initially 0 event 1 : reads X and obtains 0 event 2 : writes 1 to X event 3 : reads X and obtains 1 hence: event 1 is happened before event 2 1a: X = 1 2a: X = 3 1b: A = X 2b: X = 2 1c: B = X 2c: D = X 3c: F = X P3 result: A=1 ; B=1 ; C=1 ; D=3 ; E=2 ; F=2 : G=3 1d: C = X 2d: E = X 3d: G = X P4 event 2 is happened before event 3 all processors agree on such an ordering event 2 is happened before event 3 all processors agree on such an ordering Causal consistency: example 1 Causal consistency: example 1 initially: X=0 for P3: 2a < 2b for P4: : W(X)1 W(X)3 2b < 2a : P3: P4: R(X)1 R(X)1 R(X)1 W(X)2 R(X)3 R(X)2 R(X)2 R(X)3 a single global order is not possible (2a? 2b) execution is not Sequentially consistent no contradictions on causal dependencies execution is Causally consistent 2a and 2b are not causally related

9 Causal consistency: example 2 Causal consistency: example 2 initially: X=0, Y=0 initially: X=0, Y=0 1a: X = 1 2a: X = 2 1b: A = X 2b: Y = 3 1c: B = Y 2c: C = X P3 : : P3: W(X)1 W(X)2 R(X)2 W(Y)3 R(Y)3 R(X)1 result: A=2 ; B=3 ; C=1 Causal consistency: example 2 Causal consistency: example 3 for : 2a < 2b (A=2) for P3: 2b < 2a (B=3 and C=1) initially: X=0, Y=0 and P3 disagree on the order between 2a and 2b 2a and 2b are causally related (constraint due to A=2) 1a: X = 1 2a: X = 2 1b: A = X 2b: Y = 3 1c: B = Y 2c: C = X execution is not Causally consistent P3 result: A=2 ; B=3 ; C=2 execution is Causally consistent

10 Uniform consistency models PRAM consistency: example PRAM (pipelined ram) consistency (PRAM) Writes performed by a single process are seen by all other processes in the order in which they were issued the perceived order of all writes seen can be different for each process 1a: X = 1 1b: A = X 2b: X = 2 initially: X=0 1c: B = X 2c: D = X 1d: C = X 2d: E = X Cache consistency (CC) All writes to the same memory location are performed in some sequential order P3 result: A=1 ; B=1 ; C=2 ; D=2 ; E=1 P4 all processes see the same order of writes for each location (but the order of all writes can differ) PRAM consistency: example PRAM consistency: example initially: X=0 all processors see the same order for writes of (1a) and (2b) (trivial) execution is PRAM consistent : W(X)1 a single global order is not possible : R(X)1 W(X)2 execution is not Sequentially consistent P3: R(X)1 R(X)2 P3 and P4 do not agree on causal relation between 1a and 2b P4: R(X)2 R(X)1 execution is not Causally consistent

11 Cache consistency: example Cache consistency: example initially: X=0 ; Y=0 initially: X=0 ; Y=0 1a: X = 1 2a: A = Y 1b: Y = 1 2b: B = X : : W(X)1 W(Y)1 R(Y)0 R(X)0 result: A=0 ; B=0 Cache consistency: example Uniform consistency models for : 1a < 1b (A=0) for : 1b < 1a (B=0) Processor consistency (PC) PRAM consistent and Cache Consistent all processors see the same order for writes on X (1a) and on Y (1b) (trivial) Tie-Breaker (Peterson's) algorithm executes correctly under Processor consistency Bakery algorithm needs Sequential consistency execution is Cache consistent a single global order is not possible Processor consistent machines are easier to build than sequentially consistent systems. execution is not Sequentially consistent

12 Processor consistency: example Processor consistency: example initially: X=0 ; Y=0 initially: X=0 ; Y=0 1a: X = 1 2a: c = 1 3a: A = Y 1b: Y = 1 2b: c = 2 3b: B = X : : W(X)1 W(Y)1 W(c)1 W(c)2 R(Y)0 R(X)0 result: A=0 ; B=0 Processor consistency: example Uniform consistency models A=0 for, 3a < 1b Slow consistency (SC) B=0 for, 3b < 1a All processors agree on the order of observed writes to each location by a single processor for : for : Writes by a process must be immediately visible to itself 1a < 2a < 3a < 1b < 2b 1b < 2b < 3b < 1a < 2a System where writes propagate slowly to memory and other processors processors see different orders for writes on c execution is not Processor consistent

13 Uniform consistency models SC vs PC: example Sequential Consistency initially: A=0 and B=0 A and B are volatiles Causal Consistency Processor Consistency 1a. A = 1; 2a. X = B; 1b. B = 1; 2b. Y = A; PRAM Consistency Cache Consistency On sequential consistent systems X==0 and Y==0 is not possible Slow Consistency On processor consistent systems, X==0 and Y==0 is possible Local Consistency SC vs PC: example Consistency model and synchronization 1a. A = 1; 2a. B = D; 3a. C = 1; initially: A=0, B=0, C=0, D=0, E=0 1b. D = 1; 2b. E = A; 3b. while (C==0) continue; 4b. assert(b==1 E==1); A, B, C, D, E are volatiles For 2 processes, many synchronization patterns work in the same way in processor consistent systems as well as in sequential consistent systems The assertion 4b cannot fail on sequential consistent systems, but can fail on processor consistent systems It is possible to construct a situation in which processor ordering fails, but there are few chances that such a code is somewhat useful

14 Signaling Barrier initially: A=0 and B=0 A and B are volatiles initially: A=0, B=0, C=0, D=0 A, B, C, D are volatiles 1a. A = 1; 2a. B = 1; 1b. while (B==0) continue; 2b. assert (A==1); 1a. A = 1; 2a. B = 1; 3a. while (D==0) continue; 4a. assert (A==1 && C==1); 1b. C=1; 2b. D=1; 3b. while (B==0) continue; 4b. assert (A==1 && C==1); The assertion cannot fail on sequential consistent and on processor consistent systems The assertions cannot fail on sequential consistent and on processor consistent systems 1a 2a : : W(A)1 R(B)0 R(B)0 R(B)0 W(B)1 R(B)1 R(A)1 1b 1b 1b 1b 2b Consistency model and synchronization Signaling For 3 or more processes, there are simple synchronization patters that work in sequential consistent system but not in processor consistent systems 1a. A = 1; 2a. B = 1; initially: A=0, B=0, C=0 1b. while (B==0) continue; 2b. C=1; A, B, C are volatiles 1c. while (C==0) continue; 2c. assert (A==1); P3 However, it is easy to introduce small changes to have a correct synchronization even in processor consistent systems The assertion 2c cannot fail on sequential consistent systems, but can fail on processor consistent systems WA1 and WC1 are performed on different variables by different cores: on PC systems no order is enforced

15 Signaling exploiting cache coherency Signaling exploiting cache coherency initially: A=0 and B=0 A and B are volatiles initially: A=0, B=0, C=0 A, B, C are volatiles 1a. A = 1; 2a. B = 1; 1b. while (B==0) continue; 2b. B=2; 1c. while (B!=2) continue; 2c. assert (A==1); 1a. A = 1; 2a. B = 1; 1b. while (B==0) continue; 2b. B=1; 3b. C=1; 1c. while (C==0) continue; 2c. assert (A==1); P3 P3 The assertion 2c cannot fail on processor consistent systems The assertion 2c cannot fail on processor consistent systems WB1 and WB2 are performed by different cores on the same variable: cache coherency enforces access ordering WB1 (2a) and WB1 (2b) are performed by different cores on the same variable: cache coherency enforces access ordering WB1 and WC1 are performed by the same processor: order is enforced by PRAM consistency Hybrid consistency models Hybrid consistency models Weak Consistency (WC) Release Consistency (RC) Entry Consistency (EC) Others Scope Consistency Location Consistency Dag Consistency Weak Consistency (WC) 2 types of accesses not synchronizing (read, write, read-write) synchronizing Accesses to synchronization variables are sequentially consistent No access to a synchronization variable is issued in a processor before all previous data accesses have been performed No access is issued by a processor before a previous access to a synchronization variable has been performed Standard read and writes obey to Local consistency A synchronization access works as a fence

16 Weak consistency Data Race : : 1a 2a W(X)1 sync_w(y)1 sync_r(y)1 R(X)1 1b 2b 1a < 2a cannot be reordered, since 2a is a synch. access 1b < 2b cannot be reordered, since 1b is a synch. access Y=1 2a < 1b Hence: in 2b, X must be 1 Conflicting accesses accesses to the same address from different processors, where at least one is a write Access order order can be enforced by the consistency model (SC) or by using a synchronization access Data race: 2 conflicting accesses with no ordering imposed SC-DRF Hybrid consistency models A program executing on a weakly consistent system appears sequentially consistent if: there are no data races (i.e., no competing accesses) synchronization is visible to the memory system Release Consistency (RC) 2 kinds of synchronization accesses acquire only delays future accesses release often associated to a read: load_acquire Sequential consistency for Data-race free programs only waits for previous accesses often associated to a write: store_release Synchronization accesses are Processor consistent acquire and release act as a semi-permeable

17 Memory Acquire and Release Release consistency 1: access1 2: access2 access1 access2 : 1a W(X)1 2a W_rel(Y)1 3: acquire Acquire : R_acq(Y)1 R(X)1 4: access3 5: access4 6: release 7: access5 8: access6 access3 access4 Release access5 access6 1b 2b 1a < 2a cannot be reordered, since 2a is a release access 1b < 2b cannot be reordered, since 1b is an acquire access Y=1 2a < 1b access1 and access2 can be reordered before and after Acquire, but not after Release Hence: access3 and access4 can be reordered only between Acquire and Release in 2b, X must be 1 access5 and access6 can be reordered before and after Release, but not before Acquire Hybrid consistency models Synchronizing accesses Entry Consistency (EC) Synchronizing accesses similar to RC Full fences differences: each shared variable is associated to a synchronizing variable the association can change dynamically under program control Weak consistency Release and Acquire Release consistency a synchronizing variable is a lock or a acquire accesses can be exclusive or non-exclusive 1. Reordering constraint 2. Memory access

18 Memory s Memory s Synchronizing accesses without access mechanism to control the out-of-order execution Instructions that prevents memory access reordering read s: prevent reordering of reads e.g., wait until the invalidate queue is empty write s: prevent reordering of writes e.g., wait until the store buffer is empty full s: act on all accesses : : W(X)1 For : 1 1b < 2b < 3b W(Y)1 program order R(Y)1 R(X)? 1a and 3a are executed in order, but 1b an 2b can be executed out of order for is the same as: 1a and 3a can be executed out of order Y=1 3a < 1b Hence: 1a 2a 3a in 2b, X can be either 0 or 1 1b 2b Memory s CPU's memory consistency models 1a 2a 3a : W(X)1 1 W(Y)1 : R(Y)1 2 R(X)1 1b 2b 3b For : 1b < 2b < 3b program order 1a < 2a < 3a 1 (and 2) Y=1 3a < 1b Hence: 1a < 2a < 3a < 1b < 2b < 3b in 3b, X must be 1 Processors implement out-of-order execution Store buffer, cache coherency,... CPU specifications provide rules about possible reordering Different memory areas can have different rules ISA provide instructions to control reordering Barriers

19 Memory consistency model Alpha There is a partial order: BEFORE (or <=) global relation (memory order) Processors can perform accesses out-of-order accesses: Instruction-fetch, Read, Write when addresses overlap: IF-IF: maintain order IF-W: maintain order R-R: maintain order R-W: maintain order W-W: maintain order I-cache and pipeline are not coherent three kinds of s: MB: force no-reordering between reads and writes WMB: force no-reordering between writes IMB: force no-reordering for reads, writes and I-fetches Order is not enforced for data dependency 1a: ptr = malloc(...); 2a: ptr->key = val; 3a: ptr->data = data; 4a: wmb 5a: global_ptr = ptr; Memory consistency model Alpha initially: global_ptr = NULL 1b: while (global_ptr==null) continue; 2b: mb 3b: myval = global_ptr->key; 4b: mydata = global_ptr->data; there is a data dependency from 1b and 3b, but addresses do not overlap a is required for Memory consistency model ARMv7 Memory consistency model ARMv7 No global memory order Accesses to a single address are seen in the same order by all processors (Cache coherency) Instruction fetches, data reads, data writes can be performed out-of-order Data dependent loads are not reordered I-cache and pipeline are not coherent Three kinds of s: DMB: Data Memory Barrier All specified memory accesses before the must be completed before any (specified) memory accesses after the is started DSB: Data Synchronization Barrier all specified memory accesses before the must be completed before any instruction after the is started ISB: Instruction Synchronization Barrier flushes the pipeline DMB, DSB, and ISB instructions are added in ARMv7 Previous versions use C5 to implement operations In ARMv6, operations are always defined In ARMv4 and ARMv5, operations may not exist

20 Memory consistency model ARMv7 Memory consistency model ARMv7 Memory types Normal 3 levels of shareability Non-shareable for Normal memory that is used by only a single processor Inner Shareable for Normal memory that is shared between several processors Outer Shareable for Normal memory that is shared between processors and devices Cacheability Non-cacheable Write-Through Cacheable Write-Back Write-Allocate Cacheable Write-Back no Write-Allocate Cacheable Device Memory types Accesses are strongly ordered All memory accesses occur in program order. Shareability Shareable for memory-mapped peripherals that are shared by several processors Non-shareable for memory-mapped peripherals that are used only by a single processor Cacheability: Non-cacheable a write to Device memory is permitted to complete before it reaches the target Strongly-ordered Accesses are strongly ordered Shareability: All Strongly-ordered regions are assumed to be Shareable Cacheability: Non-cacheable a write to Strongly-ordered memory can complete only when it reaches the target Memory consistency model ARMv7 DMB (or DSB) sy Barrier for all memory accesses that refer to domain Outer Shareable (full system ) DMB (or DSB) st Barrier for writes that refer to domain Outer Shareable DMB (or DSB) sh Barrier for all memory accesses that refer to domain Inner Shareable DMB (or DSB) stst Barrier for writes that refer to domain Inner Shareable DMB (or DSB) un Barrier for all memory accesses that refer to domain Non-Shareable DMB (or DSB) unst Barrier for writes that refer to domain Non-Shareable Memory consistency model MIPS32 Three kinds of s: optional Completion Barriers all specified memory accesses before the must be completed (globally performed) before the memory accesses after the are started after the SYNC (or SYNC 0): acts on R and W (required in all implementations) Ordering Barrier all specified memory accesses before the must be completed before the SYNC_WMB (or SYNC 4): acts on W SYNC_MB (or SYNC 16): acts on R and W SYNC_ACQUIRE (or SYNC 17): acts on R (before) and R and W (after) SYNC_RELEASE (or SYNC 18): acts on R and W (before) and W (after) SYNC_RMB (or SYNC 19): acts on R Instruction cache Synchronize Caches to Make Instruction Writes Effective SYNCI an I-cache line is updated to be used after a code change

21 Memory consistency model IA-32 Memory consistency model IA-32 Memory areas can be: For WB and WT: UC: uncacheable strong ordering is enforced useful for memory-mapped devices WC: write-combining cached in special buffers, coherence not enforced useful for framebuffers (writes order is not relevant) WB: cacheable, with write-back policy coherence enforced WT: cacheable, with write-through policy coherence enforced useful for devices that access memory (DMA-capable devices) without implementing cache coherency protocols there is a global memory ordering order is maintained for: R-R, R-W, W-W order is not maintained for: W-R the read obtains data from the forwarding path some streaming store instruction allows W-W reordering MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD string operations allow W-W reordering Memory consistency model IA-32 Memory consistency model IA-32 For WB memory areas: Individual processors use the same ordering principles as in a single-processor system. Writes by a single processor are observed in the same order by all processors. Writes from an individual processor are NOT ordered with respect to the writes from other processors. Memory ordering obeys causality (memory ordering respects transitive visibility). Any two stores are seen in a consistent order by processors other than those performing the stores Locked instructions have a total order. Three kinds of s: MFENCE LFENCE SFENCE Serializes load and store operations guarantees that all loads and stores specified before the fence are globally observable prior to any loads or stores being carried out after the fence. Serializes load operations guarantees ordering between two loads and prevents speculative loads from passing the load fence Serializes store operations guarantees that every store instruction that precedes the SFENCE in program order becomes globally visible before any store instruction that follows the SFENCE

22 Memory consistency models and OS Linux memory s OS must provide primitive to enforce access ordering processor vs processor accesses not required on uni-processor systems processor vs device accesses required even on uni-processor systems Multi architectures issue Portable code must use the weakest model of all supported architectures Linux weakest model: ALPHA consistency model does not guarantee ordering between data dependent accesses Compiler prevent compiler reordering of accesses processor can still perform out-of-order accesses (): compiler directive, no instruction Processor vs processor s smp_mb(): full memory smp_rmb(): memory for reads smp_wb(): memory for writes smp_read depends(): memory for data-dependency Processor vs anything s mb(): full memory rmb(): memory for reads wb(): memory for writes read depends(): memory for data-dependency Linux memory s examples uni-processor systems multi-processor systems smp_mb smp_rmb smp_wb smp_read depends mb rmb wb read depends smp_mb smp_rmb smp_wb smp_read depends mb rmb wb read depends Alpha ARMv7 MIPS32 IA-32 mb mb wmb mb dsb dsb dsb st sync sync sync mfence lfence sfence Alpha ARMv7 MIPS32 IA-32 mb mb wmb mb mb mb wmb mb dmb ish dmb ish dms ishst dsb dsb dsb st synch synch synch synch synch synch mfence mfence lfence sfence

CS5460: Operating Systems

CS5460: Operating Systems Lecture 9: Implementing Synchronization (Chapter 6) Multiprocessor Memory Models Uniprocessor memory is simple Every load from a location retrieves the last value stored to that