Memory Consistency Models

Size: px
Start display at page:

Download "Memory Consistency Models"

Transcription

1 Calcolatori Elettronici e Sistemi Operativi Memory Consistency Models Sources of out-of-order memory accesses... Compiler optimizations Store buffers FIFOs for uncommitted writes Invalidate queues (for cache coherency) Data prefetch Banked cache architectures Networked interconnect Non-uniform memory access (NUMA) architectures: different accesses to memory have different latencies Compiler optimizations Language semantic does not consider 1. Side-effects of memory accesses 2. Multi-threading 3. Asynchronous execution Compiler can: Reorder instructions Eliminate operations Some compiler optimization can be controlled by the volatile qualifier int add3 (int x) { add3: int i; mov r0, r0, asl #3 for (i=0; i<3; i++) x += x; mov pc, lr return x; } C code ARM assembly code This function always returns 8 x: compiler can optimize code int add_vals (int *vec) { add_vals: int y = vec[1]; ldr r3, [r0] y += vec[0]; ldr r0, [r0, #4] return y; add r0, r0, r3 } mov pc, lr C code ARM assembly code Result does not depend on access order: compiler can change loads order waitval: ldr r3, [r0] void waitval (int *ptr) { cmp r3, #0 while (*ptr == 0) movne pc, lr continue; loop: } b loop C code ARM assembly code Compiler does not need to consider that someone else can change *ptr

2 Volatile Examples Semantic Each read from a volatile variable requires an actual load and may return a different value Compiler optimization cannot merge reads from the same address Each write to a volatile variable requires an actual store Compiler optimization cannot cancel stores Required to access I/O address space this is the C/C++ semantic the Java semantic differs (it also implies atomicity) int *ptr; /* pointer to int */ volatile int *ptr_to_vol; /* pointer to volatile int */ int *volatile vol_ptr; /* volatile pointer to int */ volatile int *volatile vol_ptr_to_vol; /* volatile pointer to volatile int */ Beware the semantic: a = *ptr_to_vol; is a volatile access a = *vol_ptr; is not a volatile access Volatile Volatile Inconsistent qualification causes errors volatile int A; volatile int B; Volatile does not enforce ordering with non-volatile accesses Volatile does not enforce order on how access are actually performed A=1; /* these two lines won't be */ B=1; /* reordered by compiler */ int A; volatile int B; A=1; /* these two lines can be */ B=1; /* reordered by compiler */ volatile int A; volatile int B; A=1; /* these two lines won't be */ B=1; /* reordered by compiler but */ /* accesses can be reordered */ /* by HW */ Volatile does not mean atomic volatile int X; X=1; /* this assignment can be interrupted or preempted */

3 Memory Store Buffer Implementation on GCC asm volatile ("" : : : "memory"); This inline assembly code: 1. contains no instructions 2. may read or write all of RAM Hence: compiler memory accesses reordering is not allowed around the in either direction Record the store in buffer until is actually performed Hide memory latency Cache latency Cache-miss on write Processor can execute other instructions Data dependency (RAW) Wait until the write is actually performed in memory or in cache Read the data from the store buffer (store forwarding) Data dependency (WAW) Add a new entry in the store buffer Replace the previous write in the store buffer Example Example Processor executes 1) store A 2) store B A and B are shared with : A is in cache B is in both caches Store buffer Cache B Interconnect Store buffer Cache B A Execution: 1: store A: cache miss write the updated value in store buffer send a read request (data will come from cache) several clock cycles needed can proceed, (the new value is in the store buffer) does not see the write 2: store B: cache hit data is written in cache a coherence message is sent to sees the write 3: A is loaded in cache 4: A is updated in cache a coherence message is sent to sees the write sees the store on B first, then the store on A

4 Consequence Cache coherency initially: A=0 and B=0 A and B are volatiles Cache coherency can require cache line invalidation A processor send an invalidate message to another one A = 1 B = 1 while (B==0) continue; assert (A==1); /* this can fail! */ Target processor must invalidate cache line Invalidate Queue Store invalidate requests until the cache is busy Invalidate the line when the cache is ready If sees the stores performed by in reverse order, the assertion fails Data prefetch Banked cache architectures Processor can read data before the actual load instruction Hide memory latency Caches split in several banks While accesses to busy banks must wait, accesses to idle banks can proceed Preload data in cache Processor Speculative execution Execute instructions after a branch before the branch Store buffer Cache Cache \ Interconnect

5 Definitions Definitions Program order The order of operations as specified by software Execution order The order of operations as executed by a processor Perceived order The order of operations as seen by processors and memories Memory consistency model Rules that specify the allowed behavior of programs in terms of memory accesses Rules: order restrictions Performed Write Read a write by processor i is performed with respect to processor k when: a read issued by k to the same address returns the value stored by i a read by processor i is performed with respect to processor k when: a write issued by k to the same address cannot affect the value read by i Globally Performed globally performed: is performed with respect to all processors Write A write is globally performed when its modification has been propagated to all processors Read A read is globally performed when the value it returns is bound and the write that wrote this value is globally performed Memory consistency models Memory consistency models Rules on access ordering can regard: Uniform consistency models Location (address of access) Direction Rules do no concern category of accesses read, write, read-write Value Causality Category behavior of an access depends on the behavior of another one Hybrid consistency models Category of accesses matters shared / private synchronizing / not synchronizing

6 Uniform consistency models Uniform consistency models Local Consistency (LC) Each process sees its own accesses in program order There is no restriction on the order of the accesses seen by other processors Different processes may see different orders The weakest consistency model: it only guarantees sequential behavior on uniprocessor systems Sequential consistency (SC) There is a global total order of all memory accesses (of all processors) all processors agree with such global order global order can change at each run Each processor sees its own accesses in program order Offsets many architectural optimizations Easy to use Not usable to program in parallel environments Model implied by a cacheless system, with a single memory device, with processors unable to perform Out-of-Order execution SC: Consequence SC: Consequence initially: A=0 and B=0 A and B are volatiles initially: A=0 and B=0 A and B are volatiles 1a. A = 1 2a. B = 1 1b. while (B==0) continue; 2b. assert (A==1); 1a. A = 1 2a. B = 1 1b. while (B==0) continue; 2b. assert (A==1); access type data stored/read The assertion cannot fail The assertion cannot fail 1a 1a 2a : : W(A)1 history : : W(A)1 R(B)0 R(B)0 R(B)0 W(B)1 R(B)1 R(A)1 variable/address 1b 1b 1b 1b 2b time time

7 SC: Consequence Sequential consistency initially: A=0 and B=0 A and B are volatiles Cache based system, no constraint on the interconnect 1a. A = 1 2a. B = 1 1b. while (B==0) continue; 2b. assert (A==1); Sufficient conditions All processors issue their access in program order The assertion cannot fail A processor does not issue an access until its previous accesses have been globally performed Need waiting for acknowledges from other processors Each processor sees its own accesses in program order All processors agree with a global order Access 1a is before access 2b It is easy to enforce order between accesses from different processors Offsets many architectural optimizations No out-of -order execution Write-hit on cache must wait answers Easy to use Comparison Comparison The union of all the Perceived orders can be valid or not for a given consistency model Consistency model A is stronger than consistency model B if: Example: executes: I1: store A executes: I3: load A each execution valid on A is also valid on B also: B is weaker than A I2: store B I4: load B 1) sees I1, I2; sees I1, I3, I4, I2 valid execution for Sequential consistency total order implied: I1, I3, I4, I2 2) sees I1, I2; sees I3, I2, I4, I1 If there exist some execution E1 valid on A and not valid on B some execution E2 valid on B and not valid on A invalid execution for Sequential consistency there is not an unique total order then, A and B are incomparable valid execution for Local consistency and see their own accesses in order

8 Uniform consistency models Causal consistency: example 1 Causal consistency (Causal) All processors agree on the order of causally related events initially: X=0 Example X is initially 0 event 1 : writes 1 to X causally unrelated events can be observed in different orders event 2 : reads X and obtains 1 event 3 : writes 2 to X hence: event 1 is happened before event 2 Example X is initially 0 event 1 : reads X and obtains 0 event 2 : writes 1 to X event 3 : reads X and obtains 1 hence: event 1 is happened before event 2 1a: X = 1 2a: X = 3 1b: A = X 2b: X = 2 1c: B = X 2c: D = X 3c: F = X P3 result: A=1 ; B=1 ; C=1 ; D=3 ; E=2 ; F=2 : G=3 1d: C = X 2d: E = X 3d: G = X P4 event 2 is happened before event 3 all processors agree on such an ordering event 2 is happened before event 3 all processors agree on such an ordering Causal consistency: example 1 Causal consistency: example 1 initially: X=0 for P3: 2a < 2b for P4: : W(X)1 W(X)3 2b < 2a : P3: P4: R(X)1 R(X)1 R(X)1 W(X)2 R(X)3 R(X)2 R(X)2 R(X)3 a single global order is not possible (2a? 2b) execution is not Sequentially consistent no contradictions on causal dependencies execution is Causally consistent 2a and 2b are not causally related

9 Causal consistency: example 2 Causal consistency: example 2 initially: X=0, Y=0 initially: X=0, Y=0 1a: X = 1 2a: X = 2 1b: A = X 2b: Y = 3 1c: B = Y 2c: C = X P3 : : P3: W(X)1 W(X)2 R(X)2 W(Y)3 R(Y)3 R(X)1 result: A=2 ; B=3 ; C=1 Causal consistency: example 2 Causal consistency: example 3 for : 2a < 2b (A=2) for P3: 2b < 2a (B=3 and C=1) initially: X=0, Y=0 and P3 disagree on the order between 2a and 2b 2a and 2b are causally related (constraint due to A=2) 1a: X = 1 2a: X = 2 1b: A = X 2b: Y = 3 1c: B = Y 2c: C = X execution is not Causally consistent P3 result: A=2 ; B=3 ; C=2 execution is Causally consistent

10 Uniform consistency models PRAM consistency: example PRAM (pipelined ram) consistency (PRAM) Writes performed by a single process are seen by all other processes in the order in which they were issued the perceived order of all writes seen can be different for each process 1a: X = 1 1b: A = X 2b: X = 2 initially: X=0 1c: B = X 2c: D = X 1d: C = X 2d: E = X Cache consistency (CC) All writes to the same memory location are performed in some sequential order P3 result: A=1 ; B=1 ; C=2 ; D=2 ; E=1 P4 all processes see the same order of writes for each location (but the order of all writes can differ) PRAM consistency: example PRAM consistency: example initially: X=0 all processors see the same order for writes of (1a) and (2b) (trivial) execution is PRAM consistent : W(X)1 a single global order is not possible : R(X)1 W(X)2 execution is not Sequentially consistent P3: R(X)1 R(X)2 P3 and P4 do not agree on causal relation between 1a and 2b P4: R(X)2 R(X)1 execution is not Causally consistent

11 Cache consistency: example Cache consistency: example initially: X=0 ; Y=0 initially: X=0 ; Y=0 1a: X = 1 2a: A = Y 1b: Y = 1 2b: B = X : : W(X)1 W(Y)1 R(Y)0 R(X)0 result: A=0 ; B=0 Cache consistency: example Uniform consistency models for : 1a < 1b (A=0) for : 1b < 1a (B=0) Processor consistency (PC) PRAM consistent and Cache Consistent all processors see the same order for writes on X (1a) and on Y (1b) (trivial) Tie-Breaker (Peterson's) algorithm executes correctly under Processor consistency Bakery algorithm needs Sequential consistency execution is Cache consistent a single global order is not possible Processor consistent machines are easier to build than sequentially consistent systems. execution is not Sequentially consistent

12 Processor consistency: example Processor consistency: example initially: X=0 ; Y=0 initially: X=0 ; Y=0 1a: X = 1 2a: c = 1 3a: A = Y 1b: Y = 1 2b: c = 2 3b: B = X : : W(X)1 W(Y)1 W(c)1 W(c)2 R(Y)0 R(X)0 result: A=0 ; B=0 Processor consistency: example Uniform consistency models A=0 for, 3a < 1b Slow consistency (SC) B=0 for, 3b < 1a All processors agree on the order of observed writes to each location by a single processor for : for : Writes by a process must be immediately visible to itself 1a < 2a < 3a < 1b < 2b 1b < 2b < 3b < 1a < 2a System where writes propagate slowly to memory and other processors processors see different orders for writes on c execution is not Processor consistent

13 Uniform consistency models SC vs PC: example Sequential Consistency initially: A=0 and B=0 A and B are volatiles Causal Consistency Processor Consistency 1a. A = 1; 2a. X = B; 1b. B = 1; 2b. Y = A; PRAM Consistency Cache Consistency On sequential consistent systems X==0 and Y==0 is not possible Slow Consistency On processor consistent systems, X==0 and Y==0 is possible Local Consistency SC vs PC: example Consistency model and synchronization 1a. A = 1; 2a. B = D; 3a. C = 1; initially: A=0, B=0, C=0, D=0, E=0 1b. D = 1; 2b. E = A; 3b. while (C==0) continue; 4b. assert(b==1 E==1); A, B, C, D, E are volatiles For 2 processes, many synchronization patterns work in the same way in processor consistent systems as well as in sequential consistent systems The assertion 4b cannot fail on sequential consistent systems, but can fail on processor consistent systems It is possible to construct a situation in which processor ordering fails, but there are few chances that such a code is somewhat useful

14 Signaling Barrier initially: A=0 and B=0 A and B are volatiles initially: A=0, B=0, C=0, D=0 A, B, C, D are volatiles 1a. A = 1; 2a. B = 1; 1b. while (B==0) continue; 2b. assert (A==1); 1a. A = 1; 2a. B = 1; 3a. while (D==0) continue; 4a. assert (A==1 && C==1); 1b. C=1; 2b. D=1; 3b. while (B==0) continue; 4b. assert (A==1 && C==1); The assertion cannot fail on sequential consistent and on processor consistent systems The assertions cannot fail on sequential consistent and on processor consistent systems 1a 2a : : W(A)1 R(B)0 R(B)0 R(B)0 W(B)1 R(B)1 R(A)1 1b 1b 1b 1b 2b Consistency model and synchronization Signaling For 3 or more processes, there are simple synchronization patters that work in sequential consistent system but not in processor consistent systems 1a. A = 1; 2a. B = 1; initially: A=0, B=0, C=0 1b. while (B==0) continue; 2b. C=1; A, B, C are volatiles 1c. while (C==0) continue; 2c. assert (A==1); P3 However, it is easy to introduce small changes to have a correct synchronization even in processor consistent systems The assertion 2c cannot fail on sequential consistent systems, but can fail on processor consistent systems WA1 and WC1 are performed on different variables by different cores: on PC systems no order is enforced

15 Signaling exploiting cache coherency Signaling exploiting cache coherency initially: A=0 and B=0 A and B are volatiles initially: A=0, B=0, C=0 A, B, C are volatiles 1a. A = 1; 2a. B = 1; 1b. while (B==0) continue; 2b. B=2; 1c. while (B!=2) continue; 2c. assert (A==1); 1a. A = 1; 2a. B = 1; 1b. while (B==0) continue; 2b. B=1; 3b. C=1; 1c. while (C==0) continue; 2c. assert (A==1); P3 P3 The assertion 2c cannot fail on processor consistent systems The assertion 2c cannot fail on processor consistent systems WB1 and WB2 are performed by different cores on the same variable: cache coherency enforces access ordering WB1 (2a) and WB1 (2b) are performed by different cores on the same variable: cache coherency enforces access ordering WB1 and WC1 are performed by the same processor: order is enforced by PRAM consistency Hybrid consistency models Hybrid consistency models Weak Consistency (WC) Release Consistency (RC) Entry Consistency (EC) Others Scope Consistency Location Consistency Dag Consistency Weak Consistency (WC) 2 types of accesses not synchronizing (read, write, read-write) synchronizing Accesses to synchronization variables are sequentially consistent No access to a synchronization variable is issued in a processor before all previous data accesses have been performed No access is issued by a processor before a previous access to a synchronization variable has been performed Standard read and writes obey to Local consistency A synchronization access works as a fence

16 Weak consistency Data Race : : 1a 2a W(X)1 sync_w(y)1 sync_r(y)1 R(X)1 1b 2b 1a < 2a cannot be reordered, since 2a is a synch. access 1b < 2b cannot be reordered, since 1b is a synch. access Y=1 2a < 1b Hence: in 2b, X must be 1 Conflicting accesses accesses to the same address from different processors, where at least one is a write Access order order can be enforced by the consistency model (SC) or by using a synchronization access Data race: 2 conflicting accesses with no ordering imposed SC-DRF Hybrid consistency models A program executing on a weakly consistent system appears sequentially consistent if: there are no data races (i.e., no competing accesses) synchronization is visible to the memory system Release Consistency (RC) 2 kinds of synchronization accesses acquire only delays future accesses release often associated to a read: load_acquire Sequential consistency for Data-race free programs only waits for previous accesses often associated to a write: store_release Synchronization accesses are Processor consistent acquire and release act as a semi-permeable

17 Memory Acquire and Release Release consistency 1: access1 2: access2 access1 access2 : 1a W(X)1 2a W_rel(Y)1 3: acquire Acquire : R_acq(Y)1 R(X)1 4: access3 5: access4 6: release 7: access5 8: access6 access3 access4 Release access5 access6 1b 2b 1a < 2a cannot be reordered, since 2a is a release access 1b < 2b cannot be reordered, since 1b is an acquire access Y=1 2a < 1b access1 and access2 can be reordered before and after Acquire, but not after Release Hence: access3 and access4 can be reordered only between Acquire and Release in 2b, X must be 1 access5 and access6 can be reordered before and after Release, but not before Acquire Hybrid consistency models Synchronizing accesses Entry Consistency (EC) Synchronizing accesses similar to RC Full fences differences: each shared variable is associated to a synchronizing variable the association can change dynamically under program control Weak consistency Release and Acquire Release consistency a synchronizing variable is a lock or a acquire accesses can be exclusive or non-exclusive 1. Reordering constraint 2. Memory access

18 Memory s Memory s Synchronizing accesses without access mechanism to control the out-of-order execution Instructions that prevents memory access reordering read s: prevent reordering of reads e.g., wait until the invalidate queue is empty write s: prevent reordering of writes e.g., wait until the store buffer is empty full s: act on all accesses : : W(X)1 For : 1 1b < 2b < 3b W(Y)1 program order R(Y)1 R(X)? 1a and 3a are executed in order, but 1b an 2b can be executed out of order for is the same as: 1a and 3a can be executed out of order Y=1 3a < 1b Hence: 1a 2a 3a in 2b, X can be either 0 or 1 1b 2b Memory s CPU's memory consistency models 1a 2a 3a : W(X)1 1 W(Y)1 : R(Y)1 2 R(X)1 1b 2b 3b For : 1b < 2b < 3b program order 1a < 2a < 3a 1 (and 2) Y=1 3a < 1b Hence: 1a < 2a < 3a < 1b < 2b < 3b in 3b, X must be 1 Processors implement out-of-order execution Store buffer, cache coherency,... CPU specifications provide rules about possible reordering Different memory areas can have different rules ISA provide instructions to control reordering Barriers

19 Memory consistency model Alpha There is a partial order: BEFORE (or <=) global relation (memory order) Processors can perform accesses out-of-order accesses: Instruction-fetch, Read, Write when addresses overlap: IF-IF: maintain order IF-W: maintain order R-R: maintain order R-W: maintain order W-W: maintain order I-cache and pipeline are not coherent three kinds of s: MB: force no-reordering between reads and writes WMB: force no-reordering between writes IMB: force no-reordering for reads, writes and I-fetches Order is not enforced for data dependency 1a: ptr = malloc(...); 2a: ptr->key = val; 3a: ptr->data = data; 4a: wmb 5a: global_ptr = ptr; Memory consistency model Alpha initially: global_ptr = NULL 1b: while (global_ptr==null) continue; 2b: mb 3b: myval = global_ptr->key; 4b: mydata = global_ptr->data; there is a data dependency from 1b and 3b, but addresses do not overlap a is required for Memory consistency model ARMv7 Memory consistency model ARMv7 No global memory order Accesses to a single address are seen in the same order by all processors (Cache coherency) Instruction fetches, data reads, data writes can be performed out-of-order Data dependent loads are not reordered I-cache and pipeline are not coherent Three kinds of s: DMB: Data Memory Barrier All specified memory accesses before the must be completed before any (specified) memory accesses after the is started DSB: Data Synchronization Barrier all specified memory accesses before the must be completed before any instruction after the is started ISB: Instruction Synchronization Barrier flushes the pipeline DMB, DSB, and ISB instructions are added in ARMv7 Previous versions use C5 to implement operations In ARMv6, operations are always defined In ARMv4 and ARMv5, operations may not exist

20 Memory consistency model ARMv7 Memory consistency model ARMv7 Memory types Normal 3 levels of shareability Non-shareable for Normal memory that is used by only a single processor Inner Shareable for Normal memory that is shared between several processors Outer Shareable for Normal memory that is shared between processors and devices Cacheability Non-cacheable Write-Through Cacheable Write-Back Write-Allocate Cacheable Write-Back no Write-Allocate Cacheable Device Memory types Accesses are strongly ordered All memory accesses occur in program order. Shareability Shareable for memory-mapped peripherals that are shared by several processors Non-shareable for memory-mapped peripherals that are used only by a single processor Cacheability: Non-cacheable a write to Device memory is permitted to complete before it reaches the target Strongly-ordered Accesses are strongly ordered Shareability: All Strongly-ordered regions are assumed to be Shareable Cacheability: Non-cacheable a write to Strongly-ordered memory can complete only when it reaches the target Memory consistency model ARMv7 DMB (or DSB) sy Barrier for all memory accesses that refer to domain Outer Shareable (full system ) DMB (or DSB) st Barrier for writes that refer to domain Outer Shareable DMB (or DSB) sh Barrier for all memory accesses that refer to domain Inner Shareable DMB (or DSB) stst Barrier for writes that refer to domain Inner Shareable DMB (or DSB) un Barrier for all memory accesses that refer to domain Non-Shareable DMB (or DSB) unst Barrier for writes that refer to domain Non-Shareable Memory consistency model MIPS32 Three kinds of s: optional Completion Barriers all specified memory accesses before the must be completed (globally performed) before the memory accesses after the are started after the SYNC (or SYNC 0): acts on R and W (required in all implementations) Ordering Barrier all specified memory accesses before the must be completed before the SYNC_WMB (or SYNC 4): acts on W SYNC_MB (or SYNC 16): acts on R and W SYNC_ACQUIRE (or SYNC 17): acts on R (before) and R and W (after) SYNC_RELEASE (or SYNC 18): acts on R and W (before) and W (after) SYNC_RMB (or SYNC 19): acts on R Instruction cache Synchronize Caches to Make Instruction Writes Effective SYNCI an I-cache line is updated to be used after a code change

21 Memory consistency model IA-32 Memory consistency model IA-32 Memory areas can be: For WB and WT: UC: uncacheable strong ordering is enforced useful for memory-mapped devices WC: write-combining cached in special buffers, coherence not enforced useful for framebuffers (writes order is not relevant) WB: cacheable, with write-back policy coherence enforced WT: cacheable, with write-through policy coherence enforced useful for devices that access memory (DMA-capable devices) without implementing cache coherency protocols there is a global memory ordering order is maintained for: R-R, R-W, W-W order is not maintained for: W-R the read obtains data from the forwarding path some streaming store instruction allows W-W reordering MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD string operations allow W-W reordering Memory consistency model IA-32 Memory consistency model IA-32 For WB memory areas: Individual processors use the same ordering principles as in a single-processor system. Writes by a single processor are observed in the same order by all processors. Writes from an individual processor are NOT ordered with respect to the writes from other processors. Memory ordering obeys causality (memory ordering respects transitive visibility). Any two stores are seen in a consistent order by processors other than those performing the stores Locked instructions have a total order. Three kinds of s: MFENCE LFENCE SFENCE Serializes load and store operations guarantees that all loads and stores specified before the fence are globally observable prior to any loads or stores being carried out after the fence. Serializes load operations guarantees ordering between two loads and prevents speculative loads from passing the load fence Serializes store operations guarantees that every store instruction that precedes the SFENCE in program order becomes globally visible before any store instruction that follows the SFENCE

22 Memory consistency models and OS Linux memory s OS must provide primitive to enforce access ordering processor vs processor accesses not required on uni-processor systems processor vs device accesses required even on uni-processor systems Multi architectures issue Portable code must use the weakest model of all supported architectures Linux weakest model: ALPHA consistency model does not guarantee ordering between data dependent accesses Compiler prevent compiler reordering of accesses processor can still perform out-of-order accesses (): compiler directive, no instruction Processor vs processor s smp_mb(): full memory smp_rmb(): memory for reads smp_wb(): memory for writes smp_read depends(): memory for data-dependency Processor vs anything s mb(): full memory rmb(): memory for reads wb(): memory for writes read depends(): memory for data-dependency Linux memory s examples uni-processor systems multi-processor systems smp_mb smp_rmb smp_wb smp_read depends mb rmb wb read depends smp_mb smp_rmb smp_wb smp_read depends mb rmb wb read depends Alpha ARMv7 MIPS32 IA-32 mb mb wmb mb dsb dsb dsb st sync sync sync mfence lfence sfence Alpha ARMv7 MIPS32 IA-32 mb mb wmb mb mb mb wmb mb dmb ish dmb ish dms ishst dsb dsb dsb st synch synch synch synch synch synch mfence mfence lfence sfence

CS5460: Operating Systems

CS5460: Operating Systems CS5460: Operating Systems Lecture 9: Implementing Synchronization (Chapter 6) Multiprocessor Memory Models Uniprocessor memory is simple Every load from a location retrieves the last value stored to that

More information

Relaxed Memory Consistency

Relaxed Memory Consistency Relaxed Memory Consistency Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)

More information

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( ) Systems Group Department of Computer Science ETH Zürich Lecture 24: Multiprocessing Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Most of the rest of this

More information

Memory Consistency. Minsoo Ryu. Department of Computer Science and Engineering. Hanyang University. Real-Time Computing and Communications Lab.

Memory Consistency. Minsoo Ryu. Department of Computer Science and Engineering. Hanyang University. Real-Time Computing and Communications Lab. Memory Consistency Minsoo Ryu Department of Computer Science and Engineering 2 Distributed Shared Memory Two types of memory organization in parallel and distributed systems Shared memory (shared address

More information

Overview: Memory Consistency

Overview: Memory Consistency Overview: Memory Consistency the ordering of memory operations basic definitions; sequential consistency comparison with cache coherency relaxing memory consistency write buffers the total store ordering

More information

ARMv8-A Memory Systems. Systems. Version 0.1. Version 1.0. Copyright 2016 ARM Limited or its affiliates. All rights reserved.

ARMv8-A Memory Systems. Systems. Version 0.1. Version 1.0. Copyright 2016 ARM Limited or its affiliates. All rights reserved. Connect ARMv8-A User Memory Guide Systems Version 0.1 Version 1.0 Page 1 of 17 Revision Information The following revisions have been made to this User Guide. Date Issue Confidentiality Change 28 February

More information

Relaxed Memory-Consistency Models

Relaxed Memory-Consistency Models Relaxed Memory-Consistency Models [ 9.1] In Lecture 13, we saw a number of relaxed memoryconsistency models. In this lecture, we will cover some of them in more detail. Why isn t sequential consistency

More information

Memory Consistency Models

Memory Consistency Models Memory Consistency Models Contents of Lecture 3 The need for memory consistency models The uniprocessor model Sequential consistency Relaxed memory models Weak ordering Release consistency Jonas Skeppstedt

More information

Using Relaxed Consistency Models

Using Relaxed Consistency Models Using Relaxed Consistency Models CS&G discuss relaxed consistency models from two standpoints. The system specification, which tells how a consistency model works and what guarantees of ordering it provides.

More information

Motivations. Shared Memory Consistency Models. Optimizations for Performance. Memory Consistency

Motivations. Shared Memory Consistency Models. Optimizations for Performance. Memory Consistency Shared Memory Consistency Models Authors : Sarita.V.Adve and Kourosh Gharachorloo Presented by Arrvindh Shriraman Motivations Programmer is required to reason about consistency to ensure data race conditions

More information

Hardware Memory Models: x86-tso

Hardware Memory Models: x86-tso Hardware Memory Models: x86-tso John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 522 Lecture 9 20 September 2016 Agenda So far hardware organization multithreading

More information

Other consistency models

Other consistency models Last time: Symmetric multiprocessing (SMP) Lecture 25: Synchronization primitives Computer Architecture and Systems Programming (252-0061-00) CPU 0 CPU 1 CPU 2 CPU 3 Timothy Roscoe Herbstsemester 2012

More information

Unit 12: Memory Consistency Models. Includes slides originally developed by Prof. Amir Roth

Unit 12: Memory Consistency Models. Includes slides originally developed by Prof. Amir Roth Unit 12: Memory Consistency Models Includes slides originally developed by Prof. Amir Roth 1 Example #1 int x = 0;! int y = 0;! thread 1 y = 1;! thread 2 int t1 = x;! x = 1;! int t2 = y;! print(t1,t2)!

More information

NOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem.

NOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem. Memory Consistency Model Background for Debate on Memory Consistency Models CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley for a SAS specifies constraints on the order in which

More information

COMP Parallel Computing. CC-NUMA (2) Memory Consistency

COMP Parallel Computing. CC-NUMA (2) Memory Consistency COMP 633 - Parallel Computing Lecture 11 September 26, 2017 Memory Consistency Reading Patterson & Hennesey, Computer Architecture (2 nd Ed.) secn 8.6 a condensed treatment of consistency models Coherence

More information

Taming release-acquire consistency

Taming release-acquire consistency Taming release-acquire consistency Ori Lahav Nick Giannarakis Viktor Vafeiadis Max Planck Institute for Software Systems (MPI-SWS) POPL 2016 Weak memory models Weak memory models provide formal sound semantics

More information

Distributed Operating Systems Memory Consistency

Distributed Operating Systems Memory Consistency Faculty of Computer Science Institute for System Architecture, Operating Systems Group Distributed Operating Systems Memory Consistency Marcus Völp (slides Julian Stecklina, Marcus Völp) SS2014 Concurrent

More information

Shared Memory. SMP Architectures and Programming

Shared Memory. SMP Architectures and Programming Shared Memory SMP Architectures and Programming 1 Why work with shared memory parallel programming? Speed Ease of use CLUMPS Good starting point 2 Shared Memory Processes or threads share memory No explicit

More information

MULTITHREADING AND SYNCHRONIZATION. CS124 Operating Systems Fall , Lecture 10

MULTITHREADING AND SYNCHRONIZATION. CS124 Operating Systems Fall , Lecture 10 MULTITHREADING AND SYNCHRONIZATION CS124 Operating Systems Fall 2017-2018, Lecture 10 2 Critical Sections Race conditions can be avoided by preventing multiple control paths from accessing shared state

More information

CS533 Concepts of Operating Systems. Jonathan Walpole

CS533 Concepts of Operating Systems. Jonathan Walpole CS533 Concepts of Operating Systems Jonathan Walpole Shared Memory Consistency Models: A Tutorial Outline Concurrent programming on a uniprocessor The effect of optimizations on a uniprocessor The effect

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

ARMv8-A Software Development

ARMv8-A Software Development ARMv8-A Software Development Course Description ARMv8-A software development is a 4 days ARM official course. The course goes into great depth and provides all necessary know-how to develop software for

More information

Data-Centric Consistency Models. The general organization of a logical data store, physically distributed and replicated across multiple processes.

Data-Centric Consistency Models. The general organization of a logical data store, physically distributed and replicated across multiple processes. Data-Centric Consistency Models The general organization of a logical data store, physically distributed and replicated across multiple processes. Consistency models The scenario we will be studying: Some

More information

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization Lec-11 Multi-Threading Concepts: Coherency, Consistency, and Synchronization Coherency vs Consistency Memory coherency and consistency are major concerns in the design of shared-memory systems. Consistency

More information

Shared Memory Consistency Models: A Tutorial

Shared Memory Consistency Models: A Tutorial Shared Memory Consistency Models: A Tutorial By Sarita Adve, Kourosh Gharachorloo WRL Research Report, 1995 Presentation: Vince Schuster Contents Overview Uniprocessor Review Sequential Consistency Relaxed

More information

Distributed Shared Memory and Memory Consistency Models

Distributed Shared Memory and Memory Consistency Models Lectures on distributed systems Distributed Shared Memory and Memory Consistency Models Paul Krzyzanowski Introduction With conventional SMP systems, multiple processors execute instructions in a single

More information

Systèmes d Exploitation Avancés

Systèmes d Exploitation Avancés Systèmes d Exploitation Avancés Instructor: Pablo Oliveira ISTY Instructor: Pablo Oliveira (ISTY) Systèmes d Exploitation Avancés 1 / 32 Review : Thread package API tid thread create (void (*fn) (void

More information

SELECTED TOPICS IN COHERENCE AND CONSISTENCY

SELECTED TOPICS IN COHERENCE AND CONSISTENCY SELECTED TOPICS IN COHERENCE AND CONSISTENCY Michel Dubois Ming-Hsieh Department of Electrical Engineering University of Southern California Los Angeles, CA90089-2562 dubois@usc.edu INTRODUCTION IN CHIP

More information

RELAXED CONSISTENCY 1

RELAXED CONSISTENCY 1 RELAXED CONSISTENCY 1 RELAXED CONSISTENCY Relaxed Consistency is a catch-all term for any MCM weaker than TSO GPUs have relaxed consistency (probably) 2 XC AXIOMS TABLE 5.5: XC Ordering Rules. An X Denotes

More information

Relaxed Memory-Consistency Models

Relaxed Memory-Consistency Models Relaxed Memory-Consistency Models Review. Why are relaxed memory-consistency models needed? How do relaxed MC models require programs to be changed? The safety net between operations whose order needs

More information

<atomic.h> weapons. Paolo Bonzini Red Hat, Inc. KVM Forum 2016

<atomic.h> weapons. Paolo Bonzini Red Hat, Inc. KVM Forum 2016 weapons Paolo Bonzini Red Hat, Inc. KVM Forum 2016 The real things Herb Sutter s talks atomic Weapons: The C++ Memory Model and Modern Hardware Lock-Free Programming (or, Juggling Razor Blades)

More information

DISTRIBUTED SHARED MEMORY

DISTRIBUTED SHARED MEMORY DISTRIBUTED SHARED MEMORY COMP 512 Spring 2018 Slide material adapted from Distributed Systems (Couloris, et. al), and Distr Op Systems and Algs (Chow and Johnson) 1 Outline What is DSM DSM Design and

More information

Example: The Dekker Algorithm on SMP Systems. Memory Consistency The Dekker Algorithm 43 / 54

Example: The Dekker Algorithm on SMP Systems. Memory Consistency The Dekker Algorithm 43 / 54 Example: The Dekker Algorithm on SMP Systems Memory Consistency The Dekker Algorithm 43 / 54 Using Memory Barriers: the Dekker Algorithm Mutual exclusion of two processes with busy waiting. //flag[] is

More information

Computer Architecture

Computer Architecture Jens Teubner Computer Architecture Summer 2016 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2016 Jens Teubner Computer Architecture Summer 2016 83 Part III Multi-Core

More information

An introduction to weak memory consistency and the out-of-thin-air problem

An introduction to weak memory consistency and the out-of-thin-air problem An introduction to weak memory consistency and the out-of-thin-air problem Viktor Vafeiadis Max Planck Institute for Software Systems (MPI-SWS) CONCUR, 7 September 2017 Sequential consistency 2 Sequential

More information

Parallel Computer Architecture Spring Memory Consistency. Nikos Bellas

Parallel Computer Architecture Spring Memory Consistency. Nikos Bellas Parallel Computer Architecture Spring 2018 Memory Consistency Nikos Bellas Computer and Communications Engineering Department University of Thessaly Parallel Computer Architecture 1 Coherence vs Consistency

More information

Portland State University ECE 588/688. Memory Consistency Models

Portland State University ECE 588/688. Memory Consistency Models Portland State University ECE 588/688 Memory Consistency Models Copyright by Alaa Alameldeen 2018 Memory Consistency Models Formal specification of how the memory system will appear to the programmer Places

More information

The Cache-Coherence Problem

The Cache-Coherence Problem The -Coherence Problem Lecture 12 (Chapter 6) 1 Outline Bus-based multiprocessors The cache-coherence problem Peterson s algorithm Coherence vs. consistency Shared vs. Distributed Memory What is the difference

More information

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST Chapter 8. Multiprocessors In-Cheol Park Dept. of EE, KAIST Can the rapid rate of uniprocessor performance growth be sustained indefinitely? If the pace does slow down, multiprocessor architectures will

More information

Memory Consistency Models. CSE 451 James Bornholt

Memory Consistency Models. CSE 451 James Bornholt Memory Consistency Models CSE 451 James Bornholt Memory consistency models The short version: Multiprocessors reorder memory operations in unintuitive, scary ways This behavior is necessary for performance

More information

The Java Memory Model

The Java Memory Model The Java Memory Model The meaning of concurrency in Java Bartosz Milewski Plan of the talk Motivating example Sequential consistency Data races The DRF guarantee Causality Out-of-thin-air guarantee Implementation

More information

1. Memory technology & Hierarchy

1. Memory technology & Hierarchy 1. Memory technology & Hierarchy Back to caching... Advances in Computer Architecture Andy D. Pimentel Caches in a multi-processor context Dealing with concurrent updates Multiprocessor architecture In

More information

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 15: Memory Consistency and Synchronization Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 5 (multi-core) " Basic requirements: out later today

More information

Sequential Consistency & TSO. Subtitle

Sequential Consistency & TSO. Subtitle Sequential Consistency & TSO Subtitle Core C1 Core C2 data = 0, 1lag SET S1: store data = NEW S2: store 1lag = SET L1: load r1 = 1lag B1: if (r1 SET) goto L1 L2: load r2 = data; Will r2 always be set to

More information

Foundations of the C++ Concurrency Memory Model

Foundations of the C++ Concurrency Memory Model Foundations of the C++ Concurrency Memory Model John Mellor-Crummey and Karthik Murthy Department of Computer Science Rice University johnmc@rice.edu COMP 522 27 September 2016 Before C++ Memory Model

More information

Administrivia. Review: Thread package API. Program B. Program A. Program C. Correct answers. Please ask questions on Google group

Administrivia. Review: Thread package API. Program B. Program A. Program C. Correct answers. Please ask questions on Google group Administrivia Please ask questions on Google group - We ve had several questions that would likely be of interest to more students x86 manuals linked on reference materials page Next week, guest lecturer:

More information

Computer Science 146. Computer Architecture

Computer Science 146. Computer Architecture Computer Architecture Spring 24 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture 2: More Multiprocessors Computation Taxonomy SISD SIMD MISD MIMD ILP Vectors, MM-ISAs Shared Memory

More information

SPIN, PETERSON AND BAKERY LOCKS

SPIN, PETERSON AND BAKERY LOCKS Concurrent Programs reasoning about their execution proving correctness start by considering execution sequences CS4021/4521 2018 jones@scss.tcd.ie School of Computer Science and Statistics, Trinity College

More information

General Purpose Processors

General Purpose Processors Calcolatori Elettronici e Sistemi Operativi Specifications Device that executes a program General Purpose Processors Program list of instructions Instructions are stored in an external memory Stored program

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal

More information

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Symmetric Multiprocessors: Synchronization and Sequential Consistency Constructive Computer Architecture Symmetric Multiprocessors: Synchronization and Sequential Consistency Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology November

More information

Multiprocessors and Locking

Multiprocessors and Locking Types of Multiprocessors (MPs) Uniform memory-access (UMA) MP Access to all memory occurs at the same speed for all processors. Multiprocessors and Locking COMP9242 2008/S2 Week 12 Part 1 Non-uniform memory-access

More information

Coherence and Consistency

Coherence and Consistency Coherence and Consistency 30 The Meaning of Programs An ISA is a programming language To be useful, programs written in it must have meaning or semantics Any sequence of instructions must have a meaning.

More information

Handout 3 Multiprocessor and thread level parallelism

Handout 3 Multiprocessor and thread level parallelism Handout 3 Multiprocessor and thread level parallelism Outline Review MP Motivation SISD v SIMD (SIMT) v MIMD Centralized vs Distributed Memory MESI and Directory Cache Coherency Synchronization and Relaxed

More information

CS510 Advanced Topics in Concurrency. Jonathan Walpole

CS510 Advanced Topics in Concurrency. Jonathan Walpole CS510 Advanced Topics in Concurrency Jonathan Walpole Threads Cannot Be Implemented as a Library Reasoning About Programs What are the valid outcomes for this program? Is it valid for both r1 and r2 to

More information

Lecture 6: Lazy Transactional Memory. Topics: TM semantics and implementation details of lazy TM

Lecture 6: Lazy Transactional Memory. Topics: TM semantics and implementation details of lazy TM Lecture 6: Lazy Transactional Memory Topics: TM semantics and implementation details of lazy TM 1 Transactions Access to shared variables is encapsulated within transactions the system gives the illusion

More information

Cortex-A9 MPCore Software Development

Cortex-A9 MPCore Software Development Cortex-A9 MPCore Software Development Course Description Cortex-A9 MPCore software development is a 4 days ARM official course. The course goes into great depth and provides all necessary know-how to develop

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Shared-Memory Multi-Processors Shared-Memory Multiprocessors Multiple threads use shared memory (address space) SysV Shared Memory or Threads in software Communication implicit

More information

Page 1. Outline. Coherence vs. Consistency. Why Consistency is Important

Page 1. Outline. Coherence vs. Consistency. Why Consistency is Important Outline ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Memory Consistency Models Copyright 2006 Daniel J. Sorin Duke University Slides are derived from work by Sarita

More information

Beyond Sequential Consistency: Relaxed Memory Models

Beyond Sequential Consistency: Relaxed Memory Models 1 Beyond Sequential Consistency: Relaxed Memory Models Computer Science and Artificial Intelligence Lab M.I.T. Based on the material prepared by and Krste Asanovic 2 Beyond Sequential Consistency: Relaxed

More information

RA3 - Cortex-A15 implementation

RA3 - Cortex-A15 implementation Formation Cortex-A15 implementation: This course covers Cortex-A15 high-end ARM CPU - Processeurs ARM: ARM Cores RA3 - Cortex-A15 implementation This course covers Cortex-A15 high-end ARM CPU OBJECTIVES

More information

Shared Memory Consistency Models: A Tutorial

Shared Memory Consistency Models: A Tutorial Shared Memory Consistency Models: A Tutorial By Sarita Adve & Kourosh Gharachorloo Slides by Jim Larson Outline Concurrent programming on a uniprocessor The effect of optimizations on a uniprocessor The

More information

Hardware models: inventing a usable abstraction for Power/ARM. Friday, 11 January 13

Hardware models: inventing a usable abstraction for Power/ARM. Friday, 11 January 13 Hardware models: inventing a usable abstraction for Power/ARM 1 Hardware models: inventing a usable abstraction for Power/ARM Disclaimer: 1. ARM MM is analogous to Power MM all this is your next phone!

More information

Relaxed Memory: The Specification Design Space

Relaxed Memory: The Specification Design Space Relaxed Memory: The Specification Design Space Mark Batty University of Cambridge Fortran meeting, Delft, 25 June 2013 1 An ideal specification Unambiguous Easy to understand Sound w.r.t. experimentally

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM UNIT III MULTIPROCESSORS AND THREAD LEVEL PARALLELISM 1. Symmetric Shared Memory Architectures: The Symmetric Shared Memory Architecture consists of several processors with a single physical memory shared

More information

Order Is A Lie. Are you sure you know how your code runs?

Order Is A Lie. Are you sure you know how your code runs? Order Is A Lie Are you sure you know how your code runs? Order in code is not respected by Compilers Processors (out-of-order execution) SMP Cache Management Understanding execution order in a multithreaded

More information

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

Lecture 29 Review CPU time: the best metric Be sure you understand CC, clock period Common (and good) performance metrics Be sure you understand CC, clock period Lecture 29 Review Suggested reading: Everything Q1: D[8] = D[8] + RF[1] + RF[4] I[15]: Add R2, R1, R4 RF[1] = 4 I[16]: MOV R3, 8 RF[4] = 5 I[17]: Add R2, R2, R3

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University Lecture 4: Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee282 Lecture 4-1 Announcements HW1 is out (handout and online) Due on 10/15

More information

Review: Thread package API

Review: Thread package API Review: Thread package API tid thread create (void (*fn) (void *), void *arg); - Create a new thread that calls fn with arg void thread exit (); void thread join (tid thread); The execution of multiple

More information

Lecture 12: Hardware/Software Trade-Offs. Topics: COMA, Software Virtual Memory

Lecture 12: Hardware/Software Trade-Offs. Topics: COMA, Software Virtual Memory Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory 1 Capacity Limitations P P P P B1 C C B1 C C Mem Coherence Monitor Mem Coherence Monitor B2 In a Sequent NUMA-Q design above,

More information

Computer System Architecture Final Examination Spring 2002

Computer System Architecture Final Examination Spring 2002 Computer System Architecture 6.823 Final Examination Spring 2002 Name: This is an open book, open notes exam. 180 Minutes 22 Pages Notes: Not all questions are of equal difficulty, so look over the entire

More information

Relaxed Memory-Consistency Models

Relaxed Memory-Consistency Models Relaxed Memory-Consistency Models [ 9.1] In small multiprocessors, sequential consistency can be implemented relatively easily. However, this is not true for large multiprocessors. Why? This is not the

More information

Review of last lecture. Peer Quiz. DPHPC Overview. Goals of this lecture. Is coherence everything?

Review of last lecture. Peer Quiz. DPHPC Overview. Goals of this lecture. Is coherence everything? Review of last lecture Design of Parallel and High-Performance Computing Fall Lecture: Memory Models Motivational video: https://www.youtube.com/watch?v=twhtg4ous Instructor: Torsten Hoefler & Markus Püschel

More information

Cache Coherence in Bus-Based Shared Memory Multiprocessors

Cache Coherence in Bus-Based Shared Memory Multiprocessors Cache Coherence in Bus-Based Shared Memory Multiprocessors Shared Memory Multiprocessors Variations Cache Coherence in Shared Memory Multiprocessors A Coherent Memory System: Intuition Formal Definition

More information

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming Cache design review Let s say your code executes int x = 1; (Assume for simplicity x corresponds to the address 0x12345604

More information

This section covers the MIPS instruction set.

This section covers the MIPS instruction set. This section covers the MIPS instruction set. 1 + I am going to break down the instructions into two types. + a machine instruction which is directly defined in the MIPS architecture and has a one to one

More information

EECS 570 Final Exam - SOLUTIONS Winter 2015

EECS 570 Final Exam - SOLUTIONS Winter 2015 EECS 570 Final Exam - SOLUTIONS Winter 2015 Name: unique name: Sign the honor code: I have neither given nor received aid on this exam nor observed anyone else doing so. Scores: # Points 1 / 21 2 / 32

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Parallel Programming: Background Information

Parallel Programming: Background Information 1 Parallel Programming: Background Information Mike Bailey mjb@cs.oregonstate.edu parallel.background.pptx Three Reasons to Study Parallel Programming 2 1. Increase performance: do more work in the same

More information

Parallel Programming: Background Information

Parallel Programming: Background Information 1 Parallel Programming: Background Information Mike Bailey mjb@cs.oregonstate.edu parallel.background.pptx Three Reasons to Study Parallel Programming 2 1. Increase performance: do more work in the same

More information

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015 Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Marble House The Knife (Silent Shout) Before starting The Knife, we were working

More information

Chapter 5. Thread-Level Parallelism

Chapter 5. Thread-Level Parallelism Chapter 5 Thread-Level Parallelism Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Progress Towards Multiprocessors + Rate of speed growth in uniprocessors saturated

More information

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon

More information

MIPS Coherence Protocol Specification

MIPS Coherence Protocol Specification MIPS Coherence Protocol Specification Document Number: MD00605 Revision 01.01 September 14, 2015 Public. This publication contains proprietary information which is subject to change without notice and

More information

Potential violations of Serializability: Example 1

Potential violations of Serializability: Example 1 CSCE 6610:Advanced Computer Architecture Review New Amdahl s law A possible idea for a term project Explore my idea about changing frequency based on serial fraction to maintain fixed energy or keep same

More information

Designing Memory Consistency Models for. Shared-Memory Multiprocessors. Sarita V. Adve

Designing Memory Consistency Models for. Shared-Memory Multiprocessors. Sarita V. Adve Designing Memory Consistency Models for Shared-Memory Multiprocessors Sarita V. Adve Computer Sciences Department University of Wisconsin-Madison The Big Picture Assumptions Parallel processing important

More information

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING 16 MARKS CS 2354 ADVANCE COMPUTER ARCHITECTURE 1. Explain the concepts and challenges of Instruction-Level Parallelism. Define

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Weak memory models. Mai Thuong Tran. PMA Group, University of Oslo, Norway. 31 Oct. 2014

Weak memory models. Mai Thuong Tran. PMA Group, University of Oslo, Norway. 31 Oct. 2014 Weak memory models Mai Thuong Tran PMA Group, University of Oslo, Norway 31 Oct. 2014 Overview 1 Introduction Hardware architectures Compiler optimizations Sequential consistency 2 Weak memory models TSO

More information

Computer Systems Architecture I. CSE 560M Lecture 18 Guest Lecturer: Shakir James

Computer Systems Architecture I. CSE 560M Lecture 18 Guest Lecturer: Shakir James Computer Systems Architecture I CSE 560M Lecture 18 Guest Lecturer: Shakir James Plan for Today Announcements No class meeting on Monday, meet in project groups Project demos < 2 weeks, Nov 23 rd Questions

More information

Review: Creating a Parallel Program. Programming for Performance

Review: Creating a Parallel Program. Programming for Performance Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)

More information

C11 Compiler Mappings: Exploration, Verification, and Counterexamples

C11 Compiler Mappings: Exploration, Verification, and Counterexamples C11 Compiler Mappings: Exploration, Verification, and Counterexamples Yatin Manerkar Princeton University manerkar@princeton.edu http://check.cs.princeton.edu November 22 nd, 2016 1 Compilers Must Uphold

More information

DISTRIBUTED COMPUTER SYSTEMS

DISTRIBUTED COMPUTER SYSTEMS DISTRIBUTED COMPUTER SYSTEMS CONSISTENCY AND REPLICATION CONSISTENCY MODELS Dr. Jack Lange Computer Science Department University of Pittsburgh Fall 2015 Consistency Models Background Replication Motivation

More information

Memory Cache. Memory Locality. Cache Organization -- Overview L1 Data Cache

Memory Cache. Memory Locality. Cache Organization -- Overview L1 Data Cache Memory Cache Memory Locality cpu cache memory Memory hierarchies take advantage of memory locality. Memory locality is the principle that future memory accesses are near past accesses. Memory hierarchies

More information