19: Multiprocessing and Multicore. Computer Architecture and Systems Programming , Herbstsemester 2013 Timothy Roscoe

Size: px
Start display at page:

Download "19: Multiprocessing and Multicore. Computer Architecture and Systems Programming , Herbstsemester 2013 Timothy Roscoe"

Transcription

1 19: Multiprocessing and Multicore Computer Architecture and Systems Programming , Herbstsemester 2013 Timothy Roscoe

2 Very simple start Problem: One processor isn t fast enough Observation: Some jobs can parallelize Multiple jobs can run at the same time Solution: Attach multiple processors to the system bus

3 Symmetric multiprocessing (SMP) CPU 0 CPU 1 CPU 2 CPU 3 Cache Cache Cache Cache RAM SMP only works because of caches! Shared memory rapidly becomes bottleneck

4 19.1: Consistency and Coherence

5 Coherency and Consistency As with DMA, memory can change under a cache Writes from other processors to memory Leads to 2 important concepts: 1. Coherency: Values in caches all match each other Processors all see a coherent view of memory 2. Consistency: The order in which changes to memory are seen by different processors

6 Cache coherency Most CPU cores on a modern machine are cache coherent Behave as if all accessing a single memory array We ll see what this really means in a moment Big advantage: ease of programming Shared memory programming models work! Pthreads, OpenMP, etc. Disadvantages: Complex to implement (lots of transistors, bug prone) Memory is slower as a result

7 Memory consistency When several processors are reading and writing memory, what value is read by each processor? Not an easy question to answer Last value written : By which processor? What do we mean by last? Important to have an answer! Defines semantics of order dependent operations E.g. does Dekker s algorithm work? How to ensure that it does work? There are many memory consistency models

8 Consistency models: terminology Program order: order in which a program on a processor appears to issue reads and writes Refers only to local reads/writes Even on a uniprocessor order the CPU issues them! Write back caches, write buffers, out of order execution, etc. Visibility order: order which all reads and writes are seen by one or more processors Refers to all operations in the machine Might not be the same for Each processor reads the value written by the last write in visibility order

9 19.2: Sequential Consistency

10 Sequential consistency 1. Operations from a processor appear (to all others) in program order 2. Every processor s visibility order is the same interleaving of all the program orders. Requirements: Each processor issues memory ops in program order RAM totally orders all operations Memory operations are atomic

11 Sequential consistency example Results: (u=1, v=1): CPU A CPU B a 1 : *p= 1; b 1 : u= *q; a 2 : *q = 1; b 2 : v= *p; Possible under SC: (a 1, a 2, b 1, b 2 ) (a 1, a 2 ) and (b 1, b 2 ) are both program orders (u=1, v=0): Impossible under SC: No interleaving of program orders that generates this result Would require: a 2 > b 1 > b 2 > a 1

12 Sequential consistency example Results: (u=1, v=1): CPU A CPU B a 1 : *p= 1; b 1 : *q = 1; a 2 : u= *q; b 2 : v= *p; Possible under SC: (a 1, b 1, a 2, b 2 ) (a 1, a 2 ) and (b 1, b 2 ) are both program orders (u=0, v=0): Impossible under SC: No interleaving of program orders that generates this result Would require: a 2 > b 1 > b 2 > a 1

13 Sequential consistency Advantages: Easy to understand for the programmer Easy to write correct code to Easy to analyze automatically Disadvantages: Hard to build a fast implementation Cannot reorder reads/writes even in the compiler even from a single processor! Cannot combine writes to same cache line (write buffer) Serializing ops at memory controller is to restrictive: see NUMA later

14 19.3: Cache coherence with snooping

15 Implementing SC with a snoopy cache Cache snoops on reads/writes from other processors If a line is valid in local cache: Remote (other processor) write to line invalidate local line Requires a write through cache! But coherency mechanism sequential consistency Line can be valid in many caches, until a write

16 What about write back caches? Cache lines can now be dirty (modified) Requires a cache coherency protocol Simplest protocol: MSI Each line has 3 states: Modified, Shared, Invalid Line can only be dirty in one cache Cache logic must respond to: Processor reads and writes Remote bus reads and writes and must: Change cache line state Write back data (flush) if required

17 MSI state machine: all transitions Local write miss Eviction write back block Remote write miss write back block Invalid Local read miss Local write Modified Remote write miss Eviction Shared Cache write back Local read or write Local read Remote read miss Remote read miss write back block

18 MSI issues Assumes we can distinguish remote processor read and write misses In I state, executing a write miss: Need to first read line (allocate) If someone else has it in M state, need to wait for flush In M state, other core observes a remote read: Must flush line (required) Invalidate line? But what if you want read sharing? Extra cache miss! Transition to shared? But what if it s actually a remote write miss? Extra invalidate!

19 19.4: The MESI cache coherence protocol

20 MESI protocol Add a new line state: exclusive Modified: This is the only copy, it s dirty Exclusive: This is the only copy, it s clean Shared: This is one of several copies, all clean Invalid Add a new bus signal: RdX Read exclusive Cache can load into either shared or exclusive states Other caches can see the type of read Also: HIT signal Signals to a remote processor that its read hit in local cache. First x86 appearance in the Pentium

21 MESI invariants Allowed combination of states for a line between any pair of caches: M E S I M E S I Protocol must preserve these invariants MSI invariants: M S I M S I

22 MESI state machine Terminology: PrRd: processor read PrWr: processor write BusRd: bus read BusRdX: bus read excl BusWr: bus write PrRd Issue BusRd, if shared Invalid BusRdX discard Shared PrRd No transaction BusRd Signal HIT Processor initiated Snoop initiated BusRdX discard PrRd If line not shared PrWr issue BusRdX PrWr issue BusRdX BusRdX Write back BusRd Write back Exclusive BusRd Signal HIT Modified PrRd No transaction PrWr No transaction PrRd, PrWr No transaction

23 MESI observations Dirty data always written through memory No cache cache transfers Invalidation based protocol Data is always either: 1. Dirty in one cache must be written back before a remote read 2. Clean can be safely fetched from memory Good if: latency of memory << latency of remote cache

24 MOESI protocol Add new Owner state: allow line to be modified, but other unmodified copies to exist in other caches. Modified: No other cached copies exist, local copy dirty Owner: Unmodified copies may exist, local copy is dirty Exclusive: No other cached copies exist, local copy clean Shared: Other cached copies exist, local copy clean One other copy might be dirty (state Owner) Invalid: Not in cache.

25 MOESI invariants Allowed combination of states for a line between any pair of caches: M O E S I M O E S I MOESI can satisfy a read request in state I from a remote cache in state O, for example. Good if: latency of remote cache < latency of main memory

26 19.5: Relaxing sequential consistency

27 Relaxing sequential consistency CPU A Recall program order requirement for SC: Out of order execution might reorder (b 2, b 1 ) Write buffer might reorder (a 1, a 2 ) a 1 might miss in the cache, and a 2 hit Compiler might reorder operations in each thread Or optimize out entire reads or writes What can be done? CPU B a 1 : *p= 1; b 1 : u= *q; a 2 : *q = 1; b 2 : v= *p;

28 Relaxing sequential consistency Many, many different ways to do this! E.g.: Write to read: later reads can bypass earlier writes Write to write: later writes can bypass earlier writes Break write atomicity (no single visibility order) Weak ordering: no implicit order guarantees at all Explicit synchronization instructions x86: lfence (load fence), sfence (store fence), mfence (memory fence) Alpha: mb (memory barrier), wmb (write memory barrier)

29 Processor Consistency Also PRAM (Pipelined Random Access Memory) Implemented in Pentium Pro, now part of x86 architecture. Write to read relaxation: later reads can bypass earlier writes All processors see writes from one processor in the order they were issued. Processors can see different interleavings of writes from different processors.

30 Processor (PRAM) Consistency CPU A CPU B CPU C a 1 : *p = 1; b 1 : u= *p; c 1 : v= *q; b 2 : *q = 1; c 2 : w= *p; (u,v,w) = (1,1,0) is possible in PC B sees visibility order (a 1, b 2 ) C sees visibility order (b 2, a 1 )

31 Other consistency models Alpha PA RISC Power X86_32 X86_64 ia64 zseries Reads after reads Reads after writes Writes after reads Writes after writes Dependent reads Ifetch after write Icache is incoherent: requires explicit Icache flushes for self modifying code Read of value can be seen before read of address of value! Not shown: SPARC, which supports 3 different memory models Portable languages like Java must define their own memory model, and enforce it!

32 19.6: Barriers and fences

33 Barriers and Fences General rule: the weaker the consistency model is, the faster/cheaper it goes in hardware We ve seen that visibility order is: Essential for correct functioning of some algorithms Difficult to guarantee with many compilers and memory models Solution is to use barriers (also called fences) Compiler barriers: prevent compiler reordering statements Memory barriers: prevent CPU reordering instructions

34 Barriers and Fences General rule: the weaker the consistency model is, the faster/cheaper it goes in hardware We ve seen that visibility order is: Essential for correct functioning of some algorithms Difficult to guarantee with many compilers and memory models Solution is to use barriers (also called fences) Compiler barriers: prevent compiler reordering statements Memory barriers: prevent CPU reordering instructions

35 Compiler barriers Prevent compiler from reordering visible loads and stores May still reorder register access (private) Typically part of compiler intrinsics GCC: asm volatile ("" ::: "memory"); Intel ECC: memory_barrier() Microsoft Visual C & C++: ReadWriteBarrier()

36 Memory barriers on x86 MFENCE instruction Prevents the CPU reordering any loads or stores past it Store Program order Load Store MFENCE Store Load Store Load

37 Memory barriers on x86 MFENCE instruction Prevents the CPU reordering any loads or stores past it Store Program order Allowed instruction reorderings Load Store MFENCE Store Load Store Load

38 Memory barriers on x86 MFENCE instruction Prevents the CPU reordering any loads or stores past it Store Program order Allowed instruction reorderings Load Store MFENCE Store Load NOT allowed (crosses fence) Store Load

39 Memory barriers on x86 MFENCE instruction Prevents the CPU reordering any loads or stores past it Store Program order Allowed instruction reorderings Load Store MFENCE Store Load Store Load NOT allowed (crosses fence) Also: LFENCE: loads SFENCE: stores

40 19.7: Multicore synchronization: Test and Set

41 Synchronization Two ways to synchronize: 1. Atomic operations on shared memory e.g.: test and set, compare and swap Also have ordering constraints specified in the memory model 2. Interprocessor interrupts (IPIs) Invoke interrupt handler on remote CPU Very slow (500+ cycles on Intel), often avoided except in OS Used for different purposes (e.g. locks, vs. asynchronous notification)

42 Test And Set (TAS) One of the simplest non trivial atomic operations 1. Read a memory location s value into a register 2. Store 1 into the location Read modify write cycle required Memory bus must be locked during instruction Can also appear as a register Reading returns value; sets to 1 Writes of zero reset register Not to be confused with TAZ:

43 Using Test And Set Acquire a mutex with TAS: This is a spinlock: keep trying in a tight loop Often fastest if lock is not held for long Release is simple: void acquire( int *lock) { while ( TAS(lock) == 1) ; } void release( int *lock) { *lock = 0; }

44 Performance of TAS based spinlock How bad can this be? It turns out that TAS can be expensive Memory must be locked while a long operation occurs Must do a read, followed by write, while no one else can access memory. If spinning: slows things down Can we make it faster?

45 Test And Test And Set Replace most of RMW cycles with simple reads: void acquire( int *lock) { do { while (*lock == 1); } while ( TAS(lock) == 1); } Think about cache traffic: Reads hit in the spinner s cache Write due to release invalidates cache line load from main memory, returns 0 triggers further RMW cycle from spinner Highly likely to succeed (unless contention)

46 Beware of Test And Test And Set! You may be tempted to use this pattern in your regular code Try whether a value has changed outside a lock If so, then acquire the lock and check again Don t. It doesn t work. You need to understand: Whether the compiler will reorder reads and writes Whether the processor will reorder reads and writes Whether the memory consistency model will change your code s semantics. In Java, this is almost bound to happen.

47 Test And Test And Set in systems code Why does it work in systems code? TAS is a hardware instruction TAS is serializing: Processor won t reorder instructions past a TAS Compiler won t (we hope) Or, at least, the compiler and processor are told not to Memory barriers or fences Use of volatile keyword Even in an OS code depends on the processor architecture ( memory consistency model)

48 Other primitives: Fetch And Add Atomically { Add to a memory location Read the previous value } Allows sequencing of operations Event counts and sequencers model Contrast with mutexes and conditions

49 Mutual exclusion with Fetch And Add Atomically { Add to a memory location Read the previous value } struct seq_t { int sequencer; // Initially 0 int event_count // Initially 0 }; void acquire( seq_t* s ) { int ticket = FAA( s >sequencer, 1 ); while ( s >event_count!= ticket ) ; } procedure UnLock( seq_t* s ) { FAA( s >event_count, 1 ) }

50 19.8: Compare and Swap

51 Compare and Swap CAS( location, old, new) atomically { 1. Load location into value 2. If value == old then store new to location 3. Return value } Interesting features: Theoretically more powerful than TAS, FAA, etc. Can implement almost all wait free data stuctures Requires bus locking, or similar, in the memory system

52 Use of CAS for mutual exclusion CAS is generally used in lock freedata structures Essentially, concurrent updates do not require locks Such data structures >> 1 word of memory General pattern: read copy update Readers all read the same datastructure Writers take a copy, modify it, then write back the copy Old version is deleted when all the readers are finished CAS used to change pointer to new version As long as nothing else has changed it!

53 CAS for lock free update Global pointer Data structure Ref = 2 CPU 0 (reader) 1. Reader follows global pointer, increments ref count. CPU 1 (writer)

54 CAS for lock free update Global pointer Data structure Ref = 2 CPU 0 (reader) 1. Reader follows global pointer, increments ref count. Copy of data structure. Ref = 1 CPU 1 (writer) 3. Writer copies data; decrements ref count 2. Writer also reads data structure, increments ref count

55 CAS for lock free update Global pointer Data structure Ref = 1 CPU 0 (reader) 1. Reader follows global pointer, increments ref count. 4. Writer uses CAS to swap global pointer to update copy Copy of data structure. Ref = 2 CPU 1 (writer) 3. Writer copies data; decrements ref count; updates data 2. Writer also reads data structure, increments ref count

56 CAS for lock free update Global pointer CPU 0 (reader) 5. Reader finishes; decrements ref count; now 0 frees old data Copy of data structure. Ref = 1 CPU 1 (writer) 6. Writer finishes; decrements ref count; now 1.

57 The ABA problem CAS has a problem: Reports when a single location is different Does not report when it is written (with the same value) Leads to the ABA problem: 1. CPU A reads value as x 2. CPU B writes y to value 3. CPU B writes x to value 4. CPU A reads value as x concludes nothing has changed Many problems: E.g., what if the value is a software stack pointer?

58 Solving the ABA problem Basic problem: Value used for CAS comparison has not changed But the data has CAS doesn t say whether a write has occurred, only if a value has changed. Solution: Ensure the value always changes! Split value into: Original value Monotonically increasing counter CAS both halves in a single instruction

59 Double Compare And Swap CAS2 or DCAS CAS2( loc1, old1, new1, loc2, old2, new2) atomically: 1. Compares two memory locations with different values 2. If both match, each is updated with a different new value 3. If not, existing values in the locations are loaded Rarely implemented: MC680x0 Was claimed to be more useful than purely CAS Shown recently to be not so useful! Everything can be done fast with CAS, if you re slightly clever

60 What about x86? Bewildering array of options! XCHG: atomic exchange of register and memory location equivalent to Test And Set LOCK prefix (e.g. LOCK ADD, LOCK BTS, ) Executes instruction with bus locked LOCK XADD: Atomic Fetch and add CMPXCHG: 32 bit compare and swap CMPXCHG8B: 64 bit compare and swap Useful for solving ABA problem Not the same as CAS2! CMPXCHG16B: 128 bit compare and swap (x86_64 only!)

61 19.9: Load Locked / Store Conditional

62 Load Locked / Store Conditional CAS requires bus locking for read modify write Complicates memory bus logic May be a memory bottleneck Very CISC y not a good match for load store RISC Alternative: Load locked / store conditional Also called load linked, or load and reserve Implemented in MIPS, PowerPC, ARM, Alpha, Theoretically as powerful as CAS Provides lock free atomic read modify write Factored into two instructions

63 LL/SC usage Atomic read modify write with multiple instructions: 1. Load locked <location> register 2. Modify value in <register> 3. Store conditional <register> location 4. Test for SC success; retry if not Will probably succeed if no updates to location occur in the meantime. Note: does not suffer from the ABA problem!

64 Implementation of LL/SC Remember one address, re use cache coherence: Processor remembers physical address of locked load Snoops on cache invalidates for that cache line Store fails if an invalidate has been seen Limitations: Cache line granularity: danger of false sharing Context switches, interrupts, etc. produce false positives Referred to by theorists as weak LL/SC

65 Further reading Definitive work on shared memory synchronization and lock free or wait free algorithms

66 19.10: Simultaneous multithreading

67 Performance limits of SMP Cache coherent SMP still has memory as a bottleneck All accesses to main memory stall the processor Limit scalability: remember Amdahl s law? MOESI allows reads to be serviced from another cache But the cache itself can also be slow

68 Performance limits of SMP Cache coherent SMP still has memory as a bottleneck All accesses to main memory stall the processor Limit scalability: remember Amdahl s law? MOESI allows reads to be serviced from another cache But the cache itself can also be slow Memory stalls halt the processor And other processors accessing memory And the more processors, the more often this will happen

69 Simultaneous Multithreading Can we do anything useful when the processor is waiting for memory (or another cache)? Most functional units (ALU, etc.) are idle Many instructions don t require the memory unit But ILP is limited: can t execute many other instructions in the thread because of data dependencies. So what about instructions in other threads? Multiple fetch/decode units, registers Reuse superscalar functional units

70 Conventional Superscalar Instruction Control Retirement Unit Register File Fetch Control Address Instruction Instructions Decode Operations Instruction Cache Register Updates Prediction OK? Integer/ Branch General Integer FP Add FP Mult/Div Load Store Functional Units Operation Results Addr. Addr. Data Data Data Cache Execution

71 SMT (or Hyperthreading) Label instructions in hardware with thread id Logical extension of superscalar techniques Now have multiple independent instruction streams Fine grained multithreading: Select from threads on a per instruction basis Coarse grained multithreading: Switch between threads on a memory stall Multithreaded CPU appears to OS as multiple CPUs!

72 Is it worth it? Well Can be slower than a single thread! Might get no advantage from memory accesses Multiple threads compete for cache Advantage is limited (10 20% is typical) BUT: it s cheap (in transistors) Depends on workloads Not necessarily good for scientific computing Good for lots of memory intensive requests, e.g. web servers! Target application domain for Sun Niagara, Rock, etc.

73 19.11: Non Uniform Memory Access (NUMA)

74 SMP architecture Core1 Core2 Core3 Core4 To main memory

75 SMP architecture More cores brings more cycles not necessarily proportionately more cache...nor more off chip bandwidth or total RAM capacity To main memory

76 SMP architecture Cache Cache Cache Cache Main memory More cores and faster cores use more memory bandwidth I/O system Buses replaced with interconnection networks

77 Distributed memory CPU1 CPU2 CPU3 architecture CPU4 Adding more CPUs brings more of most other things CPUs + RAM CPUs + RAM CPUs + RAM CPUs + RAM RAM & directory To interconnect Locality property: only go to interconnect for real I/O or sharing DSM/NUMA Message passing, eg clusters Could scale to 100s or 1000s of cores CPUs + RAM CPUs + RAM CPUs + RAM CPUs + RAM

78 Non Uniform Memory Access Fast (ish) CPU 0 Cache Slow Interconnect CPU 1 Cache RAM RAM All memory is globally addressable But local is faster

79 Non Uniform Memory Access CPU 0 CPU 1 Cache Interconnect Cache RAM RAM NUMA interconnects are not new, but are somewhat new in PCs AMD HyperTransport (originally the Alpha EV7 interconnect) Intel CSI/QuickPath

80 19.12: NUMA cache coherence

81 NUMA cache coherence Can t snoop on the bus any more: it s not a bus! NUMA use a message passing interconnect Solution 1: Bus emulation Much more complicated than it sounds Similar to snooping, but without a shared bus Each node sends a message to all other nodes E.g. Read exclusive Waits for a reply from all nodes before proceeding E.g. Acknowledge Example: AMD coherent HyperTransport

82 Solution 2: Cache Directory Augment each node s local memory with a cache directory: Cache line data Owner (e.g. 64 bytes) Memory itself (usually DRAM) Node ID of owner of cache line 1 bit per node indicating presence of line

83 Directory based Cache Coherence Directory based cache coherence Home node maintains set of nodes that may have line Large multiprocessors, plus AMD HTAssist, Intel Beckton QuickPath More efficient when: 1. Lines are not widely shared 2. Lots of NUMA nodes Avoid broadcast/incast Reduces interconnect traffic, load at each node Requires lots of fast memory HTAssist lose up to 33% of L3!

84 19.13: Multicore

85 Moore s law: the free lunch Intel Pentium Intel Pentium 4 Intel Itanium Intel 486 Intel Pentium II 8086 Intel Source: table from

86 The power wall AMD Athlon 1500 processor

87 The power wall Power dissipation = Capacitive load Voltage 2 Frequency Increase in clock frequency mean more power dissipated more cooling required Decrease in voltage reduces dynamic power but increase the static power leakage We ve reached the practical power limit for cooling commodity microprocessors Can t increase clock frequency without expensive cooling

88 The memory wall CPU 1000 Operations / µs MHz CPU clock, 500ns to access memory in 1985 Memory 1

89 The ILP wall ILP = Instruction level parallelism Implicit parallelism between instructions in 1 thread Processor can re order and pipeline instructions, split them into microinstructions, do aggressive branch prediction etc. Requires hardware safeguards to prevent potential errors from out of order execution Increases execution unit complexity and associated power consumption Diminishing returns Serial performance acceleration using ILP has stalled

90 End of the road for serial hardware Power wall + ILP wall + memory wall = brick wall Power wall can t clock processors any faster Memory wall for many workloads performance dominated by memory access times ILP wall can t keep functional units busy while waiting for memory accesses There is also a complexity wall, but chip designers don t like to talk about it

91 Multicore processors Multiple processor cores per chip This is the future (and present) of computing Most multicore chips so far are shared memory multiprocessors (SMP) Single physical address space shared by all processors Communication between processors happens through shared variables in memory Hardware typically provides cache coherence

92 Implications for software The things that would have used this lost perf must now be written to use cores/accel Log (seq. perf) Historical 1 thread perf gains via improved clock rate and transistors used to extract ILP #transistors still growing, but delivered as additional cores and accelerators Year

93 Intel Nehalem EX (Beckton) 2 threads per core

94 Intel Nehalem EX 8 socket configuration Interconnect is in fact a cube with added main diagonals:

95 8 socket 32 core AMD Barcelona L3 RAM L3 RAM L3 RAM L3 RAM L3 L3 L3 L3 RAM RAM RAM RAM PCIe PCIe PCI SATA 1GbE PCI SATA 1GbE Floppy disk drive

96 Cache access latency L1: 2 : L3: Memory is a bit faster than L3, but it s complicated

97 19.14: Performance implications of multicore

98 Performance implications 1. Memory latency How long does it take to load from/store to memory? Depends on the physical address High performance code allocates memory accordingly! 2. Contention Loading a cache line from RAM is slow, but so is loading it from another cache Avoid cache cache transfers wherever possible

99 Memory latency Memory here will be slower to access from thread RAM RAM RAM RAM Thread runs on this core L3 L3 L3 L3 Memory here will be much slower to access from thread! Allocate private data from this bank of RAM L3 RAM L3 RAM L3 RAM L3 RAM

100 False sharing Written often by thread A Written often by thread B Cache line: For best performance On a single CPU: pack data into single cache line Better locality, more cache coverage Threads on different CPUs: very bad! Cache line ping pongs between caches on every write!

101 19.15: Optimization example: MCS locks

102 Example There are many tricks to making things go fast on a multiprocessor Closely related to lock free data structures, etc. Example: MCS locks Possibly the best known locking system for multiprocessors Excellent cache properties: Only spin on local data Only one processor wakes up on release()

103 MCS locks [Mellor Crummey and Scott, 1991] Problem: cache line containing lock is a hot spot Continuously invalidated as every processor tries to acquire it Dominates interconnect traffic Solution: When acquiring, a processor enqueues itself on a list of waiting processors, and spins on its own entry in the list When releasing, only the next processor is awakened

104 MCS lock pseudocode struct qnode { struct qnode *next; int locked; }; typedef struct qnode *lock_t; qnode in local memory void acquire( lock_t *lock, struct qnode *local) { local >next = NULL; struct qnode *prev = XCHG(lock, local); if (prev) { // queue was non empty local >locked = 1; prev >next = local; while (local >locked) ; // spin } } lock last element of a queue of spinning processes 1. Add ourselves to the end of this queue using XCHG 2. If the queue was empty, we have the lock! 3. If not, point the previous tail at us, and spin.

105 MCS lock pseudocode struct qnode { struct qnode *next; int locked; }; typedef struct qnode *lock_t; void release (lock_t *lock, struct qnode *local) { if (local >next == NULL) { if ( CAS(lock, local, NULL) ) return; while (local >next == NULL) ; // spin local >next >locked = 0; } 1. We have the lock. Is someone after us waiting? 2. If yes, tell them, and they will do the rest (see acquire()!) 3. If no, set the lock to NULL unless someone appears in the meantime 4. If they do, wait for them to enqueue, and then go to (2)

106 MCS lock performance 4x4 core AMD Opteron, Linux [Boyd Wickizer et al., 2008]

107 19.16: A quick look at the future

108 The future: processors are becoming more diverse Different numbers of cores, threads Different sizes and numbers of caches Shared in different ways between cores Different extensions to an instruction set Crypto instructions Graphics acceleration Different consistency models And potentially incoherent memory

109 Sun Niagara 2 (UltraSPARC T2) 8 threads per core

110 Intel Knight s Ferry (now Xeon Phi )

111 Tilera TilePro64

112 Beyond Multicore: Heterogeneous Manycore Cores will vary Some large, superscalar, out of order cores Many small, in order cores (less power) Some will be specialized (graphics, crypto, network, etc.) Some will have different instructions, or instruction sets Cores will come and go Constantly powered on or off to save energy May not even be able to run all at the same time Every machine will be different Hardware is now changing faster than system software Can t tune your OS to one type of machine any more Caches may not be coherent any more

113 Example: Intel Single Chip Cloud Computer

114 24 tiles 2 x P54C processors The 100MHz Pentium In order execution 256k $ per core Non coherent No write allocate 16k fast SRAM ( MPB ) 8 port router Configuration registers

115 Our part of the picture: Barrelfish Open source OS written from scratch Collaboration between ETHZ and Microsoft Research Goals: Scale to many cores Adapt to different hardware Not reply on cache coherence Keep up with trends

116 20: Summary

117 Course summary How to program in C Close to the metal language Can understand disassembly from compiler What hardware looks like to software How hardware designers make it go fast What this means for software What the compiler does Optimizations, transformations How to make the code: Correct (e.g. floating point, memory consistency, devices, etc.) Fast (caches, pipelines, vectors, DMA, etc.)

118 Course goals Make you into an extraordinary programmer Most programmers don t know this material It shows in their code Truly great programmers understand all this And it shows in their code

119 Thank you! You ve been a wonderful audience. Good luck in the exam! Have a great holiday.

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( ) Systems Group Department of Computer Science ETH Zürich Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Today Non-Uniform

More information

Other consistency models

Other consistency models Last time: Symmetric multiprocessing (SMP) Lecture 25: Synchronization primitives Computer Architecture and Systems Programming (252-0061-00) CPU 0 CPU 1 CPU 2 CPU 3 Timothy Roscoe Herbstsemester 2012

More information

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( ) Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Systems Group Department of Computer Science ETH Zürich SMP architecture

More information

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( ) Systems Group Department of Computer Science ETH Zürich Lecture 24: Multiprocessing Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Most of the rest of this

More information

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon

More information

Computer Architecture

Computer Architecture Jens Teubner Computer Architecture Summer 2016 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2016 Jens Teubner Computer Architecture Summer 2016 83 Part III Multi-Core

More information

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming Cache design review Let s say your code executes int x = 1; (Assume for simplicity x corresponds to the address 0x12345604

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015 Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Marble House The Knife (Silent Shout) Before starting The Knife, we were working

More information

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012) Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues

More information

CS5460: Operating Systems

CS5460: Operating Systems CS5460: Operating Systems Lecture 9: Implementing Synchronization (Chapter 6) Multiprocessor Memory Models Uniprocessor memory is simple Every load from a location retrieves the last value stored to that

More information

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization Lec-11 Multi-Threading Concepts: Coherency, Consistency, and Synchronization Coherency vs Consistency Memory coherency and consistency are major concerns in the design of shared-memory systems. Consistency

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Module 18: TLP on Chip: HT/SMT and CMP Lecture 39: Simultaneous Multithreading and Chip-multiprocessing TLP on Chip: HT/SMT and CMP SMT TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012

More information

Multiprocessor Systems. Chapter 8, 8.1

Multiprocessor Systems. Chapter 8, 8.1 Multiprocessor Systems Chapter 8, 8.1 1 Learning Outcomes An understanding of the structure and limits of multiprocessor hardware. An appreciation of approaches to operating system support for multiprocessor

More information

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems 1 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ 10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems To enhance system performance and, in some cases, to increase

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Shared-Memory Multi-Processors Shared-Memory Multiprocessors Multiple threads use shared memory (address space) SysV Shared Memory or Threads in software Communication implicit

More information

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence SMP Review Multiprocessors Today s topics: SMP cache coherence general cache coherence issues snooping protocols Improved interaction lots of questions warning I m going to wait for answers granted it

More information

Computer Architecture

Computer Architecture 18-447 Computer Architecture CSCI-564 Advanced Computer Architecture Lecture 29: Consistency & Coherence Lecture 20: Consistency and Coherence Bo Wu Prof. Onur Mutlu Colorado Carnegie School Mellon University

More information

Multiprocessors and Locking

Multiprocessors and Locking Types of Multiprocessors (MPs) Uniform memory-access (UMA) MP Access to all memory occurs at the same speed for all processors. Multiprocessors and Locking COMP9242 2008/S2 Week 12 Part 1 Non-uniform memory-access

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

More information

Concurrent Preliminaries

Concurrent Preliminaries Concurrent Preliminaries Sagi Katorza Tel Aviv University 09/12/2014 1 Outline Hardware infrastructure Hardware primitives Mutual exclusion Work sharing and termination detection Concurrent data structures

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Cache Coherence and Atomic Operations in Hardware

Cache Coherence and Atomic Operations in Hardware Cache Coherence and Atomic Operations in Hardware Previously, we introduced multi-core parallelism. Today we ll look at 2 things: 1. Cache coherence 2. Instruction support for synchronization. And some

More information

Scalable Locking. Adam Belay

Scalable Locking. Adam Belay Scalable Locking Adam Belay Problem: Locks can ruin performance 12 finds/sec 9 6 Locking overhead dominates 3 0 0 6 12 18 24 30 36 42 48 Cores Problem: Locks can ruin performance the locks

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Advanced Parallel Programming I

Advanced Parallel Programming I Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University

More information

Handout 3 Multiprocessor and thread level parallelism

Handout 3 Multiprocessor and thread level parallelism Handout 3 Multiprocessor and thread level parallelism Outline Review MP Motivation SISD v SIMD (SIMT) v MIMD Centralized vs Distributed Memory MESI and Directory Cache Coherency Synchronization and Relaxed

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Processor Architecture

Processor Architecture Processor Architecture Shared Memory Multiprocessors M. Schölzel The Coherence Problem s may contain local copies of the same memory address without proper coordination they work independently on their

More information

Parallel Architecture. Hwansoo Han

Parallel Architecture. Hwansoo Han Parallel Architecture Hwansoo Han Performance Curve 2 Unicore Limitations Performance scaling stopped due to: Power Wire delay DRAM latency Limitation in ILP 3 Power Consumption (watts) 4 Wire Delay Range

More information

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Module 10: Design of Shared Memory Multiprocessors Lecture 20: Performance of Coherence Protocols MOESI protocol. MOESI protocol Dragon protocol State transition Dragon example Design issues General issues Evaluating protocols Protocol optimizations Cache size Cache line size Impact on bus traffic Large cache line

More information

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware Overview: Shared Memory Hardware Shared Address Space Systems overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and

More information

Overview: Shared Memory Hardware

Overview: Shared Memory Hardware Overview: Shared Memory Hardware overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and update protocols false sharing

More information

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache. Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

Modern CPU Architectures

Modern CPU Architectures Modern CPU Architectures Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2014 16.04.2014 1 Motivation for Parallelism I CPU History RISC Software GmbH Johannes

More information

EC 513 Computer Architecture

EC 513 Computer Architecture EC 513 Computer Architecture Cache Coherence - Directory Cache Coherence Prof. Michel A. Kinsy Shared Memory Multiprocessor Processor Cores Local Memories Memory Bus P 1 Snoopy Cache Physical Memory P

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

Cache Coherence in Bus-Based Shared Memory Multiprocessors

Cache Coherence in Bus-Based Shared Memory Multiprocessors Cache Coherence in Bus-Based Shared Memory Multiprocessors Shared Memory Multiprocessors Variations Cache Coherence in Shared Memory Multiprocessors A Coherent Memory System: Intuition Formal Definition

More information

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM UNIT III MULTIPROCESSORS AND THREAD LEVEL PARALLELISM 1. Symmetric Shared Memory Architectures: The Symmetric Shared Memory Architecture consists of several processors with a single physical memory shared

More information

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence 1 COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations

More information

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence CS252 Spring 2017 Graduate Computer Architecture Lecture 12: Cache Coherence Lisa Wu, Krste Asanovic http://inst.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 Last Time in Lecture 11 Memory Systems DRAM

More information

Module 15: "Memory Consistency Models" Lecture 34: "Sequential Consistency and Relaxed Models" Memory Consistency Models. Memory consistency

Module 15: Memory Consistency Models Lecture 34: Sequential Consistency and Relaxed Models Memory Consistency Models. Memory consistency Memory Consistency Models Memory consistency SC SC in MIPS R10000 Relaxed models Total store ordering PC and PSO TSO, PC, PSO Weak ordering (WO) [From Chapters 9 and 11 of Culler, Singh, Gupta] [Additional

More information

Lecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol

Lecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol Lecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu

More information

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University Material from: Parallel Computer Organization and Design by Debois,

More information

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins 4 Chip Multiprocessors (I) Robert Mullins Overview Coherent memory systems Introduction to cache coherency protocols Advanced cache coherency protocols, memory systems and synchronization covered in the

More information

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Multithreading: Exploiting Thread-Level Parallelism within a Processor Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced

More information

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems Multiprocessors II: CC-NUMA DSM DSM cache coherence the hardware stuff Today s topics: what happens when we lose snooping new issues: global vs. local cache line state enter the directory issues of increasing

More information

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8. Multiprocessor System Multiprocessor Systems Chapter 8, 8.1 We will look at shared-memory multiprocessors More than one processor sharing the same memory A single CPU can only go so fast Use more than

More information

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST Chapter 8. Multiprocessors In-Cheol Park Dept. of EE, KAIST Can the rapid rate of uniprocessor performance growth be sustained indefinitely? If the pace does slow down, multiprocessor architectures will

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

CSE 120 Principles of Operating Systems

CSE 120 Principles of Operating Systems CSE 120 Principles of Operating Systems Spring 2018 Lecture 15: Multicore Geoffrey M. Voelker Multicore Operating Systems We have generally discussed operating systems concepts independent of the number

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal

More information

Multiprocessors 1. Outline

Multiprocessors 1. Outline Multiprocessors 1 Outline Multiprocessing Coherence Write Consistency Snooping Building Blocks Snooping protocols and examples Coherence traffic and performance on MP Directory-based protocols and examples

More information

Multiprocessor Systems. COMP s1

Multiprocessor Systems. COMP s1 Multiprocessor Systems 1 Multiprocessor System We will look at shared-memory multiprocessors More than one processor sharing the same memory A single CPU can only go so fast Use more than one CPU to improve

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Multicore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh

Multicore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh Multicore Workshop Cache Coherency Mark Bull David Henty EPCC, University of Edinburgh Symmetric MultiProcessing 2 Each processor in an SMP has equal access to all parts of memory same latency and bandwidth

More information

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors? Parallel Computers CPE 63 Session 20: Multiprocessors Department of Electrical and Computer Engineering University of Alabama in Huntsville Definition: A parallel computer is a collection of processing

More information

Chapter 2. OS Overview

Chapter 2. OS Overview Operating System Chapter 2. OS Overview Lynn Choi School of Electrical Engineering Class Information Lecturer Prof. Lynn Choi, School of Electrical Eng. Phone: 3290-3249, Kong-Hak-Kwan 411, lchoi@korea.ac.kr,

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

250P: Computer Systems Architecture. Lecture 14: Synchronization. Anton Burtsev March, 2019

250P: Computer Systems Architecture. Lecture 14: Synchronization. Anton Burtsev March, 2019 250P: Computer Systems Architecture Lecture 14: Synchronization Anton Burtsev March, 2019 Coherence and Synchronization Topics: synchronization primitives (Sections 5.4-5.5) 2 Constructing Locks Applications

More information

Introducing Multi-core Computing / Hyperthreading

Introducing Multi-core Computing / Hyperthreading Introducing Multi-core Computing / Hyperthreading Clock Frequency with Time 3/9/2017 2 Why multi-core/hyperthreading? Difficult to make single-core clock frequencies even higher Deeply pipelined circuits:

More information

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization Computer Architecture Computer Architecture Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr nizamettinaydin@gmail.com Parallel Processing http://www.yildiz.edu.tr/~naydin 1 2 Outline Multiple Processor

More information

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 15: Memory Consistency and Synchronization Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 5 (multi-core) " Basic requirements: out later today

More information

Concurrent programming: From theory to practice. Concurrent Algorithms 2015 Vasileios Trigonakis

Concurrent programming: From theory to practice. Concurrent Algorithms 2015 Vasileios Trigonakis oncurrent programming: From theory to practice oncurrent Algorithms 2015 Vasileios Trigonakis From theory to practice Theoretical (design) Practical (design) Practical (implementation) 2 From theory to

More information

Computer Architecture Crash course

Computer Architecture Crash course Computer Architecture Crash course Frédéric Haziza Department of Computer Systems Uppsala University Summer 2008 Conclusions The multicore era is already here cost of parallelism is dropping

More information

Kaisen Lin and Michael Conley

Kaisen Lin and Michael Conley Kaisen Lin and Michael Conley Simultaneous Multithreading Instructions from multiple threads run simultaneously on superscalar processor More instruction fetching and register state Commercialized! DEC

More information

Page 1. Cache Coherence

Page 1. Cache Coherence Page 1 Cache Coherence 1 Page 2 Memory Consistency in SMPs CPU-1 CPU-2 A 100 cache-1 A 100 cache-2 CPU-Memory bus A 100 memory Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale

More information

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri Cache Coherence (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri mainakc@cse.iitk.ac.in 1 Setting Agenda Software: shared address space Hardware: shared memory multiprocessors Cache

More information

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012) Lecture 11: Snooping Cache Coherence: Part II CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Announcements Assignment 2 due tonight 11:59 PM - Recall 3-late day policy Assignment

More information

CSCI 4717 Computer Architecture

CSCI 4717 Computer Architecture CSCI 4717/5717 Computer Architecture Topic: Symmetric Multiprocessors & Clusters Reading: Stallings, Sections 18.1 through 18.4 Classifications of Parallel Processing M. Flynn classified types of parallel

More information

Memory Consistency Models

Memory Consistency Models Memory Consistency Models Contents of Lecture 3 The need for memory consistency models The uniprocessor model Sequential consistency Relaxed memory models Weak ordering Release consistency Jonas Skeppstedt

More information

Shared Memory Multiprocessors

Shared Memory Multiprocessors Parallel Computing Shared Memory Multiprocessors Hwansoo Han Cache Coherence Problem P 0 P 1 P 2 cache load r1 (100) load r1 (100) r1 =? r1 =? 4 cache 5 cache store b (100) 3 100: a 100: a 1 Memory 2 I/O

More information

Advanced OpenMP. Lecture 3: Cache Coherency

Advanced OpenMP. Lecture 3: Cache Coherency Advanced OpenMP Lecture 3: Cache Coherency Cache coherency Main difficulty in building multiprocessor systems is the cache coherency problem. The shared memory programming model assumes that a shared variable

More information

EE382 Processor Design. Processor Issues for MP

EE382 Processor Design. Processor Issues for MP EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors, Part I EE 382 Processor Design Winter 98/99 Michael Flynn 1 Processor Issues for MP Initialization Interrupts Virtual Memory TLB Coherency

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Last class: Today: Course administration OS definition, some history. Background on Computer Architecture

Last class: Today: Course administration OS definition, some history. Background on Computer Architecture 1 Last class: Course administration OS definition, some history Today: Background on Computer Architecture 2 Canonical System Hardware CPU: Processor to perform computations Memory: Programs and data I/O

More information

Lecture 9: Multiprocessor OSs & Synchronization. CSC 469H1F Fall 2006 Angela Demke Brown

Lecture 9: Multiprocessor OSs & Synchronization. CSC 469H1F Fall 2006 Angela Demke Brown Lecture 9: Multiprocessor OSs & Synchronization CSC 469H1F Fall 2006 Angela Demke Brown The Problem Coordinated management of shared resources Resources may be accessed by multiple threads Need to control

More information

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Synchronization Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) Types of Synchronization

More information

Snooping-Based Cache Coherence

Snooping-Based Cache Coherence Lecture 10: Snooping-Based Cache Coherence Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2017 Tunes Elle King Ex s & Oh s (Love Stuff) Once word about my code profiling skills

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 23 Mahadevan Gomathisankaran April 27, 2010 04/27/2010 Lecture 23 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

Relaxed Memory-Consistency Models

Relaxed Memory-Consistency Models Relaxed Memory-Consistency Models [ 9.1] In Lecture 13, we saw a number of relaxed memoryconsistency models. In this lecture, we will cover some of them in more detail. Why isn t sequential consistency

More information

Kernel Synchronization I. Changwoo Min

Kernel Synchronization I. Changwoo Min 1 Kernel Synchronization I Changwoo Min 2 Summary of last lectures Tools: building, exploring, and debugging Linux kernel Core kernel infrastructure syscall, module, kernel data structures Process management

More information

Chapter Seven Morgan Kaufmann Publishers

Chapter Seven Morgan Kaufmann Publishers Chapter Seven Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored as a charge on capacitor (must be

More information

Advanced Topic: Efficient Synchronization

Advanced Topic: Efficient Synchronization Advanced Topic: Efficient Synchronization Multi-Object Programs What happens when we try to synchronize across multiple objects in a large program? Each object with its own lock, condition variables Is

More information

5008: Computer Architecture

5008: Computer Architecture 5008: Computer Architecture Chapter 4 Multiprocessors and Thread-Level Parallelism --II CA Lecture08 - multiprocessors and TLP (cwliu@twins.ee.nctu.edu.tw) 09-1 Review Caches contain all information on

More information

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism Computing Systems & Performance Beyond Instruction-Level Parallelism MSc Informatics Eng. 2012/13 A.J.Proença From ILP to Multithreading and Shared Cache (most slides are borrowed) When exploiting ILP,

More information

Parallelism, Multicore, and Synchronization

Parallelism, Multicore, and Synchronization Parallelism, Multicore, and Synchronization Hakim Weatherspoon CS 3410 Computer Science Cornell University [Weatherspoon, Bala, Bracy, McKee, and Sirer, Roth, Martin] xkcd/619 3 Big Picture: Multicore

More information

Portland State University ECE 588/688. Cray-1 and Cray T3E

Portland State University ECE 588/688. Cray-1 and Cray T3E Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector

More information

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Synchronization Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Types of Synchronization Mutual Exclusion Locks Event Synchronization Global or group-based

More information

Multi-core Architectures. Dr. Yingwu Zhu

Multi-core Architectures. Dr. Yingwu Zhu Multi-core Architectures Dr. Yingwu Zhu What is parallel computing? Using multiple processors in parallel to solve problems more quickly than with a single processor Examples of parallel computing A cluster

More information

Lecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Lecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015 Lecture 11: Cache Coherence: Part II Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Bang Bang (My Baby Shot Me Down) Nancy Sinatra (Kill Bill Volume 1 Soundtrack) It

More information

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University Material from: Parallel Computer Organization and Design by Debois,

More information

Comp. Org II, Spring

Comp. Org II, Spring Lecture 11 Parallel Processor Architectures Flynn s taxonomy from 1972 Parallel Processing & computers 8th edition: Ch 17 & 18 Earlier editions contain only Parallel Processing (Sta09 Fig 17.1) 2 Parallel

More information

CS377P Programming for Performance Multicore Performance Multithreading

CS377P Programming for Performance Multicore Performance Multithreading CS377P Programming for Performance Multicore Performance Multithreading Sreepathi Pai UTCS October 14, 2015 Outline 1 Multiprocessor Systems 2 Programming Models for Multicore 3 Multithreading and POSIX

More information

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence Computer Architecture ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols 1 Shared Memory Multiprocessor Memory Bus P 1 Snoopy Cache Physical Memory P 2 Snoopy

More information

Parallel Algorithm Engineering

Parallel Algorithm Engineering Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework and numa control Examples

More information