19: Multiprocessing and Multicore. Computer Architecture and Systems Programming , Herbstsemester 2013 Timothy Roscoe

Size: px

Start display at page:

Download "19: Multiprocessing and Multicore. Computer Architecture and Systems Programming , Herbstsemester 2013 Timothy Roscoe"

Myrtle Morton
5 years ago
Views:

1 19: Multiprocessing and Multicore Computer Architecture and Systems Programming , Herbstsemester 2013 Timothy Roscoe

2 Very simple start Problem: One processor isn t fast enough Observation: Some jobs can parallelize Multiple jobs can run at the same time Solution: Attach multiple processors to the system bus

3 Symmetric multiprocessing (SMP) CPU 0 CPU 1 CPU 2 CPU 3 Cache Cache Cache Cache RAM SMP only works because of caches! Shared memory rapidly becomes bottleneck

4 19.1: Consistency and Coherence

5 Coherency and Consistency As with DMA, memory can change under a cache Writes from other processors to memory Leads to 2 important concepts: 1. Coherency: Values in caches all match each other Processors all see a coherent view of memory 2. Consistency: The order in which changes to memory are seen by different processors

6 Cache coherency Most CPU cores on a modern machine are cache coherent Behave as if all accessing a single memory array We ll see what this really means in a moment Big advantage: ease of programming Shared memory programming models work! Pthreads, OpenMP, etc. Disadvantages: Complex to implement (lots of transistors, bug prone) Memory is slower as a result

7 Memory consistency When several processors are reading and writing memory, what value is read by each processor? Not an easy question to answer Last value written : By which processor? What do we mean by last? Important to have an answer! Defines semantics of order dependent operations E.g. does Dekker s algorithm work? How to ensure that it does work? There are many memory consistency models

8 Consistency models: terminology Program order: order in which a program on a processor appears to issue reads and writes Refers only to local reads/writes Even on a uniprocessor order the CPU issues them! Write back caches, write buffers, out of order execution, etc. Visibility order: order which all reads and writes are seen by one or more processors Refers to all operations in the machine Might not be the same for Each processor reads the value written by the last write in visibility order

9 19.2: Sequential Consistency

10 Sequential consistency 1. Operations from a processor appear (to all others) in program order 2. Every processor s visibility order is the same interleaving of all the program orders. Requirements: Each processor issues memory ops in program order RAM totally orders all operations Memory operations are atomic

11 Sequential consistency example Results: (u=1, v=1): CPU A CPU B a 1 : *p= 1; b 1 : u= *q; a 2 : *q = 1; b 2 : v= *p; Possible under SC: (a 1, a 2, b 1, b 2 ) (a 1, a 2 ) and (b 1, b 2 ) are both program orders (u=1, v=0): Impossible under SC: No interleaving of program orders that generates this result Would require: a 2 > b 1 > b 2 > a 1

12 Sequential consistency example Results: (u=1, v=1): CPU A CPU B a 1 : *p= 1; b 1 : *q = 1; a 2 : u= *q; b 2 : v= *p; Possible under SC: (a 1, b 1, a 2, b 2 ) (a 1, a 2 ) and (b 1, b 2 ) are both program orders (u=0, v=0): Impossible under SC: No interleaving of program orders that generates this result Would require: a 2 > b 1 > b 2 > a 1

13 Sequential consistency Advantages: Easy to understand for the programmer Easy to write correct code to Easy to analyze automatically Disadvantages: Hard to build a fast implementation Cannot reorder reads/writes even in the compiler even from a single processor! Cannot combine writes to same cache line (write buffer) Serializing ops at memory controller is to restrictive: see NUMA later

14 19.3: Cache coherence with snooping

15 Implementing SC with a snoopy cache Cache snoops on reads/writes from other processors If a line is valid in local cache: Remote (other processor) write to line invalidate local line Requires a write through cache! But coherency mechanism sequential consistency Line can be valid in many caches, until a write

16 What about write back caches? Cache lines can now be dirty (modified) Requires a cache coherency protocol Simplest protocol: MSI Each line has 3 states: Modified, Shared, Invalid Line can only be dirty in one cache Cache logic must respond to: Processor reads and writes Remote bus reads and writes and must: Change cache line state Write back data (flush) if required

17 MSI state machine: all transitions Local write miss Eviction write back block Remote write miss write back block Invalid Local read miss Local write Modified Remote write miss Eviction Shared Cache write back Local read or write Local read Remote read miss Remote read miss write back block

18 MSI issues Assumes we can distinguish remote processor read and write misses In I state, executing a write miss: Need to first read line (allocate) If someone else has it in M state, need to wait for flush In M state, other core observes a remote read: Must flush line (required) Invalidate line? But what if you want read sharing? Extra cache miss! Transition to shared? But what if it s actually a remote write miss? Extra invalidate!

19 19.4: The MESI cache coherence protocol

20 MESI protocol Add a new line state: exclusive Modified: This is the only copy, it s dirty Exclusive: This is the only copy, it s clean Shared: This is one of several copies, all clean Invalid Add a new bus signal: RdX Read exclusive Cache can load into either shared or exclusive states Other caches can see the type of read Also: HIT signal Signals to a remote processor that its read hit in local cache. First x86 appearance in the Pentium

21 MESI invariants Allowed combination of states for a line between any pair of caches: M E S I M E S I Protocol must preserve these invariants MSI invariants: M S I M S I

22 MESI state machine Terminology: PrRd: processor read PrWr: processor write BusRd: bus read BusRdX: bus read excl BusWr: bus write PrRd Issue BusRd, if shared Invalid BusRdX discard Shared PrRd No transaction BusRd Signal HIT Processor initiated Snoop initiated BusRdX discard PrRd If line not shared PrWr issue BusRdX PrWr issue BusRdX BusRdX Write back BusRd Write back Exclusive BusRd Signal HIT Modified PrRd No transaction PrWr No transaction PrRd, PrWr No transaction

23 MESI observations Dirty data always written through memory No cache cache transfers Invalidation based protocol Data is always either: 1. Dirty in one cache must be written back before a remote read 2. Clean can be safely fetched from memory Good if: latency of memory << latency of remote cache

24 MOESI protocol Add new Owner state: allow line to be modified, but other unmodified copies to exist in other caches. Modified: No other cached copies exist, local copy dirty Owner: Unmodified copies may exist, local copy is dirty Exclusive: No other cached copies exist, local copy clean Shared: Other cached copies exist, local copy clean One other copy might be dirty (state Owner) Invalid: Not in cache.

25 MOESI invariants Allowed combination of states for a line between any pair of caches: M O E S I M O E S I MOESI can satisfy a read request in state I from a remote cache in state O, for example. Good if: latency of remote cache < latency of main memory

26 19.5: Relaxing sequential consistency

27 Relaxing sequential consistency CPU A Recall program order requirement for SC: Out of order execution might reorder (b 2, b 1 ) Write buffer might reorder (a 1, a 2 ) a 1 might miss in the cache, and a 2 hit Compiler might reorder operations in each thread Or optimize out entire reads or writes What can be done? CPU B a 1 : *p= 1; b 1 : u= *q; a 2 : *q = 1; b 2 : v= *p;

28 Relaxing sequential consistency Many, many different ways to do this! E.g.: Write to read: later reads can bypass earlier writes Write to write: later writes can bypass earlier writes Break write atomicity (no single visibility order) Weak ordering: no implicit order guarantees at all Explicit synchronization instructions x86: lfence (load fence), sfence (store fence), mfence (memory fence) Alpha: mb (memory barrier), wmb (write memory barrier)

29 Processor Consistency Also PRAM (Pipelined Random Access Memory) Implemented in Pentium Pro, now part of x86 architecture. Write to read relaxation: later reads can bypass earlier writes All processors see writes from one processor in the order they were issued. Processors can see different interleavings of writes from different processors.

30 Processor (PRAM) Consistency CPU A CPU B CPU C a 1 : *p = 1; b 1 : u= *p; c 1 : v= *q; b 2 : *q = 1; c 2 : w= *p; (u,v,w) = (1,1,0) is possible in PC B sees visibility order (a 1, b 2 ) C sees visibility order (b 2, a 1 )

31 Other consistency models Alpha PA RISC Power X86_32 X86_64 ia64 zseries Reads after reads Reads after writes Writes after reads Writes after writes Dependent reads Ifetch after write Icache is incoherent: requires explicit Icache flushes for self modifying code Read of value can be seen before read of address of value! Not shown: SPARC, which supports 3 different memory models Portable languages like Java must define their own memory model, and enforce it!

32 19.6: Barriers and fences

33 Barriers and Fences General rule: the weaker the consistency model is, the faster/cheaper it goes in hardware We ve seen that visibility order is: Essential for correct functioning of some algorithms Difficult to guarantee with many compilers and memory models Solution is to use barriers (also called fences) Compiler barriers: prevent compiler reordering statements Memory barriers: prevent CPU reordering instructions

34 Barriers and Fences General rule: the weaker the consistency model is, the faster/cheaper it goes in hardware We ve seen that visibility order is: Essential for correct functioning of some algorithms Difficult to guarantee with many compilers and memory models Solution is to use barriers (also called fences) Compiler barriers: prevent compiler reordering statements Memory barriers: prevent CPU reordering instructions

35 Compiler barriers Prevent compiler from reordering visible loads and stores May still reorder register access (private) Typically part of compiler intrinsics GCC: asm volatile ("" ::: "memory"); Intel ECC: memory_barrier() Microsoft Visual C & C++: ReadWriteBarrier()

36 Memory barriers on x86 MFENCE instruction Prevents the CPU reordering any loads or stores past it Store Program order Load Store MFENCE Store Load Store Load

37 Memory barriers on x86 MFENCE instruction Prevents the CPU reordering any loads or stores past it Store Program order Allowed instruction reorderings Load Store MFENCE Store Load Store Load

38 Memory barriers on x86 MFENCE instruction Prevents the CPU reordering any loads or stores past it Store Program order Allowed instruction reorderings Load Store MFENCE Store Load NOT allowed (crosses fence) Store Load

39 Memory barriers on x86 MFENCE instruction Prevents the CPU reordering any loads or stores past it Store Program order Allowed instruction reorderings Load Store MFENCE Store Load Store Load NOT allowed (crosses fence) Also: LFENCE: loads SFENCE: stores

40 19.7: Multicore synchronization: Test and Set

41 Synchronization Two ways to synchronize: 1. Atomic operations on shared memory e.g.: test and set, compare and swap Also have ordering constraints specified in the memory model 2. Interprocessor interrupts (IPIs) Invoke interrupt handler on remote CPU Very slow (500+ cycles on Intel), often avoided except in OS Used for different purposes (e.g. locks, vs. asynchronous notification)

42 Test And Set (TAS) One of the simplest non trivial atomic operations 1. Read a memory location s value into a register 2. Store 1 into the location Read modify write cycle required Memory bus must be locked during instruction Can also appear as a register Reading returns value; sets to 1 Writes of zero reset register Not to be confused with TAZ:

43 Using Test And Set Acquire a mutex with TAS: This is a spinlock: keep trying in a tight loop Often fastest if lock is not held for long Release is simple: void acquire( int *lock) { while ( TAS(lock) == 1) ; } void release( int *lock) { *lock = 0; }

44 Performance of TAS based spinlock How bad can this be? It turns out that TAS can be expensive Memory must be locked while a long operation occurs Must do a read, followed by write, while no one else can access memory. If spinning: slows things down Can we make it faster?

45 Test And Test And Set Replace most of RMW cycles with simple reads: void acquire( int *lock) { do { while (*lock == 1); } while ( TAS(lock) == 1); } Think about cache traffic: Reads hit in the spinner s cache Write due to release invalidates cache line load from main memory, returns 0 triggers further RMW cycle from spinner Highly likely to succeed (unless contention)

46 Beware of Test And Test And Set! You may be tempted to use this pattern in your regular code Try whether a value has changed outside a lock If so, then acquire the lock and check again Don t. It doesn t work. You need to understand: Whether the compiler will reorder reads and writes Whether the processor will reorder reads and writes Whether the memory consistency model will change your code s semantics. In Java, this is almost bound to happen.

47 Test And Test And Set in systems code Why does it work in systems code? TAS is a hardware instruction TAS is serializing: Processor won t reorder instructions past a TAS Compiler won t (we hope) Or, at least, the compiler and processor are told not to Memory barriers or fences Use of volatile keyword Even in an OS code depends on the processor architecture ( memory consistency model)

48 Other primitives: Fetch And Add Atomically { Add to a memory location Read the previous value } Allows sequencing of operations Event counts and sequencers model Contrast with mutexes and conditions

49 Mutual exclusion with Fetch And Add Atomically { Add to a memory location Read the previous value } struct seq_t { int sequencer; // Initially 0 int event_count // Initially 0 }; void acquire( seq_t* s ) { int ticket = FAA( s >sequencer, 1 ); while ( s >event_count!= ticket ) ; } procedure UnLock( seq_t* s ) { FAA( s >event_count, 1 ) }

50 19.8: Compare and Swap

51 Compare and Swap CAS( location, old, new) atomically { 1. Load location into value 2. If value == old then store new to location 3. Return value } Interesting features: Theoretically more powerful than TAS, FAA, etc. Can implement almost all wait free data stuctures Requires bus locking, or similar, in the memory system

52 Use of CAS for mutual exclusion CAS is generally used in lock freedata structures Essentially, concurrent updates do not require locks Such data structures >> 1 word of memory General pattern: read copy update Readers all read the same datastructure Writers take a copy, modify it, then write back the copy Old version is deleted when all the readers are finished CAS used to change pointer to new version As long as nothing else has changed it!

53 CAS for lock free update Global pointer Data structure Ref = 2 CPU 0 (reader) 1. Reader follows global pointer, increments ref count. CPU 1 (writer)

54 CAS for lock free update Global pointer Data structure Ref = 2 CPU 0 (reader) 1. Reader follows global pointer, increments ref count. Copy of data structure. Ref = 1 CPU 1 (writer) 3. Writer copies data; decrements ref count 2. Writer also reads data structure, increments ref count

55 CAS for lock free update Global pointer Data structure Ref = 1 CPU 0 (reader) 1. Reader follows global pointer, increments ref count. 4. Writer uses CAS to swap global pointer to update copy Copy of data structure. Ref = 2 CPU 1 (writer) 3. Writer copies data; decrements ref count; updates data 2. Writer also reads data structure, increments ref count

56 CAS for lock free update Global pointer CPU 0 (reader) 5. Reader finishes; decrements ref count; now 0 frees old data Copy of data structure. Ref = 1 CPU 1 (writer) 6. Writer finishes; decrements ref count; now 1.

57 The ABA problem CAS has a problem: Reports when a single location is different Does not report when it is written (with the same value) Leads to the ABA problem: 1. CPU A reads value as x 2. CPU B writes y to value 3. CPU B writes x to value 4. CPU A reads value as x concludes nothing has changed Many problems: E.g., what if the value is a software stack pointer?

58 Solving the ABA problem Basic problem: Value used for CAS comparison has not changed But the data has CAS doesn t say whether a write has occurred, only if a value has changed. Solution: Ensure the value always changes! Split value into: Original value Monotonically increasing counter CAS both halves in a single instruction

59 Double Compare And Swap CAS2 or DCAS CAS2( loc1, old1, new1, loc2, old2, new2) atomically: 1. Compares two memory locations with different values 2. If both match, each is updated with a different new value 3. If not, existing values in the locations are loaded Rarely implemented: MC680x0 Was claimed to be more useful than purely CAS Shown recently to be not so useful! Everything can be done fast with CAS, if you re slightly clever

60 What about x86? Bewildering array of options! XCHG: atomic exchange of register and memory location equivalent to Test And Set LOCK prefix (e.g. LOCK ADD, LOCK BTS, ) Executes instruction with bus locked LOCK XADD: Atomic Fetch and add CMPXCHG: 32 bit compare and swap CMPXCHG8B: 64 bit compare and swap Useful for solving ABA problem Not the same as CAS2! CMPXCHG16B: 128 bit compare and swap (x86_64 only!)

61 19.9: Load Locked / Store Conditional

62 Load Locked / Store Conditional CAS requires bus locking for read modify write Complicates memory bus logic May be a memory bottleneck Very CISC y not a good match for load store RISC Alternative: Load locked / store conditional Also called load linked, or load and reserve Implemented in MIPS, PowerPC, ARM, Alpha, Theoretically as powerful as CAS Provides lock free atomic read modify write Factored into two instructions

63 LL/SC usage Atomic read modify write with multiple instructions: 1. Load locked <location> register 2. Modify value in <register> 3. Store conditional <register> location 4. Test for SC success; retry if not Will probably succeed if no updates to location occur in the meantime. Note: does not suffer from the ABA problem!

64 Implementation of LL/SC Remember one address, re use cache coherence: Processor remembers physical address of locked load Snoops on cache invalidates for that cache line Store fails if an invalidate has been seen Limitations: Cache line granularity: danger of false sharing Context switches, interrupts, etc. produce false positives Referred to by theorists as weak LL/SC

65 Further reading Definitive work on shared memory synchronization and lock free or wait free algorithms

66 19.10: Simultaneous multithreading

67 Performance limits of SMP Cache coherent SMP still has memory as a bottleneck All accesses to main memory stall the processor Limit scalability: remember Amdahl s law? MOESI allows reads to be serviced from another cache But the cache itself can also be slow

68 Performance limits of SMP Cache coherent SMP still has memory as a bottleneck All accesses to main memory stall the processor Limit scalability: remember Amdahl s law? MOESI allows reads to be serviced from another cache But the cache itself can also be slow Memory stalls halt the processor And other processors accessing memory And the more processors, the more often this will happen

69 Simultaneous Multithreading Can we do anything useful when the processor is waiting for memory (or another cache)? Most functional units (ALU, etc.) are idle Many instructions don t require the memory unit But ILP is limited: can t execute many other instructions in the thread because of data dependencies. So what about instructions in other threads? Multiple fetch/decode units, registers Reuse superscalar functional units

70 Conventional Superscalar Instruction Control Retirement Unit Register File Fetch Control Address Instruction Instructions Decode Operations Instruction Cache Register Updates Prediction OK? Integer/ Branch General Integer FP Add FP Mult/Div Load Store Functional Units Operation Results Addr. Addr. Data Data Data Cache Execution

71 SMT (or Hyperthreading) Label instructions in hardware with thread id Logical extension of superscalar techniques Now have multiple independent instruction streams Fine grained multithreading: Select from threads on a per instruction basis Coarse grained multithreading: Switch between threads on a memory stall Multithreaded CPU appears to OS as multiple CPUs!

72 Is it worth it? Well Can be slower than a single thread! Might get no advantage from memory accesses Multiple threads compete for cache Advantage is limited (10 20% is typical) BUT: it s cheap (in transistors) Depends on workloads Not necessarily good for scientific computing Good for lots of memory intensive requests, e.g. web servers! Target application domain for Sun Niagara, Rock, etc.

73 19.11: Non Uniform Memory Access (NUMA)

74 SMP architecture Core1 Core2 Core3 Core4 To main memory

75 SMP architecture More cores brings more cycles not necessarily proportionately more cache...nor more off chip bandwidth or total RAM capacity To main memory

76 SMP architecture Cache Cache Cache Cache Main memory More cores and faster cores use more memory bandwidth I/O system Buses replaced with interconnection networks

77 Distributed memory CPU1 CPU2 CPU3 architecture CPU4 Adding more CPUs brings more of most other things CPUs + RAM CPUs + RAM CPUs + RAM CPUs + RAM RAM & directory To interconnect Locality property: only go to interconnect for real I/O or sharing DSM/NUMA Message passing, eg clusters Could scale to 100s or 1000s of cores CPUs + RAM CPUs + RAM CPUs + RAM CPUs + RAM

78 Non Uniform Memory Access Fast (ish) CPU 0 Cache Slow Interconnect CPU 1 Cache RAM RAM All memory is globally addressable But local is faster

79 Non Uniform Memory Access CPU 0 CPU 1 Cache Interconnect Cache RAM RAM NUMA interconnects are not new, but are somewhat new in PCs AMD HyperTransport (originally the Alpha EV7 interconnect) Intel CSI/QuickPath

80 19.12: NUMA cache coherence

81 NUMA cache coherence Can t snoop on the bus any more: it s not a bus! NUMA use a message passing interconnect Solution 1: Bus emulation Much more complicated than it sounds Similar to snooping, but without a shared bus Each node sends a message to all other nodes E.g. Read exclusive Waits for a reply from all nodes before proceeding E.g. Acknowledge Example: AMD coherent HyperTransport

82 Solution 2: Cache Directory Augment each node s local memory with a cache directory: Cache line data Owner (e.g. 64 bytes) Memory itself (usually DRAM) Node ID of owner of cache line 1 bit per node indicating presence of line

83 Directory based Cache Coherence Directory based cache coherence Home node maintains set of nodes that may have line Large multiprocessors, plus AMD HTAssist, Intel Beckton QuickPath More efficient when: 1. Lines are not widely shared 2. Lots of NUMA nodes Avoid broadcast/incast Reduces interconnect traffic, load at each node Requires lots of fast memory HTAssist lose up to 33% of L3!

84 19.13: Multicore

Intel 386 4004 Source: table from http://download.intel.

85 Moore s law: the free lunch Intel Pentium Intel Pentium 4 Intel Itanium Intel 486 Intel Pentium II 8086 Intel Source: table from

86 The power wall AMD Athlon 1500 processor

87 The power wall Power dissipation = Capacitive load Voltage 2 Frequency Increase in clock frequency mean more power dissipated more cooling required Decrease in voltage reduces dynamic power but increase the static power leakage We ve reached the practical power limit for cooling commodity microprocessors Can t increase clock frequency without expensive cooling

88 The memory wall CPU 1000 Operations / µs MHz CPU clock, 500ns to access memory in 1985 Memory 1

89 The ILP wall ILP = Instruction level parallelism Implicit parallelism between instructions in 1 thread Processor can re order and pipeline instructions, split them into microinstructions, do aggressive branch prediction etc. Requires hardware safeguards to prevent potential errors from out of order execution Increases execution unit complexity and associated power consumption Diminishing returns Serial performance acceleration using ILP has stalled

90 End of the road for serial hardware Power wall + ILP wall + memory wall = brick wall Power wall can t clock processors any faster Memory wall for many workloads performance dominated by memory access times ILP wall can t keep functional units busy while waiting for memory accesses There is also a complexity wall, but chip designers don t like to talk about it

91 Multicore processors Multiple processor cores per chip This is the future (and present) of computing Most multicore chips so far are shared memory multiprocessors (SMP) Single physical address space shared by all processors Communication between processors happens through shared variables in memory Hardware typically provides cache coherence

92 Implications for software The things that would have used this lost perf must now be written to use cores/accel Log (seq. perf) Historical 1 thread perf gains via improved clock rate and transistors used to extract ILP #transistors still growing, but delivered as additional cores and accelerators Year

93 Intel Nehalem EX (Beckton) 2 threads per core

94 Intel Nehalem EX 8 socket configuration Interconnect is in fact a cube with added main diagonals:

95 8 socket 32 core AMD Barcelona L3 RAM L3 RAM L3 RAM L3 RAM L3 L3 L3 L3 RAM RAM RAM RAM PCIe PCIe PCI SATA 1GbE PCI SATA 1GbE Floppy disk drive

96 Cache access latency L1: 2 : L3: Memory is a bit faster than L3, but it s complicated

97 19.14: Performance implications of multicore

98 Performance implications 1. Memory latency How long does it take to load from/store to memory? Depends on the physical address High performance code allocates memory accordingly! 2. Contention Loading a cache line from RAM is slow, but so is loading it from another cache Avoid cache cache transfers wherever possible

99 Memory latency Memory here will be slower to access from thread RAM RAM RAM RAM Thread runs on this core L3 L3 L3 L3 Memory here will be much slower to access from thread! Allocate private data from this bank of RAM L3 RAM L3 RAM L3 RAM L3 RAM

100 False sharing Written often by thread A Written often by thread B Cache line: For best performance On a single CPU: pack data into single cache line Better locality, more cache coverage Threads on different CPUs: very bad! Cache line ping pongs between caches on every write!

101 19.15: Optimization example: MCS locks

102 Example There are many tricks to making things go fast on a multiprocessor Closely related to lock free data structures, etc. Example: MCS locks Possibly the best known locking system for multiprocessors Excellent cache properties: Only spin on local data Only one processor wakes up on release()

103 MCS locks [Mellor Crummey and Scott, 1991] Problem: cache line containing lock is a hot spot Continuously invalidated as every processor tries to acquire it Dominates interconnect traffic Solution: When acquiring, a processor enqueues itself on a list of waiting processors, and spins on its own entry in the list When releasing, only the next processor is awakened

104 MCS lock pseudocode struct qnode { struct qnode *next; int locked; }; typedef struct qnode *lock_t; qnode in local memory void acquire( lock_t *lock, struct qnode *local) { local >next = NULL; struct qnode *prev = XCHG(lock, local); if (prev) { // queue was non empty local >locked = 1; prev >next = local; while (local >locked) ; // spin } } lock last element of a queue of spinning processes 1. Add ourselves to the end of this queue using XCHG 2. If the queue was empty, we have the lock! 3. If not, point the previous tail at us, and spin.

105 MCS lock pseudocode struct qnode { struct qnode *next; int locked; }; typedef struct qnode *lock_t; void release (lock_t *lock, struct qnode *local) { if (local >next == NULL) { if ( CAS(lock, local, NULL) ) return; while (local >next == NULL) ; // spin local >next >locked = 0; } 1. We have the lock. Is someone after us waiting? 2. If yes, tell them, and they will do the rest (see acquire()!) 3. If no, set the lock to NULL unless someone appears in the meantime 4. If they do, wait for them to enqueue, and then go to (2)

106 MCS lock performance 4x4 core AMD Opteron, Linux [Boyd Wickizer et al., 2008]

107 19.16: A quick look at the future

108 The future: processors are becoming more diverse Different numbers of cores, threads Different sizes and numbers of caches Shared in different ways between cores Different extensions to an instruction set Crypto instructions Graphics acceleration Different consistency models And potentially incoherent memory

109 Sun Niagara 2 (UltraSPARC T2) 8 threads per core

110 Intel Knight s Ferry (now Xeon Phi )

111 Tilera TilePro64

112 Beyond Multicore: Heterogeneous Manycore Cores will vary Some large, superscalar, out of order cores Many small, in order cores (less power) Some will be specialized (graphics, crypto, network, etc.) Some will have different instructions, or instruction sets Cores will come and go Constantly powered on or off to save energy May not even be able to run all at the same time Every machine will be different Hardware is now changing faster than system software Can t tune your OS to one type of machine any more Caches may not be coherent any more

113 Example: Intel Single Chip Cloud Computer

114 24 tiles 2 x P54C processors The 100MHz Pentium In order execution 256k $ per core Non coherent No write allocate 16k fast SRAM ( MPB ) 8 port router Configuration registers

115 Our part of the picture: Barrelfish Open source OS written from scratch Collaboration between ETHZ and Microsoft Research Goals: Scale to many cores Adapt to different hardware Not reply on cache coherence Keep up with trends

116 20: Summary

117 Course summary How to program in C Close to the metal language Can understand disassembly from compiler What hardware looks like to software How hardware designers make it go fast What this means for software What the compiler does Optimizations, transformations How to make the code: Correct (e.g. floating point, memory consistency, devices, etc.) Fast (caches, pipelines, vectors, DMA, etc.)

118 Course goals Make you into an extraordinary programmer Most programmers don t know this material It shows in their code Truly great programmers understand all this And it shows in their code

119 Thank you! You ve been a wonderful audience. Good luck in the exam! Have a great holiday.

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Systems Group Department of Computer Science ETH Zürich Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Today Non-Uniform