Modern High-Performance Locking

Size: px
Start display at page:

Download "Modern High-Performance Locking"

Transcription

1 Modern High-Performance Locking Nir Shavit Slides based in part on The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

2 Locks (Mutual Exclusion) public interface Lock { public void lock(); public void unlock(); } Art of Multiprocessor Programming 2

3 Locks (Mutual Exclusion) public interface Lock { public void lock(); acquire lock public void unlock(); } Art of Multiprocessor Programming 3

4 Locks (Mutual Exclusion) public interface Lock { public void lock(); public void unlock(); } acquire lock release lock Art of Multiprocessor Programming 4

5 Mutual Exclusion Properties Mutual Exclusion At most one thread holds the lock (has completed lock() and not completed unlock()) at any time Art of Multiprocessor Programming 5

6 Mutual Exclusion Properties Freedom from Deadlock If a thread calls lock() or unlock() and never returns, then other threads must complete invocations of lock() and unlock() infinitely often. Art of Multiprocessor Programming 6

7 Mutual Exclusion Properties Freedom from Starvation Every call to lock() or unlock() eventually returns. Art of Multiprocessor Programming 7

8 Locking and Amdahl s Law Coarse Grained Speedup= 1 1 p + p n Fine Grained 25% Shared Parallel fraction 25% Shared 75% Unshared 75% Unshared Concurrent Program Concurrent Program

9 Locking and Amdahl s Law Honk! Honk! Honk! Amdahls Law: only 2.9x speedup with 8 processors Coarse Grained Fine Grained 25% Shared 25% Shared 75% Unshared 75% Unshared Concurrent Program Concurrent Program

10 Locking and Amdahl s Law Honk! Honk! Honk! Only 4x speedup with processors Coarse Grained Fine Grained 25% Shared 25% Shared 75% Unshared 75% Unshared Concurrent Program Concurrent Program

11 Locking and Amdahl s Law Honk! Honk! Honk! Why fine-grained low overhead locking maters Coarse Grained Fine Grained 25% Shared 25% Shared 75% Unshared 75% Unshared Concurrent Program Concurrent Program

12 What Should you do if you can t get a lock? Keep trying spin or busy-wait Good if delays are short Give up the processor Good if delays are long Always good on uniprocessor Art of Multiprocessor Programming 12 (1)

13 What Should you do if you can t get a lock? Keep trying spin or busy-wait Good if delays are short Give up the processor Good if delays are long Always good on uniprocessor our focus Art of Multiprocessor Programming 13

14 Basic Spin-Lock CS. spin lock critical section Resets lock upon exit Art of Multiprocessor Programming 14

15 Basic Spin-Lock lock introduces sequential bottleneck CS. spin lock critical section Resets lock upon exit Art of Multiprocessor Programming 15

16 Basic Spin-Lock lock suffers from contention CS. spin lock critical section Resets lock upon exit Art of Multiprocessor Programming 16

17 Basic Spin-Lock lock suffers from contention CS. spin lock critical section Resets lock upon exit These are distinct phenomena Art of Multiprocessor Programming 17

18 Basic Spin-Lock lock suffers from contention CS. spin lock critical section Resets lock upon exit Seq Bottleneck à no parallelism Art of Multiprocessor Programming 18

19 Basic Spin-Lock lock suffers from contention CS. spin lock critical section Resets lock upon exit Contention à overloaded communication medium Art of Multiprocessor Programming 19

20 Mutual Exclusion What do we want to optimize? Bus bandwidth used by spinning threads Release/Acquire latency Acquire latency for idle lock Art of Multiprocessor Programming 20

21 Review: Test-and-Set public class AtomicBoolean { boolean value; public synchronized boolean getandset(boolean newvalue) { boolean prior = value; value = newvalue; return prior; } } Art of Multiprocessor Programming 21 (5)

22 Review: Test-and-Set public class AtomicBoolean { boolean value; public synchronized boolean getandset(boolean newvalue) { boolean prior = value; value = newvalue; return prior; } } Package java.util.concurrent.atomic Art of Multiprocessor Programming 22

23 Review: Test-and-Set public class AtomicBoolean { boolean value; public synchronized boolean getandset(boolean newvalue) { boolean prior = value; value = newvalue; return prior; } } Swap old and new values. Art of Multiprocessor Programming 23

24 Review: Test-and-Set AtomicBoolean lock = new AtomicBoolean(false) boolean prior = lock.getandset(true) Art of Multiprocessor Programming 24

25 Review: Test-and-Set AtomicBoolean lock = new AtomicBoolean(false) boolean prior = lock.getandset(true) Swapping in true is called test-and-set or TAS. Both Swap and TAS available in hardware. Art of Multiprocessor Programming 25 (5)

26 Test-and-Set Locks Locking Lock is free: value is false Lock is taken: value is true Acquire lock by calling TAS If result is false, you win If result is true, you lose Release lock by writing false Art of Multiprocessor Programming 26

27 Simple TASLock TAS invalidates cache lines Spinners Miss in cache Go to bus Thread wants to release lock delayed behind spinners Art of Multiprocessor Programming 27

28 Test-and-Test-and-Set Locks Lurking stage Wait until lock looks free Spin while read returns true (lock taken) Pouncing state As soon as lock looks available Read returns false (lock free) Call TAS to acquire lock If TAS loses, back to lurking Art of Multiprocessor Programming 28

29 Test-and-test-and-set Lock class TTASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (true) { while (state.get()) {} if (!state.getandset(true)) return; } } Art of Multiprocessor Programming 29

30 Test-and-test-and-set Lock class TTASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (true) { while (state.get()) {} if (!state.getandset(true)) return; } } Wait until lock looks free Art of Multiprocessor Programming 30

31 Test-and-test-and-set Lock class TTASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (true) { while (state.get()) {} if (!state.getandset(true)) return; } } Then try to acquire it Art of Multiprocessor Programming 31

32 Test-and-test-and-set Wait until lock looks free Spin on local cache No bus use while lock busy Problem: when lock is released Invalidation storm Art of Multiprocessor Programming 32

33 Problems Everyone misses Reads satisfied sequentially Everyone does TAS Invalidates others caches Eventually quiesces after lock acquired Quiescence time often linear in number of cores Art of Multiprocessor Programming 33

34 Solution: Introduce Delay If the lock looks free - but I fail to get it There must be contention - better to back off than to collide again time r 2 d r 1 d d spin lock Art of Multiprocessor Programming 34

35 Dynamic Example: Exponential Backoff time 4d 2d d spin lock If I fail to get lock Wait random duration before retry Each subsequent failure doubles expected wait Art of Multiprocessor Programming 35

36 Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!state.getandset(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Art of Multiprocessor Programming 36

37 Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!state.getandset(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Fix minimum delay Art of Multiprocessor Programming 37

38 Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!state.getandset(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Wait until lock looks free Art of Multiprocessor Programming 38

39 Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!state.getandset(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} If we win, return Art of Multiprocessor Programming 39

40 Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!state.getandset(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Back off for random duration Art of Multiprocessor Programming 40

41 Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!state.getandset(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Double max delay, within reason Art of Multiprocessor Programming 41

42 Actual Data on 40-Core Machine Art of Multiprocessor Programming 42

43 Backoff: Other Issues Good Easy to implement Beats TTAS lock Bad Must choose parameters carefully Not portable across platforms Art of Multiprocessor Programming 43

44 Idea Avoid useless invalidations By keeping a queue of threads Each thread Notifies next in line Without bothering the others Art of Multiprocessor Programming 44

45 CLH Lock First Come First Served order Small, constant-size overhead per thread Art of Multiprocessor Programming 45

46 Initially idle tail false Art of Multiprocessor Programming 46

47 Initially idle tail false Queue tail Art of Multiprocessor Programming 47

48 Initially idle tail false Lock is free Art of Multiprocessor Programming 48

49 Initially idle tail false Art of Multiprocessor Programming 49

50 Purple Wants the Lock acquiring tail false Art of Multiprocessor Programming 50

51 Purple Wants the Lock acquiring tail false true Art of Multiprocessor Programming 51

52 Purple Wants the Lock acquiring Swap tail false true Art of Multiprocessor Programming 52

53 Purple Has the Lock acquired tail false true Art of Multiprocessor Programming 53

54 Red Wants the Lock acquired acquiring tail false true true Art of Multiprocessor Programming 54

55 Red Wants the Lock acquired acquiring tail false Swap true true Art of Multiprocessor Programming 55

56 Red Wants the Lock acquired acquiring tail false true true Art of Multiprocessor Programming 56

57 Red Wants the Lock acquired acquiring tail false true true Art of Multiprocessor Programming 57

58 Red Wants the Lock acquired acquiring Implicit Linked list tail false true true Art of Multiprocessor Programming 58

59 Red Wants the Lock acquired acquiring tail false true true Art of Multiprocessor Programming 59

60 Red Wants the Lock acquired acquiring tail false true true true Actually, it spins on cached copy Art of Multiprocessor Programming 60

61 Purple Releases release acquiring false Bingo! tail false false true Art of Multiprocessor Programming 61

62 Purple Releases released acquired tail true Art of Multiprocessor Programming 62

63 CLH Queue Lock class Qnode { AtomicBoolean locked = } new AtomicBoolean(true); Art of Multiprocessor Programming 63

64 CLH Queue Lock class Qnode { AtomicBoolean locked = } new AtomicBoolean(true); Not released yet Art of Multiprocessor Programming 64

65 CLH Queue Lock class CLHLock implements Lock { AtomicReference<Qnode> tail; ThreadLocal<Qnode> mynode = new Qnode(); public void lock() { Qnode pred = tail.getandset(mynode); while (pred.locked) {} }} Art of Multiprocessor Programming 65

66 CLH Queue Lock class CLHLock implements Lock { AtomicReference<Qnode> tail; ThreadLocal<Qnode> mynode = new Qnode(); public void lock() { Qnode pred = tail.getandset(mynode); while (pred.locked) {} }} Queue tail Art of Multiprocessor Programming 66

67 CLH Queue Lock class CLHLock implements Lock { AtomicReference<Qnode> tail; ThreadLocal<Qnode> mynode = new Qnode(); public void lock() { Qnode pred = tail.getandset(mynode); while (pred.locked) {} }} Thread-local Qnode Art of Multiprocessor Programming 67

68 CLH Queue Lock class CLHLock implements Lock { AtomicReference<Qnode> tail; ThreadLocal<Qnode> mynode = new Qnode(); public void lock() { Qnode pred = tail.getandset(mynode); while (pred.locked) {} }} Swap in my node Art of Multiprocessor Programming 68

69 CLH Queue Lock class CLHLock implements Lock { AtomicReference<Qnode> tail; ThreadLocal<Qnode> mynode = new Qnode(); public void lock() { Qnode pred = tail.getandset(mynode); while (pred.locked) {} }} Spin until predecessor releases lock Art of Multiprocessor Programming 69

70 CLH Queue Lock Class CLHLock implements Lock { } } public void unlock() { mynode.locked.set(false); mynode = pred; Art of Multiprocessor Programming 70

71 CLH Queue Lock Class CLHLock implements Lock { } } public void unlock() { mynode.locked.set(false); mynode = pred; Notify successor Art of Multiprocessor Programming 71

72 CLH Queue Lock Class CLHLock implements Lock { } } public void unlock() { mynode.locked.set(false); mynode = pred; Recycle predecessor s node Art of Multiprocessor Programming 72

73 CLH Queue Lock Class CLHLock implements Lock { } } public void unlock() { mynode.locked.set(false); mynode = pred; (Here we don t actually reuse mynode. Can see how it s done in Art of Multiprocessor Programming book) Art of Multiprocessor Programming 73

74 CLH Lock Good Lock release affects predecessor only Small, constant-sized space Bad Doesn t work for uncached NUMA architectures Art of Multiprocessor Programming 74

75 NUMA and cc-numa Architectures Acronym: Non-Uniform Memory Architecture ccnuma = cache coherent NUMA Illusion: Flat shared memory Truth: No caches (sometimes) Some memory regions faster than others Art of Multiprocessor Programming 75

76 NUMA Machines Spinning on local memory is fast Art of Multiprocessor Programming 76

77 NUMA Machines Spinning on remote memory is slow Art of Multiprocessor Programming 77

78 CLH Lock Each thread spins on predecessor s memory Could be far away Art of Multiprocessor Programming 78

79 MCS Lock FCFS order Spin on local memory only Small, Constant-size overhead Art of Multiprocessor Programming 79

80 Initially idle tail false Art of Multiprocessor Programming 80

81 Acquiring acquiring (allocate Qnode) tail false true Art of Multiprocessor Programming 81

82 Acquiring acquired tail swap false true Art of Multiprocessor Programming 82

83 Acquiring acquired tail false true Art of Multiprocessor Programming 83

84 Acquired acquired tail false true Art of Multiprocessor Programming 84

85 Acquiring acquired acquiring tail swap false true Art of Multiprocessor Programming 85

86 Acquiring acquired acquiring tail false true Art of Multiprocessor Programming 86

87 Acquiring acquired acquiring tail false true Art of Multiprocessor Programming 87

88 Acquiring acquired acquiring tail false true Art of Multiprocessor Programming 88

89 Acquiring acquired acquiring tail true false true Art of Multiprocessor Programming 89

90 Acquiring acquired acquiring tail true Yes! false true Art of Multiprocessor Programming 90

91 MCS Queue Lock class Qnode { volatile boolean locked = false; volatile qnode next = null; } Art of Multiprocessor Programming 91

92 MCS Queue Lock class MCSLock implements Lock { AtomicReference tail; public void lock() { Qnode qnode = new Qnode(); Qnode pred = tail.getandset(qnode); if (pred!= null) { qnode.locked = true; pred.next = qnode; while (qnode.locked) {} }}} Art of Multiprocessor Programming 92

93 MCS Queue Lock class MCSLock implements Lock { AtomicReference tail; public void lock() { Qnode qnode = new Qnode(); Qnode pred = tail.getandset(qnode); if (pred!= null) { qnode.locked = true; pred.next = qnode; while (qnode.locked) {} }}} Make a QNode Art of Multiprocessor Programming 93

94 MCS Queue Lock class MCSLock implements Lock { AtomicReference tail; public void lock() { Qnode qnode = new Qnode(); Qnode pred = tail.getandset(qnode); if (pred!= null) { qnode.locked = true; pred.next = qnode; while (qnode.locked) {} }}} add my Node to the tail of queue Art of Multiprocessor Programming 94

95 MCS Queue Lock class MCSLock implements Lock { AtomicReference tail; public void lock() { Qnode qnode = new Qnode(); Qnode pred = tail.getandset(qnode); if (pred!= null) { qnode.locked = true; pred.next = qnode; while (qnode.locked) {} }}} Fix if queue was non-empty Art of Multiprocessor Programming 95

96 MCS Queue Lock class MCSLock implements Lock { AtomicReference tail; public void lock() { Qnode qnode = new Qnode(); Qnode pred = tail.getandset(qnode); if (pred!= null) { qnode.locked = true; pred.next = qnode; while (qnode.locked) {} }}} Wait until unlocked Art of Multiprocessor Programming 96

97 Purple Release releasing swap false false Art of Multiprocessor Programming 97

98 Purple Release releasing I don t see a successor. But by looking at the queue, I see another thread is active swap false false Art of Multiprocessor Programming 98

99 Purple Release releasing I don t see a successor. But by looking at the queue, I see another thread is active swap false false I have to release that thread so must wait for it to identify its node Art of Multiprocessor Programming 99

100 Purple Release releasing prepare to spin true false Art of Multiprocessor Programming 100

101 Purple Release releasing spinning true false Art of Multiprocessor Programming 101

102 Purple Release releasing spinning false true false Art of Multiprocessor Programming 102

103 Purple Release releasing Acquired lock false true false Art of Multiprocessor Programming 103

104 MCS Queue Unlock class MCSLock implements Lock { AtomicReference tail; public void unlock() { if (qnode.next == null) { if (tail.cas(qnode, null) return; while (qnode.next == null) {} } qnode.next.locked = false; }} Art of Multiprocessor Programming 104

105 MCS Queue Lock class MCSLock implements Lock { AtomicReference tail; public void unlock() { if (qnode.next == null) { if (tail.cas(qnode, null) return; while (qnode.next == null) {} } qnode.next.locked = false; }} Missing successor? Art of Multiprocessor Programming 105

106 MCS Queue Lock class MCSLock implements Lock { If really no successor, AtomicReference tail; public void unlock() { if (qnode.next == null) { if (tail.cas(qnode, null) return; while (qnode.next == null) {} } qnode.next.locked = false; }} return Art of Multiprocessor Programming 106

107 MCS Queue Lock class MCSLock implements Lock { Otherwise wait for AtomicReference tail; public void unlock() { if (qnode.next == null) { if (tail.cas(qnode, null) return; while (qnode.next == null) {} } qnode.next.locked = false; }} successor to catch up Art of Multiprocessor Programming 107

108 MCS Queue Lock class MCSLock implements Lock { AtomicReference queue; public void unlock() { if (qnode.next == null) { if (tail.cas(qnode, null) return; while (qnode.next == null) {} } qnode.next.locked = false; }} Pass lock to successor Art of Multiprocessor Programming 108

109 Abortable Locks What if you want to give up waiting for a lock? For example Timeout Database transaction aborted by user Art of Multiprocessor Programming 109

110 Back-off Lock Aborting is trivial Just return from lock() call Extra benefit: No cleaning up Wait-free Immediate return Art of Multiprocessor Programming 110

111 Queue Locks Can t just quit Thread in line behind will starve Need a graceful way out Art of Multiprocessor Programming 111

112 Queue Locks spinning spinning spinning true true true Art of Multiprocessor Programming 112

113 Queue Locks locked spinning spinning false true true Art of Multiprocessor Programming 113

114 Queue Locks locked spinning false true Art of Multiprocessor Programming 114

115 Queue Locks locked false Art of Multiprocessor Programming 115

116 Queue Locks spinning spinning spinning true true true Art of Multiprocessor Programming 116

117 Queue Locks spinning spinning true true true Art of Multiprocessor Programming 117

118 Queue Locks locked spinning false true true Art of Multiprocessor Programming 118

119 Queue Locks spinning false true Art of Multiprocessor Programming 119

120 Queue Locks pwned false true Art of Multiprocessor Programming 120

121 Abortable CLH Lock When a thread gives up Removing node in a wait-free way is hard Idea: let successor deal with it. Art of Multiprocessor Programming 121

122 Initially idle Pointer to predecessor (or null) tail A Art of Multiprocessor Programming 122

123 Initially idle Distinguished available node means lock is free tail A Art of Multiprocessor Programming 123

124 acquiring Acquiring tail A Art of Multiprocessor Programming 124

125 acquiring Acquiring Null predecessor means lock not available and not aborted A Art of Multiprocessor Programming 125

126 acquiring Acquiring Swap A Art of Multiprocessor Programming 126

127 acquiring Acquiring A Art of Multiprocessor Programming 127

128 Acquired locked Reference to AVAILABLE means lock is free. A Art of Multiprocessor Programming 128

129 Normal Case locked spinning spinning Null means lock is not free & request not aborted Art of Multiprocessor Programming 129

130 One Thread Aborts locked Timed out spinning Art of Multiprocessor Programming 130

131 Successor Notices locked Timed out spinning Non-Null means predecessor aborted Art of Multiprocessor Programming 131

132 Recycle Predecessor s Node locked spinning Art of Multiprocessor Programming 132

133 Spin on Earlier Node locked spinning Art of Multiprocessor Programming 133

134 Spin on Earlier Node released spinning A The lock is now mine Art of Multiprocessor Programming 134

135 Time-out Lock public class TOLock implements Lock { static Qnode AVAILABLE = new Qnode(); AtomicReference<Qnode> tail; ThreadLocal<Qnode> mynode; Art of Multiprocessor Programming 135

136 Time-out Lock public class TOLock implements Lock { static Qnode AVAILABLE = new Qnode(); AtomicReference<Qnode> tail; ThreadLocal<Qnode> mynode; AVAILABLE node signifies free lock Art of Multiprocessor Programming 136

137 Time-out Lock public class TOLock implements Lock { static Qnode AVAILABLE = new Qnode(); AtomicReference<Qnode> tail; ThreadLocal<Qnode> mynode; Tail of the queue Art of Multiprocessor Programming 137

138 Time-out Lock public class TOLock implements Lock { static Qnode AVAILABLE = new Qnode(); AtomicReference<Qnode> tail; ThreadLocal<Qnode> mynode; Remember my node Art of Multiprocessor Programming 138

139 Time-out Lock public boolean lock(long timeout) { Qnode qnode = new Qnode(); mynode.set(qnode); qnode.prev = null; Qnode mypred = tail.getandset(qnode); if (mypred== null } mypred.prev == AVAILABLE) { return true; Art of Multiprocessor Programming 139

140 Time-out Lock public boolean lock(long timeout) { Qnode qnode = new Qnode(); mynode.set(qnode); qnode.prev = null; Qnode mypred = tail.getandset(qnode); if (mypred == null } mypred.prev == AVAILABLE) { return true; Create & initialize node Art of Multiprocessor Programming 140

141 Time-out Lock public boolean lock(long timeout) { Qnode qnode = new Qnode(); mynode.set(qnode); qnode.prev = null; Qnode mypred = tail.getandset(qnode); if (mypred == null } mypred.prev == AVAILABLE) { return true; Swap with tail Art of Multiprocessor Programming 141

142 Time-out Lock public boolean lock(long timeout) { Qnode qnode = new Qnode(); mynode.set(qnode); qnode.prev = null; Qnode mypred = tail.getandset(qnode); if (mypred == null }... mypred.prev == AVAILABLE) { return true; If predecessor absent or released, we are done Art of Multiprocessor Programming 142

143 locked Time-out Lock spinning spinning long start = now(); while (now()- start < timeout) { } Qnode predpred = mypred.prev; if (predpred == AVAILABLE) { return true; } else if (predpred!= null) { mypred = predpred; } Art of Multiprocessor Programming 143

144 Time-out Lock long start = now(); while (now()- start < timeout) { } Qnode predpred = mypred.prev; if (predpred == AVAILABLE) { return true; } else if (predpred!= null) { mypred = predpred; } Keep trying for a while Art of Multiprocessor Programming 144

145 Time-out Lock long start = now(); while (now()- start < timeout) { } Qnode predpred = mypred.prev; if (predpred == AVAILABLE) { return true; } else if (predpred!= null) { mypred = predpred; } Spin on predecessor s prev field Art of Multiprocessor Programming 145

146 Time-out Lock long start = now(); while (now()- start < timeout) { } Qnode predpred = mypred.prev; if (predpred == AVAILABLE) { return true; } else if (predpred!= null) { mypred = predpred; } Predecessor released lock Art of Multiprocessor Programming 146

147 Time-out Lock long start = now(); while (now()- start < timeout) { } Qnode predpred = mypred.prev; if (predpred == AVAILABLE) { return true; } else if (predpred!= null) { mypred = predpred; } Predecessor aborted, advance one Art of Multiprocessor Programming 147

148 Time-out Lock if (!tail.compareandset(qnode, mypred)) qnode.prev = mypred; } } return false; What do I do when I time out? Art of Multiprocessor Programming 148

149 Time-out Lock if (!tail.compareandset(qnode, mypred)) qnode.prev = mypred; } } return false; Do I have a successor? If CAS succeeds: no successor, tail just set to my pred, simply return false Art of Multiprocessor Programming 149

150 Time-out Lock if (!tail.compareandset(qnode, mypred)) qnode.prev = mypred; } } return false; If CAS fails, I do have a successor. Tell it about mypred Art of Multiprocessor Programming 150

151 Time-Out Unlock public void unlock() { } Qnode qnode = mynode.get(); if (!tail.compareandset(qnode, null)) qnode.prev = AVAILABLE; Art of Multiprocessor Programming 151

152 Time-out Unlock public void unlock() { } Qnode qnode = mynode.get(); if (!tail.compareandset(qnode, null)) qnode.prev = AVAILABLE; If CAS failed: successor exists, notify it can enter Art of Multiprocessor Programming 152

153 Timing-out Lock public void unlock() { } Qnode qnode = mynode.get(); if (!tail.compareandset(qnode, null)) qnode.prev = AVAILABLE; CAS successful: set tail to null, no clean up since no successor waiting Art of Multiprocessor Programming 153

154 Fairness and NUMA Locks MCS lock mechanics are aware of NUMA Lock Fairness is FCFS Is this a good fit with NUMA and Cache-Coherent NUMA machines?

155 Lock Data Access in NUMA Machine Node 1 CS MCS lock various memory locations Node 2

156 Who s the Unfairest of Them All? locality crucial to NUMA performance Big gains if threads from same node/ cluster obtain lock consecutively Unfairness pays

157 Hierarchical Backoff Lock (HBO) Back off less for thread from same node time 4d 2d d CS Unfairness is key to performance Global T&T&S lock time 4d 2d d

158 Hierarchical Backoff Lock (HBO) Advantages: Simple, improves locality Disadvantages: Requires platform specific tuning Unstable Unfair Continuous invalidations on shared global lock word

159 Hierarchical CLH Lock (HCLH) Each thread spins on cached copy of predecessor s node Local Tail Thread at local head splices local queue into global queue Local CLH queue Global Tail CAS() CAS() Local Tail Local CLH queue CS CAS()

160 Hierarchical CLH Lock (HCLH) 7e+06 6e+06 hbo hclh Throughput 5e+06 4e+06 3e+06 2e+06 1e+06 HCLH HBO Number of Threads Threads access 4 cache lines in CS

161 Hierarchical CLH Lock (HCLH) Advantages: Improved locality Local spinning Fair Disadvantages: Complex code implies long common path Splicing into both local and global requires CAS Hard to get long local sequences

162

163 Lock Cohorting General technique for converting almost any lock into a NUMA lock Allows combining different lock types But need these locks to have certain properties (will discuss shortly)

164 Lock Cohorting Non-empty cohort empty cohort Acquire local lock and proceed to critical section Local Lock On release: if nonempty cohort of waiting threads, release only local lock; leave mark Thread that acquired local lock can now acquire global lock Local Lock CS Global Lock On release: since cohort is empty must release global lock to avoid deadlock

165 Thread Obliviousness A lock is thread-oblivious if After being acquired by one thread, Can be released by another Art of Multiprocessor Programming 165

166 Cohort Detection A lock provides cohort detection if It can tell whether any thread is trying to acquire it Art of Multiprocessor Programming 166

167 Lock Cohorting Two levels of locking Global lock: thread oblivious Thread acquiring the lock can be different than one releasing it Local lock: cohort detection Thread releasing can detect if some thread is waiting to acquire it

168 Two new states: acquire local and acquire global. Do we own global lock? Lock Cohorting: C-BO-MCS In Lock, cohort Lock Local MCS lock tail detection by checking successor pointer CAS() False False True Global backoff lock Bound number of Local consecutive MCS lock acquires to control tail unfairness time 4d 2d d CS CAS() False False True BO Lock is thread oblivious by definition

169 How to add cohort detection Lock Cohorting: property to BO lock? C-BO-BO Lock 4d 2d d Global backoff lock CS time 4d 2d d 4d 2d d As noted BO Lock is thread oblivious

170 Write successorexists Lock Cohorting: C-BO-BO Lock field before attempting to acquire local lock. successorexists reset on lock release. 4d 2d d Release might overwrite another successor s Global write but we don t backoff care why? lock CS time 4d 2d d 4d 2d d

171 C-BO-BO Aborting thread is a resets Time-Out successorexists field before leaving local lock. Spinning threads set it to true. NUMA Lock 4d 2d d BO locks trivially abortable Global backoff lock If releasing thread finds successorexists time false, 4d it releases global lock 2d d CS 4d 2d d

172 Time-Out NUMA Lock Local time-out queue Local Time-Out locks have cohort detection property why? Not enough swap() Local time-out queue time 4d 2d d CS swap()

173 Lock Cohorting Advantages: Great locality Low contention on shared lock Practically no tuning Has whatever properties you want: Can be more or less fair, abortable just choose the appropriate type of locks Disadvantages: Must tune fairness parameters

174 Lock Cohorting Throughput 7e+06 6e+06 5e+06 4e+06 3e+06 C-BO-MCS C-BO-BO hbo hclh fc-mcs nmcs nclh nbo nticket HCLH 2e+06 1e+06 HBO Number of Threads

175 Time-Out (Abortable) Lock Cohorting A-BO-CLH (time-out lock + BO) Throughput in CR-nCRs per sec 4e e+06 3e e+06 2e e+06 1e A-BO-BO a-clh a-hbo a-bo-bo a-bo-clh (CAS) a-bo-clh Number of Threads Abortable CLH (our time-out lock) and HBO

176 One Lock To Rule Them All? TTAS+Backoff, CLH, MCS, ToLock Each better than others in some way There is no one solution Lock we pick really depends on: the application the hardware which properties are important Art of Multiprocessor Programming 176

177 Yeahy! Amdahl s Law Works Fine grained parallelism gives great performance benefit Coarse Grained Fine Grained 25% Shared 25% Shared 75% Unshared 75% Unshared

178 But Can we always draw the right conclusions from Amdahl s law? Claim: sometimes the overhead of finegrained synchronization is so high that it is better to have a single thread do all the work sequentially in order to avoid it

179 object Oyama et. al Mutex object lock CAS() Head Improves cache behavior and reduces contention on object Not great since every request involves CAS Release lock CAS() a b c d Apply a,b,c, and d to object Combine lock requests return responses

180 Flat Combining Have single lock holder collect and perform requests of all others Without using CAS operations to coordinate requests With (non-naïve) combining of requests (if cost of k batched operations is less than that of k operations in sequence à we win)

181 Flat-Combining Most requests do not involve a CAS, in fact, not even a memory barrier object counter 54 object lock Collect Again try requests to collect requests Head CAS() Publication list Deq() 53 Deq() null 54 Enq(d) null 12 Enq(d) 54 Apply requests to object Deq() Deq() Enq(d)

182 Flat-Combining Pub-List object counter 54 Cleanup Every combiner increments counter and updates record s time stamp when returning response Traverse and remove from list records with old time stamp Cleanup requires no CAS, only reads and writes Head Publication list Deq() 53 null 12 Enq(d) 54 Enq(d) 54 Enq(d) If thread reappears must add itself to pub list

183 Fine-Grained Lock-free FIFO Queue Head CAS() Tail CAS() a CAS() b c d P: Dequeue() => a Q: Enqueue(d)

184 Flat-Combining FIFO Queue counter object lock 54 Head CAS() Publication list Head Tail null 54 Enq(b) 12 Enq(b) 54 Deq() Enq(a) Enq(b) Sequential FIFO Queue

185 Flat-Combining FIFO Queue OK, but can do better combining: collect all items into a fat node, enqueue Fat in Node one step easy sequentially but object cannot counter lock be done in concurrent alg without CAS 54 Head Tail Head CAS() Publication list Deq() 54 Enq(b) 12 Enq(b) 54 a Deq() Enq(a) Enq(b) Sequential Fat Node FIFO Queue

186 Linearizable FIFO Queue Flat Combining MS queue, Oyama Combining tree

187 Benefit s of Flat Combining Flat Combining in Red

188 Linearizable Stack Flat Combining A Bug Elimination Stack Treiber Lockfree Stack

189 Concurrent Priority Queue (Chapter 15) k deletemin opera.ons take O(k log n) deletemin() traverses CASing un.l you manage to mark a node, then use skiplist remove your marked node 1

190 Flat-Combining Priority Queue counter object lock 54 Head CAS() Publica.on list cvccc Deq() 54 Enq(b) 12 Enq(b) 54 Deq() Enq(a) Enq(b)

191 Flat Combining Priority Queue k deletemin opera.ons take O(k+log n) Traverse Collect skiplist k deletemin towards requests kth key, removing all nodes below your path Remove traverse to find kth key, collect values to be returned

192 Flat Combining Skiplist based queue Whats this? Priority Queue lock-based SkipQueue lock-free SkipQueue

193 Priority Queue Flat combining with sequen.al pairing heap plugged in

194 Priority Queue on Intel Flat combining with sequen.al pairing heap plugged in

195 Don t be Afraid of the Big Bad Lock Fine grained parallelism comes with an overhead not always worth the effort. Sometimes using a single global lock is a win. Art of Multiprocessor Programming 195

196 Thanks! Art of Multiprocessor Programming 196

197 This work is licensed under a Creative Commons Attribution-ShareAlike 2.5 License. You are free: to Share to copy, distribute and transmit the work to Remix to adapt the work Under the following conditions: Attribution. You must attribute the work to The Art of Multiprocessor Programming (but not in any way that suggests that the authors endorse you or your use of the work). Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license. For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to Any of the above conditions can be waived if you get permission from the copyright holder. Nothing in this license impairs or restricts the author's moral rights. Art of Multiprocessor Programming 197

Spin Locks and Contention. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Spin Locks and Contention. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Focus so far: Correctness and Progress Models Accurate (we never lied to you) But idealized

More information

Spin Locks and Contention. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Spin Locks and Contention. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Spin Locks and Contention Companion slides for The Programming by Maurice Herlihy & Nir Shavit Focus so far: Correctness Models Accurate (we never lied to you) But idealized (so we forgot to mention a

More information

Spin Locks and Contention. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Spin Locks and Contention. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit MIMD Architectures memory Shared Bus Memory Contention Communication Contention Communication

More information

Spin Locks and Contention. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Spin Locks and Contention. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Real world concurrency Understanding hardware architecture What is contention How to

More information

Spin Locks and Contention

Spin Locks and Contention Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Modified for Software1 students by Lior Wolf and Mati Shomrat Kinds of Architectures

More information

Spin Locks and Contention. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Spin Locks and Contention. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Focus so far: Correctness and Progress Models Accurate (we never lied to you) But idealized

More information

Spin Locks and Contention. Companion slides for Chapter 7 The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Spin Locks and Contention. Companion slides for Chapter 7 The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Spin Locks and Contention Companion slides for Chapter 7 The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Focus so far: Correctness and Progress Models Accurate (we never lied to you)

More information

Locks Chapter 4. Overview. Introduction Spin Locks. Queue locks. Roger Wattenhofer. Test-and-Set & Test-and-Test-and-Set Backoff lock

Locks Chapter 4. Overview. Introduction Spin Locks. Queue locks. Roger Wattenhofer. Test-and-Set & Test-and-Test-and-Set Backoff lock Locks Chapter 4 Roger Wattenhofer ETH Zurich Distributed Computing www.disco.ethz.ch Overview Introduction Spin Locks Test-and-Set & Test-and-Test-and-Set Backoff lock Queue locks 4/2 1 Introduction: From

More information

Locking. Part 2, Chapter 11. Roger Wattenhofer. ETH Zurich Distributed Computing

Locking. Part 2, Chapter 11. Roger Wattenhofer. ETH Zurich Distributed Computing Locking Part 2, Chapter 11 Roger Wattenhofer ETH Zurich Distributed Computing www.disco.ethz.ch Overview Introduction Spin Locks Test-and-Set & Test-and-Test-and-Set Backoff lock Queue locks 11/2 Introduction:

More information

Introduction to Multiprocessor Synchronization

Introduction to Multiprocessor Synchronization Introduction to Multiprocessor Synchronization Maurice Herlihy http://cs.brown.edu/courses/cs176/lectures.shtml Moore's Law Transistor count still rising Clock speed flattening sharply Art of Multiprocessor

More information

Spin Locks and Contention Management

Spin Locks and Contention Management Chapter 7 Spin Locks and Contention Management 7.1 Introduction We now turn our attention to the performance of mutual exclusion protocols on realistic architectures. Any mutual exclusion protocol poses

More information

Agenda. Lecture. Next discussion papers. Bottom-up motivation Shared memory primitives Shared memory synchronization Barriers and locks

Agenda. Lecture. Next discussion papers. Bottom-up motivation Shared memory primitives Shared memory synchronization Barriers and locks Agenda Lecture Bottom-up motivation Shared memory primitives Shared memory synchronization Barriers and locks Next discussion papers Selecting Locking Primitives for Parallel Programming Selecting Locking

More information

9/28/2014. CS341: Operating System. Synchronization. High Level Construct: Monitor Classical Problems of Synchronizations FAQ: Mid Semester

9/28/2014. CS341: Operating System. Synchronization. High Level Construct: Monitor Classical Problems of Synchronizations FAQ: Mid Semester CS341: Operating System Lect22: 18 th Sept 2014 Dr. A. Sahu Dept of Comp. Sc. & Engg. Indian Institute of Technology Guwahati Synchronization Hardware Support TAS, CAS, TTAS, LL LC, XCHG, Compare, FAI

More information

Locking Granularity. CS 475, Spring 2019 Concurrent & Distributed Systems. With material from Herlihy & Shavit, Art of Multiprocessor Programming

Locking Granularity. CS 475, Spring 2019 Concurrent & Distributed Systems. With material from Herlihy & Shavit, Art of Multiprocessor Programming Locking Granularity CS 475, Spring 2019 Concurrent & Distributed Systems With material from Herlihy & Shavit, Art of Multiprocessor Programming Discussion: HW1 Part 4 addtolist(key1, newvalue) Thread 1

More information

Linked Lists: Locking, Lock- Free, and Beyond. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Linked Lists: Locking, Lock- Free, and Beyond. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Linked Lists: Locking, Lock- Free, and Beyond Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Coarse-Grained Synchronization Each method locks the object Avoid

More information

Introduction. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Introduction. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Introduction Companion slides for The by Maurice Herlihy & Nir Shavit Moore s Law Transistor count still rising Clock speed flattening sharply 2 Moore s Law (in practice) 3 Nearly Extinct: the Uniprocesor

More information

Programming Paradigms for Concurrency Lecture 3 Concurrent Objects

Programming Paradigms for Concurrency Lecture 3 Concurrent Objects Programming Paradigms for Concurrency Lecture 3 Concurrent Objects Based on companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Modified by Thomas Wies New York University

More information

Coarse-grained and fine-grained locking Niklas Fors

Coarse-grained and fine-grained locking Niklas Fors Coarse-grained and fine-grained locking Niklas Fors 2013-12-05 Slides borrowed from: http://cs.brown.edu/courses/cs176course_information.shtml Art of Multiprocessor Programming 1 Topics discussed Coarse-grained

More information

Linked Lists: Locking, Lock-Free, and Beyond. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Linked Lists: Locking, Lock-Free, and Beyond. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Linked Lists: Locking, Lock-Free, and Beyond Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Objects Adding threads should not lower throughput Contention

More information

9/23/2014. Concurrent Programming. Book. Process. Exchange of data between threads/processes. Between Process. Thread.

9/23/2014. Concurrent Programming. Book. Process. Exchange of data between threads/processes. Between Process. Thread. Dr A Sahu Dept of Computer Science & Engineering IIT Guwahati Course Structure & Book Basic of thread and process Coordination and synchronization Example of Parallel Programming Shared memory : C/C++

More information

Concurrent Skip Lists. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Concurrent Skip Lists. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Skip Lists Companion slides for The by Maurice Herlihy & Nir Shavit Set Object Interface Collection of elements No duplicates Methods add() a new element remove() an element contains() if element

More information

Solution: a lock (a/k/a mutex) public: virtual void unlock() =0;

Solution: a lock (a/k/a mutex) public: virtual void unlock() =0; 1 Solution: a lock (a/k/a mutex) class BasicLock { public: virtual void lock() =0; virtual void unlock() =0; ; 2 Using a lock class Counter { public: int get_and_inc() { lock_.lock(); int old = count_;

More information

Computer Engineering II Solution to Exercise Sheet Chapter 10

Computer Engineering II Solution to Exercise Sheet Chapter 10 Distributed Computing FS 2017 Prof. R. Wattenhofer Computer Engineering II Solution to Exercise Sheet Chapter 10 Quiz 1 Quiz a) The AtomicBoolean utilizes an atomic version of getandset() implemented in

More information

Practice: Small Systems Part 2, Chapter 3

Practice: Small Systems Part 2, Chapter 3 Practice: Small Systems Part 2, Chapter 3 Roger Wattenhofer Overview Introduction Spin Locks Test and Set & Test and Test and Set Backoff lock Queue locks Concurrent Linked List Fine grained synchronization

More information

Lecture 9: Multiprocessor OSs & Synchronization. CSC 469H1F Fall 2006 Angela Demke Brown

Lecture 9: Multiprocessor OSs & Synchronization. CSC 469H1F Fall 2006 Angela Demke Brown Lecture 9: Multiprocessor OSs & Synchronization CSC 469H1F Fall 2006 Angela Demke Brown The Problem Coordinated management of shared resources Resources may be accessed by multiple threads Need to control

More information

Distributed Computing

Distributed Computing HELLENIC REPUBLIC UNIVERSITY OF CRETE Distributed Computing Graduate Course Section 3: Spin Locks and Contention Panagiota Fatourou Department of Computer Science Spin Locks and Contention In contrast

More information

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 3 Nov 2017

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 3 Nov 2017 NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY Tim Harris, 3 Nov 2017 Lecture 1/3 Introduction Basic spin-locks Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat

More information

Concurrent Preliminaries

Concurrent Preliminaries Concurrent Preliminaries Sagi Katorza Tel Aviv University 09/12/2014 1 Outline Hardware infrastructure Hardware primitives Mutual exclusion Work sharing and termination detection Concurrent data structures

More information

Lock cohorting: A general technique for designing NUMA locks

Lock cohorting: A general technique for designing NUMA locks Lock cohorting: A general technique for designing NUMA locks The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published

More information

Locking. Part 2, Chapter 7. Roger Wattenhofer. ETH Zurich Distributed Computing

Locking. Part 2, Chapter 7. Roger Wattenhofer. ETH Zurich Distributed Computing Locking Part 2, Chapter 7 Roger Wattenhofer ETH Zurich Distributed Computing www.disco.ethz.ch Overview Introduction Spin Locks Test-and-Set & Test-and-Test-and-Set Backoff lock Queue locks Concurrent

More information

Advance Operating Systems (CS202) Locks Discussion

Advance Operating Systems (CS202) Locks Discussion Advance Operating Systems (CS202) Locks Discussion Threads Locks Spin Locks Array-based Locks MCS Locks Sequential Locks Road Map Threads Global variables and static objects are shared Stored in the static

More information

Lecture 7: Mutual Exclusion 2/16/12. slides adapted from The Art of Multiprocessor Programming, Herlihy and Shavit

Lecture 7: Mutual Exclusion 2/16/12. slides adapted from The Art of Multiprocessor Programming, Herlihy and Shavit Principles of Concurrency and Parallelism Lecture 7: Mutual Exclusion 2/16/12 slides adapted from The Art of Multiprocessor Programming, Herlihy and Shavit Time Absolute, true and mathematical time, of

More information

Models of concurrency & synchronization algorithms

Models of concurrency & synchronization algorithms Models of concurrency & synchronization algorithms Lecture 3 of TDA383/DIT390 (Concurrent Programming) Carlo A. Furia Chalmers University of Technology University of Gothenburg SP3 2016/2017 Today s menu

More information

Linked Lists: The Role of Locking. Erez Petrank Technion

Linked Lists: The Role of Locking. Erez Petrank Technion Linked Lists: The Role of Locking Erez Petrank Technion Why Data Structures? Concurrent Data Structures are building blocks Used as libraries Construction principles apply broadly This Lecture Designing

More information

Concurrent Queues, Monitors, and the ABA problem. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Concurrent Queues, Monitors, and the ABA problem. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Queues, Monitors, and the ABA problem Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Queues Often used as buffers between producers and consumers

More information

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 14 November 2014

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 14 November 2014 NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY Tim Harris, 14 November 2014 Lecture 6 Introduction Amdahl s law Basic spin-locks Queue-based locks Hierarchical locks Reader-writer locks Reading

More information

Advanced Multiprocessor Programming: Locks

Advanced Multiprocessor Programming: Locks Advanced Multiprocessor Programming: Locks Martin Wimmer, Jesper Larsson Träff TU Wien 20th April, 2015 Wimmer, Träff AMP SS15 1 / 31 Locks we have seen so far Peterson Lock ilter Lock Bakery Lock These

More information

Spinlocks. Spinlocks. Message Systems, Inc. April 8, 2011

Spinlocks. Spinlocks. Message Systems, Inc. April 8, 2011 Spinlocks Samy Al Bahra Devon H. O Dell Message Systems, Inc. April 8, 2011 Introduction Mutexes A mutex is an object which implements acquire and relinquish operations such that the execution following

More information

Multiprocessor Systems. Chapter 8, 8.1

Multiprocessor Systems. Chapter 8, 8.1 Multiprocessor Systems Chapter 8, 8.1 1 Learning Outcomes An understanding of the structure and limits of multiprocessor hardware. An appreciation of approaches to operating system support for multiprocessor

More information

6.852: Distributed Algorithms Fall, Class 15

6.852: Distributed Algorithms Fall, Class 15 6.852: Distributed Algorithms Fall, 2009 Class 15 Today s plan z z z z z Pragmatic issues for shared-memory multiprocessors Practical mutual exclusion algorithms Test-and-set locks Ticket locks Queue locks

More information

Shared-Memory Computability

Shared-Memory Computability Shared-Memory Computability 10011 Universal Object Wait-free/Lock-free computable = Threads with methods that solve n- consensus Art of Multiprocessor Programming Copyright Herlihy- Shavit 2007 93 GetAndSet

More information

Whatever can go wrong will go wrong. attributed to Edward A. Murphy. Murphy was an optimist. authors of lock-free programs LOCK FREE KERNEL

Whatever can go wrong will go wrong. attributed to Edward A. Murphy. Murphy was an optimist. authors of lock-free programs LOCK FREE KERNEL Whatever can go wrong will go wrong. attributed to Edward A. Murphy Murphy was an optimist. authors of lock-free programs LOCK FREE KERNEL 251 Literature Maurice Herlihy and Nir Shavit. The Art of Multiprocessor

More information

Whatever can go wrong will go wrong. attributed to Edward A. Murphy. Murphy was an optimist. authors of lock-free programs 3.

Whatever can go wrong will go wrong. attributed to Edward A. Murphy. Murphy was an optimist. authors of lock-free programs 3. Whatever can go wrong will go wrong. attributed to Edward A. Murphy Murphy was an optimist. authors of lock-free programs 3. LOCK FREE KERNEL 309 Literature Maurice Herlihy and Nir Shavit. The Art of Multiprocessor

More information

Distributed Computing Group

Distributed Computing Group Distributed Computing Group HS 2009 Prof. Dr. Roger Wattenhofer, Thomas Locher, Remo Meier, Benjamin Sigg Assigned: December 11, 2009 Discussion: none Distributed Systems Theory exercise 6 1 ALock2 Have

More information

Fast and Lock-Free Concurrent Priority Queues for Multi-Thread Systems

Fast and Lock-Free Concurrent Priority Queues for Multi-Thread Systems Fast and Lock-Free Concurrent Priority Queues for Multi-Thread Systems Håkan Sundell Philippas Tsigas Outline Synchronization Methods Priority Queues Concurrent Priority Queues Lock-Free Algorithm: Problems

More information

Universality of Consensus. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Universality of Consensus. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Universality of Consensus Companion slides for The by Maurice Herlihy & Nir Shavit Turing Computability 1 0 1 1 0 1 0 A mathematical model of computation Computable = Computable on a T-Machine 2 Shared-Memory

More information

CSL373: Lecture 5 Deadlocks (no process runnable) + Scheduling (> 1 process runnable)

CSL373: Lecture 5 Deadlocks (no process runnable) + Scheduling (> 1 process runnable) CSL373: Lecture 5 Deadlocks (no process runnable) + Scheduling (> 1 process runnable) Past & Present Have looked at two constraints: Mutual exclusion constraint between two events is a requirement that

More information

250P: Computer Systems Architecture. Lecture 14: Synchronization. Anton Burtsev March, 2019

250P: Computer Systems Architecture. Lecture 14: Synchronization. Anton Burtsev March, 2019 250P: Computer Systems Architecture Lecture 14: Synchronization Anton Burtsev March, 2019 Coherence and Synchronization Topics: synchronization primitives (Sections 5.4-5.5) 2 Constructing Locks Applications

More information

Multiprocessor Synchronization

Multiprocessor Synchronization Multiprocessor Synchronization Material in this lecture in Henessey and Patterson, Chapter 8 pgs. 694-708 Some material from David Patterson s slides for CS 252 at Berkeley 1 Multiprogramming and Multiprocessing

More information

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8. Multiprocessor System Multiprocessor Systems Chapter 8, 8.1 We will look at shared-memory multiprocessors More than one processor sharing the same memory A single CPU can only go so fast Use more than

More information

Multiprocessor Systems. COMP s1

Multiprocessor Systems. COMP s1 Multiprocessor Systems 1 Multiprocessor System We will look at shared-memory multiprocessors More than one processor sharing the same memory A single CPU can only go so fast Use more than one CPU to improve

More information

Peterson s Algorithm

Peterson s Algorithm Peterson s Algorithm public void lock() { flag[i] = true; victim = i; while (flag[j] && victim == i) {}; } public void unlock() { flag[i] = false; } 24/03/10 Art of Multiprocessor Programming 1 Mutual

More information

Linked Lists: Locking, Lock- Free, and Beyond. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Linked Lists: Locking, Lock- Free, and Beyond. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Linked Lists: Locking, Lock- Free, and Beyond Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Last Lecture: Spin-Locks CS. spin lock critical section Resets lock

More information

A simple correctness proof of the MCS contention-free lock. Theodore Johnson. Krishna Harathi. University of Florida. Abstract

A simple correctness proof of the MCS contention-free lock. Theodore Johnson. Krishna Harathi. University of Florida. Abstract A simple correctness proof of the MCS contention-free lock Theodore Johnson Krishna Harathi Computer and Information Sciences Department University of Florida Abstract Mellor-Crummey and Scott present

More information

Unit 6: Indeterminate Computation

Unit 6: Indeterminate Computation Unit 6: Indeterminate Computation Martha A. Kim October 6, 2013 Introduction Until now, we have considered parallelizations of sequential programs. The parallelizations were deemed safe if the parallel

More information

Mutual Exclusion Algorithms with Constant RMR Complexity and Wait-Free Exit Code

Mutual Exclusion Algorithms with Constant RMR Complexity and Wait-Free Exit Code Mutual Exclusion Algorithms with Constant RMR Complexity and Wait-Free Exit Code Rotem Dvir 1 and Gadi Taubenfeld 2 1 The Interdisciplinary Center, P.O.Box 167, Herzliya 46150, Israel rotem.dvir@gmail.com

More information

Lecture 21: Transactional Memory. Topics: Hardware TM basics, different implementations

Lecture 21: Transactional Memory. Topics: Hardware TM basics, different implementations Lecture 21: Transactional Memory Topics: Hardware TM basics, different implementations 1 Transactions New paradigm to simplify programming instead of lock-unlock, use transaction begin-end locks are blocking,

More information

Multiprocessors and Locking

Multiprocessors and Locking Types of Multiprocessors (MPs) Uniform memory-access (UMA) MP Access to all memory occurs at the same speed for all processors. Multiprocessors and Locking COMP9242 2008/S2 Week 12 Part 1 Non-uniform memory-access

More information

Last Lecture: Spin-Locks

Last Lecture: Spin-Locks Linked Lists: Locking, Lock- Free, and Beyond Last Lecture: Spin-Locks. spin lock CS critical section Resets lock upon exit Today: Concurrent Objects Adding threads should not lower throughput Contention

More information

Transactional Memory: Architectural Support for Lock-Free Data Structures Maurice Herlihy and J. Eliot B. Moss ISCA 93

Transactional Memory: Architectural Support for Lock-Free Data Structures Maurice Herlihy and J. Eliot B. Moss ISCA 93 Transactional Memory: Architectural Support for Lock-Free Data Structures Maurice Herlihy and J. Eliot B. Moss ISCA 93 What are lock-free data structures A shared data structure is lock-free if its operations

More information

Concurrent Objects. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Concurrent Objects. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Objects Companion slides for The by Maurice Herlihy & Nir Shavit Concurrent Computation memory object object 2 Objectivism What is a concurrent object? How do we describe one? How do we implement

More information

Advanced Topic: Efficient Synchronization

Advanced Topic: Efficient Synchronization Advanced Topic: Efficient Synchronization Multi-Object Programs What happens when we try to synchronize across multiple objects in a large program? Each object with its own lock, condition variables Is

More information

Revisiting the Combining Synchronization Technique

Revisiting the Combining Synchronization Technique Revisiting the Combining Synchronization Technique Panagiota Fatourou Department of Computer Science University of Crete & FORTH ICS faturu@csd.uoc.gr Nikolaos D. Kallimanis Department of Computer Science

More information

Programming Paradigms for Concurrency Introduction

Programming Paradigms for Concurrency Introduction Programming Paradigms for Concurrency Introduction Based on companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Modified by Thomas Wies New York University Moore

More information

Programming Paradigms for Concurrency Lecture 6 Synchronization of Concurrent Objects

Programming Paradigms for Concurrency Lecture 6 Synchronization of Concurrent Objects Programming Paradigms for Concurrency Lecture 6 Synchronization of Concurrent Objects Based on companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Modified by Thomas

More information

Other consistency models

Other consistency models Last time: Symmetric multiprocessing (SMP) Lecture 25: Synchronization primitives Computer Architecture and Systems Programming (252-0061-00) CPU 0 CPU 1 CPU 2 CPU 3 Timothy Roscoe Herbstsemester 2012

More information

Scalable Locking. Adam Belay

Scalable Locking. Adam Belay Scalable Locking Adam Belay Problem: Locks can ruin performance 12 finds/sec 9 6 Locking overhead dominates 3 0 0 6 12 18 24 30 36 42 48 Cores Problem: Locks can ruin performance the locks

More information

Concurrent Queues and Stacks. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Concurrent Queues and Stacks. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Queues and Stacks Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit The Five-Fold Path Coarse-grained locking Fine-grained locking Optimistic synchronization

More information

Part 1: Concepts and Hardware- Based Approaches

Part 1: Concepts and Hardware- Based Approaches Part 1: Concepts and Hardware- Based Approaches CS5204-Operating Systems Introduction Provide support for concurrent activity using transactionstyle semantics without explicit locking Avoids problems with

More information

CS377P Programming for Performance Multicore Performance Synchronization

CS377P Programming for Performance Multicore Performance Synchronization CS377P Programming for Performance Multicore Performance Synchronization Sreepathi Pai UTCS October 21, 2015 Outline 1 Synchronization Primitives 2 Blocking, Lock-free and Wait-free Algorithms 3 Transactional

More information

Cache Coherence and Atomic Operations in Hardware

Cache Coherence and Atomic Operations in Hardware Cache Coherence and Atomic Operations in Hardware Previously, we introduced multi-core parallelism. Today we ll look at 2 things: 1. Cache coherence 2. Instruction support for synchronization. And some

More information

Cache-Aware Lock-Free Queues for Multiple Producers/Consumers and Weak Memory Consistency

Cache-Aware Lock-Free Queues for Multiple Producers/Consumers and Weak Memory Consistency Cache-Aware Lock-Free Queues for Multiple Producers/Consumers and Weak Memory Consistency Anders Gidenstam Håkan Sundell Philippas Tsigas School of business and informatics University of Borås Distributed

More information

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems Multiprocessors II: CC-NUMA DSM DSM cache coherence the hardware stuff Today s topics: what happens when we lose snooping new issues: global vs. local cache line state enter the directory issues of increasing

More information

Lecture 10: Multi-Object Synchronization

Lecture 10: Multi-Object Synchronization CS 422/522 Design & Implementation of Operating Systems Lecture 10: Multi-Object Synchronization Zhong Shao Dept. of Computer Science Yale University Acknowledgement: some slides are taken from previous

More information

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization Lec-11 Multi-Threading Concepts: Coherency, Consistency, and Synchronization Coherency vs Consistency Memory coherency and consistency are major concerns in the design of shared-memory systems. Consistency

More information

Non-blocking Array-based Algorithms for Stacks and Queues!

Non-blocking Array-based Algorithms for Stacks and Queues! Non-blocking Array-based Algorithms for Stacks and Queues! Niloufar Shafiei! Department of Computer Science and Engineering York University ICDCN 09 Outline! Introduction! Stack algorithm! Queue algorithm!

More information

Last Class: CPU Scheduling! Adjusting Priorities in MLFQ!

Last Class: CPU Scheduling! Adjusting Priorities in MLFQ! Last Class: CPU Scheduling! Scheduling Algorithms: FCFS Round Robin SJF Multilevel Feedback Queues Lottery Scheduling Review questions: How does each work? Advantages? Disadvantages? Lecture 7, page 1

More information

Chapter 6: Process [& Thread] Synchronization. CSCI [4 6] 730 Operating Systems. Why does cooperation require synchronization?

Chapter 6: Process [& Thread] Synchronization. CSCI [4 6] 730 Operating Systems. Why does cooperation require synchronization? Chapter 6: Process [& Thread] Synchronization CSCI [4 6] 730 Operating Systems Synchronization Part 1 : The Basics Why is synchronization needed? Synchronization Language/Definitions:» What are race conditions?»

More information

NUMA-Aware Reader-Writer Locks PPoPP 2013

NUMA-Aware Reader-Writer Locks PPoPP 2013 NUMA-Aware Reader-Writer Locks PPoPP 2013 Irina Calciu Brown University Authors Irina Calciu @ Brown University Dave Dice Yossi Lev Victor Luchangco Virendra J. Marathe Nir Shavit @ MIT 2 Cores Chip (node)

More information

Lock Oscillation: Boosting the Performance of Concurrent Data Structures

Lock Oscillation: Boosting the Performance of Concurrent Data Structures Lock Oscillation: Boosting the Performance of Concurrent Data Structures Panagiota Fatourou FORTH ICS & University of Crete Nikolaos D. Kallimanis FORTH ICS The Multicore Era The dominance of Multicore

More information

Advanced Multiprocessor Programming: Locks

Advanced Multiprocessor Programming: Locks Advanced Multiprocessor Programming: Locks Martin Wimmer, Jesper Larsson Träff TU Wien 2nd May, 2016 Wimmer, Träff AMP SS15 1 / 35 Locks we have seen so far Peterson Lock ilter Lock Bakery Lock These locks......

More information

Queue Delegation Locking

Queue Delegation Locking Queue Delegation Locking David Klaftenegger Konstantinos Sagonas Kjell Winblad Department of Information Technology, Uppsala University, Sweden Abstract The scalability of parallel programs is often bounded

More information

Multiprocessor Support

Multiprocessor Support CSC 256/456: Operating Systems Multiprocessor Support John Criswell University of Rochester 1 Outline Multiprocessor hardware Types of multi-processor workloads Operating system issues Where to run the

More information

Mutual Exclusion. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Mutual Exclusion. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Mutual Exclusion Companion slides for The by Maurice Herlihy & Nir Shavit Mutual Exclusion Today we will try to formalize our understanding of mutual exclusion We will also use the opportunity to show

More information

CPSC/ECE 3220 Summer 2018 Exam 2 No Electronics.

CPSC/ECE 3220 Summer 2018 Exam 2 No Electronics. CPSC/ECE 3220 Summer 2018 Exam 2 No Electronics. Name: Write one of the words or terms from the following list into the blank appearing to the left of the appropriate definition. Note that there are more

More information

Adaptive Lock. Madhav Iyengar < >, Nathaniel Jeffries < >

Adaptive Lock. Madhav Iyengar < >, Nathaniel Jeffries < > Adaptive Lock Madhav Iyengar < miyengar@andrew.cmu.edu >, Nathaniel Jeffries < njeffrie@andrew.cmu.edu > ABSTRACT Busy wait synchronization, the spinlock, is the primitive at the core of all other synchronization

More information

CS510 Concurrent Systems. Jonathan Walpole

CS510 Concurrent Systems. Jonathan Walpole CS510 Concurrent Systems Jonathan Walpole Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms utline Background Non-Blocking Queue Algorithm Two Lock Concurrent Queue Algorithm

More information

Fine-grained synchronization & lock-free programming

Fine-grained synchronization & lock-free programming Lecture 17: Fine-grained synchronization & lock-free programming Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2016 Tunes Minnie the Moocher Robbie Williams (Swings Both Ways)

More information

Module 7: Synchronization Lecture 13: Introduction to Atomic Primitives. The Lecture Contains: Synchronization. Waiting Algorithms.

Module 7: Synchronization Lecture 13: Introduction to Atomic Primitives. The Lecture Contains: Synchronization. Waiting Algorithms. The Lecture Contains: Synchronization Waiting Algorithms Implementation Hardwired Locks Software Locks Hardware Support Atomic Exchange Test & Set Fetch & op Compare & Swap Traffic of Test & Set Backoff

More information

6.852: Distributed Algorithms Fall, Class 21

6.852: Distributed Algorithms Fall, Class 21 6.852: Distributed Algorithms Fall, 2009 Class 21 Today s plan Wait-free synchronization. The wait-free consensus hierarchy Universality of consensus Reading: [Herlihy, Wait-free synchronization] (Another

More information

The Art of Multiprocessor Programming Copyright 2007 Elsevier Inc. All rights reserved. October 3, 2007 DRAFT COPY

The Art of Multiprocessor Programming Copyright 2007 Elsevier Inc. All rights reserved. October 3, 2007 DRAFT COPY The Art of Multiprocessor Programming Copyright 2007 Elsevier Inc. All rights reserved Maurice Herlihy October 3, 2007 Nir Shavit 2 Contents 1 Introduction 13 1.1 Shared Objects and Synchronization..............

More information

Algorithms for Scalable Synchronization on Shared Memory Multiprocessors by John M. Mellor Crummey Michael L. Scott

Algorithms for Scalable Synchronization on Shared Memory Multiprocessors by John M. Mellor Crummey Michael L. Scott Algorithms for Scalable Synchronization on Shared Memory Multiprocessors by John M. Mellor Crummey Michael L. Scott Presentation by Joe Izraelevitz Tim Kopp Synchronization Primitives Spin Locks Used for

More information

Introduction. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Introduction. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Introduction Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Moore s Law Transistor count still rising Clock speed flattening sharply Art of Multiprocessor Programming

More information

Hashing and Natural Parallism. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Hashing and Natural Parallism. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Hashing and Natural Parallism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Sequential Closed Hash Map 0 1 2 3 16 9 buckets 2 Items h(k) = k mod 4 Art of Multiprocessor

More information

Midterm Exam Amy Murphy 6 March 2002

Midterm Exam Amy Murphy 6 March 2002 University of Rochester Midterm Exam Amy Murphy 6 March 2002 Computer Systems (CSC2/456) Read before beginning: Please write clearly. Illegible answers cannot be graded. Be sure to identify all of your

More information

CS 241 Honors Concurrent Data Structures

CS 241 Honors Concurrent Data Structures CS 241 Honors Concurrent Data Structures Bhuvan Venkatesh University of Illinois Urbana Champaign March 27, 2018 CS 241 Course Staff (UIUC) Lock Free Data Structures March 27, 2018 1 / 43 What to go over

More information

CS Advanced Operating Systems Structures and Implementation Lecture 8. Synchronization Continued. Goals for Today. Synchronization Scheduling

CS Advanced Operating Systems Structures and Implementation Lecture 8. Synchronization Continued. Goals for Today. Synchronization Scheduling Goals for Today CS194-24 Advanced Operating Systems Structures and Implementation Lecture 8 Synchronization Continued Synchronization Scheduling Interactive is important! Ask Questions! February 25 th,

More information

Operating Systems. Lecture 4 - Concurrency and Synchronization. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

Operating Systems. Lecture 4 - Concurrency and Synchronization. Master of Computer Science PUF - Hồ Chí Minh 2016/2017 Operating Systems Lecture 4 - Concurrency and Synchronization Adrien Krähenbühl Master of Computer Science PUF - Hồ Chí Minh 2016/2017 Mutual exclusion Hardware solutions Semaphores IPC: Message passing

More information

CS510 Advanced Topics in Concurrency. Jonathan Walpole

CS510 Advanced Topics in Concurrency. Jonathan Walpole CS510 Advanced Topics in Concurrency Jonathan Walpole Threads Cannot Be Implemented as a Library Reasoning About Programs What are the valid outcomes for this program? Is it valid for both r1 and r2 to

More information

CS 31: Introduction to Computer Systems : Threads & Synchronization April 16-18, 2019

CS 31: Introduction to Computer Systems : Threads & Synchronization April 16-18, 2019 CS 31: Introduction to Computer Systems 22-23: Threads & Synchronization April 16-18, 2019 Making Programs Run Faster We all like how fast computers are In the old days (1980 s - 2005): Algorithm too slow?

More information