Spin Locks and Contention. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit
|
|
- Ruth Stevens
- 5 years ago
- Views:
Transcription
1 Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit
2 Focus so far: Correctness and Progress Models Accurate (we never lied to you) But idealized (so we forgot to mention a few things) Protocols Elegant Important But naïve Art of Multiprocessor Programming 2
3 New Focus: Performance Models More complicated (not the same as complex!) Still focus on principles (not soon obsolete) Protocols Elegant (in their fashion) Important (why else would we pay attention) And realistic (your mileage may vary) Art of Multiprocessor Programming 3
4 Kinds of Architectures SISD (Uniprocessor) Single instruction stream Single data stream SIMD (Vector) Single instruction Multiple data MIMD (Multiprocessors) Multiple instruction Multiple data. Art of Multiprocessor Programming 4
5 Kinds of Architectures SISD (Uniprocessor) Single instruction stream Single data stream SIMD (Vector) Single instruction Multiple data MIMD (Multiprocessors) Multiple instruction Multiple data. Our space (1) Art of Multiprocessor Programming 5
6 MIMD Architectures memory Shared Bus Memory Contention Communication Contention Communication Latency Distributed Art of Multiprocessor Programming 6
7 Today: Revisit Mutual Exclusion Performance, not just correctness Proper use of multiprocessor architectures A collection of locking algorithms Art of Multiprocessor Programming 7 (1)
8 What Should you do if you can t get a lock? Keep trying spin or busy-wait Good if delays are short Give up the processor Good if delays are long Always good on uniprocessor Art of Multiprocessor Programming 8 (1)
9 What Should you do if you can t get a lock? Keep trying spin or busy-wait Good if delays are short Give up the processor Good if delays are long Always good on uniprocessor our focus Art of Multiprocessor Programming 9
10 Basic Spin-Lock CS. spin lock critical section Resets lock upon exit Art of Multiprocessor Programming 10
11 Basic Spin-Lock lock introduces sequential bottleneck CS. spin lock critical section Resets lock upon exit Art of Multiprocessor Programming 11
12 Basic Spin-Lock lock suffers from contention CS. spin lock critical section Resets lock upon exit Art of Multiprocessor Programming 12
13 Basic Spin-Lock lock suffers from contention CS. spin lock critical section Resets lock upon exit Notice: these are distinct phenomena Art of Multiprocessor Programming 13
14 Basic Spin-Lock lock suffers from contention CS. spin lock critical section Resets lock upon exit Seq Bottleneck no parallelism Art of Multiprocessor Programming 14
15 Basic Spin-Lock lock suffers from contention CS. spin lock critical section Resets lock upon exit Contention??? Art of Multiprocessor Programming 15
16 Review: Test-and-Set Boolean value Test-and-set (TAS) Swap true with current value Return value tells if prior value was true or false Can reset just by writing false TAS aka getandset Art of Multiprocessor Programming 16
17 Review: Test-and-Set public class AtomicBoolean { boolean value; public synchronized boolean getandset(boolean newvalue) { boolean prior = value; value = newvalue; return prior; } } Art of Multiprocessor Programming 17 (5)
18 Review: Test-and-Set public class AtomicBoolean { boolean value; public synchronized boolean getandset(boolean newvalue) { boolean prior = value; value = newvalue; return prior; } } Package java.util.concurrent.atomic Art of Multiprocessor Programming 18
19 Review: Test-and-Set public class AtomicBoolean { boolean value; public synchronized boolean getandset(boolean newvalue) { boolean prior = value; value = newvalue; return prior; } } Swap old and new values Art of Multiprocessor Programming 19
20 Review: Test-and-Set AtomicBoolean lock = new AtomicBoolean(false) boolean prior = lock.getandset(true) Art of Multiprocessor Programming 20
21 Review: Test-and-Set AtomicBoolean lock = new AtomicBoolean(false) boolean prior = lock.getandset(true) Swapping in true is called test-and-set or TAS Art of Multiprocessor Programming 21 (5)
22 Test-and-Set Locks Locking Lock is free: value is false Lock is taken: value is true Acquire lock by calling TAS If result is false, you win If result is true, you lose Release lock by writing false Art of Multiprocessor Programming 22
23 Test-and-set Lock class TASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (state.getandset(true)) {} } void unlock() { state.set(false); }} Art of Multiprocessor Programming 23
24 Test-and-set Lock class TASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (state.getandset(true)) {} } void unlock() { state.set(false); }} Lock state is AtomicBoolean Art of Multiprocessor Programming 24
25 Test-and-set Lock class TASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (state.getandset(true)) {} } void unlock() { state.set(false); }} Keep trying until lock acquired Art of Multiprocessor Programming 25
26 Test-and-set Lock class TASlock { AtomicBoolean state = new AtomicBoolean(false); Release lock by resetting state to false void lock() { while (state.getandset(true)) {} } void unlock() { state.set(false); }} Art of Multiprocessor Programming 26
27 Space Complexity TAS spin-lock has small footprint N thread spin-lock uses O(1) space As opposed to O(n) Peterson/Bakery How did we overcome the W(n) lower bound? We used a RMW operation Art of Multiprocessor Programming 27
28 Performance Experiment n threads Increment shared counter 1 million times How long should it take? How long does it take? Art of Multiprocessor Programming 28
29 time Graph no speedup because of sequential bottleneck ideal threads Art of Multiprocessor Programming 29
30 time Mystery #1 TAS lock threads Ideal What is going on? Art of Multiprocessor Programming 30
31 Test-and-Test-and-Set Locks Lurking stage Wait until lock looks free Spin while read returns true (lock taken) Pouncing state As soon as lock looks available Read returns false (lock free) Call TAS to acquire lock If TAS loses, back to lurking Art of Multiprocessor Programming 31
32 Test-and-test-and-set Lock class TTASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (true) { while (state.get()) {} if (!state.getandset(true)) return; } } Art of Multiprocessor Programming 32
33 Test-and-test-and-set Lock class TTASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (true) { while (state.get()) {} if (!state.getandset(true)) return; } } Wait until lock looks free Art of Multiprocessor Programming 33
34 Test-and-test-and-set Lock class TTASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (true) { while (state.get()) {} if (!state.getandset(true)) return; } } Then try to acquire it Art of Multiprocessor Programming 34
35 time Mystery #2 TAS lock TTAS lock Ideal threads Art of Multiprocessor Programming 35
36 Mystery Both TAS and TTAS Do the same thing (in our model) Except that TTAS performs much better than TAS Neither approaches ideal Art of Multiprocessor Programming 36
37 Opinion Our memory abstraction is broken TAS & TTAS methods Are provably the same (in our model) Except they aren t (in field tests) Need a more detailed model Art of Multiprocessor Programming 37
38 Bus-Based Architectures cache cache Bus cache memory Art of Multiprocessor Programming 38
39 Bus-Based Architectures Random access memory (10s of cycles) cache cache Bus cache memory Art of Multiprocessor Programming 39
40 Bus-Based Architectures Shared Bus Broadcast medium One broadcaster at a time Processors and memory all snoop cache cache Bus cache memory Art of Multiprocessor Programming 40
41 Per-Processor Caches Small Fast: 1 or 2 cycles Address & state information Bus-Based Architectures cache cache Bus cache memory Art of Multiprocessor Programming 41
42 Granularity Caches operate at a larger granularity than a word Cache line: fixed-size block containing the address (today 64 or 128 bytes) Art of Multiprocessor Programming 42
43 Locality If you use an address now, you will probably use it again soon Fetch from cache, not memory If you use an address now, you will probably use a nearby address soon In the same cache line Art of Multiprocessor Programming 43
44 L1 and L2 Caches L2 L1 Art of Multiprocessor Programming 44
45 L1 and L2 Caches L2 L1 Small & fast 1 or 2 cycles Art of Multiprocessor Programming 45
46 Larger and slower 10s of cycles ~128 byte line L1 and L2 Caches L2 L1 Art of Multiprocessor Programming 46
47 Jargon Watch Cache hit I found what I wanted in my cache Good Thing Art of Multiprocessor Programming 47
48 Jargon Watch Cache hit I found what I wanted in my cache Good Thing Cache miss I had to shlep all the way to memory for that data Bad Thing Art of Multiprocessor Programming 48
49 Cave Canem This model is still a simplification But not in any essential way Illustrates basic principles Will discuss complexities later Art of Multiprocessor Programming 49
50 When a Cache Becomes Full Need to make room for new entry By evicting an existing entry Need a replacement policy Usually some kind of least recently used heuristic Art of Multiprocessor Programming 50
51 Fully Associative Cache Any line can be anywhere in the cache Advantage: can replace any line Disadvantage: hard to find lines Art of Multiprocessor Programming 51
52 Direct Mapped Cache Every address has exactly 1 slot Advantage: easy to find a line Disadvantage: must replace fixed line Art of Multiprocessor Programming 52
53 K-way Set Associative Cache Each slot holds k lines Advantage: pretty easy to find a line Advantage: some choice in replacing line Art of Multiprocessor Programming 53
54 Multicore Set Associativity k is 8 or even 16 and growing Why? Because cores share sets Threads cut effective size if accessing different data Art of Multiprocessor Programming 54
55 Cache Coherence A and B both cache address x A writes to x Updates cache How does B find out? Many cache coherence protocols in literature Art of Multiprocessor Programming 55
56 MESI Modified Have modified cached data, must write back to memory Art of Multiprocessor Programming 56
57 MESI Modified Have modified cached data, must write back to memory Exclusive Not modified, I have only copy Art of Multiprocessor Programming 57
58 MESI Modified Have modified cached data, must write back to memory Exclusive Not modified, I have only copy Shared Not modified, may be cached elsewhere Art of Multiprocessor Programming 58
59 MESI Modified Have modified cached data, must write back to memory Exclusive Not modified, I have only copy Shared Not modified, may be cached elsewhere Invalid Cache contents not meaningful Art of Multiprocessor Programming 59
60 Processor Issues Load Request load x cache cache Bus cache memory data Art of Multiprocessor Programming 60
61 Memory Responds E cache cache cache Bus Bus Got it! memory data Art of Multiprocessor Programming 61
62 Processor Issues Load Request Load x E data cache cache Bus memory data Art of Multiprocessor Programming 62
63 Other Processor Responds Got it ES data S cache cache Bus Bus memory data Art of Multiprocessor Programming 63
64 Modify Cached Data S data S data cache Bus memory data Art of Multiprocessor Programming 64
65 Write-Through Cache Write x! S data S data cache Bus memory data Art of Multiprocessor Programming 65
66 Write-Through Caches Immediately broadcast changes Good Memory, caches always agree More read hits, maybe Bad Bus traffic on all writes Most writes to unshared data For example, loop indexes Art of Multiprocessor Programming 66
67 Write-Through Caches Immediately broadcast changes Good Memory, caches always agree More read hits, maybe Bad Bus traffic on all writes Most writes to unshared data For example, loop indexes show stoppers Art of Multiprocessor Programming 67
68 Write-Back Caches Accumulate changes in cache Write back when line evicted Need the cache for something else Another processor wants it Art of Multiprocessor Programming 68
69 Invalidate Invalidate x SI cache data MS data cache Bus memory data Art of Multiprocessor Programming 69
70 Invalidate cache data Bus cache This cache acquires write permission memory data Art of Multiprocessor Programming 70
71 Invalidate Other caches lose read permission cache data Bus cache This cache acquires write permission memory data Art of Multiprocessor Programming 71
72 Invalidate Memory provides data only if not present in any cache, so no need to change it now (expensive) cache data Bus cache memory data Art of Multiprocessor Programming 72
73 Mutual Exclusion What do we want to optimize? Bus bandwidth used by spinning threads Release/Acquire latency Acquire latency for idle lock Art of Multiprocessor Programming 73
74 Simple TASLock TAS invalidates cache lines Spinners Miss in cache Go to bus Thread wants to release lock delayed behind spinners Art of Multiprocessor Programming 74
75 Test-and-test-and-set Wait until lock looks free Spin on local cache No bus use while lock busy Problem: when lock is released Invalidation storm Art of Multiprocessor Programming 75
76 Local Spinning while Lock is Busy busy busy Bus busy memory busy Art of Multiprocessor Programming 76
77 On Release invalid invalid Bus free memory free Art of Multiprocessor Programming 77
78 Everyone misses, rereads On Release miss invalid invalid miss Bus free memory free Art of Multiprocessor Programming 78 (1)
79 Everyone tries TAS On Release TAS( ) invalid TAS( ) invalid Bus free memory free Art of Multiprocessor Programming 79 (1)
80 Problems Everyone misses Reads satisfied sequentially Everyone does TAS Invalidates others caches Eventually quiesces after lock acquired How long does this take? Art of Multiprocessor Programming 80
81 Measuring Quiescence Time Acquire lock Pause without using bus Use bus heavily P 1 P 2 P n If pause > quiescence time, critical section duration independent of number of threads If pause < quiescence time, critical section duration slower with more threads Art of Multiprocessor Programming 81
82 time Quiescence Time Increses linearly with the number of processors for bus architecture threads Art of Multiprocessor Programming 82
83 time Mystery Explained TAS lock TTAS lock Ideal threads Better than TAS but still not as good as ideal Art of Multiprocessor Programming 83
84 Solution: Introduce Delay If the lock looks free But I fail to get it There must be contention Better to back off than to collide again time r 2 d r 1 d d spin lock Art of Multiprocessor Programming 84
85 Dynamic Example: Exponential Backoff time 4d 2d d spin lock If I fail to get lock Wait random duration before retry Each subsequent failure doubles expected wait Art of Multiprocessor Programming 85
86 Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getandset(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Art of Multiprocessor Programming 86
87 Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getandset(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Fix minimum delay Art of Multiprocessor Programming 87
88 Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getandset(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Wait until lock looks free Art of Multiprocessor Programming 88
89 Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getandset(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} If we win, return Art of Multiprocessor Programming 89
90 Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getandset(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Back off for random duration Art of Multiprocessor Programming 90
91 Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getandset(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Double max delay, within reason Art of Multiprocessor Programming 91
92 time Spin-Waiting Overhead TTAS Lock Backoff lock threads Art of Multiprocessor Programming 92
93 Backoff: Other Issues Good Easy to implement Beats TTAS lock Bad Must choose parameters carefully Not portable across platforms Art of Multiprocessor Programming 93
94 Idea Avoid useless invalidations By keeping a queue of threads Each thread Notifies next in line Without bothering the others Art of Multiprocessor Programming 95
95 Anderson Queue Lock next idle flags T F F F F F F F Art of Multiprocessor Programming 96
96 Anderson Queue Lock next acquiring getandincrement flags T F F F F F F F Art of Multiprocessor Programming 97
97 Anderson Queue Lock next acquiring getandincrement flags T F F F F F F F Art of Multiprocessor Programming 98
98 Anderson Queue Lock next acquired Mine! flags T F F F F F F F Art of Multiprocessor Programming 99
99 Anderson Queue Lock next acquired acquiring flags T F F F F F F F Art of Multiprocessor Programming 100
100 Anderson Queue Lock next acquired acquiring flags getandincrement T F F F F F F F Art of Multiprocessor Programming 101
101 Anderson Queue Lock next acquired acquiring flags getandincrement T F F F F F F F Art of Multiprocessor Programming 102
102 Anderson Queue Lock next acquired acquiring flags T F F F F F F F Art of Multiprocessor Programming 103
103 Anderson Queue Lock next released acquired flags T T F F F F F F Art of Multiprocessor Programming 104
104 Anderson Queue Lock next released acquired flags Yow! T T F F F F F F Art of Multiprocessor Programming 105
105 Anderson Queue Lock class ALock implements Lock { boolean[] flags={true,false,,false}; AtomicInteger next = new AtomicInteger(0); ThreadLocal<Integer> myslot; Art of Multiprocessor Programming 106
106 Anderson Queue Lock class ALock implements Lock { boolean[] flags={true,false,,false}; AtomicInteger next = new AtomicInteger(0); ThreadLocal<Integer> myslot; One flag per thread Art of Multiprocessor Programming 107
107 Anderson Queue Lock class ALock implements Lock { boolean[] flags={true,false,,false}; AtomicInteger next = new AtomicInteger(0); ThreadLocal<Integer> myslot; Next flag to use Art of Multiprocessor Programming 108
108 Anderson Queue Lock class ALock implements Lock { boolean[] flags={true,false,,false}; AtomicInteger next = new AtomicInteger(0); ThreadLocal<Integer> myslot; Thread-local variable Art of Multiprocessor Programming 109
109 Anderson Queue Lock public lock() { myslot = next.getandincrement(); while (!flags[myslot % n]) {}; flags[myslot % n] = false; } public unlock() { flags[(myslot+1) % n] = true; } Art of Multiprocessor Programming 110
110 Anderson Queue Lock public lock() { myslot = next.getandincrement(); while (!flags[myslot % n]) {}; flags[myslot % n] = false; } public unlock() { flags[(myslot+1) % n] = true; } Take next slot Art of Multiprocessor Programming 111
111 Anderson Queue Lock public lock() { myslot = next.getandincrement(); while (!flags[myslot % n]) {}; flags[myslot % n] = false; } public unlock() { flags[(myslot+1) % n] = true; } Spin until told to go Art of Multiprocessor Programming 112
112 Anderson Queue Lock public lock() { myslot = next.getandincrement(); while (!flags[myslot % n]) {}; flags[myslot % n] = false; } public unlock() { flags[(myslot+1) % n] = true; } Prepare slot for re-use Art of Multiprocessor Programming 113
113 Anderson Queue Lock public lock() { Tell next thread to go myslot = next.getandincrement(); while (!flags[myslot % n]) {}; flags[myslot % n] = false; } public unlock() { flags[(myslot+1) % n] = true; } Art of Multiprocessor Programming 114
114 Local Spinning next flags released acquired Spin on my bit T F F F F F F F Unfortunately many bits share cache line Art of Multiprocessor Programming 115
115 next Result: contention flags False Sharing released acquired T F F F F F F F Line 1Art of Multiprocessor Programming Line 2 Spin on my Spinning thread bit gets cache invalidation on account of store by threads it is not waiting for 116
116 The Solution: Padding next flags released acquired Spin on my line T / / / F / / / Line 1Art of Multiprocessor Programming Line 2 117
117 Performance TTAS queue Shorter handover than backoff Curve is practically flat Scalable performance Art of Multiprocessor Programming 118
118 Anderson Queue Lock Good First truly scalable lock Simple, easy to implement Back to FCFS order (like Bakery) Art of Multiprocessor Programming 119
119 Bad Anderson Queue Lock Space hog One bit per thread one cache line per thread What if unknown number of threads? What if small number of actual contenders? Art of Multiprocessor Programming 120
120 CLH Lock FCFS order Small, constant-size overhead per thread Art of Multiprocessor Programming 121
121 Initially idle tail false Art of Multiprocessor Programming 122
122 Initially idle tail false Queue tail Art of Multiprocessor Programming 123
123 Initially idle tail false Lock is free Art of Multiprocessor Programming 124
124 Initially idle tail false Art of Multiprocessor Programming 125
125 Purple Wants the Lock acquiring tail false Art of Multiprocessor Programming 126
126 Purple Wants the Lock acquiring tail false true Art of Multiprocessor Programming 127
127 Purple Wants the Lock acquiring Swap tail false true Art of Multiprocessor Programming 128
128 Purple Has the Lock acquired tail false true Art of Multiprocessor Programming 129
129 Red Wants the Lock acquired acquiring tail false true true Art of Multiprocessor Programming 130
130 Red Wants the Lock acquired acquiring Swap tail false true true Art of Multiprocessor Programming 131
131 Red Wants the Lock acquired acquiring tail false true true Art of Multiprocessor Programming 132
132 Red Wants the Lock acquired acquiring tail false true true Art of Multiprocessor Programming 133
133 Red Wants the Lock acquired acquiring Implicit Linked list tail false true true Art of Multiprocessor Programming 134
134 Red Wants the Lock acquired acquiring tail false true true Art of Multiprocessor Programming 135
135 Red Wants the Lock acquired acquiring tail false true true true Actually, it spins on cached copy Art of Multiprocessor Programming 136
136 Purple Releases release acquiring false Bingo! tail false false true Art of Multiprocessor Programming 137
137 Purple Releases released acquired tail true Art of Multiprocessor Programming 138
138 Space Usage Let L = number of locks N = number of threads ALock O(LN) CLH lock O(L+N) Art of Multiprocessor Programming 139
139 CLH Queue Lock class QNode { AtomicBoolean locked = new AtomicBoolean(true); } Art of Multiprocessor Programming 140
140 CLH Queue Lock class QNode { AtomicBoolean locked = new AtomicBoolean(true); } Not released yet Art of Multiprocessor Programming 141
141 CLH Queue Lock class CLHLock implements Lock { AtomicReference<QNode> tail; ThreadLocal<QNode> mynode = new QNode(); public void lock() { QNode pred = tail.getandset(mynode); while (pred.locked) {} }} Art of Multiprocessor Programming 142
142 CLH Queue Lock class CLHLock implements Lock { AtomicReference<QNode> tail; ThreadLocal<QNode> mynode = new QNode(); public void lock() { QNode pred = tail.getandset(mynode); while (pred.locked) {} }} Queue tail Art of Multiprocessor Programming 143
143 CLH Queue Lock class CLHLock implements Lock { AtomicReference<QNode> tail; ThreadLocal<QNode> mynode = new QNode(); public void lock() { QNode pred = tail.getandset(mynode); while (pred.locked) {} }} Thread-local QNode Art of Multiprocessor Programming 144
144 CLH Queue Lock class CLHLock implements Lock { AtomicReference<QNode> tail; ThreadLocal<QNode> mynode = new QNode(); public void lock() { QNode pred = tail.getandset(mynode); while (pred.locked) {} }} Swap in my node Art of Multiprocessor Programming 145
145 CLH Queue Lock class CLHLock implements Lock { AtomicReference<QNode> tail; ThreadLocal<QNode> mynode = new QNode(); public void lock() { QNode pred = tail.getandset(mynode); while (pred.locked) {} }} Spin until predecessor releases lock Art of Multiprocessor Programming 146
146 CLH Queue Lock Class CLHLock implements Lock { public void unlock() { mynode.locked.set(false); mynode = pred; } } Art of Multiprocessor Programming 147
147 CLH Queue Lock Class CLHLock implements Lock { public void unlock() { mynode.locked.set(false); mynode = pred; } } Notify successor Art of Multiprocessor Programming 148
148 CLH Queue Lock Class CLHLock implements Lock { } public void unlock() { } mynode.locked.set(false); mynode = pred; Recycle predecessor s node Art of Multiprocessor Programming 149
149 CLH Queue Lock Class CLHLock implements Lock { } public void unlock() { } mynode.locked.set(false); mynode = pred; (we don t actually reuse mynode. Code in book shows how it s done.) Art of Multiprocessor Programming 150
150 CLH Lock Good Lock release affects predecessor only Small, constant-sized space Bad Doesn t work for uncached NUMA architectures Art of Multiprocessor Programming 151
151 NUMA and cc-numa Architectures Acronym: Non-Uniform Memory Architecture ccnuma = cache coherent NUMA Illusion: Flat shared memory Truth: No caches (sometimes) Some memory regions faster than others Art of Multiprocessor Programming 152
152 NUMA Machines Spinning on local memory is fast Art of Multiprocessor Programming 153
153 NUMA Machines Spinning on remote memory is slow Art of Multiprocessor Programming 154
154 CLH Lock Each thread spins on predecessor s memory Could be far away Art of Multiprocessor Programming 155
155 MCS Lock FCFS order Spin on local memory only Small, Constant-size overhead Art of Multiprocessor Programming 156
156 Initially idle tail false Art of Multiprocessor Programming 157
157 Acquiring acquiring (allocate QNode) tail false true Art of Multiprocessor Programming 158
158 Acquiring acquired tail swap false true Art of Multiprocessor Programming 159
159 Acquiring acquired tail false true Art of Multiprocessor Programming 160
160 Acquired acquired tail false true Art of Multiprocessor Programming 161
161 Acquiring acquired acquiring tail swap false true Art of Multiprocessor Programming 162
162 Acquiring acquired acquiring tail false true Art of Multiprocessor Programming 163
163 Acquiring acquired acquiring tail false true Art of Multiprocessor Programming 164
164 Acquiring acquired acquiring tail false true Art of Multiprocessor Programming 165
165 Acquiring acquired acquiring tail true false true Art of Multiprocessor Programming 166
166 Acquiring acquired acquiring tail true Yes! false true Art of Multiprocessor Programming 167
167 MCS Queue Lock class QNode { volatile boolean locked = false; volatile qnode next = null; } Art of Multiprocessor Programming 168
168 MCS Queue Lock class MCSLock implements Lock { AtomicReference tail; public void lock() { QNode qnode = new QNode(); QNode pred = tail.getandset(qnode); if (pred!= null) { qnode.locked = true; pred.next = qnode; while (qnode.locked) {} }}} Art of Multiprocessor Programming 169
169 MCS Queue Lock class MCSLock implements Lock { AtomicReference tail; public void lock() { QNode qnode = new QNode(); QNode pred = tail.getandset(qnode); if (pred!= null) { qnode.locked = true; pred.next = qnode; while (qnode.locked) {} }}} Make a QNode Art of Multiprocessor Programming 170
170 MCS Queue Lock class MCSLock implements Lock { AtomicReference tail; public void lock() { QNode qnode = new QNode(); QNode pred = tail.getandset(qnode); if (pred!= null) { qnode.locked = true; pred.next = qnode; while (qnode.locked) {} }}} add my Node to the tail of queue Art of Multiprocessor Programming 171
171 MCS Queue Lock class MCSLock implements Lock { AtomicReference tail; public void lock() { QNode qnode = new QNode(); QNode pred = tail.getandset(qnode); if (pred!= null) { qnode.locked = true; pred.next = qnode; while (qnode.locked) {} }}} Fix if queue was non-empty Art of Multiprocessor Programming 172
172 MCS Queue Lock class MCSLock implements Lock { AtomicReference tail; public void lock() { QNode qnode = new QNode(); Wait until unlocked QNode pred = tail.getandset(qnode); if (pred!= null) { qnode.locked = true; pred.next = qnode; while (qnode.locked) {} }}} Art of Multiprocessor Programming 173
173 Purple Release releasing swap false false Art of Multiprocessor Programming 174
174 Purple Release releasing I don t see a successor. But by looking at the queue, I see another thread is active swap false false Art of Multiprocessor Programming 175
175 Purple Release releasing I don t see a successor. But by looking at the queue, I see another thread is active swap false false I have to release that thread so must wait for it to identify its node Art of Multiprocessor Programming 176
176 Purple Release releasing prepare to spin true false Art of Multiprocessor Programming 177
177 Purple Release releasing spinning true false Art of Multiprocessor Programming 178
178 Purple Release releasing spinning false true false Art of Multiprocessor Programming 179
179 Purple Release releasing Acquired lock false true false Art of Multiprocessor Programming 180
180 MCS Queue Unlock class MCSLock implements Lock { AtomicReference tail; public void unlock() { if (qnode.next == null) { if (tail.cas(qnode, null) return; while (qnode.next == null) {} } qnode.next.locked = false; }} Art of Multiprocessor Programming 181
181 MCS Queue Lock class MCSLock implements Lock { AtomicReference tail; public void unlock() { if (qnode.next == null) { } if (tail.cas(qnode, null) return; while (qnode.next == null) {} qnode.next.locked = false; }} Missing successor? Art of Multiprocessor Programming 182
182 MCS Queue Lock class MCSLock implements Lock { If really no successor, return AtomicReference tail; public void unlock() { if (qnode.next == null) { if (tail.cas(qnode, null) return; while (qnode.next == null) {} } qnode.next.locked = false; }} Art of Multiprocessor Programming 183
183 MCS Queue Lock class MCSLock implements Lock { Otherwise wait for successor to catch up AtomicReference tail; public void unlock() { if (qnode.next == null) { if (tail.cas(qnode, null) return; while (qnode.next == null) {} } qnode.next.locked = false; }} Art of Multiprocessor Programming 184
184 MCS Queue Lock class MCSLock implements Lock { AtomicReference queue; Pass lock to successor public void unlock() { if (qnode.next == null) { if (tail.cas(qnode, null) return; while (qnode.next == null) {} } qnode.next.locked = false; }} Art of Multiprocessor Programming 185
185 Abortable Locks What if you want to give up waiting for a lock? For example Timeout Database transaction aborted by user Art of Multiprocessor Programming 186
186 Back-off Lock Aborting is trivial Just return from lock() call Extra benefit: No cleaning up Wait-free Immediate return Art of Multiprocessor Programming 187
187 Queue Locks Can t just quit Thread in line behind will starve Need a graceful way out Art of Multiprocessor Programming 188
188 Queue Locks spinning spinning spinning true true true Art of Multiprocessor Programming 189
189 Queue Locks locked spinning spinning false true true Art of Multiprocessor Programming 190
190 Queue Locks locked spinning false true Art of Multiprocessor Programming 191
191 Queue Locks locked false Art of Multiprocessor Programming 192
192 Queue Locks spinning spinning spinning true true true Art of Multiprocessor Programming 193
193 Queue Locks spinning spinning true true true Art of Multiprocessor Programming 194
194 Queue Locks locked spinning false true true Art of Multiprocessor Programming 195
195 Queue Locks spinning false true Art of Multiprocessor Programming 196
196 Queue Locks pwned false true Art of Multiprocessor Programming 197
197 Abortable CLH Lock When a thread gives up Removing node in a wait-free way is hard Idea: let successor deal with it. Art of Multiprocessor Programming 198
198 Initially idle Pointer to predecessor (or null) tail A Art of Multiprocessor Programming 199
199 Initially idle Distinguished available node means lock is free tail A Art of Multiprocessor Programming 200
200 acquiring Acquiring tail A Art of Multiprocessor Programming 201
201 acquiring Acquiring Null predecessor means lock not released or aborted A Art of Multiprocessor Programming 202
202 acquiring Acquiring Swap A Art of Multiprocessor Programming 203
203 acquiring Acquiring A Art of Multiprocessor Programming 204
204 Acquired locked Reference to AVAILABLE means lock is free. A Art of Multiprocessor Programming 205
205 Normal Case locked spinning spinning Null means lock is not free & request not aborted Art of Multiprocessor Programming 206
206 One Thread Aborts locked Timed out spinning Art of Multiprocessor Programming 207
207 Successor Notices locked Timed out spinning Non-Null means predecessor aborted Art of Multiprocessor Programming 208
208 Recycle Predecessor s Node locked spinning Art of Multiprocessor Programming 209
209 Spin on Earlier Node locked spinning Art of Multiprocessor Programming 210
210 Spin on Earlier Node released spinning A The lock is now mine Art of Multiprocessor Programming 211
211 Time-out Lock public class TOLock implements Lock { static QNode AVAILABLE = new QNode(); AtomicReference<QNode> tail; ThreadLocal<QNode> mynode; Art of Multiprocessor Programming 212
212 Time-out Lock public class TOLock implements Lock { static QNode AVAILABLE = new QNode(); AtomicReference<QNode> tail; ThreadLocal<QNode> mynode; AVAILABLE node signifies free lock Art of Multiprocessor Programming 213
213 Time-out Lock public class TOLock implements Lock { static QNode AVAILABLE = new QNode(); AtomicReference<QNode> tail; ThreadLocal<QNode> mynode; Tail of the queue Art of Multiprocessor Programming 214
214 Time-out Lock public class TOLock implements Lock { static QNode AVAILABLE = new QNode(); AtomicReference<QNode> tail; ThreadLocal<QNode> mynode; Remember my node Art of Multiprocessor Programming 215
215 Time-out Lock public boolean lock(long timeout) { QNode qnode = new QNode(); mynode.set(qnode); qnode.prev = null; QNode mypred = tail.getandset(qnode); if (mypred== null mypred.prev == AVAILABLE) { return true; } Art of Multiprocessor Programming 216
216 Time-out Lock public boolean lock(long timeout) { QNode qnode = new QNode(); mynode.set(qnode); qnode.prev = null; QNode mypred = tail.getandset(qnode); if (mypred == null mypred.prev == AVAILABLE) { return true; } Create & initialize node Art of Multiprocessor Programming 217
217 Time-out Lock public boolean lock(long timeout) { QNode qnode = new QNode(); mynode.set(qnode); qnode.prev = null; QNode mypred = tail.getandset(qnode); if (mypred == null mypred.prev == AVAILABLE) { return true; } Swap with tail Art of Multiprocessor Programming 218
218 Time-out Lock public boolean lock(long timeout) { QNode qnode = new QNode(); mynode.set(qnode); qnode.prev = null; QNode mypred = tail.getandset(qnode); if (mypred == null }... mypred.prev == AVAILABLE) { return true; If predecessor absent or released, we are done Art of Multiprocessor Programming 219
219 locked Time-out Lock spinning spinning long start = now(); while (now()- start < timeout) { QNode predpred = mypred.prev; if (predpred == AVAILABLE) { return true; } else if (predpred!= null) { mypred = predpred; } } Art of Multiprocessor Programming 220
220 Time-out Lock long start = now(); while (now()- start < timeout) { } QNode predpred = mypred.prev; if (predpred == AVAILABLE) { return true; } else if (predpred!= null) { } mypred = predpred; Keep trying for a while Art of Multiprocessor Programming 221
221 Time-out Lock long start = now(); while (now()- start < timeout) { } QNode predpred = mypred.prev; if (predpred == AVAILABLE) { return true; } else if (predpred!= null) { } mypred = predpred; Spin on predecessor s prev field Art of Multiprocessor Programming 222
222 Time-out Lock long start = now(); while (now()- start < timeout) { QNode predpred = mypred.prev; if (predpred == AVAILABLE) { return true; } else if (predpred!= null) { mypred = predpred; } } Predecessor released lock Art of Multiprocessor Programming 223
223 Time-out Lock long start = now(); while (now()- start < timeout) { } QNode predpred = mypred.prev; if (predpred == AVAILABLE) { return true; } else if (predpred!= null) { } mypred = predpred; Predecessor aborted, advance one Art of Multiprocessor Programming 224
224 Time-out Lock if (!tail.compareandset(qnode, mypred)) qnode.prev = mypred; return false; } } What do I do when I time out? Art of Multiprocessor Programming 225
225 Time-out Lock if (!tail.compareandset(qnode, mypred)) qnode.prev = mypred; return false; } } Do I have a successor? If CAS fails, I do. Tell it about mypred Art of Multiprocessor Programming 226
226 Time-out Lock if (!tail.compareandset(qnode, mypred)) qnode.prev = mypred; return false; } } If CAS succeeds: no successor, simply return false Art of Multiprocessor Programming 227
227 Time-Out Unlock public void unlock() { QNode qnode = mynode.get(); if (!tail.compareandset(qnode, null)) qnode.prev = AVAILABLE; } Art of Multiprocessor Programming 228
228 Time-out Unlock public void unlock() { QNode qnode = mynode.get(); if (!tail.compareandset(qnode, null)) qnode.prev = AVAILABLE; } If CAS failed: successor exists, notify it can enter Art of Multiprocessor Programming 229
229 Timing-out Lock public void unlock() { QNode qnode = mynode.get(); if (!tail.compareandset(qnode, null)) qnode.prev = AVAILABLE; } CAS successful: set tail to null, no clean up since no successor waiting Art of Multiprocessor Programming 230
230 Fairness and NUMA Locks MCS lock mechanics are aware of NUMA Lock Fairness is FCFS Is this a good fit with NUMA and Cache-Coherent NUMA machines?
231 Lock Data Access in NUMA Machine Node 1 CS MCS lock various memory locations Node 2
232 Who s the Unfairest of Them All? locality crucial to NUMA performance Big gains if threads from same node/cluster obtain lock consecutively Unfairness pays
233 Hierarchical Backoff Lock (HBO) Back off less for thread from same node time 4d 2d d CS Unfairness is key to performance Global T&T&S lock time 4d 2d d
234 Hierarchical Backoff Lock (HBO) Advantages: Simple, improves locality Disadvantages: Requires platform specific tuning Unstable Unfair Continuous invalidations on shared global lock word
235 Hierarchical CLH Lock (HCLH) Each thread spins on cached copy of predecessor s node Local Tail Thread at local head splices local queue into global queue Local CLH queue Global Tail CAS() CAS() Local Tail Local CLH queue CS CAS()
236 Hierarchical CLH Lock (HCLH) HCLH HBO Threads access 4 cache lines in CS
237 Hierarchical CLH Lock (HCLH) Advantages: Improved locality Local spinning Fair Disadvantages: Complex code implies long common path Splicing into both local and global requires CAS Hard to get long local sequences
238
239 Lock Cohorting General technique for converting almost any lock into a NUMA lock Allows combining different lock types But need these locks to have certain properties (will discuss shortly)
240 Lock Cohorting Non-empty cohort empty cohort Acquire local lock and proceed to critical section Local Lock On release: if nonempty cohort of waiting threads, release only local lock; leave mark Thread that acquired local lock can now acquire global lock Local Lock Global Lock CS On release: since cohort is empty must release global lock to avoid deadlock
241 Thread Obliviousness A lock is thread-oblivious if After being acquired by one thread, Can be released by another Art of Multiprocessor Programming 242
242 Cohort Detection A lock x provides cohort detection if It can tell whether any thread is trying to acquire it Art of Multiprocessor Programming 243
243 Lock Cohorting Two levels of locking Global lock: thread oblivious Thread acquiring the lock can be different than one releasing it Local lock: cohort detection Thread releasing can detect if some thread is waiting to acquire it
244 Two new states: acquire local and acquire global. Do we own global lock? Lock Cohorting: C-BO-MCS In Lock, cohortlock Local MCS lock tail detection by checking successor pointer CAS() False False True Global backoff lock Bound number of Local consecutive MCS lock acquires to control tail unfairness CAS() False False True time 4d 2d d BO Lock is thread oblivious by definition CS
245 How to add cohort detection Lock Cohorting: property to BO lock? C-BO-BO Lock 4d 2d d Global backoff lock CS time 4d 2d d 4d 2d d As noted BO Lock is thread oblivious
246 Add successorexists Lock Cohorting: C-BO-BO Lock field before attempting to acquire local lock. successorexists reset on lock release. 4d 2d d Release might overwrite another successor s Global write but we don t backoff care why? lock CS time 4d 2d d 4d 2d d
247 C-BO-BO Aborting thread is a resets Time-Out successorexists field before leaving local lock. Spinning threads set it to true. NUMA Lock 4d 2d d BO locks trivially abortable Global backoff lock If releasing thread finds successorexists time false, 4dit releases global lock 2d d CS 4d 2d d
248 Lock Cohorting Advantages: Great locality Low contention on shared lock Practically no tuning Has whatever properties you want: Can be more or less fair, abortable just choose the appropriate type of locks Disadvantages: Must tune fairness parameters
249 Lock Cohorting C-BO-MCS C-BO-BO HCLH HBO
250 Throughput in CR-nCRs per sec Time-Out (Abortable) Lock Cohorting A-BO-CLH (time-out lock + BO) 4e e+06 3e e+06 A-BO-BO a-clh a-hbo a-bo-bo a-bo-clh (CAS) a-bo-clh 2e e+06 1e Abortable CLH (our time-out lock) and HBO Number of Threads
251 One Lock To Rule Them All? TTAS+Backoff, CLH, MCS, ToLock Each better than others in some way There is no one solution Lock we pick really depends on: the application the hardware which properties are important Art of Multiprocessor Programming 253
252 This work is licensed under a Creative Commons Attribution- ShareAlike 2.5 License. You are free: to Share to copy, distribute and transmit the work to Remix to adapt the work Under the following conditions: Attribution. You must attribute the work to The Art of Multiprocessor Programming (but not in any way that suggests that the authors endorse you or your use of the work). Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license. For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to Any of the above conditions can be waived if you get permission from the copyright holder. Nothing in this license impairs or restricts the author's moral rights. Art of Multiprocessor Programming 254
Spin Locks and Contention. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit
Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Focus so far: Correctness and Progress Models Accurate (we never lied to you) But idealized
More informationSpin Locks and Contention. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit
Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Real world concurrency Understanding hardware architecture What is contention How to
More informationSpin Locks and Contention
Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Modified for Software1 students by Lior Wolf and Mati Shomrat Kinds of Architectures
More informationSpin Locks and Contention. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit
Spin Locks and Contention Companion slides for The Programming by Maurice Herlihy & Nir Shavit Focus so far: Correctness Models Accurate (we never lied to you) But idealized (so we forgot to mention a
More informationSpin Locks and Contention. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit
Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit MIMD Architectures memory Shared Bus Memory Contention Communication Contention Communication
More informationSpin Locks and Contention. Companion slides for Chapter 7 The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit
Spin Locks and Contention Companion slides for Chapter 7 The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Focus so far: Correctness and Progress Models Accurate (we never lied to you)
More informationModern High-Performance Locking
Modern High-Performance Locking Nir Shavit Slides based in part on The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Locks (Mutual Exclusion) public interface Lock { public void lock();
More informationLocks Chapter 4. Overview. Introduction Spin Locks. Queue locks. Roger Wattenhofer. Test-and-Set & Test-and-Test-and-Set Backoff lock
Locks Chapter 4 Roger Wattenhofer ETH Zurich Distributed Computing www.disco.ethz.ch Overview Introduction Spin Locks Test-and-Set & Test-and-Test-and-Set Backoff lock Queue locks 4/2 1 Introduction: From
More informationLocking. Part 2, Chapter 11. Roger Wattenhofer. ETH Zurich Distributed Computing
Locking Part 2, Chapter 11 Roger Wattenhofer ETH Zurich Distributed Computing www.disco.ethz.ch Overview Introduction Spin Locks Test-and-Set & Test-and-Test-and-Set Backoff lock Queue locks 11/2 Introduction:
More informationAgenda. Lecture. Next discussion papers. Bottom-up motivation Shared memory primitives Shared memory synchronization Barriers and locks
Agenda Lecture Bottom-up motivation Shared memory primitives Shared memory synchronization Barriers and locks Next discussion papers Selecting Locking Primitives for Parallel Programming Selecting Locking
More informationSpin Locks and Contention Management
Chapter 7 Spin Locks and Contention Management 7.1 Introduction We now turn our attention to the performance of mutual exclusion protocols on realistic architectures. Any mutual exclusion protocol poses
More informationIntroduction to Multiprocessor Synchronization
Introduction to Multiprocessor Synchronization Maurice Herlihy http://cs.brown.edu/courses/cs176/lectures.shtml Moore's Law Transistor count still rising Clock speed flattening sharply Art of Multiprocessor
More informationComputer Engineering II Solution to Exercise Sheet Chapter 10
Distributed Computing FS 2017 Prof. R. Wattenhofer Computer Engineering II Solution to Exercise Sheet Chapter 10 Quiz 1 Quiz a) The AtomicBoolean utilizes an atomic version of getandset() implemented in
More information9/28/2014. CS341: Operating System. Synchronization. High Level Construct: Monitor Classical Problems of Synchronizations FAQ: Mid Semester
CS341: Operating System Lect22: 18 th Sept 2014 Dr. A. Sahu Dept of Comp. Sc. & Engg. Indian Institute of Technology Guwahati Synchronization Hardware Support TAS, CAS, TTAS, LL LC, XCHG, Compare, FAI
More informationPractice: Small Systems Part 2, Chapter 3
Practice: Small Systems Part 2, Chapter 3 Roger Wattenhofer Overview Introduction Spin Locks Test and Set & Test and Test and Set Backoff lock Queue locks Concurrent Linked List Fine grained synchronization
More informationLocking. Part 2, Chapter 7. Roger Wattenhofer. ETH Zurich Distributed Computing
Locking Part 2, Chapter 7 Roger Wattenhofer ETH Zurich Distributed Computing www.disco.ethz.ch Overview Introduction Spin Locks Test-and-Set & Test-and-Test-and-Set Backoff lock Queue locks Concurrent
More informationLocking Granularity. CS 475, Spring 2019 Concurrent & Distributed Systems. With material from Herlihy & Shavit, Art of Multiprocessor Programming
Locking Granularity CS 475, Spring 2019 Concurrent & Distributed Systems With material from Herlihy & Shavit, Art of Multiprocessor Programming Discussion: HW1 Part 4 addtolist(key1, newvalue) Thread 1
More informationDistributed Computing Group
Distributed Computing Group HS 2009 Prof. Dr. Roger Wattenhofer, Thomas Locher, Remo Meier, Benjamin Sigg Assigned: December 11, 2009 Discussion: none Distributed Systems Theory exercise 6 1 ALock2 Have
More information9/23/2014. Concurrent Programming. Book. Process. Exchange of data between threads/processes. Between Process. Thread.
Dr A Sahu Dept of Computer Science & Engineering IIT Guwahati Course Structure & Book Basic of thread and process Coordination and synchronization Example of Parallel Programming Shared memory : C/C++
More informationDistributed Computing
HELLENIC REPUBLIC UNIVERSITY OF CRETE Distributed Computing Graduate Course Section 3: Spin Locks and Contention Panagiota Fatourou Department of Computer Science Spin Locks and Contention In contrast
More informationLecture 9: Multiprocessor OSs & Synchronization. CSC 469H1F Fall 2006 Angela Demke Brown
Lecture 9: Multiprocessor OSs & Synchronization CSC 469H1F Fall 2006 Angela Demke Brown The Problem Coordinated management of shared resources Resources may be accessed by multiple threads Need to control
More informationLinked Lists: Locking, Lock-Free, and Beyond. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit
Linked Lists: Locking, Lock-Free, and Beyond Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Objects Adding threads should not lower throughput Contention
More informationLinked Lists: Locking, Lock- Free, and Beyond. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit
Linked Lists: Locking, Lock- Free, and Beyond Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Coarse-Grained Synchronization Each method locks the object Avoid
More informationModule 7: Synchronization Lecture 13: Introduction to Atomic Primitives. The Lecture Contains: Synchronization. Waiting Algorithms.
The Lecture Contains: Synchronization Waiting Algorithms Implementation Hardwired Locks Software Locks Hardware Support Atomic Exchange Test & Set Fetch & op Compare & Swap Traffic of Test & Set Backoff
More informationProgramming Paradigms for Concurrency Lecture 3 Concurrent Objects
Programming Paradigms for Concurrency Lecture 3 Concurrent Objects Based on companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Modified by Thomas Wies New York University
More informationConcurrent Skip Lists. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit
Concurrent Skip Lists Companion slides for The by Maurice Herlihy & Nir Shavit Set Object Interface Collection of elements No duplicates Methods add() a new element remove() an element contains() if element
More informationSolution: a lock (a/k/a mutex) public: virtual void unlock() =0;
1 Solution: a lock (a/k/a mutex) class BasicLock { public: virtual void lock() =0; virtual void unlock() =0; ; 2 Using a lock class Counter { public: int get_and_inc() { lock_.lock(); int old = count_;
More informationChapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationMultiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types
Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon
More informationScalable Locking. Adam Belay
Scalable Locking Adam Belay Problem: Locks can ruin performance 12 finds/sec 9 6 Locking overhead dominates 3 0 0 6 12 18 24 30 36 42 48 Cores Problem: Locks can ruin performance the locks
More information6.852: Distributed Algorithms Fall, Class 15
6.852: Distributed Algorithms Fall, 2009 Class 15 Today s plan z z z z z Pragmatic issues for shared-memory multiprocessors Practical mutual exclusion algorithms Test-and-set locks Ticket locks Queue locks
More informationCoarse-grained and fine-grained locking Niklas Fors
Coarse-grained and fine-grained locking Niklas Fors 2013-12-05 Slides borrowed from: http://cs.brown.edu/courses/cs176course_information.shtml Art of Multiprocessor Programming 1 Topics discussed Coarse-grained
More informationAdvanced Multiprocessor Programming: Locks
Advanced Multiprocessor Programming: Locks Martin Wimmer, Jesper Larsson Träff TU Wien 20th April, 2015 Wimmer, Träff AMP SS15 1 / 31 Locks we have seen so far Peterson Lock ilter Lock Bakery Lock These
More informationLock cohorting: A general technique for designing NUMA locks
Lock cohorting: A general technique for designing NUMA locks The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published
More informationChapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationAdaptive Lock. Madhav Iyengar < >, Nathaniel Jeffries < >
Adaptive Lock Madhav Iyengar < miyengar@andrew.cmu.edu >, Nathaniel Jeffries < njeffrie@andrew.cmu.edu > ABSTRACT Busy wait synchronization, the spinlock, is the primitive at the core of all other synchronization
More informationLecture 7: Mutual Exclusion 2/16/12. slides adapted from The Art of Multiprocessor Programming, Herlihy and Shavit
Principles of Concurrency and Parallelism Lecture 7: Mutual Exclusion 2/16/12 slides adapted from The Art of Multiprocessor Programming, Herlihy and Shavit Time Absolute, true and mathematical time, of
More informationIntroduction. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit
Introduction Companion slides for The by Maurice Herlihy & Nir Shavit Moore s Law Transistor count still rising Clock speed flattening sharply 2 Moore s Law (in practice) 3 Nearly Extinct: the Uniprocesor
More informationModels of concurrency & synchronization algorithms
Models of concurrency & synchronization algorithms Lecture 3 of TDA383/DIT390 (Concurrent Programming) Carlo A. Furia Chalmers University of Technology University of Gothenburg SP3 2016/2017 Today s menu
More information250P: Computer Systems Architecture. Lecture 14: Synchronization. Anton Burtsev March, 2019
250P: Computer Systems Architecture Lecture 14: Synchronization Anton Burtsev March, 2019 Coherence and Synchronization Topics: synchronization primitives (Sections 5.4-5.5) 2 Constructing Locks Applications
More informationMultiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems
Multiprocessors II: CC-NUMA DSM DSM cache coherence the hardware stuff Today s topics: what happens when we lose snooping new issues: global vs. local cache line state enter the directory issues of increasing
More informationNON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 3 Nov 2017
NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY Tim Harris, 3 Nov 2017 Lecture 1/3 Introduction Basic spin-locks Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat
More informationMultiprocessor Cache Coherency. What is Cache Coherence?
Multiprocessor Cache Coherency CS448 1 What is Cache Coherence? Two processors can have two different values for the same memory location 2 1 Terminology Coherence Defines what values can be returned by
More informationConcurrent Preliminaries
Concurrent Preliminaries Sagi Katorza Tel Aviv University 09/12/2014 1 Outline Hardware infrastructure Hardware primitives Mutual exclusion Work sharing and termination detection Concurrent data structures
More informationAdvance Operating Systems (CS202) Locks Discussion
Advance Operating Systems (CS202) Locks Discussion Threads Locks Spin Locks Array-based Locks MCS Locks Sequential Locks Road Map Threads Global variables and static objects are shared Stored in the static
More informationCache Coherence and Atomic Operations in Hardware
Cache Coherence and Atomic Operations in Hardware Previously, we introduced multi-core parallelism. Today we ll look at 2 things: 1. Cache coherence 2. Instruction support for synchronization. And some
More informationNON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 14 November 2014
NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY Tim Harris, 14 November 2014 Lecture 6 Introduction Amdahl s law Basic spin-locks Queue-based locks Hierarchical locks Reader-writer locks Reading
More informationMultiprocessors and Locking
Types of Multiprocessors (MPs) Uniform memory-access (UMA) MP Access to all memory occurs at the same speed for all processors. Multiprocessors and Locking COMP9242 2008/S2 Week 12 Part 1 Non-uniform memory-access
More informationMultiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering
Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to
More informationConcurrent Queues, Monitors, and the ABA problem. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit
Concurrent Queues, Monitors, and the ABA problem Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Queues Often used as buffers between producers and consumers
More informationLecture 24: Virtual Memory, Multiprocessors
Lecture 24: Virtual Memory, Multiprocessors Today s topics: Virtual memory Multiprocessors, cache coherence 1 Virtual Memory Processes deal with virtual memory they have the illusion that a very large
More informationChapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST
Chapter 8. Multiprocessors In-Cheol Park Dept. of EE, KAIST Can the rapid rate of uniprocessor performance growth be sustained indefinitely? If the pace does slow down, multiprocessor architectures will
More informationMultiprocessor Synchronization
Multiprocessor Synchronization Material in this lecture in Henessey and Patterson, Chapter 8 pgs. 694-708 Some material from David Patterson s slides for CS 252 at Berkeley 1 Multiprogramming and Multiprocessing
More informationMULTIPROCESSORS AND THREAD LEVEL PARALLELISM
UNIT III MULTIPROCESSORS AND THREAD LEVEL PARALLELISM 1. Symmetric Shared Memory Architectures: The Symmetric Shared Memory Architecture consists of several processors with a single physical memory shared
More informationOther consistency models
Last time: Symmetric multiprocessing (SMP) Lecture 25: Synchronization primitives Computer Architecture and Systems Programming (252-0061-00) CPU 0 CPU 1 CPU 2 CPU 3 Timothy Roscoe Herbstsemester 2012
More informationComputer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationGoldibear and the 3 Locks. Programming With Locks Is Tricky. More Lock Madness. And To Make It Worse. Transactional Memory: The Big Idea
Programming With Locks s Tricky Multicore processors are the way of the foreseeable future thread-level parallelism anointed as parallelism model of choice Just one problem Writing lock-based multi-threaded
More informationAdvanced Topic: Efficient Synchronization
Advanced Topic: Efficient Synchronization Multi-Object Programs What happens when we try to synchronize across multiple objects in a large program? Each object with its own lock, condition variables Is
More informationCS377P Programming for Performance Multicore Performance Cache Coherence
CS377P Programming for Performance Multicore Performance Cache Coherence Sreepathi Pai UTCS October 26, 2015 Outline 1 Cache Coherence 2 Cache Coherence Awareness 3 Scalable Lock Design 4 Transactional
More informationMutual Exclusion Algorithms with Constant RMR Complexity and Wait-Free Exit Code
Mutual Exclusion Algorithms with Constant RMR Complexity and Wait-Free Exit Code Rotem Dvir 1 and Gadi Taubenfeld 2 1 The Interdisciplinary Center, P.O.Box 167, Herzliya 46150, Israel rotem.dvir@gmail.com
More informationMultiprocessor Systems. Chapter 8, 8.1
Multiprocessor Systems Chapter 8, 8.1 1 Learning Outcomes An understanding of the structure and limits of multiprocessor hardware. An appreciation of approaches to operating system support for multiprocessor
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationLinked Lists: The Role of Locking. Erez Petrank Technion
Linked Lists: The Role of Locking Erez Petrank Technion Why Data Structures? Concurrent Data Structures are building blocks Used as libraries Construction principles apply broadly This Lecture Designing
More informationMultiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.
Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network
More informationSpinlocks. Spinlocks. Message Systems, Inc. April 8, 2011
Spinlocks Samy Al Bahra Devon H. O Dell Message Systems, Inc. April 8, 2011 Introduction Mutexes A mutex is an object which implements acquire and relinquish operations such that the execution following
More informationIntroduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization
Lec-11 Multi-Threading Concepts: Coherency, Consistency, and Synchronization Coherency vs Consistency Memory coherency and consistency are major concerns in the design of shared-memory systems. Consistency
More informationChapter 5 Thread-Level Parallelism. Abdullah Muzahid
Chapter 5 Thread-Level Parallelism Abdullah Muzahid 1 Progress Towards Multiprocessors + Rate of speed growth in uniprocessors is saturating + Modern multiple issue processors are becoming very complex
More informationFinal Exam Solutions May 11, 2012 CS162 Operating Systems
University of California, Berkeley College of Engineering Computer Science Division EECS Spring 2012 Anthony D. Joseph and Ion Stoica Final Exam May 11, 2012 CS162 Operating Systems Your Name: SID AND
More informationPeterson s Algorithm
Peterson s Algorithm public void lock() { flag[i] = true; victim = i; while (flag[j] && victim == i) {}; } public void unlock() { flag[i] = false; } 24/03/10 Art of Multiprocessor Programming 1 Mutual
More informationA simple correctness proof of the MCS contention-free lock. Theodore Johnson. Krishna Harathi. University of Florida. Abstract
A simple correctness proof of the MCS contention-free lock Theodore Johnson Krishna Harathi Computer and Information Sciences Department University of Florida Abstract Mellor-Crummey and Scott present
More informationSuggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!
1! 2! Suggested Readings! Readings!! H&P: Chapter 5.8! Could also look at material on CD referenced on p. 538 of your text! Lecture 27" Cache Coherency! 3! Processor components! Multicore processors and
More informationLecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )
Systems Group Department of Computer Science ETH Zürich Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Today Non-Uniform
More informationEC 513 Computer Architecture
EC 513 Computer Architecture Cache Coherence - Directory Cache Coherence Prof. Michel A. Kinsy Shared Memory Multiprocessor Processor Cores Local Memories Memory Bus P 1 Snoopy Cache Physical Memory P
More informationConcurrent Counting using Combining Tree
Final Project Report by Shang Wang, Taolun Chai and Xiaoming Jia Concurrent Counting using Combining Tree 1. Introduction Counting is one of the very basic and natural activities that computers do. However,
More informationImplementing Locks. Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau)
Implementing Locks Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau) Lock Implementation Goals We evaluate lock implementations along following lines Correctness Mutual exclusion: only one
More informationLecture 19: Synchronization. CMU : Parallel Computer Architecture and Programming (Spring 2012)
Lecture 19: Synchronization CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Announcements Assignment 4 due tonight at 11:59 PM Synchronization primitives (that we have or will
More informationUniversality of Consensus. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit
Universality of Consensus Companion slides for The by Maurice Herlihy & Nir Shavit Turing Computability 1 0 1 1 0 1 0 A mathematical model of computation Computable = Computable on a T-Machine 2 Shared-Memory
More informationCache Coherence Tutorial
Cache Coherence Tutorial The cache coherence protocol described in the book is not really all that difficult and yet a lot of people seem to have troubles when it comes to using it or answering an assignment
More informationMultiprocessors & Thread Level Parallelism
Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction
More informationMultiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.
Multiprocessor System Multiprocessor Systems Chapter 8, 8.1 We will look at shared-memory multiprocessors More than one processor sharing the same memory A single CPU can only go so fast Use more than
More informationHandout 3 Multiprocessor and thread level parallelism
Handout 3 Multiprocessor and thread level parallelism Outline Review MP Motivation SISD v SIMD (SIMT) v MIMD Centralized vs Distributed Memory MESI and Directory Cache Coherency Synchronization and Relaxed
More informationShared-Memory Computability
Shared-Memory Computability 10011 Universal Object Wait-free/Lock-free computable = Threads with methods that solve n- consensus Art of Multiprocessor Programming Copyright Herlihy- Shavit 2007 93 GetAndSet
More informationToday. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )
Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Systems Group Department of Computer Science ETH Zürich SMP architecture
More informationComputer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY>
Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: This is a closed book, closed notes exam. 80 Minutes 19 pages Notes: Not all questions
More informationCSE 120 Principles of Operating Systems
CSE 120 Principles of Operating Systems Spring 2018 Lecture 15: Multicore Geoffrey M. Voelker Multicore Operating Systems We have generally discussed operating systems concepts independent of the number
More informationAdvanced Multiprocessor Programming: Locks
Advanced Multiprocessor Programming: Locks Martin Wimmer, Jesper Larsson Träff TU Wien 2nd May, 2016 Wimmer, Träff AMP SS15 1 / 35 Locks we have seen so far Peterson Lock ilter Lock Bakery Lock These locks......
More informationCS5460: Operating Systems
CS5460: Operating Systems Lecture 9: Implementing Synchronization (Chapter 6) Multiprocessor Memory Models Uniprocessor memory is simple Every load from a location retrieves the last value stored to that
More informationThe complete license text can be found at
SMP & Locking These slides are made distributed under the Creative Commons Attribution 3.0 License, unless otherwise noted on individual slides. You are free: to Share to copy, distribute and transmit
More informationFlynn s Classification
Flynn s Classification SISD (Single Instruction Single Data) Uniprocessors MISD (Multiple Instruction Single Data) No machine is built yet for this type SIMD (Single Instruction Multiple Data) Examples:
More informationMultiprocessor Systems. COMP s1
Multiprocessor Systems 1 Multiprocessor System We will look at shared-memory multiprocessors More than one processor sharing the same memory A single CPU can only go so fast Use more than one CPU to improve
More informationSynchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
Synchronization Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) Types of Synchronization
More informationComputer Architecture
Jens Teubner Computer Architecture Summer 2016 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2016 Jens Teubner Computer Architecture Summer 2016 83 Part III Multi-Core
More informationAlgorithms for Scalable Synchronization on Shared Memory Multiprocessors by John M. Mellor Crummey Michael L. Scott
Algorithms for Scalable Synchronization on Shared Memory Multiprocessors by John M. Mellor Crummey Michael L. Scott Presentation by Joe Izraelevitz Tim Kopp Synchronization Primitives Spin Locks Used for
More informationChapter-4 Multiprocessors and Thread-Level Parallelism
Chapter-4 Multiprocessors and Thread-Level Parallelism We have seen the renewed interest in developing multiprocessors in early 2000: - The slowdown in uniprocessor performance due to the diminishing returns
More informationCS 31: Introduction to Computer Systems : Threads & Synchronization April 16-18, 2019
CS 31: Introduction to Computer Systems 22-23: Threads & Synchronization April 16-18, 2019 Making Programs Run Faster We all like how fast computers are In the old days (1980 s - 2005): Algorithm too slow?
More informationCS377P Programming for Performance Multicore Performance Synchronization
CS377P Programming for Performance Multicore Performance Synchronization Sreepathi Pai UTCS October 21, 2015 Outline 1 Synchronization Primitives 2 Blocking, Lock-free and Wait-free Algorithms 3 Transactional
More informationMultiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism
Computing Systems & Performance Beyond Instruction-Level Parallelism MSc Informatics Eng. 2012/13 A.J.Proença From ILP to Multithreading and Shared Cache (most slides are borrowed) When exploiting ILP,
More informationSynchronization. Erik Hagersten Uppsala University Sweden. Components of a Synchronization Even. Need to introduce synchronization.
Synchronization sum := thread_create Execution on a sequentially consistent shared-memory machine: Erik Hagersten Uppsala University Sweden while (sum < threshold) sum := sum while + (sum < threshold)
More informationCMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago
CMSC 22200 Computer Architecture Lecture 15: Memory Consistency and Synchronization Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 5 (multi-core) " Basic requirements: out later today
More information