Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit
Real world concurrency Understanding hardware architecture What is contention How to reduce it Art of Multiprocessor Programming 4
Kinds of Architectures SISD (Uniprocessor) Single instruction stream Single data stream SIMD (Vector) Single instruction Multiple data MIMD (Multiprocessors) Multiple instruction Multiple data. Art of Multiprocessor Programming 5
Kinds of Architectures SISD (Uniprocessor) Single instruction stream Single data stream SIMD (Vector) Single instruction Multiple data MIMD (Multiprocessors) Multiple instruction Multiple data. Our space (1) Art of Multiprocessor Programming 6
MIMD Architectures memory Shared Bus Memory Contention Communication Contention Communication Latency Distributed Art of Multiprocessor Programming 7
Today: Revisit Mutual Exclusion Think of performance, not just correctness and progress Begin to understand how performance depends on our software properly utilizing the multiprocessor machine s hardware And get to know a collection of locking algorithms Art of Multiprocessor Programming 8 (1)
What Should you do if you can t get a lock? Keep trying spin or busy-wait Good if delays are short Give up the processor Good if delays are long Always good on uniprocessor Art of Multiprocessor Programming 9 (1)
What Should you do if you can t get a lock? Keep trying spin or busy-wait Good if delays are short Give up the processor Good if delays are long Always good on uniprocessor our focus Art of Multiprocessor Programming 10
Basic Spin-Lock CS. spin lock critical section Resets lock upon exit Art of Multiprocessor Programming 11
Basic Spin-Lock lock introduces sequential bottleneck CS. spin lock critical section Resets lock upon exit Art of Multiprocessor Programming 12
Basic Spin-Lock lock suffers from contention CS. spin lock critical section Resets lock upon exit Art of Multiprocessor Programming 13
Basic Spin-Lock lock suffers from contention CS. spin lock critical section Resets lock upon exit Notice: these are distinct phenomena Art of Multiprocessor Programming 14
Basic Spin-Lock lock suffers from contention. spin lock CS critical section Resets lock upon exit Seq Bottleneck no parallelism Art of Multiprocessor Programming 15
Basic Spin-Lock lock suffers from contention CS. spin lock critical section Resets lock upon exit Contention??? Art of Multiprocessor Programming 16
Review: Test-and-Set Boolean value Test-and-set (TAS) Swap true with current value Return value tells if prior value was true or false Can reset just by writing false TAS aka getandset Art of Multiprocessor Programming 17
Review: Test-and-Set public class AtomicBoolean { boolean value; public synchronized boolean getandset(boolean newvalue) { boolean prior = value; value = newvalue; return prior; } } Art of Multiprocessor Programming 18(5)
Review: Test-and-Set public class AtomicBoolean { boolean value; public synchronized boolean getandset(boolean newvalue) { boolean prior = value; value = newvalue; return prior; } } Package java.util.concurrent.atomic Art of Multiprocessor Programming 19
Review: Test-and-Set public class AtomicBoolean { boolean value; public synchronized boolean getandset(boolean newvalue) { boolean prior = value; value = newvalue; return prior; } } Swap old and new values Art of Multiprocessor Programming 20
Review: Test-and-Set AtomicBoolean lock = new AtomicBoolean(false) boolean prior = lock.getandset(true) Art of Multiprocessor Programming 21
Review: Test-and-Set AtomicBoolean lock = new AtomicBoolean(false) boolean prior = lock.getandset(true) Swapping in true is called test-and-set or TAS Art of Multiprocessor Programming 22(5)
Test-and-Set Locks Locking Lock is free: value is false Lock is taken: value is true Acquire lock by calling TAS If result is false, you win If result is true, you lose Release lock by writing false Art of Multiprocessor Programming 23
Test-and-set Lock class TASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (state.getandset(true)) {} } void unlock() { state.set(false); }} Art of Multiprocessor Programming 24
Test-and-set Lock class TASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (state.getandset(true)) {} } void unlock() { state.set(false); }} Lock state is AtomicBoolean Art of Multiprocessor Programming 25
Test-and-set Lock class TASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (state.getandset(true)) {} } void unlock() { state.set(false); }} Keep trying until lock acquired Art of Multiprocessor Programming 26
Test-and-set Lock class TASlock { AtomicBoolean state = new AtomicBoolean(false); Release lock by resetting state to false void lock() { while (state.getandset(true)) {} } void unlock() { state.set(false); }} Art of Multiprocessor Programming 27
Space Complexity TAS spin-lock has small footprint N thread spin-lock uses O(1) space Art of Multiprocessor Programming 29
Performance Experiment n threads Increment shared counter 1 million times How long should it take? How long does it take? Art of Multiprocessor Programming 30
time Graph no speedup because of sequential bottleneck ideal threads Art of Multiprocessor Programming 31
Mystery #1 TAS lock time threads Ideal What is going on? Art of Multiprocessor Programming 32 (1)
Test-and-Test-and-Set Locks Lurking stage Wait until lock looks free Spin while read returns true (lock taken) Pouncing state As soon as lock looks available Read returns false (lock free) Call TAS to acquire lock If TAS loses, back to lurking Art of Multiprocessor Programming 33
Test-and-test-and-set Lock class TTASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (true) { while (state.get()) {} if (!state.getandset(true)) return; } } Art of Multiprocessor Programming 34
Test-and-test-and-set Lock class TTASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (true) { while (state.get()) {} if (!state.getandset(true)) return; } } Wait until lock looks free Art of Multiprocessor Programming 35
Test-and-test-and-set Lock class TTASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (true) { while (state.get()) {} if (!state.getandset(true)) return; } } Then try to acquire it Art of Multiprocessor Programming 36
Mystery #2 TAS lock time TTAS lock Ideal threads Art of Multiprocessor Programming 37
Mystery Both TAS and TTAS Do the same thing (in our model) Except that TTAS performs much better than TAS Neither approaches ideal Art of Multiprocessor Programming 38
Opinion Our memory abstraction is broken TAS & TTAS methods Are provably the same (in our model) Except they aren t (in field tests) Need a more detailed model Art of Multiprocessor Programming 39
Bus-Based Architectures cache cache Bus cache memory Art of Multiprocessor Programming 40
Bus-Based Architectures Random access memory (10s of cycles) cache cache Bus cache memory Art of Multiprocessor Programming 41
Bus-Based Architectures Shared Bus Broadcast medium One broadcaster at a time Processors and memory all snoop cache cache Bus cache memory Art of Multiprocessor Programming 42
Per-Processor Caches Small Fast: 1 or 2 cycles Address & state information Bus-Based Architectures cache cache Bus cache memory Art of Multiprocessor Programming 43
Jargon Watch Cache hit I found what I wanted in my cache Good Thing Art of Multiprocessor Programming 44
Jargon Watch Cache hit I found what I wanted in my cache Good Thing Cache miss I had to shlep all the way to memory for that data Bad Thing Art of Multiprocessor Programming 45
Cave Canem This model is still a simplification But not in any essential way Illustrates basic principles Will discuss complexities later Art of Multiprocessor Programming 46
Processor Issues Load Request cache cache Bus cache memory data Art of Multiprocessor Programming 47
Processor Issues Load Request Gimme data cache cache Bus cache memory data Art of Multiprocessor Programming 48
Memory Responds cache cache cache Got your data right here Bus memory data Bus Art of Multiprocessor Programming 49
Processor Issues Load Request Gimme data data cache Bus cache memory data Art of Multiprocessor Programming 50
Processor Issues Load Request Gimme data data cache Bus cache memory data Art of Multiprocessor Programming 51
Processor Issues Load Request I got data data cache Bus cache memory data Art of Multiprocessor Programming 52
Other Processor Responds I got data data cache cache Bus Bus memory data Art of Multiprocessor Programming 53
Other Processor Responds data cache cache Bus Bus memory data Art of Multiprocessor Programming 54
Modify Cached Data data data Bus cache memory data Art of Multiprocessor Programming 55 (1)
Modify Cached Data data data Bus cache memory data Art of Multiprocessor Programming 56 (1)
Modify Cached Data data data Bus cache memory data Art of Multiprocessor Programming 57
Modify Cached Data data data Bus cache What s up with the other copies? memory data Art of Multiprocessor Programming 58
Cache Coherence We have lots of copies of data Original copy in memory Cached copies at processors Some processor modifies its own copy What do we do with the others? How to avoid confusion? Art of Multiprocessor Programming 59
Write-Back Caches Accumulate changes in cache Write back when needed Need the cache for something else Another processor wants it On first modification Invalidate other entries Requires non-trivial protocol Art of Multiprocessor Programming 60
Write-Back Caches Cache entry has three states Invalid: contains raw seething bits Valid: I can read but I can t write Dirty: Data has been modified Intercept other load requests Write back to memory before using cache Art of Multiprocessor Programming 61
Invalidate data data Bus cache memory data Art of Multiprocessor Programming 62
Invalidate Mine, all mine! data data Bus cache memory data Art of Multiprocessor Programming 63
Invalidate Uh,oh cache data data Bus cache memory data Art of Multiprocessor Programming 64
Invalidate Other caches lose read permission cache data Bus cache memory data Art of Multiprocessor Programming 65
Invalidate Other caches lose read permission cache data Bus cache This cache acquires write permission memory data Art of Multiprocessor Programming 66
Invalidate Memory provides data only if not present in any cache, so no need to change it now (expensive) cache data Bus cache memory data Art of Multiprocessor Programming 67(2)
Another Processor Asks for Data cache data Bus cache memory data Art of Multiprocessor Programming 68(2)
Owner Responds Here it is! cache data Bus cache memory data Art of Multiprocessor Programming 69(2)
End of the Day data data Bus cache memory data Reading OK, no writing Art of Multiprocessor Programming 70 (1)
Mutual Exclusion What do we want to optimize? Bus bandwidth used by spinning threads Release/Acquire latency Acquire latency for idle lock Art of Multiprocessor Programming 71
Simple TASLock TAS invalidates cache lines Spinners Miss in cache Go to bus Thread wants to release lock delayed behind spinners Art of Multiprocessor Programming 72
Test-and-test-and-set Wait until lock looks free Spin on local cache No bus use while lock busy Problem: when lock is released Invalidation storm Art of Multiprocessor Programming 73
Local Spinning while Lock is Busy busy busy Bus busy memory busy Art of Multiprocessor Programming 74
On Release invalid invalid Bus free memory free Art of Multiprocessor Programming 75
Everyone misses, rereads On Release miss invalid miss invalid Bus free memory free Art of Multiprocessor Programming 76 (1)
Everyone tries TAS On Release TAS( ) invalid TAS( ) invalid Bus free memory free Art of Multiprocessor Programming 77 (1)
Problems Everyone misses Reads satisfied sequentially Everyone does TAS Invalidates others caches Eventually quiesces after lock acquired How long does this take? Art of Multiprocessor Programming 78
Quiescence Time time Increases linearly with the number of processors for bus architecture threads Art of Multiprocessor Programming 80
Mystery Explained TAS lock time TTAS lock Ideal threads Better than TAS but still not as good as ideal Art of Multiprocessor Programming 81
Solution: Introduce Delay If the lock looks free But I fail to get it There must be contention Better to back off than to collide again time r 2 d r 1 d d spin lock Art of Multiprocessor Programming 82
Dynamic Example: Exponential Backoff time 4d 2d d spin lock If I fail to get lock wait random duration before retry Each subsequent failure doubles expected wait Art of Multiprocessor Programming 83
Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getandset(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Art of Multiprocessor Programming 84
Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getandset(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Fix minimum delay Art of Multiprocessor Programming 85
Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getandset(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Wait until lock looks free Art of Multiprocessor Programming 86
Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getandset(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} If we win, return Art of Multiprocessor Programming 87
Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getandset(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Back off for random duration Art of Multiprocessor Programming 88
Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getandset(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Double max delay, within reason Art of Multiprocessor Programming 89
Spin-Waiting Overhead TTAS Lock time Backoff lock threads Art of Multiprocessor Programming 90
Backoff: Other Issues Good Easy to implement Beats TTAS lock Bad Must choose parameters carefully Not portable across platforms Art of Multiprocessor Programming 91
Idea Avoid useless invalidations By keeping a queue of threads Each thread Notifies next in line Without bothering the others Art of Multiprocessor Programming 92
Anderson Queue Lock next idle flags T F F F F F F F Art of Multiprocessor Programming 93
Anderson Queue Lock next acquiring getandincrement flags T F F F F F F F Art of Multiprocessor Programming 94
Anderson Queue Lock next acquiring getandincrement flags T F F F F F F F Art of Multiprocessor Programming 95
Anderson Queue Lock next acquired Mine! flags T F F F F F F F Art of Multiprocessor Programming 96
Anderson Queue Lock next acquired acquiring flags T F F F F F F F Art of Multiprocessor Programming 97
Anderson Queue Lock next acquired acquiring flags getandincrement T F F F F F F F Art of Multiprocessor Programming 98
Anderson Queue Lock next acquired acquiring flags getandincrement T F F F F F F F Art of Multiprocessor Programming 99
Anderson Queue Lock next acquired acquiring flags T F F F F F F F Art of Multiprocessor Programming 100
Anderson Queue Lock next released acquired flags T T F F F F F F Art of Multiprocessor Programming 101
Anderson Queue Lock next released acquired flags Yow! T T F F F F F F Art of Multiprocessor Programming 102
Anderson Queue Lock class ALock implements Lock { boolean[] flags={true,false,,false}; AtomicInteger next = new AtomicInteger(0); ThreadLocal<Integer> myslot; Art of Multiprocessor Programming 103
Anderson Queue Lock class ALock implements Lock { boolean[] flags={true,false,,false}; AtomicInteger next = new AtomicInteger(0); ThreadLocal<Integer> myslot; One flag per thread Art of Multiprocessor Programming 104
Anderson Queue Lock class ALock implements Lock { boolean[] flags={true,false,,false}; AtomicInteger next = new AtomicInteger(0); ThreadLocal<Integer> myslot; Next flag to use Art of Multiprocessor Programming 105
Anderson Queue Lock class ALock implements Lock { boolean[] flags={true,false,,false}; AtomicInteger next = new AtomicInteger(0); ThreadLocal<Integer> myslot; Thread-local variable Art of Multiprocessor Programming 106
Anderson Queue Lock public lock() { myslot = next.getandincrement(); while (!flags[myslot % n]) {}; flags[myslot % n] = false; } public unlock() { flags[(myslot+1) % n] = true; } Art of Multiprocessor Programming 107
Anderson Queue Lock public lock() { myslot = next.getandincrement(); while (!flags[myslot % n]) {}; flags[myslot % n] = false; } public unlock() { flags[(myslot+1) % n] = true; } Take next slot Art of Multiprocessor Programming 108
Anderson Queue Lock public lock() { myslot = next.getandincrement(); while (!flags[myslot % n]) {}; flags[myslot % n] = false; } public unlock() { flags[(myslot+1) % n] = true; } Spin until told to go Art of Multiprocessor Programming 109
Anderson Queue Lock public lock() { myslot = next.getandincrement(); while (!flags[myslot % n]) {}; flags[myslot % n] = false; } public unlock() { flags[(myslot+1) % n] = true; } Prepare slot for re-use Art of Multiprocessor Programming 110
Anderson Queue Lock public lock() { myslot = next.getandincrement(); Tell next thread to go while (!flags[myslot % n]) {}; flags[myslot % n] = false; } public unlock() { flags[(myslot+1) % n] = true; } Art of Multiprocessor Programming 111
Local Spinning next flags released acquired Spin on my bit T F F F F F F F Unfortunately many bits share cache line Art of Multiprocessor Programming 112
next Result: contention flags False Sharing released acquired T F F F F F F F Line Art 1 of Multiprocessor Programming Line 2 Spin on my Spinning thread bit gets cache invalidation on account of store by threads it is not waiting for 113
The Solution: Padding next flags released acquired Spin on my line T / / / F / / / Line Art 1 of Multiprocessor Programming Line 2 114
Performance TTAS queue Shorter handover than backoff Curve is practically flat Scalable performance Art of Multiprocessor Programming 115
Anderson Queue Lock Good First truly scalable lock Simple, easy to implement Back to FIFO order (like Bakery) Art of Multiprocessor Programming 116
Anderson Queue Lock Bad Space hog One bit per thread è one cache line per thread What if unknown number of threads? What if small number of actual contenders? Art of Multiprocessor Programming 117
CLH Lock FIFO order Small, constant-size overhead per thread Art of Multiprocessor Programming 118
Initially idle tail false Art of Multiprocessor Programming 119
Initially idle tail false Queue tail Art of Multiprocessor Programming 120
Initially idle tail false Lock is free Art of Multiprocessor Programming 121
Initially idle tail false Art of Multiprocessor Programming 122
Purple Wants the Lock acquiring tail false Art of Multiprocessor Programming 123
Purple Wants the Lock acquiring tail false true Art of Multiprocessor Programming 124
Purple Wants the Lock acquiring Swap tail false true Art of Multiprocessor Programming 125
Purple Has the Lock acquired tail false true Art of Multiprocessor Programming 126
Red Wants the Lock acquired acquiring tail false true true Art of Multiprocessor Programming 127
Red Wants the Lock acquired acquiring tail false Swap true true Art of Multiprocessor Programming 128
Red Wants the Lock acquired acquiring tail false true true Art of Multiprocessor Programming 129
Red Wants the Lock acquired acquiring tail false true true Art of Multiprocessor Programming 130
Red Wants the Lock acquired acquiring Implicit Linked list tail false true true Art of Multiprocessor Programming 131
Red Wants the Lock acquired acquiring tail false true true Art of Multiprocessor Programming 132
Red Wants the Lock acquired acquiring tail false true true true Actually, it spins on cached copy Art of Multiprocessor Programming 133
Purple Releases release acquiring false Bingo! tail false false true Art of Multiprocessor Programming 134
Purple Releases released acquired tail true Art of Multiprocessor Programming 135
Space Usage Let L = number of locks N = number of threads ALock O(LN) CLH lock O(L+N) Art of Multiprocessor Programming 136
One Lock To Rule Them All? TTAS+Backoff, CLH, MCS, ToLock Each better than others in some way There is no one solution Lock we pick really depends on: the application the hardware which properties are important Art of Multiprocessor Programming 228
This work is licensed under a Creative Commons Attribution-ShareAlike 2.5 License. You are free: to Share to copy, distribute and transmit the work to Remix to adapt the work Under the following conditions: Attribution. You must attribute the work to The Art of Multiprocessor Programming (but not in any way that suggests that the authors endorse you or your use of the work). Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license. For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to http://creativecommons.org/licenses/by-sa/3.0/. Any of the above conditions can be waived if you get permission from the copyright holder. Nothing in this license impairs or restricts the author's moral rights. Art of Multiprocessor Programming 229