Synchronization for Concurrent Tasks Minsoo Ryu Department of Computer Science and Engineering
2 1 Race Condition and Critical Section Page X 2 Synchronization in the Linux Kernel Page X 3 Other Synchronization Techniques Page X 4 Nonblocking Synchronization Page X 5 Q & A Page X 2
3 Race Condition The situation where several processes access and manipulate shared data concurrently The final value of the shared data depends upon which process finishes last Thread A item nextproduced; Thread B item nextconsumed; if (user_wants_to_write == 1) { while (counter == BUFFER_SIZE) ; /* do nothing */ buffer[in] = nextproduced; in = (in + 1) % BUFFER_SIZE; counter++; } If (user_wants_to_read == 1) { while (counter == 0) } ; /* do nothing */ nextconsumed = buffer[out]; out = (out + 1) % BUFFER_SIZE; counter--; 3
4 Three Approaches to Critical Sections Algorithmic approaches Algorithmically solves the critical section problem without using any special HW and OS support Hardware support Use special hardware support to achieve atomicity Interrupt disabling, Test and Set instruction, Swap instruction OS primitives Use OS primitives Semaphore, mutex, spin lock, reader-writer lock, 4
5 OS Primitives for Critical Sections OS provide several primitives for critical sections Mutual exclusion, progress, and bounded waiting are guaranteed by OS Most popular primitives are semaphores and mutexes 5
6 Spin Locks Spin locks are essentially mutex locks Tasks waiting for mutex locks can sleep or spin Tasks using a spin lock keeps trying to acquire the lock without sleeping Advantage of spin locks Tasks will acquire the lock as soon as it is released For a mutex lock that sleeps, the task needs to be woken by the operating system before it can get the lock Disadvantage of spin locks A spin lock will be monopolizing the CPU 6
7 POSIX Spinlocks 7
Synchronization in the Linux Kernel
9 Race Condition Scenario #1 in the Kernel System call and interrupt 9
10 Race Condition Scenario #2 in the Kernel System call and preemption 10
11 Synchronization Approaches in the Kernel For single core hardware Disabling preemption or interrupts Prevent other tasks or ISRs from running For multicore hardware Atomic operations Perform multiple actions at once Locking primitives Prevent other tasks from entering a critical section Spinlocks, semaphores, and mutexes 11
12 Disabling Preemption Allow a task to complete its critical section without being interfered by other task 12
13 Disabling Preemption in the Kernel Three functions preempt_disable() Disable kernel preemption by incrementing the preemption counter preempt_enable() Decrement the preemption counter and checks and services any pending reschedules if the count is now zero preempt_count() Return the preemption count Preemption counter preempt_count = 0 preempt_count > 0 preemptable not preemptable 13
14 Limitations of Disabling Preemption Race condition in multicore hardware Race condition between a task and an ISR 14
15 Disabling Interrupts Allow a task to complete its critical section without being interfered by interrupts Disabling interrupts also disables kernel preemption 15
16 Disabling Interrupts in the Kernel Simply disable and enable interrupts for the current processor Clear and set interrupt flags of the processor Disable and enable interrupts saving and restoring the state of the interrupt system 16
17 Atomic Operations Atomic operations provide instructions that execute atomically without interruption 17
18 Atomic Integer Operations A special data type <linux/types.h> The basic use 18
19 Atomic Add in ARM ARM LDREX and STREX are available in ARMv6 and above 19
20 Atomic Integer Operations Operations and Description ATOMIC_INIT(int i): At declaration, initialize to I int atomic_read(atomic_t *v): Atomically read the integer value of v void atomic_set(atomic_t *v, int i): Atomically set v equal to I void atomic_add(int i, atomic_t *v): Atomically add i to v void atomic_sub(int i, atomic_t *v): Atomically subtract i from v void atomic_inc(atomic_t *v): Atomically add one to v void atomic_dec(atomic_t *v): Atomically subtract one from v int atomic_sub_and_test(int i, atomic_t *v): Atomically subtract i from v and return true if the result is zero; otherwise false int atomic_add_negative(int i, atomic_t *v): Atomically add i to v and return true if the result is negative; otherwise false int atomic_add_return(int i, atomic_t *v): Atomically add i to v and return the result int atomic_sub_return(int i, atomic_t *v): Atomically subtract i from v and return the result int atomic_inc_return(int i, atomic_t *v): Atomically increment v by one and return the result int atomic_dec_return(int i, atomic_t *v): Atomically decrement v by one and return the result int atomic_dec_and_test(atomic_t *v): Atomically decrement v by one and return true if zero; false otherwise int atomic_inc_and_test(atomic_t *v): Atomically increment v by one and return true if the result is zero; false otherwise 20
21 Spin Locks Spin lock is a mutual exclusion mechanism where a process spins (or busy-waits) until the lock becomes available Spin locks are architecture-dependent and implemented in assembly The architecturedependent code is defined in <asm/spinlock.h> The actual usable interfaces are defined in <linux/spinlock.h> 21
22 The Use of a Spin Lock Initializing and using a spin lock Unlike spin lock implementations in other operating systems and threading libraries, the Linux kernel s spin locks are not recursive This means that if the kernel attempts to acquire a lock it already holds, it will spin, waiting for itself to release the lock But because it is busy spinning, it will never release the lock and it will deadlock 22
23 Acquiring a Spin Lock in an ISR Spin locks can be used in interrupt handlers Semaphores cannot be used because they sleep If a lock is used in an interrupt handler, you must also disable local interrupts Otherwise, it is possible for an interrupt handler to attempt to reacquire the lock (double-acquire deadlock) 23
Other Synchronization Techniques
25 Ticket Spinlocks Ticket spin lock Linux version >= 2.6.25 Lock contenders acquire the lock in FIFO manner spin_lock() A task atomically reads and increases queue ticket by one It then atomically compares the read value (the previous queue ticket) and de-queue ticket If they are the same, lock acquired and enters critical section If they are not the same, the lock is held by another task spin_unlock() A task atomically increases the de-queue ticket 25
26 Reader-Writer Spin Locks In the reader code path In the writer code path Notes It is safe for multiple readers to obtain the same lock It is safe for the same thread to recursively obtain the same read lock The Linux reader-writer spin locks favor readers over writers 26
27 The First Readers-Writers Problem The first readers-writers problem Give preferential treatment to readers Writers may suffer unbounded waiting Shared data semaphore S, wrt; Initially, S = 1, wrt = 1, readcount = 0 27
28 The First Readers-Writers Problem /* Writer */ wait(wrt); /* writing is performed */ glob_x++; glob_y++; signal(wrt); /* Reader */ wait(s); readcount++; if (readcount == 1) wait(wrt); signal(s); /* reading is performed */ temp1 = glob_x; temp2 = glob_y; wait(s); readcount--; if (readcount == 0) signal(wrt); signal(s): 28
29 The First Readers-Writers Problem write requests reading writing read requests time 29
30 Sequential Locks A newer type of lock introduced in the 2.6 kernel Used for reading and writing shared data Avoids the problem of writer starvation Writers After acquiring the seq lock, the writer increments the sequence number, both after acquiring the lock and before releasing the lock (0 1 2) Readers Readers read the sequence number before and after reading the shared data If the seq number is odd, a writer had taken the lock If the seq numbers differ, a writer has changed the data In either case, readers simply retry 30
31 Sequential Locks Initially seq = 0 writer code reader code Seq locks are more efficient than read-write locks when there are many readers and few writers Readers never block and writers do not wait for readers If there is too much write activity or the reader is too slow, they might livelock 31
32 Read-Copy-Update An alternative to a reader-writer locks Readers can access the shared data without taking a lock Even when the data is in the process of being updated A writer copies and updates the data, and changes the pointer Readers will see a valid data, but a different data Pre-existing readers see the old data, new readers see the new one R1 R2 W3 R4 R5 pointer 4. update 5. wait & reclaim data X 1. create 2. copy 3. modify data Y 32
33 RCU Interfaces Operations Description rcu_read_lock() rcu_read_unlock() rcu_dereference() rcu_assign_pointer() synchronize_rcu() Mark an RCU-protected data structure so that it won't be reclaimed for the full duration of that critical section Inform the reclaimer that the reader is exiting an RCU readside critical section The reader uses rcu_dereference to fetch an RCU-protected pointer The value returned by rcu_dereference is valid only within the enclosing RCU read-side critical section The updater uses this function to atomically assign a new value to an RCU-protected pointer, in order to safely communicate the change in value from the updater to the reader It waits until all pre-existing RCU read-side critical sections on all CPUs have completed 33
34 Example 34
35 Example 35
36 synchronize_rcu() Two ways of waiting Simply blocks until when readers are done Registers a callback to be invoked after all ongoing RCU read-side critical sections have completed (call_rcu()) Deciding when to reclaim Nonpreemptive RCU Wait for at least one grace period to elapse Preemptive RCU Lock() increments the counter associated with RCU Unlock() decrements the counter Synchronize() checks whether the counter is zero 36
37 Quiescent State and Grace Period Quiescent state Any statement that is not within an RCU read-side critical section Grace period Any time period during which each thread resides at least once in a quiescent state is called a grace period 37
38 Grace Period with Nonpreemptive RCU Detecting a grace period Readers cannot block, sleep or be preempted inside a critical section Thus, context switch implies the task has experienced a quiescent state Synchronize() waits for every other CPU passes through a context switch 38
39 Notes on RCUs RCUs allow readers to use much lighter-weight synchronization RCU works best in situations with mostly reads and few updates To use RCUs, data structure must be dynamically allocated and referenced by a single pointer 39
Nonblocking Synchronization
41 Problems with Blocking Synchronization Problems with locking Deadlock Priority inversion Lock convoy Async-signal-safety Kill-tolerant availability Preemption tolerance These problems are not equally relevant to all locking situations 41
42 Problems with Blocking Synchronization Deadlock Processes that cannot proceed because they are waiting for resources that are held by processes that are waiting for Priority inversion Low-priority processes hold a lock required by a higher priority process Priority inheritance can be a possible solution 42
43 Problems with Blocking Synchronization Lock convoy A lock convoy occurs when multiple threads of equal priority contend repeatedly for the same lock Unlike deadlock and livelock situations, the threads in a lock convoy do progress; however, each time a thread attempts to acquire the lock and fails, it relinquishes the remainder of its scheduling quantum and forces a context switch The overhead of repeated context switches and underutilization of scheduling quanta degrade overall performance 43
44 Problems with Blocking Synchronization Async-signal safety Signal handlers can t use lock-based primitives Especially malloc and free Why? Suppose a thread receives a signal while holding a user level lock in the memory allocator Signal handler executes, calls malloc, wants the lock Kill-tolerance If threads are killed/crash while holding locks, what happens? Preemption tolerance What happens if you re pre-empted holding a lock? 44
45 Lock-free Synchronization A thread never gets stuck if another thread gets suspended inside the critical section A well designed approach would work well for progress related problems such as deadlocks, priority inversion, asyncsignal safety, But it might not work well for performance related problems like the convoy problem Possible but not practical in the absence of hardware support Atomic operations are often required Designing generalized lock-free algorithms is hard Instead, design lock-free data structures such as buffer, list, stack, queue, map, deque, snapshot 45
46 Non-Blocking Synchronization Goal of non-blocking synchronization Provide a progress guarantee Guarantee deadlock-free and/or starvation-free properties Even if some threads are delayed for arbitrarily long Three liveness criteria Lock-freedom Some thread is guaranteed to complete its operation No deadlock/livelock, but starvation is possible Wait-freedom (strongest) Every thread is guaranteed to complete its operation No deadlock/livelock, no starvation Obstruction-freedom (weakest) A thread can complete its operation if it is executed in isolation (with all obstructing threads suspended) 46
47 Example of Blocking Synchronization - The deadlock example (using semaphores or mutex locks) - Both thread cannot proceed 47
48 Is This Lock-Free? Process P0 item nextproduced; Process P1 item nextconsumed; if (user_wants_to_write == 1) { while (counter == BUFFER_SIZE) ; /* do nothing */ buffer[in] = nextproduced; in = (in + 1) % BUFFER_SIZE; while (turn!= 0); counter++; turn = 1; } If (user_wants_to_read == 1) { } while (counter == 0) ; /* do nothing */ nextconsumed = buffer[out]; out = (out + 1) % BUFFER_SIZE; while (turn!= 1); counter--; turn = 0; - No! - P1 cannot enter the critical section again if P0 will not enter the critical section 48
49 Is This Lock-Free? Producer item nextproduced; Consumer item nextconsumed; if (user_wants_to_write == 1) { while (counter == BUFFER_SIZE) } ; /* do nothing */ buffer[in] = nextproduced; in = (in + 1) % BUFFER_SIZE; while(testandset(lock)); counter++; lock = false; If (user_wants_to_read == 1) { } while (counter == 0) ; /* do nothing */ nextconsumed = buffer[out]; out = (out + 1) % BUFFER_SIZE; while(testandset(lock)); counter--; lock = false; - No! - A thread may get stuck if another thread is suspended in the critical section 49
50 Is This Lock-Free? do { choosing[i] = true; number[i] = max (number[0], number[1],, number [n 1]) + 1; choosing[i] = false; for (j = 0; j < n; j++) { while (choosing[ j ]) ; while ((number[ j ]!= 0) && ((number[ j ], j) < (number[i], i))) ; } critical section (a,b) < (c,d) if a < c or if a = c and b < d number[i] = 0; remainder section } while (1); - No. A thread can get stuck at the while statements - Bakery Algorithm by Lamport 50
51 Implementation Issues Non-blocking algorithms can often be implemented by using special hardware instructions Compare and Swap (CAS) instruction Load-Link/Store-Conditional (LL/SC) instruction The performance of non-blocking algorithms does not in general match even naïve blocking designs 51
52 Compare and Swap (CAS) Atomically compares the contents of a memory location to a given value and, if they are the same, modifies the contents of that memory location to a given new value CAS(int *target, int old_val, int new_val) compare if(*target == old_val) { *target = new_val; return 1; } else { return 0; } swap 52
53 Lock-Free Stack (push) class Node { Node * next; int data; }; Node * head; head head void push(int t) { Node* node = new Node(t); do { node->next = head; } while (!cas(&head, node->next, node)); } (1) (2) (3) 53
54 Lock-Free Stack (pop) bool pop(int& t) { Node* current = head; while (current) { if (cas(&head, current, current->next)) { t = current->data; return true; } current = head; } return false } current head current 54
55 The ABA Problem It's possible that between the time the old value is read and the time CAS is attempted, some other processors or threads change the memory location two or more times such that it acquires a bit pattern which matches the old value Example Thread 1 looks at some shared variable, finds that it is A Thread 1 calculates some interesting thing based on the fact that the variable is A Thread 2 executes, changes variable to B (if Thread 1 wakes up now and tries to compare-and-set, all is well compare and set fails and Thread 1 retries) Instead, Thread 2 changes variable back to A! OK if the variable is just a value, but 55
56 The ABA Problem A general solution to this is to use a double-length CAS (e.g. on a 32 bit system, a 64 bit CAS) The second half is used to hold a counter The compare part of the operation compares the previously read value of the pointer and the counter, to the current pointer and counter 56
57 Load-Link and Store-Conditional LL/SC pair (for multiprocessor synchronization) Load-link returns the current value of a memory location A subsequent store-conditional to the same memory location will store a new value only if no updates have occurred to that location since the load-link All of Alpha, PowerPC, MIPS, and ARM provide LL/SC instructions Why LL/SC? LL/SC has no ABA problem Hard to have read & write in 1 instruction Potential pipeline difficulties from needing 2 memory operations So, use 2 instructions instead 57
58 LL/SC and Spin Lock Implementation LL/SC pseudo code in C Spin lock using LL/SC 58
59 59