Fast and Scalable Queue-Based Resource Allocation Lock on Shared-Memory Multiprocessors

Background Fast and Scalable Queue-Based Resource Allocation Lock on Shared-Memory Multiprocessors Deli Zhang, Brendan Lynch, and Damian Dechev University of Central Florida, Orlando, USA December 18, 2013

Background Motivation Problem Definition Two General Strategies Mutual Exclusion on Multicore System Mutual exclusion locks are not composable Joe Doe C1 C2 Use of multiple mutual exclusion locks poses scalability challenge for date intensive application BerkerlyDB spends over 80% of the execution time in its Test-and-Test-and-Set lock on a 32-core machine 1 1 Johnson et al., Shore-mt: a scalable storage manager for the multicore era. 2009

Background Motivation Problem Definition Two General Strategies Resource Allocation Problem Definition Given a pool of k resources that require exclusive access, each thread may request 1 h k resources, and a thread remains blocked until all required resources are available. Resource allocation problem is a generalized mutual exclusion problem (k-mutual exclusion, h-out-k mutual exclusion) It is also an extension of the Dining Philosophers Problem (relaxing the static resource configuration)

Locking Protocols Background Motivation Problem Definition Two General Strategies Assign a mutual exclusion lock to every resource Follow protocol to acquire locks one by one

Locking Protocols Background Motivation Problem Definition Two General Strategies Assign a mutual exclusion lock to every resource Follow protocol to acquire locks one by one Two-phase locking

Locking Protocols Background Motivation Problem Definition Two General Strategies Assign a mutual exclusion lock to every resource Follow protocol to acquire locks one by one Two-phase locking Resource hierarchy

Background Motivation Problem Definition Two General Strategies Locking Protocols Assign a mutual exclusion lock to every resource Follow protocol to acquire locks one by one Two-phase locking Resource hierarchy Time-stamp locking Prone to conflict and retry

Batch Locking Background Motivation Problem Definition Two General Strategies Centralized manager to distribute resources It handles resource request from one thread in one batch Extended TATAS Multi-resource lock

Extended TATAS Background Motivation Problem Definition Two General Strategies typedef uint64 bitset ; void lock ( biteset * l, biteset r){ biteset b; do{ b = *l; if(b & r) // d e t e c t c o n f l i c t continue ; } while ( CAS (l, b, b r)!= b); } A bitset is an array of bits Represent a resource by one bit Detecting conflict by bitwise AND Handle acquisition in batch

Background Queue-based Multi-resource Lock Overview Basic Data Structures Lock Acquire Lock Release Correctness 1: 2: 3: 4: 5: 6: 01000010 00011010 10100100 00111000 00000001 10100000 HEAD TAIL Ring buffer based concurrent queue Resolve conflicts in FIFO order Unbounded number of resources

Background Queue-based Multi-resource Lock Overview Basic Data Structures Lock Acquire Lock Release Correctness 1: 2: 3: 4: 5: 6: 01000010 00011010 00000000 00111000 00000001 10100000 HEAD TAIL Ring buffer based concurrent queue Resolve conflicts in FIFO order Unbounded number of resources

Data Structure Background Overview Basic Data Structures Lock Acquire Lock Release Correctness struct cell { atomic < uint32 > seq ; bitset bits ; } struct mrlock { cell * buffer ; uint32 siz ; atomic < uint32 > head ; atomic < uint32 > tail ; } void init ( mrlock & l, uint32 siz ){ l. buffer = new cell [ siz ]; l. siz = siz ; l. head. store (0) ; l. tail. store (0) ; for ( uint32 i = 0; i < siz ; i ++) { l. buffer [i]. bits. set (); l. buffer [i]. seq. store (i); } } Declaration Sequence number as sentinel bits as resource flags Atomic queue head and tail Initialization Allocate adjacent buffer cells bits are initialized to 1s seq are initialized to cell index

Lock Acquire Background Overview Basic Data Structures Lock Acquire Lock Release Correctness function acquire(mrlock* l, bitset r) loop pos l.tail, c ReadCell(l, pos), seq c.seq if seq - pos == 0 then if CAS(&l.tail, pos, pos + 1) succeeds then break c.bits r, c.seq pos + 1 spin pos l.head while spin pos!= pos do if IsDequeued(spin pos) or NoConflict(spin pos, r) then spin pos++ return pos

Lock Release Background Overview Basic Data Structures Lock Acquire Lock Release Correctness function Release(mrlock* l, handle pos) ReadCell(l, pos).bits 0 pos l.head while ReadCell(l, pos).bits == 0 do c ReadCell(l, pos) seq c.seq if seq - pos - 1 == 0 then if CAS(&l.head, pos, pos + 1) succeeds then c.bits 0 c.seq pos + l.siz pos l.head

Sequence Number Background Overview Basic Data Structures Lock Acquire Lock Release Correctness H T 0 1 2 3 Enqueue H T 1 2 2 3 T H 1 2 3 4 Dequeue H T 4 5 6 7 T H 4 5 3 4 Figure: Updating flow of sequence numbers

Background Sketch of Correctness Proofs Overview Basic Data Structures Lock Acquire Lock Release Correctness Concurrent update of the ring buffer is safe Theorem The head always precedes the tail; the tail is larger than head by at most N, where N equals to the size of the buffer. Non-atomic update of bitset is safe Bitsets are initialized to 1s Then written to specific request value by one thread Theorem In the presence of a single writer, intermediate values of the bitset during the write operation represent some supersets of requested resources.

Background Alternatives Environments Results Alternatives Two-phase locking std::lock function with std::mutex (STDLock) boost::lock function with boost::mutex (BSTLock) Resource hierarchy with std::mutex (RHSTD) with tbb::queue mutex (RHQueue) Extended TATAS (ETATAS)

Background Testing Configurations Alternatives Environments Results 64-core NUMA system 4 AMD Opteron CPUs with 16 cores per chip @ 2.1GHz Micro-benchmark Tight loop to acquire/release locks Randomize resource request prior to the loop Configuration Resource: 2 to 1024 Threads: 2 to 64 Resource contention: 0% to 100% (number of resources requested per thread divided by the total number of resources)

16 Threads Background Alternatives Environments Results 10 100 10 Time (seconds) 1 0.1 Time (seconds) 1 0.1 0.01 MRLock BSTLock RHQueue STDLock RHLock ETATAS 0.01 0% 25% 50% 75% 100% Resource Contention Figure: 64 Resources MRLock RHLock BSTLock RHQueue 0.001 0% 25% 50% 75% 100% Resource Contention Figure: 1024 Resources ETATAS, STDLock are excluded on the right because of lack of support for more than 64 resources.

Background Alternatives Environments Results Up to 64 Resources 1 1000 100 Time (seconds) 0.1 0.01 Time (seconds) 10 1 0.1 0.001 MRLock STDLock 4 8 16 32 64 BSTLock RHLock RHQueue ETATAS 0.01 MRLock STDLock 4 8 16 32 64 BSTLock RHLock RHQueue ETATAS Resource Contention Resource Contention Figure: 2 Threads Figure: 64 Threads

Background Up to 1024 Resources Alternatives Environments Results 100 1000 10 100 Time (seconds) 1 0.1 Time (seconds) 10 1 0.01 0.1 MRLock BSTLock 0.001 128 256 512 1024 Resource Contention RHLock RHQueue MRLock BSTLock 0.01 128 256 512 1024 Resource Contention RHLock RHQueue Figure: 8 Threads Figure: 32 Threads

Thread Scale Background Alternatives Environments Results 100 10 MRLock STDLock BSTLock RHLock RHQueue ETATAS 100 10 MRLock BSTLock RHLock RHQueue Time (seconds) 1 Time (seconds) 1 0.1 0.1 0.01 2 Threads 4 Threads 8 Threads 16 Threads 64 Threads 32 Threads 0.01 2 Threads 4 Threads 8 Threads 16 Threads 32 Threads 64 Threads Figure: Contention 32/64 (50%) Figure: Contention 128/1024 (12.5%)

Background and Future Work ic Advantage FIFO ordering guarantees fair acquisition of locks Support large number of resources with the use of bitset Performance advantage under mid-to-high levels of contention Future Work Adopting wait-free ring buffer to achieve starvation-freedom NUMA-awareness Adapting algorithm to compensate performance under low levels of contention

Background Questions? Thank you!