Fast and Scalable Queue-Based Resource Allocation Lock on Shared-Memory Multiprocessors

Similar documents
Fast and Scalable Queue-Based Resource Allocation Lock on Shared-Memory Multiprocessors

Queue-Based and Adaptive Lock Algorithms for Scalable Resource Allocation on Shared-Memory Multiprocessors

Advance Operating Systems (CS202) Locks Discussion

Spinlocks. Spinlocks. Message Systems, Inc. April 8, 2011

Concept of a process

CSE Traditional Operating Systems deal with typical system software designed to be:

High-Performance Composable Transactional Data Structures

CS 241 Honors Concurrent Data Structures

Lecture. DM510 - Operating Systems, Weekly Notes, Week 11/12, 2018

Last Class: Monitors. Real-world Examples

Parallelization and Synchronization. CS165 Section 8

A Scalable Lock Manager for Multicores

Cache-Aware Lock-Free Queues for Multiple Producers/Consumers and Weak Memory Consistency

Operating Systems Antonio Vivace revision 4 Licensed under GPLv3

Chapter 6: Synchronization. Operating System Concepts 8 th Edition,

6.852: Distributed Algorithms Fall, Class 15

Process Synchronization: Semaphores. CSSE 332 Operating Systems Rose-Hulman Institute of Technology

Synchronization Principles

Deadlock. Concurrency: Deadlock and Starvation. Reusable Resources

Process Management And Synchronization

Chapter 6: Process Synchronization

Semaphore. Originally called P() and V() wait (S) { while S <= 0 ; // no-op S--; } signal (S) { S++; }

CSE 451: Operating Systems Winter Lecture 7 Synchronization. Steve Gribble. Synchronization. Threads cooperate in multithreaded programs

Interprocess Communication By: Kaushik Vaghani

Concurrent programming: From theory to practice. Concurrent Algorithms 2015 Vasileios Trigonakis

Review. Preview. Three Level Scheduler. Scheduler. Process behavior. Effective CPU Scheduler is essential. Process Scheduling

PROCESS SYNCHRONIZATION

Operating Systems. Designed and Presented by Dr. Ayman Elshenawy Elsefy

Intel Thread Building Blocks, Part IV

Concurrency: Deadlock and Starvation. Chapter 6

Locks. Dongkun Shin, SKKU

Synchronization. Race Condition. The Critical-Section Problem Solution. The Synchronization Problem. Typical Process P i. Peterson s Solution

A simple correctness proof of the MCS contention-free lock. Theodore Johnson. Krishna Harathi. University of Florida. Abstract

Operating Systems: William Stallings. Starvation. Patricia Roy Manatee Community College, Venice, FL 2008, Prentice Hall

Lock free Algorithm for Multi-core architecture. SDY Corporation Hiromasa Kanda SDY

Lock-Free and Practical Doubly Linked List-Based Deques using Single-Word Compare-And-Swap

A Skiplist-based Concurrent Priority Queue with Minimal Memory Contention

Chapter 7: Process Synchronization!

Process Synchronization

Fast and Lock-Free Concurrent Priority Queues for Multi-Thread Systems

CS 471 Operating Systems. Yue Cheng. George Mason University Fall 2017

CSE 451: Operating Systems Winter Lecture 7 Synchronization. Hank Levy 412 Sieg Hall

Panu Silvasti Page 1

Chapter 6: Synchronization. Chapter 6: Synchronization. 6.1 Background. Part Three - Process Coordination. Consumer. Producer. 6.

Chapter 6: Process Synchronization

CSE 120 Principles of Operating Systems

IV. Process Synchronisation

Chapter 6 Concurrency: Deadlock and Starvation

1 Process Coordination

CS 25200: Systems Programming. Lecture 26: Classic Synchronization Problems

Synchronization I. Jo, Heeseung

Concurrency. Chapter 5

Queue Delegation Locking

Midterm Exam. October 20th, Thursday NSC

Learning Outcomes. Concurrency and Synchronisation. Textbook. Concurrency Example. Inter- Thread and Process Communication. Sections & 2.

CHAPTER 6: PROCESS SYNCHRONIZATION

Capriccio : Scalable Threads for Internet Services

Transactional Memory: Architectural Support for Lock-Free Data Structures Maurice Herlihy and J. Eliot B. Moss ISCA 93

Process Synchronization

NUMA-aware Reader-Writer Locks. Tom Herold, Marco Lamina NUMA Seminar

Chenyu Zheng. CSCI 5828 Spring 2010 Prof. Kenneth M. Anderson University of Colorado at Boulder

Concurrency and Synchronisation

Concurrency, Mutual Exclusion and Synchronization C H A P T E R 5

Concurrency Issues. Past lectures: What about coordinated access across multiple objects? Today s lecture:

Concurrency and Synchronisation

Chapter 7: Process Synchronization. Background

A Quest for Predictable Latency Adventures in Java Concurrency. Martin Thompson

What is the Race Condition? And what is its solution? What is a critical section? And what is the critical section problem?

Multiprocessors and Locking

Chapters 5 and 6 Concurrency

Process Co-ordination OPERATING SYSTEMS

CS370 Operating Systems

2 Threads vs. Processes

Process Synchronization. Mehdi Kargahi School of ECE University of Tehran Spring 2008

Process Synchronization

Programming Languages

Chapter 7: Process Synchronization. Background. Illustration

Process Synchronization

Synchronization. CSE 2431: Introduction to Operating Systems Reading: Chapter 5, [OSC] (except Section 5.10)

CS516 Programming Languages and Compilers II

Chapter 6: Process Synchronization. Operating System Concepts 8 th Edition,

Lecture 10: Multi-Object Synchronization

1. Motivation (Race Condition)

DUH! Deadlocks. Concurrency Issues. The TENEX Case. If a process requests all systems buffers, operator console tries to print an error message

CS 450 Exam 2 Mon. 4/11/2016

Scalable Concurrent Hash Tables via Relativistic Programming

Lesson 6: Process Synchronization

Lecture Topics. Announcements. Today: Concurrency (Stallings, chapter , 5.7) Next: Exam #1. Self-Study Exercise #5. Project #3 (due 9/28)

Synchronising Threads

Threads. Concurrency. What it is. Lecture Notes Week 2. Figure 1: Multi-Threading. Figure 2: Multi-Threading

Synchronization. CS61, Lecture 18. Prof. Stephen Chong November 3, 2011

10/17/ Gribble, Lazowska, Levy, Zahorjan 2. 10/17/ Gribble, Lazowska, Levy, Zahorjan 4

Concurrency - II. Recitation 3/24 Nisarg Raval Slides by Prof. Landon Cox and Vamsi Thummala

Chapter 5: Process Synchronization. Operating System Concepts Essentials 2 nd Edition

Fine-grained synchronization & lock-free data structures

Chapter 2 Processes and Threads

Chapter 18: Database System Architectures.! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems!

CSC501 Operating Systems Principles. Process Synchronization

MULTITHREADING AND SYNCHRONIZATION. CS124 Operating Systems Fall , Lecture 10

Midterm Exam Amy Murphy 6 March 2002

Transcription:

Background Fast and Scalable Queue-Based Resource Allocation Lock on Shared-Memory Multiprocessors Deli Zhang, Brendan Lynch, and Damian Dechev University of Central Florida, Orlando, USA December 18, 2013

Background Motivation Problem Definition Two General Strategies Mutual Exclusion on Multicore System Mutual exclusion locks are not composable Joe Doe C1 C2 Use of multiple mutual exclusion locks poses scalability challenge for date intensive application BerkerlyDB spends over 80% of the execution time in its Test-and-Test-and-Set lock on a 32-core machine 1 1 Johnson et al., Shore-mt: a scalable storage manager for the multicore era. 2009

Background Motivation Problem Definition Two General Strategies Resource Allocation Problem Definition Given a pool of k resources that require exclusive access, each thread may request 1 h k resources, and a thread remains blocked until all required resources are available. Resource allocation problem is a generalized mutual exclusion problem (k-mutual exclusion, h-out-k mutual exclusion) It is also an extension of the Dining Philosophers Problem (relaxing the static resource configuration)

Locking Protocols Background Motivation Problem Definition Two General Strategies Assign a mutual exclusion lock to every resource Follow protocol to acquire locks one by one

Locking Protocols Background Motivation Problem Definition Two General Strategies Assign a mutual exclusion lock to every resource Follow protocol to acquire locks one by one Two-phase locking

Locking Protocols Background Motivation Problem Definition Two General Strategies Assign a mutual exclusion lock to every resource Follow protocol to acquire locks one by one Two-phase locking Resource hierarchy

Locking Protocols Background Motivation Problem Definition Two General Strategies Assign a mutual exclusion lock to every resource Follow protocol to acquire locks one by one Two-phase locking Resource hierarchy Time-stamp locking

Background Motivation Problem Definition Two General Strategies Locking Protocols Assign a mutual exclusion lock to every resource Follow protocol to acquire locks one by one Two-phase locking Resource hierarchy Time-stamp locking Prone to conflict and retry

Batch Locking Background Motivation Problem Definition Two General Strategies Centralized manager to distribute resources It handles resource request from one thread in one batch Extended TATAS Multi-resource lock

Extended TATAS Background Motivation Problem Definition Two General Strategies typedef uint64 bitset ; void lock ( biteset * l, biteset r){ biteset b; do{ b = *l; if(b & r) // d e t e c t c o n f l i c t continue ; } while ( CAS (l, b, b r)!= b); } A bitset is an array of bits Represent a resource by one bit Detecting conflict by bitwise AND Handle acquisition in batch

Extended TATAS Background Motivation Problem Definition Two General Strategies typedef uint64 bitset ; void lock ( biteset * l, biteset r){ biteset b; do{ b = *l; if(b & r) // d e t e c t c o n f l i c t continue ; } while ( CAS (l, b, b r)!= b); } A bitset is an array of bits Represent a resource by one bit Detecting conflict by bitwise AND Handle acquisition in batch Drawbacks No fairness guarantee Heavy contention Limited number of resources

Background Queue-based Multi-resource Lock Overview Basic Data Structures Lock Acquire Lock Release Correctness 1: 2: 3: 4: 5: 6: 01000010 00011010 10100100 00111000 00000001 10100000 HEAD TAIL Ring buffer based concurrent queue Resolve conflicts in FIFO order Unbounded number of resources

Background Queue-based Multi-resource Lock Overview Basic Data Structures Lock Acquire Lock Release Correctness 1: 2: 3: 4: 5: 6: 01000010 00011010 00000000 00111000 00000001 10100000 HEAD TAIL Ring buffer based concurrent queue Resolve conflicts in FIFO order Unbounded number of resources

Data Structure Background Overview Basic Data Structures Lock Acquire Lock Release Correctness struct cell { atomic < uint32 > seq ; bitset bits ; } struct mrlock { cell * buffer ; uint32 siz ; atomic < uint32 > head ; atomic < uint32 > tail ; } void init ( mrlock & l, uint32 siz ){ l. buffer = new cell [ siz ]; l. siz = siz ; l. head. store (0) ; l. tail. store (0) ; for ( uint32 i = 0; i < siz ; i ++) { l. buffer [i]. bits. set (); l. buffer [i]. seq. store (i); } } Declaration Sequence number as sentinel bits as resource flags Atomic queue head and tail Initialization Allocate adjacent buffer cells bits are initialized to 1s seq are initialized to cell index

Lock Acquire Background Overview Basic Data Structures Lock Acquire Lock Release Correctness function acquire(mrlock* l, bitset r) loop pos l.tail, c ReadCell(l, pos), seq c.seq if seq - pos == 0 then if CAS(&l.tail, pos, pos + 1) succeeds then break c.bits r, c.seq pos + 1 spin pos l.head while spin pos!= pos do if IsDequeued(spin pos) or NoConflict(spin pos, r) then spin pos++ return pos

Lock Release Background Overview Basic Data Structures Lock Acquire Lock Release Correctness function Release(mrlock* l, handle pos) ReadCell(l, pos).bits 0 pos l.head while ReadCell(l, pos).bits == 0 do c ReadCell(l, pos) seq c.seq if seq - pos - 1 == 0 then if CAS(&l.head, pos, pos + 1) succeeds then c.bits 0 c.seq pos + l.siz pos l.head

Sequence Number Background Overview Basic Data Structures Lock Acquire Lock Release Correctness H T 0 1 2 3 Enqueue H T 1 2 2 3 T H 1 2 3 4 Dequeue H T 4 5 6 7 T H 4 5 3 4 Figure: Updating flow of sequence numbers

Background Sketch of Correctness Proofs Overview Basic Data Structures Lock Acquire Lock Release Correctness Concurrent update of the ring buffer is safe Theorem The head always precedes the tail; the tail is larger than head by at most N, where N equals to the size of the buffer. Non-atomic update of bitset is safe Bitsets are initialized to 1s Then written to specific request value by one thread Theorem In the presence of a single writer, intermediate values of the bitset during the write operation represent some supersets of requested resources.

Background Alternatives Environments Results Alternatives Two-phase locking std::lock function with std::mutex (STDLock) boost::lock function with boost::mutex (BSTLock) Resource hierarchy with std::mutex (RHSTD) with tbb::queue mutex (RHQueue) Extended TATAS (ETATAS)

Background Testing Configurations Alternatives Environments Results 64-core NUMA system 4 AMD Opteron CPUs with 16 cores per chip @ 2.1GHz Micro-benchmark Tight loop to acquire/release locks Randomize resource request prior to the loop Configuration Resource: 2 to 1024 Threads: 2 to 64 Resource contention: 0% to 100% (number of resources requested per thread divided by the total number of resources)

16 Threads Background Alternatives Environments Results 10 100 10 Time (seconds) 1 0.1 Time (seconds) 1 0.1 0.01 MRLock BSTLock RHQueue STDLock RHLock ETATAS 0.01 0% 25% 50% 75% 100% Resource Contention Figure: 64 Resources MRLock RHLock BSTLock RHQueue 0.001 0% 25% 50% 75% 100% Resource Contention Figure: 1024 Resources ETATAS, STDLock are excluded on the right because of lack of support for more than 64 resources.

Background Alternatives Environments Results Up to 64 Resources 1 1000 100 Time (seconds) 0.1 0.01 Time (seconds) 10 1 0.1 0.001 MRLock STDLock 4 8 16 32 64 BSTLock RHLock RHQueue ETATAS 0.01 MRLock STDLock 4 8 16 32 64 BSTLock RHLock RHQueue ETATAS Resource Contention Resource Contention Figure: 2 Threads Figure: 64 Threads

Background Up to 1024 Resources Alternatives Environments Results 100 1000 10 100 Time (seconds) 1 0.1 Time (seconds) 10 1 0.01 0.1 MRLock BSTLock 0.001 128 256 512 1024 Resource Contention RHLock RHQueue MRLock BSTLock 0.01 128 256 512 1024 Resource Contention RHLock RHQueue Figure: 8 Threads Figure: 32 Threads

Thread Scale Background Alternatives Environments Results 100 10 MRLock STDLock BSTLock RHLock RHQueue ETATAS 100 10 MRLock BSTLock RHLock RHQueue Time (seconds) 1 Time (seconds) 1 0.1 0.1 0.01 2 Threads 4 Threads 8 Threads 16 Threads 64 Threads 32 Threads 0.01 2 Threads 4 Threads 8 Threads 16 Threads 32 Threads 64 Threads Figure: Contention 32/64 (50%) Figure: Contention 128/1024 (12.5%)

Background and Future Work ic Advantage FIFO ordering guarantees fair acquisition of locks Support large number of resources with the use of bitset Performance advantage under mid-to-high levels of contention Future Work Adopting wait-free ring buffer to achieve starvation-freedom NUMA-awareness Adapting algorithm to compensate performance under low levels of contention

Background Questions? Thank you!