CS510 Concurrent Systems. Jonathan Walpole

Similar documents
Parallelism and Data Synchronization

Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms. M.M. Michael and M.L. Scott. Technical Report 600 December 1995

A Non-Blocking Concurrent Queue Algorithm

Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms

Non-Blocking Algorithms and Preemption-Safe Locking on Multiprogrammed Shared Memory Multiprocessors

Non-Blocking Algorithms and Preemption-Safe Locking on Multiprogrammed Shared Memory Multiprocessors

Lindsay Groves, Simon Doherty. Mark Moir, Victor Luchangco

Lock free Algorithm for Multi-core architecture. SDY Corporation Hiromasa Kanda SDY

Using SPIN to Model Check Concurrent Algorithms, using a translation from C. into Promela.

Lock-Free Techniques for Concurrent Access to Shared Objects

Advance Operating Systems (CS202) Locks Discussion

Queues. A queue is a special type of structure that can be used to maintain data in an organized way.

Locking: A necessary evil? Lecture 10: Avoiding Locks. Priority Inversion. Deadlock

Lecture 10: Avoiding Locks

A simple correctness proof of the MCS contention-free lock. Theodore Johnson. Krishna Harathi. University of Florida. Abstract

FIFO Queue Synchronization

Fast and Lock-Free Concurrent Priority Queues for Multi-Thread Systems

Non-blocking Array-based Algorithms for Stacks and Queues!

Cache-Aware Lock-Free Queues for Multiple Producers/Consumers and Weak Memory Consistency

Algorithms and Data Structures

CSC 1052 Algorithms & Data Structures II: Linked Queues

Tom Hart, University of Toronto Paul E. McKenney, IBM Beaverton Angela Demke Brown, University of Toronto

Lecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory

Unit 6: Indeterminate Computation

CMSC Introduction to Algorithms Spring 2012 Lecture 7

Fine-grained synchronization & lock-free programming

Come to the PASS workshop with your mock exam complete. During the workshop you can work with other students to review your work.

8. Fundamental Data Structures

Resource management. Real-Time Systems. Resource management. Resource management

BQ: A Lock-Free Queue with Batching

Computer Science Foundation Exam

Programming Paradigms for Concurrency Lecture 3 Concurrent Objects

Algorithms and Data Structures

Last Class: CPU Scheduling! Adjusting Priorities in MLFQ!

Proving linearizability & lock-freedom

Advanced Topic: Efficient Synchronization

Ashish Gupta, Data JUET, Guna

Lecture 6: Lazy Transactional Memory. Topics: TM semantics and implementation details of lazy TM

CS 241 Honors Concurrent Data Structures

CS510 Operating System Foundations. Jonathan Walpole

CS533 Concepts of Operating Systems. Jonathan Walpole

Real-Time Systems. Lecture #4. Professor Jan Jonsson. Department of Computer Science and Engineering Chalmers University of Technology

Operating Systems. Designed and Presented by Dr. Ayman Elshenawy Elsefy

Part 1: Concepts and Hardware- Based Approaches

Overview. CMSC 330: Organization of Programming Languages. Concurrency. Multiprocessors. Processes vs. Threads. Computation Abstractions

Today: Synchronization. Recap: Synchronization

CS5460/6460: Operating Systems. Lecture 11: Locking. Anton Burtsev February, 2014

CS200: Queues. Prichard Ch. 8. CS200 - Queues 1

SYSC 2006 CD. Come to the PASS workshop with your mock exam complete. During the workshop you can work with other students to review your work.

Mutex Implementation

Lecture 7: Transactional Memory Intro. Topics: introduction to transactional memory, lazy implementation

Lecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6)

Per-Thread Batch Queues For Multithreaded Programs

CONCURRENT LIBRARIES. Correctness Criteria, Verification

Lecture: Consistency Models, TM

Concurrency - II. Recitation 3/24 Nisarg Raval Slides by Prof. Landon Cox and Vamsi Thummala

Lecture 4: Directory Protocols and TM. Topics: corner cases in directory protocols, lazy TM

Programming. Lists, Stacks, Queues

Scheduler Activations. CS 5204 Operating Systems 1

Concurrency Issues. Past lectures: What about coordinated access across multiple objects? Today s lecture:

Stacks and Queues. CSE 373 Data Structures Lecture 6

Implementing caches. Example. Client. N. America. Client System + Caches. Asia. Client. Africa. Client. Client. Client. Client. Client.

Multithreading Programming II

Atomic Variables & Nonblocking Synchronization

Lecture 21: Transactional Memory. Topics: Hardware TM basics, different implementations

Quasi-Linearizability Relaxed Consistency For Improved Concurrency

Concept of a process

Queue ADT. CMPT 125 Feb. 28

Concurrent Queues, Monitors, and the ABA problem. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

CS 571 Operating Systems. Midterm Review. Angelos Stavrou, George Mason University

Concurrency User Guide

Concurrent Objects and Linearizability

Midterm I October 12 th, 2005 CS162: Operating Systems and Systems Programming

CS6235. L15: Dynamic Scheduling

Lecture 3: Stacks & Queues

COMP 182 Algorithmic Thinking. Breadth-first Search. Luay Nakhleh Computer Science Rice University

CS 351 Design of Large Programs Threads and Concurrency

Non-blocking Array-based Algorithms for Stacks and Queues. Niloufar Shafiei

Implementing Mutual Exclusion. Sarah Diesburg Operating Systems CS 3430

Chapter 6: Process Synchronization

Linear Structures. Linear Structure. Implementations. Array details. List details. Operations 4/18/2013

An efficient Unbounded Lock-Free Queue for Multi-Core Systems

CS533 Concepts of Operating Systems. Jonathan Walpole

Linked Lists: Locking, Lock- Free, and Beyond. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

RCU in the Linux Kernel: One Decade Later

CS5314 RESEARCH PAPER ON PROGRAMMING LANGUAGES

Monitors; Software Transactional Memory

Lock-Free and Practical Doubly Linked List-Based Deques using Single-Word Compare-And-Swap

Transactional Memory: Architectural Support for Lock-Free Data Structures Maurice Herlihy and J. Eliot B. Moss ISCA 93

Concurrent Objects. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Page 1. Challenges" Concurrency" CS162 Operating Systems and Systems Programming Lecture 4. Synchronization, Atomic operations, Locks"

L16: Dynamic Task Queues and Synchronization

Data Structures and Algorithms for Engineers

Transactional Memory. R. Guerraoui, EPFL

Linear Structures. Linear Structure. Implementations. Array details. List details. Operations 2/10/2013

Bounded Model Checking of Concurrent Data Types on Relaxed Memory Models: A Case Study

Operating Systems. Lecture 4 - Concurrency and Synchronization. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

(6-1) Basics of a Queue. Instructor - Andrew S. O Fallon CptS 122 (September 26, 2018) Washington State University

Lecture 10: Multi-Object Synchronization

CS333 Intro to Operating Systems. Jonathan Walpole

Transcription:

CS510 Concurrent Systems Jonathan Walpole

Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms

utline Background Non-Blocking Queue Algorithm Two Lock Concurrent Queue Algorithm Performance Conclusion

Background Locking vs Lock-free Blocking vs non-blocking Delay sensitivity Dead-lock Starvation Priority inversion Live-lock

Linearizability A concurrent data structure gives an external observer the illusion that its operations take effect instantaneously There must be a specific point during each operation at which it takes effect All operations on the data structure can be ordered such that every viewer agrees on the order Linearizability is about equivalence to sequential execution, it is not a condition of correctness!

Queue Correctness Properties 1. The linked list is always connected 2. Nodes are only inserted after the last node in the linked list 3. Nodes are only deleted from the beginning of the linked list 4. Head always points to the first node in the list 5. Tail always points to a node in the list

Tricky Problems With Queues Tail lag Freeing dequeued nodes

Problem Tail Lag Behind Head Head Tail H_lock T_lock - Separate head and tail pointers - Dequeueing threads can move the head pointer past the tail pointer if enqueueing threads are slow to update it

Problem Tail Lag Behind Head Head Tail H_lock T_lock

Problem Tail Lag Behind Head Head Tail H_lock T_lock

Problem Tail Lag Behind Head Head Tail H_lock T_lock free(node) - The tail can end up pointing to freed memory! - Can a dequeue help a slow enqueue by swinging the tail pointer before dequeuing the node - Will helping cause contention? - When is it safe to free memory?

Problem With Reference Counting Head Tail H_lock T_lock If cnt == 0 free(node) cnt=1 cnt=1... cnt=3 A slow process that never releases the reference lock can cause the system to run out of memory

Non-Blocking Queue Algorithm structure pointer_t {: pointer to node t, : unsigned integer} (packed Word or Double Word) structure node_t {: data type, : pointer_t} structure queue_t {Head: pointer_t, Tail: pointer_t} Head Tail

Initialization initialize(q: pointer to queue_t) node = new node() node >. = NULL Q >Head = Q >Tail = node Q Head Tail node

Enqueue Copy enqueue(q: pointer to queue t, : data type) node = new node() node > = Q node >. = NULL tail loop tail = Q >Tail = tail. > if tail == Q >Tail if. == NULL if CAS(&tail. >,, <node,.+1>) break else CAS(&Q >Tail, tail, <., tail.+1>) endloop CAS(&Q >Tail, tail, <node, tail.+1>) node Head Tail

Enqueue - Check enqueue(q: pointer to queue t, : data type) node = new node() node > = Q node >. = NULL tail loop tail = Q >Tail = tail. > if tail == Q >Tail if. == NULL if CAS(&tail. >,, <node,.+1>) break else CAS(&Q >Tail, tail, <., tail.+1>) endloop CAS(&Q >Tail, tail, <node, tail.+1>) node Head Tail

Enqueue Try Commit enqueue(q: pointer to queue t, : data type) node = new node() node > = Q node >. = NULL tail loop tail = Q >Tail = tail. > if tail == Q >Tail if. == NULL if CAS(&tail. >,, <node,.+1>) break else CAS(&Q >Tail, tail, <., tail.+1>) endloop CAS(&Q >Tail, tail, <node, tail.+1>) node Head Tail

Enqueue - Succeed enqueue(q: pointer to queue t, : data type) node = new node() node > = node >. = NULL Q tail loop tail = Q >Tail = tail. > if tail == Q >Tail if. == NULL if CAS(&tail. >,, <node,.+1>) break else CAS(&Q >Tail, tail, <., tail.+1>) endloop CAS(&Q >Tail, tail, <node, tail.+1>) node Head Tail +1

Enqueue - Swing Tail enqueue(q: pointer to queue t, : data type) node = new node() node > = node >. = NULL Q tail loop tail = Q >Tail = tail. > if tail == Q >Tail if. == NULL if CAS(&tail. >,, <node,.+1>) break else CAS(&Q >Tail, tail, <., tail.+1>) endloop CAS(&Q >Tail, tail, <node, tail.+1>) node Head Tail +1

Enqueue ther Process Won enqueue(q: pointer to queue t, : data type) node = new node() node > = Q node >. = NULL tail loop tail = Q >Tail = tail. > if tail == Q >Tail if. == NULL if CAS(&tail. >,, <node,.+1>) break else CAS(&Q >Tail, tail, <., tail.+1>) endloop CAS(&Q >Tail, tail, <node, tail.+1>) node Head Tail

Enqueue Catch Up enqueue(q: pointer to queue t, : data type) node = new node() node > = Q node >. = NULL tail loop tail = Q >Tail = tail. > if tail == Q >Tail if. == NULL if CAS(&tail. >,, <node,.+1>) break else CAS(&Q >Tail, tail, <., tail.+1>) endloop CAS(&Q >Tail, tail, <node, tail.+1>) node Head Tail

Enqueue Retry Commit enqueue(q: pointer to queue t, : data type) node = new node() node > = node >. = NULL loop tail = Q >Tail = tail. > if tail == Q >Tail if. == NULL tail Q Head if CAS(&tail. >,, <node,.+1>) break else CAS(&Q >Tail, tail, <., tail.+1>) endloop CAS(&Q >Tail, tail, <node, tail.+1>) node Tail coun t+1

Enqueue When does the enqueue take effect? Any thoughts?

Dequeue Copy dequeue(q: pointer to queue_t, p: pointer to data type): boolean loop head = Q >Head tail = Q >Tail = head > if head == Q >Head if head. == tail. if. == NULL return FALSE Q Head Tail head tail

Dequeue Check dequeue(q: pointer to queue_t, p: pointer to data type): boolean loop head = Q >Head tail = Q >Tail Q = head > if head == Q >Head if head. == tail. Head Tail if. == NULL return FALSE CAS(&Q >Tail, tail, <., tail.+1>) else *p =. > if CAS(&Q >Head, head, <., head.+1>) head tail break endloop free(head.) return TRUE

Dequeue Empty Queue dequeue(q: pointer to queue_t, p: pointer to data type): boolean loop head = Q >Head tail = Q >Tail Q = head > if head == Q >Head if head. == tail. Head Tail if. == NULL return FALSE CAS(&Q >Tail, tail, <., tail.+1>) else *p =. > if CAS(&Q >Head, head, <., head.+1>) head tail break endloop free(head.) return TRUE

Dequeue Help Tail Catch Up dequeue(q: pointer to queue_t, p: pointer to data type): boolean loop head = Q >Head tail = Q >Tail Q = head > if head == Q >Head if head. == tail. Head Tail if. == NULL return FALSE CAS(&Q >Tail, tail, <., tail.+1>) else *p =. > if CAS(&Q >Head, head, <., head.+1>) head tail break endloop free(head.) return TRUE

Dequeue Grab Node Value dequeue(q: pointer to queue_t, p: pointer to data type): boolean loop head = Q >Head tail = Q >Tail Q = head > if head == Q >Head if head. == tail. Head Tail if. == NULL return FALSE CAS(&Q >Tail, tail, <., tail.+1>) else *p =. > if CAS(&Q >Head, head, <., head.+1>) head tail break endloop free(head.) return TRUE

Dequeue - Commit and Free Node dequeue(q: pointer to queue_t, p: pointer to data type): boolean loop head = Q >Head tail = Q >Tail Q = head > if head == Q >Head if head. == tail. Head Tail if. == NULL return FALSE CAS(&Q >Tail, tail, <., tail.+1>) else *p =. > if CAS(&Q >Head, head, <., head.+1>) head tail break endloop free(head.) return TRUE +1

Dequeue When does the dequeue take effect Any thoughts?

Two-Lock Queue Algorithm structure node_t {: data type, : pointer to node_t} structure queue_t {Head: pointer to node_t, Tail: pointer to node_t, H_lock: lock type, T_lock: lock type} Head Tail H_lock T_lock

Blocking Using Two Locks ne for tail pointer updates and a second for head pointer updates - Enqueue process locks tail pointer - Dequeue process locks head pointer A dummy node prevents enqueue from needing to access the head pointer and dequeue from needing to access the tail pointer - allows concurrent enqueues and dequeues

Algorithm Two-Lock Queue initialize(q: pointer to queue_t) node = new_node() node-> = NULL Q->Head = Q->Tail = node Q->H_lock = Q->T_lock = FREE Head Tail H_lock T_lock Free Free

Enqueue enqueue(q: pointer to queue_t, : data type) node = new_node() // Allocate a new node from the free list node-> = // Write to be enqueued into node node-> = NULL // Set node s pointer to NULL lock(&q->t_lock) // Acquire T_lock in order to access Tail Q->Tail-> = node // Link node to the end of the linked list Q->Tail = node // Swing Tail to point to node unlock(&q->t_lock) // Release T_lock Head Tail H_lock T_lock Free Locked

Enqueue enqueue(q: pointer to queue_t, : data type) node = new_node() // Allocate a new node from the free list node-> = // Write to be enqueued into node node-> = NULL // Set node s pointer to NULL lock(&q->t_lock) // Acquire T_lock in order to access Tail Q->Tail-> = node // Link node at the end of the linked list Q->Tail = node // Swing Tail to point to node unlock(&q->t_lock) // Release T_lock Head Tail H_lock T_lock Free Locked

Dequeue dequeue(q: pointer to queue_t, p: pointer to data type): boolean lock(&q->h_lock) // Acquire H_lock in order to access Head node = Q->Head // Read Head new_head = node-> // Read pointer if new_head == NULL // Is queue empty? unlock(&q->h_lock) // Release H_lock before return return FALSE // Queue was empty *p = new_head-> // Queue not empty. Read before release Q->Head = new_head // Swing Head to node unlock(&q->h_lock) // Release H_lock free(node) // Free node return} TRUE new_head Head Tail H_lock T_lock Locked Free node

Dequeue dequeue(q: pointer to queue_t, p: pointer to data type): boolean lock(&q->h_lock) // Acquire H_lock in order to access Head node = Q->Head // Read Head new_head = node-> // Read pointer if new_head == NULL // Is queue empty? unlock(&q->h_lock) // Release H_lock before return return FALSE // Queue was empty *p = new_head-> // Queue not empty. Read before release Q->Head = new_head // Swing Head to node unlock(&q->h_lock) // Release H_lock free(node) // Free node return} TRUE new_head Head Tail H_lock T_lock Free Free node

Algorithm Two-Lock There is one case where 2 threads can access the same node at the same time - When is it? - Why is it not a problem? - Any other thoughts?

Performance Measurements The code was heavily optimized The test threads alternatively enqueue and dequeue items with other work (busy waiting) in between Multiple threads run on each CPU to simulate multi-programming

Dedicated Processor

Two Processes Per Processor

Three Processes Per Processor

Conclusion Non-blocking synchronization is the clean winner for multi-programmed multiprocessor systems Above 5 processors, the new non-blocking queue is the best For hardware that only supports test-and-set, use the two lock queue For two or fewer processors use a single lock algorithm for queues because overhead of more complicated algorithms outweighs the gains