Non-blocking Array-based Algorithms for Stacks and Queues. Niloufar Shafiei

Similar documents
Non-blocking Array-based Algorithms for Stacks and Queues!

Allocating memory in a lock-free manner

Lock-Free and Practical Doubly Linked List-Based Deques using Single-Word Compare-And-Swap

Fast and Lock-Free Concurrent Priority Queues for Multi-Thread Systems

Cache-Aware Lock-Free Queues for Multiple Producers/Consumers and Weak Memory Consistency

Threading and Synchronization. Fahd Albinali

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

Design of Concurrent and Distributed Data Structures

Efficient and Reliable Lock-Free Memory Reclamation Based on Reference Counting

Synchronization COMPSCI 386

Whatever can go wrong will go wrong. attributed to Edward A. Murphy. Murphy was an optimist. authors of lock-free programs 3.

A Non-Blocking Concurrent Queue Algorithm

Chapter 5 Concurrency: Mutual Exclusion. and. Synchronization. Operating Systems: Internals. and. Design Principles

Fine-grained synchronization & lock-free programming

Whatever can go wrong will go wrong. attributed to Edward A. Murphy. Murphy was an optimist. authors of lock-free programs LOCK FREE KERNEL

Fine-grained synchronization & lock-free data structures

CS 571 Operating Systems. Midterm Review. Angelos Stavrou, George Mason University

Chapter 5 Concurrency: Mutual Exclusion and Synchronization

Dealing with Issues for Interprocess Communication

Thread Synchronization: Too Much Milk

Course: Operating Systems Instructor: M Umair. M Umair

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

Operating Systems. Lecture 4 - Concurrency and Synchronization. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

Concurrent Preliminaries

Linked Lists: The Role of Locking. Erez Petrank Technion

C09: Process Synchronization

CSE 451: Operating Systems Winter Lecture 7 Synchronization. Steve Gribble. Synchronization. Threads cooperate in multithreaded programs

IT 540 Operating Systems ECE519 Advanced Operating Systems

Synchronization I. Jo, Heeseung

CPSC/ECE 3220 Summer 2018 Exam 2 No Electronics.

Safety and liveness for critical sections

CSE 153 Design of Operating Systems

Lecture 10: Avoiding Locks

Parallel Programming in Distributed Systems Or Distributed Systems in Parallel Programming

Chapter 6: Process Synchronization

Hazard Pointers. Safe Resource Reclamation for Optimistic Concurrency

Marwan Burelle. Parallel and Concurrent Programming

CS377P Programming for Performance Multicore Performance Synchronization

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 31 October 2012

Håkan Sundell University College of Borås Parallel Scalable Solutions AB

Operating Systems Comprehensive Exam. Spring Student ID # 3/16/2006

Lindsay Groves, Simon Doherty. Mark Moir, Victor Luchangco

CS 241 Honors Concurrent Data Structures

CMSC421: Principles of Operating Systems

CSc33200: Operating Systems, CS-CCNY, Fall 2003 Jinzhong Niu December 10, Review

MultiJav: A Distributed Shared Memory System Based on Multiple Java Virtual Machines. MultiJav: Introduction

A Scalable Lock-free Stack Algorithm

SPIN, PETERSON AND BAKERY LOCKS

Concurrent & Distributed Systems Supervision Exercises

CSL373: Lecture 5 Deadlocks (no process runnable) + Scheduling (> 1 process runnable)

k-abortable Objects: Progress under High Contention

Locking: A necessary evil? Lecture 10: Avoiding Locks. Priority Inversion. Deadlock

Synchronization for Concurrent Tasks

Lock Oscillation: Boosting the Performance of Concurrent Data Structures

CSE 451: Operating Systems Winter Lecture 7 Synchronization. Hank Levy 412 Sieg Hall

CS 5523 Operating Systems: Midterm II - reivew Instructor: Dr. Tongping Liu Department Computer Science The University of Texas at San Antonio

Review: Easy Piece 1

Agenda. Lecture. Next discussion papers. Bottom-up motivation Shared memory primitives Shared memory synchronization Barriers and locks

DESIGN CHALLENGES FOR SCALABLE CONCURRENT DATA STRUCTURES for Many-Core Processors

( D ) 4. Which is not able to solve the race condition? (A) Test and Set Lock (B) Semaphore (C) Monitor (D) Shared memory

Concurrency. Chapter 5

Operating Systems Comprehensive Exam. Fall Student ID # 10/31/2013

CS420: Operating Systems. Process Synchronization

CMPS 111 Spring 2013 Prof. Scott A. Brandt Midterm Examination May 6, Name: ID:

CPSC/ECE 3220 Summer 2017 Exam 2

Semaphore. Originally called P() and V() wait (S) { while S <= 0 ; // no-op S--; } signal (S) { S++; }

Process Synchronisation (contd.) Operating Systems. Autumn CS4023

Operating Systems. Designed and Presented by Dr. Ayman Elshenawy Elsefy

Copyright 2008 CS655 System Modeling and Analysis. Korea Advanced Institute of Science and Technology

Chapter 6: Process [& Thread] Synchronization. CSCI [4 6] 730 Operating Systems. Why does cooperation require synchronization?

Hazard Pointers. Number of threads unbounded time to check hazard pointers also unbounded! difficult dynamic bookkeeping! thread B - hp1 - hp2

Concurrent and Distributed Systems Introduction

Synchronization I. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Concurrency: Mutual Exclusion and

Concurrency. Lecture 14: Concurrency & exceptions. Why concurrent subprograms? Processes and threads. Design Issues for Concurrency.

[module 2.2] MODELING CONCURRENT PROGRAM EXECUTION

Lock-Free Techniques for Concurrent Access to Shared Objects

CSE 153 Design of Operating Systems Fall 2018

COMP SCI 3SH3: Operating System Concepts (Term 2 Winter 2006) Test 2 February 27, 2006; Time: 50 Minutes ;. Questions Instructor: Dr.

For use by students enrolled in #71251 CSE430 Fall 2012 at Arizona State University. Do not use if not enrolled.

Operating Systems Comprehensive Exam. Spring Student ID # 3/20/2013

Overview. CMSC 330: Organization of Programming Languages. Concurrency. Multiprocessors. Processes vs. Threads. Computation Abstractions

Interprocess Communication By: Kaushik Vaghani

Proving linearizability & lock-freedom

CSE Traditional Operating Systems deal with typical system software designed to be:

! Why is synchronization needed? ! Synchronization Language/Definitions: ! How are locks implemented? Maria Hybinette, UGA

Operating Systems. Synchronisation Part I

Chapter 5. Multiprocessors and Thread-Level Parallelism

A simple correctness proof of the MCS contention-free lock. Theodore Johnson. Krishna Harathi. University of Florida. Abstract

G52CON: Concepts of Concurrency

Part V. Process Management. Sadeghi, Cubaleska RUB Course Operating System Security Memory Management and Protection

Synchronization Principles

An Introduction to Parallel Systems

Coordination and Agreement

Unit 6: Indeterminate Computation

CS 153 Design of Operating Systems Winter 2016

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 17 November 2017

Concurrency: Mutual Exclusion and Synchronization

CSE 486/586 Distributed Systems

Advance Operating Systems (CS202) Locks Discussion

Transcription:

Non-blocking Array-based Algorithms for Stacks and Queues Niloufar Shafiei

Outline Introduction Concurrent stacks and queues Contributions New algorithms New algorithms using bounded counter values Correctness Time analysis Model checking Implementations and comparisons Conclusion and future work 7/15/13 2

Asynchronous distributed shared memory Processes communicate through shared memory. Each process has its own independent clock. Mutual exclusion (using locks) Disadvantages Not reliable and fault tolerant Priority inversions Deadlock 7/15/13 3

Non-blocking and wait-free algorithms Non-blocking (lock-free) Some operations complete in a finite number of steps. Wait-free All operations complete in a finite number of steps. Advantage Immune to deadlock Robust performance Disadvantage Complex and subtle 7/15/13 4

Linearizability Correctness condition for shared objects 7/15/13 5

Linearizability Correctness condition for shared objects time push(v 1 )? push(v 2 ) pop stack 7/15/13 6

Linearizability Correctness condition for shared objects time push(v 1 ) X push(v 2 ) X v 2 X pop empty v 1 stack 7/15/13 7

Linearizability Correctness condition for shared objects time push(v 1 ) X X push(v 2 ) pop X v 2 stack 7/15/13 8 v 1

Compare and Swap Impossible to construct some shared objects using only atomic read/write registers Use universal synchronization primitives Compare and Swap (C&S) 7/15/13 9

Compare and Swap Impossible to construct some shared object using only atomic read/write registers Use universal synchronization primitives Compare and Swap (C&S) C&S(X, old, new) if (X = old) X := new return true else return false 7/15/13 10

ABA problem X = A old = X. : X has not been changed C&S(X, old, new) 7/15/13 11

ABA problem X = A old = X. X = B : : X = A C&S(X, old, new) X has not been changed 7/15/13 12

ABA problem X = A old = X. X = B : : X = A C&S(X, old, new) X has not been changed Solution: counter values 7/15/13 13

Concurrent stacks and queues Fundamental data structures in distributed systems Application: parallel applications such as garbage collection and operating systems Two main categories Link-based Array-based 7/15/13 14

Link-based versus array-based Link-based Extra space required for pointers Potential memory fragmentation and memory management overhead Array-based Compact data structure Leave enough space in word for counters Fixed size Good locality of reference 7/15/13 15

Contributions Two non-blocking array-based algorithms for stacks and two non-blocking array-based algorithms for queues A shared array implement shared stack or queue Shared variables (Top, Rear, Front) store the index of the top or bottom element of stack or queue C&S primitive Counter values 7/15/13 16

New algorithms Linearization points of successful operations: Successful C&S on Top, Rear, Front Linearization points of operations that return Empty or Full: Last reading of Top, Rear, Front 7/15/13 17

New algorithms Linearization points of successful operations: Successful C&S on Top, Rear, Front Linearization points of operations that return Empty or Full: Last reading of Top, Rear, Front Structure of operations Operation push/pop/enqueue/dequeue: Loop: Read Top/Rear/Front : Update array : C&S on Top/Rear/Front to help previous operation 7/15/13 18

New algorithms Execution: an interleaving of steps of processes time X X X X Change Top/Rear Change Top/Rear Change Top/Rear Change Top/Rear 7/15/13 19

New algorithms Execution: an interleaving of steps of processes time X X X X Change Top/Rear Update array Change Top/Rear Update array Change Top/Rear Update array Change Top/Rear 7/15/13 20

New algorithms Execution: an interleaving of steps of processes time X X X X Change Top/Rear Update array Change Top/Rear Update array Change Top/Rear Update array Change Top/Rear Update array based on information in Top/Rear Index Value Counter value(s) 7/15/13 21

New algorithms 1. Non-blocking stack and queue algorithms using unbounded counter values 2. Non-blocking stack and queue algorithms using bounded counter values Reuse counter values Employ collect object 7/15/13 22

Collect object Collect object Store Store a process-value pair Collect Collect a set of process-value pairs of all processes that have stored process-value pairs 7/15/13 23

New algorithms using bounded Structure of operations counter values Operation push/pop/enqueue/dequeue: Loop: inner loop: Read Top/Rear/Front : store counter values into collect object : if Top/Rear/Front has not been changed exit inner loop : Update array : collect collect object to know counter values which are in use choose new counter value : C&S on Top/Rear/Front store Ø into collect object 7/15/13 24

New algorithms using bounded counter values A loop iteration of an operation: time Store counters Into Collect object 7/15/13 25

New algorithms using bounded counter values A loop iteration of an operation: time Store counters Into Collect object X Last reading Top/Rear/Front 7/15/13 26

New algorithms using bounded counter values A loop iteration of an operation: time Store counters Into Collect object X Last reading Top/Rear/Front Try to update array 7/15/13 27

New algorithms using bounded counter values A loop iteration of an operation: time Store counters Into Collect object X Last reading Top/Rear/Front Try to update array Collect the Collect object 7/15/13 28

New algorithms using bounded counter values A loop iteration of an operation: time Store counters Into Collect object X Last reading Top/Rear/Front Try to update array Collect the Collect object + Choose new counter 7/15/13 29

New algorithms using bounded counter values A loop iteration of an operation: time Store counters Into Collect object X Last reading Top/Rear/Front Try to update array Collect the Collect object + Choose new counter X Change Top/Rear/Front 7/15/13 30

Correctness Shared variables (Top, Rear, Front) has not been changed from last reading them to changing them with C&S How exactly shared array is changed What happened in data structure exactly matches with abstract stack/queue Operations return results consistent with their linearization order 7/15/13 31

Time analysis An operation can take arbitrarily many steps as long as some other operation is making progress Amortized analysis to evaluate the system as a whole Assign blame in unsuccessful loop iteration to other operations that did successfully change the shared variables The worst-case amortized cost of our algorithms depends only on point contention Point contention: maximum number of process running concurrently at a given point of time 7/15/13 32

Time analysis time op 1 op 2 op 3 op 4 7/15/13 33

Time analysis time op 1 blame op 4 op 2 blame op 4 op 3 blame op 4 T 1 op 4 7/15/13 34

Time analysis time op 1 blame op 4 op 2 blame op 4 op 3 blame op 4 T 1 op 4 7/15/13 35

Time analysis time op 1 blame op 4 blame op 2 T 2 op 2 blame op 4 op 3 blame op 4 blame op 2 op 4 T 1 7/15/13 36

Time analysis time op 1 blame op 4 blame op 2 T 2 op 2 blame op 4 op 3 blame op 4 blame op 2 op 4 T 1 7/15/13 37

Time analysis time op 1 blame op 4 blame op 2 blame op 3 op 2 blame op 4 T 2 T 3 op 3 blame op 4 blame op 2 op 4 T 1 7/15/13 38

Time analysis time T 4 op 1 blame op 4 blame op 2 blame op 3 op 2 blame op 4 T 2 T 3 op 3 blame op 4 blame op 2 op 4 T 1 7/15/13 39

Time analysis time T 4 op 1 blame op 4 blame op 2 blame op 3 op 2 blame op 4 T 2 T 3 op 3 blame op 4 blame op 2 T 1 op 4 Number of unsuccessful loop iteration: ( Point contention(t i ) -1 ) 7/15/13 40

Spin model checker Model checking Define abstract stack/queue variables Atomically change abstract stack/queue at linearization points of successful operations At linearization points, assert that the contents of shared data structures are the same as the state of the abstract stack/queue Define end-state labels when operations return to make sure all operations terminate 7/15/13 41

Model checking Verify our algorithms for four operations and array size of three Use exhaustive search Partial reduction 7/15/13 42

Implementations Compare our stack algorithms using unbounded counter values Treiber s stack algorithm Compare our queue algorithms using unbounded counter values Queue algorithm of Michael and Scott Array-based queue algorithm of Colvin and Groves Implementations java (java.util.concurrent.atomic) System with two quad processors 7/15/13 43

Comparison Compare in both low and high contentions Total number of operations is constant (1441440) in each execution Increase number of threads 50 runs 7/15/13 44

Comparison Compare in both low and high contentions Total number of operations is constant (1441440) in each execution Increase number of threads 50 runs possible Point contention 2 Thread 1 Thread 2 720720 operations 720720 operations 7/15/13 45

Comparison Compare in both low and high contentions Total number of operations is constant (1441440) in each execution Increase number of threads 50 runs possible Point contention 2 Thread 1 Thread 2 720720 operations 720720 operations 4 Thread 1 360360 operations Thread 2 360360 operations Thread 3 360360 operations Thread 4 360360 operations 7/15/13 46

Comparison Compare in both low and high contentions Total number of operations is constant (1441440) in each execution Increase number of threads 50 runs possible Point contention 2 Thread 1 Thread 2 720720 operations 720720 operations Thread 1 Thread 2 Thread 3 4 360360 operations 360360 operations 360360 operations Thread 1 Thread 2 : : Thread 3 32 45045 operations 45045 operations 45045 operations Thread 4 360360 operations Thread 32 45045 operations 7/15/13 47

Comparison of concurrent stack algorithms 7/15/13 48

Comparison of concurrent queue algorithms 7/15/13 49

Conclusions Propose new array-based algorithms for stacks and queues Prove their correctness Verify them by the Spin model checker Amortized time complexity of an operation depends on point contention Implementation and comparison Compared to Treiber s stack implementation, our stack algorithm is more scalable Our queue implementation outperforms Michael and Scott queue algorithm Our stack implementation is first practical array-based stack implementation It is the first time that bounded counter values are used to implement shared stack and queue 7/15/13 50

Future work Improvement of memory reclamation technique of link-based algorithms Optimal in general Do not increase the contention of algorithms 7/15/13 51

Thank you!