Linked Lists: The Role of Locking. Erez Petrank Technion

Similar documents
Linked Lists: Locking, Lock-Free, and Beyond. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Coarse-grained and fine-grained locking Niklas Fors

Linked Lists: Locking, Lock- Free, and Beyond. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Linked Lists: Locking, Lock- Free, and Beyond. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Programming Paradigms for Concurrency Lecture 6 Synchronization of Concurrent Objects

Last Lecture: Spin-Locks

Locking Granularity. CS 475, Spring 2019 Concurrent & Distributed Systems. With material from Herlihy & Shavit, Art of Multiprocessor Programming

Concurrent Skip Lists. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

The Art of Multiprocessor Programming Copyright 2007 Elsevier Inc. All rights reserved. October 3, 2007 DRAFT COPY

Lecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory

Fine-grained synchronization & lock-free programming

Distributed Computing Group

Lecture 7: Transactional Memory Intro. Topics: introduction to transactional memory, lazy implementation

Reminder from last time

Fast and Lock-Free Concurrent Priority Queues for Multi-Thread Systems

LOCKLESS ALGORITHMS. Lockless Algorithms. CAS based algorithms stack order linked list

Fine-grained synchronization & lock-free data structures

Concurrency. Part 2, Chapter 10. Roger Wattenhofer. ETH Zurich Distributed Computing

Concurrent Data Structures

Non-blocking Array-based Algorithms for Stacks and Queues. Niloufar Shafiei

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 31 October 2012

Introduction to Concurrency and Multicore Programming. Slides adapted from Art of Multicore Programming by Herlihy and Shavit

Concurrent Data Structures Concurrent Algorithms 2016

Multi-Core Computing with Transactional Memory. Johannes Schneider and Prof. Roger Wattenhofer

Parallel linked lists

CS377P Programming for Performance Multicore Performance Synchronization

CSL373: Lecture 5 Deadlocks (no process runnable) + Scheduling (> 1 process runnable)

A Simple Optimistic skip-list Algorithm

Synchronization. CS61, Lecture 18. Prof. Stephen Chong November 3, 2011

A Skiplist-based Concurrent Priority Queue with Minimal Memory Contention

Topics. File Buffer Cache for Performance. What to Cache? COS 318: Operating Systems. File Performance and Reliability

15 418/618 Project Final Report Concurrent Lock free BST

CSE 410 Final Exam Sample Solution 6/08/10

CMSC421: Principles of Operating Systems

AST: scalable synchronization Supervisors guide 2002

Automatic Memory Reclamation for Lock-Free Data Structures

Lecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6)

Advanced Topic: Efficient Synchronization

Concurrent Preliminaries

Linux kernel synchronization. Don Porter CSE 506

Lecture 10: Multi-Object Synchronization

Cache-Aware Lock-Free Queues for Multiple Producers/Consumers and Weak Memory Consistency

Integrating Transactionally Boosted Data Structures with STM Frameworks: A Case Study on Set

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

6.852: Distributed Algorithms Fall, Class 20

CSE 410 Final Exam 6/09/09. Suppose we have a memory and a direct-mapped cache with the following characteristics.

Stanford University Computer Science Department CS 140 Midterm Exam Dawson Engler Winter 1999

Building Efficient Concurrent Graph Object through Composition of List-based Set

Whatever can go wrong will go wrong. attributed to Edward A. Murphy. Murphy was an optimist. authors of lock-free programs LOCK FREE KERNEL

Whatever can go wrong will go wrong. attributed to Edward A. Murphy. Murphy was an optimist. authors of lock-free programs 3.

Transactional Memory. How to do multiple things at once. Benjamin Engel Transactional Memory 1 / 28

Agenda. Designing Transactional Memory Systems. Why not obstruction-free? Why lock-based?

Lock vs. Lock-free Memory Project proposal

Multiprocessor Support

Lecture: Consistency Models, TM

Midterm Exam Solutions Amy Murphy 28 February 2001

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Thread Safety. Review. Today o Confinement o Threadsafe datatypes Required reading. Concurrency Wrapper Collections

Non-Blocking Synchronization

CPSC/ECE 3220 Summer 2017 Exam 2

Page 1. Challenges" Concurrency" CS162 Operating Systems and Systems Programming Lecture 4. Synchronization, Atomic operations, Locks"

Non-blocking Array-based Algorithms for Stacks and Queues!

CS140 Operating Systems and Systems Programming

Lecture: Coherence and Synchronization. Topics: synchronization primitives, consistency models intro (Sections )

Midterm Exam Amy Murphy 6 March 2002

CS 31: Introduction to Computer Systems : Threads & Synchronization April 16-18, 2019

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 17 November 2017

A Concurrent Skip List Implementation with RTM and HLE

Parallel Programming: Background Information

Parallel Programming: Background Information

Lecture 21: Transactional Memory. Topics: Hardware TM basics, different implementations

Conflict Detection and Validation Strategies for Software Transactional Memory

Last Class: Clock Synchronization. Today: More Canonical Problems

CS5460/6460: Operating Systems. Lecture 11: Locking. Anton Burtsev February, 2014

Reagent Based Lock Free Concurrent Link List Spring 2012 Ancsa Hannak and Mitesh Jain April 28, 2012

Scheduling Transactions in Replicated Distributed Transactional Memory

Programming Languages

Programmazione di sistemi multicore

CS140 Operating Systems Midterm Review. Feb. 5 th, 2009 Derrick Isaacson

Thread-Local. Lecture 27: Concurrency 3. Dealing with the Rest. Immutable. Whenever possible, don t share resources

Marwan Burelle. Parallel and Concurrent Programming

Software transactional memory

Lecture 10: Avoiding Locks

CS 571 Operating Systems. Midterm Review. Angelos Stavrou, George Mason University

File Systems. ECE 650 Systems Programming & Engineering Duke University, Spring 2018

CS420: Operating Systems. Process Synchronization

Lecture 19: Synchronization. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Cache Coherence and Atomic Operations in Hardware

CSE 374 Programming Concepts & Tools

Lecture 19: Coherence and Synchronization. Topics: synchronization primitives (Sections )

Recap: Thread. What is it? What does it need (thread private)? What for? How to implement? Independent flow of control. Stack

Computation Abstractions. Processes vs. Threads. So, What Is a Thread? CMSC 433 Programming Language Technologies and Paradigms Spring 2007

Rsyslog: going up from 40K messages per second to 250K. Rainer Gerhards

Dealing with Issues for Interprocess Communication

Linux kernel synchronization

Lecture 9: Multiprocessor OSs & Synchronization. CSC 469H1F Fall 2006 Angela Demke Brown

Today: Synchronization. Recap: Synchronization

Split-Ordered Lists: Lock-Free Extensible Hash Tables. Pierre LaBorde

Concurrent Programming

Timers 1 / 46. Jiffies. Potent and Evil Magic

Transcription:

Linked Lists: The Role of Locking Erez Petrank Technion

Why Data Structures? Concurrent Data Structures are building blocks Used as libraries Construction principles apply broadly

This Lecture Designing the first concurrent data structure: the linked-list How do we use locks? How do we achieve progress guarantees

Several drawings are taken from the book, or from its accompanying slides. Proper Credit

Locking vs. Progress Guarantees

Counter Example T1 local := counter local++ counter := local T2 local := counter local++ counter := local concurrent execution is not safe!

Use a Lock Lock (L) local := counter local++ counter := local Unlock(L) Synchronization: Only one thread can acquire a lock L

Compare and Swap (CAS) CAS (addr, expected, new) Atomically: If ( MEM[addr] == expected ) { MEM[addr] = new return (TRUE) } else return (FALSE)

Use a Lock or a CAS Lock (L) local := counter local++ counter := local Unlock(L) START: old := counter new := old ++ if (! CAS(&counter,old,new) ) goto START Synchronization: Only one thread can acquire a lock L Synchronization: Only one thread changes from old in a concurrent CAS.

Use a Lock or a CAS Lock (L) local := counter local++ counter := local Unlock(L) START: old := counter new := old ++ if (! CAS(&counter,old,new) ) goto START Issues: Efficiency, scalability, Fairness, progress guarantee, design complexity.

What Does Lock-Free Mean? First try: never using a lock. Well, is this using a lock? (word is initially zero.) while (!CAS(&word,0,1)) {} local := counter local++ counter := local word := 0 It is not easy to say if something is a lock.

What Does Lock-Free Mean? Better: No matter which interleaving is scheduled, my program will make progress.. A.k.a. non-blocking. Our second example is lock-free. START: old := counter new := old ++ if (! CAS(&counter,old,new) ) goto START

Lock-Freedom Lock-Freedom If you schedule enough steps across all threads, one of them will make progress. Realistic (though difficult): Various lock-free data structures exist in the literature (stack, queue, hashing, skiplist, trees, etc.). Advantages: Worst-case responsiveness, scalability, no deadlocks, no livelocks, added robustness to threads fail-stop.

Linked List: Our First Example Fine-grained locking and lock-freedom

The Linked List - Support: insert, delete, contains. 4 6 9 Implements a set: no duplicates, order maintained. +

Sequential Implementation Delete 6: Insert 7: 4 6 9 7 4 6 9

But don t try this concurrently! Delete 6 Insert 7: 4 6 9 A similar problem with concurrent deletes. 7

Solutions Coarse-grained locking. Sequential with overhead Fine-grained locking hand-over-hand, optimistic, lazy synchronization. Lock-free (or wait-free) implementation. Simplicity Coarsegrained 7 4 6 9 Fine- grained Lock-free scalability / performance

scalability / performance

Fine-Grained Locking #1: Hand-Over-Hand

Hand-over-Hand locking [Bayer-Schkolnik 1977] a b c 21

Operation 1: Remove a Node a b c d remove(b) 22

Removing a Node a b c d remove(b) 23

Removing a Node a b c d remove(b) 24

Removing a Node a b c d remove(b) 25

Removing a Node a b c d remove(b) 26

Removing a Node a c d remove(b) Why do we need to always hold 2 locks? 27

Concurrent Removes a b c d remove(b) remove(c) 28

Concurrent Removes a b c d remove(b) remove(c) 29

Concurrent Removes a b c d remove(b) remove(c) 30

Concurrent Removes a b c d remove(b) remove(c) 31

Concurrent Removes a b c d remove(b) remove(c) 32

Concurrent Removes a b c d remove(b) remove(c) 33

Uh, Oh a b c d remove(b) remove(c) 34

Uh, Oh Bad news, C not removed a c d remove(b) remove(c) 35

With Two Locks a b c d remove(b) remove(c) 36

Removing a Node a b c d remove(b) remove(c) 37

Removing a Node a b c d remove(b) remove(c) 38

Removing a Node a b c d remove(b) remove(c) 39

Removing a Node a b c d remove(b) remove(c) 40

Removing a Node a b c d remove(b) remove(c) 41

Removing a Node a b c d remove(b) remove(c) 42

Removing a Node a b c d remove(b) remove(c) 43

Removing a Node a b c d remove(c) Must acquire Lock of b 44

Removing a Node a b c d Cannot acquire lock of b remove(c) 45

Removing a Node a b c d Wait! remove(c) 46

Removing a Node a b d Proceed to remove(b) 47

Removing a Node a b d remove(b) 48

Removing a Node a b d remove(b) 49

Removing a Node a d remove(b) 50

Removing a Node a d 51

Adding Nodes To add node e Go hand-over-hand Lock predecessor Lock successor Neither can be deleted Actually it is enough to lock predecessor (for an insert). But must go hand-over-hand. 52

No Deadlock In general, no deadlock if locks are always acquired in the same order.

Why Is It Correct The idea: snapshot. Each thread sees all operations executed earlier, and no operation that started afterwards. Start time: take head s lock. Implications: sequentialization of operations Good only for hierarchical data structures.

Properties Scalability better than coarse-grained locking. But long chains of threads waiting for the first thread to advance. Limited parallelism. Excessive locking harms performance. Can we obtain more parallelism and better performance?

Second List: Optimistic (First was hand-over-hand.)

Optimistic Synchronization [Herlihy-Shavit 2008] Find nodes without locking Lock nodes Check that everything is OK 57

Optimistic: Traverse without Locking a b d e add(c) Aha! 58

Optimistic: Lock and Load a b d e add(c) 59

What could go wrong? a b d e add(c) Aha! remove(b) 60

First node must be in the list! While holding the lock, check that first node is reachable from the head While we hold the lock this node cannot be removed.

Validate Part 1 (while holding locks) a b d e add(c) Yes, b still reachable from head (after locks acquired!) 62

What Else Can Go Wrong? a b d e add(c) 63

What Else Can Go Wrong? b a b d e add(b ) add(c) 64

What Else Can Go Wrong? b a b d e add(c) Aha! 65

First node must still point to second! Validation 1: first node still reachable. While we hold the lock the node cannot be removed. Validation 2: first node pointing to second. While we hold the lock the pointer from the first node cannot be modified (no adding and no removing).

Validate Part 2 (while holding locks) a b d e add(c) Yes, b still points to d (after locks were acquired!) 67

Validation Failure? Upon failure to validate start from scratch. Assumed to happen infrequently.

Insert (After Validation) a b d e add(c) c 69

Optimistic Synchronization More parallelism, better scalability. Only lock nodes where actually modifying. Scalability depends on the actual workload. Excessive work on validation (double traversal). Less efficient. There is another fine-grained locking methodology. But let s jump to lock-freedom 70

Third List: Lock-Free We did hand-over-hand and optimistic

Lock-Freedom Don t use locks. And more important: guarantee progress! Complete robustness against worst-case scheduling No swapping problems Even when a thread dies, the other threads will continue to make progress. Design by [Harris 2001], improvement by [Michael 2002].

Lock-Free Linked Lists [Harris-Michael 2001-2] First attempt: insert/delete using CAS instead of a regular read/write operation.

First Attempt: Use CASes Instead of Locks Delete 6: Insert 7: 4 6 9 CAS 7 CAS 4 6 9

The original problem still exists Outcome for deleting 6 and inserting 7 in parallel: 7 4 6 9

We Keep the Simple Insert A single CAS to insert 7, after locally allocating and initializing it. 7 4 6 9 But delete will be more complicated.

Recall the Problem Outcome for deleting 6 and inserting 7 in parallel: 7 4 6 9

The Crux of the Problem When deleting 6, we want to block changes both on the pointer that points at 6, as well as the pointer that points out of 6. 4 6 9

The Crux of the Problem When deleting 6, we want to block changes both on the pointer that points at 6, as well as the pointer that points out of 6. 4 6 9 Harris s idea: 1. mark the pointer out of 6, and then 2. modify the pointer out of 4.

Solution: Mark & Delete Logically delete 6 by marking the outgoing pointer of 6. 4 6 9 Physically delete 6 by unlinking it from the list. 4 6 9

Implementing a Red Pointer Use least bit. Essentially unused with pointers as words are composed of 4 or 8 bytes. Unmarked: Marked: 00010 1010010101110000100 00010 1010010101110000101

Logical Deletion Logical Removal = Set Mark Bit a 0 b 0 c 10 e 0 Mark-Bit and Pointer are CASed together Physical Removal CAS d 0 An attempted insert will fail the CAS after logical removal 82

Concurrent Removal failed CAS a bcas c d remove c remove b 83

Removing a Node a b c d remove c remove b 84

Removing a Node a d remove c remove b 85

Traversing the List When you find a logically deleted node in your path: Finish the job: CAS the predecessor s next field, Proceed (repeat as needed). 86

Lock-Free Traversal (only Add and Remove) pred curr pred curr CAS a b c d Uh-oh 87

CAS Failures Node removal: Logical remove fails: start from scratch. Physical remove fails: ignore. Why? Node insert: CAS fails: start from scratch.

Why is it Lock-Free? Node removal: Logical remove fails: either someone else has succeeded to remove this node, or someone else has inserted a node. Logical remove succeeds: I succeeded to delete a node (and will finish the operation after trying the physical remove once). Node insert: CAS fails: someone else succeeded to insert or delete a node. CAS succeeds: I succeeded in inserting a node.

The Main Intuition Logical marking locks the next pointer from being modified. But this lock can be unlocked by anyone (by trimming the node from the list), so no one is stuck. Different from normal locking that only the owner can unlock. This is the methodology in all lock-free algorithms: Make a change in the data structure, leaving it unstable. Anyone can stabilize the data structure and continue to work on it. 4 6 9

Progress Guarantees are good for Real-time, OS, interactive systems, service level agreements, etc. But it s always good to have. Avoid deadlock, live-lock, convoying, priority inversion, etc. Scalability.

Progress Guarantees Great guarantee! Until recently considered difficult to achieve and inefficient. Wait-Freedom If you schedule enough steps of any thread, it will make progress. Lock-Freedom If you schedule enough steps across all threads, one of them will make progress.

Contains is Wait-Free Contains(key); curr = head; while (curr.key < key) curr = removemark ( curr.next ) succ = removemark ( curr.next ) return (curr.key == key &&!marked(curr.next) ) Contains is important!

Fourth List: Third (and Best) Fine-Grained Locking: Lazy List We discussed hand-over-hand, optimistic, and lock-free.

Lazy Synchronization [Heller et al. 2005] Lock only relevant nodes Do not validate reachability Instead, leave a mark on deleted nodes, like in lock-free algorithm. To remove a node: Logically remove it by marking it removed. Physically remove it by unlinking it.

Remove or Add Scan through the list Lock predecessor and current nodes validate that both are not deleted, and that predecessor points to current. Perform the add or the remove. Search can simply traverse the list.

Lazy List a b c 97

Lazy List a b c 98

Lazy List a b c 99

Lazy List a b c remove(b) 100

Lazy List a b c a not marked 101

Lazy List a b c b not marked 102

Lazy List a b c a still points to b 103

Lazy List a b c Logical delete 104

Lazy List a b c physical delete 105

Lazy List a b c 106

Invariant If a node is not marked then its key is in the set and reachable from head 107

The contains Method Simply traverses the list and reports finding. Very efficient, progress guaranteed. Wait-free Most popular method.

Properties of Lazy List Good performance no rescanning, a small number of locks, hopefully not too many validation failures Good scalability Lock only relevant nodes. Still standard locking problems No progress guarantee A thread holding a lock may face a cache-miss, page-fault, swapout, etc. Worst-case scalability issues, scheduler critical here

Summary Starting data structures Locking and Lock-freedom Linked list (ordered for sets) Parallization problems Fine grained locking: Hand-over-hand Optimistic Lazy Lock-Free version

Which List Should You Use? If little contention: coarse-grained locking. To handle contention pretty well: lazy list. To handle high contention and provide a progress guarantee: lock-free list.

The End