NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 17 November 2017

Similar documents
NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 21 November 2014

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 31 October 2012

LOCKLESS ALGORITHMS. Lockless Algorithms. CAS based algorithms stack order linked list

AST: scalable synchronization Supervisors guide 2002

CS1622. Semantic Analysis. The Compiler So Far. Lecture 15 Semantic Analysis. How to build symbol tables How to use them to find

Notes on the Exam. Question 1. Today. Comp 104:Operating Systems Concepts 11/05/2015. Revision Lectures (separate questions and answers)

Comp 204: Computer Systems and Their Implementation. Lecture 25a: Revision Lectures (separate questions and answers)

Reminder from last time

CSE373 Fall 2013, Second Midterm Examination November 15, 2013

Fine-grained synchronization & lock-free data structures

Non-blocking Array-based Algorithms for Stacks and Queues. Niloufar Shafiei

Concurrent & Distributed Systems Supervision Exercises

Advanced MEIC. (Lesson #18)

Linked Lists: The Role of Locking. Erez Petrank Technion

Lecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory

Order Is A Lie. Are you sure you know how your code runs?

CS 241 Honors Concurrent Data Structures

Lecture 10: Avoiding Locks

The Parallel Persistent Memory Model

Fine-grained synchronization & lock-free programming

Design of Concurrent and Distributed Data Structures

Non-blocking Array-based Algorithms for Stacks and Queues!

Linearizability with ownership transfer

Håkan Sundell University College of Borås Parallel Scalable Solutions AB

Scalable Concurrent Hash Tables via Relativistic Programming

CSE332: Data Abstractions Lecture 23: Programming with Locks and Critical Sections. Tyler Robison Summer 2010

An Almost Non- Blocking Stack

Reuse, don t Recycle: Transforming Lock-free Algorithms that Throw Away Descriptors

Thread Scheduling for Multiprogrammed Multiprocessors

Question 1. Notes on the Exam. Today. Comp 104: Operating Systems Concepts 11/05/2015. Revision Lectures

Foster B-Trees. Lucas Lersch. M. Sc. Caetano Sauer Advisor

A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 5 Programming with Locks and Critical Sections

Parallel Programming Languages COMP360

Fast and Lock-Free Concurrent Priority Queues for Multi-Thread Systems

Brushing the Locks out of the Fur: A Lock-Free Work Stealing Library Based on Wool

Programming in Parallel COMP755

15 418/618 Project Final Report Concurrent Lock free BST

Computer Science 62. Bruce/Mawhorter Fall 16. Midterm Examination. October 5, Question Points Score TOTAL 52 SOLUTIONS. Your name (Please print)

Computer Science 302 Spring 2007 Practice Final Examination: Part I

Practical Lock-Free and Wait-Free LL/SC/VL Implementations Using 64-Bit CAS

Proving liveness. Alexey Gotsman IMDEA Software Institute

CSE373 Fall 2013, Final Examination December 10, 2013 Please do not turn the page until the bell rings.

Implementing Rollback: Copy. 2PL: Rollback. Implementing Rollback: Undo. 8L for Part IB Handout 4. Concurrent Systems. Dr.

Computer Science 210 Data Structures Siena College Fall Topic Notes: Linear Structures

Run-time Environments. Lecture 13. Prof. Alex Aiken Original Slides (Modified by Prof. Vijay Ganesh) Lecture 13

Roadmap. Tevfik Ko!ar. CSC Operating Systems Spring Lecture - III Processes. Louisiana State University. Virtual Machines Processes

Synchronization 1. Synchronization

CSE373 Fall 2013, Midterm Examination October 18, 2013

A Comparison of Relativistic and Reader-Writer Locking Approaches to Shared Data Access

CS 134: Operating Systems

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2

Multiple choice questions. Answer on Scantron Form. 4 points each (100 points) Which is NOT a reasonable conclusion to this sentence:

4.1 COMPUTATIONAL THINKING AND PROBLEM-SOLVING

Locking: A necessary evil? Lecture 10: Avoiding Locks. Priority Inversion. Deadlock

Whatever can go wrong will go wrong. attributed to Edward A. Murphy. Murphy was an optimist. authors of lock-free programs 3.

Lock-Free and Practical Doubly Linked List-Based Deques using Single-Word Compare-And-Swap

Lindsay Groves, Simon Doherty. Mark Moir, Victor Luchangco

CS5460: Operating Systems

CSE 332 Autumn 2013: Midterm Exam (closed book, closed notes, no calculators)

Lecture 8: September 30

Ownership of a queue for practical lock-free scheduling

A linear-list Data Structure where - addition of elements to and - removal of elements from are restricted to the first element of the list.

CS 326: Operating Systems. Process Execution. Lecture 5

CS514: Intermediate Course in Computer Systems. Lecture 38: April 23, 2003 Nested transactions and other weird implementation issues

FINALTERM EXAMINATION Fall 2009 CS301- Data Structures Question No: 1 ( Marks: 1 ) - Please choose one The data of the problem is of 2GB and the hard

Operating Systems. Lecture 4 - Concurrency and Synchronization. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

ECE 2035 Programming HW/SW Systems Fall problems, 5 pages Exam Three 28 November 2012

Case Study: Using Inspect to Verify and fix bugs in a Work Stealing Deque Simulator

CMSC 341 Lecture 10 Binary Search Trees

Final Exam Solutions May 11, 2012 CS162 Operating Systems

A Skiplist-based Concurrent Priority Queue with Minimal Memory Contention

CS313D: ADVANCED PROGRAMMING LANGUAGE

- 1 - Handout #22S May 24, 2013 Practice Second Midterm Exam Solutions. CS106B Spring 2013

Sorting. Introduction. Classification

CSE373 Winter 2014, Midterm Examination January 29, 2014

CSCE 2014 Final Exam Spring Version A

Cache-Aware Lock-Free Queues for Multiple Producers/Consumers and Weak Memory Consistency

Process a program in execution; process execution must progress in sequential fashion. Operating Systems

Proposed Wording for Concurrent Data Structures: Hazard Pointer and Read Copy Update (RCU)

Threads and Locks, Part 2. CSCI 5828: Foundations of Software Engineering Lecture 08 09/18/2014

Lecture Topics. Administrivia

Lecture 2: September 9

Multi-Core Computing with Transactional Memory. Johannes Schneider and Prof. Roger Wattenhofer

CSE 12, Week Six, Lecture Two Discussion: Getting started on hw7 & hw8

CSL373: Lecture 5 Deadlocks (no process runnable) + Scheduling (> 1 process runnable)

Midterm - Winter SE 350

Linearizability of Persistent Memory Objects

Whatever can go wrong will go wrong. attributed to Edward A. Murphy. Murphy was an optimist. authors of lock-free programs LOCK FREE KERNEL

Thread Scheduling for Multiprogrammed Multiprocessors

CS5460/6460: Operating Systems. Lecture 11: Locking. Anton Burtsev February, 2014

Hazard Pointers. Safe Resource Reclamation for Optimistic Concurrency

Fortgeschrittene objektorientierte Programmierung (Advanced Object-Oriented Programming)

CMSC351 - Fall 2014, Homework #2

DO NOT. UNIVERSITY OF CALIFORNIA Department of Electrical Engineering and Computer Sciences Computer Science Division. P. N.

Lecture 10 Notes Linked Lists

CS 31: Introduction to Computer Systems : Threads & Synchronization April 16-18, 2019

Relaxed Memory-Consistency Models

Bringing Practical Lock-Free Synchronization to 64-Bit Applications

Split-Ordered Lists: Lock-Free Extensible Hash Tables. Pierre LaBorde

Concurrency, Thread. Dongkun Shin, SKKU

Transcription:

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY Tim Harris, 17 November 2017

Lecture 7 Linearizability Lock-free progress properties Hashtables and skip-lists Queues Reducing contention Explicit memory management

Linearizability 3

More generally Suppose we build a shared-memory data structure directly from read/write/cas, rather than using locking as an intermediate layer Data structure Locks Data structure H/W primitives: read, write, CAS,... H/W primitives: read, write, CAS,... Why might we want to do this? What does it mean for the data structure to be correct? 4

What we re building A set of integers, represented by a sorted linked list find(int) -> bool insert(int) -> bool delete(int) -> bool 5

Searching a sorted list find(20): 20? H 10 30 T find(20) -> false 6

Inserting an item with CAS insert(20): 30 20 H 10 30 T 20 insert(20) -> true 7

Inserting an item with CAS insert(20): insert(25): 30 20 30 25 H 10 30 T 25 20 8

Searching and finding together find(20) -> false insert(20) -> true 20? This thread saw 20 was not in the set......but this thread succeeded in putting it in! H 10 30 T Is this a correct implementation of a set? Should the programmer be surprised if this happens? What about more complicated mixes of operations? 20 9

Correctness criteria Informally: Look at the behaviour of the data structure (what operations are called on it, and what their results are). If this behaviour is indistinguishable from atomic calls to a sequential implementation then the concurrent implementation is correct. 10

-> true -> true -> false Sequential history No overlapping invocations: T1: insert(10) T2: insert(20) T1: find(15) time 10 10, 20 10, 20 11

Concurrent history Allow overlapping invocations: insert(10)->true insert(20)->true Thread 1: time Thread 2: find(20)->false 12

Linearizability Is there a correct sequential history: Same results as the concurrent one Consistent with the timing of the invocations/responses? 13

Example: linearizable insert(10)->true insert(20)->true Thread 1: time Thread 2: find(20)->false A valid sequential history: this concurrent execution is OK 14

Example: linearizable insert(10)->true delete(10)->true Thread 1: time Thread 2: find(10)->false A valid sequential history: this concurrent execution is OK 15

Example: not linearizable insert(10)->true insert(10)->false Thread 1: time Thread 2: delete(10)->true 16

Returning to our example find(20) 20? -> false insert(20) -> true H 10 30 T 20 A valid sequential history: this concurrent execution is OK Thread 1: Thread 2: insert(20)->true find(20)->false 17

Recurring technique For updates: Perform an essential step of an operation by a single atomic instruction E.g. CAS to insert an item into a list This forms a linearization point For reads: Identify a point during the operation s execution when the result is valid Not always a specific instruction 18

Adding delete First attempt: just use CAS delete(10): 10 30 H 10 30 T 19

Delete and insert: delete(10) & insert(20): 10 30 30 20 H 10 30 T 20 20

Logical vs physical deletion Use a spare bit to indicate logically deleted nodes: 10 30 30 30X H 10 30 T 30 20 20 21

Delete-greater-than-or-equal DeleteGE(int x) -> int Remove x, or next element above x H 10 30 T DeleteGE(20) -> 30 H 10 T 22

Does this work: DeleteGE(20) H 10 30 T 1. Walk down the list, as in a normal delete, find 30 as next-after-20 2. Do the deletion as normal: set the mark bit in 30, then physically unlink 23

Delete-greater-than-or-equal B must be after A (thread order) insert(25)->true insert(30)->false Thread 1: A B time Thread 2: A must be after C (otherwise C should have returned 15) C deletege(20)->30 C must be after B (otherwise B should have succeeded) 24

Lock-free progress properties 25

Progress: is this a good lock-free list? static volatile int MY_LIST = 0; bool find(int key) { // Wait until list available while (CAS(&MY_LIST, 0, 1) == 1) { } OK, we re not calling pthread_mutex_lock... but we re essentially doing the same thing }... // Release list MY_LIST = 0; 26

Lock-free A specific kind of non-blocking progress guarantee Precludes the use of typical locks From libraries Or hand rolled Often mis-used informally as a synonym for Free from calls to a locking function Fast Scalable 27

Lock-free A specific kind of non-blocking progress guarantee Precludes the use of typical locks From libraries Or hand rolled Often mis-used informally as a synonym for Free from calls to a locking function Fast Scalable The version number mechanism is an example of a technique that is often effective in practice, does not use locks, but is not lock-free in this technical sense 28

System model High-level operation Lookup(20) Insert(15) H H->10 H H->10 New 10->20 CAS time Primitive step (read/write/cas) True True 29

Wait-free A thread finishes its own operation if it continues executing steps Start Start Start time Finish Finish Finish 30

Implementing wait-free algorithms Important in some significant niches Worst-case execution time guarantees General construction techniques exist ( universal constructions ) Queuing and helping strategies: everyone ensures oldest operation makes progress Often a high sequential overhead Often limited scalability Fast-path / slow-path constructions Start out with a faster lock-free algorithm Switch over to a wait-free algorithm if there is no progress...if done carefully, obtain wait-free progress overall In practice, progress guarantees can vary between operations on a shared object e.g., wait-free find + lock-free delete 31

Lock-free Some thread finishes its operation if threads continue taking steps Start Start Start Start time Finish Finish Finish 32

A (poor) lock-free counter int getnext(int *counter) { while (true) { int result = *counter; if (CAS(counter, result, result+1)) { return result; } } } Not wait free: no guarantee that any particular thread will succeed 33

Implementing lock-free algorithms Ensure that one thread (A) only has to repeat work if some other thread (B) has made real progress e.g., insert(x) starts again if it finds that a conflicting update has occurred Use helping to let one thread finish another s work e.g., physically deleting a node on its behalf 34

Obstruction-free A thread finishes its own operation if it runs in isolation Start Start time Interference here can prevent any operation finishing Finish 35

A (poor) obstruction-free counter int getnext(int *counter) { while (true) { int result = LL(counter); if (SC(counter, result+1)) { return result; } } } Assuming a very weak load-linked (LL) storeconditional (SC): LL on one thread will prevent an SC on another thread succeeding 36

Building obstruction-free algorithms Ensure that none of the low-level steps leave a data structure broken On detecting a conflict: Help the other party finish Get the other party out of the way Use contention management to reduce likelihood of livelock 37

Hashtables and skiplists 38

Hash tables 0 16 24 Bucket array: 8 entries in example 3 11 List of items with hash val modulo 8 == 0 5 39

Hash tables: Contains(16) 0 16 24 1. Hash 16. Use bucket 0 3 11 2. Use normal list operations 5 40

Hash tables: Delete(11) 0 16 24 3 11 1. Hash 11. Use bucket 3 5 2. Use normal list operations 41

Lessons from this hashtable Informal correctness argument: Operations on different buckets don t conflict: no extra concurrency control needed Operations appear to occur atomically at the point where the underlying list operation occurs (Not specific to lock-free lists: could use whole-table lock, or per-list locks, etc.) 42

Practical difficulties: Key-value mapping Population count Iteration Resizing the bucket array Options to consider when implementing a difficult operation: Relax the semantics (e.g., non-exact count, or non-linearizable count) Fall back to a simple implementation if permitted (e.g., lock the whole table for resize) Design a clever implementation (e.g., split-ordered lists) Use a different data structure (e.g., skip lists) 43

Skip lists Each node is a tower of random size. High levels skip over lower levels 0 3 5 11 16 24 All items in a single list: this defines the set s contents 44

Skip lists: Delete(11) Principle: lowest list is the truth 2. Link by link remove 11 from the towers 0 3 3. Finally, remove 11 from lowest list 5 11 16 24 1. Find 11 node, mark it logically deleted 45

Queues 46

Work stealing queues PopTop() -> Item 1. Semantics relaxed for PopTop Try to steal an item. May sometimes return nothing spuriously 2. Restriction: only one thread ever calls Push/PopBottom PushBottom(Item) PopBottom() -> Item Add/remove items, PopBottom must return an item if the queue is not empty 3. Implementation costs skewed toward PopTop complex 47

Bounded deque Top has a version number, updated atomically with it Top / V0 0 1 2 3 4 Bottom Items between the indices are present in the queue Bottom is a normal integer, updated only by the local end of the queue Arora, Blumofe, Plaxton 48

Bounded deque Top / V0 0 1 2 3 4 void pushbottom(item i){ tasks[bottom] = i; bottom++; } Bottom 49

Bounded deque Top / V0 0 1 2 3 4 void pushbottom(item i){ tasks[bottom] = i; bottom++; } Item popbottom() { if (bottom ==0) return null; bottom--; result = tasks[bottom]; <tmp_top,tmp_v> = <top,version>; if (bottom > tmp_top) return result;. return null; } Bottom 50

Bounded deque Top / V0 Top / V1 Item poptop() { 2 if (bottom <= top) return null; <tmp_top,tmp_v> = <top, version>; } result = tasks[tmp_top]; if (CAS( &<top,version>, <tmp_top, tmp_v>, <tmp_top+1, tmp_v+1>)) { return result; } return null; 0 1 3 4 void pushbottom(item i){ tasks[bottom] = i; bottom++; } Item popbottom() { if (bottom ==0) return null; bottom--; if (bottom==top) { result bottom = tasks[bottom]; = 0; <tmp_top,tmp_v> if (CAS( &<top,version>, = <top,version>; if (bottom > tmp_top) <tmp_top,tmp_v>, return result;. <0,tmp_v+1>)) { return null; return result; } } } <top,version>=<0,v+1> Bottom 51

Bounded deque Top / V0 Item poptop() { 2 if (bottom <= top) return null; <tmp_top,tmp_v> = <top, version>; } result = tasks[tmp_top]; if (CAS( &<top,version>, <tmp_top, tmp_v>, 0 <tmp_top+1, tmp_v+1>)) { return result; } return null; 1 3 4 void pushbottom(item i){ tasks[bottom] = i; bottom++; } Item popbottom() { if (bottom ==0) return null; bottom--; if (bottom==top) { result bottom = tasks[bottom]; = 0; <tmp_top,tmp_v> if (CAS( &<top,version>, = <top,version>; if (bottom > tmp_top) <tmp_top,tmp_v>, return result;. <0,tmp_v+1>)) { return null; return result; } } } <top,version>=<0,v+1> Bottom 52

ABA problems Item poptop() { if (bottom <= top) return null; tmp_top = top; result = tasks[tmp_top]; if (CAS(&top, top, top+1)) { return result; } return null; } result = CCC Top 0 1 2 3 4 DDD CCC BBB EEE AAA FFF Bottom 53

General techniques Local operations designed to avoid CAS Traditionally slower, less so now Costs of memory fences can be important ( Idempotent work stealing, Michael et al, and the Laws of Order paper) Local operations just use read and write Only one accessor, check for interference Use CAS: Resolve conflicts between stealers Resolve local/stealer conflicts Version number to ensure conflicts seen 54

Reducing contention 55

Reducing contention Suppose you re implementing a shared counter with the following sequential spec: void increment(int *counter) { atomic { (*counter) ++; } } void decrement(int *counter) { atomic { (*counter) --; } } bool iszero(int *counter) { atomic { return (*counter) == 0; } } How well can this scale? 56

SNZI trees Each node holds a value and a version number (updated together with CAS) SNZI (2,230) Child SNZI forwards inc/dec to parent when the child changes to/from zero SNZI (10,100) SNZI (5,250) T1 T2 T3 T4 T5 T6 SNZI: Scalable NonZero Indicators, Ellen et al 57

SNZI trees, linearizability on 0->1 change SNZI (0,100) SNZI (0,230) 1. T1 calls increment 2. T1 increments child to 1 3. T2 calls increment 4. T2 increments child to 2 5. T2 completes 6. Tx calls iszero 7. Tx sees 0 at parent 8. T1 calls increment on parent 9. T1 completes T1 T2 Tx 58

SNZI trees void increment(snzi *s) { bool done=false; int undo=0; while(!done) { <val,ver> = read(s->state); if (val >= 1 && CAS(s->state, <val,ver>, <val+1,ver>)) { done = true; } if (val == 0 && CAS(s->state, <val,ver>, <½, ver+1>)) { done = true; val=½; ver=ver+1 } if (val == ½) { increment(s->parent); if (!CAS(s->state, <val, ver>, <1, ver>)) { undo ++; } } } while (undo > 0) { decrement(s->parent); } } 59

Pop Pop Reducing contention: stack Push Push Push Existing lock-free stack (e.g., Treiber s): good performance under low contention, poor scalability A scalable lock-free stack algorithm, Hendler et al 60

Pop Pop Pairing up operations Push(30) Push(20) Push(10) 20 10 61

Back-off elimination array Contention on the stack? Try the array Stack Elimination array Don t get eliminated? Try the stack Operation record: Thread, Push/Pop, 62

Explicit memory management 63

Deletion revisited: Delete(10) H 10 30 T H 10 30 T H 10 30 T 64

De-allocate to the OS? Search(20) H 10 30 T 65

Re-use as something else? Search(20) H 100 200 30 T 66

Re-use as a list node? Search(20) H 10 20 30 T H 30 T 67

Reference counting 1. Decide what to access H 10 30 T 1 1 1 1 68

Reference counting 1. Decide what to access 2. Increment reference count H 10 30 T 2 1 1 1 69

Reference counting 1. Decide what to access 2. Increment reference count 3. Check access still OK H 10 30 T 2 1 1 1 70

Reference counting 1. Decide what to access 2. Increment reference count 3. Check access still OK H 10 30 T 2 2 1 1 71

Reference counting 1. Decide what to access 2. Increment reference count 3. Check access still OK H 10 30 T 1 2 1 1 72

Reference counting 1. Decide what to access 2. Increment reference count 3. Check access still OK 4. Defer deallocation until count 0 H 10 30 T 1 1 1 1 73

Epoch mechanisms Global epoch: 1000 Thread 1 epoch: - Thread 2 epoch: - H 10 30 T 74

Epoch mechanisms Global epoch: 1000 Thread 1 epoch: 1000 Thread 2 epoch: - 1. Record global epoch at start of operation H 10 30 T 75

Epoch mechanisms Global epoch: 1000 Thread 1 epoch: 1000 Thread 2 epoch: 1000 1. Record global epoch at start of operation 2. Keep per-epoch deferred deallocation lists H 10 30 T Deallocate @ 1000 76

Epoch mechanisms Global epoch: 1001 Thread 1 epoch: 1000 Thread 2 epoch: - 1. Record global epoch at start of operation 2. Keep per-epoch deferred deallocation lists 3. Increment global epoch at end of operation (or periodically) H 10 30 T Deallocate @ 1000 77

Epoch mechanisms Global epoch: 1002 Thread 1 epoch: - Thread 2 epoch: - 1. Record global epoch at start of operation 2. Keep per-epoch deferred deallocation lists 3. Increment global epoch at end of operation (or periodically) 4. Free when everyone past epoch H 10 30 T Deallocate @ 1000 78

The repeat offender problem Free: ready for allocation Allocated and linked in to a data structure Escaping: unlinked, but possibly temporarily in use 79

Re-use via ROP 1. Decide what to access 2. Set guard 3. Check access still OK H 10 30 T Thread 1 guards 80

Re-use via ROP 1. Decide what to access 2. Set guard 3. Check access still OK H 10 30 T Thread 1 guards 81

Re-use via ROP 1. Decide what to access 2. Set guard 3. Check access still OK H 10 30 T Thread 1 guards 82

Re-use via ROP 1. Decide what to access 2. Set guard 3. Check access still OK H 10 30 T Thread 1 guards 83

Re-use via ROP 1. Decide what to access 2. Set guard 3. Check access still OK H 10 30 T Thread 1 guards 84

Re-use via ROP 1. Decide what to access 2. Set guard 3. Check access still OK H 10 30 T Thread 1 guards 85

Re-use via ROP See also: Safe memory reclamation 1. & hazard Decide pointers, what to access 2. Maged Set guard Michael 3. Check access still OK 4. Batch deallocations and defer on objects while guards are present H 10 30 T Thread 1 guards 86