Concurrent Data Structures Concurrent Algorithms 2016

Similar documents
Kernel Scalability. Adam Belay

RCU in the Linux Kernel: One Decade Later

Linked Lists: The Role of Locking. Erez Petrank Technion

Read-Copy Update (RCU) Don Porter CSE 506

Linked Lists: Locking, Lock-Free, and Beyond. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Introduction to RCU Concepts

Read-Copy Update in a Garbage Collected Environment. By Harshal Sheth, Aashish Welling, and Nihar Sheth

Read-Copy Update (RCU) Don Porter CSE 506

Hazard Pointers. Number of threads unbounded time to check hazard pointers also unbounded! difficult dynamic bookkeeping! thread B - hp1 - hp2

LOCKLESS ALGORITHMS. Lockless Algorithms. CAS based algorithms stack order linked list

15 418/618 Project Final Report Concurrent Lock free BST

Lock-Free and Practical Doubly Linked List-Based Deques using Single-Word Compare-And-Swap

Linked Lists: Locking, Lock- Free, and Beyond. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Exploring mtcp based Single-Threaded and Multi-Threaded Web Server Design

Lecture 10: Multi-Object Synchronization

Technical Report. Designing ASCY-compliant Concurrent Search Data Structures

Abstraction, Reality Checks, and RCU

What Is RCU? Indian Institute of Science, Bangalore. Paul E. McKenney, IBM Distinguished Engineer, Linux Technology Center 3 June 2013

PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES

What Is RCU? Guest Lecture for University of Cambridge

A Skiplist-based Concurrent Priority Queue with Minimal Memory Contention

MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing

Advanced Topic: Efficient Synchronization

RCU and C++ Paul E. McKenney, IBM Distinguished Engineer, Linux Technology Center Member, IBM Academy of Technology CPPCON, September 23, 2016

High Performance Transactions in Deuteronomy

Scalable Concurrent Hash Tables via Relativistic Programming

Lecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory

Lecture 9: Multiprocessor OSs & Synchronization. CSC 469H1F Fall 2006 Angela Demke Brown

A Comparison of Relativistic and Reader-Writer Locking Approaches to Shared Data Access

Proposed Wording for Concurrent Data Structures: Hazard Pointer and Read Copy Update (RCU)

Linux multi-core scalability

Cache-Aware Lock-Free Queues for Multiple Producers/Consumers and Weak Memory Consistency

The Art and Science of Memory Allocation

Lecture 10: Avoiding Locks

Concurrent programming: From theory to practice. Concurrent Algorithms 2015 Vasileios Trigonakis

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 31 October 2012

Scalable Locking. Adam Belay

Introduction to RCU Concepts

Cost of Concurrency in Hybrid Transactional Memory. Trevor Brown (University of Toronto) Srivatsan Ravi (Purdue University)

Overview. This Lecture. Interrupts and exceptions Source: ULK ch 4, ELDD ch1, ch2 & ch4. COSC440 Lecture 3: Interrupts 1

Design of Concurrent and Distributed Data Structures

ffwd: delegation is (much) faster than you think Sepideh Roghanchi, Jakob Eriksson, Nilanjana Basu

Tom Hart, University of Toronto Paul E. McKenney, IBM Beaverton Angela Demke Brown, University of Toronto

Locking Granularity. CS 475, Spring 2019 Concurrent & Distributed Systems. With material from Herlihy & Shavit, Art of Multiprocessor Programming

Fine-grained synchronization & lock-free programming

FOEDUS: OLTP Engine for a Thousand Cores and NVRAM

CMSC 330: Organization of Programming Languages

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 17 November 2017

Kernel Range Reader/Writer Locking

The Google File System

Rsyslog: going up from 40K messages per second to 250K. Rainer Gerhards

THREADS & CONCURRENCY

Agenda. Designing Transactional Memory Systems. Why not obstruction-free? Why lock-based?

Ext3/4 file systems. Don Porter CSE 506

Kernel Synchronization I. Changwoo Min

Concurrent Preliminaries

Heckaton. SQL Server's Memory Optimized OLTP Engine

An Analysis of Linux Scalability to Many Cores

Bw-Tree. Josef Schmeißer. January 9, Josef Schmeißer Bw-Tree January 9, / 25

P0461R0: Proposed RCU C++ API

Whatever can go wrong will go wrong. attributed to Edward A. Murphy. Murphy was an optimist. authors of lock-free programs LOCK FREE KERNEL

Linked Lists: Locking, Lock- Free, and Beyond. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

davidklee.net gplus.to/kleegeek linked.com/a/davidaklee

Fine-grained synchronization & lock-free data structures

Distributed Computing Group

arxiv: v1 [cs.dc] 4 Dec 2017

This project must be done in groups of 2 3 people. Your group must partner with one other group (or two if we have add odd number of groups).

Non-blocking Array-based Algorithms for Stacks and Queues. Niloufar Shafiei

Locking: A necessary evil? Lecture 10: Avoiding Locks. Priority Inversion. Deadlock

Topics. File Buffer Cache for Performance. What to Cache? COS 318: Operating Systems. File Performance and Reliability

Programming Paradigms for Concurrency Lecture 6 Synchronization of Concurrent Objects

Marwan Burelle. Parallel and Concurrent Programming

CMSC 330: Organization of Programming Languages

But What About Updates?

CSE 120 Principles of Operating Systems

A new Mono GC. Paolo Molaro October 25, 2006

Log-Free Concurrent Data Structures

Advance Operating Systems (CS202) Locks Discussion

Distributed Systems. Lec 9: Distributed File Systems NFS, AFS. Slide acks: Dave Andersen

Experience with Model Checking Linearizability

HydraVM: Mohamed M. Saad Mohamed Mohamedin, and Binoy Ravindran. Hot Topics in Parallelism (HotPar '12), Berkeley, CA

Fast and Lock-Free Concurrent Priority Queues for Multi-Thread Systems

Order Is A Lie. Are you sure you know how your code runs?

RCU. ò Walk through two system calls in some detail. ò Open and read. ò Too much code to cover all FS system calls. ò 3 Cases for a dentry:

VFS, Continued. Don Porter CSE 506

A Scalable Ordering Primitive for Multicore Machines. Sanidhya Kashyap Changwoo Min Kangnyeon Kim Taesoo Kim

Concurrent & Distributed Systems Supervision Exercises

Concurrent Skip Lists. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Synchronization for Concurrent Tasks

CMSC 330: Organization of Programming Languages. Memory Management and Garbage Collection

How to hold onto things in a multiprocessor world

Hierarchical PLABs, CLABs, TLABs in Hotspot

What is RCU? Overview

Last Lecture: Spin-Locks

Final Exam. 12 December 2018, 120 minutes, 26 questions, 100 points

Hazard Pointers. Safe Resource Reclamation for Optimistic Concurrency

The RCU-Reader Preemption Problem in VMs

What is RCU? Overview

The benefits and costs of writing a POSIX kernel in a high-level language

CS4021/4521 INTRODUCTION

Transcription:

Concurrent Data Structures Concurrent Algorithms 2016 Tudor David (based on slides by Vasileios Trigonakis) Tudor David 11.2016 1

Data Structures (DSs) Constructs for efficiently storing and retrieving data Different types: lists, hash tables, trees, queues, Accessed through the DS interface Depends on the DS type, but always includes Store an element Retrieve an element Element Set: just one value Map: key/value pair CA Tudor David 11.2016 2

Concurrent Data Structures (CDSs) Concurrently accessed by multiple threads Through the CDS interface linearizable operations! Really important on multi-cores Used in most software systems ASCY Tudor David 11.2016 3

What do we care about in practice? Progress of individual operations - sometimes More often: Number of operations per second (throughput) The evolution of throughput as we increase the number of threads (scalability) Throughput (Mop/s) CA 15 10 5 0 1 10 20 30 40 # Threads Tudor David 11.2016 4

DS Example: Linked List delete(6) 1 2 3 5 6 8 insert(4) 4 A sequence of elements (nodes) Interface search (aka contains) insert remove (aka delete) CA struct node { value_t value; struct node* next; }; Tudor David 11.2016 5

Search Data Structures Interface 1. search 2. insert 3. remove Semantics 1. read-only 2. read-only 3. read-only 4. read-write updates search(k) update(k) parse(k) k modify(k) CA Tudor David 11.2016 6

Optimistic vs. Pessimistic Concurrency 12 20-core Xeon 1024 elements Throughput (Mop/s) 10 8 6 4 2 0 traverse traverse pessimistic 1 10 20 30 40 # Cores "bad" linked list "good" linked list (Lesson 1 ) Optimistic concurrency is the only way to get scalability Tudor David 11.2016 7

The Two Problems in Optimistic Concurrency Concurrency Control How threads synchronize their writes to the shared memory (e.g., nodes) Locks CAS Transactional memory Memory Reclamation How and when threads free and reuse the shared memory (e.g., nodes) Garbage collectors Hazard pointers RCU Quiescent states Tudor David 11.2016 8

Tools for Optimistic Concurrency Control (OCC) RCU: slow in the presence of updates (also a memory reclamation scheme) STM: slow in general HTM: not ubiquitous, not very fast (yet) Wait-free algorithms: slow in general (Optimistic) Lock-free algorithms: Optimistic lock-based algorithms: We either need a lock-free or an optimistic lock-based algorithm Tudor David 11.2016 9

Parenthesis: Target platform 2-socket Intel Xeon E5-2680 v2 Ivy Bridge 20 cores @ 2.8 GHz, 40 hyper-threads 25 MB LLC (per socket) 256GB RAM c c c c c c c c c c c c c c c c c c c c CA Tudor David 11.2016 10

Concurrent Linked Lists 5% Updates 12 Blocking Lock-free Wait-free 1024 elements 5% updates Throughput (Mops/s) 10 8 6 4 2 0 1 5 9 13 17 21 25 29 33 37 Number of threads Wait-free algorithm is slow Tudor David 11.2016 11

Optimistic Concurrency in Data Structures operation Pattern optimistic prepare (non-synchronized) optimistic prepare validate (synchronized) failed perform (synchronized) perform detect conflicting concurrent operations Example linked list insert find insertion spot insert validate Validation plays a key role in concurrent data structures Tudor David 11.2016 12

Validation in Concurrent Data Structures Lock-free: atomic operations optimistic prepare marking pointers, flags, helping, Lock-based: lock validate validate & perform (atomic ops) failed optimistic prepare lock validate perform unlock unlock failed flags, pointer reversal, parsing twice, Validation is what differentiates algorithms Tudor David 11.2016 13

Let s design two concurrent linked lists: A lock-free and a lock-based Tudor David 11.2016 14

Lock-free Sorted Linked List: Naïve Search find spot return Insert find modification spot CAS Delete find modification spot CAS Is this a correct (linearizable) linked list? Tudor David 11.2016 15

Lock-free Sorted Linked List: Naïve Incorrect P0: Insert(x) P1: Delete(y) P1: find modification spot P1:CAS P0: find modification spot P0:CAS y x Lost update! What is the problem? Insert involves one existing node; Delete involves two existing nodes How can we fix the problem? Tudor David 11.2016 16

Lock-free Sorted Linked List: Fix Idea! To delete a node, make it unusable first Mark it for deletion so that 1. You fail marking if someone changes next pointer; 2. An insertion fails if the predecessor node is marked. In other words: delete in two steps 1. Mark for deletion; and then 2. Physical deletion 2. CAS(remove) Delete(y) find modification spot 1. CAS(mark) y Tudor David 11.2016 17

1. Failing Deletion (Marking) P1: find modification spot P1:CAS(mark) false P0: Insert(x) P1: Delete(y) P0: find modification spot P0:CAS y x Upon failure restart the operation Restarting is part of all state-of-the-art-data structures Tudor David 11.2016 18

1. Failing Insertion due to Marked Node P1:CAS(remove) P1: find modification spot P1:CAS(mark) P0: Insert(x) P1: Delete(y) P0: find modification spot P0:CAS false y Upon failure restart the operation Restarting is part of all state-of-the-art-data structures How can we implement marking? Tudor David 11.2016 19

Implementing Marking (C Style) Pointers in 64 bit architectures Word aligned - 8 bit aligned! next pointer 0 0 0 boolean mark(node_t* n) uintptr_t unmarked = n->next & ~0x1L; uintptr_t marked = n->next 0x1L; return CAS(&n->next, unmarked, marked) == unmarked; Tudor David 11.2016 20

Lock-free List: Putting Everything Together Traversal: traverse (requires unmarking nodes) Search: traverse Insert: traverse CAS to insert Delete: traverse CAS to mark CAS to remove Garbage (marked) nodes Cleanup while traversing (helping in this course s terms) What happers if this CAS fails?? A pragmatic implementation of lock-free linked lists Tudor David 11.2016 21

What is not Perfect with the Lock-free List? 1. Garbage nodes Increase path length; and Increase complexity if (is_marked_node(n)) 2. Unmarking every single pointer Increase complexity curr = get_unmark_ref(curr->next) Can we simplify the design with locks? Tudor David 11.2016 22

Lock-based Sorted Linked List: Naïve Search find spot return Insert find modification spot lock Delete find modification spot lock(target) lock(predecessor) Is this a correct (linearizable) linked list? Tudor David 11.2016 23

Lock-based List: Validate After Locking Search find spot return Insert validate!pred->marked && pred->next did not change find modification spot lock Delete find modification spot mark(curr) lock(curr) lock(predecessor)!pred->marked &&!curr->marked && pred->next did not change Tudor David 11.2016 24

Concurrent Linked Lists 0% updates Throughput (Mop/s) 50 45 40 35 30 25 20 15 10 5 0 Just because the lockbased is not unmarking! 1 10 20 30 40 lock-free # Cores lock-based (Lesson 2 ) Sequential complexity matters Simplicity 1024 elements 0% updates Tudor David 11.2016 25

Optimistic Concurrency Control: Summary Lock-free: atomic operations optimistic prepare marking pointers, flags, helping, Lock-based: lock validate validate & perform (atomic ops) failed optimistic prepare lock validate perform unlock unlock failed flags, pointer reversal, parsing twice, Tudor David 11.2016 26

Word of caution: lock-based algorithms Search data structures Queues, stacks, counters,... 3 Queue, 40 threads Throughput (Mop/s) 2.5 2 1.5 1 0.5 0 Lock-based Non-blocking CA Tudor David 11.2016 27

Memory Reclamation: OCC s Side Effect Delete a node free and reuse this memory Subset of the garbage collection problem Who is accessing that memory? Can we just directly do free(node)? P0: pointer on x P0: search P1: delete(x) x P1: free(x) We cannot directly free the memory! Need memory reclamation Tudor David 11.2016 28

Memory Reclamation Schemes 1. Reference counting Count how many references exist on a node 2. Hazard pointers Tell to others what exactly you are reading 3. Quiescent states Wait until it is certain than no one holds references 4. Read-Copy Update (RCU) Quiescent states The extreme approach Tudor David 11.2016 29

1. Reference Counting Pointer + Counter Dereference: rc_dereference(rc_pointer* rcp) atomic_increment(&rcp->counter); return *pointer; Release : rc_release(rc_pointer* rcp) atomic_decrement(&rcp->counter); Free: iff counter = 0 rc_pointer counter pointer (Lesson 3 ) Readers cannot write on the shared nodes Bad bad bad idea: Readers write on shared nodes! Tudor David 11.2016 30

2. Hazard pointers (1/2) Reference counter property of the node Hazard pointer property of the thread A Multi-Reader Single-Writer (MRSW) register Protect: hp_protect(node* n) hazard_pointer* hp = hp_get(n); hp->address = n; Release: hp_release(hazard_pointer* hp) hp->address = NULL; hazard_pointer address Depends on the data structure type Tudor David 11.2016 31

2. Hazard pointers (2/2) Free memory x 1. Collect all hazard pointers 2. Check if x is accessed by any thread 1. If yes, buffer the free for later 2. If not, free the memory Buffering the free is implementation specific hazard_pointer address + lock-free - not scalable O(data structure size) hazard pointers hp_protect Tudor David 11.2016 32

3. Quiescent States Keep the memory until it is certain it is not accessed Can be implemented in various ways Example implementation search / insert / delete qs_unsafe(); I m accessing shared data qs_safe(); I m not accessing shared data return The data written in qs_[un]safe must be local-mostly Tudor David 11.2016 33

3. Quiescent States: qs_[un]safe Implementation List of thread-local (mostly) counters (id = 0) qs_state (id = x) qs_state (id = y) qs_state qs_state (initialized to 0) even : in safe mode (not accessing shared data) odd : in unsafe mode qs_safe / qs_unsafe qs_state++; How do we free memory? Tudor David 11.2016 34

3. Quiescent States: Freeing memory List of thread-local (mostly) counters (id = 0) qs_state (id = x) qs_state (id = y) qs_state Upon qs_free: Timestamp memory (vector_ts) Can do this for batches of frees Safe to reuse the memory vector_ts now >> vector_ts mem for t in thread_ids if (vts_mem[t] is odd && vts_now[t] = vts_mem[t]) return false; return true; How do the schemes we have seen perform? Tudor David 11.2016 35

Hazard Pointers vs. Quiescent States 1024 elements 0% updates 12 Throughput (Mop/s) 10 8 6 4 2 0 0 5 10 15 20 25 30 35 #Threads None QSBR HP Quiescent-state reclamation is as fast as it gets ASCY Tudor David 11.2016 36

4. Read-Copy Update (RCU) Quiescent states at their extreme Deletions wait all readers to reach a safe state Introduced in the Linux kernel in ~2002 More than 10000 uses in the kernel! (Example) Interface rcu_read_lock (= qs_unsafe) rcu_read_unlock (= qs_safe) synchronize_rcu wait all readers Tudor David 11.2016 37

4. Using RCU Search / Traverse rcu_read_lock() rcu_read_unlock() Delete physical deletion of x synchronize_rcu() free(x) + simple + read-only workloads - bad for writes Tudor David 11.2016 38

Memory Reclamation: Summary How and when to reuse freed memory Many techniques, no silver bullet 1. Reference counting 2. Hazard pointers 3. Quiescent states 4. Read-Copy Update (RCU) Tudor David 11.2016 39

Summary Concurrent data structures are very important Optimistic concurrency necessary for scalability Only recently a lot of active work for CDSs Memory reclamation is Inherent to optimistic concurrency; A difficult problem; A potential performance/scalability bottleneck Tudor David 11.2016 40