Tom Hart, University of Toronto Paul E. McKenney, IBM Beaverton Angela Demke Brown, University of Toronto

Similar documents
Performance of memory reclamation for lockless synchronization

Lecture 10: Avoiding Locks

Locking: A necessary evil? Lecture 10: Avoiding Locks. Priority Inversion. Deadlock

Abstraction, Reality Checks, and RCU

LOCKLESS ALGORITHMS. Lockless Algorithms. CAS based algorithms stack order linked list

The RCU-Reader Preemption Problem in VMs

Fast and Lock-Free Concurrent Priority Queues for Multi-Thread Systems

Efficient and Reliable Lock-Free Memory Reclamation Based on Reference Counting

Lock-Free and Practical Doubly Linked List-Based Deques using Single-Word Compare-And-Swap

Proposed Wording for Concurrent Data Structures: Hazard Pointer and Read Copy Update (RCU)

Cache-Aware Lock-Free Queues for Multiple Producers/Consumers and Weak Memory Consistency

Read-Copy Update (RCU) Don Porter CSE 506

CS510 Concurrent Systems. Jonathan Walpole

Exploring mtcp based Single-Threaded and Multi-Threaded Web Server Design

Fine-grained synchronization & lock-free programming

Practical Concerns for Scalable Synchronization

Netconf RCU and Breakage. Paul E. McKenney IBM Distinguished Engineer & CTO Linux Linux Technology Center IBM Corporation

Exploiting Deferred Destruction: An Analysis of Read-Copy-Update Techniques in Operating System Kernels

Introduction to RCU Concepts

Fast and Robust Memory Reclamation for Concurrent Data Structures

The read-copy-update mechanism for supporting real-time applications on shared-memory multiprocessor systems with Linux

CS 571 Operating Systems. Midterm Review. Angelos Stavrou, George Mason University

Read-Copy Update (RCU) Don Porter CSE 506

Kernel Scalability. Adam Belay

RCU in the Linux Kernel: One Decade Later

Hazard Pointers. Number of threads unbounded time to check hazard pointers also unbounded! difficult dynamic bookkeeping! thread B - hp1 - hp2

Linked Lists: The Role of Locking. Erez Petrank Technion

Split-Ordered Lists: Lock-Free Extensible Hash Tables. Pierre LaBorde

CBPQ: High Performance Lock-Free Priority Queue

Scalable Concurrent Hash Tables via Relativistic Programming

Verification of Real-Time Systems Resource Sharing

Scalable Correct Memory Ordering via Relativistic Programming

Hazard Pointers. Safe Resource Reclamation for Optimistic Concurrency

A Non-Blocking Concurrent Queue Algorithm

Scalable Concurrent Hash Tables via Relativistic Programming

Advance Operating Systems (CS202) Locks Discussion

Non-blocking Array-based Algorithms for Stacks and Queues!

Midterm Exam Amy Murphy 6 March 2002

Drop the Anchor: Lightweight Memory Management for Non-Blocking Data Structures

A Comparison of Relativistic and Reader-Writer Locking Approaches to Shared Data Access

Concurrent Data Structures Concurrent Algorithms 2016

Proposed Wording for Concurrent Data Structures: Hazard Pointer and Read Copy Update (RCU)

Fine-grained synchronization & lock-free data structures

Sleepable Read-Copy Update

Reuse, don t Recycle: Transforming Lock-free Algorithms that Throw Away Descriptors

Multi-Chance Adaptive Replacement Cache(MC-ARC) Shailendra Tripathi Architect, Tegile Systems

Design of Concurrent and Distributed Data Structures

Stamp-it: A more Thread-efficient, Concurrent Memory Reclamation Scheme in the C++ Memory Model

Hazard Pointers. Safe Resource Reclamation for Optimistic Concurrency

CSE 544 Principles of Database Management Systems

Advanced Topic: Efficient Synchronization

Per-Thread Batch Queues For Multithreaded Programs

Synchronization for Concurrent Tasks

RCU and C++ Paul E. McKenney, IBM Distinguished Engineer, Linux Technology Center Member, IBM Academy of Technology CPPCON, September 23, 2016

Scalable Concurrent Hash Tables via Relativistic Programming

Read-Copy Update in a Garbage Collected Environment. By Harshal Sheth, Aashish Welling, and Nihar Sheth

Order Is A Lie. Are you sure you know how your code runs?

arxiv: v1 [cs.dc] 4 Dec 2017

The Architectural and Operating System Implications on the Performance of Synchronization on ccnuma Multiprocessors

Allocating memory in a lock-free manner

P0461R0: Proposed RCU C++ API

Getting RCU Further Out Of The Way

What is the Race Condition? And what is its solution? What is a critical section? And what is the critical section problem?

Lecture 9: Multiprocessor OSs & Synchronization. CSC 469H1F Fall 2006 Angela Demke Brown

CPSC/ECE 3220 Summer 2018 Exam 2 No Electronics.

Department of Computer Science Institute for System Architecture, Operating Systems Group REAL-TIME MICHAEL ROITZSCH OVERVIEW

Application Programming

Lecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory

Nondeterminism is Unavoidable, but Data Races are Pure Evil

Fall 2015 COMP Operating Systems. Lab 06

Design and Implementation of the L4.sec Microkernel for Shared-Memory Multiprocessors

When Do Real Time Systems Need Multiple CPUs?

On the Scalability of the Erlang Term Storage

Non-blocking Array-based Algorithms for Stacks and Queues. Niloufar Shafiei

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.

Concurrency Kit. Towards accessible scalable synchronization primitives for C. Samy Al Bahra

Practical Lock-Free and Wait-Free LL/SC/VL Implementations Using 64-Bit CAS

Multiprocessor Systems. COMP s1

Enhancing Linux Scheduler Scalability

RCU Usage In the Linux Kernel: One Decade Later

Read-Copy Update (RCU) for C++

General Objectives: To understand the process management in operating system. Specific Objectives: At the end of the unit you should be able to:

Precept 2: Non-preemptive Scheduler. COS 318: Fall 2018

CONCURRENT ACCESS TO SHARED DATA STRUCTURES. CS124 Operating Systems Fall , Lecture 11

CSE544 Database Architecture

Multiprocessor Systems. Chapter 8, 8.1

Computer Science Window-Constrained Process Scheduling for Linux Systems

Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY>

Modernizing NetBSD Networking Facilities and Interrupt Handling. Ryota Ozaki Kengo Nakahara

CSL373: Lecture 5 Deadlocks (no process runnable) + Scheduling (> 1 process runnable)

CS140 Operating Systems Midterm Review. Feb. 5 th, 2009 Derrick Isaacson

Concurrency Race Conditions and Deadlocks

Lecture 16/17: Distributed Shared Memory. CSC 469H1F Fall 2006 Angela Demke Brown

How to hold onto things in a multiprocessor world

SSC - Concurrency and Multi-threading Advanced topics about liveness

Kernel Synchronization I. Changwoo Min

RCUArray: An RCU-like Parallel-Safe Distributed Resizable Array

Verification of Tree-Based Hierarchical Read-Copy Update in the Linux Kernel

CPSC/ECE 3220 Summer 2017 Exam 2

Using a Malicious User-Level RCU to Torture RCU-Based Algorithms

Transcription:

Making Lockless Synchronization Fast: Performance Implications of Memory Reclamation Tom Hart, University of Toronto Paul E. McKenney, IBM Beaverton Angela Demke Brown, University of Toronto

Outline Motivation Memory Reclamation Schemes Results Conclusions

My Laptop Has Two Cores Multiprocessing becoming mainstream. Synchronization must be fast. Locks create problems: Overhead Serialization Bottleneck Deadlock Priority Inversion

Lockless Synchronization Using a shared object without locking. Non-blocking synchronization. Read-copy update. Can drastically improve performance. Can lead to read-reclaim races.

Read-Reclaim Races Thread 1 A Thread 2

Read-Reclaim Races Thread 1 Thread 1: remove(a); A Thread 2

Read-Reclaim Races Thread 1 Thread 1: remove(a); A Thread 2: read(a >next); Thread 2

Read-Reclaim Races Thread 1 Thread 1: remove(a); A Thread 2: read(a >next); Thread 2 When can Thread 1 call free(a) without interfering with Thread 2?

Contribution Much prior work solves read-reclaim races, but... How do these solutions perform? What factors determine performance? Is the performance impact significant? Investigate with a microbenchmark: Vary factors independently.

Outline Motivation Memory Reclamation Schemes Results Conclusions

Memory Reclamation Schemes Mediate read-reclaim races. Many have been proposed: Quiescent-State-Based Reclamation [M&S] Used with Read-Copy Update (RCU) Epoch-Based Reclamation [Fraser] Hazard Pointers [Michael] Lock-Free Reference Counting [Valois, D. et. al.]

Memory Reclamation Schemes Mediate read-reclaim races. Many have been proposed: Quiescent-State-Based Reclamation [M&S] Used with Read-Copy Update (RCU) Epoch-Based Reclamation [Fraser] Hazard Pointers [Michael] Lock-Free Reference Counting [Valois, D. et. al.]

Assumption! For the purposes of this presentation: A thread accesses the elements of a shared data structure only through a well-defined set of operations. Operations: find() insert() enqueue() dequeue() etc.

Quiescent-State-Based Reclamation for (i=0;i<100;i++) if(list_find(l, i)) break; /* Do other work... */ ` Thread has no references to any element in list L.

Quiescent-State-Based Reclamation for (i=0;i<100;i++) if(list_find(l, i)) break; Introduce application-dependent quiescent states. quiescent_state(); /* Do other work... */ ` Thread has no references to any element in list L.

Quiescent-State-Based Reclamation Grace period: any interval in which each thread passes through a quiescent state. Thread 1 Thread 2 QS Thread 3 Execution Time

Quiescent-State-Based Reclamation Grace period: any interval in which each thread passes through a quiescent state. Thread 1 Thread 2 QS Thread 3 QS Execution Time

Quiescent-State-Based Reclamation Grace period: any interval in which each thread passes through a quiescent state. Thread 1 Thread 2 QS QS Thread 3 QS Execution Time

Quiescent-State-Based Reclamation Grace period: any interval in which each thread passes through a quiescent state. Thread 1 Thread 2 QS QS QS Thread 3 QS Execution Time

Quiescent-State-Based Reclamation Grace period: any interval in which each thread passes through a quiescent state. Thread 1 QS Thread 2 QS QS QS Thread 3 QS Execution Time

Quiescent-State-Based Reclamation Any element removed before this point...... can be safely reclaimed after this point. Grace Period Thread 1 QS Thread 2 QS QS QS Thread 3 QS Execution Time

Epoch-Based Reclamation Similar to quiescent-state-based scheme. Instead of quiescent_state(), uses: lockless_begin() lockless_end() Within the body of an operation. Application-independent.

Hazard Pointers Thread 1 Thread 1: remove(a); A Thread 2 Thread 2: read(a >next);

Hazard Pointers Global Hazard Pointer Array: HP[0] HP[1] HP[2] Thread 1 A Thread 1: remove(a); free_later(a);... if (!find(hp,a)) free(a); HP[3] Thread 2 Thread 2: HP[2] = A; if (A has been removed) goto RETRY; read(a >next);

Outline Motivation Memory Reclamation Schemes Results Conclusions

Performance Factors In our paper, we consider: Number of CPUs Number of threads Choice of data structure Workload (read-to-update ratio) Length of chains of elements Memory constraints Look at a few in this presentation.

Performance Factors In our paper, we consider: Number of CPUs Number of threads Choice of data structure Workload (read-to-update ratio) Length of chains of elements Memory constraints Look at a few in this presentation.

Sequential Consistency Parallel schedule of instructions is equivalent to a legal serial schedule; ie. Machine instructions are not reordered. Memory references are globally ordered.

Sequential Consistency Don't Bet On It!!! Hardware is not sequentially-consistent. CPUs can reorder instructions for performance. Must force sequential consistency. Use a memory fence. Fences affect relative performance: Fences are expensive (orders of magnitude). Reclamation schemes need different numbers of fences!

Data Structures and List Length Can affect the number of fences needed. Linked lists have long chains of elements.... Well-designed hash tables have short chains.

Example Hazard Pointers Memory Fence! cur for (cur = list >head; cur!= NULL; cur = cur >next) { *hazard_ptr = list >cur; memory_fence(); /* continue...*/ }

Example Hazard Pointers Memory Fence!Memory Fence! cur for (cur = list >head; cur!= NULL; cur = cur >next) { *hazard_ptr = list >cur; memory_fence(); /* continue...*/ }

Example Hazard Pointers Memory Fence!Memory Fence! Memory Fence! cur for (cur = list >head; cur!= NULL; cur = cur >next) { *hazard_ptr = list >cur; memory_fence(); /* continue...*/ }

Example Hazard Pointers Memory Fence!Memory Fence! Memory Fence! Memory Fence! O(n) fences needed!! cur for (cur = list >head; cur!= NULL; cur = cur >next) { *hazard_ptr = list >cur; memory_fence(); /* continue...*/ }

Memory Fence! Example Epochs cur lockless_begin(); /* calls memory_fence() */ for (cur = list >head; cur!= NULL; cur = cur >next) { /* continue...*/ } lockless_end(); /* calls memory_fence() */

Memory Fence! Example Epochs cur lockless_begin(); /* calls memory_fence() */ for (cur = list >head; cur!= NULL; cur = cur >next) { /* continue...*/ } lockless_end(); /* calls memory_fence() */

Memory Fence! Example Epochs cur lockless_begin(); /* calls memory_fence() */ for (cur = list >head; cur!= NULL; cur = cur >next) { /* continue...*/ } lockless_end(); /* calls memory_fence() */

Memory Fence! Example Epochs Memory Fence! O(1) fences needed! cur lockless_begin(); /* calls memory_fence() */ for (cur = list >head; cur!= NULL; cur = cur >next) { /* continue...*/ } lockless_end(); /* calls memory_fence() */

Example Quiescent States cur One fence per several operations. for (cur = list >head; cur!= NULL; cur = cur >next) { /* continue...*/ }

Traversal Length 5000 Quiescent Hazard Epoch Avg CPU Time (ns) 4500 4000 3500 3000 2500 2000 1500 1000 500 O(n) fences add up!!! 0 0 10 20 30 40 50 60 70 80 90 100 Number of Elements

Traversal Length - LFRC 20000 Quiescent Hazard Epoch Ref Cnt Avg CPU Time (ns) 18000 16000 14000 12000 10000 8000 6000 4000 2000 O(n) atomic increments are much worse!!! 0 0 10 20 30 40 50 60 70 80 90 100 Number of Elements

Traversal Length - LFRC 20000 Quiescent Hazard Epoch Ref Cnt Avg CPU Time (ns) 18000 16000 14000 12000 10000 8000 6000 4000 2000 O(n) atomic increments are much worse!!! So, we should always use quiescent states or epochs, right? 0 0 10 20 30 40 50 60 70 80 90 100 Number of Elements

Threads and Memory Hazard pointers bound unfreed memory. (Provided number of hazard pointers is finite.) Everything with no hazard pointer can be freed. Other schemes are more memory-hungry. What happens when there are many threads? More threads than CPUs Preemption.

Preemption Avg Execution Time (ns) 105000 90000 75000 60000 45000 30000 15000 Quiescent Hazard Epoch Grace periods are infrequent, so performance suffers! 0 0 5 10 15 20 25 30 35 Number of Threads

Preemption With Yielding Avg Execution Time (ns) 900 850 800 750 700 650 600 550 500 450 400 Quiescent Hazard Epoch If we yield on memory allocation failure, grace periods are more frequent! 350 0 5 10 15 20 25 30 35 Number of Threads

Outline Motivation Memory Reclamation Schemes Results Conclusions

Summary of Results Schemes have very different overheads. Difference between faster than locking and slower than locking. No scheme is always best. Quiescent-state-based reclamation has the lowest best-case overhead. No per-operation fences. Hazard pointers are good when there is preemption and many updates.

Significance Understanding performance factors lets us: Choose the right scheme for a program. Design new, faster schemes.

Future Work

Future Work Macrobenchmark Use lockless synchronization with SPLASH-2 Quiescent-State-Based Reclamation with realtime workloads

Questions? Tom Hart http://www.cs.toronto.edu/~tomhart Angela Demke Brown http://www.cs.toronto.edu/~demke Paul McKenney http://www.rdrop.com/users/paulmck/rcu