How to hold onto things in a multiprocessor world

Similar documents
Concurrency Bugs in the Network Stack of NetBSD

RCU in the Linux Kernel: One Decade Later

Devsummit Concurrency Hacks

Read-Copy Update (RCU) Don Porter CSE 506

Toward MP-safe Networking in NetBSD

Advanced Topic: Efficient Synchronization

Scalable Concurrent Hash Tables via Relativistic Programming

Read-Copy Update (RCU) Don Porter CSE 506

Locking: A necessary evil? Lecture 10: Avoiding Locks. Priority Inversion. Deadlock

RCU. ò Walk through two system calls in some detail. ò Open and read. ò Too much code to cover all FS system calls. ò 3 Cases for a dentry:

VFS, Continued. Don Porter CSE 506

Netconf RCU and Breakage. Paul E. McKenney IBM Distinguished Engineer & CTO Linux Linux Technology Center IBM Corporation

Lecture 10: Avoiding Locks

Kernel Scalability. Adam Belay

Modernizing NetBSD Networking Facilities and Interrupt Handling. Ryota Ozaki Kengo Nakahara

Concurrency, Thread. Dongkun Shin, SKKU

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring 2004

Learning Outcomes. Concurrency and Synchronisation. Textbook. Concurrency Example. Inter- Thread and Process Communication. Sections & 2.

Timers 1 / 46. Jiffies. Potent and Evil Magic

Lecture 10: Multi-Object Synchronization

Concurrency and Synchronisation

Concurrency and Synchronisation

Proposed Wording for Concurrent Data Structures: Hazard Pointer and Read Copy Update (RCU)

Lecture 9: Multiprocessor OSs & Synchronization. CSC 469H1F Fall 2006 Angela Demke Brown

A brief overview of DRM/KMS and its status in NetBSD

Sleepable Read-Copy Update

CSE 451: Operating Systems Winter Lecture 7 Synchronization. Steve Gribble. Synchronization. Threads cooperate in multithreaded programs

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

RCU and C++ Paul E. McKenney, IBM Distinguished Engineer, Linux Technology Center Member, IBM Academy of Technology CPPCON, September 23, 2016

Read-Copy Update (RCU) for C++

CS 241 Honors Concurrent Data Structures

Advance Operating Systems (CS202) Locks Discussion

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

Semaphores. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

CS533 Concepts of Operating Systems. Jonathan Walpole

Introduction to RCU Concepts

CSE 451: Operating Systems Winter Lecture 7 Synchronization. Hank Levy 412 Sieg Hall

Tricky issues in file systems

Virtual Machine Design

THREADS & CONCURRENCY

Synchronising Threads

Scalable Locking. Adam Belay

What s in a process?

Hazard Pointers. Number of threads unbounded time to check hazard pointers also unbounded! difficult dynamic bookkeeping! thread B - hp1 - hp2

Read-Copy Update Ottawa Linux Symposium Paul E. McKenney et al.

HLD For SMP node affinity

CSE 120 Principles of Operating Systems

Intro to Threads. Two approaches to concurrency. Threaded multifinger: - Asynchronous I/O (lab) - Threads (today s lecture)

Read-Copy Update in a Garbage Collected Environment. By Harshal Sheth, Aashish Welling, and Nihar Sheth

Motivation. Threads. Multithreaded Server Architecture. Thread of execution. Chapter 4

CSE 153 Design of Operating Systems Fall 2018

Exploring mtcp based Single-Threaded and Multi-Threaded Web Server Design

Dealing with Issues for Interprocess Communication

Heckaton. SQL Server's Memory Optimized OLTP Engine

Other consistency models

PROJECT 6: PINTOS FILE SYSTEM. CS124 Operating Systems Winter , Lecture 25

Threading and Synchronization. Fahd Albinali

10/17/ Gribble, Lazowska, Levy, Zahorjan 2. 10/17/ Gribble, Lazowska, Levy, Zahorjan 4

The Art and Science of Memory Allocation

What s in a traditional process? Concurrency/Parallelism. What s needed? CSE 451: Operating Systems Autumn 2012

What is uop Cracking?

Exception Namespaces C Interoperability Templates. More C++ David Chisnall. March 17, 2011

Scalable Correct Memory Ordering via Relativistic Programming

Implementation of Processes

10/10/ Gribble, Lazowska, Levy, Zahorjan 2. 10/10/ Gribble, Lazowska, Levy, Zahorjan 4

Synchronization COMPSCI 386

Concurrency and Synchronisation. Leonid Ryzhyk

UMTX OP (2) FreeBSD System Calls Manual UMTX OP (2)

Concurrency & Parallelism. Threads, Concurrency, and Parallelism. Multicore Processors 11/7/17

CONCURRENT ACCESS TO SHARED DATA STRUCTURES. CS124 Operating Systems Fall , Lecture 11

Overview. Thread Packages. Threads The Thread Model (1) The Thread Model (2) The Thread Model (3) Thread Usage (1)

Internet Technology. 06. Exam 1 Review Paul Krzyzanowski. Rutgers University. Spring 2016

Operating Systems CMPSCI 377 Spring Mark Corner University of Massachusetts Amherst

Deterministic Futexes Revisited

A Comparison of Relativistic and Reader-Writer Locking Approaches to Shared Data Access

FreeBSD Network Stack Optimizations for Modern Hardware

Lustre SMP scaling. Liang Zhen

Internet Technology 3/2/2016

Threads, Concurrency, and Parallelism

CSCE 313 Introduction to Computer Systems. Instructor: Dezhen Song

Linux Synchronization Mechanisms in Driver Development. Dr. B. Thangaraju & S. Parimala

Background. The Critical-Section Problem Synchronisation Hardware Inefficient Spinning Semaphores Semaphore Examples Scheduling.

More Types of Synchronization 11/29/16

Scalability Efforts for Kprobes

CMSC 330: Organization of Programming Languages

Questions answered in this lecture: CS 537 Lecture 19 Threads and Cooperation. What s in a process? Organizing a Process

CS533 Concepts of Operating Systems. Jonathan Walpole

Synchronization. CS61, Lecture 18. Prof. Stephen Chong November 3, 2011

CSCE 313: Intro to Computer Systems

CHAPTER 6: PROCESS SYNCHRONIZATION

Ext3/4 file systems. Don Porter CSE 506

The benefits and costs of writing a POSIX kernel in a high-level language

EECS 482 Introduction to Operating Systems

CSC2/458 Parallel and Distributed Systems Scalable Synchronization

I/O Management Software. Chapter 5

CS 153 Design of Operating Systems Winter 2016

Lecture. DM510 - Operating Systems, Weekly Notes, Week 11/12, 2018

Review: Easy Piece 1

MULTITHREADING AND SYNCHRONIZATION. CS124 Operating Systems Fall , Lecture 10

ECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017

Transcription:

How to hold onto things in a multiprocessor world Taylor Riastradh Campbell campbell@mumble.net riastradh@netbsd.org AsiaBSDcon 2017 Tokyo, Japan March 12, 2017

Slides n code Full of code! Please browse at your own pace. Slides: https://tinyurl.com/ho2cdhq 1 Paper: https://tinyurl.com/h9kqccf 2 Slides: Paper: 1 https://www.netbsd.org/gallery/presentations/riastradh/ asiabsdcon2017/mp-refs-slides.pdf 2 https://www.netbsd.org/gallery/presentations/riastradh/ asiabsdcon2017/mp-refs-paper.pdf

Resources Network routes May be tens of thousands in system. Acquired and released by packet-processing path. Same route may be used simultaneously by many flows. Large legacy code base to update for parallelism. Update must be incremental! Device drivers Only a few dozen in system. Even wider range of legacy code to safely parallelize. File system objects ( vnodes ) User credential sets...

The life and times of a resource Birth: Create: allocate memory, initialize it. Publish: reveal to all threads. Life: Acquire: thread begins to use a resource. Release: thread is done using a resource.... rinse, repeat. Concurrently by many threads at a time. Death: Delete: prevent threads from acquiring. Destroy: free memory... after all threads have released.

Problems for an implementer If you are building an API for some class of resources... You MUST ensure nobody frees memory still in use! You MUST satisfy other API contracts, e.g. mutex rules. You MAY want to allow concurrent users of resources. You MAY care about performance.

Serialize all resources layout struct foo { int key;...; struct foo *next; }; struct { kmutex_t lock; struct foo *first; } footab;

Serialize all resources create/publish struct foo *f = alloc_foo(key); mutex_enter(&footab.lock); f->next = footab.first; footab.first = f; mutex_exit(&footab.lock);

Serialize all resources lookup/use struct foo *f; mutex_enter(&footab.lock); for (f = footab.first; f!= NULL; f = f->next) { if (f->key == key) {...use f... break; } } mutex_exit(&footab.lock);

Serialize all resources delete/destroy Delete/destroy: struct foo **fp, *f; mutex_enter(&footab.lock); for (fp = &footab.first; (f = *fp)!= NULL; fp = &f->next) { if (f->key == key) { *fp = f->next; break; } } mutex_exit(&footab.lock); if (f!= NULL) free_foo(f);

Serialize all resources slow and broken! No parallelism. Not allowed to wait for I/O or do long computation under mutex lock. (This is a NetBSD rule to put bounds on progress for mutex enter, which is not interruptible.)

Mutex and reference counts layout (a) Add reference count to each object. (b) Add condition variable for notifying f->refcnt == 0. struct foo { int key;...; unsigned refcnt; struct foo *next; }; struct { kmutex_t lock; kcondvar_t cv; struct foo *first; } footab; // (a) // (b)

Mutex and reference counts layout footab unowned waiters rst NULL struct foo key: 1 refcnt: 3 next key: 2 refcnt: 0 next NULL

Mutex and reference counts create/publish struct foo *f = alloc_foo(key); f->refcnt = 0; mutex_enter(&footab.lock); f->next = footab.first; footab.first = f; mutex_exit(&footab.lock);

Mutex and reference counts lookup/acquire struct foo *f; mutex_enter(&footab.lock); for (f = footab.first; f!= NULL; f = f->next) { if (f->key == key) { f->refcnt++; break; } } mutex_exit(&footab.lock); if (f!= NULL)...use f...

Mutex and reference counts release mutex_enter(&footab.lock); if (--f->refcnt == 0) cv_broadcast(&footab.cv); mutex_exit(&footab.lock);

Mutex and reference counts delete/destroy struct foo **fp, *f; mutex_enter(&footab.lock); for (fp = &footab.first; (f = *fp)!= NULL; fp = &f->next) { if (f->key == key) { *fp = f->next; while (f->refcnt!= 0) cv_wait(&footab.cv, &footab.lock); break; } } mutex_exit(&footab.lock); if (f!= NULL) free_foo(f);

Mutex lock and reference counts summary If this works for you, stop here! Easy to prove correct. Just go to another talk.... but it does have problems: Only one lookup at any time. Contention over lock for every object. Hence not scalable to many CPUs.

Hashed locks Randomly partition resources into buckets. If distribution on resource use is uniform, lower contention for lookup!

Hashed locks layout struct { struct foobucket { kmutex_t lock; kcondvar_t cv; struct foo *first; } b; char pad[roundup( sizeof(struct foobucket), CACHELINE_SIZE)]; } footab[nbucket];

Hashed locks acquire size_t h = hash(key); mutex_enter(&footab[h].b.lock); for (f = footab[h].b.first; f!= NULL; f = f->next) { if (f->key == key) { f->refcnt++; break; } } mutex_exit(&footab[h].b.lock);

Hashed locks Randomly partition resources into buckets. If distribution on resource use is uniform, lower contention for lookup! What if many threads want to look up same object? Still only one lookup at a time for that object. Still contention for releasing resources after use.

Mutex lock and atomic reference counts Use atomic operations to manage most uses of a resource. No need to acquire global table lock to release a resource if it s not the last one.

Mutex lock and atomic reference counts acquire struct foo *f; mutex_enter(&footab.lock); for (f = footab.first; f!= NULL; f = f->next) { if (f->key == key) { atomic_inc_uint(&f->refcnt); break; } } mutex_exit(&footab.lock); if (f!= NULL)...use f...

Mutex lock and atomic reference counts release do { old = f->refcnt; if (old == 1) { mutex_enter(&footab.lock); if (f->refcnt == 1) { f->refcnt = 0; cv_broadcast(&footab.cv); } else { atomic_dec_uint(&f->refcnt); } mutex_exit(&footab.lock); break; } } while (atomic_cas_uint(&f->refcnt, old, new)!= old);

Atomics: still not scalable We avoid contention over global table lock to release. But if many threads want to use the same foo... Atomic operations are not a magic bullet! Single atomic is slightly faster and uses less memory than a mutex lock enter/exit. But contended atomics are just as bad as contended locks!

Atomics: interprocessor synchronization 3 CPU 0 CPU 1 Cache Cache Interconnect CPU 2 CPU 3 Cache Cache Interconnect Memory System Interconnect Memory Interconnect Cache Cache CPU 4 CPU 5 Interconnect Cache Cache CPU 6 CPU 7 3 Diagram Copyright c 2005 2010, Paul E. McKenney. From Paul E. McKenney, Is Parallel Programming Hard, And, If So, What Can You Do About It?, 2011. https://www.kernel.org/pub/linux/kernel/people/ paulmck/perfbook/perfbook.2011.01.02a.pdf

Reader/writer locks for lookup Instead of mutex lock for table, use rwlock. At any time, either one writer or many readers. Allows concurrent lookups, not just concurrent resource use. If lookups are slow, great! If lookups are fast, reader count is just another reference count managed with atomics contention!

Basic problem: to read, we must write! All approaches here require readers to coordinate writes. Acquire table lock: write who owns it now. Acquire read lock: write how many readers. Acquire reference count: write how many users. Can we avoid writes to read? Are there more reads than creations or destructions? Can we make reads cheaper, perhaps at the cost of making creation or destruction more expensive?

No-contention references in NetBSD Passive serialization. Like read copy update, RCU in Linux.... but US patent expired sooner! Passive references. Similar to hazard pointers. Similar to OpenBSD SRP. Local counts per-cpu reference counts. Similar to sleepable RCU, SRCU, but simpler.

Coordinate publish and read Linked-list insert and read can coordinate with no atomics.... as long as they write and read in the correct order. One writer, any number of readers! Same principle for hash tables ( hashed lists ), radix trees.

Publish Write data first. Then write pointer to it. struct foo *f = alloc_foo(key); mutex_enter(&footab.lock); f->next = footab.first; membar_producer(); footab.first = f; mutex_exit(&footab.lock);

Publish 1: after writing data head data next data next NULL data next

Publish 2: after write barrier head data next data next NULL data next

Publish 3: after writing pointer head data next data next NULL data next

Read Read pointer first. Then read data from it.... Yes, in principle stale data could be cached. Fortunately, membar datadep consumer is a no-op on all CPUs other than DEC Alpha. for (f = footab.first; f!= NULL; f = f->next) { membar_datadep_consumer(); if (f->key == key) { use(f); break; } }

Read 1: after reading pointer head data next data next NULL garbage garbage

Read 2: after read barrier head data next data next NULL data next

Read 3: after reading data head data next data next NULL data next

Delete Deletion is even easier! *fp = f->next;... but there is a catch.

Delete 1: before delete head data next data next NULL data next

Delete 1: after delete head data next data next NULL data next

The catch All well and good for publish and use! All well and good for delete! But when can we destroy (free memory, etc.)? No signal for when all users are done with a resource. How to signal release without contention?

Passive serialization: pserialize(9) Lookup/use: 1. Acquire: Block interrupts on CPU. 2. Look up resource. 3. Use it. 4. Release: Restore and process queued interrupts on CPU. 5. (Cannot use resource any more after this point!) Delete/destroy: 1. Remove resource from list: *fp = f->next. 2. Send interprocessor interrupt to all CPUs. 3. Wait for it to return on all CPUs. 4. All users that could have seen this resource have exited.

Passive serialization CPU A read section defer IPI answer IPI CPU B unlink send IPI safe to destroy CPU C read section answer IPI read section time

Passive serialization lookup/use 1. Acquire: Block interrupts with pserialize read enter. 2. Lookup: Read pointer. 3. Memory barrier! 4. Use: Read data. 5. Release: Restore and process queued interrupts with pserialize read exit. s = pserialize_read_enter(); for (f = footab.first; f!= NULL; f = f->next) { membar_datadep_consumer(); if (f->key == key) { use(f); break; } } pserialize_read_exit(s);

Passive serialization delete/destroy (a) Delete from list to prevent new users. (b) Send IPI to wait for existing users to drain. (c) Free memory. mutex_enter(&footab.lock); for (fp = &footab.first; (f = *fp)!= NULL; f = f->next) { if (f->key == key) { /* (a) Prevent new users. */ *fp = f->next; /* (b) Wait for old users. */ pserialize_perform(footab.psz); } } mutex_exit(&footab.lock); if (f!= NULL) /* (c) Destroy. */ free_foo(f);

Passive serialization lists sys/queue.h macros do not have correct memory barriers. So we provide PSLIST(9), like LIST in sys/queue.h. Linked list with constant-time insert and delete...... and correct memory barrier for insert and read.

Passive serialization PSLIST(9) struct foo {... struct pslist_entry f_entry;... }; struct {... struct pslist_head head;... } footab; mutex_enter(&footab.lock); PSLIST_WRITER_INSERT_HEAD(&footab.head, f, f_entry); mutex_exit(&footab.lock); s = pserialize_read_enter(); PSLIST_READER_FOREACH(f, &footab.head, struct foo, f_entry) { if (f->key == key) {...use f...; break; } } pserialize_read_exit(s);

Passive serialization pros Zero contention! Serially fast readers! We use software interrupts, so cheap to block and restore. No hardware interrupt controller reconfiguration! Constant memory overhead no memory per resource, per CPU!

Passive serialization cons Interrupts must be blocked during read. Thread cannot sleep during read. What if we want to pserialize the network stack? Code was written in 80s before parallelism mattered...... and does memory allocation in packet path (e.g., to prepend a header in a tunnel)...... and simultaneously re-engineering the whole network stack is hard! Can we do it incrementally with different tradeoffs?

Passive references: psref(9) Record per-cpu list of all resources in use. Lookup: use pserialize for table lookup. To acquire resource: put it on the list. Can now do anything on the CPU sleep, eat, watch television... To release resource: remove it from the list. To wait for users: send IPI to check for resource on each CPU s list. Note: Reader threads must not switch CPUs!

Passive references create/publish struct foo {... struct psref_target target;... }; struct {... struct psref_class *psr;... } footab; struct foo *f = alloc_foo(key); psref_target_init(&f->target, footab.psr); mutex_enter(&footab.lock); PSLIST_WRITER_INSERT_HEAD(&footab.head, f_entry, f); mutex_exit(&footab.lock);

Passive references lookup/acquire psref acquire inserts entry on CPU-local list: no atomics! struct psref fref; int bound, s; /* Bind to current CPU and lookup. */ bound = curlwp_bind(); s = pserialize_read_enter(); PSLIST_READER_FOREACH(f, &footab.head, struct foo, f_entry) { if (f->key == key) { psref_acquire(&fref, &f->target, footab.psr); break; } } pserialize_read_exit(s);

Passive references release psref remove removes entry on CPU-local list, and notifies destroyer if there is one. No atomics unless another thread is waiting to destroy the resource. /* Release psref and unbind from CPU. */ psref_release(&fref, &f->target, footab.psr); curlwp_bindx(bound);

Passive references delete/destroy psref target destroy marks the resource as being destroyed. Thus, future psref release will wake it. Then psref target destroy repeatedly checks for references on all CPUs and sleeps until there are none left.

Passive references delete/destroy /* (a) Prevent new users. */ mutex_enter(&footab.lock); PSLIST_WRITER_FOREACH(f, &footab.head, struct foo, f_entry) { if (f->key == key) { PSLIST_WRITER_REMOVE(f, f_entry); pserialize_perform(footab.psz); break; } } mutex_exit(&footab.lock); if (f!= NULL) { /* (b) Wait for old users. */ psref_target_destroy(&f->target, footab.psr); /* (c) Destroy. */ free_foo(f); }

Passive references notes Threads can sleep while holding passive references. Binding to CPU is not usually a problem. Much of network stack already runs bound to a CPU anyway! Bonus: can write precise asserts for diagnostics! KASSERT(psref_held(&f->target, footab.psr)); Modest memory cost: O(#CPU) + O(#resource) + O(#references). Network routes: tens of thousands in system. Network routes: a handful per CPU at any time.

Local counts: localcount(9) Global reference count per resource = contention. What about a per-cpu reference count per resource? High memory cost: O(#CPU #resource). So use only for small numbers of resources, like device drivers. Device drivers: dozens in system. Device drivers: maybe thousands of uses at any time during heavy I/O loads.

Local counts create/publish struct foo {... struct localcount lc;... }; struct foo *f = alloc_foo(key); localcount_init(&f->lc); mutex_enter(&footab.lock); PSLIST_WRITER_INSERT_HEAD(&footab.head, f_entry, f); mutex_exit(&footab.lock);

Local counts lookup/acquire localcount acquire increments a CPU-local counter no atomics! s = pserialize_read_enter(); PSLIST_READER_FOREACH(f, &footab.head, struct foo, f_entry) { if (f->key == key) { localcount_acquire(&f->lc); break; } } pserialize_read_exit(s);

Local counts release localcount release increments a CPU-local counter. If there is a destroyer, updates destroyer s global reference count. No atomics unless another thread is waiting to destroy the resource. localcount_release(&f->lc);

Local counts delete/destroy localcount destroy marks resource as being destroyed. Sends IPI to compute global reference count by adding up each CPU s local reference count. (Fun fact: local reference counts can be negative, if threads have migrated!) Waits for all IPIs to return and reference count to become zero.

Local counts delete/destroy /* (a) Prevent new users. */ mutex_enter(&footab.lock); PSLIST_WRITER_FOREACH(f, &footab.head, struct foo, f_entry) { if (f->key == key) { PSLIST_WRITER_REMOVE(f, f_entry); pserialize_perform(footab.psz); break; } } mutex_exit(&footab.lock); if (f!= NULL) { /* (b) Wait for old users. */ localcount_destroy(&f->lc); /* (c) Destroy. */ free_foo(f); }

Local counts notes Not yet integrated in NetBSD still on an experimental branch! To be used for MP-safely unloading device driver modules. Other applications? Probably yes!

Summary Avoid locks! Locks don t scale. Avoid atomics! Atomics don t scale. pserialize: short uninterruptible reads, fast but limited. psref: sleepable readers, modest time/memory cost, flexible. localcount: migratable readers, fast but memory-intensive.

Questions? riastradh@netbsd.org