Kernel Scalability. Adam Belay

Similar documents
CONCURRENT ACCESS TO SHARED DATA STRUCTURES. CS124 Operating Systems Fall , Lecture 11

RCU in the Linux Kernel: One Decade Later

RCU and C++ Paul E. McKenney, IBM Distinguished Engineer, Linux Technology Center Member, IBM Academy of Technology CPPCON, September 23, 2016

Read-Copy Update (RCU) Don Porter CSE 506

Read-Copy Update (RCU) Don Porter CSE 506

CSE 120 Principles of Operating Systems

Read-Copy Update in a Garbage Collected Environment. By Harshal Sheth, Aashish Welling, and Nihar Sheth

Synchronization for Concurrent Tasks

Introduction to RCU Concepts

Native POSIX Thread Library (NPTL) CSE 506 Don Porter

Read- Copy Update and Its Use in Linux Kernel. Zijiao Liu Jianxing Chen Zhiqiang Shen

Concurrent Data Structures Concurrent Algorithms 2016

Lecture 10: Avoiding Locks

Advanced Topic: Efficient Synchronization

Multiprocessor Systems. Chapter 8, 8.1

Abstraction, Reality Checks, and RCU

Overview. This Lecture. Interrupts and exceptions Source: ULK ch 4, ELDD ch1, ch2 & ch4. COSC440 Lecture 3: Interrupts 1

MULTITHREADING AND SYNCHRONIZATION. CS124 Operating Systems Fall , Lecture 10

Locking: A necessary evil? Lecture 10: Avoiding Locks. Priority Inversion. Deadlock

CSE 374 Programming Concepts & Tools

The Art and Science of Memory Allocation

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Fall Quiz II Solutions. Mean 55.4 Median 58 Standard deviation 13.0

Multiprocessors and Locking

What Is RCU? Guest Lecture for University of Cambridge

What Is RCU? Indian Institute of Science, Bangalore. Paul E. McKenney, IBM Distinguished Engineer, Linux Technology Center 3 June 2013

Lecture 10: Multi-Object Synchronization

Proposed Wording for Concurrent Data Structures: Hazard Pointer and Read Copy Update (RCU)

CPSC/ECE 3220 Fall 2017 Exam Give the definition (note: not the roles) for an operating system as stated in the textbook. (2 pts.

Linux Synchronization Mechanisms in Driver Development. Dr. B. Thangaraju & S. Parimala

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.

Multiprocessor Systems. COMP s1

Using a Malicious User-Level RCU to Torture RCU-Based Algorithms

1 RCU. 2 Improving spinlock performance. 3 Kernel interface for sleeping locks. 4 Deadlock. 5 Transactions. 6 Scalable interface design

Lecture 9: Multiprocessor OSs & Synchronization. CSC 469H1F Fall 2006 Angela Demke Brown

How to hold onto things in a multiprocessor world

Problem Set 2. CS347: Operating Systems

Scalable Locking. Adam Belay

CSE 160 Lecture 7. C++11 threads C++11 memory model

CSE 451: Operating Systems Winter Lecture 7 Synchronization. Steve Gribble. Synchronization. Threads cooperate in multithreaded programs

Scalable Concurrent Hash Tables via Relativistic Programming

I/O Devices. Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau)

Concurrency. Johan Montelius KTH

What is concurrency? Concurrency. What is parallelism? concurrency vs parallelism. Concurrency: (the illusion of) happening at the same time.

Foundations of the C++ Concurrency Memory Model

Performance and Optimization Issues in Multicore Computing

Read-Copy Update (RCU) for C++

Kernel Synchronization I. Changwoo Min

Threads. Raju Pandey Department of Computer Sciences University of California, Davis Spring 2011

MultiJav: A Distributed Shared Memory System Based on Multiple Java Virtual Machines. MultiJav: Introduction

Concurrency, Thread. Dongkun Shin, SKKU

Other consistency models

Martin Kruliš, v

CSE 451: Operating Systems Winter Lecture 7 Synchronization. Hank Levy 412 Sieg Hall

The RCU-Reader Preemption Problem in VMs

Linux kernel synchronization

Thread and Synchronization

The read-copy-update mechanism for supporting real-time applications on shared-memory multiprocessor systems with Linux

CS 326: Operating Systems. Process Execution. Lecture 5

Lustre SMP scaling. Liang Zhen

What is uop Cracking?

Ext3/4 file systems. Don Porter CSE 506

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 12, DECEMBER

Sleepable Read-Copy Update

LOCKDEP, AN INSIDE OUT PERSPECTIVE. Nahim El

HLD For SMP node affinity

RCU. ò Walk through two system calls in some detail. ò Open and read. ò Too much code to cover all FS system calls. ò 3 Cases for a dentry:

VFS, Continued. Don Porter CSE 506

250P: Computer Systems Architecture. Lecture 14: Synchronization. Anton Burtsev March, 2019

What s in a traditional process? Concurrency/Parallelism. What s needed? CSE 451: Operating Systems Autumn 2012

Modernizing NetBSD Networking Facilities and Interrupt Handling. Ryota Ozaki Kengo Nakahara

Synchronization Principles

Devsummit Concurrency Hacks

What s in a process?

Last class: Today: Thread Background. Thread Systems

Synchronization I. Jo, Heeseung

Great Reality #2: You ve Got to Know Assembly Does not generate random values Arithmetic operations have important mathematical properties

CS5460/6460: Operating Systems. Lecture 14: Scalability techniques. Anton Burtsev February, 2014

Quiz II Solutions MASSACHUSETTS INSTITUTE OF TECHNOLOGY Fall Department of Electrical Engineering and Computer Science

Introduction to RCU Concepts

Thread. Disclaimer: some slides are adopted from the book authors slides with permission 1

Questions answered in this lecture: CS 537 Lecture 19 Threads and Cooperation. What s in a process? Organizing a Process

Identity, State and Values

Part 1 Fine-grained Operations

Motivation of Threads. Preview. Motivation of Threads. Motivation of Threads. Motivation of Threads. Motivation of Threads 9/12/2018.

THREADS: (abstract CPUs)

Agenda Process Concept Process Scheduling Operations on Processes Interprocess Communication 3.2

CS 4284 Systems Capstone

Administrivia. p. 1/20

ENCM 501 Winter 2019 Assignment 9

Agenda. Designing Transactional Memory Systems. Why not obstruction-free? Why lock-based?

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Motivation. Threads. Multithreaded Server Architecture. Thread of execution. Chapter 4

Parallel Computer Architecture and Programming Written Assignment 3

Introduction to PThreads and Basic Synchronization

Advance Operating Systems (CS202) Locks Discussion

CHAPTER 6: PROCESS SYNCHRONIZATION

! Readings! ! Room-level, on-chip! vs.!

PROCESSES AND THREADS THREADING MODELS. CS124 Operating Systems Winter , Lecture 8

SPIN Operating System

Modern Processor Architectures. L25: Modern Compiler Design

Transcription:

Kernel Scalability Adam Belay <abelay@mit.edu>

Motivation Modern CPUs are predominantly multicore Applications rely heavily on kernel for networking, filesystem, etc. If kernel can t scale across many cores, applications that rely on it won t scale either Have to be able to execute system calls in parallel

Problem is sharing OS maintains many data structures Process table, file descriptor table, buffer cache, scheduler queues, etc. They depend on locks to maintain invariants Applications may contend on locks, limiting scalability

OS evolution Early kernels depended on a single big lock to protect kernel data Later, kernels transitioned to fine-grained locking Now, many lock-free approaches are used too Extreme case: Some research kernels attempted to share nothing E.g. FOS and Barrelfish Could potentially run without cache-coherence Downside: Poor load balancing

Agenda for today 1. Read-copy-update (today s reading assignment) 2. Per-CPU reference counters 3. Scalable commutativity rule

Read-heavy data structures Kernels often have data that is read much more often than it is modified Network tables: routing, ARP File descriptor arrays, most types of system call state RCU optimizes for these use cases Over 10,000 RCU API uses in the Linux Kernel! 1. Goal: Concurrent reads even during updates 2. Goal: Low space overhead 3. Goal: Low execution overhead

Plan #1: spin locks Problem: Serializes all critical sections Read-only critical sections would have to wait for other read-only sections to finish Idea: Could we allow parallel readers but still serialize writers with respect to both readers and others writers

Plan #2: Read-write locks A modification to spin locks that allows parallel reads How to change spin lock implementation to support this feature?

Read-write lock implementation typedef struct { volatile int cnt; } rwlock_t; void read_lock(rwlock_t *l) { int x; while (true) { x = l->cnt; if (x < 0) // is write lock held? continue; if (CMPXCHG(&l->cnt, x, x + 1)) break; } } void read_unlock(rwlock_t *l) { } ATOMIC_DEC(&l->cnt);

Read-write lock implementation typedef struct { volatile int cnt; } rwlock_t; void write_lock(rwlock_t *l) { int x; while (true) { x = l->cnt; if (x!= 0) // is the lock held? continue; if (CMPXCHG(&l->cnt, 0, -1)) break; } } void write_unlock(rwlock_t *l) { } ATOMIC_INC(&l->cnt);

Q: Why check before CMPXCHG? typedef struct { volatile int cnt; } rwlock_t; void write_lock(rwlock_t *l) { int x; while (true) { x = l->cnt; if (x!= 0) // is the lock held? continue; if (CMPXCHG(&l->cnt, 0, -1)) break; } } Why? void write_unlock(rwlock_t *l) { } ATOMIC_INC(&l->cnt);

Q: What s the execution overhead?

Q: What s the execution overhead? Every reader uses CMPXCHG instruction S -> M cache coherence state transition Find + invalidate messages for contended read_lock() And for read_unlock() too! If writer holds lock, readers must spin and wait Violates goal of concurrent read, even during updates

Plan #3: Read-copy-update (RCU) Readers just access objects directly (no locks) Writers make a copy of object, change it, then update the pointer to the new copy 2. Writer makes a copy and changes the data 1. Reader accesses object directly 3. Then writer updates the pointer

When to free old objects? At any given moment, readers could be accessing the latest copy or older copies of an object Idea: Can safely free objects when they are no longer reachable Usually only one pointer to an RCU object Can t be copied, stored on the stack, or in registers (except inside critical sections) Need to define a quiescent period, after which it s safe to free Wait until all cores have passed through a context switch Pointer can only be dereferenced inside a critical section Read critical sections disable preemption

Q: Why disable preemption during RCU read critical sections?

Q: Why disable preemption during RCU read critical sections? If we didn t, waiting for all cores to context switch wouldn t be an effective quiescent period A task could still hold a pointer to an RCU object while it is preempted Hard to determine when its safe to free Unless we wait until all current tasks are killed Need to define a read critical section such that references to RCU objects can t persist outside the section

RCU API (simplified) void rcu_read_lock() { preempt_disable[cpu_id()]++; } void rcu_read_unlock() { preempt_disable[cpu_id()]--; } void synchronize_rcu(void) { for_each_cpu(int cpu) run_on(cpu); }

Real RCU API rcu_read_lock(): Begin an RCU critical section rcu_read_unlock(): End an RCU critical section synchronize_rcu(): Wait for existing RCU critical sections to complete call_rcu(callback, argument): Call the callback when existing RCU critical sections complete rcu_dereference(pointer): Signal the intent to dereference a pointer in an RCU critical section rcu_dereference_protected(pointer, check): signals the intent to dereference a pointer outside of an RCU critical section rcu_assign_pointer(pointer_addr, pointer): Assign a value to a pointer that is read in RCU critical sections

How to synchronize writes? Against other writers: Allow only one writer Or just use normal synchronization like locks! Against readers: (memory order matters) Writers must fully finish writes to new object before updating pointer Readers must not reorder reads such that contents of an object are read before its pointer (NOTE: the DEC Alpha can actually do this!) rcu_dereference() and rcu_assign_pointer() automatically insert the appropriate compiler and memory barriers

Example RCU usage Imagine a simple online store Need an object to represent the price of each item typedef struct { const char *name; float price; float discount; } item_t; rcu item_t *item; lock_t item_lock; NOTE: total_cost = price - discount

Example RCU usage (reader) float get_cost(void) { item_t *p; float cost; rcu_read_lock(); p = rcu_dereference(item); // read cost = p->price p->discount; rcu_read_unlock(); return cost; }

Example RCU usage (writer) void set_cost(float price, float discount) { item_t *oldp, *newp; spin_lock(&item_lock); oldp = rcu_dereference_protected(item, spin_locked(&item_lock)); newp = kmalloc(sizeof(*newp)); *newp = *oldp; // copy newp->price = price; newp->discount = discount; rcu_assign_pointer(item, newp); // update spin_unlock(&item_lock); rcu_synchronize(); kfree(oldp); // free }

RCU is a very powerful tool 1. Works with more complex data structures like linked lists and hash tables 2. Most common use case is as an alternative to read-write locks 3. Can be used to wait for parallel work to complete 4. Can be used to elide reference counting See paper for many more examples

Does RCU achieve its goals 1. Goal: Concurrent reads even during updates? Yes! Reads are never stalled by updates. 2. Goal: Low space overhead? Yes! An RCU pointer is the same size as an ordinary pointer. No extra synchronization data is required. However, objects can t be freed until quiescent period has passed. Forcing this to happen immediately incurs overhead. 3. Goal: Low execution overhead? For readers, RCU has practically no execution overhead! For writers, RCU adds a slight amount of overhead due to allocation, freeing, and copying. In practice, this overhead is modest. Fine-grained locking can help to make updates concurrent.

Reference counters Counts number of pointer references to an object When count reaches zero, safe to free object Challenge: involves true sharing Many resources in kernel are reference counted Often a scaling bottleneck (once other bottlenecks are removed)

Standard approach typedef struct { int cnt; } kref_t; void kref_get(kref_t *r) { WARN_ON(r->cnt == 0); ATOMIC_INC(&r->cnt); } void kref_put(kref_t *r, void (*release)(kref_t *r)) { if (ATOMIC_DEC_AND_TEST(&r->cnt)) release(r); }

What s the execution overhead?

What s the execution overhead? kref_get() and kref_put() both require exclusive ownership of cache-line (i.e. place it in M state) Tons of cache-line bouncing if object is referenced frequently

Idea: Per-cpu reference counters Maintain an array of counters, one per core percpu_ref_get() and percpu_ref_put() operate on the local core s array entry Data written by only one core, no cache-line bouncing percpu_ref_kill() reverts to normal, shared reference counting Performance improved only if most references added and removed while killer s reference remains held Often this is true!

How to implement kill? 1. Set shared refcount to a bias value 2. Atomically set a kill flag in shared refcount 3. Wait a quiescent period, after which flag is visible to all reference counters How? rcu_synchronize() is perfect for this! 4. Sum all refcounts in per-core array and atomically add total to ref count in shared structure 5. Finally, atomically subtract the bias value 6. When shared refcount reaches zero, free object

Can any program be made scalable? Today we saw two examples of highly scalable algorithms But only applicable in certain situations In general, when is it possible to make code scalable?

Scalable commutativity rule rule: if two operations, commute then there exists a scalable (conflict-free) implementation intuition: if ops commute, order doesn't matter communication between ops must be unnecessary Caveat: Scalable implementations may still be possible for operations that fail the rule. However, If they pass the rule, scalability is definitely possible See http://pdos.csail.mit.edu/papers/commutativity:sosp13.pdf

Insight: The key to scalability is good interface design Example: POSIX requires open() system call to return the smallest available fd number Do two open() calls commute? How would you change open() to make it more scalable?

Conclusion RCU enables zero-cost read-only access at the expense of slightly more expensive updates Very useful for read-mostly data (extremely common in kernels) Reference counting can be almost free in cases where objects are long-lived and one task can be designated as the killer Per-CPU refcounting sacrifices space for speed Scalable commutativity rule provides guideline for designing scalable interfaces Operations in both RCU and Per-CPU refcount commute!