Native POSIX Thread Library (NPTL) CSE 506 Don Porter

Similar documents
Fall 2014:: CSE 506:: Section 2 (PhD) Threading. Nima Honarmand (Based on slides by Don Porter and Mike Ferdman)

CSE 506: Opera.ng Systems Na.ve POSIX Threading Library (NPTL)

Na.ve POSIX Threading Library (NPTL)

ò mm_struct represents an address space in kernel ò task represents a thread in the kernel ò A task points to 0 or 1 mm_structs

Scheduling. Don Porter CSE 306

Scheduling, part 2. Don Porter CSE 506

ò Paper reading assigned for next Tuesday ò Understand low-level building blocks of a scheduler User Kernel ò Understand competing policy goals

238P: Operating Systems. Lecture 14: Process scheduling

The Art and Science of Memory Allocation

SMD149 - Operating Systems

Timers 1 / 46. Jiffies. Potent and Evil Magic

Process Description and Control

Read-Copy Update (RCU) Don Porter CSE 506

Threads. Raju Pandey Department of Computer Sciences University of California, Davis Spring 2011

CS 326: Operating Systems. CPU Scheduling. Lecture 6

What s in a process?

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective. Part I: Operating system overview: Processes and threads

Processes and Non-Preemptive Scheduling. Otto J. Anshus

EECS 482 Introduction to Operating Systems

10/10/ Gribble, Lazowska, Levy, Zahorjan 2. 10/10/ Gribble, Lazowska, Levy, Zahorjan 4

Process. Heechul Yun. Disclaimer: some slides are adopted from the book authors slides with permission

Questions answered in this lecture: CS 537 Lecture 19 Threads and Cooperation. What s in a process? Organizing a Process

ECE 574 Cluster Computing Lecture 8

What s in a traditional process? Concurrency/Parallelism. What s needed? CSE 451: Operating Systems Autumn 2012

Computer Systems II. First Two Major Computer System Evolution Steps

ECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017

Last class: Today: Thread Background. Thread Systems

Threads Implementation. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CSCI-GA Operating Systems Lecture 3: Processes and Threads -Part 2 Scheduling Hubertus Franke

Page Frame Reclaiming

Kernel Scalability. Adam Belay

Distributed Systems Operation System Support

Threads. Computer Systems. 5/12/2009 cse threads Perkins, DW Johnson and University of Washington 1

VEOS high level design. Revision 2.1 NEC

Processes Prof. James L. Frankel Harvard University. Version of 6:16 PM 10-Feb-2017 Copyright 2017, 2015 James L. Frankel. All rights reserved.

ECE 598 Advanced Operating Systems Lecture 23

Multiprocessor Systems. Chapter 8, 8.1

CS 326: Operating Systems. Process Execution. Lecture 5

Chapter 4: Threads. Overview Multithreading Models Thread Libraries Threading Issues Operating System Examples Windows XP Threads Linux Threads

PROCESSES AND THREADS THREADING MODELS. CS124 Operating Systems Winter , Lecture 8

Operating Systems 2 nd semester 2016/2017. Chapter 4: Threads

Ext3/4 file systems. Don Porter CSE 506

CS 5523 Operating Systems: Midterm II - reivew Instructor: Dr. Tongping Liu Department Computer Science The University of Texas at San Antonio

Multiprocessor Systems. COMP s1

Operating Systems Design Fall 2010 Exam 1 Review. Paul Krzyzanowski

Last time: introduction. Networks and Operating Systems ( ) Chapter 2: Processes. This time. General OS structure. The kernel is a program!

Operating Systems. Process scheduling. Thomas Ropars.

Multithreaded Programming

Exam Guide COMPSCI 386

Review: Easy Piece 1

Agenda. Threads. Single and Multi-threaded Processes. What is Thread. CSCI 444/544 Operating Systems Fall 2008

Agenda Process Concept Process Scheduling Operations on Processes Interprocess Communication 3.2

CSE 120 Principles of Operating Systems

CS 322 Operating Systems Practice Midterm Questions

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.

Chapter 4: Threads. Chapter 4: Threads

Overview. Thread Packages. Threads The Thread Model (1) The Thread Model (2) The Thread Model (3) Thread Usage (1)

Process. Heechul Yun. Disclaimer: some slides are adopted from the book authors slides with permission 1

CPSC/ECE 3220 Fall 2017 Exam Give the definition (note: not the roles) for an operating system as stated in the textbook. (2 pts.

OVERVIEW. Last Week: But if frequency of high priority task increases temporarily, system may encounter overload: Today: Slide 1. Slide 3.

What is a Process? Processes and Process Management Details for running a program

Notes. CS 537 Lecture 5 Threads and Cooperation. Questions answered in this lecture: What s in a process?

High Performance Computing Course Notes Shared Memory Parallel Programming

Scheduling. CS 161: Lecture 4 2/9/17

Chapter 2 Processes and Threads

Sistemi in Tempo Reale

Inf2C - Computer Systems Lecture 16 Exceptions and Processor Management

CSE 374 Programming Concepts & Tools

CS 5523: Operating Systems

Concurrency, Thread. Dongkun Shin, SKKU

Processes and Exceptions

OS Structure, Processes & Process Management. Don Porter Portions courtesy Emmett Witchel

OS 1 st Exam Name Solution St # (Q1) (19 points) True/False. Circle the appropriate choice (there are no trick questions).

Problem Set 2. CS347: Operating Systems

Chap 4, 5: Process. Dongkun Shin, SKKU

Processes and Threads

Lecture 5: Process Description and Control Multithreading Basics in Interprocess communication Introduction to multiprocessors

Chapter 4: Threads. Operating System Concepts 9 th Edit9on

Threads Implementation. Jo, Heeseung

Concurrent Programming

Process Concepts. CSC400 - Operating Systems. 3. Process Concepts. J. Sumey

Read-Copy Update (RCU) Don Porter CSE 506

Interrupts and System Calls

OPERATING SYSTEM. Chapter 4: Threads

Linux Driver and Embedded Developer

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

Software Development & Education Center

THE PROCESS ABSTRACTION. CS124 Operating Systems Winter , Lecture 7

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues

RCU. ò Walk through two system calls in some detail. ò Open and read. ò Too much code to cover all FS system calls. ò 3 Cases for a dentry:

RCU. ò Dozens of supported file systems. ò Independent layer from backing storage. ò And, of course, networked file system support

VFS, Continued. Don Porter CSE 506

3.1 Introduction. Computers perform operations concurrently

Exceptions and Processes

Mon Sep 17, 2007 Lecture 3: Process Management

by Marina Cholakyan, Hyduke Noshadi, Sepehr Sahba and Young Cha

PROCESSES & THREADS. Charles Abzug, Ph.D. Department of Computer Science James Madison University Harrisonburg, VA Charles Abzug

CS 31: Intro to Systems Processes. Kevin Webb Swarthmore College March 31, 2016

Concurrent Programming. Implementation Alternatives. Content. Real-Time Systems, Lecture 2. Historical Implementation Alternatives.

Transcription:

Native POSIX Thread Library (NPTL) CSE 506 Don Porter

Logical Diagram Binary Memory Threads Formats Allocators Today s Lecture Scheduling System Calls threads RCU File System Networking Sync User Kernel Memory Management Device Drivers CPU Scheduler Interrupts Disk Net Consistency Hardware

Today s reading Design challenges and trade-offs in a threading library Nice practical tricks and system details And some historical perspective on Linux evolution

Threading review What is threading? Multiple threads of execution in one address space x86 hardware: One cr3 register and set of page tables shared by 2+ different register contexts otherwise (rip, rsp/stack, etc.) Linux: One mm_struct shared by several task_structs Does JOS support threading?

Ok, but what is a thread library? Kernel provides basic functionality: e.g., create a new task with a shared address space, set my gs register In Linux, libpthread provides several abstractions for programmer convenience. Examples? Thread management (join, cleanup, etc) Synchronization (mutex, condition variables, etc) Thread-local storage Part of the design is a division of labor between kernel and libraries!

User vs. Kernel Threading Kernel threading: Every application-level thread is implemented by a kernel-visible thread (task struct) Called 1:1 in the paper User threading: Multiple application-level threads (m) multiplexed on n kernel-visible threads (m >= n) Called m:n in the paper Insight: Context switching involves saving/restoring registers (including stack). This can be done in user space too!

Intuition 2 user threads on 1 kernel thread; start with explicit yield 2 stacks On each yield(): Save registers, switch stacks just like kernel does OS schedules the one kernel thread Programmer controls how much time for each user thread

Extensions Can map m user threads onto n kernel threads (m >= n) Bookkeeping gets much more complicated (synchronization) Can do crude preemption using: Certain functions (locks) Timer signals from OS

Why bother? Context switching overheads Finer-grained scheduling control Blocking I/O

Context Switching Overheads Recall: Forking a thread halves your time slice Takes a few hundred cycles to get in/out of kernel Plus cost of switching a thread Time in the scheduler counts against your timeslice 2 threads, 1 CPU If I can run the context switching code locally (avoiding trap overheads, etc), my threads get to run slightly longer! Stack switching code works in userspace with few changes

Finer-Grained Scheduling Control Example: Thread 1 has a lock, Thread 2 waiting for lock Thread 1 s quantum expired Thread 2 just spinning until its quantum expires Wouldn t it be nice to donate Thread 2 s quantum to Thread 1? Both threads will make faster progress! Similar problems with producer/consumer, barriers, etc. Deeper problem: Application s data flow and synchronization patterns hard for kernel to infer

Blocking I/O I have 2 threads, they each get half of the application s quantum If A blocks on I/O and B is using the CPU B gets half the CPU time A s quantum is lost (at least in some schedulers) Modern Linux scheduler: A gets a priority boost Maybe application cares more about B s CPU time

Blocking I/O and Events Events are an abstraction for dealing with blocking I/O Layered over a user-level scheduler Lots of literature on this topic if you are interested

Scheduler Activations Observations: Kernel context switching substantially more expensive than user context switching Kernel can t infer application goals as well as programmer nice() helps, but clumsy Thesis: Highly tuned multithreading should be done in the application Better kernel interfaces needed

What is a scheduler activation? Like a kernel thread: a kernel stack and a user-mode stack Represents the allocation of a CPU time slice Not like a kernel thread: Does not automatically resume a user thread Goes to one of a few well-defined upcalls New timeslice, Timeslice expired, Blocked SA, Unblocked SA Upcalls must be reentrant (called on many CPUs at same time) User scheduler decides what to run

User-level threading Independent of SA s, user scheduler creates: Analog of task struct for each thread Stores register state when preempted Stack for each thread Some sort of run queue Simple list in the (optional) paper Application free to use O(1), CFS, round-robin, etc. User scheduler keeps kernel notified of how many runnable tasks it has (via system call)

Downsides of scheduler activations A random user thread gets preempted on every scheduling-related event Not free! User scheduling must do better than kernel by a big enough margin to offset these overheads Moreover, the most important thread may be the one to get preempted, slowing down critical path Potential optimization: communicate to kernel a preference for which activation gets preempted to notify of an event

Back to NPTL Ultimately, a 1:1 model was adopted by Linux. Why? Higher context switching overhead (lots of register copying and upcalls) Difference of opinion between research and kernel communities about how inefficient kernel-level schedulers are. (claims about O(1) scheduling) Way more complicated to maintain the code for m:n model. Much to be said for encapsulating kernel from thread library!

Meta-observation Much of 90s OS research focused on giving programmers more control over performance E.g., microkernels, extensible OSes, etc. Argument: clumsy heuristics or awkward abstractions are keeping me from getting full performance of my hardware Some won the day, some didn t High-performance databases generally get direct control over disk(s) rather than go through the file system

User-threading in practice Has come in and out of vogue Correlated with how efficiently the OS creates and context switches threads Linux 2.4 Threading was really slow User-level thread packages were hot Linux 2.6 Substantial effort went into tuning threads E.g., Most JVMs abandoned user-threads

Other issues to cover Signaling Correctness Performance (Synchronization) Manager thread List of all threads Other miscellaneous optimizations

Brief digression: Signals Signals are like a user-level interrupt Specify a signal handler (trap handler), different numbers have different meanings Default actions for different signals (kill the process, ignore, etc). Delivered when returning from the kernel E.g., after returning from a system call Can be sent by hand using the kill command kill -HUP 10293 # send SIGHUP to proc. 10293

Signal masking Like interrupts, signals can be masked See the sigprocmask system call on Linux Why? User code may need to synchronize access to a data structure shared with a signal handler Or multiple signal handlers may need to synchronize See optional reading on signal races for an example

What was all the fuss about signals? 2 issues: 1) The behavior of sending a signal to a multi-threaded process was not correct. And could never be implemented correctly with kernel-level tools (pre 2.6) Correctness: Cannot implement POSIX standard 2) Signals were also used to implement blocking synchronization. E.g., releasing a mutex meant sending a signal to the next blocked task to wake it up. Performance: Ridiculously complicated and inefficient

Issue 1: Signal correctness w/ threads Mostly solved by kernel assigning same PID to each thread 2.4 assigned different PID to each thread Different TID to distinguish them Problem with different PID? POSIX says I should be able to send a signal to a multi-threaded program and any unmasked thread will get the signal, even if the first thread has exited To deliver a signal kernel has to search each task in the process for an unmasked thread

Issue 2: Performance Solved by adoption of futexes Essentially just a shared wait queue in the kernel Idea: Use an atomic instruction in user space to implement fast path for a lock (more in later lectures) If task needs to block, ask the kernel to put you on a given futex wait queue Task that releases the lock wakes up next task on the futex wait queue See optional reading on futexes for more details

Manager Thread A lot of coordination (using signals) had to go through a manager thread E.g., cleaning up stacks of dead threads Scalability bottleneck Mostly eliminated with tweaks to kernel that facilitate decentralization: The kernel handled several termination edge cases for threads Kernel would write to a given memory location to allow lazy cleanup of per-thread data

List of all threads A pain to maintain Mostly eliminated, but still needed to eliminate some leaks in fork Generation counter is a useful trick for lazy deletion Used in many systems Idea: Transparently replace key Foo with Foo:0. Upon deletion, require next creation to rename Foo to Foo:1. Eliminates accidental use of stale data.

Other misc. optimizations On super-computers, were hitting the 8k limit on segment descriptors Where does the 8k limit come from? Bits in the segment descriptor. Hardware-level limit How solved? Essentially, kernel scheduler swaps them out if needed Is this the common case? No, expect 8k to be enough

Optimizations Optimized exit performance for 100k threads from 15 minutes to 2 seconds! PID space increased to 2 billion threads /proc file system able to handle more than 64k processes

Big speedups! Yay! Results

Summary Nice paper on the practical concerns and trade-offs in building a threading library I enjoyed this reading very much Understand 1:1 vs. m:n model User vs. kernel-level threading Understand other key implementation issues discussed in the paper