Netconf RCU and Breakage. Paul E. McKenney IBM Distinguished Engineer & CTO Linux Linux Technology Center IBM Corporation

Similar documents
Introduction to Performance, Scalability, and Real-Time Issues on Modern Multicore Hardware: Is Parallel Programming Hard, And If So, Why?

Introduction to RCU Concepts

Validating Core Parallel Software

Validating Core Parallel Software

Is Parallel Programming Hard, And If So, What Can You Do About It?

What Is RCU? Guest Lecture for University of Cambridge

What Is RCU? Indian Institute of Science, Bangalore. Paul E. McKenney, IBM Distinguished Engineer, Linux Technology Center 3 June 2013

Verification of Tree-Based Hierarchical Read-Copy Update in the Linux Kernel

Lecture 10: Multi-Object Synchronization

Validating Core Parallel Software

Advanced Topic: Efficient Synchronization

When Do Real Time Systems Need Multiple CPUs?

Abstraction, Reality Checks, and RCU

Introduction to RCU Concepts

How to hold onto things in a multiprocessor world

The RCU-Reader Preemption Problem in VMs

Read-Copy Update in a Garbage Collected Environment. By Harshal Sheth, Aashish Welling, and Nihar Sheth

Testing real-time Linux: What to test and how.

Using a Malicious User-Level RCU to Torture RCU-Based Algorithms

RCU in the Linux Kernel: One Decade Later

Getting RCU Further Out Of The Way

Tom Hart, University of Toronto Paul E. McKenney, IBM Beaverton Angela Demke Brown, University of Toronto

Read-Copy Update (RCU) Don Porter CSE 506

Real-Time Response on Multicore Systems: It is Bigger Than You Think

Simplicity Through Optimization

Operating Systems: William Stallings. Starvation. Patricia Roy Manatee Community College, Venice, FL 2008, Prentice Hall

CS533 Concepts of Operating Systems. Jonathan Walpole

Process Management And Synchronization

Scalable Concurrent Hash Tables via Relativistic Programming

Last Class: Monitors. Real-world Examples

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Multi-Core Memory Models and Concurrency Theory

Classical Synchronization Problems. Copyright : University of Illinois CS 241 Staff 1

The Multikernel: A new OS architecture for scalable multicore systems Baumann et al. Presentation: Mark Smith

Read-Copy Update (RCU) Don Porter CSE 506

CS533 Concepts of Operating Systems. Jonathan Walpole

CSE Traditional Operating Systems deal with typical system software designed to be:

Deadlock. Concurrency: Deadlock and Starvation. Reusable Resources

Scalable Correct Memory Ordering via Relativistic Programming

When Do Real Time Systems Need Multiple CPUs?


Taking a Virtual Machine Towards Many-Cores. Rickard Green - Patrik Nyblom -

Decoding Those Inscrutable RCU CPU Stall Warnings

Generalized Construction of Scalable Concurrent Data Structures via Relativistic Programming

Chapter 6 Concurrency: Deadlock and Starvation

Process Synchronization

Lecture Topics. Announcements. Today: Concurrency (Stallings, chapter , 5.7) Next: Exam #1. Self-Study Exercise #5. Project #3 (due 9/28)

Performance and Scalability of Server Consolidation

The read-copy-update mechanism for supporting real-time applications on shared-memory multiprocessor systems with Linux

Kernel Scalability. Adam Belay

Multiprocessor Systems. Chapter 8, 8.1

Timers 1 / 46. Jiffies. Potent and Evil Magic

Precision Time Protocol, and Sub-Microsecond Synchronization

Real-Time Response on Multicore Systems: It is Bigger Than I Thought

But What About Updates?

Last class: Today: Course administration OS definition, some history. Background on Computer Architecture

Concurrency: Deadlock and Starvation. Chapter 6

Fast Bounded-Concurrency Hash Tables. Samy Al Bahra

Chapter 2 Processes and Threads. Interprocess Communication Race Conditions

Formal Verification and Linux-Kernel Concurrency

Exploring mtcp based Single-Threaded and Multi-Threaded Web Server Design

Concurrent Computing

CS 571 Operating Systems. Midterm Review. Angelos Stavrou, George Mason University

System Software. Computer Science and Engineering College of Engineering The Ohio State University. Lecture 13

Example: The Dekker Algorithm on SMP Systems. Memory Consistency The Dekker Algorithm 43 / 54

Design Principles for End-to-End Multicore Schedulers

Read-Copy Update (RCU) for C++

Fast and Scalable Queue-Based Resource Allocation Lock on Shared-Memory Multiprocessors

Today. Adding Memory Does adding memory always reduce the number of page faults? FIFO: Adding Memory with LRU. Last Class: Demand Paged Virtual Memory

Deadlocks. Copyright : University of Illinois CS 241 Staff 1

A Comparison of Relativistic and Reader-Writer Locking Approaches to Shared Data Access

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.

Fast and Scalable Queue-Based Resource Allocation Lock on Shared-Memory Multiprocessors

CSE 120 Principles of Operating Systems

CSCE Operating Systems Deadlock. Qiang Zeng, Ph.D. Fall 2018

Deadlocks. Frédéric Haziza Spring Department of Computer Systems Uppsala University

CS 220: Introduction to Parallel Computing. Introduction to CUDA. Lecture 28

Lecture 10: Avoiding Locks

Multiprocessor Systems. COMP s1

CSE 410 Final Exam Sample Solution 6/08/10

Go Deep: Fixing Architectural Overheads of the Go Scheduler

Basic Memory Management

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

Disco. CS380L: Mike Dahlin. September 13, This week: Disco and Exokernel. One lesson: If at first you don t succeed, try try again.

Semaphores. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

RCU and C++ Paul E. McKenney, IBM Distinguished Engineer, Linux Technology Center Member, IBM Academy of Technology CPPCON, September 23, 2016

CMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 27, SPRING 2013

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

CS193k, Stanford Handout #10. HW2b ThreadBank

Consistency: Strict & Sequential. SWE 622, Spring 2017 Distributed Software Engineering

High Performance Computing on GPUs using NVIDIA CUDA

Last Class: Synchronization Problems!

Process Synchronization. Mehdi Kargahi School of ECE University of Tehran Spring 2008

Overview. This Lecture. Interrupts and exceptions Source: ULK ch 4, ELDD ch1, ch2 & ch4. COSC440 Lecture 3: Interrupts 1

CSE 410 Final Exam 6/09/09. Suppose we have a memory and a direct-mapped cache with the following characteristics.

6.852: Distributed Algorithms Fall, Class 15

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Deadlock. Disclaimer: some slides are adopted from Dr. Kulkarni s and book authors slides with permission 1

Revisiting the Past 25 Years: Lessons for the Future. Guri Sohi University of Wisconsin-Madison

Transcription:

RCU and Breakage Paul E. McKenney IBM Distinguished Engineer & CTO Linux Linux Technology Center Copyright 2009 IBM 2002 IBM Corporation

Overview What the #$I#@(&!!! is RCU-bh for??? RCU status in mainline Breakage for performance and scalability 2

What the #$I#@(&!!! is RCU-bh For??? 3

What the #$I#@(&!!! is RCU-bh For??? It is all Robert Olsson's fault!!! Ran a DDoS workload that hung the system ICMP redirects forced routing-table updates Routing cache protected by RCU Each update waits for a grace period before freeing Load was so heavy that system never left irq!!! No context switches, no quiescent states, no grace periods Eventually, OOM!!! Dipankar created RCU-bh Additional quiescent state in softirq execution Routing cache converted to RCU-bh, then withstood DDoS 4

RCU Status in Mainline 5

RCU Status in Mainline synchronize_sched_expedited() in mainline Completes grace period in few tens of microseconds By hammering all the CPUs with IPIs Therefore, should be used sparingly Boot-time and other infrequent updates CLASSIC_RCU and PREEMPT_RCU are gone TREE_RCU and TREE_PREEMPT_RCU instead TINY_RCU under test, not yet in mainline Reports to the contrary notwithstanding 6

Breakage for Performance and Scalability 7

Performance of Synchronization Mechanisms 4-CPU 1.8GHz AMD Opteron 844 system Need to be here! (Partitioning/RCU) Operation Cost (ns) Clock period 0.6 Best-case CAS 37.9 Best-case lock 65.6 Single cache miss 139.5 CAS cache miss 306.0 Heavily optimized readerwriter lock might get here for readers (but too bad about those poor writers...) Ratio 1 63.2 109.3 232.5 510.0 Typical synchronization mechanisms do this a lot 8

Performance of Synchronization Mechanisms 4-CPU 1.8GHz AMD Opteron 844 system Need to be here! (Partitioning/RCU) Operation Cost (ns) Clock period 0.6 Best-case CAS 37.9 Best-case lock 65.6 Single cache miss 139.5 CAS cache miss 306.0 Heavily optimized readerwriter lock might get here for readers (but too bad about those poor writers...) Ratio 1 63.2 109.3 232.5 510.0 Typical synchronization mechanisms do this a lot But this is an old system... 9

Performance of Synchronization Mechanisms 4-CPU 1.8GHz AMD Opteron 844 system Need to be here! (Partitioning/RCU) Operation Cost (ns) Clock period 0.6 Best-case CAS 37.9 Best-case lock 65.6 Single cache miss 139.5 CAS cache miss 306.0 Heavily optimized readerwriter lock might get here for readers (but too bad about those poor writers...) But this is an old system... Ratio 1 63.2 109.3 232.5 510.0 Typical synchronization mechanisms do this a lot And why low-level details??? 10

Why All These Low-Level Details??? Would you trust a bridge designed by someone who did not understand strengths of materials? Or a ship designed by someone who did not understand the steel-alloy transition temperatures? Or a house designed by someone who did not understand that unfinished wood rots when wet? Or a car designed by someone who did not understand the corrosion properties of the metals used in the exhaust system? Or a space shuttle designed by someone who did not understand the temperature limitations of O-rings? So why trust algorithms from someone ignorant of the properties of the underlying hardware??? 11

Performance of Synchronization Mechanisms 16-CPU 2.8GHz Intel X5550 (Nehalem) System Operation Clock period Best-case CAS Best-case lock Single cache miss CAS cache miss Cost (ns) 0.4 12.2 25.6 12.9 7.0 Ratio 1 33.8 71.2 35.8 19.4 What a difference a few years can make!!! 12

Performance of Synchronization Mechanisms 16-CPU 2.8GHz Intel X5550 (Nehalem) System Operation Clock period Best-case CAS Best-case lock Single cache miss CAS cache miss Single cache miss (off-core) CAS cache miss (off-core) Cost (ns) 0.4 12.2 25.6 12.9 7.0 31.2 31.2 Ratio 1 33.8 71.2 35.8 19.4 86.6 86.5 Not quite so good... But still a 6x improvement!!! 13

Performance of Synchronization Mechanisms 16-CPU 2.8GHz Intel X5550 (Nehalem) System Operation Cost (ns) Clock period 0.4 Best-case CAS 12.2 Best-case lock 25.6 Single cache miss 12.9 CAS cache miss 7.0 Single cache miss (off-core) 31.2 CAS cache miss (off-core) 31.2 Single cache miss (off-socket) 92.4 CAS cache miss (off-socket) 95.9 Ratio 1 33.8 71.2 35.8 19.4 86.6 86.5 256.7 266.4 Maybe not such a big difference after all... And these are best-case values!!! (Why?) 14

Performance of Synchronization Mechanisms If you thought a single atomic operation was slow, try lots of them!!! (Parallel atomic increment of single variable on 1.9GHz Power 5 system) 15

Performance of Synchronization Mechanisms Same effect on a 16-CPU 2.8GHz Intel X5550 (Nehalem) system 16

3 centimeters SOL RT @ 5GHz System Hardware Structure CPU CPU CPU CPU CPU CPU CPU CPU Store Buffer Store Buffer Store Buffer Store Buffer Store Buffer Store Buffer Store Buffer Store Buffer $ $ $ $ $ $ $ $ Interconnect Memory Interconnect Interconnect Interconnect Memory Interconnect $ $ $ $ $ $ $ $ Store Buffer Store Buffer Store Buffer Store Buffer Store Buffer Store Buffer Store Buffer Store Buffer CPU CPU CPU CPU CPU CPU CPU CPU Electrons move at 0.03C to 0.3C in transistors and, so lots of waiting. 3D??? 17

Visual Demonstration of Instruction Overhead The Bogroll Demonstration 18

CPU Performance: The Marketing Pitch 19

CPU Performance: Memory References 20

CPU Performance: Pipeline Flushes 21

CPU Performance: Atomic Instructions 22

CPU Performance: Memory Barriers 23

CPU Performance: Cache Misses 24

CPU Performance: I/O 25

So We Need to Break Things Up... 26

Exercise: Dining Philosophers Problem Each philosopher requires two forks to eat. Need to avoid starvation. 27

Exercise: Dining Philosophers Solution #1 1 2 5 3 Locking hierarchy. Pick up low-numbered fork first, preventing deadlock. 4 Is this a good solution??? 28

Exercise: Dining Philosophers Solution #2 1 5 2 3 Locking hierarchy. Pick up low-numbered fork first, preventing deadlock. 4 If all want to eat, at least two will be able to do so. 29

Exercise: Dining Philosophers Solution #3 Zero contention. All 5 can eat concurrently. Excellent disease control. 30

Exercise: Dining Philosophers Solutions Objections to solution #2 and #3: You can't just change the rules like that!!! No rule against moving or adding forks!!! Dining Philosophers Problem valuable lock-hierarchy teaching tool #3 just destroyed it!!! Lock hierarchy is indeed very valuable and widely used, so the restriction there can only be five forks positioned as shown does indeed have its place, even if it didn't appear in this instance of the Dining Philosophers Problem. But the lesson of transforming the problem into perfectly partitionable form is also very valuable, and given the wide availability of cheap multiprocessors, most desperately needed. But what if each fork cost a million dollars? Then we make the philosophers eat with their fingers... 31

But What To Do... If you have a problem that does not partition nicely???? 32

Embarrassingly Parallel CPU 0 CPU 1 CPU 2 CPU 3 Per-CPU variables Per-task variables Per-device structures... 33

If Cannot Fully Partition Use per-cpu/per-task caching memory allocation, limit-aware counting Reduce frequency of global interaction Use periodic update (e.g., load balancing) Reduce frequency of global interaction Give up some accuracy or responsiveness Perhaps random() is your friend Coordination more expensive than it is worth 34

Overview What the #$I#@(&!!! is RCU-bh for??? RCU status in mainline Breakage for performance and scalability 35

Questions? 36

Legal Statement This work represents the view of the author and does not necessarily represent the view of IBM. IBM and IBM (logo) are trademarks or registered trademarks of International Business Machines Corporation in the United States and/or other countries. Linux is a registered trademark of Linus Torvalds. Other company, product, and service names may be trademarks or service marks of others. This material is based upon work supported by the National Science Foundation under Grant No. CNS0719851. Joint work with Manish Gupta, Maged Michael, Phil Howard, Joshua Triplett, and Jonathan Walpole 37