RCU and Breakage Paul E. McKenney IBM Distinguished Engineer & CTO Linux Linux Technology Center Copyright 2009 IBM 2002 IBM Corporation
Overview What the #$I#@(&!!! is RCU-bh for??? RCU status in mainline Breakage for performance and scalability 2
What the #$I#@(&!!! is RCU-bh For??? 3
What the #$I#@(&!!! is RCU-bh For??? It is all Robert Olsson's fault!!! Ran a DDoS workload that hung the system ICMP redirects forced routing-table updates Routing cache protected by RCU Each update waits for a grace period before freeing Load was so heavy that system never left irq!!! No context switches, no quiescent states, no grace periods Eventually, OOM!!! Dipankar created RCU-bh Additional quiescent state in softirq execution Routing cache converted to RCU-bh, then withstood DDoS 4
RCU Status in Mainline 5
RCU Status in Mainline synchronize_sched_expedited() in mainline Completes grace period in few tens of microseconds By hammering all the CPUs with IPIs Therefore, should be used sparingly Boot-time and other infrequent updates CLASSIC_RCU and PREEMPT_RCU are gone TREE_RCU and TREE_PREEMPT_RCU instead TINY_RCU under test, not yet in mainline Reports to the contrary notwithstanding 6
Breakage for Performance and Scalability 7
Performance of Synchronization Mechanisms 4-CPU 1.8GHz AMD Opteron 844 system Need to be here! (Partitioning/RCU) Operation Cost (ns) Clock period 0.6 Best-case CAS 37.9 Best-case lock 65.6 Single cache miss 139.5 CAS cache miss 306.0 Heavily optimized readerwriter lock might get here for readers (but too bad about those poor writers...) Ratio 1 63.2 109.3 232.5 510.0 Typical synchronization mechanisms do this a lot 8
Performance of Synchronization Mechanisms 4-CPU 1.8GHz AMD Opteron 844 system Need to be here! (Partitioning/RCU) Operation Cost (ns) Clock period 0.6 Best-case CAS 37.9 Best-case lock 65.6 Single cache miss 139.5 CAS cache miss 306.0 Heavily optimized readerwriter lock might get here for readers (but too bad about those poor writers...) Ratio 1 63.2 109.3 232.5 510.0 Typical synchronization mechanisms do this a lot But this is an old system... 9
Performance of Synchronization Mechanisms 4-CPU 1.8GHz AMD Opteron 844 system Need to be here! (Partitioning/RCU) Operation Cost (ns) Clock period 0.6 Best-case CAS 37.9 Best-case lock 65.6 Single cache miss 139.5 CAS cache miss 306.0 Heavily optimized readerwriter lock might get here for readers (but too bad about those poor writers...) But this is an old system... Ratio 1 63.2 109.3 232.5 510.0 Typical synchronization mechanisms do this a lot And why low-level details??? 10
Why All These Low-Level Details??? Would you trust a bridge designed by someone who did not understand strengths of materials? Or a ship designed by someone who did not understand the steel-alloy transition temperatures? Or a house designed by someone who did not understand that unfinished wood rots when wet? Or a car designed by someone who did not understand the corrosion properties of the metals used in the exhaust system? Or a space shuttle designed by someone who did not understand the temperature limitations of O-rings? So why trust algorithms from someone ignorant of the properties of the underlying hardware??? 11
Performance of Synchronization Mechanisms 16-CPU 2.8GHz Intel X5550 (Nehalem) System Operation Clock period Best-case CAS Best-case lock Single cache miss CAS cache miss Cost (ns) 0.4 12.2 25.6 12.9 7.0 Ratio 1 33.8 71.2 35.8 19.4 What a difference a few years can make!!! 12
Performance of Synchronization Mechanisms 16-CPU 2.8GHz Intel X5550 (Nehalem) System Operation Clock period Best-case CAS Best-case lock Single cache miss CAS cache miss Single cache miss (off-core) CAS cache miss (off-core) Cost (ns) 0.4 12.2 25.6 12.9 7.0 31.2 31.2 Ratio 1 33.8 71.2 35.8 19.4 86.6 86.5 Not quite so good... But still a 6x improvement!!! 13
Performance of Synchronization Mechanisms 16-CPU 2.8GHz Intel X5550 (Nehalem) System Operation Cost (ns) Clock period 0.4 Best-case CAS 12.2 Best-case lock 25.6 Single cache miss 12.9 CAS cache miss 7.0 Single cache miss (off-core) 31.2 CAS cache miss (off-core) 31.2 Single cache miss (off-socket) 92.4 CAS cache miss (off-socket) 95.9 Ratio 1 33.8 71.2 35.8 19.4 86.6 86.5 256.7 266.4 Maybe not such a big difference after all... And these are best-case values!!! (Why?) 14
Performance of Synchronization Mechanisms If you thought a single atomic operation was slow, try lots of them!!! (Parallel atomic increment of single variable on 1.9GHz Power 5 system) 15
Performance of Synchronization Mechanisms Same effect on a 16-CPU 2.8GHz Intel X5550 (Nehalem) system 16
3 centimeters SOL RT @ 5GHz System Hardware Structure CPU CPU CPU CPU CPU CPU CPU CPU Store Buffer Store Buffer Store Buffer Store Buffer Store Buffer Store Buffer Store Buffer Store Buffer $ $ $ $ $ $ $ $ Interconnect Memory Interconnect Interconnect Interconnect Memory Interconnect $ $ $ $ $ $ $ $ Store Buffer Store Buffer Store Buffer Store Buffer Store Buffer Store Buffer Store Buffer Store Buffer CPU CPU CPU CPU CPU CPU CPU CPU Electrons move at 0.03C to 0.3C in transistors and, so lots of waiting. 3D??? 17
Visual Demonstration of Instruction Overhead The Bogroll Demonstration 18
CPU Performance: The Marketing Pitch 19
CPU Performance: Memory References 20
CPU Performance: Pipeline Flushes 21
CPU Performance: Atomic Instructions 22
CPU Performance: Memory Barriers 23
CPU Performance: Cache Misses 24
CPU Performance: I/O 25
So We Need to Break Things Up... 26
Exercise: Dining Philosophers Problem Each philosopher requires two forks to eat. Need to avoid starvation. 27
Exercise: Dining Philosophers Solution #1 1 2 5 3 Locking hierarchy. Pick up low-numbered fork first, preventing deadlock. 4 Is this a good solution??? 28
Exercise: Dining Philosophers Solution #2 1 5 2 3 Locking hierarchy. Pick up low-numbered fork first, preventing deadlock. 4 If all want to eat, at least two will be able to do so. 29
Exercise: Dining Philosophers Solution #3 Zero contention. All 5 can eat concurrently. Excellent disease control. 30
Exercise: Dining Philosophers Solutions Objections to solution #2 and #3: You can't just change the rules like that!!! No rule against moving or adding forks!!! Dining Philosophers Problem valuable lock-hierarchy teaching tool #3 just destroyed it!!! Lock hierarchy is indeed very valuable and widely used, so the restriction there can only be five forks positioned as shown does indeed have its place, even if it didn't appear in this instance of the Dining Philosophers Problem. But the lesson of transforming the problem into perfectly partitionable form is also very valuable, and given the wide availability of cheap multiprocessors, most desperately needed. But what if each fork cost a million dollars? Then we make the philosophers eat with their fingers... 31
But What To Do... If you have a problem that does not partition nicely???? 32
Embarrassingly Parallel CPU 0 CPU 1 CPU 2 CPU 3 Per-CPU variables Per-task variables Per-device structures... 33
If Cannot Fully Partition Use per-cpu/per-task caching memory allocation, limit-aware counting Reduce frequency of global interaction Use periodic update (e.g., load balancing) Reduce frequency of global interaction Give up some accuracy or responsiveness Perhaps random() is your friend Coordination more expensive than it is worth 34
Overview What the #$I#@(&!!! is RCU-bh for??? RCU status in mainline Breakage for performance and scalability 35
Questions? 36
Legal Statement This work represents the view of the author and does not necessarily represent the view of IBM. IBM and IBM (logo) are trademarks or registered trademarks of International Business Machines Corporation in the United States and/or other countries. Linux is a registered trademark of Linus Torvalds. Other company, product, and service names may be trademarks or service marks of others. This material is based upon work supported by the National Science Foundation under Grant No. CNS0719851. Joint work with Manish Gupta, Maged Michael, Phil Howard, Joshua Triplett, and Jonathan Walpole 37