Fast on Average, Predictable in the Worst Case

Size: px

Start display at page:

Download "Fast on Average, Predictable in the Worst Case"

Lester Weaver
5 years ago
Views:

1 Fast on Average, Predictable in the Worst Case Exploring Real-Time Futexes in LITMUSRT Roy Spliet, Manohar Vanga, Björn Brandenburg, Sven Dziadek IEEE Real-Time Systems Symposium 2014 December 2-5, 2014 Rome, Italy 1

2 Real-Time Locking API Mutex API kernel_do_lock(); // critical section kernel_do_unlock(); 2

3 Real-Time Locking API Mutex API kernel_do_lock(); // critical section kernel_do_unlock(); No other task is currently holding the lock No other task is waiting for the lock Operations are typically uncontended at runtime 3

4 Futexes: Fast Userspace Mutexes A mechanism (API) in the Linux kernel!! For optimizing the uncontended case of locking protocol implementations Avoid kernel invocation during uncontended operations 4

5 Futexes: Fast Userspace Mutexes A mechanism (API) in the Linux kernel!! For optimizing the uncontended case of locking protocol implementations Avoid kernel invocation during uncontended operations Why optimize the uncontended case?!! Improved throughput for soft real-time workloads Increasing available slack time in the system 5

6 Real-Time Locking Implementations PRIO_INHERIT! (Priority Inheritance Protocol) PRIO_PROTECT (Immediate Priority Ceiling Protocol) Classic Priority Ceiling Protocol! (PCP) Multiprocessor PCP (MPCP) FMLP Futex Implementation Linux 6

7 Real-Time Locking Implementations PRIO_INHERIT! (Priority Inheritance Protocol) PRIO_PROTECT (Immediate Priority Ceiling Protocol) Classic Priority Ceiling Protocol! (PCP) Multiprocessor PCP (MPCP) FMLP Futex Implementation Linux LITMUS RT LITMUS RT

8 Real-Time Locking Implementations PRIO_INHERIT! (Priority Inheritance Protocol) PRIO_PROTECT (Immediate Priority Ceiling Protocol) Classic Priority Ceiling Protocol! (PCP) Multiprocessor PCP (MPCP) FMLP Futex Implementation Linux LITMUS RT 8

9 Real-Time Locking Implementations PRIO_INHERIT! (Priority Inheritance Protocol) PRIO_PROTECT (Immediate Priority Ceiling Protocol) Classic Priority Ceiling Protocol! (PCP) Multiprocessor PCP (MPCP) FMLP Futex Implementation Linux LITMUS RT Choice between futex implementation and better analytical properties 9

10 Real-Time Locking Implementations PRIO_INHERIT! (Priority Inheritance Protocol) PRIO_PROTECT (Immediate Priority Ceiling Protocol) Classic Priority Ceiling Protocol! (PCP) Multiprocessor PCP (MPCP) FMLP Futex Implementation Linux LITMUS RT Choice between futex implementation and better analytical properties Why is it challenging to have both? 10

11 Real-Time Locking Implementations PRIO_INHERIT! (Priority Inheritance Protocol) PRIO_PROTECT (Immediate Priority Ceiling Protocol) Classic Priority Ceiling Protocol! (PCP) Multiprocessor PCP (MPCP) FMLP Futex Implementation Reactive Anticipatory 11

12 Real-Time Locking Protocol Dichotomy Reactive Locking Protocols Anticipatory Locking Protocols React to contention on a lock only when it occurs. Anticipate problem scenarios and take measure to minimize priority-inversion (pi) blocking. 12

13 Real-Time Locking Protocol Dichotomy Reactive Locking Protocols Anticipatory Locking Protocols React to contention on a lock only when it occurs. Anticipate problem scenarios and take measure to minimize priority-inversion (pi) blocking. Simple futex implementation Tricky to implement futexes without violating semantics 13

14 Contributions PRIO_INHERIT! (Priority Inheritance Protocol) PRIO_PROTECT (Immediate Priority Ceiling Protocol) Classic Priority Ceiling Protocol! (PCP) Multiprocessor PCP (MPCP) FMLP Futex Implementation 14

15 Contributions PRIO_INHERIT! (Priority Inheritance Protocol) PRIO_PROTECT (Immediate Priority Ceiling Protocol) Classic Priority Ceiling Protocol! (PCP) Multiprocessor PCP (MPCP) FMLP Futex Implementation Design and implementation of futexes for three anticipatory real-time locking protocols 15

16 Contributions PRIO_INHERIT! (Priority Inheritance Protocol) PRIO_PROTECT (Immediate Priority Ceiling Protocol) Classic Priority Ceiling Protocol! (PCP) Multiprocessor PCP (MPCP) FMLP Futex Implementation Observed overhead reduction: 92% 85% 84% Design and implementation of futexes for three anticipatory real-time locking protocols 16

17 Contributions PRIO_INHERIT! (Priority Inheritance Protocol) PRIO_PROTECT (Immediate Priority Ceiling Protocol) Classic Priority Ceiling Protocol! (PCP) Multiprocessor PCP (MPCP) FMLP Futex Implementation Observed overhead reduction: 92% 85% 84% Design and implementation of futexes for three anticipatory real-time locking protocols Method of using page faults to implement futexes 17

18 Overview of Talk 18

19 Overview of Talk Challenge Implementation Evaluation 19

20 Overview of Talk Challenge Implementation Evaluation 20

21 Futexes Overview and Intuition The problem with the vanilla approach Mutex API kernel_do_lock(); // critical section kernel_do_unlock(); 21

22 Futexes Overview and Intuition The problem with the vanilla approach Kernel invoked on every lock and unlock operation Mutex API kernel_do_lock(); // critical section kernel_do_unlock(); 22

23 Futexes Overview and Intuition The problem with the vanilla approach Kernel invoked on every lock and unlock operation Task Task Task Mutex API kernel_do_lock(); // critical section kernel_do_unlock(); Lock State Kernel 23

24 Futexes Overview and Intuition Exporting Lock State to Userspace Processes Task Task Task Lock State Kernel Shared lock state between userspace and kernel 24

25 Futexes Overview and Intuition Exporting Lock State to Userspace Processes Futex Pseudocode try_to_acquire_lock(lock); if (lock is contended) kernel_do_lock(lock); // critical section release_lock(lock); if (there are blocked tasks) kernel_do_unlock(); Task Task Task Lock State Kernel 25

26 Futexes Overview and Intuition Fast-Path for Uncontended Operations Futex Pseudocode try_to_acquire_lock(lock); if (lock is contended) kernel_do_lock(lock); // critical section release_lock(lock); if (there are blocked tasks) kernel_do_unlock(); Task Task Task Lock State Kernel Fast-path when operation is uncontended (kernel not invoked) 26

27 Futexes Overview and Intuition Slow-Path for Contended Operations Futex Pseudocode try_to_acquire_lock(lock); if (lock is contended) kernel_do_lock(lock); // critical section release_lock(lock); if (there are blocked tasks) kernel_do_unlock(); Task Task Task Lock State Kernel Kernel invoked only when lock or unlock operation is contended 27

28 Priority Inheritance Protocol Futexes for reactive locking protocols HI Lock 2 LO Lock 1 Lock 2 Time 28 Fixed-Priority Scheduling

29 Priority Inheritance Protocol Futexes for reactive locking protocols Uncontended No kernel intervention HI Lock 2 LO Lock 1 Lock 2 Time 29 Fixed-Priority Scheduling

30 Priority Inheritance Protocol Futexes for reactive locking protocols Uncontended No kernel intervention Contended Kernel intervention needed HI Lock 2 LO Lock 1 Lock 2 Time 30 Fixed-Priority Scheduling

31 Priority Ceiling Protocol (PCP) Each lock has a priority ceiling the highest priority of all tasks that can acquire that lock. System ceiling Currently held lock with highest ceiling A lock may be acquired by a task only if: Task priority > System Ceiling Priority! Or the task is holding the system ceiling! When under pi-blocking, priority inheritance raises priority of lower priority task to that of higher priority task. 31

32 Priority Ceiling Protocol (PCP) Each lock has a priority ceiling the highest priority of all tasks that can acquire that lock. System ceiling Currently held lock with highest ceiling A lock may be acquired by a task only if: Task priority > System Ceiling Priority! Or the task is holding the system ceiling! When under pi-blocking, priority inheritance raises priority of lower priority task to that of higher priority task. 32

33 Priority Ceiling Protocol (PCP) Each lock has a priority ceiling the highest priority of all tasks that can acquire that lock. System ceiling Currently held lock with highest ceiling A lock may be acquired by a task only if: Task priority > System Ceiling Priority! Or the task is holding the system ceiling! When under pi-blocking, priority inheritance raises priority of lower priority task to that of higher priority task. 33

34 Priority Ceiling Protocol (PCP) Each lock has a priority ceiling the highest priority of all tasks that can acquire that lock. System ceiling Currently held lock with highest ceiling A lock may be acquired by a task only if: Task priority > System Ceiling Priority! Or the task is holding the system ceiling! When under contention, priority inheritance (raises priority of lower priority task to that of higher priority task). 34

35 PCP Futexes Challenge HI L1 MED L2 LO L1 Time 35 Fixed-Priority Scheduling

36 PCP Futexes Challenge HI L1 MED L2 LO L1 Raises system ceiling prio. to HI Time Lowers system ceiling prio. to None 36 Fixed-Priority Scheduling

37 PCP Futexes Challenge HI L1 Valid acquisition MED > System Ceiling Prio. MED L2 LO L1 Raises system ceiling prio. to HI Time Lowers system ceiling prio. to None 37 Fixed-Priority Scheduling

38 PCP Futexes Challenge HI L1 HI L1 MED L2 MED L2 LO L1 LO L1 Time Time 38 Fixed-Priority Scheduling

39 PCP Futexes Challenge HI L1 HI L1 MED L2 MED L2 LO L1 LO L1 Time Raises system ceiling prio. to HI Time Preemption Sys. ceiling prio. = HI 39 Fixed-Priority Scheduling

40 PCP Futexes Challenge HI L1 HI L1 L2 free But invalid acquisition Sys. Ceiling Prio. > MED MED L2 MED L2 LO L1 LO L1 Time Raises system ceiling prio. to HI Time Preemption Sys. ceiling prio. = HI 40 Fixed-Priority Scheduling

41 PCP Futexes Challenge HI L1 HI L1 L2 free But invalid acquisition Sys. Ceiling Prio. > MED MED L2 MED L2 LO L1 LO L1 Problem Simply relying on lock state (as in reactive case) may violate PCP semantics. Must consider system ceiling. Time Time Raises system ceiling prio. to HI Preemption Sys. ceiling prio. = HI 41 Fixed-Priority Scheduling

42 PCP Futexes Challenge HI L1 HI L1 L2 free But invalid acquisition Sys. Ceiling Prio. > MED MED L2 MED L2 LO L1 LO L1 Problem Simply relying on lock state (as in reactive case) may violate PCP semantics. Must consider system ceiling. Time Time Raises system ceiling prio. to HI Problem system ceiling potentially updated at every Preemption lock acquisition and release must always invoke Sys. kernel ceiling to avoid prio. = HI protocol violations 42 Fixed-Priority Scheduling

43 Overview of Talk Challenge Implementation Evaluation 43

44 Overview of Talk Challenge Implementation Evaluation 44

45 PCP Futexes Our Approach PCP Futex Pseudocode if (lock is contended) kernel_do_lock(lock); else acquire_lock_locally(lock) // critical section release_lock(lock); if (there are blocked tasks) kernel_do_unlock(); 45

46 PCP Futexes Our Approach PCP Futex Pseudocode if (lock is contended) kernel_do_lock(lock); else acquire_lock_locally(lock) // critical section release_lock(lock); if (there are blocked tasks) kernel_do_unlock(); Whether lock acquisition is contended is determined in PCP by the current system ceiling 46

47 PCP Futexes Our Approach PCP Futex Pseudocode if (lock is contended) kernel_do_lock(lock); else acquire_lock_locally(lock) // critical section release_lock(lock); if (there are blocked tasks) kernel_do_unlock(); Whether lock acquisition is contended is determined in PCP by the current system ceiling The presence of blocked tasks is determined by the size of the waitqueue corresponding to the lock. 47

48 PCP Futexes Our Approach PCP Futex Pseudocode if (lock is contended) kernel_do_lock(lock); else acquire_lock_locally(lock) // critical section release_lock(lock); if (there are blocked tasks) kernel_do_unlock(); Whether lock acquisition is contended is determined in PCP by the current system ceiling The presence of blocked tasks is determined by the size of the waitqueue corresponding to the lock. Cannot change while a task is scheduled on a uniprocessor! 48

49 PCP Futexes Deferred Updates HI Scheduling Interval LO Time Invoke kernel 49

50 PCP Futexes Deferred Updates HI Scheduling Interval LO Time Predetermine contention conditions and communicate them to task Make changes to local copy of lock states (if uncontended) Update kernel state based on copy at context switch 50

51 PCP Futexes Deferred Updates HI Scheduling Interval LO Time Predetermine contention conditions and communicate them to task Make changes to local copy of lock states (if uncontended) Update kernel state based on copy at context switch Requires a bidirectional communication channel between tasks and the kernel 51

52 Bidirectional Communication Lock Page bitmap: lock_states bool: tasks_blocked Lock Page Per-process page of memory writable by both the process and kernel. 52

53 Bidirectional Communication Lock Page Lock-state bitmap used to communicate acquisitions and releases of locks to kernel bitmap: lock_states bool: tasks_blocked Lock Page Per-process page of memory writable by both the process and kernel. 53

54 Bidirectional Communication Lock Page Lock-state bitmap used to communicate acquisitions and releases of locks to kernel bitmap: lock_states bool: tasks_blocked Lock Page Per-process page of memory writable by both the process and kernel. Boolean set by kernel to indicate the presence of blocked tasks (indicates that release operation is contended) 54

55 Bidirectional Communication Lock Page Lock-state bitmap used to communicate acquisitions and releases of locks to kernel bitmap: lock_states bool: tasks_blocked Lock Page Per-process page of memory writable by both the process and kernel. Boolean set by kernel to indicate the presence of blocked tasks (indicates that release operation is contended) A permission bit set by the kernel to indicate whether lock acquisition is allowed (system ceiling check) 55

56 Communicating the Permission Bit PCP Futex Pseudocode if (permission bit is set) set_bit_in_bitmap(lock) else kernel_do_lock(lock); // critical section release_lock(lock); if (tasks_blocked is set) kernel_do_unlock(); 56

57 Communicating the Permission Bit PCP Futex Pseudocode if (permission bit is set) set_bit_in_bitmap(lock) else kernel_do_lock(lock); // critical section release_lock(lock); if (tasks_blocked is set) kernel_do_unlock(); Challenge: Checking for permission and setting lock bit must be atomic. Why? Preemption between them may change the permission. 57

58 Communicating the Permission Bit We proposed two approaches for the Classic PCP Page Faults (PCP-DU-PF) CMPXCHG (PCP-DU-BOOL) Uses page faults to invoke kernel when locking is prohibited Boolean for permission bit. Compare-and-exchange operation to check and acquire locks atomically 58

59 Communicating the Permission Bit We proposed two approaches for the Classic PCP Page Faults (PCP-DU-PF) CMPXCHG (PCP-DU-BOOL) Uses page faults to invoke kernel when locking is prohibited Boolean See paper for! details! Compare-and-exchange operation to check and acquire locks atomically

60 PCP Futexes Page Fault Approach HI LO Time If (locking is allowed) set lock page writable else set lock page unwritable Attempting to write lock bit will invoke kernel automatically if locking is not permitted 60

61 PCP Futexes Page Fault Approach Can be implemented using segmentation faults as well HI Lock code-path is entirely branch-free LO Time If (locking is allowed) set lock page writable else set lock page unwritable Attempting to write lock bit will invoke kernel automatically if locking is not permitted 61

62 PCP Futexes Page Fault Approach Can be implemented using segmentation faults as well HI Lock code-path is entirely branch-free LO Requires an optimized page-fault handler Time If (locking is allowed) set lock page writable else set lock page unwritable Attempting to write lock bit will invoke kernel automatically if locking is not permitted 62

63 Communicating the Permission Bit We proposed two approaches for the Classic PCP Page Faults (PCP-DU-PF) CMPXCHG (PCP-DU-BOOL) Uses page faults to invoke kernel when locking is prohibited Boolean for permission bit. Compare-and-exchange operation to check and acquire locks atomically Lock code-path is entirely branch-free Avoid page faults at cost of atomic op + branch 63

64 Overview of Talk Challenge Implementation Evaluation 64

65 Overview of Talk Challenge Implementation Evaluation 65

66 Evaluation Platform Boundary Devices Sabre Lite Board Freescale I.MX6Q Quad Core SoC ARM Cortex A9 (1 GHz) Implemented in LITMUS RT (based on Linux ) 66

67 Evaluation Benchmarking Methodology Test program locks and unlocks once every period Uniprocessor tests: 5 threads Randomly assigned parameters for threads Critical section length 25-45µs Execution time 25-65µs Period µs Measured using processor timestamp counters 67

Evaluation Benchmarking Methodology Test program locks and unlocks once every period Uniprocessor tests: 5 threads Randomly assigned parameters for threads

68 Evaluation Benchmarking Methodology Test program locks and unlocks once every period Uniprocessor tests: 5 threads Randomly assigned parameters for threads Critical section length 25-45µs Execution time 25-65µs Period µs Demanding workload thousands of operations per second Measured using processor timestamp counters 68

69 PCP Futex Evaluation Baseline We compare all futex implementations against the non-futex PCP implementation in LITMUS RT PRIO_INHERIT! (Priority Inheritance Protocol) PRIO_PROTECT (Immediate Priority Ceiling Protocol) Classic Priority Ceiling Protocol! (PCP) Multiprocessor PCP (MPCP) FMLP Futex Implementation Linux LITMUS RT 69

70 PCP Futex Evaluation Baseline PRIO_PROTECT (Linux PCP) PCP (LITMUS^RT) 33,271 Overhead in Cycles 1, ,850 1,221 1,211 9,027 6,655 Lock (avg) Lock (max) Unlock (avg) Unlock (max) 70

71 PCP Futex Evaluation Baseline PRIO_PROTECT (Linux PCP) PCP (LITMUS^RT) Overhead in Cycles 1, ,271 Our baseline PCP implementation already incurs lower overheads than Linux s. 3,850 1,221 1,211 9,027 6,655 Lock (avg) Lock (max) Unlock (avg) Unlock (max) 71

72 PCP Futex Evaluation Uncontended Case How much reduction do we see in average uncontended-case overheads? 72

73 PCP Futex Evaluation Uncontended Case Lock (average) Unlock (average) 100% 100% Overhead (compared to baseline) 25% 27% 41% 25% 7% 8% PCP (LITMUS^RT) ) Futex PCP (Page-Faults) Futex Futex PCP (CMPXCHG) Futex PRIO_INHERIT Linux Futex (Linux) Baseline (Page Faults) (CMPXCHG) (PRIO_INHERIT) Based on 6.6 million samples per protocol 73

74 PCP Futex Evaluation Uncontended Case Lock (average) Unlock (average) 100% 100% Overhead (compared to baseline) Our PCP futex implementation overheads are significantly lower than Linux s PIP futex implementation. 25% 7% 27% 8% 41% 25% PCP (LITMUS^RT) ) Futex PCP (Page-Faults) Futex Futex PCP (CMPXCHG) Futex PRIO_INHERIT Linux Futex (Linux) Baseline (Page Faults) (CMPXCHG) (PRIO_INHERIT) Based on 6.6 million samples per protocol 74

75 PCP Futex Evaluation Contended Case How much additional overhead do we incur in the maximum contended-case overheads? 75

76 PCP Futex Evaluation Contended Case Lock (Max) Unlock (Max) 186% 167% 137% 100% 100% 107% 106% 106% PCP PCP (LITMUS^RT) ) PCP-DU-PF Futex PCP-DU-BOOL Futex PRIO_INHERIT Linux Futex (Linux) Baseline (Page Faults) (CMPXCHG) (PRIO_INHERIT) 76 Based on 900K samples per protocol

77 PCP Futex Evaluation Contended Case Lock (Max) Unlock (Max) Linux page-fault handler not optimized for real-time locking protocol implementation. 186% 137% 167% 100% 100% 107% 106% 106% PCP PCP (LITMUS^RT) ) PCP-DU-PF Futex PCP-DU-BOOL Futex PRIO_INHERIT Linux Futex (Linux) Baseline (Page Faults) (CMPXCHG) (PRIO_INHERIT) 77 Based on 900K samples per protocol

78 PCP Futex Evaluation Contended Case Lock (Max) Unlock (Max) 186% 167% 137% 107% 100% 100% 106% 106% PCP-DU-PF PCP (LITMUS^RT) ) PCP-DU-BOOL Futex PRIO_INHERIT Linux Futex (Linux) Baseline (CMPXCHG) (PRIO_INHERIT) 78 Based on 900K samples per protocol

79 PCP Futex Evaluation Contended Case Lock (Max) Unlock (Max) Up to 37% 186% increase in worst-case contended overheads as a result of the futex approach. 137% 167% 107% 100% 100% 106% 106% PCP-DU-PF PCP (LITMUS^RT) ) PCP-DU-BOOL Futex PRIO_INHERIT Linux Futex (Linux) Baseline (CMPXCHG) (PRIO_INHERIT) 79 Based on 900K samples per protocol

80 PCP Futex Evaluation Contended Case Lock (Max) Unlock (Max) Up to 37% 186% increase in worst-case contended overheads as a result of the futex approach. But no worse than Linux s futex implementation 107% 100% 100% 137% 106% 167% 106% PCP-DU-PF PCP (LITMUS^RT) ) PCP-DU-BOOL Futex PRIO_INHERIT Linux Futex (Linux) Baseline (CMPXCHG) (PRIO_INHERIT) 80 Based on 900K samples per protocol

81 PCP Futex Evaluation Summary Up to 92% reduction in average uncontended overheads at a cost of 37% increase in maximum contended overheads (no worse than Linux). Page faults can be used to implement futexes for anticipatory real-time locking protocols. 81

82 Summary PRIO_INHERIT! (Priority Inheritance Protocol) PRIO_PROTECT (Immediate Priority Ceiling Protocol) Classic Priority Ceiling Protocol! (PCP) Multiprocessor PCP (MPCP) FMLP Futex Implementation Observed overhead reduction: 92% 85% 84% Design and implementation of futexes for three anticipatory real-time locking protocols Method of using page faults to implement futexes 82

83 Thanks! Linux Testbed for Multiprocessor Scheduling in Real-Time Systems All source code available at: 83

Fast on Average, Predictable in the Worst Case: Exploring Real-Time Futexes in LITMUS RT

Fast on Average, Predictable in the Worst Case: Exploring Real-Time Futexes in LITMUS RT Roy Spliet MPI-SWS Manohar Vanga MPI-SWS Björn B. Brandenburg MPI-SWS Sven Dziadek TU Dresden Abstract This paper