19/09/2008. Microkernel Construction. IPC - Implementation. IPC Importance. General IPC Algorithm. IPC Implementation. Short IPC

Size: px

Start display at page:

Download "19/09/2008. Microkernel Construction. IPC - Implementation. IPC Importance. General IPC Algorithm. IPC Implementation. Short IPC"

Constance Phelps
5 years ago
Views:

1 IPC Importance Microkernel Construction IPC Implementation General IPC Algorithm Validate parameters Locate target thread if unavailable, deal with it Transfer message untyped - short IPC typed message - long IPC Schedule target thread switch address space as necessary Wait for IPC IPC - Implementation Short IPC Short IPC (uniprocessor) Short IPC (uniprocessor) call system-call preamble (disable intr) identify dest thread and check same chief / no ipc redirection? -to-receive? analyze msg and transfer short: no action required switch to dest thread & address space system-call postamble running wait to receive system-call pre (disable intr) identify dest thread and check same chief / no ipc redirection? -to-receive? analyze msg and transfer short: no action required switch to dest thread & address space system-call post wait to receive running The critical path 1

2 Short IPC (uniprocessor) send (eagerly) Short IPC (uniprocessor) send (lazily) system-call pre (disable intr) system-call pre (disable intr) identify dest thread and check identify dest thread and check same chief / no ipc redirection? same chief / no ipc redirection? -to-receive? -to-receive? running analyze msg and transfer wait to receive running analyze msg and transfer wait to receive short: no action required short: no action required switch to dest thread & address space switch to dest thread & address space running system-call post running running system-call post running IPC IPC EAX EAX ECX ECX EDX EDX EBX EBX ESI ESI EDI EDI EBP EBP ESP ESP EFLAGS EFLAGS EIP EIP CS ES CS ES SS FS SS FS DS GS DS GS IPC IPC EAX EAX ECX ECX EDX EDX EBX EBX ESI ESI EDI EDI EBP EBP ESP ESP EFLAGS EFLAGS EIP EIP CS ES CS ES SS FS SS FS DS GS DS GS 2

3 IPC IPC EAX ECX EDX EBX ESI EDI EBP ESP EAX ECX EDX EBX ESI EDI EBP ESP EFLAGS EIP EFLAGS EIP CS ES CS ES SS FS SS FS DS GS DS GS IPC Implementation Goal Note payload from green thread EAX ECX EDX EBX ESI EDI EBP ESP EFLAGS EIP CS ES SS FS DS GS Most frequent kernel op: short IPC thousands of invocations per second Performance is critical: structure IPC for speed structure entire kernel to support fast IPC What affects performance? cache line misses TLB misses memory references pipe stalls and flushes instruction scheduling Fast Path IPC Attributes for Fast Path Optimize for common cases write in assembler non-critical paths written in C++ but still fast as possible Avoid high-level language overhead: function call state preservation poor code optimizations We want every cycle possible! untyped message single runnable thread after IPC must be valid IPC call switch threads, originator blocks send phase: the target is waiting receive phase: the sender is not to couple, causing us to block no receive timeout 3

4 Avoid Memory References!!! Memory references are slow avoid in IPC: ex: use lazy scheduling avoid in common case: ex: timeouts Optimized Memory stack Also: hard-wire TLB entries for kernel code and data. Microkernel should minimize indirect costs cache pollution Single TLB entry. TLB pollution memory bus thread state UTCB cpu ID thread ID TCB state, grouped by cache lines. TLB Problem Avoid Table Lookups stack stack stack stack Walking a linked has a TLB footprint. thread ID thread no version virtual TCB area virtual TCB area TCB = TCB_area + (thread_no & TCB_size_mask) virtual addresses Validate Thread ID Branch Elimination Common case: -1 thread ID thread no version virtual TCB area slow = ~receiver->thread_state + (timeouts & 0xffff) + sender->resources + receiver->resources; if( slow ) enter_slow_path() Are the thread IDs equal? Reduces branch prediction foot print. Avoids mispredicts & stalls & flushes. Increases ncy for slow path Common case: 0 4

5 TCB Resources Message Transfer Resources bitfield 1 1 One bit per resource Fast path checks entire word if not 0, jump to resource handlers IBM PowerPC 750, 500 MHz, 32 registers Copy area Debug registers up to 10 physical registers virtual register copy loop Many cycles wasted on pipe flushes for privileged instructions. Slow Path vs. Fast Path Inter vs. Intra Address Space L4Ka::Pistachio IPC performance Pentium 3 L4Ka::Pistachio IPC performance Pentium cycles 300 Inter C-Path cycles Inter FastPath 200 Intra FastPath Inter FastPath number message registers number message registers IPC - Implementation Long IPC (uniprocessor) system-call preamble (disable intr) identify dest thread and check same chief -to-receive? analyze msg and transfer long/map: Preemptions possible! (end of timeslice, device interrupt ) Pagefaults possible! (in source and dest address space) Long IPC transfer message switch to dest thread & address space system-call postamble 5

6 Long IPC (uniprocessor) system-call pre (disable intr) identify dest thread and check same chief -to-receive? analyze msg and transfer long/map: lock both partners transfer message unlock both partners switch to dest thread & address space system-call post Preemptions possible! (end of timeslice, device interrupt ) Pagefaults possible! (in source and dest address space) Long IPC (uniprocessor) system-call pre (disable intr) identify dest thread and check same chief -to-receive? analyze msg and transfer long/map: lock both partners enable intr transfer message disable intr unlock both partners switch to dest thread & address space system-call post Preemptions possible! (end of timeslice, device interrupt ) Pagefaults possible! (in source and dest address space) Long IPC (uniprocessor) IPC - mem copy running system-call pre (disable intr) identify dest thread and check same chief -to-receive? wait Why is it needed? Why not share? Security analyze msg and transfer Need own copy long/map: Granularity locked running lock both partners enable intr locked wait Object small than a page or not aligned transfer message disable intr wait to receive unlock both partners switch to dest thread & address space running system-call post copy in - copy out copy in - copy out copy into kernel buffer copy into kernel buffer switch spaces 6

7 copy in - copy out copy into kernel buffer switch spaces copy out of kernel buffer costs for n words 2 2n r/w operations 3 n/8 cache lines 1 n/8 overhead cache misses (small n) 4 n/8 cache misses (large n) select dest area (4+4 M) select dest area (4+4 M) map into source AS (kernel) select dest area (4+4 M) map into source AS (kernel) copy data select dest area (4+4 M) map into source AS (kernel) copy data switch to dest space 7

8 problems multiple threads per AS mappings might change while message is copied How long to keep PTE? What about TLB? current AS invalidate PTE invalidate PTE flush TLB when leaving curr thread during ipc? flush TLB when leaving curr thread during ipc: current AS current AS Reestablishing temp mapping requires to store partner id and dest area address in the sender s tcb. Note: receiver s page mappings might have changed! when returning to thread during ipc: when returning to thread during ipc: current AS current AS 8

9 Start temp mapping: mytcb.partner := partner ; mytcb.waddr := dest 8M area base ; mypde.tmarea := destpde.destarea. optimization only: avoids second TLB flush if subsequent thread switch would flush TLB anyhow current AS Leave thread: if mytcb.waddr nil then mypde.tmarea := nil ; if dest AS = my AS then flush TLB fi fi. Close temp mapping: mytcb.waddr := nil. why? mypde.tmarea := nil?? Alternative method: Requires separation of TLB flush and load PT root! Does therefore not work reasonably on x86. Load PT root implicitly includes TLB flush on x86. current AS Leave thread: if mytcb.waddr nil then mypde.tmarea := nil ; flush TLB ; TLB flushed := true fi. Thread switch : if TLB just flushed then TLB flushed := false else flush TLB fi ; PT root :=... Page Fault Resolution: Page Fault Resolution: current AS current AS Page Fault Resolution: Page Fault Resolution: TM area PF: if mypde.tmarea = destpde.destarea then tunnel to (partner) ; access dest area ; tunnel to (my) fi ; mypde.tmarea := destpde.destarea. current AS current AS 9

10 Cost estimates 486 IPC costs [µs] Mach Copy in - copy out Temporary mapping Mach: copy in/out 400 R/W operations 2 2n 2n L4: temp mapping 300 Cache lines 3 n/8 2 n/8 Small n overhead cache misses n/ L4 + cache flush Large n cache misses Overhead TLB misses 5 n/8 3 n/8 0 n / words per page 100 L4 raw copy Startup instructions msg len Dispatching Dispatching topics: thread switch priorities preemption (to a specific thread) to next thread to be scheduled (to nil) implicitly, when ipc blocks time slices wakeups, interruptions timeouts and wake-ups time Switch to (): Thread A Dispatcher Thread switch to (dispatcher) select next thread, assume B switch to (B) Smaller stack per thread Dispatcher is preemptable Improved interrupt ncy if dispatching is time consuming Thread A Switch to (): tcb[a].sp := SP; SP := disp thread bottom. switch to (dispatcher) switch to (B) Dispatcher Thread select next thread, assume B Optimizations : disp thread is special no user mode, no own AS required Can avoid AS switch no id required Freedom from tcb layout conventions almost stateless (see priorities) No need to preserve internal state between invocations External state must be consistent costs (A B) costs (A disp B) costs (select next) costs( A disp A) are low Thread B Thread B SP := tcb[a].sp ; if B A then switch from A to B else return fi. Why?? 10

11 Switch to (): tcb[a].sp := SP; SP := disp thread bottom. Example: Simple Dispatch Thread A Dispatcher Thread switch to (dispatcher) select next thread, assume B Issue: If preempted, thread A is not in a good state whenever disp thread is left, stack has to be discarded! even if with intr or timer switch to (B) Thread B SP := tcb[a].sp ; if B A then switch from A to B else return fi. Why does this always work? Example: Simple Dispatch Example: Dispatch with Tick sp tcb B edi eax Local State x eip cs flg esp ss tcb A edi eax Local State x eip cs flg esp ss Dispatcher stack Local Variables Example: Dispatch with Tick Example: Dispatch with Tick sp sp tcb B edi eax Local State x eip cs flg esp ss tcb B edi eax Local State x eip cs flg esp ss tcb A Local State x eip cs flg esp ss tcb A edi eax Local State x eip cs flg esp ss Dispatcher Local stack State x eip cs flg esp ss Local Variables Dispatcher stack Local Variables 11

12 Example: Dispatch with Interrupt Example: Dispatch with Interrupt sp Int Thrd edi eax Local State x eip cs flg esp ss tcb A Local State x eip cs flg esp ss Dispatcher Local stack State x eip cs flg esp ss Local Variables Example: Dispatch with Interrupt Switch to (): dispatcher thread is also idle thread Thread A sp Int Thrd tcb A edi eax edi eax Local State Local State x x eip cs flg esp ss eip cs flg esp ss Dispatcher Thread switch to (dispatcher) select next thread, assume B switch to (B) Thread B B := A ; do B:= next (B) ; if B nil then return fi ; idle od. Dispatcher stack Priorities 0 (lowest) 255 hard priorities round robin per prio dynamically changeable tcb per prio current tcb per Optimization Priorities keep highest active prio do p := 255; do if current [p] nil then B := current [p] ; return fi ; p -= 1 until p < 0 od ; idle od. disp table Prio 50 do do p := 255; if current do [highest active p] nil then B := current if current [p] nil [highest active p] ; return then B := current [p] ; elif highest active p > 0 return then highest active p -= 1 fi ; else p -= 1 idle until p < 0 od ; fi idle od. od disp table Prio 50 12

13 Priorities, Preemption highest active p := max (new p, highest active p). Priorities, Preemption What happens when a prio falls empty? do do p := 255; if current do [highest active p] nil then B := current if current [p] nil [highest active p] ; return then B := current [p] ; elif highest active p > 0 return then highest active p -= 1 fi ; else p -= 1 idle until p < 0 od ; fi idle od. od disp table Prio 110 Prio 50 p= 110 intr/wakeup do if current [highest active p] nil then round robin if necessary; B := current [highest active p]; return doelif highest active p > 0 if then current highest active p -= [highest active p] 1 nil else then B := current [highest active p] ; idle fi return od elif. highest active p > 0 then highest active p -= 1 round else robin if necessary: if curr idle [hi act p].rem ts = 0 then curr [hi act p] := next ; fi current [hi act p].rem ts := new ts od fi. disp table Prio 110 Prio 50? Remaining time slice > 0? Priorities, Preemption Preemption What happens when a prio falls empty? Preemption, time slice exhausted do if current [highest active p] nil then round robin if necessary; B := current [highest active p]; return elif highest active p > 0 then highest active p -= 1 else idle fi od. disp table Prio 110 Prio 50 Remaining time slice > 0 do if current [highest active p] nil then round robin if necessary; B := current [highest active p]; return elif highest active p > 0 then highest active p -= 1 else idle fi od. disp table Prio 50 Remaining time slice = 0 round robin if necessary: if curr [hi act p].rem ts = 0 then curr [hi act p] := next ; current [hi act p].rem ts := new ts fi. round robin if necessary: if curr [hi act p].rem ts = 0 then curr [hi act p] := next ; current [hi act p].rem ts := new ts fi. Preemption Lazy Dispatching do if current [highest active p] nil then round robin if necessary; B := current [highest active p]; return elif highest active p > 0 then highest active p -= 1 else idle fi od. Preemption, time slice exhausted disp table Remaining time slice := new ts Prio 50 Thread state toggles frequently (per ipc) waiting delete/insert is expensive therefore: delete lazily from disp table Prio 50 round robin if necessary: if current [hi act p].rem ts = 0 then current [hi act p].rem ts := new ts ; current [hi act p] := next fi. 13

14 Lazy Dispatching Thread state toggles frequently (per ipc) waiting delete/insert is expensive therefore: delete lazily from Lazy Dispatching Thread state toggles frequently (per ipc) waiting delete/insert is expensive therefore: delete lazily from disp table disp table waiting waiting Prio 50 Prio 50 Lazy Dispatching Thread state toggles frequently (per ipc) waiting delete/insert is expensive therefore: delete lazily from disp table Lazy Dispatching Thread state toggles frequently (per ipc) waiting delete/insert is expensive therefore: delete lazily from Whenever reaching a non- thread, delete it from proceed with next disp table waiting Prio 50 Prio 50 Lazy Dispatching Thread state toggles frequently (per ipc) do waiting round robin if necessary; if current [highest delete/insert active p] nil is expensive then B := current therefore: [highest delete active p]; return elif highest active p > 0 lazily from then highest active p -= 1 else idle fi od. disp table Timeouts & Wakeups Operations: insert timeout raise timeout find next timeout delete timeout round robin if necessary: while curr [hi act p] nil do if curr [hi act p].state then delete from (curr [hi act p]) elif curr Prio 50 [hi act p].rem ts = 0 then curr [hi act p].rem ts := new ts else leave round robin if necessary fi ; curr [hi act p] := next ; od. timeout set completion, no timeout timeout expired raised-timeout costs are uncritical (occurr only after timeout exp time) most timeouts are never raised! t 14

15 Timeouts & Wakeups Idea 1: unsorted insert timeout costs: search + insert entry find next timeout costs: parse entire raise timeout costs: delete found entry delete timeout costs: delete entry too expensive cycles n cycles cycles cycles Timeouts & Wakeups too expensive Idea 2: sorted insert timeout costs: search + insert entry n/ cycles find next timeout costs: find head cycles raise timeout costs: delete head cycles delete timeout costs: delete entry cycles Timeouts & Wakeups Idea 3: sorted tree insert timeout costs: search + insert entry find next timeout costs: find head raise timeout costs: delete head delete timeout costs: delete entry too expensive too complicated log n cycles cycles cycles cycles Wakeup Classes Wakeup Classes now insert timeout (now + ) now soon soon now soon t now soon t 15

16 Wakeup Classes now Wakeup Classes now soon soon soon now t soon now t contains soon entries correction phase required contains entries correction phase required Wakeup Classes Timeouts & Wakeups now τ soon soon now soon t max s? (length of soon ) s timeouts to be raised in τ soon + new timeouts in τ soon s is small if τ soon is short enough Idea 4: unsorted wakeup classes insert timeout costs: select class + add entry cycles find next timeout costs: search soon class raise timeout costs: delete head delete timeout costs: delete entry raised-timeout costs are uncritical (occurr only after timeout exp time) BUT most timeouts are never raised! cycles cycles still too expensive s..n cycles Lazy Timeouts Lazy Timeouts insert (t 1 ) insert (t 1 ) delete timeout soon soon t 1 now soon t t 1 now soon t Late Late 16

17 Lazy Timeouts Lazy Sorting insert (t 1 ) delete timeout insert (t 2 ) t 2 soon now soon t Keep a sorted for fast lookup Don t sort on insert insert is common but timeouts are uncommon Sort lazily: sort when walking wakeup thus we sort only when necessary Late Incremental Sorting Combine the cost of sorting with cost of finding first thread to wake Problem: every addition to resets the sorted flag, and thus we must perform entire walk. But we want to avoid this. Alternative: maintain sorted, and unsorted. Merge the two s when necessary. merge can be incremental bubble sort iow: we keep a of new additions, so that we can remove the additions, without requiring a resort Issue How common is insertion compared to wake up searching/sorting? Very IPC more frequent than ticks Wakeup queues always unsorted Approach seems dubious Security Is your system secure? Security defined by policy Examples All users have access to all objects Physical access to servers is forbidden Users only have access to their own files Users have access to their own files, group access files, and public files (UNIX) 17

18 Security policy Specifies who has what type of access to which resources Authentication Authorization All access is via IPC What microkernel mechanisms are needed for security? How do we authenticate? How do we perform authorization? How do we implement arbitrary security policies? How do we enforce arbitrary security policies? Authentication Unforgeable thread identifiers Thread identifiers can be mapped to Tasks Users Groups Machines Domains Authentication is outside the microkernel, any policy can be implemented. Authorization Servers implement objects; clients access objects via IPC. Servers receive unforgeable client identities from the IPC mechanism Servers can implement arbitrary access control policy No special mechanisms needed in the microkernel Is this really true??? Example Policy: Mandatory Access Control Secure System Objects assigned security levels Top Secret, Secret, Classified, Unclassified TS > S > C > UC Subjects (users) assigned security levels Top Secret, Secret, Classified, Unclassified A subject (S) can read an object (O) iff level(s) >= level(o) A subject (S) can write an object (O) iff level(s) <= level(o) Client (UC) S C Server UC TS Client (TS) Client (C) Client (S) 18

19 Problem Conclusion S Server TS To control information flow we must control communication Client (UC) C UC Client (TS) We need mechanisms to not only implement a policy we must also be able to enforce a policy!!! Mechanism should be flexible enough to implement and enforce all relevant security policies. Client (C) Client (S) Clans & Chiefs Clans & Chiefs Within all system based on direct message transfer, protection is essentially a matter of message control. Using access control s can be done at the server level, but maintenance of large distributed access control s becomes hard when access rights change rapidly. The clan concept permits to complement the mentioned passive entity protection by active protection based on intercepting all communication of suspicious subjects. Aclan is a set of tasks headed by achief task. Inside the clan all messages are transferred freely and the kernel guarantees message integrity. But whenever a message tries to cross a clan s borderline, regardless of whether it is outgoing or incoming, it is redirected to the clan s chief. This chief may inspect the message (including the sender and receiver ids as well as the contents) and decide whether or not it should be passed to the destination to which it was addressed. Obviously subject restriction and local reference monitors can be implemented outside the kernel by means of clans. Since chief are tasks at user level, the clan concept allows more sophisticated and user definable checks as well as active control. Clans & Chiefs clan chief A clan is a set of tasks headed by a chief task Intra-Clan IPC clan chief tasks tasks Direct IPC by microkernel 19

20 Inter-Clan IPC clan chief Direction-Preserving Deceiving C 1 C 2 from T 2 T 1 T 2 T 3 T 4 C 3 T5 tasks Microkernel redirects IPC to next chief Chief (user task) can forward IPC or modify or Direction-Preserving Deceiving Direction-Preserving Deceiving C 1 from T 2 C 1 from T 2 C 2 from T 2 from T 2 C 2 from T 2 T 1 T 2 C 3 T 1 T 2 C 3 T 4 T 4 T 3 T5 T 3 T5 Direction-Preserving Deceiving Direction-Preserving Deceiving C 1 from T 2 C 1 from T 2 C 2 from T 2 C 2 Can I trust C from T 2 2? Yes! T 1 T 2 C 3 T 1 T 2 C 3 T 4 T 4 T 3 T5 T 3 T5 20

21 Direction-Preserving Deceiving Direction-Preserving Deceiving C 1 from T 2 C 1 from T 2 C 2 Can I trust C from T 2 2? Yes! C 2 Can I trust C from T 2 1? Yes! T 1 T 2 C 3 T 1 T 2 C 3 T 4 T 4 T 3 T5 T 3 T5 Direction-Preserving Deceiving Example C 1 T 1 T 2 from T 2 T 4 C 2 Can I trust C from T 2 1? Yes! C 3 C 1 T 1 T 2 C 2 Direct-Preserving-Deceiving (DPD) is a simple mechanism to realize security. Imagine the blue task is a tool you have from the Internet. Without DPD there is no relevant security. The blue thread T 3 wants to get some private information from T 1. T 3 T5 T 3 Example Example C 1 T 1 T 2 From T 2 C 2 Direct-Preserving-Deceiving (DPD) is a simple mechanism to realize security. Imagine the blue task is a tool you have from the Internet. Without DPD there is no relevant security. The blue thread T 3 want to get some private information from T 1. The chief C 2 can send an IPC to T 1 so it appears that it came from T 2. C 1 T 1 T 2 From T 2 C 2 Direct-Preserving-Deceiving (DPD) is a simple mechanism to realize security. Imagine the blue task is a tool you have from the Internet. Without DPD there is no relevant security. The blue thread T 3 want to get some private information from T 1. The chief C 2 can send an IPC to T 1 so it appears that it came from T 2. T 3 T 3 The important fact is that with DPD when T 1 gets an IPC from C 2 then he definitely knows that the message came from inside the clan C 2. Vice versa is the same. 21

22 Remote IPC Node A Node B Clans & Chiefs Remote IPC Multi-level security Debugging Heterogeneity Secure System using Clans & Chiefs Problems with Clans & Chiefs S C Server UC TS Chief Client (TS) Client (TS) Static A chief is assigned when task is started If we might want to control IPC, we must always assign a chief General case requires many more IPCs Chief Every task has its own chief Chief Client (UC) Chief Client (C) Client (C) Client (S) Client (S) Client (S) The most general system configuration If a pair could communicate freely we still require 3 IPCs where one would suffice Chief Client Chief IPC Redirection Chief Client Chief Client Client 22

23 IPC Redirection IPC Redirection For each source and destination we actually deliver to X, where X is one of: Destination Intermediary Invalid Intermediary If X is Destination We have a fast path when source and destination can communication freely Source Destination Source Destination IPC fails IPC Redirection If X is Invalid We have a barrier that prevents communication completely IPC Redirection If X is Intermediary Enforce security policy Monitor, analyze, reject, modify each IPC Audit communication Debug Intermediary Source Destination Source Destination IPC fails Deception To be able to transparently insert an intermediary, intermediaries must be able to deceive the destination into believing the intermediary is the source. An intermediary (I) can impersonate a source (S) in IPC to a destination (D) I [S]=> D Iff R(S,D) = I or R(S,D) = x and I[x]=>D Case 1 I[S]=>D if R(S,D) = I Intermediary Source From S Destination 23

24 Case 2 Secure System using IPC Redirection I[S]=>D if R(S,D) = x, and I[x]=>D Redirection Controller S Server TS C UC Intermediary Source X Destination From S Client (UC) Client (C) Client (TS) Client (TS) Client (S) Client (S) Client (S) Client (C) IPC Redirection can implement Clans & Chiefs Disadvantages and Issues Redirection Controller Chief Client (UC) S TS Server C UC Chief Client (C) Client (C) Client (TS) Chief Client (TS) Chief Client (S) Client (S) Client (S) The check for if impersonation is permitted is defined recursively Could be expensive to validate Dynamic insertion of transparent intermediaries is easy, removal is hard. There might be state along a path of intermediaries, redirection controller cannot know unless it has detailed knowledge and/or coordination with intermediaries. Cannot determine IPC path of an impersonated message as path may not exist after message arrives Centralized redirection controller Summary In microkernel based systems information flow is via communication Communication control is necessary to enforce security policy. Any mechanism for communication control must be flexible enough to implement arbitrary security policies. We examined two policy-free mechanisms to provide communication control Clans & Chiefs Redirection Neither is perfect Current research: Virtual Threads, Capabilities 24

µ-kernel Construction (12)

µ-kernel Construction (12) Review 1 Threading Thread state must be saved/restored on thread switch We need a Thread Control Block (TCB) per thread TCBs must be kernel objects TCBs implement threads We