FlexSC. Flexible System Call Scheduling with Exception-Less System Calls. Livio Soares and Michael Stumm. University of Toronto

Size: px

Start display at page:

Download "FlexSC. Flexible System Call Scheduling with Exception-Less System Calls. Livio Soares and Michael Stumm. University of Toronto"

Osborn Stafford
6 years ago
Views:

1 FlexSC Flexible System Call Scheduling with Exception-Less System Calls Livio Soares and Michael Stumm University of Toronto

2 Motivation The synchronous system call interface is a legacy from the single core era Expensive! Costs are: direct: mode-switch indirect: processor structure pollution FlexSC implements efficient and flexible system calls for the multicore era 2

3 FlexSC overview Two contributions: FlexSC and FlexSC-Threads Results in: 1) MySQL throughput increase of up to 40% and latency reduction of 30% 2) Apache throughput increase of up to 115% and latency reduction of 50% 3

4 Performance impact of synchronous syscalls Xalan from SPEC CPU 2006 Virtually no time in the OS Linux on Intel Core i7 (Nehalem) Injected exceptions with varying frequencies Direct: Direct emulate null system call Indirect: Indirect emulate write() system call Measured only user-mode time Kernel time ignored Ideally, user-mode performance is unaltered 4

5 Degradation (lower is faster) Degradation due to sync. syscalls Xalan (SPEC CPU 2006) 70% Apache 60% Indirect Direct 50% MySQL 40% 30% 20% 10% 0% 1K 2K 5K 10K 20K 50K 100K user-mode instructions between exceptions (log scale) System calls can half processor efficiency; indirect cause is major contributor 5

6 Processor state pollution Key source of performance impact On a Linux write() call: rd up to 2/3 of the L1 data cache and data TLB are evicted Kernel performance equally affected Processor efficiency for OS code is also cut in half 6

7 Synchronous system calls are expensive User Kernel Traditional system calls are synchronous and use exceptions to cross domains 7

8 Alternative: side-step the boundary User Kernel Exception-less syscalls remove synchronicity by decoupling invocation from execution 8

9 Benefits of exception-less system calls Significantly reduce direct costs Fewer mode switches User Allow for batching Reduce indirect costs Kernel Allow for dynamic multicore specialization Further reduce direct and indirect costs 9

10 Exception-less interface: syscall page write(fd, buf, 4096); entry = free_syscall_entry(); /* write syscall */ entry->syscall = 1; entry->num_args = 3; entry->args[0] = fd; entry->args[1] = buf; entry->args[2] = 4096; entry->status = SUBMIT; SUBMIT while (entry->status!= DONE) DONE do_something_else(); return entry->return_code; 10

11 Exception-less interface: syscall page write(fd, buf, 4096); entry = free_syscall_entry(); /* write syscall */ entry->syscall = 1; entry->num_args = 3; entry->args[0] = fd; entry->args[1] = buf; entry->args[2] = 4096; entry->status = SUBMIT; SUBMIT SUBMIT while (entry->status!= DONE) DONE do_something_else(); return entry->return_code; 11

12 Exception-less interface: syscall page write(fd, buf, 4096); entry = free_syscall_entry(); /* write syscall */ entry->syscall = 1; entry->num_args = 3; entry->args[0] = fd; entry->args[1] = buf; entry->args[2] = 4096; entry->status = SUBMIT; SUBMIT DONE while (entry->status!= DONE) DONE do_something_else(); return entry->return_code; 12

13 Syscall threads Kernel-only threads Part of application process Execute requests from syscall page Schedulable on a per-core basis 13

14 System call batching Request as many system calls as possible Switch to kernel-mode Start executing all posted system calls Avoids direct and indirect costs, even on a single core 14

15 Dynamic multicore specialization FlexSC makes specializing cores simple Dynamically adapts to workload needs 15

16 What programs can benefit from FlexSC? Event-driven servers (e.g., memcached, nginx webserver) Use asynchoronous calls, similar to FlexSC Can use FlexSC directly Mix sync and exception-less system calls Multi-threaded servers: FlexSC-Threads Thread library, compatible with Pthreads No changes to app. code or recompilation required Transparently converts legacy syscalls into exception-less ones 16

17 FlexSC-Threads library Hybrid (M-on-N) threading model One kernel visible thread per core Many user threads per kernel-visible thread Redirects system calls (libc wrappers) Posts exception-less syscall to syscall page Switches to other user-level thread Resumes thread upon syscall completion Benefits of exception-less syscalls while maintaining sequential syscall interface 17

18 FlexSC-Threads in action User 18

19 FlexSC-Threads in action On a syscall: Post request to system call page Block user-level thread 19

20 FlexSC-Threads in action Kernel On a syscall: Post request to system call page Block user-level thread Switch to next ready thread 20

21 FlexSC-Threads in action User Kernel If all user-level threads become blocked: 1) enter kernel 2) wait for completion of at least 1 syscall 21

22 Evaluation Linux Nehalem (Core i7) server, 2.3GHz 4 cores on a chip Clients connected on 1 Gbps network Workloads Sysbench on MySQL (80% user, 20% kernel) ApacheBench on Apache (50% user, 50% kernel) Default Linux NTPL ( sync ) sync vs. FlexSC-Threads ( flexsc ) flexsc 22

23 Sysbench: OLTP on MySQL (1 core) Throughput (requests/sec.) % improvement flexsc sync Request Concurrency

24 Sysbench: OLTP on MySQL (4 cores) Throughput (requests/sec.) 1, % improvement 400 flexsc sync Request Concurrency

25 Latency (ms) MySQL latency per client request 1, connections 95th percentile average sync flexsc 1 core sync flexsc 2 cores sync flexsc 4 cores Up to 30% reduction of average request latencies 25

26 MySQL processor metrics SysBench (4 cores) Relative Performance (flexsc/sync) User 1 Kernel L3 IPC d-cache TLB L2 i-cache Branch IPC L2 i-cache Branch L3 d-cache TLB Performance improvements consequence of more efficient processor execution 26

27 Throughput (requests/sec.) ApacheBench throughput (1 core) 45,000 40,000 35,000 30,000 25,000 20,000 15,000 10,000 5, flexsc sync 80-90% improvement Request Concurrency

28 Throughput (requests/sec.) ApacheBench throughput (4 cores) 45,000 40,000 35,000 30,000 25,000 20,000 15,000 10,000 5, % improvement flexsc sync Request Concurrency

29 Apache latency per client request concurrent requests 99th percentile average Latency (ms) sync flexsc sync flexsc sync flexsc 1 core 2 cores 4 cores Up to 50% reduction of average request latencies 29

30 Apache processor metrics Apache (1 core) Relative Performance (flexsc/sync) User Kernel L3 IPC d-cache TLB L2 i-cache Branch IPC L2 i-cache Branch L3 d-cache TLB Processor efficiency doubles for kernel and user-mode execution 30

31 Discussion New OS architecture not necessary Foundation for non-blocking system calls select() / poll() in user-space Interesting case of non-blocking free() Multicore ultra -specialization Exception-less syscalls can coexist with legacy ones TCP Servers (Rutgers; Iftode et.al), FS Servers Single-ISA asymmetric cores OS-friendly cores (HP Labs; Mogul et. al) 31

32 Concluding Remarks System calls degrade server performance Exception-less syscalls Processor pollution is inherent to synchronous system calls Flexible and efficient system call execution FlexSC-Threads Leverages exception-less syscalls No modifications to multi-threaded applications Throughput & latency gains 2x throughput improvement for Apache and BIND 1.4x throughput improvement for MySQL 32

33 FlexSC Flexible System Call Scheduling with Exception-Less System Calls Livio Soares and Michael Stumm University of Toronto

Exception-Less System Calls for Event-Driven Servers

Exception-Less System Calls for Event-Driven Servers Livio Soares and Michael Stumm University of Toronto Talk overview At OSDI'10: exception-less system calls Technique targeted at highly threaded servers