Inter-Syscall Parallelism

Size: px

Start display at page:

Download "Inter-Syscall Parallelism"

Eric O’Neal’
5 years ago
Views:

1 Inter-Syscall Parallelism Jie Liao, Aaron Carroll, Lin Zhong {jie.liao, aaronc, Rice University Abstract The current thread model used by commodity OSes like Linux allows a thread to issue syscalls sequentially. The resulting sequential syscall execution can become a significant performance bottleneck. In this paper, we study the potential of executing independent syscalls issued by a thread in parallel, or inter-syscall parallelism. We build a suite of analysis tools to accurately trace syscall execution times and, using a model of the system API dependencies, predict the potential speedup from inter-syscall parallelism in common workloads. Our analysis shows 69% - 109% potential speedup on two event-driven servers, and 40% speedup on several CLI utilities. We present libasyncos, a userspace library that helps developers realize inter-syscall parallelism while retaining a sequential programming model. Our benchmarks achieve 9% - 94% speedup on an epoll-based network server, and up to 34% speedup on the conventional cp program. We present an early study to show that it is feasible to use compiler techniques to automate the optimization of inter-syscall parallelism. Our results further motivate that inter-syscall parallelism should be exploited by kernel-space mechanisms. 1 Introduction As single-core performance plateaus, modern processors get more computational power from having more cores. This creates a strong incentive for people to understand and exploit parallelism in software. While many have sought to parallelize user-space workloads, kernel-space workloads remain sequential for a single thread. In the thread model used by commodity OSes like Linux, a thread issues syscalls sequentially. The resulting sequential kernel execution can become a significant performance bottleneck if that thread contains many syscalls. In this paper, we study a new form of kernel-space parallelism for a single user thread, inter-syscall parallelism, that executes independent syscalls in that thread in parallel. Intersyscall parallelism is orthogonal to and more fine-grained than thread-level parallelism: it has the potential to improve the performance of a single thread, which can also benefit multithreaded programs by improving the performance of each single thread. We make two contributions in this paper. First, we present the first reality check of inter-syscall parallelism ( 2). We not only show inter-syscall parallelism is abundant in popular programs but also reveal common patterns of intersyscall parallelism. In doing so, we report a suite of tools, tracelite and sc-analyzer that estimate possible performance gains from inter-syscall parallelism based on syscall execution traces in common OS workloads. Second, we present our early results in exploiting intersyscall parallelism for improving the performance of a user thread. In 3, we present libasyncos, a user-space library that allows developers to exploit inter-syscall parallelism while retaining the sequential programming model. We show libasyncos can speed up an epoll-based network server by 9% - 94% and the conventional cp program by up to 34% on a four-core machine. The limitations of our user-space solution motivate our ongoing realization of a kernel-space design described in 5. 2 Reality Check In this section, we answer three questions regarding checking the reality of inter-syscall parallelism: How abundant is inter-syscall parallelism in common OS workloads? ( 2.2). What code patterns of inter-syscall parallelism are there? ( 2.3) What factors limit the benefit of inter-syscall parallelism? ( 2.4) To answer these questions, we build a suite of analysis tools to collect the syscall execution traces of a workload, and estimate the potential performance speedup from inter-syscall parallelism based on the traces. We show inter-syscall parallelism is abundant in event-driven servers memcached and nginx, and some CLI utilities such as dd and cp. We also present two code patterns that help developers to achieve inter-syscall parallelism. 2.1 Analysis tools Our analysis tools that are used to trace syscall execution times and reveal potential inter-syscall parallelism opportunities include tracelite and sc-analyzer. tracelite is a kernel-built mechanism for tracing syscall execution times. It inserts hooks into the architecturespecific syscall entry and exit points, and records the syscall number, arguments, return/error values, and entry/exit timestamps of a user thread. The traced data is stored in a Feather 1

2 Trace [3] buffer. tracelite exposes an interface to the user-space via debugfs for users to enable/disable tracing, set options such as the target thread ID, and read out the traced data. sc-analyzer is a static Python tool for predicting potential speedup from inter-syscall parallelism. It generates a syscall dependency graph from the syscall execution traces collected by tracelite, and predicts the performance speedup that could be achieved by overlapping syscall executions. In doing so, sc-analyzer removes the userlevel data dependencies in syscalls such as read() and write() from copy programs by avoiding unnecessary buffer reuse, and removes false file descriptor dependencies by assigning unique file descriptors to all the opened files. sc-analyzer calculates the speedup of a program using speed, which is the reciprocal of the time the program spends executing. Speedup is the percentage improvement of speed with inter-syscall parallelism over the original speed. While tracelite traces accurate syscall times, sc-analyzer can either overestimate or underestimate the performance speedup. The speedup can be overestimated because sc-analyzer assumes unlimited hardware resources and no sharing between any independent syscalls. However, for a contemporary platform the number of cores is not unlimited; and sometimes independent syscalls could share the same cache line or wait on the same lock. Moreover, avoiding buffer reuse and assigning unique file descriptors are not always correct, because sc-analyzer only considers dependencies between syscalls but not those between syscalls and the computation code. On the other hand, since sc-analyzer is a static tool, it can underestimate the speedup as program execution can change dynamically with specific optimization and hardware context, e.g., with CPU utilization. 2.2 Inter-syscall parallelism opportunities In order to study how abundant inter-syscall parallelism is, we use tracelite and sc-analyzer to analyze the potential speedup from inter-syscall parallelism for a number of desktop and server workloads. Our benchmarks are run on a quad-core Core-i7 CPU at a maximum frequency of 3.4 GHz. Some workloads see promising speedup, showing abundant opportunities of inter-syscall parallelism in these workloads. The best of them, Memcached and Nginx, see speedup between 69% and 109%, and CLI utilities, grep, dd, and cp, see speedup around 40%. Table 1 briefly describes all the benchmarks we have tried. As discussed in 1, inter-syscall parallelism targets improving single-thread performance, thus our analysis only focuses on the syscall execution in a single thread, and all the benchmarks are configured to be single-threaded. For all I/O operations, the disk caches are warmed up before the traces are collected. Disk access latency has a significant impact Percentage memtier memaslap ls gcc ffmpeg gzip svisplay apache-ab dd cptar grep nginx-ab nginx-httperf Speedup Syscall time 3r pgbench deltablue Figure 1: Potential percent speedup predicted by sc-analyzer and percentage of syscall execution time for all benchmarks on Core-i7, sorted by speedup. Memcached and Nginx see the highest potential speedup, and CLI utilities grep, dd, cp, and tar also have promising speedup. on the performance of I/O operations, in studying the opportunities of inter-syscall parallelism we leave this in the discussion of 3.3. Figure 1 shows the speedup determined by the sc-analyzer and percentage of total time spent executing syscalls for all the benchmarks. As shown in Figure 1, promising speedup is seen in two event-driven servers Memcached and Nginx, and four CLI utilities grep, dd, cp and tar. Most benchmarks see a strong positive correlation between speedup and percentage time spent in syscalls. However, PostgreSQL has high syscall execution time but very low speedup from exploiting inter-syscall parallelism. This is because the recvfrom() syscall in a thread of PostgreSQL takes most of the syscall execution time but cannot be parallelized due to blocking. Similarly, a thread of Apache spends more than half of its execution time on syscall execution, but it does not have high speedup because the thread spends a lot of time waiting and acquiring a lock before issuing syscalls to serve the client requests. Some of the workloads have very low speedup because they spend most of their execution time in the user space, e.g., JavaScript Engines, ffmpeg and gzip. 2.3 Code patterns of inter-syscall para. Based on the results of our syscall analysis in 2.2 and a study of the POSIX API, we identify two code patterns of inter-syscall parallelism: parallel iteration and loop pipelining. These two patterns are the common cases that contribute to the most of the speedup in 2.2. They also provide developers with useful optimization hints on how to exploit inter-syscall parallelism in their applications. There are cases where syscalls in simple sequential execution have no dependencies and can be executed concurrently. But these cases 2

3 Table 1: Benchmarks used in our reality check of inter-syscall parallelism. We cover a wide range of desktop and server workloads to seek inter-syscall opportunities. Network servers Database JavaScript Engines Memcached Nginx Apache PostgreSQL SpiderMonkey V8 Event-driven key-value store server, v1.4.20, running memaslap v1.0 and memtier as clients to generate load. Event-driven HTTP server and reverse proxy, v1.6.0, running ApacheBench v2.3 and httperf v0.8 as clients to generate load. Multithreaded HTTP web server, v2.4.9, running ApacheBench v2.3 as client to generate load. Relational database system, v9.3.4, running pgbench as client to create tables and perform queries. Mozilla s JavaScript engine, running scripts string-validate-input and 3-raytrace. Chrome s JavaScript engine, running scripts v8-splay and v8-deltablue. Compiler GCC Compiling a simple Hello World input program with compile phase (cc1). Media processing ffmpeg Encoding a WAV file to MP3. CLI utilities grep, dd, cp, tar, gzip, ls Running self-defined test cases. Disk cache is warmed up for all I/O operations do not form a common pattern and should be dealt with on a case-by-case basis. Parallel iteration: Programs often loop through a set of objects, such as files, directory entries and sockets. While syscalls within each iteration are often dependent, syscalls across different iterations sometimes are not. These syscalls can be parallelized by executing several iterations concurrently on multiple CPU cores, i.e., using multiple threads. For example, event-driven programs like Memcached loop through the sockets with incoming requests after an epoll wait() call in order to serve each available request, as shown in Listing 1. Different iterations on different sockets can indeed be run in parallel, since the requests are independent, as shown in Listing 2. 1 n = epoll_wait(); 2 for i from 0 to n-1{ 3 read(fd i ); /* thread 0*/ 4 sendmsg(fd i ); /* thread 0*/ 5 } Listing 1: Syscall trace snippet from Memcached, iterations of the epoll loop are executed sequentially within one thread. 1 n = epoll_wait(); 2 read(fd 0 ); /* thread 0*/ 3 sendmsg(fd 0 ); /* thread 0*/ 4 read(fd 1 ); /* thread 1*/ 5 sendmsg(fd 1 ); /* thread 1*/ read(fd n 1 ); /* thread n-1*/ 8 sendmsg(fd n 1 ); /* thread n-1*/ Listing 2: Restructured code of Listing 1, different iterations of the epoll loop can be executed in parallel. 1 while(!finish){ 2 read(fd1, buf); /* thread 0*/ 3 write(fd2, buf); /* thread 0*/ 4 } Listing 3: Syscall trace snippet from cp, all read() and write() of the copy loop are executed sequentially. 1 read(fd1, buf1); 2 while(!finish){ 3 write(fd2, buf1); /* thread 0*/ 4 read(fd1, buf2); /* thread 1*/ 5 exchange buf1 and buf2; 6 } 7 write(fd2, buf1); Listing 4: Restructured code of Listing 3, the read() and write() within one iteration of the new copy loop operate on different buffers and file descriptors, thus they can be executed in parallel. Loop pipelining: In the cases where dependencies do exist across iterations, early syscalls in one iteration often do not depend on the later syscalls of the previous iteration. For example, in some CLI utilities that involve copying data, such as cp, dd, and tar, the copying process consists of a loop of reading data from the source file to a buffer and writing from the same buffer to the destination file, causing a data dependency between read() and write(), as shown in Listing 3. In fact the data dependency can be removed by using two buffers, each of which serving read() and write() independently. If we can restructure the copy loop shown in Listing 3 to the code shown in Listing 4, the read() and write() within each iteration become independent, and thus can be executed in parallel. 3

4 2.4 Inter-syscall para. is not a panacea As shown in Figure 1, many workloads have very low speedup, which means inter-syscall parallelism does not help all the workloads. The following cases describe when intersyscall parallelism does not help: First, programs that spend a significantly large portion of their execution time in user space will not benefit much from inter-syscall parallelism, because inter-syscall parallelism only helps kernel time. As demonstrated in Figure 1, JavaScript Engines, gzip and ffmpeg have very low potential speedup because they spend more than 95% of their execution time in user space. Second, even with a high portion of kernel execution time, programs that block waiting on I/Os can have limited speedup. In our database benchmark, blocking syscalls recvfrom() can dominate the overall execution time, but also limit the speedup because they cannot be parallelized. Third, because one from a set of concurrently executed syscalls has a significantly longer execution time than the others, programs that present a very high degree of intersyscall parallelism can have limited speedup. For example, we can pipeline read()/write() copy loops in programs like cp and dd to make the programs run faster. But if it takes much longer time to execute write() than read() (or vice versa), the speedup will be low because it is write() that determines the overall execution time. Similarly, the cold disk caches can also negatively affect the speedup, because they make the execution of some disk I/O syscalls significantly longer than the others. Last, executing syscalls on different cores can cause cache pollution and limit the speedup. For example, while executing two instances of the same but independent syscalls sequentially on a single core can improve the cache locality, making the second one run much faster than the first one; executing them concurrently on different cores can make both run slowly, potentially reducing the overall performance. 3 User-space Design: libasyncos The results of the previous analysis show that reasonable speedup can be achieved by parallelizing syscall execution for some workloads. To exploit inter-syscall parallelism for improving the performance of these workloads, we develop libasyncos, a user-space library that allows developers to realize inter-syscall parallelism while retaining the sequential programming model. Our early results show 9% to 94% speedup on an epoll-based network server, and up to 34% speedup on the cp program from GNU Coreutils v8.23. We briefly describe the design of libasyncos in 3.1, present our early results in 3.2, and discuss the limitations of libasyncos in 3.3. user thread syscall job queue worker threads User Space Kernel Space core 0 POSTED COMPLETED FREE return COMPLETED POSTED fetch COMPLETED COMPLETED POSTED return FREE core 1 core 2 core 3 libasyncos Figure 2: Working flow of libasyncos. issue() requests execution of a syscall sc, passes the request to the shared syscall job queue (POSTED) and returns a handle to that syscall instance. When the result is required, complete() is called, which blocks until the specified syscall has completed (COMPLETED), and returns the result of the original call sc. The two example issue()s and complete()s in the figure are not necessarily related. 3.1 libasyncos Design Two basic API designs exist for an asynchronous syscall execution: event-driven programming and issue/complete semantics. In event-driven programming, a callback function is registered with each syscall invocation. When the syscall completes, the program is preempted and execution starts at the callback handler. Event-driven programming requires a significant engineering effort to restructure the sequential programs, and it suffers from concurrency complexity since the callback handlers can be invoked at any point. Therefore we are implementing the issue/complete semantics which requires little modification on the original source code. Specifically, libasyncos breaks down one syscall invocation sc(arg 0,..., arg n ) into two basic APIs: issue(sc, arg 0,..., arg n ) and complete(handle) libasyncos decouples the syscall request (issue()) from result collection (complete()) for a single syscall, such that multiple syscalls can be outstanding at the same time and thus executed in parallel. With libasyncos, syscalls are executed in the context of per-core worker threads that share a single syscall job queue in the same address space as the user thread. Figure 2 shows the working flow of libasyncos. As shown in Figure 2, issue() requests execution of a syscall sc, passes the request to the job queue and returns a handle to that syscall instance. When the result is required, complete() is called, which blocks until the specified syscall has completed, and returns the result of the origi- 4

5 nal call sc. Thus, complete(issue(sc,...)) is equivalent to sc(...). With minimal effort to rewrite the code, developers still retain the sequential programming model. Notably the existing POSIX asynchronous I/O (AIO) uses both event-driven and issue/complete semantics. By enqueuing a reading or writing syscall, AIO calls aio read() or aio write(). I/O completion can be signaled either by a blocking syscall to aio suspend(), by polling aio error(), or via a callback function. But AIO only supports I/O reading and writing, we are targeting asynchronous execution of more generic syscalls. 3.2 Early results We exploit inter-syscall parallelism with libasyncos in two benchmarks: an epoll-based network server and the cp program. We choose these two benchmarks based on the code patterns described in 2.3 that can contribute to promising speedup and are applicable for developers to realize by restructuring their code with minimal effort. Our results show that libasyncos can achieve reasonable speedup with intersyscall parallelism Experiment set up The network server uses epoll wait() to wait on incoming requests, loops through the available incoming sockets, read()s from the sockets, and send()s back a small message. The client, connected to the server via a 10 Gb Ethernet network, spawns a large number of threads, each of which issues a fixed number of requests to the server using regular synchronous syscalls. We run the server and client on two separate machines each with a quad-core Core-i7 CPU and running at a maximum frequency of 3.4 GHz. We run three versions of the network server during the benchmarking: an original single-threaded version, a fully multithreaded version with incoming connections evenly distributed across per-core worker threads, and a version with libasyncos in which we exploit inter-syscall parallelism by the parallel iteration pattern ( 2.3). Figure 3 shows the throughputs of three servers with various numbers of client threads. The cp program is taken from the GNU Coreutils, version We modify cp using libasyncos to exploit the loop pipelining pattern ( 2.3), and compare its execution speed against the speed of the original cp. We run both versions of the cp program on the same machine with a quad-core Corei7 CPU and running at a maximum frequency of 3.4 GHz. Before collecting results, the disk caches are warmed up by doing the copy several times, and the target files are removed from the file system (but not synchronized with disk). Figure 4 shows the speedup of cp with libasyncos against the original cp with various block sizes and file sizes libasyncos can effectively exploit inter-syscall parallelism From our results, libasyncos can achieve reasonable speedup with inter-syscall parallelism. As shown in Figure 3, libasyncos achieves 94% speedup on the network server with light workloads, and 9% speedup with heavy workloads that saturate the server. libasyncos is not able to accelerate the processing of each single request to the server, because syscalls within a single request is dependent. But it still improves the overall throughput. Also, as shown in Figure 4, libasyncos can make cp up to 34% faster than the original cp. When the file size is larger than 2M bytes, cp with libasyncos will always outperform the original cp libasyncos is not always beneficial However, libasyncos is not always beneficial. Figure 3 shows relatively high throughput gain (over 90%) when the workloads are light, but with heavy workloads, libasyncos only achieves around 10% speedup. This is fairly low compared with the speedup of the multithreaded server (50%) which represents the performance upper bound for this server. This is because libasyncos needs to take extra cost to issue syscalls with an additional library layer and synchronize the syscalls from the same request. Also, Figure 4 shows that when the file size is smaller than 2M bytes, even though the libasyncos version of cp consumes more CPU and memory resources than the original cp, the former performs worse than the latter. Copying smaller files needs fewer read()/write() iterations, thus the benefit from pipelining read() and write() is not enough to afford the extra cost of libasyncos discussed above. Moreover, there are also more L1 d-cache misses when using libasyncos to copy small files. 3.3 Limitations There are several limitations of our current design and implementation of libasyncos. First it uses threads that cooperate purely with shared memory and spin-loop synchronization, which can waste a lot of CPU cycles and make per-syscall cost higher than the original synchronous syscalls. This can limit the performance improvement for CPU-intensive programs. For example, as indicated in Figure 3, the throughput of the server using libasyncos plateaus faster than the multithreaded server as the number of client threads increases. Second, we have not addressed the scheduling problem of deciding when to use libasyncos. Apparently libasyncos sometimes consumes more resources but performs worse (as seen by Figure 4), in case of which we should not use libasyncos. Moreover, we have not addressed the scalability issues as our reported evaluation is based on 2-4 tightlycoupled cores. As synchronization cost across sockets on many cores is much more expensive than on-die synchronization cost [5], new performance and scheduling issues 5

6 Throughput (requests/ms) Multithreaded 200 libasyncos 100 Original Number of client threads Speedup (percentage) k k 64k k k File size (bytes in log 2 scale) Figure 3: Throughput for the epoll-based network server with libasyncos and multithreaded against the original singlethreaded. libasyncos effectively exploits inter-syscall parallelism, but performs worse than the multithreaded server. Figure 4: Speedup of cp with libasyncos against the original cp, tested with different block sizes and file sizes. libasyncos improves the performance of cp when copying large files, but can also degrade the performance when copying small files will emerge. Finally, one can further optimize the design of libasyncos. For example, we can build separate syscall job queue per worker thread to improve cache locality, and we can also interleave computation and syscall execution in the worker threads to reduce CPU cycle waste. Besides the above limitations, there are practical issues of the two benchmarks that we use to demonstrate the effectiveness of libasyncos. For a network server, only when the CPU is the bottleneck can inter-syscall parallelism help to improve the throughput by utilizing more CPU cores. However, usually it is not hard to saturate a 10 Gb Ethernet network, especially with small request sizes [7, 8]. If the NIC becomes the bottleneck, inter-syscall parallelism does not help by simply parallelizing syscalls. When the CPU is not the bottleneck, inter-syscall parallelism does not require to use multiple cores as well. Also, copying always involves disk accesses, in which case the speedup from inter-syscall parallelism depends on the time spent on individual reading or writing as discussed in 2.4. We run separate benchmarks to test the effect of disk I/O. When reading two files in lock step from different physical disks with the buffercache cleared before each run, libasyncos achieves 50%- 65% speedup over the synchronous read(), which is very close to the speedup of using aio read() to read the same two files. 4 Compiler-directed optimization Our sc-analyzer program is useful as a developer tool to identify high-level speedup opportunities using libasyncos. However, we claim that many local optimizations can be performed automatically in the compiler. As a proof-of-concept, we have implemented an LLVM transform pass that can optimize read(), write(), open() and close(), which works as follows: 1. A call to asyncos init() is placed at the beginning of main(). 2. For each callsite of the target functions convert the original call into issue(). 3. Generate a list of pointer dependencies of the syscall, either as inputs or outputs. 4. Iterate through each subsequent instruction, until one of the following conditions occurs: the syscall return value is used; a pointer is used which may alias with one in the dependency set; another syscall using the same file descriptor is called; the end of the current basic block is reached. 5. The corresponding complete() is placed immediately before the dependent instruction. We run the optimizer the following piece of test code: 1 fd1 = open("file1") 2 fd2 = open("file2") 3 read(fd1, buf) 4 write(fd2, buf) 5 close(fd1) 6 close(fd2) which produces the following code: 1 s1 = issue(open, "file1") 2 s2 = issue(open, "file2") 3 fd1 = complete(s1) 4 s3 = issue(read, fd1, buf) 5 fd2 = complete(s2) 6 complete(s3) 7 s4 = issue(write, fd2, buf) 8 s5 = issue(close, fd1) 9 complete(s4) 10 s6 = issue(close, fd2) 11 complete(s5) 12 complete(s6) Results To measure the performance of the compileroptimized code, we run many iterations of the original and optimized code on a Core2-Duo. We achieve a speedup over synchronous of ± 0.05%. 6

7 We have shown that it is feasible to implement syscall optimization in the compiler. However, in our simple case study we have ignored many practical problems, for example, we have not dealt with error handling which always include a conditional branch. A typical syscall usage looks as follows: 1 r = syscall(...); 2 if (r < 0) { 3 /* handle error */ 4 return r; 5 } Thus, the usage of the syscall return value is often immediately after execution of the syscall, so our optimization algorithm would have no opportunities to overlap syscalls. We avoid this issue in our implementation by removing error handling code: in the compiler we replace comparisons that check for error conditions with conditions that always evaluate to 0. Running a dead-code elimination on the resulting code causes the removal of error handling branch. However, a complete implementation must use more intelligent code motion techniques to relocate error handling code in a way that allows parallelism and preserves program semantics. We defer this to future work. 5 Kernel-space design needed The limitations of libasyncos motivate that it is necessary to exploit inter-syscall parallelism within the kernel. The reasons are two-fold. First, with a kernel implementation, we can make the syscall execution faster, which can improve the overall performance based on the speedup from intersyscall parallelism. Our issue()/complete() mechanism divides one syscall into two calls, thus one syscall needs to cross the user-kernel boundary twice if the mechanism is simply moved into the kernel. Therefore to implement the kernel support for inter-syscall parallelism, besides porting the issue()/complete() mechanism to the kernel, we should apply techniques such as syscall batching and exception-less based syscall invocation [15] to reduce the average per-syscall cost. Second, libasyncos must synchronize dependent syscalls explicitly in user space. Such synchronization not only incurs high overhead, but also sometimes cannot be realized in user space. For example, in single-threaded event-driven servers, inter-syscall parallelism exists because syscalls across event handlers (for different requests) are independent. But the syscalls issued by the same event handler are likely to be dependent. In a single-threaded event-driven server, a specific event is triggered only once by a unique incoming request, while all the events are handled sequentially. Thus with the user-space libasyncos, it is hard to issue syscalls from different event handlers in parallel and still synchronize syscalls from the same handler. But with a kernel support, we can have syscalls from multiple event handlers being executed in parallel by the kernel worker threads without user s involvement. Proper syscall completion notifications in the kernel can be used to synchronize dependent syscalls within each event handler implicitly. Similar existing solutions include libflexsc [16] for event-driven servers and POSIX AIO, but inter-syscall parallelism is not limited to event-driven servers or read/write syscalls. Ongoing implementation We are in the process of implementing a kernel space design for exploiting inter-syscall parallelism. Since we have determined inter-syscall parallelism patterns in common OS workloads, we are also interested in applying compiler-oriented optimization to automate the transformation from user code to the one that exploits inter-syscall parallelism. We can further overlap computation with the syscall execution based on the compiler techniques. We have assumed the use of the conventional POSIX-like syscall APIs, but examining what properties of the system APIs are conducive to inter-syscall parallelism is an interesting future direction. It is well-known that parallelism and commutativity are intrinsically linked [1]. A design of scalable software with commutative interface operations [4] will provide inter-syscall parallelism with more opportunities. 6 Related Work Our work is inspired by many previous efforts on improving program performance by understanding and optimizing syscall execution. Exploiting inter-syscall parallelism requires asynchronous syscall execution, for which many solutions are available. For example, POSIX asynchronous I/O, LAIO [6] and Linux syslets [9]. These are designed principally to overlap blocking I/O with other parts of the operations. Our current implementation of libasyncos is not aware of the blocking I/Os, but it can be extended to take advantage of blocking syscalls to gain more performance improvement. FlexSC [15] is an exception-less syscall mechanism that inherently executes syscalls asynchronously, but FlexSc works well only on massive independent user threads, while inter-syscall parallelism targets a single thread. Our inter-syscall parallelism analysis shows big improvement potential for parallelizing event-driven workloads. Previous works, including libflexsc [16] and libasync-smp [20], have achieved similar goals. libflexsc is a syscall notification library built based on FlexSC. It realizes the parallel iteration pattern we observe in event-driven servers and achieves impressive speedup. But libflexsc demands an event-driven program design while inter-syscall parallelism is not limited by eventdriven mechanisms. libasync-smp handles independent events concurrently on multiprocessors. Inter-syscall parallelism is a complementary work to libasync-smp. While libasync-smp favors user-intensive workloads at event 7

8 level, inter-syscall parallelism provides parallelism potential for kernel-intensive workloads at syscall level. Also inter-syscall parallelism is related to efforts in reducing syscall overhead. Syscall batching is a well-known technique to reduce the syscall boundary-crossing overhead, e.g., in multi-call [13, 12], netmap [14], MegaPipe [7], and FlexSC [15, 16]. Similarly, Cosy [11] moves syscallintensive code regions into kernel to reduce user-kernel boundary crossings, and vector OS [17, 18] compounds OSintensive operations in a vector such that they can be efficiently parallelized using vector interfaces. Future intersyscall parallelism mechanism can benefit from these techniques to achieve more performance improvement. Lastly, while dealing with multicore processing, distributed operating systems such as Helios [10], fos [19], and NIX [2] provide distributed kernel services, which naturally support inter-syscall parallelism. 7 Conclusion In this paper, we studied and exploited inter-syscall parallelism, which executes independent syscalls from a user thread in parallel. In studying inter-syscall parallelism, we built tracelite and sc-analyzer to trace syscall execution times and reveal potential speedup from inter-syscall parallelism in common OS workloads. We found promising potential speedup in two event-driven servers and several CLI utilities, and determined two code patterns for intersyscall parallelism. In exploiting inter-syscall parallelism, we developed libasyncos, a user-space library that allows developers to realize inter-syscall parallelism while retaining the sequential programming model. Our experiments showed libasyncos can effectively speed up an epoll-based network server and the conventional cp program. We also gave an early study of the feasibility of using compiler techniques to automate inter-syscall parallelism code transformation. Finally, we provided a brief discussion on our ongoing realization of inter-syscall parallelism in the kernel space. References [1] Farhana Aleen and Nathan Clark. Commutativity analysis for software parallelization: letting program transformations see the big picture. ACM Sigplan Notices, 44(3): , [2] Francisco J Ballesteros, Noah Evans, Charles Forsyth, Gorka Guardiola, Jim McKie, Ron Minnich, and Enrique Soriano-Salvador. Nix: A case for a manycore system for cloud computing. Bell Labs Technical Journal, 17(2):41 54, [3] B Brandenburg and J Anderson. Feather-trace: A lightweight event tracing toolkit. In Proceedings of the Third International Workshop on Operating Systems Platforms for Embedded Real-Time Applications, pages 19 28, [4] Austin T Clements, M Frans Kaashoek, Nickolai Zeldovich, Robert T Morris, and Eddie Kohler. The scalable commutativity rule: Designing scalable software for multicore processors. ACM Transactions on Computer Systems (TOCS), 32(4):10, [5] Tudor David, Rachid Guerraoui, and Vasileios Trigonakis. Everything you always wanted to know about synchronization but were afraid to ask. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pages ACM, [6] Khaled Elmeleegy, Anupam Chanda, Alan L Cox, and Willy Zwaenepoel. Lazy asynchronous i/o for eventdriven servers. In USENIX Annual Technical Conference, General Track, pages , [7] Sangjin Han, Scott Marshall, Byung-Gon Chun, and Sylvia Ratnasamy. Megapipe: A new programming interface for scalable network i/o. In OSDI, pages , [8] E Jeong, Shinae Woo, Muhammad Jamshed, Haewon Jeong, Sunghwan Ihm, Dongsu Han, and K Park. mtcp: a highly scalable user-level tcp stack for multicore systems. Proc. 11th USENIX NSDI, [9] Ingo Molnar. Announce: Syslets, generic asynchronous system call support. lkml/2007/2/13/142. [10] Edmund B Nightingale, Orion Hodson, Ross McIlroy, Chris Hawblitzel, and Galen Hunt. Helios: heterogeneous multiprocessing with satellite kernels. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, pages ACM, [11] Amit Purohit, Charles P Wright, Joseph Spadavecchia, and Erez Zadok. Cosy: Develop in user-land, run in kernel-mode. In HotOS, pages , [12] Mohan Rajagopalan, Saumya K Debray, Matti A Hiltunen, and Richard D Schlichting. System call clustering: A profile-directed optimization technique. Technical report, Technical report, The University of Arizona, [13] Mohan Rajagopalan, Saumya K Debray, Matti A Hiltunen, and Richard D Schlichting. Cassyopia: Compiler assisted system optimization. In HotOS, pages , [14] Luigi Rizzo. netmap: A novel framework for fast packet i/o. In USENIX Annual Technical Conference, pages ,

9 [15] Livio Soares and Michael Stumm. FlexSC: Flexible system call scheduling with exception-less system calls. In Proc. 9th OSDI, [16] Livio Soares and Michael Stumm. Exception-less system calls for event-driven servers. In USENIX Annual Technical Conference, [17] Vijay Vasudevan, David G Andersen, and Michael Kaminsky. The case for vos: The vector operating system. Proc. HotOS XIII, page 101, [18] Vijay Vasudevan, Michael Kaminsky, and David G Andersen. Using vector interfaces to deliver millions of iops from a networked key-value storage server. In Proceedings of the Third ACM Symposium on Cloud Computing, page 8. ACM, [19] David Wentzlaff and Anant Agarwal. Factored operating systems (fos): the case for a scalable operating system for multicores. ACM SIGOPS Operating Systems Review, 43(2):76 85, [20] Nickolai Zeldovich, Alexander Yip, Frank Dabek, Robert Morris, David Mazieres, and M Frans Kaashoek. Multiprocessor support for event-driven programs. In USENIX Annual Technical Conference, General Track, pages ,

Exception-Less System Calls for Event-Driven Servers

Exception-Less System Calls for Event-Driven Servers Livio Soares and Michael Stumm University of Toronto Talk overview At OSDI'10: exception-less system calls Technique targeted at highly threaded servers