Inter-Syscall Parallelism

Size: px
Start display at page:

Download "Inter-Syscall Parallelism"

Transcription

1 Inter-Syscall Parallelism Jie Liao, Aaron Carroll, Lin Zhong {jie.liao, aaronc, Rice University Abstract The current thread model used by commodity OSes like Linux allows a thread to issue syscalls sequentially. The resulting sequential syscall execution can become a significant performance bottleneck. In this paper, we study the potential of executing independent syscalls issued by a thread in parallel, or inter-syscall parallelism. We build a suite of analysis tools to accurately trace syscall execution times and, using a model of the system API dependencies, predict the potential speedup from inter-syscall parallelism in common workloads. Our analysis shows 69% - 109% potential speedup on two event-driven servers, and 40% speedup on several CLI utilities. We present libasyncos, a userspace library that helps developers realize inter-syscall parallelism while retaining a sequential programming model. Our benchmarks achieve 9% - 94% speedup on an epoll-based network server, and up to 34% speedup on the conventional cp program. We present an early study to show that it is feasible to use compiler techniques to automate the optimization of inter-syscall parallelism. Our results further motivate that inter-syscall parallelism should be exploited by kernel-space mechanisms. 1 Introduction As single-core performance plateaus, modern processors get more computational power from having more cores. This creates a strong incentive for people to understand and exploit parallelism in software. While many have sought to parallelize user-space workloads, kernel-space workloads remain sequential for a single thread. In the thread model used by commodity OSes like Linux, a thread issues syscalls sequentially. The resulting sequential kernel execution can become a significant performance bottleneck if that thread contains many syscalls. In this paper, we study a new form of kernel-space parallelism for a single user thread, inter-syscall parallelism, that executes independent syscalls in that thread in parallel. Intersyscall parallelism is orthogonal to and more fine-grained than thread-level parallelism: it has the potential to improve the performance of a single thread, which can also benefit multithreaded programs by improving the performance of each single thread. We make two contributions in this paper. First, we present the first reality check of inter-syscall parallelism ( 2). We not only show inter-syscall parallelism is abundant in popular programs but also reveal common patterns of intersyscall parallelism. In doing so, we report a suite of tools, tracelite and sc-analyzer that estimate possible performance gains from inter-syscall parallelism based on syscall execution traces in common OS workloads. Second, we present our early results in exploiting intersyscall parallelism for improving the performance of a user thread. In 3, we present libasyncos, a user-space library that allows developers to exploit inter-syscall parallelism while retaining the sequential programming model. We show libasyncos can speed up an epoll-based network server by 9% - 94% and the conventional cp program by up to 34% on a four-core machine. The limitations of our user-space solution motivate our ongoing realization of a kernel-space design described in 5. 2 Reality Check In this section, we answer three questions regarding checking the reality of inter-syscall parallelism: How abundant is inter-syscall parallelism in common OS workloads? ( 2.2). What code patterns of inter-syscall parallelism are there? ( 2.3) What factors limit the benefit of inter-syscall parallelism? ( 2.4) To answer these questions, we build a suite of analysis tools to collect the syscall execution traces of a workload, and estimate the potential performance speedup from inter-syscall parallelism based on the traces. We show inter-syscall parallelism is abundant in event-driven servers memcached and nginx, and some CLI utilities such as dd and cp. We also present two code patterns that help developers to achieve inter-syscall parallelism. 2.1 Analysis tools Our analysis tools that are used to trace syscall execution times and reveal potential inter-syscall parallelism opportunities include tracelite and sc-analyzer. tracelite is a kernel-built mechanism for tracing syscall execution times. It inserts hooks into the architecturespecific syscall entry and exit points, and records the syscall number, arguments, return/error values, and entry/exit timestamps of a user thread. The traced data is stored in a Feather 1

2 Trace [3] buffer. tracelite exposes an interface to the user-space via debugfs for users to enable/disable tracing, set options such as the target thread ID, and read out the traced data. sc-analyzer is a static Python tool for predicting potential speedup from inter-syscall parallelism. It generates a syscall dependency graph from the syscall execution traces collected by tracelite, and predicts the performance speedup that could be achieved by overlapping syscall executions. In doing so, sc-analyzer removes the userlevel data dependencies in syscalls such as read() and write() from copy programs by avoiding unnecessary buffer reuse, and removes false file descriptor dependencies by assigning unique file descriptors to all the opened files. sc-analyzer calculates the speedup of a program using speed, which is the reciprocal of the time the program spends executing. Speedup is the percentage improvement of speed with inter-syscall parallelism over the original speed. While tracelite traces accurate syscall times, sc-analyzer can either overestimate or underestimate the performance speedup. The speedup can be overestimated because sc-analyzer assumes unlimited hardware resources and no sharing between any independent syscalls. However, for a contemporary platform the number of cores is not unlimited; and sometimes independent syscalls could share the same cache line or wait on the same lock. Moreover, avoiding buffer reuse and assigning unique file descriptors are not always correct, because sc-analyzer only considers dependencies between syscalls but not those between syscalls and the computation code. On the other hand, since sc-analyzer is a static tool, it can underestimate the speedup as program execution can change dynamically with specific optimization and hardware context, e.g., with CPU utilization. 2.2 Inter-syscall parallelism opportunities In order to study how abundant inter-syscall parallelism is, we use tracelite and sc-analyzer to analyze the potential speedup from inter-syscall parallelism for a number of desktop and server workloads. Our benchmarks are run on a quad-core Core-i7 CPU at a maximum frequency of 3.4 GHz. Some workloads see promising speedup, showing abundant opportunities of inter-syscall parallelism in these workloads. The best of them, Memcached and Nginx, see speedup between 69% and 109%, and CLI utilities, grep, dd, and cp, see speedup around 40%. Table 1 briefly describes all the benchmarks we have tried. As discussed in 1, inter-syscall parallelism targets improving single-thread performance, thus our analysis only focuses on the syscall execution in a single thread, and all the benchmarks are configured to be single-threaded. For all I/O operations, the disk caches are warmed up before the traces are collected. Disk access latency has a significant impact Percentage memtier memaslap ls gcc ffmpeg gzip svisplay apache-ab dd cptar grep nginx-ab nginx-httperf Speedup Syscall time 3r pgbench deltablue Figure 1: Potential percent speedup predicted by sc-analyzer and percentage of syscall execution time for all benchmarks on Core-i7, sorted by speedup. Memcached and Nginx see the highest potential speedup, and CLI utilities grep, dd, cp, and tar also have promising speedup. on the performance of I/O operations, in studying the opportunities of inter-syscall parallelism we leave this in the discussion of 3.3. Figure 1 shows the speedup determined by the sc-analyzer and percentage of total time spent executing syscalls for all the benchmarks. As shown in Figure 1, promising speedup is seen in two event-driven servers Memcached and Nginx, and four CLI utilities grep, dd, cp and tar. Most benchmarks see a strong positive correlation between speedup and percentage time spent in syscalls. However, PostgreSQL has high syscall execution time but very low speedup from exploiting inter-syscall parallelism. This is because the recvfrom() syscall in a thread of PostgreSQL takes most of the syscall execution time but cannot be parallelized due to blocking. Similarly, a thread of Apache spends more than half of its execution time on syscall execution, but it does not have high speedup because the thread spends a lot of time waiting and acquiring a lock before issuing syscalls to serve the client requests. Some of the workloads have very low speedup because they spend most of their execution time in the user space, e.g., JavaScript Engines, ffmpeg and gzip. 2.3 Code patterns of inter-syscall para. Based on the results of our syscall analysis in 2.2 and a study of the POSIX API, we identify two code patterns of inter-syscall parallelism: parallel iteration and loop pipelining. These two patterns are the common cases that contribute to the most of the speedup in 2.2. They also provide developers with useful optimization hints on how to exploit inter-syscall parallelism in their applications. There are cases where syscalls in simple sequential execution have no dependencies and can be executed concurrently. But these cases 2

3 Table 1: Benchmarks used in our reality check of inter-syscall parallelism. We cover a wide range of desktop and server workloads to seek inter-syscall opportunities. Network servers Database JavaScript Engines Memcached Nginx Apache PostgreSQL SpiderMonkey V8 Event-driven key-value store server, v1.4.20, running memaslap v1.0 and memtier as clients to generate load. Event-driven HTTP server and reverse proxy, v1.6.0, running ApacheBench v2.3 and httperf v0.8 as clients to generate load. Multithreaded HTTP web server, v2.4.9, running ApacheBench v2.3 as client to generate load. Relational database system, v9.3.4, running pgbench as client to create tables and perform queries. Mozilla s JavaScript engine, running scripts string-validate-input and 3-raytrace. Chrome s JavaScript engine, running scripts v8-splay and v8-deltablue. Compiler GCC Compiling a simple Hello World input program with compile phase (cc1). Media processing ffmpeg Encoding a WAV file to MP3. CLI utilities grep, dd, cp, tar, gzip, ls Running self-defined test cases. Disk cache is warmed up for all I/O operations do not form a common pattern and should be dealt with on a case-by-case basis. Parallel iteration: Programs often loop through a set of objects, such as files, directory entries and sockets. While syscalls within each iteration are often dependent, syscalls across different iterations sometimes are not. These syscalls can be parallelized by executing several iterations concurrently on multiple CPU cores, i.e., using multiple threads. For example, event-driven programs like Memcached loop through the sockets with incoming requests after an epoll wait() call in order to serve each available request, as shown in Listing 1. Different iterations on different sockets can indeed be run in parallel, since the requests are independent, as shown in Listing 2. 1 n = epoll_wait(); 2 for i from 0 to n-1{ 3 read(fd i ); /* thread 0*/ 4 sendmsg(fd i ); /* thread 0*/ 5 } Listing 1: Syscall trace snippet from Memcached, iterations of the epoll loop are executed sequentially within one thread. 1 n = epoll_wait(); 2 read(fd 0 ); /* thread 0*/ 3 sendmsg(fd 0 ); /* thread 0*/ 4 read(fd 1 ); /* thread 1*/ 5 sendmsg(fd 1 ); /* thread 1*/ read(fd n 1 ); /* thread n-1*/ 8 sendmsg(fd n 1 ); /* thread n-1*/ Listing 2: Restructured code of Listing 1, different iterations of the epoll loop can be executed in parallel. 1 while(!finish){ 2 read(fd1, buf); /* thread 0*/ 3 write(fd2, buf); /* thread 0*/ 4 } Listing 3: Syscall trace snippet from cp, all read() and write() of the copy loop are executed sequentially. 1 read(fd1, buf1); 2 while(!finish){ 3 write(fd2, buf1); /* thread 0*/ 4 read(fd1, buf2); /* thread 1*/ 5 exchange buf1 and buf2; 6 } 7 write(fd2, buf1); Listing 4: Restructured code of Listing 3, the read() and write() within one iteration of the new copy loop operate on different buffers and file descriptors, thus they can be executed in parallel. Loop pipelining: In the cases where dependencies do exist across iterations, early syscalls in one iteration often do not depend on the later syscalls of the previous iteration. For example, in some CLI utilities that involve copying data, such as cp, dd, and tar, the copying process consists of a loop of reading data from the source file to a buffer and writing from the same buffer to the destination file, causing a data dependency between read() and write(), as shown in Listing 3. In fact the data dependency can be removed by using two buffers, each of which serving read() and write() independently. If we can restructure the copy loop shown in Listing 3 to the code shown in Listing 4, the read() and write() within each iteration become independent, and thus can be executed in parallel. 3

4 2.4 Inter-syscall para. is not a panacea As shown in Figure 1, many workloads have very low speedup, which means inter-syscall parallelism does not help all the workloads. The following cases describe when intersyscall parallelism does not help: First, programs that spend a significantly large portion of their execution time in user space will not benefit much from inter-syscall parallelism, because inter-syscall parallelism only helps kernel time. As demonstrated in Figure 1, JavaScript Engines, gzip and ffmpeg have very low potential speedup because they spend more than 95% of their execution time in user space. Second, even with a high portion of kernel execution time, programs that block waiting on I/Os can have limited speedup. In our database benchmark, blocking syscalls recvfrom() can dominate the overall execution time, but also limit the speedup because they cannot be parallelized. Third, because one from a set of concurrently executed syscalls has a significantly longer execution time than the others, programs that present a very high degree of intersyscall parallelism can have limited speedup. For example, we can pipeline read()/write() copy loops in programs like cp and dd to make the programs run faster. But if it takes much longer time to execute write() than read() (or vice versa), the speedup will be low because it is write() that determines the overall execution time. Similarly, the cold disk caches can also negatively affect the speedup, because they make the execution of some disk I/O syscalls significantly longer than the others. Last, executing syscalls on different cores can cause cache pollution and limit the speedup. For example, while executing two instances of the same but independent syscalls sequentially on a single core can improve the cache locality, making the second one run much faster than the first one; executing them concurrently on different cores can make both run slowly, potentially reducing the overall performance. 3 User-space Design: libasyncos The results of the previous analysis show that reasonable speedup can be achieved by parallelizing syscall execution for some workloads. To exploit inter-syscall parallelism for improving the performance of these workloads, we develop libasyncos, a user-space library that allows developers to realize inter-syscall parallelism while retaining the sequential programming model. Our early results show 9% to 94% speedup on an epoll-based network server, and up to 34% speedup on the cp program from GNU Coreutils v8.23. We briefly describe the design of libasyncos in 3.1, present our early results in 3.2, and discuss the limitations of libasyncos in 3.3. user thread syscall job queue worker threads User Space Kernel Space core 0 POSTED COMPLETED FREE return COMPLETED POSTED fetch COMPLETED COMPLETED POSTED return FREE core 1 core 2 core 3 libasyncos Figure 2: Working flow of libasyncos. issue() requests execution of a syscall sc, passes the request to the shared syscall job queue (POSTED) and returns a handle to that syscall instance. When the result is required, complete() is called, which blocks until the specified syscall has completed (COMPLETED), and returns the result of the original call sc. The two example issue()s and complete()s in the figure are not necessarily related. 3.1 libasyncos Design Two basic API designs exist for an asynchronous syscall execution: event-driven programming and issue/complete semantics. In event-driven programming, a callback function is registered with each syscall invocation. When the syscall completes, the program is preempted and execution starts at the callback handler. Event-driven programming requires a significant engineering effort to restructure the sequential programs, and it suffers from concurrency complexity since the callback handlers can be invoked at any point. Therefore we are implementing the issue/complete semantics which requires little modification on the original source code. Specifically, libasyncos breaks down one syscall invocation sc(arg 0,..., arg n ) into two basic APIs: issue(sc, arg 0,..., arg n ) and complete(handle) libasyncos decouples the syscall request (issue()) from result collection (complete()) for a single syscall, such that multiple syscalls can be outstanding at the same time and thus executed in parallel. With libasyncos, syscalls are executed in the context of per-core worker threads that share a single syscall job queue in the same address space as the user thread. Figure 2 shows the working flow of libasyncos. As shown in Figure 2, issue() requests execution of a syscall sc, passes the request to the job queue and returns a handle to that syscall instance. When the result is required, complete() is called, which blocks until the specified syscall has completed, and returns the result of the origi- 4

5 nal call sc. Thus, complete(issue(sc,...)) is equivalent to sc(...). With minimal effort to rewrite the code, developers still retain the sequential programming model. Notably the existing POSIX asynchronous I/O (AIO) uses both event-driven and issue/complete semantics. By enqueuing a reading or writing syscall, AIO calls aio read() or aio write(). I/O completion can be signaled either by a blocking syscall to aio suspend(), by polling aio error(), or via a callback function. But AIO only supports I/O reading and writing, we are targeting asynchronous execution of more generic syscalls. 3.2 Early results We exploit inter-syscall parallelism with libasyncos in two benchmarks: an epoll-based network server and the cp program. We choose these two benchmarks based on the code patterns described in 2.3 that can contribute to promising speedup and are applicable for developers to realize by restructuring their code with minimal effort. Our results show that libasyncos can achieve reasonable speedup with intersyscall parallelism Experiment set up The network server uses epoll wait() to wait on incoming requests, loops through the available incoming sockets, read()s from the sockets, and send()s back a small message. The client, connected to the server via a 10 Gb Ethernet network, spawns a large number of threads, each of which issues a fixed number of requests to the server using regular synchronous syscalls. We run the server and client on two separate machines each with a quad-core Core-i7 CPU and running at a maximum frequency of 3.4 GHz. We run three versions of the network server during the benchmarking: an original single-threaded version, a fully multithreaded version with incoming connections evenly distributed across per-core worker threads, and a version with libasyncos in which we exploit inter-syscall parallelism by the parallel iteration pattern ( 2.3). Figure 3 shows the throughputs of three servers with various numbers of client threads. The cp program is taken from the GNU Coreutils, version We modify cp using libasyncos to exploit the loop pipelining pattern ( 2.3), and compare its execution speed against the speed of the original cp. We run both versions of the cp program on the same machine with a quad-core Corei7 CPU and running at a maximum frequency of 3.4 GHz. Before collecting results, the disk caches are warmed up by doing the copy several times, and the target files are removed from the file system (but not synchronized with disk). Figure 4 shows the speedup of cp with libasyncos against the original cp with various block sizes and file sizes libasyncos can effectively exploit inter-syscall parallelism From our results, libasyncos can achieve reasonable speedup with inter-syscall parallelism. As shown in Figure 3, libasyncos achieves 94% speedup on the network server with light workloads, and 9% speedup with heavy workloads that saturate the server. libasyncos is not able to accelerate the processing of each single request to the server, because syscalls within a single request is dependent. But it still improves the overall throughput. Also, as shown in Figure 4, libasyncos can make cp up to 34% faster than the original cp. When the file size is larger than 2M bytes, cp with libasyncos will always outperform the original cp libasyncos is not always beneficial However, libasyncos is not always beneficial. Figure 3 shows relatively high throughput gain (over 90%) when the workloads are light, but with heavy workloads, libasyncos only achieves around 10% speedup. This is fairly low compared with the speedup of the multithreaded server (50%) which represents the performance upper bound for this server. This is because libasyncos needs to take extra cost to issue syscalls with an additional library layer and synchronize the syscalls from the same request. Also, Figure 4 shows that when the file size is smaller than 2M bytes, even though the libasyncos version of cp consumes more CPU and memory resources than the original cp, the former performs worse than the latter. Copying smaller files needs fewer read()/write() iterations, thus the benefit from pipelining read() and write() is not enough to afford the extra cost of libasyncos discussed above. Moreover, there are also more L1 d-cache misses when using libasyncos to copy small files. 3.3 Limitations There are several limitations of our current design and implementation of libasyncos. First it uses threads that cooperate purely with shared memory and spin-loop synchronization, which can waste a lot of CPU cycles and make per-syscall cost higher than the original synchronous syscalls. This can limit the performance improvement for CPU-intensive programs. For example, as indicated in Figure 3, the throughput of the server using libasyncos plateaus faster than the multithreaded server as the number of client threads increases. Second, we have not addressed the scheduling problem of deciding when to use libasyncos. Apparently libasyncos sometimes consumes more resources but performs worse (as seen by Figure 4), in case of which we should not use libasyncos. Moreover, we have not addressed the scalability issues as our reported evaluation is based on 2-4 tightlycoupled cores. As synchronization cost across sockets on many cores is much more expensive than on-die synchronization cost [5], new performance and scheduling issues 5

6 Throughput (requests/ms) Multithreaded 200 libasyncos 100 Original Number of client threads Speedup (percentage) k k 64k k k File size (bytes in log 2 scale) Figure 3: Throughput for the epoll-based network server with libasyncos and multithreaded against the original singlethreaded. libasyncos effectively exploits inter-syscall parallelism, but performs worse than the multithreaded server. Figure 4: Speedup of cp with libasyncos against the original cp, tested with different block sizes and file sizes. libasyncos improves the performance of cp when copying large files, but can also degrade the performance when copying small files will emerge. Finally, one can further optimize the design of libasyncos. For example, we can build separate syscall job queue per worker thread to improve cache locality, and we can also interleave computation and syscall execution in the worker threads to reduce CPU cycle waste. Besides the above limitations, there are practical issues of the two benchmarks that we use to demonstrate the effectiveness of libasyncos. For a network server, only when the CPU is the bottleneck can inter-syscall parallelism help to improve the throughput by utilizing more CPU cores. However, usually it is not hard to saturate a 10 Gb Ethernet network, especially with small request sizes [7, 8]. If the NIC becomes the bottleneck, inter-syscall parallelism does not help by simply parallelizing syscalls. When the CPU is not the bottleneck, inter-syscall parallelism does not require to use multiple cores as well. Also, copying always involves disk accesses, in which case the speedup from inter-syscall parallelism depends on the time spent on individual reading or writing as discussed in 2.4. We run separate benchmarks to test the effect of disk I/O. When reading two files in lock step from different physical disks with the buffercache cleared before each run, libasyncos achieves 50%- 65% speedup over the synchronous read(), which is very close to the speedup of using aio read() to read the same two files. 4 Compiler-directed optimization Our sc-analyzer program is useful as a developer tool to identify high-level speedup opportunities using libasyncos. However, we claim that many local optimizations can be performed automatically in the compiler. As a proof-of-concept, we have implemented an LLVM transform pass that can optimize read(), write(), open() and close(), which works as follows: 1. A call to asyncos init() is placed at the beginning of main(). 2. For each callsite of the target functions convert the original call into issue(). 3. Generate a list of pointer dependencies of the syscall, either as inputs or outputs. 4. Iterate through each subsequent instruction, until one of the following conditions occurs: the syscall return value is used; a pointer is used which may alias with one in the dependency set; another syscall using the same file descriptor is called; the end of the current basic block is reached. 5. The corresponding complete() is placed immediately before the dependent instruction. We run the optimizer the following piece of test code: 1 fd1 = open("file1") 2 fd2 = open("file2") 3 read(fd1, buf) 4 write(fd2, buf) 5 close(fd1) 6 close(fd2) which produces the following code: 1 s1 = issue(open, "file1") 2 s2 = issue(open, "file2") 3 fd1 = complete(s1) 4 s3 = issue(read, fd1, buf) 5 fd2 = complete(s2) 6 complete(s3) 7 s4 = issue(write, fd2, buf) 8 s5 = issue(close, fd1) 9 complete(s4) 10 s6 = issue(close, fd2) 11 complete(s5) 12 complete(s6) Results To measure the performance of the compileroptimized code, we run many iterations of the original and optimized code on a Core2-Duo. We achieve a speedup over synchronous of ± 0.05%. 6

7 We have shown that it is feasible to implement syscall optimization in the compiler. However, in our simple case study we have ignored many practical problems, for example, we have not dealt with error handling which always include a conditional branch. A typical syscall usage looks as follows: 1 r = syscall(...); 2 if (r < 0) { 3 /* handle error */ 4 return r; 5 } Thus, the usage of the syscall return value is often immediately after execution of the syscall, so our optimization algorithm would have no opportunities to overlap syscalls. We avoid this issue in our implementation by removing error handling code: in the compiler we replace comparisons that check for error conditions with conditions that always evaluate to 0. Running a dead-code elimination on the resulting code causes the removal of error handling branch. However, a complete implementation must use more intelligent code motion techniques to relocate error handling code in a way that allows parallelism and preserves program semantics. We defer this to future work. 5 Kernel-space design needed The limitations of libasyncos motivate that it is necessary to exploit inter-syscall parallelism within the kernel. The reasons are two-fold. First, with a kernel implementation, we can make the syscall execution faster, which can improve the overall performance based on the speedup from intersyscall parallelism. Our issue()/complete() mechanism divides one syscall into two calls, thus one syscall needs to cross the user-kernel boundary twice if the mechanism is simply moved into the kernel. Therefore to implement the kernel support for inter-syscall parallelism, besides porting the issue()/complete() mechanism to the kernel, we should apply techniques such as syscall batching and exception-less based syscall invocation [15] to reduce the average per-syscall cost. Second, libasyncos must synchronize dependent syscalls explicitly in user space. Such synchronization not only incurs high overhead, but also sometimes cannot be realized in user space. For example, in single-threaded event-driven servers, inter-syscall parallelism exists because syscalls across event handlers (for different requests) are independent. But the syscalls issued by the same event handler are likely to be dependent. In a single-threaded event-driven server, a specific event is triggered only once by a unique incoming request, while all the events are handled sequentially. Thus with the user-space libasyncos, it is hard to issue syscalls from different event handlers in parallel and still synchronize syscalls from the same handler. But with a kernel support, we can have syscalls from multiple event handlers being executed in parallel by the kernel worker threads without user s involvement. Proper syscall completion notifications in the kernel can be used to synchronize dependent syscalls within each event handler implicitly. Similar existing solutions include libflexsc [16] for event-driven servers and POSIX AIO, but inter-syscall parallelism is not limited to event-driven servers or read/write syscalls. Ongoing implementation We are in the process of implementing a kernel space design for exploiting inter-syscall parallelism. Since we have determined inter-syscall parallelism patterns in common OS workloads, we are also interested in applying compiler-oriented optimization to automate the transformation from user code to the one that exploits inter-syscall parallelism. We can further overlap computation with the syscall execution based on the compiler techniques. We have assumed the use of the conventional POSIX-like syscall APIs, but examining what properties of the system APIs are conducive to inter-syscall parallelism is an interesting future direction. It is well-known that parallelism and commutativity are intrinsically linked [1]. A design of scalable software with commutative interface operations [4] will provide inter-syscall parallelism with more opportunities. 6 Related Work Our work is inspired by many previous efforts on improving program performance by understanding and optimizing syscall execution. Exploiting inter-syscall parallelism requires asynchronous syscall execution, for which many solutions are available. For example, POSIX asynchronous I/O, LAIO [6] and Linux syslets [9]. These are designed principally to overlap blocking I/O with other parts of the operations. Our current implementation of libasyncos is not aware of the blocking I/Os, but it can be extended to take advantage of blocking syscalls to gain more performance improvement. FlexSC [15] is an exception-less syscall mechanism that inherently executes syscalls asynchronously, but FlexSc works well only on massive independent user threads, while inter-syscall parallelism targets a single thread. Our inter-syscall parallelism analysis shows big improvement potential for parallelizing event-driven workloads. Previous works, including libflexsc [16] and libasync-smp [20], have achieved similar goals. libflexsc is a syscall notification library built based on FlexSC. It realizes the parallel iteration pattern we observe in event-driven servers and achieves impressive speedup. But libflexsc demands an event-driven program design while inter-syscall parallelism is not limited by eventdriven mechanisms. libasync-smp handles independent events concurrently on multiprocessors. Inter-syscall parallelism is a complementary work to libasync-smp. While libasync-smp favors user-intensive workloads at event 7

8 level, inter-syscall parallelism provides parallelism potential for kernel-intensive workloads at syscall level. Also inter-syscall parallelism is related to efforts in reducing syscall overhead. Syscall batching is a well-known technique to reduce the syscall boundary-crossing overhead, e.g., in multi-call [13, 12], netmap [14], MegaPipe [7], and FlexSC [15, 16]. Similarly, Cosy [11] moves syscallintensive code regions into kernel to reduce user-kernel boundary crossings, and vector OS [17, 18] compounds OSintensive operations in a vector such that they can be efficiently parallelized using vector interfaces. Future intersyscall parallelism mechanism can benefit from these techniques to achieve more performance improvement. Lastly, while dealing with multicore processing, distributed operating systems such as Helios [10], fos [19], and NIX [2] provide distributed kernel services, which naturally support inter-syscall parallelism. 7 Conclusion In this paper, we studied and exploited inter-syscall parallelism, which executes independent syscalls from a user thread in parallel. In studying inter-syscall parallelism, we built tracelite and sc-analyzer to trace syscall execution times and reveal potential speedup from inter-syscall parallelism in common OS workloads. We found promising potential speedup in two event-driven servers and several CLI utilities, and determined two code patterns for intersyscall parallelism. In exploiting inter-syscall parallelism, we developed libasyncos, a user-space library that allows developers to realize inter-syscall parallelism while retaining the sequential programming model. Our experiments showed libasyncos can effectively speed up an epoll-based network server and the conventional cp program. We also gave an early study of the feasibility of using compiler techniques to automate inter-syscall parallelism code transformation. Finally, we provided a brief discussion on our ongoing realization of inter-syscall parallelism in the kernel space. References [1] Farhana Aleen and Nathan Clark. Commutativity analysis for software parallelization: letting program transformations see the big picture. ACM Sigplan Notices, 44(3): , [2] Francisco J Ballesteros, Noah Evans, Charles Forsyth, Gorka Guardiola, Jim McKie, Ron Minnich, and Enrique Soriano-Salvador. Nix: A case for a manycore system for cloud computing. Bell Labs Technical Journal, 17(2):41 54, [3] B Brandenburg and J Anderson. Feather-trace: A lightweight event tracing toolkit. In Proceedings of the Third International Workshop on Operating Systems Platforms for Embedded Real-Time Applications, pages 19 28, [4] Austin T Clements, M Frans Kaashoek, Nickolai Zeldovich, Robert T Morris, and Eddie Kohler. The scalable commutativity rule: Designing scalable software for multicore processors. ACM Transactions on Computer Systems (TOCS), 32(4):10, [5] Tudor David, Rachid Guerraoui, and Vasileios Trigonakis. Everything you always wanted to know about synchronization but were afraid to ask. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pages ACM, [6] Khaled Elmeleegy, Anupam Chanda, Alan L Cox, and Willy Zwaenepoel. Lazy asynchronous i/o for eventdriven servers. In USENIX Annual Technical Conference, General Track, pages , [7] Sangjin Han, Scott Marshall, Byung-Gon Chun, and Sylvia Ratnasamy. Megapipe: A new programming interface for scalable network i/o. In OSDI, pages , [8] E Jeong, Shinae Woo, Muhammad Jamshed, Haewon Jeong, Sunghwan Ihm, Dongsu Han, and K Park. mtcp: a highly scalable user-level tcp stack for multicore systems. Proc. 11th USENIX NSDI, [9] Ingo Molnar. Announce: Syslets, generic asynchronous system call support. lkml/2007/2/13/142. [10] Edmund B Nightingale, Orion Hodson, Ross McIlroy, Chris Hawblitzel, and Galen Hunt. Helios: heterogeneous multiprocessing with satellite kernels. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, pages ACM, [11] Amit Purohit, Charles P Wright, Joseph Spadavecchia, and Erez Zadok. Cosy: Develop in user-land, run in kernel-mode. In HotOS, pages , [12] Mohan Rajagopalan, Saumya K Debray, Matti A Hiltunen, and Richard D Schlichting. System call clustering: A profile-directed optimization technique. Technical report, Technical report, The University of Arizona, [13] Mohan Rajagopalan, Saumya K Debray, Matti A Hiltunen, and Richard D Schlichting. Cassyopia: Compiler assisted system optimization. In HotOS, pages , [14] Luigi Rizzo. netmap: A novel framework for fast packet i/o. In USENIX Annual Technical Conference, pages ,

9 [15] Livio Soares and Michael Stumm. FlexSC: Flexible system call scheduling with exception-less system calls. In Proc. 9th OSDI, [16] Livio Soares and Michael Stumm. Exception-less system calls for event-driven servers. In USENIX Annual Technical Conference, [17] Vijay Vasudevan, David G Andersen, and Michael Kaminsky. The case for vos: The vector operating system. Proc. HotOS XIII, page 101, [18] Vijay Vasudevan, Michael Kaminsky, and David G Andersen. Using vector interfaces to deliver millions of iops from a networked key-value storage server. In Proceedings of the Third ACM Symposium on Cloud Computing, page 8. ACM, [19] David Wentzlaff and Anant Agarwal. Factored operating systems (fos): the case for a scalable operating system for multicores. ACM SIGOPS Operating Systems Review, 43(2):76 85, [20] Nickolai Zeldovich, Alexander Yip, Frank Dabek, Robert Morris, David Mazieres, and M Frans Kaashoek. Multiprocessor support for event-driven programs. In USENIX Annual Technical Conference, General Track, pages ,

Exception-Less System Calls for Event-Driven Servers

Exception-Less System Calls for Event-Driven Servers Exception-Less System Calls for Event-Driven Servers Livio Soares and Michael Stumm University of Toronto Talk overview At OSDI'10: exception-less system calls Technique targeted at highly threaded servers

More information

FlexSC. Flexible System Call Scheduling with Exception-Less System Calls. Livio Soares and Michael Stumm. University of Toronto

FlexSC. Flexible System Call Scheduling with Exception-Less System Calls. Livio Soares and Michael Stumm. University of Toronto FlexSC Flexible System Call Scheduling with Exception-Less System Calls Livio Soares and Michael Stumm University of Toronto Motivation The synchronous system call interface is a legacy from the single

More information

PROCESSES AND THREADS THREADING MODELS. CS124 Operating Systems Winter , Lecture 8

PROCESSES AND THREADS THREADING MODELS. CS124 Operating Systems Winter , Lecture 8 PROCESSES AND THREADS THREADING MODELS CS124 Operating Systems Winter 2016-2017, Lecture 8 2 Processes and Threads As previously described, processes have one sequential thread of execution Increasingly,

More information

An Analysis of Linux Scalability to Many Cores

An Analysis of Linux Scalability to Many Cores An Analysis of Linux Scalability to Many Cores 1 What are we going to talk about? Scalability analysis of 7 system applications running on Linux on a 48 core computer Exim, memcached, Apache, PostgreSQL,

More information

A Scalable Event Dispatching Library for Linux Network Servers

A Scalable Event Dispatching Library for Linux Network Servers A Scalable Event Dispatching Library for Linux Network Servers Hao-Ran Liu and Tien-Fu Chen Dept. of CSIE National Chung Cheng University Traditional server: Multiple Process (MP) server A dedicated process

More information

MegaPipe: A New Programming Interface for Scalable Network I/O

MegaPipe: A New Programming Interface for Scalable Network I/O MegaPipe: A New Programming Interface for Scalable Network I/O Sangjin Han in collabora=on with Sco? Marshall Byung- Gon Chun Sylvia Ratnasamy University of California, Berkeley Yahoo! Research tl;dr?

More information

Light & NOS. Dan Li Tsinghua University

Light & NOS. Dan Li Tsinghua University Light & NOS Dan Li Tsinghua University Performance gain The Power of DPDK As claimed: 80 CPU cycles per packet Significant gain compared with Kernel! What we care more How to leverage the performance gain

More information

INFLUENTIAL OS RESEARCH

INFLUENTIAL OS RESEARCH INFLUENTIAL OS RESEARCH Multiprocessors Jan Bierbaum Tobias Stumpf SS 2017 ROADMAP Roadmap Multiprocessor Architectures Usage in the Old Days (mid 90s) Disco Present Age Research The Multikernel Helios

More information

Motivation. Threads. Multithreaded Server Architecture. Thread of execution. Chapter 4

Motivation. Threads. Multithreaded Server Architecture. Thread of execution. Chapter 4 Motivation Threads Chapter 4 Most modern applications are multithreaded Threads run within application Multiple tasks with the application can be implemented by separate Update display Fetch data Spell

More information

Background: I/O Concurrency

Background: I/O Concurrency Background: I/O Concurrency Brad Karp UCL Computer Science CS GZ03 / M030 5 th October 2011 Outline Worse Is Better and Distributed Systems Problem: Naïve single-process server leaves system resources

More information

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

IX: A Protected Dataplane Operating System for High Throughput and Low Latency IX: A Protected Dataplane Operating System for High Throughput and Low Latency Belay, A. et al. Proc. of the 11th USENIX Symp. on OSDI, pp. 49-65, 2014. Reviewed by Chun-Yu and Xinghao Li Summary In this

More information

Bringing&the&Performance&to&the& Cloud &

Bringing&the&Performance&to&the& Cloud & Bringing&the&Performance&to&the& Cloud & Dongsu&Han& KAIST & Department&of&Electrical&Engineering& Graduate&School&of&InformaAon&Security& & The&Era&of&Cloud&CompuAng& Datacenters&at&Amazon,&Google,&Facebook&

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

MULTIMEDIA PROCESSING ON MANY-CORE TECHNOLOGIES USING DISTRIBUTED MULTIMEDIA MIDDLEWARE

MULTIMEDIA PROCESSING ON MANY-CORE TECHNOLOGIES USING DISTRIBUTED MULTIMEDIA MIDDLEWARE MULTIMEDIA PROCESSING ON MANY-CORE TECHNOLOGIES USING DISTRIBUTED MULTIMEDIA MIDDLEWARE Michael Repplinger 1,2, Martin Beyer 1, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken,

More information

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

IX: A Protected Dataplane Operating System for High Throughput and Low Latency IX: A Protected Dataplane Operating System for High Throughput and Low Latency Adam Belay et al. Proc. of the 11th USENIX Symp. on OSDI, pp. 49-65, 2014. Presented by Han Zhang & Zaina Hamid Challenges

More information

Evolving HPCToolkit John Mellor-Crummey Department of Computer Science Rice University Scalable Tools Workshop 7 August 2017

Evolving HPCToolkit John Mellor-Crummey Department of Computer Science Rice University   Scalable Tools Workshop 7 August 2017 Evolving HPCToolkit John Mellor-Crummey Department of Computer Science Rice University http://hpctoolkit.org Scalable Tools Workshop 7 August 2017 HPCToolkit 1 HPCToolkit Workflow source code compile &

More information

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14 General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14 Lecture Outline Heterogenous multi-core systems and general purpose GPU programming Programming models Heterogenous multi-kernels

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 Introduction to Parallel Computing CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 1 Definition of Parallel Computing Simultaneous use of multiple compute resources to solve a computational

More information

Mbench: Benchmarking a Multicore Operating System Using Mixed Workloads

Mbench: Benchmarking a Multicore Operating System Using Mixed Workloads Mbench: Benchmarking a Multicore Operating System Using Mixed Workloads Gang Lu and Xinlong Lin Institute of Computing Technology, Chinese Academy of Sciences BPOE-6, Sep 4, 2015 Backgrounds Fast evolution

More information

MegaPipe: A New Programming Interface for Scalable Network I/O

MegaPipe: A New Programming Interface for Scalable Network I/O MegaPipe: A New Programming Interface for Scalable Network I/O Sangjin Han +, Scott Marshall +, Byung-Gon Chun *, and Sylvia Ratnasamy + + University of California, Berkeley *Yahoo! Research Abstract We

More information

OPERATING SYSTEM. Chapter 4: Threads

OPERATING SYSTEM. Chapter 4: Threads OPERATING SYSTEM Chapter 4: Threads Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Operating System Examples Objectives To

More information

Accelerated Library Framework for Hybrid-x86

Accelerated Library Framework for Hybrid-x86 Software Development Kit for Multicore Acceleration Version 3.0 Accelerated Library Framework for Hybrid-x86 Programmer s Guide and API Reference Version 1.0 DRAFT SC33-8406-00 Software Development Kit

More information

Questions answered in this lecture: CS 537 Lecture 19 Threads and Cooperation. What s in a process? Organizing a Process

Questions answered in this lecture: CS 537 Lecture 19 Threads and Cooperation. What s in a process? Organizing a Process Questions answered in this lecture: CS 537 Lecture 19 Threads and Cooperation Why are threads useful? How does one use POSIX pthreads? Michael Swift 1 2 What s in a process? Organizing a Process A process

More information

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services SEDA: An Architecture for Well-Conditioned, Scalable Internet Services Matt Welsh, David Culler, and Eric Brewer Computer Science Division University of California, Berkeley Operating Systems Principles

More information

An AIO Implementation and its Behaviour

An AIO Implementation and its Behaviour An AIO Implementation and its Behaviour Benjamin C. R. LaHaise Red Hat, Inc. bcrl@redhat.com Abstract Many existing userland network daemons suffer from a performance curve that severely degrades under

More information

Operating System Performance and Large Servers 1

Operating System Performance and Large Servers 1 Operating System Performance and Large Servers 1 Hyuck Yoo and Keng-Tai Ko Sun Microsystems, Inc. Mountain View, CA 94043 Abstract Servers are an essential part of today's computing environments. High

More information

Reducing Disk Latency through Replication

Reducing Disk Latency through Replication Gordon B. Bell Morris Marden Abstract Today s disks are inexpensive and have a large amount of capacity. As a result, most disks have a significant amount of excess capacity. At the same time, the performance

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

GPUfs: Integrating a file system with GPUs

GPUfs: Integrating a file system with GPUs GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Traditional System Architecture Applications OS CPU

More information

Evolution of the netmap architecture

Evolution of the netmap architecture L < > T H local Evolution of the netmap architecture Evolution of the netmap architecture -- Page 1/21 Evolution of the netmap architecture Luigi Rizzo, Università di Pisa http://info.iet.unipi.it/~luigi/vale/

More information

Exploring mtcp based Single-Threaded and Multi-Threaded Web Server Design

Exploring mtcp based Single-Threaded and Multi-Threaded Web Server Design Exploring mtcp based Single-Threaded and Multi-Threaded Web Server Design A Thesis Submitted in partial fulfillment of the requirements for the degree of Master of Technology by Pijush Chakraborty (153050015)

More information

Introduction to Asynchronous Programming Fall 2014

Introduction to Asynchronous Programming Fall 2014 CS168 Computer Networks Fonseca Introduction to Asynchronous Programming Fall 2014 Contents 1 Introduction 1 2 The Models 1 3 The Motivation 3 4 Event-Driven Programming 4 5 select() to the rescue 5 1

More information

Software-Controlled Multithreading Using Informing Memory Operations

Software-Controlled Multithreading Using Informing Memory Operations Software-Controlled Multithreading Using Informing Memory Operations Todd C. Mowry Computer Science Department University Sherwyn R. Ramkissoon Department of Electrical & Computer Engineering University

More information

Linux multi-core scalability

Linux multi-core scalability Linux multi-core scalability Oct 2009 Andi Kleen Intel Corporation andi@firstfloor.org Overview Scalability theory Linux history Some common scalability trouble-spots Application workarounds Motivation

More information

Notes. CS 537 Lecture 5 Threads and Cooperation. Questions answered in this lecture: What s in a process?

Notes. CS 537 Lecture 5 Threads and Cooperation. Questions answered in this lecture: What s in a process? Notes CS 537 Lecture 5 Threads and Cooperation Michael Swift OS news MS lost antitrust in EU: harder to integrate features Quiz tomorrow on material from chapters 2 and 3 in the book Hardware support for

More information

Multiprocessor Systems. Chapter 8, 8.1

Multiprocessor Systems. Chapter 8, 8.1 Multiprocessor Systems Chapter 8, 8.1 1 Learning Outcomes An understanding of the structure and limits of multiprocessor hardware. An appreciation of approaches to operating system support for multiprocessor

More information

CSE 120 Principles of Operating Systems

CSE 120 Principles of Operating Systems CSE 120 Principles of Operating Systems Spring 2018 Lecture 15: Multicore Geoffrey M. Voelker Multicore Operating Systems We have generally discussed operating systems concepts independent of the number

More information

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues 4.2 Silberschatz, Galvin

More information

Asynchronous Events on Linux

Asynchronous Events on Linux Asynchronous Events on Linux Frederic.Rossi@Ericsson.CA Open System Lab Systems Research June 25, 2002 Ericsson Research Canada Introduction Linux performs well as a general purpose OS but doesn t satisfy

More information

Agenda Process Concept Process Scheduling Operations on Processes Interprocess Communication 3.2

Agenda Process Concept Process Scheduling Operations on Processes Interprocess Communication 3.2 Lecture 3: Processes Agenda Process Concept Process Scheduling Operations on Processes Interprocess Communication 3.2 Process in General 3.3 Process Concept Process is an active program in execution; process

More information

Threads. Raju Pandey Department of Computer Sciences University of California, Davis Spring 2011

Threads. Raju Pandey Department of Computer Sciences University of California, Davis Spring 2011 Threads Raju Pandey Department of Computer Sciences University of California, Davis Spring 2011 Threads Effectiveness of parallel computing depends on the performance of the primitives used to express

More information

What s An OS? Cyclic Executive. Interrupts. Advantages Simple implementation Low overhead Very predictable

What s An OS? Cyclic Executive. Interrupts. Advantages Simple implementation Low overhead Very predictable What s An OS? Provides environment for executing programs Process abstraction for multitasking/concurrency scheduling Hardware abstraction layer (device drivers) File systems Communication Do we need an

More information

Operating System. Chapter 4. Threads. Lynn Choi School of Electrical Engineering

Operating System. Chapter 4. Threads. Lynn Choi School of Electrical Engineering Operating System Chapter 4. Threads Lynn Choi School of Electrical Engineering Process Characteristics Resource ownership Includes a virtual address space (process image) Ownership of resources including

More information

CHAPTER 3 - PROCESS CONCEPT

CHAPTER 3 - PROCESS CONCEPT CHAPTER 3 - PROCESS CONCEPT 1 OBJECTIVES Introduce a process a program in execution basis of all computation Describe features of processes: scheduling, creation, termination, communication Explore interprocess

More information

Introduction to OpenOnload Building Application Transparency and Protocol Conformance into Application Acceleration Middleware

Introduction to OpenOnload Building Application Transparency and Protocol Conformance into Application Acceleration Middleware White Paper Introduction to OpenOnload Building Application Transparency and Protocol Conformance into Application Acceleration Middleware Steve Pope, PhD Chief Technical Officer Solarflare Communications

More information

May 1, Foundation for Research and Technology - Hellas (FORTH) Institute of Computer Science (ICS) A Sleep-based Communication Mechanism to

May 1, Foundation for Research and Technology - Hellas (FORTH) Institute of Computer Science (ICS) A Sleep-based Communication Mechanism to A Sleep-based Our Akram Foundation for Research and Technology - Hellas (FORTH) Institute of Computer Science (ICS) May 1, 2011 Our 1 2 Our 3 4 5 6 Our Efficiency in Back-end Processing Efficiency in back-end

More information

ronny@mit.edu www.cag.lcs.mit.edu/scale Introduction Architectures are all about exploiting the parallelism inherent to applications Performance Energy The Vector-Thread Architecture is a new approach

More information

Chapter 4: Threads. Chapter 4: Threads

Chapter 4: Threads. Chapter 4: Threads Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Operating System Examples

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Distributed Computing: PVM, MPI, and MOSIX. Multiple Processor Systems. Dr. Shaaban. Judd E.N. Jenne

Distributed Computing: PVM, MPI, and MOSIX. Multiple Processor Systems. Dr. Shaaban. Judd E.N. Jenne Distributed Computing: PVM, MPI, and MOSIX Multiple Processor Systems Dr. Shaaban Judd E.N. Jenne May 21, 1999 Abstract: Distributed computing is emerging as the preferred means of supporting parallel

More information

BENCHMARKING LIBEVENT AGAINST LIBEV

BENCHMARKING LIBEVENT AGAINST LIBEV BENCHMARKING LIBEVENT AGAINST LIBEV Top 2011-01-11, Version 6 This document briefly describes the results of running the libevent benchmark program against both libevent and libev. Libevent Overview Libevent

More information

The latency of user-to-user, kernel-to-kernel and interrupt-to-interrupt level communication

The latency of user-to-user, kernel-to-kernel and interrupt-to-interrupt level communication The latency of user-to-user, kernel-to-kernel and interrupt-to-interrupt level communication John Markus Bjørndalen, Otto J. Anshus, Brian Vinter, Tore Larsen Department of Computer Science University

More information

For use by students enrolled in #71251 CSE430 Fall 2012 at Arizona State University. Do not use if not enrolled.

For use by students enrolled in #71251 CSE430 Fall 2012 at Arizona State University. Do not use if not enrolled. Operating Systems: Internals and Design Principles Chapter 4 Threads Seventh Edition By William Stallings Operating Systems: Internals and Design Principles The basic idea is that the several components

More information

Running Application Specific Kernel Code by a Just-In-Time Compiler. Ake Koomsin Yasushi Shinjo Department of Computer Science University of Tsukuba

Running Application Specific Kernel Code by a Just-In-Time Compiler. Ake Koomsin Yasushi Shinjo Department of Computer Science University of Tsukuba Running Application Specific Kernel Code by a Just-In-Time Compiler Ake Koomsin Yasushi Shinjo Department of Computer Science University of Tsukuba Agenda Motivation & Objective Approach Evaluation Related

More information

Computer System Overview

Computer System Overview Computer System Overview Introduction A computer system consists of hardware system programs application programs 2 Operating System Provides a set of services to system users (collection of service programs)

More information

Chapter 4: Multithreaded Programming

Chapter 4: Multithreaded Programming Chapter 4: Multithreaded Programming Silberschatz, Galvin and Gagne 2013 Chapter 4: Multithreaded Programming Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading

More information

An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language

An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language Martin C. Rinard (martin@cs.ucsb.edu) Department of Computer Science University

More information

COMP SCI 3SH3: Operating System Concepts (Term 2 Winter 2006) Test 2 February 27, 2006; Time: 50 Minutes ;. Questions Instructor: Dr.

COMP SCI 3SH3: Operating System Concepts (Term 2 Winter 2006) Test 2 February 27, 2006; Time: 50 Minutes ;. Questions Instructor: Dr. COMP SCI 3SH3: Operating System Concepts (Term 2 Winter 2006) Test 2 February 27, 2006; Time: 50 Minutes ;. Questions Instructor: Dr. Kamran Sartipi Name: Student ID: Question 1 (Disk Block Allocation):

More information

Optimization of thread affinity and memory affinity for remote core locking synchronization in multithreaded programs for multicore computer systems

Optimization of thread affinity and memory affinity for remote core locking synchronization in multithreaded programs for multicore computer systems Optimization of thread affinity and memory affinity for remote core locking synchronization in multithreaded programs for multicore computer systems Alexey Paznikov Saint Petersburg Electrotechnical University

More information

Utilizing Linux Kernel Components in K42 K42 Team modified October 2001

Utilizing Linux Kernel Components in K42 K42 Team modified October 2001 K42 Team modified October 2001 This paper discusses how K42 uses Linux-kernel components to support a wide range of hardware, a full-featured TCP/IP stack and Linux file-systems. An examination of the

More information

NAME: STUDENT ID: MIDTERM 235 POINTS. Closed book, closed notes 70 minutes

NAME: STUDENT ID: MIDTERM 235 POINTS. Closed book, closed notes 70 minutes NAME: STUDENT ID: MIDTERM 235 POINTS Closed book, closed notes 70 minutes 1. Name three types of failures in a distributed system. (15 points) 5 points for each correctly names failure type Valid answers

More information

CSCE 313 Introduction to Computer Systems. Instructor: Dezhen Song

CSCE 313 Introduction to Computer Systems. Instructor: Dezhen Song CSCE 313 Introduction to Computer Systems Instructor: Dezhen Song Programs, Processes, and Threads Programs and Processes Threads Programs, Processes, and Threads Programs and Processes Threads Processes

More information

Chapter 4: Multithreaded Programming

Chapter 4: Multithreaded Programming Chapter 4: Multithreaded Programming Silberschatz, Galvin and Gagne 2013! Chapter 4: Multithreaded Programming Overview Multicore Programming Multithreading Models Threading Issues Operating System Examples

More information

Problem Set: Processes

Problem Set: Processes Lecture Notes on Operating Systems Problem Set: Processes 1. Answer yes/no, and provide a brief explanation. (a) Can two processes be concurrently executing the same program executable? (b) Can two running

More information

CSCE 313: Intro to Computer Systems

CSCE 313: Intro to Computer Systems CSCE 313 Introduction to Computer Systems Instructor: Dr. Guofei Gu http://courses.cse.tamu.edu/guofei/csce313/ Programs, Processes, and Threads Programs and Processes Threads 1 Programs, Processes, and

More information

Threads SPL/2010 SPL/20 1

Threads SPL/2010 SPL/20 1 Threads 1 Today Processes and Scheduling Threads Abstract Object Models Computation Models Java Support for Threads 2 Process vs. Program processes as the basic unit of execution managed by OS OS as any

More information

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year & Semester : III/VI Section : CSE-1 & CSE-2 Subject Code : CS2354 Subject Name : Advanced Computer Architecture Degree & Branch : B.E C.S.E. UNIT-1 1.

More information

Distributed Deadlock Detection for. Distributed Process Networks

Distributed Deadlock Detection for. Distributed Process Networks 0 Distributed Deadlock Detection for Distributed Process Networks Alex Olson Embedded Software Systems Abstract The distributed process network (DPN) model allows for greater scalability and performance

More information

Problem Set: Processes

Problem Set: Processes Lecture Notes on Operating Systems Problem Set: Processes 1. Answer yes/no, and provide a brief explanation. (a) Can two processes be concurrently executing the same program executable? (b) Can two running

More information

SMD149 - Operating Systems

SMD149 - Operating Systems SMD149 - Operating Systems Roland Parviainen November 3, 2005 1 / 45 Outline Overview 2 / 45 Process (tasks) are necessary for concurrency Instance of a program in execution Next invocation of the program

More information

IsoStack Highly Efficient Network Processing on Dedicated Cores

IsoStack Highly Efficient Network Processing on Dedicated Cores IsoStack Highly Efficient Network Processing on Dedicated Cores Leah Shalev Eran Borovik, Julian Satran, Muli Ben-Yehuda Outline Motivation IsoStack architecture Prototype TCP/IP over 10GE on a single

More information

Lightweight Remote Procedure Call

Lightweight Remote Procedure Call Lightweight Remote Procedure Call Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, Henry M. Levy ACM Transactions Vol. 8, No. 1, February 1990, pp. 37-55 presented by Ian Dees for PSU CS533, Jonathan

More information

The benefits and costs of writing a POSIX kernel in a high-level language

The benefits and costs of writing a POSIX kernel in a high-level language 1 / 38 The benefits and costs of writing a POSIX kernel in a high-level language Cody Cutler, M. Frans Kaashoek, Robert T. Morris MIT CSAIL Should we use high-level languages to build OS kernels? 2 / 38

More information

Comparing the Performance of Web Server Architectures

Comparing the Performance of Web Server Architectures Comparing the Performance of Web Server Architectures David Pariag, Tim Brecht, Ashif Harji, Peter Buhr, and Amol Shukla David R. Cheriton School of Computer Science University of Waterloo, Waterloo, Ontario,

More information

Efficient Implementation of IPCP and DFP

Efficient Implementation of IPCP and DFP Efficient Implementation of IPCP and DFP N.C. Audsley and A. Burns Department of Computer Science, University of York, York, UK. email: {neil.audsley, alan.burns}@york.ac.uk Abstract Most resource control

More information

Comparing and Evaluating epoll, select, and poll Event Mechanisms

Comparing and Evaluating epoll, select, and poll Event Mechanisms Appears in the Proceedings of the Ottawa Linux Symposium, Ottawa, Canada, July, 24 Comparing and Evaluating epoll,, and poll Event Mechanisms Louay Gammo, Tim Brecht, Amol Shukla, and David Pariag University

More information

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology September 19, 2007 Markus Levy, EEMBC and Multicore Association Enabling the Multicore Ecosystem Multicore

More information

SCALABILITY AND HETEROGENEITY MICHAEL ROITZSCH

SCALABILITY AND HETEROGENEITY MICHAEL ROITZSCH Faculty of Computer Science Institute of Systems Architecture, Operating Systems Group SCALABILITY AND HETEROGENEITY MICHAEL ROITZSCH LAYER CAKE Application Runtime OS Kernel ISA Physical RAM 2 COMMODITY

More information

Speeding up Linux TCP/IP with a Fast Packet I/O Framework

Speeding up Linux TCP/IP with a Fast Packet I/O Framework Speeding up Linux TCP/IP with a Fast Packet I/O Framework Michio Honda Advanced Technology Group, NetApp michio@netapp.com With acknowledge to Kenichi Yasukata, Douglas Santry and Lars Eggert 1 Motivation

More information

Research on the Implementation of MPI on Multicore Architectures

Research on the Implementation of MPI on Multicore Architectures Research on the Implementation of MPI on Multicore Architectures Pengqi Cheng Department of Computer Science & Technology, Tshinghua University, Beijing, China chengpq@gmail.com Yan Gu Department of Computer

More information

kguard++: Improving the Performance of kguard with Low-latency Code Inflation

kguard++: Improving the Performance of kguard with Low-latency Code Inflation kguard++: Improving the Performance of kguard with Low-latency Code Inflation Jordan P. Hendricks Brown University Abstract In this paper, we introduce low-latency code inflation for kguard, a GCC plugin

More information

A Cool Scheduler for Multi-Core Systems Exploiting Program Phases

A Cool Scheduler for Multi-Core Systems Exploiting Program Phases IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 5, MAY 2014 1061 A Cool Scheduler for Multi-Core Systems Exploiting Program Phases Zhiming Zhang and J. Morris Chang, Senior Member, IEEE Abstract Rapid growth

More information

AN 831: Intel FPGA SDK for OpenCL

AN 831: Intel FPGA SDK for OpenCL AN 831: Intel FPGA SDK for OpenCL Host Pipelined Multithread Subscribe Send Feedback Latest document on the web: PDF HTML Contents Contents 1 Intel FPGA SDK for OpenCL Host Pipelined Multithread...3 1.1

More information

Measurement-based Analysis of TCP/IP Processing Requirements

Measurement-based Analysis of TCP/IP Processing Requirements Measurement-based Analysis of TCP/IP Processing Requirements Srihari Makineni Ravi Iyer Communications Technology Lab Intel Corporation {srihari.makineni, ravishankar.iyer}@intel.com Abstract With the

More information

LINUX OPERATING SYSTEM Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

LINUX OPERATING SYSTEM Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science A Seminar report On LINUX OPERATING SYSTEM Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science SUBMITTED TO: www.studymafia.org SUBMITTED

More information

TincVPN Optimization. Derek Chiang, Jasdeep Hundal, Jisun Jung

TincVPN Optimization. Derek Chiang, Jasdeep Hundal, Jisun Jung TincVPN Optimization Derek Chiang, Jasdeep Hundal, Jisun Jung Abstract We explored ways to improve the performance for tincvpn, a virtual private network (VPN) implementation. VPN s are typically used

More information

CPSC/ECE 3220 Fall 2017 Exam Give the definition (note: not the roles) for an operating system as stated in the textbook. (2 pts.

CPSC/ECE 3220 Fall 2017 Exam Give the definition (note: not the roles) for an operating system as stated in the textbook. (2 pts. CPSC/ECE 3220 Fall 2017 Exam 1 Name: 1. Give the definition (note: not the roles) for an operating system as stated in the textbook. (2 pts.) Referee / Illusionist / Glue. Circle only one of R, I, or G.

More information

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser ParalleX A Cure for Scaling Impaired Parallel Applications Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 Tianhe-1A 2.566 Petaflops Rmax Heterogeneous Architecture: 14,336 Intel Xeon CPUs 7,168 Nvidia Tesla M2050

More information

Brushing the Locks out of the Fur: A Lock-Free Work Stealing Library Based on Wool

Brushing the Locks out of the Fur: A Lock-Free Work Stealing Library Based on Wool Brushing the Locks out of the Fur: A Lock-Free Work Stealing Library Based on Wool Håkan Sundell School of Business and Informatics University of Borås, 50 90 Borås E-mail: Hakan.Sundell@hb.se Philippas

More information

Design Tradeoffs for User-level I/O Architectures

Design Tradeoffs for User-level I/O Architectures IEEE TRANSACTIONS OF COMPUTERS 1 Design Tradeoffs for User-level I/O Architectures Lambert Schalicke, Member, IEEE, Alan L. Davis, Member, IEEE Abstract To address the growing I/O bottleneck, nextgeneration

More information

IO-Lite: A Unified I/O Buffering and Caching System

IO-Lite: A Unified I/O Buffering and Caching System IO-Lite: A Unified I/O Buffering and Caching System Vivek S. Pai, Peter Druschel and Willy Zwaenepoel Rice University (Presented by Chuanpeng Li) 2005-4-25 CS458 Presentation 1 IO-Lite Motivation Network

More information

A Comparison of Scheduling Latency in Linux, PREEMPT_RT, and LITMUS RT. Felipe Cerqueira and Björn Brandenburg

A Comparison of Scheduling Latency in Linux, PREEMPT_RT, and LITMUS RT. Felipe Cerqueira and Björn Brandenburg A Comparison of Scheduling Latency in Linux, PREEMPT_RT, and LITMUS RT Felipe Cerqueira and Björn Brandenburg July 9th, 2013 1 Linux as a Real-Time OS 2 Linux as a Real-Time OS Optimizing system responsiveness

More information

Everything You Always Wanted to Know About Synchronization but Were Afraid to Ask

Everything You Always Wanted to Know About Synchronization but Were Afraid to Ask Everything You Always Wanted to Know About Synchronization but Were Afraid to Ask Tudor David, Rachid Guerraoui and Vasileios Trigonakis Ecole Polytechnique Federale de Lausanne(EPFL) Haksu Lim, Luis,

More information

The Kernel Abstraction

The Kernel Abstraction The Kernel Abstraction Debugging as Engineering Much of your time in this course will be spent debugging In industry, 50% of software dev is debugging Even more for kernel development How do you reduce

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal

More information

Chapter 5: Threads. Outline

Chapter 5: Threads. Outline Department of Electr rical Eng ineering, Chapter 5: Threads 王振傑 (Chen-Chieh Wang) ccwang@mail.ee.ncku.edu.tw ncku edu Feng-Chia Unive ersity Outline Overview Multithreading Models Threading Issues 2 Depar

More information

Per-Thread Batch Queues For Multithreaded Programs

Per-Thread Batch Queues For Multithreaded Programs Per-Thread Batch Queues For Multithreaded Programs Tri Nguyen, M.S. Robert Chun, Ph.D. Computer Science Department San Jose State University San Jose, California 95192 Abstract Sharing resources leads

More information

HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser

HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser HPX High Performance CCT Tech Talk Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 What s HPX? Exemplar runtime system implementation Targeting conventional architectures (Linux based SMPs and clusters) Currently,

More information

Lab 2: Threads and Processes

Lab 2: Threads and Processes CS333: Operating Systems Lab Lab 2: Threads and Processes Goal The goal of this lab is to get you comfortable with writing basic multi-process / multi-threaded applications, and understanding their performance.

More information