Asynchronous/Multithreaded I/O on Commodity Systems with Multiple Disks a performance study

Size: px

Start display at page:

Download "Asynchronous/Multithreaded I/O on Commodity Systems with Multiple Disks a performance study"

Elfrieda Eaton
6 years ago
Views:

1 Asynchronous/Multithreaded I/O on Commodity Systems with Multiple Disks a performance study Nick Cook Department of Computing Science, University of Newcastle October 2000

2 2

3 Abstract The purpose of this dissertation is to test the hypothesis that an asynchronous/multithreaded I/O service can be used to exploit the potential for parallelism of commodity systems with access to multiple disks. Three working systems are contributed: a simulation of asynchronous I/O; a POSIX-compliant asynchronous I/O library; and an extensible C++ class library for comparative tests of I/O services. The results of a simulation study and tests on real systems are used to demonstrate that, under specified circumstances, a significant increase in average application throughput is achieved through the use of asynchronous I/O. The types of workload most likely to benefit from this performance gain are characterised. However, for multithreaded implementations of asynchronous I/O, the improvements are shown to be achieved at the expense of average response times. These costs, sometimes as much as an order of magnitude increase in response time, are quantified. Declaration I declare that this dissertation represents my own work except where otherwise stated. Acknowledgements Thanks are due to my supervisors, Paul Watson and Jim Smith, for their technical input, guidance and support. Isi Mitrani offered advice on the simulation and modelling aspects of the project. I should also acknowledge Ulrich Drepper, author of much of GNU glibc and the asynchronous I/O implementation in particular. It would have been difficult to proceed with significant parts of this project without free access to the GNU source code. Lesley, Joseph and Eve are owed a great deal for their support and forbearance during two years of absenteeism on my part. It is no compensation but this dissertation is dedicated to them. 3

4 4

5 Contents 1 Introduction Target system and project scope I/O service definitions Approach Structure of the dissertation Background Related work File-systems and file-system modelling Disk characteristics and disk drive modelling Specific impact of related work Initial investigation The test applications Results of the initial investigation Programming languages and other tools used aiosim : simulation of asynchronous (and synchronous) I/O services Why simulation? Simulation methodology Overview of simulation model Disk drive model File-system cache model Simulation configuration Description of simulation components FileSys - the file-system class library buffer class disk class fsblock, cacheline and fscache classes filedes class request, reord_req and fill_req classes requestgen class aiosim classes and application arrival generation request setup service disk service results service

6 The simulation application Performance measures and analysis of results Model validation libaiomt : a multithreaded implementation of POSIX.4 asynchronous I/O Design and implementation of libaiomt Interface of POSIX-compliant asynchronous I/O Implementation of libaiomt Key internal functions Thread management Data structures Request allocation Access to shared resources and synchronisation points Comparison with GNU glibc librt implementation iot : a C++ class library for comparative tests of I/O services Composition of test applications Test configuration Description of main iot library components iot::iotest class iot::testfile class iot::request class iot::requestgen class Portability Analysis of selected results Test and simulation configuration System characteristics Common workload specification Sample variance and confidence in results Variance of tests on real systems Confidence in simulation results Average application throughput Tests on real systems Simulation results Average response time and system power estimates Priority queueing Request reordering optimisations Conclusion and further work Lessons learned and further work A aiosim code listing and configuration 107 A.1 simulation code listing A.1.1 aiosim.sim A.1.2 filesys.sim A.1.3 simutil.sim

7 A.2 example simulation configuration files A.2.1 simulation configuration file A.2.2 disk configuration file A.3 simulation parameter tables B libaiomt code listing 147 B.1 aiomt.h B.2 aio_cancel.c B.3 aio_error.c B.4 aio_fsync.c B.5 aio_read.c B.6 aio_return.c B.7 aio_suspend.c B.8 aio_write.c B.9 lio_listio.c B.10 aio_misc.h B.11 aio_misc.c C iot code listing and configuration 179 C.1 iot test application and library code listing C.1.1 test applications C.1.2 common definitions and constants C.1.3 iot::iotest class C.1.4 iot::ioterror class C.1.5 iot::reqlist class C.1.6 iot::request class C.1.7 iot::aiorequest class C.1.8 iot::siorequest class C.1.9 iot::requestgen class C.1.10 iot::runtime class C.1.11 iot::testfile class C.1.12 iot::tfconfig class C.2 Example I/O test configuration file

8 8

9 List of Figures 1.1 commodity system with multiple disks random read 2 100MB and 2 400MB files on 2 disks (50, KB requests) threads created (1000s) in librt async. I/O library for tests in Figure repeat of tests in Figure 2.1 with modified async. I/O library repeat of tests in Figure 2.1 with modified async. I/O library model of asynchronous (and synchronous) I/O service system with arrivals and departures convergence of L and L=TW as simulation batches increase decreasing confidence intervals as simulation batches increase decline in throughput with service rate at results service progress of requests through the libaiomt library comparison of request allocation time the impact of contention in libaiomt (20,000 random read req., uniprocessor PC) the impact of contention in libaiomt (20,000 random read req., multiprocessor PC) progress of requests through the GNU librt library throughput of sequential writes of 2 files on 2 disks seek-time-distance curve used for model of 500MB disk throughput and request queue size (random read 8MB and 16MB files on 2 disks) throughput and request size (random read 16MB files on 2 disks) throughput and request size (random read 32MB and 64MB files on 2 disks) throughput and file size (random read 2 files on 2 disks on multiprocessor PC) comparison of asynchronous I/O simulation and real system comparison of synchronous I/O simulation and real system simulation of the impact on throughput of additional disks (2-8 disks) simulated throughput for a scientific computing workload simulated throughput for a transaction processing workload simulated throughput for a database-like workload simulated throughput for a database-like workload with additional disks simulated response times for a database-like workload with additional disks simulated requests present for a database-like workload with additional disks simulated system power for a database-like workload with additional disks simulated system power for a scientific computing workload simulated system power for a transaction processing workload simulated response times for randomly assigned request priorities

10 6.20 simulated throughput with request reordering and request injection at the library-level

11 List of Tables 2.1 application throughput for sequential read of 2 files on 2 disks performance estimate for 32 batches of 0.01s simulation time (1.23s cpu-time) performance estimates for 8 batches of 0.1s simulation time (11.49s cpu-time) access to shared resources in libaiomt user-parameterisation of iot test applications valid method entry and exit states for iot::iotest valid method entry and exit states for iot::request system calls used in iot test library average performance characteristics of disk drive model throughput of random reads of 22 files on 2 disks average response time estimates for experiment in Table detailed performance estimates for simulated application workloads A.1 global simulation parameters A.2 per run simulation parameters (1) A.3 per run simulation parameters (2): request generation A.4 per run simulation parameters (3): arrival process A.5 per run simulation parameters (4): services A.6 per disk simulation parameters

12 12

13 Chapter 1 Introduction Recent years have seen dramatic increases in processor speeds (doubling year on year). Memory densities have doubled every two years with access times to memory decreasing by 30% to 80% per year [15][9]. Similarly, disk capacity has increased significantly as new technologies enable smaller, higher density and cheaper disks [14]. Advances have been made in I/O bus transfer rates, disk seek times and rotational speeds. However, access to disk, involving as it does the movement of mechanical parts, remains a potentially significant bottleneck in modern computer systems. Indeed the rapid advances in other areas, and continuing decline in price/performance ratio, serve to highlight this bottleneck. The advent of new applications that take advantage of gains in other areas also increase the demand on the I/O subsystem. In recognition of the disk I/O bottleneck, operating system designers and disk manufactures deploy a variety of mechanisms to minimise the impact of disk access. For example, caching is used to either amortize the costs of access to physical media by anticipating future requests or to defer, and where possible avoid, such access. The strategy adopted is to optimise for the common cases of sequential access in general and for short-lived writes in particular (where any given write request is overwritten in the near future by a subsequent write request). However, there are applications that generate I/O workloads that defeat such optimisation. Workloads that may, for example, exhibit worst-case characteristics such as random access to files and/or a requirement for guaranteed persistent writes (where every write must be committed to disk even if subsequently overwritten). For some applications that would present workloads of this type to a conventional file-system, such as database systems, it is common to develop customised storage subsystems. These subsystems by-pass the filesystem provided by the operating system and perform I/0 direct to disk. It is then possible to optimise access for the special case by controlling layout on disk according to predicted access patterns and/or application-specific semantics. The purpose of this dissertation is to test the hypothesis that multithreading can be used to exploit the potential for parallelism of commodity systems with access to multiple disks. In particular, it addresses the question of whether applications that produce worst case workloads, from file-system and disk caching viewpoints, can be supported efficiently by using an asynchronous/multithreaded I/O service. Any demonstrable improvement in performance would be of interest when it is either impractical or too costly to customise disk access for specific applications. If gains are very significant, then it may be possible to avoid development of customised disk access for a large number of applications. Section 1.1 describes the target system and project scope. Section 1.2 defines the types of I/O service studied. Section 1.3 gives an overview of the approach adopted. The structure of the remainder 13

14 of the dissertation is outlined in Section Target system and project scope The target system is presented in Figure 1.1. Applications run on a host system that provides the!! "# Figure 1.1: commodity system with multiple disks abstraction of a file-system for access to data on disk. At the application level, requests are made to read or write data from/to files that logically reside in the file-system. Access to disk is mediated through the file-system which determines whether a request can be serviced from/to its own cache or that access to disk is required. Requests for disk service are submitted through a device interface and may, in turn, be serviced in whole or part from/to the disk s cache. Any data that cannot be transferred from/to the file-system or the disk cache, incurs a physical disk media transfer. A disk controller manages access to the disk cache and initiates transfers from/to disk media. Communication between the host system and disk is via an I/O bus. The bus is shared by multiple disks. In general terms, an application request to read or write data will incur a combination of one or more of the following transfers: $ transfer to/from file-system cache $ transfer across the I/O bus to/from disk cache $ transfer to/from disk media from/to disk cache under control of the disk controller It is possible for the different types of transfer to overlap. For example, having initiated a transfer from its cache across the bus, the disk controller may fetch additional data from physical media while the transfer progresses. Further discussion of the technical details of the above transfers is deferred to later chapters. However, it can be seen that only the third type of transfer does not involve any access to the shared resources of the file-system cache or the I/O bus. In the context of the target system, it is during data transfer to/from physical disk media that the greatest opportunity for parallelism arises. It is assumed that the target system supports kernel-level multithreading (as distinct from userlevel threads). Threads provide separate execution contexts, within a process, that share some process state such as file descriptors and memory address space [6]. Kernel-level multithreading implies that 14

15 the separate threads of control map to kernel entities that may be scheduled separately. A kernel-level thread blocked on a system call (such as a file read()) will allow the scheduling of other threads in the client process. User-level threads within a process are all mapped to a single kernel entity and a blocking system call will block the whole process. The project is, in part, motivated by the development of the Polar parallel object database server [25]. One of the aims of the Polar project is to determine whether commodity systems, composed of a number high-performance PCs, can compete with current commercial parallel database systems that use custom designed components. The platform envisaged for the Polar project is the interconnection of systems such as the target system depicted in Figure 1.1 via a high-speed network. It is a requirement of the project that efficient access to multiple disks be available. The scope of the project has been generalised to investigate the possible performance gains of using an asynchronous I/O service to access multiple disks and to determine the likely I/O workloads that would gain from such a service. 1.2 I/O service definitions This dissertation presents a comparative study of the performance of two types of I/O service: traditional synchronous I/O and asynchronous I/O. A third category multithreaded I/O is essentially a variant of asynchronous I/O. All three categories are defined below. $ Synchronous I/O is POSIX standard blocking I/O. Requests are made for data to/from open files (referenced by a system-maintained file descriptor) using read() and write() system calls. The calling application blocks pending completion of a request. Synchronous I/O should not be confused with synchronised I/O, which refers to the guarantee that a write request has been committed to disk. Rather synchronous I/O indicates that the client application must wait for completion of its requested I/O or, more accurately, must wait until the file-system has completed processing of a request. As implied by the distinction, the return of a write() system call does not guarantee that data has been written to disk. A successful write() operation means that data has been written to file-system cache and scheduled for writing to disk. $ Asynchronous I/O allows the calling application to initiate an I/O request and immediately regain control without waiting for completion of the request. In the context of this project, asynchronous I/O refers to an implementation of the POSIX b-1993 standard (also known as POSIX.4) [8]. POSIX.4 defines an interface to asynchronous I/O and associated requirements on the implementation of the service. As with synchronous I/O, operations are performed on file descriptors, opened by the standard open() system call. However, an asynchronous I/O request (for example, aio_read() or aio_write()) does not block. The request is queued within the system and control is immediately returned to the calling application. When the I/O completes, the application is notified via some user-specified mechanism. For example, a signal may be queued on completion or the application may periodically check the status of its outstanding requests. Thus, I/O is potentially performed in parallel with other operations. The POSIX.4 standard does not specify implementation details and an implementation may be threads-based. $ Multithreaded I/O is the use of synchronous I/O in separate threads within an application. For example, I/O requests to different disks may be handled by different threads using blocking read() and write() calls. Given the assumptions of our target system, there is the potential for parallel I/O because a thread that is blocked pending I/O completion will not block the whole 15

16 process. It is therefore possible for another thread to be scheduled to service a request for its disk. Both multithreaded and asynchronous I/O incur some synchronisation overhead. They may contend for shared resources and at some point an application must handle the result of the I/O requested. They also incur the overhead of managing additional resources, such as request queues and threads. Offset against these overheads is the potential to overlap computation and I/O and, where I/O involves multiple devices, to overlap access to those devices. In the general sense of operations being performed asynchronously, multithreaded I/O is a variant of asynchronous I/O. One distinction between the two derives from the assumption that asynchronous I/O is an implementation of the POSIX.4 standard. The use of multithreaded I/O implies that the application developer is responsible for all thread management issues. Whereas asynchronous I/O presents an implementation-independent interface that relieves the developer of such responsibilities. Another significant difference between multithreaded I/O and asynchronous I/O is that the use of threads is just one approach to the implementation of the latter. The other main approach to implementation of asynchronous I/O is to modify the file-system to provide direct support for asynchronous operation. Some of the performance implications of this approach compared to a multithreaded implementation are addressed in Chapters 6 and Approach The approach adopted was to develop three simple test applications to conduct an initial investigation into the relative performance of synchronous, asynchronous and multithreaded I/O. This investigation is described in Chapter 2. The results of the investigation suggested that, under certain circumstances, there would be a performance gain from the use of asynchronous I/O. However, further work would be required to confidently characterise the types of application workload most likely to benefit and to explore the trade-offs between different aspects of performance. In addition, the investigation highlighted shortcomings in the implementation of asynchronous I/O used. Given the above, it was decided to conduct a more detailed investigation as follows: 1. develop a simulation of asynchronous I/O capable of assessing performance under a wider variety of workloads than would be possible to test on the real systems available 2. implement an optimised version of asynchronous I/O that addresses the short-comings highlighted during the initial investigation 3. produce a more flexible test application suite for comparison of asynchronous and synchronous I/O on real systems. 1.4 Structure of the dissertation The remainder of the dissertation is structured as follows: Chapter 2 provides background technical information to the project. It is divided into three sections. The first provides an overview of the main components of the I/O subsystem under consideration at the file-system and the disk level. Related work on both file-system and disk drive modelling and performance analysis is also introduced. Section 2.2, describes the initial investigation undertaken 16

17 and the results obtained. Section 2.3 briefly discusses the choice of programming languages and other tools used for the project. Chapters 3, 4 and 5 describe the design, implementation and use of: 1. the simulation program (aiosim); 2. the asynchronous I/O library (libaiomt); and 3. the I/O test library (iot) and associated application. Chapter 6 provides an analysis of selected results of the I/O tests and the simulation study Chapter 7 concludes the dissertation and suggests further work. There are three appendices to the dissertation: $ Appendix A provides aiosim source code listing; example configuration files and detailed parameter tables. $ Appendix B provides the source code listing of libaiomt. $ Appendix C provides source code for the iot test library and applications; and an example test configuration file. 17

18 18

19 Chapter 2 Background In this chapter: Section 2.1 introduces technical background to the project; Section 2.2 describes an initial investigation, including results obtained and lessons learned from the investigation; and Section 2.3 discusses the choice of programming languages and other tools used for the project. 2.1 Related work When considering how best to exploit the potential for I/O parallelism offered by a system that accesses multiple disks it is important to understand how both file-systems and disks drives operate. Of particular importance is the impact on performance of both file-system and disk drive caching under different I/O workloads. There follows an overview of file-system and disk drive technologies, including an introduction to related work on file-system and disk drive modelling and performance analysis. First, the processing of application requests for file I/O should be explained. An application may make requests to read or write arbitrary amounts of data within a file. The filesystem divides these requests into file-system block-sized (and block-aligned) requests and services them either from its own cache or by initiating a request for service from disk. In either case, an application request of % bytes will be translated by the file-system into a request for & blocks of data (where %(')&+*-,/ ;: ). As discussed below, for read requests, % is almost always less than &<*=,> ?79;: because the file-system will often arrange for the pre-fetching of subsequent blocks in anticipation of future requests [21]. A read or write request of arbitrary size, % bytes, from/to arbitrary offset, 0, within a file is, then, translated into access to a sequence of one or more file-system blocks:,/@ba4a4a/,/@ C6DFEGIH+JLKM7ONPGRQS&UT. Where is the block within which offset 0 resides. The application request may reside wholly within block,/@. In the following, access to a file is considered sequential if the previous access to the file was to (the block within which the current request starts) or to (the preceding block). This ensures that a series of requests that all reside within a single file-system block are considered sequential. It should be emphasised that a file-system block is the logical unit used by a file-system to organise data. There is no guarantee that data (logically) held in contiguous file-system blocks will be contiguous on disk. File-systems attempt to organise data on disk so that the majority, if not all, of a file s blocks are close together and that placement on disk corresponds to logical placement within a file. The smaller the file size, the greater the likelihood that the correspondence is maintained. The larger the file, and the greater the proportion of disk that is in use, the more likely it is that it s data 19

20 will spread across larger and less contiguous areas of the disk File-systems and file-system modelling It is widely accepted that access patterns at the file-system level are often sequential and that writes very often overwrite recently written data [18][4]. File-systems therefore deploy two types of caching to optimise for these common cases: $ A Least Recently Used (LRU) cache stores recently accessed blocks in anticipation that the same blocks will be accessed again. As the name suggests, the least recently used blocks are expelled first. So blocks are organised according to access times and frequently accessed blocks will remain in the cache. $ A write-back cache is used to buffer writes in memory for later transfer to disk. In this way, many writes do not survive for transfer to disk because they are overwritten by a subsequent request before the transfer takes place. A further common optimisation is read pre-fetching where blocks that logically follow those of the current request are pre-fetched into the file-system cache. The benefits of read pre-fetching [21] include: $ the cost associated with performing a disk I/O operation is amortised over a larger amount of data $ assuming sequential file access translates reasonably well to sequential layout on disk, prefetching will better utilise the disks own readahead cache (see below) $ presenting a larger list of requests to the disk controller provides greater opportunity for request re-ordering at the disk level (exploiting the actual layout of data on disk) Given a request, read pre-fetching adjusts to workload as follows: 1. the cache is checked for the blocks that the request resides in and for some number of pre-fetch blocks. If necessary, a disk request is initiated for the request blocks and/or for the pre-fetch blocks. For application requests that are sufficiently small both the request blocks and the pre-fetch blocks may already reside in cache. As long as the workload appears sequential, pre-fetching is triggered and the amount of data pre-fetched doubles up to some limit. 2. if, according to the LRU cache policy, a pre-fetched block is evicted from the cache before it is used then the file-system assumes it is pre-fetching too aggressively and halves the number of blocks to be pre-fetched at the next request. In this way, the file-system will reduce pre-fetching to no more than the next block should the presented workload not be sequential. Examples of workloads that should benefit from the file-system optimisations described above are: $ sequential streams of requests (whether all reads, all writes or a mixture of the two). For reads from or writes to data that has been recently accessed or that logically follows recently accessed data, the likelihood of cache hits is improved and, therefore, application throughput improves. $ access that frequently overwrites data recently written will often avoid disk access altogether. 20

21 $ streams of requests to small files, whether sequential or not. Such requests will often be mostly served from file-system cache. For small file sizes, most if not all the file may have been fetched to cache before (or even if) pre-fetching is reduced to a minimal level. Given an initial threshold for pre-fetching of % blocks and assuming the first, possibly random, request is not for data near the end of a file, any file near to % blocks in size will be mostly in cache after the first request. If subsequent requests turn out not to be sequential, they will still mostly be served from the cache. Examples of workloads not likely to benefit are: $ writes that require a guarantee of commitment to disk regardless of any subsequent activity $ long streams of requests large enough to overflow caches before any re-use of cached data. Access to large, continuous media files is likely to have these characteristics [4]. A continuous media file, such as a video file, is likely to be read in a long sequential stream of large requests. Even if the same video is replayed, the data from the start of the stream will have been evicted from file-system cache before being requested again. $ streams of random requests over larger files. Random requests to larger files are unlikely to gain much benefit from pre-fetching or caching since requests ranging across a large address space will result in fewer cache hits. Streams of small, random requests to large files may suffer more since they involve access to small areas of a relatively large address space. A recent study [17] suggests that the characteristics of workloads presented to file-systems may be changing and that the common case is becoming less common. It is reported that, in comparison to earlier studies, traces of file-system traffic show that the size of files being accessed is increasing and that a larger proportion of access to files is random. Further, access to large files (greater than 2MB) exhibited a much greater tendency to be random. Mail and WWW browser applications, in particular, were identified as producing random workloads. This is presumably because user-navigation through large mailboxes or WWW documents will not necessarily be sequential. There are other known application workloads that will defeat the file-system optimisations described. For example, amongst the workloads used to validate their disk model[20], Bell Labs describe a... database-like workload... that had... very little spatial locality.... It is an aim of this project to determine whether asynchronous I/O can be used to deliver a performance gain to applications that generate workloads for which file-system optimisations are ill-suited Disk characteristics and disk drive modelling Disk drives contain a mechanism and a controller [19]. The mechanism comprises the recording components (rotating disks and head to access them) and positional components (disk arm assembly etc.). The controller includes a microprocessor, a cache and an interface to the I/O bus. The controller mediates transfers to/from the host system to which a disk is attached. A request for data from disk always involves communication across the bus with the controller, and an associated command overhead. Depending on the state of the disk cache and the mapping of the requested logical blocks to physical locations on disk (performed by the disk controller), the request will incur a transfer of data from disk buffer or from disk media, or from both. From the point of view of this project, transfers to/from disk media, with the associated costs of physical seek and rotational delay, are of particular interest. It is during these transfers that the main opportunity for parallel access to disks arises. 21

22 Disks typically employ the following types of caching: $ a speed matching buffer ensures that data to/from disk is transferred when the host interface is ready. When servicing read requests, the buffer is partially filled (up to a fence value) before a bus data transfer is initiated. Writes are buffered to overlap with head positioning by the drive mechanism. $ a readahead cache actively retrieves and caches data that the controller expects the host to request in the near future. This is commonly implemented as continuing to read from where the last read left off. The readahead cache allows reads to be satisfied in the time it takes the controller to detect a cache hit and then transfer data at bus rate (as opposed to the much slower media transfer rate). A single readahead cache can only support a single sequential stream of requests. The cache is, therefore, often segmented to support interleaved sequential streams. $ a write cache provides immediate reporting of writes as soon as they are in the cache. From the host viewpoint, writes are serviced in the time taken to transfer data to the disk cache. The host experiences slower media rates for data that it explicitly requests is written through to disk. The write cache reduces the volume of writes to disk because overwrites can be made to in-cache data before it goes to disk. Command re-ordering is also supported where writes are schedule for near-optimality. In combination with readahead caching, writes and reads of adjacent blocks can proceed at bus transfer rates. Caching is also used to support command queueing at the controller. The controller is able to impose an ordering on incoming requests to minimise seek times and disk head movements typically ordering requests by Shortest Positioning Time First. The disk device driver will also normally hold a request queue that is ordered to improve response times Specific impact of related work The recent work at Bell Labs on analysing and modelling the performance of both multiple disks on a SCSI bus and of file-system pre-fetching [3][20][21], and that of Ruemmler and Wilkes [19] that preceded it, forms the basis of the modelling work described in Chapter 3. The work on multiple disks on a SCSI bus analyses and models similar workloads to the random read experiments discussed in Section 2.2 below, except that I/O requests are direct to disk (bypassing the file-system). For larger request sizes, they report convoy behaviour in disk I/O (termed rounds ) where, under heavy workloads, each disk services one request before any disk services its next request. This behaviour results in sub-optimal performance. They developed and validated an analytical model that accurately predicts the performance impairment and identifies the terms that characterise it. They suggest an optimisation that deploys an asynchronous read request to trigger disk readahead and thereby achieve greater overlap of bus transfers with disk seeks. This led to consideration of a possible optimisation of multithreaded, asynchronous I/O where incoming requests to a file are ordered by file offset and additional requests are inserted into a random request stream to make it appear more sequential when presented to the file-system. This optimisation is explored in Chapter 6. The work cited above; that of Peter Bosch on mixed media file-systems [4]; and the UC Berkeley work on file-system traffic [17]; all provided useful input for the parameterisation of the simulation model and for the characterisation of workloads for the experiments presented in Chapter 6. In addition to the work cited, there is a considerable body of existing work on both file-system and disk drive performance modelling and evaluation. The papers by Ruemmler and Wilkes [19][18] 22

23 provide a useful overviews of much as this work. Bosch s PhD thesis provides a more recent and detailed survey, with an emphasis on file-system support for mixed media systems. However, very little detailed work was found on the performance analysis of asynchronous I/O. There is work on specific implemenations of asynchronous I/O [5]. One paper was found on the comparison of a filesystem level implementation and a multithreaded implementation [26]. This work indicated that, as is to be expected, file-system support for asynchronous I/O is more efficient than a library level, multithreaded implementation. No comparison of either implementation with synchronous I/O was provided. The only such study found was an earlier paper by the same author that compared a filesystem implementation of asynchronous I/O with synchronous I/O in the specific context of an On- Line Transaction Processing application [27]. This work indicated a performance gain from the use of asynchronous I/O. In conclusion, no detailed work on the comparative performance of synchronous I/O and a multithreaded implementation of asynchronous I/O has been found. Specifically, no study has been found that addresses the possibility of using asynchronous I/O to parallelise access to multiple disks or that identifies the performance trade-offs between application throughput and average request response times that such use entails. Apart from the systems developed, it is the contribution of this dissertation to provide this performance analysis. 2.2 Initial investigation This section presents an initial investigation conducted to determine whether there was any likelihood of achieving an improvement in throughput by using asynchronous or multithreaded I/O to access multiple disks The test applications Three test applications were written 1 : 1. siotest: used standard blocking read(), write() calls to service requests 2. mtiotest: was a custom-built POSIX threads application that used a-priori knowledge of file-todisk mappings to assign threads to service requests to a file on a given disk. I/O operations were performed using standard (blocking) read(), write() calls within a thread 3. aiotest: used the GNU glibc [7] library implementation of asynchronous I/0 (librt) to service requests via calls to aio_read() and aio_write() For all applications, an initialisation phase opened the files and set up any data structures required. For example: the request queues for aiotest and mtiotest; and the area to be written from for write tests. The applications could be configured to produce a single run of one of the following types of request stream: sequential reads; sequential writes; reads from random offsets within a file or writes to random offsets within a file. Files could optionally be opened to request that writes be written through (or synched ) to disk. The siotest application blocked pending completion of each requested I/O operation. A request queue was not used and each request was dealt with as it was generated. For streams of random 1 these applications preceded development of both the simulation and iot test library described in Chapters 3 and 5 23

24 requests, an lseek() operation was performed to move the file position pointer to the requested offset. The system-maintained file position pointer was relied on for sequential requests. The mtiotest and aiotest applications used bounded queues to control the number of outstanding requests. The mtiotest application started separate threads for each of the disks identified at configuration time and a request thread generated requests to be queued the relevant disk thread. The mapping between open files and disks was specified at configuration. Each disk thread would wait for completion of each of its I/O operations. The application would only block when all disk queues were full and all disk threads were blocked pending I/O completion. A results queue was used by the disk threads to pass results back to the initial thread for handling. Ordering of requests between disks was non-deterministic and dependent on the scheduling of threads. Requests for each disk and each file on a disk were serviced in FIFO order. As with siotest, lseek() was only called for requests to random offsets within a file. The aiotest application submitted requests using the aio_read() or aio_write() asynchronous I/O calls. Control returned to the application immediately after submission of a request. The application was then free to handle completion of earlier requests (notified by a signal) and to submit further requests (up to the configured queue size). Application blocking would occur when the request queue was full the maximum number of requests had been submitted and none had yet completed. Asynchronous I/O provides no guarantee on the ordering of the service of requests and may reorder requests submitted. A request priority attribute may be used to lower the priority of an I/O request with respect to the calling application or in order to implement some priority scheme between requests. This priority scheduling scheme is similar to UNIX nice() priority scheduling and it is not possible to increase a request s priority above that of its calling application. Request priorities were not used in the aiotest application. Asynchronous I/O requests must provide a file offset at which to perform the requested I/O (except when a file is opened for appending using the O_APPEND flag). The file position pointer is not relied on and an implied lseek() to the requested offset is always performed. Please see Chapter 4 for further details of the asynchronous I/O interface. For each test, a minimum sample of four runs was used and the average time taken to complete a fixed number of requests recorded. This figure was used to estimate the application throughput in MB/s. The applications were I/O-intensive. I/O requests were generated repeatedly. The minimal computation necessary to check data read or written and to generate the next request was performed between requests. All tests were conducted on Pentium II 233MHz PC systems running the Linux operating system and accessing two SCSI disks Results of the initial investigation For a workload of small (0.5KB) sequential reads of 2 2.5MB files on separate disks, the results shown in Table 2.1 were recorded. test siotest mtiotest aiotest ave. application throughput 20 MB/s 5 MB/s <1MB/s Table 2.1: application throughput for sequential read of 2 files on 2 disks. 24

25 As a result of pre-fetching, from the application viewpoint, small requests in a sequential stream are served mostly from the file-system cache. In effect, the read() call does not block and it is not therefore possible to overlap requests to different disks (the application reads data direct from memory). For such workloads there is no benefit to multiple threads contending for access to file-system cache. Similar relative performance was found for a workload of sequential writes. Write-back caching means that writes to disk media are performed asynchronously and the application throughput experienced is that of writes to memory. It was found that, for sequential read access, performance appeared to converge as request size increased. For example, changing request size from 0.5KB to 8KB resulted in sequential reads from 2.5MB files being served at approximately 5MB/s for all three applications, except for the aiotest application at low queue sizes. When the request size is increased, and request time increases, the benefits of multithreading become apparent. Larger request sizes lead to more I/O blocking. Also, the overhead of multithreading is amortised over longer service times for a request. A series of tests were carried out with streams of random reads to large files and with varying queue sizes for multithreaded and asynchronous I/O. As can be seen from Figure 2.1, mtiotest con average application throughput (MB/s) aiotest MB mtiotest - 100MB file siotest - 100MB aiotest - 400MB mtiotest - 400MB siotest - 400MB request queue size Figure 2.1: random read 2 100MB and 2 400MB files on 2 disks (50, KB requests) sistently out-performs siotest (in terms of throughput). For the larger request queue sizes, aiotest matches mtiotest. The same pattern of relative performance is apparent for both 100MB and 400MB files. A series of small requests at random offsets within a large file will tend to be for data that is distributed widely across both both file and disk (particularly when the file represents a significant percentage of disk capacity 20% for a 100MB file and 80% for the 400MB files in this case). Seek distances between requests are likely to increase. File-system pre-fetching and disk drive readahead 25

26 will be less effective. As illustrated, performance is likely to degrade as file size increase. Another set of tests were carried out with files opened to write requests through to disk (using the O_FSYNC open flag) and therefore avoid write-back caching. These tests revealed a similar relative gain in throughput for both the aiotest and mtiotest applications over the siotest application. A significant feature of initial results was the poor performance of the asynchronous I/O library at low request queue sizes. An examination of the source code of the library revealed that it was a multithreaded implementation that used a separate thread to service requests to each file accessed. These threads remained active as long requests were queued in the library for the given file, but no longer). File thread management was implemented as follows. Thread creation: if (there is no active file thread for a request) create a new thread to service the request else queue the request for the file thread Thread actions: while (there are requests for this file) take the first request for this file from the request queue service the request if (notification by thread) create new thread to notify result else notify result exit Given that librt is a threads-based threads-based implementation of asynchronous I/O, it was assumed that performance of aiotest would be similar to that of mtiotest. A simple modification of the library to count threads created, demonstrated that the significant degradation at low queue sizes resulted directly from the thread management algorithm adopted. As shown in Figure 2.2, the library creates excessive numbers of threads when the file threads are not kept active (even though there may be requests for the relevant file in the future). This thread creation problem would have been even worse had result notification via thread been specified. In this case, an additional thread would have been created to notify each result. The performance of the librt implementation is dependent on the number of threads that are created compared to the number, and frequency, of requests. For I/O-intensive workloads, unless the time to create a thread is insignificant compared to the time to service a request, file threads must be kept active if their creation time is not to dominate performance. A file thread is kept active if the time taken to service a request is greater than the time taken for a subsequent request to be queued for the thread on the request list (to ensure that there is an outstanding request for the thread to service). Sequential streams of small requests, which are mostly memory accesses, do not satisfy this criteria. Nor do low request queue sizes which throttle the library. In the worst case, at low queue sizes, a thread is created after every other request. The application tends to become stable with a queue size of 20-24, when less than 2,000 threads were created over a test run. To address the thread creation problem, the library was modified to make threads wait for 1 second before exiting. If requests arrived before the timeout or were found to be present after the timeout, then a thread would not exit. As shown in Figures 2.3 and 2.4, this brought an immediate improvement in 26

Process size is independent of the main memory present in the system.

Process size is independent of the main memory present in the system. Hardware control structure Two characteristics are key to paging and segmentation: 1. All memory references are logical addresses within a process which are dynamically converted into physical at run time.