D5.5 Overall system performance, scaling, and outlook on large multicores

Size: px

Start display at page:

Download "D5.5 Overall system performance, scaling, and outlook on large multicores"

Cecily Horn
5 years ago
Views:

IOLanes Advancing the Scalability and Performance of I/O Subsystems in Multicore Platforms Contract number: Contract Start Date: Duration: Project

5 Overall system performance, scaling, and outlook on large multicores Draft Date: 9- May- 13, 11:50 Delivery Date: M39 Due Date: M39 Workpackage: WP2

(Choose one, implies all previous) 1. Internal draft preparation 2. Under internal project review 3. Release draft preparation 4.

1 IOLanes Advancing the Scalability and Performance of I/O Subsystems in Multicore Platforms Contract number: Contract Start Date: Duration: Project coordinator: Partners: FP Jan months FORTH (Greece) UPM (Spain), BSC (Spain), IBM (Israel), Intel (Ireland), Neurocom (Greece) D5.5 Overall system performance, scaling, and outlook on large multicores Draft Date: 9- May- 13, 11:50 Delivery Date: M39 Due Date: M39 Workpackage: WP2 Dissemination Level: Public Authors: Stelios Mavridis, Yannis Sfakianakis, Spyridon Papageorgiou, Manolis Marazakis, and Angelos Bilas FORTH Status: 5 (Choose one, implies all previous) 1. Internal draft preparation 2. Under internal project review 3. Release draft preparation 4. Passed internal project review 5. Released Project funded by the European Commission under the Embedded Systems Unit G3 Directorate General Information Society 7 th Framework Programme Page 1 of 18

2 History of Changes Date By Description Apr 10, 2013 Manolis Marazakis Initial Version: discussion of testbed configurations, scaling results (up to 64 cores), input from all subsystems. Apr 22, 2013 Manolis Marazakis Text on experimental evaluation & related work Apr 29, 2013 Angelos Bilas Editing May 2, 2013 Manolis Marazakis Figures May 7, 2013 Manolis Marazakis Editing May 8, 2013 Angelos Bilas Editing Page 2 of 18

3 Table of Contents TABLE OF CONTENTS... 3 LIST OF TABLES... 4 LIST OF FIGURES... 4 EXECUTIVE SUMMARY OVERVIEW OF DELIVERABLE MAIN CONTRIBUTIONS AND ACHIEVEMENTS PROGRESS BEYOND THE STATE- OF- THE- ART ALIGNMENT WITH PROJECT GOALS AND TIME- PLAN FOR THIS REPORTING PERIOD AND PLAN FOR THE NEXT REPORTING PERIOD STRUCTURE OF DELIVERABLE DESCRIPTION OF EXPERIMENTAL SYSTEMS DESCRIPTION OF ALTERNATIVE DATA STRUCTURES FOR THE PCACHE LAYER EXPERIMENTAL RESULTS I/O THROUGHPUT RESULTS IOPS RESULTS FILESYSTEM METADATA- INTENSIVE RESULTS SUMMARY AND DISCUSSION Page 3 of 18

4 List of tables TABLE 1: KEY PROPERTIES OF THE TWO EXPERIMENTAL NUMA SERVER PLATFORMS TABLE 2: COMPARISON OF ALTERNATIVE DATA STRUCTURES FOR THE PCACHE LAYER List of Figures FIGURE 1: BLOCK DIAGRAM OF THE CURRENT- GENERATION 8- CORE SERVER PLATFORM (TYAN S7025 MOTHERBOARD, WITH TWO PROCESSOR SOCKETS AND A SYMMETRIC INTERCONNECT). [CF. CITATION 6]... 9 FIGURE 2: BLOCK DIAGRAM OF THE NEXT- GENERATION 64- CORE SERVER PLATFORM (TYAN S8812 MOTHERBOARD, WITH FOUR PROCESSOR SOCKETS AND ASYMMETRIC INTERCONNECT). [CF. CITATION 7] FIGURE 3: I/O THROUGHPUT ACHIEVED USING THE IOR BENCHMARK (READS, WRITES), WITH 4KB REQUESTS, ON A 64- CORE NUMA SERVER. SCALABILITY LIMITATIONS OF THE UNMODIFIED LINUX KERNELS BECOME EVIDENT WITH MORE THAN 24 CORES FIGURE 4: I/O THROUGHPUT ACHIEVED USING THE IOR BENCHMARK (READS, WRITES), WITH 4KB REQUESTS, ON A 8- CORE NUMA SERVER. THE WRITE THROUGHPUT OF THE BASELINE SYSTEM IS DISTURBINGLY LOW. AFFINITY- CONSCIOUS ALTERNATIVES CONSISTENTLY OUTPERFORM AFFINITY- UNAWARE ONES FIGURE 5: IOPS PERFORMANCE ACHIEVED USING THE FIO BENCHMARK (READS, WRITES), WITH 4KB RANDOM REQUESTS, ON A 64- CORE NUMA SERVER. SCALABILITY LIMITATIONS OF THE UNMODIFIED LINUX KERNEL BECOME EVIDENT WITH MORE THAN 24 CORES FIGURE 6: VARIATION OF IOPS PERFORMANCE, ON 8- CORE NUMA SERVER. EVEN WITH RELATIVELY LOW CORE COUNTS AND A SYMMETRIC SYSTEM INTERCONNECT, THE PLACEMENT OF I/O ISSUING THREADS AFFECTS IOPS PERFORMANCE (BY UP TO 21%) FIGURE 7: FILE CREATION OVERHEAD, WITH UP TO 16 THREADS ON THE 64- CORE NUMA SERVER. THE BASELINE SYSTEM DOES NOT SCALE, WITH OVERHEAD INCREASING BY 65% BETWEEN CONSECUTIVE DATA POINTS Page 4 of 18

5 Executive Summary Building on the work reported in the previous reporting periods of the IOLanes project, in this report we present a focused evaluation of our custom I/O software stack on a large- scale NUMA multicore server. The main contribution of the work presented in this deliverable is an evaluation of NUMA effects on our partitioned I/O stack. We measure I/O throughput and IOPS performance, comparing the unmodified Linux kernel with two variations of our partitioned I/O stack. We observe severe scalability limitations of the baseline system (representative of today s commonly deployed data- centre servers), when the core count exceeds 20. Our custom I/O stack not only scales much better with up to 64 cores, but also outperforms the baseline system by a significant margin, especially with small- sized requests. With 64 cores, our I/O stack improves the read I/O throughput by 166% over the baseline system, and the write I/O throughput by 122%. Regarding IOPS performance, our I/O stack improves performance by 88% over the baseline system for reads, and by 118% for writes. Furthermore, we present results from a metadata intensive operation, specifically the creation of a large file- set by independent threads. The baseline system suffers from scalability limitations for such metadata- intensive operations due to global ordering in the journal. We find that our partitioned I/O stack, that supports multiple, independent journal instances, achieves better scalability and results in 88% improvement compared to the baseline. In our scalability evaluation, we compare two alternative data structure options for our implementation: hash tables and radix trees. Both implementations achieve good scalability with up to 64 cores. However, we find that the implementation based on radix trees consistently achieves better absolute performance and that the two designs exhibit different trade- offs regarding the handling of reads and writes. Overall, we expect our results to hold for upcoming larger- scale NUMA server platforms, with even more pronounced non- uniformity in remote memory access times and cross- core synchronization overheads. In such a setting, the benefits from scalable data structures that are capable of low- overhead operation under conditions of high concurrency will become even more critical for efficiency and high performance. Page 5 of 18

6 1. Overview of Deliverable 1.1 Main contributions and achievements The I/O stack discussed in this deliverable is a fundamental redesign of how I/O is performed in modern servers. Today, all functions in the kernel I/O path treat cores and memory symmetrically, which is far from current technology trends, NUMA architectures, and complex workloads. This approach leads to significant contention and interference. E.g. data buffers of the buffer cache are placed in any available memory (module), kernel code executes in any available core, and thus accesses to a specific buffer can originate from any core, resulting in contention. In our approach we use partitioning to contain all these effects as the number of cores, devices, and other physical resources grows with current technology trends. We have evaluated our work on a much more challenging server platform, as compared to the previously used test machines: more cores (64 vs. 8 or 12 in the servers used during the previous periods), and much more pronounced NUMA effects (due to a much more asymmetric system interconnect). Important overheads and effects, in particular the impact of thread/data affinity and undesired context switches, are much more pronounced on this server platform, and allow us to further validate our partitioning approach. Moreover, we present a quantitative comparison between two variants of our partitioned RAM cache design, providing essential insight on the impact of data structure properties in the presence of concurrent I/O requests. We compare two implementations, one using hash tables and one using radix trees. Overall, our approach is a significant departure from the current I/O path design, and we believe it will allow I/O performance to scale with the physical resources in future servers. Although these benefits have already been demonstrated in previous evaluations, we believe that these benefits will become even more important in larger- scale NUMA server platforms. In this report, we present a comprehensive evaluation of the partitioned I/O stack with up to 64 cores. 1.2 Progress beyond the state- of- the- art Overall, we examine three main aspects of scalability: NUMA effects, ordering, and locking. (1) Our work is the first to discuss NUMA effects on I/O performance and scalability at larger scales. However, the Linux kernel is not yet ready to handle large numbers of cores and their (inevitable) non- uniform memory access characteristics. The main contribution of the work presented in this deliverable is an evaluation of NUMA effects on our partitioned I/O stack. We have focused our evaluation on workload file sets that fit in main memory, thus avoiding the additional complications of storage device access. Since storage device access is itself subject to NUMA effects, we believe that our results are still relevant for workloads that do not fit in memory. Related to NUMA effects, we evaluate two alternative data structures for the per- partition block caches: hash tables and radix trees (cf. citation [5] for an overview). We believe that the scalability properties of data structures used in the I/O stack will become increasingly important for achieving scalability on server platforms with higher core counts than what is currently the norm. (2) Finally, our work is also among the first to present scaling results for the impact of recovery (journaling) with increasing core counts. This case is representative of workloads that involve the creation of large numbers of files, and brings out severe scalability limitations of current filesystem implementations due to contention. Our partitioned I/O stack design supports having multiple mostly- independent journals, rather than a single centralized point of co- ordination for co- ordinated updates to the persistent data structures of the filesystem. (3) Previous work (cf. citations [1,2,3,4]) has shown substantial potential for improvements in the Linux kernel by reducing lock contention. However, the Linux kernel has been steadily improving in this respect, incorporating more efficient locking primitive implementations and more scalable data. Our evaluation shows that locking in the I/O path is not a major concern. Page 6 of 18

7 1.3 Alignment with project goals and time- plan for this reporting period and plan for the next reporting period This deliverable discusses the implementation of the base (non- virtualized) partitioned stack for improving scaling, which is the main task in WP2. The implementation of the stack was available for initial runs during Period 2, and evolved substantially during Period 3, in accordance with the original plan. The code is divided in the main three components, filesystem, cache, recovery, and includes glue components and modifications to other parts of the system, especially for passing important information across existing components and layers. These components are used in specific configurations with the rest of the project components to build the denser and leaner I/O stack we are targeting. Our work in this reporting period is in- line with the plan for following up on the work of the three regular reporting periods. We had a first- draft implementation at the end of Period 2. During Period 2 we managed to get parts of the stack working somewhat earlier than originally planned, and also managed to integrate them with other components successfully, which resulted in an early working version of the integrated system. Thus, Period 3 was devoted to evaluation and optimization in the context of the integrated system, as was the original plan. Furthermore, we evaluated our work on a larger- scale NUMA server, aiming to verify the scalability potential of our design. We discovered scalability limitations for the baseline system when the core count exceeds 20; our design maintains good performance up to the full core count (64) of the new testbed. Finally, it is also important to note that this evaluation provides a much- needed set of reference results for better understanding I/O issues in modern servers, identifying the main issues for scaling I/O performance, and proposing further enhancements at the system and architectural levels. 1.4 Structure of deliverable Section 2 describes the experimental platforms used in this evaluation. Section 3 describes and compares the core data structures used in our scalable caching layer implementations. Section 4 presents our results, comparing the unmodified Linux kernel with our custom partitioned I/O stack. Section 5 summarizes our work and results. Page 7 of 18

8 2. Description of experimental systems We present results from two NUMA multicore server systems: a 8- core server with relatively uniform memory access costs, representative of current- generation server nodes in data centres, and a 64- core server with much more pronounced non- uniformity in the memory access costs, representative of upcoming server architectures. The current- generation 8- core server is comparable with the shared testbed used throughout the duration of the IOLanes project. The 64- core server We focus our experimental evaluation on workloads that fit entirely in the server s memory, aiming to stress as much as possible the I/O stack layers and making low- contention and low- latency response. This arrangement is a commonly accepted best - practice for online data processing workloads, and also in- line with the architectural trend to introduce non- volatile memories in servers. Table 1 summarizes the key properties of the two experimental server platforms. Block diagrams of the two server platforms are shown in Figures 1 and 2, respectively. Table 1: Key properties of the two experimental NUMA server platforms. Current- generation 8- core server Processor socket count 2 4 Cores/processor socket 4 (8 with hyperthreading) 16 Next- generation 64- core server Motherboard Tyan S7025 Tyan S8812 Processor type Intel Xeon E5620 (2.4GHz) AMD Opteron 6272 (2.1GHz) Processor core caches L1: 128KB (code), 128KB (data) L2: 1MB L3: 12MB L1: 2x32KB (code shared by 2 cores), 16KB (data) L2: 4x2MB (shared by 2 cores) L3: 2x8MB (shared by 4 cores) DRAM (DDR3, # DIMMs) Up to 8 (up to 64GB) Up to 16 (up to 512GB) Interconnect type QPI (5.86 GT/sec) HyperTransport Interconnect topology Dual- Ring, symmetric Point- to- Point, asymmetric In terms of systems software, we use in this evaluation study the partitioned I/O stack developed over the three regular reporting periods of the IOLanes project. We have developed various scalability enhancements for this I/O stack, with an eye towards eliminating points of contention for data structures. The most significant enhancement in the design is the explicit manipulation of thread/data affinity in the pfs filesystem. The most prominent change in the implementation was the change of the main data structure used by the pcache caching layer, for a hash table (per partition or slice ) to a radix tree. In the following sections, we present results from our experiments, comparing the performance scores of our two alternative I/O stack implementations (marked IOslices/HT and IOslices/RT, for the hash- table and radix- tree, respectively) with the unmodified Linux kernel s (marked Native (xfs)). 3. Description of alternative data structures for the pcache layer We implemented two alternative data structures for keeping track of filesystem blocks in- memory. The hash table implementation was our original choice, and has been described in detail in previous reports. Towards the end of the 3 rd reporting period, we developed a radix tree data structure, and in this report we present a quantitative comparison of the two pcache implementations. A qualitative comparison of the two data structures is shown in Table 2. Page 8 of 18

9 In both implementations, there is a distinct instance of the data structure for each of the lanes that we have provided in the server s I/O path. Our first implementation (hash table) was used in all evaluation results presented up to this point in the IOLanes project, and has been delivered quite aggressive performance scores in a variety of configurations. The alternative implementation is more complex, especially in terms of concurrency control; however, we feel that this additional complexity is justified by the performance results presented in this report. Overall, we expect our results to hold for upcoming larger- scale NUMA server platforms, with even more pronounced non- uniformity in remote memory access times and cross- core synchronization overheads. In such a setting, the benefits from scalable data structures that are capable of low- overhead operation under conditions of high concurrency will become even more critical for efficiency and high performance. Figure 1: Block diagram of the current- generation 8- core server platform (Tyan S7025 motherboard, with two processor sockets and a symmetric interconnect). [cf. Citation 6] Page 9 of 18

Figure 2: Block diagram of the next- generation 64- core server platform (Tyan S8812 motherboard, with four processor sockets and asymmetric interconnect). [cf.

10 Figure 2: Block diagram of the next- generation 64- core server platform (Tyan S8812 motherboard, with four processor sockets and asymmetric interconnect). [cf. Citation 7] Table 2: Comparison of alternative data structures for the pcache layer. Hash Table O(1) overhead for lookup/insert/remove operations in the common case (i.e. when hash- function conflicts are rare) Allows simultaneous reads and writes from multiple threads, as long as there is no hash- function conflict between them. Fine- grain locking (per collision list), with very little contention as long as hash- function collisions are rare. Consecutive blocks (as identified by their device block address) are highly unlikely to be close together in the hash table - i.e. we Radix Tree log(k) lookup worst- case overhead, where K is the number of bits in a block address Optimized for high- concurrency, fast reads, and low- concurrency "background" writes: o Reads are lock- free (reading threads never block, even while writes are ongoing) o Reader threads always see a consistent version of the radix tree o Reader threads do not block writer threads Page 10 of 18

11 need to issue a synchronized group of lookup operations for requests involving multiple blocks. o This spread of consecutive elements is detrimental to processor cache locality and therefore system performance, in the case of I/O accesses that involve consecutive blocks. This effect is particularly acute when evictions are needed to release space in the cache. o The replacement policy implementation needs to scan the entire cache to obtain a set of blocks to be evicted. The eviction context attempts to amortize this overhead by collecting several blocks for eviction (up to 1MB), thus generating a non- sequential device access pattern. Lock- based concurrency control scheme Relatively simple implementation complexity o Writer threads only block each other, i.e. never block reader threads Consecutive blocks are placed close together in the radix tree (by property of their block addresses having a common prefix), thus allowing fast lookup in the case of requests involving multiple blocks. Moreover, the radix tree allows for a faster implementation of the replacement policy. RCU (read- copy- update) concurrency control scheme More complex implementation, esp. regarding concurrency control Page 11 of 18

12 4. Experimental results Sections 4.1 and 4.2 present I/O throughput and I/O operation rate results to illustrate scalability limitations of the baseline system, and the scaling behaviour of our alternative I/O stack design. Section 4.3 presents results from a metadata- intensive workload. NUMA affects are a particular focus of our evaluation, as we have quantified a disturbingly wide range of performance variation for the baseline system. The IOLanes partitioned I/O stack explicitly addresses NUMA effects, by modifying at run- time the core affinity mask of I/O issuing threads. This design decision has a tremendous impact on performance and scalability, as seen in the following results. We use the MPI- based IOR benchmark to emulate various check- pointing patterns, with the POSIX I/O interface. In our experiments, we vary the number of processes (instances), each generating an aggregate I/O volume of 1.25 GBytes in five iterations of write- then- read pattern. Overall, we generate an interleaving of streaming read and write accesses. We have set the I/O request size to 4 KBytes. The metric reported by IOR is data throughput in MBytes/s, separately for the write and read phases. We also use the fio I/O workload generator to emulate concurrent streams of I/O requests. Finally, we use the make- many- files microbenchmark as a stress- test for filesystem metadata performance. 4.1 I/O Throughput results We start with experimental results from running the IOR benchmark: sequential reads and writes, with 4KB requests. These I/O patterns aim to stress the scaling capabilities of the I/O stack layers, as they involve concurrent access streams with relatively small- sized I/O requests that highlight any processing overheads in the common I/O path. The IOR results, shown in Figure 3, are from the 64- core server platform. Read I/O Throughput (4KB requests) Write I/O Throughput (4KB requests) Naqve (xfs) IOslices/HT IOslices/RT Naqve (xfs) IOslices/HT IOslices/RT MB/sec MB/sec # Cores # Cores (a) IOR: Read Throughput (b) IOR: Write Throughput Page 12 of 18

13 Figure 3: I/O Throughput achieved using the IOR benchmark (reads, writes), with 4KB requests, on a 64- core NUMA server. Scalability limitations of the unmodified Linux kernels become evident with more than 24 cores. We observe severe scalability limitations for the baseline system (unmodified Linux kernel) with 24 or more cores; our configurations scale linearly up to 64 cores, and outperform the baseline system. The radix- tree implementation achieves consistently 10-15% better throughput than the hash- table implementation. Both of our implementations explicitly set the processor affinity mask of the I/O- issuing threads to avoid as much as possible remote memory references. The baseline system does not have this facility, and under high- concurrency suffers both the penalty of remote memory references and also the penalty of undesired thread migrations upon context switches. With 64 cores, our I/O stack improves the read I/O throughput by 166% over the baseline system, and the write I/O throughput by 122%. (a) IOR: Read Throughput (b) IOR: Write I/O Throughput Figure 4: I/O Throughput achieved using the IOR benchmark (reads, writes), with 4KB requests, on a 8- core NUMA server. The write throughput of the baseline system is disturbingly low. Affinity- conscious alternatives consistently outperform affinity- unaware ones. In Figure 4, we present results from running the IOR read and write throughput tests on the current- generation 8- core NUMA server platform as well. For our two implementations of the RAM cache, we present results with and without the thread/data affinity control mechanism. With the lower core count, the scalability limitations of the baseline system are not seen in the case of reads, as shown in Figure 4(a). However, the baseline write results are quite disturbing. Figure 4(b) summarizes these results, showing that our affinity- conscious implementations outperform the baseline system. Moreover, Figure 4 illustrates the wide range of performance that an application can experience without thread/data affinity control; our affinity- conscious implementations consistently outperform the affinity- unaware ones, particularly with more than 4 cores. 4.2 IOPS results Figure 5 presents results from running fio with 64 concurrent I/O issuing threads, with each thread issuing 4KB requests with random distribution. This experiment measures the maximum achievable IOPS scores, for reads and writes. We observe (as in Figure 3) that the baseline system does not scale with more than 24 cores, and in fact its IOPS performance drops with more than 40 cores. Our two implementations both scale linearly, and outperform the baseline system with more than 28 cores in the case of reads and 20 cores in the case of writes. The radix- tree implementation consistently outperforms the hash- table implementation, for both reads and writes. The hash- table Page 13 of 18

14 implementation matches and then exceeds the IOPS performance of the baseline system at 32 cores, and then continues to scale. Overall, our I/O stack improves read IOPS performance by 88% over the baseline system, and write IOPS performance by 118%. Read IOPS (4KB random requests) NATIVE (xfs) IOslices/HT IOslices/RT Write IOPS (4KB random requests) NATIVE (xfs) IOslices/HT IOslices/RT Millions IOPS IOPS Millions # Cores # Cores (a) fio: Max. read IOPS. (b) fio: Max. write IOPS Figure 5: IOPS performance achieved using the fio benchmark (reads, writes), with 4KB random requests, on a 64- core NUMA server. Scalability limitations of the unmodified Linux kernel become evident with more than 24 cores. The main advantage of our I/O stack over the baseline system derives from explicitly considering thread/data affinity. This aspect of I/O performance will become increasingly critical with higher core counts, and it is already important in current- generation servers. Unmanaged thread/data affinity results in large variance of application- level performance, even in configurations where we can have a 1-1 mapping between application threads and cores. To illustrate this point, we show in Figure 6 the 4KB random read IOPS scores (again with fio and 64 concurrent threads) achieved by two configurations of the baseline system on the 8- core NUMA server platform: (a) no placement constraints for the I/O- issuing threads (marked in the graphs as default ), and (b) with placement constraints, enforced statically using the taskset utility (marked in the graphs as fixed ). The default placement is what a typical application would experience on a current- generation server system. The fixed placement is what could be achieved if certain simplifying assumptions hold for the application: (a) fixed number of threads (i.e. no dynamically spawned threads), (b) private working set per thread, allowing a simple, static setting of the thread- core affinity mask. We observe that even with the relatively low core count and symmetric costs of memory references of this particular server platform, there is a significant margin of variation in IOPS performance, from 3.5% up to 21%. Our partitioned I/O stack dynamically adjusts thread affinity, even with dynamically spawned threads and overlapping working sets. Page 14 of 18

15 IOPS Millions IOPS (4KB random requests) READ WRITE # Cores IOPS Millions IOPS (4KB random requests) variance with thread placement default fixed # Cores (a) fio: read and write IOPS with default thread placement. (b) fio: variation in IOPS performance due to thread placement. Figure 6: Variation of IOPS performance, on 8- core NUMA server. Even with relatively low core counts and a symmetric system interconnect, the placement of I/O issuing threads affects IOPS performance (by up to 21%). 4.3 Filesystem metadata- intensive results For completeness, we also present evaluation results for a metadata- intensive microbenchmark, with up to 16 concurrent threads, each creating a file tree. Each thread creates a 3- level directory hierarchy, and initializes 300 1KB files at each of the 3rd- level directories. Thus, each thread creates 8,100 files, with a total of 129,600 for the 16- threads case. The threads are independent, thus the expectation would be to observe a relatively constant time for creating the file tree. This is not the case with the baseline system, by far. Figure 7: File creation overhead, with up to 16 threads on the 64- core NUMA server. The baseline system does not scale, with overhead increasing by 65% between consecutive data points. Page 15 of 18

16 Figure 7 shows a comparison of the execution time for this microbenchmark, for the baseline system and our I/O stack. Both configurations use a RAM- disk as the backing store for the filesystem (xfs for the baseline system, pfs for our I/O stack). We observe severe performance degradation with the baseline system, with execution times increasing by about 65% between consecutive data points. With our I/O stack, the increase in the overhead is much less pronounced (8-16%). With 16 threads, our I/O stack reduces the time for creating the requested file- set by 88%. These results highlight the benefits of partitioning the I/O stack, and in particular the benefit from having multiple mostly- independent journal instances rather than a single one. The pjournal layer ensures the consistency of the filesystem s persistent data structures in the presence of failures, as in the case of existing filesystem; however, there is no contention between independent threads that do not share access to files. The pjournal layer implements a co- ordination protocol between journal instances, but does not unnecessarily serialize requests from independent threads. Page 16 of 18

17 5. Summary and Discussion We have identified scalability limitations of the Linux kernel, using targeted intensive tests of the common I/O path, for both data and metadata. The main contribution of the work presented in this deliverable is an evaluation of NUMA effects on our partitioned I/O stack. We demonstrate significantly improved scalability, for both I/O throughput- intensive and IOPS- intensive tests. Most of these limitations are not observed at relatively low core counts (8-12), but become severe with more than 20 cores. With 64 cores, our I/O stack improves the read I/O throughput by 166% over the baseline system, and the write I/O throughput by 122%. Regarding IOPS performance, our I/O stack improves performance by 88% over the baseline system for reads, and by 118% for writes. Furthermore, we provide a quantitative comparison between two options of the data structure used for the partitioned RAM cache (hash table vs radix tree). Although both options provide good scalability, the implementation using radix trees achieves consistently better performance. Finally, we compare our I/O stack with the baseline for a metadata- intensive workload with up to 16 independent threads, and find again severe scalability limitations of the baseline system. For 16 threads, we achieve a reduction of the time required for creating a large file- set by 88%. Overall, we expect our results to hold for upcoming larger- scale NUMA server platforms, with even more pronounced non- uniformity in remote memory access times and cross- core synchronization overheads. In such a setting, the benefits from scalable data structures that are capable of low- overhead operation under conditions of high concurrency will become even more critical for efficiency and high performance. Page 17 of 18

18 References 1. Silas Boyd- Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich. An Analysis of Linux Scalability to Many Cores. In the Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI '10), Vancouver, Canada, October Silas Boyd- Wickizer, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich. Non- scalable locks are dangerous. In the Proceedings of the Linux Symposium, Ottawa, Canada, July Yan Cui, Yingxin Wang, Yu Chen, and Yuanchun Shi. Lock- contention- aware scheduler: A scalable and energy- efficient method for addressing scalability collapse on multicore systems. ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High- Performance Embedded Architectures and Compilers (TACO), vol. 9, no. 4, p. 44, A Da Zheng, Randal Burns, and Alexander S. Szalay. Parallel Page Cache: IOPS and Caching for Multicore Systems. In the Proceedings of the 4th USENIX workshop on Hot Topics in Storage and File Systems (HotStorage), Alfred V. Aho, John E. Hopcroft, and Jeffrey D. Ullman. Data Structures and Algorithms. Addison- Wesley, User Guide for the Tyan S7025 motherboard. URL: 7. User Guide for the Tyan S8812 motherboard. URL: Page 18 of 18

Azor: Using Two-level Block Selection to Improve SSD-based I/O caches

Azor: Using Two-level Block Selection to Improve SSD-based I/O caches Yannis Klonatos, Thanos Makatos, Manolis Marazakis, Michail D. Flouris, Angelos Bilas {klonatos, makatos, maraz, flouris, bilas}@ics.forth.gr