D5.5 Overall system performance, scaling, and outlook on large multicores

Size: px
Start display at page:

Download "D5.5 Overall system performance, scaling, and outlook on large multicores"

Transcription

1 IOLanes Advancing the Scalability and Performance of I/O Subsystems in Multicore Platforms Contract number: Contract Start Date: Duration: Project coordinator: Partners: FP Jan months FORTH (Greece) UPM (Spain), BSC (Spain), IBM (Israel), Intel (Ireland), Neurocom (Greece) D5.5 Overall system performance, scaling, and outlook on large multicores Draft Date: 9- May- 13, 11:50 Delivery Date: M39 Due Date: M39 Workpackage: WP2 Dissemination Level: Public Authors: Stelios Mavridis, Yannis Sfakianakis, Spyridon Papageorgiou, Manolis Marazakis, and Angelos Bilas FORTH Status: 5 (Choose one, implies all previous) 1. Internal draft preparation 2. Under internal project review 3. Release draft preparation 4. Passed internal project review 5. Released Project funded by the European Commission under the Embedded Systems Unit G3 Directorate General Information Society 7 th Framework Programme Page 1 of 18

2 History of Changes Date By Description Apr 10, 2013 Manolis Marazakis Initial Version: discussion of testbed configurations, scaling results (up to 64 cores), input from all subsystems. Apr 22, 2013 Manolis Marazakis Text on experimental evaluation & related work Apr 29, 2013 Angelos Bilas Editing May 2, 2013 Manolis Marazakis Figures May 7, 2013 Manolis Marazakis Editing May 8, 2013 Angelos Bilas Editing Page 2 of 18

3 Table of Contents TABLE OF CONTENTS... 3 LIST OF TABLES... 4 LIST OF FIGURES... 4 EXECUTIVE SUMMARY OVERVIEW OF DELIVERABLE MAIN CONTRIBUTIONS AND ACHIEVEMENTS PROGRESS BEYOND THE STATE- OF- THE- ART ALIGNMENT WITH PROJECT GOALS AND TIME- PLAN FOR THIS REPORTING PERIOD AND PLAN FOR THE NEXT REPORTING PERIOD STRUCTURE OF DELIVERABLE DESCRIPTION OF EXPERIMENTAL SYSTEMS DESCRIPTION OF ALTERNATIVE DATA STRUCTURES FOR THE PCACHE LAYER EXPERIMENTAL RESULTS I/O THROUGHPUT RESULTS IOPS RESULTS FILESYSTEM METADATA- INTENSIVE RESULTS SUMMARY AND DISCUSSION Page 3 of 18

4 List of tables TABLE 1: KEY PROPERTIES OF THE TWO EXPERIMENTAL NUMA SERVER PLATFORMS TABLE 2: COMPARISON OF ALTERNATIVE DATA STRUCTURES FOR THE PCACHE LAYER List of Figures FIGURE 1: BLOCK DIAGRAM OF THE CURRENT- GENERATION 8- CORE SERVER PLATFORM (TYAN S7025 MOTHERBOARD, WITH TWO PROCESSOR SOCKETS AND A SYMMETRIC INTERCONNECT). [CF. CITATION 6]... 9 FIGURE 2: BLOCK DIAGRAM OF THE NEXT- GENERATION 64- CORE SERVER PLATFORM (TYAN S8812 MOTHERBOARD, WITH FOUR PROCESSOR SOCKETS AND ASYMMETRIC INTERCONNECT). [CF. CITATION 7] FIGURE 3: I/O THROUGHPUT ACHIEVED USING THE IOR BENCHMARK (READS, WRITES), WITH 4KB REQUESTS, ON A 64- CORE NUMA SERVER. SCALABILITY LIMITATIONS OF THE UNMODIFIED LINUX KERNELS BECOME EVIDENT WITH MORE THAN 24 CORES FIGURE 4: I/O THROUGHPUT ACHIEVED USING THE IOR BENCHMARK (READS, WRITES), WITH 4KB REQUESTS, ON A 8- CORE NUMA SERVER. THE WRITE THROUGHPUT OF THE BASELINE SYSTEM IS DISTURBINGLY LOW. AFFINITY- CONSCIOUS ALTERNATIVES CONSISTENTLY OUTPERFORM AFFINITY- UNAWARE ONES FIGURE 5: IOPS PERFORMANCE ACHIEVED USING THE FIO BENCHMARK (READS, WRITES), WITH 4KB RANDOM REQUESTS, ON A 64- CORE NUMA SERVER. SCALABILITY LIMITATIONS OF THE UNMODIFIED LINUX KERNEL BECOME EVIDENT WITH MORE THAN 24 CORES FIGURE 6: VARIATION OF IOPS PERFORMANCE, ON 8- CORE NUMA SERVER. EVEN WITH RELATIVELY LOW CORE COUNTS AND A SYMMETRIC SYSTEM INTERCONNECT, THE PLACEMENT OF I/O ISSUING THREADS AFFECTS IOPS PERFORMANCE (BY UP TO 21%) FIGURE 7: FILE CREATION OVERHEAD, WITH UP TO 16 THREADS ON THE 64- CORE NUMA SERVER. THE BASELINE SYSTEM DOES NOT SCALE, WITH OVERHEAD INCREASING BY 65% BETWEEN CONSECUTIVE DATA POINTS Page 4 of 18

5 Executive Summary Building on the work reported in the previous reporting periods of the IOLanes project, in this report we present a focused evaluation of our custom I/O software stack on a large- scale NUMA multicore server. The main contribution of the work presented in this deliverable is an evaluation of NUMA effects on our partitioned I/O stack. We measure I/O throughput and IOPS performance, comparing the unmodified Linux kernel with two variations of our partitioned I/O stack. We observe severe scalability limitations of the baseline system (representative of today s commonly deployed data- centre servers), when the core count exceeds 20. Our custom I/O stack not only scales much better with up to 64 cores, but also outperforms the baseline system by a significant margin, especially with small- sized requests. With 64 cores, our I/O stack improves the read I/O throughput by 166% over the baseline system, and the write I/O throughput by 122%. Regarding IOPS performance, our I/O stack improves performance by 88% over the baseline system for reads, and by 118% for writes. Furthermore, we present results from a metadata intensive operation, specifically the creation of a large file- set by independent threads. The baseline system suffers from scalability limitations for such metadata- intensive operations due to global ordering in the journal. We find that our partitioned I/O stack, that supports multiple, independent journal instances, achieves better scalability and results in 88% improvement compared to the baseline. In our scalability evaluation, we compare two alternative data structure options for our implementation: hash tables and radix trees. Both implementations achieve good scalability with up to 64 cores. However, we find that the implementation based on radix trees consistently achieves better absolute performance and that the two designs exhibit different trade- offs regarding the handling of reads and writes. Overall, we expect our results to hold for upcoming larger- scale NUMA server platforms, with even more pronounced non- uniformity in remote memory access times and cross- core synchronization overheads. In such a setting, the benefits from scalable data structures that are capable of low- overhead operation under conditions of high concurrency will become even more critical for efficiency and high performance. Page 5 of 18

6 1. Overview of Deliverable 1.1 Main contributions and achievements The I/O stack discussed in this deliverable is a fundamental redesign of how I/O is performed in modern servers. Today, all functions in the kernel I/O path treat cores and memory symmetrically, which is far from current technology trends, NUMA architectures, and complex workloads. This approach leads to significant contention and interference. E.g. data buffers of the buffer cache are placed in any available memory (module), kernel code executes in any available core, and thus accesses to a specific buffer can originate from any core, resulting in contention. In our approach we use partitioning to contain all these effects as the number of cores, devices, and other physical resources grows with current technology trends. We have evaluated our work on a much more challenging server platform, as compared to the previously used test machines: more cores (64 vs. 8 or 12 in the servers used during the previous periods), and much more pronounced NUMA effects (due to a much more asymmetric system interconnect). Important overheads and effects, in particular the impact of thread/data affinity and undesired context switches, are much more pronounced on this server platform, and allow us to further validate our partitioning approach. Moreover, we present a quantitative comparison between two variants of our partitioned RAM cache design, providing essential insight on the impact of data structure properties in the presence of concurrent I/O requests. We compare two implementations, one using hash tables and one using radix trees. Overall, our approach is a significant departure from the current I/O path design, and we believe it will allow I/O performance to scale with the physical resources in future servers. Although these benefits have already been demonstrated in previous evaluations, we believe that these benefits will become even more important in larger- scale NUMA server platforms. In this report, we present a comprehensive evaluation of the partitioned I/O stack with up to 64 cores. 1.2 Progress beyond the state- of- the- art Overall, we examine three main aspects of scalability: NUMA effects, ordering, and locking. (1) Our work is the first to discuss NUMA effects on I/O performance and scalability at larger scales. However, the Linux kernel is not yet ready to handle large numbers of cores and their (inevitable) non- uniform memory access characteristics. The main contribution of the work presented in this deliverable is an evaluation of NUMA effects on our partitioned I/O stack. We have focused our evaluation on workload file sets that fit in main memory, thus avoiding the additional complications of storage device access. Since storage device access is itself subject to NUMA effects, we believe that our results are still relevant for workloads that do not fit in memory. Related to NUMA effects, we evaluate two alternative data structures for the per- partition block caches: hash tables and radix trees (cf. citation [5] for an overview). We believe that the scalability properties of data structures used in the I/O stack will become increasingly important for achieving scalability on server platforms with higher core counts than what is currently the norm. (2) Finally, our work is also among the first to present scaling results for the impact of recovery (journaling) with increasing core counts. This case is representative of workloads that involve the creation of large numbers of files, and brings out severe scalability limitations of current filesystem implementations due to contention. Our partitioned I/O stack design supports having multiple mostly- independent journals, rather than a single centralized point of co- ordination for co- ordinated updates to the persistent data structures of the filesystem. (3) Previous work (cf. citations [1,2,3,4]) has shown substantial potential for improvements in the Linux kernel by reducing lock contention. However, the Linux kernel has been steadily improving in this respect, incorporating more efficient locking primitive implementations and more scalable data. Our evaluation shows that locking in the I/O path is not a major concern. Page 6 of 18

7 1.3 Alignment with project goals and time- plan for this reporting period and plan for the next reporting period This deliverable discusses the implementation of the base (non- virtualized) partitioned stack for improving scaling, which is the main task in WP2. The implementation of the stack was available for initial runs during Period 2, and evolved substantially during Period 3, in accordance with the original plan. The code is divided in the main three components, filesystem, cache, recovery, and includes glue components and modifications to other parts of the system, especially for passing important information across existing components and layers. These components are used in specific configurations with the rest of the project components to build the denser and leaner I/O stack we are targeting. Our work in this reporting period is in- line with the plan for following up on the work of the three regular reporting periods. We had a first- draft implementation at the end of Period 2. During Period 2 we managed to get parts of the stack working somewhat earlier than originally planned, and also managed to integrate them with other components successfully, which resulted in an early working version of the integrated system. Thus, Period 3 was devoted to evaluation and optimization in the context of the integrated system, as was the original plan. Furthermore, we evaluated our work on a larger- scale NUMA server, aiming to verify the scalability potential of our design. We discovered scalability limitations for the baseline system when the core count exceeds 20; our design maintains good performance up to the full core count (64) of the new testbed. Finally, it is also important to note that this evaluation provides a much- needed set of reference results for better understanding I/O issues in modern servers, identifying the main issues for scaling I/O performance, and proposing further enhancements at the system and architectural levels. 1.4 Structure of deliverable Section 2 describes the experimental platforms used in this evaluation. Section 3 describes and compares the core data structures used in our scalable caching layer implementations. Section 4 presents our results, comparing the unmodified Linux kernel with our custom partitioned I/O stack. Section 5 summarizes our work and results. Page 7 of 18

8 2. Description of experimental systems We present results from two NUMA multicore server systems: a 8- core server with relatively uniform memory access costs, representative of current- generation server nodes in data centres, and a 64- core server with much more pronounced non- uniformity in the memory access costs, representative of upcoming server architectures. The current- generation 8- core server is comparable with the shared testbed used throughout the duration of the IOLanes project. The 64- core server We focus our experimental evaluation on workloads that fit entirely in the server s memory, aiming to stress as much as possible the I/O stack layers and making low- contention and low- latency response. This arrangement is a commonly accepted best - practice for online data processing workloads, and also in- line with the architectural trend to introduce non- volatile memories in servers. Table 1 summarizes the key properties of the two experimental server platforms. Block diagrams of the two server platforms are shown in Figures 1 and 2, respectively. Table 1: Key properties of the two experimental NUMA server platforms. Current- generation 8- core server Processor socket count 2 4 Cores/processor socket 4 (8 with hyperthreading) 16 Next- generation 64- core server Motherboard Tyan S7025 Tyan S8812 Processor type Intel Xeon E5620 (2.4GHz) AMD Opteron 6272 (2.1GHz) Processor core caches L1: 128KB (code), 128KB (data) L2: 1MB L3: 12MB L1: 2x32KB (code shared by 2 cores), 16KB (data) L2: 4x2MB (shared by 2 cores) L3: 2x8MB (shared by 4 cores) DRAM (DDR3, # DIMMs) Up to 8 (up to 64GB) Up to 16 (up to 512GB) Interconnect type QPI (5.86 GT/sec) HyperTransport Interconnect topology Dual- Ring, symmetric Point- to- Point, asymmetric In terms of systems software, we use in this evaluation study the partitioned I/O stack developed over the three regular reporting periods of the IOLanes project. We have developed various scalability enhancements for this I/O stack, with an eye towards eliminating points of contention for data structures. The most significant enhancement in the design is the explicit manipulation of thread/data affinity in the pfs filesystem. The most prominent change in the implementation was the change of the main data structure used by the pcache caching layer, for a hash table (per partition or slice ) to a radix tree. In the following sections, we present results from our experiments, comparing the performance scores of our two alternative I/O stack implementations (marked IOslices/HT and IOslices/RT, for the hash- table and radix- tree, respectively) with the unmodified Linux kernel s (marked Native (xfs)). 3. Description of alternative data structures for the pcache layer We implemented two alternative data structures for keeping track of filesystem blocks in- memory. The hash table implementation was our original choice, and has been described in detail in previous reports. Towards the end of the 3 rd reporting period, we developed a radix tree data structure, and in this report we present a quantitative comparison of the two pcache implementations. A qualitative comparison of the two data structures is shown in Table 2. Page 8 of 18

9 In both implementations, there is a distinct instance of the data structure for each of the lanes that we have provided in the server s I/O path. Our first implementation (hash table) was used in all evaluation results presented up to this point in the IOLanes project, and has been delivered quite aggressive performance scores in a variety of configurations. The alternative implementation is more complex, especially in terms of concurrency control; however, we feel that this additional complexity is justified by the performance results presented in this report. Overall, we expect our results to hold for upcoming larger- scale NUMA server platforms, with even more pronounced non- uniformity in remote memory access times and cross- core synchronization overheads. In such a setting, the benefits from scalable data structures that are capable of low- overhead operation under conditions of high concurrency will become even more critical for efficiency and high performance. Figure 1: Block diagram of the current- generation 8- core server platform (Tyan S7025 motherboard, with two processor sockets and a symmetric interconnect). [cf. Citation 6] Page 9 of 18

10 Figure 2: Block diagram of the next- generation 64- core server platform (Tyan S8812 motherboard, with four processor sockets and asymmetric interconnect). [cf. Citation 7] Table 2: Comparison of alternative data structures for the pcache layer. Hash Table O(1) overhead for lookup/insert/remove operations in the common case (i.e. when hash- function conflicts are rare) Allows simultaneous reads and writes from multiple threads, as long as there is no hash- function conflict between them. Fine- grain locking (per collision list), with very little contention as long as hash- function collisions are rare. Consecutive blocks (as identified by their device block address) are highly unlikely to be close together in the hash table - i.e. we Radix Tree log(k) lookup worst- case overhead, where K is the number of bits in a block address Optimized for high- concurrency, fast reads, and low- concurrency "background" writes: o Reads are lock- free (reading threads never block, even while writes are ongoing) o Reader threads always see a consistent version of the radix tree o Reader threads do not block writer threads Page 10 of 18

11 need to issue a synchronized group of lookup operations for requests involving multiple blocks. o This spread of consecutive elements is detrimental to processor cache locality and therefore system performance, in the case of I/O accesses that involve consecutive blocks. This effect is particularly acute when evictions are needed to release space in the cache. o The replacement policy implementation needs to scan the entire cache to obtain a set of blocks to be evicted. The eviction context attempts to amortize this overhead by collecting several blocks for eviction (up to 1MB), thus generating a non- sequential device access pattern. Lock- based concurrency control scheme Relatively simple implementation complexity o Writer threads only block each other, i.e. never block reader threads Consecutive blocks are placed close together in the radix tree (by property of their block addresses having a common prefix), thus allowing fast lookup in the case of requests involving multiple blocks. Moreover, the radix tree allows for a faster implementation of the replacement policy. RCU (read- copy- update) concurrency control scheme More complex implementation, esp. regarding concurrency control Page 11 of 18

12 4. Experimental results Sections 4.1 and 4.2 present I/O throughput and I/O operation rate results to illustrate scalability limitations of the baseline system, and the scaling behaviour of our alternative I/O stack design. Section 4.3 presents results from a metadata- intensive workload. NUMA affects are a particular focus of our evaluation, as we have quantified a disturbingly wide range of performance variation for the baseline system. The IOLanes partitioned I/O stack explicitly addresses NUMA effects, by modifying at run- time the core affinity mask of I/O issuing threads. This design decision has a tremendous impact on performance and scalability, as seen in the following results. We use the MPI- based IOR benchmark to emulate various check- pointing patterns, with the POSIX I/O interface. In our experiments, we vary the number of processes (instances), each generating an aggregate I/O volume of 1.25 GBytes in five iterations of write- then- read pattern. Overall, we generate an interleaving of streaming read and write accesses. We have set the I/O request size to 4 KBytes. The metric reported by IOR is data throughput in MBytes/s, separately for the write and read phases. We also use the fio I/O workload generator to emulate concurrent streams of I/O requests. Finally, we use the make- many- files microbenchmark as a stress- test for filesystem metadata performance. 4.1 I/O Throughput results We start with experimental results from running the IOR benchmark: sequential reads and writes, with 4KB requests. These I/O patterns aim to stress the scaling capabilities of the I/O stack layers, as they involve concurrent access streams with relatively small- sized I/O requests that highlight any processing overheads in the common I/O path. The IOR results, shown in Figure 3, are from the 64- core server platform. Read I/O Throughput (4KB requests) Write I/O Throughput (4KB requests) Naqve (xfs) IOslices/HT IOslices/RT Naqve (xfs) IOslices/HT IOslices/RT MB/sec MB/sec # Cores # Cores (a) IOR: Read Throughput (b) IOR: Write Throughput Page 12 of 18

13 Figure 3: I/O Throughput achieved using the IOR benchmark (reads, writes), with 4KB requests, on a 64- core NUMA server. Scalability limitations of the unmodified Linux kernels become evident with more than 24 cores. We observe severe scalability limitations for the baseline system (unmodified Linux kernel) with 24 or more cores; our configurations scale linearly up to 64 cores, and outperform the baseline system. The radix- tree implementation achieves consistently 10-15% better throughput than the hash- table implementation. Both of our implementations explicitly set the processor affinity mask of the I/O- issuing threads to avoid as much as possible remote memory references. The baseline system does not have this facility, and under high- concurrency suffers both the penalty of remote memory references and also the penalty of undesired thread migrations upon context switches. With 64 cores, our I/O stack improves the read I/O throughput by 166% over the baseline system, and the write I/O throughput by 122%. (a) IOR: Read Throughput (b) IOR: Write I/O Throughput Figure 4: I/O Throughput achieved using the IOR benchmark (reads, writes), with 4KB requests, on a 8- core NUMA server. The write throughput of the baseline system is disturbingly low. Affinity- conscious alternatives consistently outperform affinity- unaware ones. In Figure 4, we present results from running the IOR read and write throughput tests on the current- generation 8- core NUMA server platform as well. For our two implementations of the RAM cache, we present results with and without the thread/data affinity control mechanism. With the lower core count, the scalability limitations of the baseline system are not seen in the case of reads, as shown in Figure 4(a). However, the baseline write results are quite disturbing. Figure 4(b) summarizes these results, showing that our affinity- conscious implementations outperform the baseline system. Moreover, Figure 4 illustrates the wide range of performance that an application can experience without thread/data affinity control; our affinity- conscious implementations consistently outperform the affinity- unaware ones, particularly with more than 4 cores. 4.2 IOPS results Figure 5 presents results from running fio with 64 concurrent I/O issuing threads, with each thread issuing 4KB requests with random distribution. This experiment measures the maximum achievable IOPS scores, for reads and writes. We observe (as in Figure 3) that the baseline system does not scale with more than 24 cores, and in fact its IOPS performance drops with more than 40 cores. Our two implementations both scale linearly, and outperform the baseline system with more than 28 cores in the case of reads and 20 cores in the case of writes. The radix- tree implementation consistently outperforms the hash- table implementation, for both reads and writes. The hash- table Page 13 of 18

14 implementation matches and then exceeds the IOPS performance of the baseline system at 32 cores, and then continues to scale. Overall, our I/O stack improves read IOPS performance by 88% over the baseline system, and write IOPS performance by 118%. Read IOPS (4KB random requests) NATIVE (xfs) IOslices/HT IOslices/RT Write IOPS (4KB random requests) NATIVE (xfs) IOslices/HT IOslices/RT Millions IOPS IOPS Millions # Cores # Cores (a) fio: Max. read IOPS. (b) fio: Max. write IOPS Figure 5: IOPS performance achieved using the fio benchmark (reads, writes), with 4KB random requests, on a 64- core NUMA server. Scalability limitations of the unmodified Linux kernel become evident with more than 24 cores. The main advantage of our I/O stack over the baseline system derives from explicitly considering thread/data affinity. This aspect of I/O performance will become increasingly critical with higher core counts, and it is already important in current- generation servers. Unmanaged thread/data affinity results in large variance of application- level performance, even in configurations where we can have a 1-1 mapping between application threads and cores. To illustrate this point, we show in Figure 6 the 4KB random read IOPS scores (again with fio and 64 concurrent threads) achieved by two configurations of the baseline system on the 8- core NUMA server platform: (a) no placement constraints for the I/O- issuing threads (marked in the graphs as default ), and (b) with placement constraints, enforced statically using the taskset utility (marked in the graphs as fixed ). The default placement is what a typical application would experience on a current- generation server system. The fixed placement is what could be achieved if certain simplifying assumptions hold for the application: (a) fixed number of threads (i.e. no dynamically spawned threads), (b) private working set per thread, allowing a simple, static setting of the thread- core affinity mask. We observe that even with the relatively low core count and symmetric costs of memory references of this particular server platform, there is a significant margin of variation in IOPS performance, from 3.5% up to 21%. Our partitioned I/O stack dynamically adjusts thread affinity, even with dynamically spawned threads and overlapping working sets. Page 14 of 18

15 IOPS Millions IOPS (4KB random requests) READ WRITE # Cores IOPS Millions IOPS (4KB random requests) variance with thread placement default fixed # Cores (a) fio: read and write IOPS with default thread placement. (b) fio: variation in IOPS performance due to thread placement. Figure 6: Variation of IOPS performance, on 8- core NUMA server. Even with relatively low core counts and a symmetric system interconnect, the placement of I/O issuing threads affects IOPS performance (by up to 21%). 4.3 Filesystem metadata- intensive results For completeness, we also present evaluation results for a metadata- intensive microbenchmark, with up to 16 concurrent threads, each creating a file tree. Each thread creates a 3- level directory hierarchy, and initializes 300 1KB files at each of the 3rd- level directories. Thus, each thread creates 8,100 files, with a total of 129,600 for the 16- threads case. The threads are independent, thus the expectation would be to observe a relatively constant time for creating the file tree. This is not the case with the baseline system, by far. Figure 7: File creation overhead, with up to 16 threads on the 64- core NUMA server. The baseline system does not scale, with overhead increasing by 65% between consecutive data points. Page 15 of 18

16 Figure 7 shows a comparison of the execution time for this microbenchmark, for the baseline system and our I/O stack. Both configurations use a RAM- disk as the backing store for the filesystem (xfs for the baseline system, pfs for our I/O stack). We observe severe performance degradation with the baseline system, with execution times increasing by about 65% between consecutive data points. With our I/O stack, the increase in the overhead is much less pronounced (8-16%). With 16 threads, our I/O stack reduces the time for creating the requested file- set by 88%. These results highlight the benefits of partitioning the I/O stack, and in particular the benefit from having multiple mostly- independent journal instances rather than a single one. The pjournal layer ensures the consistency of the filesystem s persistent data structures in the presence of failures, as in the case of existing filesystem; however, there is no contention between independent threads that do not share access to files. The pjournal layer implements a co- ordination protocol between journal instances, but does not unnecessarily serialize requests from independent threads. Page 16 of 18

17 5. Summary and Discussion We have identified scalability limitations of the Linux kernel, using targeted intensive tests of the common I/O path, for both data and metadata. The main contribution of the work presented in this deliverable is an evaluation of NUMA effects on our partitioned I/O stack. We demonstrate significantly improved scalability, for both I/O throughput- intensive and IOPS- intensive tests. Most of these limitations are not observed at relatively low core counts (8-12), but become severe with more than 20 cores. With 64 cores, our I/O stack improves the read I/O throughput by 166% over the baseline system, and the write I/O throughput by 122%. Regarding IOPS performance, our I/O stack improves performance by 88% over the baseline system for reads, and by 118% for writes. Furthermore, we provide a quantitative comparison between two options of the data structure used for the partitioned RAM cache (hash table vs radix tree). Although both options provide good scalability, the implementation using radix trees achieves consistently better performance. Finally, we compare our I/O stack with the baseline for a metadata- intensive workload with up to 16 independent threads, and find again severe scalability limitations of the baseline system. For 16 threads, we achieve a reduction of the time required for creating a large file- set by 88%. Overall, we expect our results to hold for upcoming larger- scale NUMA server platforms, with even more pronounced non- uniformity in remote memory access times and cross- core synchronization overheads. In such a setting, the benefits from scalable data structures that are capable of low- overhead operation under conditions of high concurrency will become even more critical for efficiency and high performance. Page 17 of 18

18 References 1. Silas Boyd- Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich. An Analysis of Linux Scalability to Many Cores. In the Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI '10), Vancouver, Canada, October Silas Boyd- Wickizer, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich. Non- scalable locks are dangerous. In the Proceedings of the Linux Symposium, Ottawa, Canada, July Yan Cui, Yingxin Wang, Yu Chen, and Yuanchun Shi. Lock- contention- aware scheduler: A scalable and energy- efficient method for addressing scalability collapse on multicore systems. ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High- Performance Embedded Architectures and Compilers (TACO), vol. 9, no. 4, p. 44, A Da Zheng, Randal Burns, and Alexander S. Szalay. Parallel Page Cache: IOPS and Caching for Multicore Systems. In the Proceedings of the 4th USENIX workshop on Hot Topics in Storage and File Systems (HotStorage), Alfred V. Aho, John E. Hopcroft, and Jeffrey D. Ullman. Data Structures and Algorithms. Addison- Wesley, User Guide for the Tyan S7025 motherboard. URL: 7. User Guide for the Tyan S8812 motherboard. URL: Page 18 of 18

Azor: Using Two-level Block Selection to Improve SSD-based I/O caches

Azor: Using Two-level Block Selection to Improve SSD-based I/O caches Azor: Using Two-level Block Selection to Improve SSD-based I/O caches Yannis Klonatos, Thanos Makatos, Manolis Marazakis, Michail D. Flouris, Angelos Bilas {klonatos, makatos, maraz, flouris, bilas}@ics.forth.gr

More information

An Analysis of Linux Scalability to Many Cores

An Analysis of Linux Scalability to Many Cores An Analysis of Linux Scalability to Many Cores 1 What are we going to talk about? Scalability analysis of 7 system applications running on Linux on a 48 core computer Exim, memcached, Apache, PostgreSQL,

More information

Using Transparent Compression to Improve SSD-based I/O Caches

Using Transparent Compression to Improve SSD-based I/O Caches Using Transparent Compression to Improve SSD-based I/O Caches Thanos Makatos, Yannis Klonatos, Manolis Marazakis, Michail D. Flouris, and Angelos Bilas {mcatos,klonatos,maraz,flouris,bilas}@ics.forth.gr

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet Pilar González-Férez and Angelos Bilas 31 th International Conference on Massive Storage Systems

More information

ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency

ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency Thanos Makatos, Yannis Klonatos, Manolis Marazakis, Michail D. Flouris, and Angelos Bilas {mcatos,klonatos,maraz,flouris,bilas}@ics.forth.gr

More information

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google 2017 fall DIP Heerak lim, Donghun Koo 1 Agenda Introduction Design overview Systems interactions Master operation Fault tolerance

More information

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Tharso Ferreira 1, Antonio Espinosa 1, Juan Carlos Moure 2 and Porfidio Hernández 2 Computer Architecture and Operating

More information

CPHash: A Cache-Partitioned Hash Table Zviad Metreveli, Nickolai Zeldovich, and M. Frans Kaashoek

CPHash: A Cache-Partitioned Hash Table Zviad Metreveli, Nickolai Zeldovich, and M. Frans Kaashoek Computer Science and Artificial Intelligence Laboratory Technical Report MIT-CSAIL-TR-211-51 November 26, 211 : A Cache-Partitioned Hash Table Zviad Metreveli, Nickolai Zeldovich, and M. Frans Kaashoek

More information

Performance of Variant Memory Configurations for Cray XT Systems

Performance of Variant Memory Configurations for Cray XT Systems Performance of Variant Memory Configurations for Cray XT Systems Wayne Joubert, Oak Ridge National Laboratory ABSTRACT: In late 29 NICS will upgrade its 832 socket Cray XT from Barcelona (4 cores/socket)

More information

NUMA replicated pagecache for Linux

NUMA replicated pagecache for Linux NUMA replicated pagecache for Linux Nick Piggin SuSE Labs January 27, 2008 0-0 Talk outline I will cover the following areas: Give some NUMA background information Introduce some of Linux s NUMA optimisations

More information

PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES

PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES Anish Athalye and Patrick Long Mentors: Austin Clements and Stephen Tu 3 rd annual MIT PRIMES Conference Sequential

More information

A Software Approach to Unifying Multicore Caches Silas Boyd-Wickizer, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich

A Software Approach to Unifying Multicore Caches Silas Boyd-Wickizer, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich Computer Science and Artificial Intelligence Laboratory Technical Report MIT-CSAIL-TR-2011-032 June 28, 2011 A Software Approach to Unifying Multicore Caches Silas Boyd-Wickizer, M. Frans Kaashoek, Robert

More information

Mbench: Benchmarking a Multicore Operating System Using Mixed Workloads

Mbench: Benchmarking a Multicore Operating System Using Mixed Workloads Mbench: Benchmarking a Multicore Operating System Using Mixed Workloads Gang Lu and Xinlong Lin Institute of Computing Technology, Chinese Academy of Sciences BPOE-6, Sep 4, 2015 Backgrounds Fast evolution

More information

Distributed Filesystem

Distributed Filesystem Distributed Filesystem 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributing Code! Don t move data to workers move workers to the data! - Store data on the local disks of nodes in the

More information

Analyzing I/O Performance on a NEXTGenIO Class System

Analyzing I/O Performance on a NEXTGenIO Class System Analyzing I/O Performance on a NEXTGenIO Class System holger.brunst@tu-dresden.de ZIH, Technische Universität Dresden LUG17, Indiana University, June 2 nd 2017 NEXTGenIO Fact Sheet Project Research & Innovation

More information

CA485 Ray Walshe Google File System

CA485 Ray Walshe Google File System Google File System Overview Google File System is scalable, distributed file system on inexpensive commodity hardware that provides: Fault Tolerance File system runs on hundreds or thousands of storage

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

The Google File System

The Google File System October 13, 2010 Based on: S. Ghemawat, H. Gobioff, and S.-T. Leung: The Google file system, in Proceedings ACM SOSP 2003, Lake George, NY, USA, October 2003. 1 Assumptions Interface Architecture Single

More information

NEXTGenIO Performance Tools for In-Memory I/O

NEXTGenIO Performance Tools for In-Memory I/O NEXTGenIO Performance Tools for In- I/O holger.brunst@tu-dresden.de ZIH, Technische Universität Dresden 22 nd -23 rd March 2017 Credits Intro slides by Adrian Jackson (EPCC) A new hierarchy New non-volatile

More information

Linux Software RAID Level 0 Technique for High Performance Computing by using PCI-Express based SSD

Linux Software RAID Level 0 Technique for High Performance Computing by using PCI-Express based SSD Linux Software RAID Level Technique for High Performance Computing by using PCI-Express based SSD Jae Gi Son, Taegyeong Kim, Kuk Jin Jang, *Hyedong Jung Department of Industrial Convergence, Korea Electronics

More information

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review COS 318: Operating Systems NSF, Snapshot, Dedup and Review Topics! NFS! Case Study: NetApp File System! Deduplication storage system! Course review 2 Network File System! Sun introduced NFS v2 in early

More information

Evaluation of Intel Memory Drive Technology Performance for Scientific Applications

Evaluation of Intel Memory Drive Technology Performance for Scientific Applications Evaluation of Intel Memory Drive Technology Performance for Scientific Applications Vladimir Mironov, Andrey Kudryavtsev, Yuri Alexeev, Alexander Moskovsky, Igor Kulikov, and Igor Chernykh Introducing

More information

Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems

Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems 1 Presented by Hadeel Alabandi Introduction and Motivation 2 A serious issue to the effective utilization

More information

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage Evaluation of Lustre File System software enhancements for improved Metadata performance Wojciech Turek, Paul Calleja,John

More information

Maximizing System x and ThinkServer Performance with a Balanced Memory Configuration

Maximizing System x and ThinkServer Performance with a Balanced Memory Configuration Front cover Maximizing System x and ThinkServer Performance with a Balanced Configuration Last Update: October 2017 Introduces three balanced memory guidelines for Intel Xeon s Compares the performance

More information

I/O Profiling Towards the Exascale

I/O Profiling Towards the Exascale I/O Profiling Towards the Exascale holger.brunst@tu-dresden.de ZIH, Technische Universität Dresden NEXTGenIO & SAGE: Working towards Exascale I/O Barcelona, NEXTGenIO facts Project Research & Innovation

More information

NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory

NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory Dhananjoy Das, Sr. Systems Architect SanDisk Corp. 1 Agenda: Applications are KING! Storage landscape (Flash / NVM)

More information

CS377P Programming for Performance Multicore Performance Synchronization

CS377P Programming for Performance Multicore Performance Synchronization CS377P Programming for Performance Multicore Performance Synchronization Sreepathi Pai UTCS October 21, 2015 Outline 1 Synchronization Primitives 2 Blocking, Lock-free and Wait-free Algorithms 3 Transactional

More information

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Designing High Performance Communication Middleware with Emerging Multi-core Architectures Designing High Performance Communication Middleware with Emerging Multi-core Architectures Dhabaleswar K. (DK) Panda Department of Computer Science and Engg. The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Duy Le (Dan) - The College of William and Mary Hai Huang - IBM T. J. Watson Research Center Haining Wang - The College of William and Mary

Duy Le (Dan) - The College of William and Mary Hai Huang - IBM T. J. Watson Research Center Haining Wang - The College of William and Mary Duy Le (Dan) - The College of William and Mary Hai Huang - IBM T. J. Watson Research Center Haining Wang - The College of William and Mary Virtualization Games Videos Web Games Programming File server

More information

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE DELL EMC ISILON F800 AND H600 I/O PERFORMANCE ABSTRACT This white paper provides F800 and H600 performance data. It is intended for performance-minded administrators of large compute clusters that access

More information

Cost of Concurrency in Hybrid Transactional Memory. Trevor Brown (University of Toronto) Srivatsan Ravi (Purdue University)

Cost of Concurrency in Hybrid Transactional Memory. Trevor Brown (University of Toronto) Srivatsan Ravi (Purdue University) Cost of Concurrency in Hybrid Transactional Memory Trevor Brown (University of Toronto) Srivatsan Ravi (Purdue University) 1 Transactional Memory: a history Hardware TM Software TM Hybrid TM 1993 1995-today

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

Go Deep: Fixing Architectural Overheads of the Go Scheduler

Go Deep: Fixing Architectural Overheads of the Go Scheduler Go Deep: Fixing Architectural Overheads of the Go Scheduler Craig Hesling hesling@cmu.edu Sannan Tariq stariq@cs.cmu.edu May 11, 2018 1 Introduction Golang is a programming language developed to target

More information

Deploy a High-Performance Database Solution: Cisco UCS B420 M4 Blade Server with Fusion iomemory PX600 Using Oracle Database 12c

Deploy a High-Performance Database Solution: Cisco UCS B420 M4 Blade Server with Fusion iomemory PX600 Using Oracle Database 12c White Paper Deploy a High-Performance Database Solution: Cisco UCS B420 M4 Blade Server with Fusion iomemory PX600 Using Oracle Database 12c What You Will Learn This document demonstrates the benefits

More information

The benefits and costs of writing a POSIX kernel in a high-level language

The benefits and costs of writing a POSIX kernel in a high-level language 1 / 38 The benefits and costs of writing a POSIX kernel in a high-level language Cody Cutler, M. Frans Kaashoek, Robert T. Morris MIT CSAIL Should we use high-level languages to build OS kernels? 2 / 38

More information

MODERN FILESYSTEM PERFORMANCE IN LOCAL MULTI-DISK STORAGE SPACE CONFIGURATION

MODERN FILESYSTEM PERFORMANCE IN LOCAL MULTI-DISK STORAGE SPACE CONFIGURATION INFORMATION SYSTEMS IN MANAGEMENT Information Systems in Management (2014) Vol. 3 (4) 273 283 MODERN FILESYSTEM PERFORMANCE IN LOCAL MULTI-DISK STORAGE SPACE CONFIGURATION MATEUSZ SMOLIŃSKI Institute of

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung December 2003 ACM symposium on Operating systems principles Publisher: ACM Nov. 26, 2008 OUTLINE INTRODUCTION DESIGN OVERVIEW

More information

Optimization of thread affinity and memory affinity for remote core locking synchronization in multithreaded programs for multicore computer systems

Optimization of thread affinity and memory affinity for remote core locking synchronization in multithreaded programs for multicore computer systems Optimization of thread affinity and memory affinity for remote core locking synchronization in multithreaded programs for multicore computer systems Alexey Paznikov Saint Petersburg Electrotechnical University

More information

Presented by: Nafiseh Mahmoudi Spring 2017

Presented by: Nafiseh Mahmoudi Spring 2017 Presented by: Nafiseh Mahmoudi Spring 2017 Authors: Publication: Type: ACM Transactions on Storage (TOS), 2016 Research Paper 2 High speed data processing demands high storage I/O performance. Flash memory

More information

Scalable Correct Memory Ordering via Relativistic Programming

Scalable Correct Memory Ordering via Relativistic Programming Scalable Correct Memory Ordering via Relativistic Programming Josh Triplett Portland State University josh@joshtriplett.org Philip W. Howard Portland State University pwh@cs.pdx.edu Paul E. McKenney IBM

More information

SoftNAS Cloud Performance Evaluation on AWS

SoftNAS Cloud Performance Evaluation on AWS SoftNAS Cloud Performance Evaluation on AWS October 25, 2016 Contents SoftNAS Cloud Overview... 3 Introduction... 3 Executive Summary... 4 Key Findings for AWS:... 5 Test Methodology... 6 Performance Summary

More information

Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System

Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System Center for Information ervices and High Performance Computing (ZIH) Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor ystem Parallel Architectures and Compiler Technologies

More information

Improving Real-Time Performance on Multicore Platforms Using MemGuard

Improving Real-Time Performance on Multicore Platforms Using MemGuard Improving Real-Time Performance on Multicore Platforms Using MemGuard Heechul Yun University of Kansas 2335 Irving hill Rd, Lawrence, KS heechul@ittc.ku.edu Abstract In this paper, we present a case-study

More information

A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing

A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing Z. Sebepou, K. Magoutis, M. Marazakis, A. Bilas Institute of Computer Science (ICS) Foundation for Research and

More information

Hierarchical PLABs, CLABs, TLABs in Hotspot

Hierarchical PLABs, CLABs, TLABs in Hotspot Hierarchical s, CLABs, s in Hotspot Christoph M. Kirsch ck@cs.uni-salzburg.at Hannes Payer hpayer@cs.uni-salzburg.at Harald Röck hroeck@cs.uni-salzburg.at Abstract Thread-local allocation buffers (s) are

More information

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance 11 th International LS-DYNA Users Conference Computing Technology LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance Gilad Shainer 1, Tong Liu 2, Jeff Layton

More information

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning September 22 nd 2015 Tommaso Cecchi 2 What is IME? This breakthrough, software defined storage application

More information

Evaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades

Evaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades Evaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades Evaluation report prepared under contract with Dot Hill August 2015 Executive Summary Solid state

More information

The Multikernel: A new OS architecture for scalable multicore systems Baumann et al. Presentation: Mark Smith

The Multikernel: A new OS architecture for scalable multicore systems Baumann et al. Presentation: Mark Smith The Multikernel: A new OS architecture for scalable multicore systems Baumann et al. Presentation: Mark Smith Review Introduction Optimizing the OS based on hardware Processor changes Shared Memory vs

More information

Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access

Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access S. Moreaud, B. Goglin and R. Namyst INRIA Runtime team-project University of Bordeaux, France Context Multicore architectures everywhere

More information

MultiLanes: Providing Virtualized Storage for OS-level Virtualization on Many Cores

MultiLanes: Providing Virtualized Storage for OS-level Virtualization on Many Cores MultiLanes: Providing Virtualized Storage for OS-level Virtualization on Many Cores Junbin Kang, Benlong Zhang, Tianyu Wo, Chunming Hu, and Jinpeng Huai Beihang University 夏飞 20140904 1 Outline Background

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Crossing the Chasm: Sneaking a parallel file system into Hadoop Crossing the Chasm: Sneaking a parallel file system into Hadoop Wittawat Tantisiriroj Swapnil Patil, Garth Gibson PARALLEL DATA LABORATORY Carnegie Mellon University In this work Compare and contrast large

More information

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand Matthew Koop, Wei Huang, Ahbinav Vishnu, Dhabaleswar K. Panda Network-Based Computing Laboratory Department of

More information

SoftNAS Cloud Performance Evaluation on Microsoft Azure

SoftNAS Cloud Performance Evaluation on Microsoft Azure SoftNAS Cloud Performance Evaluation on Microsoft Azure November 30, 2016 Contents SoftNAS Cloud Overview... 3 Introduction... 3 Executive Summary... 4 Key Findings for Azure:... 5 Test Methodology...

More information

Open-Channel SSDs Offer the Flexibility Required by Hyperscale Infrastructure Matias Bjørling CNEX Labs

Open-Channel SSDs Offer the Flexibility Required by Hyperscale Infrastructure Matias Bjørling CNEX Labs Open-Channel SSDs Offer the Flexibility Required by Hyperscale Infrastructure Matias Bjørling CNEX Labs 1 Public and Private Cloud Providers 2 Workloads and Applications Multi-Tenancy Databases Instance

More information

Simultaneous Multithreading on Pentium 4

Simultaneous Multithreading on Pentium 4 Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu CS203B-Advanced Computer Architecture, Spring 2004 p.1/32 Overview Multiple threads executing on

More information

Improving Virtual Machine Scheduling in NUMA Multicore Systems

Improving Virtual Machine Scheduling in NUMA Multicore Systems Improving Virtual Machine Scheduling in NUMA Multicore Systems Jia Rao, Xiaobo Zhou University of Colorado, Colorado Springs Kun Wang, Cheng-Zhong Xu Wayne State University http://cs.uccs.edu/~jrao/ Multicore

More information

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching Kefei Wang and Feng Chen Louisiana State University SoCC '18 Carlsbad, CA Key-value Systems in Internet Services Key-value

More information

An Introduction to Parallel Programming

An Introduction to Parallel Programming An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe

More information

M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle

M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle M7: Next Generation SPARC Hotchips 26 August 12, 2014 Stephen Phillips Senior Director, SPARC Architecture Oracle Safe Harbor Statement The following is intended to outline our general product direction.

More information

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon

More information

VMware vsphere 4: The CPU Scheduler in VMware ESX 4 W H I T E P A P E R

VMware vsphere 4: The CPU Scheduler in VMware ESX 4 W H I T E P A P E R VMware vsphere 4: The CPU Scheduler in VMware ESX 4 W H I T E P A P E R Table of Contents 1 Introduction..................................................... 3 2 ESX CPU Scheduler Overview......................................

More information

CLOUD-SCALE FILE SYSTEMS

CLOUD-SCALE FILE SYSTEMS Data Management in the Cloud CLOUD-SCALE FILE SYSTEMS 92 Google File System (GFS) Designing a file system for the Cloud design assumptions design choices Architecture GFS Master GFS Chunkservers GFS Clients

More information

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson A Cross Media File System Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson 1 Let s build a fast server NoSQL store, Database, File server, Mail server Requirements

More information

Claude TADONKI. MINES ParisTech PSL Research University Centre de Recherche Informatique

Claude TADONKI. MINES ParisTech PSL Research University Centre de Recherche Informatique Got 2 seconds Sequential 84 seconds Expected 84/84 = 1 second!?! Got 25 seconds MINES ParisTech PSL Research University Centre de Recherche Informatique claude.tadonki@mines-paristech.fr Séminaire MATHEMATIQUES

More information

Near Memory Key/Value Lookup Acceleration MemSys 2017

Near Memory Key/Value Lookup Acceleration MemSys 2017 Near Key/Value Lookup Acceleration MemSys 2017 October 3, 2017 Scott Lloyd, Maya Gokhale Center for Applied Scientific Computing This work was performed under the auspices of the U.S. Department of Energy

More information

Maximizing NFS Scalability

Maximizing NFS Scalability Maximizing NFS Scalability on Dell Servers and Storage in High-Performance Computing Environments Popular because of its maturity and ease of use, the Network File System (NFS) can be used in high-performance

More information

CSE 124: Networked Services Lecture-17

CSE 124: Networked Services Lecture-17 Fall 2010 CSE 124: Networked Services Lecture-17 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa10/cse124 11/30/2010 CSE 124 Networked Services Fall 2010 1 Updates PlanetLab experiments

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google SOSP 03, October 19 22, 2003, New York, USA Hyeon-Gyu Lee, and Yeong-Jae Woo Memory & Storage Architecture Lab. School

More information

Dell Reference Configuration for Large Oracle Database Deployments on Dell EqualLogic Storage

Dell Reference Configuration for Large Oracle Database Deployments on Dell EqualLogic Storage Dell Reference Configuration for Large Oracle Database Deployments on Dell EqualLogic Storage Database Solutions Engineering By Raghunatha M, Ravi Ramappa Dell Product Group October 2009 Executive Summary

More information

HP SAS benchmark performance tests

HP SAS benchmark performance tests HP SAS benchmark performance tests technology brief Abstract... 2 Introduction... 2 Test hardware... 2 HP ProLiant DL585 server... 2 HP ProLiant DL380 G4 and G4 SAS servers... 3 HP Smart Array P600 SAS

More information

Graph Streaming Processor

Graph Streaming Processor Graph Streaming Processor A Next-Generation Computing Architecture Val G. Cook Chief Software Architect Satyaki Koneru Chief Technology Officer Ke Yin Chief Scientist Dinakar Munagala Chief Executive Officer

More information

Oracle Database 12c: JMS Sharded Queues

Oracle Database 12c: JMS Sharded Queues Oracle Database 12c: JMS Sharded Queues For high performance, scalable Advanced Queuing ORACLE WHITE PAPER MARCH 2015 Table of Contents Introduction 2 Architecture 3 PERFORMANCE OF AQ-JMS QUEUES 4 PERFORMANCE

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

Agilio CX 2x40GbE with OVS-TC

Agilio CX 2x40GbE with OVS-TC PERFORMANCE REPORT Agilio CX 2x4GbE with OVS-TC OVS-TC WITH AN AGILIO CX SMARTNIC CAN IMPROVE A SIMPLE L2 FORWARDING USE CASE AT LEAST 2X. WHEN SCALED TO REAL LIFE USE CASES WITH COMPLEX RULES TUNNELING

More information

Performance Modeling and Analysis of Flash based Storage Devices

Performance Modeling and Analysis of Flash based Storage Devices Performance Modeling and Analysis of Flash based Storage Devices H. Howie Huang, Shan Li George Washington University Alex Szalay, Andreas Terzis Johns Hopkins University MSST 11 May 26, 2011 NAND Flash

More information

Computer Systems Research in the Post-Dennard Scaling Era. Emilio G. Cota Candidacy Exam April 30, 2013

Computer Systems Research in the Post-Dennard Scaling Era. Emilio G. Cota Candidacy Exam April 30, 2013 Computer Systems Research in the Post-Dennard Scaling Era Emilio G. Cota Candidacy Exam April 30, 2013 Intel 4004, 1971 1 core, no cache 23K 10um transistors Intel Nehalem EX, 2009 8c, 24MB cache 2.3B

More information

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Crossing the Chasm: Sneaking a parallel file system into Hadoop Crossing the Chasm: Sneaking a parallel file system into Hadoop Wittawat Tantisiriroj Swapnil Patil, Garth Gibson PARALLEL DATA LABORATORY Carnegie Mellon University In this work Compare and contrast large

More information

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers System Performance Scaling of IBM POWER6 TM Based Servers Jeff Stuecheli Hot Chips 19 August 2007 Agenda Historical background POWER6 TM chip components Interconnect topology Cache Coherence strategies

More information

Optimizing TCP Receive Performance

Optimizing TCP Receive Performance Optimizing TCP Receive Performance Aravind Menon and Willy Zwaenepoel School of Computer and Communication Sciences EPFL Abstract The performance of receive side TCP processing has traditionally been dominated

More information

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University Falcon: Scaling IO Performance in Multi-SSD Volumes Pradeep Kumar H Howie Huang The George Washington University SSDs in Big Data Applications Recent trends advocate using many SSDs for higher throughput

More information

SAY-Go: Towards Transparent and Seamless Storage-As-You-Go with Persistent Memory

SAY-Go: Towards Transparent and Seamless Storage-As-You-Go with Persistent Memory SAY-Go: Towards Transparent and Seamless Storage-As-You-Go with Persistent Memory Hyeonho Song, Sam H. Noh UNIST HotStorage 2018 Contents Persistent Memory Motivation SAY-Go Design Implementation Evaluation

More information

White Paper. File System Throughput Performance on RedHawk Linux

White Paper. File System Throughput Performance on RedHawk Linux White Paper File System Throughput Performance on RedHawk Linux By: Nikhil Nanal Concurrent Computer Corporation August Introduction This paper reports the throughput performance of the,, and file systems

More information

COS 318: Operating Systems. Journaling, NFS and WAFL

COS 318: Operating Systems. Journaling, NFS and WAFL COS 318: Operating Systems Journaling, NFS and WAFL Jaswinder Pal Singh Computer Science Department Princeton University (http://www.cs.princeton.edu/courses/cos318/) Topics Journaling and LFS Network

More information

Prefetch Threads for Database Operations on a Simultaneous Multi-threaded Processor

Prefetch Threads for Database Operations on a Simultaneous Multi-threaded Processor Prefetch Threads for Database Operations on a Simultaneous Multi-threaded Processor Kostas Papadopoulos December 11, 2005 Abstract Simultaneous Multi-threading (SMT) has been developed to increase instruction

More information

The Spider Center-Wide File System

The Spider Center-Wide File System The Spider Center-Wide File System Presented by Feiyi Wang (Ph.D.) Technology Integration Group National Center of Computational Sciences Galen Shipman (Group Lead) Dave Dillow, Sarp Oral, James Simmons,

More information

QLogic TrueScale InfiniBand and Teraflop Simulations

QLogic TrueScale InfiniBand and Teraflop Simulations WHITE Paper QLogic TrueScale InfiniBand and Teraflop Simulations For ANSYS Mechanical v12 High Performance Interconnect for ANSYS Computer Aided Engineering Solutions Executive Summary Today s challenging

More information

The MOSIX Scalable Cluster Computing for Linux. mosix.org

The MOSIX Scalable Cluster Computing for Linux.  mosix.org The MOSIX Scalable Cluster Computing for Linux Prof. Amnon Barak Computer Science Hebrew University http://www. mosix.org 1 Presentation overview Part I : Why computing clusters (slide 3-7) Part II : What

More information

Dynamic Translator-Based Virtualization

Dynamic Translator-Based Virtualization Dynamic Translator-Based Virtualization Yuki Kinebuchi 1,HidenariKoshimae 1,ShuichiOikawa 2, and Tatsuo Nakajima 1 1 Department of Computer Science, Waseda University {yukikine, hide, tatsuo}@dcl.info.waseda.ac.jp

More information

Modification and Evaluation of Linux I/O Schedulers

Modification and Evaluation of Linux I/O Schedulers Modification and Evaluation of Linux I/O Schedulers 1 Asad Naweed, Joe Di Natale, and Sarah J Andrabi University of North Carolina at Chapel Hill Abstract In this paper we present three different Linux

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

Exploiting the benefits of native programming access to NVM devices

Exploiting the benefits of native programming access to NVM devices Exploiting the benefits of native programming access to NVM devices Ashish Batwara Principal Storage Architect Fusion-io Traditional Storage Stack User space Application Kernel space Filesystem LBA Block

More information

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures W.M. Roshan Weerasuriya and D.N. Ranasinghe University of Colombo School of Computing A Comparative

More information

The Google File System

The Google File System The Google File System By Ghemawat, Gobioff and Leung Outline Overview Assumption Design of GFS System Interactions Master Operations Fault Tolerance Measurements Overview GFS: Scalable distributed file

More information

IBM B2B INTEGRATOR BENCHMARKING IN THE SOFTLAYER ENVIRONMENT

IBM B2B INTEGRATOR BENCHMARKING IN THE SOFTLAYER ENVIRONMENT IBM B2B INTEGRATOR BENCHMARKING IN THE SOFTLAYER ENVIRONMENT 215-4-14 Authors: Deep Chatterji (dchatter@us.ibm.com) Steve McDuff (mcduffs@ca.ibm.com) CONTENTS Disclaimer...3 Pushing the limits of B2B Integrator...4

More information

Distributed Computing: PVM, MPI, and MOSIX. Multiple Processor Systems. Dr. Shaaban. Judd E.N. Jenne

Distributed Computing: PVM, MPI, and MOSIX. Multiple Processor Systems. Dr. Shaaban. Judd E.N. Jenne Distributed Computing: PVM, MPI, and MOSIX Multiple Processor Systems Dr. Shaaban Judd E.N. Jenne May 21, 1999 Abstract: Distributed computing is emerging as the preferred means of supporting parallel

More information