MS2 Lustre Streaming Performance Improvements

Size: px
Start display at page:

Download "MS2 Lustre Streaming Performance Improvements"

Transcription

1 MS2 Lustre Streaming Performance Improvements Solution Architecture Revision 1.2 September 19, 2015 Intel Federal, LLC Proprietary 1 Solution Architecture

2 Generated under Argonne Contract number: B DISTRIBUTION STATEMENT: None Required Disclosure Notice: This presentation is bound by Non-Disclosure Agreements between Intel Corporation, the Department of Energy, and DOE National Labs, and is therefore for Internal Use Only and not for distribution outside these organizations or publication outside this Subcontract. USG Disclaimer: This report was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor any agency thereof, nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof. Export: This document contains information that is subject to export control under the Export Administration Regulations. Intel Disclaimer: Intel makes available this document and the information contained herein in furtherance of CORAL. None of the information contained therein is, or should be construed, as advice. While Intel makes every effort to present accurate and reliable information, Intel does not guarantee the accuracy, completeness, efficacy, or timeliness of such information. Use of such information is voluntary, and reliance on it should only be undertaken after an independent review by qualified experts. Access to this document is with the understanding that Intel is not engaged in rendering advice or other professional services. Information in this document may be changed or updated without notice by Intel. This document contains copyright information, the terms of which must be observed and followed. Reference herein to any specific commercial product, process or service does not constitute or imply endorsement, recommendation, or favoring by Intel or the US Government. Intel makes no representations whatsoever about this document or the information contained herein. IN NO EVENT SHALL INTEL BE LIABLE TO ANY PARTY FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES FOR ANY USE OF THIS DOCUMENT, INCLUDING, WITHOUT LIMITATION, ANY LOST PROFITS, BUSINESS INTERRUPTION, OR OTHERWISE, EVEN IF INTEL IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Company Name: Intel Federal, LLC Company Address: 4100 Monument Corner Drive, Suite 540 Fairfax, VA Copyright 2015, Intel Corporation. Technical Lead: (Name, , Phone) _Al Gara, Contract Administrator: (Name, , Phone) Aaron Matzkin, Program Manager: (Name, , Phone) Jacob Wood, Intel Federal, LLC Proprietary 2 Solution Architecture

3 Contents MS Solution Architecture... 1 Contents... 3 Revision History Milestone Overview - Solution Architecture MS2 Description Terms Introduction Background MB HDD I/O Allocation High Level Overview Dynamic Striping Impact Interleaved Metadata Impact RAIDZ Skip Sector Impact Block Allocation Impact Runtime Memory Impact Solution Requirements Native 16MB Blocks Two Megabyte Disk I/O File Data Buffers File Block Size Selection Non-streaming I/O Performance Aged File System Performance Declustered RAIDZ Integration Use Cases Checkpoint Dump and Drain Checkpoint Restarts Mixed Workloads Aged File Systems Solution Proposal Lustre ZFS Relevant ZFS Open Source Community Work Large Block Work Scatter/Gather Data Buffer Work Solution Test Plans Intel Federal, LLC Proprietary 3 Solution Architecture

4 6.1 Unit Testing and Automated Code Verification Integration Testing System and Black Box Testing with Other CORAL Software Acceptance Criteria Intel Federal, LLC Proprietary 4 Solution Architecture

5 Figures Figure 2-1: Raw HDD Performance Figure 2-2: Realtime Allocations Figure 2-3: Ideal RAIDZ large allocation sizes (sans skip sectors) Intel Federal, LLC Proprietary 5 Solution Architecture

6 Revision History Revision Description Date Author 0.1 First draft Don Brady 1.1 Revisions following Argonne review Don Brady 1.2 Final revisions following Argonne review Don Brady Author: Don Brady Contributors: Eric Barton, John Salinas, Isaac Huang, Andreas Dilger, John Carrier Intel Federal, LLC Proprietary 6 Solution Architecture

7 1 Milestone Overview - Lustre Streaming Performance Improvements Solution Architecture 1.1 MS2 Description This document covers the Solution Architecture for Lustre and ZFS I/O related enhancements necessary to meet MS2 of the Argonne NRE Contract as stated below. MS2 Description: The Subcontractor shall modify the existing ZFS on Linux memory management to use scatter/gather lists to avoid dynamic allocation of large contiguous buffers and to increase the maximum block size to 16MB. The Subcontractor shall modify the ZFS allocator to cluster small and large allocation units to avoid large block fragmentation. The Subcontractor shall modify Lustre to increase the maximum block size supported to 16MB, to provide application control over block size used and to provide perdirectory and per-file system block size defaults including support for lfs ladvise iosize. The Subcontractor shall modify Lustre s client and server side caching policies to optimize both streaming and random I/O performance for files of varying block size. The Subcontractor shall demonstrate that the application streaming I/O efficiency has been improved by increasing the size of the Lustre/ZFS allocation unit end-to-end so that the size of I/Os delivered to drives via RAID subsystems match the characteristics of modern drives. The Subcontractor shall demonstrate that the I/O performance of Lustre for a range of file block sizes and I/O patterns show that files optimized for streaming I/O achieve bandwidth requirements without undue performance penalties for random I/O performance or for streaming performance on files optimized for random I/O. The Subcontractor shall use a storage system the same as, or similar to, the Argonne Helium system for all demonstrations. The Subcontractor shall provide all source changes to Lustre and ZFS to upstream source repositories. 1.2 Terms RAIDZ a data/parity distribution scheme with a dynamic stripe width Dynamic Striping dynamically adjusting the load across virtual devices to distribute bandwidth and re-balance space usage Intel Federal, LLC Proprietary 7 Solution Architecture

8 Virtual Device (VDEV) an allocation unit, typically a RAIDZ group of disks Dynamic Fit Placement an adaptive allocation policy that uses first fit up until a threshold and then switches to best fit Metaslab an allocation region in a virtual device Metaslab Group the set of metaslabs belonging to a virtual device vmalloc kernel interface that allocates a contiguous memory region in the kernel virtual address space RAIDZ Layout Conventions N d = number of data drives N p = number of parity columns N t = total number of drives For the CORAL configuration, N t = 10, N d = 8 and N p = Introduction Concurrent streams of I/O coming from multiple Lustre clients are interleaved at the servers according to arrival time. Since the mapping from file offset to disk offset is controlled by the disk block allocator, allocating writes (all writes with ZFS) may result in streaming disk I/O for a relatively empty file system. However, writes to an aged, relatively full, fragmented file system, and the re-interleaving that occurs on subsequent reads, results in effectively random disk I/O. Since these streams are interleaved at file block granularity, larger file block sizes can ensure that, although disk I/O patterns are random, each disk I/O is for a larger chunk of data. The relative impact of seek overhead is, therefore, reduced and delivered I/O bandwidth improves correspondingly. This solution architecture document covers the Lustre and ZFS streaming I/O related enhancements for CORAL. The Improvements for Streaming I/O solution will boost overall throughput with larger I/O sizes. These improvements must maintain a consistent level of performance over time, avoiding any performance cliffs due to system memory fragmentation or severe storage free space fragmentation. Intel Federal, LLC Proprietary 8 Solution Architecture

9 For a single thread Lustre client, the expected performance depends on the storage configuration (OST count, file stripe width, etc.). Single-threaded Lustre clients typically achieve a significant proportion of the underlying single port network fabric bandwidth. Intel Federal, LLC Proprietary 9 Solution Architecture

10 2 Background ZFS does not have a single fixed file block size. Different files will use file block sizes appropriate to the I/O size. Under the proposed solution, the backend file block sizes will range from 4KB for very small files up to 16MB for larger files. The ideal file block size for this solution is 16MB, which results in 2MB disk writes for the prescribed RAIDZ2 configuration (8 data +2 parity). Smaller files will not use such large block sizes. For example, a 1MB file would have a maximum file block size of 1MB, but the file block size could even be smaller if so specified by the client. A workload of primarily small files, however, will not take advantage of larger blocks and their higher throughput into the HDD layer. The proposed solution targets large sequential I/Os delivered as 16MB file blocks to ZFS in order to maximize HDD throughput MB HDD I/O The project goal is to improve Lustre and ZFS delivered bandwidth from the HDD backed storage. The CORAL file system will provide an aggregate over 1TB/sec from the back-end file system. With the target configuration, each HDD must deliver over 100 MB/sec to meet this goal. A disk I/O size of 2MB can supply a raw bandwidth of 100 MB/sec (see diagram 2.1). Bigger file blocks are needed so that for streaming I/O we end up with a 2MB I/O unit. RAIDZ will break up each file block by the number of data disks in the redundancy group. With the traditional 128KB block maximum and a RAIDZ with N d = 8, this amounts to only 16KB chunk submitted to each disk per block. While some amount of IO merging can happen with each 16KB chunk, this adds more overhead per IO and does not guarantee contiguous disk writes. However, the same N d of 8 will divide a 16MB block into 2MB columns and thereby deliver 2MB I/O to the underlying disks. Using 16MB file blocks will satisfy the goal of achieving 2MB sized I/O for streaming files. Intel Federal, LLC Proprietary 10 Solution Architecture

11 Figure 2-1: Raw HDD Performance Though 16MB file blocks may seem aggressive, there should not be any need to make specific, large block improvements to either the burst buffer or the Lustre client. Lustre already includes support for large 16MB blocks, though this feature has not yet been used in production. And running the client on the burst buffer should make aggregating large I/Os easier to achieve. The effort will be tuning the file block size from the client down through the server to the block devices to ensure HDDs deliver optimal performance for the workload. The ZFS on-disk format now supports 16MB blocks (more details in section 5.3). However, the current ZFS run-time requires some additional work to effectively accommodate large blocks. Since the adoption of larger blocks impacts the disk space allocation mechanisms in ZFS, the following high-level overview provides the necessary context to discuss the affected areas. 2.2 Allocation High Level Overview A ZFS storage pool is typically comprised of a set of RAIDZ groups known as virtual devices (VDEVs). ZFS further divides the space in each virtual disk into approximately 200 regions known as metaslabs. The block allocation policy in ZFS has three basic steps: virtual device selection, metaslab selection and free space selection. When a block of data needs to be stored on disk, ZFS will first select a VDEV. Then within that VDEV, a target metaslab will be chosen based on a weighting scheme. Finally, within the chosen slab, a range of free space will be selected using a dynamic fit approach (first fit initially and transitioning to best fit later). Intel Federal, LLC Proprietary 11 Solution Architecture

12 To fully utilize all available disk bandwidth within a storage pool, ZFS will round-robin I/O writes across the set of VDEVs, moving to a new VDEV after a fixed amount has been allocated. In addition, the allocated amount can be adjusted when new VDEVs are added or when the free space needs to be rebalanced across the VDEVs. This adaptive spreading of the load across all VDEVs is also referred to as Dynamic Striping. After filling a metaslab, ZFS will move to the next best metaslab based on a weighting scheme. The metaslab with the highest weight wins. The weighting criteria considers, among other things, recent use, available space, and fragmentation state. It is worth noting here that the incoming metaslab allocations will contain a mixture of file data (large blocks) and metadata (very small blocks). Each metadata block is also replicated onto a different virtual device for additional redundancy. These extra metadata copies are internally referred to as metadata ditto blocks. The introduction of large blocks has some unintended consequences on the aforementioned metaslab allocation process. The more significant impacts of dynamic striping, interleaved metadata, skip sectors, and allocation policies are described in more detail in the following sections. 2.3 Dynamic Striping Impact The target amount to allocate (i.e. write) before moving to the next virtual device defaults to a fixed quota value x number of drives in the group. The current allocation quota value (metaslab_aliquot = 512K ) was chosen over ten years ago based on the HDD characteristics at the time. However, with 4-16MB sized blocks the original threshold value is inappropriately sized. Furthermore, the space re-balancing mechanism assumes that the overall quota threshold contains multiple physical blocks that can be adjusted up/down with a simple bias calculation. With large blocks the biasing mechanism will cease to function and no re-balancing will occur. With a RAIDZ N t = 10, the target threshold would be: 512K * 10 = 5MB, which is only 1/4 the size required to store a single 16MB block. The solution proposal needs to offer a remedy for effective dynamic striping with large blocks. 2.4 Interleaved Metadata Impact It is easy to demonstrate that the metadata blocks being allocated are small, in most cases they compress to fit within a single 4K block. With RAIDZ2 and using 4K sectors, this amounts to a 12K allocation (4K data + 2x4K for parity). The astute reader will note that this is a de facto triple mirror but with the overhead of calculating parity. A workload that pushes ZFS to a point where all the metaslab regions have been activated and undergone space reclamation, creates an on-disk layout representative of an aged file system. While running such a file system aging workload, it was easy to observe that over half of the Intel Federal, LLC Proprietary 12 Solution Architecture

13 overall allocation requests made were for metadata. And, significantly, that metadata was mostly 12K in size (see diagram below). Figure 2-2: Realtime Allocations Due to copy-on-write (COW), many metadata blocks will continuously need to find a new home. The incoming file data will at the same time force a transition to other metaslabs to locate space for the larger file blocks (1280KB each in the evaluation FS aging runs). These file system aging workloads left an excess of free 12K islands (thousands) in every metaslab even though by size, the metadata is a tiny fraction of overall allocation space. The trend appears to artificially provision the pool to always have residual metadata-only space in each metaslab. The solution proposal needs to mitigate any large block fragmentation problems. 2.5 RAIDZ Skip Sector Impact In RAIDZ, disks are divided up into sectors (4K) and the allocation space is always in units of sectors. The allocation space can be visualized as a grid, N t columns wide where each column represents a specific disk and each row represents a set of different sectors (one from each disk). The smallest allocated sector amount would be N p + 1. To avoid the creation of a free range that is less than this minimum allocation, RAIDZ will round up each allocation to the nearest multiple of (N p + 1) and skip these unused sectors. Skipped sectors ensure there is never a free range of sectors that is too small to allocate. For the CORAL configuration N p + 1 is three sectors. With an even count of data disks (like N d = 8) every RAIDZ allocation for a power-of-two block (like 16MB) will require skip sectors to make the allocation count a multiple of three. Ideally the RAIDZ allocations can guarantee that larger blocks (i.e. 1MB - 16MB) end up with a large common multiple (e.g. 256KB), to improve coalescing after freeing. For the Intel Federal, LLC Proprietary 13 Solution Architecture

14 CORAL RAIDZ configuration (8 + 2, 4KB sectors), Figure 2-3 shows the ideal allocation size values. However, the legacy skip sector requirement inhibits this goal resulting in oddsized allocations (i.e. only 4KB multiple, not 256KB or higher). Elimination of skip sectors is desirable to achieve the best possible allocation unit multiples as another means to minimize free space fragmentation. BLOCK SIZE COLUMN SIZE PARITY SIZE ALLOC SIZE 512 KB 64 KB 128 KB 756 KB 1 MB 128 KB 256 KB 1.25 MB 2 MB 256 KB 512 KB 2.50 MB 4 MB 512 KB 1 MB 5 MB 8 MB 1 MB 2 MB 10 MB 16 MB 2 MB 4 MB 20 MB Figure 2-3: Ideal RAIDZ large allocation sizes (sans skip sectors) 2.6 Block Allocation Impact The current allocation policies were designed for a maximum block size of 128KB, at a time when drives were on the order of 200GB. Some of the threshold parameters used by the lowest allocation layer no longer make sense when large blocks are in the mix. For example, the Dynamic Fit policy waits until it cannot satisfy a max-block-size request (16MB) before moving to best-fit. But for a 16MB block on RAIDZ-2 with N d = 10, the resulting allocation would be 20MB. The other trigger for moving to best-fit is when the metaslab free space percent drops to 4%. With 12TB drives and N t = 10, we end up with 512GB per metaslab, and 4% of that is 20 GB. The allocator further segments requests into areas with a prescribed alignment. That alignment is roughly half the requested size. So a 16MB block is assigned into an 8MB alignment area. For larger blocks, a maximum alignment of 256KB (a common multiplier) makes more sense. These and other adjustments are needed to ensure the allocation policies perform as intended. We expect these large block allocations to improve the performance of RAIDZ rebuild performance. Though larger blocks will require reserving some memory for the rebuild read-ahead buffers, the size of the file's metadata (parent block pointers) will be significantly smaller. And, instead of potentially hundreds of random block reads, when max block size was 128KB, we will have many fewer 16MB reads. Otherwise, impacts of Intel Federal, LLC Proprietary 14 Solution Architecture

15 large block allocation on the ZFS caching layers will need to be measured as the project proceeds. 2.7 Runtime Memory Impact ZFS on Linux has traditionally used the kernel virtual memory allocator, vmalloc(), to satisfy the memory for its file buffers (inflight I/O and cached I/O) because the ZFS code was originally written for Solaris and accesses the buffer contents directly. However, there is a limitation on how much memory can be mapped with vmalloc() and it should be used sparingly to avoid performance issues (like increased TLB thrashing). The adoption of large blocks will significantly increase the demand for larger memory allocations. Correspondingly, the ZFS run-time must use an alternative to vmalloc() when allocating memory for its file data buffers. A scatter/gather buffer mechanism is the proposed alternative for allocating file buffer memory in ZFS. Intel Federal, LLC Proprietary 15 Solution Architecture

16 3 Solution Requirements The requirement for 2MB HDD I/O is at the heart of the solution and introduces the requirement that the file block size increases to 16MB. This in turn adds additional requirements so that block sizes can be controlled and that disk space allocations policies adapt to larger blocks. Metadata isolation will keep the large data block allocations separated from the much smaller metadata allocations. 3.1 Native 16MB Blocks The end-to-end storage stack must be capable of using native 16MB blocks (from Lustre client down to ZFS storage pool). Client I/O aggregation should be limited to larger writers and perhaps BRW traffic. It is not expected that smaller I/O will be held and aggregated to a full 16MB on the client, but some aggregation may still be done to avoid too-small RPCs. Remember that the CORAL design has Lustre clients only on the burst buffer nodes. If the application reads from or writes to the burst buffer, then I/O is best configured to make use of large block I/Os. 3.2 Two Megabyte Disk I/O When end-to-end 16MB blocks are in play, the back-end disk I/O will be no less than 2MB per drive when configured with a prescribed 8 data drives in a RAIDZ configuration. This 2MB size requirement applies to both writing and reading. This minimum size is not achieved from I/O aggregation but from actual allocated units. 3.3 File Data Buffers As mentioned in section 2.7 above, the ZFS run-time must use an alternative to vmalloc() when allocating memory for its file data buffers. A scatter/gather buffer mechanism is the proposed alternative for allocating file buffer memory in ZFS. 3.4 File Block Size Selection Large blocks are great for streaming, but they may not be appropriate for every workload. Controls are required to make sure an appropriate block size is used for each workload. The resulting performance must improve on the status quo for ZFS, where block size was chosen solely based on file size and the block size range was much smaller. 16M blocks will only be auto-selected when the I/O write sizes are that large. The solution shall provide a separate API, ladvise(), for users and applications to set the back-end block size of a Lustre file across a range of sizes. The ladvise() API is Intel Federal, LLC Proprietary 16 Solution Architecture

17 modeled after the Linux fadvise() API and is intended to have similar hints (RANDOM/SEQUENTIAL/WILL_NEED/DONT_NEED). The range is power-of-two sizes between 4K and 16M. The block size cannot be changed once the file is written (set once). In the absence of an application supplied block size, the solution shall provide a default size. The default can be statically or dynamically configured. The static default value will be configured per file system or per directory. The dynamic default value will be based on context (such as, initial write locations and sizes) and will be used for dynamic workloads that did not specify a size. 3.5 Non-streaming I/O Performance Varying workloads and block sizes should not degrade (e.g. random or smaller sized I/O) compared to the status quo. Non-streaming I/O patterns will have acceptable performance when the correct block size was assigned. (in 3.4 above) 3.6 Aged File System Performance The performance of an aged file system must continue to maintain a baseline efficiency. Use of larger blocks must not accelerate allocation space fragmentation. Smaller block allocations will be segregated from larger block allocations. This baseline efficiency must be measured on a file system that has reached as high as 80% full and has undergone a realistic history of file creation, writes and deletions. 3.7 Declustered RAIDZ Integration The solution must function in tandem with the RAIDZ Declustering solution. Further investigations are required to determine where isolated metadata is stored. Any metadata isolation method must not impact the ability of RAIDZ Declustering to spread I/O across the set of pool drives. Intel Federal, LLC Proprietary 17 Solution Architecture

18 4 Use Cases 4.1 Checkpoint Dump and Drain 1. Application checkpoint data is written into burst buffers. 2. Data is then drained asynchronously using large I/O writes through Lustre and ultimately into ZFS to free space in the burst buffers for the next checkpoint. 3. The rate of data draining directly determines availability for the next checkpoint. 4. The desired drain bandwidth must be maintained while other application file system load is present. 5. This cycle of application checkpoints repeats at regular intervals. 4.2 Checkpoint Restarts 1. Previous checkpoint data is staged into burst buffers using large read I/O requests from Lustre/ZFS. 2. The rate of data reads will determine the job restart delay. 3. The desired read performance for staging must be maintained while other application load is present. 4. Application job restarts using the checkpoint data. 5. In the case where the burst buffer is unavailable, restart data can be read directly from Lustre/ZFS. As in the staging case, maximum read efficiency is desired to minimize restart delays. 4.3 Mixed Workloads 1. Simultaneous jobs with various I/O workloads are serviced by the Lustre file system. 1. Multiple streams with large data blocks (streaming I/O) 2. Small block random I/O to multiple files (parallel I/O) 3. Checkpoint drainage I/O streams 2. Each application workload obtains an appropriate file block size. 1. As requested by the application, or 2. As configured by the file system, or 3. As determined by the initial file I/O pattern 3. The chosen block size persists for the life of the file. 4. The file system adapts across a range of active block sizes. 1. caching policies 2. I/O scheduler 3. block allocation scheme Intel Federal, LLC Proprietary 18 Solution Architecture

19 4.4 Aged File Systems 1. The file system (Lustre with back-end ZFS) undergoes a realistic history of file creation, writes and deletions. 2. The file system fills up over time and older job data and checkpoint data is periodically removed to make room for new data. 3. The file system remains online while continuing to provide a stable I/O bandwidth efficiency (the provided bandwidth does not significantly degrade over time). 4. A separate de-fragmentation process is not required. Intel Federal, LLC Proprietary 19 Solution Architecture

20 5 Solution Proposal 5.1 Lustre The Lustre client can aggregate I/O requests in order to generate optimally-sized I/O for the Lustre server so that data can later be read in the same size chunks as it was written. Lustre servers also have the opportunity to aggregate I/O in order to generate optimallysized I/O for the disk file system. Our preference is to have heuristics in Lustre detect the I/O pattern and generate the correct I/O size for the application s I/O workload. Clients should aggregate small I/O if the I/O is contiguous or to mitigate the worst case random writes to reduce RPC counts to a manageable level. Servers should determine the block size by default. Using the layout stripe size to also reflect the block size, for example, would allow specifying the default block size for a file or directory without the need to implement a separate mechanism from the Lustre layout. Nonetheless, there will always be I/O patterns that do not match the heuristics provided or are detected incorrectly. In this case, there will be an API provided to allow the application or library to specify behavior that it knows to be correct (e.g., random writes with a specific block size, prefetching of data from disk, or dropping data from cache that is no longer needed). This API will be provided through the Lustre ladvise interface. (Note that the kernel fadvise() interface is not suitable because it only interfaces with the memory subsystem of a single client, and does not interface at all with the client-side file system or the server.) Our expectation is that the CORAL burst buffer will be tuned for the Aurora file system implementation and, hence, will be able to use the API to set the block size to maximize I/O performance for its workflow. 1. Increase the maximum supported block size to 16MB 1. Lustre OSD will determine the largest supported block size by querying ZFS at mount time and communicate this to Lustre clients 2. Introduction of a client interface, lfs ladvise iosize, to set the file block size 3. Introduction of a block size default per file system and per directory 4. Allow adaptive block size selection based on initial I/O context 5. Communicate desired block size to ZFS layer 2. Lustre s client and server side caching policies will be modified to optimize both streaming and random I/O performance for files of varying block size. Must ensure caching is sufficient for large block streaming I/O and no regression introduced for other workloads. 3. Adopt new scatter/gather buffer interface from ZFS into Lustre OSD module. Intel Federal, LLC Proprietary 20 Solution Architecture

21 5.2 ZFS 1. Increase the default 1MB maximum supported block size to 16MB. Address any new issues/regressions that may arise from this increase. Section 5.3 outlines some additional large block related work. 2. To minimize long term free space fragmentation, the solution will isolate metadata allocations from much larger file data allocations. Isolation will either come from dedicated metaslabs or from a separate metadata allocation class. 3. Adjustments to allocation policies are needed to accommodate larger allocation mixes and address the policy issues raised in section 2. The existing metaslab fragmentation metrics will be leveraged/improved to steer the policies and to monitor the free space fragmentation over time. 4. Additional Performance Improvements (as needed). Following the increase of file data I/O size to 16MB, it is expected that additional efficiency bottlenecks and memory allocation issues will be discovered. The exact set of improvements is not known, but the following areas will be explored by Intel: 1. Improve physical locality for file data (sort dirty data list at transaction group (txg) sync). 2. Consume lfs ladvise hints from Lustre (such as Random, Sequential, Will-Need and Don't-Need). 3. Address I/O path lock contention/lack of scaling with CPU core increases. 4. Overlap inline compute and I/O during transaction group syncing (e.g. overlap RAIDZ parity generation with async data column writes) 5.3 Relevant ZFS Open Source Community Work The proposed solution architecture leverages existing ongoing work in the ZFS community. Leveraged work includes the support for large blocks, the addition of scatter/gather buffers for ZFS file data and the metaslab fragmentation tracking mechanisms (as mentioned in section 5.2) Large Block Work The on-disk support for using large blocks is now included in the master repository of ZFS on Linux as of the release. This change allows for using up to 1MB blocks on pools that (a) opt-in to the pool feature, feature@large_blocks, and (b) set the dataset recordsize property to a value larger than 128K. The current implementation limits the maximum size to 1MB due to some of the issues discussed in sections 2.3, 2.4, and 2.5, even though the on-disk format can accommodate values up to 16MB. To get to 16MB blocks, the run-time setting, zfs_max_recordsize, needs to be increased beyond the current 1MB default. Intel Federal, LLC Proprietary 21 Solution Architecture

22 To our knowledge, there has been no production use of larger than 1MB blocks with ZFS. Since the use of large blocks is entirely opt-in, it is assumed that there has been minimal testing and limited exposure to large blocks in practice. In addition to the work stated in section 5.2, additional run-time work is expected to include the following: 1. Add full stack test coverage of 16MB blocks to the ztest and zfs test suite tools. 2. Specifically address large blocks induced free space fragmentation. 3. Increase the gang block header size on pools with 4KB sector drives to minimize the impact when using ganged blocks. 4. ZFS I/O scheduler tunings (optional) 1. cut-in-line for smaller reads 2. allow larger than 128KB aggregations 3. size-adjusted queue depths Scatter/Gather Data Buffer Work Intel had originally scoped scatter/gather buffer development for CORAL. Recently a relevant solution was proposed in the ZFS on Linux community. While this initial proposal has a more narrow focus than our original plans, it will be a sufficient foundation to build upon. Our current plan is to leverage this community work (aka Arc Buffer Data, or ABD) once it is officially accepted by the ZFS-on-Linux Git repository (i.e. in the master branch). The scatter/gather solution needs to be validated with Lustre and 16MB blocks. Intel can augment this with additional follow-on work as needed. Some additional I/O areas could benefit from using scatter/gather buffers. Recommended follow-on work for scatter gather buffers: 1. Comprehensive testing 1. Lustre workloads 16MB blocks using our RAIDZ configurations 2. Long running workloads (days) 3. Memory pressure stress testing 2. Add run-time memory usage stats (kstats) for scatter/gather buffers (similar to ZIO buffer stats available at /proc/spl/kmem/slab) 3. Extend adoption beyond just the ARC data buffers as needed 1. Supply a scatter/gather buffer interface to Lustre OSD 2. Use for RAID-Z reconstruction 3. Use for RAID-Z parity buffers 4. For other data transforms (like LZ4 data compression) Intel Federal, LLC Proprietary 22 Solution Architecture

23 6 Solution Test Plans The focus of the Solution Test Plan is to describe the what as opposed to the how of the test plan, which is being created alongside the High Level Design (HLD). This plan will include descriptions of the following test workloads: 1) For file system aging, we are converting and updating an internal script to a more powerful python based tool for file system aging. Also we use iotest and its bigbench.sh script and are currently evaluating another tool. 2) For core benchmarks, we will use our SOAK test framework, which includes mdtest, IOR, simul, racer, Kcompile, blobench and pct. 3) Our realistic tests include Lustre server failover, CPU/memory hog, and in the future Lustre fault injection. 4) For mixed workloads the current plan is to use various iozone options to generate different traffic types (streaming, random, strided, backwards, burst), use filebench to simulate different workloads, to use simul and mdtest once in a while to generate large amounts of metadata and other operations. Failure injection at the lower layer, such as injection of disk failures, path to JBOD failure, ZFS failures, failover, etc are in the de-clustered RAIDZ SA as it is a more natural place to consider such things. Once the prototype for de-clustered RAID stabilizes, all of these tests will be run with Lustre and big blocks. 6.1 Unit Testing and Automated Code Verification 1. Verify that 16MB I/O units are active across the stack (Lustre clients into ZFS backend). 2. Verify that Lustre client interfaces can directly set block sizes between 128K and 16MB (per file or per directory or per file system). 3. Validate the logic and functionality that the assigned block size adapts with a reasonable default. 4. Verify that the code mitigates fragmentation with the metadata isolation schemes. 6.2 Integration Testing 1. Test Lustre Streaming performance with both large block streaming and with burst workloads. Record important statics. 2. Develop simulated, mixed workload tests that have multiple I/O streams with large blocks to mimic streaming I/O and random small blocks to multiple files to mimic parallel I/O. Intel Federal, LLC Proprietary 23 Solution Architecture

24 3. Generate realistic workloads to aged file systems in terms of file creation, blocks written and read, files and directories removed. Tests must verify the I/O bandwidth efficiency as the field system is in various states of aging. File system aging scripts are used to induce fragmentation pressure. Fragmentation statics are recorded to correlate the impact on performance over time. 4. All integration tests shall be run in various declustered RAIDZ configurations. 6.3 System and Black Box Testing with Other CORAL Software 1. Use SCR to initiate a check point to generate various types of check point files. Use CPPR data mover to send check points to back-end storage through the function shipper. Tests must cover direct paths to Lustre and indirect paths through the burst buffers. The performance characteristics of both paths will be recorded. Initiate checkpoint restarts that demonstrate data flowing back into the application nodes. 2. Run HACC IO, IOR, WRF (NetCDF) and FlashIO (HDF5) from clients through function shipper to send I/O through various I/O libraries and record ZFS and block device statics to measure performance through the CORAL I/O software stack. Intel Federal, LLC Proprietary 24 Solution Architecture

25 7 Acceptance Criteria Intel will validate its progress toward acceptance of this NRE program through the following criteria: 1. Test plans described in section 6, Solution Test Plans, are fully executed and all tests pass. 2. All requirements listed in section 3, Solutions Requirements, are validated. 3. An agreed set of use cases are demonstrated. Intel Federal, LLC Proprietary 25 Solution Architecture

Common Persistent Memory POSIX* Runtime (CPPR) API Reference (MS21) API Reference High Performance Data Division

Common Persistent Memory POSIX* Runtime (CPPR) API Reference (MS21) API Reference High Performance Data Division Common Persistent Memory POSIX* Runtime (CPPR) API Reference High Performance Data Division INTEL FEDERAL, LLC PROPRIETARY December 2017 Generated under Argonne Contract number: B609815 DISTRIBUTION STATEMENT:

More information

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation The current status of the adoption of ZFS as backend file system for Lustre: an early evaluation Gabriele Paciucci EMEA Solution Architect Outline The goal of this presentation is to update the current

More information

Common Persistent Memory POSIX Runtime (CPPR) API Reference Manual. Reference Manual High Performance Data Division

Common Persistent Memory POSIX Runtime (CPPR) API Reference Manual. Reference Manual High Performance Data Division Common Persistent Memory POSIX Runtime (CPPR) Reference Manual High Performance Data Division INTEL FEDERAL, LLC PROPRIETARY October 2016 Generated under Argonne Contract number: B609815 DISTRIBUTION STATEMENT:

More information

Assessing performance in HP LeftHand SANs

Assessing performance in HP LeftHand SANs Assessing performance in HP LeftHand SANs HP LeftHand Starter, Virtualization, and Multi-Site SANs deliver reliable, scalable, and predictable performance White paper Introduction... 2 The advantages of

More information

5.4 - DAOS Demonstration and Benchmark Report

5.4 - DAOS Demonstration and Benchmark Report 5.4 - DAOS Demonstration and Benchmark Report Johann LOMBARDI on behalf of the DAOS team September 25 th, 2013 Livermore (CA) NOTICE: THIS MANUSCRIPT HAS BEEN AUTHORED BY INTEL UNDER ITS SUBCONTRACT WITH

More information

Lustre * Features In Development Fan Yong High Performance Data Division, Intel CLUG

Lustre * Features In Development Fan Yong High Performance Data Division, Intel CLUG Lustre * Features In Development Fan Yong High Performance Data Division, Intel CLUG 2017 @Beijing Outline LNet reliability DNE improvements Small file performance File Level Redundancy Miscellaneous improvements

More information

EsgynDB Enterprise 2.0 Platform Reference Architecture

EsgynDB Enterprise 2.0 Platform Reference Architecture EsgynDB Enterprise 2.0 Platform Reference Architecture This document outlines a Platform Reference Architecture for EsgynDB Enterprise, built on Apache Trafodion (Incubating) implementation with licensed

More information

An Introduction to GPFS

An Introduction to GPFS IBM High Performance Computing July 2006 An Introduction to GPFS gpfsintro072506.doc Page 2 Contents Overview 2 What is GPFS? 3 The file system 3 Application interfaces 4 Performance and scalability 4

More information

Small File I/O Performance in Lustre. Mikhail Pershin, Joe Gmitter Intel HPDD April 2018

Small File I/O Performance in Lustre. Mikhail Pershin, Joe Gmitter Intel HPDD April 2018 Small File I/O Performance in Lustre Mikhail Pershin, Joe Gmitter Intel HPDD April 2018 Overview Small File I/O Concerns Data on MDT (DoM) Feature Overview DoM Use Cases DoM Performance Results Small File

More information

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage Evaluation of Lustre File System software enhancements for improved Metadata performance Wojciech Turek, Paul Calleja,John

More information

Contributors: Surabhi Jain, Gengbin Zheng, Maria Garzaran, Jim Cownie, Taru Doodi, and Terry L. Wilmarth

Contributors: Surabhi Jain, Gengbin Zheng, Maria Garzaran, Jim Cownie, Taru Doodi, and Terry L. Wilmarth Presenter: Surabhi Jain Contributors: Surabhi Jain, Gengbin Zheng, Maria Garzaran, Jim Cownie, Taru Doodi, and Terry L. Wilmarth May 25, 2018 ROME workshop (in conjunction with IPDPS 2018), Vancouver,

More information

Chapter 8: Virtual Memory. Operating System Concepts

Chapter 8: Virtual Memory. Operating System Concepts Chapter 8: Virtual Memory Silberschatz, Galvin and Gagne 2009 Chapter 8: Virtual Memory Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped Files Allocating

More information

OpenZFS Performance Improvements

OpenZFS Performance Improvements OpenZFS Performance Improvements LUG Developer Day 2015 April 16, 2015 Brian, Behlendorf This work was performed under the auspices of the U.S. Department of Energy by under Contract DE-AC52-07NA27344.

More information

Current Topics in OS Research. So, what s hot?

Current Topics in OS Research. So, what s hot? Current Topics in OS Research COMP7840 OSDI Current OS Research 0 So, what s hot? Operating systems have been around for a long time in many forms for different types of devices It is normally general

More information

Milestone Solution Partner IT Infrastructure Components Certification Report

Milestone Solution Partner IT Infrastructure Components Certification Report Milestone Solution Partner IT Infrastructure Components Certification Report NEC HYDRAstor 30-05-2016 Table of Contents Introduction... 4 Certified Products... 4 Solution Architecture... 5 Topology...

More information

OpenZFS Performance Analysis and Tuning. Alek 03/16/2017

OpenZFS Performance Analysis and Tuning. Alek 03/16/2017 OpenZFS Performance Analysis and Tuning Alek Pinchuk apinchuk@datto.com @alek_says 03/16/2017 What is performance analysis and tuning? Going from this 3 To this 4 Analyzing and benchmarking performance

More information

InfoSphere Warehouse with Power Systems and EMC CLARiiON Storage: Reference Architecture Summary

InfoSphere Warehouse with Power Systems and EMC CLARiiON Storage: Reference Architecture Summary InfoSphere Warehouse with Power Systems and EMC CLARiiON Storage: Reference Architecture Summary v1.0 January 8, 2010 Introduction This guide describes the highlights of a data warehouse reference architecture

More information

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017 Operating Systems Lecture 7.2 - File system implementation Adrien Krähenbühl Master of Computer Science PUF - Hồ Chí Minh 2016/2017 Design FAT or indexed allocation? UFS, FFS & Ext2 Journaling with Ext3

More information

CLOUD-SCALE FILE SYSTEMS

CLOUD-SCALE FILE SYSTEMS Data Management in the Cloud CLOUD-SCALE FILE SYSTEMS 92 Google File System (GFS) Designing a file system for the Cloud design assumptions design choices Architecture GFS Master GFS Chunkservers GFS Clients

More information

OPERATING SYSTEM. Chapter 9: Virtual Memory

OPERATING SYSTEM. Chapter 9: Virtual Memory OPERATING SYSTEM Chapter 9: Virtual Memory Chapter 9: Virtual Memory Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped Files Allocating Kernel Memory

More information

Improving I/O Bandwidth With Cray DVS Client-Side Caching

Improving I/O Bandwidth With Cray DVS Client-Side Caching Improving I/O Bandwidth With Cray DVS Client-Side Caching Bryce Hicks Cray Inc. Bloomington, MN USA bryceh@cray.com Abstract Cray s Data Virtualization Service, DVS, is an I/O forwarder providing access

More information

IME Infinite Memory Engine Technical Overview

IME Infinite Memory Engine Technical Overview 1 1 IME Infinite Memory Engine Technical Overview 2 Bandwidth, IOPs single NVMe drive 3 What does Flash mean for Storage? It's a new fundamental device for storing bits. We must treat it different from

More information

Dell Reference Configuration for Large Oracle Database Deployments on Dell EqualLogic Storage

Dell Reference Configuration for Large Oracle Database Deployments on Dell EqualLogic Storage Dell Reference Configuration for Large Oracle Database Deployments on Dell EqualLogic Storage Database Solutions Engineering By Raghunatha M, Ravi Ramappa Dell Product Group October 2009 Executive Summary

More information

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning September 22 nd 2015 Tommaso Cecchi 2 What is IME? This breakthrough, software defined storage application

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung December 2003 ACM symposium on Operating systems principles Publisher: ACM Nov. 26, 2008 OUTLINE INTRODUCTION DESIGN OVERVIEW

More information

A Thorough Introduction to 64-Bit Aggregates

A Thorough Introduction to 64-Bit Aggregates Technical Report A Thorough Introduction to 64-Bit Aggregates Shree Reddy, NetApp September 2011 TR-3786 CREATING AND MANAGING LARGER-SIZED AGGREGATES The NetApp Data ONTAP 8.0 operating system operating

More information

Configuring Short RPO with Actifio StreamSnap and Dedup-Async Replication

Configuring Short RPO with Actifio StreamSnap and Dedup-Async Replication CDS and Sky Tech Brief Configuring Short RPO with Actifio StreamSnap and Dedup-Async Replication Actifio recommends using Dedup-Async Replication (DAR) for RPO of 4 hours or more and using StreamSnap for

More information

Optimizing MySQL performance with ZFS. Neelakanth Nadgir Allan Packer Sun Microsystems

Optimizing MySQL performance with ZFS. Neelakanth Nadgir Allan Packer Sun Microsystems Optimizing MySQL performance with ZFS Neelakanth Nadgir Allan Packer Sun Microsystems Who are we? Allan Packer Principal Engineer, Performance http://blogs.sun.com/allanp Neelakanth Nadgir Senior Engineer,

More information

UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc. UK LUG 10 th July 2012 Lustre at Exascale Eric Barton CTO Whamcloud, Inc. eeb@whamcloud.com Agenda Exascale I/O requirements Exascale I/O model 3 Lustre at Exascale - UK LUG 10th July 2012 Exascale I/O

More information

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic WHITE PAPER Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic Western Digital Technologies, Inc. 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Executive

More information

Built for Speed: Comparing Panoply and Amazon Redshift Rendering Performance Utilizing Tableau Visualizations

Built for Speed: Comparing Panoply and Amazon Redshift Rendering Performance Utilizing Tableau Visualizations Built for Speed: Comparing Panoply and Amazon Redshift Rendering Performance Utilizing Tableau Visualizations Table of contents Faster Visualizations from Data Warehouses 3 The Plan 4 The Criteria 4 Learning

More information

2. PICTURE: Cut and paste from paper

2. PICTURE: Cut and paste from paper File System Layout 1. QUESTION: What were technology trends enabling this? a. CPU speeds getting faster relative to disk i. QUESTION: What is implication? Can do more work per disk block to make good decisions

More information

DELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE

DELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE WHITEPAPER DELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE A Detailed Review ABSTRACT While tape has been the dominant storage medium for data protection for decades because of its low cost, it is steadily

More information

Virtual Memory Outline

Virtual Memory Outline Virtual Memory Outline Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped Files Allocating Kernel Memory Other Considerations Operating-System Examples

More information

VERITAS Storage Foundation 4.0 TM for Databases

VERITAS Storage Foundation 4.0 TM for Databases VERITAS Storage Foundation 4.0 TM for Databases Powerful Manageability, High Availability and Superior Performance for Oracle, DB2 and Sybase Databases Enterprises today are experiencing tremendous growth

More information

ZFS STORAGE POOL LAYOUT. Storage and Servers Driven by Open Source.

ZFS STORAGE POOL LAYOUT. Storage and Servers Driven by Open Source. ZFS STORAGE POOL LAYOUT Storage and Servers Driven by Open Source marketing@ixsystems.com CONTENTS 1 Introduction and Executive Summary 2 Striped vdev 3 Mirrored vdev 4 RAIDZ vdev 5 Examples by Workload

More information

Using Synology SSD Technology to Enhance System Performance Synology Inc.

Using Synology SSD Technology to Enhance System Performance Synology Inc. Using Synology SSD Technology to Enhance System Performance Synology Inc. Synology_WP_ 20121112 Table of Contents Chapter 1: Enterprise Challenges and SSD Cache as Solution Enterprise Challenges... 3 SSD

More information

SurFS Product Description

SurFS Product Description SurFS Product Description 1. ABSTRACT SurFS An innovative technology is evolving the distributed storage ecosystem. SurFS is designed for cloud storage with extreme performance at a price that is significantly

More information

Chapter 11: File System Implementation. Objectives

Chapter 11: File System Implementation. Objectives Chapter 11: File System Implementation Objectives To describe the details of implementing local file systems and directory structures To describe the implementation of remote file systems To discuss block

More information

Hyperscaler Storage. September 12, 2016

Hyperscaler Storage. September 12, 2016 Storage Networking Industry Association Technical White Paper Hyperscaler Storage Abstract: Hyperscaler storage customers typically build their own storage systems from commodity components. They have

More information

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE DELL EMC ISILON F800 AND H600 I/O PERFORMANCE ABSTRACT This white paper provides F800 and H600 performance data. It is intended for performance-minded administrators of large compute clusters that access

More information

Operating Systems. File Systems. Thomas Ropars.

Operating Systems. File Systems. Thomas Ropars. 1 Operating Systems File Systems Thomas Ropars thomas.ropars@univ-grenoble-alpes.fr 2017 2 References The content of these lectures is inspired by: The lecture notes of Prof. David Mazières. Operating

More information

CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed. CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed. File-System Structure File structure Logical storage unit Collection of related information File

More information

Google File System. Arun Sundaram Operating Systems

Google File System. Arun Sundaram Operating Systems Arun Sundaram Operating Systems 1 Assumptions GFS built with commodity hardware GFS stores a modest number of large files A few million files, each typically 100MB or larger (Multi-GB files are common)

More information

Distributed File Systems II

Distributed File Systems II Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation

More information

Operating System Concepts

Operating System Concepts Chapter 9: Virtual-Memory Management 9.1 Silberschatz, Galvin and Gagne 2005 Chapter 9: Virtual Memory Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped

More information

Chapter 8 & Chapter 9 Main Memory & Virtual Memory

Chapter 8 & Chapter 9 Main Memory & Virtual Memory Chapter 8 & Chapter 9 Main Memory & Virtual Memory 1. Various ways of organizing memory hardware. 2. Memory-management techniques: 1. Paging 2. Segmentation. Introduction Memory consists of a large array

More information

Chapter 10: Mass-Storage Systems

Chapter 10: Mass-Storage Systems Chapter 10: Mass-Storage Systems Silberschatz, Galvin and Gagne 2013 Chapter 10: Mass-Storage Systems Overview of Mass Storage Structure Disk Structure Disk Attachment Disk Scheduling Disk Management Swap-Space

More information

CSE 153 Design of Operating Systems

CSE 153 Design of Operating Systems CSE 153 Design of Operating Systems Winter 2018 Lecture 22: File system optimizations and advanced topics There s more to filesystems J Standard Performance improvement techniques Alternative important

More information

FILE SYSTEMS. CS124 Operating Systems Winter , Lecture 23

FILE SYSTEMS. CS124 Operating Systems Winter , Lecture 23 FILE SYSTEMS CS124 Operating Systems Winter 2015-2016, Lecture 23 2 Persistent Storage All programs require some form of persistent storage that lasts beyond the lifetime of an individual process Most

More information

Fast Forward I/O & Storage

Fast Forward I/O & Storage Fast Forward I/O & Storage Eric Barton Lead Architect 1 Department of Energy - Fast Forward Challenge FastForward RFP provided US Government funding for exascale research and development Sponsored by 7

More information

File Structures and Indexing

File Structures and Indexing File Structures and Indexing CPS352: Database Systems Simon Miner Gordon College Last Revised: 10/11/12 Agenda Check-in Database File Structures Indexing Database Design Tips Check-in Database File Structures

More information

Surveillance Dell EMC Storage with Synectics Digital Recording System

Surveillance Dell EMC Storage with Synectics Digital Recording System Surveillance Dell EMC Storage with Synectics Digital Recording System Configuration Guide H15108 REV 1.1 Copyright 2016-2017 Dell Inc. or its subsidiaries. All rights reserved. Published June 2016 Dell

More information

TECHNOLOGY BRIEF. Compaq 8-Way Multiprocessing Architecture EXECUTIVE OVERVIEW CONTENTS

TECHNOLOGY BRIEF. Compaq 8-Way Multiprocessing Architecture EXECUTIVE OVERVIEW CONTENTS TECHNOLOGY BRIEF March 1999 Compaq Computer Corporation ISSD Technology Communications CONTENTS Executive Overview1 Notice2 Introduction 3 8-Way Architecture Overview 3 Processor and I/O Bus Design 4 Processor

More information

PowerVault MD3 SSD Cache Overview

PowerVault MD3 SSD Cache Overview PowerVault MD3 SSD Cache Overview A Dell Technical White Paper Dell Storage Engineering October 2015 A Dell Technical White Paper TECHNICAL INACCURACIES. THE CONTENT IS PROVIDED AS IS, WITHOUT EXPRESS

More information

Lustre* - Fast Forward to Exascale High Performance Data Division. Eric Barton 18th April, 2013

Lustre* - Fast Forward to Exascale High Performance Data Division. Eric Barton 18th April, 2013 Lustre* - Fast Forward to Exascale High Performance Data Division Eric Barton 18th April, 2013 DOE Fast Forward IO and Storage Exascale R&D sponsored by 7 leading US national labs Solutions to currently

More information

SoftNAS Cloud Performance Evaluation on Microsoft Azure

SoftNAS Cloud Performance Evaluation on Microsoft Azure SoftNAS Cloud Performance Evaluation on Microsoft Azure November 30, 2016 Contents SoftNAS Cloud Overview... 3 Introduction... 3 Executive Summary... 4 Key Findings for Azure:... 5 Test Methodology...

More information

Chapter 10: Mass-Storage Systems. Operating System Concepts 9 th Edition

Chapter 10: Mass-Storage Systems. Operating System Concepts 9 th Edition Chapter 10: Mass-Storage Systems Silberschatz, Galvin and Gagne 2013 Chapter 10: Mass-Storage Systems Overview of Mass Storage Structure Disk Structure Disk Attachment Disk Scheduling Disk Management Swap-Space

More information

libhio: Optimizing IO on Cray XC Systems With DataWarp

libhio: Optimizing IO on Cray XC Systems With DataWarp libhio: Optimizing IO on Cray XC Systems With DataWarp May 9, 2017 Nathan Hjelm Cray Users Group May 9, 2017 Los Alamos National Laboratory LA-UR-17-23841 5/8/2017 1 Outline Background HIO Design Functionality

More information

File Open, Close, and Flush Performance Issues in HDF5 Scot Breitenfeld John Mainzer Richard Warren 02/19/18

File Open, Close, and Flush Performance Issues in HDF5 Scot Breitenfeld John Mainzer Richard Warren 02/19/18 File Open, Close, and Flush Performance Issues in HDF5 Scot Breitenfeld John Mainzer Richard Warren 02/19/18 1 Introduction Historically, the parallel version of the HDF5 library has suffered from performance

More information

6. Results. This section describes the performance that was achieved using the RAMA file system.

6. Results. This section describes the performance that was achieved using the RAMA file system. 6. Results This section describes the performance that was achieved using the RAMA file system. The resulting numbers represent actual file data bytes transferred to/from server disks per second, excluding

More information

The Google File System

The Google File System October 13, 2010 Based on: S. Ghemawat, H. Gobioff, and S.-T. Leung: The Google file system, in Proceedings ACM SOSP 2003, Lake George, NY, USA, October 2003. 1 Assumptions Interface Architecture Single

More information

Lecture 21: Reliable, High Performance Storage. CSC 469H1F Fall 2006 Angela Demke Brown

Lecture 21: Reliable, High Performance Storage. CSC 469H1F Fall 2006 Angela Demke Brown Lecture 21: Reliable, High Performance Storage CSC 469H1F Fall 2006 Angela Demke Brown 1 Review We ve looked at fault tolerance via server replication Continue operating with up to f failures Recovery

More information

Andreas Dilger. Principal Lustre Engineer. High Performance Data Division

Andreas Dilger. Principal Lustre Engineer. High Performance Data Division Andreas Dilger Principal Lustre Engineer High Performance Data Division Focus on Performance and Ease of Use Beyond just looking at individual features... Incremental but continuous improvements Performance

More information

HP AutoRAID (Lecture 5, cs262a)

HP AutoRAID (Lecture 5, cs262a) HP AutoRAID (Lecture 5, cs262a) Ion Stoica, UC Berkeley September 13, 2016 (based on presentation from John Kubiatowicz, UC Berkeley) Array Reliability Reliability of N disks = Reliability of 1 Disk N

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771 Overview Google file system is a scalable distributed file system

More information

EMC CLARiiON Backup Storage Solutions

EMC CLARiiON Backup Storage Solutions Engineering White Paper Backup-to-Disk Guide with Computer Associates BrightStor ARCserve Backup Abstract This white paper describes how to configure EMC CLARiiON CX series storage systems with Computer

More information

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Donald S. Miller Department of Computer Science and Engineering Arizona State University Tempe, AZ, USA Alan C.

More information

Implementing a Statically Adaptive Software RAID System

Implementing a Statically Adaptive Software RAID System Implementing a Statically Adaptive Software RAID System Matt McCormick mattmcc@cs.wisc.edu Master s Project Report Computer Sciences Department University of Wisconsin Madison Abstract Current RAID systems

More information

IBM MQ Appliance HA and DR Performance Report Model: M2001 Version 3.0 September 2018

IBM MQ Appliance HA and DR Performance Report Model: M2001 Version 3.0 September 2018 IBM MQ Appliance HA and DR Performance Report Model: M2001 Version 3.0 September 2018 Sam Massey IBM MQ Performance IBM UK Laboratories Hursley Park Winchester Hampshire 1 Notices Please take Note! Before

More information

LATEST INTEL TECHNOLOGIES POWER NEW PERFORMANCE LEVELS ON VMWARE VSAN

LATEST INTEL TECHNOLOGIES POWER NEW PERFORMANCE LEVELS ON VMWARE VSAN LATEST INTEL TECHNOLOGIES POWER NEW PERFORMANCE LEVELS ON VMWARE VSAN Russ Fellows Enabling you to make the best technology decisions November 2017 EXECUTIVE OVERVIEW* The new Intel Xeon Scalable platform

More information

JD Edwards World Electronic Burst and Bind Guide. Version A9.1

JD Edwards World Electronic Burst and Bind Guide. Version A9.1 JD Edwards World Electronic Burst and Bind Guide Version A9.1 Revised - December 15, 2007 JD Edwards World Electronic Burst and Bind Guide Copyright 2006, Oracle. All rights reserved. The Programs (which

More information

CA ERwin Data Profiler

CA ERwin Data Profiler PRODUCT BRIEF: CA ERWIN DATA PROFILER CA ERwin Data Profiler CA ERWIN DATA PROFILER HELPS ORGANIZATIONS LOWER THE COSTS AND RISK ASSOCIATED WITH DATA INTEGRATION BY PROVIDING REUSABLE, AUTOMATED, CROSS-DATA-SOURCE

More information

Operating Systems Design Exam 2 Review: Spring 2012

Operating Systems Design Exam 2 Review: Spring 2012 Operating Systems Design Exam 2 Review: Spring 2012 Paul Krzyzanowski pxk@cs.rutgers.edu 1 Question 1 Under what conditions will you reach a point of diminishing returns where adding more memory may improve

More information

Messaging Overview. Introduction. Gen-Z Messaging

Messaging Overview. Introduction. Gen-Z Messaging Page 1 of 6 Messaging Overview Introduction Gen-Z is a new data access technology that not only enhances memory and data storage solutions, but also provides a framework for both optimized and traditional

More information

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1 ZEST Snapshot Service A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1 Design Motivation To optimize science utilization of the machine Maximize

More information

Foster B-Trees. Lucas Lersch. M. Sc. Caetano Sauer Advisor

Foster B-Trees. Lucas Lersch. M. Sc. Caetano Sauer Advisor Foster B-Trees Lucas Lersch M. Sc. Caetano Sauer Advisor 14.07.2014 Motivation Foster B-Trees Blink-Trees: multicore concurrency Write-Optimized B-Trees: flash memory large-writes wear leveling defragmentation

More information

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010 Moneta: A High-performance Storage Array Architecture for Nextgeneration, Non-volatile Memories Micro 2010 NVM-based SSD NVMs are replacing spinning-disks Performance of disks has lagged NAND flash showed

More information

SoftNAS Cloud Performance Evaluation on AWS

SoftNAS Cloud Performance Evaluation on AWS SoftNAS Cloud Performance Evaluation on AWS October 25, 2016 Contents SoftNAS Cloud Overview... 3 Introduction... 3 Executive Summary... 4 Key Findings for AWS:... 5 Test Methodology... 6 Performance Summary

More information

Basic Memory Management. Basic Memory Management. Address Binding. Running a user program. Operating Systems 10/14/2018 CSC 256/456 1

Basic Memory Management. Basic Memory Management. Address Binding. Running a user program. Operating Systems 10/14/2018 CSC 256/456 1 Basic Memory Management Program must be brought into memory and placed within a process for it to be run Basic Memory Management CS 256/456 Dept. of Computer Science, University of Rochester Mono-programming

More information

Basic Memory Management

Basic Memory Management Basic Memory Management CS 256/456 Dept. of Computer Science, University of Rochester 10/15/14 CSC 2/456 1 Basic Memory Management Program must be brought into memory and placed within a process for it

More information

Chapter 8: Memory-Management Strategies

Chapter 8: Memory-Management Strategies Chapter 8: Memory-Management Strategies Chapter 8: Memory Management Strategies Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel 32 and

More information

PRESERVE DATABASE PERFORMANCE WHEN RUNNING MIXED WORKLOADS

PRESERVE DATABASE PERFORMANCE WHEN RUNNING MIXED WORKLOADS PRESERVE DATABASE PERFORMANCE WHEN RUNNING MIXED WORKLOADS Testing shows that a Pure Storage FlashArray//m storage array used for Microsoft SQL Server 2016 helps eliminate latency and preserve productivity.

More information

Recommendations for Aligning VMFS Partitions

Recommendations for Aligning VMFS Partitions VMWARE PERFORMANCE STUDY VMware ESX Server 3.0 Recommendations for Aligning VMFS Partitions Partition alignment is a known issue in physical file systems, and its remedy is well-documented. The goal of

More information

SolidFire and Ceph Architectural Comparison

SolidFire and Ceph Architectural Comparison The All-Flash Array Built for the Next Generation Data Center SolidFire and Ceph Architectural Comparison July 2014 Overview When comparing the architecture for Ceph and SolidFire, it is clear that both

More information

InfiniBand Networked Flash Storage

InfiniBand Networked Flash Storage InfiniBand Networked Flash Storage Superior Performance, Efficiency and Scalability Motti Beck Director Enterprise Market Development, Mellanox Technologies Flash Memory Summit 2016 Santa Clara, CA 1 17PB

More information

Enterprise print management in VMware Horizon

Enterprise print management in VMware Horizon Enterprise print management in VMware Horizon Introduction: Embracing and Extending VMware Horizon Tricerat Simplify Printing enhances the capabilities of VMware Horizon environments by enabling reliable

More information

IBM Tivoli Storage Manager for HP-UX Version Installation Guide IBM

IBM Tivoli Storage Manager for HP-UX Version Installation Guide IBM IBM Tivoli Storage Manager for HP-UX Version 7.1.4 Installation Guide IBM IBM Tivoli Storage Manager for HP-UX Version 7.1.4 Installation Guide IBM Note: Before you use this information and the product

More information

Chapter 8: Main Memory

Chapter 8: Main Memory Chapter 8: Main Memory Chapter 8: Memory Management Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel 32 and 64-bit Architectures Example:

More information

File. File System Implementation. File Metadata. File System Implementation. Direct Memory Access Cont. Hardware background: Direct Memory Access

File. File System Implementation. File Metadata. File System Implementation. Direct Memory Access Cont. Hardware background: Direct Memory Access File File System Implementation Operating Systems Hebrew University Spring 2009 Sequence of bytes, with no structure as far as the operating system is concerned. The only operations are to read and write

More information

CS307: Operating Systems

CS307: Operating Systems CS307: Operating Systems Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building 3-513 wuct@cs.sjtu.edu.cn Download Lectures ftp://public.sjtu.edu.cn

More information

Operating Systems. Week 9 Recitation: Exam 2 Preview Review of Exam 2, Spring Paul Krzyzanowski. Rutgers University.

Operating Systems. Week 9 Recitation: Exam 2 Preview Review of Exam 2, Spring Paul Krzyzanowski. Rutgers University. Operating Systems Week 9 Recitation: Exam 2 Preview Review of Exam 2, Spring 2014 Paul Krzyzanowski Rutgers University Spring 2015 March 27, 2015 2015 Paul Krzyzanowski 1 Exam 2 2012 Question 2a One of

More information

Red Hat Gluster Storage performance. Manoj Pillai and Ben England Performance Engineering June 25, 2015

Red Hat Gluster Storage performance. Manoj Pillai and Ben England Performance Engineering June 25, 2015 Red Hat Gluster Storage performance Manoj Pillai and Ben England Performance Engineering June 25, 2015 RDMA Erasure Coding NFS-Ganesha New or improved features (in last year) Snapshots SSD support Erasure

More information

File. File System Implementation. Operations. Permissions and Data Layout. Storing and Accessing File Data. Opening a File

File. File System Implementation. Operations. Permissions and Data Layout. Storing and Accessing File Data. Opening a File File File System Implementation Operating Systems Hebrew University Spring 2007 Sequence of bytes, with no structure as far as the operating system is concerned. The only operations are to read and write

More information

Andreas Dilger, Intel High Performance Data Division Lustre User Group 2017

Andreas Dilger, Intel High Performance Data Division Lustre User Group 2017 Andreas Dilger, Intel High Performance Data Division Lustre User Group 2017 Statements regarding future functionality are estimates only and are subject to change without notice Performance and Feature

More information

The Google File System

The Google File System The Google File System By Ghemawat, Gobioff and Leung Outline Overview Assumption Design of GFS System Interactions Master Operations Fault Tolerance Measurements Overview GFS: Scalable distributed file

More information

IBM InfoSphere Streams v4.0 Performance Best Practices

IBM InfoSphere Streams v4.0 Performance Best Practices Henry May IBM InfoSphere Streams v4.0 Performance Best Practices Abstract Streams v4.0 introduces powerful high availability features. Leveraging these requires careful consideration of performance related

More information

Process size is independent of the main memory present in the system.

Process size is independent of the main memory present in the system. Hardware control structure Two characteristics are key to paging and segmentation: 1. All memory references are logical addresses within a process which are dynamically converted into physical at run time.

More information

CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES

CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES OBJECTIVES Detailed description of various ways of organizing memory hardware Various memory-management techniques, including paging and segmentation To provide

More information

SolidFire and Pure Storage Architectural Comparison

SolidFire and Pure Storage Architectural Comparison The All-Flash Array Built for the Next Generation Data Center SolidFire and Pure Storage Architectural Comparison June 2014 This document includes general information about Pure Storage architecture as

More information