MS2 Lustre Streaming Performance Improvements

Size: px

Start display at page:

Download "MS2 Lustre Streaming Performance Improvements"

Joella Gray
5 years ago
Views:

1 MS2 Lustre Streaming Performance Improvements Solution Architecture Revision 1.2 September 19, 2015 Intel Federal, LLC Proprietary 1 Solution Architecture

2 Generated under Argonne Contract number: B DISTRIBUTION STATEMENT: None Required Disclosure Notice: This presentation is bound by Non-Disclosure Agreements between Intel Corporation, the Department of Energy, and DOE National Labs, and is therefore for Internal Use Only and not for distribution outside these organizations or publication outside this Subcontract. USG Disclaimer: This report was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor any agency thereof, nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof. Export: This document contains information that is subject to export control under the Export Administration Regulations. Intel Disclaimer: Intel makes available this document and the information contained herein in furtherance of CORAL. None of the information contained therein is, or should be construed, as advice. While Intel makes every effort to present accurate and reliable information, Intel does not guarantee the accuracy, completeness, efficacy, or timeliness of such information. Use of such information is voluntary, and reliance on it should only be undertaken after an independent review by qualified experts. Access to this document is with the understanding that Intel is not engaged in rendering advice or other professional services. Information in this document may be changed or updated without notice by Intel. This document contains copyright information, the terms of which must be observed and followed. Reference herein to any specific commercial product, process or service does not constitute or imply endorsement, recommendation, or favoring by Intel or the US Government. Intel makes no representations whatsoever about this document or the information contained herein. IN NO EVENT SHALL INTEL BE LIABLE TO ANY PARTY FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES FOR ANY USE OF THIS DOCUMENT, INCLUDING, WITHOUT LIMITATION, ANY LOST PROFITS, BUSINESS INTERRUPTION, OR OTHERWISE, EVEN IF INTEL IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Company Name: Intel Federal, LLC Company Address: 4100 Monument Corner Drive, Suite 540 Fairfax, VA Copyright 2015, Intel Corporation. Technical Lead: (Name, , Phone) _Al Gara, Contract Administrator: (Name, , Phone) Aaron Matzkin, Program Manager: (Name, , Phone) Jacob Wood, Intel Federal, LLC Proprietary 2 Solution Architecture

3 Contents MS Solution Architecture... 1 Contents... 3 Revision History Milestone Overview - Solution Architecture MS2 Description Terms Introduction Background MB HDD I/O Allocation High Level Overview Dynamic Striping Impact Interleaved Metadata Impact RAIDZ Skip Sector Impact Block Allocation Impact Runtime Memory Impact Solution Requirements Native 16MB Blocks Two Megabyte Disk I/O File Data Buffers File Block Size Selection Non-streaming I/O Performance Aged File System Performance Declustered RAIDZ Integration Use Cases Checkpoint Dump and Drain Checkpoint Restarts Mixed Workloads Aged File Systems Solution Proposal Lustre ZFS Relevant ZFS Open Source Community Work Large Block Work Scatter/Gather Data Buffer Work Solution Test Plans Intel Federal, LLC Proprietary 3 Solution Architecture

4 6.1 Unit Testing and Automated Code Verification Integration Testing System and Black Box Testing with Other CORAL Software Acceptance Criteria Intel Federal, LLC Proprietary 4 Solution Architecture

5 Figures Figure 2-1: Raw HDD Performance Figure 2-2: Realtime Allocations Figure 2-3: Ideal RAIDZ large allocation sizes (sans skip sectors) Intel Federal, LLC Proprietary 5 Solution Architecture

6 Revision History Revision Description Date Author 0.1 First draft Don Brady 1.1 Revisions following Argonne review Don Brady 1.2 Final revisions following Argonne review Don Brady Author: Don Brady Contributors: Eric Barton, John Salinas, Isaac Huang, Andreas Dilger, John Carrier Intel Federal, LLC Proprietary 6 Solution Architecture

7 1 Milestone Overview - Lustre Streaming Performance Improvements Solution Architecture 1.1 MS2 Description This document covers the Solution Architecture for Lustre and ZFS I/O related enhancements necessary to meet MS2 of the Argonne NRE Contract as stated below. MS2 Description: The Subcontractor shall modify the existing ZFS on Linux memory management to use scatter/gather lists to avoid dynamic allocation of large contiguous buffers and to increase the maximum block size to 16MB. The Subcontractor shall modify the ZFS allocator to cluster small and large allocation units to avoid large block fragmentation. The Subcontractor shall modify Lustre to increase the maximum block size supported to 16MB, to provide application control over block size used and to provide perdirectory and per-file system block size defaults including support for lfs ladvise iosize. The Subcontractor shall modify Lustre s client and server side caching policies to optimize both streaming and random I/O performance for files of varying block size. The Subcontractor shall demonstrate that the application streaming I/O efficiency has been improved by increasing the size of the Lustre/ZFS allocation unit end-to-end so that the size of I/Os delivered to drives via RAID subsystems match the characteristics of modern drives. The Subcontractor shall demonstrate that the I/O performance of Lustre for a range of file block sizes and I/O patterns show that files optimized for streaming I/O achieve bandwidth requirements without undue performance penalties for random I/O performance or for streaming performance on files optimized for random I/O. The Subcontractor shall use a storage system the same as, or similar to, the Argonne Helium system for all demonstrations. The Subcontractor shall provide all source changes to Lustre and ZFS to upstream source repositories. 1.2 Terms RAIDZ a data/parity distribution scheme with a dynamic stripe width Dynamic Striping dynamically adjusting the load across virtual devices to distribute bandwidth and re-balance space usage Intel Federal, LLC Proprietary 7 Solution Architecture

8 Virtual Device (VDEV) an allocation unit, typically a RAIDZ group of disks Dynamic Fit Placement an adaptive allocation policy that uses first fit up until a threshold and then switches to best fit Metaslab an allocation region in a virtual device Metaslab Group the set of metaslabs belonging to a virtual device vmalloc kernel interface that allocates a contiguous memory region in the kernel virtual address space RAIDZ Layout Conventions N d = number of data drives N p = number of parity columns N t = total number of drives For the CORAL configuration, N t = 10, N d = 8 and N p = Introduction Concurrent streams of I/O coming from multiple Lustre clients are interleaved at the servers according to arrival time. Since the mapping from file offset to disk offset is controlled by the disk block allocator, allocating writes (all writes with ZFS) may result in streaming disk I/O for a relatively empty file system. However, writes to an aged, relatively full, fragmented file system, and the re-interleaving that occurs on subsequent reads, results in effectively random disk I/O. Since these streams are interleaved at file block granularity, larger file block sizes can ensure that, although disk I/O patterns are random, each disk I/O is for a larger chunk of data. The relative impact of seek overhead is, therefore, reduced and delivered I/O bandwidth improves correspondingly. This solution architecture document covers the Lustre and ZFS streaming I/O related enhancements for CORAL. The Improvements for Streaming I/O solution will boost overall throughput with larger I/O sizes. These improvements must maintain a consistent level of performance over time, avoiding any performance cliffs due to system memory fragmentation or severe storage free space fragmentation. Intel Federal, LLC Proprietary 8 Solution Architecture

9 For a single thread Lustre client, the expected performance depends on the storage configuration (OST count, file stripe width, etc.). Single-threaded Lustre clients typically achieve a significant proportion of the underlying single port network fabric bandwidth. Intel Federal, LLC Proprietary 9 Solution Architecture

10 2 Background ZFS does not have a single fixed file block size. Different files will use file block sizes appropriate to the I/O size. Under the proposed solution, the backend file block sizes will range from 4KB for very small files up to 16MB for larger files. The ideal file block size for this solution is 16MB, which results in 2MB disk writes for the prescribed RAIDZ2 configuration (8 data +2 parity). Smaller files will not use such large block sizes. For example, a 1MB file would have a maximum file block size of 1MB, but the file block size could even be smaller if so specified by the client. A workload of primarily small files, however, will not take advantage of larger blocks and their higher throughput into the HDD layer. The proposed solution targets large sequential I/Os delivered as 16MB file blocks to ZFS in order to maximize HDD throughput MB HDD I/O The project goal is to improve Lustre and ZFS delivered bandwidth from the HDD backed storage. The CORAL file system will provide an aggregate over 1TB/sec from the back-end file system. With the target configuration, each HDD must deliver over 100 MB/sec to meet this goal. A disk I/O size of 2MB can supply a raw bandwidth of 100 MB/sec (see diagram 2.1). Bigger file blocks are needed so that for streaming I/O we end up with a 2MB I/O unit. RAIDZ will break up each file block by the number of data disks in the redundancy group. With the traditional 128KB block maximum and a RAIDZ with N d = 8, this amounts to only 16KB chunk submitted to each disk per block. While some amount of IO merging can happen with each 16KB chunk, this adds more overhead per IO and does not guarantee contiguous disk writes. However, the same N d of 8 will divide a 16MB block into 2MB columns and thereby deliver 2MB I/O to the underlying disks. Using 16MB file blocks will satisfy the goal of achieving 2MB sized I/O for streaming files. Intel Federal, LLC Proprietary 10 Solution Architecture

11 Figure 2-1: Raw HDD Performance Though 16MB file blocks may seem aggressive, there should not be any need to make specific, large block improvements to either the burst buffer or the Lustre client. Lustre already includes support for large 16MB blocks, though this feature has not yet been used in production. And running the client on the burst buffer should make aggregating large I/Os easier to achieve. The effort will be tuning the file block size from the client down through the server to the block devices to ensure HDDs deliver optimal performance for the workload. The ZFS on-disk format now supports 16MB blocks (more details in section 5.3). However, the current ZFS run-time requires some additional work to effectively accommodate large blocks. Since the adoption of larger blocks impacts the disk space allocation mechanisms in ZFS, the following high-level overview provides the necessary context to discuss the affected areas. 2.2 Allocation High Level Overview A ZFS storage pool is typically comprised of a set of RAIDZ groups known as virtual devices (VDEVs). ZFS further divides the space in each virtual disk into approximately 200 regions known as metaslabs. The block allocation policy in ZFS has three basic steps: virtual device selection, metaslab selection and free space selection. When a block of data needs to be stored on disk, ZFS will first select a VDEV. Then within that VDEV, a target metaslab will be chosen based on a weighting scheme. Finally, within the chosen slab, a range of free space will be selected using a dynamic fit approach (first fit initially and transitioning to best fit later). Intel Federal, LLC Proprietary 11 Solution Architecture

12 To fully utilize all available disk bandwidth within a storage pool, ZFS will round-robin I/O writes across the set of VDEVs, moving to a new VDEV after a fixed amount has been allocated. In addition, the allocated amount can be adjusted when new VDEVs are added or when the free space needs to be rebalanced across the VDEVs. This adaptive spreading of the load across all VDEVs is also referred to as Dynamic Striping. After filling a metaslab, ZFS will move to the next best metaslab based on a weighting scheme. The metaslab with the highest weight wins. The weighting criteria considers, among other things, recent use, available space, and fragmentation state. It is worth noting here that the incoming metaslab allocations will contain a mixture of file data (large blocks) and metadata (very small blocks). Each metadata block is also replicated onto a different virtual device for additional redundancy. These extra metadata copies are internally referred to as metadata ditto blocks. The introduction of large blocks has some unintended consequences on the aforementioned metaslab allocation process. The more significant impacts of dynamic striping, interleaved metadata, skip sectors, and allocation policies are described in more detail in the following sections. 2.3 Dynamic Striping Impact The target amount to allocate (i.e. write) before moving to the next virtual device defaults to a fixed quota value x number of drives in the group. The current allocation quota value (metaslab_aliquot = 512K ) was chosen over ten years ago based on the HDD characteristics at the time. However, with 4-16MB sized blocks the original threshold value is inappropriately sized. Furthermore, the space re-balancing mechanism assumes that the overall quota threshold contains multiple physical blocks that can be adjusted up/down with a simple bias calculation. With large blocks the biasing mechanism will cease to function and no re-balancing will occur. With a RAIDZ N t = 10, the target threshold would be: 512K * 10 = 5MB, which is only 1/4 the size required to store a single 16MB block. The solution proposal needs to offer a remedy for effective dynamic striping with large blocks. 2.4 Interleaved Metadata Impact It is easy to demonstrate that the metadata blocks being allocated are small, in most cases they compress to fit within a single 4K block. With RAIDZ2 and using 4K sectors, this amounts to a 12K allocation (4K data + 2x4K for parity). The astute reader will note that this is a de facto triple mirror but with the overhead of calculating parity. A workload that pushes ZFS to a point where all the metaslab regions have been activated and undergone space reclamation, creates an on-disk layout representative of an aged file system. While running such a file system aging workload, it was easy to observe that over half of the Intel Federal, LLC Proprietary 12 Solution Architecture

13 overall allocation requests made were for metadata. And, significantly, that metadata was mostly 12K in size (see diagram below). Figure 2-2: Realtime Allocations Due to copy-on-write (COW), many metadata blocks will continuously need to find a new home. The incoming file data will at the same time force a transition to other metaslabs to locate space for the larger file blocks (1280KB each in the evaluation FS aging runs). These file system aging workloads left an excess of free 12K islands (thousands) in every metaslab even though by size, the metadata is a tiny fraction of overall allocation space. The trend appears to artificially provision the pool to always have residual metadata-only space in each metaslab. The solution proposal needs to mitigate any large block fragmentation problems. 2.5 RAIDZ Skip Sector Impact In RAIDZ, disks are divided up into sectors (4K) and the allocation space is always in units of sectors. The allocation space can be visualized as a grid, N t columns wide where each column represents a specific disk and each row represents a set of different sectors (one from each disk). The smallest allocated sector amount would be N p + 1. To avoid the creation of a free range that is less than this minimum allocation, RAIDZ will round up each allocation to the nearest multiple of (N p + 1) and skip these unused sectors. Skipped sectors ensure there is never a free range of sectors that is too small to allocate. For the CORAL configuration N p + 1 is three sectors. With an even count of data disks (like N d = 8) every RAIDZ allocation for a power-of-two block (like 16MB) will require skip sectors to make the allocation count a multiple of three. Ideally the RAIDZ allocations can guarantee that larger blocks (i.e. 1MB - 16MB) end up with a large common multiple (e.g. 256KB), to improve coalescing after freeing. For the Intel Federal, LLC Proprietary 13 Solution Architecture

14 CORAL RAIDZ configuration (8 + 2, 4KB sectors), Figure 2-3 shows the ideal allocation size values. However, the legacy skip sector requirement inhibits this goal resulting in oddsized allocations (i.e. only 4KB multiple, not 256KB or higher). Elimination of skip sectors is desirable to achieve the best possible allocation unit multiples as another means to minimize free space fragmentation. BLOCK SIZE COLUMN SIZE PARITY SIZE ALLOC SIZE 512 KB 64 KB 128 KB 756 KB 1 MB 128 KB 256 KB 1.25 MB 2 MB 256 KB 512 KB 2.50 MB 4 MB 512 KB 1 MB 5 MB 8 MB 1 MB 2 MB 10 MB 16 MB 2 MB 4 MB 20 MB Figure 2-3: Ideal RAIDZ large allocation sizes (sans skip sectors) 2.6 Block Allocation Impact The current allocation policies were designed for a maximum block size of 128KB, at a time when drives were on the order of 200GB. Some of the threshold parameters used by the lowest allocation layer no longer make sense when large blocks are in the mix. For example, the Dynamic Fit policy waits until it cannot satisfy a max-block-size request (16MB) before moving to best-fit. But for a 16MB block on RAIDZ-2 with N d = 10, the resulting allocation would be 20MB. The other trigger for moving to best-fit is when the metaslab free space percent drops to 4%. With 12TB drives and N t = 10, we end up with 512GB per metaslab, and 4% of that is 20 GB. The allocator further segments requests into areas with a prescribed alignment. That alignment is roughly half the requested size. So a 16MB block is assigned into an 8MB alignment area. For larger blocks, a maximum alignment of 256KB (a common multiplier) makes more sense. These and other adjustments are needed to ensure the allocation policies perform as intended. We expect these large block allocations to improve the performance of RAIDZ rebuild performance. Though larger blocks will require reserving some memory for the rebuild read-ahead buffers, the size of the file's metadata (parent block pointers) will be significantly smaller. And, instead of potentially hundreds of random block reads, when max block size was 128KB, we will have many fewer 16MB reads. Otherwise, impacts of Intel Federal, LLC Proprietary 14 Solution Architecture

15 large block allocation on the ZFS caching layers will need to be measured as the project proceeds. 2.7 Runtime Memory Impact ZFS on Linux has traditionally used the kernel virtual memory allocator, vmalloc(), to satisfy the memory for its file buffers (inflight I/O and cached I/O) because the ZFS code was originally written for Solaris and accesses the buffer contents directly. However, there is a limitation on how much memory can be mapped with vmalloc() and it should be used sparingly to avoid performance issues (like increased TLB thrashing). The adoption of large blocks will significantly increase the demand for larger memory allocations. Correspondingly, the ZFS run-time must use an alternative to vmalloc() when allocating memory for its file data buffers. A scatter/gather buffer mechanism is the proposed alternative for allocating file buffer memory in ZFS. Intel Federal, LLC Proprietary 15 Solution Architecture

16 3 Solution Requirements The requirement for 2MB HDD I/O is at the heart of the solution and introduces the requirement that the file block size increases to 16MB. This in turn adds additional requirements so that block sizes can be controlled and that disk space allocations policies adapt to larger blocks. Metadata isolation will keep the large data block allocations separated from the much smaller metadata allocations. 3.1 Native 16MB Blocks The end-to-end storage stack must be capable of using native 16MB blocks (from Lustre client down to ZFS storage pool). Client I/O aggregation should be limited to larger writers and perhaps BRW traffic. It is not expected that smaller I/O will be held and aggregated to a full 16MB on the client, but some aggregation may still be done to avoid too-small RPCs. Remember that the CORAL design has Lustre clients only on the burst buffer nodes. If the application reads from or writes to the burst buffer, then I/O is best configured to make use of large block I/Os. 3.2 Two Megabyte Disk I/O When end-to-end 16MB blocks are in play, the back-end disk I/O will be no less than 2MB per drive when configured with a prescribed 8 data drives in a RAIDZ configuration. This 2MB size requirement applies to both writing and reading. This minimum size is not achieved from I/O aggregation but from actual allocated units. 3.3 File Data Buffers As mentioned in section 2.7 above, the ZFS run-time must use an alternative to vmalloc() when allocating memory for its file data buffers. A scatter/gather buffer mechanism is the proposed alternative for allocating file buffer memory in ZFS. 3.4 File Block Size Selection Large blocks are great for streaming, but they may not be appropriate for every workload. Controls are required to make sure an appropriate block size is used for each workload. The resulting performance must improve on the status quo for ZFS, where block size was chosen solely based on file size and the block size range was much smaller. 16M blocks will only be auto-selected when the I/O write sizes are that large. The solution shall provide a separate API, ladvise(), for users and applications to set the back-end block size of a Lustre file across a range of sizes. The ladvise() API is Intel Federal, LLC Proprietary 16 Solution Architecture

17 modeled after the Linux fadvise() API and is intended to have similar hints (RANDOM/SEQUENTIAL/WILL_NEED/DONT_NEED). The range is power-of-two sizes between 4K and 16M. The block size cannot be changed once the file is written (set once). In the absence of an application supplied block size, the solution shall provide a default size. The default can be statically or dynamically configured. The static default value will be configured per file system or per directory. The dynamic default value will be based on context (such as, initial write locations and sizes) and will be used for dynamic workloads that did not specify a size. 3.5 Non-streaming I/O Performance Varying workloads and block sizes should not degrade (e.g. random or smaller sized I/O) compared to the status quo. Non-streaming I/O patterns will have acceptable performance when the correct block size was assigned. (in 3.4 above) 3.6 Aged File System Performance The performance of an aged file system must continue to maintain a baseline efficiency. Use of larger blocks must not accelerate allocation space fragmentation. Smaller block allocations will be segregated from larger block allocations. This baseline efficiency must be measured on a file system that has reached as high as 80% full and has undergone a realistic history of file creation, writes and deletions. 3.7 Declustered RAIDZ Integration The solution must function in tandem with the RAIDZ Declustering solution. Further investigations are required to determine where isolated metadata is stored. Any metadata isolation method must not impact the ability of RAIDZ Declustering to spread I/O across the set of pool drives. Intel Federal, LLC Proprietary 17 Solution Architecture

18 4 Use Cases 4.1 Checkpoint Dump and Drain 1. Application checkpoint data is written into burst buffers. 2. Data is then drained asynchronously using large I/O writes through Lustre and ultimately into ZFS to free space in the burst buffers for the next checkpoint. 3. The rate of data draining directly determines availability for the next checkpoint. 4. The desired drain bandwidth must be maintained while other application file system load is present. 5. This cycle of application checkpoints repeats at regular intervals. 4.2 Checkpoint Restarts 1. Previous checkpoint data is staged into burst buffers using large read I/O requests from Lustre/ZFS. 2. The rate of data reads will determine the job restart delay. 3. The desired read performance for staging must be maintained while other application load is present. 4. Application job restarts using the checkpoint data. 5. In the case where the burst buffer is unavailable, restart data can be read directly from Lustre/ZFS. As in the staging case, maximum read efficiency is desired to minimize restart delays. 4.3 Mixed Workloads 1. Simultaneous jobs with various I/O workloads are serviced by the Lustre file system. 1. Multiple streams with large data blocks (streaming I/O) 2. Small block random I/O to multiple files (parallel I/O) 3. Checkpoint drainage I/O streams 2. Each application workload obtains an appropriate file block size. 1. As requested by the application, or 2. As configured by the file system, or 3. As determined by the initial file I/O pattern 3. The chosen block size persists for the life of the file. 4. The file system adapts across a range of active block sizes. 1. caching policies 2. I/O scheduler 3. block allocation scheme Intel Federal, LLC Proprietary 18 Solution Architecture

19 4.4 Aged File Systems 1. The file system (Lustre with back-end ZFS) undergoes a realistic history of file creation, writes and deletions. 2. The file system fills up over time and older job data and checkpoint data is periodically removed to make room for new data. 3. The file system remains online while continuing to provide a stable I/O bandwidth efficiency (the provided bandwidth does not significantly degrade over time). 4. A separate de-fragmentation process is not required. Intel Federal, LLC Proprietary 19 Solution Architecture

20 5 Solution Proposal 5.1 Lustre The Lustre client can aggregate I/O requests in order to generate optimally-sized I/O for the Lustre server so that data can later be read in the same size chunks as it was written. Lustre servers also have the opportunity to aggregate I/O in order to generate optimallysized I/O for the disk file system. Our preference is to have heuristics in Lustre detect the I/O pattern and generate the correct I/O size for the application s I/O workload. Clients should aggregate small I/O if the I/O is contiguous or to mitigate the worst case random writes to reduce RPC counts to a manageable level. Servers should determine the block size by default. Using the layout stripe size to also reflect the block size, for example, would allow specifying the default block size for a file or directory without the need to implement a separate mechanism from the Lustre layout. Nonetheless, there will always be I/O patterns that do not match the heuristics provided or are detected incorrectly. In this case, there will be an API provided to allow the application or library to specify behavior that it knows to be correct (e.g., random writes with a specific block size, prefetching of data from disk, or dropping data from cache that is no longer needed). This API will be provided through the Lustre ladvise interface. (Note that the kernel fadvise() interface is not suitable because it only interfaces with the memory subsystem of a single client, and does not interface at all with the client-side file system or the server.) Our expectation is that the CORAL burst buffer will be tuned for the Aurora file system implementation and, hence, will be able to use the API to set the block size to maximize I/O performance for its workflow. 1. Increase the maximum supported block size to 16MB 1. Lustre OSD will determine the largest supported block size by querying ZFS at mount time and communicate this to Lustre clients 2. Introduction of a client interface, lfs ladvise iosize, to set the file block size 3. Introduction of a block size default per file system and per directory 4. Allow adaptive block size selection based on initial I/O context 5. Communicate desired block size to ZFS layer 2. Lustre s client and server side caching policies will be modified to optimize both streaming and random I/O performance for files of varying block size. Must ensure caching is sufficient for large block streaming I/O and no regression introduced for other workloads. 3. Adopt new scatter/gather buffer interface from ZFS into Lustre OSD module. Intel Federal, LLC Proprietary 20 Solution Architecture

21 5.2 ZFS 1. Increase the default 1MB maximum supported block size to 16MB. Address any new issues/regressions that may arise from this increase. Section 5.3 outlines some additional large block related work. 2. To minimize long term free space fragmentation, the solution will isolate metadata allocations from much larger file data allocations. Isolation will either come from dedicated metaslabs or from a separate metadata allocation class. 3. Adjustments to allocation policies are needed to accommodate larger allocation mixes and address the policy issues raised in section 2. The existing metaslab fragmentation metrics will be leveraged/improved to steer the policies and to monitor the free space fragmentation over time. 4. Additional Performance Improvements (as needed). Following the increase of file data I/O size to 16MB, it is expected that additional efficiency bottlenecks and memory allocation issues will be discovered. The exact set of improvements is not known, but the following areas will be explored by Intel: 1. Improve physical locality for file data (sort dirty data list at transaction group (txg) sync). 2. Consume lfs ladvise hints from Lustre (such as Random, Sequential, Will-Need and Don't-Need). 3. Address I/O path lock contention/lack of scaling with CPU core increases. 4. Overlap inline compute and I/O during transaction group syncing (e.g. overlap RAIDZ parity generation with async data column writes) 5.3 Relevant ZFS Open Source Community Work The proposed solution architecture leverages existing ongoing work in the ZFS community. Leveraged work includes the support for large blocks, the addition of scatter/gather buffers for ZFS file data and the metaslab fragmentation tracking mechanisms (as mentioned in section 5.2) Large Block Work The on-disk support for using large blocks is now included in the master repository of ZFS on Linux as of the release. This change allows for using up to 1MB blocks on pools that (a) opt-in to the pool feature, feature@large_blocks, and (b) set the dataset recordsize property to a value larger than 128K. The current implementation limits the maximum size to 1MB due to some of the issues discussed in sections 2.3, 2.4, and 2.5, even though the on-disk format can accommodate values up to 16MB. To get to 16MB blocks, the run-time setting, zfs_max_recordsize, needs to be increased beyond the current 1MB default. Intel Federal, LLC Proprietary 21 Solution Architecture

22 To our knowledge, there has been no production use of larger than 1MB blocks with ZFS. Since the use of large blocks is entirely opt-in, it is assumed that there has been minimal testing and limited exposure to large blocks in practice. In addition to the work stated in section 5.2, additional run-time work is expected to include the following: 1. Add full stack test coverage of 16MB blocks to the ztest and zfs test suite tools. 2. Specifically address large blocks induced free space fragmentation. 3. Increase the gang block header size on pools with 4KB sector drives to minimize the impact when using ganged blocks. 4. ZFS I/O scheduler tunings (optional) 1. cut-in-line for smaller reads 2. allow larger than 128KB aggregations 3. size-adjusted queue depths Scatter/Gather Data Buffer Work Intel had originally scoped scatter/gather buffer development for CORAL. Recently a relevant solution was proposed in the ZFS on Linux community. While this initial proposal has a more narrow focus than our original plans, it will be a sufficient foundation to build upon. Our current plan is to leverage this community work (aka Arc Buffer Data, or ABD) once it is officially accepted by the ZFS-on-Linux Git repository (i.e. in the master branch). The scatter/gather solution needs to be validated with Lustre and 16MB blocks. Intel can augment this with additional follow-on work as needed. Some additional I/O areas could benefit from using scatter/gather buffers. Recommended follow-on work for scatter gather buffers: 1. Comprehensive testing 1. Lustre workloads 16MB blocks using our RAIDZ configurations 2. Long running workloads (days) 3. Memory pressure stress testing 2. Add run-time memory usage stats (kstats) for scatter/gather buffers (similar to ZIO buffer stats available at /proc/spl/kmem/slab) 3. Extend adoption beyond just the ARC data buffers as needed 1. Supply a scatter/gather buffer interface to Lustre OSD 2. Use for RAID-Z reconstruction 3. Use for RAID-Z parity buffers 4. For other data transforms (like LZ4 data compression) Intel Federal, LLC Proprietary 22 Solution Architecture

23 6 Solution Test Plans The focus of the Solution Test Plan is to describe the what as opposed to the how of the test plan, which is being created alongside the High Level Design (HLD). This plan will include descriptions of the following test workloads: 1) For file system aging, we are converting and updating an internal script to a more powerful python based tool for file system aging. Also we use iotest and its bigbench.sh script and are currently evaluating another tool. 2) For core benchmarks, we will use our SOAK test framework, which includes mdtest, IOR, simul, racer, Kcompile, blobench and pct. 3) Our realistic tests include Lustre server failover, CPU/memory hog, and in the future Lustre fault injection. 4) For mixed workloads the current plan is to use various iozone options to generate different traffic types (streaming, random, strided, backwards, burst), use filebench to simulate different workloads, to use simul and mdtest once in a while to generate large amounts of metadata and other operations. Failure injection at the lower layer, such as injection of disk failures, path to JBOD failure, ZFS failures, failover, etc are in the de-clustered RAIDZ SA as it is a more natural place to consider such things. Once the prototype for de-clustered RAID stabilizes, all of these tests will be run with Lustre and big blocks. 6.1 Unit Testing and Automated Code Verification 1. Verify that 16MB I/O units are active across the stack (Lustre clients into ZFS backend). 2. Verify that Lustre client interfaces can directly set block sizes between 128K and 16MB (per file or per directory or per file system). 3. Validate the logic and functionality that the assigned block size adapts with a reasonable default. 4. Verify that the code mitigates fragmentation with the metadata isolation schemes. 6.2 Integration Testing 1. Test Lustre Streaming performance with both large block streaming and with burst workloads. Record important statics. 2. Develop simulated, mixed workload tests that have multiple I/O streams with large blocks to mimic streaming I/O and random small blocks to multiple files to mimic parallel I/O. Intel Federal, LLC Proprietary 23 Solution Architecture

24 3. Generate realistic workloads to aged file systems in terms of file creation, blocks written and read, files and directories removed. Tests must verify the I/O bandwidth efficiency as the field system is in various states of aging. File system aging scripts are used to induce fragmentation pressure. Fragmentation statics are recorded to correlate the impact on performance over time. 4. All integration tests shall be run in various declustered RAIDZ configurations. 6.3 System and Black Box Testing with Other CORAL Software 1. Use SCR to initiate a check point to generate various types of check point files. Use CPPR data mover to send check points to back-end storage through the function shipper. Tests must cover direct paths to Lustre and indirect paths through the burst buffers. The performance characteristics of both paths will be recorded. Initiate checkpoint restarts that demonstrate data flowing back into the application nodes. 2. Run HACC IO, IOR, WRF (NetCDF) and FlashIO (HDF5) from clients through function shipper to send I/O through various I/O libraries and record ZFS and block device statics to measure performance through the CORAL I/O software stack. Intel Federal, LLC Proprietary 24 Solution Architecture

25 7 Acceptance Criteria Intel will validate its progress toward acceptance of this NRE program through the following criteria: 1. Test plans described in section 6, Solution Test Plans, are fully executed and all tests pass. 2. All requirements listed in section 3, Solutions Requirements, are validated. 3. An agreed set of use cases are demonstrated. Intel Federal, LLC Proprietary 25 Solution Architecture

Common Persistent Memory POSIX* Runtime (CPPR) API Reference (MS21) API Reference High Performance Data Division

Common Persistent Memory POSIX* Runtime (CPPR) API Reference High Performance Data Division INTEL FEDERAL, LLC PROPRIETARY December 2017 Generated under Argonne Contract number: B609815 DISTRIBUTION STATEMENT: