MS 52 Distributed Persistent Memory Class Storage Model (DAOS-M) Solution Architecture Revision 1.2 September 28, 2015

Size: px

Start display at page:

Download "MS 52 Distributed Persistent Memory Class Storage Model (DAOS-M) Solution Architecture Revision 1.2 September 28, 2015"

Marsha Lawson
5 years ago
Views:

1 MS 52 Distributed Persistent Memory Class Storage Model (DAOS-M) Solution Architecture Revision 1.2 September 28, 2015 Intel Federal, LLC Proprietary i Solution Architecture

2 Generated under Argonne Contract number: B DISTRIBUTION STATEMENT: None Required Disclosure Notice: This presentation is bound by Non-Disclosure Agreements between Intel Corporation, the Department of Energy, and DOE National Labs, and is therefore for Internal Use Only and not for distribution outside these organizations or publication outside this Subcontract. USG Disclaimer: This report was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor any agency thereof, nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof. Export: This document contains information that is subject to export control under the Export Administration Regulations. Intel Disclaimer: Intel makes available this document and the information contained herein in furtherance of CORAL. None of the information contained therein is, or should be construed, as advice. While Intel makes every effort to present accurate and reliable information, Intel does not guarantee the accuracy, completeness, efficacy, or timeliness of such information. Use of such information is voluntary, and reliance on it should only be undertaken after an independent review by qualified experts. Access to this document is with the understanding that Intel is not engaged in rendering advice or other professional services. Information in this document may be changed or updated without notice by Intel. This document contains copyright information, the terms of which must be observed and followed. Reference herein to any specific commercial product, process or service does not constitute or imply endorsement, recommendation, or favoring by Intel or the US Government. Intel makes no representations whatsoever about this document or the information contained herein. IN NO EVENT SHALL INTEL BE LIABLE TO ANY PARTY FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES FOR ANY USE OF THIS DOCUMENT, INCLUDING, WITHOUT LIMITATION, ANY LOST PROFITS, BUSINESS INTERRUPTION, OR OTHERWISE, EVEN IF INTEL IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Company Name: Intel Federal, LLC Company Address: 4100 Monument Corner Drive, Suite 540 Fairfax, VA Copyright 2015, Intel Corporation. Technical Lead: (Name, , Phone) _Al Gara, Contract Administrator: (Name, , Phone) Aaron Matzkin, Program Manager: (Name, , Phone) Jacob Wood, Intel Federal, LLC Proprietary ii Solution Architecture

3 Contents MS 52 (DAOS-M)... i Solution Architecture... i 1 Milestone Overview Milestone Description Milestone Acceptance Criteria Terminology Introduction Solution Requirement Containers Epochs Sharding and Resilience Global uniform object address space Storage characteristics awareness Object distribution schemas Container shard exclusion Container shard addition Data integrity Distributed Persistent Memory (DPM) DAOS POSIX Namespaces Private POSIX Namespaces System Namespace Use Cases Simulation Normal Operation Transient Node Failure Permanent Node Failure Degraded Mode & Asynchronous Recovery Simulation/Analysis in Different Containers Normal Operation Failure Simulation/Analysis in Single Container POSIX Emulation Checkpoint/Restart with Segmented Arrays Solution Proposal The DAOS-M Layer Containers Epochs Process Model & Transport Intel Federal, LLC Proprietary iii Solution Architecture

4 6.1.4 Versioning Object Store over Persistent Memory NVM Libraries (NVML) Versioning Object Store DAOS Sharding and Resilience Object ID allocator Cluster map Placement map Algorithmic object placement Object Lookup Table Per-object placement schema Single object Fixed striped object Dynamically striped object Dynamically chunked object Segmented array object Fault Handling & Degraded Mode Recovery and Rebuild Rebalancing Distributed Persistent Memory (DPM) POSIX Emulation Superblock, File & Directory Representation Transaction Model & Multi-version Concurrency Control POSIX Compliance Private Namespace System Namespace Q & A References Intel Federal, LLC Proprietary iv Solution Architecture

5 Figures Figure 1: DAOS POSIX Namespaces Figure 2: New DAOS Stack Figure 3: A DAOS-M container (three consensus shards, with the first one magnified) Figure 4: Example Extent.Version index tree for the Versioning Object Store 27 Figure 5: Cluster map and Placement maps Figure 6: Algorithmic Object Placement Figure 7: 2-Way replicated lookup table Revision History Revision Description Date Author 1.0 Initial version Johann Lombardi, Vishwanath Venkatesan, Liang Zhen, Li Wei 1.1 Integrated feedback from ANL reviewer(s). Answer to questions. 1.2 Incorporate ANL comments into the document Johann Lombardi, Vishwanath Venkatesan, Liang Zhen, Li Wei John Carrier, Johann Lombardi Author: Johann Lombardi, Vishwanath Venkatesan, Li Wei, Liang Zhen Contributors: Eric Barton Intel Federal, LLC Proprietary v Solution Architecture

6 1 Milestone Overview 1.1 Milestone Description This document covers the work necessary to meet the scope statement deliverable of MS52 of the Argonne NRE Contract as stated below. The Subcontractor shall design a distributed persistent memory class storage model, which leverages the Intel Crystal Ridge non-volatile memory technology that is configured in DDR4 compatible DIMM form factor with processor load/store access semantics on CORAL point design compute nodes. This software design will allow applications running on any CORAL point design compute node to have a global view of and global access to Crystal Ridge that is on other compute nodes. MS52 Scope and Solution Architecture for the Distributed Persistent Memory Class Storage Model The advent of a large capacity of affordable NVRAM based DIMMs in the Intel Crystal Ridge technology with direct processor load/store access on every compute node of a pre- Exascale class system provides Subcontractor a unique opportunity to redefine the storage paradigm and break through the IO wall for the Exascale era. This distributed memory class storage shall be remotely accessible at the full cross-sectional bandwidth of the fabric. These two breakthrough technological advances when enabled with an equally revolutionary new storage software stack shall provide over two orders of magnitude faster delivered bandwidth than the CORAL Burst Buffer and over three orders of magnitude more than the CORAL CFS at latencies measured in processor cycles rather than 10s of milliseconds. This inflection in the pre-exascale system hardware technology elements of memory class storage tightly coupled with the compute interconnect will bring a revolution in capability that can be unlocked for applications usage only with a corresponding revolution in programming, application, system services and usage methodologies spanning Persistent Memory, Partitioned Global Address Spaces and fault tolerance. Software that can exploit fully distributed Crystal Ridge shall be greatly simplified if the systems supporting these resources support a uniform global namespace that provides consistency, availability and resiliency guarantees while preserving direct load/store local access and low latency, high bandwidth get/put remote access over the compute interconnect for data intensive applications. As datasets become ever larger and storage systems become more widely distributed, these guarantees are required not only on the system metadata, but also on the persistent application data and metadata stored in it so Intel Federal, LLC Proprietary 1 Solution Architecture

7 that the availability and integrity of entire O(10s) PB scientific datasets can be assured in the face of application and hardware failures without requiring the data to be migrated to a three orders of magnitude times slower parallel file system. The Subcontractor shall design a new storage paradigm for a distributed persistent memory class programming model that leverages on compute node non-volatile memory, Intel Crystal Ridge technology. This new architecture shall leverage technology developed under the Fast Forward Storage and I/O project. The Subcontractor shall create a new superset of DAOS, DAOS-M, which will provide a distributed persistent memory class storage programming model, featuring OS bypass end-to-end and scaling to the 100,000s CNs to deliver the full performance advantages of ubiquitous Crystal Ridge and is closely integrated into the communication fabric. The Subcontractor shall provide a design based on the following architectural descriptions. DAOS Containers A foundational component of the Fast Forward I/O stack is the DAOS API, which replaces the POSIX file with the DAOS container, a Partitioned Global Object Address Space within which applications and middleware can implement their own data and metadata schemas. An extension to DAOS, called DAOS-M, shall enable new application, system services and usage models. The DAOS-M server will access memory class storage using a Persistent Memory programming model that directly utilizes load-store access to NVRAM DIMMs (as opposed to Linux IO APIs) to enable byte-granular version control and data integrity checking. DAOS-M shall extend the current DAOS API to support key-value objects natively. It shall also use a new distributed client/server process and communications model that leverages the Storm Lake (STL) Scalable Fabric Interface (SFI), the same subsystem underlying Intel MPI, to scale across all the CORAL compute nodes. This will support both co-located and disjoint clients and servers and feature end-to-end OS bypass to take all overhead, including authorization and authentication, off the critical I/O latency path. DAOS Sharding and Resilience Distribution schemas provide two key classes of benefit. They allow storage system performance to scale with system size and can be used to guarantee data availability, durability and integrity in the presence of failure. The DAOS Sharding and Resilience subsystem (DAOS-SR) shall leverage the underlying DAOS-M consistency and integrity model to provide a range of different distribution schemas for both array and key-value objects that tradeoff contention, granularity, locality, resilience and space-efficiency for different access patterns. These schemas will include n-way replication, erasure coding and checksumming and associated maintenance and repair tools will be provided. Distributed Persistent Memory (DPM) The DAOS API shall be extended to include schemas that exploit affinity between DAOS container shards and MPI ranks to map DAOS objects and key values into process virtual address space. This shall enable all ranks of a parallel application to instantiate one or more Intel Federal, LLC Proprietary 2 Solution Architecture

8 Persistent Memory (PM) regions in their address space backed by corresponding DAOS objects and utilize the Intel PM programming library for heap management and Intel optimized PM algorithms etc. within them. Local DAOS-M objects and local fragments of DAOS-SR objects shall be mapped directly to enable 0-copy load access. Direct store access to local NVRAM shall also be supported if the corresponding mapped PM region has been marked volatile and the application can tolerate inconsistency and recover for itself on failure. Otherwise PM regions shall be mapped copy-on-write so that updates only become persistent on DAOS commit and global DPM consistency can be assured. Private POSIX namespaces In order to facilitate the smooth migration of applications, tools, system services and system administration of CORAL to the new storage paradigm, private POSIX namespaces and a shared system wide scalable POSIX namespace shall be provided. A library implementing an agreed POSIX subset built over DAOS-SR shall be developed to allow a POSIX namespace to be encapsulated in DAOS container. Applications shall be able to use these POSIX namespaces both to organize their other DAOS containers in a conventional namespace and to run high performance, scalable shared file and file-per-process I/O using conventional I/O stacks. The library shall assume use of the POSIX namespace by a single parallel application which will access the namespace by mounting the DOAS container encapsulating it. The library shall support explicit specification of sequential dependencies for lockless operation. The implementation shall base directories on DAOS- SR key-value objects and files on simple DAOS-SR array objects. The full range of HA schemas shall be available so that for example, directories are replicated 3 ways and files are erasure coded to tolerate the loss of 2 container shards. Directories and files shall also be fully distributed so that throughput scales with the lower of the number of application and storage nodes. Symlinks from the encapsulated namespace to other DAOS containers shall also be supported so that the POSIX namespace can be used to access and index DAOS containers. Shared POSIX namespace A parallel system daemon shall be provided to enable multiple users to share access to a single encapsulated POSIX namespace. This daemon shall use the Mercury function shipping to export the namespace to its clients. Mercury shall be extended to include authentication plug-ins to validate client credentials. The system daemon shall track space utilization by user and group within the shared POSIX namespace container and enforce access permissions on all file types, including links to external DAOS containers. A FUSE client shall be provided so that the shared POSIX namespace can be mounted on front-end login nodes and I/O nodes and is accessible from standard POSIX shells and utilities including find, rm, mv etc. Applications running on compute nodes shall also have access to this shared global namespace by function shipping to the I/O nodes. Intel Federal, LLC Proprietary 3 Solution Architecture

9 1.2 Milestone Acceptance Criteria Deliverables The Subcontractor shall deliver a Scope Statement that documents the goals to be satisfied and specifies in-scope and out-of-scope elements of work, assumptions and constraints and the key deliverables and development milestones. Following completion of the Scope Statement, the Subcontractor shall deliver a Solution Architecture that documents the detailed solution requirements and outlines the solution broken down by subsystem. This milestone will be considered complete when: (1) the Subcontractor delivers the Scope Statement and Solution Architecture to Argonne; (2) the Subcontractor presents the reports at the quarterly meeting; and (3) the reports are considered complete to the reasonable satisfaction of Argonne. Intel Federal, LLC Proprietary 4 Solution Architecture

10 2 Terminology DAOS HCE KV MTBF NVML OFI OST PM RAS RDMA/RMA DAOS-M DAOS-SR Distributed Application Object Storage Highest Committed Epoch Key Value Mean Time Between Failures Non-Volatile Memory Libraries Open Fabrics Interface Object Storage Target Persistent Memory Reliability, Availability & Serviceability Remote (Direct) Memory Access DAOS Persistent Memory storage layer DAOS Sharding and Resilience layer Intel Federal, LLC Proprietary 5 Solution Architecture

11 3 Introduction The emergence of affordable large-capacity non-volatile memory offers a unique opportunity to redefine the storage paradigm for the Exascale era and beyond. With persistent memory on every compute node, applications will have direct access to byteaddressable storage at the full cross-sectional bandwidth of the fabric with an incredibly low latency compared to traditional storage systems. This revolution requires a new storage stack capable of unleashing the full potential of this new technology. The new stack must be able to support massively distributed storage for which failure will be the norm while preserving low latency and high bandwidth access over the compute interconnect. The purpose of this project is to design such a storage stack by aggregating persistent memory distributed on all the compute nodes into globally accessible object address spaces providing consistency, availability and resiliency guarantees without compromising performance. The proposed architecture leverages the Distributed Application Object Storage (DAOS) API developed under the DOE Fast Forward Storage & I/O project. A DAOS container provides a transactional object store distributed across compute nodes. It supports multiversion concurrency at byte granularity to eliminate unintended serialization through false sharing conflicts. Objects in a container are distributed and replicated across the cluster to achieve horizontal scalability and resilience while guaranteeing optimal recovery time. Intel Federal, LLC Proprietary 6 Solution Architecture

12 4 Solution Requirement The detailed requirements below are targeted at a HPC cluster with the following characteristics: - hundreds of thousands of compute nodes - every (or at least a vast majority of) compute node has direct access to local byteaddressable non-volatile memory with a capacity which is several times larger than the amount of volatile memory. - compute nodes share some resources (die, motherboard, power supply, rack, interconnect switch, cooling system, ) which are used to identify fault domains - compute nodes can communicate through a scalable high-speed low-latency interconnect capable of direct RMA to/from persistent memory. 4.1 Containers DAOS is a byte-granular, multi-version concurrent transactional object store. Its container abstraction provides applications with distributed and resilient objects accessible through a global flat address space. Within each DAOS container is an object address space distributed over multiple persistent memory nodes, where each object is either a simple byte array or key-value store. Byte array objects can support read, write and punch operations (punch will eventually free object extents and reclaims underlying storage space). Similarly key-value objects will support lookup, update and delete operations. DAOS container uses multiversion concurrency control at byte granularity to allow multiple versions of the container to be accessed concurrently and eliminate unintended serialization. DAOS concurrency control is based on the epoch, which are arranged in a total order such that epochs less than or equal to the highest committed epoch (HCE) correspond to immutable container versions. Readers that access immutable container epochs don t conflict with writers updating future uncommitted epochs. In addition, reading uncommitted epochs is also allowed to facilitate shared access from collaborating processes where synchronization has to be handled by the application. This also allows for the operations to be idempotent, so individual operations can potentially be repeated until successful or abandoned. Complete details on the requirements for creating and maintaining epochs have been explained in Solution Requirements Epochs. A container shard is a portion of persistent memory attached to a node over which DAOS objects are distributed. Redundant distribution of objects across container shards on different nodes achieves both horizontal scalability and fault tolerance. A lightweight stack that accesses storage directly from userspace using a persistent memory programming methodology removes any need for block/page alignment constraints and thereby eliminates the need for read-modify-write style operations. All DAOS operations are asynchronous with an event-based mechanism to signal completion. This allows multiple concurrent I/O operations even from a single process or thread to be grouped together for vertical scalability and also mitigates jitter by decoupling application execution from storage latency. Remote access to container shards at different persistent memory nodes Intel Federal, LLC Proprietary 7 Solution Architecture

13 is achieved via a lightweight RPC or function shipping transport that abstracts native fabric interfaces. Containers may comprise hundreds of thousands of shards and applications can create containers that can range from a small subset of compute nodes to the whole cluster. Each DAOS container is identified by its own UUID and each container shard within a given container is further identified by its unique shard index. The container metadata also includes information such as UID/GID, permissions, the complete list of shards and metadata describing available versions and commit state. This metadata is replicated across a subset of the container shards themselves to provide fault tolerance, using a consensus algorithm to guarantee consistency. Containers may be read and written in the presence of inaccessible or failing container shards while a quorum of the shards replicating its metadata remain accessible. Further detail on fault tolerance is described in section Solution Requirements Object Placement and Availability. Access to a container is controlled in a similar way to that of a POSIX file. To read or write a container, an application must open it first. If the application s user and group IDs and open mode (i.e., read-only or read-write) are compatible with the container s owner and group IDs and permission, a container handle is returned. This includes capabilities that authorize any process in the application to access the container and its contents. The opening process may then share this handle with any/all of its peers. These capabilities are revoked either on explicit container close or on request from the system resource manager. A set of processes sharing the same container handle is called a process group. One process may belong to multiple process groups corresponding to one or more containers. A container can be opened by multiple process groups at the same time, regardless of their open mode multiple concurrent read-write handles shall be supported. 4.2 Epochs The word epoch is used to denote both a set of distributed updates that should be applied atomically to a given container and the container state effectively generated by applying them. Process groups shall perform consistent distributed updates on a single container by specifying the same epoch for all write operations (e.g. byte-array punches, key-value updates, etc.) in the set. If all writes are applied, the epoch is said to be committed. Otherwise, all writes are discarded and the epoch is said to be aborted. An aborted epoch is therefore identical to an empty committed epoch. That said, it is possible to query the status of a given epoch to find out whether it was successfully committed or aborted. A newly created container is in an initial empty state, which is considered to be generated by an initial empty epoch. Though this initial epoch will not contain valid data to read, it is considered to be the first committed epoch. Subsequent epochs are committed in ascending contiguous order and attempts to write committed epochs are discarded and return an error. When an epoch commits, it effectively creates a new immutable version Intel Federal, LLC Proprietary 8 Solution Architecture

14 of the state of the entire container by applying all its writes to the version created by its predecessor. Process groups shall perform consistent distributed reads on a single container by specifying the same committed epoch for all read operations (e.g. byte-array reads, keyvalue lookups, etc.) in the set. Uncommitted epochs may also be specified in read operations, however, consistency cannot be guaranteed in this case unless process group members specifically order the sequence of reads and writes. Other than the lack of a consistency guarantee, there is no difference between reading an uncommitted and committed epoch. The storage server handles reading from both committed and uncommitted epochs in the exact same manner. Process group members shall also be able to enumerate all container shards within a container, all objects distributed on a container shard, and all written extents and holes (i.e. unwritten extents) of array objects and all updated keys of key-value objects within any range of epochs. DAOS may from time to time aggregate a range of committed epochs in a container into the highest epoch within the range in order to reclaim space utilized by overlapping writes and increase metadata efficiency. After such an aggregation, the state corresponding to the highest epoch within the range stays effectively the same, while the states corresponding to the other epochs in the range become unreadable. Each process group shall thus be able to reference a range of committed epochs through its container handle, so that the epochs remain readable until the references are released. It shall also be able to turn a committed epoch into a snapshot, which is essentially placing a persistent reference to the epoch so that the epoch remains readable even after all container handles have released their references on this epoch. A snapshot is not associated with any container open handle and will be accessible to subsequent executions within both the same or different job allocation. Since epochs are totally ordered and all committed epochs are immutable, snapshots are immutable too. Process groups shall also be able to query all snapshots of a container. Any processes within a process group shall be able to submit write operations in any uncommitted epoch. It is the process group s responsibility to serialize any conflicting write operations in the epoch and to make sure the write operations in an epoch processed by the servers represent a consistent state change on the container. Although epochs are logically applied in a total order, writes in an epoch shall be stored without having to wait for conflicting writes from prior uncommitted epochs. Any process within a process group shall be able to request the epoch to be committed or aborted. Committing an epoch signals that all its write operations are stored successfully. The process requesting commit shall be able to specify a set of other uncommitted epochs this epoch depends on. Typically, the set includes other epochs this epoch reads from. The epoch is committed only if all other epochs in the set are committed. If any epoch in the set is aborted, this epoch is aborted too. Aborting an epoch discards all write operations in it, and aborts all its dependents. When a container handle is closed, either Intel Federal, LLC Proprietary 9 Solution Architecture

15 explicitly by the process group or by DAOS if the process group is terminated, all its uncommitted epochs are aborted. The commit operation waits for all other epochs in the set to become committed or aborted. Therefore, the committed (or aborted ) state of an epoch is final. Commits may very well be requested through a non-blocking API though. When multiple process groups have opened the same container, each of them shall be able to commit or abort independent epochs with minimum unintended dependencies. 4.3 Sharding and Resilience DAOS container shall provide a global object address space and will be responsible for the distribution of objects in this address space over container shards to provide both horizontal scalability and fault tolerance Global uniform object address space As described in the Container section, DAOS objects shall be addressed by an ID in a global uniform address space which is per container. This address space shall be sufficiently sparse to enable efficient and scalable object ID allocation and ensure that all levels of the I/O stack may reserve a sufficient subset of the address space for levelspecific or private metadata. Multiple distribution schemas shall be supported for objects in this address space so that, for example, a large object can be sharded over many pseudo randomly selected container shards, whereas small objects can simply be stored locally on a compute node and replicated to specified neighbors for I/O efficiency. Efficient, scalable and resilient mechanisms shall be provided to determine the distribution schema for any given object. These may utilize multiple levels of distribution metadata, bootstrapped from the simplest where distribution is determined from the object ID and metadata stored with the global container metadata, to multi-level schemas that store distribution metadata in the global object address space. The simplest schema, used for non-redundant, non-replicated objects, maps the object ID directly to a container shard. A multi-level schema could employ a fully distributed, 3-way replicated key-value table storing placement metadata looked up by object ID. The system should support both predefined schemas and custom schemas for which the user will have to specify the striping information (number of stripes or dynamically striped), the HA degree (hint to DAOS for choosing the fault domain) and data protection method (erasure code or replication) Storage characteristics awareness To make the best decision for object placement and achieve both performance and resilience, the storage system needs to understand the storage cluster characteristics. This information must be provided by an external source (e.g. a configuration database) and will be used as follows: Intel Federal, LLC Proprietary 10 Solution Architecture

16 - Redundancy group is a set of container shards that are organized to store object shards protected by a redundancy schema, for example, replicas of an object shard, or data chunks that share the same parity chunk(s). To ensure data availability for those applications that can survive from co-failure of nodes. Object shards in the same redundancy group shall be placed in different failure domains so that objects remain accessible in degraded mode when multiple nodes are affected by the same underlying failure. Fault domains are hierarchical and cannot overlap. - To provide scalable performance for parallel I/O. For now, the placement algorithm will guarantee that object shards of the same object are in different redundancy groups and shall not share underlying targets to avoid storage or network hotspots and balance load. In the future, the notion of performance domain (i.e. targets with relatively short network distance) might be supported in addition to fault domain. The placement algorithm must be designed in a way that it can be easily extended overtime and allow new allocation strategy to be added to an existing container Object distribution schemas Failure in clusters with many thousands of nodes is unavoidable and causes the PM on the failed node to become inaccessible. DAOS shall therefore provide the following classes of shared-nothing redundancy schemas to protect against data loss in this event: - N-way replication. This is the simplest redundancy schema and may be used both for array and KV objects. It simplifies concurrent update at the finest granularity and improves read bandwidth at the expense of space inefficiency and write bandwidth. Erasure coding. This retains space efficiency but may only be used for array objects and complicates the way redundancy is computed on objects shared for write. Both classes of redundancy schema may support the loss of multiple object shards within the same redundancy group. N-way replication supports synchronous update of all replicas to allow consistent read of uncommitted data (assuming the application orders readers and writers) on any replica. Both classes of schema shall support semisynchronous and asynchronous update of redundancy information. - Semi-synchronous update generates redundancy on commit and therefore assures durability and availability of all committed data. It should complete at the same time with commit. - Asynchronous update generates redundancy after commit similarly to recovery, but leaves data vulnerable to failure until redundancy has been persisted. - This project will not support synchronous update of redundancy information, because object data can be changed overtime even in the same epoch, and different extents can be updated by different writing processes. So DAOS may have to compute parities again and again, which is complex and not efficient. Intel Federal, LLC Proprietary 11 Solution Architecture

17 The following classes of distribution schemas shall be provided to support different ways of scaling performance over storage nodes. In all cases, KV objects may only use replication for resilience: - Single. Objects with this class of schema are replicated for redundancy, but otherwise not distributed. Placement shall either be determined by locality, or be distributed pseudo randomly over all container shards. - Statically striped/hashed. Objects with this class of schema are distributed over 2 or more redundancy groups. Placement of objects with few redundancy groups may be determined by locality. Large objects may be distributed over all container shards. - Dynamically striped/hashed. Objects with this class of schema progressively add more data stripes in different redundancy groups as the object grows to improve concurrent access. Data redistribution on object growth must be minimized. Stripe extension must be transparent to the application. - Dynamically chunked. This class of schema applies only to array objects. It chooses specified number of container shards for object data distributing, each contiguous chunk of the object can be placed in any of those container shards, and thus location of chunk should be stored as object metadata. - Segmented array. This class of schema provides a 2 dimensional array in which the column index iterates over nodes in while the row index iterates over bytes within an array object local to each node. Redundancy shall be provided either by replicating individual rows or by erasure coding over neighboring rows. In both cases, all object shards in the same redundancy groups must land in different failure domains Container shard exclusion When one or more nodes hosting shards of a container fail, I/O can either return degraded success if the failure is still tolerable, or return error if it is intolerable for current data protection method. Either way, DAOS should notify higher stack layer about the failures of shards, so higher stack layers can then disable the failed shards to exclude them from container. When a DAOS container shard is disabled, future writes that previously targeted the disabled container shard shall target alternative container shards to retain full redundancy. Metadata updates required to achieve this must scale at worst O(#container shards) and not O(#objects). Reselection must preserve the property that all redundancy groups are distributed over different failure domains. Distribution schemas shall support declustering to ensure that this selection spreads redistribution to multiple container shards as determined by the container s declustering factor. Full recovery of redundancy of committed epochs may be initiated after the failing container shard has been disabled and shall proceed concurrently with application I/O. Recovery shall scalably enumerate all committed contents of resilient objects spanning the disabled container shard, recovering this to the alternative container shards selected by their updated distribution metadata. Intel Federal, LLC Proprietary 12 Solution Architecture

18 Recovery shall be idempotent and provide progress guarantees to ensure that it may be abandoned and restarted arbitrarily and still complete in finite time Container shard addition Object redistribution performed by DAOS in response to container shard addition shall have the following properties: - All object shards that would otherwise be placed on a container shard that was previously excluded shall migrate to the new shard if it is flagged as its replacement - Data migration on shard addition shall be minimized. - Doubling the total number of container shards shall approximately migrate half of all object shards from existing container shards to the new container shards and result in approximately balanced space utilization Data integrity DAOS shall be able to compute and store checksum for write, and verify data integrity by comparing the stored and newly computed checksum on read. If I/O service detected integrity violation in object data on read, it should notify client to run in degraded mode for this object, and rebuild its replica or data chunk. 4.4 Distributed Persistent Memory (DPM) DAOS shall enable local rows of segmented array objects to be mapped directly into application process memory. Rows of the segmented array may be fully or partially mapped with strict page alignment. Aggregation required to convert local object shards to a mappable form shall be automatic. APIs shall be provided to determine the mapping of segmented array column index to application process IDs (e.g. MPI ranks). The following mapping modes shall be supported: Read-only: any committed version of the object can be mapped directly in the process address space, only load access is supported. Update: the HCE version of the object is mapped directly in the process address space and load/store access is supported. Modified pages are copied-on-write to ensure the version of the object as it appeared when mapped remains immutable. Modified pages are applied in the next commit, when redundancy, either replicated or erasure coded, is computed and stored. I/O operations through the DAOS API (e.g. daos_obj_write()) to a segmented array which is memory-mapped by another process may result in undefined behavior. On the other hand, concurrent I/O operations from multiple threads to the same mmap region will definitely be supported. 4.5 DAOS POSIX Namespaces DAOS aims at replacing the legacy POSIX I/O interface over which applications and domain-specific I/O middleware have been developed for decades. Until all applications, tools, system services and I/O middleware (e.g. HDF5, ADIOS) natively support the DAOS API, POSIX emulation will be necessary to facilitate smooth transition. Intel Federal, LLC Proprietary 13 Solution Architecture

/projects System Namespace /posix /climate POSIX dataset ADIOS dataset /particle HDF5 dataset group dir dir dir dir group data data data data data data data data data data data data file file file

19 /projects System Namespace /posix /climate POSIX dataset ADIOS dataset /particle HDF5 dataset group dir dir dir dir group data data data data data data data data data data data data file file file group group data data data data data data data data data data data dataset dataset data dataset I/O Nodes Compute Nodes Parallel System Namespace Servers Figure 1: DAOS POSIX Namespaces Private POSIX Namespaces A POSIX namespace shall be encapsulated into a container and built on top of DAOS objects. This application-private namespace is accessible to any tasks of the application that successfully opened the container. Upon application termination, another application can open the container and access the POSIX files and directories. POSIX support shall be provided via a library, which can be used by the application to access the POSIX files and directories stored inside a container once the latter has been successfully opened. The POSIX namespace will be private to the opener and must feature the following: scalable directory operations scalable shared file I/O scalable file-per-process I/O full range of distribution schemas (fixed/dynamically striped replicated/erasure-coded objects) available for both files and directories. link in the namespace to external containers by pointing at the container UUID self-healing to recover from failed or corrupted storage Intel Federal, LLC Proprietary 14 Solution Architecture

20 Both data and metadata of the encapsulated POSIX file system will be fully distributed across all the available storage for both performance and resilience. The private namespace is designed for well-behaved applications that generate conflict-free operations for which a very high level of concurrency will be supported. The emulation library will not be fully POSIX compliant and must implement a minimal subset of POSIX rules agreed by Argonne to support their HPC applications and middleware System Namespace A shared global namespace accessible from all compute nodes is still very convenient and useful to store application binaries, libraries and links to other containers. This service must be provided by a system daemon which exports a private POSIX namespace encapsulated into a container to all compute nodes. This system daemon is effectively a parallel application in charge of: processing requests sent from compute nodes through the function shipper; managing contention between those un-coordinated clients; enforcing POSIX permissions; tracking space used by each group and user. This service is not intended to store any simulation data and will not be as scalable as the private namespace, which contrariwise is designed to be used directly by application and to support scalable concurrent non-conflicting data and metadata operations from many nodes. On the compute node, the system namespace will be accessible via a FUSE mount point or a library directly linked with the application. In both cases, the function shipper will be used to communicate with the system daemon. Similarly to the private namespace, the system namespace will not be fully POSIX compliant. Intel Federal, LLC Proprietary 15 Solution Architecture

21 5 Use Cases The purpose of this section is to provide a non-exhaustive list of use cases that aims at presenting how the DAOS stack could be used on exascale clusters. 5.1 Simulation Normal Operation The cluster resource manager allocates a session of N compute nodes to a scientific workflow composed of a single simulation job. The session script first starts the DAOS servers on each compute node and then the simulation job. The simulation job then connects through the function shipper to the DAOS servers. This connection to the DAOS progress group is done internally and transparently to the actual application. Rank 0 of the simulation job looks up the container UUID by name in the system namespace and opens it through the DAOS API. If the container does not exist, a new one is created over all the available DAOS servers. Once the container is opened, rank 0 accesses the root object identified by the reserved object ID 0. If this object does not exist, the access fails and rank 0 then initiates the creation of this object as a replicated keyvalue store (non-striped, one replica in each fault domain). This object stores the simulation metadata table describing the simulation data in the container. Rank 0 then looks up key 0 in the root KV object, which stores the latest committed timestep (0 will be returned if no value has ever been inserted for this key), which is the timestep to restart from. Rank 0 then broadcasts the container handle to all other tasks and the last successfully committed timestep. Upon receiving the container handle and latest timestep, each task allocates several striped replicated array objects to store simulation output like temperature or pressure and record the object IDs in the root KV store under the key timestep.rank. It then proceeds with the simulation of this timestep, writes simulation output data in the respective arrays at HCE+1, waits for I/O operations to complete, initiates flush of the epoch and proceeds similarly with each timestep while incrementing the epoch number each time. Each rank regularly checks for flush completion notification and signals timestep completion to rank 0 by participating in non-blocking collective communications 1 over all ranks. Once all ranks are done with the next to-be-committed timestep, rank 0 increases the timestep value associated with key 0 in the root KV store and triggers an epoch commit. 1 Non-blocking collectives are slowly coming into existence in implementations after the addition into MPI 3.0 spec. In theory, the same functionality can be achieved using non-blocking point-to-point operations. Intel Federal, LLC Proprietary 16 Solution Architecture

22 5.1.2 Transient Node Failure When one compute node becomes suddenly unreachable, I/O operations to the simulation data arrays time out. The DAOS stack retries failed I/O operations multiple times since the node is still reported as alive or temporarily inaccessible by the external health monitoring service. The node is eventually responsive again and I/O operation retries are successful Permanent Node Failure This is the same scenario as above except the node does not come back and is reported dead by the health monitoring service (separate service external to DAOS). The simulation job is terminated by the resource manager and rescheduled with a spare node added to replace the failed one (degraded mode addressed in the next section). In the meantime, the DAOS servers are notified of the job termination and close the container handle. When rank 0 of the restarted job opens the container, DAOS detects a container shard is missing, disables it in the container layout, creates a new container shard on the new compute node and enters into a synchronous recovery process which consists of: restoring redundancy of container metadata if required identifying all the objects impacted by the failure and rebuilding redundancy on the newly added container shard. This is done for all committed epochs. discarding updates from uncommitted epochs This process is done synchronously and is expected to complete promptly with the help of the very high bandwidth and low latency offered by persistent memory and the compute interconnect. Once recovery is finished, the container open request returns a handle to rank 0 and the application can continue normal operations. In the future, recovery might be triggered even for idle containers via a dedicated service in charge of rebuilding quiescent containers when a failure occurs. A scrub service is also being considered. 5.2 Degraded Mode & Asynchronous Recovery A simulation job is running and one compute node fails permanently. If the communication middleware is fault tolerant, the job recovers from this failure and continues running with the same open handle. Otherwise, the job is terminated and restarted without the failed node. DAOS eventually detects that one container shard is missing, disables it from the container layout and enters into recovery which holds the container open request (if any) as well as any new I/O operation. However, unlike the previous use case, only the minimal recovery actions are done synchronously. Once uncommitted updates are discarded and objects impacted by the failure are marked as degraded, the DAOS stack allows the simulation job Intel Federal, LLC Proprietary 17 Solution Architecture

23 to proceed further with the container in degraded mode. Objects with enough redundancy will remain accessible with potentially extra overhead (e.g. data restoration from parity), whereas the others have lost data and will return failure on access. If any object is in the latter case and the container is reopened, a special flag is returned along with the container handle to notify the application that some objects have inaccessible data. The simulation job then resumes regular operations with reduced redundancy while the rest of recovery takes place asynchronously in the background. This asynchronous recovery consists of replicating container metadata further if required and restoring redundancy of the impacted objects on the existing shards for both committed and uncommitted epochs. The use of declustering technique guarantees all DAOS servers participate in the rebuild. The recovery load is thus distributed across all the DAOS servers and should complete promptly with a minimal performance impact on the simulation progress. The simulation program will eventually use the new layout for each impacted object and exit from degraded mode. 5.3 Simulation/Analysis in Different Containers Normal Operation A session of M+N nodes is allocated to a scientific workflow composed of a simulation job to be run on M nodes and an analysis job on N nodes. The resource manager first starts the DAOS servers on all the nodes (i.e. M+N), then the simulation job on the first M nodes and finally the analysis job on the remaining N nodes. Both the simulation and analysis jobs connect to the DAOS servers through the function shipper and open (create if nonexistent) their own containers to store their respective output data. The analysis job checks in its private container for the latest analyzed timestep, opens the simulation container for read and checks whether the next timestep to be analyzed is ready. If not, it waits for a new epoch commit event to be raised (i.e. completion event returned on an event queue) and checks again. Once available, the analysis job processes simulation data for this new timestep, dumps the results in its own container and updates the latest analyzed timestep in its metadata before committing updates to its private container. It then waits again for a new epoch to be committed and repeats the same process. Another approach is for the simulation job to create explicit snapshots for epochs of interest and have the analysis job waiting and processing snapshots instead of every single committed epoch Failure Failure cases associated with the simulation job have already been covered in the previous use cases, the focus in this section is thus on the impact for the analysis job. Intel Federal, LLC Proprietary 18 Solution Architecture

24 A permanent failure of a simulation node causes analysis to access the simulation container in degraded mode until recovery is completed. If the simulation job is terminated and restarted, the analysis job will keep its open handle on the simulation container and be woken up when the simulation has restarted and produced new simulation data to be analyzed. A permanent failure of an analysis node is handled in a similar way as for the simulation job. Recovery for the container storing analysis output data is triggered synchronously or asynchronously to restore redundancy. 5.4 Simulation/Analysis in Single Container The simulation and analysis jobs use a single container to store their respective output data. As a consequence, the analysis job opens the shared container created by the simulation job for read & write and notifies that it does not intend to submit any updates in the short term. This allows the container HCE to move forward immediately when the simulation job commits new epochs without waiting for more updates from the analysis job. As in the previous use case, the analysis reads its metadata to find out the latest analyzed timestep, waits for the next timestep to be available and analyzed the simulation data once ready. At this point, the analysis job notifies that it wants to update the container, gets back a reference on current HCE + 1 and writes its analysis output data and metadata to objects stored in the same container as the simulation data. Once all I/O operations are completed and flushed, the analysis job issues a commit for its updates and notifies again that no more updates are expected in the short term. It then waits again for a new timestep to be available and repeats the same process. Meanwhile, the simulation job moves on to the next timesteps regardless of the analysis job. 5.5 POSIX Emulation A well-behaved POSIX-based application is linked with libsysio and uses the private namespace emulation inside a container. This application is known to use POSIX in a pretty standard way: it does not generate conflicting metadata operations and never overwrites data. For other applications that are known to produce conflicting operations, a FUSE mount point will be created on each compute node and contention will be managed by dedicated daemons, similarly to the system namespace. 5.6 Checkpoint/Restart with Segmented Arrays A checkpoint/restart library (like SCR or FTI) wants to checkpoint individual tasks. To do so, it creates a segmented array object distributed over all the compute nodes. Each task maps local segments of the array and writes the local checkpoint to the memory-map region. Once everyone is completed, the object is unmapped everywhere and one of the task commits the changes to a future uncommitted epoch. The commit request computes Intel Federal, LLC Proprietary 19 Solution Architecture

25 erasure code across all the segments and then stores it inside the distributed object. The object can then be mapped again, updated with a new checkpoint and committed without altering the previous checkpoint. If a compute node fails, the checkpoint stored in persistent memory is still accessible for restart through reconstruction from the erasure code (e.g. Reed-Solomon or fountain codes; details to be determined in the HLD). Intel Federal, LLC Proprietary 20 Solution Architecture

26 6 Solution Proposal The purpose of this section is to provide a high-level overview of the new storage stack, which is primarily based on the concepts developed under the Exascale Fast Forward and I/O program. That being said, the new DAOS-M stack is a complete rewrite and does not share much beyond the container concept, transaction model (i.e. epoch) and API with the DAOS Lustre implemention developed in Fast Forward. The new DAOS stack is composed of multiple layers represented in the diagram below. APPLICATION I/O Middleware System Tools System POSIX Private POSIX DAOS-SR DAOS-M Function Shipper NVML Fabric Driver NVDIMMs Fabric Figure 2: New DAOS Stack The DAOS-M layer provides the container abstraction which aggregates the persistent memory distributed across compute nodes into a global transactional object store. Each Intel Federal, LLC Proprietary 21 Solution Architecture

27 DAOS-M server manages persistent memory on a compute node with the help of the NVML library that handles allocations and guarantees atomic updates through local transactions. On the other side, the DAOS-M client is a library linked directly with the application. It provides container access through the DAOS API. Request transportation is done by the function shipper which supports both collective and peer-to-peer communications. The sharding and resilience (SR) layer is responsible for data distribution and resilience inside the DAOS-M container. It provides data protection at the object-level and relies on replication and erasure code for data safety. The SR layer takes non-overlapped fault domains into account to guarantee optimal placement for resilience. Support for performance domain might be added in the future. A broad panoply of object schema are supported to meet various application needs. The POSIX emulation is built on top of the SR layer and provides both application-private and system-wide namespaces. The former can be used directly by an application to store input and output data in POSIX form, whereas the latter will be directly accessible to system tools, users and administrators on all compute nodes through FUSE. To manage contention, the POSIX service could be provided by daemons running on dedicated nodes that can be reached by compute clients through the function shipper. The next subsections describe in further details the DAOS-M layer, the SR layer and finally the POSIX emulation. 6.1 The DAOS-M Layer The DAOS-M layer provides containers with local objects, on top of which the DAOS-SR layer builds distributed objects with resilience and performance properties. Its design follows the client-server model. (A purely library-based approach would struggle to provide remote access when no process in a process group runs locally to some shards.) A client is any process that links with the DAOS-M client library. A server is any PMequipped compute node that runs the DAOS-M server daemon. Clients and servers communicate with an RPC transport. The rest of this subsection explains, among other topics, how the DAOS-M server daemon builds a versioning object store over persistent memory and how the client library and the server daemon together provide the container abstraction Containers DAOS-M containers provide all features described in Solution Requirements Containers, except that each object is stored in only one shard. In other words, DAOS-M objects are not striped for performance, nor do they have any redundancy to survive shard failures. Various metadata must be stored for each container. Some of them follow naturally from the container abstraction, like UID, GID, and permission, while others are considered Intel Federal, LLC Proprietary 22 Solution Architecture

28 necessary for internal protocols implementing container handles, epochs, etc. In addition, DAOS-SR may need to store its own container metadata. Therefore, a container s metadata is organized as a set of key-value-style attributes (not to be confused with keyvalue objects in the container), recording these varieties of information: UID/GID the container s user and group ID, each with its own attribute. Permission the capabilities of the container s user, group, and others, as one attribute. Layout the complete list of the container s shards, as one attribute. Each shard s entry includes the shard s ID, its network address, and whether it has been disabled. Epoch state the container s highest committed epoch, lowest readable epoch (excluding snapshots), etc., each with its own attribute. Snapshots the list of the container s snapshots, as one attribute. DAOS-SR attributes opaque to DAOS-M. The set of attributes of a container is replicated with the Raft consensus algorithm [4] on a subset of the container s shards, called consensus shards. On some of the storage servers, the Raft service is thus collocated with the existing DAOS servers. Reading and updating the attributes must go through Raft. When a process group opens a container, a Raft instance is bootstrapped if one is not already running. The open protocol may then read from Raft the UID/GID, the permission, and the layout to complete authorization, as well as the epoch state to recover epoch states of each shard. As long as any majority of the consensus shards is available, Raft guarantees the attributes to be available. The Raft leader, a role played by one consensus shard and reelected if current one fails, also serves as the proxy for container operations like container open/close, epoch commit, etc. DAOS-SR may request the membership of the consensus shards to be changed, in order to restore any lost replicas to new members. Intel Federal, LLC Proprietary 23 Solution Architecture

Consensus Shards Shard 0 Shard 1 Shard 2 Shard 3 Container Metadata Replica 0 - UID/GID - Permission - Layout - Epoch state - Snapshots - DAOS-SR attributes - Container Metadata Replica 1 Shard 1

shards, with the first one magnified) 6.1.2 Epochs DAOS-M epochs directly address the requirements in Solution Requirements Epochs, minimizing epoch processing in DAOS-SR as a result.

Epochs can thusly still be committed with the presence of a limited set of faulty shards, as long as a majority of the consensus shards is still available.

A flush operation is provided for each process to flush all its write operations for an epoch.

29 Consensus Shards Shard 0 Shard 1 Shard 2 Shard 3 Container Metadata Replica 0 - UID/GID - Permission - Layout - Epoch state - Snapshots - DAOS-SR attributes - Container Metadata Replica 1 Shard 1 Metadata Container Metadata Replica 2 Shard 2 Metadata Shard 3 Metadata Shard 0 Metadata Shard 1 Objects Shard 2 Objects Shard 3 Objects Shard 0 Objects Figure 3: A DAOS-M container (three consensus shards, with the first one magnified) Epochs DAOS-M epochs directly address the requirements in Solution Requirements Epochs, minimizing epoch processing in DAOS-SR as a result. Overall, the epoch protocols rely on fault-tolerant container metadata to store the definitive epoch state and layout of each container. Epochs can thusly still be committed with the presence of a limited set of faulty shards, as long as a majority of the consensus shards is still available. Before committing an epoch, a process group must make sure all write operations belonging to this epoch have been synchronized to persistent memory. A flush operation is provided for each process to flush all its write operations for an epoch. If any shards fail during the flush operations, it is up to DAOS-SR to determine if the write operations to the failed shards are recoverable and if the affected flush operations should fail. The SR layer will not report any fatal error, but will notify the upper layer of partial success (i.e. some redundancy data could not be updated) through a dedicated API. The upper layer can then either decide to continue since the reduced redundancy is acceptable or address the partial success by replacing the failed target. If the decision is to commit the epoch anyway, the failed shards must be disabled first and the container will be accessed in degraded mode. Once every process has dealt with their flush operations and any failed shards, one of them may request the epoch to be committed (or aborted), representing the whole group. Intel Federal, LLC Proprietary 24 Solution Architecture

30 This is as simply as the client library sending a commit (or abort) request to current Raft leader. As soon as Raft has made the epoch state persistent, the epoch is considered committed (or aborted). Epoch recovery during container open is simplified as well. The Raft leader determines the HCE from the container metadata, and sends the HCE in a collective communication request to all shards that are not disabled. Each shard then learns the definitive HCE and discards any higher epochs Process Model & Transport The DAOS-M layer, as described earlier, is split into client and server parts and any application process that uses DAOS API with its associated linked library will be a client process. The server process will be a daemon with the same UID and GID as the application process. Since compute node availability is a point of contention in a cluster, DAOS server daemons will be launched on-demand for each successful allocation of a set of compute nodes. This allocation is called a session for convenience. There can be multiple application jobs within each session. The DAOS-M server daemon process will be alive for the lifetime of a session. A container open will perform POSIX like permission check by comparing UID, GID and the open mode of the calling process to the ones of the container and grant a container handle on success. This container handle, which also encapsulates the capabilities (readonly/read-write) of accessing the container, can be used from all the processes within the application. Each container shard within the container is represented as a unique file stored in a persistent-memory aware ext4 file system managed by the versioning object store. The kernel does inherent permission checks, when the file associated with a shard is opened. Both the server and client in DAOS rely on a lightweight user space communication stack. For this purpose Mercury [2], which is an asynchronous RPC interface, will be used. Mercury supports both small data transfers for metadata operations and bulk data transfers for actual data (like memory buffers). Mercury interfaces are co-located for both client and server and also offers a lightweight Network Abstraction Layer (NAL) that allows it to be portable. Mercury will be required to support scalable collective operations with both request propagation and reply aggregation across clients & servers and also provide fault-handling features for DAOS, which will be addressed as a part of a separate project called Function Shipping. Asynchronous DAOS operations are handled using an event mechanism to signal completion. This enables upper levels to initiate many concurrent operations from a single thread. The event mechanism will also permit multiple child events to be aggregated into a single parent event to simplify the expected case of successful completion of many Intel Federal, LLC Proprietary 25 Solution Architecture

31 concurrent operations. In the case of unsuccessful completion, the child events may all be queried individually for detailed error handling Versioning Object Store over Persistent Memory NVM Libraries (NVML) NVML [1] is a collection of libraries to use memory-mapped persistence on persistent memory. These libraries facilitate the use of direct access storage, which supports load/store access without paging blocks from block storage devices with the help of a persistent memory aware file system. A non-paged load/store could be achieved by just memory mapping (mmap) a file from this type of file system. Although a PM-aware file system allows direct load/store without going through the page cache, persistent memory stores can potentially go through processor cache, which needs to be explicitly flushed to be made durable on the persistent memory. In addition, cache-line evictions can lead to dirty data still residing in the memory subsystem, which need explicit flushing to the persistent memory. For this reason, transactional interfaces are needed to guarantee that the objects/data are consistent in the persistent memory in the case of failures/crashes. The libpmemobj NVML library provides a transactional object store in persistent memory. The libpmemobj library also allows locks to be embedded with PM-resident objects which gets reinitialized (unlocking) automatically every time the persistent object store pool is opened. This property makes sure that all locks are always released when the object pool is opened Versioning Object Store DAOS-M provides multi-version concurrency at byte granularity with end-to-end data integrity and allows concurrent updates to multiple distributed objects to be grouped into transactions to guarantee consistency in the face of application and system failures. To satisfy requirements for a versioning object store on persistent memory the transaction based NVM library will be leveraged to maintain the Extent.Version (as shown in figure 4) index over the extent buffers. One file (or) NVML memory pool is memory-mapped for each DAOS-M shard added to a DAOS-M container. Loads and stores can be done directly to this memory-mapped file. With the use of NVML, transactional updates to the index data structure could be done in-place in the persistent memory, without having to copy it to main memory. The actual data for the extents pointed in the index tree can be saved through RMA directly in persistent memory. An index data-structure can be designed with a self-balancing tree with a key encapsulating both extent range (or) key/value and their version index. By localizing the updates to sub-trees for a set of extents and versions, the necessary concurrency for the updates could be achieved. In addition, this provides the capability to remove nodes from Intel Federal, LLC Proprietary 26 Solution Architecture

32 the tree while aggregating/discarding versions and also fall back to available versions when a requested version does not exist. With the presence of the auto-reinitializing embedded locks provided by NVML, the data structure can be kept consistent even in the event of a crash/failure. By keeping the tree always balanced, the latency is reduced for scanning the data structure for generating the read descriptor even for complicated object reads. Figure 4: Example Extent.Version index tree for the Versioning Object Store End-to-end data integrity can be achieved with the help of checksum to detect any data corruption (example, through the network, PCI bus, PM hardware and so on). Checksum must be computed on the client side on write/update and verified on the server side for read/lookup. Checksum will be computed for both byte extents as well as KVs and sent along with the I/O operation. In case of partial read/lookup for both individual byte extents and those spanning over multiple extents, checksums of each individual extent will be verified and recomputed for the bytes that have been read before it is returned to the client. Checksums will be recomputed and updated even while aggregating/discarding versions. Since DAOS-M is byte-granular, there is not static block/chunk size for checksums but can be considered in the future, if required for optimizations. Intel Federal, LLC Proprietary 27 Solution Architecture

33 6.2 DAOS Sharding and Resilience The DAOS Sharding and Resilience (DAOS-SR) layer is responsible for providing byte array and KV object abstractions distributed for both horizontal performance and resilience. Objects exposed by DAOS-SR can spread across multiple container shards and have DAOS- M objects in these shards as its storage entities. Concurrent I/O against these DAOS-M objects can be foundation of performance scalability; building data redundancy over multiple DAOS-M objects can provide required availability Object ID allocator DAOS promises to generate unique object identifiers (IDs) for upper stack layers. A Raft cluster would be too slow and not scalable enough to handle cluster-wide object ID distribution. To guarantee uniqueness, there should thus be a central service acting as coordinator. However, frequently requesting ID from central coordinator can significantly impact performance in a cluster with large amount of nodes. So container shards shall be able to independently allocate object IDs at least most of the time and avoid potential performance issue. To achieve this goal, an ID range or sequence is allocated to the shard by the coordinator instead of one at a time. The shard can then use any ID within this range without interacting with other servers because these IDs are guaranteed to be unique. To avoid a peak load generated by many container shards requesting ID ranges at the same moment, allocated ID range can randomly cover a few thousands IDs, so it is unlikely for a lot of container shards to consume their ID ranges at the same time. Moreover, to reduce request to central coordinator, a container can generate a spanning tree for its shards, so ID allocating requests can be aggregated and allocating replies can be propagated through this tree. The ID allocating coordinator should store the last allocated ID as container metadata. Whenever current coordinator failed, it can be replaced by a service running on a different container shard. The new coordinator can discover already consumed ID space by fetching it from the container metadata. In order to provide a large enough object address space, the object ID should be 128-bit, although there could be a few reserved bits for DAOS as a flag of internal object types Cluster map The cluster map is a compact data structure for describing a storage cluster. It includes all targets, their status and their characteristics such as bandwidth, failure domain and performance domain. It is one of the most basic input data for DAOS-SR to generate algorithmic object distribution. Intel Federal, LLC Proprietary 28 Solution Architecture

To reduce compute overhead of object placement algorithm, the resource manager may create per-container cluster map, instead of a globally shared cluster map of the whole storage tier.

34 To reduce compute overhead of object placement algorithm, the resource manager may create per-container cluster map, instead of a globally shared cluster map of the whole storage tier. Although the cluster map is not expected to change frequently, it can still be modified while jobs are running (e.g. new shard is added or disabled). The cluster map needs to be stored within a container, so container recovery can calculate object distribution for the past epochs and rebuild date redundancy for objects in these epochs. The old cluster map will be replaced by the up-to-date version when recovery finished. The cluster map will be a versioned data structure and only the difference between the old and new versions will be propagated during recovery Placement map A placement map is built from a cluster map. Although a placement map could include the same set of cluster components as the cluster map it built from, e.g., targets, shelves, cabinets, but it can pseudo randomly combine these components within their failure and performance domains. For example, the figure below shows a few placement maps created for a cluster map which only has 3 cabinets and 9 targets. Figure 5: Cluster map and Placement maps In other words, a placement map is a virtualized cluster map. A container can generate many placement maps based on its cluster map. A DAOS-SR object can select one placement map from its container by hashing object ID, then using the placement algorithm to compute its object distribution based on the chosen placement map. Intel Federal, LLC Proprietary 29 Solution Architecture

6.2.4 Algorithmic object placement An object placement algorithm shall be a scalable pseudo random data distribution function which is based on consistent hash: - Giving the same version of a

35 6.2.4 Algorithmic object placement An object placement algorithm shall be a scalable pseudo random data distribution function which is based on consistent hash: - Giving the same version of a placement map, an object placement schema and an object ID, a placement algorithm will always output the same result, which is a set of container shards for distributing data chunks or replicas. - If a shard was excluded from placement map, output of placement algorithm should either have no change, or have one different shard as replacement of the excluded shard. An object placement schema is a predefined description for object placement requirements, for example, the number of replicas or number of data chunks and coding chunks for Reed-Solomon codes. The object placement schema is mandatory input for the placement algorithm. Figure 6: Algorithmic Object Placement There are a few advantages about using consistent hash based algorithmic object placement: - Object placement is decentralized. Any part of the cluster can independently compute location of object without querying central server. - It can significantly reduce metadata like object layout, this is especially important in large cluster that has tens of thousands or even more storage targets. - If container layout or cluster map was changed, the consistent hash based object placement only redistributes a relatively small amount of data: moving them to survived shards or newly added shards, most object replicas or data chunks still remain in place. Intel Federal, LLC Proprietary 30 Solution Architecture

36 Because both data redundancy and I/O parallelism are goals of object placement, these two aspects should be considered by placement schema and algorithm: - Redundancy distribution Shards for storing object replicas or erasure coded data chunks 2 can either reside in different failure domains, or shall have short network distance between them if application has high demand on latency but less fault tolerance. If an application has requirement on read-write consistency of uncommitted changes, data replicating of objects should be synchronous. It means a client can get completion acknowledgement of its write only if all replicas for the modified object are visible. Persistent is not required because later flush and commit, or rollback on failure, can always ensure consistency. - Parallelism distribution To achieve I/O parallelism, a large object can be striped into data chunks that spread over container shards in different performance domains. Data redundancy of striped object can be implemented in two different ways: Each data stripe of object can have its own redundancy group in server stack, so server is responsible for replicating or erasure coding. Client stack is fully responsible for storing data redundancy based on object distribution. In this case, redundancy schema of object is transparent to server stack. As a summary, the placement algorithm distributes object data widely so that: Data in the same redundancy group should be spread to targets in different fault domains Redundancy groups of the same object should avoid to have overlapped targets, so different portions (stripes) of an object will not share bandwidth of the same set of targets. With this placement strategy, DAOS can provide resilience and high parallelism of I/O to achieve horizontal scalability. However, short network distance within a redundancy group cannot be supported for the time being without more extensive research Object Lookup Table In contrast, the DAOS-M layer has 2-dimensional object address space <shard-id, object- ID>, a DAOS-M object is a single entity residing in one container shard, while the DAOS-SR layer provides a flat object address space, a DAOS-SR object can stripe across multiple container shards and comprise multiple DAOS-M objects. It means that the DAOS-SR layer should have a mechanism to convert a flat object ID to distribution of DAOS-M objects. 2 Group size of erasure codes can either be specified through the API (i.e. schema descriptor) or just chosen from pre-defined schemas. Intel Federal, LLC Proprietary 31 Solution Architecture

The object lookup table is a highly available and scalable KV store which is responsible for converting a DAOS-SR object ID to object distribution.

37 The object lookup table is a highly available and scalable KV store which is responsible for converting a DAOS-SR object ID to object distribution. This table is required in order to support different object schemas with customized attributes like data protection method, number of stripes or redundancy group size. These extra metadata is required to compute the full object distribution. This table shall be constructed on container creation. To achieve scalability, its chunks or tablets can spread across all the shards and grow if more shards are added to the container. Each tablet shall be replicated to a few container shards to guarantee there are available copies even some shards failed; each copy of tablet is a DAOS-M KV object. DAOS-SR object IDs are consistently hashed to different tablets to avoid KV pairs reshuffling on table layout change. Tablets distribution of a lookup table can be built from placement maps of a container. In order to evenly distribute workload of the lookup table to all container shards, the lookup table can span over multiple placements maps. As shown in the graph below, a 2-way replicated lookup table is initialized in a container with 4 shards; it has 8 tablets and spans over 4 placement maps. While adding another two shards to this container, these shards are pseudo-randomly inserted into all placement maps. To extend lookup table to new container shards and get full horizontal scalability of this container, new tablets are inserted into the lookup table as well. Because object IDs are consistently hashed to tablets, only a small portion of KV pairs need to migrate to new tablets. In the process of KV migrating, many container shards can contribute because those new tablets have different neighboring shards in different placement maps. Figure 7: 2-Way replicated lookup table Intel Federal, LLC Proprietary 32 Solution Architecture

Fast Forward I/O & Storage

Fast Forward I/O & Storage Eric Barton Lead Architect 1 Department of Energy - Fast Forward Challenge FastForward RFP provided US Government funding for exascale research and development Sponsored by 7