Fast Forward I/O & Storage

Similar documents
UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

Lustre* - Fast Forward to Exascale High Performance Data Division. Eric Barton 18th April, 2013

Fast Forward Storage & I/O. Jeff Layton (Eric Barton)

Lustre overview and roadmap to Exascale computing

Lustre Beyond HPC. Presented to the Lustre* User Group Beijing October 2013

Extreme I/O Scaling with HDF5

CA485 Ray Walshe Google File System

Johann Lombardi High Performance Data Division

Lustre Update. Brent Gorda. CEO Whamcloud, Inc. Linux Foundation Collaboration Summit San Francisco, CA April 4, 2012

Distributed File Systems II

Open SFS Roadmap. Presented by David Dillow TWG Co-Chair

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

MS 52 Distributed Persistent Memory Class Storage Model (DAOS-M) Solution Architecture Revision 1.2 September 28, 2015

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE

FastForward I/O and Storage: IOD M5 Demonstration (5.2, 5.3, 5.9, 5.10)

Optimizing MySQL performance with ZFS. Neelakanth Nadgir Allan Packer Sun Microsystems

Application Performance on IME

NPTEL Course Jan K. Gopinath Indian Institute of Science

Distributed Systems 16. Distributed File Systems II

Datacenter replication solution with quasardb

Lustre * Features In Development Fan Yong High Performance Data Division, Intel CLUG

HPC Storage Use Cases & Future Trends

Brent Gorda. General Manager, High Performance Data Division

NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory

SolidFire and Ceph Architectural Comparison

An Exploration into Object Storage for Exascale Supercomputers. Raghu Chandrasekar

<Insert Picture Here> Lustre Development

Next-Generation NVMe-Native Parallel Filesystem for Accelerating HPC Workloads

DAOS Epoch Recovery Design FOR EXTREME-SCALE COMPUTING RESEARCH AND DEVELOPMENT (FAST FORWARD) STORAGE AND I/O

ECMWF's Next Generation IO for the IFS Model and Product Generation

Distributed Filesystem

Campaign Storage. Peter Braam Co-founder & CEO Campaign Storage

Leveraging Flash in HPC Systems

Lustre A Platform for Intelligent Scale-Out Storage

CephFS A Filesystem for the Future

Scality RING on Cisco UCS: Store File, Object, and OpenStack Data at Scale

Topics. File Buffer Cache for Performance. What to Cache? COS 318: Operating Systems. File Performance and Reliability

DDN s Vision for the Future of Lustre LUG2015 Robert Triendl

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

The Fusion Distributed File System

Remote Directories High Level Design

Coordinating Parallel HSM in Object-based Cluster Filesystems

Improved Solutions for I/O Provisioning and Application Acceleration

libhio: Optimizing IO on Cray XC Systems With DataWarp

8.5 End-to-End Demonstration Exascale Fast Forward Storage Team June 30 th, 2014

CLOUD-SCALE FILE SYSTEMS

Versioning Object Storage Device (VOSD) Design Document FOR EXTREME-SCALE COMPUTING RESEARCH AND DEVELOPMENT (FAST FORWARD) STORAGE AND I/O

DAOS Lustre Restructuring and Protocol Changes Design FOR EXTREME-SCALE COMPUTING RESEARCH AND DEVELOPMENT (FAST FORWARD) STORAGE AND I/O

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

Lustre 2.12 and Beyond. Andreas Dilger, Whamcloud

Andreas Dilger, Intel High Performance Data Division LAD 2017

ECMWF s Next Generation IO for the IFS Model

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

Architecting Storage for Semiconductor Design: Manufacturing Preparation

Design Document (Historical) HDF5 Dynamic Data Structure Support FOR EXTREME-SCALE COMPUTING RESEARCH AND DEVELOPMENT (FAST FORWARD) STORAGE AND I/O

Intel Many Integrated Core (MIC) Architecture

Got Burst Buffer. Now What? Early experiences, exciting future possibilities, and what we need from the system to make it work

Changing Requirements for Distributed File Systems in Cloud Storage

What is a file system

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

SDS: A Framework for Scientific Data Services

A Breakthrough in Non-Volatile Memory Technology FUJITSU LIMITED

Data Movement & Tiering with DMF 7

Enosis: Bridging the Semantic Gap between

CS370 Operating Systems

LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November Abstract

Analyzing I/O Performance on a NEXTGenIO Class System

The ZFS File System. Please read the ZFS On-Disk Specification, available at:

ebay s Architectural Principles

Improving Altibase Performance with Solarflare 10GbE Server Adapters and OpenOnload

DDN About Us Solving Large Enterprise and Web Scale Challenges

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

Windows 7 Overview. Windows 7. Objectives. The History of Windows. CS140M Fall Lake 1

CSE 124: Networked Services Lecture-17

Analytics in the cloud

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands

Big and Fast. Anti-Caching in OLTP Systems. Justin DeBrabant

Guide to Mitigating Risk in Industrial Automation with Database

System Software for Persistent Memory

Google File System. Arun Sundaram Operating Systems

File System Implementation

Apache Hadoop Goes Realtime at Facebook. Himanshu Sharma

Current Topics in OS Research. So, what s hot?

Parallel File Systems. John White Lawrence Berkeley National Lab

Cloud Filesystem. Jeff Darcy for BBLISA, October 2011

Use of a new I/O stack for extreme-scale systems in scientific applications

GFS: The Google File System

BigData and Map Reduce VITMAC03

Transactions. Chapter 15. New Chapter. CS 2550 / Spring 2006 Principles of Database Systems. Roadmap. Concept of Transaction.

NFS, GPFS, PVFS, Lustre Batch-scheduled systems: Clusters, Grids, and Supercomputers Programming paradigm: HPC, MTC, and HTC

ebay Marketplace Architecture

Accelerate MySQL for Demanding OLAP and OLTP Use Cases with Apache Ignite. Peter Zaitsev, Denis Magda Santa Clara, California April 25th, 2017

6.5 Collective Open/Close & Epoch Distribution Demonstration

W4118 Operating Systems. Instructor: Junfeng Yang

FastForward I/O and Storage: ACG 8.6 Demonstration

CSE 124: Networked Services Fall 2009 Lecture-19

AUTOMATING IBM SPECTRUM SCALE CLUSTER BUILDS IN AWS PROOF OF CONCEPT

Transcription:

Fast Forward I/O & Storage Eric Barton Lead Architect 1

Department of Energy - Fast Forward Challenge FastForward RFP provided US Government funding for exascale research and development Sponsored by 7 leading US national labs Aims to solve the currently intractable problems of Exascale to meet the 2020 goal of an exascale machine RFP elements were CPU, Memory and Filesystem Whamcloud won the Filesystem component HDF Group HDF5 modifications and extensions EMC Burst Buffer manager and I/O Dispatcher Cray - Test Contract renegotiated on Intel acquisition of Whamcloud Intel - Arbitrary Connected Graph Computation DDN - Versioning OSD

Exascale I/O technology drivers 2012 2020 Nodes 10-100K 100K-1M Threads/node ~10 ~1000 Total concurrency 100K-1M 100M-1B Object create 100K/s 100M/s Memory 1-4PB 30-60PB FS Size 10-100PB 600-3000PB MTTI 1-5 Days 6 Hours Memory Dump < 2000s < 300s Peak I/O BW 1-2TB/s 100-200TB/s Sustained I/O BW 10-200GB/s 20TB/s

Exascale I/O technology drivers (Meta)data explosion Many billions of entities Mesh elements / graph nodes Complex relationships UQ ensemble runs Data provenance + quality OODB Read/Write -> Instantiate/Persist Fast / ad-hoc search: Where s the 100 year wave? Multiple indexes Analysis shipping

Exascale I/O requirements Constant failures expected at exascale Filesystem must guarantee data and metadata consistency Metadata at one level of abstraction is data to the level below Filesystem must guarantee data integrity Required end-to-end Filesystem must always be available Balanced recovery strategies Transactional models Fast cleanup up failure Scrubbing Repair / resource recovery that may take days-weeks

Exascale I/O Architecture Exascale Machine Shared Storage Compute Nodes Exascale Network I/O Nodes Burst buffer NVRAM Site Storage Network Storage Servers Metadata NVRAM Disk

Project Goals Storage as a tool of the Scientist Manage the explosive growth and complexity of application data and metadata at Exascale Support complex / flexible analysis to enable scientists to engage with their datasets Overcome today s filesystem scaling limits Provide the storage performance and capacity Exascale science will require Provide unprecedented fault tolerance Design ground-up to handle failure as the norm rather than the exception Guarantee data and application metadata consistency Guarantee data and application metadata integrity

Kernel Userspace I/O stack Features & requirements Non-blocking APIs Asynchronous programming models Transactional == consistent thru failure End-to-end application data & metadata integrity Low latency / OS bypass Fragmented / Irregular data Layered Stack Application I/O Multiple top-level APIs to support general purpose or application-specific I/O models I/O Dispatcher Match conflicting application and storage object models Manage NVRAM burst buffer / cache DAOS Scalable, transactional global shared object storage Application Query Application I/O I/O Dispatcher DAOS Storage Tools

Fast Forward I/O Architecture Compute Nodes HPC Fabric MPI / Portals I/O Nodes Burst Buffer SAN Fabric OFED Storage Servers HDF5 VOL Application MPI-IO POSIX I/O Forwarding Client I/O Forwarding Server NVRAM I/O Dispatcher Lustre Client (DAOS+POSIX) Lustre Server

Updates Transactions Consistency and Integrity Guarantee required on any and all failures Foundational component of system resilience Required at all levels of the I/O stack Metadata at one level is data to the level below I/O Epochs No blocking protocols Non-blocking on each OSD Non-blocking across OSDs Time I/O Epochs demark globally consistent snapshots Guarantee all updates in one epoch are atomic Recovery == roll back to last globally persistent epoch Roll forward using client replay logs for transparent fault handling Cull old epochs when next epoch persistent on all OSDs

Kernel Userspace I/O stack Applications and tools Query, search and analysis Index maintenance Data browsers, visualizers, editors Analysis shipping Move I/O intensive operations to data Application I/O Non-blocking APIs Function shipping CN/ION End-to-end application data/metadata integrity Domain-specific API styles HDFS, Posix, OODB, HDF5, Complex data models Application Query Application I/O I/O Dispatcher DAOS Storage Tools

Kernel Userspace HDF5 Application I/O DAOS-native Storage Format Application Query Tools Built-for-HPC storage containers Leverage I/O Dispatcher/DAOS capabilities End-to-end metadata+data integrity Application I/O I/O Dispatcher New Application Capabilities Asynchronous I/O Create/modify/delete objects Read/write dataset elements Transactions Group many API operations into single transaction DAOS Storage Data Model Extensions Pluggable Indexing + Query Language Pointer datatypes

Kernel Userspace I/O Dispatcher I/O rate/latency/bandwidth matching Application Query Tools Burst buffer / prefetch cache Absorb peak application load Sustain global storage performance Application I/O I/O Dispatcher Layout optimization DAOS Application object aggregation / sharding Storage Upper layers provide expected usage Higher-level resilience models Exploit redundancy across storage objects Scheduler integration Pre-staging / Post flushing

Kernel Userspace DAOS Containers Distributed Application Object Storage Application Query Tools Sharded transactional object storage Virtualizes underlying object storage Application I/O Private object namespace / schema I/O Dispatcher Share-nothing create/destroy, read/write DAOS 10s of billions of objects Distributed over thousands of servers Storage Accessed by millions of application threads ACID transactions Defined state on any/all combinations of failures No scanning on recovery

DAOS Container Container FID Shard Shard Container Inode UID, perms etc Shard FIDs Parent FID Shard Shard Metadata (space etc) Obj IDX Parent FID Object Obj metadata (size, etc) Data

Kernel Userspace Versioning OSD DAOS container shards Space accounting Quota Shard objects Transactions Container shard versioned by epoch Implicit commit Epoch becomes durable when globally persistent Explicit abort Rollback to specific container version Out-of-epoch-order updates Version metadata aggregation Application Query Application I/O I/O Dispatcher DAOS Storage Tools

Versioning with CoW New epoch directed to a clone Cloned extents freed when no longer referenced Requires epochs to be written in order

Versioning with an intent log Out-of-order epoch writes logged Log flattened into CoW clone on epoch close Keeps storage system eager

Server Collectives Collective client eviction Enables non-local/derived attribute caching (e.g. SOM) Collective client health monitoring Avoids ping storms Global epoch persistence Enables distributed transactions (SNS) Spanning Tree Scalable O(log n) latency Collectives and notifications Discovery & Establishment Gossip protocols Accrual failure detection

Exascale filesystem Integrated I/O Stack Epoch transaction model Non-blocking scalable object I/O HDF5 High level application object I/O model I/O forwarding I/O Dispatcher Burst Buffer management Impedance match application I/O performance to storage system capabilities DAOS Conventional namespace for administration, security & accounting DAOS container files for transactional, scalable object I/O /projects /Legacy /HPC /BigData Posix striped file a b c a b c a b c a Simulation data OODB OODB OODB OODB OODB metadata data data data data data data data data data data data data data data data MapReduce data Blocksequence data data data data data data data data data

Thank You