Lustre* - Fast Forward to Exascale High Performance Data Division. Eric Barton 18th April, 2013

Similar documents
Fast Forward I/O & Storage

Fast Forward Storage & I/O. Jeff Layton (Eric Barton)

UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

Lustre overview and roadmap to Exascale computing

Extreme I/O Scaling with HDF5

Distributed File Systems II

CA485 Ray Walshe Google File System

Distributed Filesystem

FastForward I/O and Storage: IOD M5 Demonstration (5.2, 5.3, 5.9, 5.10)

DAOS Epoch Recovery Design FOR EXTREME-SCALE COMPUTING RESEARCH AND DEVELOPMENT (FAST FORWARD) STORAGE AND I/O

FastForward I/O and Storage: ACG 8.6 Demonstration

DDN s Vision for the Future of Lustre LUG2015 Robert Triendl

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

<Insert Picture Here> Lustre Development

Lustre Beyond HPC. Presented to the Lustre* User Group Beijing October 2013

Leveraging Flash in HPC Systems

FastForward I/O and Storage: ACG 5.8 Demonstration

ECMWF's Next Generation IO for the IFS Model and Product Generation

8.5 End-to-End Demonstration Exascale Fast Forward Storage Team June 30 th, 2014

MS 52 Distributed Persistent Memory Class Storage Model (DAOS-M) Solution Architecture Revision 1.2 September 28, 2015

Versioning Object Storage Device (VOSD) Design Document FOR EXTREME-SCALE COMPUTING RESEARCH AND DEVELOPMENT (FAST FORWARD) STORAGE AND I/O

Design Document (Historical) HDF5 Dynamic Data Structure Support FOR EXTREME-SCALE COMPUTING RESEARCH AND DEVELOPMENT (FAST FORWARD) STORAGE AND I/O

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

Lustre Update. Brent Gorda. CEO Whamcloud, Inc. Linux Foundation Collaboration Summit San Francisco, CA April 4, 2012

Johann Lombardi High Performance Data Division

Datacenter replication solution with quasardb

DAOS and Friends: A Proposal for an Exascale Storage System

5.4 - DAOS Demonstration and Benchmark Report

6.5 Collective Open/Close & Epoch Distribution Demonstration

An Exploration into Object Storage for Exascale Supercomputers. Raghu Chandrasekar

Open SFS Roadmap. Presented by David Dillow TWG Co-Chair

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

An Overview of the TiVo Service

<Insert Picture Here> Btrfs Filesystem

Lustre A Platform for Intelligent Scale-Out Storage

Community Release Update

CS November 2017

Enosis: Bridging the Semantic Gap between

GFS: The Google File System

BigTable: A Distributed Storage System for Structured Data

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

Lustre * Features In Development Fan Yong High Performance Data Division, Intel CLUG

RAIDIX Data Storage Solution. Clustered Data Storage Based on the RAIDIX Software and GPFS File System

Map-Reduce. Marco Mura 2010 March, 31th

Data Movement & Tiering with DMF 7

Using the SDACK Architecture to Build a Big Data Product. Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver

I/O Profiling Towards the Exascale

Technical Note. Dell/EMC Solutions for Microsoft SQL Server 2005 Always On Technologies. Abstract

CS November 2018

The Google File System. Alexandru Costan

Databricks Delta: Bringing Unprecedented Reliability and Performance to Cloud Data Lakes

Analyzing I/O Performance on a NEXTGenIO Class System

Current Topics in OS Research. So, what s hot?

Windows Persistent Memory Support

Building LinkedIn s Real-time Data Pipeline. Jay Kreps

GFS: The Google File System. Dr. Yingwu Zhu

Staggeringly Large Filesystems

Meltdown or "Holy Crap: How did we do this to ourselves" Meltdown exploits side effects of out-of-order execution to read arbitrary kernelmemory

CLOUD-SCALE FILE SYSTEMS

Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

How we build TiDB. Max Liu PingCAP Amsterdam, Netherlands October 5, 2016

BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis

9/26/2017 Sangmi Lee Pallickara Week 6- A. CS535 Big Data Fall 2017 Colorado State University

The Stream Processor as a Database. Ufuk

ECMWF s Next Generation IO for the IFS Model

Google File System (GFS) and Hadoop Distributed File System (HDFS)

A Breakthrough in Non-Volatile Memory Technology FUJITSU LIMITED

Lock Ahead: Shared File Performance Improvements

Imperative Recovery. Jinshan Xiong /jinsan shiung/ 2011 Whamcloud, Inc.

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Open Source Storage. Ric Wheeler Architect & Senior Manager April 30, 2012

SolidFire and Ceph Architectural Comparison

Massive Scalability With InterSystems IRIS Data Platform

Adaptive Resync in vsan 6.7 First Published On: Last Updated On:

HAWQ: A Massively Parallel Processing SQL Engine in Hadoop

Brent Gorda. General Manager, High Performance Data Division

HPC Storage Use Cases & Future Trends

CS /29/18. Paul Krzyzanowski 1. Question 1 (Bigtable) Distributed Systems 2018 Pre-exam 3 review Selected questions from past exams

An Introduction to GPFS

Directions in Workload Management

Distributed Systems 16. Distributed File Systems II

Application Performance on IME

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

Distributed Systems Pre-exam 3 review Selected questions from past exams. David Domingo Paul Krzyzanowski Rutgers University Fall 2018

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

PNUTS: Yahoo! s Hosted Data Serving Platform. Reading Review by: Alex Degtiar (adegtiar) /30/2013

朱义普. Resolving High Performance Computing and Big Data Application Bottlenecks with Application-Defined Flash Acceleration. Director, North Asia, HPC

Industrial system integration experts with combined 100+ years of experience in software development, integration and large project execution

Extreme Computing. NoSQL.

DAOS Lustre Restructuring and Protocol Changes Design FOR EXTREME-SCALE COMPUTING RESEARCH AND DEVELOPMENT (FAST FORWARD) STORAGE AND I/O

Ceph: A Scalable, High-Performance Distributed File System

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE

NEXTGenIO Performance Tools for In-Memory I/O

A GPFS Primer October 2005

BigData and Map Reduce VITMAC03

Analytics in the cloud

Use of a new I/O stack for extreme-scale systems in scientific applications

Xyratex ClusterStor6000 & OneStor

Transcription:

Lustre* - Fast Forward to Exascale High Performance Data Division Eric Barton 18th April, 2013

DOE Fast Forward IO and Storage Exascale R&D sponsored by 7 leading US national labs Solutions to currently intractable problems of Exascale required to meet the 2020 goal of an Exascale system Whamcloud & partners won the IO & Storage contract Proposal to rethink the whole HPC IO stack from top to bottom Develop a working prototype Demonstrate utility of prototype in HPC and Big Data HDF Group HDF5 modifications and extensions EMC Burst Buffer manager & I/O Dispatcher Whamcloud Distributed Application Object Storage (DAOS) Cray Scale out test Contract renegotiated on Intel acquisition of Whamcloud Intel Arbitrary Connected Graph computation DDN Versioning Object Storage Device 2

Project Schedule 8 project quarters from July 2012 through June 2014 Quarterly milestones demonstrate progress in overlapping phases Planning architecture design implementation test benchmark Project Quarter Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 July 2012 Oct 2012 Jan 2013 April 2013 July 2013 Oct 2013 Jan 2014 April 2014 Planning Solution Architecture Detailed Design ACG Demonstrations HDF Demonstrations IOD Demonstrations DAOS Demonstrations End-toEnd Demo & Reporting 3

Project Goals Make storage a tool of the Scientist Tractable management Comprehensive interaction Move compute to data or data to compute as appropriate Overcome today s IO limits Multi-petabyte datasets Shared file & metadata performance Horizontal scaling & jitter Support unprecedented fault tolerance Deterministic interactions with failing hardware & software Fast & scalable recovery Enable multiple redundancy & integrity schemes 4

Processes Non-blocking APIs Jitter Scheduling noise Power management Dynamic load imbalance Tight coupling Bulk synchronous programming simplifies application development Makes strong scaling harder Loose coupling Closes idle gaps Requires non-blocking IPC and IO Time All IO non-blocking Initiation procedure / completion event 5

Collectives Scalable O(log n) group operations Open, commit Push communications up the stack Higher levels may Be able to piggyback on their own communications. Have access to higher performance communications. local2global / global2local Single process performs IO API call on behalf of process group local2global creates opaque shareable buffer global2local uses shareable buffer to bind local resources 6

Locking & Caching Serialization kills scalability It s not the storage system s responsibility Storage is not message passing Tightly coupled processes should communicate directly Low-level IO should not predict high level caching requirements Read-ahead / write behind is different from working set Avoid premature block alignment It s a needless source of contention Don t let writers block readers Or vice versa 7

Atomicity Consistency & integrity guarantees Required throughout the I/O stack Required for data as well as metadata Metadata is data to the level below Cannot afford O(system size) recovery Don t FSCK with DAOS Transactions Move storage system between consistent states Recovery == rollback uncommitted transactions Prefer O(0) v. O(transaction size) recovery time Simplified interaction with failing subsystems for upper levels Nestable transactions required in a layered stack Scrub still required to protect against bitrot 8

Writes Transactions Transactions ordered by epoch # Writes apply in epoch order All writes in an epoch committed atomically All reads of an epoch see consistent data Epoch # Applied within epoch scope Container granularity Multi-process and multi-object writes Single committer for each open scope Arbitrary transaction pipeline depth System may aggregate epochs Highest Committed Epoch (HCE) determined on epoch scope open Slip scope to check/wait for updates 9

Layered I/O stack Applications and tools Index, search, analysis, viz, browse, edit Analysis shipping In-transit analysis & visualization Application I/O API Multiple domain-specific API styles & schemas I/O Dispatcher Impedance match application requirements to storage capability Burst Buffer manager DAOS-HA High-availability scalable object storage Follow-project from Fast Forward DAOS Transactional scalable object storage Application Query Application I/O I/O Dispatcher DAOS-HA DAOS / Lustre Client Lustre Server Tools 10

Scalable server health & collectives Health Gossip distributes I m alive Fault tolerant O(log n) latency Tree overlay networks Fault tolerant Collective completes with failure on group membership change Scalable server communications Scalable commit Collective client eviction Distributed client health monitoring 11

Versioned object storage COW & snapshot Transaction rollback Version intent log Applies writes in epoch order Writes persisted on arrival No serialisation / backpressure Full OSD blocks don t have to move Extent metadata insert in epoch order Start immediately previous epochs complete On arrival when possible Otherwise via version intent log 12

Object DAOS Containers Virtualizes Lustre s underlying object storage Shared-nothing 10s of billions of objects Thousands of servers Private object namespace / schema Filesystem namespace unpolluted Transactional PGAS Baseline: addr = <shard.object.offset> HA: addr = <layout.object.offset> Read & Write No create/destroy Punch == store 0s efficiently 13

I/O Dispatcher Abstracts Burst Buffer (NVRAM) and global (DAOS) storage I/O rate/latency/bandwidth matching Absorb peak application load Sustain global storage performance Layout optimization guided by upper layers Application object aggregation / sharding Stream transformation Semantic resharding Multi-format semantic replication Buffers transactions Higher-level resilience models Exploit redundancy across storage objects Scheduler integration Pre-staging / Post flushing End-to-end data integrity 14

HDF5 Application I/O Built-for-HPC object database New application capabilities Non-blocking I/O Create/modify/delete HDF5 objects Read/write HDF5 Dataset elements Atomic transactions Group multiple HDF5 operations HDF5 Data Model Extensions Pluggable Indexing + Query Language Support for dynamic data structures New Storage Format Leverage I/O Dispatcher/DAOS capabilities End-to-end metadata+data integrity 15

Big Data Arbitrary Connected Graphs HDF5 Adaptation Layer (HAL) Storage API for ACG Ingress & Computation Kernel applications Stores partitioned graphs and associated information using HDF5 ACG Ingress Extract meaningful information from raw data Transform data dependency information into a graph Partition graphs to maximize efficiency in handling and computation Graph Computation Kernel Machine Learning: LDA, CoEM, ALS, etc. Graph Analytics: PageRank, Triangle counting, Community structure 16

Follow-on development Productization & system integration Job scheduler integration In-transit analysis runtime Analysis shipping runtime Monitoring and management Btrfs VOSD in-kernel (GPL) storage target DAOS-HA Replication / erasure coding IOD/HDF5-HA: Fault-tolerant Burst Buffer & IO forwarding Additional top-level APIs Application domain-specific e.g. OODB, MapReduce etc. Layered over HDF5 or directly over IOD Associated tools 17

1 8 18