UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

Similar documents
Fast Forward I/O & Storage

Lustre overview and roadmap to Exascale computing

Lustre* - Fast Forward to Exascale High Performance Data Division. Eric Barton 18th April, 2013

Fast Forward Storage & I/O. Jeff Layton (Eric Barton)

Lustre Update. Brent Gorda. CEO Whamcloud, Inc. Linux Foundation Collaboration Summit San Francisco, CA April 4, 2012

CA485 Ray Walshe Google File System

Lustre Beyond HPC. Presented to the Lustre* User Group Beijing October 2013

Open SFS Roadmap. Presented by David Dillow TWG Co-Chair

Next-Generation NVMe-Native Parallel Filesystem for Accelerating HPC Workloads

Data Movement & Tiering with DMF 7

朱义普. Resolving High Performance Computing and Big Data Application Bottlenecks with Application-Defined Flash Acceleration. Director, North Asia, HPC

DDN s Vision for the Future of Lustre LUG2015 Robert Triendl

NPTEL Course Jan K. Gopinath Indian Institute of Science

Johann Lombardi High Performance Data Division

Distributed File Systems II

Brent Gorda. General Manager, High Performance Data Division

Parallel File Systems for HPC

SolidFire and Ceph Architectural Comparison

<Insert Picture Here> Lustre Development

Lustre * Features In Development Fan Yong High Performance Data Division, Intel CLUG

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

An Exploration into Object Storage for Exascale Supercomputers. Raghu Chandrasekar

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory

CSE 124: Networked Services Lecture-16

The Fusion Distributed File System

Current Topics in OS Research. So, what s hot?

5.4 - DAOS Demonstration and Benchmark Report

FastForward I/O and Storage: IOD M5 Demonstration (5.2, 5.3, 5.9, 5.10)

CSE 124: Networked Services Fall 2009 Lecture-19

Coordinating Parallel HSM in Object-based Cluster Filesystems

Distributed Systems 16. Distributed File Systems II

High Performance Computing. NEC LxFS Storage Appliance

HPC Storage Use Cases & Future Trends

An Overview of Fujitsu s Lustre Based File System

LUG 2012 From Lustre 2.1 to Lustre HSM IFERC (Rokkasho, Japan)

IME Infinite Memory Engine Technical Overview

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE

CephFS A Filesystem for the Future

Andreas Dilger, Intel High Performance Data Division LAD 2017

CLOUD-SCALE FILE SYSTEMS

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Campaign Storage. Peter Braam Co-founder & CEO Campaign Storage

ZFS Benchmarking. eric kustarz blogs.sun.com/erickustarz

Opendedupe & Veritas NetBackup ARCHITECTURE OVERVIEW AND USE CASES

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

Massively Scalable File Storage. Philippe Nicolas, KerStor

Architecting Storage for Semiconductor Design: Manufacturing Preparation

An Introduction to GPFS

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

CSE 153 Design of Operating Systems

The Google File System

HPE Scalable Storage with Intel Enterprise Edition for Lustre*

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Introduction to High Performance Parallel I/O

Lustre 2.12 and Beyond. Andreas Dilger, Whamcloud

NAND Interleaving & Performance

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review

Exploiting the benefits of native programming access to NVM devices

Datacenter Storage with Ceph

Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands

Storage and File Structure

Andreas Dilger, Intel High Performance Data Division Lustre User Group 2017

CSCS HPC storage. Hussein N. Harake

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Overcoming Obstacles to Petabyte Archives

Community Release Update

2014 VMware Inc. All rights reserved.

Google File System. Arun Sundaram Operating Systems

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

MS 52 Distributed Persistent Memory Class Storage Model (DAOS-M) Solution Architecture Revision 1.2 September 28, 2015

Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies

A Gentle Introduction to Ceph

1. ALMA Pipeline Cluster specification. 2. Compute processing node specification: $26K

Introduction to Lustre* Architecture

Analytics in the cloud

Structuring PLFS for Extensibility

Cluster-Level Google How we use Colossus to improve storage efficiency

A Generic Methodology of Analyzing Performance Bottlenecks of HPC Storage Systems. Zhiqi Tao, Sr. System Engineer Lugano, March

IBM Spectrum NAS, IBM Spectrum Scale and IBM Cloud Object Storage

Architecting a High Performance Storage System

Storage Optimization with Oracle Database 11g

BigData and Map Reduce VITMAC03

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

Lecture 21: Reliable, High Performance Storage. CSC 469H1F Fall 2006 Angela Demke Brown

Data Analytics and Storage System (DASS) Mixing POSIX and Hadoop Architectures. 13 November 2016

Application Performance on IME

Andreas Dilger, Intel High Performance Data Division SC 2017

Block Storage Service: Status and Performance

libhio: Optimizing IO on Cray XC Systems With DataWarp

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

RAIDIX Data Storage Solution. Clustered Data Storage Based on the RAIDIX Software and GPFS File System

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

6.5 Collective Open/Close & Epoch Distribution Demonstration

Optimizing MySQL performance with ZFS. Neelakanth Nadgir Allan Packer Sun Microsystems

Triton file systems - an introduction. slide 1 of 28

Transcription:

UK LUG 10 th July 2012 Lustre at Exascale Eric Barton CTO Whamcloud, Inc. eeb@whamcloud.com

Agenda Exascale I/O requirements Exascale I/O model 3 Lustre at Exascale - UK LUG 10th July 2012

Exascale I/O technology drivers 2012 2020 Nodes 10-100K 100K-1M Threads/node ~10 ~1000 Total concurrency 100K-1M 100M-1B Object create 100K/s 100M/s Memory 1-4PB 30-60PB FS Size 10-100PB 600-3000PB MTTI 1-5 Days 6 Hours Memory Dump < 2000s < 300s Peak I/O BW 1-2TB/s 100-200TB/s Sustained I/O BW 10-200GB/s 20TB/s 4 Lustre at Exascale - UK LUG 10th July 2012

Exascale I/O technology drivers (Meta)data explosion Many billions of entities Mesh elements Graph nodes Timesteps Complex relationships UQ ensemble runs OODB Read/Write -> Instantiate/Persist Multiple indexes Ad-hoc search Where s the 100 year wave Data provenance + quality 5 Lustre at Exascale - UK LUG 10th July 2012

Exascale I/O Architecture Exascale Machine Shared Storage Compute Nodes Exascale Network I/O Nodes Burst buffer NVRAM Site Storage Network Storage Servers Metadata NVRAM Disk 6 Lustre at Exascale - UK LUG 10th July 2012

Exascale I/O requirements (Meta)data consistency + integrity Metadata at one level is data in the level below Foundational component of system resilience Required end-to-end Balanced recovery strategies Transactional models Fast cleanup up failure Filesystem always available Filesystem always exists in a defined state Scrubbing Repair / resource recovery that may take days-weeks 7 Lustre at Exascale - UK LUG 10th July 2012

Software engineering Stability Storage system underpins system resilience Stabilization effort required non-trivial Expensive/scarce scale development and test resources Build on existing components when possible LNET (network abstraction), OSD API (backend storage abstraction) Implement new subsystems when required Distributed Application Object Storage (DAOS) Clean stack Common base features in lower layers Application-domain-specific features in higher layers APIs that enable concurrent development 8 Lustre at Exascale - UK LUG 10th July 2012

Exascale shared filesystem /projects Conventional namespace Works at human scale Administration Security & accounting Legacy data and applications DAOS Containers Distributed Application Object Storage Sharded Application / middleware determined object namespace Works at exascale Embedded in conventional namespace Storage pools Quota Streaming v. IOPS /Legacy /HPC /BigData Posix striped file a b c a b c a b c a Simulation data OODB OODB OODB OODB OODB metadata data data data data data data data data data data data data data data data MapReduce data Blocksequence data data data data data data data data data 9 Lustre at Exascale - UK LUG 10th July 2012

Kernel Userspace I/O stack Application Query Tools Stack features / requirements Userspace Easier development and debug Low latency / OS bypass Fragmented / Irregular data Kernel Leverage Lustre infrastructure Transactional == consistent thru failure End-to-end application data/metadata integrity Application I/O API for general purpose or application-specific I/O model I/O Scheduler Object aggregation / layout optimization / caching DAOS Global shared object storage Application I/O I/O Scheduler DAOS Storage 10 Lustre at Exascale - UK LUG 10th July 2012

Updates Transactions I/O Epochs (Meta)data consistency / integrity Guaranteed required on any/all failures Foundational component of system resilience Metadata at one level is data in the level below Required at all levels I/O Epochs Epochs demark globally consistent snapshots Copy-on-write OSD Roll back to last globally consistent snapshot Guarantee updates in one epoch atomic No blocking / barriers Non-blocking on each OSD Time Next epoch writing in cache while previous epoch flushing to disk Non-blocking across OSDs Advance independently on each OSD Free space only when next epoch snapshot stable on all OSDs 11 Lustre at Exascale - UK LUG 10th July 2012

Kernel Userspace I/O stack Application Query Tools Applications and tools Query, search and analysis Index maintenance Data browsers, visualisers, editors Analysis shipping Move bandwidth-intensive operations to data Application I/O Non-blocking APIs Function shipping CN/ION End-to-end application data/metadata integrity Domain-specific API style HDFS, Posix, OODB, HDF5, Complex data models Instantiate/persist (not read/write) Application I/O I/O Scheduler DAOS Storage 12 Lustre at Exascale - UK LUG 10th July 2012

Kernel Userspace I/O Scheduler Application Query Tools Layout optimization Application object aggregation Object / shard placement n -> m stream transformation Properties for expected usage Scheduler integration Pre-staging cache Minimize application read latency Burst buffer Absorb peak application write load Higher-level resilience models Exploit redundancy across storage objects Application I/O I/O Scheduler DAOS Storage 13 Lustre at Exascale - UK LUG 10th July 2012

Kernel Userspace DAOS Containers Application Query Tools Application I/O Sharded object repository Object resilience I/O Scheduler N-way mirrors / RAID6 DAOS Data management Migration over pools Storage Migration between containers 10s of billions of objects distributed over thousands of OSSs Share-nothing create/destroy, read/write Millions of application threads ACID transactions on objects and containers Defined state on any/all combinations of failures No scanning on recovery 14 Lustre at Exascale - UK LUG 10th July 2012

Thank You Eric Barton CTO Whamcloud, Inc. eeb@whamcloud.com