UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

UK LUG 10 th July 2012 Lustre at Exascale Eric Barton CTO Whamcloud, Inc. eeb@whamcloud.com

Agenda Exascale I/O requirements Exascale I/O model 3 Lustre at Exascale - UK LUG 10th July 2012

Exascale I/O technology drivers 2012 2020 Nodes 10-100K 100K-1M Threads/node ~10 ~1000 Total concurrency 100K-1M 100M-1B Object create 100K/s 100M/s Memory 1-4PB 30-60PB FS Size 10-100PB 600-3000PB MTTI 1-5 Days 6 Hours Memory Dump < 2000s < 300s Peak I/O BW 1-2TB/s 100-200TB/s Sustained I/O BW 10-200GB/s 20TB/s 4 Lustre at Exascale - UK LUG 10th July 2012

Exascale I/O technology drivers (Meta)data explosion Many billions of entities Mesh elements Graph nodes Timesteps Complex relationships UQ ensemble runs OODB Read/Write -> Instantiate/Persist Multiple indexes Ad-hoc search Where s the 100 year wave Data provenance + quality 5 Lustre at Exascale - UK LUG 10th July 2012

Exascale I/O Architecture Exascale Machine Shared Storage Compute Nodes Exascale Network I/O Nodes Burst buffer NVRAM Site Storage Network Storage Servers Metadata NVRAM Disk 6 Lustre at Exascale - UK LUG 10th July 2012

Exascale I/O requirements (Meta)data consistency + integrity Metadata at one level is data in the level below Foundational component of system resilience Required end-to-end Balanced recovery strategies Transactional models Fast cleanup up failure Filesystem always available Filesystem always exists in a defined state Scrubbing Repair / resource recovery that may take days-weeks 7 Lustre at Exascale - UK LUG 10th July 2012

Software engineering Stability Storage system underpins system resilience Stabilization effort required non-trivial Expensive/scarce scale development and test resources Build on existing components when possible LNET (network abstraction), OSD API (backend storage abstraction) Implement new subsystems when required Distributed Application Object Storage (DAOS) Clean stack Common base features in lower layers Application-domain-specific features in higher layers APIs that enable concurrent development 8 Lustre at Exascale - UK LUG 10th July 2012

Exascale shared filesystem /projects Conventional namespace Works at human scale Administration Security & accounting Legacy data and applications DAOS Containers Distributed Application Object Storage Sharded Application / middleware determined object namespace Works at exascale Embedded in conventional namespace Storage pools Quota Streaming v. IOPS /Legacy /HPC /BigData Posix striped file a b c a b c a b c a Simulation data OODB OODB OODB OODB OODB metadata data data data data data data data data data data data data data data data MapReduce data Blocksequence data data data data data data data data data 9 Lustre at Exascale - UK LUG 10th July 2012

Kernel Userspace I/O stack Application Query Tools Stack features / requirements Userspace Easier development and debug Low latency / OS bypass Fragmented / Irregular data Kernel Leverage Lustre infrastructure Transactional == consistent thru failure End-to-end application data/metadata integrity Application I/O API for general purpose or application-specific I/O model I/O Scheduler Object aggregation / layout optimization / caching DAOS Global shared object storage Application I/O I/O Scheduler DAOS Storage 10 Lustre at Exascale - UK LUG 10th July 2012

Updates Transactions I/O Epochs (Meta)data consistency / integrity Guaranteed required on any/all failures Foundational component of system resilience Metadata at one level is data in the level below Required at all levels I/O Epochs Epochs demark globally consistent snapshots Copy-on-write OSD Roll back to last globally consistent snapshot Guarantee updates in one epoch atomic No blocking / barriers Non-blocking on each OSD Time Next epoch writing in cache while previous epoch flushing to disk Non-blocking across OSDs Advance independently on each OSD Free space only when next epoch snapshot stable on all OSDs 11 Lustre at Exascale - UK LUG 10th July 2012

Kernel Userspace I/O stack Application Query Tools Applications and tools Query, search and analysis Index maintenance Data browsers, visualisers, editors Analysis shipping Move bandwidth-intensive operations to data Application I/O Non-blocking APIs Function shipping CN/ION End-to-end application data/metadata integrity Domain-specific API style HDFS, Posix, OODB, HDF5, Complex data models Instantiate/persist (not read/write) Application I/O I/O Scheduler DAOS Storage 12 Lustre at Exascale - UK LUG 10th July 2012

Kernel Userspace I/O Scheduler Application Query Tools Layout optimization Application object aggregation Object / shard placement n -> m stream transformation Properties for expected usage Scheduler integration Pre-staging cache Minimize application read latency Burst buffer Absorb peak application write load Higher-level resilience models Exploit redundancy across storage objects Application I/O I/O Scheduler DAOS Storage 13 Lustre at Exascale - UK LUG 10th July 2012

Kernel Userspace DAOS Containers Application Query Tools Application I/O Sharded object repository Object resilience I/O Scheduler N-way mirrors / RAID6 DAOS Data management Migration over pools Storage Migration between containers 10s of billions of objects distributed over thousands of OSSs Share-nothing create/destroy, read/write Millions of application threads ACID transactions on objects and containers Defined state on any/all combinations of failures No scanning on recovery 14 Lustre at Exascale - UK LUG 10th July 2012

Thank You Eric Barton CTO Whamcloud, Inc. eeb@whamcloud.com