LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

Similar documents
Lustre overview and roadmap to Exascale computing

LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November Abstract

Academic Workflow for Research Repositories Using irods and Object Storage

Enhancing Lustre Performance and Usability

Introduction to Lustre* Architecture

Introduction The Project Lustre Architecture Performance Conclusion References. Lustre. Paul Bienkowski

SFA12KX and Lustre Update

Parallel File Systems for HPC

Lustre A Platform for Intelligent Scale-Out Storage

Lustre File System. Proseminar 2013 Ein-/Ausgabe - Stand der Wissenschaft Universität Hamburg. Paul Bienkowski Author. Michael Kuhn Supervisor

DDN s Vision for the Future of Lustre LUG2015 Robert Triendl

Storage for HPC, HPDA and Machine Learning (ML)

DDN. DDN Updates. Data DirectNeworks Japan, Inc Shuichi Ihara. DDN Storage 2017 DDN Storage

Parallel File Systems Compared

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

LUG 2012 From Lustre 2.1 to Lustre HSM IFERC (Rokkasho, Japan)

朱义普. Resolving High Performance Computing and Big Data Application Bottlenecks with Application-Defined Flash Acceleration. Director, North Asia, HPC

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

Lustre TM. Scalability

2012 HPC Advisory Council

Andreas Dilger. Principal Lustre Engineer. High Performance Data Division

Architecting Storage for Semiconductor Design: Manufacturing Preparation

Data Movement & Tiering with DMF 7

A ClusterStor update. Torben Kling Petersen, PhD. Principal Architect, HPC

An Introduction to GPFS

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

An Overview of Fujitsu s Lustre Based File System

Isilon Scale Out NAS. Morten Petersen, Senior Systems Engineer, Isilon Division

Feedback on BeeGFS. A Parallel File System for High Performance Computing

InfiniBand Networked Flash Storage

LCE: Lustre at CEA. Stéphane Thiell CEA/DAM

Oak Ridge National Laboratory

Campaign Storage. Peter Braam Co-founder & CEO Campaign Storage

XtreemStore A SCALABLE STORAGE MANAGEMENT SOFTWARE WITHOUT LIMITS YOUR DATA. YOUR CONTROL

DDN. DDN Updates. DataDirect Neworks Japan, Inc Nobu Hashizume. DDN Storage 2018 DDN Storage 1

Improved Solutions for I/O Provisioning and Application Acceleration

Lustre * Features In Development Fan Yong High Performance Data Division, Intel CLUG

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

TGCC OVERVIEW. 13 février 2014 CEA 10 AVRIL 2012 PAGE 1

HPSS Treefrog Summary MARCH 1, 2018

Parallel File Systems. John White Lawrence Berkeley National Lab

Active-Active LNET Bonding Using Multiple LNETs and Infiniband partitions

Lustre* is designed to achieve the maximum performance and scalability for POSIX applications that need outstanding streamed I/O.

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE

IBM Spectrum NAS, IBM Spectrum Scale and IBM Cloud Object Storage

Software Defined Storage for the Evolving Data Center

Architecting a High Performance Storage System

Efficient Object Storage Journaling in a Distributed Parallel File System

Lustre at the OLCF: Experiences and Path Forward. Galen M. Shipman Group Leader Technology Integration

High-Performance Lustre with Maximum Data Assurance

An introduction to GPFS Version 3.3

LLNL Lustre Centre of Excellence

INTEGRATING HPFS IN A CLOUD COMPUTING ENVIRONMENT

THE EMC ISILON STORY. Big Data In The Enterprise. Deya Bassiouni Isilon Regional Sales Manager Emerging Africa, Egypt & Lebanon.

A GPFS Primer October 2005

Mission-Critical Lustre at Santos. Adam Fox, Lustre User Group 2016

Lustre 2.12 and Beyond. Andreas Dilger, Whamcloud

HPE Scalable Storage with Intel Enterprise Edition for Lustre*

Andreas Dilger High Performance Data Division RUG 2016, Paris

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS PARALLEL FILE I/O 1

Andreas Dilger, Intel High Performance Data Division LAD 2017

Implementing a Hierarchical Storage Management system in a large-scale Lustre and HPSS environment

Effizientes Speichern von Cold-Data

The Future of Archiving

Multi-tenancy: a real-life implementation

NetApp: Solving I/O Challenges. Jeff Baxter February 2013

Simplifying Collaboration in the Cloud

UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

Xyratex ClusterStor6000 & OneStor

Open SFS Roadmap. Presented by David Dillow TWG Co-Chair

Next-Generation NVMe-Native Parallel Filesystem for Accelerating HPC Workloads

Lustre HPCS Design Overview. Andreas Dilger Senior Staff Engineer, Lustre Group Sun Microsystems

DataDirect Networks Japan, Inc.

Lessons learned from Lustre file system operation

Lustre on ZFS. Andreas Dilger Software Architect High Performance Data Division September, Lustre Admin & Developer Workshop, Paris, 2012

Scalability Testing of DNE2 in Lustre 2.7 and Metadata Performance using Virtual Machines Tom Crowe, Nathan Lavender, Stephen Simms

Lustre Clustered Meta-Data (CMD) Huang Hua Andreas Dilger Lustre Group, Sun Microsystems

GlusterFS Architecture & Roadmap

The modules covered in this course are:

Jessica Popp Director of Engineering, High Performance Data Division

SGI Overview. HPC User Forum Dearborn, Michigan September 17 th, 2012

Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy

Current Topics in OS Research. So, what s hot?

Web Object Scaler. WOS and IRODS Data Grid Dave Fellinger

DVS, GPFS and External Lustre at NERSC How It s Working on Hopper. Tina Butler, Rei Chi Lee, Gregory Butler 05/25/11 CUG 2011

An Exploration of New Hardware Features for Lustre. Nathan Rutman

Object Storage: Redefining Bandwidth for Linux Clusters

NetApp High-Performance Storage Solution for Lustre

AFM Use Cases Spectrum Scale User Meeting

The File Systems Evolution. Christian Bandulet, Sun Microsystems

Using file systems at HC3

The Spider Center Wide File System

Using DDN IME for Harmonie

Lustre * 2.12 and Beyond Andreas Dilger, Intel High Performance Data Division LUG 2018

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

Challenges in making Lustre systems reliable

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Transcription:

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions Roger Goff Senior Product Manager DataDirect Networks, Inc.

What is Lustre? Parallel/shared file system for clusters Aggregates many systems into one file system Hugely scalable - 100s of GB/s throughput, 10s of thousands of native clients Scales I/O throughput and capacity Widely adopted in HPC community Open source, community driven development Supports RDMA over high speed low latency interconnects and Ethernet Runs on commodity storage and purpose built HPC storage platforms Abstracts hardware view of storage from software Highly available 2

Lustre Node Types Management Servers (MGS) Meta-Data Servers (MDS) Object Storage Servers (OSS) Clients LNET Routers 3

Lustre MGS ManaGement Server Server node which manages cluster configuration database All clients, OSS and MDS need to know how to contact the MGS MGS provides all other components with information about the Lustre file system(s) it manages MGS can be separated from or co-located with MDS 4

Lustre MDS Meta Data Server Manages the names and directories in the Lustre file system Stores metadata (such as filenames, directories, permissions and file layout) on a MetaData Target (MDT) Usually MDSs are configured in an active/passive failover pair Lustre inode ext3 inode 5

Lustre OSS Object Storage Server Provides file I/O service, and network request handling for one or more Object Storage Target (OST) Can be configured for high availability Storage agnostic 6

Lustre Clients POSIX compliant file system access layer Has Logical Object Volume (LOV) layer that manages file striping across multiple OSTs. Has client layers for each of the servers e.g. Object Storage Client (OSC) to interact with OSS and Metadata Client (MDC) to interact with MDS 7

Lustre File Striping The ability to stripe data across multiple OSTs leads to high performance Stripes used to improve performance when the desired aggregate bandwidth to a single file exceeds the bandwidth of a single OST stripe_count and stripe_size can be set per file 8

Lustre Networking (LNET) LNET Routers provide simultaneous availability of multiple network types with routing between them. Support for many commonly-used network types such as InfiniBand and TCP/IP RDMA when supported by underlying networks High availability and recovery features enabling transparent recovery in conjunction with failover servers LNET can use bonded networks Supported Networks: InfiniBand, TCP, Cray: Seastar, Myrinet: MX, RapidArray, Quadrics: Elan 9

LNET Routers in Lustre Solutions 10

Lustre File Systems Public Sector Lustre Systems: Lawrence Livermore National Lab 64PB, 1.5TB/s, 96,000 clients Texas Advanced Computing Center 20PB, 250+GB/s Oakridge Spider 10PB, 240GB/s, 26,000 clients Environmental Molecular Science Lab @ PNNL 75GB/s Commercial Lustre Systems: Energy Company (unnamed) 10.5PB, 200GB/s Energy Company (unnamed) 270GB/s Wellcome Trust Sanger Institute (life sciences) 22.5PB, 20GB/s ANU (Bureau of Meteorology + climate science labs) 6PB, 152GB/s 11

Lustre Small File Performance Improvement Array Cache and Flash Cache Acceleration Lustre Distributed Namespace (DNE) Lustre Data on Metadata 12

Cache Optimization DDN ReACT Intelligent Cache Management Caches unaligned IO Allows full stripe writes pass directly to disk Faster, safer data 13

Flash Cache Acceleration Re-read Accelerate Reads by caching frequently accessed data in Flash Ram Flash Hard Disk 14

Flash Cache Acceleration DDN Storage Fusion Xcelerator (SFX) - Instant Commit Leverage Flash for new Writes to accelerate subsequent Reads Ram Flash Hard Disk 15

Flash Cache Acceleration DDN Storage Fusion Xcelerator (SFX) - Context Commit Application Control In-Line Hints on a IO basis to classify Data Type, Access frequency and QOS Intelligently pre-populate critical data in Flash to improve application performance Ram Flash Hard Disk 16

Lustre Distributed NamespacE (DNE) DNE allows the Lustre namespace to be divided across multiple metadata servers Enables the size of the namespace and metadata throughput to be scaled with the number of servers An administrator can allocate specific metadata resources for different sub-trees within the namespace Phase 1-1x MDT per directory (available now) Phase 2 - Multiple MDTs per directory (available in Lustre 2.7) 17

Lustre Data on Metadata Aims to improve small file performance by allowing the data for small files to be placed only on the MDT 3 Phase Implementation 1. Basic DOM mechanism 2. Auto migration when file becomes too large 3. Performance optimizations readahead for readdir, stat First write detection so files are only created on MDT when appropriate Reserve OST objects in advance even for small files 18

Reduced RPCs with Data on Metadata 19

Estimated Performance Improvement with Data on Metadata 20

High Performance Data Analysis (HPDA) 21

22 Slide courtesy of Intel, LUG2014 Moving Lustre Forward presentation

23 23 Slide courtesy of Intel, LUG2014 Moving Lustre Forward presentation

Extending Lustre beyond the data center HSM & policy engines Automated cloud tiering Research collaborations Data processing as a service Global business data sharing 24

HSM and Policy Managers Lustre 2.5 introduced HSM capabilities An interface for data movement within Lustre namespace Does not include policy engines Policy Engines irods Integrated Rule Oriented Data System; http://irods.org/ Open source data management software Functions independently of storage resources and abstracts data control away from storage devices and device location Robinhood Open source policy engine to monitor and schedule actions on file systems Developed at CEA; http://www-hpc.cea.fr/index-en.htm Versity Storage Manager An enterprise-class storage virtualization and archiving system that runs Based on Open-source SAM-QFS 25

Bridging Lustre to a Global Object Store 26

DDN Web Object Scaler The Ultimate in Flexibility & Scale Architecture: choose any number of sites Platform: choose integrated DDN Hardware, or softwareonly Data Protection: choose from a wide range of protection schemes Integration: choose natively integrated applications, direct integration through native object API s or file gateways 27

WOS Feature Baseline Flexible Data Protection Chose between replication, erasure coding or both Latency-Aware Access Manager WOS intelligently makes decisions on the best geographies to get from based upon location access load and latency REPLICATIO N Federated, Global Object Storage Namespace Supports clustering of up to 256 nodes across 64 geographies, replicate data with smart policies for performance and/or disaster recovery on a perobject basis DeClustered Data Protection No hard tie between physical disks and data. Failed drives are recovered through dispersed data placement rebuilds happen at read, not write, speed rebuilding only data. Fully-Integrated Object Storage Appliance 60 Drives in 4U WOS7000 Object Disk Architecture Formats drives with custom WOS disk file system, no Linux file I/Os, no fragmentation, fully contiguous object read and write operations for maximum disk efficiency User Defined Metadata and Metadata Search Applications can assign their own metadata via object storage API, WOS now also supports simple search of WOS user metadata 28

Collaborate, Distribute, Federate 29

SSERCA: Cross-university Data Sharing 30

Gene Sequencing as a Service Gene sequence analysis specialist 1 NFS and CIFS for Ingest Hospital or Lab Corporate pharmaceutical R&D Collaboration 3 Recipient of Final Bio Markers/IP CIFS/NFS access University researchers/ collaborators 2 Active/Active Storage with Parallel File Systems for Alignment 31

Global Business Data Sharing HPC Data Center with parallel file system Mgmt. Project manager access via CIFS/NFS Design Designer access via CIFS/NFS HPC Data Center with parallel file system 32

Summary Lustre is open source Lustre has a proven ability to scale Lustre is incorporating new features to better support small file IO Lustre s reach extends from the data center to the globe Could Lustre be the foundation for your softwaredefined-storage future? 33

34 Thank You! Keep in touch with us sales@ddn.com 2929 Patrick Henry Drive Santa Clara, CA 95054 @ddn_limitless 1.800.837.2298 1.818.700.4000 company/datadirect-networks

Lustre Scalability and Performance Feature Current Practical Range Tested in Production Client Scalability 100-100000 50000+ clients, many in the 10000 to 20000 range Client Performance Single client: I/O 90% of network bandwidth Aggregate: 2.5 TB/sec I/O Single client: 2 GB/sec I/O, 1000 metadata ops/sec Aggregate: 240 GB/sec I/O OSS Scalability Single OSS: 1-32 OSTs per OSS, 128TB per OST OSS count: 500 OSSs, with up to 4000 OSTs Single OSS: 8 OSTs per OSS, 16TB per OST OSS count: 450 OSSs with 1000 4TB OSTs 192 OSSs with 1344 8TB OSTs OSS Performance Single OSS: 5 GB/sec Aggregate: 2.5 TB/sec Single OSS: 2.0+ GB/sec Aggregate: 240 GB/sec MDS Scalability Single MDT: 4 billion files (ldiskfs), 256 trillion files (ZFS) MDS count: 1 primary + 1 backup Introduced in Lustre 2.4Up to 4096 MDTs and up to 4096 MDSs Single MDT: 1 billion files MDS count: 1 primary + 1 backup MDS Performance 35000/s create operations, 100000/s metadata stat operations 15000/s create operations, 35000/s metadata stat operations File system Scalability Single File: 2.5 PB max file size Aggregate: 512 PB space, 4 billion files Single File: multi-tb max file size Aggregate: 55 PB space, 1 billion files 35