Lustre / ZFS at Indiana University

Similar documents
Scalability Testing of DNE2 in Lustre 2.7 and Metadata Performance using Virtual Machines Tom Crowe, Nathan Lavender, Stephen Simms

Lustre on ZFS. At The University of Wisconsin Space Science and Engineering Center. Scott Nolin September 17, 2013

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

OST data migrations using ZFS snapshot/send/receive

Roads to Zester. Ken Rawlings Shawn Slavin Tom Crowe. High Performance File Systems Indiana University

DDN s Vision for the Future of Lustre LUG2015 Robert Triendl

Extraordinary HPC file system solutions at KIT

Lustre on ZFS. Andreas Dilger Software Architect High Performance Data Division September, Lustre Admin & Developer Workshop, Paris, 2012

NetApp High-Performance Storage Solution for Lustre

Andreas Dilger. Principal Lustre Engineer. High Performance Data Division

Data Movement & Storage Using the Data Capacitor Filesystem

Indiana University s Lustre WAN: The TeraGrid and Beyond

Open SFS Roadmap. Presented by David Dillow TWG Co-Chair

InfiniBand Networked Flash Storage

Lustre Metadata Fundamental Benchmark and Performance

Enhancing Lustre Performance and Usability

Lustre at Scale The LLNL Way

LCE: Lustre at CEA. Stéphane Thiell CEA/DAM

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

Introduction The Project Lustre Architecture Performance Conclusion References. Lustre. Paul Bienkowski

SFA12KX and Lustre Update

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

Andreas Dilger, Intel High Performance Data Division Lustre User Group 2017

HPC Storage Use Cases & Future Trends

Lustre * Features In Development Fan Yong High Performance Data Division, Intel CLUG

Challenges in making Lustre systems reliable

2012 HPC Advisory Council

Community Release Update

A ClusterStor update. Torben Kling Petersen, PhD. Principal Architect, HPC

Lustre File System. Proseminar 2013 Ein-/Ausgabe - Stand der Wissenschaft Universität Hamburg. Paul Bienkowski Author. Michael Kuhn Supervisor

Architecting Storage for Semiconductor Design: Manufacturing Preparation

Feedback on BeeGFS. A Parallel File System for High Performance Computing

Mission-Critical Lustre at Santos. Adam Fox, Lustre User Group 2016

High-Performance Lustre with Maximum Data Assurance

Lessons learned from Lustre file system operation

SDSC s Data Oasis Gen II: ZFS, 40GbE, and Replication

Application Performance on IME

An Exploration of New Hardware Features for Lustre. Nathan Rutman

An Overview of Fujitsu s Lustre Based File System

LLNL Lustre Centre of Excellence

The Spider Center-Wide File System

Active-Active LNET Bonding Using Multiple LNETs and Infiniband partitions

Fujitsu s Contribution to the Lustre Community

朱义普. Resolving High Performance Computing and Big Data Application Bottlenecks with Application-Defined Flash Acceleration. Director, North Asia, HPC

CSCS HPC storage. Hussein N. Harake

Efficient Object Storage Journaling in a Distributed Parallel File System

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

Comet Virtualization Code & Design Sprint

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

INTEGRATING HPFS IN A CLOUD COMPUTING ENVIRONMENT

Lustre architecture for Riccardo Veraldi for the LCLS IT Team

Lustre HSM at Cambridge. Early user experience using Intel Lemur HSM agent

Isilon Scale Out NAS. Morten Petersen, Senior Systems Engineer, Isilon Division

Andreas Dilger, Intel High Performance Data Division LAD 2017

Experiences with HP SFS / Lustre in HPC Production

Lustre overview and roadmap to Exascale computing

Demonstration Milestone Completion for the LFSCK 2 Subproject 3.2 on the Lustre* File System FSCK Project of the SFS-DEV-001 contract.

DDN. DDN Updates. Data DirectNeworks Japan, Inc Shuichi Ihara. DDN Storage 2017 DDN Storage

Xyratex ClusterStor6000 & OneStor

High Capacity network storage solutions

CMS experience with the deployment of Lustre

BeeGFS. Parallel Cluster File System. Container Workshop ISC July Marco Merkel VP ww Sales, Consulting

Parallel File Systems. John White Lawrence Berkeley National Lab

Community Release Update

Multi-tenancy: a real-life implementation

HPE Scalable Storage with Intel Enterprise Edition for Lustre*

Project Quota for Lustre

All-Flash High-Performance SAN/NAS Solutions for Virtualization & OLTP

Andreas Dilger, Intel High Performance Data Division SC 2017

High Performance Storage Solutions

Lustre* is designed to achieve the maximum performance and scalability for POSIX applications that need outstanding streamed I/O.

DL-SNAP and Fujitsu's Lustre Contributions

Jessica Popp Director of Engineering, High Performance Data Division

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

Metadata Performance Evaluation LUG Sorin Faibish, EMC Branislav Radovanovic, NetApp and MD BWG April 8-10, 2014

SurFS Product Description

Introduction to Lustre* Architecture

Upstreaming of Features from FEFS

All-Flash High-Performance SAN/NAS Solutions for Virtualization & OLTP

SUN ZFS STORAGE 7X20 APPLIANCES

5.4 - DAOS Demonstration and Benchmark Report

DELL Terascala HPC Storage Solution (DT-HSS2)

OpenZFS Performance Improvements

Experiences in providing secure multi-tenant Lustre access to OpenStack. Dave Holland Wellcome Trust Sanger Institute

DDN. DDN Updates. DataDirect Neworks Japan, Inc Nobu Hashizume. DDN Storage 2018 DDN Storage 1

Using DDN IME for Harmonie

Building Self-Healing Mass Storage Arrays. for Large Cluster Systems

Operational characteristics of a ZFS-backed Lustre filesystem

Emerging Technologies for HPC Storage

TGCC OVERVIEW. 13 février 2014 CEA 10 AVRIL 2012 PAGE 1

Johann Lombardi High Performance Data Division

OpenAFS A HPC filesystem? Rich Sudlow Center for Research Computing University of Notre Dame

Seagate ExaScale HPC Storage

Lustre Reference Architecture

LUG 2012 From Lustre 2.1 to Lustre HSM IFERC (Rokkasho, Japan)

Demonstration Milestone for Parallel Directory Operations

irods at TACC: Secure Infrastructure for Open Science Chris Jordan

Fan Yong; Zhang Jinghai. High Performance Data Division

Transcription:

Lustre / ZFS at Indiana University HPC-IODC Workshop, Frankfurt June 28, 2018 Stephen Simms Manager, High Performance File Systems ssimms@iu.edu Tom Crowe Team Lead, High Performance File Systems thcrowe@iu.edu

Greetings from Indiana University

Indiana University has 8 Campuses Fall 2017 Total Student Count 93,934 IU has a central IT organization UITS 1200 Full / Part Time Staff Approximately $184M Budget HPC resources are managed by Research Technologies division with $15.9M Budget

Lustre at Indiana University 2005 Small HP SFS Install, 1 Gb interconnnects 2006 Data Capacitor (535 TB, DDN 9550s, 10Gb) 2008 Data Capacitor WAN (339 TB, DDN 9550, 10Gb) 2011 Data Capacitor 1.5 (1.1 PB, DDN SFA-10K, 10Gb) PAINFUL MIGRATION using rsync 2013 Data Capacitor II (5.3 PB, DDN SFA12K-40s, FDR) 2015 Wrangler with TACC (10 PB, Dell, FDR) 2016 Data Capacitor RAM (35 TB SSD, HP, FDR-10)

2016 DC-WAN 2 We wanted to replace our DDN 9550 WAN file system 1.1 PB DC 1.5 DDN SFA10K 5 years of reliable production, now aging Recycled our 4 x 40 Gb Ethernet OSS nodes Upgrade to Lustre with IU developed nodemap UID mapping for mounts over distance We wanted to give laboratories bounded space DDN developed project quota were not available Used a combination of ZFS quotas and Lustre pools We wanted some operational experience with Lustre / ZFS

Why ZFS? Extreme Scaling Max Lustre File System Size LDISKFS 512 PB ZFS 8 EB Snapshots Supported in Lustre version 2.10 Online Data Scrubbing Provides error detection and handling No downtime for fsck Copy on Write Improves random write performance Compression More storage for less hardware

Lessons Learned At that point metadata performance using ZFS was abysmal We used LDISKFS for our MDT Multi-mount protection is a good thing became available in ZFS 0.7 We would use manual failover DDN SFA10K presentation feature kept us safe lz4 compression works without significant performance loss Some users flustered when the same file shows different sizes ls -l uncompressed size du shows compressed size du --apparent-size shows uncompressed size

Migration with rsync again painful ZFS Send and Receive make for a brighter future https://www.flickr.com/photos/davesag/18735941

Need for Persistent High Performance Storage IU Grand Challenges $300M over 5 years for interdisciplinary work to address issues Addiction Environmental Change Precision Health From DC-WAN 2 Users liked having high performance storage without a purge policy File system for individual use to be called Slate Project and chargeback space to be called Condo IU issued an RFP for a Lustre / ZFS file system After lengthy evaluation, IU chose DDN As partners we have worked to create DDN s EXAScaler ZFS

Slate and Condo Specs Slate Lustre 2.10.3 4 PB RAW 2 MDS (active active) 3.2 GHz / 512 GB RAM 2 MDT 10 x 1.92 TB SSD 8 OSS (configured as failover pairs) 16 OST 2.6 GHz / 512 GB RAM 1 Mellanox ConnectX-4 card Dual 12 Gb SAS / enclosure 40 x 8 TB drives Condo Lustre 2.10.3 8 PB RAW 2 MDS (active active) 3.2 GHz / 512 GB RAM 2 MDT 10 x 1.92 TB SSD 16 OSS (configured as failover pairs) 32 OST 2.6 GHz / 512 GB RAM 1 Mellanox ConnectX-4 card Dual 12 Gb SAS / enclosure 40 x 8 TB drives

Functional Units or Building Blocks OSS MDS

OSS Building Block Details

MDS Building Block Details

ZFS Version 0.7.5 ZFS Configuration OSTs created from a zpools consisting of 4 (8 + 2) RAIDZ2 vdevs With a ZFS record size of 1M each drive receives 128K on writes For production we will have lz4 compression turned on

Slate Building Blocks Scale Multiple Nodes Performance and Scalability EXAScaler ZFS 50000 45000 43466 40000 38056 35000 33433 30000 29893 25000 20000 20800 22234 15000 10000 11496 11411 5000 0 1 BB 2 BB 3 BB 4 BB Reads Writes Reads Linear Writes Li near

Single Building Block Performance in Blue Dead OSS Failover Performance in Red 14000 12000 Single Node and Single Building Block Performance comparison EXAScaler ZFS 11496 11411 10000 9983 8000 7512 6000 4000 2000 0 Reads Writes

Failover During IOR Testing CACTI Graph of IB Throughput IOR Stonewall write tests: We d lost OSS02 16:07:05-38.44 GB/sec OSS02 came back 16:25:55-42.53 GB/sec

Lustre / ZFS Metadata From Alexey Zhuravlev s 2016 LAD Talk https://www.eofs.eu/_media/events/lad16/02_zfs_md_performance_improvements_zhuravlev.pdf

Metadata Apples and Oranges Fastest ldiskfs MDS at IU versus Slate DCRAM - Experimental all solid state Lustre cluster MDS 3.3 GHz 8 core (E5-2667v2) Lustre 2.8 MDT (ldiskfs) SSDs in Hardware RAID controller RAID 0 Slate MDS 3.2 GHz 8 core (E5-2667v4) Lustre 2.10.3 MDT (ZFS 0.7.5) SSDs in JBOD Striped RAID Mirrors (2+2+2+2+2)

MDS Survey Creates Slate (ZFS) 3.2 GHz 8 core (E5-2667v4) Lustre 2.10.3 DCRAM (ldiskfs) 3.3 GHz 8 core (E5-2667v2) Lustre 2.8 Slate (ZFS) vs DCRAM (ldiskfs) - MDS Survey 200 000 180 000 160 000 140 000 operations/sec 120 000 100 000 800 00 SLA TE DC RAM 600 00 400 00 200 00 0 4 8 16 32 64 128 256 threads

MDS Survey Lookups Slate (ZFS) 3.2 GHz 8 core (E5-2667v4) Lustre 2.10.3 DCRAM (ldiskfs) 3.3 GHz 8 core (E5-2667v2) Lustre 2.8 2500000 Slate (ZFS) vs DCRAM (ldiskfs) - MDS Survey 2000000 operations/sec 1500000 1000000 SLATE DCRAM 500000 0 4 8 16 32 64 128 256 threads

MDS Survey Destroys Slate (ZFS) 3.2 GHz 8 core (E5-2667v4) Lustre 2.10.3 DCRAM (ldiskfs) 3.3 GHz 8 core (E5-2667v2) Lustre 2.8 Slate (ZFS) vs DCRAM (ldiskfs) - MDS Survey 180000 160000 140000 120000 operations/sec 100000 80000 60000 SLATE DCRAM 40000 20000 0 4 8 16 32 64 128 256 threads

MDS Survey md_getattr Slate (ZFS) 3.2 GHz 8 core (E5-2667v4) Lustre 2.10.3 DCRAM (ldiskfs) 3.3 GHz 8 core (E5-2667v2) Lustre 2.8 Slate (ZFS) vs DCRAM (ldiskfs) - MDS Survey 1200000 1000000 800000 operations/sec 600000 400000 SLATE DCRAM 200000 0 4 8 16 32 64 128 256 threads

MDS Survey setxattr Slate (ZFS) 3.2 GHz 8 core (E5-2667v4) Lustre 2.10.3 DCRAM (ldiskfs) 3.3 GHz 8 core (E5-2667v2) Lustre 2.8 Slate (FS) vs DCRAM (ldiskfs) - MDS Survey 200000 180000 160000 140000 operations/sec 120000 100000 80000 SLATE DCRAM 60000 40000 20000 0 4 8 16 32 64 128 256 threads

DDN Value Add DDNs EXAScaler tools, like those for the SFA devices, ease management and administration of ZFS and the JBODs. I believe enhancements to these tools will continue. DDN provided us with a set of Grafana based monitoring tools that (at the present time) are meeting our needs. Most importantly, DDN has been very responsive to our needs and desires.

Acknowledgments DDN Carlos Thomaz, Shuichi Ihara, Sebastien Buisson, Li Xi, Artur Novik, Frank Leers, Gregory Mason, Rich Arena, Bob Hawkins, James Coomer, and Robert Triendl The rest of IU s HPFS team Chris Hanna, Nathan Heald, Nathan Lavender, Ken Rawlings, Shawn Slavin LLNL Brian Behlendorf, Chris Morrone, Marc Stearman Whamcloud née Intel née Whamcloud née Oracle née Sun née Cluster File Systems Special thanks to Alexey Zhuravlev

Questions? Thank You for Your Time!