Lustre on ZFS. At The University of Wisconsin Space Science and Engineering Center. Scott Nolin September 17, 2013

Similar documents
Lustre / ZFS at Indiana University

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

Lustre on ZFS. Andreas Dilger Software Architect High Performance Data Division September, Lustre Admin & Developer Workshop, Paris, 2012

High-Performance Lustre with Maximum Data Assurance

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Extraordinary HPC file system solutions at KIT

Feedback on BeeGFS. A Parallel File System for High Performance Computing

NetApp High-Performance Storage Solution for Lustre

SDSC s Data Oasis Gen II: ZFS, 40GbE, and Replication

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

Mission-Critical Lustre at Santos. Adam Fox, Lustre User Group 2016

DDN s Vision for the Future of Lustre LUG2015 Robert Triendl

5.4 - DAOS Demonstration and Benchmark Report

InfiniBand Networked Flash Storage

Scalability Testing of DNE2 in Lustre 2.7 and Metadata Performance using Virtual Machines Tom Crowe, Nathan Lavender, Stephen Simms

Andreas Dilger. Principal Lustre Engineer. High Performance Data Division

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

Demonstration Milestone for Parallel Directory Operations

SFA12KX and Lustre Update

LLNL Lustre Centre of Excellence

All-Flash High-Performance SAN/NAS Solutions for Virtualization & OLTP

DELL Terascala HPC Storage Solution (DT-HSS2)

Introduction The Project Lustre Architecture Performance Conclusion References. Lustre. Paul Bienkowski

Fujitsu s Contribution to the Lustre Community

1. ALMA Pipeline Cluster specification. 2. Compute processing node specification: $26K

CSCS HPC storage. Hussein N. Harake

Andreas Dilger, Intel High Performance Data Division Lustre User Group 2017

Lustre overview and roadmap to Exascale computing

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

SurFS Product Description

FhGFS - Performance at the maximum

PART-I (B) (TECHNICAL SPECIFICATIONS & COMPLIANCE SHEET) Supply and installation of High Performance Computing System

Lustre at Scale The LLNL Way

OpenZFS Performance Improvements

An ESS implementation in a Tier 1 HPC Centre

Picking the right number of targets per server for BeeGFS. Jan Heichler March 2015 v1.3

Dell EMC Storage Enclosure Support Matrix

Dell TM Terascala HPC Storage Solution

Architecting a High Performance Storage System

Lustre * Features In Development Fan Yong High Performance Data Division, Intel CLUG

Lustre* is designed to achieve the maximum performance and scalability for POSIX applications that need outstanding streamed I/O.

DL-SNAP and Fujitsu's Lustre Contributions

All-Flash High-Performance SAN/NAS Solutions for Virtualization & OLTP

IBM Storwize V7000 Unified

Lessons learned from Lustre file system operation

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

Datura The new HPC-Plant at Albert Einstein Institute

LUG 2012 From Lustre 2.1 to Lustre HSM IFERC (Rokkasho, Japan)

Lustre at the OLCF: Experiences and Path Forward. Galen M. Shipman Group Leader Technology Integration

Lustre Metadata Fundamental Benchmark and Performance

Lustre Reference Architecture

Emerging Technologies for HPC Storage

Comet Virtualization Code & Design Sprint

TGCC OVERVIEW. 13 février 2014 CEA 10 AVRIL 2012 PAGE 1

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

Lustre High availability configuration in CETA-CIEMAT

Demonstration Milestone Completion for the LFSCK 2 Subproject 3.2 on the Lustre* File System FSCK Project of the SFS-DEV-001 contract.

MD4 esata. 4-Bay Rack Mount Chassis. User Manual February 6, v1.0

A Generic Methodology of Analyzing Performance Bottlenecks of HPC Storage Systems. Zhiqi Tao, Sr. System Engineer Lugano, March

HPE Scalable Storage with Intel Enterprise Edition for Lustre*

Fan Yong; Zhang Jinghai. High Performance Data Division

Andreas Dilger, Intel High Performance Data Division LAD 2017

Storage Optimization with Oracle Database 11g

A ClusterStor update. Torben Kling Petersen, PhD. Principal Architect, HPC

ZFS STORAGE POOL LAYOUT. Storage and Servers Driven by Open Source.

SGI Overview. HPC User Forum Dearborn, Michigan September 17 th, 2012

Software-defined Shared Application Acceleration

NT2 U3. 2-Bay RAID Storage Enclosure. User Manual May 18, 2010 v1.1

Cloudian Sizing and Architecture Guidelines

THE SUMMARY. CLUSTER SERIES - pg. 3. ULTRA SERIES - pg. 5. EXTREME SERIES - pg. 9

ANSYS Fluent 14 Performance Benchmark and Profiling. October 2012

Open SFS Roadmap. Presented by David Dillow TWG Co-Chair

General Purpose Storage Servers

Introduction to Lustre* Architecture

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

Dell EMC Ready Bundle for HPC Digital Manufacturing ANSYS Performance

Lustre HSM at Cambridge. Early user experience using Intel Lemur HSM agent

CPMD Performance Benchmark and Profiling. February 2014

Enterprise Ceph: Everyway, your way! Amit Dell Kyle Red Hat Red Hat Summit June 2016

INTEGRATING HPFS IN A CLOUD COMPUTING ENVIRONMENT

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

Isilon Performance. Name

LCE: Lustre at CEA. Stéphane Thiell CEA/DAM

Metadata Performance Evaluation LUG Sorin Faibish, EMC Branislav Radovanovic, NetApp and MD BWG April 8-10, 2014

Architecting Storage for Semiconductor Design: Manufacturing Preparation

Enterprise HPC storage systems

Implementing Storage in Intel Omni-Path Architecture Fabrics

Challenges in making Lustre systems reliable

An Exploration of New Hardware Features for Lustre. Nathan Rutman

Coordinating Parallel HSM in Object-based Cluster Filesystems

GROMACS (GPU) Performance Benchmark and Profiling. February 2016

OpenVMS Storage Updates

Upstreaming of Features from FEFS

Lustre usages and experiences

The Spider Center-Wide File System

What is QES 2.1? Agenda. Supported Model. Live demo

NetApp: Solving I/O Challenges. Jeff Baxter February 2013

High Performance Storage Solutions

Experiences with HP SFS / Lustre in HPC Production

Parallel File Systems for HPC

Transcription:

Lustre on ZFS At The University of Wisconsin Space Science and Engineering Center Scott Nolin September 17, 2013

Why use ZFS for Lustre? The University of Wisconsin Space Science and Engineering Center (SSEC) is engaged in atmospheric research, with a focus on satellite remote sensing which has large data needs. As storage has grown, improved data integrity has become increasingly critical. The data integrity features of ZFS make it an important candidate for many SSEC systems, especially an ongoing satellite data disk archive project. For some background information see: Zhang, Rajimwale, A. Arpaci-Dusseau, and R. Arpaci-Dusseau - End-to-end Data Integrity for File Systems: A ZFS Case Study http://research.cs.wisc.edu/wind/publications/zfs-corruption-fast10.pdf Bernd Panzer-Steindel, CERN/IT - Data Integrity http://indico.cern.ch/getfile.py/access?contribid=3&sessionid=0&resid=1&materialid=paper&confid=13797 Additional ZFS features such as compression and snapshots are also very attractive. We have been following the ZFS on Linux project (http://zfsonlinux.org) at Lawrence Livermore National Labs (LLNL). This active project ported ZFS to linux for use as the 55PB filesystem on the supercomputer Sequoia. We feel it is clear that ZFS on Lustre has matured, especially considering the production use on Sequoia. 2

Why a Test System? Proof of concept and design validation ZFS as a backend to Lustre is fairly new, released in 2.4 Documentation on setup and routine administration is not available in the Lustre manual or man pages yet. For the best data integrity we need systems supporting direct JBOD access to disks. Finding appropriate hardware is a challenge. 3

SSEC Test System: Cove Grant from Dell to provide testing equipment. No high-availability features included in test For many SSEC systems, we have a 5x9 next business day support level. HA will be important for some future operational systems SSEC built the system and tested for performance, data integrity features, and routine administration. System design largely based on LUG 2012 Lustre for Sequoia presentation by Brian Behlendorf. This allows a reasonable performance comparison to a known system. The Sequoia file system also provides some insight into massive scaling that we cannot test. 4

Sequoia / Cove Comparison Sequoia File System * Cove File System 768 OSS, 68 PB raw 2 OSS, 240 TB raw High Availability Configuration Not High Availability 60 disks in 4U per OSS pair 60 Disks in 10U per OSS pair 3TB Near line SAS disks MDT: Quantity 40-1TB OCZ Talos 2 SSDs OST: IB Host attached Dual Raid Controllers RAID-6 by controller 4TB Near line SAS disks MDT: Quantity 4-400GB Toshiba MK4001GRZB SSD OST: Direct SAS attached JBOD mode RAID-Z2 by ZFS * Based on LUG 2012 Lustre for Sequoia presentation, Brian Behlendorf 5

OST: RAID-6 is not the best answer. For the best data integrity, giving ZFS direct disk access is ideal. Cove uses SAS HBA to JBOD. Create a ZFS RAID-Z2, which has 2 parity disks like RAID-6 For our Cove design, this also allows us to stripe the RAID-Z2 across all enclosures, 2 disks per enclosure. So you can lose an entire enclosure and keep operating. 6

Sequoia: Why No RAID-Z2? Brian Behlendorf at Lawrence Livermore National Laboratory (LLNL) explains: LLNL needed something which could be used for ZFS or Ldiskfs. Risk mitigation strategy since this was going to be the first Lustre on ZFS system. Since then have developed confidence in the ZFS implementation and expect future deployments to be JBOD based. 7

Cove Hardware Configuration MDS/MDT FDR-10 3 computers: 1 MDS/MDT 2 OSS 5 OST: MD1200 storage arrays Split Mode MD1200 presents itself as 2 independent 6 disk SAS devices Each half MD1200 direct SAS attached to OSS via SAS port on LSI 9200-8e adapter 1 OST = 10 disk RAIDZ2, 2 disks from each MD1200 This configuration can lose an entire MD1200 enclosure and keep operating Disk and OST to OSS ratio same as Sequoia file system SAS Cables OST0 OST1 OST2 OSS 1 Split Mode FDR-10 SAS Cables OST3 OST4 OST5 OSS 2 FDR-10 8

Sequoia / Cove MDS-MDT in Detail MDS/MDT Sequoia Cove 2 Supermicro X8DTH (HA pair) 1 Dell R720 Xeon 5650, Dual Socket 6 core @ 2.47 GHz 192GB RAM QDR Mellanox ConnectX-3 IB Dual Port LSI SAS, 6Gbps External JBOD Quantity 40-1TB OCZ Talos 2 SSDs ZFS Mirror for MDT Xeon E5-2643, Dual Socket 8 core @ 3.30GHz 128GB RAM FDR-10 Mellanox ConnectX-3 IB Dell H310 SAS, 6Gbps Internal LSI SAS2008 controller Internal JBOD (Dell H310 passthrough mode) Quantity 4-400GB Toshiba MK4001GRZB SSD ZFS Mirror for MDT 9

Sequoia / Cove OSS Unit in Detail OSS Sequoia Appro Greenblade Quantity 2 Sequoia has 378 of these units, for 756 total Intel Xeon E5-2670, Dual Socket, 8 Core @ 2.60GHz 64GB RAM QDR Mellanox ConnectX-3 IB Storage Enclosure Connection Dual QDR ConnectX-2 IB HA Pairs SSEC Dell R720 Quantity 2 Intel Xeon E5-2670, Dual Socket, 8 Core @ 2.60GHz 64GB RAM FDR-10 Mellanox ConnectX-3 IB Storage Enclosure Connection LSI SAS 9200-8e (8 ports) No HA 10

Sequoia / Cove OST Unit in Detail OST0 OST1 OST2 Split Mode Sequoia NetApp E5400 Cove Dell MD1200 60 Bay, 4U 12 Bay, 2U x 5 = 60 Bay, 10U 3TB Near line SAS IB host attached 4TB Near line SAS SAS host attached 180 TB Raw 240 TB Raw Volume Management Dual Raid Controllers RAID-6 by controller OST3 OST4 OST5 Volume Management Direct SAS, JBOD mode RAID-Z2 by ZFS 11

Cove Software Configuration Servers RHEL 6 Lustre 2.4 for ZFS ZFS 0.61 Lustre 2.3 for ldiskfs (2.4 not available at test time) Mellanox OFED Infiniband Stack Test Clients 4-16 core nodes RHEL 5 Lustre 2.15 Mellanox OFED Infiniband Stack 12

Metadata Testing: MDTEST Sequoia MDTEST * 1. ldiskfs 40 disk,mirror 2. ZFS mirror Cove MDTEST 1. ldiskfs 4 disk RAID-5 2. ZFS mirror 3. RAID-Z 1,000,000 files Single directory 640,000 files Single directory 52 clients 4 clients, 64 threads total * Based on LUG 2012 Lustre for Sequoia presentation, Brian Behlendorf 13

Operations/second MDTEST: File Create and Remove 18000 16000 14000 12000 10000 8000 6000 cove zfs (raid10) cove zfs (raidz) cove ldiskfs (raid5) sequoia ZFS OI+SA (raid10) sequoia ldiskfs (raid10) 4000 2000 0 File create File removal 14

Operations/second MDTEST: File Stat 70000 60000 50000 40000 30000 cove zfs (raid10) cove zfs (raidz) cove ldiskfs (raid5) sequoia ZFS OI+SA (raid10) sequoia ldiskfs (raid10) 20000 10000 0 File stat 15

MDTEST Conclusion Cove performance seems consistent with Sequoia performance for MDTEST. File stat is much lower than Sequoia s performance, we think this is due to 40 SSD s vs 4. The metadata performance of lustre with ZFS is acceptable for many SSEC needs. 16

File System IO Performance: IOR Sequoia IOR Benchmark* Cove IOR benchmark 2 OSS (1/384 th scale) 2 OSS 6 OST: 8+2 RAID-6 SAS 6 OST: 8+2 RAID-Z2 SAS FPP One File Per Process FPP One File Per Process 1, 2, 4, 8, 16 tasks per node 2, 4, 8, 16 tasks per node 1 to 192 nodes 1 to 4 nodes Stonewalling Not Stonewalling Lustre 2.3 Lustre 2.4 Unknown file size Showing 100MB file results. (Tested 1MB, 20MB, 100MB, 200MB, 500MB) * From Andreas Dilger LAD 2012 - Lustre on ZFS 17

MB/s IOR Read 3000 100 MB FPP Reads 2 OST (6x 8+2 RAID-Z2 SAS) 2500 2000 1500 1000 500 2 threads 4 threads 8 threads 16 threads 0 1 2 3 4 Nodes Sequoia Cove 18

MB/s IOR Write 3000 100 MB FPP Writes 2 OST (6x 8+2 RAID Z2 SAS) 2500 2000 1500 1000 500 2 threads 4 threads 8 threads 16 threads 0 1 2 3 4 Nodes Sequoia Cove 19

IOR Conclusion Cove performance compares well with Sequoia filesystem We assume it would scale similarly to Sequoia. SSEC will soon have opportunity to test at larger scale. 20

Data Integrity Tests: Routine Failures Tested routine things like replacing a single disk, systems going up and down, or an OST offline and back. Various unplanned tests as I tried things with hardware, moved things, recabled, made mistakes, etc. Striping across JBODs means we could test by powering off an entire JBOD enclosure and keep running. In all cases, no data was lost or corrupted. 21

Split Mode Data Integrity Tests: Corrupting a Disk MDS/MDT FDR-10 OSS 1 FDR-10 OST0 OST1 OST2 For SSEC this is the critical feature of ZFS. The random dd simulates corruption that can happen for a wide variety of reasons, and with most filesystems there is no protection. 22TB VIIRS Data With MD5 sums SAS Cables SAS Cables OST3 OST4 OST5 OSS 2 FDR-10 Store on Cove No data was lost or corrupted. ZFS Corrects Zpool scrub Verify MD5 sums Scott Does Damage Use dd to write random data to a disk. 22

System Administration Giving ZFS direct access to drives is the best choice for ensuring data integrity, but this leads to system administration challenges. ZFS is responsible for volume management, so functions typically provided by the RAID adapter are not available. ZFS snapshots are an attractive feature for metadata backups We tested various administration tasks, with reasonable solutions for most. Firmware updates are a concern. 23

System Administration Summary System Administration Task SAS HBA / ZFS (OSS/OST) Drive Identification vdev_id.conf aliases by path Sas2ircu (LSI command line tool) to enumerate drives Drive Firmware Update Install H810 replace cables flash with Dell Utilities Enclosure Firmware Update *assume same as Drive Methods, not tested H310 passthrough mode / ZFS (MDS/MDT) vdev_id.conf aliases by path Openmanage to enumerate drives Dell Utilities Dell Utilities H810 RAID / ldiskfs (Traditional Method) Dell Openmanage Dell Utilities Dell Utilities Drive Failure Notification Use zpool status results Dell Openmanage Dell Openmanage Predictive Drive Failure Notification Metadata Backup and Recovery smartctl Dell Openmanage Dell Openmanage ZFS Snapshot ZFS Snapshot dd or tar (some versions of tar and lustre) 24

System Administration: Firmware Drive firmware Dell Redhat utility identifies and will attempt firmware updates, but fails to due to check of logical drive status. It expects that to be provided by the Dell H810 Bring system down, add H810, re-cable to H810, flash firmware, re-cable to original. Works, doesn t do anything bad like write a UUID Awkward and slow due to re-cabling Does not scale well Enclosure firmware We did not test this, but assume it is a similar situation. 25

System Administration: Metadata Backup ZFS snapshots These are object type backups. No file-based backup currently possible Snapshots are very nice for MDT backup. Analogous to the dd option required for some lustre versions, but with incrementals and a lot more flexibilty. This is a bonus of zfs. Like many ZFS features, can be useful for other tasks, but metadata backup is most critical for SSEC. 26

System Administration Conclusion While giving ZFS direct access to drives in a JBOD is ideal for data integrity, it introduces some system administration challenges. Vendor tools are focused on intelligent RAID controllers, not direct access JBOD. Most missing pieces can be directly replaced now with linux tools. Firmware updates and utilities need more work. ZFS snapshots are very convenient for metadata backup and restore. 27

Next Steps Production disk archive system Testing at larger scale (12 OSS, more clients) HA Metadata servers Will try ZFS on infiniband SRP mirrored LUNS Based on Charles Taylor s High Availability Lustre Using SRP-mirrored LUNs LUG 2012 System administration work Pursue firmware utility improvements with Dell Setup, administration, monitoring documentation Will be made public by end of 2013 28

Thanks to: Brian Behlendorf at LLNL Jesse Stroik and John Lalande at UW SSEC Pat Meyers and Andrew Waxenberg at Dell Dell HPC 29