DL-SNAP and Fujitsu's Lustre Contributions

Similar documents
Fujitsu s Contribution to the Lustre Community

Upstreaming of Features from FEFS

Fujitsu's Lustre Contributions - Policy and Roadmap-

An Overview of Fujitsu s Lustre Based File System

Active-Active LNET Bonding Using Multiple LNETs and Infiniband partitions

Andreas Dilger. Principal Lustre Engineer. High Performance Data Division

European Lustre Workshop Paris, France September Hands on Lustre 2.x. Johann Lombardi. Principal Engineer Whamcloud, Inc Whamcloud, Inc.

Community Release Update

Demonstration Milestone Completion for the LFSCK 2 Subproject 3.2 on the Lustre* File System FSCK Project of the SFS-DEV-001 contract.

Lustre * Features In Development Fan Yong High Performance Data Division, Intel CLUG

LNET MULTI-RAIL RESILIENCY

Andreas Dilger, Intel High Performance Data Division Lustre User Group 2017

Lustre Interface Bonding

An Exploration of New Hardware Features for Lustre. Nathan Rutman

LUG 2012 From Lustre 2.1 to Lustre HSM IFERC (Rokkasho, Japan)

The modules covered in this course are:

Lustre 2.8 feature : Multiple metadata modify RPCs in parallel

Small File I/O Performance in Lustre. Mikhail Pershin, Joe Gmitter Intel HPDD April 2018

Challenges in making Lustre systems reliable

Lustre at Scale The LLNL Way

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

Lustre overview and roadmap to Exascale computing

Lustre on ZFS. At The University of Wisconsin Space Science and Engineering Center. Scott Nolin September 17, 2013

Project Quota for Lustre

Fan Yong; Zhang Jinghai. High Performance Data Division

Extraordinary HPC file system solutions at KIT

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

High Level Architecture For UID/GID Mapping. Revision History Date Revision Author 12/18/ jjw

Andreas Dilger, Intel High Performance Data Division LAD 2017

Open SFS Roadmap. Presented by David Dillow TWG Co-Chair

Community Release Update

Enhancing Lustre Performance and Usability

Olaf Weber Senior Software Engineer SGI Storage Software

Lustre * 2.12 and Beyond Andreas Dilger, Intel High Performance Data Division LUG 2018

DNE2 High Level Design

Using file systems at HC3

Olaf Weber Senior Software Engineer SGI Storage Software. Amir Shehata Lustre Network Engineer Intel High Performance Data Division

Disruptive Storage Workshop Hands-On Lustre

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

Mission-Critical Lustre at Santos. Adam Fox, Lustre User Group 2016

Introduction The Project Lustre Architecture Performance Conclusion References. Lustre. Paul Bienkowski

Integration Path for Intel Omni-Path Fabric attached Intel Enterprise Edition for Lustre (IEEL) LNET

Scalability Testing of DNE2 in Lustre 2.7 and Metadata Performance using Virtual Machines Tom Crowe, Nathan Lavender, Stephen Simms

Lustre 2.12 and Beyond. Andreas Dilger, Whamcloud

Lustre Metadata Fundamental Benchmark and Performance

Lustre on ZFS. Andreas Dilger Software Architect High Performance Data Division September, Lustre Admin & Developer Workshop, Paris, 2012

Lustre / ZFS at Indiana University

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Multi-Rail LNet for Lustre

Lessons learned from Lustre file system operation

Implementing Storage in Intel Omni-Path Architecture Fabrics

Architecting a High Performance Storage System

Experiences with HP SFS / Lustre in HPC Production

Feedback on BeeGFS. A Parallel File System for High Performance Computing

The Btrfs Filesystem. Chris Mason

Lustre/HSM Binding. Aurélien Degrémont Aurélien Degrémont LUG April

Parallel File Systems for HPC

Lustre Data on MDT an early look

SDSC s Data Oasis Gen II: ZFS, 40GbE, and Replication

LLNL Lustre Centre of Excellence

Technical Computing Suite supporting the hybrid system

Remote Directories High Level Design

Administering Lustre 2.0 at CEA

Multi-tenancy: a real-life implementation

Robin Hood 2.5 on Lustre 2.5 with DNE

PERSISTENCE: FSCK, JOURNALING. Shivaram Venkataraman CS 537, Spring 2019

LCE: Lustre at CEA. Stéphane Thiell CEA/DAM

Andreas Dilger, Intel High Performance Data Division SC 2017

Case study: ext2 FS 1

[537] Fast File System. Tyler Harter

File Systems Management and Examples

DDN s Vision for the Future of Lustre LUG2015 Robert Triendl

OST data migrations using ZFS snapshot/send/receive

LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November Abstract

LIME: A Framework for Lustre Global QoS Management. Li Xi DDN/Whamcloud Zeng Lingfang - JGU

High-Performance Lustre with Maximum Data Assurance

mode uid gid atime ctime mtime size block count reference count direct blocks (12) single indirect double indirect triple indirect mode uid gid atime

Case study: ext2 FS 1

Toward Understanding Life-Long Performance of a Sonexion File System

Analytics of Wide-Area Lustre Throughput Using LNet Routers

12th ANNUAL WORKSHOP Experiences in Writing OFED Software for a New InfiniBand HCA. Knut Omang ORACLE. [ April 6th, 2016 ]

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

Jessica Popp Director of Engineering, High Performance Data Division

4/19/2016. The ext2 file system. Case study: ext2 FS. Recap: i-nodes. Recap: i-nodes. Inode Contents. Ext2 i-nodes

The Role of InfiniBand Technologies in High Performance Computing. 1 Managed by UT-Battelle for the Department of Energy

LNet Roadmap & Development. Amir Shehata Lustre * Network Engineer Intel High Performance Data Division

6.5 Collective Open/Close & Epoch Distribution Demonstration

Network Request Scheduler Scale Testing Results. Nikitas Angelinas

Toward An Integrated Cluster File System

Assistance in Lustre administration

Operating Systems. File Systems. Thomas Ropars.

ECE 598 Advanced Operating Systems Lecture 18

BTREE FILE SYSTEM (BTRFS)

Designing a True Direct-Access File System with DevFS

GPFS on a Cray XT. Shane Canon Data Systems Group Leader Lawrence Berkeley National Laboratory CUG 2009 Atlanta, GA May 4, 2009

Data life cycle monitoring using RoBinHood at scale. Gabriele Paciucci Solution Architect Bruno Faccini Senior Support Engineer September LAD

T10PI End-to-End Data Integrity Protection for Lustre

Chapter 6: File Systems

LFSCK 1.5 High Level Design

Transcription:

Lustre Administrator and Developer Workshop 2016 DL-SNAP and Fujitsu's Lustre Contributions Shinji Sumimoto Fujitsu Ltd. a member of OpenSFS 0

Outline DL-SNAP Background: Motivation, Status, Goal and Contribution Plan What is DL-SNAP? Use case and Utility Commands Implementation and Evaluation Other Lustre Contributions Directory Quota, IB Channel Bonding 1

Motivation, Status, Goal and Contribution Plan Motivation: Backup files on large scale file system are an issue to solve. However, existing system level backup requires large storage space and backup time. Status: We started to develop a snapshot function, and, we have developed a prototype of the function. Goal of This Presentation: To present our snapshot specification and the prototype implementation To discuss its usability and gather user s requirements. Contribution Plan: 1Q 2018 to Lustre community 2

Background of DL-SNAP It is difficult to make backup on large scale file system. PB class file system backup takes long time and requires its backup space. To reduce backup storage usage and backup time: Using snapshot to reduce duplicate data Not all file system data, selection of backup area Two level of backup: System level and User level 3

Background: System Level vs. User Level Backup System level backup: System guarantees to backup data and to restore the backup data Therefore, double sized storage space or another backup device is required to guarantee data backup and restore. File Services must be stopped during backup. User level backup: User can select backup data File Service does not need to be stopped. 4

Background: Selected User Level Backup Scheme Customer Requirement: Continuing file system service Difficult to guarantees the backup data to restore in system operation Providing effective backup service with limited storage space Therefore, user level backup scheme is selected. We started to develop DL-SNAP which is user and directory level snapshot 5

What is DL-SNAP? DL-SNAP is designed for user and directory level file backups. Users can create a snapshot of a directory using lfs command with snapshot option and create option like a directory copy. The user creates multiple snapshot of the directory and manage the snapshots including merge of the snapshots. DL-SNAP also supports quota to limit storage usage of users. 6

DL-SNAP Use-case 1 Avoiding file deletion or corruption by file operation Needs data three day before #!/bin/bash User A File1 File2(Mod) File3 GEN=7 NEW_SNAP=`/bin/date '+%Y%m%d'` OLD_SNAP=`/bin/date -d "${GEN} day ago" +%Y%m%d` Creating Snaphot File 3 Deleted Making snapshot day by day using cron Snapshot Data -2 Day File1 File2 File3(Del) -1 Day File1 File2(Mod) -3 Day File1 File2 File3 lfs snapshot --create /home/usra@$new_snap lfs snapshot --delete /home/usra@$old_snap Restoring File3 User A File1 File2(Mod) File3 Restoring by File Copy -1 Day File1 File2(Mod) 7 Snapshot Data -2 Day File1 File2 File3(Del) -3 Day File1 File2 File3

DL-SNAP Use-case 2 Maintaining large database with partially different data Updating database by an application using DL-SNAP Application User A Analyzing difference of databases Daily Incoming Data Input 1 Input 2 Input 3 Input 4 : Outputs Large Database Snapshot Snapshots Dec.1 Dec. 3 Dec. 10 Output Output Output 8

DL-SNAP: Quota Support and Utility Commands Quota function is also provided to manage storage usage of users a little bit complicate when the owner of the snapshot is different among the original and some snapshot generations. Utility Commands: lfs snapshot, lctl snapshot Enabling Snapshot: Getting Status of Snapshot: Creating a snapshot: Listing snapshot: Deleting snapshot: lctl snapshot on <fsname> lctl snapshot status <fsname> lfs snapshot --create [-s <snapshot>] [-d <directory>] lfs snapshot --list [-R] [-d <directory>] lfs snapshot --delete [-f] -s <snapshot> [-d <directory>] 9

DL-SNAP Implementation The implementation of DL-SNAP is copy on write base Implemented on top of current Lustre ldiskfs and limited in OST level modification Without modification of ext4 disk format Adding special function to create snapshot to MDS. OST level modification (more detail on next page): Add Function which creates extra-references on OSTs. Add Copy-on-Write capability to the backend-fs. Two Methods to Manage Copy-on-Write Region Blocks Block Bitmap Method Extent Region Method (Our Method) 10

Basic Mechanism of DL-SNAP by Extent Region (1) Initial state: The original file points to the data blocks on OSTs Taking snapshot: Adds another reference and it points the blocks the original file points to. /data/foo/abc /data/foo/abc /data/foo/.magic/foo-1/abc original file original file snapshot file Cow Region Create CoW region data blocks OST 11 OST

Basic Mechanism of DL-SNAP by Extent Region(2) Append-writing the original file: Allocates a new data block on the OST and writes the data to the data block. Also, creating the original file modification extent of the data block. original file snapshot file Over-writing the original file: Allocates a new data block on the OST and copy the original data block. Then, the file point the data block. /data/foo/abc /data/foo/.magic/foo-1/abc /data/foo/abc /data/foo/.magic/foo-1/abc original file snapshot file Ext Region Cow Region allocate a new block Ext Region Ext Region Cow Region OST 12 copy old block and update the data OST

Evaluation of DL-SNAP DL-SNAP is faster than normal copy 1K byte file is 6% longer than that of copy command, but the time on 100 MB file is over 5 times faster File Size (B) 13

DL-SNAP: Write Performance by IOR Comparable performance to regular write Regular Write MiB/s 14 File Size (B)

DL-SNAP: Read Performance by IOR Comparable performance to regular read Regular Read MiB/s 15 File Size (B)

DL-SNAP: Contribution Plan and Vendor neutrality Contribution Plan: 1Q 2018 to Lustre community, several months after shipping as a product Vendor Neutrality: The implementation of DL-SNAP is absolutely vendor-neutral because no special hardware is required and based on standard Lustre code based implementation. 16

Fujitsu Lustre Contribution Policy (Presented as LAD 14) Fujitsu will open its development plan and feed back it s enhancement to Lustre community Fujitsu s basic contribution policy: Opening development plan and Contributing Production Level Code Feeding back its enhancement to Lustre community no later than after a certain period when our product is shipped. OpenSFS Fujitsu Lustre Development Dev. Plan Enhancement Release Lustre Development Feed back Enhancement Contribution & Production Cycle 17 Feed back Enhancement Product Shipment Fujitsu Customers Lustre Development

DL-SNAP: Current Development Status We are now developing DL-SNAP and evaluated its performance. The performance results show that the creating snapshot time is much better than that using copy command in longer files. Creating snapshot time on 1K byte file is 6% longer than that of copy command, but the time on 100 MB file is over 5 times faster than that of copy. Our contribution of DL-SNAP will be planned in 1Q 2018. 18

Directory Quota Implemented for K computer 2010, and has been used on several sites both 1.8 and 2.x based Lustre. No need to projection ID and implemented on top of original Lustre Quota scheme. Planed to be contributed in 2018 19

Directory Quota (DQ for short) Manages maximum files and disk usages for each directory All files/subdirectories under DQ-enabled directory are under control Can not be set to subdirectories under DQ-enabled directory Implemented on top of the Lustre s Quota framework UID/GID Quota can be used along with DQ Keep compatibility with current Lustre Upgrade rpm without mkfs Old version of clients can access DQ enabled directory 20

Directory Quota: How to Use Operations are same as Lustre s UID/GID Quota Set limits of inodes and blocks # lfs setquota d <target dir> -B <#blk> -I <#inode> <mountpoint> Enable limiting by DQ # lctl conf_param <fsname>.quota.<ost mdt>=<ugd> # lctl set_param -P <fsname>.quota.<ost mdt>= <ugd> Check status # lctl get_param osd-*.*.quota_slave.info 21

Directory Quota: Implementation Existing processes of UID/GID Quota are used almost as it is Some data structures that stores DQ information are added Disk layout keeps unchanged mkfs isn t needed to upgrade PKG Introduce new ID for DQ (=DID) DID = inode number of DQ enable directory DID is stored in ldiskfs inode of MDT/OST object files Index/account files for DQ are added Usages/Limits of the number of inodes/blocks are managed 22

Directory Quota: Management Information DID is stored in unused area of ldiskfs inode i_ctime_extra and i_mtime_extra are used DQ s index/account files are created on MDTs/OSTs Some flags to identify DQ are added Added for DQ Changed for DQ Lustre original Unused [Flag] LDISKFS_SETDID_FL MDS/OSS ldiskfs_inode_info vfs_inode i_dquot[3] i_dquot[2] DID ldiskfs (MDT/OST) Super block inode# of account file ldiskfs_inode of object file i_flags DID account file for DQ index file for DQ i_ctime_extra i_mtime_extra Usage of #inodes/blocks Limits of #inodes/blocks Same format as UID/GID Quota 23 mdt_body DID (mbo_did) mbo_valid i_dquot[2] for DQ ll_inode_info DID (lli_did) Client [Flag] OBD_MD_FLDID OBD_MD_FLDIRQUOTA client_obd cl_quota_hash[] cl_quota_hash[2] for DQ

Current Status of Directory Quota Lustre 1.8 & 2.6 based DQ on FEFS has been providing as a product Our customers use DQ function on their system operation 24

IB Multi-Rail Implemented for K computer 2010, and contributed our code to OpenSFS (LU-6531) 2015, so you can use this feature now! Contributed patch is independent driver, so users can easily use the patch for current Lustre. 25

IB Multi-Rail Improves LNET throughput and redundancy using multiple InfiniBand(IB) interfaces Improving LNET throughput Using multiple IB interfaces as single Lustre NID LNET B/W improves in proportion to the number of IBs on single Lustre node Improving Redundancy LNET can continue communicating unless all IBs fail MDS/OSS failover is not necessary when a single point IB failure occurrs Client HCA0 ib0(192.168.0.10) HCA1 ib1(192.168.0.11) NID=192.168.0.10@o2ib0 Single LNET (network=o2ib0) 26 Server (MDS/OSS) HCA0 ib0(192.168.0.12) HCA1 ib1(192.168.0.13) NID=192.168.0.12@o2ib0

IB Multi-Rail: Related Work OFED level IPoIB bonding: OFED has this function already, but RDMA isn t supported RDMA bonding: Ongoing work by Mellanox: OFED will support RDMA bonding (I m not sure when ). IB partition method: Mr. Ihara (DDN) presented at LUG 2013 Multiple bond interfaces are enabled with IPoIB child interfaces Requiring multiple LNET, configurations are complex LNET Level SGI presented LNET level multi-rail at Lustre Developer Summit 2015. Our approach is better in the point of having a real code to work perfectly for many years. 27

IB Multi-Rail: Implementation Implemented in LND (ko2iblnd) Other Lustre modules are not changed Keep compatibility with old version of Lustre (socklnd) Multiple IB HCAs are handled as single NID Enable constructing single LNET network All IB HCAs are active ko2iblnd selects transmission path by round-robin order Multiple LNET requests are transmitted by using all IB paths in parallel 28

IB Multi-Rail: Path Selection Transmission path is selected in round-robin order Source and destination interfaces are selected cyclically when each LNET function (LNetPut/LNetGet) is executed Node A ib0 1 2 ib0 ib1 Node B Node A ib0 ib1 1 2 ib0 ib1 Node B Node A ib0 ib1 1 2 3 ib0 ib1 ib2 Node B ib0 Node A ib0 ib1 1 2 3 4 ib1 ib2 ib3 Node B source round-robin order destination 29

IB Multi-Rail: Error Handling Path error Ptlrpc resends the request that got an error ko2iblnd selects next transmission path in round-robin order and sends it Port down ko2iblnd removes the transmission path that uses the failed port FS Layer ptlrpc No error occurs when sending the request Normal Case Send Request Send Done Send Request Retry Path Error Send Send Done LNET LND (ko2ib) Select Path Source node ib0 ib1 ib0 ib1 Destination node Send Data Notify Success Error Detected Select Path Source node ib0 ib1 ib0 ib1 Destination node 30 error Notify Error Select Path Source node ib0 ib1 ib0 ib1 Destination node Send Data Notify Success

IB Multi-Rail: LNET Throughput Server CPU: Xeon E5520 2.27GHz x2 IB: QDR x2 or FDR x2 Result B/W almost scales by #IBs Achieves nearly HW performance IB SW Server Server IB x2 IB x2 31 (Concurrency=32)

IB Multi-Rail: I/O Throughput of Single OSS OSS/Client CPU: Xeon E5520 2.27GHz x2 IB: QDR x2 OST ramdisk x8 (> 6GB/s) IOR 32-process (8client x4) Result Throughput almost scales by #IBs Measurement of FDR is planned Client Client Client x8 IB SW MDS o2ib0 OSS ramdisk x8 32

IB Multi-Rail Code Status Our IB Multi-Rail is provided as a commercial product for over 5 years K computer: over 90 OSS, 1000 class clients since 2011 Realizing Highly available operation for over 5 years OFED based implementation can be widely used for other devices RoCE OmniPath Tofu We contributed our code to OpenSFS (LU-6531), 2015, so you can use this feature now! Contributed patch is independent driver, so users can easily use the patch for current Lustre. 33

34