Implementing a Hierarchical Storage Management system in a large-scale Lustre and HPSS environment

Similar documents
Implementing a Hierarchical Storage Management system in a large-scale Lustre and HPSS environment

Blue Waters System Overview. Greg Bauer

Is Petascale Complete? What Do We Do Now?

Adventures in Parallelization On One of the Largest Supercomputers in The Country. Nathan Sloat

Blue Waters System Overview

Blue Waters Super System

Lustre HSM at Cambridge. Early user experience using Intel Lemur HSM agent

LUG 2012 From Lustre 2.1 to Lustre HSM IFERC (Rokkasho, Japan)

Lustre/HSM Binding. Aurélien Degrémont Aurélien Degrémont LUG April

Blue Waters I/O Performance

DVS, GPFS and External Lustre at NERSC How It s Working on Hopper. Tina Butler, Rei Chi Lee, Gregory Butler 05/25/11 CUG 2011

TGCC OVERVIEW. 13 février 2014 CEA 10 AVRIL 2012 PAGE 1

LCE: Lustre at CEA. Stéphane Thiell CEA/DAM

Lustre* is designed to achieve the maximum performance and scalability for POSIX applications that need outstanding streamed I/O.

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

RobinHood Project Status

Data Movement & Tiering with DMF 7

Bridging the Gap Between High Quality and High Performance for HPC Visualization

Lustre overview and roadmap to Exascale computing

Account Management on a Large-Scale HPC Resource

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

Overcoming Obstacles to Petabyte Archives

GPFS on a Cray XT. Shane Canon Data Systems Group Leader Lawrence Berkeley National Laboratory CUG 2009 Atlanta, GA May 4, 2009

Nathan Rutman SC09 Portland, OR. Lustre HSM

4/20/15. Blue Waters User Monthly Teleconference

NCAR Globally Accessible Data Environment (GLADE) Updated: 15 Feb 2017

6/20/16. Blue Waters User Monthly Teleconference

Parallel File Systems for HPC

Cray Operating System Plans and Status. Charlie Carroll May 2012

HPSS Treefrog Summary MARCH 1, 2018

Lustre TM. Scalability

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

Data Movement & Storage Using the Data Capacitor Filesystem

DDN s Vision for the Future of Lustre LUG2015 Robert Triendl

Co-existence: Can Big Data and Big Computation Co-exist on the Same Systems?

Parallel File Systems. John White Lawrence Berkeley National Lab

Coordinating Parallel HSM in Object-based Cluster Filesystems

Lessons learned from Lustre file system operation

Andreas Dilger. Principal Lustre Engineer. High Performance Data Division

Campaign Storage. Peter Braam Co-founder & CEO Campaign Storage

Xyratex ClusterStor6000 & OneStor

Data storage services at KEK/CRC -- status and plan

Data Transfer Study for HPSS Archiving

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

Improving overall Robinhood performance for use on large-scale deployments Colin Faber

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

RobinHood Policy Engine

An Exploration into Object Storage for Exascale Supercomputers. Raghu Chandrasekar

ARCHER/RDF Overview. How do they fit together? Andy Turner, EPCC

Architecting Storage for Semiconductor Design: Manufacturing Preparation

SDSC s Data Oasis Gen II: ZFS, 40GbE, and Replication

Welcome! Virtual tutorial starts at 15:00 BST

*University of Illinois at Urbana Champaign/NCSA Bell Labs

LCOC Lustre Cache on Client

Lustre * Features In Development Fan Yong High Performance Data Division, Intel CLUG

HPSS RAIT. A high performance, resilient, fault-tolerant tape data storage class. 1

Mission-Critical Lustre at Santos. Adam Fox, Lustre User Group 2016

SC17 - Overview

ROBINHOOD POLICY ENGINE

HPC NETWORKING IN THE REAL WORLD

Chapter 4 Data Movement Process

A GPFS Primer October 2005

IBM Storwize V7000 Unified

Illinois Proposal Considerations Greg Bauer

Table 9. ASCI Data Storage Requirements

Cori (2016) and Beyond Ensuring NERSC Users Stay Productive

Assistance in Lustre administration

Zhang, Hongchao

An introduction to GPFS Version 3.3

Lustre usages and experiences

Storage Supporting DOE Science

The World s Fastest Backup Systems

XtreemStore A SCALABLE STORAGE MANAGEMENT SOFTWARE WITHOUT LIMITS YOUR DATA. YOUR CONTROL

An ESS implementation in a Tier 1 HPC Centre

MANAGING LUSTRE & ITS CEA

Challenges in making Lustre systems reliable

Object storage platform How it can help? Martin Lenk, Specialist Senior Systems Engineer Unstructured Data Solution, Dell EMC

libhio: Optimizing IO on Cray XC Systems With DataWarp

Azure Webinar. Resilient Solutions March Sander van den Hoven Principal Technical Evangelist Microsoft

Ohio Supercomputer Center

FROM RESEARCH TO INDUSTRY. RobinHood v3. Robinhood User Group Thomas Leibovici 16 septembre 2015

Robin Hood 2.5 on Lustre 2.5 with DNE

Lustre on ZFS. Andreas Dilger Software Architect High Performance Data Division September, Lustre Admin & Developer Workshop, Paris, 2012

Managing HPC Active Archive Storage with HPSS RAIT at Oak Ridge National Laboratory

Using SDSC Systems (part 2)

The Blue Water s File/Archive System. Data Management Challenges Michelle Butler

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

Open SFS Roadmap. Presented by David Dillow TWG Co-Chair

Universal Storage Consistency of DASD and Virtual Tape

GPFS Experiences from the Argonne Leadership Computing Facility (ALCF) William (Bill) E. Allcock ALCF Director of Operations

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Storage Optimization with Oracle Database 11g

LCG data management at IN2P3 CC FTS SRM dcache HPSS

Andreas Dilger High Performance Data Division RUG 2016, Paris

RAIDIX Data Storage Solution. Clustered Data Storage Based on the RAIDIX Software and GPFS File System

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

HPSS Treefrog Introduction.

Figure 1: cstcdie Grid Site architecture

IBM Spectrum Protect Version Introduction to Data Protection Solutions IBM

Transcription:

Implementing a Hierarchical Storage Management system in a large-scale Lustre and HPSS environment Brett Bode, Michelle Butler, Sean Stevens, Jim Glasgow National Center for Supercomputing Applications/University of Illinois Nate Schumann, Frank Zago Cray Inc.

Background What is Hierarchical Storage Management? A single namespace view for multiple physical/logical storage systems Automated movement of data to the lower tiers, usually based on configurable policies Data is returned to the top tier based on request or access.

Traditional HSM HSM environments have been used for many years to front-end tape archives with a disk cache. Usually the environment was isolated, the only actions on data are to transfer to/from the system All data is expected to be written to the back-end. Most data is accessed infrequently Disk cache Automatically migrate by policy Dual resident and release as needed Tape Archive User accesses single namespace via explicit data transfer to/from the environment Released data requested is retrieved

Blue Waters Computing System Aggregate Memory 1.66 PB Scuba Subsystem Storage Configuration for User Best Access 120+ GB/sec 400+ Gbps WAN IB Switch 10/40/100 Gb Ethernet Switch External Servers >1 TB/sec 66 GB/sec Spectra Logic: 200 usable PB CUG 2017 Sonexion: 26 usable PB

Gemini Fabric (HSN) Cray XE6/XK7-288 Cabinets DSL 48 Nodes Resource Manager (MOM) 64 Nodes BOOT 2 Nodes SDB 2 Nodes XE6 Compute Nodes - 5,659 Blades 22,636 Nodes 362,176 FP (bulldozer) Cores 724,352 Integer Cores RSIP 12Nodes Network GW 8 Nodes Unassigned 74 Nodes XK7 GPU Nodes 1057 Blades 4,228 Nodes 33,824 FP Cores 4,228 GPUs LNET Routers 582 Nodes Boot RAID SMW Boot Cabinet eslogin 4 Nodes Import/Export Nodes InfiniBand fabric 10/40/100 Gb Ethernet Switch HPSS Data Mover Nodes Sonexion 25+ usable PB online storage 36 racks Cyber Protection IDPS Management Node NCSAnet esservers Cabinets Near-Line Storage 200+ usable PB NPCF Supporting systems: LDAP, RSA, Portal, JIRA, Globus CA, Bro, test systems, Accounts/Allocations, Wiki

Today s Data Management User manages file location between scratch and Nearline scratch 30 day Purge Globus commands Nearline

Hierarchical Storage Management (HSM) Vision: scratch Automatically migrate by policy Dual resident and release as needed Nearline User accesses single namespace Purge Released data requested is retrieved Alternate online target Purge no longer employed Allows policy parameters to manage filesystem free space Users limited by lustre quota and back-end quota (out of band)

HSM Design Lustre 2.5 or later required for HSM support NCSA completed upgrade for all file systems in August 2016 (Sonexion Neo 2.0) Lustre copy tool provided via co-design & development with Cray Tiered Adaptive Storage Connector product Cray provides bulk of the copy tool with a plugin architecture for various back-ends. NCSA develops a plugin to interface with HPSS tape systems Specifications created for resiliency, quotas, disaster recovery, etc.

Lustre HSM Support Lustre (2.5 and later) provides tracking for files regardless of the location of the data blocks. File information remains on the MDS, but none on OSTs Commands are provided to initiate data movement and to release the lustre data blocks lsf hsm_[archive release restore etc] Lustre tracks the requests, but does not provide any built in data movement. A copy tool component is required to register with Lustre in order for any data movement to occur.

HSM Policies Policy engine provided by Robinhood Policies drive scripts that execute the lsf hsm_* commands. It is very early in the policy development process so these are early thoughts. Files will be copied to the back end after 7 days Estimated based on a review of the churn in the scratch file system. Files will be released when the file system reaches 80% full based on age and size. Files below 16MB will never be released and will be copied to a secondary file system rather than HPSS

Cray s Connector Cray s connector consists of two components Connector Migration Manager (CMM) registers with the Lustre file system as a copy tool. It is responsible for queuing and scheduling Lustre HSM requests that are received from the MDT across one or more CMAs. Connector Migration Agents (CMA) Perform all data movement within the Connector, and are also responsible for removing archive files from back-end storage if requested.

CMA Plugins The CMA plugin architecture allows multiple back-ends to interface with the CMM. The CMA also allows threading transfers across multiple agents. Several sample CMAs are provided to copy data to a secondary file system.

HPSS Plugin The NCSA HPSS CMA plugin is called HTAP. Will be released as open source soon. Utilizes HPSS API to provide authentication and data transfer to/from the HPSS environment. Transfers can be parallelized to match HPSS COS

Data IO Data IO for all file transfers is done at the native stripe width for the selected HPSS class of service this is generally smaller than the Lustre stripe width however, files are restored to their original Lustre stripe width All file archive and retrieve operations are further verified by full checksums checksums are kept with the files in user extended attributes and HPSS UDAs the CMA provides parallel and inline checksum capabilities in a variety of standard methods

Lustre HSM + Cray TAS + HPSS scratch Nearline User application writes files to Lustre Robinhood updated via changelogs Data Mover CMM/CMA Robinhood DB

Time Passes scratch Nearline Data Mover CMM/CMA lfs hsm_state afile afile: (0x00000000) Robinhood DB <--- new file not archived

Policy Triggers Copy to Back End scratch Nearline Policy engine issues lsf hsm_archive commands Data Mover CMM/CMA Robinhood policy selects file based on size/age/etc file attributes Robinhood DB

File Copy scratch Nearline Lustre issues copy request to the copy tool (CMM) Data Mover CMM/CMA Robinhood DB

File Copy scratch Nearline Robinhood DB Data Mover CMM/CMA CMM kicks off CMA(s) which copy the file to the back end and then store the backend metadata as extended lustre attrs. lfs hsm_state somefile somefile: (0x00000009) exists archived, archive_id:1<--- file archived, not released

File Release scratch Nearline At a later time Robinhood policies choose a list of files to release in order to free file system space. Data blocks are freed, but metadata remains. Data Mover CMM/CMA Robinhood DB lfs hsm_state somefile somefile: (0x0000000d) released exists archived, archive_id:1 <--- file archived and released

File Restore scratch Nearline Data Mover CMM/CMA User makes request to stage files (or issues a file open) Robinhood DB

File Restore scratch Nearline Data Mover CMM/CMA User makes request to stage files (or issues a file open) Robinhood DB lfs hsm_state somefile somefile: (0x00000009) exists archived, archive_id:1<--- file archived, not released

Backup? HSM is NOT equivalent to a backup! One site uses HSM functions to dual-copy data - sort of a backup in the case of a fault in the primary storage. However, deleting a file in the file system quickly results in it being deleted from the back-end. Thus, HSM does not protect against user mistakes. One could create a backend that did file versioning and delayed deletes, but that goes well beyond the current work.

900 files/min Initial Testing Ok rate. Unclear bottleneck. 900 MB/s Good rate, reasonable fraction of the resources for the single data mover node/lustre client The Lustre hsm.max_requests setting must be tuned. In the limited test system increasing it from 3 to 6 gave good results.

Future Work Much more testing, particularly testing at scale. Development and scaled testing of HPSS plugin Create HA Robinhood (policy engine) setup Workload manager integration Cray, Blue Waters, site team is investigating Lustre Bug that requires MDS failover to clear hung transfers

Conclusions The initial development and testing of the HPSS/Cray TAS HSM implementation is showing full functionality and good initial performance. Challenges remain in crafting effective policies. Effective production use will require user education and assistance. Data must be staged before use in a batch job! The lack of a unified quota system will confuse users.

Questions/Acknowledgements Supported by: The National Science Foundation through awards OCI-0725070 and ACI-1238993 The State and University of Illinois