Coordinating Parallel HSM in Object-based Cluster Filesystems

Similar documents
The ASCI/DOD Scalable I/O History and Strategy Run Time Systems and Scalable I/O Team Gary Grider CCN-8 Los Alamos National Laboratory LAUR

Filesystems on SSCK's HP XC6000

Parallel File Systems Compared

Parallel File Systems for HPC

RAIDIX Data Storage Solution. Clustered Data Storage Based on the RAIDIX Software and GPFS File System

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Feedback on BeeGFS. A Parallel File System for High Performance Computing

A GPFS Primer October 2005

XtreemStore A SCALABLE STORAGE MANAGEMENT SOFTWARE WITHOUT LIMITS YOUR DATA. YOUR CONTROL

Parallel File Systems. John White Lawrence Berkeley National Lab

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

Lustre A Platform for Intelligent Scale-Out Storage

Data Movement & Tiering with DMF 7

Lustre* is designed to achieve the maximum performance and scalability for POSIX applications that need outstanding streamed I/O.

Insights into TSM/HSM for UNIX and Windows

Shared Object-Based Storage and the HPC Data Center

Lustre overview and roadmap to Exascale computing

Distributed Filesystem

Distributed File Systems II

Lustre/HSM Binding. Aurélien Degrémont Aurélien Degrémont LUG April

HPC File Systems and Storage. Irena Johnson University of Notre Dame Center for Research Computing

Symantec Design of DP Solutions for UNIX using NBU 5.0. Download Full Version :

Software Defined Storage at the Speed of Flash. PRESENTATION TITLE GOES HERE Carlos Carrero Rajagopal Vaideeswaran Symantec

An Introduction to GPFS

System input-output, performance aspects March 2009 Guy Chesnot

IBM Tivoli Storage Manager Version Introduction to Data Protection Solutions IBM

Scalable I/O, File Systems, and Storage Networks R&D at Los Alamos LA-UR /2005. Gary Grider CCN-9

HPSS Treefrog Summary MARCH 1, 2018

Shared Parallel Filesystems in Heterogeneous Linux Multi-Cluster Environments

CA485 Ray Walshe Google File System

Next-Generation NVMe-Native Parallel Filesystem for Accelerating HPC Workloads

LUG 2012 From Lustre 2.1 to Lustre HSM IFERC (Rokkasho, Japan)

Storage Optimization with Oracle Database 11g

libhio: Optimizing IO on Cray XC Systems With DataWarp

IBM Spectrum Protect Version Introduction to Data Protection Solutions IBM

HPC I/O and File Systems Issues and Perspectives

IBM Storwize V7000 Unified

Solutions for iseries

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing

The Google File System

The Last Bottleneck: How Parallel I/O can improve application performance

CSE 124: Networked Services Lecture-16

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

System that permanently stores data Usually layered on top of a lower-level physical storage medium Divided into logical units called files

CSE 124: Networked Services Fall 2009 Lecture-19

Structuring PLFS for Extensibility

API and Usage of libhio on XC-40 Systems

UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

Building Self-Healing Mass Storage Arrays. for Large Cluster Systems

MAHA. - Supercomputing System for Bioinformatics

Managing HPC Active Archive Storage with HPSS RAIT at Oak Ridge National Laboratory

Object Storage: Redefining Bandwidth for Linux Clusters

Table 9. ASCI Data Storage Requirements

Accelerating Parallel Analysis of Scientific Simulation Data via Zazen

IBM s Data Warehouse Appliance Offerings

BlueGene/L. Computer Science, University of Warwick. Source: IBM

An introduction to GPFS Version 3.3

LLNL Lustre Centre of Excellence

Cisco MCS 7845-H1 Unified CallManager Appliance

PLFS and Lustre Performance Comparison

Storage Adapter Testing Report

TSM HSM Explained. Agenda. Oxford University TSM Symposium Introduction. How it works. Best Practices. Futures. References. IBM Software Group

STORAGE CONSOLIDATION WITH IP STORAGE. David Dale, NetApp

The Last Bottleneck: How Parallel I/O can attenuate Amdahl's Law

The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler

LCE: Lustre at CEA. Stéphane Thiell CEA/DAM

1. ALMA Pipeline Cluster specification. 2. Compute processing node specification: $26K

pnfs, POSIX, and MPI-IO: A Tale of Three Semantics

The Hyperion Project: Collaboration for an Advanced Technology Cluster Testbed. November 2008

T e c h n o l o g y. LaserTAPE: The Future of Storage

The RAMDISK Storage Accelerator

Campaign Storage. Peter Braam Co-founder & CEO Campaign Storage

High-Performance Lustre with Maximum Data Assurance

Maximizing NFS Scalability

The Parallel NFS Bugaboo. Andy Adamson Center For Information Technology Integration University of Michigan

Design and Implementation of a Network Aware Object-based Tape Device

Symantec Exam Administration of Symantec Backup Exec 2012 Version: 7.0 [ Total Questions: 150 ]

STORAGE CONSOLIDATION WITH IP STORAGE. David Dale, NetApp

The File Systems Evolution. Christian Bandulet, Sun Microsystems

HPC Storage Use Cases & Future Trends

Dell Microsoft Reference Configuration Performance Results

Hardware & System Requirements

The Google File System

CS550. TA: TBA Office: xxx Office hours: TBA. Blackboard:

Competitive Power Savings with VMware Consolidation on the Dell PowerEdge 2950

Lustre and PLFS Parallel I/O Performance on a Cray XE6

Object-based Storage (OSD) Architecture and Systems

Data Movement & Storage Using the Data Capacitor Filesystem

designed. engineered. results. Parallel DMF

Offloaded Data Transfers (ODX) Virtual Fibre Channel for Hyper-V. Application storage support through SMB 3.0. Storage Spaces

Cloudian Sizing and Architecture Guidelines

SC17 - Overview

Isilon Scale Out NAS. Morten Petersen, Senior Systems Engineer, Isilon Division

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

The Oracle Database Appliance I/O and Performance Architecture

SGI Overview. HPC User Forum Dearborn, Michigan September 17 th, 2012

Using file systems at HC3

Transcription:

Coordinating Parallel HSM in Object-based Cluster Filesystems Dingshan He, Xianbo Zhang, David Du University of Minnesota Gary Grider Los Alamos National Lab

Agenda Motivations Parallel archiving/retrieving architecture Coordinating parallel hierarchical storage management Experiment result Future works

Demanding for Scalable, Global and Secure (SGS) file system High Performance Computing (HPC) applications Tri-Lab File System Path Forward RFQ Global name space Security Scalable infrastructure for clusters and enterprise No single point of failure POSIX-like Interface Work well with MPI-IO

Object-based Cluster File System (OCFS) file metadata access MetaData Server /foo/bar {(osd1, obj 2),(osd2, obj 3),(osd3,obj 2)} RAID 0 stripe unit size = 64KB processing nodes data object access SAN OSD interface OSD interface OSD interface local FS local FS local FS disk disk disk

Recent OCFS Solutions Lustre File System of CFS, Inc. LLNL runs Lustre on Multiprogrammatic Capability Cluster (MCR) 20 million files and 115.2TB Aggregate I/O 22GBps ActiveScale File System of Panasas LANL deploys three systems The largest one has 200TB capacity and ~20GBps aggregate I/O

Why storage hierarchy in SGS systems? Fast data generating rate in HPC environment simulations generate one new file of multiple Gigabytes every 30 minutes Checkpoints for error recovery and experiment steering Cost effectiveness Combination of expensive high-performance storage and more affordable low-performance storage Data lifecycle Shared computing facilities used by many projects in turns

Bottleneck in archive and retrieve bandwidth Enlarging gap between application parallel I/O and archival storage I/O Every TFLOP needs 1GBps parallel I/O bandwidth Every TFLOP needs 100MBps archival I/O bandwidth 2005 BlueGene/L 280TFLOP Backup and Restore record in 2003 Achieved by SGI 2.8GBps file-level backup 1.25GBps file-level restore

Objectives High aggregated data archive and retrieve bandwidth Scalable in archival bandwidth in addition to capacity Automated and transparent management of data migration in storage hierarchy

Agenda Motivations Parallel archiving/retrieving architecture Coordinating parallel hierarchical storage management Experiment result Future works

Parallel Archive Architecture file metadata access MetaData Server /foo/bar {(osd1, obj 2),(osd2, obj 3),(osd3,obj 2)} RAID 0 stripe unit size = 64KB processing nodes data object access SAN OSD interface OSD interface OSD interface HSM local FS OSD/NAS/SCSI interface HSM local FS OSD/NAS/SCSI interface HSM local FS OSD/NAS/SCSI interface disk migration tape disk migration tape disk migration tape

Design Rationales Parallel archival storages Explore aggregated parallel archival bandwidth OSD embeds automated management of migrations Close to storage device Better understanding of object access pattern Direct data migration between OSDs and their associated archival storages OSDs are smart and powerful enough

Eliminating dependency on Data Management API (DMAPI) Not widely supported by popular file systems Not scalable from past experiences Requiring single event listener Most functions are unused by HSM

Functions of DMAPI Need to be Replaced Catching access events Accessing objects not in online storage Triggering HSM to recall data Transparent namespace As if files are always there File stub is always kept in the FS managed by HSM

Replacing DMAPI file metadata access Archival attribute MetaData Server /foo/bar {(osd1, obj 2),(osd2, obj 3),(osd3,obj 2)} RAID 0 stripe Archival unit attribute size = 64KB processing nodes data object access SAN OSD interface Migration Agent HSM local FS OSD/NAS/SCSI interface OSD interface Migration Agent HSM local FS OSD/NAS/SCSI interface OSD interface Migration Agent HSM local FS OSD/NAS/SCSI interface disk migration tape disk migration tape disk migration tape

Component Functions Archival attribute File object storage level File object location Migration Agent Intercepting file system requests from object interface Checking archival attribute to pass the request to local file system or recall object first

Agenda Motivations Parallel archiving/retrieving architecture Coordinating parallel hierarchical storage management Experiment result Future works

Distributed HSMs Need to be Coordinated Striping of single large files on multiple OSDs Multiple gigabyte files are common in HPC File sets of many related files on multiple OSDs Used by the same application Synchronous migrations between multiple pairs of online OSD and archival storage True high aggregated migration bandwidth for single file or file set Sequential access pattern can not explore parallel migration data paths Striped file will be retrieved object by object in sequence

Coordinating Parallel HSM file metadata access Archival attribute MetaData Server /foo/bar {(osd1, obj 2),(osd2, obj 3),(osd3,obj 2)} RAID 0 stripe Archival unit attribute size = 64KB query file layout update archival attribute processing nodes data object access SAN notify migration request instruct HSM migration Migration Coordinator OSD interface Migration Agent HSM local FS OSD/NAS/SCSI interface OSD interface Migration Agent HSM local FS OSD/NAS/SCSI interface OSD interface Migration Agent HSM local FS OSD/NAS/SCSI interface disk migration tape disk migration tape disk migration tape

Reasons for Centralized Coordinating Separated migration control path and migration data path OSDs do not involve in distributed migration coordinating task Centralized coordinating authority Possible for intelligent decisions across requests

Concurrency Control and Error Recovery Migration Locking Mechanism Concurrent client access, archive migration and retrieve migration Access Lock (ALock); Backup Lock (Block); Restore Lock (RLock) Error Recovery Failure on components in the middle of migration Logging migration operation before starting Checkpoint for large object migration Restart unfinished migration operation when doing error recovery

Agenda Motivations Parallel archiving/retrieving architecture Coordinating parallel hierarchical storage management Experiment result Future works

Prototyping Lustre 1.4.0 RH9 Linux 2.4.20

Experiment setup single-pair triple-pair OSD host configurations CPU Two Intel XEON 2.0G Memory 256MB DDR DIMM SCSI interface Ultra 160 SCSI (160MBps) HDD speed 10,000RPM Avg. seek time 4.7ms NIC Intel Pro/1000MF Target/MDS host configurations CPU 4 Pentium III 500MHz Memory 1GB EDO DIMM SCSI interface Ultra2/LVD SCSI (80MBps) HDD speed 10,000RPM Avg. seek time 5.2ms dual-pair single-backup NIC Intel Pro/1000MF

Aggregate Archive Bandwidth

Aggregate Retrieve Bandwidth

Future Works Intelligent migration decision How to use both file metadata access pattern from MDS and file object access pattern from OSDs to make better migration decisions Object archive tertiary storage devices Device interface for archive and retrieve operations Format of portable storage media Sequential-access features of tapes Extension to pnfs Handling heterogeneous storage device types

Summary A hierarchical storage solution for Object-based Cluster File Systems High aggregate archiving/retrieving bandwidth Scalable architecture for both archival capacity and bandwidth Transparent and automatic object migrations

Acknowledgements Los Alamos national lab DTC Intelligent Storage Consortium (DISC) members ITRI LSI Logic Sun Microsystems Symantec

Sequence of Accessing Object not in online storage Client 1 MDS 5 11 2 MC OST 4 6 6 6 9.a 3 10 10 10 9.b MA MA MA 7 9 7 9 7 9 HSM HSM HSM primary secondary primary secondary 8 8 primary 8 secondary