Leveraging High Performance Computing Infrastructure for Trusted Digital Preservation

Similar documents
Mitigating Risk of Data Loss in Preservation Environments

Digital Curation and Preservation: Defining the Research Agenda for the Next Decade

DSpace Fedora. Eprints Greenstone. Handle System

Enabling Collaboration for Digital Preservation

5th International Digital Curation Conference December 2009

IRODS: the Integrated Rule- Oriented Data-Management System

Harnessing the Data Deluge

The International Journal of Digital Curation Issue 1, Volume

Audio/Video Collection Migration, Storage and Preservation

David Minor UC San Diego Library Chronopolis Preservation Network

New Zealand Government IbM Infrastructure as a service

A Simple Mass Storage System for the SRB Data Grid

Digital Preservation at NARA

Openness, Growth, Evolution, and Closure in Archival Information Systems

Simplifying Collaboration in the Cloud

Creating a Digital Preservation Network with Shared Stewardship and Cost

Scalable, Reliable Marshalling and Organization of Distributed Large Scale Data Onto Enterprise Storage Environments *

DRI: Dr Aileen O Carroll Policy Manager Digital Repository of Ireland Royal Irish Academy

Cyberinfrastructure!

irods at TACC: Secure Infrastructure for Open Science Chris Jordan

New Zealand Government IBM Infrastructure as a Service

The International Journal of Digital Curation Issue 1, Volume

CDL s Web Archiving System

Trusted Digital Repositories. A systems approach to determining trustworthiness using DRAMBORA

Implementing Trusted Digital Repositories

Accelerate Your Enterprise Private Cloud Initiative

Big Data 2015: Sponsor and Participants Research Event ""

CASE STUDY: RELOCATE THE DATA CENTER OF THE NATIONAL SCIENCE FOUNDATION. Alan Stuart, Managing Director System Infrastructure Innovators, LLC

Mainframe Storage Best Practices Utilizing Oracle s Virtual Tape Technology

Digital Preservation in eresearch. Sam Houston Storage Product Manager A/NZ

Introduction to Digital Preservation. Danielle Mericle University of Oregon

Richard Marciano Alexandra Chassanoff David Pcolar Bing Zhu Chien-Yi Hu. March 24, 2010

Knowledge-based Grids

Regional Resilience: Prerequisite for Defense Industry Base Resilience

Libraries and Disaster Recovery

Cloud Services. Infrastructure-as-a-Service

SNIA 100 Year Archive Survey 2017 Thomas Rivera, CISSP

University of British Columbia Library. Persistent Digital Collections Implementation Plan. Final project report Summary version

Storage Infrastructure Optimization (SIO)

Can a Consortium Build a Viable Preservation Repository?

AGENCY: National Weather Service, National Oceanic and Atmospheric Administration, U.S.

Scaling a Global File System to the Greatest Possible Extent, Performance, Capacity, and Number of Users

Copyright 2012 EMC Corporation. All rights reserved.

DATA MANAGEMENT SYSTEMS FOR SCIENTIFIC APPLICATIONS

Network Support for Data Intensive Science

Smart thinking, clever working

Overview of NIPP 2013: Partnering for Critical Infrastructure Security and Resilience October 2013

Distributed Data Management with Storage Resource Broker in the UK

Digital Preservation DMFUG 2017

Long Term Digital Preserva2on

ISO Self-Assessment at the British Library. Caylin Smith Repository

DCH-RP Trust-Building Report

Policy Based Distributed Data Management Systems

Digital Preservation Standards Using ISO for assessment

Infrastructure PA Stephen Lecce

Protecting Future Access Now Models for Preserving Locally Created Content

Advanced Solutions of Microsoft SharePoint Server 2013

Mitigating Preservation Threats: Standards and Practices in the National Digital Newspaper Program

Building a Reference Implementation for Long-Term Preservation

Striving for efficiency

RED HAT ENTERPRISE LINUX. STANDARDIZE & SAVE.

The Third Party Administrator Model: Right for Puerto Rico?

The intelligence of hyper-converged infrastructure. Your Right Mix Solution

Earthquake Early Warning Where we are and where we are going. Doug Given USGS National Earthquake Early Warning Coordinator

Managing Petabytes of data with irods. Jean-Yves Nief CC-IN2P3 France

Managed Hosting Services

Transcontinental Persistent Archive Prototype

Electronic Records Archives: Philadelphia Federal Executive Board

Preservation. Policy number: PP th March Table of Contents

Enterprise X-Architecture 5th Generation And VMware Virtualization Solutions

The Impact of Hyper- converged Infrastructure on the IT Landscape

Storage Challenges for LSST: When Science Is Bigger Than Your Hardware

Energy Assurance Energy Assurance and Interdependency Workshop Fairmont Hotel, Washington D.C. December 2 3, 2013

Managing Records in Electronic Formats. An Introduction

IBM smarter Business Resilience in the Cloud

The Multi Cloud Journey

December 10, Statement of the Securities Industry and Financial Markets Association. Senate Committee on Banking, Housing, and Urban Development

Managing Large Scale Data for Earthquake Simulations

Advanced Solutions of Microsoft SharePoint Server 2013 Course Contact Hours

Accessioning. Kevin Glick Yale University AIMS

Standard: Data Center Security

Hitachi Content Archive Platform

Advanced Solutions of Microsoft SharePoint 2013

Archiving and Preserving the Web. Kristine Hanna Internet Archive November 2006

Solution Brief. Bridging the Infrastructure Gap for Unstructured Data with Object Storage. 89 Fifth Avenue, 7th Floor. New York, NY 10003

Preservation at scale

Distributing BaBar Data using the Storage Resource Broker (SRB)

Cisco Director Class SAN Planning and Design Service

DRS Update. HL Digital Preservation Services & Library Technology Services Created 2/2017, Updated 4/2017

Managing and Accessing 500 Million Files (and More!)

MetaArchive Cooperative TRAC Audit Checklist

Response to Industry Canada Consultation Developing a Digital Research Infrastructure Strategy

CASE STUDY: USING THE HYBRID CLOUD TO INCREASE CORPORATE VALUE AND ADAPT TO COMPETITIVE WORLD TRENDS

DEVELOPING, ENABLING, AND SUPPORTING DATA AND REPOSITORY CERTIFICATION

Developing a social science data platform. Ron Dekker Director CESSDA

Edinburgh DataShare: Tackling research data in a DSpace institutional repository

High Performance Computing Data Management. Philippe Trautmann BDM High Performance Computing Global Research

I data set della ricerca ed il progetto EUDAT

A Ten Year ( ) Storage Landscape LTO Tape Media, HDD, NAND

Building a Dutch National Research Infrastructure IRODS UGM 2017

Transcription:

Leveraging High Performance Computing Infrastructure for Trusted Digital Preservation 12 December 2007 Digital Curation Conference Washington D.C. Richard Moore Director of Production Systems San Diego Supercomputer Center University of California, San Diego rlm@sdsc.edu, http://www.sdsc.edu

San Diego Supercomputer Center: One of original supercomputer centers established by National Science Foundation (ca 1985) A History of Data Supports high performance computing (HPC) systems with a focus on data-intensive computing Supports data applications for science, engineering, social sciences, cultural heritage institutions, etc. SDSC is leveraging the infrastructure and expertise from its HPC experience for digital preservation

SDSC is taking multiple paths to enable digital preservation Production-level storage support Increased levels of reliability/verification SRB & irods development (the other R Moore) Multiple preservation partnerships Demonstrating trust as digital repository At local, regional, national and international levels Bit preservation cost estimates/projections Developing sustainable business models

Production Storage Services SDSC has extensive experience in the provision & operation of large-scale storage systems Size and stability allows SDSC to drive economies of scale and maintain deep base of staff experience SDSC has developed a stable infrastructure which allows for growth and innovation Critical mass of staff & expertise Use of data management tools The storage services are relatively generic i.e. users can take advantage of the infrastructure without deep understanding of the implementation details

SDSC as High-Performance Data Center Serves Both HPC & Digital Preservation Archive 25 PB capacity Both HPSS & SAM-QFS Online disk ~3PB total HPC parallel file systems Collections Databases Access Tools

SDSC Archival Capacity/Stored Data 1997-2007 Model A (8-yr,15.2-mo 2X) TB Stored Planned Capacity 100000.0 10000.0 Archival Storage (TB) 1000.0 100.0 10.0 June-97 June-98 June-99 June-00 June-01 June-02 June-03 June-04 June-05 June-06 June-07 June-08 June-09 Date

Exemplary Digital Preservation Collaborations UC San Diego Libraries (local) Digital Asset Management System w/ SDSC Data Resources (SRB) California Digital Library (regional) CDL Digital Repository Mass Transit program, enabling Data Sharing among UC Libraries Library of Congress (national) Pilot Data Center Project (2006-2007) LC NDIIPP Partnership w/ ICPSR & CDL (2007-2008) Southern California Earthquake Center (national) Managed library of HPC simulations & analysis Many others

A Library of Congress-SDSC Pilot Project Building Trust in a Third-Party Data Repository demonstrate the feasibility and performance of current approaches for a production third-party digital Data Center to support the Library of Congress collections. - One-year pilot project - Ingest, store & serve 2 collections (~6 TB data total) w/ different usage models (e.g. static/dynamic, access patterns) - Focus on trust issues - verification, change detection, audit trails - Documentation and lessons learned Prokudin-Gorskii Photographs http://www.loc.gov/exhibits/empire/ Internet Archive Web Crawls http://lcweb2.loc.gov/cocoon/minerva/html/minervahome.html

LC NDIIPP Chronopolis Program

SDSC as Trusted Digital Repository Undertaking NARA/RLG TRAC audit 2007-2008 Undertaking DRAMBORA audit 2007-2008 Reagan Moore s group developing rules-based approach for NARA/RLG TRAC compliance in irods Developing best practices with NDIIPP partners for Federated Digital Preservation Management Establishing data reliability policies that fit both HPC Data Center and Trusted Repository Center needs Developing best practices for Data Collection packaging and transmission

A Three-Stage Model for A Digital Preservation Environment Ingest Store Use Bit Storage Capacity Online (disk) Archival (tape) Single-copy reliability Media/technology advances Data migration Replication Geographically distributed System diversity Verification & recovery Synchronization Master version Propagating to replicas Audit trails Mitigation of termination risk

Disk/Tape Bit Storage Cost Comparison: Relative Cost Elements

Future projections of bit storage costs If annual costs decline exponentially with a halving time Δt, the cost to store data in perpetuity is finite (1.44 * Δt * Current cost/yr) Expect that exponential declines in media costs and other IT equipment will continue for a while current technologies as well as new technologies MAID targeting disk archive : capital cost comparable to disk, but lower operations costs (utilities, floor space) and projections of extended lifetime Disruptive technologies on horizon e.g. holographic storage Integrated cost ($/TB/yr) will decline, but how much? Critical issue is which cost elements scale with declining media costs and which do not? Most costs scale w/ media, but labor costs may not scale well Cost elements that do not scale well w/ media will dominate future costs, even at the bit storage level, and undermine finite cost And we expect that for total preservation storage costs beyond bit storage - e.g. file management, curation, etc. - labor costs dominate!

Thank You!