Building a Reference Implementation for Long-Term Preservation

Similar documents
Mitigating Risk of Data Loss in Preservation Environments

IRODS: the Integrated Rule- Oriented Data-Management System

The International Journal of Digital Curation Issue 1, Volume

Transcontinental Persistent Archive Prototype

Leveraging High Performance Computing Infrastructure for Trusted Digital Preservation

Implementing Trusted Digital Repositories

DSpace Fedora. Eprints Greenstone. Handle System

Digital Curation and Preservation: Defining the Research Agenda for the Next Decade

Policy Based Distributed Data Management Systems

Cheshire 3 Framework White Paper: Implementing Support for Digital Repositories in a Data Grid Environment

The International Journal of Digital Curation Issue 1, Volume

Distributed Data Management with Storage Resource Broker in the UK

Knowledge-based Grids

Richard Marciano Alexandra Chassanoff David Pcolar Bing Zhu Chien-Yi Hu. March 24, 2010

An overview of the OAIS and Representation Information

The OAIS Reference Model: current implementations

Introduction to The Storage Resource Broker

CineGrid Exchange. Building A Global Networked Testbed for Distributed Media Management and Preservation

Can a Consortium Build a Viable Preservation Repository?

Conducting a Self-Assessment of a Long-Term Archive for Interdisciplinary Scientific Data as a Trustworthy Digital Repository

A Simple Mass Storage System for the SRB Data Grid

Data Replication: Automated move and copy of data. PRACE Advanced Training Course on Data Staging and Data Movement Helsinki, September 10 th 2013

DRI: Dr Aileen O Carroll Policy Manager Digital Repository of Ireland Royal Irish Academy

DATA MANAGEMENT SYSTEMS FOR SCIENTIFIC APPLICATIONS

White Paper: National Data Infrastructure for Earth System Science

Digital Preservation at NARA

Promoting Open Standards for Digital Repository. case study examples and challenges

Working with a Preservation Software Vendor - The Kentucky Experience Glen McAninch

Digital Preservation DMFUG 2017

Managing Petabytes of data with irods. Jean-Yves Nief CC-IN2P3 France

Trusted Digital Repositories. A systems approach to determining trustworthiness using DRAMBORA

RUtgers COmmunity REpository (RUcore)

Enabling Collaboration for Digital Preservation

University of British Columbia Library. Persistent Digital Collections Implementation Plan. Final project report Summary version

Building the Archives of the Future: Self-Describing Records

An Introduction to Digital Preservation

Preserving Electronic Mailing Lists as Scholarly Resources: The H-Net Archives

Policy-Driven Repository Interoperability: Enabling Integration Patterns for irods and Fedora

Constraint-based Knowledge Systems for Grids, Digital Libraries, and Persistent Archives: Final Report May 2007

Data Grid Services: The Storage Resource Broker. Andrew A. Chien CSE 225, Spring 2004 May 26, Administrivia

Simplifying Collaboration in the Cloud

The Virtual Observatory and the IVOA

Sustainable Governance for Long-Term Stewardship of Earth Science Data

Launching the. Data Curation Network NDS/MBDH 2018

Certification. F. Genova (thanks to I. Dillo and Hervé L Hours)

DCH-RP Trust-Building Report

Comparing Curricula for Digital Library. Digital Curation Education

Scientific Workflow Tools. Daniel Crawl and Ilkay Altintas San Diego Supercomputer Center UC San Diego

DRI: Preservation Planning Case Study Getting Started in Digital Preservation Digital Preservation Coalition November 2013 Dublin, Ireland

DEVELOPING, ENABLING, AND SUPPORTING DATA AND REPOSITORY CERTIFICATION

Agenda. Bibliography

Managing Large Distributed Data Sets using the Storage Resource Broker

Digital Preservation Network (DPN)

Applying Archival Science to Digital Curation: Advocacy for the Archivist s Role in Implementing and Managing Trusted Digital Repositories

irods User Group integrated Rule Oriented Data System Reagan Moore

LTR TWG & the Cloud PRESENTATION TITLE GOES HERE

DataONE: Open Persistent Access to Earth Observational Data

Cyberinfrastructure!

Digital repositories as research infrastructure: a UK perspective

5th International Digital Curation Conference December 2009

SRB Logical Structure

One Body, Many Heads for Repository-Powered Library Applications

Grid Computing. MCSN - N. Tonellotto - Distributed Enabling Platforms

EUDAT Data Services & Tools for Researchers and Communities. Dr. Per Öster Director, Research Infrastructures CSC IT Center for Science Ltd

Fundamental Issues for Electronic Records

State Government Digital Preservation Profiles

Developing a Research Data Policy

NUIT Tech Talk Topics in Research Computing: XSEDE and Northwestern University Campus Champions

EUDAT. Towards a pan-european Collaborative Data Infrastructure

irods - An Overview Jason Executive Director, irods Consortium CS Department of Computer Science, AGH Kraków, Poland

MAPR DATA GOVERNANCE WITHOUT COMPROMISE

The EGEE-III Project Towards Sustainable e-infrastructures

Data Sharing with Storage Resource Broker Enabling Collaboration in Complex Distributed Environments. White Paper

NSF Data Management Plan Template Duke University Libraries Data and GIS Services

Fedora, Islandora, & Samvera: Requirements and Gaps. David Wilcox,

Scaling a Global File System to the Greatest Possible Extent, Performance, Capacity, and Number of Users

Archiving and Preserving the Web. Kristine Hanna Internet Archive November 2006

Data Curation Handbook Steps

DSpace development from MIT's Digital Library Research Program

Robin Dale RLG

I data set della ricerca ed il progetto EUDAT

Requirements for data catalogues within facilities

DRS Update. HL Digital Preservation Services & Library Technology Services Created 2/2017, Updated 4/2017

The NASA/GSFC Advanced Data Grid: A Prototype for Future Earth Science Ground System Architectures

MetaData Management Control of Distributed Digital Objects using irods. Venkata Raviteja Vutukuri

e-infrastructures in FP7 INFO DAY - Paris

Data Grids, Digital Libraries, and Persistent Archives

The Storage Networking Industry Association (SNIA) Data Preservation and Metadata Projects. Bob Rogers, Application Matrix

Archiving the Web: What can Institutions learn from National and International Web Archiving Initiatives

AT&T Labs Research Bell Labs/Lucent Technologies Princeton University Rensselaer Polytechnic Institute Rutgers, the State University of New Jersey

Distributing BaBar Data using the Storage Resource Broker (SRB)

EarthCube and Cyberinfrastructure for the Earth Sciences: Lessons and Perspective from OpenTopography

Long-term preservation for INSPIRE: a metadata framework and geo-portal implementation

SNIA 100 Year Archive Survey 2017 Thomas Rivera, CISSP

TIER ROADMAP UPDATE WORKING TOGETHER TO DEVELOP THE PATH

Data Management for Distributed Scientific Collaborations Using a Rule Engine

Improving a Trustworthy Data Repository with ISO 16363

The Data Management Plan: Putting policy into practice Suzanne Clarke Director, Information Resources

Systems Interoperability and Collaborative Development for Web Archiving

Scientific data management

Transcription:

Building a Reference Implementation for Long-Term Preservation Richard Marciano Lead Scientist Sustainable Archives & Library Technologies (SALT) lab director Data Intensive Cyber Environment (DICE) group rmarciano@diceresearch.org http://www.diceresearch.org

GOAL: Building a Preservation Reference Implementation (1/2) Go beyond preservation and access & address the full lifecycle: Appraisal & Disposition Accessioning Arrangement Description Preservation policy enforcement Preservation trustworthiness assessments

Developing Skills for Preservation SAA 2007 Summer Camp: week-long handson training. Topics included: Electronic Records Components of an erecords program Policies / Mandate Technical Infrastructure Social Infrastructure Infrastructure Independence Appraisal & Disposition Accessioning Arrangement Description Preservation Access Scalability

Archives: CA, WA, NC Libraries: AZ State University of Illinois Urbana Champaign University of Madison-Wisconsin University College UCSD Libraries, Spec. Coll. UC Irvine Spec. Coll. Princeton U., S.G.M. Man. Lib. U. San Diego, Copley Library Harvard Business School, B. Lib. University of New Mexico, Pol. Arch. U. of Texas at Arlington, Spec. Coll. Occidental College, Periodicals Dept. US Navy Federal USER Non-profit National Fire Protection Association History Associates, Inc. Local Commercial City of Richmond Archives, Canada Sacramento Archives & Museum Collection Center International Marist Brothers of Canada Cigna Ford Motor Company Preservation Partners SAA e-records Summer Camp 2007

GOAL: Building a Preservation Reference Implementation (2/2) The reference implementation consists of: the record management environment the preservation management rules the management processes that implement preservation services, and the rules that verify compliance with assessment criteria. The resulting system can be shown to be provably correct

irods Collaborations Notre Dame University - porting of the Parrot interface on top of irods, unifies access to GridFTP, irods, file systems University of Texas, Austin - creation of a Common Teragrid Software Stack kit for irods, simplifies installation of irods on Teragrid Sites Vanderbilt - integration of irods with LStore and Logistical Networking, integrates a distributed metadata catalog for the TDLC data grid MIT - integration of DSpace with irods, funded by NARA Fedora Commons - integration of Fedora with irods in support of NSDL Stanford - proposal to use LOCKSS reliability technology for guarantees on distributed irods rule base

irods Collaborations SHAMAN - integration of Cheshire and Multivalent browsing into irods micro-services for parsing data objects CASPAR - representation information for data and Trusted Repository Audit check list assessment of irods rules James Cook University - porting of Python, Perl, PHP load libraries on irods UK ASPIS project - integration of Shibboleth authentication with irods ATOS - use of irods in Bibliotheque Nationale de France DEISA - use of irods in European supercomputer centers D-Grid - irods beta test site Archer - creation of preservation rules (TRAC)

irods Interested Parties Royal British Columbia Museum - irods rules for fixity Globus - integration of irods and GridFTP Aerospace Corp - data interoperability Merrill Lynch - rule-based data management University of York - DAME distributed analysis systems IBM - integration with object based storage devices SNIA - integration with XAM technology Mitre - support for real-time data streams JPL - Planetary Data System

irods Tutorials - 2008 January 31, SDSC April 8 - ISGC, Taipei May 13 - China, National Academy of Science May 27-30 - UK escience, Edinburgh June 5 - OGF23, Barcelona July 7-11 - SAA, SDSC August 4-8 - SAA, SDSC August 25 - SAA, San Francisco

irods: the Latest Generation of Data Grids Data Grids are middleware services Sitting between the applications and data providers Providing transparent and uniform access To diverse types of digital assets Files, databases, streams, web, programs, Documents, images, data, sensor packets, tables, From heterogeneous resources File Systems, tape archives, sensor streams, Distributed over a wide area network Multiple administrative and security domains With users unaware of physical attributes of the data access System addresses, paths, protocols,

Data Grids are Trust Relationships Data-level Trust Virtualization for integrity, authenticity, access provision, availability, data and metadata organization and management, community ownership and curation User-level Trust Virtualization of authentication, authorization, auditing and accounting Resource-level Trust Virtualization of administration and maintenance, appropriation (quota), availability and accesssibility These are Data Grid 1.0 level trusts

Data Grids are Trust Relationships Policy-level Trust Virtualization of Management, Organizational and Community Rules Service-level Trust Virtualization of Operations and Services Execution-level Trust Virtualization of distributed, parallel, asynchronous, delayed and/or remote execution These are Data Grid 2.0 level trusts

User Base & Diversity of Applications Collections at SDSC: 1+PetaBytes, 170+ Million files Multi-disciplinary Scientific Data Astronomy, Cosmology Neuro Science, Cell-Signalling & other Bio-medical Informatics Environmental & Ecological Data Educational (web) & Research Data (Chem, Phys, ) Archival & Library Collections Earthquake Data, Seismic Simulations Real-time Sensor Data Growing at 1TB a day Supporting large projects: TeraGrid, NVO, SCEC, SEEK/Kepler, GEON, ROADNet, JCSG, AfCS, SIO Explorer, SALK, PAT, UCSDLibrary,

What is irods? It is a data grid system data virtualization A distributed file system, based on a client-server architecture. Allows users to access files seamlessly across a distributed environment, based upon their attributes rather than just their names or physical locations. It replicates, syncs and archives data, connecting heterogeneous resources in a logical and abstracted manner. It is a distributed workflow system policy/service virtualization Policy can be coded as functions (micro-services) Remote micro-services can be chained The chains (workflows) are interpreted at run-time The chains can be triggered on an event and condition (rules) They can also be recursive. Similar to SRB Micro-services communicate through parameters, shared contexts, and out-of-band message queues.

Policy Virtualization with irods Micro-Services Functions with well-defined semantics Transactional - recovery Context of application Message Queues Rules Triggered by events Conditional execution of alternative rule declarations System constructs: loops, recursion, branching Workflows Distributed Execution Immediate, Deferred, Periodic Execution at SIO User Application Execution at MBARI Execution at WoodsHole

Rule-based Data Management Administrator-controlled rules to implement management policies Administrative - adding / deleting users, resources Data ingestion - pre-processing, post-processing Data transport / deletion - parallel I/O streams, disposition Data retention policies expiration, over-writes, versions Data Reliability Policies copies, formats, migration, checking,

Distributed Management System Distributed Management System Rule Rule Engine Engine Execution Execution Control Control Messaging Messaging System System Execution Execution Engine Engine Virtualization Virtualization Server Server Side Side Workflow Workflow Scheduling Scheduling Policy Policy Management Management Data Data Transport Transport Metadata Metadata Catalog Catalog Persistent Persistent State State information information

Management Virtualization Examples of management policies Integrity Validation of checksums Synchronization of replicas Data distribution Data retention Access controls Authenticity Chain of custody - audit trails Track required preservation metadata - templates Generation of Archival Information Packages

Rule-based Data Management Associate rules with combinations of name spaces Rule set for a particular collection Rule set for a particular user group Rule set for a particular user group when accessing a particular collection Rule set for a particular storage system Rule set for a particular micro-service Generic rules based on SRB operations

TPAP (1/2) National Archives and Records Administration Transcontinental Persistent Archive Prototype Federation of Seven Independent Data Grids NARA I NARA II Georgia Tech Rocket Center U NC U Md SDSC MCAT MCAT MCAT MCAT MCAT MCAT MCAT Extensible Environment, can federate with additional research and education sites. Each data grid uses different vendor products.

TPAP (2/2) Transcontinental Persistent Archive Prototype Distributed Data Management Concepts Data virtualization Storage system independence Trust virtualization Administration independence Risk mitigation Federation of multiple independent data grids Operation independence

PAT: Persistent Archives Testbed Local Storage Resources Kentucky Grid Brick Michigan Grid Brick Minnesota Grid Brick Ohio Grid Brick SLAC Storage SDSC Archive Metadata Catalog (Oracle) MCAT Shared Preservation Environment Archival Storage (HPSS, Sam-QFS)

PAT Project Participants

The Evolution of PAT: archives on rules archives on rules what the DCP Center project will try to do Automate curation processes e.g. design reusable curation workflows Enforce curation policies e.g. enforce retention/disposition schedules Verify assertions about curation results e.g. periodically verify checksums e.g. parse audit trails to verify accesses e.g. RLG/NARA Trusworthiness Assessment Criteria

DCP: Distributed Custodial Preservation Center Purpose: Build a distributed production preservation environment that meets the needs of archival repositories for trusted archival preservation services Distributed partnership of 35 participants across 11 institutions: * STATES: - California - Kansas - Michigan - Kentucky - North Carolina - New York * UNIVERSITIES: - Tufts University - West Virginia University * CULTURAL ENTITIES: - Getty Research Institute * INTERNATIONAL PARTNERS: - Carleton University (Geomatics and Cartographic Research Centre)

Building a Preservation Reference Implementation Go beyond preservation and access SAA 2007 Summer Camp: week-long handson training. Topics included: Electronic Records Components of an erecords program Policies / Mandate Technical Infrastructure Social Infrastructure Address the full lifecycle: Appraisal & Disposition Accessioning Arrangement Description Preservation policy enforcement Preservation trustworthiness assessments

Preservation Reference Implementation (cont. 2/3) NARA ERA capabilities list, and the assessment criteria are based on the Trustworthy Repositories Audit & Certification (TRAC): Criteria and Checklist. For each identified capability, the required operations are encapsulated in micro-services that are executed at the storage location, under the control of rules that implement the management policies needed to enforce TRAC criteria.

Preservation Reference Implementation (cont. 3/3) Rules are also defined that periodically query the system to verify compliance, and automate recovery procedures when problems are found. The reference implementation then consists of: the record management environment the preservation management rules the management processes that implement preservation services, and the rules that verify compliance with assessment criteria.

For More Information Richard Marciano DICE Group rmarciano@diceresearch.org http://www.diceresearch.org