Format Technologies for the Long Term Retention of Big Data

Similar documents
Combining SNIA Cloud, Tape and Container Format Technologies for the Long Term Retention of Big Data

Format Technologies for the Long Term Retention of Big Data

Format Technologies for the Long Term Retention of Big Data. Roger Cummings, Antesignanus Co-Author: Simona Rabinovici-Cohen, IBM Research Haifa

SNIA 100 Year Archive Survey 2017 Thomas Rivera, CISSP

Self-contained Information Retention Format For Future Semantic Interoperability

LTR TWG & the Cloud PRESENTATION TITLE GOES HERE

Self-contained Information Retention Format (SIRF) Specification

Interoperable Cloud Storage with the CDMI Standard. Mark Carlson, SNIA TC and Oracle Co-Chair, SNIA Cloud Storage TWG

The Storage Networking Industry Association (SNIA) Data Preservation and Metadata Projects. Bob Rogers, Application Matrix

Data Archiving Using Enhanced MAID

Cloud Archive and Long Term Preservation Challenges and Best Practices

LTFS Bulk Transfer Standard PRESENTATION TITLE GOES HERE

Data-intensive Storage Services on Clouds: The VISION Cloud Project Simona Rabinovici-Cohen, Hillel Kolodner IBM Research - Haifa

More than a Lifetime of

The Need for a Terminology Bridge. May 2009

Preservation DataStores: An architecture for Preservation Aware Storage

Interoperable Cloud Storage with the CDMI Standard. Mark Carlson, SNIA TC and Oracle Chair, SNIA Cloud Storage TWG

Active Archive and the State of the Industry

CAN MICROSOFT HELP MEET THE GDPR

The OAIS Reference Model: current implementations

Multi-Cloud Storage: Addressing the Need for Portability and Interoperability

Archive and Preservation for Media Collections

White paper Selecting the right method

GDPR compliance. GDPR preparedness with OpenText InfoArchive. White paper

A Vendor Agnostic Overview. Walt Hubis Hubis Technical Associates

SNIA Cloud Storage TWG

Data Management Checklist

Selecting the Right Method

extensible Access Method (XAM) - a new fixed content API Mark A Carlson, SNIA Technical Council, Sun Microsystems, Inc.

Proceedings of the 4 th International Workshop on Semantic Digital Archives (SDA 2014)

ARCHIVE ESSENTIALS

Information Lifecycle Management for Business Data. An Oracle White Paper September 2005

Secure Messaging is far more than traditional encryption.

ARCHIVE ESSENTIALS: Key Considerations When Moving to Office 365 DISCUSSION PAPER

Digital Preservation at NARA

Cost and security improvement of the Long Time Preservation by the use of open source software and new ISO standard in the National Library repository

Product Brief: XenData X2500-SAS LTO-6 Digital Video Archive System

Solution Brief. Bridging the Infrastructure Gap for Unstructured Data with Object Storage. 89 Fifth Avenue, 7th Floor. New York, NY 10003

PRESENTATION TITLE GOES HERE. Understanding Architectural Trade-offs in Object Storage Technologies

Introduction to Digital Archiving and IBM archive storage options

How to Participate in SNIA Standards & Software Development. Arnold Jones SNIA Technical Council Managing Director

Cipherpost Pro is far more than traditional encryption.

Trends in Data Protection and Restoration Technologies. Mike Fishman, EMC 2 Corporation

Administration and Data Retention. Best Practices for Systems Management

Sachin Goswami TATA Consultancy Services

Deploying Public, Private, and Hybrid. Storage Cloud Environments

MODERN ARCHIVING WITH ELASTIC CLOUD STORAGE (ECS)

Tom Sas HP. Author: SNIA - Data Protection & Capacity Optimization (DPCO) Committee

EU GDPR & NEW YORK CYBERSECURITY REQUIREMENTS 3 KEYS TO SUCCESS

General Data Protection Regulation: Knowing your data. Title. Prepared by: Paul Barks, Managing Consultant

The digital preservation technological context

WHITE PAPER. The General Data Protection Regulation: What Title It Means and How SAS Data Management Can Help

Tape Sucks for Long-Term Retention Time to Move to the Cloud. How Cloud is Transforming Legacy Data Strategies

ZIMBRA & THE IMPACT OF GDPR

Storage Industry Resource Domain Model

Susan Thomas, Project Manager. An overview of the project. Wellcome Library, 10 October

2 The IBM Data Governance Unified Process

The Evolving Role of Tape and Disk in the High Performance Data Center Bruce Master IBM LTO Program

Storage Networking Strategy for the Next Five Years

IBM tape libraries help Arkivum make the difference

McAfee Total Protection for Data Loss Prevention

Creating our Digital Cultural Heritage. Characteristics of Digital Information

XenData6 Workstation User Guide

Compliance of Panda Products with General Data Protection Regulation (GDPR) Panda Security

Application Recovery. Andreas Schwegmann / HP

Overview of Archiving. Cloud & IT Services for your Company. EagleMercury Archiving

HPE Converged Data Solutions

Tutorial. A New Standard for IP Based Drive Management. Mark Carlson SNIA Technical Council Co-Chair

Audio/Video Collection Migration, Storage and Preservation

Cloud Computing Standard 1.1 INTRODUCTION 2.1 PURPOSE. Effective Date: July 28, 2015

Oracle 11g Partitioning new features and ILM

THE TRIAL MASTER FILE

LTO and Magnetic Tapes Lively Roadmap With a roadmap like this, how can tape be dead?

Information Infrastructure Forum

BYOD Success Kit. Table of Contents. Current state of BYOD in enterprise Checklist for BYOD Success Helpful Pilot Tips

Copyright 2013, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 12

DIGITAL STEWARDSHIP SUPPLEMENTARY INFORMATION FORM

NSF Data Management Plan Template Duke University Libraries Data and GIS Services

Archive II. The archive. 26/May/15

Operational Network Security

Backup and Archiving for Office 365. White Paper

Restoration Technologies

The Root Cause of Unstructured Data Problems is Not What You Think

Maintain Data Control and Work Productivity

Storage Clouds. Marty Stogsdill Oracle

Brendan Lelieveld-Amiro, Director of Product Development StorageQuest Inc. December 2012

SQL Compliance Whitepaper HOW COMPLIANCE IMPACTS BACKUP STRATEGY

XenData MX64 Edition. Product Brief:

Internet, , Social Networking, Mobile Device, and Electronic Communication Policy

Privacy Policy of

Data Privacy in Your Own Backyard

Data Governance: Data Usage Labeling and Enforcement in Adobe Cloud Platform

-archiving. project roadmap CHAPTER 1. archiving Planning, policies and product selection

Fujitsu World Tour 2018

Best Practices in Designing Cloud Storage based Archival solution Sreenidhi Iyangar & Jim Rice EMC Corporation

Datasheet. Only Workspaces delivers the features users want and the control that IT needs.

We offer background check and identity verification services to employers, businesses, and individuals. For example, we provide:

XML information Packaging Standards for Archives

General Data Protection Regulation (GDPR) and the Implications for IT Service Management

in Transition to the Cloud David A. Chapa, CTE EVault, a Seagate Company

Transcription:

Combining PRESENTATION SNIA Cloud, TITLE Tape GOES HERE and Container Format Technologies for the Long Term Retention of Big Data Presenter: Sam Fineberg, HP Co-Authors: Simona Rabinovici-Cohen, IBM Research Haifa Roger Cummings, Antesignanus

Outline Introduction SNIA Long Term Retention technology Self-contained Information Retention Format (SIRF) Other SNIA technologies in development Cloud Data Management Interface (CDMI) Linear Tape File System (LTFS) Combining SNIA technologies SIRF Serialization for CDMI SIRF Serialization for LTFS EU Project ForgetIT and SIRF Summary 2

Big Data.. Really is BIG.. 2.5 quintillion (10 18 ) bytes of new data created per day in 2012 (source IBM) And the move to the Internet of Things is only going to increase this volume 19.8 Billion connected devices by 2020 (source McKinsey) Only 4.2 billion smartphones and tablets, 3.4 billion PCs Data analytics is improving all the time Therefore historical information has significant value Apply new techniques and algorithms to gain new insights Need to ensure ALL necessary information is captured to extract full value Therefore Big Data has similarities to (long term) preservation 3

The Need for Digital Preservation of Big Data Regulatory compliance and legal issues Sarbanes-Oxley, HIPAA, FRCP, intellectual property litigation Emerging web services and applications Email, photo sharing, web site archives, social networks, blogs Many other fixed-content repositories Scientific data, intelligence, libraries, movies, music Domains that have Big Data require preservation Healthcare M&E Satellite data is kept for ever We would like to keep digital art for ever Scientific and Cultural X-rays are often stored for periods of 75 years Records of minors are needed until 20 to 43 years of age Film Masters, Out takes. Related artifacts (e.g., games). 100 Years or more 4

SNIA Survey from 2007 Preservation of business history Other Protection of customer privacy Security Risk Top External Factors Driving Long-Term Retention Requirements: Legal Risk, Compliance Regulations, Business Risk, Security Risk Protection of business or intellectual assets Retaining history for competitiveness or protection Protection from compliance or legal fines Meeting regulatory requirements Business Risk Business Risk Compliance Requirements Compliance Requirements Security Risk Meeting regulatory requirements Legal Risk Concern with ligitation protection Legal Risk 0% 10% 20% 30% 40% 50% 60% Percent of Respondents Source: SNIA-100 Year Archive Requirements Survey, January 2007. >100 Years >50-100 Years >21-50 Years >11-20 Years >7-10 Years 18.3% 38.8% 13.1% 15.7% 12.3% What does Long-Term Mean? Retention of 20 years or more is required by 70% of responses. >3-6 Years 1.9% 0.0% Combining 5.0% 10.0% SNIA 15.0% Cloud, 20.0% Tape and 25.0% Container 30.0% Format 35.0% Technologies 40.0% for the Long Term Retention of Big Data 5

Goals of Digital Preservation Digital assets stored now should remain Accessible Undamaged Usable For as long as desired beyond the lifetime of Any particular storage system Any particular storage technology And at an affordable cost 6

Real Life Example Problem 2003 2007 To: roger.cummings@veritas.com From: fred@nowhere.com Subject: Something or other To: roger_cummings@symantec.com From: sue@somewhere.com Subject: Something else Same people?? Could you PROVE it 20 years on? To: gary.phillips@veritas.com From: fred@nowhere.com Subject: Something or other To: gary_phillips@symantec.com From: sue@somewhere.com Subject: Something else 7

Outline Introduction SNIA Long Term Retention technology Self-contained Information Retention Format (SIRF) Other SNIA technologies in development Cloud Data Management Interface (CDMI) Linear Tape File System (LTFS) Combining SNIA technologies SIRF Serialization for CDMI SIRF Serialization for LTFS EU Project ForgetIT and SIRF Summary 8

SIRF: Self-contained Information Retention Format Being developed by SNIA Long Term Retention (LTR) TWG Photo courtesy Oregon State Archives An Analogy Standard physical archival box Archivists gather together a group of related items and place them in a physical box container The box is labeled with information about its content e.g., name and reference number, date, contents description, destroy date SIRF is the digital equivalent Logical container for a set of (digital) preservation objects and a catalog The SIRF catalog contains metadata related to the entire contents of the container as well as to the individual objects SIRF standardizes the information in the catalog 9

SIRF Properties SIRF is a logical data format of a storage container appropriate for long term storage of digital information A storage container may comprise a logical or physical storage area considered as a unit. Examples: a file system, a tape, a block device, a stream device, an object store, a data bucket in a cloud storage Required Properties Self-describing can be interpreted by different systems Self-contained all data needed for the interpretation is in the container Extensible so it can meet future needs 10

SIRF Components A SIRF container includes: Magic object: identifies SIRF container and its version Preservation objects that are immutable Catalog that is Updatable Contains metadata to make container and preservation objects portable into the future without external functions 11

Outline Introduction SNIA Long Term Retention technology Self-contained Information Retention Format (SIRF) Other SNIA technologies in development Cloud Data Management Interface (CDMI) Linear Tape File System (LTFS) Combining SNIA technologies SIRF Serialization for CDMI SIRF Serialization for LTFS EU Project ForgetIT and SIRF Summary 12

Cloud Data Management Interface (CDMI) An ISO/IEC 17826:2012 Information technology standard being developed by SNIA CDMI TWG The CDMI standard defines an interoperable format for moving data and associated metadata between cloud providers CDMI data objects can be accessed by standard browsers and internet tools (subject to owner s access control lists) CDMI data objects may order data services from the cloud Secure Erasure, Encryption, Replication, Retention, Backup/Restore, Tiering, Hashing, Preservation, etc. (extensible) Done through Data System Metadata (key/value) on the Containers or Objects Has several implementations including in OpenStack 13

Model for the CDMI Interface Resources accessed through RESTful interface: 14

Linear Tape File System (LTFS) A file system implemented on dual-partition linear tape: Index Partition and Data Partition Index Partition is small (2 wraps, 37.5 GB out of 1.5 TB on LTO5) Data Partition is remainder of the tape File System module that implements a set of standard file system interfaces Implemented using FUSE On Linux and Mac OS X Windows implementation uses FUSE-like framework Includes an on-tape structure used to track tape contents XML Index Schema Format becoming the standard for linear tape Formal standardization through SNIA LTFS TWG 15

Logical View of LTFS Volume LTFS XML Index Index Partition Guard Wraps File File File B O T File File Data Partition E O T Check out SNIA Tutorial: Big Data Storage Options for Hadoop Check out SNIA Tutorial: Protecting Data in the "Big Data" World 16

Outline Introduction SNIA Long Term Retention technology Self-contained Information Retention Format (SIRF) Other SNIA technologies in development Cloud Data Management Interface (CDMI) Linear Tape File System (LTFS) Combining SNIA technologies SIRF Serialization for CDMI SIRF Serialization for LTFS EU Project ForgetIT and SIRF Summary 17

Goals of SIRF Serialization for CDMI/LTFS SIRF serialization for CDMI/LTFS specifies how a CDMI container or LTFS Tape also becomes SIRF-compliant A SIRF-compliant CDMI container or LTFS Tape enables a future CDMI/LTFS client to understand containers created by today s CDMI/LTFS client The properties of the future client is unknown to us today understand means identify the preservation objects in the container, the packaging format of each object, its fixities values, etc. (as defined in the SIRF catalog) 18

SIRF Serialization for CDMI: Interface CDMI API can be used to access the various preservation objects and the catalog object in a SIRF-compliant CDMI container Example Assume we have a cloud container named "PatientContainer" that is SIRFcompliant the container has a catalog object each encounter is a preservation object each image is a preservation object cat Patient PO PO Container PO We can read the various preservation objects and the catalog object via CDMI REST API as follows: GET <root URI>/ PatientContainer>/sirfCatalog GET <root URI>/<PatientContainer>/encounterJan2001 GET <root URI>/<PatientContainer>/chestImage 19

SIRF Serialization for CDMI PatientContainer SIRF magic object: specification=1111 SIRF level = 1 Catalog object=sirfcatalog sirfcatalog { "encounterjan2001":[ "IDs": [{...},] "Fixity": [{...},] ] "chestimage":[ "IDs": [{...},] "Fixity": [{...},] ] } Simple Simple PO Simple PO PO Encounter Jan2001 Simple PO cestimage manifest cestimage manifest cestimage manifest chestimage cestimage manifest cestimage dicom1 cestimage dicom1 cestimage dicom1 cestimage dicom1 cestimage dicom1 chestimage dicom1 chestimage dicom1 dicom2 Composite PO 20

SIRF Serialization for CDMI: General A CDMI Container can be qualified also as a SIRF Container when: The SIRF magic object is mapped to the CDMI container metadata and includes, for example, specification ID and version, SIRF level, SIRF catalog object ID. The SIRF catalog is an object in the CDMI container formatted in JSON A SIRF preservation object (PO) that is a simple object (contains one element) is mapped to a CDMI data object The simple object can be a tar/zip A SIRF PO that is a composite object (contains several elements) is mapped to: a set of data objects (one for each element) and a manifest data object that its content includes the IDs and fixities of the element data objects 21

SIRF Serialization for LTFS File Mark IP Label Construct. SIRF catalog LTFS index SIRF magic object: specification=1111 SIRF level = 1 Catalog object=sirfcatalog DP Label Construct. Preservation Object. Preservation Object The index partition of the tape is 2 wraps which is 37.5 GB in LTO-5 and probably larger in LTO-6. The tape index partition is large enough to hold the LTFS index, the SIRF catalog, and even additional information e.g. thumbnails of images 22

SIRF Serialization for LTFS: General A LTFS Tape can also be a SIRF Container when: The SIRF magic object is mapped to extended attributes of the LTFS index root directory The magic object includes, for example, specification ID and version, SIRF level, reference to SIRF catalog The SIRF catalog resides in the index partition and formatted in XML A SIRF preservation object (PO) that is a simple object (contains one element) is mapped to a LTFS file A SIRF PO that is a composite object (contains several elements) is mapped to: a set of LTFS files (one for each element) and a manifest file that its content includes the IDs and fixities of the element data objects 23

Outline Introduction SNIA Long Term Retention technology Self-contained Information Retention Format (SIRF) Other SNIA technologies in development Cloud Data Management Interface (CDMI) Linear Tape File System (LTFS) Combining SNIA technologies SIRF Serialization for CDMI SIRF Serialization for LTFS EU Project ForgetIT and SIRF Summary 24

ForgetIT is an FP7 EU Project in the area of preservation Three year Integrated Project (IP) started Feb. 1, 2013 Consortium of 11 partners (industry and academic) ForgetIT combines new concepts for easing the adoption of preservation in personal and organizational contexts

ForgetIT - Concise Preservation by combining Managed Forgetting and Contextualized Remembering Contextualized Remembering bringing back information into active use in a meaningful way Synergetic Preservation couples information management and preservation management Managed Forgetting as opposed to the current forgetting by accident inspired by human forgetting

PDS and SIRF in ForgetIT ingestaip, accessaip, transformaip, manageaggregation PDS Preservation Engine AIP Service Migration Service Fixity Service SIRF Handler Multi-Cloud Service REST container container OpenStack Swift Storlet Engine Preservation DataStores (PDS) provides preservation-aware storage services for ForgetIT that is based on the OAIS model The SIRF Handler can create SIRF-complaint containers in OpenStack Swift Cloud 27

Summary Need to retain not only information of interest but ALL other information to make it fully usable in future Put it all in the SIRF digital box, preserve that as a unit SIRF includes metadata about the storage container, to help understand the contents of the container in the future No single technology will be usable over the timespans mandated by current digital preservation needs SNIA CDMI and LTFS technologies are among best current choices Are good for perhaps 10-20 years SIRF provides a vehicle for collecting all of the information that will be needed to transition to new technologies in the future SIRF can be serialized for the future technologies as they come 28

Attribution & Feedback The SNIA Education Committee thanks the following individuals for their contributions to this Tutorial. Authorship History Authors (Fall 2013) Mary Baker Simona Rabinovici-Cohen Roger Cummings Sam Fineberg Additional Contributors Mark Carlson (& the Cloud TWG) David Pease Joseph White Alan Yoder (incorporating materials from earlier tutorials dating back to 2008, and with particular thanks to the 100 Year Archive Task Force (2007)) Please send any questions or comments regarding this SNIA Tutorial to tracktutorials@snia.org 29

For further information SIRF use cases and requirements document is released for public review http://www.snia.org/tech_activities/publicreview More information on SIRF (& other SNIA LTR activities) is available at http://www.snia.org/ltr More information on ForgetIT is available @: http://www.forgetit-project.eu/ 30