Data Centres in the Virtual Observatory Age

Similar documents
Designing the Future Data Management Environment for [Radio] Astronomy. JJ Kavelaars Canadian Astronomy Data Centre

The Computation and Data Needs of Canadian Astronomy

Case Study: CyberSKA - A Collaborative Platform for Data Intensive Radio Astronomy

A VO-friendly, Community-based Authorization Framework

The CFH12K Queued Service Observations (QSO) Project: Mission Statement, Scope and Resources

A VOSpace deployment: interoperability and integration in big infrastructure

Gemini Queue Operation

The Canadian CyberSKA Project

The Virtual Observatory and the IVOA

Integrating Anonymous & Authenticated Access to VO Services. Patrick Dowler Canadian Astronomy Data Centre

Mining Massive Data Sets With CANFAR and Skytree. Nicholas M. Ball Canadian Astronomy Data Centre National Research Council Victoria, BC, Canada

Building on Existing Communities: the Virtual Astronomical Observatory (and NIST)

DOI for Astronomical Data Centers: ESO. Hainaut, Bordelon, Grothkopf, Fourniol, Micol, Retzlaff, Sterzik, Stoehr [ESO] Enke, Riebe [AIP]

CyberSKA: Project Update. Cameron Kiddle, CyberSKA Technical Coordinator

Distributed Archive System for the Cherenkov Telescope Array

Netherlands Institute for Radio Astronomy. May 18th, 2009 Hanno Holties

Introduction to Grid Computing

Exploiting Virtual Observatory and Information Technology: Techniques for Astronomy

First Financial Bank. Highly available, centralized, tiered storage brings simplicity, reliability, and significant cost advantages to operations

Summary of Data Management Principles

Quality assurance in the ingestion of data into the CDS VizieR catalogue and data services

THE EUCLID ARCHIVE SYSTEM: A DATA-CENTRIC APPROACH TO BIG DATA

Hitachi Adaptable Modular Storage and Hitachi Workgroup Modular Storage

Massive Data Analysis

Hitachi Adaptable Modular Storage and Workgroup Modular Storage

Consolidating servers, storage, and incorporating virtualization allowed this publisher to expand with confidence in a challenging industry climate.

Using IBM Tivoli Storage Manager and IBM BRMS to create a comprehensive storage management strategy for your iseries environment

Archive exchange Format AXF

Protecting Future Access Now Models for Preserving Locally Created Content

Giovanni Lamanna LAPP - Laboratoire d'annecy-le-vieux de Physique des Particules, Université de Savoie, CNRS/IN2P3, Annecy-le-Vieux, France

Modernizing Healthcare IT for the Data-driven Cognitive Era Storage and Software-Defined Infrastructure

It s Time to Move Your Critical Data to SSDs Introduction

An Introduction to Digital Preservation

STATE OF STORAGE IN VIRTUALIZED ENVIRONMENTS INSIGHTS FROM THE MIDMARKET

Mid-Kent ICT Services Technology Strategy. Author: Tony Bullock Date: September 2013 Version: 019

SIA V2 Analysis, Scope and Concepts. Alliance. Author(s): D. Tody (ed.), F. Bonnarel, M. Dolensky, J. Salgado (others TBA in next version)

Technical documentation. SIOS Data Management Plan

Cloud FastPath: Highly Secure Data Transfer

ODC and future EIDA/ EPOS-S plans within EUDAT2020. Luca Trani and the EIDA Team Acknowledgements to SURFsara and the B2SAFE team

Migrating NetBackUp Data to the Commvault Data Platform

QLogic 2500 Series FC HBAs Accelerate Application Performance

LASDA: an archiving system for managing and sharing large scientific data

Rethinking VDI: The Role of Client-Hosted Virtual Desktops. White Paper Virtual Computer, Inc. All Rights Reserved.

Cloud Confidence: Simple Seamless Secure. Dell EMC Data Protection for VMware Cloud on AWS

APAC: A National Research Infrastructure Program

The IPAC Research Archives. Steve Groom IPAC / Caltech

High Performance Computing from an EU perspective

Move Up to an OpenStack Private Cloud and Lose the Vendor Lock-in

Field Update Expanded Deduplication Sizing Guidelines. Oct 2015

HPC Growing Pains. IT Lessons Learned from the Biomedical Data Deluge

Virtualization. Q&A with an industry leader. Virtualization is rapidly becoming a fact of life for agency executives,

VIRTUAL OBSERVATORY TECHNOLOGIES

The Mosaic Data Capture Agent

The Herschel Data Processing System: History, Status and latest Developments

Automated Tiered Storage by PoINT Storage Manager Optimizing the storage infrastructure concerning cost, efficiency and long-term availability

THE EUCLID ARCHIVE SYSTEM: A DATA-CENTRIC APPROACH TO BIG DATA

Digital Preservation Preservation Strategies

Real-time Recovery Architecture as a Service by, David Floyer

THE NATIONAL DATA SERVICE(S) & NDS CONSORTIUM A Call to Action for Accelerating Discovery Through Data Services we can Build Ed Seidel

Global Headquarters: 5 Speen Street Framingham, MA USA P F

Dell Storage Point of View: Optimize your data everywhere

Hyper-converged Secondary Storage for Backup with Deduplication Q & A. The impact of data deduplication on the backup process

EarthCube and Cyberinfrastructure for the Earth Sciences: Lessons and Perspective from OpenTopography

Archives Building System Infrastructure: Re-engineering ESA Space Based Missions' Archives ABSTRACT

Active Archive and the State of the Industry

Big Data, exploiter de grands volumes de données

CIO Guide: Disaster recovery solutions that work. Making it happen with Azure in the public cloud

Converged Infrastructure Matures And Proves Its Value

... IBM Advanced Technical Skills IBM Oracle International Competency Center September 2013

Community-Centered User Support

HP StorageWorks LTO-5 Ultrium tape portfolio

Backups and archives: What s the scoop?

CLOUD COMPUTING PRIMER

Preserving the World s Most Important Data. Yours. SYSTEMS AT-A-GLANCE: KEY FEATURES AND BENEFITS

Linux Automation.

EUDAT- Towards a Global Collaborative Data Infrastructure

Introduction to Data Management for Ocean Science Research

Aligning Backup, Recovery and Archive

5 Fundamental Strategies for Building a Data-centered Data Center

Private Cloud Project Leads to Lower Server Footprint and US$595,000 Savings

BackupAssist V4 vs V5

Microsoft Azure StorSimple Hybrid Cloud Storage. Manu Aery, Raju S

SPARC 2 Consultations January-February 2016

Rio-2 Hybrid Backup Server

Hyper-Converged Infrastructure: Providing New Opportunities for Improved Availability

ORACLE 11g R2 New Features

Nexsan Storage Optimizes Virtual Operating Environment for The Clark Enersen Partners

Cisco CloudCenter Solution Use Case: Application Migration and Management

National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology WISE Archive.

e-infrastructures in FP7 INFO DAY - Paris

Linking Literature to Astronomy Data with DOIs

OFF-SITE TAPE REPLICATION

Science 2.0 VU Big Science, e-science and E- Infrastructures + Bibliometric Network Analysis

Research Infrastructures and Horizon 2020

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE

Micron Quad-Level Cell Technology Brings Affordable Solid State Storage to More Applications

Next Generation Backup: Better ways to deal with rapid data growth and aging tape infrastructures

Data Processing for SUBARU Telescope using GRID

Astronomy Dataverse: enabling astronomer data publishing.

Electronic Records Archives: Philadelphia Federal Executive Board

Transcription:

Data Centres in the Virtual Observatory Age David Schade Canadian Astronomy Data Centre

A few things I ve learned in the past two days There exist serious efforts at Long-Term Data Preservation Alliance for Permanent Access CASPAR: interesting top-down approach to broad study of digital data preservation Older data degrades in value as understanding gets fuzzy I don t know much about Long Term Preservation 2

CADC History Data Collections Formed in 1986 to distribute HST data to Canadian astronomers HST

CADC History Data Collections Canada France Hawaii Telescope CFHT HST

CADC History Data Collections CFHT Gemini Gemini Observatories Hawaii and Chile HST

CADC History Data Collections CFHT Gemini JCMT James Clerk Maxwell Telescope Sub-mm HST

CADC 2009 Data Collections CFHT HST Gemini CGPS JCMT MOST Deliver 2 Terabytes per week to users 2500 distinct users 87 countries Serve all Canadian astronomy research universities Large astronomy data centre (400 TB) 25 Staff FUSE BLAST MACHO

CADC Data Flows (last year)

The key to our success An intimate and intense working relationship between CADC software developers, scientists, and operations staff

Added value is our business Query capability Web interfaces On-the-fly calibration for HST Data Processing WFPC2 ACS Hubble Legacy Archive CFHT mosaic cameras Virtual Observatory Integration of services locally and globally

Old model for data management CFHT Principle Investigator

Old model for data management CFHT Data Archive Archive Users Principle Investigator

Current model for data management Processing Centre CFHT CADC is in the main stream of data flow Data Centre Principle Investigator Survey Teams Archival Researchers Virtual Observatory All data and metadata flow through data centre

Long Term Data Preservation We are not in the LTDP business We do not have the right staff for LTDP We don t have funding for LTDP We pursue funding based on support of leading-edge science especially surveys Our work lays the foundation for LTDP We need a proper model for LTDP for these major collections 14

Data Security We maintain 2 copies of each file on spinning disk (GEN 5 now) We maintain 2 copies on tape (offsite) backup Our recovery-from-backup system is not well tested This is not a very high level of security We are not satisfied with this situation 15

Important roles for CADC CFHT Data Centre Principle Investigator Survey Teams Archival Researchers Virtual Observatory

Important roles for CADC HST JCMT Gemini CFHT Data Centre We play an important role in working with observatories to improve their data and metadata collection methods and quality. (INTENSE interaction) This results in great improvements to archived data quality PI and archive data streams are the same: this is good for data quality

Integration of data and services We have taken a few steps beyond the archive-specific data management paradigm USER 18 CFHT Gemini JCMT HST

Common Archive Observation Model We have taken a few steps beyond the archive-specific data management paradigm USER Integration 19 CFHT Gemini JCMT HST

Common Archive Observation Model Transform archive-specific metadata [fitstocaom] into a common model that represents various data products, provenance, etc. for multi-facility data All downstream software is a single stack (Query, data access, VO services) CAOM CAOM supports a science user view of the observations It does not necessarily support LTDP Integration 20 CFHT Gemini JCMT HST

Common Archive Observation Model To be implemented in each archive Mirrored in the data warehouse Purpose: Standardize the core of every archive The only metadata interface between archives and the data warehouse A general purpose infrastructure to respond to evolving VO standards Model: Inspired from VO work: Observation, Characterization, SIA, SSA, Authentication, etc. Using our archive, data modelling and VO experience Characterisation based on FITS WCS papers I, II and III

Common Archive Observation Model

CAOM Data Types

Common Archive Observation Model

Access Control

Common Archive Observation Model

Processing and Provenance

Common Archive Observation Model Transform archive-specific metadata [fitstocaom] into a common model that represents various data products, provenance, etc. for multi-facility data All downstream software is a single stack (Query, data access, VO services) CAOM CAOM supports a science user view of the observations It does not necessarily support LTDP Integration 28 CFHT Gemini JCMT HST

Virtual Observatory The requirement for integration is driven by science practice: multi-facility, multi-wavelength datasets used by multi-national teams. USER Integration CAOM CFHT JCMT Gemini HST 29

Science Data Centres Future steps will take us beyond the science-specific data management paradigm USER Integration Physics data Ocean Sciences data Chemistry data Biology data CFHT JCMT Gemini HST

VO Challenges A fundamental problem in VO is that we have been focusing on INTEROPERABILITY when the quality of many (most?) data collections and services are not of high enough quality to support VO-level services Substantial data engineering is needed to bring these services up to the required standard The concept of VO as a lightweight layer on top of archives is incorrect Competing pressures make it easy to implement make it powerful 31

Data Centres Future steps will take us beyond the domain-specific data management paradigm USER Integration Science Data Medical Data Economic Data Security data *Are we really looking at data at this level?

Long Term Data Preservation The growth in instrumental data volume output with time means that migrating old collections to new media have been negligible in cost Soon the JCMT will close. Later CFHT will close. We will be faced for the first time with significant costs for data preservation for inactive data collections Where will the funding come from? We need a proper model for LTDP for these major collections 33

Canadian Investments in Computational Infrastructure COMPUTE CANADA CANARIE Neither the grid computing facilities of Compute Canada nor the network facilities of CANARIE are delivering the performance needed by observational astronomers Configuration issues are the problem CANFAR is attempting to address these issues 34

Proposed model for data management Processing Centre CFHT CANARIE Data Centre COMPUTE CANADA Principle Investigator Survey Teams Archival Researchers Virtual Observatory Cloud storage and processing on Compute Canada grid (Infrastructure as a Service)

Cloud Computing Implications for LTDP COMPUTE CANADA CFHT Data Centre What are the implications? What commitment exists from Compute Canada? What will happen 10 years from now? New uncertainties for LTDP Archival Researchers Virtual Observatory

Summary Key to success: intimate working relationship between CADC software developers, scientists, and operations staff We add value to data (we are in the mainstream of data flow) Long Term Data Preservation is not a primary driver of our activities (yet our work lays the foundation for LTDP) We invest heavily in data integration (CAOM) and VO Data collections need to be upgraded as well as become interoperable. We help to improve data quality at source The future of our data collections is uncertain to a disturbing degree