Data-Intensive Workflows A journey to a Holistic Framework for Data-Intensive Workflows

Similar documents
eresearch Collaboration across the Pacific:

New Zealand Government IBM Infrastructure as a Service

CSIRO and the Open Data Cube

Supporting Data Stewardship Throughout the Data Life Cycle in the Solid Earth Sciences

Cisco Software-Defined Access

ANZPAA National Institute of Forensic Science BUSINESS PLAN

Interoperability in Science Data: Stories from the Trenches

EMC Storage Resource Management Suite

Get More Out of Hitting Record IT S EASY TO CREATE EXCEPTIONAL VIDEO CONTENT WITH MEDIASITE JOIN

The Semantic Sensor Network Ontology A Generic Language to Describe Sensor Assets

PREPARE FOR TAKE OFF. Accelerate your organisation s journey to the Cloud.

Data on the Web Best Practices: Challenges and Benefits

CIPMA CRITICAL INFRASTRUCTURE PROTECTION MODELLING & ANALYSIS. Overview of CIP in Australia

Web-enabled Physical Samples: Curating and Publishing Physical Samples in CSIRO

Big Data infrastructure and tools in libraries

IM&T Data Management Strategy Overview March 2013 CSIRO INFORMATION MANAGEMENT & TECHNOLOGY

Striving for efficiency

Drive digital transformation with an enterprise-grade Managed Private Cloud

PAGE - 16 PAGE - 1. Sometimes, the solution is just a benchmark away..

ARKive-ERA Project Lessons and Thoughts

DataONE: Open Persistent Access to Earth Observational Data

The University of Queensland

Implementing ITIL v3 Service Lifecycle

HPE ALM Standardization as a Precursor for Data Warehousing March 7, 2017

Making research data repositories visible and discoverable. Robert Ulrich Karlsruhe Institute of Technology

Microsoft SharePoint Server 2013 Plan, Configure & Manage

OUR VISION To be a global leader of computing research in identified areas that will bring positive impact to the lives of citizens and society.

Digital Preservation: How to Plan

Why Converged Infrastructure?

accelerate your ambition Chris Jenkins

NEXT GENERATION Data Centres FOR NEW GENERATION. Technology

GEOSS Data Management Principles: Importance and Implementation

CenturyLink Data Center Services Herndon, Virginia

National Research Data Cloud

CAPABILITY STATEMENT

CSIRO Visualisation Service

New Zealand Government IbM Infrastructure as a service

DATA CENTRE & COLOCATION

Data management and discovery

Digital Renewable Ecosystem on Predix Platform from GE Renewable Energy

IoT privacy risk management in ANASTACIA project

The intelligence of hyper-converged infrastructure. Your Right Mix Solution

T A C. Total. Taking Footwear Design Virtualization to a New Level

Stony Brook University Data Strategy. Presented to the Data Governance Council June 8, 2017

BUILDING A NEW DIGITAL LIBRARY FOR THE NATIONAL LIBRARY OF AUSTRALIA

Six Weeks to Security Operations The AMP Story. Mike Byrne Cyber Security AMP

Energy Solutions for Buildings

Registry Interchange Format: Collections and Services (RIF-CS) explained

Tamr Technical Whitepaper

I data set della ricerca ed il progetto EUDAT

MarkLogic 8 Overview of Key Features COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

SOLUTION BRIEF RSA NETWITNESS NETWORK VISIBILITY-DRIVEN THREAT DEFENSE

EUDAT- Towards a Global Collaborative Data Infrastructure

EUDAT Training 2 nd EUDAT Conference, Rome October 28 th Introduction, Vision and Architecture. Giuseppe Fiameni CINECA Rob Baxter EPCC EUDAT members

ENVI. Get the Information You Need from Imagery.

Building Management Solutions. for Data Centers and Mission Critical Facilities

APAC: A National Research Infrastructure Program

Cisco Crosswork Network Automation

Cloudline Autonomous Driving Solutions. Accelerating insights through a new generation of Data and Analytics October, 2018

World Class. Globally Certified. High Availability.

Implementation of the Data Seal of Approval

Trustworthy ICT. FP7-ICT Objective 1.5 WP 2013

The Cambridge Bio-Medical-Cloud An OpenStack platform for medical analytics and biomedical research

CSD3 The Cambridge Service for Data Driven Discovery. A New National HPC Service for Data Intensive science

GridEye FOR A SAFE, EFFICIENT AND CONTROLLED POWER GRID

We make hybrid cloud deliver the business outcomes you require

Mineral Spectral Library_CI

W3C Provenance Incubator Group: An Overview. Thanks to Contributing Group Members

NeAT Business Plan Component Data Integration and Annotation Services in Biodiversity (DIAS-B) 1. Service Description

Three Steps Toward Zero Downtime. Guide. Solution Guide Server.

Incentives for IoT Security. White Paper. May Author: Dr. Cédric LEVY-BENCHETON, CEO

Uptime and Proactive Support Services

Long Term Digital Preserva2on

ZigBee in Commercial Buildings

Educating a New Breed of Data Scientists for Scientific Data Management

Implementation of a reporting workflow to maintain data lineage for major water resource modelling projects

Applying Auto-Data Classification Techniques for Large Data Sets

BLAZE 600M HIGH-ACCURACY BLUE LIGHT MEASUREMENT SYSTEM PRODUCT BROCHURE

An Introduction to Digital Preservation

Building a Data Strategy for a Digital World

An introduction to Headless Content Management Systems

W3C Working Group Report

Data safety for digital business. Veritas Backup Exec WHITE PAPER. One solution for hybrid, physical, and virtual environments.

Accelerate Your Enterprise Private Cloud Initiative

Supporting the Cloud Transformation of Agencies across the Public Sector

Micro-scale Stereo Photogrammetry of Skin Lesions for Depth and Colour Classification

THERMAL MANAGEMENT THE PROMISE THE PROOF HEATEX AIR-TO-AIR HEAT EXCHANGERS

Corporate Profile. Fibre Network Integration Specialists

The Expanding Role of Bluetooth in Smart Buildings. Chuck Sabin Senior Director, Business Strategy

Compacts. Big New 8 screen

About SpeedCast. John Thomson. Date. Communications IT Solutions Consulting. Copyright 2017 by Speedcast. All Rights Reserved.

Google Cloud & the General Data Protection Regulation (GDPR)

Bringing DevOps to Service Provider Networks & Scoping New Operational Platform Requirements for SDN & NFV

Cisco Patient Connect Solution

IMAGERY FOR ARCGIS. Manage and Understand Your Imagery. Credit: Image courtesy of DigitalGlobe

SMART. Investing in urban innovation

Are all CFD s created equal? Prepared by: Associate Director: Sam Wicks

Executive summary. Natural disasters including bushfires, floods, storms and cyclones have destructive and devastating consequences for Australia

Securing Your Digital Transformation

For Performance and Scalability, Amadeus Chooses Data Center

Transcription:

Data-Intensive Workflows A journey to a Holistic Framework for Data-Intensive Workflows Ian Corner Design and Implementation Lead May 2016 INFORMATION MANAGEMENT AND TECHNOLOGY (IMT)

CSIRO Who we are Commonwealth Scientific and Industrial Research Organisation

CSIRO Our Mission Strategy 2020 Australia s Innovation Catalyst

CSIRO What we do

CSIRO Our Collections Commonwealth Scientific and Industrial Research Organisation Australian National Insect Collection 12,000,000 specimens (+100,000 per year) Australian National Fish collection 5,000 species Australian National Algae Culture Collection 1,000 strains of more than 300 micro-algae species Australian National Herbarium 1,000,000 herbarium (Captain Cook s 1770 expedition to Australia) Australian National Wildlife Collection 200,000 irreplaceable specimens of wildlife http://www.csiro.au/en/research/collections

CSIRO Yesterdays Collections Physical collections, Captured and Preserved http://www.csiro.au/en/research/collections/anic

CSIRO Todays Collections We need collections digitised, discoverable, consumable http://data.csiro.au/

CSIRO Todays Collections Commonwealth Scientific and Industrial Research Organisation RV Investigator is our stateof-the-art marine research vessel, supporting Australia s atmospheric, oceanographic, biological and geosciences research from the tropical north to the Antarctic ice-edge. http://www.csiro.au/en/research/facilities/marine-national-facility/rv-investigator

Data-Intensive Workflows Where we started

Data-Intensive Workflows As data growth and proliferation continued to outpace research grade infrastructure, we considered a new approach? CSIRO started by asking what good is our data if it: is unable to be found? can not speak? only ever repeats the same story? can not repeat the same story twice? speaks so slowly the message is lost?

Data-Intensive Workflows Lets revisit the monolithic approach

Data-Intensive Workflows We split the monolithic file systems into named and discoverable 'datasets.'

Data-Intensive Workflows The 'dataset' approach delineated the 'responsibility' between infrastructure owners and dataset managers.

Data-Intensive Workflows Within the dataset we developed 'categories as a tool for data management.

Data-Intensive Workflows Categories enabled mapping of the workflow to technology of best fit.

Data-Intensive Workflows Categories kick started the discussion about workflows.

Data-Intensive Workflows We established the relationships between owners, domain specialists, users, consumers, and infrastructure.

Data-Intensive Workflows As workflows matured, science apps evolved enabling domain specific datasets to be usable by non-domain consumers.

Data-Intensive Workflows Science Applications The Pyrotron - CSIRO National Bushfire Research Facility http://www.csiro.au/en/do-business/services/testing-and-technical-services/enviro/pyrotron

Data-Intensive Workflows Science Applications CSIRO Workspace - Intuitive Workflow Development Tool https://research.csiro.au/workspace/

Data-Intensive Workflows Science Applications CSIRO SPARK A wild fire simulation tool SPARK A wildfire simulation framework for researchers and experts in the disaster resilience field. https://research.csiro.au/spark/

Data-Intensive Workflows Our leading edge researchers combined domain specific workflows to produce higher value layered products.

Data-Intensive Workflows Our leading edge researchers combined domain specific workflows to produce higher value layered products.

Data-Intensive Workflows How we matured

Data-Intensive Workflows Below the line 'technology' is a consumable, replaceable, discardable commodity.

Data-Intensive Workflows Below the line - the fit for purpose pool of generic infrastructure

Data-Intensive Workflows CSIRO's value proposition is the Workflow.

Data-Intensive Workflows Crossing the line we deliver to the 'current' profile of the researchers workflow.

Data-Intensive Workflows Layers of abstraction enabled us to scale up.

Data-Intensive Workflows Layers of abstraction enabled us to scale up.

Data-Intensive Workflows Layers of abstraction enabled us to scale out.

Data-Intensive Workflows Summary

Where we started We came from a position where data, code and compute were isolated by the approach to HPC infrastructure.

What we did Brought Data to Life We engineered a solution where data, code and compute are all now directly connected.

High Value Information: Discoverable, Assured, and Consumable.

Data-Intensive Workflows CSIRO s data-intensive workflows are a valuable source of information. How do we discover them, trust them and consume them?

Data-Intensive Workflows CSIRO s data-intensive workflows are a valuable source of information. How do we discover them, trust them and consume them? METADATA

Data-Intensive Workflows CSIRO s data-intensive workflows are a valuable source of information. How do we discover them, trust them and consume them? METADATA PROVENANCE

Data-Intensive Workflows CSIRO s data-intensive workflows are a valuable source of information. How do we discover them, trust them and consume them? METADATA SEMANTIC WEB PROVENANCE

Discoverable Metadata Metadata is a pathway to making data and workflows discoverable. Lets look at Wikipedia: https://en.wikipedia.org/wiki/metadata Metadata is "data that provides information about other data. Two types of metadata exist: structural metadata and descriptive metadata. Structural metadata is data about the containers of data. Descriptive metadata uses individual instances of application data or the data content.

Assured Provenance Dr Victoria Stodden at the CSIRO Computation Simulation Sciences and eresearch Annual Conference in Melbourne 2014

Consumable Semantic Web Pragmatic use of the web. Lets look at Wikipedia: https://en.wikipedia.org/wiki/semantic_web The Semantic Web is an extension of the Web through standards by the World Wide Web Consortium (W3C). The standards promote common data formats and exchange protocols on the Web, most fundamentally the Resource Description Framework (RDF).

Linking Metadata, Provenance and Semantics We need to link metadata, provenance and semantics in an automatic and extensible manner to increase our value.

Context Capture The Future Preserving Metadata Establishing Provenance Presenting via Semantic Web

Stripping Context The Past What are we currently loosing

Context Capture The Future Preserving the context of discrete events

Context Capture The Future Create a blank dataset

Context Capture The Future Create a blank dataset Preserving metadata, establishing provenance,...

Context Capture The Future Ingesting sensor data Preserving metadata, establishing provenance,...

Context Capture The Future Lets consider a real world application - PlantScan PlantScan provides non-invasive analyses of plant structure (topology, surface orientation, number of leaves), morphology (leaf size, shape, colour, area, volume) and function by utilising cutting edge information technology including high resolution cameras and three-dimensional (3D) reconstruction software. Plant surface mesh reconstruction Morphological mesh segmentation Accurate phenotypic data extraction Longitudinal matching http://www.plantphenomics.org.au/services/plantscan/

Context Capture The Future Plant Scan: Workflow with integrated metadata E X A M P L E O N L Y

Context Capture The Future Benefit 1 Repeatable Analysis Now there is no simple fix to this issue. But if we issue the birth certificate before the baby leaves the hospital then we have at least improved our position. We reduce the size of the problem.

Context Capture The Future Benefit 2 Quality Assurance File_1 became File_2 using version 56 of Fred_1. Now if File_1 had a calibration issue. Or Fred_1 had an analysis bug. Guest what? We reduce our problem more.

Context Capture The Future Other benefits Benefit 3: Immediate Consumption

Context Capture The Future Other benefits Benefit 3: Immediate Consumption Benefit 4: Benchmarks

Context Capture The Future Other benefits Benefit 3: Immediate Consumption Benefit 4: Benchmarks Benefit 5: Infrastructure Management

Context Capture The Future Other benefits Benefit 3: Immediate Consumption Benefit 4: Benchmarks Benefit 5: Infrastructure Management Benefit 6: Failures

Context Capture The Future Summary An example from the past In Unix everything is a file (pretty much) there are a set of well written simple tools which you can tie together in an ad-hoc way to produce high value outcomes in a dynamic yet robust manner. Moving to the future Everything is a dataset, there are a set of published, well proven and tested set of research workflows which you can tie together in an ad-hoc way to produce high value outcomes in a dynamic yet robust manner.

Summary

Context Capture The Future Summary We made it to a place where data, code and compute are now tightly coupled Researchers focus on the workflow. As workflows proliferate we want to make sure they exist in an ecosystem where they can be discovered, assessed and consumed.

Thank You INFORMATION MANAGEMENT AND TECHNOLOGY (IMT)