Carelyn Campbell, Ben Blaiszik, Laura Bartolo. November 1, 2016

Similar documents
Overview. Scientific workflows and Grids. Kepler revisited Data Grids. Taxonomy Example systems. Chimera GridDB

The Problem of Grid Scheduling

Educating a New Breed of Data Scientists for Scientific Data Management

The Materials Data Facility

Integrating Existing Scientific Workflow Systems: The Kepler/Pegasus Example

Bio-Workflows with BizTalk: Using a Commercial Workflow Engine for escience

End-to-End Online Performance Data Capture and Analysis of Scientific Workflows

Introduction to Grid Computing

Techniques for Efficient Execution of Large-Scale Scientific Workflows in Distributed Environments

Update on Dataverse Dryad-Dataverse Community Meeting. Mercè Crosas, Elizabeth Quigley & Eleni Castro. Data Science > IQSS > Harvard University

A High-Level Distributed Execution Framework for Scientific Workflows

QoS-based Scheduling of Workflows on Global Grids

Grid Computing Systems: A Survey and Taxonomy

NOMAD Metadata for all

Data Discovery - Introduction

Data publication and discovery with Globus

Data Management at NIST

Pegasus Workflow Management System. Gideon Juve. USC Informa3on Sciences Ins3tute

Knowledge Discovery Services and Tools on Grids

Scibox: Online Sharing of Scientific Data via the Cloud

Grid-Flow: a Grid-enabled scientific workflow system with a Petri-net-based interface

Description of the European Big Data Hackathon 2019

[ PARADIGM SCIENTIFIC SEARCH ] A POWERFUL SOLUTION for Enterprise-Wide Scientific Information Access

Making Sense of Data: What You Need to know about Persistent Identifiers, Best Practices, and Funder Requirements

re3data.org - Making research data repositories visible and discoverable

Grid Programming: Concepts and Challenges. Michael Rokitka CSE510B 10/2007

The Virtual Observatory and the IVOA

An Approach to Enhancing Workflows Provenance by Leveraging Web 2.0 to Increase Information Sharing, Collaboration and Reuse

BEST BIG DATA CERTIFICATIONS

Reproducibility and FAIR Data in the Earth and Space Sciences

A Cloud Framework for Big Data Analytics Workflows on Azure

Introduction to FREE National Resources for Scientific Computing. Dana Brunson. Jeff Pummill

Town Hall Meeting on the National and International Data Infrastructure. Jim Warren, NIST Lisa Friedersdorf, NNCO

Introducing AppsLab Library

OUR VISION To be a global leader of computing research in identified areas that will bring positive impact to the lives of citizens and society.

Data Entry, and Manipulation. DataONE Community Engagement & Outreach Working Group

Conducting a Self-Assessment of a Long-Term Archive for Interdisciplinary Scientific Data as a Trustworthy Digital Repository

Indiana University Research Technology and the Research Data Alliance

Response to Industry Canada Consultation Developing a Digital Research Infrastructure Strategy

THE NATIONAL DATA SERVICE(S) & NDS CONSORTIUM A Call to Action for Accelerating Discovery Through Data Services we can Build Ed Seidel

GEOSS Data Management Principles: Importance and Implementation

Workflow, Planning and Performance Information, information, information Dr Andrew Stephen M c Gough

Open Research Online The Open University s repository of research publications and other research outputs

Making data publication a first class research output

Upgrading your End User Skills to SharePoint 2013 Course 55026A; 3 Days, Instructor-led

A Cloud-based Dynamic Workflow for Mass Spectrometry Data Analysis

Digital repositories as research infrastructure: a UK perspective

Kepler and Grid Systems -- Early Efforts --

January 16, Re: Request for Comment: Data Access and Data Sharing Policy. Dear Dr. Selby:

Preservation and Access of Digital Audiovisual Assets at the Guggenheim

The Ohio State University s Institutional Repository: The Knowledge Bank. Knowledge Bank Users Group. 2 nd Annual Meeting May 11, Welcome!

USC Viterbi School of Engineering

A Grid Workflow Environment for Brain Imaging Analysis on Distributed Systems

Storage Virtualization. Eric Yen Academia Sinica Grid Computing Centre (ASGC) Taiwan

Applying Auto-Data Classification Techniques for Large Data Sets

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

Some Big Data Challenges

Mapping Abstract Complex Workflows onto Grid Environments

Integration of New and Big Administrative Data Sources into Turkish Statistical System

ICME: Status & Perspectives

Scientific Workflows

EUROPEAN ICT PROFESSIONAL ROLE PROFILES VERSION 2 CWA 16458:2018 LOGFILE

Rok: Data Management for Data Science

ICGI Recommendations for Federal Public Websites

Embedded Technosolutions

The Data Curation Profiles Toolkit: Interview Worksheet


Best Practice Guidelines for the Development and Evaluation of Digital Humanities Projects

A High-Level Distributed Execution Framework for Scientific Workflows

Microsoft SharePoint Server

ARC VIEW. Critical Industries Need Active Defense and Intelligence-driven Cybersecurity. Keywords. Summary. By Sid Snitkin

OpenAIRE. Fostering the social and technical links that enable Open Science in Europe and beyond

The Role of Repositories and Journals in the Astronomy Research Lifecycle

A COMPARATIVE STUDY IN DYNAMIC JOB SCHEDULING APPROACHES IN GRID COMPUTING ENVIRONMENT

Implementing Web-Scale Discovery in an Academic Research Library

Eight units must be completed and passed to be awarded the Diploma.

Gergely Sipos MTA SZTAKI

Peer-to-Peer Based Grid Workflow Runtime Environment of SwinDeW-G

Astronomy Dataverse: enabling astronomer data publishing.

Big Data Insights Using Analytics

An innovative workflow mapping mechanism for Grids in the frame of Quality of Service

EGI Operations and Best Practices

{Scientific, Distributed, Cloud} Computing

GEOSS Data Sharing Task Force First MEETING May 2009 GEO Secretariat, Geneva. Meeting Record

EFFICIENT SCHEDULING TECHNIQUES AND SYSTEMS FOR GRID COMPUTING

CLOUD WORKFLOW SCHEDULING BASED ON STANDARD DEVIATION OF PREDICTIVE RESOURCE AVAILABILITY

Advanced School in High Performance and GRID Computing November Introduction to Grid computing.

Invenio: A Modern Digital Library for Grey Literature

Organizing and Managing Grassroots Enterprise Mashup Environments. Doctorial Thesis, 24 th June, Volker Hoyer

Discovery Net : A UK e-science Pilot Project for Grid-based Knowledge Discovery Services. Patrick Wendel Imperial College, London

Open Data Standards for Administrative Data Processing

Grid Middleware and Globus Toolkit Architecture

NDSA Web Archiving Survey

Big Data infrastructure and tools in libraries

Introducing the Springer Nature Data Support Services

Informatica Enterprise Information Catalog

Groovy in Jenkins. Ioannis K. Moutsatsos. Repurposing Jenkins for Life Sciences Data Pipelining

Building on to the Digital Preservation Foundation at Harvard Library. Andrea Goethals ABCD-Library Meeting June 27, 2016

EarthCube and Cyberinfrastructure for the Earth Sciences: Lessons and Perspective from OpenTopography

What s Out There and Where Do I find it: Enterprise Metacard Builder Resource Portal

Transcription:

Carelyn Campbell, Ben Blaiszik, Laura Bartolo November 1, 2016

Data Landscape Collaboration Tools (e.g. Google Drive, DropBox, Sharepoint, Github, MatIN) Data Sharing Communities (e.g. Dryad, FigShare, NanoHub, Kaggle, NDS) Data Repositories (e.g. Aflow, MaterialsProject, OQMD, NIMS MaterialNavi, NoMaD, Materials Universe) Data Analysis Tools Data Curation Software

Scott Adams, October 30, 2011 Data curation is the active and ongoing management of data through its lifecycle of interest and usefulness to scholarship, science, and education. http://ischool.illinois.edu/academics/degrees/specializations/data_curation

Representative list of tools focused on materials data, but not comprehensive 4

Data Model Definition Defines the structure of metadata and data Measurement Data Model Metadata e.g. Sample owner Date of measurment Kα1 Sample stage position Apparatus temperature Data e.g. As XML Raw data (text, ASCII, binary) Imported table * Link to image or raw data 5

Example APS Data (Large Data Example) NanoHub Generate Data at APS Analysis Tools Analyzed Data Share Data And Metadata 6

Workflows Large Data sets: Single Point Source (e.g. APS) Experimental data (small to medium size), multiple source generation Computational Data Infrastructure Selection Tool

Stress-Strain Measurement

Experimental Workflow: Stress-strain Measurement Load Frame Sample Geometry Stress-Strain Measurement DIC Sample Metadata LVDT Analysis (Tools e.g. Github, Matlab) R e s u l t s MDF Granta Other Load Frame Metadata DIC Metadata LVDT Metadata Load Frame Data DIC data LVDT data Curate with Materials Commons or ICE

Big Data Workflow

Computational Data Workflow Lots of different techniques PhaseField modeling: no standards. Community standards needed Codes changes quickly Social change needed. FEM - more benchmark. -- more standardize

How do I select a Materials Data Infrastructure Tool?

Example: Workflow Tool Selection Journal of Grid Computing (2006) 3: 171 200 # Springer 2006 DOI: 10.1007/s10723-005-9010-8 185 A Taxonomy of Workflow Management Systems for Grid Computing j Jia Yu and Rajkumar Buyya* Grid Computing and Distributed Systems (GRIDS) Laboratory, Department of Computer Science and Software Engineering, The University of Melbourne, Melbourne, Australia E-mail: raj@cs.mu.oz.au Received 28 May 2005; accepted in revised form 6 December 2005 Key words: Grid computing, resource management, scheduling, taxonomy, workflow management Abstract With the advent of Grid and application technologies, scientists and engineers are building more and more complex applications to manage and process large data sets, and execute scientific experiments on distributed resources. Such application scenarios require means for composing and executing complex workflows. Therefore, many efforts have been made towards the development of workflow management systems for Grid computing. In this paper, we propose a taxonomy that characterizes and classifies various approaches for building and executing workflows on Grids. We also survey several representative Grid workflow systems developed by various projects world-wide to demonstrate the comprehensiveness of the taxonomy. The taxonomy not only highlights the design and engineering similarities and differences of state-of-the-art in Grid workflow systems, but also identifies the areas that need further research. 1. Introduction Grids [51] have emerged as a global cyber-infrastructure for the next-generation of e-science applications by integrating large-scale, distributed and heterogeneous resources. Scientific communities, such as high-energy physics, gravitational-wave physics, geophysics, astronomy and bioinformatics, are utilizing Grids to share, manage and process large data sets. In order to support complex scientific experiments, distributed resources such as computational devices, data, applications, and scientific instruments need to be orchestrated while managing the application workflow operations within Grid environments [92]. Workflow is concerned with the automation of procedures whereby files and data are passed between participants according to a defined set of rules j Corresponding author. Table 2. Workflow design taxonomy mapping. Table 2. Workflow design taxonomy mapping. Project name Structure Model Composition systems QoS constraints DAGMan DAG Abstract User-directed User specified rank expression for desired resources Pegasus DAG Abstract User-directed N/A Automatic Triana Non-DAG Abstract User-directed N/A & Graph-based ICENI Non-DAG Abstract User-directed Metrics specified by users & Graph-based Taverna DAG Abstract/concrete User-directed N/A & Graph-based GridAnt Non-DAG Concrete User-directed N/A GrADS DAG Abstract User-directed Estimated application execution time GridFlow DAG Abstract User-directed Application execution time & Graph-based Unicore Non-DAG Concrete User-directed N/A & Graph-based Gridbus workflow DAG Abstract/concrete User-directed Deadline, cost minimisation Askalon Non-DAG Abstract User-directed Constrains and properties specified & Graph-based by users or pre-defined Karajan Non-DAG Abstract User-directed N/A & Graph-based Kepler Non-DAG Abstract/concrete User-directed N/A & Graph-based Project name Structure Model Composition systems DAGMan to achieve an overalldag goal [35]. A workflow management system [5] defines, manages and executes Abstract User-directed workflows on computing resources. Imposing the workflow paradigm for application composition on their dependencies in the form of Directed Acyclic Grids offers several advantages [117] such as: Graph (DAG). Before mapping, Pegasus reduces the Pegasus DAG Abstract abstract workflow by reusing a materialized dataset User-directed Ability to build dynamic applications which orchestrate distributed resources. which is produced by other users within a VO. Reduction optimization assumes that it is more costly Utilization of resources that are located in a particular domain to increase throughput or reduce to produce a dataset than access the processing results. The reduction algorithm removes any antecedents of the redundant jobs that do not haveautomatic any execution costs. Execution spanning multiple administrative domains to obtain specific processing capabilities. unmaterialized descendents in order to reduce the Triana Integration of multiple Non-DAG teams involved in managing of different parts of the experiment workflow Abstract User-directed complexity of the executable workflow. Pegasus consults various Grid information services to find the resources, software, and data that are Y thus promoting inter-organizational collaborations. used in the workflow. A Replica Location Service & Graph-based (RLS) [30] and Transformation Catalog (TC) [39] are ICENI Figure 1 shows thenon-dag architecture and functionalities Abstract used to locate the replicas of the required data, and User-directed to supported by various components of the Grid workflow system based on the workflow reference model nents respectively. Pegasus also queries Globus find the location of the logical application compo- & Graph-based Monitoring and Discovery Service (MDS) [34] to find available resources and their characteristics. There are two methods used in Pegasus for resource selection, one is through random allocation, the other is through a performance prediction approach. In the latter approach, Pegasus interacts with Prophesy [68, 140], which serves as an infrastructure for performance analysis and modeling of parallel and distributed applications. Prophesy is used to predict the best site to execute an application component by using performance historical data. Prophesy gathers and stores the performance data of every application. The performance information can provide insight into the performance relationship between the application and hardware and between the application, compilers, and run-time systems. An analytical model is produced based on the performance data

Example: Hardware Store Website Credit: homedepot.com Any mention of commercial products is for information only; it does not imply recommendation or endorsement by NIST.

Example: Used Car Website Credit: carmax.com Any mention of commercial products is for information only; it does not imply recommendation or endorsement by NIST.

Registry: Materials Data Infrastructure Tools

First Rough Draft Taxonomy

First Rough Draft Taxonomy

First Rough Draft Taxonomy

First Rough Draft Taxonomy

Notes from Summit Wrap-up Session Integrate tools into undergraduate education Tools need to more user friendly Embed data experts into experimental groups Alternate: floating data experts available for experimental groups. Need to define skills needed for these data experts Encourage more conference exchanges at Data Analytics and Materials communities Define data curation guidelines/code Benefit to users Data Challenge (Student) Prize for data set Best paper/doi/pid Develop implementation path Improve peer recognition Develop data cite profile Interest in following up with small working groups on specific issues.