Master's Thesis Defense

Similar documents
Development of a Data Management Architecture for the Support of Collaborative Computational Biology Lance W. Feagan

Bioinformatics Computational Journal: User Guide

Concurrency and Recovery

Object-Oriented Design

Microsoft SharePoint Server 2013 Plan, Configure & Manage

Object-Oriented Design

The Submission Data File System Automating the Creation of CDISC SDTM and ADaM Datasets

Oracle Application Development Framework Overview

Oracle Database. Installation and Configuration of Real Application Security Administration (RASADM) Prerequisites

A Simple Mass Storage System for the SRB Data Grid

irods for Data Management and Archiving UGM 2018 Masilamani Subramanyam

Fast Track Model Based Design and Development with Oracle9i Designer. An Oracle White Paper August 2002

Real Application Security Administration

IBM Best Practices Working With Multiple CCM Applications Draft

SharePoint 2013 End User Level II

Microsoft SharePoint End User level 1 course content (3-day)

Panorama Sharing Skyline Documents

CMPT 354 Database Systems I

JMP and SAS : One Completes The Other! Philip Brown, Predictum Inc, Potomac, MD! Wayne Levin, Predictum Inc, Toronto, ON!

Tools to Develop New Linux Applications

Managing Learning Objects in Large Scale Courseware Authoring Studio 1

About the Edinburgh Pathway Editor:

Product Documentation. ER/Studio Portal. User Guide. Version Published February 21, 2012

MOC 6232A: Implementing a Microsoft SQL Server 2008 Database

Grid Computing with Voyager

Introduction to Geodatabase and Spatial Management in ArcGIS. Craig Gillgrass Esri

Business Process Testing

High-End Training, Research and Mentoring Space The KnowledgeStation

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Common approaches to management. Presented at the annual conference of the Archives Association of British Columbia, Victoria, B.C.

Oracle. Engagement Cloud Using Service Request Management. Release 12

Oracle Hyperion Financial Management Instructor-led Live Online Training Program

Medici for Digital Cultural Heritage Libraries. George Tsouloupas, PhD The LinkSCEEM Project

High Availability Distributed (Micro-)services. Clemens Vasters Microsoft

SharePoint 2013 End User Level II

CSE 344 Final Review. August 16 th

Review -Chapter 4. Review -Chapter 5

Oracle Endeca Information Discovery

20331B: Core Solutions of Microsoft SharePoint Server 2013

Advanced Databases: Parallel Databases A.Poulovassilis

Content Management for the Defense Intelligence Enterprise

SQL SERVER DBA TRAINING IN BANGALORE

The Analysis and Design of the Object-oriented System Li Xin 1, a

Oracle API Platform Cloud Service

UNIT 4 DATABASE SYSTEM CATALOGUE

Survey of Oracle Database

CONTENT PLANNING JOB AID

Advanced Solutions of Microsoft SharePoint Server 2013

T-sql Grant View Definition Example

Policy Manager in Compliance 360 Version 2018

This course is designed for anyone who needs to learn how to write programs in Python.

Export out report results in multiple formats like PDF, Excel, Print, , etc.

Fundamentals of Windows Server 2008 Active Directory

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

SharePoint SP380: SharePoint Training for Power Users (Site Owners and Site Collection Administrators)

Mastering SOA Challenges more cost-effectively. Bodo Bergmann Senior Software Engineer Ingres Corp.

StrongLink: Data and Storage Management Simplified

BPS Suite and the OCEG Capability Model. Mapping the OCEG Capability Model to the BPS Suite s product capability.

Tags, Categories and Keywords

Alfresco Guide. By IT Services

Best Practice Guide for a Successful Migration

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN. Chapter 1. Introduction

Appendix A - Glossary(of OO software term s)

Lessons Learned While Building Infrastructure Software at Google

IBM Workplace Web Content Management and Why Every Company Needs It. Sunny Wan Technical Sales Specialist

The InfoLibrarian Metadata Appliance Automated Cataloging System for your IT infrastructure.

Microsoft Core Solutions of Microsoft SharePoint Server 2013

Collage: A Declarative Programming Model for Compositional Development and Evolution of Cross-Organizational Applications

Talend Component tgoogledrive

Oracle NoSQL Database. Creating Index Views. 12c Release 1

Microsoft SharePoint Server 2013 for the Site Owner/Power User

ControlPoint. Managing ControlPoint Users, Permissions, and Menus. February 05,

SECURE PRIVATE VAGRANT BOXES AND MORE WITH A BINARY REPOSITORY MANAGER. White Paper

Informatica Enterprise Information Catalog

Chapter 20: Database System Architectures

Web Application Development Using JEE, Enterprise JavaBeans and JPA

Distributed File Systems II

Business Processes for Managing Engineering Documents & Related Data

QUARK AUTHOR THE SMART CONTENT TOOL. INFO SHEET Quark Author

Metamodeling. Janos Sztipanovits ISIS, Vanderbilt University

IT ADMINISTRATOR TRAINING COURSE

UNIT V *********************************************************************************************

COURSE OUTLINE MOC : PLANNING AND ADMINISTERING SHAREPOINT 2016

Advanced Solutions of Microsoft SharePoint Server 2013 Course Contact Hours

W H IT E P A P E R. Salesforce Security for the IT Executive

Data Immersion : Providing Integrated Data to Infinity Scientists. Kevin Gilpin Principal Engineer Infinity Pharmaceuticals October 19, 2004

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

Advanced Solutions of Microsoft SharePoint 2013

Connector for OpenText Content Server Setup and Reference Guide

AADL Graphical Editor Design

Applied Data Governance - Part 3

Liferay Fundamentals Course Overview

BRANCH:IT FINAL YEAR SEVENTH SEM SUBJECT: MOBILE COMPUTING UNIT-IV: MOBILE DATA MANAGEMENT

A GUIDE FOR ADMINISTRATORS

Science-as-a-Service

Taxonomy Tools: Collaboration, Creation & Integration. Dow Jones & Company

Oracle Database 10g The Self-Managing Database

COMMUNITIES USER MANUAL. Satori Team

DEPARTMENT OF COMPUTER SCIENCE

Rajiv GandhiCollegeof Engineering& Technology, Kirumampakkam.Page 1 of 10

Transcription:

Master's Thesis Defense Development of a Data Management Architecture for the Support of Collaborative Computational Biology Lance Feagan 1

Acknowledgements Ed Komp My mentor. Terry Clark My advisor. Victor Frost My other advisor. Heather, Jesse, Justin, Keith, Andrew Project team members. 2

Presentation Overview Computational biology background Analysis of similar projects Design Objectives Implementation Conclusions 3

Computational Biology Small & Large data sets Input: protein structure, genetic sequence Output: MD simulation, BLAST Exponential data growth New data becoming available at impressive rate Curated vs. non-curated data Experiments can be represented as pipelines, possibly with feedback to earlier stages Runs frequently re-done Parameter variation and re-analysis of results Use of recently available data with previously used experimental configuration 4

Computational Biology Maintenance of provenance records required when publishing results Challenging in fast-paced, high-volume environment Collaboration important Controlled information access levels Wet-lab techniques do not scale well Tools and formats used in analysis do not inherently provide integrated, end-to-end provenance trail 5

Why is provenance important? Provides context and specifications work was done in CS pioneer Jim Gray on provenance: 6

Provenance Taxonomy Y.L. Simmhan, B. Plale, and D. Gannon A Survey of Data Provenance in e-science 7

Related Work NoteCards Xerox D, LISP machine Semantic network of related notes Tree-like graphical browse/manipulation tool Semantic network entirely user-maintained Flexible, but very tedious and time-consuming Search was limited to the title and content of a NoteFile Virtual Notebook Environment (ViNE) BioCoRE mygrid/taverna Scientific Annotation Middleware (SAM) 8

Related Work: Shortcomings Limited or no provenance or other meta-data management Concept of workflow that other users can view, import, and alter Collaborative features don't include managed re-use that maintains provenance trail Weak search capabilities result in overly cluttered user interfaces Search can be used as a filter to select visible data plane in navigation tools Integration of secure computational facility while keeping user's rights and restrictions on usage intact 9

Design Objectives Derived from analysis of related projects Identified five key areas as vital in aiding computational biologists 1)Type Hierarchy & Data Abstraction 2)Data Storage 3)Provenance Management 4)Data Security 5)Data & Provenance Search 10

Type Hierarchy & Data Abstraction 11

Type Hierarchy & Data Abstraction Type Hierarchy: Design Goals Provide core type definitions that could be used pervasively as basis Extensible by users as well as administrators Allow grouping of types for namespaces Data Abstraction: Design Goals Allow interaction in generic fashion Enable extensive search and computational usage capability Maintain or improve performance 12

Data Model: Entry Electronic journal concept as basis Page <==> Entry Visible journal entry comprised of a single toplevel entry (node) with arbitrary number of related nodes Entry (node) is an atomic unit of information associated with a single type that may be reused Entry data structure composed of meta-data + content 13

Data Model: Entry 1)EntryId (INTEGER): PK for Entry table. Positive integers for all non-core entries. 0 reserved to prevent infinite recursion of type relationships. 2)UserId (INTEGER): Entry creator/owner. Joined with other tables to define permissions and search for entries. 3)ContentTypeId (INTEGER): References by EntryId the type of the information stored in this entry. 4)Title (VARCHAR2): Used to assign a title to an entry. Titles are not unique. 5)JournalId (INTEGER): References by EntryId another entry, which must be of content type journal. All journals are attached to a single root journal. 14

Data Model: Journal Collection of entries Entry may only be a member of a single journal Journals may be nested Multiple writers to a single journal Furthers collaboration 15

Data Abstraction Achieved through: Extensible type hierarchy Plug-in architecture Dissociate content from meta-data High-performance Pervasive, comprehensive search capability Attribution provenance Content type & content storage type 16

Type Hierarchy Extensibility: Addition of new types By both administrators and users Aids flexibility and re-use Flexibility: Changes to hierarchy should not break existing infrastructure Robustness: Operations that compromise stability are prevented Data-driven, plug-in architecture reduces coupling No nested types: simplifies processing of an entry Type hierarchy stored as a collection of entries, with relationship to special master type entry 17

Type Hierarchy: Design 18

Storage Architecture 19

Storage Architecture Objectives: Rapid textual & meta-data search High-performance (latency and B/W) cluster-based computing Split atomic entry into meta-data and contentdata Hybrid content-data storage: in DBMS or in cluster file system 20

Provenance 21

Provenance 1)Attribution 1)Owner-only write model + link creation on copy + security model ensures attribution is maintained 2)Audit Trail 3)Data Quality 4)Informational 5)Replication 22

Provenance: Entry Fields Five Fields UserId CreateDate ModifyDate CommitDate Committed 23

Provenance: Semantic Relationships RDF 3-tuple: <subject,predicate,object> Predicate defines an IS-A relationship <'Junior','son','Senior'> Junior IS A son OF Senior Choice to require IS-A eliminates need for inverse relationship searches on HAS-A FromEntry IS A LinkType OF ToEntry 24

Provenance: Semantic Relationship Integrity Constraint Second stage in ensuring durability of provenance (first is secured commit) Extension of Resource Description Framework (RDF) 3-tuple of <subject,predicate,object> Depending on predicate, removal of subject or object may nullify ability to maintain definitive provenance trail Form 5-tuple with addition of two values describing subject's dependence on object and vice-a-versa Assurances of non-removal of content protected by SRIC encourages collaborative sharing and re-use of experiments & data from other users 25

Provenance: SRIC Flags: ToRequiresFrom FromRequiresTo Commit vs. SRIC Describe how this plays out. 26

Provenance: Collection & Usage Two collection techniques: Automatic: Done via DB triggers, Ex: create, update, commit of an entry API: Done via BCJ plug-ins, Ex: input to and output from experiment, textual annotations Used by navigators and tools Provide relevant information while reducing clutter Filtering by author, create/modify/commit date, deprecated flag, rating, hidden, categorization by content types Related entry view in navigator Open interface via views rather than purely functional interface Allows complex join queries to be performed 27

Provenance: Conclusions Automated provenance management framework vital Eliminates burdensome folder-file management scheme to maintain association between input, experiment, and output Ensures attribution information is maintained from inception through multiple iterations and ending in the successful conclusion of work Non-repudiation assured Extended RDF 3-tuple encourages sharing and re-use among collaborators 28

Search 29

Search Pervasive: Fundamental activity Data and provenance Anything that user can access should be query-able Content Meta-Data Combined content + meta-data query Used by navigators to define and refine presentation Used by workflow and experiment editors to find available and appropriate blocks and resources Sorting 30

Search: Helper Fields Four additional meta-data fields to assist search ContentLastAccessDate Deprecated Hidden Rating 31

Security 32

Security Coarse- vs. fine-grained Permission inheritance Goals Create secure, private workspace for individuals Encourage collaboration through the ability to selectively share information via find-grained access controls Maintain the integrity of all provenance information collected 33

Security Actors: Users and Groups Group-based access control Each user a member of one or more groups Each group contains zero or more users Each entry associated with one or more groups granted read-only access Read/Write Permission Levels Owner-only write permission 34

Security Commit concept Prior to commit, owner free to alter content Committing entry locks content to prevent further writing Meta-data fields affecting provenance also locked First stage in ensuring durability of provenance (second step is SRIC) 35

Security: Metadata Two categories of metadata field access levels: System-controlled, and User-controlled Triggers used on entry metadata fields rather than functional interface Maximize available information for search hasessentialdependents Delete compromise of strict rules 36

Security: BFILE Separate cluster FS created Can only be accessed by 'oracle' and special user to submit PBS jobs Mode bits altered when an entry is (un)committed so that FS reflects meta-data Uses triggers to Java Stored Procedures (JSP) UNIX permissions of 660 and 440 Immediate deletion 37

Summary & Conclusions 38

Summary BCJ represents a comprehensive solution Areas addressed 1)Type Hierarchy & Data Abstraction 2)Data Storage 3)Provenance Management 4)Data Security 5)Data & Provenance Search 39

Summary Type Hieararchy Four core entries, three master types Data decomposition: atomic unit of information: the entry Each entry associated with a single type Not a hierarchy, single master type, eases addition of new types Journal as organizational structure Provenance All five types of provenance can be collected Collection through automated means (triggers or API) along with user-friendly tooling eliminates manual processes 40

Summary Storage Global, shared workspace provides area for collaboration Security Flexible, efficient, tools simplify use, enables powerful search and cluster-based computing while ensuring integrity of provenance SRIC extension of RDF 5-tuple 41

Conclusions Crafting a comprehensive solution is not only time consuming but also challenging End-product is vastly superior at lowering level of effort necessary to produce relevant biological research while maintaining vital provenance information Pervasive use of search & filters simplifies access to basic information while retaining complex search capabilities 42

Future Work SRIC Link deletion strategy Split ownership of link end-points Owner of (very large) content linked-to by another user wants to delete the content Deployment needs to be simplified SRIC concept could be made extensible through the LinkType Entry Integrate SRIC into DBMS 43

Questions Wow! We made it to the end. Contact: lfeagan@us.ibm.com 44