Integrating large, fast-moving, and heterogeneous data sets in biology.

Similar documents
Low Friction Data Warehousing WITH PERSPECTIVE ILM DATA GOVERNOR

Science-as-a-Service

SELF-SERVICE SEMANTIC DATA FEDERATION

The Emerging Data Lake IT Strategy

ANNUAL REPORT Visit us at project.eu Supported by. Mission

Introduction to Grid Computing

Powering Knowledge Discovery. Insights from big data with Linguamatics I2E

Unstructured Text in Big Data The Elephant in the Room

Extracting reproducible simulation studies from model repositories using the CombineArchive Toolkit

Advances in Data Integration & Representation in Systems Biology

Building High Performance Apps using NoSQL. Swami Sivasubramanian General Manager, AWS NoSQL

Extending SOA Infrastructure for Semantic Interoperability

Dataspaces: A New Abstraction for Data Management. Mike Franklin, Alon Halevy, David Maier, Jennifer Widom

Big Data - Some Words BIG DATA 8/31/2017. Introduction

Intermediate/Advanced Python. Michael Weinstein (Day 1)

What is Text Mining? Sophia Ananiadou National Centre for Text Mining University of Manchester

BUILDING MICROSERVICES ON AZURE. ~ Vaibhav

Ontologies and Database Schema: What s the Difference? Michael Uschold, PhD Semantic Arts.

When, Where & Why to Use NoSQL?

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS

Introduction to Semantic Web

Strategic Briefing Paper Big Data

SOA: Service-Oriented Architecture

Wither OWL in a knowledgegraphed, Linked-Data World?

Interoperability ~ An Introduction

Empowering People with Knowledge the Next Frontier for Web Search. Wei-Ying Ma Assistant Managing Director Microsoft Research Asia

The 7 Habits of Highly Effective API and Service Management

How to integrate data into Tableau

Management Information Systems MANAGING THE DIGITAL FIRM, 12 TH EDITION FOUNDATIONS OF BUSINESS INTELLIGENCE: DATABASES AND INFORMATION MANAGEMENT

Bioinformatics Data Distribution and Integration via Web Services and XML

Enabling Open Science: Data Discoverability, Access and Use. Jo McEntyre Head of Literature Services

Data management is fun. Casey Dunn Assistant Professor Ecology and Evolutionary Biology

Starting small to go Big: Building a Living Database

Title: Episode 11 - Walking through the Rapid Business Warehouse at TOMS Shoes (Duration: 18:10)

Semantic Technologies for Nuclear Knowledge Modelling and Applications

Taming Structured And Unstructured Data With SAP HANA Running On VCE Vblock Systems

REPORT MICROSOFT PATTERNS AND PRACTICES

Your Data Demands More NETAPP ENABLES YOU TO LEVERAGE YOUR DATA & COMPUTE FROM ANYWHERE

CMIS An Industry Effort to Define a Service-Based Interoperability Standard for Content Management

Foundations of a Data Centric Organization A NDREW K A R CHER SQL SAT U R D AY #740 A P R IL 1 4,

PlantSimLab An Innovative Web Application Tool for Plant Biologists

Data Analysis and Validation for ML

Outline. Quick Introduction to Database Systems. Data Manipulation Tasks. What do they all have in common? CSE142 Wi03 G-1

Meeting the OMB FY2012 Objective: Experiences, Observations, Lessons-Learned, and Other Thoughts

Getting Started with Semantics in the Enterprise. November 10, 2010, AWOSS, Moncton, NB

Prof. Dr. Christian Bizer

TWOO.COM CASE STUDY CUSTOMER SUCCESS STORY

XML in the bipharmaceutical

Towards Practical Differential Privacy for SQL Queries. Noah Johnson, Joseph P. Near, Dawn Song UC Berkeley

Data-Intensive Distributed Computing

Where the Social Web Meets the Semantic Web. Tom Gruber RealTravel.com tomgruber.org

Education Brochure. Education. Accelerate your path to business discovery. qlik.com

Big Data in Translational Science

HOW THE RIGHT CMS MAKES CONTENT AVAILABLE WHEN AND WHERE CUSTOMERS NEED IT

CSE 3241: Database Systems I Databases Introduction (Ch. 1-2) Jeremy Morris

Let me begin by introducing myself. I have been a Progress Application Partner since 1986 and for many years I was the architect and chief developer

Building a Data Strategy for a Digital World

THE NATIONAL DATA SERVICE(S) & NDS CONSORTIUM A Call to Action for Accelerating Discovery Through Data Services we can Build Ed Seidel

Databases in the Cloud

Biocomputing II Coursework guidance

National Centre for Text Mining NaCTeM. e-science and data mining workshop

Build Scientific Computing Infrastructure with Rebar3 and Docker. Eric Sage

Database Management Systems Chapter 1 Instructor: Oliver Schulte Database Management Systems 3ed, R. Ramakrishnan and J.

Introduction to Data Management for Ocean Science Research

Planning & Managing Migrations

Sensor Data Collection and Processing

Overview. Data-mining. Commercial & Scientific Applications. Ongoing Research Activities. From Research to Technology Transfer

BUILDING the VIRtUAL enterprise

Semantic Web in a Constrained Environment

RiskSense Attack Surface Validation for IoT Systems

Introduction. October 5, Petr Křemen Introduction October 5, / 31

Discovery Net : A UK e-science Pilot Project for Grid-based Knowledge Discovery Services. Patrick Wendel Imperial College, London

This video is part of the Microsoft Virtual Academy.

Data Integrity in Stateful Services. Percona Live, Santa Clara, 2017

10 Cloud Myths Demystified

from the idea to the experience

Copyright 2012 EMC Corporation. All rights reserved. Obrigado

Tackling network heterogeneity head-on

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Proposal for Implementing Linked Open Data on Libraries Catalogue

NDARC Web Refresh 2011

What s a BA to do with Data? Discover and define standard data elements in business terms

Improving Decision-Making Support

Lesson 14 SOA with REST (Part I)

Choosing the perfect CMS

Database and Knowledge-Base Systems: Data Mining. Martin Ester

Semantic Web and Web2.0. Dr Nicholas Gibbins

Q&A TAKING ENTERPRISE SECURITY TO THE NEXT LEVEL. An interview with John Summers, Enterprise VP and GM, Akamai

Semantics Modeling and Representation. Wendy Hui Wang CS Department Stevens Institute of Technology

Building A Business Online. A Crash Course in Creating an Online Presence for Your Business

Chapter 6 VIDEO CASES

Interoperability and Semantics in Use- Application of UML, XMI and MDA to Precision Medicine and Cancer Research

Overview of Web Mining Techniques and its Application towards Web

Exploring and Exploiting the Biological Maze. Presented By Vidyadhari Edupuganti Advisor Dr. Zoe Lacroix

Modern Data Warehouse The New Approach to Azure BI

LeakDAS Version 4 The Complete Guide

By Snappy. Advanced SEO

Transformative characteristics and research agenda for the SDI-SKI step change:

Making the most of metadata with Metadata 2020

Enterprise Knowledge Map: Toward Subject Centric Computing. March 21st, 2007 Dmitry Bogachev

Transcription:

Integrating large, fast-moving, and heterogeneous data sets in biology. C. Titus Brown Asst Prof, CSE and Microbiology; BEACON NSF STC Michigan State University ctb@msu.edu

Introduction Background: Modeling dl & data analysis undergrad d => Open source software development + software engineering + developmental biology + genomics PhD => Bio + computer science faculty => Data driven biology Currently working with next-gen sequencing data (mrnaseq, metagenomics, difficult genomes). Thinking hard about how to do data-driven modeling & model-driven data analysis.

Goal & outline Address challenges and opportunities of heterogeneous data integration: 1000 ft view. Outline: What types of analysis and discovery do we want to enable? What are the technical challenges, common solutions, and common failure points? Where might we look for success stories, and what lessons can we port to biology? My conclusions.

Specific types of questions I have a known chemical/gene interaction; do I see it in this other data set? I have a known chemical/gene interaction; what other gene expression is affected? What does chemical X do to overall phenotype, effect on gene expression, altered protein localization, and patterns of histone modification? More complex/combinatorial interactions: What does this chemical do in this genetic background? What kind of additional gene expression changes are generated by the combination of these two chemicals? What are common effects of this class of chemicals?

What general behavior do we want to enable? Reuse of data by groups that did not/could not produce it. Publication of reusable/ fork able data analysis pipelines pp and models. Integration of data and models. Serendipitous uses and cross-referencing of data sets ( mashups ). Rapid scientific exploration and hypothesis generation in data space.

(Executable papers & data reuse) ENCODE All data is available; all processing scripts for papers are available on a virtual machine. QIIME (microbial ecology) Amazon virtual machine containing software and data for: Collaborative cloud-enabled d tools allow rapid, reproducible biological insights. (pmid 23096404) Digital normalization paper Amazon virtual machine, again: http://arxiv.org/abs/1203.4802

Executable papers can support easy replication & reuse of code, data. (IPython Notebook; also see RStudio) http://ged.msu.edu/papers/2012-diginorm/notebook/

What general behavior do we want to enable? Reuse of data by groups that did not/could not produce it. Publication of reusable/ fork able data analysis pipelines and models. Integration of data and models. Serendipitous uses and cross-referencing of data sets ( mashups ). Rapid scientific exploration and hypothesis generation in data space.

An entertaining digression -- A mashup of Facebook top 10 books by college and per-college SAT rankings http://booksthatmakeyoudumb.virgil.gr/

Technical obstacles Syntactic incompatibility The first 90% of bioinformatics: your IDs are different from my IDs. Semantic incompatibility The second 90% of bioinformatics: what does gene mean in your database? Impedance mismatch SQL is notoriously bad at representing intervals and hierarchies Genomes consist of intervals; ontologies consist of hierarchies! SQL databases dominate (vs graph or object DBs). Data volume & velocity Large & expanding data sets just make everything er harder. Unstructured data aka publications most scientific knowledge is locked up

Typical solutions Entity resolution Accession numbers or other common identifiers requires global naming system OR translators. Top down imposition of structure Centralized DB; Here is the schema you will all use ; limits flexibility, prevents use of unstructured data, heavyweight. Ontologies to enable correct communication Centrally coordinated vocabulary slow, hard to get right, doesn t solve unstructured data problem. Balancing theoretical rigor with practical applicability is particularly hard. Ad hoc entity resolution ( winging it ) Common solution doesn t work that well.

Are better standards the solution? http://xkcd.com/927/

Rephrasing technical goals How can we best provide a platform or platforms to support flexible data dt integration it ti and data dt investigation across a wide range of data sets and data types in biology? My interests: Avoid master data manager and centralization Support federated roll-out of new data and functionality Provide flexible extensibility of ontologies and hierarchies Support diverse ecology of databases, interfaces, and analysis software.

Success stories outside of biology? Look for domains: with really large amounts of heterogenous data, that are continually increasing in size, are being effectively mined on an ongoing basis, Have widely used programmatic interfaces that support mashups and other cross-database stuff, and are intentional, with principles that we can steal or adapt.

Success stories outside of biology? Look for domains: with really large amounts of heterogenous data, that are continually increasing in size, are being effectively mined on an ongoing basis, Have widely used programmatic interfaces that support mashups and other cross-database stuff, and are intentional, with principles that we can steal or adapt. Amazon.

Amazon: > 50 million users, > 1 million product partners, billions of reviews, dozens of compute services Continually changing/updating data sets. Explicitly l adopted d a service-oriented architecture that enables both internal and external use of this data. For example, the amazon.com Web site is itself built from over 150 independent services Amazon routinely deploys new services and functionality.

Sources: The Platform Rant (Steve Yegge) -- in which he compares the Google and Amazon approaches: https://plus.google.com/112678702228711889851/posts/ev eouesvavx A summary at HighScalability.com: com: http://highscalability.com/amazon-architecture (They are both long and tech-y, note, but the first is especially entertaining.)

A brief summary of core principles Mandates from the CEO: 1. All teams must expose data and functionality solely through h a service interface. 2. All communication between teams happens through that service interface. 3. All service interfaces must be designed so that they can be exposed to the outside world.

More colloquially: You should eat your own dogfood. Design and implement the database and database functionality to meet your own needs; and only use the functionality yyou ve explicitly made available to everyone. To adapt to research: database functionality should be designed in tightly integration with researchers who are using it, both at a user interface level and programmatically. (Genome databases have done a really good job of this, albeit generally in a centralized model.)

If the customers aren t integrated into the development loop:

A platform view? Metabolic model Diff'n gene expression query Data exploration WWW Gene ID translator Chemical relationships Expression normalization Isoform resolution/ comparison Expression data (tiling) Expression data (microarray) Expression data (mrnaseq) Expression data II (mrnaseq)

A few points Open source and agile software development approaches can be surprisingly effective and inexpensive. Developing services in small groups that include customerfacing developers helps ensure utility. Implementing services in the cloud (e.g. virtual machines, or on top of infrastructure as a service services) )gives developer flexibility in tools, approaches, implementation; also enables scaling and reusability.

Combining modelling with data Data-driven modeling: connections and parameters can be, to some extent, determined d from data. Model-driven driven data investigation: data that doesn t fit the known model is particularly interesting. The second approach is essentially how particle physicists work with accelerator data: build a model & then interpret the data using the model. (In biology, models are less constraining, though; more unknowns.)

Using developmental models Davidson et al., http://sugp.caltech.edu/endomes/

Using developmental models Models can contain useful abstractions of specific processes; here, the direct effects of blocking nuclearization of B-catenin can be predicted by following the connections. Models provide a common language for (dis)agreement in a community.

Using developmental models Davidson et al., http://sugp.caltech.edu/endomes/

Social obstacles Training of biologically aware software developers is lacking. Molecular biologists are still very much of a computationally naïve mindset: give me the answer so I can do the real work Incentives for data sharing, much less useful data sharing are not yet very strong. Pubs, grants, respect... Patterns for useful data sharing are still not well understood, in general.

Other places to look NEON and other NSF centers (e.g. NCEAS) are collecting vast heterogenous data sets, and are explicitly tackling the data management/use/integration/reuse problem. SBML ( Systems Biology Markup Language ) is a modeling descriptive language g that enables interoperability of modeling software. Software Carpentry runs free workshops on effective use of computation for science.

My conclusions We need a platform mentality to make the most use of our data, even if we don t completely embrace loose coupling and distribution. Agile and end-user focused software development methodologies have worked well in other areas; much of the hard technical space has already been explored in Internet companies (and probably social networking companies, too). Data is most useful in the context of an explicit model; models can Data is most useful in the context of an explicit model; models can be generated from data, and models can feed back into data gathering.

Things I didn t discuss Database maintenance and active curation is incredibly important. Most data only makes sense in the context of other data (think: controls; wild type vs knockout; other backgrounds; etc.) so we will need lots more data to interpret the data we already have. Deep learning is a promising field for extracting correlations from multiple large data sets. All of these technical problems are easier to solve than the social problems (incentives; training).

Thanks -- This talk and ancillary notes will be available on my blog ~soon: http://ivory.idyll.org/blog/ /bl / Pl d t t t tb@ d if h ti Please do contact me at ctb@msu.edu if you have questions or comments.