Integrating large, fast-moving, and heterogeneous data sets in biology.

Integrating large, fast-moving, and heterogeneous data sets in biology. C. Titus Brown Asst Prof, CSE and Microbiology; BEACON NSF STC Michigan State University ctb@msu.edu

Introduction Background: Modeling dl & data analysis undergrad d => Open source software development + software engineering + developmental biology + genomics PhD => Bio + computer science faculty => Data driven biology Currently working with next-gen sequencing data (mrnaseq, metagenomics, difficult genomes). Thinking hard about how to do data-driven modeling & model-driven data analysis.

Goal & outline Address challenges and opportunities of heterogeneous data integration: 1000 ft view. Outline: What types of analysis and discovery do we want to enable? What are the technical challenges, common solutions, and common failure points? Where might we look for success stories, and what lessons can we port to biology? My conclusions.

Specific types of questions I have a known chemical/gene interaction; do I see it in this other data set? I have a known chemical/gene interaction; what other gene expression is affected? What does chemical X do to overall phenotype, effect on gene expression, altered protein localization, and patterns of histone modification? More complex/combinatorial interactions: What does this chemical do in this genetic background? What kind of additional gene expression changes are generated by the combination of these two chemicals? What are common effects of this class of chemicals?

What general behavior do we want to enable? Reuse of data by groups that did not/could not produce it. Publication of reusable/ fork able data analysis pipelines pp and models. Integration of data and models. Serendipitous uses and cross-referencing of data sets ( mashups ). Rapid scientific exploration and hypothesis generation in data space.

(Executable papers & data reuse) ENCODE All data is available; all processing scripts for papers are available on a virtual machine. QIIME (microbial ecology) Amazon virtual machine containing software and data for: Collaborative cloud-enabled d tools allow rapid, reproducible biological insights. (pmid 23096404) Digital normalization paper Amazon virtual machine, again: http://arxiv.org/abs/1203.4802

Executable papers can support easy replication & reuse of code, data. (IPython Notebook; also see RStudio) http://ged.msu.edu/papers/2012-diginorm/notebook/

What general behavior do we want to enable? Reuse of data by groups that did not/could not produce it. Publication of reusable/ fork able data analysis pipelines and models. Integration of data and models. Serendipitous uses and cross-referencing of data sets ( mashups ). Rapid scientific exploration and hypothesis generation in data space.

An entertaining digression -- A mashup of Facebook top 10 books by college and per-college SAT rankings http://booksthatmakeyoudumb.virgil.gr/

Technical obstacles Syntactic incompatibility The first 90% of bioinformatics: your IDs are different from my IDs. Semantic incompatibility The second 90% of bioinformatics: what does gene mean in your database? Impedance mismatch SQL is notoriously bad at representing intervals and hierarchies Genomes consist of intervals; ontologies consist of hierarchies! SQL databases dominate (vs graph or object DBs). Data volume & velocity Large & expanding data sets just make everything er harder. Unstructured data aka publications most scientific knowledge is locked up

Typical solutions Entity resolution Accession numbers or other common identifiers requires global naming system OR translators. Top down imposition of structure Centralized DB; Here is the schema you will all use ; limits flexibility, prevents use of unstructured data, heavyweight. Ontologies to enable correct communication Centrally coordinated vocabulary slow, hard to get right, doesn t solve unstructured data problem. Balancing theoretical rigor with practical applicability is particularly hard. Ad hoc entity resolution ( winging it ) Common solution doesn t work that well.

Are better standards the solution? http://xkcd.com/927/

Rephrasing technical goals How can we best provide a platform or platforms to support flexible data dt integration it ti and data dt investigation across a wide range of data sets and data types in biology? My interests: Avoid master data manager and centralization Support federated roll-out of new data and functionality Provide flexible extensibility of ontologies and hierarchies Support diverse ecology of databases, interfaces, and analysis software.

Success stories outside of biology? Look for domains: with really large amounts of heterogenous data, that are continually increasing in size, are being effectively mined on an ongoing basis, Have widely used programmatic interfaces that support mashups and other cross-database stuff, and are intentional, with principles that we can steal or adapt.

Amazon: > 50 million users, > 1 million product partners, billions of reviews, dozens of compute services Continually changing/updating data sets. Explicitly l adopted d a service-oriented architecture that enables both internal and external use of this data. For example, the amazon.com Web site is itself built from over 150 independent services Amazon routinely deploys new services and functionality.

Sources: The Platform Rant (Steve Yegge) -- in which he compares the Google and Amazon approaches: https://plus.google.com/112678702228711889851/posts/ev eouesvavx A summary at HighScalability.com: com: http://highscalability.com/amazon-architecture (They are both long and tech-y, note, but the first is especially entertaining.)

A brief summary of core principles Mandates from the CEO: 1. All teams must expose data and functionality solely through h a service interface. 2. All communication between teams happens through that service interface. 3. All service interfaces must be designed so that they can be exposed to the outside world.

More colloquially: You should eat your own dogfood. Design and implement the database and database functionality to meet your own needs; and only use the functionality yyou ve explicitly made available to everyone. To adapt to research: database functionality should be designed in tightly integration with researchers who are using it, both at a user interface level and programmatically. (Genome databases have done a really good job of this, albeit generally in a centralized model.)

If the customers aren t integrated into the development loop:

A platform view? Metabolic model Diff'n gene expression query Data exploration WWW Gene ID translator Chemical relationships Expression normalization Isoform resolution/ comparison Expression data (tiling) Expression data (microarray) Expression data (mrnaseq) Expression data II (mrnaseq)

A few points Open source and agile software development approaches can be surprisingly effective and inexpensive. Developing services in small groups that include customerfacing developers helps ensure utility. Implementing services in the cloud (e.g. virtual machines, or on top of infrastructure as a service services) )gives developer flexibility in tools, approaches, implementation; also enables scaling and reusability.

Combining modelling with data Data-driven modeling: connections and parameters can be, to some extent, determined d from data. Model-driven driven data investigation: data that doesn t fit the known model is particularly interesting. The second approach is essentially how particle physicists work with accelerator data: build a model & then interpret the data using the model. (In biology, models are less constraining, though; more unknowns.)

Using developmental models Davidson et al., http://sugp.caltech.edu/endomes/

Using developmental models Models can contain useful abstractions of specific processes; here, the direct effects of blocking nuclearization of B-catenin can be predicted by following the connections. Models provide a common language for (dis)agreement in a community.

Using developmental models Davidson et al., http://sugp.caltech.edu/endomes/

Social obstacles Training of biologically aware software developers is lacking. Molecular biologists are still very much of a computationally naïve mindset: give me the answer so I can do the real work Incentives for data sharing, much less useful data sharing are not yet very strong. Pubs, grants, respect... Patterns for useful data sharing are still not well understood, in general.

Other places to look NEON and other NSF centers (e.g. NCEAS) are collecting vast heterogenous data sets, and are explicitly tackling the data management/use/integration/reuse problem. SBML ( Systems Biology Markup Language ) is a modeling descriptive language g that enables interoperability of modeling software. Software Carpentry runs free workshops on effective use of computation for science.

My conclusions We need a platform mentality to make the most use of our data, even if we don t completely embrace loose coupling and distribution. Agile and end-user focused software development methodologies have worked well in other areas; much of the hard technical space has already been explored in Internet companies (and probably social networking companies, too). Data is most useful in the context of an explicit model; models can Data is most useful in the context of an explicit model; models can be generated from data, and models can feed back into data gathering.

Things I didn t discuss Database maintenance and active curation is incredibly important. Most data only makes sense in the context of other data (think: controls; wild type vs knockout; other backgrounds; etc.) so we will need lots more data to interpret the data we already have. Deep learning is a promising field for extracting correlations from multiple large data sets. All of these technical problems are easier to solve than the social problems (incentives; training).

Thanks -- This talk and ancillary notes will be available on my blog ~soon: http://ivory.idyll.org/blog/ /bl / Pl d t t t tb@ d if h ti Please do contact me at ctb@msu.edu if you have questions or comments.