Graph-Based Genomic Integration Using Spark

Size: px

Start display at page:

Download "Graph-Based Genomic Integration Using Spark"

Coleen Allen
6 years ago
Views:

1 Graph-Based Genomic Integration Using Spark David Tester Novartis Institutes for Biomedical Research 1!

2 Reference Genome Database External Transcript Database Internal Expression Measurements Public Variation Database Biological Sample Database Variant Effect Prediction Algorithms Ontology of Sequence Alterations Ontology of Health Conditions Ontology of Anatomical Structures Scope Transcript Cancerous 34 copies Hippocampus Tissue Sample ACTAGATAGCAGCTAGCTAGCGCGCGATATCGCGATAGCGCGATAGCGCTAGAGCTCGTCGCAAAGCTGAGCTCG Insertion: TCAAT Produces Frameshift Event

3 Scope (more) Much, much more over time New types of information (compound interactions, protein interactions,...) Confidence and Error Propagation Need to be extremely flexible Model flexibility (our scientists disagree with each other!) Analytic flexibility (technologies change!)

4 Data Characteristics Moderately sized As graph: currently low trillions of edges As tables: currently 100s of billions of rows But Growing Extremely Fast Driven by logarithmic decrease in sequencing costs Also Extremely Messy 10s-100s of internal and external sources Incentive structure at odds with standardization?

5 More a Framework than a Database Overall Architecture We Input: Input API Integration Layer Output API Raw data and metadata We Add: Markup (Scala DSL) CSVs Data and Metadata with User Markup CSVs Ontologies Graph Representation Table Markup Production Databases Ad-hoc Datamarts Ontologies We Archive Graph representations of all datasets + markup VCFs Etc VCFs Etc Analysis Scripts (Spark) Non-Database Analytic Tools Various We Output: Vertica, SciDB, R API, Dataframes, Others Analytics API Local Validation of Markup, Datasets, and Ontologies Spark Global Validation Analysis- or Endpoint- Specific Validation

6 Integration Mechanics 1. Parse Input API Integration Layer Output API 2. Group into semantic units Data and Metadata with User Markup Ontologies Graph Representation Production Databases 3. Create integration instructions 4. Join to other datasets CSVs CSVs Table Markup Ad-hoc Datamarts 5. Convert to edges 6. Enrich edges (reasoning) VCFs VCFs Analysis Scripts (Spark) Non-Database Analytic Tools Etc Etc Various 7. Assign IDs 8. Publish Local Validation of Markup, Datasets, and Ontologies Global Validation Analysis- or Endpoint- Specific Validation Spark

7 Why Graphs? We really just want binary relationships Example: HasDiagnosis ( ThisTissue, Adenoma ) Flexible representation (more expressive than key/value, less rigid than tables) Well-understood reasoning mechanics (transitive closures, type inference, validation, ) Useful for integration Logically equivalent to a graph with labeled, directed edges For topological and/or network analyses (e.g., shortest paths) GraphX

8 Why not a Graph toolkit? Distributed graph TKs require an efficient partitioning strategy Typically a function from edge or vertex to its partition We partition over semantic units (encoded as subgraphs) Each subgraph contains the edges which describe a semantic unit Has natural mapping back/forth to source files Naturally expressed in Spark, awkward for GraphX But GraphX is just a flatmap away when we need it!

9 Future Heavy interest in not-just-genomic data More reliance on Spark-based analytics Parquet / Spark Dataframes / Spark SQL / Adam Trillion edge graph follows exponential trajectory in data size over next few years

10 Acknowledgements Rob Anderson Hans Bitter Mark Borowsky Sophie Brachat Victor Bucor Rose Brannon Jason Calvert Dennis Cunningham John Damask Timothy Danford Anthony Dibiase Sean Duane Chris Farnham Nick Flower Laurent Gauthier Ajay Gourneni Nabil Hachem Victor Hong Mike Jones Jason Kondracki Andrew Knueven Igor Mendelev Steve Marshall Gregg McAllister Brant Peterson Martin Petracchi Brian Repko Erik Sassaman Mark Schreiber Ruth Seltzer Dave Sexton Anita Stout David Treff Quan Yang Dongmei Zuo Many others

11 We re Hiring Spark Gurus! Novartis Institutes for Biomedical Research - Cambridge, Massachusetts View our open positions at:

Why do we need graph processing?

Why do we need graph processing? Community detection: suggest followers? Determine what products people will like Count how many people are in different communities (polling?) Graphs are Everywhere Group