The Human Variant Database

Size: px

Start display at page:

Download "The Human Variant Database"

Blake Wright
5 years ago
Views:

1 The Human Variant Database Mya Warren Michael Smith Genome Sciences Centre Vancouver BC

2 Bioinforma=cs is Big Data Human genome has 3 billion nucleo=de bases 60 thousand genes thousand proteins Bioinforma=cs takes advantage of High performance compu=ng Sophis=cated algorithms Math/Sta=s=cs Machine learning

3 Our mission Two parallel goals: Personalized Oncogenomics Program Use pa'ent genomics to diagnose and iden'fy therapies for each pa'ent s unique disease Cancer research Find new pa9erns in the genomics data to iden'fy novel targets for therapy, learn fundamental truths about cancer

4 Our mission Two parallel goals: Personalized Oncogenomics Program Use pa'ent genomics to diagnose and iden'fy therapies for each pa'ent s unique disease Cancer research Find new pa9erns in the genomics data to iden'fy novel targets for therapy, learn fundamental truths about cancer The database supports these goals through: Fast querying and explora=on of pa=ent genomics, clinical covariates Data mining and analysis of pa=ent cohorts

5 HAWQ (HAdoop With Queries) A massively parallel processing (MPP) SQL engine in Hadoop

6 HAWQ (HAdoop With Queries) A massively parallel processing (MPP) SQL engine in Hadoop Interface with the data using PostgreSQL

7 HAWQ (HAdoop With Queries) A massively parallel processing (MPP) SQL engine in Hadoop Interface with the data using PostgreSQL Parallel, fault tolerant architecture for storing and processing big data

8 Our system 13 slave nodes 32 thread CPUs Total memory: 1.5 TB Total storage: 250 TB Current disk usage: 1.5 TB Largest table: ~10 billion rows

9 HAWQ Architecture Hadoop distributed file system (HDFS) Data is chunked, replicated, distributed Data locality Move the computa=on to the data Data is not shared HAWQ is very fast, linear scalability Can interface with the rest of the Hadoop ecosystem

10 HAWQ vs. Rela=onal Databases Append-only tables No primary keys No foreign keys Joins are more expensive Extract-transform-load (ETL) op=mized for large data files Import raw data Transform data in database

11 The Data Internally generated data + public cancer datasets (TCGA) 11,519 pa=ents 21,591 libraries 31,067 analyses > 10 billion rows

12 Variants Raw data for Unpaired/soma=c SNVs and Indels Germline/soma=c CNVs Soma=c loss of heterozygosity Gene expression Homozygous dele=ons Post-Processed and filtered variant data

13 Metadata Library construc=on and sequencing Analysis pipeline Pa=ent data Demographics Biopsy diagnoses Drug treatment Radia=on treatment

14 Annota=ons dbsnp COSMIC ClinVar SnpEff Gene models

15 Coming soon Other internal projects More external data sets! Structural variants, mirna... Disease/Drug ontologies Knowledgebase More data = bejer analysis!

16 Accessing the data Custom queries and pipelines

17 Accessing the data Custom queries and pipelines General purpose REST APIs Python SQL Alchemy Object Rela=onal Model Pyramid REST framework Web interface Query Filter Analyze

18 Query selector

19 Results

20 The Future Let the database do the work!

21 The Future Let the database do the work! Why give up your pipeline? speed flexibility

22 Tasks that could be done on the variant database Annota=ons Filtering Sta=s=cal analysis and analy=cs Correla=ons Machine Learning

23 scalable, in-database analy=cs

24 Thanks! Variant DB Developers Marcel Bernard Joshua Davies Darryl D Souza Navjashan Singh James Zhou Simon Chan PIPE/BioApps/LIMS Morgan Bye Karen Eddy Patrick Plejner Systems Hansen Wong Rudy Zhou Lance Bailey Brandon Pierce Richard Corbej Eric Chuah Yussanne Ma

BigDataBench- S: An Open- source Scien6fic Big Data Benchmark Suite

BigDataBench- S: An Open- source Scien6fic Big Data Benchmark Suite Xinhui Tian, Shaopeng Dai, Zhihui Du, Wanling Gao, Rui Ren, Yaodong Cheng, Zhifei Zhang, Zhen Jia, Peijian Wang and Jianfeng Zhan INSTITUTE