Financed by the European Commission 7 th Framework Programme. biobankcloud.eu. Jim Dowling, PhD Assoc. Prof, KTH Project Coordinator

Size: px

Start display at page:

Download "Financed by the European Commission 7 th Framework Programme. biobankcloud.eu. Jim Dowling, PhD Assoc. Prof, KTH Project Coordinator"

Baldric Ball
5 years ago
Views:

1 Financed by the European Commission 7 th Framework Programme. biobankcloud.eu Jim Dowling, PhD Assoc. Prof, KTH Project Coordinator

2 The Biobank Bottleneck We will soon be generating massive amounts of genomic (NGS) data. We can t securely store and analyze this data with current systems. There is an urgent need for better systems to store and process genomic data.

3 Population-Scale WGS: $1325 per Genome HiSeq X Ten^ => Volume => Velocity => ~18,000 genomes/year ~5.2 PB/year* ~45 MB/sec* ^Cost ~$10 million *5.2 PB assumes a replication factor of 3 See:

4 Genomics needs Big Data [Image source: Patterson, Fighting the Big C with the Big D, 2014]

5 Genomics needs Big Data [Image source: Patterson, Fighting the Big C with the Big D, 2014]

6 Network effects: Biggest Dataset(s) win! #diseases insights rare diseases log(#samples)

7 Big Data is affordable!

8 180TB for $9,305

9 20PB for $279,150

10 Administration Costs Facebook Operations staffers manage 20-26,000 servers each^ ^

11 Genomics needs Tools for Big Data In a wide array of academic fields, the ability to effectively process data is superseding other more classical modes of research. More data trumps better algorithms * * The Unreasonable Effectiveness of Data [Halevey et al 09]

12 Forget about serial analysis pipelines Read genome on 1 machine: ~1000 secs [Harddisk Image courtesy of sorapop /

13 Pipeline Issues in popular NGS toolkits Time taken to get answers from reads is too long Population-level statistical analysis requires petabytes of data Standard analysis of genomes does not even scale to thousands of genomes

14 Read Pre-Processing Single-Machine Genome Analysis using GATK Raw Reads Stage GATK 2.7/NA12878 Mapping Mark Duplicates BQSR 13 hours 9 hours Sorted Mapping Realignment 32 hours Local Alignment Call Variants 8 hours Mark Duplicates Total 62 hours* Base Quality Score Recalibration Calling-Ready Reads *

15 Big Data means Parallelization Read genome on 100 machines: ~10 seconds

16 Speedup using ADAM/Spark/HDFS* Sort (250 GB) Picard 1 hs1.8xlarge 17h 44m ADAM 100 m2.4xlarge 21m 43x Speedup Mark Duplicates (250 GB) Picard 1 hs1.8xlarge 20h 22m ADAM 100 m2.4xlarge 29m 54x Speedup *

parallel Scalability Experiments on up to 576

17 HiWay/Cuniform Variant Call Workflow TeraBytes of input data (1000 Genomes Project) read and reference parallel Scalability Experiments on up to 576 containers in a 28 node cluster reference parallel reference parallel

18 Runtime in Minutes HiWay/Cuniform Scalability* 128,00 Workflow Runtime with Increasing Number of Containers Container time spent at different stages of execution 64,00 32,00 16, Number of Nodes idle scheduling startup stage-in execution stage-out shutdown [*Unpublished Results]

19 Storage and Processing of Big Data

20 What is Apache Hadoop? Petabyte data sets 1000s of nodes on cheap commodity hardware Highly Available and Fault tolerant Data Location-Aware Computation Frameworks

21 Big Data Processing with No Data Locality Job( /genomes/jim.bam ) submit Workflow Manager Compute Grid Node Job This doesn t scale. Bandwidth is the bottleneck

22 Big Data Processing with Data Locality Job( /genomes/jim.bam ) submit Job Tracker Task Task Task Task Task Task Tracker Tracker Tracker Tracker Tracker Tracker Job Job Job Job Job Job DN DN DN DN DN DN R R = resultfile(s) R R

23 We store Genomes in our Hadoop Filesystem Stripe a file s blocks over data using a block placement policy that minimizes re-identification risk. Donor data is not compromised unless intruders also compromise the NameNode state /genomes/jim.bam -> {2,4,3,6,9} Name node Data nodes` Hops-HDFS: Scalable (PBs) High Availability Erasure-Coding Replication Genetic data stolen from Data nodes

24 The Bigger Picture.

25 BiobankCloud Ecosystem Big Data Cloud Computing Commodity Hardware Security Technologies Regulations, Ethics, Security Reqs Genomic Data Bioinformatics standards, tools, pipelines

26 Users of BiobankCloud Bioinformaticians IT Administrators Genomic Data Biobankers

27 Genomic data management solutions Outsource to SaaS providers - Illumina s Basespace - DNAnexus, Spiral Genetics, SVBio Sequencing Centers High Performance Computing Centers In-House

28 Demo LIMS

29 Conclusions Open-source commodity data management solutions for genomic data are feasible and economical - BiobankCloud is building such a system BiobankCloud status - Under heavy development - First version of the platform in testing - Ongoing work ensuring compliance with our regulatory and ethical framework

The Team KTH Salman Niazi, Mahmoud Ismail,

Alberto Lorente Humboldt University Ulf Leser,

Alysson Bessani, Vinicius Cogo Karolinska

Jane Reichel, Mats Hansson Charité University

30 The Team KTH Salman Niazi, Mahmoud Ismail, Kamal Hakimzadeh, Hamid Afzali, Ali Gholami, Alberto Lorente Humboldt University Ulf Leser, Jörgen Brandt, Marc Bux University of Lisbon Alysson Bessani, Vinicius Cogo Karolinska Institute Jan-Eric Litton, Roxana Martinez, Jane Reichel, Mats Hansson Charité University Hospital Michael Hummel, Lora Dimitrova, Karen Zimmermann

31 Backup Studies and Audit Trail

32 Backup Study (joined)

33 Backup Upload Study Samples

34 Backup Study Samples

35 Backup Study (owner)

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Research challenges in data-intensive computing The Stratosphere Project Apache Flink Seif Haridi KTH/SICS haridi@kth.se e2e-clouds.org Presented by: Seif Haridi May 2014 Research Areas Data-intensive