Improved VariantSpark breaks the curse of dimensionality for machine learning on genomic data

Size: px

Start display at page:

Download "Improved VariantSpark breaks the curse of dimensionality for machine learning on genomic data"

Delphia Bennett
5 years ago
Views:

1 Shiratani Unsui forest by Σ64 Improved VariantSpark breaks the curse of dimensionality for machine learning on genomic data Oscar J. Luo Health Data Analytics 12 th October 2016 HEALTH & BIOSECURITY

2 Transformational Bioinformatics Denis Bauer Oscar Luo Laurence Wilson Aidan O Brien Rob Dunne Florian Heyl Piotr Szul

3 Genomic sequencing can lead to a successful diagnosis in up to 50% of cases where traditional genetic testing failed and is on average 96% cheaper Bauer et al. Trends Mol Med PMID:

4 Genomics projects are getting bigger ASPREE 4000 healthy 70+ year olds Project MinE 15,000 people with ALS 100,000 Genomes project 70,000 individuals by 2017 The cancer genome atlas 11,000 samples 2015 Human genome ~1 sample The HapMap Project 270 samples Genome Project 1097 samples

5 The problem Genomic data is more prevalent than ever Datasets may have hundreds or thousands of samples A single sample can be hundreds of gigabytes More difficult (or impossible) for current tools to process Often limited to a single machine Methods and tools for large scale data processing usually have limited genomic data analysis support 5

6 Large scale compute VariantSpark A scalable tool for performing tertiary analysis on genomic data. Introduces an interface between Variant Call Format (VCF) files and machine learning algorithms. Input VariantSpark Toolkit Result VCF e.g. clustering or disease gene detection Spark ML* 6

7 Apache Spark Lightning-fast Cluster Computing Scalable Can scale to 1000s of compute nodes But will also run on commodity hardware Fault tolerant No need to add checkpoints Active nodes will take over jobs from failed nodes Fast Resilient Distributed Datasets (RDDs) bring the data to the compute Reduces disk-access by using in-memory caching 7

8 VariantSpark is fast To perform k-means clustering on variants from one chromosome (chr 22). With Spark, we can moreefficiently process the VCF files than the non-spark approaches. time in seconds Python R Hadoop Adam method ADMIXTURE VariantSpark task binary conversion clustering pre processing Chromosome 22; VM on Microsoft Azure with A7 Linux instance and 8 cores, 56GB memory running Ubuntu. 8

9 VariantSpark is accurate We use Spark ML Spark s collection of machine learning algorithms. Clustering samples from the 1000 genome project Using VariantSpark with Spark ML, we can cluster samples based on the genetic variants ~1000 samples with 20M variants each, takes about 1 day ARI (adjusted Rand index) = 0.84, with -1 (independent labelling) and 1 (perfect match) 9

10 What if we have more than 1000 samples? ~2500 samples with 80M variants from the current phase of 1000 genome project SIZE RANDOM FOREST (SPARK ML) CURSED FOREST Chromosome 1 Chromosome 1-2 Chromosome hr 22mins 8GB/executor 5hr 22mins 16GB/executor Fail The standard implementation of random forest in Spark ML cannot deal with feature vectors of 80M! 10

11 Curse of Dimensionality in computing Algorithms do not scale well to high-dimensional data, typically due to time or memory scaling with the large number of dimensions of the data. Spark ML was designed for Big but low dimensional data. Usual Big Data: e.g. Customer Info Large samples with few features Cursed Big Data: e.g. Genomics Moderate number of samples with many features RDD RDD Can be handled by dedicated executer Feature set too large to be handled by single executer 11

12 CursedForest: a supervised learning tool for big and wide data Implementation of the random forest algorithm for robustly dealing with wide /high-dimensional data Doesn t store data in vectors like Spark ML Builds the same model as Spark ML, just in a more efficient way. 12

13 CursedForest: More efficient variant storage Instead of using Vectors as the intermediary format for VCF, we store each individual variant as an item in the RDD For example, rather than an RDD of 2500 vectors of 80,000,000 dimensions, we have an RDD of 2500 x 80,000,000 (200,000,000,000) tiny vectors /** * RDD[ FlatVariant(subjectId:String, variantindex:int, allele:double)] */ 200,000,000,000 in an RDD may seem like a lot more than 2,500 But this means we can now separate data much more granularly, i.e. by specific features, rather than entire samples. 13

14 CursedForest: More efficient task management Spark ML builds each decision tree on a single compute node. This is only a problem with wide data. But by not using traditional vectors, we can distribute the growth of each tree to multiple nodes. Each node calculates the information gains of splits for a subset of features. This data can then be collected and the optimal split can then be selected. 14

15 Can we do better with supervised learning? SIZE RANDOM FOREST (SPARK ML) CURSED FOREST Chromosome 1 Chromosome 1-2 1hr 22mins 8GB per executor 5hr 22mins 16GB per executor 2 mins 2GB per executor 3 mins 4GB per executor Chromosome 1-22 Fail 13 mins 8GB per executor Constantly optimizing CursedForest for efficiency! 15

16 Supervised learning = higher accuracy ARI=~

It can build a Random Forest model on ~2500 individuals and 80 million variants in under 15 minutes

17 Conclusions Our improved VariantSpark tool is an interface bringing big learning tasks to genomics applications. It can build a Random Forest model on ~2500 individuals and 80 million variants in under 15 minutes using 8GB per executor CursedForest solves the curse of dimensionality for machine learning on genomic data 17 Presentation title Presenter name

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on