SparkBurst: An Efficient and Faster Sequence Mapping Tool on Apache Spark Platform

Size: px

Start display at page:

Download "SparkBurst: An Efficient and Faster Sequence Mapping Tool on Apache Spark Platform"

Julie Leonard
5 years ago
Views:

SparkBurst: An Efficient and Faster Sequence Mapping Tool on Apache Spark Platform MTP End-Sem Report submitted to Indian Institute of Technology, Mandi for partial fulfillment of the degree

1 SparkBurst: An Efficient and Faster Sequence Mapping Tool on Apache Spark Platform MTP End-Sem Report submitted to Indian Institute of Technology, Mandi for partial fulfillment of the degree of B. Tech. by Shivam Satija (B12020) under the guidance of Dr. Arti Kashyap Associate Professor SCHOOL OF COMPUTING AND ELECTRICAL ENGINEERING INDIAN INSTITUTE OF TECHNOLOGY MANDI 28 th MAY 2016

2 CERTIFICATE OF APPROVAL Certified that the End-Sem Report entitled, SparkBurst: An Efficient and Faster Sequence Mapping Tool on Apache Spark Platform, submitted by, Shivam Satija (B12020), to the Indian Institute of Technology Mandi, for the partial fulfilment of the degree of B. Tech. has been accepted after examination held today. Date: 28 th MAY 2016 Place: KAMAND, H. P., INDIA Dr. Varun Dutt Faculty Advisor

3 CERTIFICATE This is to certify that the End-Sem Report titled, SparkBurst: An Efficient and Faster Sequence Mapping Tool on Apache Spark Platform, submitted by, Shivam Satija (B12020), to the Indian Institute of Technology, Mandi, is a record of bonafide work under my (our) supervision and is worthy of consideration for partial fulfilment of the degree of B. Tech. of the Institute. Date: 28 th MAY 2016 Place: KAMAND, H. P., INDIA Dr. Arti Kashyap (Guide) Faculty Supervisor(s)

4 DECLARATION BY THE STUDENT This is to certify that the End-Sem Report titled, SparkBurst: An Efficient and Faster Sequence Mapping Tool on Apache Spark Platform, submitted by me to the Indian Institute of Technology Mandi for partial fulfillment of the degree of B. Tech. is a bonafide record of work carried out by me under the supervision of Dr. Arti Kashyap. The contents of this MTP, in full or in parts, have not been submitted to any other Institute or University for partial fulfillment of any degree or diploma. Date: 28 th MAY 2016 Place: KAMAND, H. P., INDIA Shivam Satija B12020

5 Acknowledgments I would like to express my special thanks of gratitude to my guide, Dr. Arti Kashyap who gave me the golden opportunity to do this project on the topic (SparkBurst: An Efficient and Faster Sequence Mapping Tool on Apache Spark Platform) under her supervision. I would also like to thank Mr. Sanjay Rathee (Ph.D Scholar, IIT Mandi) for helping me in this project. Shivam Satija i

6 Abstract Next generation sequencing (NGS) technologies are generating a huge amount of genetic data due to which conventional single-processor sequence alignment tools are unable to keep trace with them. Therefore, cloud computing and MapReduce frameworks, which use thousands of commodity machines to store and process huge datasets, are emerged as best solution for this problem of growing data. In this project, we propose a MapReduce based sequence alignment technique implemented on Apache Spark, called SparkBurst. It is like a reference genome indexing tool which generates an index (like suffix array or BWT-FM index) for reference genome by running the computation on number of commodity machines parallely. Reads are distributed over machines to discover their location in reference genome using index parallely. Keywords: DNA Sequence, BWT, Spark, String Matching, Mapreduce ii

7 Table of Contents Acknowledgement Abstract i ii 1. Introduction 1 2. Objectives 2 3. Background and Related Work 2 4. Methodology 2 5. Results 7 6. Discussion and conclusions of results 9 7. Deliverables 9 8. Timeline References 11

8 Introduction Genome sequencing is the process of determining the precise order of nucleotides within a DNA molecule. It includes any method or technology that is used to determine the order of the four bases adenine, guanine, cytosine, and thymine in a strand of DNA. An important goal of genomics is to determine the index of a particular sequence in the reference genome. This relates to the problem of string matching where we can use various algorithms. Nowadays, there is a rapid increase in the use of computer technology to manage the biological information (BioInformatics). Computers are used to collect, store, analyze and integrate genetic and biological information to use it for genotyping, metagenomics, SNP (single nucleotide polymorphisms) discovery and personal genomics. The rapid development of next generation sequencing technologies have dramatically reduced the time and cost of DNA sequencing as well as dramatically increased the size of genetic data produced by these next generation machines. There are existing single machine sequence alignment tools like BLAST, SOAP [6], RMAP [1], MAQ [2], etc. But the next generation machines are producing billions of short sequence of DNA (reads) in few days. It is projected that size of sequence data will continue to increase dramatically in future. So, the above single machine alignment tools will not be able to handle such huge datasets and therefore highly distributed computing machines will be required. In recent years, a parallel computing framework called Mapreduce which can use thousands of commodity machines for distributed computing, has emerged as an evolution. Many MapReduce based platforms like Hadoop[14], Apache Spark [15], Apache Flink [16] have emerged. These platforms provide highly parallel distributed computing environment using thousands of commodity machines to store and analyze large datasets faster and efficiently. Data generated by next-generation sequencing machines can be analyzed efficiently by using these platforms. Some initiative towards the trend of using platforms like Hadoop for sequence alignment have already been taken such as CloudBurst [5], CloudAligner, BlastReduce, etc. The results were very effective and promising. Most of these hadoop based sequence alignment algorithms are based on RMAP and Blast sequence alignment techniques. These techniques use hashing techniques which in turn consume larger chunk of memory to create hash tables for read or reference genome and match them with genome or reads respectively to find locations of reads in reference genome. But, in recent years, sequence alignment tools such as Bowtie and BWA[11] which are based on Burrow-Wheeler Transformation, have become highly popular due to their higher memory efficiency and support to flexible read lengths. Creating index of reference genome is most time consuming part for BWT based alignment tools. Therefore, we proposed a MapReduce based alignment tool implemented on Apache Spark called SparkBurst which will use parallel distributed computing to generate index and finding reads in reference genome using that index. Spark can be up to x [15] faster than Hadoop for many large scale data analysis problems by exploiting its in-memory computing capabilities. SparkBurst is a MapReduce based sequence alignment tool implemented on Apache Spark. It is a reference genome indexing tool. An index like suffix array or FM-index, is generated for reference genome by running hundreds of commodity machines parallely. Reads are distributed over machines to discover their location in reference genome using index parallely. 1

9 Objectives Main ojective of the project is to design and implement the most used BWT based genome sequencing technique in an efficient manner using parallelization though mapreduce framework which can align reads of various lengths to reference genome to gain further performance by exploiting the capabilities of the platform Spark. Background and Related Work The next generation sequencing machines typically produces million bp reads on a single run of the machine. For existing sequence alignment techniques, mapping of this large volume of data to a genome like human genome is a great challenge. Many new sequence alignment programs have been developed in last decade to tackle the problem of accurate and efficient read mapping for such huge datasets. Techniques like Eland (Cox, 2007, unpublished material), SHRiMP ( cs.toronto.edu/shrimp), RMAP [1], MAQ [2], ZOOM [3], SeqMap [4], and CloudBurst [5] use hashing techniques to hash read sequence and scan through reference genome. These techniques have a drawback of overhead of scanning the whole reference genome when some reads are aligned. Second category of techniques like SOAPv1 [6], PASS [7], MOM [8], ProbeMatch[9] use hashing techniques to hash the genome. Therefore, these techniques can be parallelized easily. But these techniques have limitation of requirement of large volume of storage to store genome index. Recently, string matching algorithms based on BWT [10] has drawn the attention of many research groups. Techniques like Bowtie and BWA which are based on burrow-wheeler transformation, have become very popular because of their superior storage memory efficiency and support to flexible seed lengths. These BWT-based sequence alignment tools provide fast mapping of short reads of DNA sequences against reference genome sequence with small memory footprint using data structure like FM-Index built atop the BWT. Techniques like BWT and BWA are very efficient as long as size of reads and reference genome is small. But due to evolution of next-generation sequencing (NGS) machines, size of data has grown above the capabilities of single machine alignment techniques. Therefore, we need alignment techniques which run parallely on cluster of many machines to give more fast and efficient results. Many cluster based alignment techniques have been proposed in recent years. Micheal C. Schatz proposed a highly sensitive read mapping algorithm CloudBurst. CloudBurst used Hadoop implementation of MapReduce to parallelize the task using multiple machines. Recently, a new cloud based sequence alignment algorithm by Tung Nguyen et al. called as CloudAligner was proposed. Cloud aligner have performance gain over Cloudburst using better partitioning and parallel processing of reference genome as well as reads data. It has web-based interface which make it more user friendly. Methodology Suffix Index Binary Search (Approach 1) SparkBurst is a MapReduce based sequence alignment tool implemented on Apache Spark. There are few existing Mapreduce based alignment tools like CloudBurst and CloudAligner. Both of them, shown a great performance gain as compared to single machine alignment tools. Most of the times, CloudAligner outperforms CloudBurst in terms of time complexity. CloudAligner takes reads and reference genome data as input 2

10 and split reads over different mappers as key and whole reference genome as a value. Every mapper find location of given key (read) in value (reference genome) and produce <key, value> (<read, GenomicRegion>) pair as result. These alignment tools are concentrated on splitting reads over mappers and keeping reference genome same. Therefore, every mapper should process whole genome to find a read. But, Suffix Index Binary Search Approach is a MapReduce based alignment tool which creates an index (suffix tree ) for reference genome so that every mapper needs very less computations to search a read in reference genome. It has two phases of Mapreduce. In first phase, mappers take reference genome as input from HDFS and generate suffixes of length k as key and their location in reference genome as value. Then a shuffle task was used to sort these keys and reducer combine outputs of shuffle and generate partitions of values (suffix locations in reference genome) sorted according to keys (suffix). Output of 1 st phase is partitions of suffix array. Mainly every partition has sorted suffixes which start from a particular character. For example, for human reference genome 5 partitions can be generated which have all sorted suffixes starting with character A, C, G, T and N respectively so that during second phase we need to search every read into nearly 1/5th index of reference genome. In 2 nd phase, every mapper takes some reads as input, search every read in reference genome index partition according to its starting character and generate a <Key, value > pair where key is read and value is location of read in reference genome. MapReduce Architecture for Phase I 1. Foreach Line T in ReferenceGenome 2. flatmap (line offset, T) 3. Foreach char C in T 4. Yield(C, Index) 5. End Foreach 6. End flatmap 7. storeatrdd1 3

11 8. End Foreach 9. RDD2=RDD1.sortByValue 10. Foreach entry in RDD1 11. flatmap(c, Index) 12. Find string S starting location Index 13. usingrdd2 14. Yield(S, Index) 15. End flatmap 16. storeinrdd3 17. End Foreach 18. Final_Index=RDD3.sortByKey.Values() Figure 1. 1 st Algorithm for Phase I. 1. Foreach Read R in ReadData 2. flatmap (line offset, R) 3. Find R in RG using Final_Index use Binary Search 4. Yield(R, Index) 5. End flatmap 6. End Foreach 7. CombineAllResults Figure 2. 1 st Algorithm for Phase II. 4

MapReduce Architecture for Phase II Sparkburst BWT-FM(Approach 2) After doing with the first binary search algorithm we moved on to BWT-based algorithm.

12 MapReduce Architecture for Phase II Sparkburst BWT-FM(Approach 2) After doing with the first binary search algorithm we moved on to BWT-based algorithm. This BWT-FM based Algorithm is built on top of the previous algorithm. This algorithm build the BWT transformation using the last characters of suffixes. After doing this transformation, the count of all the characters are calculated and stored in an array. This array keeps track of count of all the smaller characters. Then using the above bwt transformation, occurrence of all the characters are counted at particular index. For eg. Count of G in BWT till 30 th index. 1. Foreach Line T in ReferenceGenome 2. flatmap (line offset, T) 3. Foreach char C in T 4. Yield(C, Index) 5. End Foreach 6. End flatmap 7. storeatrdd1 8. End Foreach 9. RDD2=RDD1.sortByValue 10. Foreach entry in RDD1 11. flatmap(c, Index) 12. Find string S starting location Index 13. usingrdd2 14. Yield(S, Index) 15. End flatmap 5

16. storeinrdd3 17. End Foreach 18. Final_Index=RDD3.sortByKey.Values() 19. Make Occurence(O) array 20. Now calculate Count(C) Array Figure 3. 2 nd Algorithm for Phase I.

13 16. storeinrdd3 17. End Foreach 18. Final_Index=RDD3.sortByKey.Values() 19. Make Occurence(O) array 20. Now calculate Count(C) Array Figure 3. 2 nd Algorithm for Phase I. Searching for query in BWT-FM based algorithm using the below approach. This algorithm starts from the last character of query. Calculate the low and high and iterate this process over the whole query using the previously calculated low and high until low is less than high and index is greater than first character of query. 1. Foreach Read R in ReadData 2. flatmap (line offset, R) 3. Find R in RG using Final_Index use ExactMatch Algorithm 4. Yield(R, Index) 5. End flatmap 6. End Foreach 7. CombineAllResults Figure 4. 2 nd Algorithm for Phase II 6

14 Results These are the datasets which were used for evaluation. 100k.fa dataset was taken from AML & BRL were given by Agilent Technologies and g_cat_set was taken from the bioinformatics.bc.edu Name of Dataset BP length No. of Queries 100k.fa AML BRL g_cat_set These results mentioned below in the table are from Suffix Index Binary Search (Approach 1). Reference Genome Query Genome Time Taken (in secs) s_suis 100k.fa 100 CHR-21 AML Text 420 CHR-21 BRL Text 480 CHR-22 AML Text 433 CHR-22 BRL Text 527 SparkBurst algorithm results are presented below. SparkBurst algorithm uses BWT based FM-index Running Time for SparkBurst Algorithm Time Taken k.fa AML Text BRL Text g_cat_set s_suis CH21 CH21 CH21 Query Genome, Reference Genome Graph 1 7

15 As we can see from the graph, time for SparkBurst increases as the number of queries increases. Although we saw a dip in last dataset, this is because query dataset is of cat and we try to compare it with human, that's why it matched less. Comparison of Sparkburst with Suffix Index Binary Search is Given Below Comparison of SparkBurst vs Suffix Index Binary Search SP SIBS Graph 2 100k.fa AML Text BRL Text s_suis CH21 CH21 Comparison of SparkBurst with CloudBurst is represented below Running Time Comparison SparkBurst v/s CloudBurst Time Taken s_suis s_suis CH21 CH21 CH21 CH21 CH21 CH21 SP CB SP CB SP CB SP CB Query Genome, Reference Genome, Algorithm Graph 3 8

16 Discussion and Conclusions There are various results that can to be discussed : BWT-FM based algorithm was implemented on single machine, sparkburst parallelize the approach for this algorithm. Due to this, time to locate the index of query in reference genome decreases significantly and this makes the process, time saving. As the number of node increases, computation time decreases. But as the base pair length increaes, computation time increases. As shown in Graph 1, we can say that even length of 400 bp is not taking too much time. This algorithm also works perfectly for variable lengths, even for 400 bp long query. Second approach is comparatively better as compared to first (Graph 2) because suffix array binary search compares the suffix and query everytime, whereas Sparkburst (BWT-FM) only does the lookups. Sparkburst is built on top of first algorithm. This saves a lot of time. From the comparison graph (Graph 3), it looks that Sparkburst performance is much better than Cloudburst. Sparkburst uses BWT-FM based method which does not compare the query again and again, only performs some lookups which reduces the time significantly, whereas, cloudburst compares the seeds and then extends which takes up a lot of time. Also, spark has an added advantage of in-memory computation. Apart from this, we also looked into the inexact search based implementation. There were many challenges as insertion/ deletion cases are difficult to handle because if we consider all possible cases (i.e. Match, Mismatch, Insertion, Deletion), in total we would have 9 cases which would enlarge the search zone and would be difficult to handle in terms of memory-efficiency. So, in order to solve this problem we explored different techniques which are explained below. There can be multiple approaches to solve this problem. One of them could be to find all the exact matches first and seperate the inexact matches. Then work on inexact matches and dump those inexact matches which show a high degree of inexactness and then process the remaining. These remaining inexact sequences can be searched by using seed and extend algorithm & smith-waterman dynamic algorithm. Deliverables An algorithm which can find the index of the query in the reference genome. SparkBurst will also have the following advantages : Variable Length Support SparkBurst can find the index of any query length rather than any fixed length which gives an added advantage. Improved Load Balancing SparkBurst have an edge over recent alignment tools in load balancing. During 1 st phase of reference genome indexing, data from reference genome is splitted in lines and distributed to mappers line by line. In earlier genome indexing techniques on MapReduce reference data is splitted in partition equally depending on number of 9

17 machines. Therefore, total time to index reference is time taken by slowest machine in cluster. But in SparkBurst, we distribute data line by line, after giving n lines to n machines (n+1)th line is given to machine which is free. Therefore, faster machines will get more lines and slower machines will get less due to which execution time will be optimized. Fault-tolerant Earlier reference genome indexing techniques split data into equal partitions as number of machines available. Therefore, every machine have 1/n th part of reference genome where n is number of machines available. Since tools like Hadoop and Spark use cheap commodity hardware, therefore their chances to crash are very high. In case of such machine crash we have to re-compute whole partition which is allocated to crashed machine. SparkBurst tackle this problem very efficiently. It distributes data to machines line by line instead of partitions. Therefore, crash of a machine cause re-computing of single line only which will not cost much in term of computation time. Thus, SparkBurst will be more fault-tolerant as compared to existing genome indexing techniques. Timeline Literature reading 24/12/15 40 Applied the simple sorting technique 15/02/16 30 Applying the BWT for Exact Matching 20/03/16 30 Explore the techniques for In-Exact Matching 24/04/16 15 Analyse the results 15/05/16 8 Timeline Before and During 8th Semester 15/12/15 03/02/16 24/03/16 13/05/16 Literature reading Applied the simple sorting technique Applying the BWT for Exact Matching Explore the techniques for In-Exact Matching Analyse the results We planned to work on the ExactMatch technique and came up with BWT-FM based algorithm to solve the problem. After that we compare its results with the cloudburst (already implemented on map-reduce platform) algorithm. Apart from that we had also planned to explore the inexact techniques to solve the inexact matching problems and present different ways to solve it. 10

18 References [1] Andrew D Smith, Zhenyu Xuan, and Michael Q Zhang, Using quality scores and longer reads improves accuracy of Solexa read mapping. BMC Bioinformatics, 2008 [2] Li,H. et al. Mapping short dna sequencing reads and calling variants using mapping quality scores. Genome Res., 2008 [3] Lin,H. et al. Zoom! zillions of oligos mapped. Bioinformatics, 2008 [4] Jiang H, Wong WH. SeqMap: mapping massive amount of oligonucleotides to the genome. Bioinformatics, 2008 [5] Michael C. Schatz. CloudBurst: highly sensitive read mapping with MapReduce, Bioinformatics, 2009 [6] Li,R. et al. SOAP: short oligonucleotide alignment program. Bioinformatics, 2008 [7] Campagna et al. PASS: a program to align short sequences. Bioinformatics, 2009 [8] Eaves HL., Gao Y., MOM: maximum oligonucleotide mapping, Bioinformatics, 2009 [9] Jung Kim, ProbeMatch: rapid alignment of oligonucleotides to genome allowing both gaps and mismatches, Bioinformatics, 2009 [10] Burrows M, Wheeler DJ. Technical report 124. Palo Alto, CA: Digital Equipment Corporation; A block-sorting lossless data compression algorithm, 1994 [11] Heng Li, Richard Durbin, Fast and accurate short read alignment with Burrows Wheeler transform, Bioinformatics, 2009 [12] Langmead,B. et al Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biolnformatics, 2009 [13] Tung Nguyen, CloudAligner: A fast and full-featured MapReduce based tool for sequence mapping, BMC research, 2011 [14] Hadoop Map/Reduce tutorial. common/docs/r0.20.0/mapred tutorial.html. [15] Apache Spark Map/Reduce tutorial. [16] Apache Flink. 11

GPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units

GPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units Abstract A very popular discipline in bioinformatics is Next-Generation Sequencing (NGS) or DNA sequencing. It specifies