Efficient Alignment of Next Generation Sequencing Data Using MapReduce on the Cloud

Size: px

Start display at page:

Download "Efficient Alignment of Next Generation Sequencing Data Using MapReduce on the Cloud"

Annabelle Lambert
6 years ago
Views:

1 212 Cairo International Biomedical Engineering Conference (CIBEC) Cairo, Egypt, December 2-21, 212 Efficient Alignment of Next Generation Sequencing Data Using MapReduce on the Cloud Rawan AlSaad and Qutaibah Malluhi Department of Computer Science and Engineering Qatar University, Doha, Qatar {rawana, Mohamed Abouelhoda Cairo University, Giza, Egypt Nile University, Giza, Egypt Abstract This paper presents a methodology for running NGS read mapping tools in the cloud environment based on the MapReduce programming paradigm. As a demonstration, the recently developed and robust sequence alignment tool, BFAST, is used within our methodology to handle massive datasets. The results of our experiments show that the transformation of existing read mapping tools to run within the MapReduce framework dramatically reduces the total execution time and enables the user to utilize the resources provided by the cloud. Index Terms Cloud computing, MapReduce, bioinformatics, sequence alignment. I. INTRODUCTION Rapidly evolving Next Generation Sequencing (NGS) technologies produce data on an unparalleled scale. A central challenge to the analysis of this data is the sequence alignment analysis, whereby sequence reads must be mapped (aligned) to a reference genome. The data sizes produced by the nextgeneration sequencing machines insinuate that parallelism is essential to process the DNA sequences in a timely fashion. The problem of mapping NGS reads to a reference genome is naturally data parallel, and the reads can be independently processed. This problem structure suggests the utilization of the MapReduce paradigm together with its available open source implementation Hadoop, which is well suited to handle such data intensive jobs. The introduction of Hadoop implementation over cloud computing platforms added another interesting advantage, where the users can scale their infrastructure (in terms of number and type of machines) according to their needs and can use the services on a pay-asyou-go basis. Currently, Amazon AWS directly supports the creation of Hadoop-based clusters with special prices, but the users of private cloud platforms and other commercial ones, like Microsoft, can readily install Hadoop and use it. A handful of Hadoop-based projects for read mapping running in the cloud have been lunched in this area: These include CloudBurst [1], Crossbow [2], CloudAligner [3], and the work of [4]. Yet most of these projects are based on rebuilding the alignment tools from scratch to fit the cloud paradigm, which is in general a very difficult problem to tackle. Also, it is not possible to replace the alignment (read mapping) module in these tools with more advanced ones without recoding the alignment tools. In this paper, we present a methodology for easy and efficient transformation of existing NGS mapping tools into the cloud environment based on the MapReduce programming model [5], [6] without the need for recoding these tools. The essence of this methodology is to use the MapReduce paradigm to partition the sequence alignment problem into a large number of sub-problems which can run independently in parallel, and with minimum inter-task communication. Critical to the function and performance of our methodology is the implementation of a scheme that allows the cloudified NGS alignment tool to run as a black box within the MapReduce model, without the need for building new parallel algorithms or recoding these tools from scratch. By separating specific sequence analysis calculations from common data management infrastructure, tools can benefit from the ongoing improvements to the cloud computing paradigm, MapReduce programming model, and the DNA sequence alignment tools. In this context, the paper makes the following contributions: Develop a methodology with techniques and mechanisms to ease porting existing NGS alignment tools into the cloud environment. Implement the MapReduce model, which enables a parallelized run of the cloudified NGS alignment tools as a black box within the MapReduce framework, with little or no changes to the original code of these tools. Demonstrate the utility of the proposed methodology by transforming two of the recently developed and commonly used NGS alignment tools, which are BFAST [7] and SHRiMP [8], to the MapReduce model on the cloud. Demonstrate the efficiency and scalability of the proposed methodology to large problem sizes. The rest of the paper is organized as follows: Section 2 provides basic background about the read mapping problem, MapReduce and Hadoop. Section 3 describes the approach used to conduct this study demonstrated by the BFAST tool. Section 4 presents the experiments and results used to evaluate the efficiency and scalability of the proposed methodology, and Section 5 concludes the paper /12/$ IEEE 18

2 II. BACKGROUND Sequence Alignment Software Tools: NGS data is composed of short segments of sequenced DNA. Each segment is called a read and its length varies according to the sequence technology used, and it ranges between tens to hundreds of base pairs (characters). With recent NGS technologies, one genome sequencing project produces millions of reads with dataset sizes in the range of billions of base pairs. The NGS read mapping problem is to align (map) a set of NGS reads to a reference genomic sequence. The goal of any software tool solving this problem is to find the location of each read in the reference sequence (if exists), where the alignment of the read and the respective subsequence at this location have the minimum number of edit operations (replacements, insertions, and deletions) or the alignment score exceeds a certain threshold. A wide variety of alignment software tools have been developed over the past few years to solve the NGS read mapping problem. These include among others BFAST [7], SHRiMP [8], Bowtie [9], BWA [1], and SOAP2 [11]. In this research, we have selected the tool BFAST to demonstrate the proposed methodology. This tool is open-source software designed for aligning Giga-scale short read sets with comparable or better speed compared to existing methods, while maintaining higher sensitivity and accuracy for deletions/insertions. The underlying method of BFAST is based on creating flexible and efficient whole genome indexes to rapidly map reads to candidate alignment locations, with arbitrary multiple independent indexes allowed to achieve robustness against read errors and sequence variants. The final local alignment uses the Smith-Waterman [12] algorithm with gaps to support the detection of similarities and dissimilarities regions. More details about the BFAST is described later in this paper. MapReduce: MapReduce is a parallel programming model over a computer cluster composed of a number of computing nodes for processing large datasets. The power of the MapReduce computational paradigm is that it can intelligently distribute computations across a cluster with hundreds or thousands of computers, each analyzing a portion of the dataset stored locally on the compute node. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required intermachine communication. This programming model allows the application developer to focus on what computations need to be performed, as opposed to how those computations are actually carried out or how to get the data to the processes that depend on them. The core concept of the MapReduce framework is that input is split into logical chunks, and each chunk may be initially processed independently by a map task. During the map process, the input file(s) is (are) split up into many smaller pieces. Each piece is sent to a node in the cluster to be computed. The map process creates <key, value> pairs and then uses a mapper task. The mapper task is a piece of code defined by software developer based on the application business, and specifies what should be done to the <key, value> pairs.the reduce task, also developed by the software developer, collects all of the solutions from each of the nodes in the cluster and combines them into one file based on the defined reduction criteria. The power of MapReduce is that the map and reduce functions are executed in parallel over potentially hundreds or thousands of processors with minimal effort by the application developer. Hadoop: Hadoop is a popular open-source implementation of the MapReduce model written in Java for cross-platform portability [13]. A key component of Hadoop is the Hadoop Distributed File System (HDFS), which enables efficient management of data files and sharing them among the nodes of the computer cluster. Amazon provides the Elastic MapReduce (EMR) product which is basically a computer cluster with Hadoop installed on it. III. METHODOLOGY The methodology is designed to ease the cloudification of existing NGS alignment tools to run in the cloud environment using the programming paradigm of MapReduce. Before describing the MapReduce solution, we briefly introduce the main steps of the BFAST tool, which is likely to be very similar to the steps used by most of the sequence alignment software tools in the field. The sequence alignment process performed by BFAST is divided into four main steps. The first step is to create the indexes for the reference genomes, the number and layout of these indexes is determined both by the user s speed and accuracy requirements. The second step is to find Candidate Alignment Locations (CALs) for each read. The expected number of CALs returned is a function of the number of indexes and the layouts chosen in the first step as well as the number offsets. The third step is to fully align each CAL for each read. The fourth and final step is to filter and prioritize the final alignments. The user specifies criteria to select the correct alignment for each read. The criteria can be based on many factors, including uniqueness, score, or other factors. MapReduce divides the computations into two separate steps; map and reduce. Inthemap step, the larger sequence alignment problem is divided into many smaller independent sub-problems, which are fed to the map function where the sequence alignment tools are plugged-in. The output of the map level, which is the partial alignment results, is then passed to the reduce function to merge the partial results and produce the final output. Calculations like finding the CALs, extending the CALs using local alignment algorithms, and prioritizing the alignments naturally operate at the map level of MapReduce as they perform calculations at each chunk of the short-reads files independently. The setup used in our methodology to achieve the parallelized execution of the sequential BFAST program in the MapReduce framework is that the execution of the entire BFAST alignment stages for a subset of the short-reads sequence is assigned to each mapper. The reduce stage in our case will simply pass its input to its output with no changes. However, in other bioinformatics applications, the reducer 19

might be further utilized to perform more complicated analysis on the data.

The reference genome input is replicated on all the allocated cloud nodes, so that it is available for all the instances of the BFAST program as part of the execution environment.

.., Sn, each of which is passed into one of the worker nodes. On each worker node, the indexes of the reference genome R= {R1, R2,, Rm are resident.

In our implementation, the read files are split into blocks of 64 MB, which is equal to the HDFS default block size. The algorithm listed in Fig.2 lay-outs the mapper function used in our methodology.

The MappingTool is a function that invokes the mapping program (like BFAST) with the respective arguments, input sequences, and indexes of reference genome(s).

The pre-processing step includes two main phases: 1) setup phase for preparing the reference genome, and 2) loading the short-reads into the HDFS.

3 might be further utilized to perform more complicated analysis on the data. The parallelization approach consists of segmenting the short-reads sequence input into subsets of the same size and running multiple instances of the BFAST black box on each subset. The reference genome input is replicated on all the allocated cloud nodes, so that it is available for all the instances of the BFAST program as part of the execution environment. Figure 1 explains the parallelization approach described above. The input short-reads sequence S is divided into subsets of equal sizes, S= {S1, S2,..., Sn, each of which is passed into one of the worker nodes. On each worker node, the indexes of the reference genome R= {R1, R2,, Rm are resident. For human genome, for example, we have 22 files representing the 22 chromosomes in the human body, and we have created five indexes for each of the chromosomes files. In our implementation, the read files are split into blocks of 64 MB, which is equal to the HDFS default block size. The algorithm listed in Fig.2 lay-outs the mapper function used in our methodology. The reference and read files are assigned automatically to each Mapper function by the Hadoop system. The MappingTool is a function that invokes the mapping program (like BFAST) with the respective arguments, input sequences, and indexes of reference genome(s). There are two main steps which are executed before and after the mapper function is invoked; pre-processing and postprocessing Steps. The pre-processing step includes two main phases: 1) setup phase for preparing the reference genome, and 2) loading the short-reads into the HDFS. In the setup phase, the reference genome files are uploaded to all the cluster nodes, where each Input Sequence S= {S 1,S 2,..., S n Mapper( ) { for each reference file Ri in R { for each sequence read Si in S{ MappingTool(Si, Ri, ARG) Fig.2. Mapper function Fig.2. Mapping function Node 1 BFAST (S 1 vs R 1 ) BFAST (S 1 vs R 2 )... BFAST (S 1 vs R m ) Node 2 BFAST (S 2 vs R 1 ) BFAST (S 2 vs R 2 )... BFAST (S 2 vs R m ) Node N BFAST (S n vs R 1 ) BFAST (S n vs R 2 )... BFAST (S n vs R m ) Fig.1. Parallelization of BFAST node constructs its copy of the index, if it is not already constructed before. In general, all read mapping programs include the step for creating an index for the reference genome to speed up the computation. (The tools do not use the same indexing data structure, but different ones.) It is also worth mentioning that the indexes are created only once and used thereafter for all subsequent processing. In the second phase, the short-reads files are uploaded into the cloud through a connection to the HDFS. The pre-processing phase is handled by the user according to our tool manual. The post-processing step is used to prepare the final output of the mapping process. Typical post-processing work includes removing duplicates, prioritizing the alignments scores, and selecting the best alignments. There are two options for the placement of the post-processing step; it can exist either as an integrated part of the MappingTool black box, or as a separate module which is independent of the MappingTool black box at either the map or reduce phase. In our implementation, the post-processing step exists as part of the BFAST black box. The criteria for selecting the correct alignment for each read are specified by the user. The criteria can be based on many factors such as uniqueness or scoring functions. The mapping output is collected after all the computations are completed and stored into a location on HDFS. These are the main characteristics of our proposed methodology: A. Pushing Environment to Data One of the core principles in the MapReduce model is to push code to data. Therefore, map functions move to nodes that hold the data on which the map will work. Our methodology calls for mechanisms to push not only the code, but also the entire execution environment to be close to the data. The execution environment includes the code, the MappingTool black box, the reference genome data, and the temporary directories created to handle the MappingTool sideeffect files. This is an important procedure in the era of massive data sets and open source softwares. This approach works very well when the execution environment is smaller than the data one wishes to analyze, but still have some limitations and incurs some overhead when the execution environment grows to be larger than the data. In our case, this approach works very well because the size of the reads data is much larger than the size of the execution environment. B. Cloudification without Recoding By using the MapReduce, Hadoop, and HDFS we were able to capitalize on the technical advantages conferred by MapReduce/Hadoop, without having to recode our own sequence analysis algorithms and workflows, and without having to design our own solutions for job queuing, tracking, and maintenance. Although the implementation of the BFAST tool is not in Java, and the MapReduce and Hadoop frameworks are initially designed to use Java, we were able to run the C codes of BFAST within the MapReduce/Hadoop Java frameworks using the Java Native Interface (JNI). 2

4 C. Data Pipelining across Methodology Modules Pipelining the input, intermediate, and output data between the different modules was a challenging task due to the nature of the abstraction level used in the MapReduce and Hadoop frameworks. The data pipelining in our methodology is managed using the Named Pipes which allow different processes to communicate to each other, and temporary directories created on each of the cluster nodes to locally store the intermediate MappingTool side-effect files. Only the partial alignment results are prompted to the reducer. Other intermediate MappingTool side-effect files arediscarded. D. Multiple Sources for the Mapper Input In typical MapReduce implementations the input of the mapper is expected to reside within an input directory on the HDFS. However, given the complex stream of operations needed by the sequence alignment tools, the standard storage and flow of the mapper input is modified in our methodology to allow for more flexibility. This is done by allowing for two different sources for the mapper input; the first is the standard HDFS input directory, and the second is the input directory integrated as part of the execution environment which stores the pre-processed reference genome files together with its indexes. IV. EXPERIMENTS AND RESULTS We conducted a number of experiments to demonstrate the efficiency and scalability of our methodology using BFAST (version.6.4f) and SHRiMP (version 2.2.2). We used sets of publicly available Illumina sequencing reads from the 1 Genomes Project (accession: SRX6833, Illumina paired-end sequencing of an African male individual) and the human genome publically available at UCSC [14]. The experiments were performed on 16-nodes Hadoop cluster testbed using the resources available by the IBM blue cloud 1.6 installed at Qatar University. Each node includes one 2.8 GHz Intel processor, 4 GB RAM, and 8 GB local disk space. The compute nodes were running Hadoop In all the experiments, the time to build the reference genome indexes and load it into the HDFS is excluded, since this task is done once and the indexes can be read in sub-sequent experiments. The first experiment explores how well the MapReduced BFAST scales as the number of reads increases. In this experiment all the 16 nodes were used to align subsets of the reads to the human chromosome 22. Fig.3 shows the runtime of these tasks, runtime is averaged over 3 runs. The results show that the execution time of the MapReduced BFAST scales linearly as the number of reads increases, as expected as the reads are independent. The second experiment evaluates how well the MapReduced BFAST scales as the number of nodes increases for a fixed problem size. Fig.4 shows the execution time of the MapReduced BFAST for mapping 5 Mbp reads on a cluster size ranging between 2 and 16 nodes. Each runtime value is averaged over the readings of 3 runs. The results show that as the number of nodes doubles, the execution time drops by almost 5%. The graph also shows that the overhead for Hadoop to schedule to all of the compute nodes, process the reads files, and map the first reads to the compute nodes is constant regardless of the total number of available nodes. The third experiment compares the run time of the parallel execution of the MapReduced version of BFAST and SHRiMP using our methodology with the serial execution of the original implementation of these two tools on the same computing environment for each tool separately. The first part of this experiment compares the run time of the parallel execution of the MapReduced BFAST on 8 nodes with the serial execution of the original BFAST. The run time curves in Fig.5 show the results of this test in which both implementations of BFAST are executed for mapping 5-3 Mbp reads to the human chromosome 22 (each runtime value is averaged over 3 runs). The second part of this experiment compares the run time of the parallel execution of the MapReduced SHRiMP on 8 nodes with the serial execution of the original SHRiMP. The run time curves in Fig.6 show the results for this test in which the two implementations of SHRiMP are executed for mapping 2-32 Kbp reads to the human chromosome22 (each runtime value is averaged over 3 runs). The results of these two experiments show that the MapReduce implementation of both tools, BFAST and SHRiMP, outperforms the original implementation with the same speedup which is 7x. The performance gain of the MapReduce implementation of these tools is mainly from the partitioning and parallel processing of the huge reference genome as well as the reads Number of Bases (Mbp) Fig.3. BFAST Running Time vs. Number of Reads Number of Nodes Fig.4. BFAST Running Time vs. Number of Nodes 21

5 MapReduced BFAST Number of bases (Mbp) Original BFAST Fig.5. MapReduced BFAST vs. Original BFAST MapReduced SHRiMP Number of bases (Kbp) Original SHRiMP [4] A. Bateman and M. Wood, "Cloud computing," in Bioinformatics 29. [5] J. Dean and S. Ghemawat, "MapReduce: Simplifed data processing on large clusters," in In Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI 24), San Francisco, 24, pp [6] J. Ekanayake, S. Pallickara, and G. Fox, "MapReduce for Data Intensive Scientific Analyses," in Fourth IEEE International Conference on escience, 28, pp [7] N. Homer, B. Merriman, S. Nelson, and C. Creighton, "BFAST: An Alignment Tool for Large Scale Genome Resequencing," PLoS ONE, vol. 4, no. 11, 29. [8] S. Rumble et al., "SHRiMP: accurate mapping of short colorspace reads," PLoS Computational Biology, May 29. [9] B. Langmead, C. Trapnell, M. Pop, et al. Ultrafast and memoryefficient alignment of short DNA sequences to the human genome. Genome Biol 29;1:R25. [1] H. Li and R. Durbin. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 29;25: [11] R. Li, C. Yu, Y. Li, et al. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 29;25: [12] T. Smith and M. Waterman, "Identification of Common Molecular Subsequences," Journal of Molecular Biology, pp , [13] J. Venner, Pro Hadoop.: Apress, 29. [14] (29, February) Human Reference Genome, FASTA sequence of each chromosome. [Online]. Fig.6. MapReduced SHRiMP vs. Original SHRiMP V. CONCLUSION We presented an approach for running read mapping tools within the MapReduce framework to efficiently process large datasets in parallel on a computer cluster. Our methodology is generic and can be used with any read mapping tool without recoding it. Furthermore, our approach can be ported to any cloud environment that accommodates standard Linux distributions running Hadoop. Thus, we expect that by utilizing an approach similar to ours, more and more applications will take advantage of the cloud environments in the near future. We have evaluated this by applying the same approach on a different sequence alignment tool, which is SHRiMP. The parallelization of SHRiMP using our approach was a very easy task, as expected. Moreover, the experimental results show that using this approach will produce good scalability and performance enhancement results. REFERENCES [1] M. Schatz, "CloudBurst: highly sensitive read mapping with MapReduce," Bioinformatics, vol. 25, no. 11, pp , June 29. [2] H. Li and N. Homer, "A Survey of Sequence Alignment Algorithms for Next Generation Sequencing," Briefings in Bioinformatics, 21. [3] T. Nguyen, W. Shi, and D. Ruden, "CloudAligner: A fast and full-featured MapReduce based tool for sequence mapping," BMC Res Notes,

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department