Bioinformatics. Anatomy of a Hash-based Long Read Sequence Mapping Algorithm for Next Generation DNA Sequencing

Size: px

Start display at page:

Download "Bioinformatics. Anatomy of a Hash-based Long Read Sequence Mapping Algorithm for Next Generation DNA Sequencing"

Emory Heath
5 years ago
Views:

1 Bioinformatics Anatomy of a Hash-based Long Read Sequence Mapping Algorithm for Next Generation DNA Sequencing Journal: Bioinformatics Manuscript ID: BIOINF-0-0 Category: Original Paper Date Submitted by the Author: -Jul-0 Complete List of Authors: Misra, Sanchit; Northwestern University, Electrical Engineering and Computer Science Agrawal, Ankit; Northwestern University, Electrical Engineering and Computer Science Liao, Wei-keng; Northwestern University, Electrical Engineering and Computer Science Choudhary, Alok; Northwestern University, Electrical Engineering and Computer Science Keywords: Sequence alignment, Next-generation sequencing, Genome analysis, Computational biology, Bioinformatics, Algorithms

2 Page of Bioinformatics BIOINFORMATICS Anatomy of a Hash-based Long Read Sequence Mapping Algorithm for Next Generation DNA Sequencing Sanchit Misra, Ankit Agrawal, Wei-keng Liao and Alok Choudhary Department of Electrical Engineering and Computer Science, Northwestern University, Sheridan Rd, Evanston IL 00 Received on XXXXX; revised on XXXXX; accepted on XXXXX Associate Editor: XXXXXXX ABSTRACT Motivation: Recently a number of programs have been proposed for mapping short reads to a reference genome. Many of them are heavily optimized for short-read mapping and hence are very efficient for shorter queries, but that makes them inefficient or not applicable for reads longer than 00bp. However, many sequencers are already generating longer reads and more are expected to follow. For long read sequence mapping, there are limited options; BLAT, SSAHA, FANGS and BWA-SW are among the popular ones. However, resequencing and personalized medicine need much faster software to map these long sequencing reads to a reference genome to identify SNPs or rare transcripts. Results: We present AGILE (AliGnIng Long reads), a hash table based high throughput sequence mapping algorithm for longer reads that uses diagonal multiple seed-match criteria, customized q- gram filtering and a dynamic incremental search approach among other heuristics to optimize every step of the mapping process. In our experiments, we observe that AGILE is more accurate than BLAT, and comparable to BWA-SW and SSAHA. For practical error rates (< %) and read lengths (00 00bp), AGILE is significantly faster than BLAT, SSAHA, and BWA-SW. Even for the other cases, AGILE is comparable to BWA-SW and several times faster than BLAT and SSAHA. Availability: smi/agile.html Contact: smi@eecs.northwestern.edu INTRODUCTION Recent advances in Next Generation Sequencing (NGS) technology have led to affordable desktop-sized sequencers with low running costs and high throughput. These sequencers produce small fragments of the genome being sequenced as a result of the sequencing process. By mapping these small fragments (reads) to a reference genome, we can sequence the DNA of a new individual. The NGSs are making it possible for these studies to be conducted at a mass scale. These advances will usher an era of personal genomics when each individual can have his/her DNA sequenced and studied to develop more personalized ways of anticipating, diagnosing and treating diseases []. Studies of this nature have already begun []. James Lupski, a physician-scientist who suffers from a neurological disorder called to whom correspondence should be addressed Vol. 00 no Pages Charcot-Marie-Tooth has found the genetic cause of his disease by sequencing his entire genome (late 00). Another study, the first to describe the genomes of an entire family of four, confirmed the genetic root of a rare disease, called Miller syndrome, afflicting both children (March 0). These studies have been made possible by plunging costs and increasing speeds of high throughput sequencing. Next generation sequencers (NGSs) sequence the DNA by generating small substrings of the DNA called reads. With rapid improvements in sequencing technologies, the lengths of the reads is constantly increasing. For example, the lengths of reads increased from about 0bp in 00 to about 00bp in 00. The length of illumina reads increased from about 0-0bp in 00 to about 0bp in 00. Pacific Biosciences also announced 00bp long reads in 00. The rate of throughput as well as read lengths of these NGSs are increasing at a pace that puts even the Moore s law to shame. Hence there is a growing need for tools that can work for longer reads and can still match the pace of the NGSs. A number of tools have been developed for shorter illumina queries. These include MAQ [], ELAND, SOAP [], BowTie [], Mosaik [], SHRiMP []. However most of these tools work only for read lengths < 00. Also, they allow very few number of mismatches (usually < ) and many of them do not allow any gaps. However, as the lengths of reads are rapidly increasing, we need tools that can work for longer read lengths. Moreover, these new tools for longer reads should be able to handle a larger number of gaps and mismatches. To the best of our knowledge, the only other tools specifically designed to work for longer reads are BWA-SW [] and FANGS []. BWA-SW is a package based on Burrows- Wheeler Transform (BWT). It supports gapped global alignment with respect to queries and is one of the fastest long read alignment algorithms while also finding suboptimal matches. Hash tables have been used extensively for short read mapping and many other related problems [,,,,, ]. Hence, it may appear that we have exhausted all possible uses of hash tables for sequence mapping. However, we will find that with proper heuristics, hash tables can give excellent speedups for sequence mapping of longer reads as well. The general structure of any hash table based sequence mapping algorithm is as follows: i) Create a hash table index of the genome, ii) Use the index to find regions in the genome that can potentially be homologous, iii) Examine each region in more detail and output regions that are indeed homologous. The execution time of existing hash table-based algorithms is dominated by stage (iii). The processing time of this stage is directly proportional c Oxford University Press 00.

3 Bioinformatics Page of Misra et al Table. The value of threshold T for different values of error rate ɛ and read length (q = ) Read length Error rate ɛ % % % 0 0 % % to the number of regions found. FANGS is also a hash table based tool. It uses q-gram filtering and pigeon hole principle to filter out many of the regions to rapidly map reads with nearly 0% sensitivity []. However, FANGS is inefficient for error rates greater than %. Using techniques in addition to the ones used by FANGS to quickly filter out regions that are not homologous will greatly reduce the time taken. We follow this path. In this paper, we will discuss five different techniques to filter out non-homologous regions and their adaptations to the high throughput long read sequence mapping problem. Subsequently, we present AGILE - yet another hash table based tool for sequence mapping - in which we have used these filtering techniques to optimize every step of the sequence mapping process. HIGH THROUGHPUT SEQUENCE MAPPING In the context of Next Generation Sequencing, sequence mapping problem involves searching for a small DNA sequence (read) in the reference genome allowing a small number of differences. The reference genome is obtained from an organism of the same species as the reads, implying a high level of similarity between the read and the reference genome. The small number of differences are allowed to account for differences between individual organisms and sequencing errors. Given a string S over a finite alphabet Σ, we use S to refer to the length of S, S[i] to denote the i th character of S and S[i : j] to denote the substring of S which starts at position i and ends at position j. A q-gram of S is defined as a substring of S of length q > 0. A q-hit between two strings S and S is defined as the tuple (x, y) such that S [x : x + q ] = S [y : y + q ]. The unit cost edit distance (Levenshtein distance) [] between two strings S and S is defined as the minimum number of substitutions, insertions and deletions required to convert S to S. We will use E(S, S ) to refer to the unit cost edit distance between S and S. It can be calculated by using dynamic programming in O( S S ) time [, 0]. For a string S, we will refer to the natural decimal representation of S over Σ as D(S, Σ). For example, for Σ = {A, C, G, T }, the nucleotides A, C, G, T can be mapped to the numbers 0,,, respectively. Therefore: S X D(S, {A, C, G, T }) = i f(s[i]), where f(a) = 0, f(c) =, f(g) =, f(t ) =. This brings us to the formal definition of the sequence mapping problem. We can represent every genomic sequence as a string over the alphabet Σ = {A, C, G, T }. Given a genomic database G of subject sequences {S, S,, S l }, a query sequence (read) Q of length Q, the genome sequencing problem is to find the substring α of G that has the minimum value of E(α, Q) for all α. This problem can be reduced to finding the best match i=0 α for a query such that E(α, Q) is less than a certain bound. We can keep on increasing the bound till we find a match. Moreover, for a given sequencing error rate, larger reads tend to have more differences in the alignment as compared to shorter reads. Hence, having an absolute bound on the number of differences in an alignment is inappropriate as the length of the reads can vary. The bound should be a fraction of the read length. Hence, we define the ɛ-match sequence mapping problem: Given a genomic database G of subject sequences {S, S,, S l }, a query sequence (read) Q of length Q and an error rate ɛ find the substring α of G, such that E(α, Q) is minimum and E(α, Q) ɛ Q. ALGORITHM Most sequence mapping algorithms divide the problem into two stages - search stage and alignment stage. The search stage finds regions in the genome that can potentially be homologous to the read. The alignment stage verifies these regions to check if they are indeed homologous. The alignment stage is usually more computationally intensive than the search stage and the time taken is directly proportional to the number of regions found in the search stage. Hence, the best strategy to a fast sequence mapping algorithm is to filter out as many candidate regions as possible before the alignment stage. Various programs try to achieve this with the use of well designed filters for the specific problem ranges. In this section, we describe the AGILE algorithm that can achieve such filtering for a wide range of read lengths and error rates through a number of carefully designed filters. Some of these filters are generic and work for the entire range of read lengths and error rates, while others cater to a specific subset. However, combining these filters can achieve faster speeds over a wide range. We start by dividing the genome into non-overlapping q-grams and storing the locations of the q-grams in a q-gram index (hash table). Using the q-gram index, we find the list of q-hits between the read and the genome. Each q-hit can be extended on either side to create a candidate region. Regions with a large overlap with each other represent the same alignment and hence can be merged together. This gives us a very large number of regions. In the following subsections, we describe the filtering techniques used in AGILE that we apply to these regions.. Contiguous perfect matches Two strings of length m with n differences share a common (exactly matching) substring of length m []. FANGS [] adapted the above formula n+ to note that - given a read Q and genome G, if a substring α of G is such that E(Q, α) ɛ Q, then Q and α have a perfectly matching substring of length T q-grams, where T is given by: T = Q (q ) ɛ Q + q where q is the length of the q-grams. Therefore, we only consider regions in the genome which have a common substring of length T q-grams with the read. This filtering criteria significantly reduces the search space for finding homologous regions. However, as Table demonstrates, the value of T quickly becomes zero beyond a certain percentage of errors. Hence this filtering scheme does not help much if we consider a larger error rate.. Multiple perfect matches In [], James Kent had discussed the possibility of using multiple perfect matches as a filter. Each q-hit is a perfect match. The matches need not be contiguous. Let q =. Consider, for example, two q-hits - one starts at position 0 in the read and position 00 in the genome and second starting at position in the read and position 0 in the genome. Since there is equal gap between the q-hits in the read as well as the genome, they can easily be part of one alignment. Notice that genome-position readposition = 000 for both the reads. This value is called the diagonal. We

4 Page of Bioinformatics Table. The highest value of minimum number of perfect matches to get a particular sensitivity for different values of read lengths and sequence similarity. The first column lists the sequence similarity M. In each subsequent column, we report the maximum value of N for the given read length such that P N > sensitivity. Read length Similarity Sensitivity % % 0 0 % 0 can filter out a large number of regions by keeping a minimum cutoff on the number of q-hits with equal or slightly different diagonal. Let M be the sequence similarity between the read and the corresponding homologous region in the genome. Assuming that each letter is independent of the previous letter, the probability that a specific q-gram in the read matches a q-gram in a homologous region in the genome is given by: p = M q The match of the read in the genome will be of the same length as the read. Number of non-overlapping q-grams in the homologous region is given by: K = Q /q The probability that there are exactly n matches in the homologous region is: K P n = p n ( p ) K n n and the probability that there are N or more matches is the sum: P N = P N + P N P K James Kent discussed the idea of using two perfect matches as a filtering criteria as opposed to here we use a larger number of perfect matches for the same. Table displays the values of N that can be used for different values of sequence similarity and read lengths.. Ignore q-grams with high frequency In both the optimizations above, we are applying theoretical constraints to the q-hits between reads and genome in order to filter out unwanted regions. However, some q-grams appear much more frequently than many others. As a result, we get a very large number of q-hits for some reads. Moreover, if a particular q-gram appears very frequently in the genome, then it will produce a lot of matches at undesired places wasting time in processing them. A less frequent q-gram is much more useful in pin-pointing a match. Hence, our heuristic ignores all q-grams with frequency more than a certain cutoff frequency F to save time. To ensure that the contiguous perfect match criteria still works, we give a wildcard to all q-grams with frequency greater than F as done in FANGS []. In principle, we need to reduce the values of N so that we are still able to find all the homologous regions. A theoretical model may not be accurate unless we take into account the exact probability of each q-gram, which makes the model very complex. Hence we decided to do empirical analysis using synthetic queries for which the correct answers are already known. Table shows results of such experiments for read length 00. Clearly, for a fixed value of N, higher value of F should result in more regions to process in the alignment stage and more correctly mapped regions. For example, for N =, even F = passes 00 regions to the alignment stage. While with N = and F =, AGILE correctly maps more reads and filters out more regions. For N =, if F is reduced, that will further reduce correctly mapped regions. On the other hand, if F is increased, that will increase the time taken. Comparing N = case with Table. Experiments to find the most appropriate values of F and N so as to maximize the number of reads correctly mapped and minimize the number of regions sent to the alignment stage (hence minimize the time taken). The genome used is human genome. F N Correctly mapped # Regions Time taken (CPU sec) reads of length 00 were synthetically generated by sampling the human genome. We introduced errors in the reads using an error rate of %. For a read length of 00 and error rate of %, N = and F = work the best. N = case, even with F =, N = case correctly maps less number of reads but costs more time. Hence for a read length of 00 and error rate of %, N = works the best. With N = and F =, AGILE correctly maps.% of the queries. As compared to F =, F = takes extra seconds to increase the number of correctly mapped reads by just. Hence, there is a tradeoff between the time taken and the number of correctly mapped queries. In this particular case, N = and F = seems a relatively better choice. Using similar analysis, for a read length of 000 and error rate of %, N = and F = turns out to be a good choice. This filtering criteria works very well for large read lengths.. Customized q-gram filtering Another filter that we have applied in AGILE is keeping a minimum cutoff on the number of q-hits between the read and the region. This is called q-gram filtering. To estimate the effect of q-gram filtering, we conducted experiments with synthetic reads. For each read, we calculated the number of q-hits in each region and also whether the region is actually homologous or not. As shown in Figures and, homologous regions tend to have larger number of q-hits while non-homologous regions tend to have smaller number of q-hits. Hence, with a carefully chosen cutoff on the number of q-hits, we can filter out non-homologous regions. However, some queries can have more frequently appearing q-grams than others. In that case, those queries with more frequently appearing q-grams can have many hits in each region. On the other hand, queries with less frequent q-grams can have only a few

5 Bioinformatics Page of Misra et al Fig.. Histogram of the number of regions that are homologous and number of q-hits they share with the reads. This demonstrates that the homologous regions have larger number of q-hits Fig.. Histogram of the number of regions that are not homologous and number of q-hits they share with the reads. This demonstrates that the nonhomologous regions have smaller number of q-hits hits in even a homologous region. Hence, a generic cutoff is not appropriate. In AGILE, we dynamically choose the cutoff for each read. For each read, we find the maximum of the number of q-hits in all regions, say C. We keep the cutoff as a fraction f of C. Hence, if a region has fc q-hits, only then it is processed further. The best value of f can filter out the maximum number of non-homologous regions while keeping the number of correctly mapped reads the same. As an example, Table demonstrates the effect of using various values of f for read length 00. In this case, the best value of f is 0... Selecting the error rate ɛ All the above optimizations assume that we already know ɛ, to filter out the regions. However, the percentage error of a particular read cannot be known in advance. Reads from different sources can have varying error percentages. Even reads from the same source can have variations in error. To account for this, we apply the following heuristics. For each query, we start with a small value of ɛ, say %. We keep increasing the value until we find a match. There are various ways in which the value can be increased. We can increase by percent each time (e.g.,,,etc.), or by a larger constant interval (e.g.,,,, etc.), we can increase the value of the increment by one each time (e.g.,,,,, etc.), we can double the value of increment each time (e.g.,,,,, etc.). Through our experiments, we found that doubling the value of increment each time (exponential increment) works faster than the other strategies mentioned above. Table. Effect of the value of f. # Regions is the number of regions passed as a result of the previous filters. # Regions processed is the number of regions with number of q-hits fc for that read. Hence these are the number of regions processed by the alignment stage. The genome used is human genome. 000 reads of length 00 were synthetically generated by sampling the human genome. We introduced errors in the reads using an error rate of %. Clearly, f = 0. works best for this case. f # Regions # Regions processed # Correctly mapped Time taken s s s s s s s s s s 00.s Table. Effect of pruning Without Pruning With Pruning query % similarity time correctly time correctly length taken mapped taken mapped Choosing an appropriate starting value of ɛ is extremely crucial for the above step and depends on the error distribution of the reads. If the errors in the reads are distributed in a very small band on the lower side of the spectrum, choosing a higher initial value of ɛ, will lead to a lot of extra time wasted in unnecessary processing. On the other hand, if the errors in the reads are distributed over a very small band on the higher side of the spectrum, choosing a smaller initial value of ɛ will mean that we will not find any match for the initial few values of ɛ. Hence we will waste time in unnecessary and unfruitful processing. In order to guess a good initial value of n to start with, we dynamically adjust the ɛ by learning from the previous queries. We take the average of the edit distance per unit length of all the previous queries as the initial value of ɛ for the next query. AGILE IMPLEMENTATION Our implementation takes as input the genome FASTA file and a query FASTA file. The query file typically contains a batch of many queries. In the output we report locations in the genome where the full length of the read matches along with mapping quality and score of the match.

6 Page of Bioinformatics Fig.. AGILE workflow The high level workflow of AGILE is depicted in Figure. AGILE uses the fact that sequencers have a very small error rate [, ]. Hence we divide the problem of finding the best match α of a read Q that minimizes E(α, Q), into multiple problems allowing different values of edit distance. Each such problem is now of the form - Find a substring α of G such that E(α, Q) < ɛ Q. We start with allowing a small value of ɛ. If we find a match, we output that match and move to the next read. If we do not find a match, we increase the value of ɛ and try again. AGILE uses a q-gram index of the genome to map queries. We process the input reads one by one. Since the read can be from any strand of the DNA, AGILE processes both the read and reverse complement of the read to search for matches. For any sequence, we start by identifying the regions in the reference genome that can potentially be homologous to the read. In the next step, we filter out many of the regions using the heuristics discussed in the previous section. Each homologous region in the filtered list is further processed by using dynamic programming by creating the edit distance matrix to check if the region actually has an edit distance ɛ Q. Since the edit distance is bounded by ɛ Q, we only need to calculate a diagonal band of width ɛ Q + of the matrix. If at any stage, all paths in the edit distance matrix have an edit distance of more than ɛ Q, we stop further processing and discard the region. This pruning policy greatly reduces the time taken by the algorithm as many regions are discarded after the first few rows are processed. Table demonstrates the speed benefit obtained by this optimization. We have run extensive experiments in order to find the most efficient parameter values for different read lengths and ɛ values. Based on these experiments, for each query we automatically set the optimal values of these parameters. Also, as explained in section., we dynamically select the most appropriate starting values of ɛ and increment ɛ until we find a match. Automatic selection of parameter values makes it easier for users to use the program as they do not need to worry about deciding the appropriate parameter settings. In addition, having tailored values of parameter settings for different scenarios make it possible to run real queries that can be of varied lengths and error rates. RESULTS. Mapping Quality Calculation The concept of mapping quality was coined by Li et al. [] to estimate the probability that the read sequence has been mapped at the correct place or not. When a mapping algorithm is guaranteed to find all the matches of an algorithm, the mapping quality can be calculated using these alignments. However, since we use heuristics, we may not be able to find all the matches of a read and hence cannot calculate the mapping quality accurately. Li et. al [] approximated the mapping quality as 0(S S )/S, where S is the score of the best alignment and S is the score of the second best alignment. We adopt the same approach in our experiments for calculating the mapping quality.. Results on synthetic data To create synthetic queries, we used WGSIM script provided in the SAMTOOLS package. The script was modified to adapt to data. We created reads of a total of million bp of different read lengths. For each read length we introduced, and % errors. 0% of the errors were indels. We aligned these simulated reads to the human genome hg using AGILE, BWA-SW, BLAT (option -fastmap) and SSAHA (option -). Since these are synthetic reads, we know their coordinates in the genome. Hence, we compared the aligned coordinates to the known coordinates to calculate the alignment error. SSAHA, BWA-SW and AGILE report mapping quality. For BLAT, we calculated the mapping quality in the similar manner as described in section., i.e., 0(S S )/S, where S is the score of the best alignment and S is the score of the second best alignment. For our evaluation, we ran all the experiments on Intel Xeon quad core E0. GHz processor with xmb cache and GB RAM running a Linux based operating system. Each tool was run in single-threaded mode. Table shows the CPU time, percentage of confidently (mapping quality > 0) aligned reads and percentage of reads incorrectly aligned for AGILE (version 0..0), BWA-SW (version 0..), BLAT (version ) and SSAHA (version..) for different values of read length and sequencing error rates. We used the default command line options for each program unless necessary otherwise. Carefully selected command line options might yield better results... AGILE vs BWA-SW It is observed from Table that AGILE is more accurate than BWA-SW especially for reads with shorter lengths and more error rates. However, AGILE is also slower than BWA-SW for the same case - reads with shorter lengths and more error rates. For smaller error rates, AGILE is significantly faster than BWA-SW (upto times faster). AGILE is most efficient for read lengths of about 00. With the rate at which the read lengths are increasing, 00bp reads will soon be be the norm... AGILE vs BLAT From Table, it is clear that AGILE is significantly faster (upto 0 times faster) than BLAT while still being much more accurate in most cases; most importantly the cases with larger error rates. Even in cases where AGILE is less accurate than BLAT, it is not much worse. BLAT s lack of accuracy is also due to the -fastmap option. However, with default parameters BLAT is more than an order of magnitude slower than with -fastmap option... AGILE vs SSAHA AGILE is several times faster than SSAHA on all inputs. SSAHA is more accurate than AGILE, although it needs to spend significantly more time to achieve this accuracy. SSAHA does not work well for kbp reads as it is not designed for such read lengths and thus could not be tested on them... Memory Comparison As far as memory usage is concerned, for mapping reads against human genome, BLAT and BWA-SW use about GB memory. The peak memory requirement of SSAHA is about.gb. For AGILE, the q-gram index requires about.gb memory and the genome itself requires about

7 Bioinformatics Page of Misra et al Table. Results on synthetic reads. We created reads of a total of million bp of different read lengths. Q0% = percentage of reads with mapping quality > 0. erraln% = percentage of reads incorrectly aligned Program AGILE BWA-SW BLAT SSAHA Read Length Error rate CPU sec Q0% erraln% CPU sec Q0% erraln% CPU sec Q0% erraln% CPU sec Q0% erraln% 0bp % % % bp % % % bp % % % bp % % % bp % % % Table. Breakup of alignments that are mapped inconsistently between BWA-SW and AGILE BWA-SW AGILE Condition Count Avg read length Avg. mapping qual Avg. diff Avg. mapping qual Avg diff BWA-SW 0; AGILE unmapped...% - - BWA-SW 0 plausible; AGILE< %..% BWA-SW 0 questionable...%.% AGILE 0; BWA-SW unmapped % AGILE 0 plausible; BWA-SW< 0...%.0.% AGILE 0 questionable. 0..% 0.% We mapped 0, 000 reads uniformly selected from SRR000 (pre-filtered to remove any reads smaller than 0bp) against the human genome hg with BWA-SW and AGILE respectively. We call a read inconsistently mapped if the left most position of the alignments found by BWA-SW and AGILE differ by more than the length of the read. For each alignment, we calculated the score (number of matches three times the number of differences (edit distance)). We call a AGILE alignment plausible if 0(AGILE score BWA-SW score)/(agile score) 0 (using BWA alignment as the next best alignment, AGILE mapping quality > 0). Essentially, this means that the AGILE alignment is sufficiently better. Otherwise, we call an AGILE alignment questionable. We use similar definitions of plausible and questionable for BWA-SW alignments. In adddition, AGILE 0 is defined as AGILE alignments with mapping quality 0. GB memory. Memory required by each query is insignificant with respect to them.. Results on real data It is difficult to evaluate on real data because of the lack of ground truth. However, if two algorithms output the same alignment for a read, it is most likely correct. If two aligners X and Y output different alignments for a read, if X and Y both report low mapping quality, then the alignment is ambiguous and it does not matter which one is wrong. If X reports high mapping quality for a read and X alignment score is worse than or slightly better than Y, X mapping quality is wrong and X is not aware of an equally probable alignment. This analysis is similar to the one reported in []. We ran BWA-SW and AGILE on real queries obtained from the NCBI short read archive SRR000. We filtered the read set to remove queries smaller than 0bp long and uniformly selected 0k read with an average length of bp. AGILE took CPU seconds and BWA-SW took CPU seconds. Both tools found common alignments. reads were not mapped by any of the tools. Table shows breakup of reads that are aligned by only one aligner or mapped to different places, and are assigned mapping quality greater than equal to 0 for either BWA-SW or AGILE.

8 Page of Bioinformatics Overall, BWA-SW misses 0+ = 0 alignments that AGILE maps well. AGILE misses ( + ) alignments that are aligned well by BWA-SW. Note that, even though the average read length of the entire read set is, the inconsistencies come mostly in the case of smaller read lengths (average length or smaller). This is in accordance with the results on synthetic reads as both aligners tend to make more mistakes in aligning reads of shorter lengths. BWA-SW tends to miss more alignments in even shorter (-) read length range. While AGILE misses more alignments for a bit longer (0-) reads. ANALYSIS AGILE is similar to BLAT, SSAHA, FANGS and BWA-SW in sharing the seed-and-extend strategy to get candidate regions. However, the major difference lies in the way heuristics are employed to reduce the number of regions to be processed. BLAT and SSAHA consider short (-bp) exact matches as seeds. BLAT also provides functionality to use short inexact match (one mismatch allowed) or two exact matches slight differing in diagonal values. For longer reads, both these techniques result in a very large number of candidate regions. AGILE filters the candidates by allowing longer seeds (upto bp). It also uses a cutoff on the minimum number of contiguous exact seed matches and larger cutoff on the minimum number of exact matches with equal or slightly different diagonal values. To the best of our knowledge, no other tool uses multiple perfect match filtering criteria with a cutoff of more than matches. This is similar to using a long gapped seed. BWA-SW also uses long gapped seeds for the similar reason. The main difference between BWA-SW and AGILE is the way long gapped seed is implemented. While BWA- SW uses Prefix Trie and Prefix DAWG, AGILE relies on a much simpler data structure q-gram index (hash table) and diagonal coordinates. Both BWA-SW and AGILE further use different heuristics in order to reduce the search space. The heuristics used by AGILE are similar to the ones used in string matching or sequence mapping algorithms. For each heuristic, AGILE adapts it to the problem of long read mapping. Customized q-gram filtering is an example. Q-gram filtering is a well known technique for filtering out unwanted regions. However to the best of our knowledge no algorithm has used q-gram filtering with the cutoff customized for each read. Another major contribution of AGILE is the gradual increase of the allowed error rate till we find a match. Most tools use the allowed error rate to decide thresholds for pruning the search space. However, many reads have a small number of errors. Since these tools use a fixed value of the allowed error rate, they end up using a larger value of the allowed error rate for all reads in order to also map the reads with larger errors, resulting in a loss of efficiency. Hence gradually increasing the allowed error rate in mapping can have tremendous effects on the efficiency of these tools. CONCLUSION Advances in sequencing techniques necessitate the development of high performance, scalable algorithms to extract biologically relevant information from these datasets. Research on developing sequence mapping algorithms has been largely focused on mapping short reads, and little work has been done for longer reads. AGILE is a hash based sequence mapping algorithm that rapidly maps long reads using efficient heuristics to optimize different steps of the mapping process. AGILE can handle very large genome sizes and read lengths. It is flexible in that it allows a large number of mismatches and insertions/deletions in mapping and provides command line parameters to control every step of the mapping process. The best sensitivity and specificity of AGILE is achieved when the reads are longer and the error rate is small. Considering that with the improvement in sequencing technology, the read lengths will increase further and the error rates will decrease, AGILE should be even more useful in the future. ACKNOWLEDGEMENT Funding: This work was supported in part by NSF award numbers: CCF-0, SDCI OCI-0, CNS-0, IIS-0, and HECURA This work was also partially supported by DOE grants DE-FC0-0ER0 and DE-FG0-0ER. REFERENCES []Hunting Disease Origins with Whole-Genome Sequencing []Mosaik: [] Sequencers System Features - []S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local alignment search tool. J Mol Biol, ():0, October 0. []W. J. Kent. Blat the blast-like alignment tool. Genome Res, ():, April 00. []B. Langmead, C. Trapnell, M. Pop, and S. Salzberg. Ultrafast and memoryefficient alignment of short dna sequences to the human genome. Genome Biology, (), 00. []H. Li and R. Durbin. Fast and accurate long read alignment with burrows-wheeler transform. Bioinformatics, pages, January 0. []H. Li, J. Ruan, and R. Durbin. Mapping short dna sequencing reads and calling variants using mapping quality scores. Genome research, pages, August 00. []R. Li, Y. Li, K. Kristiansen, and J. Wang. Soap: short oligonucleotide alignment program. Bioinformatics, pages, January 00. []S. Misra, R. Narayanan, S. Lin, and A. Choudhary. Fangs: High speed sequence mapping for next generation sequencers. In Proceedings of ACM Symposium of Applied Computing (ACM SAC), Sierre, Switzerland, 00. []S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, ():, March 0. []Z. Ning, A. J. Cox, and J. C. Mullikin. Ssaha: A fast search method for large dna databases. Genome Res., ():, 00. []K. L. Patrick. life sciences: Illuminating the future of genome sequencing and personalized medicine. Yale J Biol Med., 0():, Dec 00. []W. R. Pearson and D. J. Lipman. Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A, ():, April. []P. Pevzner and M. Waterman. Multiple filtration and approximate pattern matching. Algorithmica, :,. []K. R. Rasmussen, J. Stoye, and E. W. Myers. Efficient q-gram filters for finding all epsilon-matches over a given length. Journal of Computational Biology, (): 0, 00. []J. M. Rothberg and J. H. Leamon. The development and impact of sequencing. Nature Biotechnology, ():, 00. []S. M. Rumble, P. Lacroute, A. V. Dalca, M. Fiume, A. Sidow, and M. Brudno. Shrimp: Accurate mapping of short color-space reads. PLoS Comput Biol, ():e00, []A. D. Smith, Z. Xuan, and M. Q. Zhang. Using quality scores and longer reads improves accuracy of solexa read mapping. BMC Bioinformatics, :, February 00. [0]T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, :, March.

24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading:

24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, 2010 3 BLAST and FASTA This lecture is based on the following papers, which are all recommended reading: D.J. Lipman and W.R. Pearson, Rapid