Bioinformatics. Anatomy of a Hash-based Long Read Sequence Mapping Algorithm for Next Generation DNA Sequencing

Size: px
Start display at page:

Download "Bioinformatics. Anatomy of a Hash-based Long Read Sequence Mapping Algorithm for Next Generation DNA Sequencing"

Transcription

1 Bioinformatics Anatomy of a Hash-based Long Read Sequence Mapping Algorithm for Next Generation DNA Sequencing Journal: Bioinformatics Manuscript ID: BIOINF-0-0 Category: Original Paper Date Submitted by the Author: -Jul-0 Complete List of Authors: Misra, Sanchit; Northwestern University, Electrical Engineering and Computer Science Agrawal, Ankit; Northwestern University, Electrical Engineering and Computer Science Liao, Wei-keng; Northwestern University, Electrical Engineering and Computer Science Choudhary, Alok; Northwestern University, Electrical Engineering and Computer Science Keywords: Sequence alignment, Next-generation sequencing, Genome analysis, Computational biology, Bioinformatics, Algorithms

2 Page of Bioinformatics BIOINFORMATICS Anatomy of a Hash-based Long Read Sequence Mapping Algorithm for Next Generation DNA Sequencing Sanchit Misra, Ankit Agrawal, Wei-keng Liao and Alok Choudhary Department of Electrical Engineering and Computer Science, Northwestern University, Sheridan Rd, Evanston IL 00 Received on XXXXX; revised on XXXXX; accepted on XXXXX Associate Editor: XXXXXXX ABSTRACT Motivation: Recently a number of programs have been proposed for mapping short reads to a reference genome. Many of them are heavily optimized for short-read mapping and hence are very efficient for shorter queries, but that makes them inefficient or not applicable for reads longer than 00bp. However, many sequencers are already generating longer reads and more are expected to follow. For long read sequence mapping, there are limited options; BLAT, SSAHA, FANGS and BWA-SW are among the popular ones. However, resequencing and personalized medicine need much faster software to map these long sequencing reads to a reference genome to identify SNPs or rare transcripts. Results: We present AGILE (AliGnIng Long reads), a hash table based high throughput sequence mapping algorithm for longer reads that uses diagonal multiple seed-match criteria, customized q- gram filtering and a dynamic incremental search approach among other heuristics to optimize every step of the mapping process. In our experiments, we observe that AGILE is more accurate than BLAT, and comparable to BWA-SW and SSAHA. For practical error rates (< %) and read lengths (00 00bp), AGILE is significantly faster than BLAT, SSAHA, and BWA-SW. Even for the other cases, AGILE is comparable to BWA-SW and several times faster than BLAT and SSAHA. Availability: smi/agile.html Contact: smi@eecs.northwestern.edu INTRODUCTION Recent advances in Next Generation Sequencing (NGS) technology have led to affordable desktop-sized sequencers with low running costs and high throughput. These sequencers produce small fragments of the genome being sequenced as a result of the sequencing process. By mapping these small fragments (reads) to a reference genome, we can sequence the DNA of a new individual. The NGSs are making it possible for these studies to be conducted at a mass scale. These advances will usher an era of personal genomics when each individual can have his/her DNA sequenced and studied to develop more personalized ways of anticipating, diagnosing and treating diseases []. Studies of this nature have already begun []. James Lupski, a physician-scientist who suffers from a neurological disorder called to whom correspondence should be addressed Vol. 00 no Pages Charcot-Marie-Tooth has found the genetic cause of his disease by sequencing his entire genome (late 00). Another study, the first to describe the genomes of an entire family of four, confirmed the genetic root of a rare disease, called Miller syndrome, afflicting both children (March 0). These studies have been made possible by plunging costs and increasing speeds of high throughput sequencing. Next generation sequencers (NGSs) sequence the DNA by generating small substrings of the DNA called reads. With rapid improvements in sequencing technologies, the lengths of the reads is constantly increasing. For example, the lengths of reads increased from about 0bp in 00 to about 00bp in 00. The length of illumina reads increased from about 0-0bp in 00 to about 0bp in 00. Pacific Biosciences also announced 00bp long reads in 00. The rate of throughput as well as read lengths of these NGSs are increasing at a pace that puts even the Moore s law to shame. Hence there is a growing need for tools that can work for longer reads and can still match the pace of the NGSs. A number of tools have been developed for shorter illumina queries. These include MAQ [], ELAND, SOAP [], BowTie [], Mosaik [], SHRiMP []. However most of these tools work only for read lengths < 00. Also, they allow very few number of mismatches (usually < ) and many of them do not allow any gaps. However, as the lengths of reads are rapidly increasing, we need tools that can work for longer read lengths. Moreover, these new tools for longer reads should be able to handle a larger number of gaps and mismatches. To the best of our knowledge, the only other tools specifically designed to work for longer reads are BWA-SW [] and FANGS []. BWA-SW is a package based on Burrows- Wheeler Transform (BWT). It supports gapped global alignment with respect to queries and is one of the fastest long read alignment algorithms while also finding suboptimal matches. Hash tables have been used extensively for short read mapping and many other related problems [,,,,, ]. Hence, it may appear that we have exhausted all possible uses of hash tables for sequence mapping. However, we will find that with proper heuristics, hash tables can give excellent speedups for sequence mapping of longer reads as well. The general structure of any hash table based sequence mapping algorithm is as follows: i) Create a hash table index of the genome, ii) Use the index to find regions in the genome that can potentially be homologous, iii) Examine each region in more detail and output regions that are indeed homologous. The execution time of existing hash table-based algorithms is dominated by stage (iii). The processing time of this stage is directly proportional c Oxford University Press 00.

3 Bioinformatics Page of Misra et al Table. The value of threshold T for different values of error rate ɛ and read length (q = ) Read length Error rate ɛ % % % 0 0 % % to the number of regions found. FANGS is also a hash table based tool. It uses q-gram filtering and pigeon hole principle to filter out many of the regions to rapidly map reads with nearly 0% sensitivity []. However, FANGS is inefficient for error rates greater than %. Using techniques in addition to the ones used by FANGS to quickly filter out regions that are not homologous will greatly reduce the time taken. We follow this path. In this paper, we will discuss five different techniques to filter out non-homologous regions and their adaptations to the high throughput long read sequence mapping problem. Subsequently, we present AGILE - yet another hash table based tool for sequence mapping - in which we have used these filtering techniques to optimize every step of the sequence mapping process. HIGH THROUGHPUT SEQUENCE MAPPING In the context of Next Generation Sequencing, sequence mapping problem involves searching for a small DNA sequence (read) in the reference genome allowing a small number of differences. The reference genome is obtained from an organism of the same species as the reads, implying a high level of similarity between the read and the reference genome. The small number of differences are allowed to account for differences between individual organisms and sequencing errors. Given a string S over a finite alphabet Σ, we use S to refer to the length of S, S[i] to denote the i th character of S and S[i : j] to denote the substring of S which starts at position i and ends at position j. A q-gram of S is defined as a substring of S of length q > 0. A q-hit between two strings S and S is defined as the tuple (x, y) such that S [x : x + q ] = S [y : y + q ]. The unit cost edit distance (Levenshtein distance) [] between two strings S and S is defined as the minimum number of substitutions, insertions and deletions required to convert S to S. We will use E(S, S ) to refer to the unit cost edit distance between S and S. It can be calculated by using dynamic programming in O( S S ) time [, 0]. For a string S, we will refer to the natural decimal representation of S over Σ as D(S, Σ). For example, for Σ = {A, C, G, T }, the nucleotides A, C, G, T can be mapped to the numbers 0,,, respectively. Therefore: S X D(S, {A, C, G, T }) = i f(s[i]), where f(a) = 0, f(c) =, f(g) =, f(t ) =. This brings us to the formal definition of the sequence mapping problem. We can represent every genomic sequence as a string over the alphabet Σ = {A, C, G, T }. Given a genomic database G of subject sequences {S, S,, S l }, a query sequence (read) Q of length Q, the genome sequencing problem is to find the substring α of G that has the minimum value of E(α, Q) for all α. This problem can be reduced to finding the best match i=0 α for a query such that E(α, Q) is less than a certain bound. We can keep on increasing the bound till we find a match. Moreover, for a given sequencing error rate, larger reads tend to have more differences in the alignment as compared to shorter reads. Hence, having an absolute bound on the number of differences in an alignment is inappropriate as the length of the reads can vary. The bound should be a fraction of the read length. Hence, we define the ɛ-match sequence mapping problem: Given a genomic database G of subject sequences {S, S,, S l }, a query sequence (read) Q of length Q and an error rate ɛ find the substring α of G, such that E(α, Q) is minimum and E(α, Q) ɛ Q. ALGORITHM Most sequence mapping algorithms divide the problem into two stages - search stage and alignment stage. The search stage finds regions in the genome that can potentially be homologous to the read. The alignment stage verifies these regions to check if they are indeed homologous. The alignment stage is usually more computationally intensive than the search stage and the time taken is directly proportional to the number of regions found in the search stage. Hence, the best strategy to a fast sequence mapping algorithm is to filter out as many candidate regions as possible before the alignment stage. Various programs try to achieve this with the use of well designed filters for the specific problem ranges. In this section, we describe the AGILE algorithm that can achieve such filtering for a wide range of read lengths and error rates through a number of carefully designed filters. Some of these filters are generic and work for the entire range of read lengths and error rates, while others cater to a specific subset. However, combining these filters can achieve faster speeds over a wide range. We start by dividing the genome into non-overlapping q-grams and storing the locations of the q-grams in a q-gram index (hash table). Using the q-gram index, we find the list of q-hits between the read and the genome. Each q-hit can be extended on either side to create a candidate region. Regions with a large overlap with each other represent the same alignment and hence can be merged together. This gives us a very large number of regions. In the following subsections, we describe the filtering techniques used in AGILE that we apply to these regions.. Contiguous perfect matches Two strings of length m with n differences share a common (exactly matching) substring of length m []. FANGS [] adapted the above formula n+ to note that - given a read Q and genome G, if a substring α of G is such that E(Q, α) ɛ Q, then Q and α have a perfectly matching substring of length T q-grams, where T is given by: T = Q (q ) ɛ Q + q where q is the length of the q-grams. Therefore, we only consider regions in the genome which have a common substring of length T q-grams with the read. This filtering criteria significantly reduces the search space for finding homologous regions. However, as Table demonstrates, the value of T quickly becomes zero beyond a certain percentage of errors. Hence this filtering scheme does not help much if we consider a larger error rate.. Multiple perfect matches In [], James Kent had discussed the possibility of using multiple perfect matches as a filter. Each q-hit is a perfect match. The matches need not be contiguous. Let q =. Consider, for example, two q-hits - one starts at position 0 in the read and position 00 in the genome and second starting at position in the read and position 0 in the genome. Since there is equal gap between the q-hits in the read as well as the genome, they can easily be part of one alignment. Notice that genome-position readposition = 000 for both the reads. This value is called the diagonal. We

4 Page of Bioinformatics Table. The highest value of minimum number of perfect matches to get a particular sensitivity for different values of read lengths and sequence similarity. The first column lists the sequence similarity M. In each subsequent column, we report the maximum value of N for the given read length such that P N > sensitivity. Read length Similarity Sensitivity % % 0 0 % 0 can filter out a large number of regions by keeping a minimum cutoff on the number of q-hits with equal or slightly different diagonal. Let M be the sequence similarity between the read and the corresponding homologous region in the genome. Assuming that each letter is independent of the previous letter, the probability that a specific q-gram in the read matches a q-gram in a homologous region in the genome is given by: p = M q The match of the read in the genome will be of the same length as the read. Number of non-overlapping q-grams in the homologous region is given by: K = Q /q The probability that there are exactly n matches in the homologous region is: K P n = p n ( p ) K n n and the probability that there are N or more matches is the sum: P N = P N + P N P K James Kent discussed the idea of using two perfect matches as a filtering criteria as opposed to here we use a larger number of perfect matches for the same. Table displays the values of N that can be used for different values of sequence similarity and read lengths.. Ignore q-grams with high frequency In both the optimizations above, we are applying theoretical constraints to the q-hits between reads and genome in order to filter out unwanted regions. However, some q-grams appear much more frequently than many others. As a result, we get a very large number of q-hits for some reads. Moreover, if a particular q-gram appears very frequently in the genome, then it will produce a lot of matches at undesired places wasting time in processing them. A less frequent q-gram is much more useful in pin-pointing a match. Hence, our heuristic ignores all q-grams with frequency more than a certain cutoff frequency F to save time. To ensure that the contiguous perfect match criteria still works, we give a wildcard to all q-grams with frequency greater than F as done in FANGS []. In principle, we need to reduce the values of N so that we are still able to find all the homologous regions. A theoretical model may not be accurate unless we take into account the exact probability of each q-gram, which makes the model very complex. Hence we decided to do empirical analysis using synthetic queries for which the correct answers are already known. Table shows results of such experiments for read length 00. Clearly, for a fixed value of N, higher value of F should result in more regions to process in the alignment stage and more correctly mapped regions. For example, for N =, even F = passes 00 regions to the alignment stage. While with N = and F =, AGILE correctly maps more reads and filters out more regions. For N =, if F is reduced, that will further reduce correctly mapped regions. On the other hand, if F is increased, that will increase the time taken. Comparing N = case with Table. Experiments to find the most appropriate values of F and N so as to maximize the number of reads correctly mapped and minimize the number of regions sent to the alignment stage (hence minimize the time taken). The genome used is human genome. F N Correctly mapped # Regions Time taken (CPU sec) reads of length 00 were synthetically generated by sampling the human genome. We introduced errors in the reads using an error rate of %. For a read length of 00 and error rate of %, N = and F = work the best. N = case, even with F =, N = case correctly maps less number of reads but costs more time. Hence for a read length of 00 and error rate of %, N = works the best. With N = and F =, AGILE correctly maps.% of the queries. As compared to F =, F = takes extra seconds to increase the number of correctly mapped reads by just. Hence, there is a tradeoff between the time taken and the number of correctly mapped queries. In this particular case, N = and F = seems a relatively better choice. Using similar analysis, for a read length of 000 and error rate of %, N = and F = turns out to be a good choice. This filtering criteria works very well for large read lengths.. Customized q-gram filtering Another filter that we have applied in AGILE is keeping a minimum cutoff on the number of q-hits between the read and the region. This is called q-gram filtering. To estimate the effect of q-gram filtering, we conducted experiments with synthetic reads. For each read, we calculated the number of q-hits in each region and also whether the region is actually homologous or not. As shown in Figures and, homologous regions tend to have larger number of q-hits while non-homologous regions tend to have smaller number of q-hits. Hence, with a carefully chosen cutoff on the number of q-hits, we can filter out non-homologous regions. However, some queries can have more frequently appearing q-grams than others. In that case, those queries with more frequently appearing q-grams can have many hits in each region. On the other hand, queries with less frequent q-grams can have only a few

5 Bioinformatics Page of Misra et al Fig.. Histogram of the number of regions that are homologous and number of q-hits they share with the reads. This demonstrates that the homologous regions have larger number of q-hits Fig.. Histogram of the number of regions that are not homologous and number of q-hits they share with the reads. This demonstrates that the nonhomologous regions have smaller number of q-hits hits in even a homologous region. Hence, a generic cutoff is not appropriate. In AGILE, we dynamically choose the cutoff for each read. For each read, we find the maximum of the number of q-hits in all regions, say C. We keep the cutoff as a fraction f of C. Hence, if a region has fc q-hits, only then it is processed further. The best value of f can filter out the maximum number of non-homologous regions while keeping the number of correctly mapped reads the same. As an example, Table demonstrates the effect of using various values of f for read length 00. In this case, the best value of f is 0... Selecting the error rate ɛ All the above optimizations assume that we already know ɛ, to filter out the regions. However, the percentage error of a particular read cannot be known in advance. Reads from different sources can have varying error percentages. Even reads from the same source can have variations in error. To account for this, we apply the following heuristics. For each query, we start with a small value of ɛ, say %. We keep increasing the value until we find a match. There are various ways in which the value can be increased. We can increase by percent each time (e.g.,,,etc.), or by a larger constant interval (e.g.,,,, etc.), we can increase the value of the increment by one each time (e.g.,,,,, etc.), we can double the value of increment each time (e.g.,,,,, etc.). Through our experiments, we found that doubling the value of increment each time (exponential increment) works faster than the other strategies mentioned above. Table. Effect of the value of f. # Regions is the number of regions passed as a result of the previous filters. # Regions processed is the number of regions with number of q-hits fc for that read. Hence these are the number of regions processed by the alignment stage. The genome used is human genome. 000 reads of length 00 were synthetically generated by sampling the human genome. We introduced errors in the reads using an error rate of %. Clearly, f = 0. works best for this case. f # Regions # Regions processed # Correctly mapped Time taken s s s s s s s s s s 00.s Table. Effect of pruning Without Pruning With Pruning query % similarity time correctly time correctly length taken mapped taken mapped Choosing an appropriate starting value of ɛ is extremely crucial for the above step and depends on the error distribution of the reads. If the errors in the reads are distributed in a very small band on the lower side of the spectrum, choosing a higher initial value of ɛ, will lead to a lot of extra time wasted in unnecessary processing. On the other hand, if the errors in the reads are distributed over a very small band on the higher side of the spectrum, choosing a smaller initial value of ɛ will mean that we will not find any match for the initial few values of ɛ. Hence we will waste time in unnecessary and unfruitful processing. In order to guess a good initial value of n to start with, we dynamically adjust the ɛ by learning from the previous queries. We take the average of the edit distance per unit length of all the previous queries as the initial value of ɛ for the next query. AGILE IMPLEMENTATION Our implementation takes as input the genome FASTA file and a query FASTA file. The query file typically contains a batch of many queries. In the output we report locations in the genome where the full length of the read matches along with mapping quality and score of the match.

6 Page of Bioinformatics Fig.. AGILE workflow The high level workflow of AGILE is depicted in Figure. AGILE uses the fact that sequencers have a very small error rate [, ]. Hence we divide the problem of finding the best match α of a read Q that minimizes E(α, Q), into multiple problems allowing different values of edit distance. Each such problem is now of the form - Find a substring α of G such that E(α, Q) < ɛ Q. We start with allowing a small value of ɛ. If we find a match, we output that match and move to the next read. If we do not find a match, we increase the value of ɛ and try again. AGILE uses a q-gram index of the genome to map queries. We process the input reads one by one. Since the read can be from any strand of the DNA, AGILE processes both the read and reverse complement of the read to search for matches. For any sequence, we start by identifying the regions in the reference genome that can potentially be homologous to the read. In the next step, we filter out many of the regions using the heuristics discussed in the previous section. Each homologous region in the filtered list is further processed by using dynamic programming by creating the edit distance matrix to check if the region actually has an edit distance ɛ Q. Since the edit distance is bounded by ɛ Q, we only need to calculate a diagonal band of width ɛ Q + of the matrix. If at any stage, all paths in the edit distance matrix have an edit distance of more than ɛ Q, we stop further processing and discard the region. This pruning policy greatly reduces the time taken by the algorithm as many regions are discarded after the first few rows are processed. Table demonstrates the speed benefit obtained by this optimization. We have run extensive experiments in order to find the most efficient parameter values for different read lengths and ɛ values. Based on these experiments, for each query we automatically set the optimal values of these parameters. Also, as explained in section., we dynamically select the most appropriate starting values of ɛ and increment ɛ until we find a match. Automatic selection of parameter values makes it easier for users to use the program as they do not need to worry about deciding the appropriate parameter settings. In addition, having tailored values of parameter settings for different scenarios make it possible to run real queries that can be of varied lengths and error rates. RESULTS. Mapping Quality Calculation The concept of mapping quality was coined by Li et al. [] to estimate the probability that the read sequence has been mapped at the correct place or not. When a mapping algorithm is guaranteed to find all the matches of an algorithm, the mapping quality can be calculated using these alignments. However, since we use heuristics, we may not be able to find all the matches of a read and hence cannot calculate the mapping quality accurately. Li et. al [] approximated the mapping quality as 0(S S )/S, where S is the score of the best alignment and S is the score of the second best alignment. We adopt the same approach in our experiments for calculating the mapping quality.. Results on synthetic data To create synthetic queries, we used WGSIM script provided in the SAMTOOLS package. The script was modified to adapt to data. We created reads of a total of million bp of different read lengths. For each read length we introduced, and % errors. 0% of the errors were indels. We aligned these simulated reads to the human genome hg using AGILE, BWA-SW, BLAT (option -fastmap) and SSAHA (option -). Since these are synthetic reads, we know their coordinates in the genome. Hence, we compared the aligned coordinates to the known coordinates to calculate the alignment error. SSAHA, BWA-SW and AGILE report mapping quality. For BLAT, we calculated the mapping quality in the similar manner as described in section., i.e., 0(S S )/S, where S is the score of the best alignment and S is the score of the second best alignment. For our evaluation, we ran all the experiments on Intel Xeon quad core E0. GHz processor with xmb cache and GB RAM running a Linux based operating system. Each tool was run in single-threaded mode. Table shows the CPU time, percentage of confidently (mapping quality > 0) aligned reads and percentage of reads incorrectly aligned for AGILE (version 0..0), BWA-SW (version 0..), BLAT (version ) and SSAHA (version..) for different values of read length and sequencing error rates. We used the default command line options for each program unless necessary otherwise. Carefully selected command line options might yield better results... AGILE vs BWA-SW It is observed from Table that AGILE is more accurate than BWA-SW especially for reads with shorter lengths and more error rates. However, AGILE is also slower than BWA-SW for the same case - reads with shorter lengths and more error rates. For smaller error rates, AGILE is significantly faster than BWA-SW (upto times faster). AGILE is most efficient for read lengths of about 00. With the rate at which the read lengths are increasing, 00bp reads will soon be be the norm... AGILE vs BLAT From Table, it is clear that AGILE is significantly faster (upto 0 times faster) than BLAT while still being much more accurate in most cases; most importantly the cases with larger error rates. Even in cases where AGILE is less accurate than BLAT, it is not much worse. BLAT s lack of accuracy is also due to the -fastmap option. However, with default parameters BLAT is more than an order of magnitude slower than with -fastmap option... AGILE vs SSAHA AGILE is several times faster than SSAHA on all inputs. SSAHA is more accurate than AGILE, although it needs to spend significantly more time to achieve this accuracy. SSAHA does not work well for kbp reads as it is not designed for such read lengths and thus could not be tested on them... Memory Comparison As far as memory usage is concerned, for mapping reads against human genome, BLAT and BWA-SW use about GB memory. The peak memory requirement of SSAHA is about.gb. For AGILE, the q-gram index requires about.gb memory and the genome itself requires about

7 Bioinformatics Page of Misra et al Table. Results on synthetic reads. We created reads of a total of million bp of different read lengths. Q0% = percentage of reads with mapping quality > 0. erraln% = percentage of reads incorrectly aligned Program AGILE BWA-SW BLAT SSAHA Read Length Error rate CPU sec Q0% erraln% CPU sec Q0% erraln% CPU sec Q0% erraln% CPU sec Q0% erraln% 0bp % % % bp % % % bp % % % bp % % % bp % % % Table. Breakup of alignments that are mapped inconsistently between BWA-SW and AGILE BWA-SW AGILE Condition Count Avg read length Avg. mapping qual Avg. diff Avg. mapping qual Avg diff BWA-SW 0; AGILE unmapped...% - - BWA-SW 0 plausible; AGILE< %..% BWA-SW 0 questionable...%.% AGILE 0; BWA-SW unmapped % AGILE 0 plausible; BWA-SW< 0...%.0.% AGILE 0 questionable. 0..% 0.% We mapped 0, 000 reads uniformly selected from SRR000 (pre-filtered to remove any reads smaller than 0bp) against the human genome hg with BWA-SW and AGILE respectively. We call a read inconsistently mapped if the left most position of the alignments found by BWA-SW and AGILE differ by more than the length of the read. For each alignment, we calculated the score (number of matches three times the number of differences (edit distance)). We call a AGILE alignment plausible if 0(AGILE score BWA-SW score)/(agile score) 0 (using BWA alignment as the next best alignment, AGILE mapping quality > 0). Essentially, this means that the AGILE alignment is sufficiently better. Otherwise, we call an AGILE alignment questionable. We use similar definitions of plausible and questionable for BWA-SW alignments. In adddition, AGILE 0 is defined as AGILE alignments with mapping quality 0. GB memory. Memory required by each query is insignificant with respect to them.. Results on real data It is difficult to evaluate on real data because of the lack of ground truth. However, if two algorithms output the same alignment for a read, it is most likely correct. If two aligners X and Y output different alignments for a read, if X and Y both report low mapping quality, then the alignment is ambiguous and it does not matter which one is wrong. If X reports high mapping quality for a read and X alignment score is worse than or slightly better than Y, X mapping quality is wrong and X is not aware of an equally probable alignment. This analysis is similar to the one reported in []. We ran BWA-SW and AGILE on real queries obtained from the NCBI short read archive SRR000. We filtered the read set to remove queries smaller than 0bp long and uniformly selected 0k read with an average length of bp. AGILE took CPU seconds and BWA-SW took CPU seconds. Both tools found common alignments. reads were not mapped by any of the tools. Table shows breakup of reads that are aligned by only one aligner or mapped to different places, and are assigned mapping quality greater than equal to 0 for either BWA-SW or AGILE.

8 Page of Bioinformatics Overall, BWA-SW misses 0+ = 0 alignments that AGILE maps well. AGILE misses ( + ) alignments that are aligned well by BWA-SW. Note that, even though the average read length of the entire read set is, the inconsistencies come mostly in the case of smaller read lengths (average length or smaller). This is in accordance with the results on synthetic reads as both aligners tend to make more mistakes in aligning reads of shorter lengths. BWA-SW tends to miss more alignments in even shorter (-) read length range. While AGILE misses more alignments for a bit longer (0-) reads. ANALYSIS AGILE is similar to BLAT, SSAHA, FANGS and BWA-SW in sharing the seed-and-extend strategy to get candidate regions. However, the major difference lies in the way heuristics are employed to reduce the number of regions to be processed. BLAT and SSAHA consider short (-bp) exact matches as seeds. BLAT also provides functionality to use short inexact match (one mismatch allowed) or two exact matches slight differing in diagonal values. For longer reads, both these techniques result in a very large number of candidate regions. AGILE filters the candidates by allowing longer seeds (upto bp). It also uses a cutoff on the minimum number of contiguous exact seed matches and larger cutoff on the minimum number of exact matches with equal or slightly different diagonal values. To the best of our knowledge, no other tool uses multiple perfect match filtering criteria with a cutoff of more than matches. This is similar to using a long gapped seed. BWA-SW also uses long gapped seeds for the similar reason. The main difference between BWA-SW and AGILE is the way long gapped seed is implemented. While BWA- SW uses Prefix Trie and Prefix DAWG, AGILE relies on a much simpler data structure q-gram index (hash table) and diagonal coordinates. Both BWA-SW and AGILE further use different heuristics in order to reduce the search space. The heuristics used by AGILE are similar to the ones used in string matching or sequence mapping algorithms. For each heuristic, AGILE adapts it to the problem of long read mapping. Customized q-gram filtering is an example. Q-gram filtering is a well known technique for filtering out unwanted regions. However to the best of our knowledge no algorithm has used q-gram filtering with the cutoff customized for each read. Another major contribution of AGILE is the gradual increase of the allowed error rate till we find a match. Most tools use the allowed error rate to decide thresholds for pruning the search space. However, many reads have a small number of errors. Since these tools use a fixed value of the allowed error rate, they end up using a larger value of the allowed error rate for all reads in order to also map the reads with larger errors, resulting in a loss of efficiency. Hence gradually increasing the allowed error rate in mapping can have tremendous effects on the efficiency of these tools. CONCLUSION Advances in sequencing techniques necessitate the development of high performance, scalable algorithms to extract biologically relevant information from these datasets. Research on developing sequence mapping algorithms has been largely focused on mapping short reads, and little work has been done for longer reads. AGILE is a hash based sequence mapping algorithm that rapidly maps long reads using efficient heuristics to optimize different steps of the mapping process. AGILE can handle very large genome sizes and read lengths. It is flexible in that it allows a large number of mismatches and insertions/deletions in mapping and provides command line parameters to control every step of the mapping process. The best sensitivity and specificity of AGILE is achieved when the reads are longer and the error rate is small. Considering that with the improvement in sequencing technology, the read lengths will increase further and the error rates will decrease, AGILE should be even more useful in the future. ACKNOWLEDGEMENT Funding: This work was supported in part by NSF award numbers: CCF-0, SDCI OCI-0, CNS-0, IIS-0, and HECURA This work was also partially supported by DOE grants DE-FC0-0ER0 and DE-FG0-0ER. REFERENCES []Hunting Disease Origins with Whole-Genome Sequencing []Mosaik: [] Sequencers System Features - []S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local alignment search tool. J Mol Biol, ():0, October 0. []W. J. Kent. Blat the blast-like alignment tool. Genome Res, ():, April 00. []B. Langmead, C. Trapnell, M. Pop, and S. Salzberg. Ultrafast and memoryefficient alignment of short dna sequences to the human genome. Genome Biology, (), 00. []H. Li and R. Durbin. Fast and accurate long read alignment with burrows-wheeler transform. Bioinformatics, pages, January 0. []H. Li, J. Ruan, and R. Durbin. Mapping short dna sequencing reads and calling variants using mapping quality scores. Genome research, pages, August 00. []R. Li, Y. Li, K. Kristiansen, and J. Wang. Soap: short oligonucleotide alignment program. Bioinformatics, pages, January 00. []S. Misra, R. Narayanan, S. Lin, and A. Choudhary. Fangs: High speed sequence mapping for next generation sequencers. In Proceedings of ACM Symposium of Applied Computing (ACM SAC), Sierre, Switzerland, 00. []S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, ():, March 0. []Z. Ning, A. J. Cox, and J. C. Mullikin. Ssaha: A fast search method for large dna databases. Genome Res., ():, 00. []K. L. Patrick. life sciences: Illuminating the future of genome sequencing and personalized medicine. Yale J Biol Med., 0():, Dec 00. []W. R. Pearson and D. J. Lipman. Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A, ():, April. []P. Pevzner and M. Waterman. Multiple filtration and approximate pattern matching. Algorithmica, :,. []K. R. Rasmussen, J. Stoye, and E. W. Myers. Efficient q-gram filters for finding all epsilon-matches over a given length. Journal of Computational Biology, (): 0, 00. []J. M. Rothberg and J. H. Leamon. The development and impact of sequencing. Nature Biotechnology, ():, 00. []S. M. Rumble, P. Lacroute, A. V. Dalca, M. Fiume, A. Sidow, and M. Brudno. Shrimp: Accurate mapping of short color-space reads. PLoS Comput Biol, ():e00, []A. D. Smith, Z. Xuan, and M. Q. Zhang. Using quality scores and longer reads improves accuracy of solexa read mapping. BMC Bioinformatics, :, February 00. [0]T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, :, March.

24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading:

24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading: 24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, 2010 3 BLAST and FASTA This lecture is based on the following papers, which are all recommended reading: D.J. Lipman and W.R. Pearson, Rapid

More information

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST Alexander Chan 5075504 Biochemistry 218 Final Project An Analysis of Pairwise

More information

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be 48 Bioinformatics I, WS 09-10, S. Henz (script by D. Huson) November 26, 2009 4 BLAST and BLAT Outline of the chapter: 1. Heuristics for the pairwise local alignment of two sequences 2. BLAST: search and

More information

OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT

OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT Asif Ali Khan*, Laiq Hassan*, Salim Ullah* ABSTRACT: In bioinformatics, sequence alignment is a common and insistent task. Biologists align

More information

Computational Molecular Biology

Computational Molecular Biology Computational Molecular Biology Erwin M. Bakker Lecture 3, mainly from material by R. Shamir [2] and H.J. Hoogeboom [4]. 1 Pairwise Sequence Alignment Biological Motivation Algorithmic Aspect Recursive

More information

Performance analysis of parallel de novo genome assembly in shared memory system

Performance analysis of parallel de novo genome assembly in shared memory system IOP Conference Series: Earth and Environmental Science PAPER OPEN ACCESS Performance analysis of parallel de novo genome assembly in shared memory system To cite this article: Syam Budi Iryanto et al 2018

More information

Accurate Long-Read Alignment using Similarity Based Multiple Pattern Alignment and Prefix Tree Indexing

Accurate Long-Read Alignment using Similarity Based Multiple Pattern Alignment and Prefix Tree Indexing Proposal for diploma thesis Accurate Long-Read Alignment using Similarity Based Multiple Pattern Alignment and Prefix Tree Indexing Astrid Rheinländer 01-09-2010 Supervisor: Prof. Dr. Ulf Leser Motivation

More information

A Nearest Neighbors Algorithm for Strings. R. Lederman Technical Report YALEU/DCS/TR-1453 April 5, 2012

A Nearest Neighbors Algorithm for Strings. R. Lederman Technical Report YALEU/DCS/TR-1453 April 5, 2012 A randomized algorithm is presented for fast nearest neighbors search in libraries of strings. The algorithm is discussed in the context of one of the practical applications: aligning DNA reads to a reference

More information

SlopMap: a software application tool for quick and flexible identification of similar sequences using exact k-mer matching

SlopMap: a software application tool for quick and flexible identification of similar sequences using exact k-mer matching SlopMap: a software application tool for quick and flexible identification of similar sequences using exact k-mer matching Ilya Y. Zhbannikov 1, Samuel S. Hunter 1,2, Matthew L. Settles 1,2, and James

More information

Running SNAP. The SNAP Team February 2012

Running SNAP. The SNAP Team February 2012 Running SNAP The SNAP Team February 2012 1 Introduction SNAP is a tool that is intended to serve as the read aligner in a gene sequencing pipeline. Its theory of operation is described in Faster and More

More information

BLAST & Genome assembly

BLAST & Genome assembly BLAST & Genome assembly Solon P. Pissis Tomáš Flouri Heidelberg Institute for Theoretical Studies May 15, 2014 1 BLAST What is BLAST? The algorithm 2 Genome assembly De novo assembly Mapping assembly 3

More information

COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1. Database Searching

COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1. Database Searching COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1 Database Searching In database search, we typically have a large sequence database

More information

pfangs: Parallel High Speed Sequence Mapping for Next Generation 454-Roche Sequencing Reads

pfangs: Parallel High Speed Sequence Mapping for Next Generation 454-Roche Sequencing Reads pfangs: Parallel High Speed Sequence Mapping for Next Generation 454-Roche Sequencing Reads Sanchit Misra 1, Ramanathan Narayanan 2, Wei-keng Liao 3, Alok Choudhary 4 Electrical Engineering and Computer

More information

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha Motivation Sequence homology to a known protein suggest function of newly sequenced protein Bioinformatics

More information

BLAST & Genome assembly

BLAST & Genome assembly BLAST & Genome assembly Solon P. Pissis Tomáš Flouri Heidelberg Institute for Theoretical Studies November 17, 2012 1 Introduction Introduction 2 BLAST What is BLAST? The algorithm 3 Genome assembly De

More information

On the Efficacy of Haskell for High Performance Computational Biology

On the Efficacy of Haskell for High Performance Computational Biology On the Efficacy of Haskell for High Performance Computational Biology Jacqueline Addesa Academic Advisors: Jeremy Archuleta, Wu chun Feng 1. Problem and Motivation Biologists can leverage the power of

More information

Comparative Analysis of Protein Alignment Algorithms in Parallel environment using CUDA

Comparative Analysis of Protein Alignment Algorithms in Parallel environment using CUDA Comparative Analysis of Protein Alignment Algorithms in Parallel environment using BLAST versus Smith-Waterman Shadman Fahim shadmanbracu09@gmail.com Shehabul Hossain rudrozzal@gmail.com Gulshan Jubaed

More information

Kart: a divide-and-conquer algorithm for NGS read alignment

Kart: a divide-and-conquer algorithm for NGS read alignment Bioinformatics, 33(15), 2017, 2281 2287 doi: 10.1093/bioinformatics/btx189 Advance Access Publication Date: 4 April 2017 Original Paper Sequence analysis Kart: a divide-and-conquer algorithm for NGS read

More information

Running SNAP. The SNAP Team October 2012

Running SNAP. The SNAP Team October 2012 Running SNAP The SNAP Team October 2012 1 Introduction SNAP is a tool that is intended to serve as the read aligner in a gene sequencing pipeline. Its theory of operation is described in Faster and More

More information

FastA & the chaining problem

FastA & the chaining problem FastA & the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem 1 Sources for this lecture: Lectures by Volker Heun, Daniel Huson and Knut Reinert,

More information

Review of Recent NGS Short Reads Alignment Tools BMI-231 final project, Chenxi Chen Spring 2014

Review of Recent NGS Short Reads Alignment Tools BMI-231 final project, Chenxi Chen Spring 2014 Review of Recent NGS Short Reads Alignment Tools BMI-231 final project, Chenxi Chen Spring 2014 Deciphering the information contained in DNA sequences began decades ago since the time of Sanger sequencing.

More information

FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:

FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10: FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:56 4001 4 FastA and the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem

More information

The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval. Kevin C. O'Kane. Department of Computer Science

The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval. Kevin C. O'Kane. Department of Computer Science The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval Kevin C. O'Kane Department of Computer Science The University of Northern Iowa Cedar Falls, Iowa okane@cs.uni.edu http://www.cs.uni.edu/~okane

More information

Lectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures

Lectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures 4 FastA and the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem 4.1 Sources for this lecture Lectures by Volker Heun, Daniel Huson and Knut

More information

From Smith-Waterman to BLAST

From Smith-Waterman to BLAST From Smith-Waterman to BLAST Jeremy Buhler July 23, 2015 Smith-Waterman is the fundamental tool that we use to decide how similar two sequences are. Isn t that all that BLAST does? In principle, it is

More information

Short Read Alignment. Mapping Reads to a Reference

Short Read Alignment. Mapping Reads to a Reference Short Read Alignment Mapping Reads to a Reference Brandi Cantarel, Ph.D. & Daehwan Kim, Ph.D. BICF 05/2018 Introduction to Mapping Short Read Aligners DNA vs RNA Alignment Quality Pitfalls and Improvements

More information

Jyoti Lakhani 1, Ajay Khunteta 2, Dharmesh Harwani *3 1 Poornima University, Jaipur & Maharaja Ganga Singh University, Bikaner, Rajasthan, India

Jyoti Lakhani 1, Ajay Khunteta 2, Dharmesh Harwani *3 1 Poornima University, Jaipur & Maharaja Ganga Singh University, Bikaner, Rajasthan, India International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2017 IJSRCSEIT Volume 2 Issue 6 ISSN : 2456-3307 Improvisation of Global Pairwise Sequence Alignment

More information

SEASHORE / SARUMAN. Short Read Matching using GPU Programming. Tobias Jakobi

SEASHORE / SARUMAN. Short Read Matching using GPU Programming. Tobias Jakobi SEASHORE SARUMAN Summary 1 / 24 SEASHORE / SARUMAN Short Read Matching using GPU Programming Tobias Jakobi Center for Biotechnology (CeBiTec) Bioinformatics Resource Facility (BRF) Bielefeld University

More information

ON HEURISTIC METHODS IN NEXT-GENERATION SEQUENCING DATA ANALYSIS

ON HEURISTIC METHODS IN NEXT-GENERATION SEQUENCING DATA ANALYSIS ON HEURISTIC METHODS IN NEXT-GENERATION SEQUENCING DATA ANALYSIS Ivan Vogel Doctoral Degree Programme (1), FIT BUT E-mail: xvogel01@stud.fit.vutbr.cz Supervised by: Jaroslav Zendulka E-mail: zendulka@fit.vutbr.cz

More information

Bioinformatics explained: BLAST. March 8, 2007

Bioinformatics explained: BLAST. March 8, 2007 Bioinformatics Explained Bioinformatics explained: BLAST March 8, 2007 CLC bio Gustav Wieds Vej 10 8000 Aarhus C Denmark Telephone: +45 70 22 55 09 Fax: +45 70 22 55 19 www.clcbio.com info@clcbio.com Bioinformatics

More information

Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs

Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs Anas Abu-Doleh 1,2, Erik Saule 1, Kamer Kaya 1 and Ümit V. Çatalyürek 1,2 1 Department of Biomedical Informatics 2 Department of Electrical

More information

Bioinformatics for Biologists

Bioinformatics for Biologists Bioinformatics for Biologists Sequence Analysis: Part I. Pairwise alignment and database searching Fran Lewitter, Ph.D. Director Bioinformatics & Research Computing Whitehead Institute Topics to Cover

More information

GPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units

GPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units GPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units Abstract A very popular discipline in bioinformatics is Next-Generation Sequencing (NGS) or DNA sequencing. It specifies

More information

FASTA. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm.

FASTA. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm. FASTA INTRODUCTION Definition (by David J. Lipman and William R. Pearson in 1985) - Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence

More information

Highly Scalable and Accurate Seeds for Subsequence Alignment

Highly Scalable and Accurate Seeds for Subsequence Alignment Highly Scalable and Accurate Seeds for Subsequence Alignment Abhijit Pol Tamer Kahveci Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA, 32611

More information

Introduction to Computational Molecular Biology

Introduction to Computational Molecular Biology 18.417 Introduction to Computational Molecular Biology Lecture 13: October 21, 2004 Scribe: Eitan Reich Lecturer: Ross Lippert Editor: Peter Lee 13.1 Introduction We have been looking at algorithms to

More information

Read Mapping. Slides by Carl Kingsford

Read Mapping. Slides by Carl Kingsford Read Mapping Slides by Carl Kingsford Bowtie Ultrafast and memory-efficient alignment of short DNA sequences to the human genome Ben Langmead, Cole Trapnell, Mihai Pop and Steven L Salzberg, Genome Biology

More information

BIOINFORMATICS. Multiple spaced seeds for homology search

BIOINFORMATICS. Multiple spaced seeds for homology search BIOINFORMATICS Vol. 00 no. 00 2007 pages 1-9 Sequence Analysis Multiple spaced seeds for homology search Lucian Ilie 1, and Silvana Ilie 2 1 Department of Computer Science, University of Western Ontario,

More information

Under the Hood of Alignment Algorithms for NGS Researchers

Under the Hood of Alignment Algorithms for NGS Researchers Under the Hood of Alignment Algorithms for NGS Researchers April 16, 2014 Gabe Rudy VP of Product Development Golden Helix Questions during the presentation Use the Questions pane in your GoToWebinar window

More information

Long Read RNA-seq Mapper

Long Read RNA-seq Mapper UNIVERSITY OF ZAGREB FACULTY OF ELECTRICAL ENGENEERING AND COMPUTING MASTER THESIS no. 1005 Long Read RNA-seq Mapper Josip Marić Zagreb, February 2015. Table of Contents 1. Introduction... 1 2. RNA Sequencing...

More information

Accelrys Pipeline Pilot and HP ProLiant servers

Accelrys Pipeline Pilot and HP ProLiant servers Accelrys Pipeline Pilot and HP ProLiant servers A performance overview Technical white paper Table of contents Introduction... 2 Accelrys Pipeline Pilot benchmarks on HP ProLiant servers... 2 NGS Collection

More information

USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT

USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT IADIS International Conference Applied Computing 2006 USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT Divya R. Singh Software Engineer Microsoft Corporation, Redmond, WA 98052, USA Abdullah

More information

Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs

Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs Anas Abu-Doleh Erik Saule Kamer Kaya Ümit V. Çatalyürek Dept. of Biomedical Informatics Dept. of Electrical and Computer Engineering

More information

Bioinformatics explained: Smith-Waterman

Bioinformatics explained: Smith-Waterman Bioinformatics Explained Bioinformatics explained: Smith-Waterman May 1, 2007 CLC bio Gustav Wieds Vej 10 8000 Aarhus C Denmark Telephone: +45 70 22 55 09 Fax: +45 70 22 55 19 www.clcbio.com info@clcbio.com

More information

Biology 644: Bioinformatics

Biology 644: Bioinformatics Find the best alignment between 2 sequences with lengths n and m, respectively Best alignment is very dependent upon the substitution matrix and gap penalties The Global Alignment Problem tries to find

More information

Algorithms in Bioinformatics: A Practical Introduction. Database Search

Algorithms in Bioinformatics: A Practical Introduction. Database Search Algorithms in Bioinformatics: A Practical Introduction Database Search Biological databases Biological data is double in size every 15 or 16 months Increasing in number of queries: 40,000 queries per day

More information

A Fast Read Alignment Method based on Seed-and-Vote For Next GenerationSequencing

A Fast Read Alignment Method based on Seed-and-Vote For Next GenerationSequencing A Fast Read Alignment Method based on Seed-and-Vote For Next GenerationSequencing Song Liu 1,2, Yi Wang 3, Fei Wang 1,2 * 1 Shanghai Key Lab of Intelligent Information Processing, Shanghai, China. 2 School

More information

GSNAP: Fast and SNP-tolerant detection of complex variants and splicing in short reads by Thomas D. Wu and Serban Nacu

GSNAP: Fast and SNP-tolerant detection of complex variants and splicing in short reads by Thomas D. Wu and Serban Nacu GSNAP: Fast and SNP-tolerant detection of complex variants and splicing in short reads by Thomas D. Wu and Serban Nacu Matt Huska Freie Universität Berlin Computational Methods for High-Throughput Omics

More information

Alignment of Long Sequences

Alignment of Long Sequences Alignment of Long Sequences BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2009 Mark Craven craven@biostat.wisc.edu Pairwise Whole Genome Alignment: Task Definition Given a pair of genomes (or other large-scale

More information

GPU Accelerated API for Alignment of Genomics Sequencing Data

GPU Accelerated API for Alignment of Genomics Sequencing Data GPU Accelerated API for Alignment of Genomics Sequencing Data Nauman Ahmed, Hamid Mushtaq, Koen Bertels and Zaid Al-Ars Computer Engineering Laboratory, Delft University of Technology, Delft, The Netherlands

More information

NGS Data and Sequence Alignment

NGS Data and Sequence Alignment Applications and Servers SERVER/REMOTE Compute DB WEB Data files NGS Data and Sequence Alignment SSH WEB SCP Manpreet S. Katari App Aug 11, 2016 Service Terminal IGV Data files Window Personal Computer/Local

More information

ASAP - Allele-specific alignment pipeline

ASAP - Allele-specific alignment pipeline ASAP - Allele-specific alignment pipeline Jan 09, 2012 (1) ASAP - Quick Reference ASAP needs a working version of Perl and is run from the command line. Furthermore, Bowtie needs to be installed on your

More information

IDBA - A Practical Iterative de Bruijn Graph De Novo Assembler

IDBA - A Practical Iterative de Bruijn Graph De Novo Assembler IDBA - A Practical Iterative de Bruijn Graph De Novo Assembler Yu Peng, Henry Leung, S.M. Yiu, Francis Y.L. Chin Department of Computer Science, The University of Hong Kong Pokfulam Road, Hong Kong {ypeng,

More information

Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA.

Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA. Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA. Fasta is used to compare a protein or DNA sequence to all of the

More information

Achieving High Throughput Sequencing with Graphics Processing Units

Achieving High Throughput Sequencing with Graphics Processing Units Achieving High Throughput Sequencing with Graphics Processing Units Su Chen 1, Chaochao Zhang 1, Feng Shen 1, Ling Bai 1, Hai Jiang 1, and Damir Herman 2 1 Department of Computer Science, Arkansas State

More information

A CAM(Content Addressable Memory)-based architecture for molecular sequence matching

A CAM(Content Addressable Memory)-based architecture for molecular sequence matching A CAM(Content Addressable Memory)-based architecture for molecular sequence matching P.K. Lala 1 and J.P. Parkerson 2 1 Department Electrical Engineering, Texas A&M University, Texarkana, Texas, USA 2

More information

Lecture Overview. Sequence search & alignment. Searching sequence databases. Sequence Alignment & Search. Goals: Motivations:

Lecture Overview. Sequence search & alignment. Searching sequence databases. Sequence Alignment & Search. Goals: Motivations: Lecture Overview Sequence Alignment & Search Karin Verspoor, Ph.D. Faculty, Computational Bioscience Program University of Colorado School of Medicine With credit and thanks to Larry Hunter for creating

More information

A Design of a Hybrid System for DNA Sequence Alignment

A Design of a Hybrid System for DNA Sequence Alignment IMECS 2008, 9-2 March, 2008, Hong Kong A Design of a Hybrid System for DNA Sequence Alignment Heba Khaled, Hossam M. Faheem, Tayseer Hasan, Saeed Ghoneimy Abstract This paper describes a parallel algorithm

More information

BLAST, Profile, and PSI-BLAST

BLAST, Profile, and PSI-BLAST BLAST, Profile, and PSI-BLAST Jianlin Cheng, PhD School of Electrical Engineering and Computer Science University of Central Florida 26 Free for academic use Copyright @ Jianlin Cheng & original sources

More information

Lam, TW; Li, R; Tam, A; Wong, S; Wu, E; Yiu, SM.

Lam, TW; Li, R; Tam, A; Wong, S; Wu, E; Yiu, SM. Title High throughput short read alignment via bi-directional BWT Author(s) Lam, TW; Li, R; Tam, A; Wong, S; Wu, E; Yiu, SM Citation The IEEE International Conference on Bioinformatics and Biomedicine

More information

Notes on Dynamic-Programming Sequence Alignment

Notes on Dynamic-Programming Sequence Alignment Notes on Dynamic-Programming Sequence Alignment Introduction. Following its introduction by Needleman and Wunsch (1970), dynamic programming has become the method of choice for rigorous alignment of DNA

More information

Distributed Protein Sequence Alignment

Distributed Protein Sequence Alignment Distributed Protein Sequence Alignment ABSTRACT J. Michael Meehan meehan@wwu.edu James Hearne hearne@wwu.edu Given the explosive growth of biological sequence databases and the computational complexity

More information

Omega: an Overlap-graph de novo Assembler for Metagenomics

Omega: an Overlap-graph de novo Assembler for Metagenomics Omega: an Overlap-graph de novo Assembler for Metagenomics B a h l e l H a i d e r, Ta e - H y u k A h n, B r i a n B u s h n e l l, J u a n j u a n C h a i, A l e x C o p e l a n d, C h o n g l e Pa n

More information

High-performance short sequence alignment with GPU acceleration

High-performance short sequence alignment with GPU acceleration Distrib Parallel Databases (2012) 30:385 399 DOI 10.1007/s10619-012-7099-x High-performance short sequence alignment with GPU acceleration Mian Lu Yuwei Tan Ge Bai Qiong Luo Published online: 10 August

More information

Sequence Alignment & Search

Sequence Alignment & Search Sequence Alignment & Search Karin Verspoor, Ph.D. Faculty, Computational Bioscience Program University of Colorado School of Medicine With credit and thanks to Larry Hunter for creating the first version

More information

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page.

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page. Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page. In this page you will learn to use the tools of the MAPHiTS suite. A little advice before starting : rename your

More information

AMAS: optimizing the partition and filtration of adaptive seeds to speed up read mapping

AMAS: optimizing the partition and filtration of adaptive seeds to speed up read mapping AMAS: optimizing the partition and filtration of adaptive seeds to speed up read mapping Ngoc Hieu Tran 1, * Email: nhtran@ntu.edu.sg Xin Chen 1 Email: chenxin@ntu.edu.sg 1 School of Physical and Mathematical

More information

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

CISC 636 Computational Biology & Bioinformatics (Fall 2016) CISC 636 Computational Biology & Bioinformatics (Fall 2016) Sequence pairwise alignment Score statistics: E-value and p-value Heuristic algorithms: BLAST and FASTA Database search: gene finding and annotations

More information

Efficient Alignment of Next Generation Sequencing Data Using MapReduce on the Cloud

Efficient Alignment of Next Generation Sequencing Data Using MapReduce on the Cloud 212 Cairo International Biomedical Engineering Conference (CIBEC) Cairo, Egypt, December 2-21, 212 Efficient Alignment of Next Generation Sequencing Data Using MapReduce on the Cloud Rawan AlSaad and Qutaibah

More information

Pacific Symposium on Biocomputing 13: (2008) PASH 2.0: SCALEABLE SEQUENCE ANCHORING FOR NEXT-GENERATION SEQUENCING TECHNOLOGIES

Pacific Symposium on Biocomputing 13: (2008) PASH 2.0: SCALEABLE SEQUENCE ANCHORING FOR NEXT-GENERATION SEQUENCING TECHNOLOGIES PASH 2.0: SCALEABLE SEQUENCE ANCHORING FOR NEXT-GENERATION SEQUENCING TECHNOLOGIES CRISTIAN COARFA Human Genome Sequencing Center, Department of Molecular and Human Genetics, Baylor College of Medicine,

More information

Acceleration of Algorithm of Smith-Waterman Using Recursive Variable Expansion.

Acceleration of Algorithm of Smith-Waterman Using Recursive Variable Expansion. www.ijarcet.org 54 Acceleration of Algorithm of Smith-Waterman Using Recursive Variable Expansion. Hassan Kehinde Bello and Kazeem Alagbe Gbolagade Abstract Biological sequence alignment is becoming popular

More information

IDBA A Practical Iterative de Bruijn Graph De Novo Assembler

IDBA A Practical Iterative de Bruijn Graph De Novo Assembler IDBA A Practical Iterative de Bruijn Graph De Novo Assembler Yu Peng, Henry C.M. Leung, S.M. Yiu, and Francis Y.L. Chin Department of Computer Science, The University of Hong Kong Pokfulam Road, Hong Kong

More information

Proceedings of the 11 th International Conference for Informatics and Information Technology

Proceedings of the 11 th International Conference for Informatics and Information Technology Proceedings of the 11 th International Conference for Informatics and Information Technology Held at Hotel Molika, Bitola, Macedonia 11-13th April, 2014 Editors: Vangel V. Ajanovski Gjorgji Madjarov ISBN

More information

ICB Fall G4120: Introduction to Computational Biology. Oliver Jovanovic, Ph.D. Columbia University Department of Microbiology

ICB Fall G4120: Introduction to Computational Biology. Oliver Jovanovic, Ph.D. Columbia University Department of Microbiology ICB Fall 2008 G4120: Computational Biology Oliver Jovanovic, Ph.D. Columbia University Department of Microbiology Copyright 2008 Oliver Jovanovic, All Rights Reserved. The Digital Language of Computers

More information

Darwin: A Genomic Co-processor gives up to 15,000X speedup on long read assembly (To appear in ASPLOS 2018)

Darwin: A Genomic Co-processor gives up to 15,000X speedup on long read assembly (To appear in ASPLOS 2018) Darwin: A Genomic Co-processor gives up to 15,000X speedup on long read assembly (To appear in ASPLOS 2018) Yatish Turakhia EE PhD candidate Stanford University Prof. Bill Dally (Electrical Engineering

More information

Heterogeneous Hardware/Software Acceleration of the BWA-MEM DNA Alignment Algorithm

Heterogeneous Hardware/Software Acceleration of the BWA-MEM DNA Alignment Algorithm Heterogeneous Hardware/Software Acceleration of the BWA-MEM DNA Alignment Algorithm Nauman Ahmed, Vlad-Mihai Sima, Ernst Houtgast, Koen Bertels and Zaid Al-Ars Computer Engineering Lab, Delft University

More information

Database Searching Using BLAST

Database Searching Using BLAST Mahidol University Objectives SCMI512 Molecular Sequence Analysis Database Searching Using BLAST Lecture 2B After class, students should be able to: explain the FASTA algorithm for database searching explain

More information

A Coprocessor Architecture for Fast Protein Structure Prediction

A Coprocessor Architecture for Fast Protein Structure Prediction A Coprocessor Architecture for Fast Protein Structure Prediction M. Marolia, R. Khoja, T. Acharya, C. Chakrabarti Department of Electrical Engineering Arizona State University, Tempe, USA. Abstract Predicting

More information

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg High-throughput sequencing: Alignment and related topic Simon Anders EMBL Heidelberg Established platforms HTS Platforms Illumina HiSeq, ABI SOLiD, Roche 454 Newcomers: Benchtop machines 454 GS Junior,

More information

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014 Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014 Dynamic programming is a group of mathematical methods used to sequentially split a complicated problem into

More information

Parallel Mapping Approaches for GNUMAP

Parallel Mapping Approaches for GNUMAP 2011 IEEE International Parallel & Distributed Processing Symposium Parallel Mapping Approaches for GNUMAP Nathan L. Clement, Mark J. Clement, Quinn Snell and W. Evan Johnson Department of Computer Science

More information

Building approximate overlap graphs for DNA assembly using random-permutations-based search.

Building approximate overlap graphs for DNA assembly using random-permutations-based search. An algorithm is presented for fast construction of graphs of reads, where an edge between two reads indicates an approximate overlap between the reads. Since the algorithm finds approximate overlaps directly,

More information

BIOL591: Introduction to Bioinformatics Alignment of pairs of sequences

BIOL591: Introduction to Bioinformatics Alignment of pairs of sequences BIOL591: Introduction to Bioinformatics Alignment of pairs of sequences Reading in text (Mount Bioinformatics): I must confess that the treatment in Mount of sequence alignment does not seem to me a model

More information

Research Article International Journals of Advanced Research in Computer Science and Software Engineering ISSN: X (Volume-7, Issue-6)

Research Article International Journals of Advanced Research in Computer Science and Software Engineering ISSN: X (Volume-7, Issue-6) International Journals of Advanced Research in Computer Science and Software Engineering ISSN: 77-18X (Volume-7, Issue-6) Research Article June 017 DDGARM: Dotlet Driven Global Alignment with Reduced Matrix

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 04: Variations of sequence alignments http://www.pitt.edu/~mcs2/teaching/biocomp/tutorials/global.html Slides adapted from Dr. Shaojie Zhang (University

More information

Next generation sequencing: assembly by mapping reads. Laurent Falquet, Vital-IT Helsinki, June 3, 2010

Next generation sequencing: assembly by mapping reads. Laurent Falquet, Vital-IT Helsinki, June 3, 2010 Next generation sequencing: assembly by mapping reads Laurent Falquet, Vital-IT Helsinki, June 3, 2010 Overview What is assembly by mapping? Methods BWT File formats Tools Issues Visualization Discussion

More information

S 4 : International Competition on Scalable String Similarity Search & Join. Ulf Leser, Humboldt-Universität zu Berlin

S 4 : International Competition on Scalable String Similarity Search & Join. Ulf Leser, Humboldt-Universität zu Berlin S 4 : International Competition on Scalable String Similarity Search & Join Ulf Leser, Humboldt-Universität zu Berlin S 4 Competition, not a workshop in the usual sense Idea: Try to find the best existing

More information

INTRODUCING NVBIO: HIGH PERFORMANCE PRIMITIVES FOR COMPUTATIONAL GENOMICS. Jonathan Cohen, NVIDIA Nuno Subtil, NVIDIA Jacopo Pantaleoni, NVIDIA

INTRODUCING NVBIO: HIGH PERFORMANCE PRIMITIVES FOR COMPUTATIONAL GENOMICS. Jonathan Cohen, NVIDIA Nuno Subtil, NVIDIA Jacopo Pantaleoni, NVIDIA INTRODUCING NVBIO: HIGH PERFORMANCE PRIMITIVES FOR COMPUTATIONAL GENOMICS Jonathan Cohen, NVIDIA Nuno Subtil, NVIDIA Jacopo Pantaleoni, NVIDIA SEQUENCING AND MOORE S LAW Slide courtesy Illumina DRAM I/F

More information

Heuristic methods for pairwise alignment:

Heuristic methods for pairwise alignment: Bi03c_1 Unit 03c: Heuristic methods for pairwise alignment: k-tuple-methods k-tuple-methods for alignment of pairs of sequences Bi03c_2 dynamic programming is too slow for large databases Use heuristic

More information

Mapping Reads to Reference Genome

Mapping Reads to Reference Genome Mapping Reads to Reference Genome DNA carries genetic information DNA is a double helix of two complementary strands formed by four nucleotides (bases): Adenine, Cytosine, Guanine and Thymine 2 of 31 Gene

More information

) I R L Press Limited, Oxford, England. The protein identification resource (PIR)

) I R L Press Limited, Oxford, England. The protein identification resource (PIR) Volume 14 Number 1 Volume 1986 Nucleic Acids Research 14 Number 1986 Nucleic Acids Research The protein identification resource (PIR) David G.George, Winona C.Barker and Lois T.Hunt National Biomedical

More information

Dynamic Programming & Smith-Waterman algorithm

Dynamic Programming & Smith-Waterman algorithm m m Seminar: Classical Papers in Bioinformatics May 3rd, 2010 m m 1 2 3 m m Introduction m Definition is a method of solving problems by breaking them down into simpler steps problem need to contain overlapping

More information

FastCluster: a graph theory based algorithm for removing redundant sequences

FastCluster: a graph theory based algorithm for removing redundant sequences J. Biomedical Science and Engineering, 2009, 2, 621-625 doi: 10.4236/jbise.2009.28090 Published Online December 2009 (http://www.scirp.org/journal/jbise/). FastCluster: a graph theory based algorithm for

More information

QIAseq DNA V3 Panel Analysis Plugin USER MANUAL

QIAseq DNA V3 Panel Analysis Plugin USER MANUAL QIAseq DNA V3 Panel Analysis Plugin USER MANUAL User manual for QIAseq DNA V3 Panel Analysis 1.0.1 Windows, Mac OS X and Linux January 25, 2018 This software is for research purposes only. QIAGEN Aarhus

More information

Fast Sequence Alignment Method Using CUDA-enabled GPU

Fast Sequence Alignment Method Using CUDA-enabled GPU Fast Sequence Alignment Method Using CUDA-enabled GPU Yeim-Kuan Chang Department of Computer Science and Information Engineering National Cheng Kung University Tainan, Taiwan ykchang@mail.ncku.edu.tw De-Yu

More information

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota Marina Sirota MOTIVATION: PROTEIN MULTIPLE ALIGNMENT To study evolution on the genetic level across a wide range of organisms, biologists need accurate tools for multiple sequence alignment of protein

More information

Research on Pairwise Sequence Alignment Needleman-Wunsch Algorithm

Research on Pairwise Sequence Alignment Needleman-Wunsch Algorithm 5th International Conference on Mechatronics, Materials, Chemistry and Computer Engineering (ICMMCCE 2017) Research on Pairwise Sequence Alignment Needleman-Wunsch Algorithm Xiantao Jiang1, a,*,xueliang

More information

USING BRAT-BW Table 1. Feature comparison of BRAT-bw, BRAT-large, Bismark and BS Seeker (as of on March, 2012)

USING BRAT-BW Table 1. Feature comparison of BRAT-bw, BRAT-large, Bismark and BS Seeker (as of on March, 2012) USING BRAT-BW-2.0.1 BRAT-bw is a tool for BS-seq reads mapping, i.e. mapping of bisulfite-treated sequenced reads. BRAT-bw is a part of BRAT s suit. Therefore, input and output formats for BRAT-bw are

More information

Short Read Alignment Algorithms

Short Read Alignment Algorithms Short Read Alignment Algorithms Raluca Gordân Department of Biostatistics and Bioinformatics Department of Computer Science Department of Molecular Genetics and Microbiology Center for Genomic and Computational

More information

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010 Principles of Bioinformatics BIO540/STA569/CSI660 Fall 2010 Lecture 11 Multiple Sequence Alignment I Administrivia Administrivia The midterm examination will be Monday, October 18 th, in class. Closed

More information