Remote Homolog Detection Using Local Sequence Structure Correlations

Size: px
Start display at page:

Download "Remote Homolog Detection Using Local Sequence Structure Correlations"

Transcription

1 PROTEINS: Structure, Function, and Bioinformatics 57: (2004) Remote Homolog Detection Using Local Sequence Structure Correlations Yuna Hou, 1 * Wynne Hsu, 1 Mong Li Lee, 1 and Christopher Bystroff 2 1 School of Computing, National University of Singapore, Singapore 2 Department of Biology, Rensselaer Polytechnic Institute, Troy, New York ABSTRACT Remote homology detection refers to the detection of structural homology in proteins when there is little or no sequence similarity. In this article, we present a remote homolog detection method called SVM-HMMSTR that overcomes the reliance on detectable sequence similarity by transforming the sequences into strings of hidden Markov states that represent local folding motif patterns. These state strings are transformed into fixeddimension feature vectors for input to a support vector machine. Two sets of features are defined: an order-independent feature set that captures the amino acid and local structure composition; and an order-dependent feature set that captures the sequential ordering of the local structures. Tests using the Structural Classification of Proteins (SCOP) 1.53 data set show that the SVM-HMMSTR gives a significant improvement over several current methods. Proteins 2004;57: Wiley-Liss, Inc. Key words: remote homology; local structure; support vector machines; hidden Markov model; protein folding; I-sites; HMMSTR INTRODUCTION Breakthroughs in large-scale sequencing and the Human Genome Project have led to a surge in biological sequence information. Researchers are increasingly relying on computational techniques to cope with the massive amount of information generated. Homology detection is one such computational approach to interpret the protein sequences through the detection of homologous proteins. Early methods in homology detection are based on pairwise comparisons of protein sequences using dynamic programming algorithms such as the Needleman Wunsch 1 and Smith Waterman 2 algorithms. Popular search tools such as BLAST 3 and FASTA 4 are fast approximations of these dynamic programming algorithms. However, these pairwise comparison methods do not work well for remote homologies. In order to identify remote homologs, methods such as profiles for protein families, 5 hidden Markov models (HMMs), 6,7 and iterative methods such as PSI-BLAST 8 and SAM 9 have been introduced. The basic idea behind these methods is to generate a representative model for each protein family. Instead of comparing an unknown sequence to a specific protein sequence, we compare it to the generated model of the appropriate family. While these approaches provide better detection of the remote homologies as compared to the pairwise comparison methods, additional accuracy can be obtained by modeling the difference between positive and negative examples positive if they are in the family and negative otherwise. If a fixed number of descriptive parameters, or features, can be defined for each of the sequence families, then a general technique called a support vector machine (SVM) may be applied to find the boundaries between families in the space defined by the features. This method has been successfully applied to the homolog detection problem An SVM algorithm chooses a hyperplane through the feature space that has a maximum margin between positive and negative examples. Each hyperplane divides one class of data (all of the sequences families of one fold type) from all of the other classes. A new data point (an unknown sequence) can then be classified by finding its relationship with each of the hyperplanes. SVM classifiers are better generalizers than neural networks for small data sets. 15 The success of a SVM classification method depends on the choice of the feature set to describe each protein family. Methods that use only sequence information fail when the sequence similarity is very low, even if the two structures are very similar (see Fig. 1). In our earlier work, we noted that similar structures contain similar local structure motifs. Our first SVM method, SVM-I-sites, 16 demonstrated that the local structure content of a protein improves remote homolog detection, even without knowing the locations of the local features within the sequence. Local structure motifs were found using the I-sites library of sequence structure correlations. 17 The I-sites library contains 262 short-sequence patterns that each has a strong correlation with the three-dimensional (3D) structure, locally. Our tests showed that SVM-I-sites was comparable in detection accuracy and more efficient than the state-of-the-art method SVM-pairwise. 11 Further study revealed that I-sites motifs can occur in different orders along the sequence in different fold topologies. Thus, encoding structure features using only the Correspondence to: Yuna Hou, School of Computing, National University of Singapore, Singapore houyuna@comp.nus.edu.sg Received 18 January 2004; Accepted 28 April 2004 Published online 6 August 2004 in Wiley InterScience ( DOI: /prot WILEY-LISS, INC.

2 REMOTE HOMOLOG DETECTION 519 Fig. 1. Example of two proteins from SCOP superfamily The structures are remarkably similar (top figures), but the alignment below shows only poor sequence similarity (10% identity). composition of I-sites motifs may not uniquely define the global fold of the protein. The ordering of these motifs along the sequence also must play a role in dictating how the self-organization takes place during folding. Moreover, many of the I-sites motifs tend to overlap, resulting in redundant information. Bystroff et al. 18 described a hidden Markov model called HMMSTR (Hidden Markov Model for protein STRucture) that extends and generalizes the I-sites library. HMMSTR models I-sites motifs as words of a higher order grammatical structure. Probabilistic transitions between I-sites motifs along the chain are thought to capture the propagation of structure during the folding process. A typical transition would be from helix to helix-cap, or -strand to -turn. Thus HMMSTR encodes some of the sequence ordering of local motifs. HMMSTR also removes the redundancy inherent in the I-sites model and reduces the number of free parameters. For example, several I-sites motifs are used to represent the different register shifts of the amphipathic heptad repeat helix motif, while in HMMSTR these motifs are merged into a single, cyclic pathway of 7 Markov states. For each query protein sequence, the output of the HMMSTR algorithm is a matrix, 19 where each element is the probability that a residue at a given position is associated with one of the Markov states. The states are conserved in remote homologs, because the local structure elements are conserved, and the relative positions of the states along the sequence are approximately conserved, because remote homologs conserve topology. Both the state and position information are crucial for defining the similarity of two proteins. In this article, we investigate ways that the positionspecific local structure information in HMMSTR can be efficiently encoded as fixed-length feature vectors for an SVM. There are two issues to address here: 1. The length of matrix is variable, while the input to the SVM must be a fixed-length vector. We encode the sequence order of the local motifs by performing alignments with the database, using a novel HMMSTRbased similarity score. Each alignment score is one feature. 2. The matrices are high dimensional. This incurs high computational costs when encoding the state and position information. We overcome this problem by first performing a dimension reduction before attempting to align the matrices of two proteins. These feature vectors are subsequently used to train an SVM, SVM-HMMSTR. In short, for each protein, we obtain 2 sets of features. The first set aims to capture the amino acid and local structure composition, while the second set encodes the alignment scores. The new SVM classifier was tested on the Structural Classification of Proteins (SCOP) 1.53 data set and significantly outperformed existing methods. MATERIALS AND METHODS HMMSTR Model HMMSTR is a better model for local structure prediction than the I-sites library, improving 8-residue fragment prediction accuracy from 43% to 59%. 18 This is because HMMSTR models the possible ways of arranging sequence

3 520 Y. HOU ET AL. Fig. 2. HMMSTR model built from I-sites Library. Symbols represent hidden states: circles, predominantly helix; squares, strand; diamonds, loop or turn; yellow, glycine; magenta, proline; blue, nonpolar; green, polar; white, no predominant amino acid. Only high probability connections are shown (Reprinted by permission of the authors 18 ). structure motifs along the sequence, and because overlapping motifs have been merged, reducing redundancy and increasing the statistical significance of the amino acid profiles. The topology of HMMSTR is a highly branched and multicyclic network. Each of the 262 I-sites motifs is represented as a chain of Markov states, which contains information about the sequence and structure attributes of a single position in the motif. Adjacent positions are modeled by directionally linked states. A hierarchical merging of these chains of states, based on sequence and structure overlaps, resulted in a graph that contains almost all the motifs. The merged graph of I-sites motifs comprises a network of states connected by probabilistic transitions, or a HMM, as shown in Figure 2. In contrast to the more familiar family-specific HMMs, 20 HMMSTR captures simultaneously the recurrent local features of sequences and structures across all families of globular proteins. Each state in HMMSTR can produce, or emit, amino acids and structure symbols according to a probability distribution specific to that state. There are 4 probability distributions defined for the states in HMMSTR b, d, r, and c that describe the probability of observing a particular amino acid, secondary structure, backbone angle region, or structural context descriptor, respectively. This set of emission probabilities for a given state q i is collectively called B qi. The values b qi ( )(1 20) are associated with probabilities for the emission of amino acids. The values d qi ( ) (1 3) are the probabilities of emitting helix, strand, or loop, respectively. The values r qi ( ) (1 11) are the probabilities of emitting one of the 11 dihedral angle symbols. Finally, c qi ( ) (1 10) are probabilities of emitting 1 of 10 structural context symbols. The database used for training, evaluation, and testing of the HMMSTR is encoded as a linear sequence of amino acids and structural observables. The amino acid sequence data consist of a parent ( parent means the sequence

4 REMOTE HOMOLOG DETECTION 521 upon which the multiple alignment is based) amino acid sequence of known 3D structure and an amino acid profile obtained by alignment to the parent sequence. The amino acid of the parent sequence is denoted by O t, and the profile by O t (1 20). For each position t, there are 3 structural identifiers: 3-state secondary structure D t, discrete backbone angle region R t, and the context symbol C t (context symbols C t were assigned to strands and loops in the training sequences based on the nonlocal context of position t; for example, loops were hairpins, diverging turns, corners, one of two types of helix cap, or just coil ; also, strands could be middle of the sheet or end ). Therefore, any sequence s of length T is given by the values of the attributes at all positions s t {O t, O t, D t, R t, C t } (1 t T). HMMSTR models database sequences based on the notion of a path. A path is a sequence of states through the HMMs, denoted Q q 1 q 2 q T. The probability of a sequence s given the model, P(s ), is obtained by summing the relevant contributions from all possible paths Q: P s q1 B q1 s 1 a q1q 2 B q2 s 2 a qt 1q T B qt s T, (1) allq where a qiqj (1 i, j T) represents the probability of a transition from state q i to state q j, qi is the probability of initiating a sequence at state q 1 and B qi (s t ) is the probability of observing s t at state q i, which for observation of a single sequence is given by B qi s t d q i D t r qi R t c qi C t b qi O t. (2) Usually, only one of the structural emission symbols d, r, or c is included in B qi in any given training run. However, in principle, any combination could be used. For remote homology detection, we used a model trained on r for the prediction, because this is the most closely tied to the I-sites library representation. We found significant improvements in performance when we used amino acid profiles instead of single amino acid sequences for training and for subsequent predictions. For the probability of observing a given profile O t at position t in a sequence, we use the multinomial distribution, and the expression for B becomes 20 B qi s t r qi R t b qi Ncount O t. (3) 1 In this equation, Ncount is the depth of the multiple sequence alignment, which is roughly the number of homologous sequences in the alignment. However, this number is an artifact of uneven sampling of sequence families in the database. To give equal weight to all sequence families in the training set, Ncount was taken to be a fixed value. To use the HMMSTR model, we input a single sequence to predict its Markov states, as follows: We run PSI- BLAST against the Swissprot database to generate a multiple sequence alignment. Then, the multiple sequence alignment is converted to a sequence profile. Next, the profile is aligned to HMMSTR to get a probability distribution over all states at each position, that is, the matrix. This matrix is computed by the Forward Backward Algorithm 19 and describes the probability of each HMMSTR state at each position; that is, pq P q s,t, (4) for all the 281 HMMSTR states (1 q 281) and for all residues s t (1 t T, where T is the length of the protein). The matrix may be viewed as the translation of a given sequence to a language of I-sites motif descriptors. HMMSTR is a grammatical model for that motif language. Feature Extraction and Representation There are two schools of thought in defining the similarity of proteins. 21 The local view considers only 10 20% of residues to be critical, while the global view advocates that similarities should occur along the entire sequence. In order to accommodate both views, we define two sets of features: One feature set captures the composition of a protein in terms of the local structure motifs and amino acids regardless of their sequential ordering, and the other feature set only considers the longest conserved regions between two proteins by encoding both the composition and the arrangement order of the regions. These two feature sets are called order-independent and orderdependent, respectively. This section presents the details of our feature extraction and representation procedure. The objective is to capture both the state information and the positional information in terms of fixed length vectors to be used for training an SVM. Order-independent feature set The order-independent feature set aims to capture the composition of amino acids and local structure. Each feature is indexed by (x,s), where x represents the 20 amino acids and s is the 281 Markov states in HMMSTR. (x,s) is the total number of x in state s summed over all positions of a protein. Therefore, there are a total 20 * 281 features. The feature value (x,s) can be computed from the gamma matrix: T x, s t s, (5) t 1 O t x which is the sum of all the probabilities of being in state s and observing amino acid x across all the positions of the target protein. The feature set given by Eq. (5) is a variation of the Fisher score defined by Jaakkola et al. 10 Fisher score vector U x can be obtained by the following formula: U x x,s log P X H 1 (6)

5 522 Y. HOU ET AL. Each component of U x is a derivative of the log-likelihood score for the query sequence X with respect to a particular emission probability parameter x,s of the HMM H 1. The magnitude of the components of U x specifies the extent to which each parameter x,s contributes to generating the query sequence. The derivatives of log P(X H 1 ) with respect to the x,s as the components of the Fisher score vector U x can be computed by the following formula 10 : x,s x, s log P X H 1 s, (7) x,s where x,s is one emission probability parameter that specifies the probability of emitting amino acid x in state s, and (s) is the expected number of times we visit state s as we traverse all paths through the model. In this work, we use a sequence profile instead of a single amino acid sequence. Therefore, for the order-independent feature set, we define a variation of the Fisher score to capture the expected sum, over all the positions of a protein, of the position-specific frequency of visiting state s and emitting symbol x. Order-dependent feature set The relative position of local structure is lost in the order-independent feature set. Therefore, a strong similarity between two such feature vectors tells us only that the two sequences have the same local structure motifs. It does not tell us that the motifs are arranged in the same order along the sequence. For example, two proteins that are predominantly helical would have similar local features, even if they are globally different. Although to some extent, the existence of conserved helix-caps or other, less common motifs makes this feature space more selective than amino acid or secondary structure composition measures. In this section, we define an order-dependent feature set by aligning the matrices to include both the states posterior probability and their positions in the protein. The order-dependent feature set for a protein X is defined as F x (f x1,f x2,,f x ), where is the total number of proteins in the training set and f xi is the E-value of the Smith Waterman score between protein X and the ith training set protein. This feature set has a similar format as the features in SVM-pairwise method. The difference is that, in this work, f xi is the E-value of the alignment score of 2 gamma matrices, rather than the E-value of the Smith Waterman score between 2 sequences in SVMpairwise method. In the following sections, the procedure of deriving the E-values of the alignment of 2 matrices is described. This process comprises 3 steps, namely, dimension reduction, similarity computation of 2 matrices, and transformation of similarities scores into E-values. Dimension reduction. Recall that the matrix is 281 rows by T columns, where T is the length of the protein sequence. We observe that only a few of the 281 states have probability values that are significantly greater than zero in most of the columns in the matrices. Most of the states have probability values that are very close to zero. A TABLE I. Correlation Between C and Probability value Cutoff Number of States C state probability value is defined as significant if it occurs by chance with a probability less than or equal to a predefined probability value. Given a significance value, we use the following steps to determine the cutoff value and the maximum number of states C needed to store the significant states for each column of a matrix. This procedure is based on a sample set of matrices for proteins in the SCOP database 22 version 1.53: 1. Sort all the states in a descending order by their probabilities. The cutoff is the probability value of the nth state, where n N *, and N is the number of the sampled states. 2. Count the number of states in each column with a probability value greater than this cutoff, and C is the maximum of these numbers. Table I lists the cutoff values corresponding to the different probability values. The variable C represents the maximum number of states in one column in these samples with values greater than the cutoff. It is clear that for an of 0.01, it is sufficient to just maintain the top 8 states. This motivates us to reduce the dimensionality of the matrix by storing only the top C states for each column, and the remaining states are assumed to have probability 0. In our experiments, we set C at 10 to ensure that all the significant values for 0.01 are included. With this, we transform each matrix into a record with the format: T (S1 V1 SC VC), where T is the length of the matrix, the set (S1 V1 SC VC) denotes the concatenation of C entries of each column in the matrix, S i and V i (1 i C) represents the state number from 1 to 281, and the probability value corresponds to the state. Subsequent similarity comparisons are carried out on this transformed data set. Similarity computation. We align the extracted matrices using Smith Waterman algorithm to get a similarity score to measure the overall conserved folding similarity of two proteins. In order to use the Smith Waterman algorithm for alignment, a set of parameters (similarity score, gap penalties) called a scoring scheme must be defined. Existing scoring schemes such as BLO- SUM or PAM are designed for sequence alignment, and are not applicable for the matrices alignment problem. Hence, we redefine the scoring scheme for matrices comparison. Yona and Levitt 23 examined the principles for defining a new scoring scheme for the Smith Waterman algorithm for a new alignment problem. They concluded that in order to be consistent with the BLOSUM and PAM scoring schemes, the similarity score of two positions (a,b) must satisfy the conditions:

6 REMOTE HOMOLOG DETECTION mean{score(a,b)} 0 2. max{score(a,b)} 0. The first condition guarantees that the average score of a random match is negative, while from the second condition, a match with a positive score is possible. Based on these two principles, we define the scoring scheme for our matrix alignment problem. The following subsection presents the details of the procedure for redefining the new similarity score of two positions. Similarity score. To encode the position information, we need to compute the similarity of 2 matrices. This involves performing a Smith Waterman alignment on the 2 matrices. To do this, we must first reduce the 2 matrices to a pairwise similarity matrix. In other words, one pair of columns is first reduced to a single similarity value. Given 2 matrix columns p and q, where the top C states are stored, the similarity score of two columns is defined as the cosine similarity score of the two columns. The main reason we use a cosine similarity score in place of a symmetrized relative entropy, which is used by Yona and Levitt, 23 is that multiplication is a much faster operation compared to the log operation; especially the time needed for such operations is huge. The cosine similarity score is given by the formula p q p, q (8) p p q q where denotes dot product, and the dot products (i.e., p q, p p, and q q) in Eq. (8) are all computed with the stored top C states under the assumption that the remaining states are insignificant. The computation of the self-dot products p p and q q is very straightforward, which is again over just the top C states. The dot product of p q is equal to the sum of the products over just the states that overlap between p and q. Once we have computed the similarity scores of all the pairwise columns of the 2 matrices, the next step is to map the computed similarity scores into a new range that guarantees the average score of a random match is negative, and at the same time, a match with a positive score is possible. A simple method that can map the similarity score defined in Eq. (8) into a new range that satisfies the two conditions is to subtract the similarity scores with a shift value. To determine the shift value, we first compute the average similarity score calculated for a large set of matrix column pairs, which is To ensure that the average score of a random match is negative, the shifted value much be larger than To ensure that a match with a positive score is possible, the maximum shifted value is less than 1. Shift values ranging from 0.1 to 0.9 are investigated in our experiments. Gap penalties. In addition to the shift value, gap penalties also play an important role in deriving a sensitive scoring scheme. In order to determine the optimal shift value and gap penalty, a total of 9 3 sets of parameters are tested, with shift value between 0.1 and 0.9 (values of 0.1, 0.2, 0.3,, 0.9), a gap opening penalty between 1 and Fig. 3. Performance with different parameter sets. The x axis is the shift value; the y axis is the ROC score computed for the sampled protein pairs. For each shift value, there are three ROC scores that correspond to the gap opening penalty 1,2,3 and gap extension penalty 0.1,0.2,0.3, respectively. 3 (values of 1, 2, 3), and gap extension penalty between 0.1 and 0.3 (values of 0.1, 0.2, 0.3). In keeping with the BLOSUM62 matrix, we set the gap extension penalties to be one order of magnitude smaller than gap open penalty. We sample a subset of protein pairs that are either related or unrelated from SCOP database 22 version 1.53 to determine the best scoring scheme. We refer to two proteins within the same superfamily in SCOP as related, and otherwise, as unrelated. Given a specific set of parameters, we first run Smith Waterman algorithm using these parameters as the scoring scheme to obtain the similarity score for all the sampled protein pairs. The receiver operating characteristic (ROC) score 24 is computed with all the alignment scores of the protein pairs and their relatedness (1 for related and 0 for unrelated) as the input to measure the goodness of this set of parameters. Figure 3 shows the results of the experiments. From the graph, we set the shift value at 0.3, with the gap opening penalty of 3 and gap extension penalty of 0.3. After redefining the scoring scheme, we run the standard Smith Waterman algorithm to get the similarity score of the 2 matrices. Transformation of similarity score into E-value. Empirical studies 25 have shown that the distribution of local gapped similarity scores can be well approximated by the extreme value distribution. 26 We transform the scores into a statistical significance value called E-value to distinguish true similarities from random matches. The E-value can be computed using the formula E Kmne S, where S denotes the similarity score, m and n are the lengths of the compared proteins, and and K are two empirically derived parameters. We use the direct estimation method 25 to estimate the parameters and K. We collect the scores of 5000 optimal alignments of matrices. These alignments are produced from random protein sequences of length n m 900

7 524 Y. HOU ET AL. using the scoring scheme described in the previous subsection. The number of alignments that score above a given threshold are counted. It is suggested in Waterman and Vingron 25 that the probability of an alignment scoring less than or equal to is given by exp( mnk ). After an appropriate transformation [log{ log(data)}], the empirical distribution function is expected to form a straight line, which facilitates the estimation of the parameters and K. However, this estimation does not take into consideration the length of the real data that will have an effect on the value of parameter. 25 To eliminate this effect, is corrected by using the estimated and K to search the protein database using the maximum likelihood estimation (MLE) method. Data Sets We assess the recognition performance of each algorithm by testing its ability to classify protein domains into superfamilies in the SCOP 22 version The same experiment setup as the SVM-pairwise 11 method is adopted here to allow for a direct comparison. Remote homology is simulated by holding out all members of a target SCOP family from a given superfamily as follows: 1. Close sequences are removed using an E-value threshold of 10 25, and this resulted in 4352 distinct sequences, grouped into families and superfamilies. 2. Positive training examples are chosen from the remaining families in the same superfamily, and negative test and training examples are chosen from outside the target family s fold. The held out family members serve as positive test examples. A total of 54 families containing at least 5 family members (positive test) and 10 superfamily members outside of the family (positive train) are produced. For each family, negative examples are taken from outside of the positive sequences fold, and are randomly split into training and testing sets in the same ratio as the positive examples. Metrics To assess the performance of a remote homology detection method, we consider two metrics: the Receiver Operating Characteristic (ROC50) score 24 and median Rate of False Positives (RFP). 10 The ROC score combines measures of a search s sensitivity and specificity. The ROC score is the area under a curve that plots true positives versus true negatives for varying score thresholds, and it measures the probability of correct classification. Therefore, a score of 1 indicates perfect separation of positives from negatives, whereas a score of 0 denotes that none of the sequences selected by the algorithm is positive. The ROC50 score is the area under the ROC curve, up to the first 50 false positives. ROC50 has the advantage of providing a more efficient and sensitive way to evaluate different methods over the ROC score. The median RFP score is the fraction of negative test sequences that score as high or better than the median-scoring positive test sequence. Median RFP is used to measure the error rate of the prediction under the score threshold where half of the true positives can be detected. These measures were used for evaluation in Jaakkola et al. 10 and Liao and Noble. 11 Support Vector Machines SVMs 15,27 have strong theoretical foundations and excellent empirical successes. The SVM is a supervised machine learning method that developed rapidly and has been widely used in many kinds of pattern recognition problems. The basic method of SVM is to transform the samples into a high-dimension Hilbert space and to seek a separating hyperplane in this space. The separating hyperplane, which is called the optimal separating hyperplane, is chosen in such a way as to maximize its distance from the closest training samples. The SVM usually outperforms other machine learning technologies, including Neural Networks and K-Nearest Neighbor classifiers. 15 We use the Gist publicly available SVM software implementation ( index.html), which implements the soft margin optimization algorithm described in Jaakkola et al. 10 The base kernel is normalized so that each vector has length 1 in the feature space, that is, K(X,Y) X Y/ (X X)(Y Y). This kernel K(.,.) is then transformed into a radial basis kernel. RESULTS AND DISCUSSION Setup of Competing Methods We compare SVM-HMMSTR with 5 methods: the generative models PSI-BLAST and SAM, and the SVM-based discriminative models SVM-I-sites, SVM-pairwise, and SVM-Fisher. We first briefly describe the setup procedure of these methods. It is not straightforward to compare PSI-BLAST and SAM, which requires as input a single sequence, with methods such as SVM-Fisher and SVM-HMMSTR, which take multiple input sequences. We address this problem by randomly selecting a positive training set sequence to serve as the initial query. PSI-BLAST is run for two iterations on the Swissprot database with an E-value threshold 0.001, and the resulting profile is then used to search against the test set sequence. The resulting E-values are used to rank the test set sequences. The PSI-BLAST settings used here are exactly the same as the ones we used to build the profiles for our SVM-HMMSTR method. For the SAM method, HMMs were trained using the Sequence Alignment and Models toolkit ( research/compbio/sam.html). 7 The generative models were trained from an existing library of SAM-T99 HMMs. The SAM-T99 algorithm, described more fully in Karplus et al., 9 builds an HMM for a SCOP domain sequence by searching the nonredundant protein database Swissprot for a set of potential homologs of the sequence and then iteratively selecting positive training sequences from among these potential homologs and refining a model. The resulting model is stored as an alignment of the domain sequence and final set of homologs. Once a model is obtained, it is straightforward to compare the test sequences to the model and the resulting reverse scores are used to rank the test set sequences.

8 REMOTE HOMOLOG DETECTION 525 Fig. 4. Family-by-family comparison of SVM-HMMSTR-Independent and SVM-HMMSTR-Dependent. The coordinates of each point in the plot are the ROC50 scores for one SCOP family, obtained using SVM-HMMSTR-Independent and SVM-HMMSTR-Dependent. The dotted line is y x. For the SVM based methods, SVM-Fisher, SVM-pairwise, and SVM-I-sites, the key steps are feature representation and extraction. The SVM-Fisher method uses feature vectors calculated from the parameters of a profile HMM. For each positive training set, a HMM is first built with the SAM package as described above, and subsequently used to compute a vector representation for any sequence using the accompanying program get_fisher_ scores in the SAM package. SVM-pairwise uses the pairwise Smith Waterman sequence similarity algorithm in place of the gradient vector in the SVM-Fisher method that we described earlier. In contrast, SVM-I-sites encodes the local structure composition of a protein as the sum of I-sites motif confidence scores, 16,17 where each motif defines one feature. After the vectorization step, all of the SVM-based methods will define a similarity score for 2 proteins based on the feature vectors and use that similarity as the kernel of the classifier. Results We first investigated the effect of varying the composition of the feature sets on the detection accuracy. We call the method that uses only the order-independent feature set as SVM-HMMSTR-Independent, and the method using only the order-dependent feature set as SVM- HMMSTR-Dependent. Figure 4 shows a family-by-family comparison of the performances of the two methods. From Figure 4, we realize that SVM-HMMSTR- Independent and SVM-HMMSTR-Dependent are complementary. This suggests that the local structure composition and the sequential order of the local motifs are equally important in determining the similarity of 2 proteins and a combination of the 2 feature sets should achieve a better performance than either feature set alone. There are 2 basic approaches for combining the 2 feature sets. The most direct method is to concatenate the 2 feature sets into a longer feature set. We call this method SVM-HMMSTR-Hybrid. Another approach is to first obtain 2 classifiers with the 2 feature sets individually and then combine their results, either taking the average or maximum for each query protein. We refer to these methods as SVM-HMMSTR-Ave and SVM-HMMSTR- Max, respectively. Table II summarizes the performance of the various methods we tested in terms of ROC50 and median RFP scores averaged over all 54 families tested. Since SVM- HMMSTR-Ave gave the best performance, we will use SVM-HMMSTR-Ave for the rest of the comparison tests. The distribution of ROC50 and median RFP scores are shown in Figures 5, 6, and 7. In each case, a higher curve corresponds to more accurate remote homology detection performance. Using either performance measure, the SVM- HMMSTR method performs significantly better than the other 5 methods. We also assess the statistical significance of differences among methods using Wilcoxon signed rank test. 13 The resulting p values are adjusted using a Bonferroni correction for multiple comparisons. As shown in Table III,

9 526 Y. HOU ET AL. TABLE II. ROC50 and Median RFP Averaged Over 54 Families for Different Methods Methods Mean ROC50 Mean median RFP SVM-HMMSTR-Ave SVM-HMMSTR-Max SVM-HMMSTR-Hybrid SVM-HMMSTR-Independent SVM-HMMSTR-Dependent SVM-I-sites SVM-pairwise SVM-Fisher SAM PSI-BLAST SVM-HMMSTR significantly outperforms all of the other methods with a p value Many of these results agree with previous assessments. For example, the relative performance of SVM-Fisher and SAM agrees with the results given in Jaakkola et al., 10 as does the relative performance of SAM and PSI-BLAST with the results given in Park et al. 28 and the relative performance of SVM-I-sites and SVM-pairwise given in Hou et al. 16 One surprise is the magnitude of the difference between SVM-pairwise and the 2 methods, SVM-Fisher and SAM, that directly or indirectly utilize SAM-T99 model. It is reported in Liao and Noble 11 that SVM-pairwise significantly outperforms these 2 methods, which is not the case in this work. This difference can be explained as follows: Our experiments allowed the iterative algorithm to access to all of Swissprot database during training of the model, while in Liao and Noble, 11 they are only given a small training set (containing only a handful of positive samples), without the benefits of the potential homologs in the Swissprot database. The most significant result from our experiments is the top-ranking performance of the SVM-HMMSTR method. This result is further illustrated in Figure 8, which shows a family-by-family comparison of the 54 ROC50 scores computed for SVM-HMMSTR and SVM-pairwise method. Our order-independent feature set uses a variation of the Fisher score, and it plays essentially the same role as the Fisher score features in that both capture the similarity of 2 proteins in terms of the composition of HMM states and amino acids. At the same time, the order-dependent feature set is defined by the E-value of an alignment score between 2 HMMSTR-derived profiles, which differentiates our order-dependent feature set from that of the SVMpairwise algorithm. SVM-pairwise uses the E-value of a Smith Waterman alignment score. Thus, we can surmise from the results shown in Table II that local structure based HMMSTR information provides better features than the sequence-based profile HMMs, and alignment of HMMSTR-derived profiles provide better features than alignment of sequences. Figure 9 is the ROC50 distribu- Fig. 5. Relative performance of homology detection methods. The graph plots the total number of families for which a given method exceeds a ROC50 score threshold. Each series corresponds to one of the homology detection methods described in the text.

10 REMOTE HOMOLOG DETECTION 527 Fig. 6. Relative performance of homology detection methods. The graph plots the total number of families for which a given method exceeds a median RFP score threshold. Each series corresponds to one of the homology detection methods described in the text. Fig. 7. Detail plot of the low median RFP region of Figure 6.

11 528 Y. HOU ET AL. TABLE III. Statistical Significance of Differences Between Pairs of Homology Detection Methods SVM-HMMSTR SVM-I-sites SVM-Fisher SVM-pairwise SAM PSI-BLAST SVM-HMMSTR 1.35e e e SVM-I-sites 9.0e-03 SVM-Fisher 3.15e-02 SVM-pairwise 1.35e-04 SAM PSI-BLAST Each entry in the table is the p value given by a Wilcoxon signed rank test comparing paired ROC50 scores from two methods for each of the 54 families. The p values have been adjusted for multiple comparisons using a Bonferonni adjustment. An entry in the table indicates that the method listed in the current row performs significantly better than the method listed in the current column at a threshold of Fig. 8. Family-by-family comparison of SVMHMMSTR and SVM-pairwise. The coordinates of each point in the plot are the ROC50 scores for one SCOP family, obtained using SVM-HMMSTR and SVM-pairwise. The dotted line is y x. tion of these features, and it shows that HMMSTR does provide better features than the sequence-based models. One significant characteristic of any homology detection algorithm is its computational efficiency. In this aspect, SVM-HMMSTR has the same order of time complexity as SVM-pairwise in theory. Both algorithms include an SVM optimization, which dominates SVM training time and is roughly O(n 2 ), 11 where n is the number of training set examples. The vectorization step of SVM-pairwise involves computing n 2 pairwise scores. Using Smith Waterman, each computation takes O(m 2 ), yielding a total running time of O(n 2 m 2 ), where m is the length of the longest training set sequence. In contrast, SVM-HMMSTR first runs PSI-BLAST to obtain a profile before it aligns the obtained profile against HMMSTR to get the matrix. The time complexity of running PSI-BLAST on the Swissprot database is O(N) when the length of the query sequence k is much less than N, where N is the size of the Swissprot database. Hence, the total running time of getting the profile is O(nN). The size of Swissprot database N we used in the experiment is about quadratic in m* n, where m is a second-order number; we conclude that the time complexity of the step of obtaining a profile is O(n 2 m 2 ). The step of aligning the obtained profiles for the n examples against HMMSTR is O(nmp), where p is the number of HMM parameters. Thus, assuming that m p, the time complexity of aligning a profile against HMMSTR is O(nm 2 ). For feature representation and extraction step, the time complexity of obtaining order-independent feature set is an order of O(nmp). The time complexity of obtaining order dependent feature set is O(n 2 m 2 ), because the time of computing the similarity score of 2 columns is a constant time for a small fixed number C. So the overall time complexity of SVM-HMMSTR is O(n 2 m 2 ). Therefore, we

12 REMOTE HOMOLOG DETECTION 529 Fig. 9. Relative performance of homology detection methods. The graph plots the total number of families for which a given method exceeds a ROC50 score threshold. Each series corresponds to one of the homology detection methods described in the text. conclude that the time complexity of SVM-HMMSTR is the same as that of SVM-pairwise. CONCLUSIONS To the best of our knowledge, this is the first attempt to encode both structure prediction and database alignment information in the training of an SVM. We have presented a new method to represent the sequence in terms of HMM state composition, and also in terms of alignment scores for sequences represented as HMM states. This creates a feature space that captures both the local structural composition and the overall conserved sequence similarity. The improved performance of SVM-HMMSTR in remote homology detection can be understood by considering the following factors: 1. HMMSTR is a better model for local structure prediction than I-sites. HMMSTR states represent local sequence patterns, allowing distant similarities to be seen. 2. Our approach considers both the overall composition and the sequential ordering of local structures in separate feature sets. The compositional feature sets are sensitive but relatively unselective, while the sequential features are more selective but less sensitive, since they require alignments that may be error prone. 3. The use of SVMs enables efficient and effective learning to take place in high-dimensional feature space. The SVM owes a great part of its success to its ability to use kernels, allowing the classifier to exploit a very highdimensional (possibly even infinite-dimensional) feature space. In addition to their empirical success, SVMs are also appealing due to the existence of strong generalization guarantees, derived from the margin-maximizing properties of the learning algorithm. ACKNOWLEDGMENTS We thank William Stafford Noble and Li Liao for their helpful discussions and for providing experimental results from their prior work, and Mark Diekhans for providing information about the Fisher kernel. We also thank the anonymous reviewers for their helpful comments and suggestions. REFERENCES 1. Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 1970;48: Smith T, Waterman M. Identification of common molecular subsequences. J Mol Biol 1981;147: Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. A basic local alignment search tool. J Mol Biol 1990;215: Pearson WR. Rapid and sensitive sequence comparisons with FASTP and FASTA. Methods Enzymol 1985;183: Gribskov M, McLachlan AD, Eisenberg D. Profile analysis: detection of distantly related proteins. Proc.Natl Acad Sci USA 1987;84: Krogh A, Brown M, Mian IS, et al. Hidden Markov models in computational biology: applications to protein modeling. J Mol Biol 1994;235:

13 530 Y. HOU ET AL. 7. Baldi P, Chauvin Y, Hunkapiller T, McClure MA. Hidden Markov models of biological primary sequence information. Proc Natl Acad Sci USA 1994;91: Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman OJ. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res 1997;25: Karplus K, Barrett C, Hughey R. Hidden Markov models for detecting remote protein homologies. Bioinformatics 1998;14: Jaakkola T, Diekhans M, Haussler D. A discriminative framework for detecting remote protein homologies. J Comp Biol 2000;7: Liao L, Noble WS. Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships. J Comp Biol 2003;10: Tsuda K, Kin T, Asai K. Marginalized kernels for biological sequences. The 10th International Conference on Intelligent Systems for Molecular Biology, Ben-Hur A, Brutlag D. Remote homology detection: a motif based approach. The 11th International Conference on Intelligent Systems for Molecular Biology, Leslie CS, Eskin E, Cohen A, Weston J, Noble WS. Mismatch string kernels for discriminative protein classification. Bioinformatics 2004;20: Cristiani N, Shawe-Taylor J. An introduction to support vector machines. Cambridge, UK: Cambridge University Press; Hou Y, Hsu W, Lee ML, Bystroff C. Efficient remote homology detection using local structure. Bioinformatics 2003;19: Bystroff C, Baker D. Prediction of local structure in proteins using a library of sequence structure motifs. J Mol Biol 1998;281: Bystroff C, Thorsson V, Baker D. HMMSTR: a hidden Markov model for local sequence structure correlations in proteins. J Mol Biol 2000;301: Rabiner L. A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE 1989;77: Eddy S. Profile hidden Markov models. Bioinformatics 1998;14: Logan B, Moreno P, Suzek B, Weng Z, Kasif S. A study of remote homology detection. Compaq and DEC Technical Reports, CRL , Murzin AG, Brenner SE, Hubbard T, Chothia C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995;247: Yona G, Levitt M. Within the twilight zone: a sensitive profile profile comparison tool based on information theory. J Mol Biol 2002;315: Gribskov M, Robinson NL. Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching. Comput Chem 1996;20: Waterman MS, Vingron M. Sequence comparison significance and possion approximation. Stat Sci 1994;9: Gumbel EJ. Statistics of extremes. New York: Columbia Univeristy Press; Vapnik V. Statistical learning theory. New York: Wiley; Park J, Karplus K, Barrett C, et al. Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J Mol Biol 1998;284:

THE SPECTRUM KERNEL: A STRING KERNEL FOR SVM PROTEIN CLASSIFICATION

THE SPECTRUM KERNEL: A STRING KERNEL FOR SVM PROTEIN CLASSIFICATION THE SPECTRUM KERNEL: A STRING KERNEL FOR SVM PROTEIN CLASSIFICATION CHRISTINA LESLIE, ELEAZAR ESKIN, WILLIAM STAFFORD NOBLE a {cleslie,eeskin,noble}@cs.columbia.edu Department of Computer Science, Columbia

More information

Mismatch String Kernels for SVM Protein Classification

Mismatch String Kernels for SVM Protein Classification Mismatch String Kernels for SVM Protein Classification Christina Leslie Department of Computer Science Columbia University cleslie@cs.columbia.edu Jason Weston Max-Planck Institute Tuebingen, Germany weston@tuebingen.mpg.de

More information

Semi-supervised protein classification using cluster kernels

Semi-supervised protein classification using cluster kernels Semi-supervised protein classification using cluster kernels Jason Weston Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany weston@tuebingen.mpg.de Dengyong Zhou, Andre Elisseeff

More information

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST Alexander Chan 5075504 Biochemistry 218 Final Project An Analysis of Pairwise

More information

Mismatch String Kernels for SVM Protein Classification

Mismatch String Kernels for SVM Protein Classification Mismatch String Kernels for SVM Protein Classification by C. Leslie, E. Eskin, J. Weston, W.S. Noble Athina Spiliopoulou Morfoula Fragopoulou Ioannis Konstas Outline Definitions & Background Proteins Remote

More information

BIOINFORMATICS. Mismatch string kernels for discriminative protein classification

BIOINFORMATICS. Mismatch string kernels for discriminative protein classification BIOINFORMATICS Vol. 1 no. 1 2003 Pages 1 10 Mismatch string kernels for discriminative protein classification Christina Leslie 1, Eleazar Eskin 1, Adiel Cohen 1, Jason Weston 2 and William Stafford Noble

More information

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 13 G R A T I V. Iterative homology searching,

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 13 G R A T I V. Iterative homology searching, C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U Introduction to bioinformatics 2007 Lecture 13 Iterative homology searching, PSI (Position Specific Iterated) BLAST basic idea use

More information

Jyoti Lakhani 1, Ajay Khunteta 2, Dharmesh Harwani *3 1 Poornima University, Jaipur & Maharaja Ganga Singh University, Bikaner, Rajasthan, India

Jyoti Lakhani 1, Ajay Khunteta 2, Dharmesh Harwani *3 1 Poornima University, Jaipur & Maharaja Ganga Singh University, Bikaner, Rajasthan, India International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2017 IJSRCSEIT Volume 2 Issue 6 ISSN : 2456-3307 Improvisation of Global Pairwise Sequence Alignment

More information

New String Kernels for Biosequence Data

New String Kernels for Biosequence Data Workshop on Kernel Methods in Bioinformatics New String Kernels for Biosequence Data Christina Leslie Department of Computer Science Columbia University Biological Sequence Classification Problems Protein

More information

Profiles and Multiple Alignments. COMP 571 Luay Nakhleh, Rice University

Profiles and Multiple Alignments. COMP 571 Luay Nakhleh, Rice University Profiles and Multiple Alignments COMP 571 Luay Nakhleh, Rice University Outline Profiles and sequence logos Profile hidden Markov models Aligning profiles Multiple sequence alignment by gradual sequence

More information

USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT

USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT IADIS International Conference Applied Computing 2006 USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT Divya R. Singh Software Engineer Microsoft Corporation, Redmond, WA 98052, USA Abdullah

More information

BLAST, Profile, and PSI-BLAST

BLAST, Profile, and PSI-BLAST BLAST, Profile, and PSI-BLAST Jianlin Cheng, PhD School of Electrical Engineering and Computer Science University of Central Florida 26 Free for academic use Copyright @ Jianlin Cheng & original sources

More information

Profile-based String Kernels for Remote Homology Detection and Motif Extraction

Profile-based String Kernels for Remote Homology Detection and Motif Extraction Profile-based String Kernels for Remote Homology Detection and Motif Extraction Ray Kuang, Eugene Ie, Ke Wang, Kai Wang, Mahira Siddiqi, Yoav Freund and Christina Leslie. Department of Computer Science

More information

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be 48 Bioinformatics I, WS 09-10, S. Henz (script by D. Huson) November 26, 2009 4 BLAST and BLAT Outline of the chapter: 1. Heuristics for the pairwise local alignment of two sequences 2. BLAST: search and

More information

Modifying Kernels Using Label Information Improves SVM Classification Performance

Modifying Kernels Using Label Information Improves SVM Classification Performance Modifying Kernels Using Label Information Improves SVM Classification Performance Renqiang Min and Anthony Bonner Department of Computer Science University of Toronto Toronto, ON M5S3G4, Canada minrq@cs.toronto.edu

More information

Biologically significant sequence alignments using Boltzmann probabilities

Biologically significant sequence alignments using Boltzmann probabilities Biologically significant sequence alignments using Boltzmann probabilities P Clote Department of Biology, Boston College Gasson Hall 16, Chestnut Hill MA 0267 clote@bcedu Abstract In this paper, we give

More information

24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading:

24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading: 24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, 2010 3 BLAST and FASTA This lecture is based on the following papers, which are all recommended reading: D.J. Lipman and W.R. Pearson, Rapid

More information

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota Marina Sirota MOTIVATION: PROTEIN MULTIPLE ALIGNMENT To study evolution on the genetic level across a wide range of organisms, biologists need accurate tools for multiple sequence alignment of protein

More information

Bioinformatics explained: BLAST. March 8, 2007

Bioinformatics explained: BLAST. March 8, 2007 Bioinformatics Explained Bioinformatics explained: BLAST March 8, 2007 CLC bio Gustav Wieds Vej 10 8000 Aarhus C Denmark Telephone: +45 70 22 55 09 Fax: +45 70 22 55 19 www.clcbio.com info@clcbio.com Bioinformatics

More information

COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1. Database Searching

COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1. Database Searching COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1 Database Searching In database search, we typically have a large sequence database

More information

Acceleration of Algorithm of Smith-Waterman Using Recursive Variable Expansion.

Acceleration of Algorithm of Smith-Waterman Using Recursive Variable Expansion. www.ijarcet.org 54 Acceleration of Algorithm of Smith-Waterman Using Recursive Variable Expansion. Hassan Kehinde Bello and Kazeem Alagbe Gbolagade Abstract Biological sequence alignment is becoming popular

More information

Using Hybrid Alignment for Iterative Sequence Database Searches

Using Hybrid Alignment for Iterative Sequence Database Searches Using Hybrid Alignment for Iterative Sequence Database Searches Yuheng Li, Mario Lauria, Department of Computer and Information Science The Ohio State University 2015 Neil Avenue #395 Columbus, OH 43210-1106

More information

Phase4 automatic evaluation of database search methods

Phase4 automatic evaluation of database search methods Phase4 automatic evaluation of database search methods Marc Rehmsmeier Deutsches Krebsforschungszentrum (DKFZ) Theoretische Bioinformatik Heidelberg, Germany revised version, 11th September 2002 Abstract

More information

A NEW GENERATION OF HOMOLOGY SEARCH TOOLS BASED ON PROBABILISTIC INFERENCE

A NEW GENERATION OF HOMOLOGY SEARCH TOOLS BASED ON PROBABILISTIC INFERENCE 205 A NEW GENERATION OF HOMOLOGY SEARCH TOOLS BASED ON PROBABILISTIC INFERENCE SEAN R. EDDY 1 eddys@janelia.hhmi.org 1 Janelia Farm Research Campus, Howard Hughes Medical Institute, 19700 Helix Drive,

More information

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010 Principles of Bioinformatics BIO540/STA569/CSI660 Fall 2010 Lecture 11 Multiple Sequence Alignment I Administrivia Administrivia The midterm examination will be Monday, October 18 th, in class. Closed

More information

The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval. Kevin C. O'Kane. Department of Computer Science

The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval. Kevin C. O'Kane. Department of Computer Science The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval Kevin C. O'Kane Department of Computer Science The University of Northern Iowa Cedar Falls, Iowa okane@cs.uni.edu http://www.cs.uni.edu/~okane

More information

Chapter 8 Multiple sequence alignment. Chaochun Wei Spring 2018

Chapter 8 Multiple sequence alignment. Chaochun Wei Spring 2018 1896 1920 1987 2006 Chapter 8 Multiple sequence alignment Chaochun Wei Spring 2018 Contents 1. Reading materials 2. Multiple sequence alignment basic algorithms and tools how to improve multiple alignment

More information

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha Motivation Sequence homology to a known protein suggest function of newly sequenced protein Bioinformatics

More information

Computational Molecular Biology

Computational Molecular Biology Computational Molecular Biology Erwin M. Bakker Lecture 3, mainly from material by R. Shamir [2] and H.J. Hoogeboom [4]. 1 Pairwise Sequence Alignment Biological Motivation Algorithmic Aspect Recursive

More information

Chapter 6. Multiple sequence alignment (week 10)

Chapter 6. Multiple sequence alignment (week 10) Course organization Introduction ( Week 1,2) Part I: Algorithms for Sequence Analysis (Week 1-11) Chapter 1-3, Models and theories» Probability theory and Statistics (Week 3)» Algorithm complexity analysis

More information

FastA & the chaining problem

FastA & the chaining problem FastA & the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem 1 Sources for this lecture: Lectures by Volker Heun, Daniel Huson and Knut Reinert,

More information

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014 Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014 Dynamic programming is a group of mathematical methods used to sequentially split a complicated problem into

More information

FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:

FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10: FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:56 4001 4 FastA and the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem

More information

Lectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures

Lectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures 4 FastA and the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem 4.1 Sources for this lecture Lectures by Volker Heun, Daniel Huson and Knut

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2015 1 Sequence Alignment Dannie Durand Pairwise Sequence Alignment The goal of pairwise sequence alignment is to establish a correspondence between the

More information

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment Courtesy of jalview 1 Motivations Collective statistic Protein families Identification and representation of conserved sequence features

More information

Bioinformatics explained: Smith-Waterman

Bioinformatics explained: Smith-Waterman Bioinformatics Explained Bioinformatics explained: Smith-Waterman May 1, 2007 CLC bio Gustav Wieds Vej 10 8000 Aarhus C Denmark Telephone: +45 70 22 55 09 Fax: +45 70 22 55 19 www.clcbio.com info@clcbio.com

More information

Bioinformatics for Biologists

Bioinformatics for Biologists Bioinformatics for Biologists Sequence Analysis: Part I. Pairwise alignment and database searching Fran Lewitter, Ph.D. Director Bioinformatics & Research Computing Whitehead Institute Topics to Cover

More information

Fast Kernels for Inexact String Matching

Fast Kernels for Inexact String Matching Fast Kernels for Inexact String Matching Christina Leslie and Rui Kuang Columbia University, New York NY 10027, USA {rkuang,cleslie}@cs.columbia.edu Abstract. We introduce several new families of string

More information

Heuristic methods for pairwise alignment:

Heuristic methods for pairwise alignment: Bi03c_1 Unit 03c: Heuristic methods for pairwise alignment: k-tuple-methods k-tuple-methods for alignment of pairs of sequences Bi03c_2 dynamic programming is too slow for large databases Use heuristic

More information

Lecture Overview. Sequence search & alignment. Searching sequence databases. Sequence Alignment & Search. Goals: Motivations:

Lecture Overview. Sequence search & alignment. Searching sequence databases. Sequence Alignment & Search. Goals: Motivations: Lecture Overview Sequence Alignment & Search Karin Verspoor, Ph.D. Faculty, Computational Bioscience Program University of Colorado School of Medicine With credit and thanks to Larry Hunter for creating

More information

Introduction to Kernels (part II)Application to sequences p.1

Introduction to Kernels (part II)Application to sequences p.1 Introduction to Kernels (part II) Application to sequences Liva Ralaivola liva@ics.uci.edu School of Information and Computer Science Institute for Genomics and Bioinformatics Introduction to Kernels (part

More information

Sequence Alignment & Search

Sequence Alignment & Search Sequence Alignment & Search Karin Verspoor, Ph.D. Faculty, Computational Bioscience Program University of Colorado School of Medicine With credit and thanks to Larry Hunter for creating the first version

More information

Multiple Sequence Alignment Based on Profile Alignment of Intermediate Sequences

Multiple Sequence Alignment Based on Profile Alignment of Intermediate Sequences Multiple Sequence Alignment Based on Profile Alignment of Intermediate Sequences Yue Lu and Sing-Hoi Sze RECOMB 2007 Presented by: Wanxing Xu March 6, 2008 Content Biology Motivation Computation Problem

More information

3.4 Multiple sequence alignment

3.4 Multiple sequence alignment 3.4 Multiple sequence alignment Why produce a multiple sequence alignment? Using more than two sequences results in a more convincing alignment by revealing conserved regions in ALL of the sequences Aligned

More information

Data Mining Technologies for Bioinformatics Sequences

Data Mining Technologies for Bioinformatics Sequences Data Mining Technologies for Bioinformatics Sequences Deepak Garg Computer Science and Engineering Department Thapar Institute of Engineering & Tecnology, Patiala Abstract Main tool used for sequence alignment

More information

.. Fall 2011 CSC 570: Bioinformatics Alexander Dekhtyar..

.. Fall 2011 CSC 570: Bioinformatics Alexander Dekhtyar.. .. Fall 2011 CSC 570: Bioinformatics Alexander Dekhtyar.. PAM and BLOSUM Matrices Prepared by: Jason Banich and Chris Hoover Background As DNA sequences change and evolve, certain amino acids are more

More information

Comparative Analysis of Protein Alignment Algorithms in Parallel environment using CUDA

Comparative Analysis of Protein Alignment Algorithms in Parallel environment using CUDA Comparative Analysis of Protein Alignment Algorithms in Parallel environment using BLAST versus Smith-Waterman Shadman Fahim shadmanbracu09@gmail.com Shehabul Hossain rudrozzal@gmail.com Gulshan Jubaed

More information

A Coprocessor Architecture for Fast Protein Structure Prediction

A Coprocessor Architecture for Fast Protein Structure Prediction A Coprocessor Architecture for Fast Protein Structure Prediction M. Marolia, R. Khoja, T. Acharya, C. Chakrabarti Department of Electrical Engineering Arizona State University, Tempe, USA. Abstract Predicting

More information

Multiple Sequence Alignment. Mark Whitsitt - NCSA

Multiple Sequence Alignment. Mark Whitsitt - NCSA Multiple Sequence Alignment Mark Whitsitt - NCSA What is a Multiple Sequence Alignment (MA)? GMHGTVYANYAVDSSDLLLAFGVRFDDRVTGKLEAFASRAKIVHIDIDSAEIGKNKQPHV GMHGTVYANYAVEHSDLLLAFGVRFDDRVTGKLEAFASRAKIVHIDIDSAEIGKNKTPHV

More information

FASTA. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm.

FASTA. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm. FASTA INTRODUCTION Definition (by David J. Lipman and William R. Pearson in 1985) - Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence

More information

Comparison of Phylogenetic Trees of Multiple Protein Sequence Alignment Methods

Comparison of Phylogenetic Trees of Multiple Protein Sequence Alignment Methods Comparison of Phylogenetic Trees of Multiple Protein Sequence Alignment Methods Khaddouja Boujenfa, Nadia Essoussi, and Mohamed Limam International Science Index, Computer and Information Engineering waset.org/publication/482

More information

Sequence alignment theory and applications Session 3: BLAST algorithm

Sequence alignment theory and applications Session 3: BLAST algorithm Sequence alignment theory and applications Session 3: BLAST algorithm Introduction to Bioinformatics online course : IBT Sonal Henson Learning Objectives Understand the principles of the BLAST algorithm

More information

GLOBEX Bioinformatics (Summer 2015) Multiple Sequence Alignment

GLOBEX Bioinformatics (Summer 2015) Multiple Sequence Alignment GLOBEX Bioinformatics (Summer 2015) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms CLUSTAL W Courtesy of jalview Motivations Collective (or aggregate) statistic

More information

Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA.

Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA. Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA. Fasta is used to compare a protein or DNA sequence to all of the

More information

FastCluster: a graph theory based algorithm for removing redundant sequences

FastCluster: a graph theory based algorithm for removing redundant sequences J. Biomedical Science and Engineering, 2009, 2, 621-625 doi: 10.4236/jbise.2009.28090 Published Online December 2009 (http://www.scirp.org/journal/jbise/). FastCluster: a graph theory based algorithm for

More information

OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT

OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT Asif Ali Khan*, Laiq Hassan*, Salim Ullah* ABSTRACT: In bioinformatics, sequence alignment is a common and insistent task. Biologists align

More information

Encoding Words into String Vectors for Word Categorization

Encoding Words into String Vectors for Word Categorization Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,

More information

Clustering Sequences with Hidden. Markov Models. Padhraic Smyth CA Abstract

Clustering Sequences with Hidden. Markov Models. Padhraic Smyth CA Abstract Clustering Sequences with Hidden Markov Models Padhraic Smyth Information and Computer Science University of California, Irvine CA 92697-3425 smyth@ics.uci.edu Abstract This paper discusses a probabilistic

More information

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes

More information

HIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT

HIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT HIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT - Swarbhanu Chatterjee. Hidden Markov models are a sophisticated and flexible statistical tool for the study of protein models. Using HMMs to analyze proteins

More information

Representation is Everything: Towards Efficient and Adaptable Similarity Measures for Biological Data

Representation is Everything: Towards Efficient and Adaptable Similarity Measures for Biological Data Representation is Everything: Towards Efficient and Adaptable Similarity Measures for Biological Data Charu C. Aggarwal IBM T. J. Watson Research Center charu@us.ibm.com Abstract Distance function computation

More information

A fast, large-scale learning method for protein sequence classification

A fast, large-scale learning method for protein sequence classification A fast, large-scale learning method for protein sequence classification Pavel Kuksa, Pai-Hsi Huang, Vladimir Pavlovic Department of Computer Science Rutgers University Piscataway, NJ 8854 {pkuksa;paihuang;vladimir}@cs.rutgers.edu

More information

Research Article International Journals of Advanced Research in Computer Science and Software Engineering ISSN: X (Volume-7, Issue-6)

Research Article International Journals of Advanced Research in Computer Science and Software Engineering ISSN: X (Volume-7, Issue-6) International Journals of Advanced Research in Computer Science and Software Engineering ISSN: 77-18X (Volume-7, Issue-6) Research Article June 017 DDGARM: Dotlet Driven Global Alignment with Reduced Matrix

More information

Bioinformatics. Sequence alignment BLAST Significance. Next time Protein Structure

Bioinformatics. Sequence alignment BLAST Significance. Next time Protein Structure Bioinformatics Sequence alignment BLAST Significance Next time Protein Structure 1 Experimental origins of sequence data The Sanger dideoxynucleotide method F Each color is one lane of an electrophoresis

More information

Keyword Extraction by KNN considering Similarity among Features

Keyword Extraction by KNN considering Similarity among Features 64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,

More information

Leave-One-Out Support Vector Machines

Leave-One-Out Support Vector Machines Leave-One-Out Support Vector Machines Jason Weston Department of Computer Science Royal Holloway, University of London, Egham Hill, Egham, Surrey, TW20 OEX, UK. Abstract We present a new learning algorithm

More information

) I R L Press Limited, Oxford, England. The protein identification resource (PIR)

) I R L Press Limited, Oxford, England. The protein identification resource (PIR) Volume 14 Number 1 Volume 1986 Nucleic Acids Research 14 Number 1986 Nucleic Acids Research The protein identification resource (PIR) David G.George, Winona C.Barker and Lois T.Hunt National Biomedical

More information

Database Searching Using BLAST

Database Searching Using BLAST Mahidol University Objectives SCMI512 Molecular Sequence Analysis Database Searching Using BLAST Lecture 2B After class, students should be able to: explain the FASTA algorithm for database searching explain

More information

Sequence alignment is an essential concept for bioinformatics, as most of our data analysis and interpretation techniques make use of it.

Sequence alignment is an essential concept for bioinformatics, as most of our data analysis and interpretation techniques make use of it. Sequence Alignments Overview Sequence alignment is an essential concept for bioinformatics, as most of our data analysis and interpretation techniques make use of it. Sequence alignment means arranging

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Biology 644: Bioinformatics

Biology 644: Bioinformatics Find the best alignment between 2 sequences with lengths n and m, respectively Best alignment is very dependent upon the substitution matrix and gap penalties The Global Alignment Problem tries to find

More information

JET 2 User Manual 1 INSTALLATION 2 EXECUTION AND FUNCTIONALITIES. 1.1 Download. 1.2 System requirements. 1.3 How to install JET 2

JET 2 User Manual 1 INSTALLATION 2 EXECUTION AND FUNCTIONALITIES. 1.1 Download. 1.2 System requirements. 1.3 How to install JET 2 JET 2 User Manual 1 INSTALLATION 1.1 Download The JET 2 package is available at www.lcqb.upmc.fr/jet2. 1.2 System requirements JET 2 runs on Linux or Mac OS X. The program requires some external tools

More information

Notes on Dynamic-Programming Sequence Alignment

Notes on Dynamic-Programming Sequence Alignment Notes on Dynamic-Programming Sequence Alignment Introduction. Following its introduction by Needleman and Wunsch (1970), dynamic programming has become the method of choice for rigorous alignment of DNA

More information

Lecture 5: Markov models

Lecture 5: Markov models Master s course Bioinformatics Data Analysis and Tools Lecture 5: Markov models Centre for Integrative Bioinformatics Problem in biology Data and patterns are often not clear cut When we want to make a

More information

Bioinformatics III Structural Bioinformatics and Genome Analysis

Bioinformatics III Structural Bioinformatics and Genome Analysis Bioinformatics III Structural Bioinformatics and Genome Analysis Chapter 3 Structural Comparison and Alignment 3.1 Introduction 1. Basic algorithms review Dynamic programming Distance matrix 2. SARF2,

More information

Comparison of Sequence Similarity Measures for Distant Evolutionary Relationships

Comparison of Sequence Similarity Measures for Distant Evolutionary Relationships Comparison of Sequence Similarity Measures for Distant Evolutionary Relationships Abhishek Majumdar, Peter Z. Revesz Department of Computer Science and Engineering, University of Nebraska-Lincoln, Lincoln,

More information

A CAM(Content Addressable Memory)-based architecture for molecular sequence matching

A CAM(Content Addressable Memory)-based architecture for molecular sequence matching A CAM(Content Addressable Memory)-based architecture for molecular sequence matching P.K. Lala 1 and J.P. Parkerson 2 1 Department Electrical Engineering, Texas A&M University, Texarkana, Texas, USA 2

More information

CS313 Exercise 4 Cover Page Fall 2017

CS313 Exercise 4 Cover Page Fall 2017 CS313 Exercise 4 Cover Page Fall 2017 Due by the start of class on Thursday, October 12, 2017. Name(s): In the TIME column, please estimate the time you spent on the parts of this exercise. Please try

More information

Scalable Algorithms for String Kernels with Inexact Matching

Scalable Algorithms for String Kernels with Inexact Matching Scalable Algorithms for String Kernels with Inexact Matching Pavel P. Kuksa, Pai-Hsi Huang, Vladimir Pavlovic Department of Computer Science, Rutgers University, Piscataway, NJ 08854 {pkuksa,paihuang,vladimir}@cs.rutgers.edu

More information

Data mining with Support Vector Machine

Data mining with Support Vector Machine Data mining with Support Vector Machine Ms. Arti Patle IES, IPS Academy Indore (M.P.) artipatle@gmail.com Mr. Deepak Singh Chouhan IES, IPS Academy Indore (M.P.) deepak.schouhan@yahoo.com Abstract: Machine

More information

A Visualization Tool to Improve the Performance of a Classifier Based on Hidden Markov Models

A Visualization Tool to Improve the Performance of a Classifier Based on Hidden Markov Models A Visualization Tool to Improve the Performance of a Classifier Based on Hidden Markov Models Gleidson Pegoretti da Silva, Masaki Nakagawa Department of Computer and Information Sciences Tokyo University

More information

Basic Local Alignment Search Tool (BLAST)

Basic Local Alignment Search Tool (BLAST) BLAST 26.04.2018 Basic Local Alignment Search Tool (BLAST) BLAST (Altshul-1990) is an heuristic Pairwise Alignment composed by six-steps that search for local similarities. The most used access point to

More information

False Discovery Rate for Homology Searches

False Discovery Rate for Homology Searches False Discovery Rate for Homology Searches Hyrum D. Carroll 1,AlexC.Williams 1, Anthony G. Davis 1,andJohnL.Spouge 2 1 Middle Tennessee State University Department of Computer Science Murfreesboro, TN

More information

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

CISC 636 Computational Biology & Bioinformatics (Fall 2016) CISC 636 Computational Biology & Bioinformatics (Fall 2016) Sequence pairwise alignment Score statistics: E-value and p-value Heuristic algorithms: BLAST and FASTA Database search: gene finding and annotations

More information

B L A S T! BLAST: Basic local alignment search tool. Copyright notice. February 6, Pairwise alignment: key points. Outline of tonight s lecture

B L A S T! BLAST: Basic local alignment search tool. Copyright notice. February 6, Pairwise alignment: key points. Outline of tonight s lecture February 6, 2008 BLAST: Basic local alignment search tool B L A S T! Jonathan Pevsner, Ph.D. Introduction to Bioinformatics pevsner@jhmi.edu 4.633.0 Copyright notice Many of the images in this powerpoint

More information

Distributed Protein Sequence Alignment

Distributed Protein Sequence Alignment Distributed Protein Sequence Alignment ABSTRACT J. Michael Meehan meehan@wwu.edu James Hearne hearne@wwu.edu Given the explosive growth of biological sequence databases and the computational complexity

More information

An Empirical Study of Lazy Multilabel Classification Algorithms

An Empirical Study of Lazy Multilabel Classification Algorithms An Empirical Study of Lazy Multilabel Classification Algorithms E. Spyromitros and G. Tsoumakas and I. Vlahavas Department of Informatics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece

More information

Salvador Capella-Gutiérrez, Jose M. Silla-Martínez and Toni Gabaldón

Salvador Capella-Gutiérrez, Jose M. Silla-Martínez and Toni Gabaldón trimal: a tool for automated alignment trimming in large-scale phylogenetics analyses Salvador Capella-Gutiérrez, Jose M. Silla-Martínez and Toni Gabaldón Version 1.2b Index of contents 1. General features

More information

CAP BLAST. BIOINFORMATICS Su-Shing Chen CISE. 8/20/2005 Su-Shing Chen, CISE 1

CAP BLAST. BIOINFORMATICS Su-Shing Chen CISE. 8/20/2005 Su-Shing Chen, CISE 1 CAP 5510-6 BLAST BIOINFORMATICS Su-Shing Chen CISE 8/20/2005 Su-Shing Chen, CISE 1 BLAST Basic Local Alignment Prof Search Su-Shing Chen Tool A Fast Pair-wise Alignment and Database Searching Tool 8/20/2005

More information

Sequence analysis Pairwise sequence alignment

Sequence analysis Pairwise sequence alignment UMF11 Introduction to bioinformatics, 25 Sequence analysis Pairwise sequence alignment 1. Sequence alignment Lecturer: Marina lexandersson 12 September, 25 here are two types of sequence alignments, global

More information

Protein homology detection using string alignment kernels

Protein homology detection using string alignment kernels BIOINFORMATICS Vol. 20 no. 11 2004, pages 1682 1689 doi:10.1093/bioinformatics/bth141 Protein homology detection using string alignment kernels Hiroto Saigo 1, Jean-Philippe Vert 2,, Nobuhisa Ueda 1 and

More information

A New Method for Database Searching and Clustering

A New Method for Database Searching and Clustering 90 \ A New Method for Database Searching and Clustering Antje Krause Martin Vingron a.krause@dkfz-heidelberg.de m.vingron@dkfz-heidelberg.de Deutsches Krebsforschungszentrum (DKFZ), Abt. Theoretische Bioinformatik

More information

Classification of biological sequences with kernel methods

Classification of biological sequences with kernel methods Classification of biological sequences with kernel methods Jean-Philippe Vert Jean-Philippe.Vert@ensmp.fr Centre for Computational Biology Ecole des Mines de Paris, ParisTech International Conference on

More information

Supervised vs unsupervised clustering

Supervised vs unsupervised clustering Classification Supervised vs unsupervised clustering Cluster analysis: Classes are not known a- priori. Classification: Classes are defined a-priori Sometimes called supervised clustering Extract useful

More information

BLAST - Basic Local Alignment Search Tool

BLAST - Basic Local Alignment Search Tool Lecture for ic Bioinformatics (DD2450) April 11, 2013 Searching 1. Input: Query Sequence 2. Database of sequences 3. Subject Sequence(s) 4. Output: High Segment Pairs (HSPs) Sequence Similarity Measures:

More information

Structural and Syntactic Pattern Recognition

Structural and Syntactic Pattern Recognition Structural and Syntactic Pattern Recognition Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2017 CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent

More information

String Vector based KNN for Text Categorization

String Vector based KNN for Text Categorization 458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research

More information

BLAST & Genome assembly

BLAST & Genome assembly BLAST & Genome assembly Solon P. Pissis Tomáš Flouri Heidelberg Institute for Theoretical Studies May 15, 2014 1 BLAST What is BLAST? The algorithm 2 Genome assembly De novo assembly Mapping assembly 3

More information

AlignMe Manual. Version 1.1. Rene Staritzbichler, Marcus Stamm, Kamil Khafizov and Lucy R. Forrest

AlignMe Manual. Version 1.1. Rene Staritzbichler, Marcus Stamm, Kamil Khafizov and Lucy R. Forrest AlignMe Manual Version 1.1 Rene Staritzbichler, Marcus Stamm, Kamil Khafizov and Lucy R. Forrest Max Planck Institute of Biophysics Frankfurt am Main 60438 Germany 1) Introduction...3 2) Using AlignMe

More information