Mismatch String Kernels for SVM Protein Classification

Similar documents
New String Kernels for Biosequence Data

Profile-based String Kernels for Remote Homology Detection and Motif Extraction

Mismatch String Kernels for SVM Protein Classification

BIOINFORMATICS. Mismatch string kernels for discriminative protein classification

Introduction to Kernels (part II)Application to sequences p.1

Semi-supervised protein classification using cluster kernels

Fast Kernels for Inexact String Matching

THE SPECTRUM KERNEL: A STRING KERNEL FOR SVM PROTEIN CLASSIFICATION

A fast, large-scale learning method for protein sequence classification

Modifying Kernels Using Label Information Improves SVM Classification Performance

Generalized Similarity Kernels for Efficient Sequence Classification

Classification of biological sequences with kernel methods

Multiple Sequence Alignment: Multidimensional. Biological Motivation

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 13 G R A T I V. Iterative homology searching,

Remote Homolog Detection Using Local Sequence Structure Correlations

Scalable Algorithms for String Kernels with Inexact Matching

Desiging and combining kernels: some lessons learned from bioinformatics

Generalized Similarity Kernels for Efficient Sequence Classification

BLAST, Profile, and PSI-BLAST

Application of Support Vector Machine In Bioinformatics

Multiple Sequence Alignment Based on Profile Alignment of Intermediate Sequences

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be

Basic Local Alignment Search Tool (BLAST)

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota

15-780: Graduate Artificial Intelligence. Computational biology: Sequence alignment and profile HMMs

A Coprocessor Architecture for Fast Protein Structure Prediction

Profiles and Multiple Alignments. COMP 571 Luay Nakhleh, Rice University

Transfer String Kernel for Cross-Context Sequence Specific DNA-Protein Binding Prediction. by Ritambhara Singh IIIT-Delhi June 10, 2016

Evaluating Classifiers

Protein Sequence Classification Using Probabilistic Motifs and Neural Networks

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998

Evaluating Classifiers

PROTEIN HOMOLOGY DETECTION WITH SPARSE MODELS

Computational Genomics and Molecular Biology, Fall

USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT

SVM-KNN : Discriminative Nearest Neighbor Classification for Visual Category Recognition

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998

Multiple Sequence Alignment. Mark Whitsitt - NCSA

BLAST - Basic Local Alignment Search Tool

Machine Learning. Computational biology: Sequence alignment and profile HMMs

Structured Learning. Jun Zhu

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment

EECS730: Introduction to Bioinformatics

Using Hidden Markov Models to Detect DNA Motifs

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

Chapter 6. Multiple sequence alignment (week 10)

LECTURE 5: DUAL PROBLEMS AND KERNELS. * Most of the slides in this lecture are from

Machine Learning Models for Pattern Classification. Comp 473/6731

Semi-Supervised Abstraction-Augmented String Kernel for bio-relationship Extraction

Computational Molecular Biology

Machine Learning for. Artem Lind & Aleskandr Tkachenko

Discriminative classifiers for image recognition

Support vector machine prediction of signal peptide cleavage site using a new class of kernels for strings

Choosing the kernel parameters for SVMs by the inter-cluster distance in the feature space Authors: Kuo-Ping Wu, Sheng-De Wang Published 2008

Classification. Slide sources:

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha

9. Support Vector Machines. The linearly separable case: hard-margin SVMs. The linearly separable case: hard-margin SVMs. Learning objectives

Algorithmic Approaches for Biological Data, Lecture #20

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Discriminate Analysis

SVM cont d. Applications face detection [IEEE INTELLIGENT SYSTEMS]

Weighted Tree Kernels for Sequence Analysis

Protein homology detection using string alignment kernels

Support vector machines. Dominik Wisniewski Wojciech Wawrzyniak

Keyword Extraction by KNN considering Similarity among Features

HIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT

Data Mining Classification: Alternative Techniques. Imbalanced Class Problem

Global Alignment Scoring Matrices Local Alignment Alignment with Affine Gap Penalties

3.4 Multiple sequence alignment

Feature Selection in Learning Using Privileged Information

Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA.

Dynamic Programming Course: A structure based flexible search method for motifs in RNA. By: Veksler, I., Ziv-Ukelson, M., Barash, D.

Chapter 8 Multiple sequence alignment. Chaochun Wei Spring 2018

.. Fall 2011 CSC 570: Bioinformatics Alexander Dekhtyar..

12 Classification using Support Vector Machines

Sequence analysis Pairwise sequence alignment

GLOBEX Bioinformatics (Summer 2015) Multiple Sequence Alignment

Bioinformatics for Biologists

MetaPhyler Usage Manual

Special course in Computer Science: Advanced Text Algorithms

MULTIPLE SEQUENCE ALIGNMENT SOLUTIONS AND APPLICATIONS

List of Exercises: Data Mining 1 December 12th, 2015

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST

Sequence alignment theory and applications Session 3: BLAST algorithm

Initiate a PSI-BLAST search simply by choosing the option on the BLAST input form.

The role of Fisher information in primary data space for neighbourhood mapping

Complex Prediction Problems

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

How Do We Measure Protein Shape? A Pattern Matching Example. A Simple Pattern Matching Algorithm. Comparing Protein Structures II

Spectral Clustering of Biological Sequence Data

Pattern recognition (4)

Learning to Localize Objects with Structured Output Regression

A Kernel Approach for Learning from Almost Orthogonal Patterns

Alignment of Pairs of Sequences

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

Learning Hierarchies at Two-class Complexity

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

Multiple Kernel Machines Using Localized Kernels

JET 2 User Manual 1 INSTALLATION 2 EXECUTION AND FUNCTIONALITIES. 1.1 Download. 1.2 System requirements. 1.3 How to install JET 2

Transcription:

Mismatch String Kernels for SVM Protein Classification by C. Leslie, E. Eskin, J. Weston, W.S. Noble Athina Spiliopoulou Morfoula Fragopoulou Ioannis Konstas

Outline Definitions & Background Proteins Remote Homology Detection SVMs Insides of the algorithm Feature mapping Mismatch tree data structure Mismatch tree traversal Computational Efficiency Experiments Discussion

Proteins Primary structure: amino acid sequence Secondary Structure 3D Structure The same amino-acid sequence almost always folds into the same 3D structure

Homologues, Remote Homologues Amino acid sequences subject to mutation Structures serving important biological function highly conserved Homologues: share the same ancestor + sequence similarity > 30% Remote Homologues: share the same ancestor + sequence similarity < 30%

Protein Classification Superfamily Family Homologues Remote Homologues Non-homologues Homology Detection: Classify sequences into families Remote Homology Detection: Classify sequences into superfamilies

Remote Homology Detection Data available: amino-acid sequences Remote Homology Detection: great challenge due to low sequence similarity Previous Methods (generative models): pairwise sequence alignment profiles for protein families consensus patterns using motifs profile Hidden Markov Models SVM-Fisher: breakthrough for remote homology detection

SVMs in Remote Homology Detection Discriminative classifiers that learn linear decision boundaries Explicitly model difference between positive and negative examples Behave and generalise well with sparse data Input data can be mapped to a feature space Kernel Trick Explicit calculation of feature vectors can be avoided

Outline Definitions & Background Proteins Remote Homology Detection SVMs Insides of the algorithm Feature mapping Mismatch tree data structure Mismatch tree traversal Computational Efficiency Experiments Discussion

Feature Mapping = 20 Amino acid alphabet A: length symbols l k-mer: a k-length subsequence in a protein sequence l k Feature Space: the -dimensional vector space indexed by the set of all possible k-mersfrom A

Feature Mapping (cont.) Alphabet A = (A, V, L) k = 3 A A L A A V AAL ALA LAA AAV AAA AAL AAV ALA AVA LAA VAA 0 1 1 1 0 1 0

Mismatch String Kernel Allows for mutations k = 3, m = 1 A A L A A V AAV AAA LAL VAL AVL ALL Mismatch neighbourhood: οf the 3-mer α = AAL The feature mapping of a k-merαis given by: Φ( )( ) ( ( )) k, m α = φ β α k βεα, where φ ( α ) β = 1if β belongs to neighbourhood and 0 otherwise N a ( ) ( 3,1)

Mismatch String Kernel (cont.) The feature mapping of sequence x is given by: Φ ( )( ) k, m x = φ( k, m)( α ) k mers a in x The (k,m)-mismatch kernel is given by: K ( k, m)( x, y) = Φ( k, m)( x), Φ( k, m)( y)

Mismatch Tree -An efficient data Structure Representation of feature space as a tree Depth of tree: k Number of branches of each internal node: A = l Label of each branch: a symbol from A

Mismatch Tree-An efficient data Structure (cont.) Alphabet A = (A, V, L) k = 3 A V L Internal nodes: prefix of k-mer A V L A V L AA AV AL A V L AAA AAV AAL Leaf nodes: fixed k-mers

Mismatch Tree Traversal (DFS) Sequence: AALA k = 3, m = 1 0 0 A L L A 0 1 L A A L A 0 0 A A A L L A A L V 2 1 1 K ( x, y) K( x, y) + count( x) count( y)

Outline Definitions & Background Proteins Remote Homology Detection SVMs Insides of the algorithm Feature mapping Mismatch tree data structure Mismatch tree traversal Computational Efficiency Experiments Discussion

Efficiency Space Complexity No need to store the entire tree For k = 7 1.28 billion nodes! No need to store all feature vectors No need to store all feature vectors Kernel trick!

Efficiency Time Complexity A fixed k-mer α has: O k m l m k-mers to its neighbourhood N = Mn, where ( ) M: number of sequences and n: the length of each sequence N: total length of the dataset ( m l m ) O Nk Whole dataset: k-mers ( ) 2 Worst case: perform O M updates to the kernel matrix Overall running complexity: ( M nk m l m ) O 2

System Pipeline Training Phase Compute the kernel matrix for all the training sequences Normalize (divide by the length of the vectors) Train the SVM classifier Compute and store the k-mer scores of the Support Vectors Testing Phase Compute the feature vector for each test datum and predict its class in linear time f r ( x) = yiai Φ( k, m)( xi ), Φ( k, m)( x) i= 1 + b

Experiments Benchmark dataset designed by Jaakkola et al. from the SCOP database 33 Families Superfamily Family Pos. Train Pos. Test Negative Train

Experiments (cont.) Comparison to other methods: PSI-BLAST (mainly used for homology detection) SAM-T98 Fisher-SVM (the state-of-the- art)

ROC Curve - ROC Scores 1 1 TP TP 0,8 0,7 0 FP 1 0 FP 1 ROC Score is the area under the curve

Comparison of all methods

Family-by-family Comparison

Discussion Mismatch-SVM performs equally well with Fisher- SVM method Mismatch-SVM much more efficient Efficiency: important issue Large real-world datasets Multi-class prediction Accuracy increased by incorporating biological knowledge

Questions?