Mismatch String Kernels for SVM Protein Classification

Mismatch String Kernels for SVM Protein Classification by C. Leslie, E. Eskin, J. Weston, W.S. Noble Athina Spiliopoulou Morfoula Fragopoulou Ioannis Konstas

Outline Definitions & Background Proteins Remote Homology Detection SVMs Insides of the algorithm Feature mapping Mismatch tree data structure Mismatch tree traversal Computational Efficiency Experiments Discussion

Proteins Primary structure: amino acid sequence Secondary Structure 3D Structure The same amino-acid sequence almost always folds into the same 3D structure

Homologues, Remote Homologues Amino acid sequences subject to mutation Structures serving important biological function highly conserved Homologues: share the same ancestor + sequence similarity > 30% Remote Homologues: share the same ancestor + sequence similarity < 30%

Protein Classification Superfamily Family Homologues Remote Homologues Non-homologues Homology Detection: Classify sequences into families Remote Homology Detection: Classify sequences into superfamilies

Remote Homology Detection Data available: amino-acid sequences Remote Homology Detection: great challenge due to low sequence similarity Previous Methods (generative models): pairwise sequence alignment profiles for protein families consensus patterns using motifs profile Hidden Markov Models SVM-Fisher: breakthrough for remote homology detection

SVMs in Remote Homology Detection Discriminative classifiers that learn linear decision boundaries Explicitly model difference between positive and negative examples Behave and generalise well with sparse data Input data can be mapped to a feature space Kernel Trick Explicit calculation of feature vectors can be avoided

Feature Mapping = 20 Amino acid alphabet A: length symbols l k-mer: a k-length subsequence in a protein sequence l k Feature Space: the -dimensional vector space indexed by the set of all possible k-mersfrom A

Feature Mapping (cont.) Alphabet A = (A, V, L) k = 3 A A L A A V AAL ALA LAA AAV AAA AAL AAV ALA AVA LAA VAA 0 1 1 1 0 1 0

Mismatch String Kernel Allows for mutations k = 3, m = 1 A A L A A V AAV AAA LAL VAL AVL ALL Mismatch neighbourhood: οf the 3-mer α = AAL The feature mapping of a k-merαis given by: Φ( )( ) ( ( )) k, m α = φ β α k βεα, where φ ( α ) β = 1if β belongs to neighbourhood and 0 otherwise N a ( ) ( 3,1)

Mismatch String Kernel (cont.) The feature mapping of sequence x is given by: Φ ( )( ) k, m x = φ( k, m)( α ) k mers a in x The (k,m)-mismatch kernel is given by: K ( k, m)( x, y) = Φ( k, m)( x), Φ( k, m)( y)

Mismatch Tree -An efficient data Structure Representation of feature space as a tree Depth of tree: k Number of branches of each internal node: A = l Label of each branch: a symbol from A

Mismatch Tree-An efficient data Structure (cont.) Alphabet A = (A, V, L) k = 3 A V L Internal nodes: prefix of k-mer A V L A V L AA AV AL A V L AAA AAV AAL Leaf nodes: fixed k-mers

Mismatch Tree Traversal (DFS) Sequence: AALA k = 3, m = 1 0 0 A L L A 0 1 L A A L A 0 0 A A A L L A A L V 2 1 1 K ( x, y) K( x, y) + count( x) count( y)

Efficiency Space Complexity No need to store the entire tree For k = 7 1.28 billion nodes! No need to store all feature vectors No need to store all feature vectors Kernel trick!

Efficiency Time Complexity A fixed k-mer α has: O k m l m k-mers to its neighbourhood N = Mn, where ( ) M: number of sequences and n: the length of each sequence N: total length of the dataset ( m l m ) O Nk Whole dataset: k-mers ( ) 2 Worst case: perform O M updates to the kernel matrix Overall running complexity: ( M nk m l m ) O 2

System Pipeline Training Phase Compute the kernel matrix for all the training sequences Normalize (divide by the length of the vectors) Train the SVM classifier Compute and store the k-mer scores of the Support Vectors Testing Phase Compute the feature vector for each test datum and predict its class in linear time f r ( x) = yiai Φ( k, m)( xi ), Φ( k, m)( x) i= 1 + b

Experiments Benchmark dataset designed by Jaakkola et al. from the SCOP database 33 Families Superfamily Family Pos. Train Pos. Test Negative Train

Experiments (cont.) Comparison to other methods: PSI-BLAST (mainly used for homology detection) SAM-T98 Fisher-SVM (the state-of-the- art)

ROC Curve - ROC Scores 1 1 TP TP 0,8 0,7 0 FP 1 0 FP 1 ROC Score is the area under the curve

Comparison of all methods

Family-by-family Comparison

Discussion Mismatch-SVM performs equally well with Fisher- SVM method Mismatch-SVM much more efficient Efficiency: important issue Large real-world datasets Multi-class prediction Accuracy increased by incorporating biological knowledge

Questions?