1 Support vector machine prediction of signal peptide cleavage site using a new class of kernels for strings Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan
2 Outline 1. SVM and kernel methods 2. New kernels for bioinformatics 3. Example: signal peptide cleavage site prediction
3 Part 1 SVM and kernel methods
4 Support vector machines φ Objects to classified x mapped to a feature space Largest margin separating hyperplan in the feature space
5 The kernel trick Implicit definition of x Φ(x) through the kernel: K(x, y) def = < Φ(x), Φ(y) >
5 The kernel trick Implicit definition of x Φ(x) through the kernel: K(x, y) def = < Φ(x), Φ(y) > Simple kernels can represent complex Φ
5 The kernel trick Implicit definition of x Φ(x) through the kernel: K(x, y) def = < Φ(x), Φ(y) > Simple kernels can represent complex Φ For a given kernel, not only SVM but also clustering, PCA, ICA... possible in the feature space = kernel methods
6 Kernel examples Classical kernels: polynomial, Gaussian, sigmoid... but the objects x must be vectors
6 Kernel examples Classical kernels: polynomial, Gaussian, sigmoid... but the objects x must be vectors Exotic kernels for strings:
6 Kernel examples Classical kernels: polynomial, Gaussian, sigmoid... but the objects x must be vectors Exotic kernels for strings: Fisher kernel (Jaakkoola and Haussler 98)
6 Kernel examples Classical kernels: polynomial, Gaussian, sigmoid... but the objects x must be vectors Exotic kernels for strings: Fisher kernel (Jaakkoola and Haussler 98) Convolution kernels (Haussler 99, Watkins 99)
6 Kernel examples Classical kernels: polynomial, Gaussian, sigmoid... but the objects x must be vectors Exotic kernels for strings: Fisher kernel (Jaakkoola and Haussler 98) Convolution kernels (Haussler 99, Watkins 99) Kernel for translation initiation site (Zien et al. 00)
6 Kernel examples Classical kernels: polynomial, Gaussian, sigmoid... but the objects x must be vectors Exotic kernels for strings: Fisher kernel (Jaakkoola and Haussler 98) Convolution kernels (Haussler 99, Watkins 99) Kernel for translation initiation site (Zien et al. 00) String kernel (Lodhi et al. 00)
6 Kernel examples Classical kernels: polynomial, Gaussian, sigmoid... but the objects x must be vectors Exotic kernels for strings: Fisher kernel (Jaakkoola and Haussler 98) Convolution kernels (Haussler 99, Watkins 99) Kernel for translation initiation site (Zien et al. 00) String kernel (Lodhi et al. 00) Spectrum kernel (Leslie et al., PSB 2002)
7 Kernel engineering Use prior knowledge to build the geometry of the feature space through K(.,.)
8 Part 2 New kernels for bioinfomatics
9 The problem X a set of (structured) objects
9 The problem X a set of (structured) objects p(x) a probability distribution on X
9 The problem X a set of (structured) objects p(x) a probability distribution on X How to build K(x, y) from p(x)?
10 Product kernel K prod (x, y) = p(x)p(y)
10 Product kernel K prod (x, y) = p(x)p(y) y x 0 p(x) p(y)
10 Product kernel K prod (x, y) = p(x)p(y) y x 0 p(x) p(y) SVM = Bayesian classifier
11 Diagonal kernel K diag (x, y) = p(x)δ(x, y)
11 Diagonal kernel K diag (x, y) = p(x)δ(x, y) x p(x) z y p(z) p(y)
11 Diagonal kernel K diag (x, y) = p(x)δ(x, y) x p(x) z y p(z) p(y) No learning
12 Interpolated kernel If objects are composite: x = (x 1, x 2 ) : K(x, y) = K diag (x 1, y 1 )K prod (x 2, y 2 )
12 Interpolated kernel If objects are composite: x = (x 1, x 2 ) : K(x, y) = K diag (x 1, y 1 )K prod (x 2, y 2 ) = p(x 1 )δ(x 1, y 1 ) p(x 2 x 1 )p(y 2 y 1 ) AB BB AA BA A* B*
13 General interpolated kernel Composite objects x = (x 1,..., x n )
13 General interpolated kernel Composite objects x = (x 1,..., x n ) A list of index subsets: V = {I 1,..., I v } where I i {1,..., n}
13 General interpolated kernel Composite objects x = (x 1,..., x n ) A list of index subsets: V = {I 1,..., I v } where I i {1,..., n} Interpolated kernel: K V (x, y) = 1 V K diag (x I, y I )K prod (x I c, y I c) I V
14 Rare common subparts For a given p(x) and p(y), we have: K V (x, y) = K prod (x, y) 1 V I V δ(x I, y I ) p(x I )
14 Rare common subparts For a given p(x) and p(y), we have: K V (x, y) = K prod (x, y) 1 V I V δ(x I, y I ) p(x I ) x and y get closer in the feature space when they share rare common subparts
15 Implementation Factorization for particular choices of p(.) and V
15 Implementation Factorization for particular choices of p(.) and V Example: V = P({1,..., n}) the set of all subsets: V = 2 n
15 Implementation Factorization for particular choices of p(.) and V Example: V = P({1,..., n}) the set of all subsets: V = 2 n product distribution p(x) = n j=1 p j(x j ).
15 Implementation Factorization for particular choices of p(.) and V Example: V = P({1,..., n}) the set of all subsets: V = 2 n product distribution p(x) = n j=1 p j(x j ). implementation in O(n)
16 Part 3 Application: SVM prediction of signal peptide cleavage site
17 Secretory pathway mrna Signal peptide Nascent protein Cell surface (secreted) Lysosome Plasma membrane ER Golgi Nucleus Chloroplast Mitochondrion Peroxisome Cytosole
18 Signal peptides Protein -1 +1 (1) MKANAKTIIAGMIALAISHTAMA EE... (2) MKQSTIALALLPLLFTPVTKA RT... (3) MKATKLVLGAVILGSTLLAG CS... (1):Leucine-binding protein, (2):Pre-alkaline phosphatase, (3)Pre-lipoprotein
18 Signal peptides Protein -1 +1 (1) MKANAKTIIAGMIALAISHTAMA EE... (2) MKQSTIALALLPLLFTPVTKA RT... (3) MKATKLVLGAVILGSTLLAG CS... (1):Leucine-binding protein, (2):Pre-alkaline phosphatase, (3)Pre-lipoprotein 6-12 hydrophobic residues (in yellow) (-3,-1) : small uncharged residues
19 Experiment Challenge : classification of aminoacids windows, positive if cleavage occurs between -1 and +1: [x 8, x 7,..., x 1, x 1, x 2 ]
19 Experiment Challenge : classification of aminoacids windows, positive if cleavage occurs between -1 and +1: [x 8, x 7,..., x 1, x 1, x 2 ] 1,418 positive examples, 65,216 negative examples
19 Experiment Challenge : classification of aminoacids windows, positive if cleavage occurs between -1 and +1: [x 8, x 7,..., x 1, x 1, x 2 ] 1,418 positive examples, 65,216 negative examples Computation of a weight matrix: SVM + K prod (naive Bayes) vs SVM + K interpolated
20 Result: ROC curves 100 Interpolated Kernel False Negative (%) 80 60 Product Kernel (Bayes) 40 0 4 8 12 16 20 24 False positive (%)
Conclusion 21
22 Conclusion An other way to derive a kernel from a probability distribution
22 Conclusion An other way to derive a kernel from a probability distribution Useful when objects can be compared by comparing subparts
22 Conclusion An other way to derive a kernel from a probability distribution Useful when objects can be compared by comparing subparts Encouraging results on real-world application how to improve a weight matrix based classifier
22 Conclusion An other way to derive a kernel from a probability distribution Useful when objects can be compared by comparing subparts Encouraging results on real-world application how to improve a weight matrix based classifier Future work: more application-specific kernels
23 Acknowledgement Minoru Kanehisa Applied Biosystems for the travel grant