Desiging and combining kernels: some lessons learned from bioinformatics

Size: px

Start display at page:

Download "Desiging and combining kernels: some lessons learned from bioinformatics"

Alban Watson
5 years ago
Views:

1 Desiging and combining kernels: some lessons learned from bioinformatics Jean-Philippe Vert Mines ParisTech & Institut Curie NIPS MKL workshop, Dec 12, Jean-Philippe Vert (ParisTech) Designing and combining kernels NIPS MKL workshop 09 1 / 26

Kernels are very popular in bioinformatics

..) Many data with particular structures

2 Kernels are very popular in bioinformatics Why? Many problems can be approached by kernels methods (classification, regression, feature construction,...) Many data with particular structures Kernel design Need to integrate heterogeneous data Kernel combination Jean-Philippe Vert (ParisTech) Designing and combining kernels NIPS MKL workshop 09 2 / 26

3 Outline 1 Kernel design 2 Kernel combination 3 Conclusion Jean-Philippe Vert (ParisTech) Designing and combining kernels NIPS MKL workshop 09 3 / 26

4 Outline 1 Kernel design 2 Kernel combination 3 Conclusion Jean-Philippe Vert (ParisTech) Designing and combining kernels NIPS MKL workshop 09 4 / 26

5 What is a GOOD kernel? Leads to good performances Mathematically valid Fast to compute Interpretable model (?) X maskat... msises marssl... malhtv... mappsv... mahtlg... φ F Jean-Philippe Vert (ParisTech) Designing and combining kernels NIPS MKL workshop 09 5 / 26

6 How to MAKE a good kernel? 3 main ideas 1 Define good features K (x, x ) = Φ(x) Φ(x ) 2 Define a good metric d(x, x ) = K (x, x) + K (x, x ) 2K (x, x ) 3 Define a good functional penalty { } min R(f ) + λ f 2 H f H Jean-Philippe Vert (ParisTech) Designing and combining kernels NIPS MKL workshop 09 6 / 26

7 How to MAKE a good kernel? 3 main ideas 1 Define good features K (x, x ) = Φ(x) Φ(x ) 2 Define a good metric d(x, x ) = K (x, x) + K (x, x ) 2K (x, x ) 3 Define a good functional penalty { } min R(f ) + λ f 2 H f H Jean-Philippe Vert (ParisTech) Designing and combining kernels NIPS MKL workshop 09 6 / 26

8 How to MAKE a good kernel? 3 main ideas 1 Define good features K (x, x ) = Φ(x) Φ(x ) 2 Define a good metric d(x, x ) = K (x, x) + K (x, x ) 2K (x, x ) 3 Define a good functional penalty { } min R(f ) + λ f 2 H f H Jean-Philippe Vert (ParisTech) Designing and combining kernels NIPS MKL workshop 09 6 / 26

9 Idea 1: define good features Motivation Estimate a function f (x) = w Φ(x) A good feature is more important than a good algorithm! Examples Explicit feature computations substring or subgraph indexation Fisher kernel Φ(x) = θ log P θ (x) Implicit feature construction + kernel trick Walk-based graph kernels Mutual information kernels K (x, x ) = P θ (x)p θ (x )dθ Caveats One good feature among too many irrelevant ones may not be enough with L 2 regularization Jean-Philippe Vert (ParisTech) Designing and combining kernels NIPS MKL workshop 09 7 / 26

10 Idea 1: define good features Motivation Estimate a function f (x) = w Φ(x) A good feature is more important than a good algorithm! Examples Explicit feature computations substring or subgraph indexation Fisher kernel Φ(x) = θ log P θ (x) Implicit feature construction + kernel trick Walk-based graph kernels Mutual information kernels K (x, x ) = P θ (x)p θ (x )dθ Caveats One good feature among too many irrelevant ones may not be enough with L 2 regularization Jean-Philippe Vert (ParisTech) Designing and combining kernels NIPS MKL workshop 09 7 / 26

11 Idea 1: define good features Motivation Estimate a function f (x) = w Φ(x) A good feature is more important than a good algorithm! Examples Explicit feature computations substring or subgraph indexation Fisher kernel Φ(x) = θ log P θ (x) Implicit feature construction + kernel trick Walk-based graph kernels Mutual information kernels K (x, x ) = P θ (x)p θ (x )dθ Caveats One good feature among too many irrelevant ones may not be enough with L 2 regularization Jean-Philippe Vert (ParisTech) Designing and combining kernels NIPS MKL workshop 09 7 / 26

12 Example: string kernel with substring indexation Index the feature space by fixed-length strings, i.e., where Φ u (x) can be: Φ (x) = (Φ u (x)) u A k the number of occurrences of u in x (without gaps) : spectrum kernel (Leslie et al., 2002) the number of occurrences of u in x up to m mismatches (without gaps) : mismatch kernel (Leslie et al., 2004) the number of occurrences of u in x allowing gaps, with a weight decaying exponentially with the number of gaps : substring kernel (Lohdi et al., 2002) Jean-Philippe Vert (ParisTech) Designing and combining kernels NIPS MKL workshop 09 8 / 26

13 Idea 2: define a good metric Motivation A kernel defines a Hilbert metric d(x, x ) = K (x, x) + K (x, x ) 2K (x, x ) The functions we can learn are smooth w.r.t this metric f (x) f (x ) f H d(x, x ) Examples Edit distances for strings or graphs, local alignment of biological sequences, graph matching distances MAMMOTH distance between protein 3D structures Caveats Most "good" distances are not Hilbertian Jean-Philippe Vert (ParisTech) Designing and combining kernels NIPS MKL workshop 09 9 / 26

14 Idea 2: define a good metric Motivation A kernel defines a Hilbert metric d(x, x ) = K (x, x) + K (x, x ) 2K (x, x ) The functions we can learn are smooth w.r.t this metric f (x) f (x ) f H d(x, x ) Examples Edit distances for strings or graphs, local alignment of biological sequences, graph matching distances MAMMOTH distance between protein 3D structures Caveats Most "good" distances are not Hilbertian Jean-Philippe Vert (ParisTech) Designing and combining kernels NIPS MKL workshop 09 9 / 26

15 Idea 2: define a good metric Motivation A kernel defines a Hilbert metric d(x, x ) = K (x, x) + K (x, x ) 2K (x, x ) The functions we can learn are smooth w.r.t this metric f (x) f (x ) f H d(x, x ) Examples Edit distances for strings or graphs, local alignment of biological sequences, graph matching distances MAMMOTH distance between protein 3D structures Caveats Most "good" distances are not Hilbertian Jean-Philippe Vert (ParisTech) Designing and combining kernels NIPS MKL workshop 09 9 / 26

16 Example: local alignment kernel How to compare 2 protein sequences? x 1 = CGGSLIAMMWFGV x 2 = CLIVMMNRLMWFGV Find a good alignment π: Two non-hilbertian metrics K (β) LA CGGSLIAMM----WFGV C---LIVMMNRLMWFGV SW (x, y) := max s(π). π Π(x,y) (x, y) = log π Π(x,y) exp (βs (x, y, π)), Jean-Philippe Vert (ParisTech) Designing and combining kernels NIPS MKL workshop / 26

17 Example: local alignment kernel How to compare 2 protein sequences? x 1 = CGGSLIAMMWFGV x 2 = CLIVMMNRLMWFGV Find a good alignment π: Two non-hilbertian metrics K (β) LA CGGSLIAMM----WFGV C---LIVMMNRLMWFGV SW (x, y) := max s(π). π Π(x,y) (x, y) = log π Π(x,y) exp (βs (x, y, π)), Jean-Philippe Vert (ParisTech) Designing and combining kernels NIPS MKL workshop / 26

18 Idea 3: define a good penalty function Motivation The kernel constrains the set of functions over which we optimize (balls in RKHS). We may first define a penalty we like, then find the associated kernel. Examples Caveats graph Laplacian over gene networks cluster kernel for protein remote homology detection Some penalties may not be RKHS norms (eg, total variation to estimate piecewise constant functions) Jean-Philippe Vert (ParisTech) Designing and combining kernels NIPS MKL workshop / 26

19 Idea 3: define a good penalty function Motivation The kernel constrains the set of functions over which we optimize (balls in RKHS). We may first define a penalty we like, then find the associated kernel. Examples Caveats graph Laplacian over gene networks cluster kernel for protein remote homology detection Some penalties may not be RKHS norms (eg, total variation to estimate piecewise constant functions) Jean-Philippe Vert (ParisTech) Designing and combining kernels NIPS MKL workshop / 26

20 Idea 3: define a good penalty function Motivation The kernel constrains the set of functions over which we optimize (balls in RKHS). We may first define a penalty we like, then find the associated kernel. Examples Caveats graph Laplacian over gene networks cluster kernel for protein remote homology detection Some penalties may not be RKHS norms (eg, total variation to estimate piecewise constant functions) Jean-Philippe Vert (ParisTech) Designing and combining kernels NIPS MKL workshop / 26

21 Example : Kernel on a graph φ Laplacian-based kernel The set H = { f R m : m i=1 f i = 0 } endowed with the norm: Ω (f ) = i j ( f (xi ) f ( x j )) 2 is a RKHS whose reproducing kernel is the pseudo-inverse of the graph Laplacian. Jean-Philippe Vert (ParisTech) Designing and combining kernels NIPS MKL workshop / 26

22 The choice of kernel makes a difference No. of families with given performance SVM-LA SVM-pairwise SVM-Mismatch SVM-Fisher ROC50 Performance on the SCOP superfamily recognition benchmark. Jean-Philippe Vert (ParisTech) Designing and combining kernels NIPS MKL workshop / 26

23 Outline 1 Kernel design 2 Kernel combination 3 Conclusion Jean-Philippe Vert (ParisTech) Designing and combining kernels NIPS MKL workshop / 26

24 Motivation We can imagine plenty of kernels for a given application different kernels for the same data (e.g., different string kernels) kernels for different types of data (e.g., integrating string and 3D structures for protein classification) Which one to use? Perhaps we can combine them to make better than each one individually? Jean-Philippe Vert (ParisTech) Designing and combining kernels NIPS MKL workshop / 26

25 Sum kernels Consider p kernels K 1,..., K p Form the sum (eg, Pavlidis et al., 2002): K = p K i. i=1 Equivalently, concatenate the features of the different kernels Equivalently, work in the RKHS H = H 1... H p with f 2 H = inf f =f f p p f i 2 H i. i=1 Jean-Philippe Vert (ParisTech) Designing and combining kernels NIPS MKL workshop / 26

26 Some early work Jean-Philippe Vert (ParisTech) Designing and combining kernels NIPS MKL workshop / 26

Huge improvements can be observed 1 1 Ratio of true positives 0.8 0.6 0.4 exp loc 0.2 phy y2h int 0 0 0.2 0.4 0.6 0.8 1 Ratio of false positives Ratio of true positives 0.

27 Huge improvements can be observed 1 1 Ratio of true positives exp loc 0.2 phy y2h int Ratio of false positives Ratio of true positives exp 0.2 loc phy int Ratio of false positives Jean-Philippe Vert (ParisTech) Designing and combining kernels NIPS MKL workshop / 26

28 Multiple kernel learning (MKL) Form the convex combination: K = p η i K i. i=1 where the weights are chosen to minimize the following convex function under the constraint tr(k ) = 1 (Lanckriet et al., 2003): h(k ) = inf f H K {R(f ) + λ f HK } Equivalently, work in the space H = H H p with non-hilbertian group L 1 norm (Bach et al., 2004): f H = inf f =f f p p f i Hi. i=1 Jean-Philippe Vert (ParisTech) Designing and combining kernels NIPS MKL workshop / 26

29 Example: Lanckriet et al. (2004) Jean-Philippe Vert (ParisTech) Designing and combining kernels NIPS MKL workshop / 26

30 MKL or sum kernel for protein network inference? Precision K1 expression K2 localization K3 phylogenetic Recall Jean-Philippe Vert (ParisTech) Designing and combining kernels NIPS MKL workshop / 26

31 MKL or sum kernel for protein network inference? Precision K1 expression K2 localization K3 phylogenetic Sum Recall Jean-Philippe Vert (ParisTech) Designing and combining kernels NIPS MKL workshop / 26

32 MKL or sum kernel for protein network inference? Precision K1 expression K2 localization K3 phylogenetic Sum MKL Recall Jean-Philippe Vert (ParisTech) Designing and combining kernels NIPS MKL workshop / 26

33 Why MKL does not estimate a good kernel combination Jean-Philippe Vert (ParisTech) Designing and combining kernels NIPS MKL workshop / 26

34 Why MKL does not estimate a good kernel combination Jean-Philippe Vert (ParisTech) Designing and combining kernels NIPS MKL workshop / 26

35 Why MKL does not estimate a good kernel combination Jean-Philippe Vert (ParisTech) Designing and combining kernels NIPS MKL workshop / 26

36 Why MKL does not estimate a good kernel combination Jean-Philippe Vert (ParisTech) Designing and combining kernels NIPS MKL workshop / 26

37 Sometimes MKL works Subcellular protein classficiation from 69 kernels Jean-Philippe Vert (ParisTech) Designing and combining kernels NIPS MKL workshop / 26

38 MKL or sum kernel? Sum is simpler and works better to combine well-engineered kernels (eg, for data integration). In spite of its misleading name, MKL is better suited for kernel selection than for weight optimization (l 2 vs l 1 ). Useful to select among large sets of kernels. We would love to be able to select the "optimal" linear combination of a few kernels Jean-Philippe Vert (ParisTech) Designing and combining kernels NIPS MKL workshop / 26

39 Outline 1 Kernel design 2 Kernel combination 3 Conclusion Jean-Philippe Vert (ParisTech) Designing and combining kernels NIPS MKL workshop / 26

40 Conclusion Are kernels popular and useful in bioinformatics? YES Is kernel design useful? YES, and we have many tricks for that Is kernels combination useful for performance? YES, and the sum kernel does a good job Is MKL useful? Hardly yet, but it offers the promising possibility to work with MANY kernels and emphasize INTERPRETABILITY Do we want to learn good linear combinations? YES Jean-Philippe Vert (ParisTech) Designing and combining kernels NIPS MKL workshop / 26

Classification of biological sequences with kernel methods

Classification of biological sequences with kernel methods Jean-Philippe Vert Jean-Philippe.Vert@ensmp.fr Centre for Computational Biology Ecole des Mines de Paris, ParisTech International Conference on