Sparse and large-scale learning with heterogeneous data

Size: px

Start display at page:

Download "Sparse and large-scale learning with heterogeneous data"

Charlene Hoover
5 years ago
Views:

1 Sparse and large-scale learning with heterogeneous data February 15, 2007 Gert Lanckriet IEEE-SDCIS

2 In this talk Statistical machine learning Techniques: roots in classical statistics Machine Learning Blends statistics, computer science, signal processing, optimization to deal with specific challenges to address real-world problems with statistical techniques Challenges: Massive scale of data sets On-line issues Diversity of information sources describing data Interpretability / cost : sparsity (e.g., feature selection)

3 Example: web-related applications Data point = web page Sources of information about the webpage: Content: Text Images Structure Sounds Relation to other webpages: links network Users (log data): click behavior origin

4 Example: web-related applications Data point = web page Sources of information about the webpage: Content: Text Images Structure Sounds Relation to other webpages: links network Users (log data): click behavior origin Information in diverse (heterogeneous) formats

5 Example: Web-related Applications Data point = web page Sources of information about the webpage: Content: Text Images Structure Sounds Relation to other webpages: links network Users (log data): click behavior Massive! origin Information in diverse (heterogeneous) formats

6 Example: Music Annotation Audio features Review Heterogeneous descriptions of same song

7 Example: Music Annotation Audio features Review Joint model?

8 Example: Music Annotation Joint model Annotate new songs Retrieve songs from a database given a description (e.g., to automatically generate play lists) Sparsity: k most relevant words per song

9 Example: bioinformatics mrna expression data hydrophobicity data protein-protein interaction data sequence data (gene, protein) upstream region data (TF binding sites)

10 Example: bioinformatics mrna expression data Sparsity: interpretation hydrophobicity data protein-protein interaction data sequence data (gene, protein) upstream region data (TF binding sites)

11 This talk Classification problems with heterogeneous information sources Sparse principal component analysis

12 Overview Kernel methods Classification problems Kernel methods with heterogeneous information Classification with heterogeneous information (SDP) Application in computational biology An efficient algorithm

13 Overview Kernel methods Classification problems Kernel methods with heterogeneous information Classification with heterogeneous information (SDP) Application in computational biology An efficient algorithm

14 Kernel-based learning x 1 Data Embed data Linear algorithm x n SVM, MPM, PCA, CCA, FDA if data described by numerical vectors: embedding ~ (non-linear) transformation non-linear versions of linear algorithms

15 Kernel-based learning x 1 Data Embed data Linear algorithm x n SVM, MPM, PCA, CCA, FDA embedding can be defined for non-vector data

16 Kernel-based learning Embed data IMPLICITLY: Inner product measures similarity j i K Property: Any symmetric positive definite matrix specifies a kernel matrix & every kernel matrix is symmetric positive definite

17 Kernel-based learning Data Embed data x 1 x n

18 Kernel-based learning x 1 Data Embed data Linear algorithm K x n SVM, MPM, PCA, CCA, FDA Kernel design Kernel algorithm

19 Kernel methods Unifying learning framework connections to statistics, convex optimization, functional analysis different data analysis problems can be formulated within this framework Classification Clustering Regression Dimensionality reduction Many successful applications

20 Kernel methods Unifying learning framework connections to statistics, convex optimization, functional analysis different data analysis problems can be formulated within this framework Many successful applications hand-writing recognition text classification analysis of micro-array data face detection time series prediction

21 Binary classification y 1 = -1 y 2 = +1 Training data: {(x i,y i )} i=1...n x i : description i th object y i : {-1,+1} - label HEART URINE DNA BLOOD SCAN HEART URINE DNA BLOOD SCAN x 1 x 2 Problem: design a classification rule such that, given a new x, it predicts y with minimal probability of error

22 Binary classification Find hyperplane that separates the two classes HEART URINE DNA BLOOD SCAN x 2 HEART URINE DNA BLOOD SCAN x 1 Classification Rule:

23 Maximal margin classification Maximize margin: Position hyperplane between two classes Such that 2-norm distance to closest point from each class is maximized

24 Maximal margin classification If not linearly separable: Allow some errors Try to maximize margin for data points with no error

25 Maximal margin classification: training algorithm max margin min error correctly classified error slack

26 Maximal margin classification Training: convex optimization problem (QP) Dual problem:

27 Maximal margin classification Training: convex optimization problem (QP) Dual problem: Optimality condition:

28 Maximal margin classification Training: Classification rule: classify new data point x:

29 Maximal margin classification Training: Classification rule: classify new data point x:

30 Kernel-based classification x 1 Data Embed data Linear classification algorithm x n K Support vector machine (SVM) Kernel design Kernel algorithm

31 Overview Kernel methods Classification problems Kernel methods with heterogeneous information Classification with heterogeneous information (SDP) Applications in computational biology An efficient algorithm

32 Kernel methods with heterogeneous info Data points: proteins Information sources: j i K

33 Kernel methods with heterogeneous info Data points: proteins Information sources: K

34 Kernel methods with heterogeneous data Proposed approach First focus on every single source j of information individually Extract relevant information from source j into K j Design algorithm to learn the optimal K, by mixing any number of kernel matrices K j, for a given learning problem

35 Kernel methods with heterogeneous data 1 2 K

36 Kernel methods with heterogeneous data 1 Proposed approach First focus on every single source k of information individually Extract relevant information from source j into K j Focus on kernel design for specific types of information 2 Design algorithm that learns the optimal K, by mixing any number of kernel matrices K j, for a given learning problem Homogeneous, standardized input Flexibility Can ignore information irrelevant for learning task

37 Kernel design: strings Data points: proteins Described by variable-length, discrete strings (amino acid sequences) protein 1 protein 2 >ICYA_MANSE GDIFYPGYCPDVKPVNDFDLSAFAGAWHEIAKLPLENENQGKCTIAEYKY DGKKASVYNSFVSNGVKEYMEGDLEIAPDAKYTKQGKYVMTFKFGQRVVN LVPWVLATDYKNYAINYMENSHPDKKAHSIHAWILSKSKVLEGNTKEVVD NVLKTFSHLIDASKFISNDFSEAACQYSTTYSLTGPDRH >LACB_BOVIN MKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDA QSAPLRVYVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTKIPAVFKI DALNENKVLVLDTDYKKYLLFCMENSAEPEQSLACQCLVRTPEVDDEALE KFDKALKALPMHIRLSFNPTQLEEQCHI Kernel design: derive valid similarity measure, based on non-vector information

38 Kernel design: strings >ICYA_MANSE GDIFYPGYCPDVKPVNDFDLSAFAGAWHEIAKLPLENENQGKCTIAEYKY DGKKALVLDTDVSNGVKEYMENSLEIAPDAKYTKQGKYVMTFKFGQRVVN LVPWVLATDYKNYAINYNCDYHPDKKAHSIHAWILSKSKVLEGNTKEVVD NVLKTFSHLIDASKFISNDFSEAACQYSTTYSLTGPDRH >LACB_BOVIN MKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDA QSAPLRVYVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTKIPAVFKI DALNENKVLVLDTDYKKYLLFCMENSAEPEQSLACQCLVKYVNTFKEALE KFDKALKALPMHIRLSFNPTQLEEQCHI more similar String kernels >ICYA_JAKSE GDIFYPGYCPDVKPVNDFDLSAFAGAWHEIAKLPLENENQGKCTIAEYKY DGKKASVYNSFVSNGVKEYMEGDLEIAPDAKYTKQGKYVMTFKFGQRVVN LVPWVLATDYKNYAINYNCDYHPDKKAHSIHAWILSKSKVLEGNTKEVVD NVLKTFSHLIDASKFISNDFSEAACQYSTTYSLTGPDRH >LACB_BOVIN MKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDA QSAPLRVYVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTKIPAVFKI DALNENKVLVLDTDYKKYLDYCMENSAEPEQSLACQCLVRTPEVDDEALE KFDKALKALPMHIRLSFNPTQLEEQCHI less similar

vertices of a graph, based on the connectivity information based upon a

39 Kernel design: graph Data points: vertices Information: connectivity described by graph Diffusion kernel: establishes similarities between vertices of a graph, based on the connectivity information based upon a random walk efficiently accounts for all paths connecting two vertices, weighted by path lengths

40 Kernel methods with heterogeneous data 1 2? K

41 Learning the kernel matrix? K? Any symmetric positive definite matrix specifies a kernel matrix Positive semidefinite matrices form a convex cone Define cost function to assess the quality of a kernel matrix Restrict to convex cost functions Learn K from the convex cone of positive-semidefinite semidefinite matrices according to a convex quality measure

42 Learning the kernel matrix? K? Learn K from the convex cone of positive-semidefinite semidefinite matrices according to a convex quality measure Semidefinite Programming (SDP): deals with optimizing convex cost functions over the convex cone of positive semidefinite matrices (or a convex subset of it)

43 Classification with multiple kernels? K? Learn K from the convex cone of positive-semidefinite semidefinite matrices (or( a convex subset) according to a convex quality measure Integrate constructed kernels Large margin classifier (SVM)

44 Classification with multiple kernels? K? Learn K from the convex cone of positive-semidefinite semidefinite matrices (or( a convex subset) according to a convex quality measure Integrate constructed kernels Large margin classifier (SVM) learn a linear combination

45 Classification with multiple kernels? K? Learn K from the convex cone of positive-semidefinite semidefinite matrices (or( a convex subset) according to a convex quality measure Integrate constructed kernels Large margin classifier (SVM) learn a linear combination maximize the margin

46 Classification with multiple kernels Integrate constructed kernels Large margin classifier (SVM) learn a linear mix maximize the margin SDP (standard form)

47 Yeast Membrane Protein Prediction Membrane proteins: anchor in various cellular membranes serve important communicative functions across the membrane important drug targets About 30% of the proteins are membrane proteins

48 Yeast Membrane Protein Prediction Protein sequences: SW scores Protein sequences: BLAST scores E-values of Pfam domains Protein-protein interactions Diffusion mrna expression profiles Gaussian Hydropathy profile

49 Yeast Membrane Protein Prediction Protein sequences: SW scores Protein sequences: BLAST scores E-values of Pfam domains Protein-protein interactions K mrna expression profiles Hydropathy profile

50 Yeast Membrane Protein Prediction

51 Yeast Protein Function Prediction Five different types of data: Pfam domains genetic interactions (CYGD) physical interactions (CYGD) protein-protein interaction (TAP) mrna expression profiles Compare our approach to approach using Markov Random Fields (Deng et al.) using the five types of data also reporting improved accuracy compared to using any single data type

52 Yeast Protein Function Prediction MRF SDP/SVM (binary) SDP/SVM (enriched)

53 Overview Kernel methods Classification problems Kernel methods with heterogeneous information Classification with heterogeneous information (SDP) Applications in computational biology An efficient algorithm

54 Efficient algorithm Convex (SDP) formulation for classification with heterogeneous sources of information Empirical scaling with respect to m and n : General-purpose interior point method (Mosek): m 1.6 n 4.1 Dedicated algorithm: m 1.1 n 1.4 Convex formulation: not the end of the story Dedicated, efficient algorithm is needed, to solve most real-life problems

55 This talk Classification problems with heterogeneous information sources Sparse principal component analysis

56 Principal Component Analysis (PCA) Classic tool in multivariate data analysis Goal: find low-dimensional model that explains most variance (information) in the data Example: 1-dimensional subspace in 2D Algorithm: eigenvalue decomposition of covariance matrix A

57 PCA Eigenvectors define subspace Eigenvalues ~ variance Applications: Finance Image & text processing Computational biology

58 PCA Advantages: Explains maximal variance for lowest dimensional subspace optimal Easy and efficient to compute Disadvantage: Eigenvectors are usually not sparse: linear combination of all variables

59 Eigenvectors are usually not sparse : linear combination of all variables Lack of interpretation (no feature selection) Can be related to cost PCA

60 Eigenvectors are usually not sparse : linear combination of all variables Lack of interpretation (no feature selection) Can be related to cost Examples: Finance: PCA sparse factors often mean less assets in the portfolio less fixed transaction costs Gene expression data: each variable ~ gene (biological meaningful unit) sparse factors ~ small subset of genes explains variance feature selection Image processing: sparse factors ~ specific objects

61 Sparse Principal Component Analysis First principal component (eigenvector):

62 Sparse Principal Component Analysis First principal component (eigenvector): Enforce sparsity: more constrained, less optimal

63 Sparse Principal Component Analysis non-convex: hard!

64 Sparse Principal Component Analysis non-convex: hard!

65 Sparse Principal Component Analysis non-convex: hard! SDP Relaxation

66 Sparse Principal Component Analysis non-convex: hard! SDP Relaxation 1-norm Relaxation

67 Sparse Principal Component Analysis non-convex: hard! SDP Relaxation 1-norm Relaxation SDP!

68 Example

69 Example: sparse second factor Much sparser, explained variance decreases a bit

70 Example: sparse PCA for gene expression data - Axes on LEFT depend on MANY genes - Axes on RIGHT depend on FEW genes (interpretable!) - Clustering is still clearly present

71 Efficient algorithm General-purpose SDP toolboxes: SEDUMI, SDPT3, Special-purpose algorihtm needed for larger problems: first order method for non-smooth optimization

72 Conclusions Classification with SVMs (kernel methods) Computational and statistical framework to integrate data from heterogeneous information sources Sparse formulation for PCA: feature selection Semidefinite (convex) programming Applications: bioinformatics (and others) Efficient, dedicated algorithms for large-scale problems

Introduction to Kernels (part II)Application to sequences p.1

Introduction to Kernels (part II) Application to sequences Liva Ralaivola liva@ics.uci.edu School of Information and Computer Science Institute for Genomics and Bioinformatics Introduction to Kernels (part