A (somewhat) Unified Approach to Semisupervised and Unsupervised Learning

Size: px

Start display at page:

Download "A (somewhat) Unified Approach to Semisupervised and Unsupervised Learning"

Ann Dorsey
5 years ago
Views:

1 A (somewhat) Unified Approach to Semisupervised and Unsupervised Learning Ben Recht Center for the Mathematics of Information Caltech April 11, 2007 Joint work with Ali Rahimi (Intel Research)

2 Overview By abusing the standard Tikhonov regularization functional, we can derive most kernel methods and many new novel techniques. KPCA Semi-supervised Classification and Clustering Transforming Time Series with Few Examples Other applications (not today, sorry) Kernel Learning Robust SVMs and Learning with missing data Constraints and Conservation Laws

3 Priors and Semi-Supervision Unlabeled: y How to leverage the unlabelled data? x

4 Video

5 Representation Big mess of numbers for each frame Raw pixels, no image processing

6 Representation We want to extract position of limbs Left Hand Left Elbow Right Hand Right Elbow

7 Annotations from user or detection algorithms

8 Assume that output time series is smooth.

9 Approach Look for smooth mapping from images to positions Annotate a subset of the frames Assume output obeys physical laws Video

10 Nonlinear Regression Let H be an RKHS, and consider the Tikhonov Regularization functional Solution:

11 Augmented Nonlinear Regression Suppose we add a penalty term constraining the outputs and kernel Search over f and y Additional costs/constraints on y Solution:

12 A variety of learning algorithms Constraints Algorithm None Outputs are binary Local geometry of the outputs Output obeys linear dynamical relations Regression/ Classification Clustering/ Transduction Manifold Learning/ KPCA Manifolds from Video

13 Least-Squares Cost We can eliminate the function for practical purposes, recovering it from the computed y i. By representer theorem, we may rewrite this

14 Least-Squares Cost Solving for c gives Plugging in this solution gives Here y is the vector of all of the y i

15 Multiple dimensions Suppose we want a vector valued function f:r D R d. We penalize each component individually We may solve for f to find the minimum cost is given by

16 Multiple dimensions Let This is a d x L matrix. Then our optimal cost can be written succinctly as

17 Kernel PCA Let f:r D R d with D>d. Assume that the set of outputs is white and zero mean: Can be solved as an eigenvalue problem. (Shoelkopf et al, 98)

18 Kernel Principal Components Solutions are the eigenvalues of K projected onto the zero-mean subspace of the RKHS. Since c = (K+λ I) -1 y, the resulting coefficients are also eigenvalues of K when the lifted data is zeromean. Centering the data in feature space is often useful in unsupervised learning. Regularization parameter only controls the scale of each component.

19 Centered Kernels Constraining the Y to have zero column sum results in a hard eigenvalue problem. If we instead insist that i f(x i ) = 0, we get the ordinary eigenvalue problem where The components are now just the eigenvalues of You don t have to invert anything.

20 Clustering and Segmentation

21 Classification on RKHS y Tikhonov Regularization Labels set to 1 or -1 Just choose a loss x

22 Example costs: Classsification

23 Transduction Sparsely labeled data y x

24 Taxonomy Classification: function fitting with ±1 labels Transduction: function fitting with ±1 labels, some of the labels withheld Segmentation/Clustering: function fitting with ±1 labels, all of the labels withheld Conceptually related/algorithmically related

25 Alternative Approaches Density Estimation Local minima, not well conditioned for large dimension Local Search for Binary Labels Can t guarantee performance Graph Cuts Is a special case of what follows

26 Transduction and Segmentation Start with zero-meaned Tikhonov Regularization Force labels to be 1 or -1 NP-Hard

27 Approximation 1: Eigenvalue Sum constraints Pick α i 0. Solve as Generalized Eigenvalue Problem Surprisingly good in practice, reasonably efficient Of course, how you pick the α is ad hoc Best α can be computed by semidefinite programming

28 Approximation 2: Duality Dual Dual Dual is a semidefinite program Randomized Algorithm of Goemans and Williamson gives you clusters. Compare against dual program for bounds Algorithms can be slow for large N

29 Spectral Clustering Freeman and Perona - Eigenvectors of adjacency matrix K. Shi and Malik Graph Partitioning/Normalized Cuts. Other variants All are approximations of binary label prior!

30 Normalized Cuts Pick α i = Solution is second largest eigenvector of where

31 Spectral Clustering sensitivity Weightings cause particular sensitivities

32 Solution 2: Average Gap Pick α i = 1/N. Perona-Freeman with modified kernel Just an Eigenvalue Problem first KPCA component

33 Average Gap Algorithm solve No degree weighting

34 Leveraging Dynamics

35 Dynamics x t x t+1 x t+2 f f f y t y t+1 y t+2 C C C s t s t+1 s t+2 A A

36 Dynamics Assume data is generated by an LTIG system For the experiments, this model can be very dumb!

37 Dynamics Search over functions and missing data Assume a priori We know (A,C) f RKHS is vector valued Some of the y t are given A A s t s t+1 s t+2 C C C y t y t+1 y t+2 f f f x t x t+1 x t+2

38 Dynamics Search over functions and missing data A A s t s t+1 s t+2 C C C y t y t+1 y t+2 f f f x t x t+1 x t+2

39 Optimization Problem Prefers outputs that evolve smoothly Smoothness Dynamics Fidelity to training data

40 Optimization Problem Prefers outputs that evolve smoothly Tikhonov Regularization

41 Optimization Problem Prefers outputs that evolve smoothly RTS Smoother (non-causal Kalman Filter)

42 Optimization Problem Semi-supervised Algorithm Smoothness Dynamics Fidelity to training data

43 Optimization Problem Eliminating the state sequence by differentiation yields the following problem that may be solved by least squares Ω is a Toeplitz matrix that can be computed efficiently from the linear dynamics model.

44 z Synthetic Results x y Recovered mappings: Tennenbaum et al Belkin/Niyogi Rahimi/Recht

45 Video

46 Representation Big mess of numbers for each frame Raw pixels, no image processing

47 Representation We want to extract position of limbs Left Hand Left Elbow Right Hand Right Elbow

48 Annotations from user or detection algorithms

49 Assume that output time series is smooth.

50 Approach Look for smooth mapping from images to positions Annotate a subset of the frames Assume output obeys physical laws Video

51 References Learning to Transform Time Series with a Few Examples, Ali Rahimi, Ben Recht, in IEEE Pattern Analysis and Machine Intelligence (PAMI) (2007). Clustering with Normalized Cuts is Clustering with a Hyperplane, A. Rahimi, B. Recht, in Statistical Learning in Computer Vision (2004). Convex Modeling with Priors. Ben Recht. PhD Dissertation, MIT Media Lab (2006).

Learning Appearance Manifolds from Video

Learning Appearance Manifolds from Video Ali Rahimi MIT CS and AI Lab, Cambridge, MA 9 ali@mit.edu Ben Recht MIT Media Lab, Cambridge, MA 9 brecht@media.mit.edu Trevor Darrell MIT CS and AI Lab, Cambridge,