Kernel Methods in Machine Learning

Size: px

Start display at page:

Download "Kernel Methods in Machine Learning"

Evelyn O’Brien’
6 years ago
Views:

1 Outline Department of Computer Science and Engineering Hong Kong University of Science and Technology Hong Kong Joint work with Ivor Tsang, Pakming Cheung, Andras Kocsor, Jacek Zurada, Kimo Lai November 2006, Nanjing

2 Outline Outline 1 Kernel Methods: An Introduction 2 3 Multi-Instance Learning Constrained Concave-Convex Procedure Loss Function Optimization Problem Experiments 4

3 Popularity of Kernel Methods Supervised learning Classification: Support vector machines (SVM) Regression: Support vector regression Unsupervised learning Novelty detection: One-class SVM / Support vector domain description Clustering: Kernel clustering Principal component analysis: Kernel PCA Other learning scenarios Semi-supervised learning, transductive learning, etc. Applications Text classification, speaker adaptation, image fusion, texture classification...

4 Basic Idea in Kernel Methods Map the data from input space to feature space F using ϕ Apply a linear procedure in F hyperplane classifier, linear regression, PCA, etc. Only inner products in F are needed Kernel trick: ϕ(x) ϕ(y) = k(x, y)

5 Support Vector Machines Classification problem: {(x i, y i )} N i=1, x i R m, y i {±1} min 1 2 w 2 (primal) s.t. y i (w ϕ(x i ) + b) 1 max s.t. N α i 1 2 i=1 N α i α j y i y j ϕ(x i ) ϕ(x j ) i,j=1 N α i y i = 0, α i 0 (dual) i=1 Quadratic programming (QP) problem

6 Problem 1 Need O(m 2 ) memory just to write down K (m training examples) If m = 20, 000 and it takes 4 bytes to represent a kernel entry, we would need 1.6Gbytes to store the kernel matrix Problem 2 Involves inverting the kernel matrix K m m = [k(x i, x j )] m i,j=1 Requires O(m 3 ) time Existing methods sampling, low-rank approximations, decomposition methods in practice, time complexities O(m) O(m 2.3 ) empirical observations and not theoretical guarantees

7 Observation SVM implementations only approximate the optimal solution by an iterative strategy 1 Pick a subset of Lagrange multipliers 2 Optimize the reduced optimization problem 3 Repeat until all the Lagrange multipliers are accurate enough (loose KKT condition) These near-optimal solutions are often good enough in practical applications

8 Approximation Algorithm Approximation algorithms have been extensively used theoretical computer science E.g., for NP-complete problems such as vertex-cover problem, traveling-salesman problem, set-covering problem,... Denote C : cost of the optimal solution C: cost of the solution returned by approximation algorithm Performance guarantee: Approximation ratio ρ(n) for input size n max ( C C, C C ) ρ(n) large ρ(n): solution is much worse than the optimal solution small ρ(n): solution is more or less optimal If the ratio does not depend on n, we may just write ρ and call the algorithm an ρ-approximation algorithm

9 The Minimum Enclosing Ball Problem Problem in Computational Geometry Given: S = {x 1,..., x m }, where each x i R d Minimum enclosing ball of S (MEB(S)): the smallest ball that contains all the points in S Finding exact MEBs is inefficient for large d

10 (1 + ɛ)-approximation Given an ɛ > 0, a ball B(c, (1 + ɛ)r) is an (1 + ɛ)-approximation of MEB(S) if R r MEB(S) and S B(c, (1 + ɛ)r)

11 Approximate MEB Algorithm Proposed by Bădoiu and Clarkson (2002) A simple iterative scheme: At the tth iteration, the current estimate B(c t, r t ) is expanded incrementally by including the furthest point outside the (1 + ɛ)-ball B(c t, (1 + ɛ)r t ) Repeat until all the points in S are covered by B(c t, (1 + ɛ)r t ) Surprising property Number of iterations (and hence the size of the final core-set) depends only on ɛ but not on d or m

12 MEB Problems and Kernel Methods What is obvious MEB is equivalent to the hard-margin support vector data description (SVDD) The MEB problem can also be used to find the radius component of the radius-margin bound SVM parameter tuning What is not so obvious Other kernel-related problems can also be viewed as MEB problems soft-margin one-class SVM, multi-class SVM, ranking SVM, SVR, Laplacian SVM, etc.

13 Hard-Margin SVDD Denote: Kernel k; feature map ϕ MEB in the feature space: B(c, R) (primal) : min R,c R2 : c ϕ(x i ) 2 R 2, i = 1,..., m (dual) max α α diag(k) α Kα : α 0, α 1 = 1 α = [α i,..., α m ] : Lagrange multipliers K m m = [k(x i, x j )]: kernel matrix 0 = [0,..., 0], 1 = [1,..., 1]

14 Kernel Methods as MEB Problems Holds for Assume k(x, x) = κ, a constant (1) 1 isotropic kernel k(x, y) = K( x y ) (e.g., Gaussian) 2 dot product kernel k(x, y) = K(x y) (e.g., polynomial) with normalized inputs 3 any normalized kernel k(x, y) = K(x,y) K(x,x) K(y,y) Combine with α 1 = 1, we have α diag(k) = κ max α α Kα : α 0, α 1 = 1 (2) Conversely, whenever the kernel k satisfies (1), Any QP of the form in (2) a MEB problem

15 Two-Class SVM {z i = (x i, y i )} m i=1 with y i { 1, 1} (primal) min w,b,ρ,ξ i (dual) max α K = α w 2 +b 2 2ρ+C [ y i y j k(x i, x j ) + y i y j + δ ij C m i=1 ξ i 2 : y i (w ϕ(x i )+b) ρ ξ i (K yy + yy + 1C I ) α : α 0, α 1 = 1 ], with k(z, z) = κ+1+ 1 C (constant)

16 Core Vector Machine (CVM) At the tth iteration, denote S t : core-set; c t : ball s center; R t : ball s radius Given an ɛ > 0 1 Initialize S 0, c 0 and R 0 2 Terminate if there is no training point z such that ϕ(z) falls outside the (1 + ɛ)-ball B(c t, (1 + ɛ)r t ) 3 Find (core vector) z such that ϕ(z) is furthest away from c t. Set S t+1 = S t {z} 4 Find the new MEB(S t+1 ) and set c t+1 = c MEB(St+1 ) and R t+1 = r MEB(St+1 ) 5 Increment t by 1 and go back to Step 2

17 Convergence to (Approximate) Optimality When ɛ = 0 CVM outputs the exact solution of the kernel problem When ɛ > 0 CVM is an (1 + ɛ) 2 -approximation algorithm

18 Time Complexity CVM converges in at most 2/ɛ iterations [Bădoiu and Clarkson, 2002] No probabilistic speedup: Overall time for τ = O(1/ɛ) iterations: O ( m + 1 ) ɛ 2 ɛ 4 linear in m for a fixed ɛ With probabilistic speedup: Overall time: O ( ) 1 ɛ 4 independent of m for a fixed ɛ

19 Space Complexity Space complexity for the for the whole procedure: O(1/ɛ 2 ) independent of m for a fixed ɛ

20 Forest Cover Type Data (522,911 patterns) L2 SVM (CVM) L2 SVM (LIBSVM) L1 SVM (LIBSVM) 10 6 L2 SVM (CVM) core set size L2 SVM (LIBSVM) L1 SVM (LIBSVM) CPU time (in seconds) number of SV s size of training set x size of training set x 10 L2 SVM (CVM) L2 SVM (LIBSVM) L1 SVM (LIBSVM) error rate (in %) size of training set x 10

21 Extended MIT Face Data training set # faces # nonfaces total original 2,429 4,548 6,977 set A 2, , ,343 set B 19,432 (blur+flip) 481, ,346 set C 408,072 (rotate) 481, ,986 CPU time (in seconds) L2 SVM (CVM) L2 SVM (LIBSVM) L1 SVM (LIBSVM) number of support vectors L2 SVM (CVM) core set size L2 SVM (LIBSVM) L1 SVM (LIBSVM) original set A set B set C training set 10 2 original set A set B set C training set L2 SVM (CVM) L2 SVM (LIBSVM) L1 SVM (LIBSVM) L2 SVM (CVM) L2 SVM (LIBSVM) L1 SVM (LIBSVM) AUC balanced loss (in %) original set A set B set C training set 15 original set A set B set C training set

22 KDDCUP-99 Intrusion Detection (4,898,431 patterns) Used in KDD-99 s Knowledge Discovery and Data Mining Tools Competition: Separate normal connections from attacks method # train patns # test SVM training other proc input to SVM errors time (in sec) time (in sec) 0.001% 47 25, random 0.01% , sampling 0.1% 4,917 25, % 49,204 25, % 245,364 25, active learning , CB-SVM (KDD 03) 4,090 20, CVM 4,898,431 19, AUC l bal # core vectors # support vectors

23 Limitations 1 k(x, x) = constant for any pattern x 2 The QP problem is of the form max α Kα s.t. α 1 = 1, α 0 Condition 1 holds for kernels, including Isotropic kernel (e.g., Gaussian kernel) Dot product kernel (e.g., polynomial kernel) with normalized input Any normalized kernel Condition 2 holds for kernel methods including the one-class and two-class SVMs there are still some popular kernel methods that violate these conditions and so cannot be used

24 Motivating Example Example (L2-support vector regression (SVR)) Training set: {z i = (x i, y i )} m i=1 with x i R d and y i R Find f (x) = w ϕ(x) + b in F that minimizes ε-insensitive loss Primal min w 2 + b 2 + C m (ξi 2 + ξi 2 ) + 2C ε µm i=1 s.t. y i (w ϕ(x i ) + b) ε + ξ i, (w ϕ(x i ) + b) y i ε + ξ i Dual [ 2 max [λ λ ] C y 2 C y ] [λ [ λ λ ] K λ s.t. [λ λ ]1 = 1, λ, λ 0 [ ] K + 11 K = [ k(z i, z j )] = + µm C I (K + 11 ) (K + 11 ) K µm C I ]

25 Center-Constrained MEB Problem Modifications to the original MEB problem: [ ] ϕ(xi ) 1 Augment an extra i R to each ϕ(x i ) 2 Constrain the last coordinate of the ball s center to zero Finding the center-constrained MEB Primal: [ ] [ ] min R 2 s.t. c ϕ(xi ) 2 R 2 0 i where = [ 2 1,..., 2 m] 0 Dual: max α (diag(k)+ ) α Kα s.t. α 1 = 1, α 0 Goal: Transform the dual of SVR to this form i [ c 0 ]

26 SVR as a Center-Constrained MEB Problem SVR s dual: max α {}}{ [ 2 [λ λ ] C y 2 C y ] [λ [ λ λ ] K λ s.t. [λ λ ]1 = 1, λ, λ 0 [ ] Define = diag( K) + η1 + 2 y C for η large enough such y that 0 max α (diag( K) + η1) α K α : α 1=1, α 0 Using the constraint α 1 = 1 max α (diag( K) + ) α K α : α 1 = 1, α 0 which is thus of the required form! ]

27 Advantages 1 Allows a more general QP formulation 2 Can be used with any linear/nonlinear kernels no longer require k(x, x) = constant on the kernel

28 Friedman (200,000 Patterns) CPU time (in seconds) L2 SVR (CVR) L1 SVR (LIBSVM) L1 SVR (SVM Light) number of SV s 14 x L2 SVR (CVR) L1 SVR (LIBSVM) L1 SVR (SVM Light) size of training set x L2 SVR (CVR) L1 SVR (LIBSVM) L1 SVR (SVM Light) size of training set x L2 SVR (CVR) L1 SVR (LIBSVM) L1 SVR (SVM Light) RMSE MRE size of training set x size of training set x 10 5

Semi-Supervised Learning Labeled patterns are rare, expensive and time consuming to collect supervised learning can have poor performance when only very few

29 Semi-Supervised Learning Labeled patterns are rare, expensive and time consuming to collect supervised learning can have poor performance when only very few labeled patterns are available Unlabeled data are abundant and readily available without any cost e.g., unlabeled webpages on the internet often has a manifold structure

30 Laplacian SVM Incorporate a manifold regularizer [Belkin et al 2005]: min 1 l l i=1 ξ i + λ 2 f 2 H k + λ G 2 G f 2 y i f (x i ) 1 ξ i, ξ i 0 Sparse Laplacian SVM min w 2 + b 2 + C l ξi 2 +2Cɛ + Cθ (ζe 2 + ζe 2 ) }{{} lµ uµ i=1 e ε margin }{{}}{{} error manifold regularizer s.t. y i (w ϕ(x i ) + b) 1 ɛ ξ i, w ψ e ɛ + ζ e, w ψ e ɛ + ζ e, e ε. Dual: center-constrained MEB problem

31 Two Moons (l = 2; u = 1, 000, 000)

32 Extended USPS: 0-vs-1 (l = 2; u = 266, 077)

33 Extended MIT Face (l = 10; u = 100, 000)

Multi-Instance Learning Constrained Concave-Convex Procedure Loss Function Optimization Problem Experiments Multi-Instance Learning: Motivating Example Content-based image retrieval:

34 Multi-Instance Learning Constrained Concave-Convex Procedure Loss Function Optimization Problem Experiments Multi-Instance Learning: Motivating Example Content-based image retrieval: Classify/retrieve images based on content each image is a bag and each local image patch an instance an image is labeled positive when at least one of its segments is positive Weak label information of the training data only the bags (but not the individual instances) have known labels

35 Kernel-Based MI Learning Methods Multi-Instance Learning Constrained Concave-Convex Procedure Loss Function Optimization Problem Experiments Design MI kernels that operate on bags the underlying quadratic programming (QP) problem only involves variables corresponding to the bags, but not instances (More direct approach) Associate the variables with instances, but not with bags bag label information still used implicitly bag B i : instances {x ij } n i j=1 f (B i ) = max j=1,...,ni f (x ij )

36 Problems Multi-Instance Learning Constrained Concave-Convex Procedure Loss Function Optimization Problem Experiments Mixed integer problem MI-SVM uses a simple optimization heuristic convergence properties unclear Only the sign is important in classification sign(f (B i )) = sign(max j=1,...,ni f (x ij )) f (B i ) = max j=1,...,ni f (x ij ) may be too restrictive Cannot utilize both the bag and instance information simultaneously MI kernels: variables correspond only to the bags, but not instances MI-SVM: variables correspond only to the instances, but not bags

37 Proposed Approach Multi-Instance Learning Constrained Concave-Convex Procedure Loss Function Optimization Problem Experiments Introduce a loss function between f (B i ) and the associated f (x ij ) s allows both the bags and instances to directly participate in the optimization process the learned function is smooth over both bags and instances Optimization technique MI-SVM uses an optimization heuristic we use constrained concave-convex procedure

38 Multi-Instance Learning Constrained Concave-Convex Procedure Loss Function Optimization Problem Experiments Constrained Concave-Convex Procedure An optimization tool for nonlinear optimization problems whose objective function can be expressed as a difference of convex functions

39 Multi-Instance Learning Constrained Concave-Convex Procedure Loss Function Optimization Problem Experiments Constrained Concave-Convex Procedure (CCCP) min x f 0 (x) g 0 (x) s.t. f i (x) g i (x) c i, i = 1,..., m, f i, g i (i = 0,..., m) are real-valued, convex and differentiable functions on R n ; c i R Procedure: 1 start with an initial x (0) 2 replace g i (x) with its first-order Taylor expansion at x (t) 3 set x (t+1) to the solution of the relaxed optimization problem: [ min f 0 (x) g 0 (x (t) ) + g 0 (x (t) ) ] (x x (t) ) x [ s.t. f i (x) g i (x (t) ) + g i (x (t) ) ] (x x (t) ) c i Converges to a local minimum solution

40 Regularization Framework Multi-Instance Learning Constrained Concave-Convex Procedure Loss Function Optimization Problem Experiments A set of training bags: {(B 1, y 1 ),..., (B m, y m )} B i = {x i1, x i2,..., x ini }: ith bag containing instances x ij s y i {±1} Define a loss function that depends on both the training bags and training instances: ( ) V {B i, y i, f (B i )} m i=1, {f (x ij )} n i j=1 m i=1 Split the loss function V into two parts 1 between each bag ( label and its bag prediction ) V {B i, y i, f (B i )} m i=1, {f (x ij)} n i j=1 m i=1 2 between the predictions of each bag and its constituent instances ( ) V {B i, y i, f (B i )} m i=1, {f (x ij)} n i j=1 m i=1

41 Loss Function V : 1st Part Multi-Instance Learning Constrained Concave-Convex Procedure Loss Function Optimization Problem Experiments Between each bag label y i and its corresponding prediction f (B i ) hinge loss (1 y i f (B i )) + where (z) + = max(0, z) 0 1

42 Loss Function V : 2nd Part Multi-Instance Learning Constrained Concave-Convex Procedure Loss Function Optimization Problem Experiments Between the predictions of each bag f (B i ) and its constituent instances {f (x ij ) j = 1,..., n i } l(f (B i ), max j f (x ij )) { 0 if v1 = v l(v 1, v 2 ) = 2, otherwise. L1 loss: l(v 1, v 2 ) = v 1 v 2 L2 loss: l(v 1, v 2 ) = (v 1 v 2 ) 2

43 Combining Multi-Instance Learning Constrained Concave-Convex Procedure Loss Function Optimization Problem Experiments V = 1 m m i=1 (1 y if (B i )) + + λ m m i=1 l(f (B i), max j f (x ij )) λ: trades off the two components Special cases: 1 1 Only the first part: m m i=1 (1 y if (B i )) + the same { as that with the MI kernel 2 0 if v1 = v l(v 1, v 2 ) = 2, otherwise. same as the MI-SVM

44 Optimization Problem Multi-Instance Learning Constrained Concave-Convex Procedure Loss Function Optimization Problem Experiments Introduce ξ = [ξ 1,..., ξ m ] : slack variables for the errors on bags γ, λ: tradeoff parameters m min f H,ξ 1 2 f 2 H + γ m ξ 1 + γλ m s.t. y i f (B i ) 1 ξ i, ξ 0 i=1 l(f (B i ), max f (x ij )) j=1,...,n i Representer Theorem m n i f (x) = α i0 k(x, B i ) + α ij k(x, x ij ), α i0, α ij R i=1 j=1 α: vector for all the α i0 s and α ij s

45 Using the L1 Loss for l(, ) Multi-Instance Learning Constrained Concave-Convex Procedure Loss Function Optimization Problem Experiments K: kernel matrix; k i : ith column of K min α,ξ,δ,b 1 2 α Kα + γ m ξ 1 + γλ m δ 1 s.t. y i (k I(B i ) α + b) 1 ξ i, ξ 0, k I(x ij ) α δ i k I(B i ) α, k I(B i ) α max (k I(x j=1,...,n ij ) α) δ i i Objective: quadratic; First three constraints: linear Last constraint: nonlinear, but is a difference of two convex functions

46 Optimization using CCCP Iterative procedure: 1 obtain α from this QP 2 use this as α (t+1) and iterate α (t) : estimate of α at the tth iteration : estimate of β ij β (t) ij min α,ξ,δ,b Multi-Instance Learning Constrained Concave-Convex Procedure Loss Function Optimization Problem Experiments 1 2 α Kα + γ m ξ 1 + γλ m δ 1 s.t. y i (k I(B i ) α + b) 1 ξ i, ξ 0, k I(x ij ) α δ i k I(B i ) α, n i k I(B i ) α β (t) ij k I(x ij ) α δ i j=1

47 Using the Loss Function in MI-SVM Multi-Instance Learning Constrained Concave-Convex Procedure Loss Function Optimization Problem Experiments With a particular choice of the subgradient identical to the optimization heuristic in MI-SVM MI-SVM: no convergence proof CCCP: guaranteed convergence

48 Multi-Instance Learning Constrained Concave-Convex Procedure Loss Function Optimization Problem Experiments Classification: Image Categorization on Corel Images Data set Used in Chen and Wang (JMLR 2004) 10 classes (beach, flowers, horses, etc.), with each class containing 100 images Each image: bag; Image segments: instance Procedure Same as in (Chen and Wang) Randomly divided into a training and test set, each containing 50 images of each category Repeated 5 times, and report the average accuracy Model parameters selected by a validation set

49 Results Multi-Instance Learning Constrained Concave-Convex Procedure Loss Function Optimization Problem Experiments accuracy (%) DD-SVM (Chen and Wang 2004) 81.5 ± 3.0 Hist-SVM (Chapelle et al. 1999) 66.7 ± 2.2 MI-SVM (Andrews et al. 2003) 74.7 ± 0.6 SVM (MI kernel) (Gärtner et al. 2002) 84.1 ± 0.90 Our method 84.4 ± 1.38 Results on DD-SVM, Hist-SVM and MI-SVM are from (Chen and Wang 2004) MI kernel used: normalized set kernel k κ(b 1, B 2 ) = set(b 1,B 2 ) kset(b1,b 1 ) k set(b 2,B 2 ) k set (B 1, B 2 ) = x B 1,z B 2 k(x, z), k: Gaussian kernel Our method: use the L1 loss significant at the 0.01 level of significance

50 Regression: Synthetic Musk Molecules Multi-Instance Learning Constrained Concave-Convex Procedure Loss Function Optimization Problem Experiments Predict the real-valued binding energies of musk molecules Synthetic data sets generated by Dooly et al. (JMLR 2002) based on using an affinity model between the musk molecules and receptors LJ , LJ and LJ LJ : # relevant features: 16; total # features: 30; # scale factors: 2 Make it more challenging created three more data sets (LJ , LJ and LJ ) by adding irrelevant features e.g., LJ is generated by adding 20 more irrelevant features to LJ while keeping its real-valued outputs intact

51 Results Multi-Instance Learning Constrained Concave-Convex Procedure Loss Function Optimization Problem Experiments data set DD citation-knn SVM (MI kernel) our method %err MSE %err MSE %err MSE %err MSE LJ LJ (not available) LJ LJ LJ LJ Results of DD, citation-knn on the first three data sets are from (Dooly et al. 2002) DD: does not perform well Easier data sets: ours has comparable/better performance More challenging data sets nearest neighbor-based and DD algorithms degrade with more irrelevant features our SVM-based approach is consistently the best

52 Kernel methods can now be used on massive data sets: novelty detection (unsupervised learning) classification/regression (supervised learning) manifold regularization (semi-supervised learning) maximum margin discriminant analysis (feature extraction) Kernel methods can also be used for multi-instance learning in a disciplined manner allows a loss function between the outputs of a bag and its associated instances both bags and instances can now directly participate in the optimization process by using CCCP, no need to use optimization heuristics how to design MI kernels? marginalized kernel

53 Recent Research ensemble learning multi instance learning IJCAI 2007b KDD 2006 ICML 2006a IJCAI 2007a feature extraction Kernel methods JMLR 2005 ICML 2005 TNN 2006a large datasets NIPS 2006a machine learning ICML 2006b NIPS 2006b semi supervised learning kernel learning ICML 2003 IJCAI 2003 ICML 2004 MLJ 2006 TNN 2006b NIPS 2003 TSAP 2005 TASLP 2006 applications TNN 2004 IJCAI 2005 TKDE 2006 GLOBECOM 2005 speech image / recognition vision pervasive computing

54 jamesk Google: james kwok

Very Large SVM Training using Core Vector Machines

Very Large SVM Training using Core Vector Machines Ivor W. Tsang James T. Kwok Department of Computer Science The Hong Kong University of Science and Technology Clear Water Bay Hong Kong Pak-Ming Cheung