Support Vector Machines and their Applications

Size: px

Start display at page:

Download "Support Vector Machines and their Applications"

Sherman Gardner
5 years ago
Views:

1 Purushottam Kar Department of Computer Science and Engineering, Indian Institute of Technology Kanpur. Summer School on Expert Systems And Their Applications, Indian Institute of Information Technology Allahabad. June 14, / 24

2 Support Vector Machines What is being supported? 2 / 24

3 Support Vector Machines What is being supported? How can vectors support anything? 2 / 24

4 Support Vector Machines What is being supported? How can vectors support anything? Wait!! Machines?? - Is this a Mechanical Engineering Lecture? 2 / 24

5 The Learning Methodology Is it possible to write an algorithm to distinguish between... 3 / 24

6 The Learning Methodology Is it possible to write an algorithm to distinguish between... a well-formed and an ill-formed C++ program? 3 / 24

7 The Learning Methodology Is it possible to write an algorithm to distinguish between... a well-formed and an ill-formed C++ program? a palindrome and a non-palindrome? 3 / 24

8 The Learning Methodology Is it possible to write an algorithm to distinguish between... a well-formed and an ill-formed C++ program? a palindrome and a non-palindrome? a graph with and without cliques of size bigger than 1000? 3 / 24

9 The Learning Methodology Is it possible to write an algorithm to distinguish between... a handwritten 4 and a handwritten 9? 4 / 24

10 The Learning Methodology Is it possible to write an algorithm to distinguish between... a handwritten 4 and a handwritten 9? a spam and a non-spam ? 4 / 24

11 The Learning Methodology Is it possible to write an algorithm to distinguish between... a handwritten 4 and a handwritten 9? a spam and a non-spam ? a positive movie review and a negative movie review? 4 / 24

12 Statistical Machine Learning Synthesize a program based on training data 5 / 24

13 Statistical Machine Learning Synthesize a program based on training data Assume training data that is randomly generated from some unknown but fixed distribution and a target function 5 / 24

14 Statistical Machine Learning Synthesize a program based on training data Assume training data that is randomly generated from some unknown but fixed distribution and a target function Give probabilistic error bounds 5 / 24

15 Statistical Machine Learning Synthesize a program based on training data Assume training data that is randomly generated from some unknown but fixed distribution and a target function Give probabilistic error bounds In other words be probably-approximately-correct 5 / 24

16 Statistical Machine Learning Synthesize a program based on training data Assume training data that is randomly generated from some unknown but fixed distribution and a target function Give probabilistic error bounds In other words be probably-approximately-correct The motto - Let the data decide the algorithm 5 / 24

17 Expert Systems... a computing system capable of representing and reasoning about some knowledge rich domain, such as internal medicine or geology... Introduction to Expert Systems, Peter Jackson, Addison Wesley Publishing Company, / 24

18 Linear Machines 7 / 24

19 Linear Machines Arguably the simplest of classifiers acting on vectoral data 7 / 24

20 Linear Machines Arguably the simplest of classifiers acting on vectoral data Numerous Learning Algorithms - Perceptron, SVM 7 / 24

21 Support Vector Machines 8 / 24

22 Support Vector Machines A special hyperplane - with the maximum margin 8 / 24

23 Support Vector Machines A special hyperplane - with the maximum margin Margin of a point measures how far is it from the hyperplane 8 / 24

24 Learning the Maximum Margin Classifier minimize w,b w 2 subject to y i ( w x i + b) 1, i = 1,..., l. A Linearly-constrained Quadratic program 9 / 24

25 Learning the Maximum Margin Classifier minimize w,b w 2 subject to y i ( w x i + b) 1, i = 1,..., l. A Linearly-constrained Quadratic program Solvable in polynomial time - several algorithms known 9 / 24

26 Learning the Maximum Margin Classifier minimize w,b w 2 subject to y i ( w x i + b) 1, i = 1,..., l. A Linearly-constrained Quadratic program Solvable in polynomial time - several algorithms known Does not give us much insight into the nature of the hyperplane 9 / 24

27 Non-linearly Separable Data 10 / 24

28 Non-linearly Separable Data Use slack variables to allow points to lie on the wrong side of the hyperplane 10 / 24

29 Non-linearly Separable Data Use slack variables to allow points to lie on the wrong side of the hyperplane Can still be solved using a QCQP 10 / 24

30 Learning the Soft Margin Classifier minimize w,b w 2 + C l i=1 subject to y i ( w x i + b) 1 ξ i, i = 1,..., l. ξ 2 i Again a Linearly-constrained Quadratic program 11 / 24

31 Learning the Soft Margin Classifier minimize w,b w 2 + C l i=1 subject to y i ( w x i + b) 1 ξ i, i = 1,..., l. ξ 2 i Again a Linearly-constrained Quadratic program More insight gained by looking at the dual program 11 / 24

32 The Dual Program for the Hard Margin SVM maximize subject to l α i 1 2 i=1 l y i y j α i α j x i x j, i,j=1 l y i α i = 0, i=1 α 1 0, i = 1,..., l. Some properties of the optimum ( w, b, α ): 12 / 24

33 The Dual Program for the Hard Margin SVM maximize subject to l α i 1 2 i=1 l y i y j α i α j x i x j, i,j=1 l y i α i = 0, i=1 α 1 0, i = 1,..., l. Some properties of the optimum ( w, b, α ): α i [y i( w x i + b ) 1] = 0, i = 1,..., l 12 / 24

34 The Dual Program for the Hard Margin SVM maximize subject to l α i 1 2 i=1 l y i y j α i α j x i x j, i,j=1 l y i α i = 0, i=1 α 1 0, i = 1,..., l. Some properties of the optimum ( w, b, α ): α i [y i( w x i + b ) 1] = 0, i = 1,..., l w = l i=1 y iα i x i 12 / 24

35 The Dual Program for the Hard Margin SVM maximize subject to l α i 1 2 i=1 l y i y j α i α j x i x j, i,j=1 l y i α i = 0, i=1 α 1 0, i = 1,..., l. Some properties of the optimum ( w, b, α ): α i [y i( w x i + b ) 1] = 0, i = 1,..., l w = l i=1 y iα i x i f ( x, α, b ) = l i=1 y iα i x i x + b 12 / 24

36 The Support Consider each vector applying a force of α i on the hyperplane in the w direction y i w / 24

37 The Support Consider each vector applying a force of α i on the hyperplane in the w direction y i w 2. The conditions exactly correspond to the force and the torque on the hyperplane being zero 13 / 24

38 The Support Consider each vector applying a force of α i on the hyperplane in the w direction y i w 2. The conditions exactly correspond to the force and the torque on the hyperplane being zero Hence the vectors lying on the margin, for whom α i 0, support the hyperplane. 13 / 24

39 Luckiness and Generalization Essentially the idea is to show that the probability of the training sample misleading us into learning an erroneous classifier is small 14 / 24

40 Luckiness and Generalization Essentially the idea is to show that the probability of the training sample misleading us into learning an erroneous classifier is small To do this model selection has to be done carefully 14 / 24

41 Luckiness and Generalization Essentially the idea is to show that the probability of the training sample misleading us into learning an erroneous classifier is small To do this model selection has to be done carefully The classifier family should not be too powerful to prevent overfitting 14 / 24

42 Bounds for Linear Classifiers Theorem (Vapnik and Chervonenkis) For any probability distribution on the input domain R d, and any target function g, with probability no less than 1 δ, any linear hyperplane f classifying a randomly chosen training set of size l perfectly cannot disagree with the target function on more than ɛ fraction of the input domain (with respect to the underlying distribution) where ɛ = 2 ( (d + 1) log 2el l d log 2 ) δ 15 / 24

43 Bounds for Large Margin Classifiers Theorem (Vapnik) For any probability distribution on the input domain - a ball of radius R, and any target function g, with probability no less than 1 δ, any linear hyperplane f classifying a randomly chosen training set of size l perfectly with margin γ cannot disagree with the target function on more than ɛ fraction of the input domain (with respect to the underlying distribution) where ( ( 1 R 2 ɛ = Õ l γ 2 + log 1 )) δ NOTE : Error bound Independent of the dimension! 16 / 24

44 The XOR problem 17 / 24

45 The XOR problem The feature map Φ : (x, y) (x 2, y 2, 2xy) makes the problem linearly a separable one. 17 / 24

46 The XOR problem The feature map Φ : (x, y) (x 2, y 2, 2xy) makes the problem linearly a separable one. But do we need to perform the actual feature map? 17 / 24

47 The XOR problem The feature map Φ : (x, y) (x 2, y 2, 2xy) makes the problem linearly a separable one. But do we need to perform the actual feature map? No! Since all that the SVM training algorithm requires are values of Φ( x i ) Φ( x j ) 17 / 24

48 The XOR problem The feature map Φ : (x, y) (x 2, y 2, 2xy) makes the problem linearly a separable one. But do we need to perform the actual feature map? No! Since all that the SVM training algorithm requires are values of Φ( x i ) Φ( x j ) But Φ( x i ) Φ( x j ) = x i x j 2 17 / 24

49 The Kernel Trick Many algorithms admit the Kernel trick - SVM, SVM-regression, Kernel-PCA, Kernel-clustering, Perceptron 18 / 24

50 The Kernel Trick Many algorithms admit the Kernel trick - SVM, SVM-regression, Kernel-PCA, Kernel-clustering, Perceptron However kernel trick not used for the Perceptron algorithm - why? 18 / 24

51 The Kernel Trick Many algorithms admit the Kernel trick - SVM, SVM-regression, Kernel-PCA, Kernel-clustering, Perceptron However kernel trick not used for the Perceptron algorithm - why? Note that not all kernels correspond to feature maps 18 / 24

52 Implementations Many techniques known - Ellipsoid method, Gradient descent, Sequential Minimal Optimization - the latter is tuned to the QP problem 19 / 24

53 Implementations Many techniques known - Ellipsoid method, Gradient descent, Sequential Minimal Optimization - the latter is tuned to the QP problem LIBSVM - supports multi-class classification and regression 19 / 24

54 Implementations Many techniques known - Ellipsoid method, Gradient descent, Sequential Minimal Optimization - the latter is tuned to the QP problem LIBSVM - supports multi-class classification and regression SVM light - good scalability 19 / 24

55 Implementations Many techniques known - Ellipsoid method, Gradient descent, Sequential Minimal Optimization - the latter is tuned to the QP problem LIBSVM - supports multi-class classification and regression SVM light - good scalability General Convex Optimization Solvers - CVX, SeDuMi - compatible with Matlab c 19 / 24

56 Handwritten Digit Recognition Boser-Guyon-Vapnik First real-world application of SVMS 20 / 24

57 Handwritten Digit Recognition Boser-Guyon-Vapnik First real-world application of SVMS Two datasets USPS and NIST used for training and testing Performance stable even across a range of kernel choices 20 / 24

58 Handwritten Digit Recognition Boser-Guyon-Vapnik First real-world application of SVMS Two datasets USPS and NIST used for training and testing Performance stable even across a range of kernel choices Even simple polynomial kernels gave improvements (3.2% error) over MSE techniques like backpropagation or ridge-regression (12.7% error) on the USPS dataset. 20 / 24

59 Text Categorization Joachims - used insight from information retrieval research 21 / 24

60 Text Categorization Joachims - used insight from information retrieval research Reuters a collection of labeled news stories used. 21 / 24

61 Text Categorization Joachims - used insight from information retrieval research Reuters a collection of labeled news stories used. Native kernel used. 21 / 24

62 Text Categorization Joachims - used insight from information retrieval research Reuters a collection of labeled news stories used. Native kernel used. Performance measured using Precision-Recall break-even point 21 / 24

63 Text Categorization Joachims - used insight from information retrieval research Reuters a collection of labeled news stories used. Native kernel used. Performance measured using Precision-Recall break-even point Aggregate performance of Bayes, k-nn, C4.5 around or less than 80% whereas performance of even native kernel > 84% 21 / 24

64 Text Categorization Joachims - used insight from information retrieval research Reuters a collection of labeled news stories used. Native kernel used. Performance measured using Precision-Recall break-even point Aggregate performance of Bayes, k-nn, C4.5 around or less than 80% whereas performance of even native kernel > 84% Use of RBF kernels improved performance to > 86% 21 / 24

65 Image based Gender Identification Varma-Babu - application Kernel Learning 22 / 24

66 Image based Gender Identification Varma-Babu - application Kernel Learning Learn the kernel as a linear combination of base kernels 22 / 24

67 Image based Gender Identification Varma-Babu - application Kernel Learning Learn the kernel as a linear combination of base kernels Performance gains of 5-10% observed over other kernel based learning techniques 22 / 24

68 Topic Drift in Page-ranking Algorithms Karnick-Saradhi - application of multi-class Suport vector data description 23 / 24

69 Topic Drift in Page-ranking Algorithms Karnick-Saradhi - application of multi-class Suport vector data description Use SVDD to obtain representative pages for a topic and prune irrelevant pages 23 / 24

70 Topic Drift in Page-ranking Algorithms Karnick-Saradhi - application of multi-class Suport vector data description Use SVDD to obtain representative pages for a topic and prune irrelevant pages Use a kernel based on both link and content information Performance gains in terms of precision and recall observed over other existing topic distillation techniques 23 / 24

71 Bibliography An Introduction to Support Vector Machines, Nello Cristianini and John Shawe-Taylor, Cambridge University Press, / 24

72 Bibliography An Introduction to Support Vector Machines, Nello Cristianini and John Shawe-Taylor, Cambridge University Press, Learning with Kernels, Bernhard Schlkopf and Alexander J. Smola, The MIT Press, / 24

73 Bibliography An Introduction to Support Vector Machines, Nello Cristianini and John Shawe-Taylor, Cambridge University Press, Learning with Kernels, Bernhard Schlkopf and Alexander J. Smola, The MIT Press, / 24

COMS 4771 Support Vector Machines. Nakul Verma

COMS 4771 Support Vector Machines Nakul Verma Last time Decision boundaries for classification Linear decision boundary (linear classification) The Perceptron algorithm Mistake bound for the perceptron