Recognition Tools: Support Vector Machines

Size: px

Start display at page:

Download "Recognition Tools: Support Vector Machines"

Anis Terry
5 years ago
Views:

1 CS 2770: Computer Vision Recognition Tools: Support Vector Machines Prof. Adriana Kovashka University of Pittsburgh January 12, 2017

2 Announcement TA office hours: Tuesday 4pm-6pm Wednesday 10am-12pm

3 Matlab Tutorial Please cover whatever we don t finish at home.

4 Tutorials and Eercises 750/Tutorial/ tlab_probs2.pdf lp211/basiceercises.html Do Problems 1-8, 12 Most also have solutions Ask the TA if you have any problems

5 Plan for today What is classification/recognition? Support vector machines Separable case / non-separable case Linear / non-linear (kernels) The importance of generalization The bias-variance trade-off (applies to all classifiers)

6 Classification Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Decision boundary Zebra Non-zebra Slide credit: L. Lazebnik

7 Classification Assign input vector to one of two or more classes Any decision rule divides the input space into decision regions separated by decision boundaries Slide credit: L. Lazebnik

8 Eample: Spam filter Slide credit: L. Lazebnik

9 Eamples of Categorization in Vision Part or object detection E.g., for each window: face or non-face? Scene categorization Indoor vs. outdoor, urban, forest, kitchen, etc. Action recognition Picking up vs. sitting down vs. standing Emotion recognition Happy vs. scared vs. surprised Region classification Label piels into different object/surface categories Boundary classification Boundary vs. non-boundary Etc, etc. Adapted from D. Hoiem

10 Image categorization Two-class (binary): Cat vs Dog Adapted from D. Hoiem

11 Image categorization Multi-class (often): Object recognition Caltech 101 Average Object Images Adapted from D. Hoiem

12 Image categorization Fine-grained recognition Visipedia Project Slide credit: D. Hoiem

13 Image categorization Place recognition Places Database [Zhou et al. NIPS 2014] Slide credit: D. Hoiem

Image categorization Dating historical photos 1940 1953

14 Image categorization Dating historical photos [Palermo et al. ECCV 2012] Slide credit: D. Hoiem

15 Image categorization Image style recognition [Karayev et al. BMVC 2014] Slide credit: D. Hoiem

16 Region categorization Material recognition [Bell et al. CVPR 2015] Slide credit: D. Hoiem

17 Why recognition? Recognition a fundamental part of perception e.g., robots, autonomous agents Organize and give access to visual content Connect to information Detect trends and themes Slide credit: K. Grauman

18 Recognition: A machine learning approach

19 The machine learning framework Apply a prediction function to a feature representation of the image to get the desired output: f( ) = apple f( ) = tomato f( ) = cow Slide credit: L. Lazebnik

20 The machine learning framework y = f() output prediction function image feature Training: given a training set of labeled eamples {( 1,y 1 ),, ( N,y N )}, estimate the prediction function f by minimizing the prediction error on the training set Testing: apply f to a never before seen test eample and output the predicted value y = f() Slide credit: L. Lazebnik

21 Training Training Images Steps Image Features Training Labels Training Learned model Testing Test Image Image Features Learned model Prediction Slide credit: D. Hoiem and L. Lazebnik

22 The simplest classifier Training eamples from class 1 Test eample Training eamples from class 2 f() = label of the training eample nearest to All we need is a distance function for our inputs No training required! Slide credit: L. Lazebnik

23 K-Nearest Neighbors classification For a new point, find the k closest points from training data Labels of the k points vote to classify Black = negative Red = positive k = 5 If query lands here, the 5 NN consist of 3 negatives and 2 positives, so we classify it as negative. Slide credit: D. Lowe

24 Where in the World? Slides: James Hays

25 im2gps: Estimating Geographic Information from a Single Image James Hays and Aleei Efros CVPR 2008 Nearest Neighbors according to GIST + bag of SIFT + color histogram + a few others Slide credit: James Hays

26 The Importance of Data Slides: James Hays

27 Linear classifier Find a linear function to separate the classes f() = sgn(w w w D D ) = sgn(w ) Slide credit: L. Lazebnik

28 Linear classifier Decision = sign(w T ) = sign(w1*1 + w2*2) 2 (0, 0) 1 What should the weights be?

29 Lines in R 2 Let w a c y a cy b 0 Kristen Grauman

30 Lines in R 2 Let w a c y w a cy b 0 w b 0 Kristen Grauman

31 0, y 0 Lines in R 2 Let w a c y w a cy b 0 w b 0 Kristen Grauman

32 0, y 0 D Lines in R 2 Let w a c y w a cy b 0 w b 0 D Kristen Grauman a a 2 cy c 2 b w b w 0 0 distance from point to line

33 0, y 0 D Lines in R 2 Let w a c y w a cy b 0 w b 0 D Kristen Grauman a0 cy0 b w b 2 2 a c w distance from point to line

34 Linear classifiers Find linear function to separate positive and negative eamples i i positive negative : : i i w w b b 0 0 Which line is best? C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

35 Support vector machines Discriminative classifier based on optimal separating line (for 2d case) Maimize the margin between the positive and negative training eamples C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

36 Support vector machines Want line that maimizes the margin. i i positive negative ( y i ( y i 1) : 1) : i i w b 1 w b 1 For support, vectors, i w b 1 Support vectors Margin C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

37 Support vector machines Want line that maimizes the margin. i i positive negative ( y i ( y i 1) : 1) : i i w b 1 w b 1 Support vectors Margin For support, vectors, i w b 1 w b Distance between point i and line: w For support vectors: Τ w b M w w w w w C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

38 Support vector machines Want line that maimizes the margin. i i positive negative ( y i ( y i 1) : 1) : i i w b 1 w b 1 Support vectors Margin For support, vectors, i w b 1 w b Distance between point i and line: w Therefore, the margin is 2 / w C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

39 Finding the maimum margin line 1. Maimize margin 2/ w 2. Correctly classify all training data points: i i positive ( y negative ( y i i 1) : 1) : i i w b 1 w b 1 Quadratic optimization problem: Minimize 1 2 w T w Subject to y i (w i +b) 1 One constraint for each training point. Note sign trick. C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

40 Finding the maimum margin line Solution: w y i i i i Learned weight Support vector C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

41 Finding the maimum margin line Solution: w y i i i i b = y i w i Classification function: f ( ) sign sign (for any support vector) ( w y Notice that it relies on an inner product between the test point and the support vectors i (Solving the optimization problem also involves computing the inner products i j between all pairs of training points) i i i b) i b If f() < 0, classify as negative, otherwise classify as positive. C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

42 Nonlinear SVMs Datasets that are linearly separable work out great: 0 But what if the dataset is just too hard? 0 We can map it to a higher-dimensional space: 2 Andrew Moore 0

43 Nonlinear SVMs General idea: the original input space can always be mapped to some higher-dimensional feature space where the training set is separable: Φ: φ() Andrew Moore

44 Nonlinear kernel: Eample Consider the mapping ), ( ) ( ), ( ), ( ), ( ) ( ) ( y y y K y y y y y 2 Svetlana Lazebnik

45 The Kernel Trick The linear classifier relies on dot product between vectors K( i, j ) = i j If every data point is mapped into high-dimensional space via some transformation Φ: i φ( i ), the dot product becomes: K( i, j ) = φ( i ) φ( j ) A kernel function is similarity function that corresponds to an inner product in some epanded feature space The kernel trick: instead of eplicitly computing the lifting transformation φ(), define a kernel function K such that: K( i, j ) = φ( i ) φ( j ) Andrew Moore

46 Eamples of kernel functions Linear: K( i, j ) T i j Polynomials of degree up to d: Gaussian RBF: 2 i j K( i, j ) ep( ) 2 2 Histogram intersection: K ( i, j ) min( i ( k), j ( k)) k Andrew Moore / Carlos Guestrin K( i, j ) = ( i T j + 1) d

47 Allowing misclassifications: Before The w that minimizes Maimize margin

48 Allowing misclassifications: After Misclassification cost # data samples Slack variable The w that minimizes Maimize margin Minimize misclassification

49 What about multi-class SVMs? Unfortunately, there is no definitive multi-class SVM formulation In practice, we have to obtain a multi-class SVM by combining multiple two-class SVMs One vs. others Training: learn an SVM for each class vs. the others Testing: apply each SVM to the test eample, and assign it to the class of the SVM that returns the highest decision value One vs. one Training: learn an SVM for each pair of classes Testing: each learned SVM votes for a class to assign to the test eample Svetlana Lazebnik

50 Multi-class problems One-vs-all (a.k.a. one-vs-others) Train K classifiers In each, pos = data from class i, neg = data from classes other than i The class with the most confident prediction wins Eample: You have 4 classes, train 4 classifiers 1 vs others: score vs others: score vs others: score vs other: score 5.5 Final prediction: class 2

51 Multi-class problems One-vs-one (a.k.a. all-vs-all) Train K(K-1)/2 binary classifiers (all pairs of classes) They all vote for the label Eample: You have 4 classes, then train 6 classifiers 1 vs 2, 1 vs 3, 1 vs 4, 2 vs 3, 2 vs 4, 3 vs 4 Votes: 1, 1, 4, 2, 4, 4 Final prediction is class 4

52 SVMs for recognition 1. Define your representation for each eample. 2. Select a kernel function. 3. Compute pairwise kernel values between labeled eamples 4. Use this kernel matri to solve for SVM support vectors & weights. 5. To classify a new eample: compute kernel values between new input and support vectors, apply weights, check sign of output. Kristen Grauman

53 Eample: learning gender with SVMs Moghaddam and Yang, Learning Gender with Support Faces, TPAMI Moghaddam and Yang, Face & Gesture Kristen Grauman

54 Learning gender with SVMs Training eamples: 1044 males 713 females Eperiment with various kernels, select Gaussian RBF i K( i, j) ep( 2 2 j 2 ) Kristen Grauman

55 Support Faces Moghaddam and Yang, Learning Gender with Support Faces, TPAMI 2002.

56 Moghaddam and Yang, Learning Gender with Support Faces, TPAMI 2002.

57 Gender perception eperiment: How well can humans do? Subjects: 30 people (22 male, 8 female) Ages mid-20 s to mid-40 s Test data: 254 face images (6 males, 4 females) Low res and high res versions Task: Classify as male or female, forced choice No time limit Moghaddam and Yang, Face & Gesture 2000.

Moghaddam and Yang, Face & Gesture 2000.

58 Moghaddam and Yang, Face & Gesture Gender perception eperiment: How well can humans do? Error Error

59 Human vs. Machine SVMs performed better than any single human test subject, at either resolution Kristen Grauman

60 Pros Many publicly available SVM packages: LIBSVM LIBLINEAR SVM Light or use built-in Matlab version (but slower) Kernel-based framework is very powerful, fleible Often a sparse set of support vectors compact at test time Work very well in practice, even with little training data Cons Adapted from Lana Lazebnik SVMs: Pros and cons No direct multi-class SVM, must combine two-class SVMs Computation, memory During training time, must compute matri of kernel values for every pair of eamples Learning can take a very long time for large-scale problems

61 Linear classifiers vs nearest neighbors Linear pros: + Low-dimensional parametric representation + Very fast at test time Linear cons: Works for two classes What if data is not linearly separable? NN pros: + Works for any number of classes + Decision boundaries not necessarily linear + Nonparametric method + Simple to implement NN cons: Slow at test time (large search problem to find neighbors) Storage of data Need good distance function Adapted from L. Lazebnik

62 Training vs Testing What do we want? High accuracy on training data? No, high accuracy on unseen/new/test data! Why is this tricky? Training data Features () and labels (y) used to learn mapping f Test data Features () used to make a prediction Labels (y) only used to see how well we ve learned f!!! Validation data Held-out set of the training data Can use both features () and labels (y) to tune parameters of the model we re learning

63 Generalization Training set (labels known) Test set (labels unknown) How well does a learned model generalize from the data it was trained on to a new test set? Slide credit: L. Lazebnik

64 Generalization Components of generalization error Bias: how much the average model over all training sets differs from the true model Error due to inaccurate assumptions/simplifications made by the model Variance: how much models estimated from different training sets differ from each other Underfitting: model is too simple to represent all the relevant class characteristics High bias and low variance High training error and high test error Overfitting: model is too comple and fits irrelevant characteristics (noise) in the data Low bias and high variance Low training error and high test error Slide credit: L. Lazebnik

65 Bias-Variance Trade-off Models with too few parameters are inaccurate because of a large bias (not enough fleibility). Models with too many parameters are inaccurate because of a large variance (too much sensitivity to the sample). Slide credit: D. Hoiem

66 Fitting a model Is this a good fit? Figures from Bishop

67 With more training data Figures from Bishop

68 Regularization No regularization Huge regularization Figures from Bishop

69 Error Training vs test error Underfitting Overfitting Test error High Bias Low Variance Compleity Training error Low Bias High Variance Slide credit: D. Hoiem

70 Test Error The effect of training set size Few training eamples Many training eamples High Bias Low Variance Compleity Low Bias High Variance Slide credit: D. Hoiem

71 Error The effect of training set size Fied prediction model Generalization Error Testing Training Number of Training Eamples Adapted from D. Hoiem

72 Error Choosing the trade-off between bias and variance Need validation set (separate from the test set) Validation error Training error High Bias Low Variance Compleity Low Bias High Variance Slide credit: D. Hoiem

73 How to reduce variance? Choose a simpler classifier Get more training data Regularize the parameters Slide credit: D. Hoiem

74 What to remember about classifiers No free lunch: machine learning algorithms are tools Try simple classifiers first Better to have smart features and simple classifiers than simple features and smart classifiers Use increasingly powerful classifiers with more training data (bias-variance tradeoff) Slide credit: D. Hoiem

Discriminative classifiers for image recognition

Discriminative classifiers for image recognition May 26 th, 2015 Yong Jae Lee UC Davis Outline Last time: window-based generic object detection basic pipeline face detection with boosting as case study