Bias-Variance Trade-off + Other Models and Problems

Size: px

Start display at page:

Download "Bias-Variance Trade-off + Other Models and Problems"

Lynn Ward
5 years ago
Views:

1 CS 1699: Intro to Computer Vision Bias-Variance Trade-off + Other Models and Problems Prof. Adriana Kovashka University of Pittsburgh November 3, 2015

2 Outline Support Vector Machines (review + other uses) Bias-variance trade-off Scene recognition: Spatial pyramid matching Other classifiers Decision trees Hidden Markov models Other problems Clustering: agglomerative clustering Dimensionality reduction

3 x 0, y 0 D Lines in R 2 Let w a c x x y w ax cy b 0 w x b 0 D Kristen Grauman ax0 cy0 b w x b 2 2 a c w distance from point to line

4 Support vector machines Want line that maximizes the margin. x x i i positive negative ( y i ( y i 1) : 1) : x x i i w b 1 w b 1 Support vectors Margin For support, vectors, x i w b 1 x w b Distance between point i and line: w For support vectors: Τ w x b M w w w w w C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

5 Finding the maximum margin line 1. Maximize margin 2/ w 2. Correctly classify all training data points: x x i i positive ( y negative ( y i i 1) : 1) : x x i i w b 1 w b 1 Quadratic optimization problem: Minimize 1 2 w T w Subject to y i (w x i +b) 1 One constraint for each training point. Note sign trick. C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

6 Finding the maximum margin line Solution: w y x i i i i b = y i w x i Classification function: f ( x) sign sign (for any support vector) ( w x y Notice that it relies on an inner product between the test point x and the support vectors x i (Solving the optimization problem also involves computing the inner products x i x j between all pairs of training points) i i i b) x i x b If f(x) < 0, classify as negative, otherwise classify as positive. C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998

7 Nonlinear SVMs Datasets that are linearly separable work out great: 0 x But what if the dataset is just too hard? 0 x We can map it to a higher-dimensional space: x 2 Andrew Moore 0 x

8 Examples of kernel functions Linear: K( x i, x j ) x T i x j Gaussian RBF: xi x K( x i,x j ) exp( 2 2 Histogram intersection: j 2 ) K ( x i, x j ) min( xi ( k), x j ( k)) k Andrew Moore

9 Allowing misclassifications Misclassification cost # data samples Slack variable The w that minimizes Maximize margin Minimize misclassification

10 What about multi-class SVMs? Unfortunately, there is no definitive multi-class SVM formulation In practice, we have to obtain a multi-class SVM by combining multiple two-class SVMs One vs. others Training: learn an SVM for each class vs. the others Testing: apply each SVM to the test example, and assign it to the class of the SVM that returns the highest decision value One vs. one Training: learn an SVM for each pair of classes Testing: each learned SVM votes for a class to assign to the test example Svetlana Lazebnik

11 Evaluating Classifiers Accuracy # correctly classified / # all test examples Precision/recall Precision = # predicted true pos / # predicted pos Recall = # predicted true pos / # true pos F-measure = 2PR / (P + R) Want evaluation metric to be in some range, e.g. [0 1] 0 = worst possible classifier, 1 = best possible classifier

12 Precision / Recall / F-measure True positives (images that contain people) True negatives (images that do not contain people) Predicted positives (images predicted to contain people) Predicted negatives (images predicted not to contain people) Precision = 2 / 5 = 0.4 Recall = 2 / 4 = 0.5 F-measure = 2*0.4*0.5 / = 0.44 Accuracy: 5 / 10 = 0.5

13 Support Vector Regression

14 Regression Regression is like classification except the labels are real valued Example applications: Stock value prediction Income prediction CPU power consumption Subhransu Maji

15 Regularized Error Function for Regression N n n n w t y } { N n n n w t x y E C ) ) ( ( In linear regression, we minimize the error function: Use the Є-insensitive error function: An example of Є-insensitive error function: Adapted from Huan Liu ) ( 0 y x f E E ) ( y x f for otherwise true value predicted value

16 y n t Slack Variables for Regression For a target point to lie inside the tube: n y n Introduce slack variables to allow points to lie outside the tube: Subject to: t t n n y( x y( x n n ) ) n n n n 0 0 Minimize: C N n1 ( ) n n 1 2 w 2 Adapted from Huan Liu

17 Next time: Support Vector Ranking

18 Outline Support Vector Machines (review + other uses) Bias-variance trade-off Scene recognition: Spatial pyramid matching Other classifiers Decision trees Hidden Markov models Other problems Clustering: agglomerative clustering Dimensionality reduction

19 Generalization Training set (labels known) Test set (labels unknown) How well does a learned model generalize from the data it was trained on to a new test set? Slide credit: L. Lazebnik

20 Generalization Components of generalization error Bias: how much the average model over all training sets differs from the true model Error due to inaccurate assumptions/simplifications made by the model Variance: how much models estimated from different training sets differ from each other Underfitting: model is too simple to represent all the relevant class characteristics High bias and low variance High training error and high test error Overfitting: model is too complex and fits irrelevant characteristics (noise) in the data Low bias and high variance Low training error and high test error Slide credit: L. Lazebnik

21 Bias-Variance Trade-off Models with too few parameters are inaccurate because of a large bias (not enough flexibility). Models with too many parameters are inaccurate because of a large variance (too much sensitivity to the sample). Slide credit: D. Hoiem

22 Fitting a model Is this a good fit? Figures from Bishop

23 With more training data Figures from Bishop

24 Error Bias-variance tradeoff Underfitting Overfitting Test error High Bias Low Variance Complexity Training error Low Bias High Variance Slide credit: D. Hoiem

25 Test Error Bias-variance tradeoff Few training examples Many training examples High Bias Low Variance Complexity Low Bias High Variance Slide credit: D. Hoiem

26 Error Choosing the trade-off Need validation set Validation set is separate from the test set Test error Training error High Bias Low Variance Complexity Low Bias High Variance Slide credit: D. Hoiem

27 Error Effect of Training Size Fixed prediction model Generalization Error Testing Training Number of Training Examples Adapted from D. Hoiem

28 How to reduce variance? Choose a simpler classifier Use fewer features Get more training data Regularize the parameters Slide credit: D. Hoiem

29 Regularization Figures from Bishop

30 Characteristics of vision learning problems Lots of continuous features Spatial pyramid may have ~15,000 features Imbalanced classes Often limited positive examples, practically infinite negative examples Difficult prediction tasks Recently, massive training sets became available If we have a massive training set, we want classifiers with low bias (high variance is ok) and reasonably efficient training Adapted from D. Hoiem

Remember No free lunch: machine learning algorithms are tools Three kinds of error Inherent: unavoidable Bias: due to over-simplifications Variance: due to inability to perfectly estimate parameters

31 Remember No free lunch: machine learning algorithms are tools Three kinds of error Inherent: unavoidable Bias: due to over-simplifications Variance: due to inability to perfectly estimate parameters from limited data Try simple classifiers first Better to have smart features and simple classifiers than simple features and smart classifiers Use increasingly powerful classifiers with more training data (bias-variance tradeoff) Adapted from D. Hoiem

32 Outline Support Vector Machines (review + other uses) Bias-variance trade-off Scene recognition: Spatial pyramid matching Other classifiers Decision trees Hidden Markov models Other problems Clustering: agglomerative clustering Dimensionality reduction

Beyond Bags of Features: Spatial Pyramid

edu) Beckman Institute, University of Illinois

33 Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories CVPR 2006 Svetlana Lazebnik Beckman Institute, University of Illinois at Urbana-Champaign Cordelia Schmid INRIA Rhône-Alpes, France Jean Ponce Ecole Normale Supérieure, France

34 Bags of words Slide credit: L. Lazebnik

35 Bag-of-features steps 1. Extract local features 2. Learn visual vocabulary using clustering 3. Quantize local features using visual vocabulary 4. Represent images by frequencies of visual words Slide credit: L. Lazebnik

36 Local feature extraction Slide credit: Josef Sivic

37 Learning the visual vocabulary Slide credit: Josef Sivic

38 Learning the visual vocabulary Clustering Slide credit: Josef Sivic

39 Learning the visual vocabulary Visual vocabulary Clustering Slide credit: Josef Sivic

40 Image categorization with bag of words Training 1. Extract bag-of-words representation 2. Train classifier on labeled examples using histogram values as features Testing 1. Extract keypoints/descriptors 2. Quantize into visual words using the clusters computed at training time 3. Compute visual word histogram 4. Compute label using classifier Slide credit: D. Hoiem

41 What about spatial layout? All of these images have the same color histogram Slide credit: D. Hoiem

42 Spatial pyramid Compute histogram in each spatial bin Slide credit: D. Hoiem

43 Spatial pyramid [Lazebnik et al. CVPR 2006] Slide credit: D. Hoiem

Matching using pyramid and histogram intersection for some particular visual

44 Adapted from L. Lazebnik Pyramid matching Indyk & Thaper (2003), Grauman & Darrell (2005) Matching using pyramid and histogram intersection for some particular visual word: x i x j Original images Feature histograms: Level 3 Level 2 Level 1 Level 0 K( x i, x j ) Total weight (value of pyramid match kernel):

Scene category dataset Fei-Fei & Perona (2005), Oliva & Torralba (2001) http://www-cvr.ai.uiuc.

45 Scene category dataset Fei-Fei & Perona (2005), Oliva & Torralba (2001) Multi-class classification results (100 training images per class) Fei-Fei & Perona: 65.2% Slide credit: L. Lazebnik

46 Scene category confusions Difficult indoor images kitchen living room bedroom Slide credit: L. Lazebnik

47 Caltech101 dataset Fei-Fei et al. (2004) Multi-class classification results (30 training images per class) Slide credit: L. Lazebnik

48 Outline Support Vector Machines (review + other uses) Bias-variance trade-off Scene recognition: Spatial pyramid matching Other classifiers Decision trees Hidden Markov models Other problems Clustering: agglomerative clustering Dimensionality reduction

49 Decision tree classifier Example problem: decide whether to wait for a table at a restaurant, based on the following attributes: 1. Alternate: is there an alternative restaurant nearby? 2. Bar: is there a comfortable bar area to wait in? 3. Fri/Sat: is today Friday or Saturday? 4. Hungry: are we hungry? 5. Patrons: number of people in the restaurant (None, Some, Full) 6. Price: price range ($, $$, $$$) 7. Raining: is it raining outside? 8. Reservation: have we made a reservation? 9. Type: kind of restaurant (French, Italian, Thai, Burger) 10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60) Slide credit: L. Lazebnik

50 Decision tree classifier Slide credit: L. Lazebnik

51 Decision tree classifier Slide credit: L. Lazebnik

52 Sequence Labeling Problem Unlike most computer vision problems, many NLP problems can viewed as sequence labeling. Each token in a sequence is assigned a label. Labels of tokens are dependent on the labels of other tokens in the sequence, particularly their neighbors (not i.i.d). foo bar blam zonk zonk bar blam Adapted from Ray Mooney

53 Markov Model / Markov Chain A finite state machine with probabilistic state transitions. Makes Markov assumption that next state only depends on the current state and independent of previous history. Ray Mooney

54 Sample Markov Model for POS Det 0.95 Noun start PropNoun Verb stop Ray Mooney

55 Sample Markov Model for POS Det 0.95 Noun PropNoun start P(PropNoun Verb Det Noun) = 0.4*0.8*0.25*0.95*0.1= Adapted from Ray Mooney Mary ate the cake. 0.9 Verb 0.5 stop

56 Hidden Markov Model Probabilistic generative model for sequences. Assume an underlying set of hidden (unobserved) states in which the model can be (e.g. parts of speech). Assume probabilistic transitions between states over time (e.g. transition from POS to another POS as sequence is generated). Assume a probabilistic generation of tokens from states (e.g. words generated for each POS). Ray Mooney

57 Sample HMM for POS the a a the the a the that Det start Tom John Mary Alice Jerry PropNoun cat dog car pen bed apple Noun bit ate played saw hit gave Verb 0.5 stop Ray Mooney

58 Outline Support Vector Machines (review + other uses) Bias-variance trade-off Scene recognition: Spatial pyramid matching Other classifiers Decision trees Hidden Markov models Other problems Clustering: agglomerative clustering Dimensionality reduction

59 Slide credit: D. Hoiem

60 Clustering Strategies K-means Iteratively re-assign points to the nearest cluster center Mean-shift clustering Estimate modes Graph cuts Split the nodes in a graph based on assigned links with similarity weights Agglomerative clustering Start with each point as its own cluster and iteratively merge the closest clusters

61 Agglomerative clustering

62 Agglomerative clustering

63 Agglomerative clustering

64 Agglomerative clustering

65 Agglomerative clustering

66 distance Agglomerative clustering How to define cluster similarity? - Average distance between points, maximum distance, minimum distance How many clusters? - Clustering creates a dendrogram (a tree) - Threshold based on max number of clusters or based on distance between merges Adapted from J. Hays

67 Why do we cluster? Summarizing data Look at large amounts of data Represent a large continuous vector with the cluster number Counting Histograms of texture, color, SIFT vectors Segmentation Separate the image into different regions Prediction Images in the same cluster may have the same labels Slide credit: J. Hays, D. Hoiem

68 Slide credit: D. Hoiem

69 Figure from Genevieve Patterson, IJCV 2014 Dimensionality Reduction

Recognition Tools: Support Vector Machines

CS 2770: Computer Vision Recognition Tools: Support Vector Machines Prof. Adriana Kovashka University of Pittsburgh January 12, 2017 Announcement TA office hours: Tuesday 4pm-6pm Wednesday 10am-12pm Matlab