CS6716 Pattern Recognition. Ensembles and Boosting (1)

Size: px

Start display at page:

Download "CS6716 Pattern Recognition. Ensembles and Boosting (1)"

Sophia Joseph
5 years ago
Views:

1 CS6716 Pattern Recognition Ensembles and Boosting (1) Aaron Bobick School of Interactive Computing

2 Administrivia Chapter 10 of the Hastie book. Slides brought to you by Aarti Singh, Peter Orbanz, and friends. Slides posted sorry that took so long Final project discussion

3 ENSEMBLES A randomly chosen hyperplane classifier has an expected error of 0.5 (i.e. 50%).

4 ENSEMBLES A randomly chosen hyperplane classifier has an expected error of 0.5 (i.e. 50%).

5 Ensembles Many random hyperplanes combined by majority vote: Still 0.5 A single classifier slightly better than random: ε. What if we use m such classifiers and take a majority vote? 146 / 531

6 Voting Decision by majority vote m individuals (or classifiers) take a vote. m is an odd number. They decide between two choices; one is correct, one is wrong. After everyone has voted, a decision is made by simple majority. Note: For two-class classifiers ff 1,, ff mm (with output ±1): mm majority vote = ssssss ff jj jj=1

7 Voting likelihoods We make some simplifying assumptions: Each individual makes the right choice with probability pp 0, 1 The votes are independent, i.e. stochastically independent when regarded as random outcomes. Given n voters, the probability the majority makes the right choice: Pr(majority correct) m+ 1 j= 2 m! j!( m (1 p) This formula is known as Condorcet s jury theorem m p j)! j m j 147 / 531

8 Power of weak classifiers Pr(majority correct) m m+ 1 j= 2 m! j!( m j)! p j (1 p) m j

9 ENSEMBLE METHODS An ensemble method makes a prediction by combining the predictions of many classifiers into a single vote. The individual classifiers are usually required to perform only slightly better than random. For two classes, this means slightly more than 50% of the data are classified correctly. Such a classifier is called a weak learner. 149 / 531

10 ENSEMBLE METHODS From before: if the weak learners are random and independent, the prediction accuracy of the majority vote will increase with the number of weak learners. But, since the weak learners are all typically trained on the same training data, producing random, independent weak learners is difficult. (See later for random forests) Different ensemble methods (e.g. Boosting, Bagging, etc) use different strategies to train and combine weak learners that behave relatively independently. 149 / 531

11 Making ensembles work Boosting (today) After training each weak learner, data is modified using weights. Deterministic algorithm. Bagging (bootstrap aggregation from earlier) Each weak learner is trained on a random subset of the data. Random forests (later) Bagging with tree classifiers as weak learners. Uses an additional step to remove dimensions that carry little information. 150 / 531

12 Why boost weak learners? Goal: Automatically categorize type of call requested (Collect, Calling card, Person- to- person, etc.) Easy to find rules of thumb that are often correct. E.g. If card occurs in utterance, then predict calling card Hard to find single highly accurate prediction rule.

Fighting the bias- variance tradeoff Simple (a.k.a. weak) learners e.g., naïve Bayes, logistic regression, decision stumps (or shallow decision trees) Are

13 Fighting the bias- variance tradeoff Simple (a.k.a. weak) learners e.g., naïve Bayes, logistic regression, decision stumps (or shallow decision trees) Are good - Low variance, don t usually overfit Are bad - High bias, can t solve hard learning problems Can we make weak learners always good??? No!!! But often yes

14 Voting (Ensemble Methods) Instead of learning a single classifier, learn many weak classifiers that are good at different parts of the input space Output class: (Weighted) vote of each classifier Classifiers that are most sure will vote with more conviction Classifiers will be most sure about a particular part of the space On average, do better than single classifier! h1(x) h2(x) H: X Y (-1,1) H(X) = h1(x)+h2(x) H(X) = sign( αt ht(x)) t 1-1???? 1-1 weights

15 Voting (Ensemble Methods) Instead of learning a single classifier, learn many weak classifiers that are good at different parts of the input space Output class: (Weighted) vote of each classifier Classifiers that are most sure will vote with more conviction Classifiers will be most sure about a par' cular part of the space On average, do better than single classifier! But how do you??? force classifiers h tt space? to learn about different parts of the input weight the votes of different classifiers? αα tt 1 5

16 Boosting (Shapire, 1989) Idea: given a weak learner, run it multiple times (reweighted) training data, then let learned classifiers vote On each iteration tt : weight each training example by "how incorrectly" it was classified Learn a weak hypothesis h tt A strength for this hypothesis αα tt Final classifier: HH xx = ssssss(σαα tt h tt xx ) Practically useful AND Theoretically interesting

17 Learning from weighted data Consider a weighted dataset D(i) weight of i th training example xx ii, yy ii Interpretations: i th training example counts as D(i) examples If you were to resample data, you would get more samples of heavier data points Now, in all calculations, whenever used, i th training example counts as D(i) examples e.g., in MLE redefine Count(Y=y) to be weighted count Unweighted data m Count(Y=y) = 1(Y i =y) i =1 Weights D(i) m Count(Y=y) = D(i)1(Y i =y) i =1 1 7

18 Boosting weak learners

19 AdaBoost.M1 (1)

20 AdaBoost.M1 (2)

21 Example sequence

22 Boosting Iteratively reweighting training samples. Higher weights to previously misclassified samples. ICCV09 Tutorial Tae-Kyun Kim University of Cambridge rounds 22 /76

23 Not mysterious????

24 Minimizing a loss function

25 Forward stage additive model

26 Exponential loss (1)

27 Exponential loss (2) At each node/level:

29 Analyzing training error Training error of final classifier is bounded by: Convex upper bound Where 0/1 loss exp loss If boosting can make upper bound 0, then training error

30 Why zero training error isn t the end?!?!

31 Boosting results Digit recognition [Schapire, 1989] Test Error Training Error Boosting often but not always, Robust to overfitting Test set error decreases even after training error is zero 31

32 Boosting and Logistic Regression Logistic regression equivalent to minimizing log loss Boosting minimizes similar loss function!! Weighted average of weak learners log loss 0/1 loss exp loss 1 Both smooth approximations of 0/1 loss!

33 Boosting and Logistic Regression Logistic regression: Boosting: Minimize log loss Minimize exp loss Define Define where x j predefined features (linear classifier) Jointly optimize over all weights w0, w1, w2 where h t (x) defined dynamically to fit data (not a linear classifier) Weights αα tt learned per iteration t incrementally 33

34 Effect of Outliers Good : Can identify outliers since focuses on examples that are hard to categorize Bad : : Too many outliers can degrade classification performance dramatically increase time to convergence 34

35 Bagging [Breiman, 1996] Related approach to combining classifiers: 1. Run independent weak learners on bootstrap replicates (sample with replacement) of the training set 2. Average/vote over weak hypotheses Bagging Resamples data points Weight of each classifier is the same Only variance reduction vs. Boosting Reweights data points (modifies their distribution) Weight is dependent on classifier s accuracy Both bias and variance reduced learning rule becomes more complex with each iteration

36 Example AdaBoost test error (simulated data).., Weak learners used are decision stumps..., Combining many trees of depth 1 yields much better results than a single large tree. 155 / 531

37 SPAM DATA Tree classifier: 9.3% overall error rate Boosting with decision stumps: 4.5% Figure shows feature selection results of Boosting.

38 Best known boosting application Face Detection Viola/Jones But it s easy to forget that two things make this work: Boosting they used real AdaBoost Cascade architecture P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR P. Viola and M. Jones. Robust real-time face detection. IJCV 57(2), 2004.

39 Face detection Searching for faces in images Two problems: Face detection - Find locations of all faces in image. Two classes. Face recognition Identify a person depicted in an image by recognizing the face. One class per person to be identified + background class (all other people). "Face detection can be regarded as a solved problem." "Face recognition is not solved." Face detection as a classification problem Divide image into patches Classify each patch as "face" or "not face" 161 / 531

40 Face detection Basic idea: slide a window across image and evaluate a face model at every location

41 Viola Jones Technique Overview Three major contributions/phases of the algorithm : Feature extraction Learning using cascaded boosting and decision stumps Multi-scale detection algorithm Feature extraction and feature evaluation. Rectangular features are used, with a new image representation their calculation is very fast. (First) classifier was actual AdaBoost Maybe first demonstration to computer vision that a combination of simple classifiers is very effective

42 Feature Extraction Features are extracted from sub windows of a sample image. The base size for a sub window is 24 by 24 pixels. Basic features are difference of sums of rectangles (white minus black below). Each of the four feature types are scaled and shifted across all possible combinations

43 Example Source Result

44 Key to feature computation: Integral images Integral image is new image S(x,y) from I(x,y) such that such that xx SS xx, yy = II(ii, jj) yy ii=1 jj=1

45 Fast Computation of Pixel Sums MATLAB: ii = cumsum(cumsum(double(i)), 2);

46 Feature selection For a 24x24 detection region, the number of possible rectangle features is ~160,000!

47 Feature selection For a 24x24 detection region, the number of possible rectangle features is ~160,000! At test time, it is impractical to evaluate the entire feature set Can we create a good classifier using just a small subset of all possible features? How to select such a subset? No surprise: Boosting!

48 Paul s slide: Boosting Boosting is a classification scheme that works by combining weak learners into a more accurate ensemble classifier A weak learner need only do better than chance Training consists of multiple boosting rounds During each boosting round, we select a weak learner that does well on examples that were hard for the previous weak learners Hardness is captured by weights attached to training examples Y. Freund and R. Schapire, A short introduction to boosting, Journal of Japanese Society for Artificial Intelligence, 14(5): , September, 1999.

49 Paul s Slide: Boosting vs. SVM Advantages of boosting Integrates classification with feature selection Complexity of training is linear instead of quadratic in the number of training examples Flexibility in the choice of weak learners, boosting scheme Testing is fast Easy to implement Disadvantages Needs many training examples Often doesn t work as well as SVM (especially for many-class problems)

50 Boosting for face detection Define weak learners based on rectangle features For each round of boosting: Evaluate each rectangle filter on each example Select best threshold for each filter Select best filter/threshold combination Reweight examples Computational complexity of learning: O(MNK) M rounds, N examples, K features

51 Boosting for face detection First two features selected by boosting: This feature combination can yield 100% detection rate and 50% false positive rate

52 Boosting for face detection A 200-feature classifier can yield 95% detection rate and a false positive rate of 1 in Not good enough! Receiver operating characteristic (ROC) curve

53 Challenges of face detection Sliding window detector must evaluate tens of thousands of location/scale combinations Faces are rare: 0 10 per image For computational efficiency, we should try to spend as little time as possible on the non-face windows A megapixel image has 10 6 pixels and a comparable number of candidate face locations To avoid having a false positive in every image, our false positive rate has to be less than An unbalanced system with the positive class very small. Standard training algorithm can achieve good error rate by classifying all data as negative. The error rate will be precisely the proportion of points in positive class.

54 Attentional cascade We start with simple classifiers which reject many of the negative sub-windows while detecting almost all positive subwindows Positive response from the first classifier triggers the evaluation of a second (more complex) classifier, and so on A negative outcome at any point leads to the immediate rejection of the sub-window IMAGE SUB-WINDOW Classifier 1 T Classifier 2 T Classifier 3 T FACE F F F NON-FACE NON-FACE NON-FACE

55 Why does a cascade work? We have to consider two rates false positive rate FPR(f j ) = detection rate DR(f j ) = # negative points classified as "+1" # negative training points at stage j #correctly classified positive points # positive training points at stage j We want to achieve a low value of FPR(f ) and a high value of DR(f ). Class imbalance: Number of faces classified as background is (size of face class) (1 DR(f )) We would like to see a decently high detection rate, say 90% Number of background patches classified as faces is (size of background class) (FPR(f )) Since background class is huge, FPR(f ) has to be very small to yield roughly the same amount of errors in both classes. How small?

56 Why does a cascade work? Cascade detection rate The rates of the overall cascade classifier f are Suppose we use a 10-stage cascade (k = 10) and that each DR(ff jj ) is 99% and we permit FPR(ff jj ) of 30%. We obtain DR ff = and a FPR ff = Sine k is powerful on false positives, we can set each function to have a very high DR and live with a fair FP.

57 Training the cascade Set target detection and false positive rates for each stage Keep adding features to the current stage until its target rates have been met Need to lower AdaBoost threshold to maximize detection (as opposed to minimizing total classification error) Test on a validation set If the overall false positive rate is not low enough, then add another stage Use false positives from current stage as the negative training examples for the next stage

58 Training the cascade Training procedure 1. User selects acceptable rates (FPR and DR) for each level of cascade. 2. At each level of cascade: Train boosting classifier with final AdaBoost threshold lowered to maximize detection (as opposed to minimizing total classification error) Gradually increase number of selected features until overall rates achieved. Test on a validation set 3. If the overall false positive rate is not low enough, then add another stage Use of training data. Each training step uses: All positive examples (= faces). Negative examples (= non-faces) that are misclassified at previous cascade layer. (plus more?)

59 Classifier cascades Training a cascade: Use imbalanced loss (very low false negative rate for each ff ii ). 1. Train classifier ff 1 on entire training data set. 2. Remove all x i in negative class which ff 1 classifies correctly from training set ( Get some more negatives ) 3. On smaller training set, train ff 2 4. Continue. 5. On remaining data at final stage, train ff kk. Rapid classifying with a cascade If any ff jj classifies x as negative, f (x) = 1. Only if allff jj classify x as positive, f (x) = +1.

60 The implemented system Training Data 5000 faces All frontal, rescaled to 24x24 pixels 300 million non-faces 9500 non-face images Faces are normalized Scale, translation Many variations Across individuals Illumination Pose

61 System performance Training time: weeks on 466 MHz Sun workstation 38 layers, total of 6061 features Average of 10 features evaluated per window on test set On a 700 Mhz Pentium III processor, the face detector can process a 384 by 288 pixel image in about.067 seconds 15 Hz 15 times faster than previous detector of comparable accuracy (Rowley et al., 1998)

62 Output of Face Detector on Test Images

63 Other detection tasks Facial Feature Localization Profile Detection Male vs. female

64 Profile Detection

65 Profile Features

66 Beyond AdaBoost

67 Why AdaBoost Works: (1) Minimizing a loss function

68 (2) Forward stage additive model

69 (3) Exponential Loss

70 (3) Exponential loss an upper bound on classification error Training error of final classifier is bounded by: Convex upper bound Where 0/1 loss exp loss If boosting can make upper bound 0, then training error

71 Source CS7616 Pattern Recognition A. Bobick Why Boosting Works? (Cont d)

72 Loss Function In Hastie: easy to show : So AdaBoost is estimating one-half the log odds that Pr YY = 1 xx. So doing classification if greater than zero makes sense. Above also implies:

73 A better loss function: A different loss function can be derived from a different assumed form of probability (logit function in f): Binomial deviance Minimized by the same f(x) but not the same function.

74 More on Loss Functions and Classification yy ff xx is called the Margin The classification rule implies that observations with positive margin yy ii ff(xxxx) > 0 were classified correctly, but the negative margin ones are incorrect The decision boundary is given by the f(x)=0 The loss criterion should penalize the negative margins more heavily than the positive ones. But how much more

75 Loss Functions for Two-Class Classification Exponential goes very high as margin gets bad. Makes AdaBoost less robust to mislabeled data or too many outliers.

76 How to fix boosting? AdaBoost analytically minimizes exponential loss. Clean equations Good performance in good cases. But, exponential loss sensitive to outliers, misclassified points. The binomial deviance loss function is better behaved. We should be able to boost any weak learner like trees. Can we boost trees for binomial deviance? Not analytically But we can numerically gradient boosting And can be improved by other tricks: Stochastic sampling of training points at each stage Regularization or diminishing effects of each stage

77 Source CS7616 Pattern Recognition A. Bobick Loss Function (Cont d)

78 Trees Reviewed (in Hastie notation) Trees partition the feature vector (joint predictor values) into disjoint regions RR jj, jj = 1,.., JJ represented by the terminal nodes A constant γγ jj is assigned to each region, whether regression or classification The predictive/classification rule xx RR jj ff xx = γγ jj The tree is: TT xx; Θ = jj γγ jj II(xx RRRR) where Θ are the parameters of the splits RR jj and the values γγ jj We want to minimize the loss: Θ ˆ = arg min Lyγ (, ) Θ i j = J j 1 x R i j

79 Boosting Trees Finding γγ jj given RR jj : this is easy Finding RR jj : this is difficult, we typically approximate. We described the greedy top-down recursive partitioning algorithm A boosted tree, is sum of such trees, Where at each stage m we minimize the :

80 Boosting Trees (cont d) Given the regions RR jj,mm the correct γγ jj,mm is whatever minimizes the loss function: For exponential loss, if we restrict our trees to be weak learners outputing { 1, +1} then this is exactly AdaBoost and train the same way. But if we want a different loss function, need numerical method.

81 Numerical Optimization Loss function in using prediction f(x) for y is Lf ( ) = Ly (, f( x)) i= 1 The goal is to minimize L(f) w.r.t f, where f is the sum of the trees. But ignore the sum of tree constraint for now. Just think about minimization. where the parameters f are the values of the approximating function ff(xx ii ) at each of the N data points xx ii : ff = ff xx 1,, ff xx NN. Numerical optimization successively approximates ff by steps N M f = h, h R M m m m= 1 Solve it as a sum of component vectors, where ff 0 = h 0 is the initial guess and each successive ff mm is induced based on the current parameter vector ff mm 1. i i N

82 Steepest Descent 1. Choose h mm = ρ mm gg mm, where ρ mm is a scalar and gg mm is the gradient of L(f) evaluated at ff = ff mm 1 2. The step length ρ m is the (line search) solution to ρ mm = arg min ρρ LL(ff mm 1 ρρgg mm ) 3. The current solution is then updated: ff mm = ff mm 1 ρ mm gg mm

83 Gradient Boosting Forward stagewise boosting is also a very greedy algorithm The tree predictions can be thought about like negative gradients The only difficulty is that the tree components are not arbitrary They are constrained to be the predictions of a JJ mm -terminal node decision tree, whereas the negative gradient is unconstrained steepest descent Unfortunately, the gradient is only defined at the training data points and is not applicable to generalizing ff mm xx to new data

84 Gradient boosting (cont) For classification: where We re going to approximate gradient by finding a decision tree that s as close as possible to the gradient.

85 Gradient boosting

86 Gradient boosting results

87 More tweaks If we fit the best tree possible, it will yield a tree that is deep, as if it were the last tree of the boosting ensemble. Large trees permit interaction between elements. Prevent by either penalizing based upon tree size or, believe it or not, let J=6. There is the question of how many stages. If go too far can overfit (worse than Adaboost?) Need shrinkage from ML: where 0 < vv < 1. Text actually describes v=.01

89 Interpretation Single decision trees are often very interpretable Linear combination of trees loses this important feature We often learn the relative importance or contribution of each input variable in predicting the response Define a measure of relevance for each predictor XX ll, sum over the J-1 internal nodes of the tree

90 Interpretation (Cont d) Relevance of a predictor XX ll in a single tree (where v(t) is the feature selected): where ıı tt 2 is measure of improvement after tree is split is applied to the node. In a boosted tree the squared relevance: Since relative, make the max 100.

91 Relevance

92 Illustration (California Housing)

93 Boosting Summary Combine weak classifiers to obtain very strong classifier Weak classifier slightly better than random on training data Resulting very strong classifier can eventually provide zero training error and AdaBoost algorithm Boosting v. Logistic Regression Similar loss functions LR is single optimization v. Incrementally improving classification in Boosting Boosting is very popular for applications: Boosted decision stumps easy to build training system Very simple to implement and efficient

Classifier Case Study: Viola-Jones Face Detector

Classifier Case Study: Viola-Jones Face Detector P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR 2001. P. Viola and M. Jones. Robust real-time face detection.