Computer Vision Group Prof. Daniel Cremers. 6. Boosting

Size: px

Start display at page:

Download "Computer Vision Group Prof. Daniel Cremers. 6. Boosting"

Marilyn Green
6 years ago
Views:

1 Prof. Daniel Cremers 6. Boosting

2 Repetition: Regression We start with a set of basis functions (x) =( 0 (x), 1(x),..., M 1(x)) x 2 í d The goal is to fit a model into the data y(x, w) =w T (x) To do this, we need to find an error function, e.g.: E(w) = 1 2 NX (w T (x i ) t i ) 2 i=1 To find the optimal parameters, we derived E with respect to w and set the derivative to zero. 2

3 Some Questions 1.Can we do the same for classification? As a special case we consider two classes: t i 2 { 1, 1} 8i =1,...,N 2.Can we use a different (better?) error function? 3.Can we learn the basis functions together with the model parameters? 4.Can we do the learning sequentially, i.e. one basis function after another? Answer to all questions: Yes, using Boosting! 3

4 The Loss Function Definition: a real-valued function L(t, y(x)), where t is a target value and y is a model, is called a loss function. Examples: 01-loss: L 01 (t, y(x)) = 0 if t = y(x) 1 else squared error loss: L sqe (t, y(x)) = (t y(x)) 2 exponential loss: L exp (t, y(x)) = exp( ty(x)) 4

5 Loss Functions 01-loss is not differentiable squared error loss has only one optimum 5

6 Sequential Fitting of Basis Functions Idea: We start with a basis function : 0(x) y 0 (x,w 0 )=w 0 0 (x) w 0 =1 Then, at iteration m, we add a new basis function m(x) to the model: y m (x,w 0,...,w m )=y m 1 (x,w 0,...,w m 1 )+w m m (x) Two questions need to be answered: 1.How do we find a good new basis function? 2.How can we determine a good value for w m? Idea: Minimize the exponential loss function 6

7 Minimizing the Exponential Loss Aim: find w m and m so that (w m, m) = arg min w, NX L(t i,y m 1 (x i )+w (x i )) i=1 where L(t, y) =exp( ty) 7

8 Minimizing the Exponential Loss Aim: find w m and m so that (w m, m) = arg min w, NX i=1 L(t i,y m 1 (x i )+w (x i )) where L(t, y) =exp( ty) Solution: m = arg min NX v i,m I(t i 6= (x i )) i=1 8

9 Minimizing the Exponential Loss Aim: find w m and m so that (w m, m) = arg min w, NX i=1 L(t i,y m 1 (x i )+w (x i )) where L(t, y) =exp( ty) Solution: w m = 1 2 log 1 m = arg min err m err m NX i=1 v i,m I(t i 6= (x i )) 9

10 Minimizing the Exponential Loss Aim: find w m and m so that (w m, m) = arg min w, NX i=1 L(t i,y m 1 (x i )+w (x i )) where L(t, y) =exp( ty) NX Solution: m = arg min i=1 v i,m I(t i 6= (x i )) w m = 1 2 log 1 err m err m v i,m+1 = v i,m exp(2w m I(t i 6= m (x i )) 10

11 Minimizing the Exponential Loss Aim: find w m and m so that (w m, m) = arg min w, NX i=1 L(t i,y m 1 (x i )+w (x i )) where L(t, y) =exp( ty) Condition: Must have training error less than 0.5! NX Solution: m = arg min i=1 v i,m I(t i 6= (x i )) w m = 1 2 log 1 err m err m v i,m+1 = v i,m exp(2w m I(t i 6= m (x i )) 11

12 1.For : The AdaBoost Algorithm i =1,...,N v i 1/N 2.For m =1,...,M Fit a classifier ( basis function ) Compute NX i=1 err m = v i I(t i 6= m (x i )) m P N i=1 v ii(t i 6= m (x i )) P N i=1 v i that minimizes and m := 2w m m = log 1 err m err m Update the weights: v i v i exp( m I(t i 6= m (x i ))) 3.Use the resulting classifier: y(x) = sgn MX m=1 12 m m (x)

13 The Basis Functions Can be any classifier that can deal with weighted data Most importantly: if these base classifiers provide a training error that is at most as bad as a random classifier would give (i.e. it is a weak classifier), then AdaBoost can return an arbitrarily small training error (i.e. AdaBoost is a strong classifier) Many possibilities for weak classifiers exist, e.g.: Decision stumps Decision trees 13

14 Decision Stumps Decision Stumps are a kind of very simple weak classifiers. Goal: Find an axis-aligned hyperplane that minimizes the class. error This can be done for each feature (i.e. for each dimension in feature space) It can be shown that the classif. error is always better than 0.5 (random guessing) Idea: apply many weak classifiers, where each is trained on the misclassified examples of the previous. 14

15 Classification Example 15

16 Classification Example 16

17 Classification Example 17

18 Classification Example 18

19 Classification Example 19

20 Classification Example 20

21 Decision Trees A more general version of decision stumps are decision trees: 10 X 1 t 1 9 X 2 t 2 R 1 X 1 t 3 X 2 t 4 R 2 R R 4 R 5 At every node, a decision is made Can be used for classification and for regression (Classification And Regression Trees CART) 21

22 Decision Trees for Classification blue color red 4,0 shape ellipse other other size < 10 yes no 1,1 0,2 4,0 0,5 Stores the distribution over class labels in each leaf (number of positives and negatives) With these, we can class label probabilities, e.g. p(y =1 x) =1/2 if we have a red ellipse 22

23 Growing a Decision Tree Finding the optimal partition of the data is an NP-complete problem! Instead: use a greedy strategy: function fittree(node, D, depth): 1. node.prediction = class label distribution 2. (j,t, D L, D R )=split(d) 3. if not worth splitting then return node 4. node.test x j <t D L 5. node.left = fittree(node,, depth +1) 6. node.right = fittree(node, D R, depth +1) 23

24 Growing a Decision Tree The Split-function finds an optimal feature and an optimal value for that feature For classification, it finds a split that minimizes some cost function, e.g. misclassification A decision stump is a decision tree with depth 1 Stopping criteria for growing the tree are: reduction of cost too small? maximum depth reached? is the distribution in the sub-trees homogenous? is the number of samples in the sub-trees too small? 24

25 Tree Pruning versicolor setosa SL < 5.45 W < 2.8 SW >= 2.8 SL < 6.15 SL >= 6.15 SW < 3.45 SW >= 3.45 SL < 7.05 SL >= 7.05 SL < 5.75 SL >= 5.75 SW < 2.4 SW >= 2.4 setosa SW < 3.1 SW >= 3.1 SL < 6.95 SL >= 6.95 versicolor versicolor SW < 2.95 SW >= 2.95 SW < 3.15 SW >= 3.15 versicolor versicolor versicolor virginica SL >= 5.45 virginica SL < 6.45 SL >= 6.45 SW < 2.65 SW >= 2.65 virginica versicolor SW < 2.85 SW >= 2.85 SW < 2.9 SW >= 2.9 versicolor virginica virginica versicolor SL < 6.55 SL >= 6.55 SW < 2.95 SW SL >= < SL >= 6.65 versicolor virginica virginica Cost (misclassification error) Cross validation Training set Min + 1 std. err. Best choice Number of terminal nodes If the tree grows too large, the algorithm overfits Simply stopping to grow can lead to situations where the tree is not expressive enough Idea: Build first full tree and then prune it Pruning can be done using cross-validation 25

26 Random Forests To reduce the variance of the classification estimate, we can train several trees on randomly sampled subsets of the data ( bagging ) However, this can result in correlated classifiers, limiting the reduction in variance Idea: chose data subset and variable (feature) subset randomly The resulting algorithm is known as Random Forests Random Forests have very good accuracy and are widely used, e.g. body pose recognition 26

27 Back to Boosting AdaBoost has been shown to perform very well, especially when using decision trees as weak classifiers However: the exponential loss weighs misclassified examples very high! 27

28 Using the Log-Loss The log-loss is defined as: L(t, y(x)) = log 2 (1 + exp( 2ty(x)) It penalizes misclassifications only linearly 28

29 The LogitBoost Algorithm 1.For : i =1,...,N v i 1/N i 1/2 2.For Compute the working response Compute the weights Find Update m =1,...,M m that minimizes NX i=1 3.Use the resulting classifier: z i = t i i i (1 i ) v i = i (1 i ) v i (z i (x i )) 2 y(x) y(x)+ 1 2 m (x) i y(x) = sgn 29 MX m=1 and m(x) exp( 2y(x i ))

30 Weighted Least-Squares Regression Instead of a weak classifier, LogitBoost uses weighted least-squares regression This is very similar to standard least-squares regression: E(w) = 1 2 NX i=1 v i (w T (x i ) t i ) 2 This results in a matrix ˆ = V 1/2 where The solution is V 1/2 = diag( p v 1,..., p v N ) w =(ˆ T ˆ) 1 ˆ T t 30

31 GentleBoost Numerically more stable than LogitBoost Tends to perform better than AdaBoost and LogitBoost 31

32 Generalization: Gradient Boost Initialize for m =1,...,M Compute the gradient residual Use the weak learner to compute that minimizes Update NX f 0 (x) = arg min L(t i, (x i, )) r im = i=1 i i ) f(x i )=f m 1 (x i ) NX (r im (x i ; m)) 2 i=1 f m (x) =f m 1 (x)+ (x, ) Return f(x) =f M (x) 32

33 Application of AdaBoost: Face Detection The biggest impact of AdaBoost was made in face detection Idea: extract features ( Haar-like features ) and train AdaBoost, use a cascade of classifiers Features can be computed very efficiently Weak classifiers can be decision stumps or decision trees As inference in AdaBoost is fast, the face detector can run in real-time! 33

34 Haar-like Features Defined as difference of rectangular integral area: The sum of the pixels which lie within the white rectangles are subtracted from the sum of pixels in the grey rectangles. One feature defined as: Feature type: A,B,C or D Feature position and size 34

35 The integral image Defined as : Integral on rectangle D can be computed in 4 access to I int : Very efficient way to compute features 35

36 Weak Classifiers Used A weak classifier has 3 attributes: A feature f j (type, size and position) A threshold θ j A comparison operator op j = < or > The resulting weak classifier is: x is a 24x24 pixels window in the image 36

37 Two First Classifiers Selected by AdaBoost A classifier with only this two features can be trained to recognise 100% of the faces, with 40% of false positives 37

38 The Inference Algorithm scale = 24x24 Do { For each position in the image { Try classifying the part of the image starting at this position, with the current scale, using the classifier selected by AdaBoost } Scale = Scale x 1.5 } until maximum scale 38

39 Another Improvement: the Cascade Basic idea: It is easy to detect that something is not a face Tune(boost) classifier to be very reliable at saying NO (i.e. very low false negative) Stop evaluating the cascade of classifier if one classifier says NO 39

40 Advantage of the Cascade Faster processing Quick elimination of useless windows Each individual classifier is trained to deal only with the example that the previous ones could not process Very specialised The deeper in the cascade, the more complex (the more features) in the classifiers. 40

41 Results (1) 41

42 Results (2) 42 PD Dr. Rudolph Triebel

43 Summary Boosting is a method to use a weak classifier and turn it into a strong one (arbitrarily small training error!) AdaBoost minimizes the exponential loss To be more robust against outliers, we can use LogitBoost or GentleBoost Weak learners can be decision stumps or decision trees Face detection can be solved with Boosting 43

Computer Vision Group Prof. Daniel Cremers. 8. Boosting and Bagging

Computer Vision Group Prof. Daniel Cremers. 8. Boosting and Bagging Prof. Daniel Cremers 8. Boosting and Bagging Repetition: Regression We start with a set of basis functions (x) =( 0 (x), 1(x),..., M 1(x)) x 2 í d The goal is to fit a model into the data y(x, w) =w T