Visual Recognition: Examples of Graphical Models

Size: px

Start display at page:

Download "Visual Recognition: Examples of Graphical Models"

Pearl Joseph
5 years ago
Views:

1 Visual Recognition: Examples of Graphical Models Raquel Urtasun TTI Chicago March 6, 2012 Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

2 Graphical models Applications Representation Inference Learning message passing (LP relaxations) graph cuts Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

3 Learning in graphical models Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

4 Parameter learning The MAP problem was defined as max θ i (y i ) + y 1,,y n α i θ α (y α ) Learn parameters w for more accurate prediction max w i φ i (y i ) + w α φ α (y α ) y 1,,y n α i Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

5 Loss functions Regularized loss minimization: Given input pairs (x, y) S, minimize ˆl(w, x, y) + C p w p p, (x,y) S Different learning frameworks depending on the surrogate loss ˆl(w, x, y) Hinge for Structural SVMs [Tsochantaridis et al. 05, Taskar et al. 04] log-loss for Conditional Random Fields [Lafferty et al. 01] Unified by [Hazan and Urtasun, 10] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

6 Loss functions Regularized loss minimization: Given input pairs (x, y) S, minimize ˆl(w, x, y) + C p w p p, (x,y) S Different learning frameworks depending on the surrogate loss ˆl(w, x, y) Hinge for Structural SVMs [Tsochantaridis et al. 05, Taskar et al. 04] log-loss for Conditional Random Fields [Lafferty et al. 01] Unified by [Hazan and Urtasun, 10] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

7 Recall SVM In SVMs we minimize the following program min w 1 2 w 2 + i ξ i subject to y i (b + w T x i ) 1 + ξ i 0, i = 1,..., N. with y i { 1, 1} binary. We need to extend this to reason about more complex structures, not just binary variables. Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

8 Structural SVM [Tsochantaridis et al., 05] We want to construct a function f (x, y) = arg max y Y wt φ(x, y) which is parameterized in terms of w, the parameters to learn. We will like to minimize the empirical risk R s (f, w) = 1 n n (y i, f (x i, w)) i=1 Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

9 Structural SVM [Tsochantaridis et al., 05] We want to construct a function f (x, y) = arg max y Y wt φ(x, y) which is parameterized in terms of w, the parameters to learn. We will like to minimize the empirical risk R s (f, w) = 1 n n (y i, f (x i, w)) i=1 This is the expected loss under the empirical distribution induced Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

10 Structural SVM [Tsochantaridis et al., 05] We want to construct a function f (x, y) = arg max y Y wt φ(x, y) which is parameterized in terms of w, the parameters to learn. We will like to minimize the empirical risk R s (f, w) = 1 n n (y i, f (x i, w)) i=1 This is the expected loss under the empirical distribution induced (y i, f (x i, w)) is the task loss which depends on the application segmentation: per pixel segmentation error detection: intersection over the union Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

11 Structural SVM [Tsochantaridis et al., 05] We want to construct a function f (x, y) = arg max y Y wt φ(x, y) which is parameterized in terms of w, the parameters to learn. We will like to minimize the empirical risk R s (f, w) = 1 n n (y i, f (x i, w)) i=1 This is the expected loss under the empirical distribution induced (y i, f (x i, w)) is the task loss which depends on the application segmentation: per pixel segmentation error detection: intersection over the union Typically, (y, y ) = 0 if y = y Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

12 Structural SVM [Tsochantaridis et al., 05] We want to construct a function f (x, y) = arg max y Y wt φ(x, y) which is parameterized in terms of w, the parameters to learn. We will like to minimize the empirical risk R s (f, w) = 1 n n (y i, f (x i, w)) i=1 This is the expected loss under the empirical distribution induced (y i, f (x i, w)) is the task loss which depends on the application segmentation: per pixel segmentation error detection: intersection over the union Typically, (y, y ) = 0 if y = y Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

13 Separable case We will like to minimize the empirical risk R s (f, w) = 1 n n (y i, f (x i, w)) i=1 We will have 0 train error if we satisfy max {w T φ(x i, y)} w T φ(x i, y i ) y Y\y i since (y i, y i ) = 0 and (y i, y) > 0, y Y \ y i. Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

14 Separable case We will like to minimize the empirical risk R s (f, w) = 1 n n (y i, f (x i, w)) i=1 We will have 0 train error if we satisfy max {w T φ(x i, y)} w T φ(x i, y i ) y Y\y i since (y i, y i ) = 0 and (y i, y) > 0, y Y \ y i. This can be replaced by Y 1 inequalities i {1,, n}, y Y \ y i : w T φ(x i, y i ) w T φ(x i, y) 0 Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

15 Separable case We will like to minimize the empirical risk R s (f, w) = 1 n n (y i, f (x i, w)) i=1 We will have 0 train error if we satisfy max {w T φ(x i, y)} w T φ(x i, y i ) y Y\y i since (y i, y i ) = 0 and (y i, y) > 0, y Y \ y i. This can be replaced by Y 1 inequalities i {1,, n}, y Y \ y i : w T φ(x i, y i ) w T φ(x i, y) 0 What s the problem of this? Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

16 Separable case We will like to minimize the empirical risk R s (f, w) = 1 n n (y i, f (x i, w)) i=1 We will have 0 train error if we satisfy max {w T φ(x i, y)} w T φ(x i, y i ) y Y\y i since (y i, y i ) = 0 and (y i, y) > 0, y Y \ y i. This can be replaced by Y 1 inequalities i {1,, n}, y Y \ y i : w T φ(x i, y i ) w T φ(x i, y) 0 What s the problem of this? Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

17 Separable case Satisfying the inequalities might have more than one solution. Select the w with the maximum margin. Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

18 Separable case Satisfying the inequalities might have more than one solution. Select the w with the maximum margin. We can thus form the following optimization problem 1 min w 2 w 2 subject to w T φ(x i, y i ) w T φ(x i, y) 1 i {1,, n}, y Y \ y i Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

19 Separable case Satisfying the inequalities might have more than one solution. Select the w with the maximum margin. We can thus form the following optimization problem 1 min w 2 w 2 subject to w T φ(x i, y i ) w T φ(x i, y) 1 i {1,, n}, y Y \ y i This is a quadratic program, so it s convex Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

20 Separable case Satisfying the inequalities might have more than one solution. Select the w with the maximum margin. We can thus form the following optimization problem 1 min w 2 w 2 subject to w T φ(x i, y i ) w T φ(x i, y) 1 i {1,, n}, y Y \ y i This is a quadratic program, so it s convex But it involves exponentially many constraints! Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

21 Separable case Satisfying the inequalities might have more than one solution. Select the w with the maximum margin. We can thus form the following optimization problem 1 min w 2 w 2 subject to w T φ(x i, y i ) w T φ(x i, y) 1 i {1,, n}, y Y \ y i This is a quadratic program, so it s convex But it involves exponentially many constraints! Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

22 Non-separable case Multiple formulations Multi-class classification [Crammer & Singer, 03] Slack re-scaling [Tsochantaridis et al. 05] Margin re-scaling [Taskar et al. 04] Let s look at them in more details Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

23 Multi-class classification [Crammer & Singer, 03] Enforce a large margin and do a batch convex optimization The minimization program is then min w 1 2 w 2 + C n n i=1 ξ i s.t. w T φ(x i, y i ) w T φ(x i, y) 1 ξ i i {1,, n}, y y i Can also be written in terms of kernels Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

24 Structured Output SVMs Frame structured prediction as a multiclass problem to predict a single element of Y and pay a penalty for mistakes Not all errors are created equally, e.g. in an HMM making only one mistake in a sequence should be penalized less than making 50 mistakes Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

25 Structured Output SVMs Frame structured prediction as a multiclass problem to predict a single element of Y and pay a penalty for mistakes Not all errors are created equally, e.g. in an HMM making only one mistake in a sequence should be penalized less than making 50 mistakes Pay a loss proportional to the difference between true and predicted error (task dependent) (y i, y) [Source: M. Blaschko] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

26 Structured Output SVMs Frame structured prediction as a multiclass problem to predict a single element of Y and pay a penalty for mistakes Not all errors are created equally, e.g. in an HMM making only one mistake in a sequence should be penalized less than making 50 mistakes Pay a loss proportional to the difference between true and predicted error (task dependent) (y i, y) [Source: M. Blaschko] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

27 Slack re-scaling Re-scale the slack variables according to the loss incurred in each of the linear constraints Violating a margin constraint involving a y y i with high loss (y i, y) should be penalized more than a violation involving an output value with smaller loss Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

28 Slack re-scaling Re-scale the slack variables according to the loss incurred in each of the linear constraints Violating a margin constraint involving a y y i with high loss (y i, y) should be penalized more than a violation involving an output value with smaller loss The minimization program is then min w 1 2 w 2 + C n n i=1 s.t. w T φ(x i, y i ) w T φ(x i, y) 1 ξ i ξ i (y i, y) i {1,, n}, y Y \ y i Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

29 Slack re-scaling Re-scale the slack variables according to the loss incurred in each of the linear constraints Violating a margin constraint involving a y y i with high loss (y i, y) should be penalized more than a violation involving an output value with smaller loss The minimization program is then min w 1 2 w 2 + C n n i=1 s.t. w T φ(x i, y i ) w T φ(x i, y) 1 ξ i ξ i (y i, y) i {1,, n}, y Y \ y i The justification is that 1 n n i=1 ξ i is an upper-bound on the empirical risk. Easy to proof Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

30 Slack re-scaling Re-scale the slack variables according to the loss incurred in each of the linear constraints Violating a margin constraint involving a y y i with high loss (y i, y) should be penalized more than a violation involving an output value with smaller loss The minimization program is then min w 1 2 w 2 + C n n i=1 s.t. w T φ(x i, y i ) w T φ(x i, y) 1 ξ i ξ i (y i, y) i {1,, n}, y Y \ y i The justification is that 1 n n i=1 ξ i is an upper-bound on the empirical risk. Easy to proof Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

31 Margin re-scaling In this case the minimization problem is formulated as min w 1 2 w 2 + C n n i=1 ξ i s.t. w T φ(x i, y i ) w T φ(x i, y) (y i, y) ξ i i {1,, n}, y Y \ y i The justification is that 1 n n i=1 ξ i is an upper-bound on the empirical risk. Also easy to proof. Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

32 Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

33 Constraint Generation To find the most violated constraint, we need to maximize w.r.t. y for margin rescaling w T φ(x i, y) + (y i, y) and for slack rescaling {w T φ(x i, y) + 1 w T φ(x i, y i )} (y i, y) For arbitrary output spaces, we would need to iterate over all elements in Y Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

34 Constraint Generation To find the most violated constraint, we need to maximize w.r.t. y for margin rescaling w T φ(x i, y) + (y i, y) and for slack rescaling {w T φ(x i, y) + 1 w T φ(x i, y i )} (y i, y) For arbitrary output spaces, we would need to iterate over all elements in Y Use Graph-cuts or message passing Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

35 Constraint Generation To find the most violated constraint, we need to maximize w.r.t. y for margin rescaling w T φ(x i, y) + (y i, y) and for slack rescaling {w T φ(x i, y) + 1 w T φ(x i, y i )} (y i, y) For arbitrary output spaces, we would need to iterate over all elements in Y Use Graph-cuts or message passing When the MAP cannot be computed exactly, but only approximately, this algorithm does not behave well [Fidley et al., 08] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

36 Constraint Generation To find the most violated constraint, we need to maximize w.r.t. y for margin rescaling w T φ(x i, y) + (y i, y) and for slack rescaling {w T φ(x i, y) + 1 w T φ(x i, y i )} (y i, y) For arbitrary output spaces, we would need to iterate over all elements in Y Use Graph-cuts or message passing When the MAP cannot be computed exactly, but only approximately, this algorithm does not behave well [Fidley et al., 08] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

37 One Slack Formulation Margin rescaling 1 min w 2 w 2 + C n ξ s.t. w T φ(x i, y i ) w T φ(x i, y) (y i, y) ξ i {1,, n}, y Y \ y i Slack rescaling min w 1 2 w 2 + C n ξ s.t. w T φ(x i, y i ) w T φ(x i, y) 1 ξ (y i, y) i {1,, n}, y Y \ y i Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

38 One Slack Formulation Margin rescaling 1 min w 2 w 2 + C n ξ s.t. w T φ(x i, y i ) w T φ(x i, y) (y i, y) ξ i {1,, n}, y Y \ y i Slack rescaling min w 1 2 w 2 + C n ξ s.t. w T φ(x i, y i ) w T φ(x i, y) 1 ξ (y i, y) i {1,, n}, y Y \ y i Same optima as previous formulation [Joachims et al, 09] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

39 One Slack Formulation Margin rescaling 1 min w 2 w 2 + C n ξ s.t. w T φ(x i, y i ) w T φ(x i, y) (y i, y) ξ i {1,, n}, y Y \ y i Slack rescaling min w 1 2 w 2 + C n ξ s.t. w T φ(x i, y i ) w T φ(x i, y) 1 ξ (y i, y) i {1,, n}, y Y \ y i Same optima as previous formulation [Joachims et al, 09] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

40 Example: Handwritten Recognition Predict text from image of handwritten characters Equivalently: Iterate [Source: B. Taskar] Estimate model parameters w using active constraint set Generate the next constraint Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

41 Conditional Random Fields Regularized loss minimization: Given input pairs (x, y) S, minimize ˆl(w, x, y) + C p w p p, (x,y) S CRF loss: The conditional distribution is p x,y (ŷ; w) = Z(x, y) = ŷ Y 1 Z(x, y) exp ( l(y, ŷ) + w Φ(x, ŷ) ) exp ( l(y, ŷ) + w Φ(x, ŷ) ) where l(y, ŷ) is a prior distribution and Z(x, y) the partition function, and l log (w, x, y) = ln 1 p x,y (y; w). Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

42 Conditional Random Fields Regularized loss minimization: Given input pairs (x, y) S, minimize ˆl(w, x, y) + C p w p p, (x,y) S CRF loss: The conditional distribution is p x,y (ŷ; w) = Z(x, y) = ŷ Y 1 Z(x, y) exp ( l(y, ŷ) + w Φ(x, ŷ) ) exp ( l(y, ŷ) + w Φ(x, ŷ) ) where l(y, ŷ) is a prior distribution and Z(x, y) the partition function, and l log (w, x, y) = ln 1 p x,y (y; w). Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

43 CRF learning In CRFs one aims to minimize the regularized negative log-likelihood of the conditional distribution (CRF) min ln Z(x, y) d w + C w p w p p, (x,y) S where (x, y) S ranges over the training pairs and d = Φ(x, y) is the vector of empirical means. (x,y) S In coordinate descent methods, each coordinate w r is iteratively updated in the direction of the negative gradient, for some step size η. Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

44 CRF learning In CRFs one aims to minimize the regularized negative log-likelihood of the conditional distribution (CRF) min ln Z(x, y) d w + C w p w p p, (x,y) S where (x, y) S ranges over the training pairs and d = Φ(x, y) is the vector of empirical means. (x,y) S In coordinate descent methods, each coordinate w r is iteratively updated in the direction of the negative gradient, for some step size η. The gradient of the log-partition function corresponds to the probability distribution p(ŷ x, y; w), and the direction of descent takes the form p(ŷ x, y; w)φ r (x, ŷ) d r + w r p 1 sign(w r ). (x,y) S ŷ Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

45 CRF learning In CRFs one aims to minimize the regularized negative log-likelihood of the conditional distribution (CRF) min ln Z(x, y) d w + C w p w p p, (x,y) S where (x, y) S ranges over the training pairs and d = Φ(x, y) is the vector of empirical means. (x,y) S In coordinate descent methods, each coordinate w r is iteratively updated in the direction of the negative gradient, for some step size η. The gradient of the log-partition function corresponds to the probability distribution p(ŷ x, y; w), and the direction of descent takes the form p(ŷ x, y; w)φ r (x, ŷ) d r + w r p 1 sign(w r ). (x,y) S ŷ Problem: Requires computing the partition function! Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

46 CRF learning In CRFs one aims to minimize the regularized negative log-likelihood of the conditional distribution (CRF) min ln Z(x, y) d w + C w p w p p, (x,y) S where (x, y) S ranges over the training pairs and d = Φ(x, y) is the vector of empirical means. (x,y) S In coordinate descent methods, each coordinate w r is iteratively updated in the direction of the negative gradient, for some step size η. The gradient of the log-partition function corresponds to the probability distribution p(ŷ x, y; w), and the direction of descent takes the form p(ŷ x, y; w)φ r (x, ŷ) d r + w r p 1 sign(w r ). (x,y) S ŷ Problem: Requires computing the partition function! Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

47 Loss functions Regularized loss minimization: Given input pairs (x, y) S, minimize ˆl(w, x, y) + C p w p p, (x,y) S In structure SVMs { l hinge (w, x, y) = max l(y, ŷ) + w Φ(x, ŷ) w Φ(x, y) } ŷ Y Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

48 Loss functions Regularized loss minimization: Given input pairs (x, y) S, minimize ˆl(w, x, y) + C p w p p, (x,y) S In structure SVMs { l hinge (w, x, y) = max l(y, ŷ) + w Φ(x, ŷ) w Φ(x, y) } ŷ Y CRF loss: The conditional distribution is 1 p x,y (ŷ; w) = Z(x, y) exp ( l(y, ŷ) + w Φ(x, ŷ) ) Z(x, y) = exp ( l(y, ŷ) + w Φ(x, ŷ) ) ŷ Y where l(y, ŷ) is a prior distribution and Z(x, y) the partition function, and l log (w, x, y) = ln 1 p x,y (y; w). Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

49 Loss functions Regularized loss minimization: Given input pairs (x, y) S, minimize ˆl(w, x, y) + C p w p p, (x,y) S In structure SVMs { l hinge (w, x, y) = max l(y, ŷ) + w Φ(x, ŷ) w Φ(x, y) } ŷ Y CRF loss: The conditional distribution is 1 p x,y (ŷ; w) = Z(x, y) exp ( l(y, ŷ) + w Φ(x, ŷ) ) Z(x, y) = exp ( l(y, ŷ) + w Φ(x, ŷ) ) ŷ Y where l(y, ŷ) is a prior distribution and Z(x, y) the partition function, and l log (w, x, y) = ln 1 p x,y (y; w). Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

50 Relation between loss functions The CRF program is (CRF) min ln Z(x, y) d w + C w p w p p, (x,y) S where (x, y) S ranges over training pairs and d = (x,y) S Φ(x, y) is the vector of empirical means, and Z(x, y) = ŷ Y ( ) exp l(y, ŷ) + w Φ(x, ŷ) In structured SVMs (structured SVM) min w max ŷ Y (x,y) S { } l(y, ŷ) + w Φ(x, ŷ) d w + C p w p p, Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

51 Relation between loss functions The CRF program is (CRF) min ln Z(x, y) d w + C w p w p p, (x,y) S where (x, y) S ranges over training pairs and d = (x,y) S Φ(x, y) is the vector of empirical means, and Z(x, y) = ŷ Y ( ) exp l(y, ŷ) + w Φ(x, ŷ) In structured SVMs (structured SVM) min w max ŷ Y (x,y) S { } l(y, ŷ) + w Φ(x, ŷ) d w + C p w p p, Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

52 A family of structure prediction problems [T. Hazan and R. Urtasun, NIPS 2010] One parameter extension of CRFs and structured SVMs min ln Z w ɛ(x, y) d w + C p w p p, (x,y) S d is the empirical means, and ln Z ɛ(x, y) = ɛ ln ( l(y, ŷ) + w ) Φ(x, ŷ) exp ɛ ŷ Y CRF if ɛ = 1, Structured SVM if ɛ = 0 respectively. Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

53 A family of structure prediction problems [T. Hazan and R. Urtasun, NIPS 2010] One parameter extension of CRFs and structured SVMs min ln Z w ɛ(x, y) d w + C p w p p, (x,y) S d is the empirical means, and ln Z ɛ(x, y) = ɛ ln ( l(y, ŷ) + w ) Φ(x, ŷ) exp ɛ ŷ Y CRF if ɛ = 1, Structured SVM if ɛ = 0 respectively. Introduces the notion of loss in CRFs. Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

54 A family of structure prediction problems [T. Hazan and R. Urtasun, NIPS 2010] One parameter extension of CRFs and structured SVMs min ln Z w ɛ(x, y) d w + C p w p p, (x,y) S d is the empirical means, and ln Z ɛ(x, y) = ɛ ln ( l(y, ŷ) + w ) Φ(x, ŷ) exp ɛ ŷ Y CRF if ɛ = 1, Structured SVM if ɛ = 0 respectively. Introduces the notion of loss in CRFs. Dual takes the form max ɛh(p x,y ) + p x,y (ŷ)l(y, ŷ) C 1 q q q p x,y (ŷ)φ(x, ŷ) d ŷ (x,y) S ŷ Y q p x,y (ŷ) Y (x,y) S over the probability simplex over Y. Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

55 A family of structure prediction problems [T. Hazan and R. Urtasun, NIPS 2010] One parameter extension of CRFs and structured SVMs min ln Z w ɛ(x, y) d w + C p w p p, (x,y) S d is the empirical means, and ln Z ɛ(x, y) = ɛ ln ( l(y, ŷ) + w ) Φ(x, ŷ) exp ɛ ŷ Y CRF if ɛ = 1, Structured SVM if ɛ = 0 respectively. Introduces the notion of loss in CRFs. Dual takes the form max ɛh(p x,y ) + p x,y (ŷ)l(y, ŷ) C 1 q q q p x,y (ŷ)φ(x, ŷ) d ŷ (x,y) S ŷ Y q p x,y (ŷ) Y (x,y) S over the probability simplex over Y. Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

56 Primal-Dual approximated learning algorithm In many applications the features decompose φ r (x, ŷ 1,..., ŷ n) = φ r,v (x, ŷ v ) + φ r,α(x, ŷ α). v V r,x α E r,x Using this we can write the approximate program as min λx,y,v α,w [T. Hazan and R. Urtasun, NIPS 2010] ɛc v ln ( lv (y v, ŷ v ) + r:v Vr,x wr φr,v (x, ŷv ) α N(v) λx,y,v α(ŷv ) ) exp ɛc (x,y) S,v v ŷv + ɛc α ln ( r:α Er exp wr φr,α(x, ŷα) + v N(α) λx,y,v α(ŷv ) ) d w C ɛc (x,y) S,α α p w p p ŷα Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

57 Primal-Dual approximated learning algorithm In many applications the features decompose φ r (x, ŷ 1,..., ŷ n) = φ r,v (x, ŷ v ) + φ r,α(x, ŷ α). v V r,x α E r,x Using this we can write the approximate program as min λx,y,v α,w [T. Hazan and R. Urtasun, NIPS 2010] ɛc v ln ( lv (y v, ŷ v ) + r:v Vr,x wr φr,v (x, ŷv ) α N(v) λx,y,v α(ŷv ) ) exp ɛc (x,y) S,v v ŷv + ɛc α ln ( r:α Er exp wr φr,α(x, ŷα) + v N(α) λx,y,v α(ŷv ) ) d w C ɛc (x,y) S,α α p w p p ŷα Coordinate descent algorithm that alternates between sending messages and updating parameters. Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

58 Primal-Dual approximated learning algorithm In many applications the features decompose φ r (x, ŷ 1,..., ŷ n) = φ r,v (x, ŷ v ) + φ r,α(x, ŷ α). v V r,x α E r,x Using this we can write the approximate program as min λx,y,v α,w [T. Hazan and R. Urtasun, NIPS 2010] ɛc v ln ( lv (y v, ŷ v ) + r:v Vr,x wr φr,v (x, ŷv ) α N(v) λx,y,v α(ŷv ) ) exp ɛc (x,y) S,v v ŷv + ɛc α ln ( r:α Er exp wr φr,α(x, ŷα) + v N(α) λx,y,v α(ŷv ) ) d w C ɛc (x,y) S,α α p w p p ŷα Coordinate descent algorithm that alternates between sending messages and updating parameters. Advantage: doesn t need the MAP or marginal at each gradient step. Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

59 Primal-Dual approximated learning algorithm In many applications the features decompose φ r (x, ŷ 1,..., ŷ n) = φ r,v (x, ŷ v ) + φ r,α(x, ŷ α). v V r,x α E r,x Using this we can write the approximate program as min λx,y,v α,w [T. Hazan and R. Urtasun, NIPS 2010] ɛc v ln ( lv (y v, ŷ v ) + r:v Vr,x wr φr,v (x, ŷv ) α N(v) λx,y,v α(ŷv ) ) exp ɛc (x,y) S,v v ŷv + ɛc α ln ( r:α Er exp wr φr,α(x, ŷα) + v N(α) λx,y,v α(ŷv ) ) d w C ɛc (x,y) S,α α p w p p ŷα Coordinate descent algorithm that alternates between sending messages and updating parameters. Advantage: doesn t need the MAP or marginal at each gradient step. Can learn a large set of parameters. Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

60 Primal-Dual approximated learning algorithm In many applications the features decompose φ r (x, ŷ 1,..., ŷ n) = φ r,v (x, ŷ v ) + φ r,α(x, ŷ α). v V r,x α E r,x Using this we can write the approximate program as min λx,y,v α,w [T. Hazan and R. Urtasun, NIPS 2010] ɛc v ln ( lv (y v, ŷ v ) + r:v Vr,x wr φr,v (x, ŷv ) α N(v) λx,y,v α(ŷv ) ) exp ɛc (x,y) S,v v ŷv + ɛc α ln ( r:α Er exp wr φr,α(x, ŷα) + v N(α) λx,y,v α(ŷv ) ) d w C ɛc (x,y) S,α α p w p p ŷα Coordinate descent algorithm that alternates between sending messages and updating parameters. Advantage: doesn t need the MAP or marginal at each gradient step. Can learn a large set of parameters. Code will be available soon, including parallel implementation. Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

61 Primal-Dual approximated learning algorithm In many applications the features decompose φ r (x, ŷ 1,..., ŷ n) = φ r,v (x, ŷ v ) + φ r,α(x, ŷ α). v V r,x α E r,x Using this we can write the approximate program as min λx,y,v α,w [T. Hazan and R. Urtasun, NIPS 2010] ɛc v ln ( lv (y v, ŷ v ) + r:v Vr,x wr φr,v (x, ŷv ) α N(v) λx,y,v α(ŷv ) ) exp ɛc (x,y) S,v v ŷv + ɛc α ln ( r:α Er exp wr φr,α(x, ŷα) + v N(α) λx,y,v α(ŷv ) ) d w C ɛc (x,y) S,α α p w p p ŷα Coordinate descent algorithm that alternates between sending messages and updating parameters. Advantage: doesn t need the MAP or marginal at each gradient step. Can learn a large set of parameters. Code will be available soon, including parallel implementation. Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

62 Learning algorithm Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

63 Examples in computer vision Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

64 Examples Depth estimation Multi-label prediction Object detection Non-maxima supression Segmentation Sentence generation Holistic scene understanding 2D pose estimation Non-rigid shape estimation 3D scene understanding Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

65 For each application what do we need to decide? Random variables Graphical model Potentials Loss for learning Learning algorithm Inference algorithm Let s look at some examples Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

66 Depth Estimation [Source: P. Kohli] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

67 Stereo matching pairwise [Source: P. Kohli] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

68 Stereo matching: energy [Source: P. Kohli] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

69 Stereo matching: energy [Source: P. Kohli] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

70 More on pairwise [O. Veksler] [Source: P. Kohli] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

Graph Structure see http://vision.middlebury.

71 Graph Structure see Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

72 Learning and inference There is only one parameter to learn: importance of pairwise with respect to unitary! Sum of square differences: outliers are more important % of pixels that have disparity error bigger than ɛ. The latter is how typically stereo algorithms are scored Which inference method will you choose? And for learning? Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

73 Example: Object Detection We can formulate object localization as a regression from an image to a bounding box g : X Y X is the space of all images Y is the space of all bounding boxes Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

74 Joint Kernel Between bboxes [Source: M. Blascko] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

75 Restriction Kernel example [Source: M. Blascko] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

76 Margin Rescaling [Source: M. Blascko] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

77 Constraint Generation with Branch and Bound [Source: M. Blascko] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

78 Sets of Rectangles [Source: M. Blascko] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

79 Optimization [Source: M. Blascko] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

80 Results: PASCAL VOC2006 [Source: M. Blascko] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

81 Results: PASCAL VOC2006 cats [Source: M. Blascko] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

82 Problem The restriction kernel is like having tunnel vision [Source: M. Blascko] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

83 Problem The restriction kernel is like having tunnel vision [Source: M. Blascko] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

84 Global and Local Context Kernels [Source: M. Blascko] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

85 Local Context Kernel [Source: M. Blascko] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

86 Results Context is a very busy area of research in vision! [Source: M. Blascko] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

87 Example: 3D Indoor Scene Understanding Task: Given an image, predict the 3D parametric cuboid that best describes the layout. Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

88 Prediction Variables are not independent of each other, i.e. structured prediction x: Input image y: Room layout φ(x, y): Multidimensional feature vector w: Predictor Estimate room layout by solving inference task ŷ = arg max w T φ(x, y) y Learning w via structured SVMs or CRFs Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

Single Variable Parameterization Approaches of [Hedau et al. 09] and [Lee et al. 10]. One random variable y for the entire layout. Every state denotes a different candidate layout.

89 Single Variable Parameterization Approaches of [Hedau et al. 09] and [Lee et al. 10]. One random variable y for the entire layout. Every state denotes a different candidate layout. Limits the amount of candidate layouts. Not really a structured prediction task. n states/3d layouts have to be evaluated exhaustively, e.g., Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

90 Four Variable Parameterization Approach of [Wang et al. 10]. 4 variables y i Y, i {1,..., 4} corresponding to the four degrees of freedom of the problem. One state of y i denotes the angle of ray r i. High order potentials, e.g., 50 4 for fourth-order. r 3 r 4 r 1 y 1 vp 2 vp 0 y 2 r 2 y 3 y 4 vp 1 For both parameterizations is even worst when reasoning about objects. Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

91 Integral Geometry for Features We follow [Wang et al. 10] and parameterize with four random variables. We follow [Lee et al. 10] and employ orientation map [Lee09 et al.] and geometric context [Hoiem et al. 07] as image cues. orientation map geometric context Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

92 Integral Geometry for Features Faces F = {left-wall, right-wall, ceiling, floor, front-wall} Faces are defined by four (front-wall) or three angles (otherwise) w T φ(x, y) = wo,αφ T o,α (x, y α ) + wg,αφ T g,α (x, y α ) α F α F Features count frequencies of image cues Orientation map and proposed left wall Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

93 Integral Geometry for Features Using inspiration from integral images, we decompose φ,α (x, y α ) = φ,{i,j,k} (x, y i, y j, y k ) = = H,{i,j} (x, y i, y j ) H,{j,k} (x, y j, y k ) Integral geometry H,{i,j} (x, y i, y j ) y i y k H,{j,k} (x, y j, y k ) y j Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

94 Integral Geometry for Features Decomposition: Corresponding factor graph: H,{i,j} (x, y i, y j ) H,{j,k} (x, y j, y k ) y i y j y k y l The front-wall: {i, j} {j, k} φ,front-wall = φ(x) φ,left-wall φ,right-wall φ,ceiling φ,floor Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

95 Integral Geometry Same concept as integral images, but in accordance with the vanishing points. Figure: Concept of integral geometry Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

96 Learning and Inference Learning Inference Family of structure prediction problems including CRF and structured-svms as especial cases. Primal-dual algorithm based on local updates. Fast and works well with large number of parameters. Code coming soon! Inference using parallel convex belief propagation Convergence and other theoretical guarantees [T. Hazan and R. Urtasun, NIPS 2010] Code available online: general potentials, cross-platform, Amazon EC2! [A. Schwing, T. Hazan, M. Pollefeys and R. Urtasun, CVPR 2011] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

97 Time vs Accuracy Learning very fast: State-of-the-art after less than a minute! Error [%] / / 150 Train/Test 200 Train/Test rel. duality gap Time [s] Time Inference as little as 10ms per image! Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

98 Results [A. Schwing, T. Hazan, M. Pollefeys and R. Urtasun, CVPR12] Table: Pixel classification error in the layout dataset of [Hedau et al. 09]. OM GC OM + GC [Hoiem07] [Hedau09] (a) [Hedau09] (b) [Wang10] [Lee10] Ours (SVM struct ) Ours (struct-pred) Table: Pixel classification error in the bedroom data set [Hedau et al. 10]. [Luca11] [Hoiem07] [Hedau09](a) Ours w/o box Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

99 Simple object reasoning Compatibility of 3D object candidates and layout Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

100 Results [A. Schwing, T. Hazan, M. Pollefeys and R. Urtasun, CVPR12] Table: Pixel classification error in the layout dataset of [Hedau et al. 09]. OM GC OM + GC [Hoiem07] [Hedau09] (a) [Hedau09] (b) [Wang10] [Lee10] Ours (SVM struct ) Ours (struct-pred) Table: WITH object reasoning. OM GC OM + GC [Wang10] [Lee10] Ours (SVM struct ) Ours (struct-pred) Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

101 Results [A. Schwing, T. Hazan, M. Pollefeys and R. Urtasun, CVPR12] Table: Pixel classification error in the layout dataset of [Hedau et al. 09] with object reasoning. OM GC OM + GC [Wang10] [Lee10] Ours (SVM struct ) Ours (struct-pred) Table: Pixel classification error in the bedroom data set [Hedau et al. 10]. [Luca11] [Hoiem07] [Hedau09](a) Ours w/o box w/ box Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

102 Qualitative Results Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

103 Conclusions and Future Work Conclusion: Efficient learning and inference tools for structure prediction based on primal-dual methods. Inference: No need for application specific moves. Learning: can learn large number of parameters using local updates. State-of-the-art results. Future Work: More features. Better object reasoning. Weakly label setting. Better inference? Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

Part 5: Structured Support Vector Machines

Part 5: Structured Support Vector Machines Sebastian Nowozin and Christoph H. Lampert Providence, 21st June 2012 1 / 34 Problem (Loss-Minimizing Parameter Learning) Let d(x, y) be the (unknown) true data