Visual Recognition: Examples of Graphical Models

Size: px
Start display at page:

Download "Visual Recognition: Examples of Graphical Models"

Transcription

1 Visual Recognition: Examples of Graphical Models Raquel Urtasun TTI Chicago March 6, 2012 Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

2 Graphical models Applications Representation Inference Learning message passing (LP relaxations) graph cuts Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

3 Learning in graphical models Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

4 Parameter learning The MAP problem was defined as max θ i (y i ) + y 1,,y n α i θ α (y α ) Learn parameters w for more accurate prediction max w i φ i (y i ) + w α φ α (y α ) y 1,,y n α i Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

5 Loss functions Regularized loss minimization: Given input pairs (x, y) S, minimize ˆl(w, x, y) + C p w p p, (x,y) S Different learning frameworks depending on the surrogate loss ˆl(w, x, y) Hinge for Structural SVMs [Tsochantaridis et al. 05, Taskar et al. 04] log-loss for Conditional Random Fields [Lafferty et al. 01] Unified by [Hazan and Urtasun, 10] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

6 Loss functions Regularized loss minimization: Given input pairs (x, y) S, minimize ˆl(w, x, y) + C p w p p, (x,y) S Different learning frameworks depending on the surrogate loss ˆl(w, x, y) Hinge for Structural SVMs [Tsochantaridis et al. 05, Taskar et al. 04] log-loss for Conditional Random Fields [Lafferty et al. 01] Unified by [Hazan and Urtasun, 10] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

7 Recall SVM In SVMs we minimize the following program min w 1 2 w 2 + i ξ i subject to y i (b + w T x i ) 1 + ξ i 0, i = 1,..., N. with y i { 1, 1} binary. We need to extend this to reason about more complex structures, not just binary variables. Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

8 Structural SVM [Tsochantaridis et al., 05] We want to construct a function f (x, y) = arg max y Y wt φ(x, y) which is parameterized in terms of w, the parameters to learn. We will like to minimize the empirical risk R s (f, w) = 1 n n (y i, f (x i, w)) i=1 Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

9 Structural SVM [Tsochantaridis et al., 05] We want to construct a function f (x, y) = arg max y Y wt φ(x, y) which is parameterized in terms of w, the parameters to learn. We will like to minimize the empirical risk R s (f, w) = 1 n n (y i, f (x i, w)) i=1 This is the expected loss under the empirical distribution induced Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

10 Structural SVM [Tsochantaridis et al., 05] We want to construct a function f (x, y) = arg max y Y wt φ(x, y) which is parameterized in terms of w, the parameters to learn. We will like to minimize the empirical risk R s (f, w) = 1 n n (y i, f (x i, w)) i=1 This is the expected loss under the empirical distribution induced (y i, f (x i, w)) is the task loss which depends on the application segmentation: per pixel segmentation error detection: intersection over the union Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

11 Structural SVM [Tsochantaridis et al., 05] We want to construct a function f (x, y) = arg max y Y wt φ(x, y) which is parameterized in terms of w, the parameters to learn. We will like to minimize the empirical risk R s (f, w) = 1 n n (y i, f (x i, w)) i=1 This is the expected loss under the empirical distribution induced (y i, f (x i, w)) is the task loss which depends on the application segmentation: per pixel segmentation error detection: intersection over the union Typically, (y, y ) = 0 if y = y Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

12 Structural SVM [Tsochantaridis et al., 05] We want to construct a function f (x, y) = arg max y Y wt φ(x, y) which is parameterized in terms of w, the parameters to learn. We will like to minimize the empirical risk R s (f, w) = 1 n n (y i, f (x i, w)) i=1 This is the expected loss under the empirical distribution induced (y i, f (x i, w)) is the task loss which depends on the application segmentation: per pixel segmentation error detection: intersection over the union Typically, (y, y ) = 0 if y = y Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

13 Separable case We will like to minimize the empirical risk R s (f, w) = 1 n n (y i, f (x i, w)) i=1 We will have 0 train error if we satisfy max {w T φ(x i, y)} w T φ(x i, y i ) y Y\y i since (y i, y i ) = 0 and (y i, y) > 0, y Y \ y i. Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

14 Separable case We will like to minimize the empirical risk R s (f, w) = 1 n n (y i, f (x i, w)) i=1 We will have 0 train error if we satisfy max {w T φ(x i, y)} w T φ(x i, y i ) y Y\y i since (y i, y i ) = 0 and (y i, y) > 0, y Y \ y i. This can be replaced by Y 1 inequalities i {1,, n}, y Y \ y i : w T φ(x i, y i ) w T φ(x i, y) 0 Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

15 Separable case We will like to minimize the empirical risk R s (f, w) = 1 n n (y i, f (x i, w)) i=1 We will have 0 train error if we satisfy max {w T φ(x i, y)} w T φ(x i, y i ) y Y\y i since (y i, y i ) = 0 and (y i, y) > 0, y Y \ y i. This can be replaced by Y 1 inequalities i {1,, n}, y Y \ y i : w T φ(x i, y i ) w T φ(x i, y) 0 What s the problem of this? Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

16 Separable case We will like to minimize the empirical risk R s (f, w) = 1 n n (y i, f (x i, w)) i=1 We will have 0 train error if we satisfy max {w T φ(x i, y)} w T φ(x i, y i ) y Y\y i since (y i, y i ) = 0 and (y i, y) > 0, y Y \ y i. This can be replaced by Y 1 inequalities i {1,, n}, y Y \ y i : w T φ(x i, y i ) w T φ(x i, y) 0 What s the problem of this? Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

17 Separable case Satisfying the inequalities might have more than one solution. Select the w with the maximum margin. Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

18 Separable case Satisfying the inequalities might have more than one solution. Select the w with the maximum margin. We can thus form the following optimization problem 1 min w 2 w 2 subject to w T φ(x i, y i ) w T φ(x i, y) 1 i {1,, n}, y Y \ y i Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

19 Separable case Satisfying the inequalities might have more than one solution. Select the w with the maximum margin. We can thus form the following optimization problem 1 min w 2 w 2 subject to w T φ(x i, y i ) w T φ(x i, y) 1 i {1,, n}, y Y \ y i This is a quadratic program, so it s convex Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

20 Separable case Satisfying the inequalities might have more than one solution. Select the w with the maximum margin. We can thus form the following optimization problem 1 min w 2 w 2 subject to w T φ(x i, y i ) w T φ(x i, y) 1 i {1,, n}, y Y \ y i This is a quadratic program, so it s convex But it involves exponentially many constraints! Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

21 Separable case Satisfying the inequalities might have more than one solution. Select the w with the maximum margin. We can thus form the following optimization problem 1 min w 2 w 2 subject to w T φ(x i, y i ) w T φ(x i, y) 1 i {1,, n}, y Y \ y i This is a quadratic program, so it s convex But it involves exponentially many constraints! Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

22 Non-separable case Multiple formulations Multi-class classification [Crammer & Singer, 03] Slack re-scaling [Tsochantaridis et al. 05] Margin re-scaling [Taskar et al. 04] Let s look at them in more details Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

23 Multi-class classification [Crammer & Singer, 03] Enforce a large margin and do a batch convex optimization The minimization program is then min w 1 2 w 2 + C n n i=1 ξ i s.t. w T φ(x i, y i ) w T φ(x i, y) 1 ξ i i {1,, n}, y y i Can also be written in terms of kernels Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

24 Structured Output SVMs Frame structured prediction as a multiclass problem to predict a single element of Y and pay a penalty for mistakes Not all errors are created equally, e.g. in an HMM making only one mistake in a sequence should be penalized less than making 50 mistakes Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

25 Structured Output SVMs Frame structured prediction as a multiclass problem to predict a single element of Y and pay a penalty for mistakes Not all errors are created equally, e.g. in an HMM making only one mistake in a sequence should be penalized less than making 50 mistakes Pay a loss proportional to the difference between true and predicted error (task dependent) (y i, y) [Source: M. Blaschko] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

26 Structured Output SVMs Frame structured prediction as a multiclass problem to predict a single element of Y and pay a penalty for mistakes Not all errors are created equally, e.g. in an HMM making only one mistake in a sequence should be penalized less than making 50 mistakes Pay a loss proportional to the difference between true and predicted error (task dependent) (y i, y) [Source: M. Blaschko] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

27 Slack re-scaling Re-scale the slack variables according to the loss incurred in each of the linear constraints Violating a margin constraint involving a y y i with high loss (y i, y) should be penalized more than a violation involving an output value with smaller loss Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

28 Slack re-scaling Re-scale the slack variables according to the loss incurred in each of the linear constraints Violating a margin constraint involving a y y i with high loss (y i, y) should be penalized more than a violation involving an output value with smaller loss The minimization program is then min w 1 2 w 2 + C n n i=1 s.t. w T φ(x i, y i ) w T φ(x i, y) 1 ξ i ξ i (y i, y) i {1,, n}, y Y \ y i Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

29 Slack re-scaling Re-scale the slack variables according to the loss incurred in each of the linear constraints Violating a margin constraint involving a y y i with high loss (y i, y) should be penalized more than a violation involving an output value with smaller loss The minimization program is then min w 1 2 w 2 + C n n i=1 s.t. w T φ(x i, y i ) w T φ(x i, y) 1 ξ i ξ i (y i, y) i {1,, n}, y Y \ y i The justification is that 1 n n i=1 ξ i is an upper-bound on the empirical risk. Easy to proof Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

30 Slack re-scaling Re-scale the slack variables according to the loss incurred in each of the linear constraints Violating a margin constraint involving a y y i with high loss (y i, y) should be penalized more than a violation involving an output value with smaller loss The minimization program is then min w 1 2 w 2 + C n n i=1 s.t. w T φ(x i, y i ) w T φ(x i, y) 1 ξ i ξ i (y i, y) i {1,, n}, y Y \ y i The justification is that 1 n n i=1 ξ i is an upper-bound on the empirical risk. Easy to proof Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

31 Margin re-scaling In this case the minimization problem is formulated as min w 1 2 w 2 + C n n i=1 ξ i s.t. w T φ(x i, y i ) w T φ(x i, y) (y i, y) ξ i i {1,, n}, y Y \ y i The justification is that 1 n n i=1 ξ i is an upper-bound on the empirical risk. Also easy to proof. Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

32 Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

33 Constraint Generation To find the most violated constraint, we need to maximize w.r.t. y for margin rescaling w T φ(x i, y) + (y i, y) and for slack rescaling {w T φ(x i, y) + 1 w T φ(x i, y i )} (y i, y) For arbitrary output spaces, we would need to iterate over all elements in Y Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

34 Constraint Generation To find the most violated constraint, we need to maximize w.r.t. y for margin rescaling w T φ(x i, y) + (y i, y) and for slack rescaling {w T φ(x i, y) + 1 w T φ(x i, y i )} (y i, y) For arbitrary output spaces, we would need to iterate over all elements in Y Use Graph-cuts or message passing Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

35 Constraint Generation To find the most violated constraint, we need to maximize w.r.t. y for margin rescaling w T φ(x i, y) + (y i, y) and for slack rescaling {w T φ(x i, y) + 1 w T φ(x i, y i )} (y i, y) For arbitrary output spaces, we would need to iterate over all elements in Y Use Graph-cuts or message passing When the MAP cannot be computed exactly, but only approximately, this algorithm does not behave well [Fidley et al., 08] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

36 Constraint Generation To find the most violated constraint, we need to maximize w.r.t. y for margin rescaling w T φ(x i, y) + (y i, y) and for slack rescaling {w T φ(x i, y) + 1 w T φ(x i, y i )} (y i, y) For arbitrary output spaces, we would need to iterate over all elements in Y Use Graph-cuts or message passing When the MAP cannot be computed exactly, but only approximately, this algorithm does not behave well [Fidley et al., 08] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

37 One Slack Formulation Margin rescaling 1 min w 2 w 2 + C n ξ s.t. w T φ(x i, y i ) w T φ(x i, y) (y i, y) ξ i {1,, n}, y Y \ y i Slack rescaling min w 1 2 w 2 + C n ξ s.t. w T φ(x i, y i ) w T φ(x i, y) 1 ξ (y i, y) i {1,, n}, y Y \ y i Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

38 One Slack Formulation Margin rescaling 1 min w 2 w 2 + C n ξ s.t. w T φ(x i, y i ) w T φ(x i, y) (y i, y) ξ i {1,, n}, y Y \ y i Slack rescaling min w 1 2 w 2 + C n ξ s.t. w T φ(x i, y i ) w T φ(x i, y) 1 ξ (y i, y) i {1,, n}, y Y \ y i Same optima as previous formulation [Joachims et al, 09] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

39 One Slack Formulation Margin rescaling 1 min w 2 w 2 + C n ξ s.t. w T φ(x i, y i ) w T φ(x i, y) (y i, y) ξ i {1,, n}, y Y \ y i Slack rescaling min w 1 2 w 2 + C n ξ s.t. w T φ(x i, y i ) w T φ(x i, y) 1 ξ (y i, y) i {1,, n}, y Y \ y i Same optima as previous formulation [Joachims et al, 09] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

40 Example: Handwritten Recognition Predict text from image of handwritten characters Equivalently: Iterate [Source: B. Taskar] Estimate model parameters w using active constraint set Generate the next constraint Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

41 Conditional Random Fields Regularized loss minimization: Given input pairs (x, y) S, minimize ˆl(w, x, y) + C p w p p, (x,y) S CRF loss: The conditional distribution is p x,y (ŷ; w) = Z(x, y) = ŷ Y 1 Z(x, y) exp ( l(y, ŷ) + w Φ(x, ŷ) ) exp ( l(y, ŷ) + w Φ(x, ŷ) ) where l(y, ŷ) is a prior distribution and Z(x, y) the partition function, and l log (w, x, y) = ln 1 p x,y (y; w). Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

42 Conditional Random Fields Regularized loss minimization: Given input pairs (x, y) S, minimize ˆl(w, x, y) + C p w p p, (x,y) S CRF loss: The conditional distribution is p x,y (ŷ; w) = Z(x, y) = ŷ Y 1 Z(x, y) exp ( l(y, ŷ) + w Φ(x, ŷ) ) exp ( l(y, ŷ) + w Φ(x, ŷ) ) where l(y, ŷ) is a prior distribution and Z(x, y) the partition function, and l log (w, x, y) = ln 1 p x,y (y; w). Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

43 CRF learning In CRFs one aims to minimize the regularized negative log-likelihood of the conditional distribution (CRF) min ln Z(x, y) d w + C w p w p p, (x,y) S where (x, y) S ranges over the training pairs and d = Φ(x, y) is the vector of empirical means. (x,y) S In coordinate descent methods, each coordinate w r is iteratively updated in the direction of the negative gradient, for some step size η. Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

44 CRF learning In CRFs one aims to minimize the regularized negative log-likelihood of the conditional distribution (CRF) min ln Z(x, y) d w + C w p w p p, (x,y) S where (x, y) S ranges over the training pairs and d = Φ(x, y) is the vector of empirical means. (x,y) S In coordinate descent methods, each coordinate w r is iteratively updated in the direction of the negative gradient, for some step size η. The gradient of the log-partition function corresponds to the probability distribution p(ŷ x, y; w), and the direction of descent takes the form p(ŷ x, y; w)φ r (x, ŷ) d r + w r p 1 sign(w r ). (x,y) S ŷ Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

45 CRF learning In CRFs one aims to minimize the regularized negative log-likelihood of the conditional distribution (CRF) min ln Z(x, y) d w + C w p w p p, (x,y) S where (x, y) S ranges over the training pairs and d = Φ(x, y) is the vector of empirical means. (x,y) S In coordinate descent methods, each coordinate w r is iteratively updated in the direction of the negative gradient, for some step size η. The gradient of the log-partition function corresponds to the probability distribution p(ŷ x, y; w), and the direction of descent takes the form p(ŷ x, y; w)φ r (x, ŷ) d r + w r p 1 sign(w r ). (x,y) S ŷ Problem: Requires computing the partition function! Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

46 CRF learning In CRFs one aims to minimize the regularized negative log-likelihood of the conditional distribution (CRF) min ln Z(x, y) d w + C w p w p p, (x,y) S where (x, y) S ranges over the training pairs and d = Φ(x, y) is the vector of empirical means. (x,y) S In coordinate descent methods, each coordinate w r is iteratively updated in the direction of the negative gradient, for some step size η. The gradient of the log-partition function corresponds to the probability distribution p(ŷ x, y; w), and the direction of descent takes the form p(ŷ x, y; w)φ r (x, ŷ) d r + w r p 1 sign(w r ). (x,y) S ŷ Problem: Requires computing the partition function! Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

47 Loss functions Regularized loss minimization: Given input pairs (x, y) S, minimize ˆl(w, x, y) + C p w p p, (x,y) S In structure SVMs { l hinge (w, x, y) = max l(y, ŷ) + w Φ(x, ŷ) w Φ(x, y) } ŷ Y Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

48 Loss functions Regularized loss minimization: Given input pairs (x, y) S, minimize ˆl(w, x, y) + C p w p p, (x,y) S In structure SVMs { l hinge (w, x, y) = max l(y, ŷ) + w Φ(x, ŷ) w Φ(x, y) } ŷ Y CRF loss: The conditional distribution is 1 p x,y (ŷ; w) = Z(x, y) exp ( l(y, ŷ) + w Φ(x, ŷ) ) Z(x, y) = exp ( l(y, ŷ) + w Φ(x, ŷ) ) ŷ Y where l(y, ŷ) is a prior distribution and Z(x, y) the partition function, and l log (w, x, y) = ln 1 p x,y (y; w). Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

49 Loss functions Regularized loss minimization: Given input pairs (x, y) S, minimize ˆl(w, x, y) + C p w p p, (x,y) S In structure SVMs { l hinge (w, x, y) = max l(y, ŷ) + w Φ(x, ŷ) w Φ(x, y) } ŷ Y CRF loss: The conditional distribution is 1 p x,y (ŷ; w) = Z(x, y) exp ( l(y, ŷ) + w Φ(x, ŷ) ) Z(x, y) = exp ( l(y, ŷ) + w Φ(x, ŷ) ) ŷ Y where l(y, ŷ) is a prior distribution and Z(x, y) the partition function, and l log (w, x, y) = ln 1 p x,y (y; w). Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

50 Relation between loss functions The CRF program is (CRF) min ln Z(x, y) d w + C w p w p p, (x,y) S where (x, y) S ranges over training pairs and d = (x,y) S Φ(x, y) is the vector of empirical means, and Z(x, y) = ŷ Y ( ) exp l(y, ŷ) + w Φ(x, ŷ) In structured SVMs (structured SVM) min w max ŷ Y (x,y) S { } l(y, ŷ) + w Φ(x, ŷ) d w + C p w p p, Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

51 Relation between loss functions The CRF program is (CRF) min ln Z(x, y) d w + C w p w p p, (x,y) S where (x, y) S ranges over training pairs and d = (x,y) S Φ(x, y) is the vector of empirical means, and Z(x, y) = ŷ Y ( ) exp l(y, ŷ) + w Φ(x, ŷ) In structured SVMs (structured SVM) min w max ŷ Y (x,y) S { } l(y, ŷ) + w Φ(x, ŷ) d w + C p w p p, Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

52 A family of structure prediction problems [T. Hazan and R. Urtasun, NIPS 2010] One parameter extension of CRFs and structured SVMs min ln Z w ɛ(x, y) d w + C p w p p, (x,y) S d is the empirical means, and ln Z ɛ(x, y) = ɛ ln ( l(y, ŷ) + w ) Φ(x, ŷ) exp ɛ ŷ Y CRF if ɛ = 1, Structured SVM if ɛ = 0 respectively. Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

53 A family of structure prediction problems [T. Hazan and R. Urtasun, NIPS 2010] One parameter extension of CRFs and structured SVMs min ln Z w ɛ(x, y) d w + C p w p p, (x,y) S d is the empirical means, and ln Z ɛ(x, y) = ɛ ln ( l(y, ŷ) + w ) Φ(x, ŷ) exp ɛ ŷ Y CRF if ɛ = 1, Structured SVM if ɛ = 0 respectively. Introduces the notion of loss in CRFs. Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

54 A family of structure prediction problems [T. Hazan and R. Urtasun, NIPS 2010] One parameter extension of CRFs and structured SVMs min ln Z w ɛ(x, y) d w + C p w p p, (x,y) S d is the empirical means, and ln Z ɛ(x, y) = ɛ ln ( l(y, ŷ) + w ) Φ(x, ŷ) exp ɛ ŷ Y CRF if ɛ = 1, Structured SVM if ɛ = 0 respectively. Introduces the notion of loss in CRFs. Dual takes the form max ɛh(p x,y ) + p x,y (ŷ)l(y, ŷ) C 1 q q q p x,y (ŷ)φ(x, ŷ) d ŷ (x,y) S ŷ Y q p x,y (ŷ) Y (x,y) S over the probability simplex over Y. Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

55 A family of structure prediction problems [T. Hazan and R. Urtasun, NIPS 2010] One parameter extension of CRFs and structured SVMs min ln Z w ɛ(x, y) d w + C p w p p, (x,y) S d is the empirical means, and ln Z ɛ(x, y) = ɛ ln ( l(y, ŷ) + w ) Φ(x, ŷ) exp ɛ ŷ Y CRF if ɛ = 1, Structured SVM if ɛ = 0 respectively. Introduces the notion of loss in CRFs. Dual takes the form max ɛh(p x,y ) + p x,y (ŷ)l(y, ŷ) C 1 q q q p x,y (ŷ)φ(x, ŷ) d ŷ (x,y) S ŷ Y q p x,y (ŷ) Y (x,y) S over the probability simplex over Y. Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

56 Primal-Dual approximated learning algorithm In many applications the features decompose φ r (x, ŷ 1,..., ŷ n) = φ r,v (x, ŷ v ) + φ r,α(x, ŷ α). v V r,x α E r,x Using this we can write the approximate program as min λx,y,v α,w [T. Hazan and R. Urtasun, NIPS 2010] ɛc v ln ( lv (y v, ŷ v ) + r:v Vr,x wr φr,v (x, ŷv ) α N(v) λx,y,v α(ŷv ) ) exp ɛc (x,y) S,v v ŷv + ɛc α ln ( r:α Er exp wr φr,α(x, ŷα) + v N(α) λx,y,v α(ŷv ) ) d w C ɛc (x,y) S,α α p w p p ŷα Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

57 Primal-Dual approximated learning algorithm In many applications the features decompose φ r (x, ŷ 1,..., ŷ n) = φ r,v (x, ŷ v ) + φ r,α(x, ŷ α). v V r,x α E r,x Using this we can write the approximate program as min λx,y,v α,w [T. Hazan and R. Urtasun, NIPS 2010] ɛc v ln ( lv (y v, ŷ v ) + r:v Vr,x wr φr,v (x, ŷv ) α N(v) λx,y,v α(ŷv ) ) exp ɛc (x,y) S,v v ŷv + ɛc α ln ( r:α Er exp wr φr,α(x, ŷα) + v N(α) λx,y,v α(ŷv ) ) d w C ɛc (x,y) S,α α p w p p ŷα Coordinate descent algorithm that alternates between sending messages and updating parameters. Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

58 Primal-Dual approximated learning algorithm In many applications the features decompose φ r (x, ŷ 1,..., ŷ n) = φ r,v (x, ŷ v ) + φ r,α(x, ŷ α). v V r,x α E r,x Using this we can write the approximate program as min λx,y,v α,w [T. Hazan and R. Urtasun, NIPS 2010] ɛc v ln ( lv (y v, ŷ v ) + r:v Vr,x wr φr,v (x, ŷv ) α N(v) λx,y,v α(ŷv ) ) exp ɛc (x,y) S,v v ŷv + ɛc α ln ( r:α Er exp wr φr,α(x, ŷα) + v N(α) λx,y,v α(ŷv ) ) d w C ɛc (x,y) S,α α p w p p ŷα Coordinate descent algorithm that alternates between sending messages and updating parameters. Advantage: doesn t need the MAP or marginal at each gradient step. Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

59 Primal-Dual approximated learning algorithm In many applications the features decompose φ r (x, ŷ 1,..., ŷ n) = φ r,v (x, ŷ v ) + φ r,α(x, ŷ α). v V r,x α E r,x Using this we can write the approximate program as min λx,y,v α,w [T. Hazan and R. Urtasun, NIPS 2010] ɛc v ln ( lv (y v, ŷ v ) + r:v Vr,x wr φr,v (x, ŷv ) α N(v) λx,y,v α(ŷv ) ) exp ɛc (x,y) S,v v ŷv + ɛc α ln ( r:α Er exp wr φr,α(x, ŷα) + v N(α) λx,y,v α(ŷv ) ) d w C ɛc (x,y) S,α α p w p p ŷα Coordinate descent algorithm that alternates between sending messages and updating parameters. Advantage: doesn t need the MAP or marginal at each gradient step. Can learn a large set of parameters. Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

60 Primal-Dual approximated learning algorithm In many applications the features decompose φ r (x, ŷ 1,..., ŷ n) = φ r,v (x, ŷ v ) + φ r,α(x, ŷ α). v V r,x α E r,x Using this we can write the approximate program as min λx,y,v α,w [T. Hazan and R. Urtasun, NIPS 2010] ɛc v ln ( lv (y v, ŷ v ) + r:v Vr,x wr φr,v (x, ŷv ) α N(v) λx,y,v α(ŷv ) ) exp ɛc (x,y) S,v v ŷv + ɛc α ln ( r:α Er exp wr φr,α(x, ŷα) + v N(α) λx,y,v α(ŷv ) ) d w C ɛc (x,y) S,α α p w p p ŷα Coordinate descent algorithm that alternates between sending messages and updating parameters. Advantage: doesn t need the MAP or marginal at each gradient step. Can learn a large set of parameters. Code will be available soon, including parallel implementation. Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

61 Primal-Dual approximated learning algorithm In many applications the features decompose φ r (x, ŷ 1,..., ŷ n) = φ r,v (x, ŷ v ) + φ r,α(x, ŷ α). v V r,x α E r,x Using this we can write the approximate program as min λx,y,v α,w [T. Hazan and R. Urtasun, NIPS 2010] ɛc v ln ( lv (y v, ŷ v ) + r:v Vr,x wr φr,v (x, ŷv ) α N(v) λx,y,v α(ŷv ) ) exp ɛc (x,y) S,v v ŷv + ɛc α ln ( r:α Er exp wr φr,α(x, ŷα) + v N(α) λx,y,v α(ŷv ) ) d w C ɛc (x,y) S,α α p w p p ŷα Coordinate descent algorithm that alternates between sending messages and updating parameters. Advantage: doesn t need the MAP or marginal at each gradient step. Can learn a large set of parameters. Code will be available soon, including parallel implementation. Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

62 Learning algorithm Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

63 Examples in computer vision Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

64 Examples Depth estimation Multi-label prediction Object detection Non-maxima supression Segmentation Sentence generation Holistic scene understanding 2D pose estimation Non-rigid shape estimation 3D scene understanding Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

65 For each application what do we need to decide? Random variables Graphical model Potentials Loss for learning Learning algorithm Inference algorithm Let s look at some examples Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

66 Depth Estimation [Source: P. Kohli] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

67 Stereo matching pairwise [Source: P. Kohli] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

68 Stereo matching: energy [Source: P. Kohli] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

69 Stereo matching: energy [Source: P. Kohli] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

70 More on pairwise [O. Veksler] [Source: P. Kohli] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

71 Graph Structure see Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

72 Learning and inference There is only one parameter to learn: importance of pairwise with respect to unitary! Sum of square differences: outliers are more important % of pixels that have disparity error bigger than ɛ. The latter is how typically stereo algorithms are scored Which inference method will you choose? And for learning? Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

73 Example: Object Detection We can formulate object localization as a regression from an image to a bounding box g : X Y X is the space of all images Y is the space of all bounding boxes Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

74 Joint Kernel Between bboxes [Source: M. Blascko] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

75 Restriction Kernel example [Source: M. Blascko] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

76 Margin Rescaling [Source: M. Blascko] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

77 Constraint Generation with Branch and Bound [Source: M. Blascko] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

78 Sets of Rectangles [Source: M. Blascko] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

79 Optimization [Source: M. Blascko] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

80 Results: PASCAL VOC2006 [Source: M. Blascko] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

81 Results: PASCAL VOC2006 cats [Source: M. Blascko] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

82 Problem The restriction kernel is like having tunnel vision [Source: M. Blascko] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

83 Problem The restriction kernel is like having tunnel vision [Source: M. Blascko] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

84 Global and Local Context Kernels [Source: M. Blascko] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

85 Local Context Kernel [Source: M. Blascko] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

86 Results Context is a very busy area of research in vision! [Source: M. Blascko] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

87 Example: 3D Indoor Scene Understanding Task: Given an image, predict the 3D parametric cuboid that best describes the layout. Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

88 Prediction Variables are not independent of each other, i.e. structured prediction x: Input image y: Room layout φ(x, y): Multidimensional feature vector w: Predictor Estimate room layout by solving inference task ŷ = arg max w T φ(x, y) y Learning w via structured SVMs or CRFs Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

89 Single Variable Parameterization Approaches of [Hedau et al. 09] and [Lee et al. 10]. One random variable y for the entire layout. Every state denotes a different candidate layout. Limits the amount of candidate layouts. Not really a structured prediction task. n states/3d layouts have to be evaluated exhaustively, e.g., Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

90 Four Variable Parameterization Approach of [Wang et al. 10]. 4 variables y i Y, i {1,..., 4} corresponding to the four degrees of freedom of the problem. One state of y i denotes the angle of ray r i. High order potentials, e.g., 50 4 for fourth-order. r 3 r 4 r 1 y 1 vp 2 vp 0 y 2 r 2 y 3 y 4 vp 1 For both parameterizations is even worst when reasoning about objects. Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

91 Integral Geometry for Features We follow [Wang et al. 10] and parameterize with four random variables. We follow [Lee et al. 10] and employ orientation map [Lee09 et al.] and geometric context [Hoiem et al. 07] as image cues. orientation map geometric context Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

92 Integral Geometry for Features Faces F = {left-wall, right-wall, ceiling, floor, front-wall} Faces are defined by four (front-wall) or three angles (otherwise) w T φ(x, y) = wo,αφ T o,α (x, y α ) + wg,αφ T g,α (x, y α ) α F α F Features count frequencies of image cues Orientation map and proposed left wall Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

93 Integral Geometry for Features Using inspiration from integral images, we decompose φ,α (x, y α ) = φ,{i,j,k} (x, y i, y j, y k ) = = H,{i,j} (x, y i, y j ) H,{j,k} (x, y j, y k ) Integral geometry H,{i,j} (x, y i, y j ) y i y k H,{j,k} (x, y j, y k ) y j Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

94 Integral Geometry for Features Decomposition: Corresponding factor graph: H,{i,j} (x, y i, y j ) H,{j,k} (x, y j, y k ) y i y j y k y l The front-wall: {i, j} {j, k} φ,front-wall = φ(x) φ,left-wall φ,right-wall φ,ceiling φ,floor Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

95 Integral Geometry Same concept as integral images, but in accordance with the vanishing points. Figure: Concept of integral geometry Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

96 Learning and Inference Learning Inference Family of structure prediction problems including CRF and structured-svms as especial cases. Primal-dual algorithm based on local updates. Fast and works well with large number of parameters. Code coming soon! Inference using parallel convex belief propagation Convergence and other theoretical guarantees [T. Hazan and R. Urtasun, NIPS 2010] Code available online: general potentials, cross-platform, Amazon EC2! [A. Schwing, T. Hazan, M. Pollefeys and R. Urtasun, CVPR 2011] Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

97 Time vs Accuracy Learning very fast: State-of-the-art after less than a minute! Error [%] / / 150 Train/Test 200 Train/Test rel. duality gap Time [s] Time Inference as little as 10ms per image! Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

98 Results [A. Schwing, T. Hazan, M. Pollefeys and R. Urtasun, CVPR12] Table: Pixel classification error in the layout dataset of [Hedau et al. 09]. OM GC OM + GC [Hoiem07] [Hedau09] (a) [Hedau09] (b) [Wang10] [Lee10] Ours (SVM struct ) Ours (struct-pred) Table: Pixel classification error in the bedroom data set [Hedau et al. 10]. [Luca11] [Hoiem07] [Hedau09](a) Ours w/o box Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

99 Simple object reasoning Compatibility of 3D object candidates and layout Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

100 Results [A. Schwing, T. Hazan, M. Pollefeys and R. Urtasun, CVPR12] Table: Pixel classification error in the layout dataset of [Hedau et al. 09]. OM GC OM + GC [Hoiem07] [Hedau09] (a) [Hedau09] (b) [Wang10] [Lee10] Ours (SVM struct ) Ours (struct-pred) Table: WITH object reasoning. OM GC OM + GC [Wang10] [Lee10] Ours (SVM struct ) Ours (struct-pred) Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

101 Results [A. Schwing, T. Hazan, M. Pollefeys and R. Urtasun, CVPR12] Table: Pixel classification error in the layout dataset of [Hedau et al. 09] with object reasoning. OM GC OM + GC [Wang10] [Lee10] Ours (SVM struct ) Ours (struct-pred) Table: Pixel classification error in the bedroom data set [Hedau et al. 10]. [Luca11] [Hoiem07] [Hedau09](a) Ours w/o box w/ box Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

102 Qualitative Results Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

103 Conclusions and Future Work Conclusion: Efficient learning and inference tools for structure prediction based on primal-dual methods. Inference: No need for application specific moves. Learning: can learn large number of parameters using local updates. State-of-the-art results. Future Work: More features. Better object reasoning. Weakly label setting. Better inference? Raquel Urtasun (TTI-C) Visual Recognition March 6, / 64

Part 5: Structured Support Vector Machines

Part 5: Structured Support Vector Machines Part 5: Structured Support Vector Machines Sebastian Nowozin and Christoph H. Lampert Providence, 21st June 2012 1 / 34 Problem (Loss-Minimizing Parameter Learning) Let d(x, y) be the (unknown) true data

More information

Part 5: Structured Support Vector Machines

Part 5: Structured Support Vector Machines Part 5: Structured Support Vector Machines Sebastian Noozin and Christoph H. Lampert Colorado Springs, 25th June 2011 1 / 56 Problem (Loss-Minimizing Parameter Learning) Let d(x, y) be the (unknon) true

More information

A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction

A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction Tamir Hazan TTI Chicago hazan@ttic.edu Raquel Urtasun TTI Chicago rurtasun@ttic.edu Abstract In this paper we

More information

Complex Prediction Problems

Complex Prediction Problems Problems A novel approach to multiple Structured Output Prediction Max-Planck Institute ECML HLIE08 Information Extraction Extract structured information from unstructured data Typical subtasks Named Entity

More information

SVMs for Structured Output. Andrea Vedaldi

SVMs for Structured Output. Andrea Vedaldi SVMs for Structured Output Andrea Vedaldi SVM struct Tsochantaridis Hofmann Joachims Altun 04 Extending SVMs 3 Extending SVMs SVM = parametric function arbitrary input binary output 3 Extending SVMs SVM

More information

Development in Object Detection. Junyuan Lin May 4th

Development in Object Detection. Junyuan Lin May 4th Development in Object Detection Junyuan Lin May 4th Line of Research [1] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection, CVPR 2005. HOG Feature template [2] P. Felzenszwalb,

More information

Learning CRFs using Graph Cuts

Learning CRFs using Graph Cuts Appears in European Conference on Computer Vision (ECCV) 2008 Learning CRFs using Graph Cuts Martin Szummer 1, Pushmeet Kohli 1, and Derek Hoiem 2 1 Microsoft Research, Cambridge CB3 0FB, United Kingdom

More information

Support Vector Machine Learning for Interdependent and Structured Output Spaces

Support Vector Machine Learning for Interdependent and Structured Output Spaces Support Vector Machine Learning for Interdependent and Structured Output Spaces I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun, ICML, 2004. And also I. Tsochantaridis, T. Joachims, T. Hofmann,

More information

COMS 4771 Support Vector Machines. Nakul Verma

COMS 4771 Support Vector Machines. Nakul Verma COMS 4771 Support Vector Machines Nakul Verma Last time Decision boundaries for classification Linear decision boundary (linear classification) The Perceptron algorithm Mistake bound for the perceptron

More information

Structured Learning. Jun Zhu

Structured Learning. Jun Zhu Structured Learning Jun Zhu Supervised learning Given a set of I.I.D. training samples Learn a prediction function b r a c e Supervised learning (cont d) Many different choices Logistic Regression Maximum

More information

Separating Objects and Clutter in Indoor Scenes

Separating Objects and Clutter in Indoor Scenes Separating Objects and Clutter in Indoor Scenes Salman H. Khan School of Computer Science & Software Engineering, The University of Western Australia Co-authors: Xuming He, Mohammed Bennamoun, Ferdous

More information

Support vector machines

Support vector machines Support vector machines When the data is linearly separable, which of the many possible solutions should we prefer? SVM criterion: maximize the margin, or distance between the hyperplane and the closest

More information

Generative and discriminative classification techniques

Generative and discriminative classification techniques Generative and discriminative classification techniques Machine Learning and Category Representation 013-014 Jakob Verbeek, December 13+0, 013 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.13.14

More information

Markov Networks in Computer Vision. Sargur Srihari

Markov Networks in Computer Vision. Sargur Srihari Markov Networks in Computer Vision Sargur srihari@cedar.buffalo.edu 1 Markov Networks for Computer Vision Important application area for MNs 1. Image segmentation 2. Removal of blur/noise 3. Stereo reconstruction

More information

Pedestrian Detection Using Structured SVM

Pedestrian Detection Using Structured SVM Pedestrian Detection Using Structured SVM Wonhui Kim Stanford University Department of Electrical Engineering wonhui@stanford.edu Seungmin Lee Stanford University Department of Electrical Engineering smlee729@stanford.edu.

More information

Estimating the 3D Layout of Indoor Scenes and its Clutter from Depth Sensors

Estimating the 3D Layout of Indoor Scenes and its Clutter from Depth Sensors Estimating the 3D Layout of Indoor Scenes and its Clutter from Depth Sensors Jian Zhang Tsingua University jizhang@ethz.ch Chen Kan Tsingua University chenkan0007@gmail.com Alexander G. Schwing ETH Zurich

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Classification III Dan Klein UC Berkeley 1 Classification 2 Linear Models: Perceptron The perceptron algorithm Iteratively processes the training set, reacting to training errors

More information

Comparisons of Sequence Labeling Algorithms and Extensions

Comparisons of Sequence Labeling Algorithms and Extensions Nam Nguyen Yunsong Guo Department of Computer Science, Cornell University, Ithaca, NY 14853, USA NHNGUYEN@CS.CORNELL.EDU GUOYS@CS.CORNELL.EDU Abstract In this paper, we survey the current state-ofart models

More information

All lecture slides will be available at CSC2515_Winter15.html

All lecture slides will be available at  CSC2515_Winter15.html CSC2515 Fall 2015 Introduc3on to Machine Learning Lecture 9: Support Vector Machines All lecture slides will be available at http://www.cs.toronto.edu/~urtasun/courses/csc2515/ CSC2515_Winter15.html Many

More information

Big and Tall: Large Margin Learning with High Order Losses

Big and Tall: Large Margin Learning with High Order Losses Big and Tall: Large Margin Learning with High Order Losses Daniel Tarlow University of Toronto dtarlow@cs.toronto.edu Richard Zemel University of Toronto zemel@cs.toronto.edu Abstract Graphical models

More information

Kernel Methods & Support Vector Machines

Kernel Methods & Support Vector Machines & Support Vector Machines & Support Vector Machines Arvind Visvanathan CSCE 970 Pattern Recognition 1 & Support Vector Machines Question? Draw a single line to separate two classes? 2 & Support Vector

More information

Machine Learning: Think Big and Parallel

Machine Learning: Think Big and Parallel Day 1 Inderjit S. Dhillon Dept of Computer Science UT Austin CS395T: Topics in Multicore Programming Oct 1, 2013 Outline Scikit-learn: Machine Learning in Python Supervised Learning day1 Regression: Least

More information

Hierarchical Image-Region Labeling via Structured Learning

Hierarchical Image-Region Labeling via Structured Learning Hierarchical Image-Region Labeling via Structured Learning Julian McAuley, Teo de Campos, Gabriela Csurka, Florent Perronin XRCE September 14, 2009 McAuley et al (XRCE) Hierarchical Image-Region Labeling

More information

Learning to Localize Objects with Structured Output Regression

Learning to Localize Objects with Structured Output Regression Learning to Localize Objects with Structured Output Regression Matthew Blaschko and Christopher Lampert ECCV 2008 Best Student Paper Award Presentation by Jaeyong Sung and Yiting Xie 1 Object Localization

More information

Contextual Classification with Functional Max-Margin Markov Networks

Contextual Classification with Functional Max-Margin Markov Networks Contextual Classification with Functional Max-Margin Markov Networks Dan Munoz Nicolas Vandapel Drew Bagnell Martial Hebert Geometry Estimation (Hoiem et al.) Sky Problem 3-D Point Cloud Classification

More information

SVM: Multiclass and Structured Prediction. Bin Zhao

SVM: Multiclass and Structured Prediction. Bin Zhao SVM: Multiclass and Structured Prediction Bin Zhao Part I: Multi-Class SVM 2-Class SVM Primal form Dual form http://www.glue.umd.edu/~zhelin/recog.html Real world classification problems Digit recognition

More information

Exponentiated Gradient Algorithms for Large-margin Structured Classification

Exponentiated Gradient Algorithms for Large-margin Structured Classification Exponentiated Gradient Algorithms for Large-margin Structured Classification Peter L. Bartlett U.C.Berkeley bartlett@stat.berkeley.edu Ben Taskar Stanford University btaskar@cs.stanford.edu Michael Collins

More information

Lecture 9: Support Vector Machines

Lecture 9: Support Vector Machines Lecture 9: Support Vector Machines William Webber (william@williamwebber.com) COMP90042, 2014, Semester 1, Lecture 8 What we ll learn in this lecture Support Vector Machines (SVMs) a highly robust and

More information

Support Vector Machines.

Support Vector Machines. Support Vector Machines srihari@buffalo.edu SVM Discussion Overview 1. Overview of SVMs 2. Margin Geometry 3. SVM Optimization 4. Overlapping Distributions 5. Relationship to Logistic Regression 6. Dealing

More information

Constrained optimization

Constrained optimization Constrained optimization A general constrained optimization problem has the form where The Lagrangian function is given by Primal and dual optimization problems Primal: Dual: Weak duality: Strong duality:

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Maximum Margin Methods Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

More information

Markov Networks in Computer Vision

Markov Networks in Computer Vision Markov Networks in Computer Vision Sargur Srihari srihari@cedar.buffalo.edu 1 Markov Networks for Computer Vision Some applications: 1. Image segmentation 2. Removal of blur/noise 3. Stereo reconstruction

More information

Lecture 7: Support Vector Machine

Lecture 7: Support Vector Machine Lecture 7: Support Vector Machine Hien Van Nguyen University of Houston 9/28/2017 Separating hyperplane Red and green dots can be separated by a separating hyperplane Two classes are separable, i.e., each

More information

Convexization in Markov Chain Monte Carlo

Convexization in Markov Chain Monte Carlo in Markov Chain Monte Carlo 1 IBM T. J. Watson Yorktown Heights, NY 2 Department of Aerospace Engineering Technion, Israel August 23, 2011 Problem Statement MCMC processes in general are governed by non

More information

Message Passing Inference for Large Scale Graphical Models with High Order Potentials

Message Passing Inference for Large Scale Graphical Models with High Order Potentials Message Passing Inference for Large Scale Graphical Models with High Order Potentials Jian Zhang ETH Zurich jizhang@ethz.ch Alexander G. Schwing University of Toronto aschwing@cs.toronto.edu Raquel Urtasun

More information

Sports Field Localization

Sports Field Localization Sports Field Localization Namdar Homayounfar February 5, 2017 1 / 1 Motivation Sports Field Localization: Have to figure out where the field and players are in 3d space in order to make measurements and

More information

Combine the PA Algorithm with a Proximal Classifier

Combine the PA Algorithm with a Proximal Classifier Combine the Passive and Aggressive Algorithm with a Proximal Classifier Yuh-Jye Lee Joint work with Y.-C. Tseng Dept. of Computer Science & Information Engineering TaiwanTech. Dept. of Statistics@NCKU

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Raquel Urtasun and Tamir Hazan TTI Chicago April 8, 2011 Raquel Urtasun and Tamir Hazan (TTI-C) Graphical Models April 8, 2011 1 / 19 Factor Graphs H does not reveal the

More information

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017 Data Analysis 3 Support Vector Machines Jan Platoš October 30, 2017 Department of Computer Science Faculty of Electrical Engineering and Computer Science VŠB - Technical University of Ostrava Table of

More information

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs) Data Mining: Concepts and Techniques Chapter 9 Classification: Support Vector Machines 1 Support Vector Machines (SVMs) SVMs are a set of related supervised learning methods used for classification Based

More information

Object Detection by 3D Aspectlets and Occlusion Reasoning

Object Detection by 3D Aspectlets and Occlusion Reasoning Object Detection by 3D Aspectlets and Occlusion Reasoning Yu Xiang University of Michigan Silvio Savarese Stanford University In the 4th International IEEE Workshop on 3D Representation and Recognition

More information

Learning Hierarchies at Two-class Complexity

Learning Hierarchies at Two-class Complexity Learning Hierarchies at Two-class Complexity Sandor Szedmak ss03v@ecs.soton.ac.uk Craig Saunders cjs@ecs.soton.ac.uk John Shawe-Taylor jst@ecs.soton.ac.uk ISIS Group, Electronics and Computer Science University

More information

An MM Algorithm for Multicategory Vertex Discriminant Analysis

An MM Algorithm for Multicategory Vertex Discriminant Analysis An MM Algorithm for Multicategory Vertex Discriminant Analysis Tong Tong Wu Department of Epidemiology and Biostatistics University of Maryland, College Park May 22, 2008 Joint work with Professor Kenneth

More information

Generative and discriminative classification

Generative and discriminative classification Generative and discriminative classification Machine Learning and Object Recognition 2017-2018 Jakob Verbeek Classification in its simplest form Given training data labeled for two or more classes Classification

More information

Support Vector Machines. James McInerney Adapted from slides by Nakul Verma

Support Vector Machines. James McInerney Adapted from slides by Nakul Verma Support Vector Machines James McInerney Adapted from slides by Nakul Verma Last time Decision boundaries for classification Linear decision boundary (linear classification) The Perceptron algorithm Mistake

More information

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007,

More information

Regularization and Markov Random Fields (MRF) CS 664 Spring 2008

Regularization and Markov Random Fields (MRF) CS 664 Spring 2008 Regularization and Markov Random Fields (MRF) CS 664 Spring 2008 Regularization in Low Level Vision Low level vision problems concerned with estimating some quantity at each pixel Visual motion (u(x,y),v(x,y))

More information

Discrete Optimization of Ray Potentials for Semantic 3D Reconstruction

Discrete Optimization of Ray Potentials for Semantic 3D Reconstruction Discrete Optimization of Ray Potentials for Semantic 3D Reconstruction Marc Pollefeys Joined work with Nikolay Savinov, Christian Haene, Lubor Ladicky 2 Comparison to Volumetric Fusion Higher-order ray

More information

Structured Prediction in Computer Vision

Structured Prediction in Computer Vision Structured Prediction in Computer Vision Tiberio Caetano and Richard Hartley NICTA/Australian National University Canberra, Australia http://tiberiocaetano.com/iccv_tutorial ICCV, September 2009 Tiberio

More information

Parallel and Distributed Graph Cuts by Dual Decomposition

Parallel and Distributed Graph Cuts by Dual Decomposition Parallel and Distributed Graph Cuts by Dual Decomposition Presented by Varun Gulshan 06 July, 2010 What is Dual Decomposition? What is Dual Decomposition? It is a technique for optimization: usually used

More information

Generative and discriminative classification techniques

Generative and discriminative classification techniques Generative and discriminative classification techniques Machine Learning and Object Recognition 2015-2016 Jakob Verbeek, December 11, 2015 Course website: http://lear.inrialpes.fr/~verbeek/mlor.15.16 Classification

More information

Optimal Separating Hyperplane and the Support Vector Machine. Volker Tresp Summer 2018

Optimal Separating Hyperplane and the Support Vector Machine. Volker Tresp Summer 2018 Optimal Separating Hyperplane and the Support Vector Machine Volker Tresp Summer 2018 1 (Vapnik s) Optimal Separating Hyperplane Let s consider a linear classifier with y i { 1, 1} If classes are linearly

More information

Stanford University. A Distributed Solver for Kernalized SVM

Stanford University. A Distributed Solver for Kernalized SVM Stanford University CME 323 Final Project A Distributed Solver for Kernalized SVM Haoming Li Bangzheng He haoming@stanford.edu bzhe@stanford.edu GitHub Repository https://github.com/cme323project/spark_kernel_svm.git

More information

Learning and Inference to Exploit High Order Poten7als

Learning and Inference to Exploit High Order Poten7als Learning and Inference to Exploit High Order Poten7als Richard Zemel CVPR Workshop June 20, 2011 Collaborators Danny Tarlow Inmar Givoni Nikola Karamanov Maks Volkovs Hugo Larochelle Framework for Inference

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Mathematical Programming and Research Methods (Part II)

Mathematical Programming and Research Methods (Part II) Mathematical Programming and Research Methods (Part II) 4. Convexity and Optimization Massimiliano Pontil (based on previous lecture by Andreas Argyriou) 1 Today s Plan Convex sets and functions Types

More information

.. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar..

.. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar.. .. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar.. Machine Learning: Support Vector Machines: Linear Kernel Support Vector Machines Extending Perceptron Classifiers. There are two ways to

More information

Decomposing a Scene into Geometric and Semantically Consistent Regions

Decomposing a Scene into Geometric and Semantically Consistent Regions Decomposing a Scene into Geometric and Semantically Consistent Regions Stephen Gould sgould@stanford.edu Richard Fulton rafulton@cs.stanford.edu Daphne Koller koller@cs.stanford.edu IEEE International

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Theory of Variational Inference: Inner and Outer Approximation Eric Xing Lecture 14, February 29, 2016 Reading: W & J Book Chapters Eric Xing @

More information

08 An Introduction to Dense Continuous Robotic Mapping

08 An Introduction to Dense Continuous Robotic Mapping NAVARCH/EECS 568, ROB 530 - Winter 2018 08 An Introduction to Dense Continuous Robotic Mapping Maani Ghaffari March 14, 2018 Previously: Occupancy Grid Maps Pose SLAM graph and its associated dense occupancy

More information

arxiv: v1 [cs.lg] 28 Aug 2014

arxiv: v1 [cs.lg] 28 Aug 2014 A Multi-Plane Block-Coordinate Frank-Wolfe Algorithm for Structural SVMs with a Costly max-oracle Neel Shah, Vladimir Kolmogorov, Christoph H. Lampert IST Austria, 3400 Klosterneuburg, Austria arxiv:1408.6804v1

More information

1 Case study of SVM (Rob)

1 Case study of SVM (Rob) DRAFT a final version will be posted shortly COS 424: Interacting with Data Lecturer: Rob Schapire and David Blei Lecture # 8 Scribe: Indraneel Mukherjee March 1, 2007 In the previous lecture we saw how

More information

Introduction to Mathematical Programming IE496. Final Review. Dr. Ted Ralphs

Introduction to Mathematical Programming IE496. Final Review. Dr. Ted Ralphs Introduction to Mathematical Programming IE496 Final Review Dr. Ted Ralphs IE496 Final Review 1 Course Wrap-up: Chapter 2 In the introduction, we discussed the general framework of mathematical modeling

More information

More Data, Less Work: Runtime as a decreasing function of data set size. Nati Srebro. Toyota Technological Institute Chicago

More Data, Less Work: Runtime as a decreasing function of data set size. Nati Srebro. Toyota Technological Institute Chicago More Data, Less Work: Runtime as a decreasing function of data set size Nati Srebro Toyota Technological Institute Chicago Outline we are here SVM speculations, other problems Clustering wild speculations,

More information

Support Vector Machines

Support Vector Machines Support Vector Machines SVM Discussion Overview. Importance of SVMs. Overview of Mathematical Techniques Employed 3. Margin Geometry 4. SVM Training Methodology 5. Overlapping Distributions 6. Dealing

More information

Support Vector Machines.

Support Vector Machines. Support Vector Machines srihari@buffalo.edu SVM Discussion Overview. Importance of SVMs. Overview of Mathematical Techniques Employed 3. Margin Geometry 4. SVM Training Methodology 5. Overlapping Distributions

More information

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017 CPSC 340: Machine Learning and Data Mining More Linear Classifiers Fall 2017 Admin Assignment 3: Due Friday of next week. Midterm: Can view your exam during instructor office hours next week, or after

More information

Case Study 1: Estimating Click Probabilities

Case Study 1: Estimating Click Probabilities Case Study 1: Estimating Click Probabilities SGD cont d AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade March 31, 2015 1 Support/Resources Office Hours Yao Lu:

More information

Learning CRFs for Image Parsing with Adaptive Subgradient Descent

Learning CRFs for Image Parsing with Adaptive Subgradient Descent Learning CRFs for Image Parsing with Adaptive Subgradient Descent Honghui Zhang Jingdong Wang Ping Tan Jinglu Wang Long Quan The Hong Kong University of Science and Technology Microsoft Research National

More information

Structure and Support Vector Machines. SPFLODD October 31, 2013

Structure and Support Vector Machines. SPFLODD October 31, 2013 Structure and Support Vector Machines SPFLODD October 31, 2013 Outline SVMs for structured outputs Declara?ve view Procedural view Warning: Math Ahead Nota?on for Linear Models Training data: {(x 1, y

More information

I How does the formulation (5) serve the purpose of the composite parameterization

I How does the formulation (5) serve the purpose of the composite parameterization Supplemental Material to Identifying Alzheimer s Disease-Related Brain Regions from Multi-Modality Neuroimaging Data using Sparse Composite Linear Discrimination Analysis I How does the formulation (5)

More information

Motivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM)

Motivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM) Motivation: Shortcomings of Hidden Markov Model Maximum Entropy Markov Models and Conditional Random Fields Ko, Youngjoong Dept. of Computer Engineering, Dong-A University Intelligent System Laboratory,

More information

DM6 Support Vector Machines

DM6 Support Vector Machines DM6 Support Vector Machines Outline Large margin linear classifier Linear separable Nonlinear separable Creating nonlinear classifiers: kernel trick Discussion on SVM Conclusion SVM: LARGE MARGIN LINEAR

More information

Input. Output. Problem Definition. Rectified stereo image pair All correspondences lie in same scan lines

Input. Output. Problem Definition. Rectified stereo image pair All correspondences lie in same scan lines Problem Definition 3 Input Rectified stereo image pair All correspondences lie in same scan lines Output Disparity map of the reference view Foreground: large disparity Background: small disparity Matching

More information

CS395T paper review. Indoor Segmentation and Support Inference from RGBD Images. Chao Jia Sep

CS395T paper review. Indoor Segmentation and Support Inference from RGBD Images. Chao Jia Sep CS395T paper review Indoor Segmentation and Support Inference from RGBD Images Chao Jia Sep 28 2012 Introduction What do we want -- Indoor scene parsing Segmentation and labeling Support relationships

More information

Semi-supervised learning and active learning

Semi-supervised learning and active learning Semi-supervised learning and active learning Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Combining classifiers Ensemble learning: a machine learning paradigm where multiple learners

More information

Alternating Direction Method of Multipliers

Alternating Direction Method of Multipliers Alternating Direction Method of Multipliers CS 584: Big Data Analytics Material adapted from Stephen Boyd (https://web.stanford.edu/~boyd/papers/pdf/admm_slides.pdf) & Ryan Tibshirani (http://stat.cmu.edu/~ryantibs/convexopt/lectures/21-dual-meth.pdf)

More information

CS 1674: Intro to Computer Vision. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh November 16, 2016

CS 1674: Intro to Computer Vision. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh November 16, 2016 CS 1674: Intro to Computer Vision Neural Networks Prof. Adriana Kovashka University of Pittsburgh November 16, 2016 Announcements Please watch the videos I sent you, if you haven t yet (that s your reading)

More information

732A54/TDDE31 Big Data Analytics

732A54/TDDE31 Big Data Analytics 732A54/TDDE31 Big Data Analytics Lecture 10: Machine Learning with MapReduce Jose M. Peña IDA, Linköping University, Sweden 1/27 Contents MapReduce Framework Machine Learning with MapReduce Neural Networks

More information

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009 Learning and Inferring Depth from Monocular Images Jiyan Pan April 1, 2009 Traditional ways of inferring depth Binocular disparity Structure from motion Defocus Given a single monocular image, how to infer

More information

SELF-ADAPTIVE SUPPORT VECTOR MACHINES

SELF-ADAPTIVE SUPPORT VECTOR MACHINES SELF-ADAPTIVE SUPPORT VECTOR MACHINES SELF ADAPTIVE SUPPORT VECTOR MACHINES AND AUTOMATIC FEATURE SELECTION By PENG DU, M.Sc., B.Sc. A thesis submitted to the School of Graduate Studies in Partial Fulfillment

More information

Advanced Operations Research Techniques IE316. Quiz 2 Review. Dr. Ted Ralphs

Advanced Operations Research Techniques IE316. Quiz 2 Review. Dr. Ted Ralphs Advanced Operations Research Techniques IE316 Quiz 2 Review Dr. Ted Ralphs IE316 Quiz 2 Review 1 Reading for The Quiz Material covered in detail in lecture Bertsimas 4.1-4.5, 4.8, 5.1-5.5, 6.1-6.3 Material

More information

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016 CPSC 340: Machine Learning and Data Mining Principal Component Analysis Fall 2016 A2/Midterm: Admin Grades/solutions will be posted after class. Assignment 4: Posted, due November 14. Extra office hours:

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Raquel Urtasun and Tamir Hazan TTI Chicago April 22, 2011 Raquel Urtasun and Tamir Hazan (TTI-C) Graphical Models April 22, 2011 1 / 22 If the graph is non-chordal, then

More information

MVE165/MMG630, Applied Optimization Lecture 8 Integer linear programming algorithms. Ann-Brith Strömberg

MVE165/MMG630, Applied Optimization Lecture 8 Integer linear programming algorithms. Ann-Brith Strömberg MVE165/MMG630, Integer linear programming algorithms Ann-Brith Strömberg 2009 04 15 Methods for ILP: Overview (Ch. 14.1) Enumeration Implicit enumeration: Branch and bound Relaxations Decomposition methods:

More information

Structured Output Learning with High Order Loss Functions

Structured Output Learning with High Order Loss Functions Daniel Tarlow Department of Computer Science University of Toronto Richard S. Zemel Department of Computer Science University of Toronto Abstract Often when modeling structured domains, it is desirable

More information

Segmentation. Bottom up Segmentation Semantic Segmentation

Segmentation. Bottom up Segmentation Semantic Segmentation Segmentation Bottom up Segmentation Semantic Segmentation Semantic Labeling of Street Scenes Ground Truth Labels 11 classes, almost all occur simultaneously, large changes in viewpoint, scale sky, road,

More information

Learning and Recognizing Visual Object Categories Without First Detecting Features

Learning and Recognizing Visual Object Categories Without First Detecting Features Learning and Recognizing Visual Object Categories Without First Detecting Features Daniel Huttenlocher 2007 Joint work with D. Crandall and P. Felzenszwalb Object Category Recognition Generic classes rather

More information

Self Lane Assignment Using Smart Mobile Camera For Intelligent GPS Navigation and Traffic Interpretation

Self Lane Assignment Using Smart Mobile Camera For Intelligent GPS Navigation and Traffic Interpretation For Intelligent GPS Navigation and Traffic Interpretation Tianshi Gao Stanford University tianshig@stanford.edu 1. Introduction Imagine that you are driving on the highway at 70 mph and trying to figure

More information

A Short SVM (Support Vector Machine) Tutorial

A Short SVM (Support Vector Machine) Tutorial A Short SVM (Support Vector Machine) Tutorial j.p.lewis CGIT Lab / IMSC U. Southern California version 0.zz dec 004 This tutorial assumes you are familiar with linear algebra and equality-constrained optimization/lagrange

More information

Structured Models in. Dan Huttenlocher. June 2010

Structured Models in. Dan Huttenlocher. June 2010 Structured Models in Computer Vision i Dan Huttenlocher June 2010 Structured Models Problems where output variables are mutually dependent or constrained E.g., spatial or temporal relations Such dependencies

More information

Multiple cosegmentation

Multiple cosegmentation Armand Joulin, Francis Bach and Jean Ponce. INRIA -Ecole Normale Supérieure April 25, 2012 Segmentation Introduction Segmentation Supervised and weakly-supervised segmentation Cosegmentation Segmentation

More information

Support Vector Machines

Support Vector Machines Support Vector Machines RBF-networks Support Vector Machines Good Decision Boundary Optimization Problem Soft margin Hyperplane Non-linear Decision Boundary Kernel-Trick Approximation Accurancy Overtraining

More information

Convex Optimization and Machine Learning

Convex Optimization and Machine Learning Convex Optimization and Machine Learning Mengliu Zhao Machine Learning Reading Group School of Computing Science Simon Fraser University March 12, 2014 Mengliu Zhao SFU-MLRG March 12, 2014 1 / 25 Introduction

More information

The More the Merrier: Parameter Learning for Graphical Models with Multiple MAPs

The More the Merrier: Parameter Learning for Graphical Models with Multiple MAPs The More the Merrier: Parameter Learning for Graphical Models with Multiple MAPs Franziska Meier fmeier@usc.edu U. of Southern California, 94 Bloom Walk, Los Angeles, CA 989 USA Amir Globerson amir.globerson@mail.huji.ac.il

More information

LOGISTIC REGRESSION FOR MULTIPLE CLASSES

LOGISTIC REGRESSION FOR MULTIPLE CLASSES Peter Orbanz Applied Data Mining Not examinable. 111 LOGISTIC REGRESSION FOR MULTIPLE CLASSES Bernoulli and multinomial distributions The mulitnomial distribution of N draws from K categories with parameter

More information

Three-Dimensional Object Detection and Layout Prediction using Clouds of Oriented Gradients

Three-Dimensional Object Detection and Layout Prediction using Clouds of Oriented Gradients ThreeDimensional Object Detection and Layout Prediction using Clouds of Oriented Gradients Authors: Zhile Ren, Erik B. Sudderth Presented by: Shannon Kao, Max Wang October 19, 2016 Introduction Given an

More information

Local cues and global constraints in image understanding

Local cues and global constraints in image understanding Local cues and global constraints in image understanding Olga Barinova Lomonosov Moscow State University *Many slides adopted from the courses of Anton Konushin Image understanding «To see means to know

More information

Analysis: TextonBoost and Semantic Texton Forests. Daniel Munoz Februrary 9, 2009

Analysis: TextonBoost and Semantic Texton Forests. Daniel Munoz Februrary 9, 2009 Analysis: TextonBoost and Semantic Texton Forests Daniel Munoz 16-721 Februrary 9, 2009 Papers [shotton-eccv-06] J. Shotton, J. Winn, C. Rother, A. Criminisi, TextonBoost: Joint Appearance, Shape and Context

More information

Supplemental Material: Discovering Groups of People in Images

Supplemental Material: Discovering Groups of People in Images Supplemental Material: Discovering Groups of People in Images Wongun Choi 1, Yu-Wei Chao 2, Caroline Pantofaru 3 and Silvio Savarese 4 1. NEC Laboratories 2. University of Michigan, Ann Arbor 3. Google,

More information