Part 5: Structured Support Vector Machines

Size: px

Start display at page:

Download "Part 5: Structured Support Vector Machines"

Ethelbert Shields
6 years ago
Views:

1 Part 5: Structured Support Vector Machines Sebastian Noozin and Christoph H. Lampert Colorado Springs, 25th June / 56

2 Problem (Loss-Minimizing Parameter Learning) Let d(x, y) be the (unknon) true data distribution. Let D = {(x 1, y 1 ),..., (x N, y N )} be i.i.d. samples from d(x, y). Let φ : X Y R D be a feature function. Let : Y Y R be a loss function. Find a eight vector that leads to imal expected loss E (x,y) d(x,y) { (y, f(x))} for f(x) = argmax y Y, φ(x, y). 2 / 56

3 Problem (Loss-Minimizing Parameter Learning) Let d(x, y) be the (unknon) true data distribution. Let D = {(x 1, y 1 ),..., (x N, y N )} be i.i.d. samples from d(x, y). Let φ : X Y R D be a feature function. Let : Y Y R be a loss function. Find a eight vector that leads to imal expected loss E (x,y) d(x,y) { (y, f(x))} for f(x) = argmax y Y, φ(x, y). Pro: We directly optimize for the quantity of interest: expected loss. No expensive-to-compute partition function Z ill sho up. Con: We need to kno the loss function already at training time. We can t use probabilistic reasoning to find. 3 / 56

4 Reder: learning by regularized risk imization For compatibility function g(x, y; ) :=, φ(x, y) find that imizes To major problems: d(x, y) is unknon E (x,y) d(x,y) ( y, argmax y g(x, y; ) ). argmax y g(x, y; ) maps into a discrete space ( y, argmax y g(x, y; )) is discontinuous, pieceise constant 4 / 56

5 Task: E (x,y) d(x,y) ( y, argmax y g(x, y; ) ). Problem 1: d(x, y) is unknon Solution: ( ) Replace E (x,y) d(x,y) ith empirical estimate 1 ( ) N (x n,y n ) To avoid overfitting: add a regularizer, e.g. λ 2. Ne task: λ N N ( y n, argmax y g(x n, y; ) ). 5 / 56

6 Task: λ N N ( y n, argmax y g(x n, y; ) ). Problem: ( y, argmax y g(x, y; ) ) discontinuous.r.t.. Solution: Replace (y, y ) ith ell behaved l(x, y, ) Typically: l upper bound to, continuous and convex.r.t.. Ne task: λ N N l(x n, y n, )) 6 / 56

7 Regularized Risk Minimization λ N N l(x n, y n, )) Regularization + Loss on training data 7 / 56

8 Regularized Risk Minimization λ N N l(x n, y n, )) Regularization + Loss on training data Hinge loss: maximum margin training l(x n, y n [, ) := max (y n, y) +, φ(x n, y), φ(x n, y n ) ] y Y 8 / 56

9 Regularized Risk Minimization λ N N l(x n, y n, )) Regularization + Loss on training data Hinge loss: maximum margin training l(x n, y n [, ) := max (y n, y) +, φ(x n, y), φ(x n, y n ) ] y Y l is maximum over linear functions continuous, convex. l bounds from above. Proof: Let ȳ = argmax y g(x n, y, ) (y n, ȳ) (y n, ȳ) + g(x n, ȳ, ) g(x n, y n, ) [ max (y n, y) + g(x n, y, ) g(x n, y n, ) ] y Y 9 / 56

10 Regularized Risk Minimization λ N N l(x n, y n, )) Regularization + Loss on training data Hinge loss: maximum margin training l(x n, y n [, ) := max (y n, y) +, φ(x n, y), φ(x n, y n ) ] y Y Alternative: Logistic loss: probabilistic training l(x n, y n, ) := log y Y exp (, φ(x n, y), φ(x n, y n ) ) 10 / 56

11 Structured Output Support Vector Machine C N N [ ] max y Y (yn, y) +, φ(x n, y), φ(x n, y n ) Conditional Random Field 2 2σ 2 + N [ log exp (, φ(x n, y), φ(x n, y n ) )] y Y CRFs and SSVMs have more in common than usually assumed. both do regularized risk imization log y exp( ) can be interpreted as a soft-max 11 / 56

12 Solving the Training Optimization Problem Numerically Structured Output Support Vector Machine: C N N [ max y Y (yn, y) +, φ(x n, y), φ(x n, y n ) )] Unconstrained optimization, convex, non-differentiable objective. 12 / 56

13 Structured Output SVM (equivalent formulation):,ξ C N N ξ n subject to, for n = 1,..., N, max y Y [ ] (y n, y) +, φ(x n, y), φ(x n, y n ) ξ n N non-linear contraints, convex, differentiable objective. 13 / 56

14 Structured Output SVM (also equivalent formulation):,ξ C N N ξ n subject to, for n = 1,..., N, (y n, y) +, φ(x n, y), φ(x n, y n ) ξ n, for all y Y N Y linear constraints, convex, differentiable objective. 14 / 56

15 Example: Multiclass SVM { Y = {1, 2,..., K}, (y, y 1 for y y ) = 0 otherise. ( ) φ(x, y) = y = 1 φ(x), y = 2 φ(x),..., y = K φ(x) Solve:,ξ C N subject to, for i = 1,..., n, N ξ n, φ(x n, y n ), φ(x n, y) 1 ξ n for all y Y \ {y n }. Classification: f(x) = argmax y Y, φ(x, y). Crammer-Singer Multiclass SVM [K. Crammer, Y. Singer: On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines, JMLR, 2001] 15 / 56

16 Example: Hierarchical SVM Hierarchical Multiclass Loss: (y, y ) := 1 (distance in tree) 2 (cat, cat) = 0, (cat, dog) = 1, (cat, bus) = 2, etc. Solve:,ξ C N N ξ n subject to, for i = 1,..., n,, φ(x n, y n ), φ(x n, y) (y n, y) ξ n for all y Y. [L. Cai, T. Hofmann: Hierarchical Document Categorization ith Support Vector Machines, ACM CIKM, 2004] [A. Binder, K.-R. Müller, M. Kaanabe: On taxonomies for multi-class image categorization, IJCV, 2011] 16 / 56

17 Solving the Training Optimization Problem Numerically We can solve SSVM training like CRF training: C N N [ ] max y Y (yn, y) +, φ(x n, y), φ(x n, y n ) continuous unconstrained convex non-differentiable e can t use gradient descent directly. e ll have to use subgradients 17 / 56

18 Definition Let f : R D R be a convex, not necessarily differentiable, function. A vector v R D is called a subgradient of f at 0, if f() f( 0 ) + v, 0 for all. f() f(0) f(0)+ v, / 56

19 Definition Let f : R D R be a convex, not necessarily differentiable, function. A vector v R D is called a subgradient of f at 0, if f() f( 0 ) + v, 0 for all. f() f(0)+ v,-0 f(0) 0 19 / 56

20 Definition Let f : R D R be a convex, not necessarily differentiable, function. A vector v R D is called a subgradient of f at 0, if f() f( 0 ) + v, 0 for all. f() f(0) f(0)+ v, / 56

21 Definition Let f : R D R be a convex, not necessarily differentiable, function. A vector v R D is called a subgradient of f at 0, if f() f( 0 ) + v, 0 for all. f() f(0) f(0)+ v,-0 0 For differentiable f, the gradient v = f( 0 ) is the only subgradient. f() f(0) f(0)+ v, / 56

22 Subgradient descent orks basically like gradient descent: Subgradient Descent Minimization imize F () require: tolerance ɛ > 0, stepsizes η t cur 0 repeat v sub F ( cur ) cur cur η t v until F changed less than ɛ return cur Converges to global imum, but rather inefficient if F non-differentiable. [Shor, Minimization methods for non-differentiable functions, Springer, 1985.] 22 / 56

23 Computing a subgradient: C N N l n () ith l n () = max y l n y (), and l n y () := (y n, y) +, φ(x n, y), φ(x n, y n ) 23 / 56

24 Computing a subgradient: C N N l n () ith l n () = max y l n y (), and l n y () := (y n, y) +, φ(x n, y), φ(x n, y n ) l() y For each y Y, l y () is a linear function. 24 / 56

25 Computing a subgradient: C N N l n () ith l n () = max y l n y (), and l n y () := (y n, y) +, φ(x n, y), φ(x n, y n ) l() y' For each y Y, l y () is a linear function. 25 / 56

26 Computing a subgradient: C N N l n () ith l n () = max y l n y (), and l n y () := (y n, y) +, φ(x n, y), φ(x n, y n ) l() For each y Y, l y () is a linear function. 26 / 56

27 Computing a subgradient: C N N l n () ith l n () = max y l n y (), and l n y () := (y n, y) +, φ(x n, y), φ(x n, y n ) l() l() = max y l y (): maximum over all y Y. 27 / 56

28 Computing a subgradient: C N N l n () ith l n () = max y l n y (), and l n y () := (y n, y) +, φ(x n, y), φ(x n, y n ) l() Subgradient of l n at 0 : 0 28 / 56

29 Computing a subgradient: C N N l n () ith l n () = max y l n y (), and l n y () := (y n, y) +, φ(x n, y), φ(x n, y n ) l() 0 Subgradient of l n at 0 : find maximal (active) y. 29 / 56

30 Computing a subgradient: C N N l n () ith l n () = max y l n y (), and l n y () := (y n, y) +, φ(x n, y), φ(x n, y n ) l() Subgradient of l n at 0 : find maximal (active) y, use v = l n y ( 0 ) / 56

31 Subgradient Descent S-SVM Training input training pairs {(x 1, y 1 ),..., (x n, y n )} X Y, input feature map φ(x, y), loss function (y, y ), regularizer C, input number of iterations T, stepsizes η t for t = 1,..., T 1: 0 2: for t=1,...,t do 3: for i=1,...,n do 4: ŷ argmax y Y (y n, y) +, φ(x n, y), φ(x n, y n ) 5: v n φ(x n, ŷ) φ(x n, y n ) 6: end for 7: η t ( C N n vn ) 8: end for output prediction function f(x) = argmax y Y, φ(x, y). Observation: each update of needs 1 argmax-prediction per example. 31 / 56

32 We can use the same tricks as for CRFs, e.g. stochastic updates: Stochastic Subgradient Descent S-SVM Training input training pairs {(x 1, y 1 ),..., (x n, y n )} X Y, input feature map φ(x, y), loss function (y, y ), regularizer C, input number of iterations T, stepsizes η t for t = 1,..., T 1: 0 2: for t=1,...,t do 3: (x n, y n ) randomly chosen training example pair 4: ŷ argmax y Y (y n, y) +, φ(x n, y), φ(x n, y n ) 5: η t ( C N [φ(xn, ŷ) φ(x n, y n )]) 6: end for output prediction function f(x) = argmax y Y, φ(x, y). Observation: each update of needs only 1 argmax-prediction (but e ll need many iterations until convergence) 32 / 56

33 Solving the Training Optimization Problem Numerically We can solve an S-SVM like a linear SVM: One of the equivalent formulations as: subject to, for i = 1,... n, R D,ξ R n C N N ξ n, φ(x n, y n ), φ(x n, y) (y n, y) ξ n, for all y Y. Introduce feature vectors δφ(x n, y n, y) := φ(x n, y n ) φ(x n, y). 33 / 56

34 Solve R D,ξ R n C N subject to, for i = 1,... n, for all y Y, N ξ n, δφ(x n, y n, y) (y n, y) ξ n. This has the same structure as an ordinary SVM! quadratic objective linear constraints 34 / 56

35 Solve R D,ξ R n C N subject to, for i = 1,... n, for all y Y, N ξ n, δφ(x n, y n, y) (y n, y) ξ n. This has the same structure as an ordinary SVM! quadratic objective linear constraints Question: Can t e use a ordinary SVM/QP solver? 35 / 56

36 Solve R D,ξ R n C N subject to, for i = 1,... n, for all y Y, N ξ n, δφ(x n, y n, y) (y n, y) ξ n. This has the same structure as an ordinary SVM! quadratic objective linear constraints Question: Can t e use a ordinary SVM/QP solver? Anser: Almost! We could, if there eren t N Y constraints. E.g. 100 binary images: constraints 36 / 56

37 Solution: orking set training It s enough if e enforce the active constraints. The others ill be fulfilled automatically. We don t kno hich ones are active for the optimal solution. But it s likely to be only a small number can of course be formalized. Keep a set of potentially active constraints and update it iteratively: 37 / 56

38 Solution: orking set training It s enough if e enforce the active constraints. The others ill be fulfilled automatically. We don t kno hich ones are active for the optimal solution. But it s likely to be only a small number can of course be formalized. Keep a set of potentially active constraints and update it iteratively: Working Set Training Start ith orking set S = (no contraints) Repeat until convergence: Solve S-SVM training problem ith constraints from S Check, if solution violates any of the full constraint set if no: e found the optimal solution, terate. if yes: add most violated constraints to S, iterate. 38 / 56

39 Solution: orking set training It s enough if e enforce the active constraints. The others ill be fulfilled automatically. We don t kno hich ones are active for the optimal solution. But it s likely to be only a small number can of course be formalized. Keep a set of potentially active constraints and update it iteratively: Working Set Training Start ith orking set S = (no contraints) Repeat until convergence: Solve S-SVM training problem ith constraints from S Check, if solution violates any of the full constraint set if no: e found the optimal solution, terate. if yes: add most violated constraints to S, iterate. Good practical performance and theoretic guarantees: polynomial time convergence ɛ-close to the global optimum [Tsochantaridis et al. Large Margin Methods for Structured and Interdependent Output Variables, JMLR, 2005.] 39 / 56

40 Working Set S-SVM Training input training pairs {(x 1, y 1 ),..., (x n, y n )} X Y, input feature map φ(x, y), loss function (y, y ), regularizer C 1: S 2: repeat 3: (, ξ) solution to QP only ith constraints from S 4: for i=1,...,n do 5: ŷ argmax y Y (y n, y) +, φ(x n, y) 6: if ŷ y n then 7: S S {(x n, ŷ)} 8: end if 9: end for 10: until S doesn t change anymore. output prediction function f(x) = argmax y Y, φ(x, y). Observation: each update of needs 1 argmax-prediction per example. (but e solve globally for next, not by local steps) 40 / 56

41 One-Slack Formulation of S-SVM: (equivalent to ordinary S-SVM formulation by ξ = 1 N n ξn ) 1 R D,ξ R Cξ subject to, for all (ŷ 1,..., ŷ N ) Y Y, N [ (y n, ŷ N ) +, φ(x n, ŷ n ), φ(x n, y n ) ] Nξ, 41 / 56

42 One-Slack Formulation of S-SVM: (equivalent to ordinary S-SVM formulation by ξ = 1 N n ξn ) 1 R D,ξ R Cξ subject to, for all (ŷ 1,..., ŷ N ) Y Y, N [ (y n, ŷ N ) +, φ(x n, ŷ n ), φ(x n, y n ) ] Nξ, Y N linear constraints, convex, differentiable objective. We ble up the constraint set even further: 100 binary images: constraints (instead of ). 42 / 56

43 Working Set One-Slack S-SVM Training input training pairs {(x 1, y 1 ),..., (x n, y n )} X Y, input feature map φ(x, y), loss function (y, y ), regularizer C 1: S 2: repeat 3: (, ξ) solution to QP only ith constraints from S 4: for i=1,...,n do 5: ŷ n argmax y Y (y n, y) +, φ(x n, y) 6: end for 7: S S { ( (x 1,..., x n ), (ŷ 1,..., ŷ n ) ) } 8: until S doesn t change anymore. output prediction function f(x) = argmax y Y, φ(x, y). Often faster convergence: We add one strong constraint per iteration instead of n eak ones. 43 / 56

44 We can solve an S-SVM like a non-linear SVM: compute Lagrangian dual becomes max, original (primal) variables, ξ disappear, ne (dual) variables α iy : one per constraint of the original problem. Dual S-SVM problem max α R n Y +,...,n y Y α ny (y n, y) 1 2 subject to, for n = 1,..., N, α ny C N. y Y y,ȳ Y n,,...,n α ny α nȳ δφ(x n, y n, y), δφ(x n, y n, ȳ) N linear contraints, convex, differentiable objective, N Y variables. 44 / 56

45 We can kernelize: Define joint kernel function k : (X Y) (X Y) R k( (x, y), ( x, ȳ) ) = φ(x, y), φ( x, ȳ). k measure similarity beteen to (input,output)-pairs. We can express the optimization in terms of k: δφ(x n, y n, y), δφ(x n, y n, ȳ) = φ(x n, y n ) φ(x n, y), φ(x n, y n ) φ(x n, ȳ) = φ(x n, y n ), φ(x n, y n ) φ(x n, y n ), φ(x n, ȳ) φ(x n, y), φ(x n, y n ) + φ(x n, y), φ(x n, ȳ) = k( (x n, y n ), (x n, y n ) ) k( (x n, y n ), φ(x n, ȳ) ) k( (x n, y), (x n, y n ) ) + k( (x n, y), φ(x n, ȳ) ) =: K iīyȳ 45 / 56

46 Kernelized S-SVM problem: max α R n Y + α iy (y n, y) 1 α iy αīȳ K iīyȳ 2 i=1,...,n y Y subject to, for i = 1,..., n, α iy C N. y Y y,ȳ Y i,ī=1,...,n too many variables: train ith orking set of α iy. Kernelized prediction function: f(x) = argmax y Y iy α iy k( (x i, y i ), (x, y) ) 46 / 56

47 What do joint kernel functions look like? k( (x, y), ( x, ȳ) ) = φ(x, y), φ( x, ȳ). As in graphical model: easier if φ decomposes.r.t. factors: φ(x, y) = ( φ F (x, y F ) ) F F Then the kernel k decomposes into sum over factors: (φf k( (x, y), ( x, ȳ) ) = (x, y F ) ) F F, ( φ F (x, y F ) ) F F = φ F (x, y F ), φ F (x F F, y F ) = F F k F ( (x, y F ), (x, y F ) ) We can define kernels for each factor (e.g. nonlinear). 47 / 56

48 Example: figure-ground segmentation ith grid structure (x, y) ˆ=(, ) Typical kernels: arbirary in x, linear (or at least simple).r.t. y: Unary factors: k p ((x p, y p ), (x p, y p) = k(x p, x p) y p = y p ith k(x p, x p) local image kernel, e.g. χ 2 or histogram intersection Pairise factors: k pq ((y p, y q ), (y p, y p) = y q = y q y q = y q More poerful than all-linear, and argmax-prediction still possible. 48 / 56

49 Example: object localization left top (x, y) ˆ=(, ) image right bottom Only one factor that includes all x and y: k( (x, y), (x, y ) ) = k image (x y, x y ) ith k image image kernel and x y is image region ithin box y. argmax-prediction as difficult as object localization ith k image -SVM. 49 / 56

50 Summary S-SVM Learning Given: training set {(x 1, y 1 ),..., (x n, y n )} X Y loss function : Y Y R. Task: learn parameter for f(x) := argmax y, φ(x, y) that imizes expected loss on future data. 50 / 56

51 Summary S-SVM Learning Given: training set {(x 1, y 1 ),..., (x n, y n )} X Y loss function : Y Y R. Task: learn parameter for f(x) := argmax y, φ(x, y) that imizes expected loss on future data. S-SVM solution derived by maximum margin frameork: enforce correct output to be better than others by a margin :, φ(x n, y n ) (y n, y) +, φ(x n, y) for all y Y. convex optimization problem, but non-differentiable many equivalent formulations different training algorithms training needs repeated argmax prediction, no probabilistic inference 51 / 56

52 Sebastian Noozin and Christoph Lampert Structured Models in Computer Vision Part 5. Structured SVMs Extra I: Beyond Fully Supervised Learning So far, training as fully supervised, all variables ere observed. In real life, some variables are unobserved even during training. missing labels in training data latent variables, e.g. part location latent variables, e.g. part occlusion latent variables, e.g. viepoint 52 / 56

53 Three types of variables: x X alays observed, y Y observed only in training, z Z never observed (latent). Decision function: f(x) = argmax y Y max z Z, φ(x, y, z) 53 / 56

54 Three types of variables: x X alays observed, y Y observed only in training, z Z never observed (latent). Decision function: f(x) = argmax y Y max z Z, φ(x, y, z) Maximum Margin Training ith Maximization over Latent Variables Solve:,ξ C N N ξ n subject to, for n = 1,..., N, for all y Y (y n, y) + max z Z, φ(xn, y, z) max z Z Problem: not a convex problem can have local ima, φ(xn, y n, z) [C. Yu, T. Joachims, Learning Structural SVMs ith Latent Variables, ICML, 2009] similar idea: [Felzenszalb, McAllester, Ramaman. A Discriatively Trained, Multiscale, Deformable Part Model, CVPR 08] 54 / 56

55 Structured Learning is full of Open Research Questions Ho to train faster? CRFs need many runs of probablistic inference, SSVMs need many runs of argmax-predictions. Ho to reduce the necessary amount of training data? semi-supervised learning? transfer learning? Ho can e better understand different loss function? hen to use probabilistic training, hen maximum margin? CRFs are consistent, SSVMs are not. Is this relevant? Can e understand structured learning ith approximate inference? often computing L() or argmaxy, φ(x, y) exactly is infeasible. can e guarantee good results even ith approximate inference? More and ne applications! 55 / 56

56 Lunch-Break Continuing at 13:30 Slides available at cvpr2011tutorial/ 56 / 56

Part 5: Structured Support Vector Machines

Part 5: Structured Support Vector Machines Sebastian Nowozin and Christoph H. Lampert Providence, 21st June 2012 1 / 34 Problem (Loss-Minimizing Parameter Learning) Let d(x, y) be the (unknown) true data