SVMs for Structured Output. Andrea Vedaldi

Size: px

Start display at page:

Download "SVMs for Structured Output. Andrea Vedaldi"

Scott Atkinson
6 years ago
Views:

1 SVMs for Structured Output Andrea Vedaldi

2 SVM struct Tsochantaridis Hofmann Joachims Altun 04

3 Extending SVMs 3

4 Extending SVMs SVM = parametric function arbitrary input binary output 3

5 Extending SVMs SVM = parametric function arbitrary input binary output Regression by convex optimization 3

6 Extending SVMs SVM = parametric function arbitrary input binary output Regression by convex optimization Today: how to handle arbitrary output (while keeping the computation efficient) 3

7 Arbitrary input and output 4

8 Arbitrary input and output SVM = sign of the projection 4

9 Arbitrary input and output SVM = sign of the projection Equivalently: output that matches best 4

10 Arbitrary input and output SVM = sign of the projection Equivalently: output that matches best Extend to arbitrary input and output 4

11 Arbitrary input and output SVM = sign of the projection Equivalently: output that matches best Extend to arbitrary input and output 4

12 Example: real function Encode y 1 F(x,y w) y x 5

13 Regression

14 Regression problem 7

15 Regression problem SVM = function y = f(x w) parametrized in w. 7

16 Regression problem SVM = function y = f(x w) parametrized in w. Goal. Fit the function to data (x 1, y 1 ),..., (x N, y N ). 7

17 Regression problem SVM = function y = f(x w) parametrized in w. Goal. Fit the function to data (x 1, y 1 ),..., (x N, y N ). Formulated as optimization problem: Loss 7

18 Regression problem SVM = function y = f(x w) parametrized in w. Goal. Fit the function to data (x 1, y 1 ),..., (x N, y N ). Formulated as optimization problem: Loss Risk (empirical) 7

19 Regression problem SVM = function y = f(x w) parametrized in w. Goal. Fit the function to data (x 1, y 1 ),..., (x N, y N ). Formulated as optimization problem: Loss Risk (empirical) Problem: Find w that minimizes R(w). 7

20 Separable case 8

21 Separable case Separable case exact fit R(w) = 0 8

22 Separable case Separable case exact fit R(w) = 0 Necessary and sufficient condition: For each point x i the maximum is reached at y i 8

23 Separable case Separable case exact fit R(w) = 0 Necessary and sufficient condition: For each point x i the maximum is reached at y i y 8

24 Separable case Separable case exact fit R(w) = 0 Necessary and sufficient condition: For each point x i the maximum is reached at y i y 8

25 Margin max-margin 9

26 Margin max-margin 9

27 Margin max-margin 9

28 Margin max-margin? 9

29 Margin max-margin }? 9

30 Margin max-margin }? 9

31 Non-separable case 10

32 Non-separable case 10

33 Non-separable case 10

34 Non-separable case

35 Non-separable case

36 Non-separable case

37 Non-separable case 1 0 ξi UB on Loss: 10

38 Non-separable case 1 0 ξi UB on Loss: 10

39 Recap SVM struct regression Max-margin: regularized (and sparse) solution Soft-margin: non-separable case One constraint for each data point x i and output value y One slack variable for each data point x i Minimize upper bound on empirical error (Generalization results) 11

40 Kernelization Compute a kernel k((x,y),(x,y )) instead of ψ(x,y) 12

41 Kernelization Compute a kernel k((x,y),(x,y )) instead of ψ(x,y) 12

42 Kernelization Compute a kernel k((x,y),(x,y )) instead of ψ(x,y) 12

43 Example: Real function Fit the real function to data data (x1, y1),..., (xn,yn) 13

44 Example: Real function Fit the real function to data data (x1, y1),..., (xn,yn) loss 13

45 Example: Real function Fit the real function to data data (x1, y1),..., (xn,yn) loss risk 13

46 Example: Real function Fit the real function to data data (x1, y1),..., (xn,yn) loss risk kernel 13

47 Example: Real function Fit the real function to data data (x1, y1),..., (xn,yn) loss risk kernel coeff. 13

48 Dual

49 Primal in matrix notation 15

50 Primal in matrix notation Assume finite range Y = {1,..., M} 15

51 Primal in matrix notation Assume finite range Y = {1,..., M} Lexicographic ordering of pairs (i, y) 15

52 Primal in matrix notation Assume finite range Y = {1,..., M} Lexicographic ordering of pairs (i, y) 15

53 Primal in matrix notation Assume finite range Y = {1,..., M} Lexicographic ordering of pairs (i, y) 15

54 Primal in matrix notation Assume finite range Y = {1,..., M} Lexicographic ordering of pairs (i, y) 15

55 Primal in matrix notation Assume finite range Y = {1,..., M} Lexicographic ordering of pairs (i, y) 15

56 Primal dual 16

57 Primal dual Introduce dual variables α 16

58 Primal dual Introduce dual variables α Optimize primal variables w, ξ partial = 0 yields bounded minimum yields 16

59 Solution and Support Vectors 17

60 Solution and Support Vectors Support Vector (SV): pairs (x i,y) with coefficient c iy > 0 17

61 Solution and Support Vectors Support Vector (SV): pairs (x i,y) with coefficient c iy > 0 Dual variable α iy > 0 results in difference of 2 SVs c iy and c iyi 17

62 Solution and Support Vectors Support Vector (SV): pairs (x i,y) with coefficient c iy > 0 Dual variable α iy > 0 results in difference of 2 SVs c iy and c iyi 17

63 Optimization

64 Dealing with many constraints 19

65 Dealing with many constraints Number of constraints Y Y can be very large or infinite number of dual variables? 19

66 Dealing with many constraints Number of constraints Y Y can be very large or infinite number of dual variables? Exploit sparsity of solutions 19

67 Dealing with many constraints Number of constraints Y Y can be very large or infinite number of dual variables? Exploit sparsity of solutions Add one constraint/dual variable per time the most violated constraint until violation increases less than ε 19

68 Dealing with many constraints Number of constraints Y Y can be very large or infinite number of dual variables? Exploit sparsity of solutions Add one constraint/dual variable per time the most violated constraint until violation increases less than ε Nice facts: stops with an ε - optimal solution in polynomial time 19

69 ε-satisfaction ε-optimality 20

70 ε-satisfaction ε-optimality full primal 20

71 ε-satisfaction ε-optimality full primal partially constrained primal 20

72 ε-satisfaction ε-optimality full primal partially constrained primal active set 20

73 ε-satisfaction ε-optimality full primal partially constrained primal active set optim: optim: 20

74 ε-satisfaction ε-optimality full primal partially constrained primal active set optim: optim: Consider smallest ε such that is feasible for full primal. Then: 20

75 ε-satisfaction ε-optimality full primal partially constrained primal active set optim: optim: Consider smallest ε such that is feasible for full primal. Then: 20

76 Hunting ε-violations 21

77 Hunting ε-violations Start with S = {}, a precision ε 21

78 Hunting ε-violations Start with S = {}, a precision ε Solve the partial dual problem (one var for each active constraint) 21

79 Hunting ε-violations Start with S = {}, a precision ε Solve the partial dual problem (one var for each active constraint) Strong duality: get solution of the partial primal problem: 21

80 Hunting ε-violations Start with S = {}, a precision ε Solve the partial dual problem (one var for each active constraint) Strong duality: get solution of the partial primal problem: Remark. satisfies only active constraints. 21

81 Hunting ε-violations Start with S = {}, a precision ε Solve the partial dual problem (one var for each active constraint) Strong duality: get solution of the partial primal problem: Remark. satisfies only active constraints. Find ε such that satisfies all constraints: 21

82 Hunting ε-violations Start with S = {}, a precision ε Solve the partial dual problem (one var for each active constraint) Strong duality: get solution of the partial primal problem: Remark. satisfies only active constraints. Find ε such that satisfies all constraints: Stop if ε < ε 21

83 Finiteness 22

84 Finiteness When the algorithm stops, it returns an ε-optimal solution. 22

85 Finiteness When the algorithm stops, it returns an ε-optimal solution. But... does it stop? 22

86 Finiteness When the algorithm stops, it returns an ε-optimal solution. But... does it stop? Theorem (convergence). Let Then the algorithm stops within iterations. 22

87 Demo: Fitting a real function Hunting constraints to fit a real function 23

88 Demo: Fitting a real function Hunting constraints to fit a real function 23

Demo: Effect of C 0 0.1 train test sv C=1 C=10 0 0.1 0.2 0.2 0.3 0.

3 0.4 0.5 0.6 0.7 0.8 0.9 1 x 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x 0 0.

5 y 0.5 0.6 0.6 0.7 0.7 0.8 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x 0.

89 Demo: Effect of C train test sv C=1 C= y 0.5 y x x train test sv C=100 C= y 0.5 y x x

90 An Application: Localization Blaschko Lampert 08

91 The problem Goal. Learn a function from an image to a bounding box and label. input x = image output y = (label, bounding box) cow 26

92 Training data Pairs (x 1, y 1 ),..., (x N, y N ) : (image, label + bounding box) cow NOT cow NOT cow cow 27

93 Loss cow cow NOT cow cow NOT cow NOT cow NOT cow cow 28

94 Kernel 29

95 Kernel Intuition: x x and y y then k((x,y), (x, y )) large 29

96 Kernel Intuition: x x and y y then k((x,y), (x, y )) large 0 if lab(y) = NOT or lab(y ) = NOT k((x,y), (x, y )) = 29

97 Kernel Intuition: x x and y y then k((x,y), (x, y )) large 0 if lab(y) = NOT or lab(y ) = NOT k((x,y), (x, y )) = k(crop(x y), crop(x y )) image patch similarity if lab(y) = lab(y ) = COW 29

98 SVM evaluation Hypothesis NOT cow cow 30

99 SVM evaluation Hypothesis Scoring NOT cow cow 30

100 SVM evaluation Hypothesis NOT cow Scoring F(x,y w) = 0 do nothing cow 30

101 SVM evaluation Hypothesis NOT cow Scoring F(x,y w) = 0 do nothing cow match SVs 30

102 SVM evaluation Hypothesis NOT cow cow Scoring F(x,y w) = 0 do nothing not cow cow cow cow match SVs NOT cow cow 30

103 SVM evaluation Hypothesis NOT cow cow Scoring F(x,y w) = 0 do nothing not cow cow cow cow match SVs NOT cow cow Return best hypothesis 30

104 Most violated constraints max{slack Δ(y,y ) } cow cow 31

105 Most violated constraints max{slack Δ(y,y ) } cow cow cow new constraint cow 31

106 Most violated constraints max{slack Δ(y,y ) } cow cow cow new constraint cow Finding most violated constraints similar to evaluating the SVM scoring weighed by Δ(y,y ) repeated many times needs a fast search method (e.g. branch and bound) 31

107 M 3 networks Taskar Guestrin Koller 03

108 M 3 network 33

109 M 3 network Structured SVM entails a large number of constraints So far, handled by adding one constraint per time 33

110 M 3 network Structured SVM entails a large number of constraints So far, handled by adding one constraint per time M 3 network is a graphical model Markov network (encodes indep. relations) Inference in structured output framework Reduction of exp to poly number of constraints 33

111 M 3 network Structured SVM entails a large number of constraints So far, handled by adding one constraint per time M 3 network is a graphical model Markov network (encodes indep. relations) Inference in structured output framework Reduction of exp to poly number of constraints M 3 network = Max-Margin Markov network 33

112 M 3 Model graph (tree) 34

113 M 3 Model graph (tree) State 34

114 M 3 Model graph (tree) State Discriminative model 34

115 M 3 Model graph (tree) State Discriminative model Inference 34

116 M 3 Model graph (tree) State Discriminative model Inference Risk 34

117 M 3 Model graph (tree) State Discriminative model Inference Risk (1) SVM Property 34

118 M 3 Model graph (tree) State Discriminative model Inference Risk (1) SVM Property (2) Markov Property 34

119 M 3 Model graph (tree) State Discriminative model Inference Risk Node n state (1) SVM Property (2) Markov Property 34

120 M 3 as Structured SVM 35

121 M 3 as Structured SVM M 3 is a structured SVM 35

122 M 3 as Structured SVM M 3 is a structured SVM usual dual problem: 35

123 M 3 as Structured SVM M 3 is a structured SVM usual dual problem: How many dual variables? 35

124 M 3 as Structured SVM M 3 is a structured SVM usual dual problem: How many dual variables? 35

125 M 3 as Structured SVM M 3 is a structured SVM usual dual problem: How many dual variables? i ranges over number of examples (N) 35

126 M 3 as Structured SVM M 3 is a structured SVM usual dual problem: How many dual variables? i ranges over number of examples (N) y ranges over number of settings (2 M ) 35

127 M 3 as Structured SVM M 3 is a structured SVM usual dual problem: How many dual variables? i ranges over number of examples (N) y ranges over number of settings (2 M ) total: N 2 M 35

128 Marginalizing dual variables 36

129 Marginalizing dual variables 36

130 Marginalizing dual variables 36

131 Marginalizing dual variables 36

132 Marginalizing dual variables 36

133 β marginals 37

134 β marginals Given β 37

135 β marginals Given β compute 1 and 2 marginal variables 37

136 β marginals Given β compute 1 and 2 marginal variables compute objective as function of marginals 37

137 Marginals β 38

138 Marginals β Given 1 and 2 marginal variables is there a corresponding β? 38

139 Marginals β Given 1 and 2 marginal variables is there a corresponding β? We also know 38

140 Marginals β Given 1 and 2 marginal variables is there a corresponding β? We also know A sufficient condition 38

141 Recap 39

142 Recap Simplification results: 39

143 Recap Simplification results: Potentials decompose on edges 39

144 Recap Simplification results: Potentials decompose on edges Risk decomposes on nodes 39

145 Recap Simplification results: Potentials decompose on edges Risk decomposes on nodes variables: N2 M down to N(M 2 + M) 39

146 Recap Simplification results: Potentials decompose on edges Risk decomposes on nodes variables: N2 M down to N(M 2 + M) constraints: N2 M down to NM 2 39

147 Recap Simplification results: Potentials decompose on edges Risk decomposes on nodes variables: N2 M down to N(M 2 + M) constraints: N2 M down to NM 2 If not a tree the simplified dual is a relaxation 39

148 Example application x: y: [e m b r a c e s] [x] n : pixel data for one characters [y] n : one of {a,b,c,..., z} 40

149 Example application x: y: [e m b r a c e s] [x] n : pixel data for one characters [y] n : one of {a,b,c,..., z} 40

150 Example application x: y: [e m b r a c e s] [x] n : pixel data for one characters [y] n : one of {a,b,c,..., z} ψ 34 (x,y) = 0 0 a b c 0 z 0 aa 0 ab ac br zz 40

151 Example application x: y: [e m b r a c e s] [x] n : pixel data for one characters [y] n : one of {a,b,c,..., z} 0 0 a b c a b c ψ 34 (x,y) = 0 z 0 aa 0 ab ac br zz ψ(x,y) = 0 z 0 aa 0 ab ac br zz 40

152 Example application x: y: [e m b r a c e s] [x] n : pixel data for one characters [y] n : one of {a,b,c,..., z} 0 0 a b c a b c ψ 34 (x,y) = 0 z 0 aa 0 ab ac br zz ψ(x,y) = 0 z 0 aa 0 ab ac br zz Comparing words ψ(x1,y1), ψ(x2,y2) correl. homogeneous chars + correl. char co-occurence matrix 40

153 Conclusions 41

154 Conclusions SVM can be used for arbitrary output very general 41

155 Conclusions SVM can be used for arbitrary output very general Function evaluation entails a maximization (inference) step 41

156 Conclusions SVM can be used for arbitrary output very general Function evaluation entails a maximization (inference) step Hinge loss adapted to bound arbitrary loss 41

157 Conclusions SVM can be used for arbitrary output very general Function evaluation entails a maximization (inference) step Hinge loss adapted to bound arbitrary loss Risk minimization is a large convex program. Handled by: Iteratively adding one constraint also requires inference step M 3 marginalization 41

Support Vector Machine Learning for Interdependent and Structured Output Spaces

Support Vector Machine Learning for Interdependent and Structured Output Spaces I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun, ICML, 2004. And also I. Tsochantaridis, T. Joachims, T. Hofmann,