Part 5: Structured Support Vector Machines
|
|
- Ethelbert Shields
- 6 years ago
- Views:
Transcription
1 Part 5: Structured Support Vector Machines Sebastian Noozin and Christoph H. Lampert Colorado Springs, 25th June / 56
2 Problem (Loss-Minimizing Parameter Learning) Let d(x, y) be the (unknon) true data distribution. Let D = {(x 1, y 1 ),..., (x N, y N )} be i.i.d. samples from d(x, y). Let φ : X Y R D be a feature function. Let : Y Y R be a loss function. Find a eight vector that leads to imal expected loss E (x,y) d(x,y) { (y, f(x))} for f(x) = argmax y Y, φ(x, y). 2 / 56
3 Problem (Loss-Minimizing Parameter Learning) Let d(x, y) be the (unknon) true data distribution. Let D = {(x 1, y 1 ),..., (x N, y N )} be i.i.d. samples from d(x, y). Let φ : X Y R D be a feature function. Let : Y Y R be a loss function. Find a eight vector that leads to imal expected loss E (x,y) d(x,y) { (y, f(x))} for f(x) = argmax y Y, φ(x, y). Pro: We directly optimize for the quantity of interest: expected loss. No expensive-to-compute partition function Z ill sho up. Con: We need to kno the loss function already at training time. We can t use probabilistic reasoning to find. 3 / 56
4 Reder: learning by regularized risk imization For compatibility function g(x, y; ) :=, φ(x, y) find that imizes To major problems: d(x, y) is unknon E (x,y) d(x,y) ( y, argmax y g(x, y; ) ). argmax y g(x, y; ) maps into a discrete space ( y, argmax y g(x, y; )) is discontinuous, pieceise constant 4 / 56
5 Task: E (x,y) d(x,y) ( y, argmax y g(x, y; ) ). Problem 1: d(x, y) is unknon Solution: ( ) Replace E (x,y) d(x,y) ith empirical estimate 1 ( ) N (x n,y n ) To avoid overfitting: add a regularizer, e.g. λ 2. Ne task: λ N N ( y n, argmax y g(x n, y; ) ). 5 / 56
6 Task: λ N N ( y n, argmax y g(x n, y; ) ). Problem: ( y, argmax y g(x, y; ) ) discontinuous.r.t.. Solution: Replace (y, y ) ith ell behaved l(x, y, ) Typically: l upper bound to, continuous and convex.r.t.. Ne task: λ N N l(x n, y n, )) 6 / 56
7 Regularized Risk Minimization λ N N l(x n, y n, )) Regularization + Loss on training data 7 / 56
8 Regularized Risk Minimization λ N N l(x n, y n, )) Regularization + Loss on training data Hinge loss: maximum margin training l(x n, y n [, ) := max (y n, y) +, φ(x n, y), φ(x n, y n ) ] y Y 8 / 56
9 Regularized Risk Minimization λ N N l(x n, y n, )) Regularization + Loss on training data Hinge loss: maximum margin training l(x n, y n [, ) := max (y n, y) +, φ(x n, y), φ(x n, y n ) ] y Y l is maximum over linear functions continuous, convex. l bounds from above. Proof: Let ȳ = argmax y g(x n, y, ) (y n, ȳ) (y n, ȳ) + g(x n, ȳ, ) g(x n, y n, ) [ max (y n, y) + g(x n, y, ) g(x n, y n, ) ] y Y 9 / 56
10 Regularized Risk Minimization λ N N l(x n, y n, )) Regularization + Loss on training data Hinge loss: maximum margin training l(x n, y n [, ) := max (y n, y) +, φ(x n, y), φ(x n, y n ) ] y Y Alternative: Logistic loss: probabilistic training l(x n, y n, ) := log y Y exp (, φ(x n, y), φ(x n, y n ) ) 10 / 56
11 Structured Output Support Vector Machine C N N [ ] max y Y (yn, y) +, φ(x n, y), φ(x n, y n ) Conditional Random Field 2 2σ 2 + N [ log exp (, φ(x n, y), φ(x n, y n ) )] y Y CRFs and SSVMs have more in common than usually assumed. both do regularized risk imization log y exp( ) can be interpreted as a soft-max 11 / 56
12 Solving the Training Optimization Problem Numerically Structured Output Support Vector Machine: C N N [ max y Y (yn, y) +, φ(x n, y), φ(x n, y n ) )] Unconstrained optimization, convex, non-differentiable objective. 12 / 56
13 Structured Output SVM (equivalent formulation):,ξ C N N ξ n subject to, for n = 1,..., N, max y Y [ ] (y n, y) +, φ(x n, y), φ(x n, y n ) ξ n N non-linear contraints, convex, differentiable objective. 13 / 56
14 Structured Output SVM (also equivalent formulation):,ξ C N N ξ n subject to, for n = 1,..., N, (y n, y) +, φ(x n, y), φ(x n, y n ) ξ n, for all y Y N Y linear constraints, convex, differentiable objective. 14 / 56
15 Example: Multiclass SVM { Y = {1, 2,..., K}, (y, y 1 for y y ) = 0 otherise. ( ) φ(x, y) = y = 1 φ(x), y = 2 φ(x),..., y = K φ(x) Solve:,ξ C N subject to, for i = 1,..., n, N ξ n, φ(x n, y n ), φ(x n, y) 1 ξ n for all y Y \ {y n }. Classification: f(x) = argmax y Y, φ(x, y). Crammer-Singer Multiclass SVM [K. Crammer, Y. Singer: On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines, JMLR, 2001] 15 / 56
16 Example: Hierarchical SVM Hierarchical Multiclass Loss: (y, y ) := 1 (distance in tree) 2 (cat, cat) = 0, (cat, dog) = 1, (cat, bus) = 2, etc. Solve:,ξ C N N ξ n subject to, for i = 1,..., n,, φ(x n, y n ), φ(x n, y) (y n, y) ξ n for all y Y. [L. Cai, T. Hofmann: Hierarchical Document Categorization ith Support Vector Machines, ACM CIKM, 2004] [A. Binder, K.-R. Müller, M. Kaanabe: On taxonomies for multi-class image categorization, IJCV, 2011] 16 / 56
17 Solving the Training Optimization Problem Numerically We can solve SSVM training like CRF training: C N N [ ] max y Y (yn, y) +, φ(x n, y), φ(x n, y n ) continuous unconstrained convex non-differentiable e can t use gradient descent directly. e ll have to use subgradients 17 / 56
18 Definition Let f : R D R be a convex, not necessarily differentiable, function. A vector v R D is called a subgradient of f at 0, if f() f( 0 ) + v, 0 for all. f() f(0) f(0)+ v, / 56
19 Definition Let f : R D R be a convex, not necessarily differentiable, function. A vector v R D is called a subgradient of f at 0, if f() f( 0 ) + v, 0 for all. f() f(0)+ v,-0 f(0) 0 19 / 56
20 Definition Let f : R D R be a convex, not necessarily differentiable, function. A vector v R D is called a subgradient of f at 0, if f() f( 0 ) + v, 0 for all. f() f(0) f(0)+ v, / 56
21 Definition Let f : R D R be a convex, not necessarily differentiable, function. A vector v R D is called a subgradient of f at 0, if f() f( 0 ) + v, 0 for all. f() f(0) f(0)+ v,-0 0 For differentiable f, the gradient v = f( 0 ) is the only subgradient. f() f(0) f(0)+ v, / 56
22 Subgradient descent orks basically like gradient descent: Subgradient Descent Minimization imize F () require: tolerance ɛ > 0, stepsizes η t cur 0 repeat v sub F ( cur ) cur cur η t v until F changed less than ɛ return cur Converges to global imum, but rather inefficient if F non-differentiable. [Shor, Minimization methods for non-differentiable functions, Springer, 1985.] 22 / 56
23 Computing a subgradient: C N N l n () ith l n () = max y l n y (), and l n y () := (y n, y) +, φ(x n, y), φ(x n, y n ) 23 / 56
24 Computing a subgradient: C N N l n () ith l n () = max y l n y (), and l n y () := (y n, y) +, φ(x n, y), φ(x n, y n ) l() y For each y Y, l y () is a linear function. 24 / 56
25 Computing a subgradient: C N N l n () ith l n () = max y l n y (), and l n y () := (y n, y) +, φ(x n, y), φ(x n, y n ) l() y' For each y Y, l y () is a linear function. 25 / 56
26 Computing a subgradient: C N N l n () ith l n () = max y l n y (), and l n y () := (y n, y) +, φ(x n, y), φ(x n, y n ) l() For each y Y, l y () is a linear function. 26 / 56
27 Computing a subgradient: C N N l n () ith l n () = max y l n y (), and l n y () := (y n, y) +, φ(x n, y), φ(x n, y n ) l() l() = max y l y (): maximum over all y Y. 27 / 56
28 Computing a subgradient: C N N l n () ith l n () = max y l n y (), and l n y () := (y n, y) +, φ(x n, y), φ(x n, y n ) l() Subgradient of l n at 0 : 0 28 / 56
29 Computing a subgradient: C N N l n () ith l n () = max y l n y (), and l n y () := (y n, y) +, φ(x n, y), φ(x n, y n ) l() 0 Subgradient of l n at 0 : find maximal (active) y. 29 / 56
30 Computing a subgradient: C N N l n () ith l n () = max y l n y (), and l n y () := (y n, y) +, φ(x n, y), φ(x n, y n ) l() Subgradient of l n at 0 : find maximal (active) y, use v = l n y ( 0 ) / 56
31 Subgradient Descent S-SVM Training input training pairs {(x 1, y 1 ),..., (x n, y n )} X Y, input feature map φ(x, y), loss function (y, y ), regularizer C, input number of iterations T, stepsizes η t for t = 1,..., T 1: 0 2: for t=1,...,t do 3: for i=1,...,n do 4: ŷ argmax y Y (y n, y) +, φ(x n, y), φ(x n, y n ) 5: v n φ(x n, ŷ) φ(x n, y n ) 6: end for 7: η t ( C N n vn ) 8: end for output prediction function f(x) = argmax y Y, φ(x, y). Observation: each update of needs 1 argmax-prediction per example. 31 / 56
32 We can use the same tricks as for CRFs, e.g. stochastic updates: Stochastic Subgradient Descent S-SVM Training input training pairs {(x 1, y 1 ),..., (x n, y n )} X Y, input feature map φ(x, y), loss function (y, y ), regularizer C, input number of iterations T, stepsizes η t for t = 1,..., T 1: 0 2: for t=1,...,t do 3: (x n, y n ) randomly chosen training example pair 4: ŷ argmax y Y (y n, y) +, φ(x n, y), φ(x n, y n ) 5: η t ( C N [φ(xn, ŷ) φ(x n, y n )]) 6: end for output prediction function f(x) = argmax y Y, φ(x, y). Observation: each update of needs only 1 argmax-prediction (but e ll need many iterations until convergence) 32 / 56
33 Solving the Training Optimization Problem Numerically We can solve an S-SVM like a linear SVM: One of the equivalent formulations as: subject to, for i = 1,... n, R D,ξ R n C N N ξ n, φ(x n, y n ), φ(x n, y) (y n, y) ξ n, for all y Y. Introduce feature vectors δφ(x n, y n, y) := φ(x n, y n ) φ(x n, y). 33 / 56
34 Solve R D,ξ R n C N subject to, for i = 1,... n, for all y Y, N ξ n, δφ(x n, y n, y) (y n, y) ξ n. This has the same structure as an ordinary SVM! quadratic objective linear constraints 34 / 56
35 Solve R D,ξ R n C N subject to, for i = 1,... n, for all y Y, N ξ n, δφ(x n, y n, y) (y n, y) ξ n. This has the same structure as an ordinary SVM! quadratic objective linear constraints Question: Can t e use a ordinary SVM/QP solver? 35 / 56
36 Solve R D,ξ R n C N subject to, for i = 1,... n, for all y Y, N ξ n, δφ(x n, y n, y) (y n, y) ξ n. This has the same structure as an ordinary SVM! quadratic objective linear constraints Question: Can t e use a ordinary SVM/QP solver? Anser: Almost! We could, if there eren t N Y constraints. E.g. 100 binary images: constraints 36 / 56
37 Solution: orking set training It s enough if e enforce the active constraints. The others ill be fulfilled automatically. We don t kno hich ones are active for the optimal solution. But it s likely to be only a small number can of course be formalized. Keep a set of potentially active constraints and update it iteratively: 37 / 56
38 Solution: orking set training It s enough if e enforce the active constraints. The others ill be fulfilled automatically. We don t kno hich ones are active for the optimal solution. But it s likely to be only a small number can of course be formalized. Keep a set of potentially active constraints and update it iteratively: Working Set Training Start ith orking set S = (no contraints) Repeat until convergence: Solve S-SVM training problem ith constraints from S Check, if solution violates any of the full constraint set if no: e found the optimal solution, terate. if yes: add most violated constraints to S, iterate. 38 / 56
39 Solution: orking set training It s enough if e enforce the active constraints. The others ill be fulfilled automatically. We don t kno hich ones are active for the optimal solution. But it s likely to be only a small number can of course be formalized. Keep a set of potentially active constraints and update it iteratively: Working Set Training Start ith orking set S = (no contraints) Repeat until convergence: Solve S-SVM training problem ith constraints from S Check, if solution violates any of the full constraint set if no: e found the optimal solution, terate. if yes: add most violated constraints to S, iterate. Good practical performance and theoretic guarantees: polynomial time convergence ɛ-close to the global optimum [Tsochantaridis et al. Large Margin Methods for Structured and Interdependent Output Variables, JMLR, 2005.] 39 / 56
40 Working Set S-SVM Training input training pairs {(x 1, y 1 ),..., (x n, y n )} X Y, input feature map φ(x, y), loss function (y, y ), regularizer C 1: S 2: repeat 3: (, ξ) solution to QP only ith constraints from S 4: for i=1,...,n do 5: ŷ argmax y Y (y n, y) +, φ(x n, y) 6: if ŷ y n then 7: S S {(x n, ŷ)} 8: end if 9: end for 10: until S doesn t change anymore. output prediction function f(x) = argmax y Y, φ(x, y). Observation: each update of needs 1 argmax-prediction per example. (but e solve globally for next, not by local steps) 40 / 56
41 One-Slack Formulation of S-SVM: (equivalent to ordinary S-SVM formulation by ξ = 1 N n ξn ) 1 R D,ξ R Cξ subject to, for all (ŷ 1,..., ŷ N ) Y Y, N [ (y n, ŷ N ) +, φ(x n, ŷ n ), φ(x n, y n ) ] Nξ, 41 / 56
42 One-Slack Formulation of S-SVM: (equivalent to ordinary S-SVM formulation by ξ = 1 N n ξn ) 1 R D,ξ R Cξ subject to, for all (ŷ 1,..., ŷ N ) Y Y, N [ (y n, ŷ N ) +, φ(x n, ŷ n ), φ(x n, y n ) ] Nξ, Y N linear constraints, convex, differentiable objective. We ble up the constraint set even further: 100 binary images: constraints (instead of ). 42 / 56
43 Working Set One-Slack S-SVM Training input training pairs {(x 1, y 1 ),..., (x n, y n )} X Y, input feature map φ(x, y), loss function (y, y ), regularizer C 1: S 2: repeat 3: (, ξ) solution to QP only ith constraints from S 4: for i=1,...,n do 5: ŷ n argmax y Y (y n, y) +, φ(x n, y) 6: end for 7: S S { ( (x 1,..., x n ), (ŷ 1,..., ŷ n ) ) } 8: until S doesn t change anymore. output prediction function f(x) = argmax y Y, φ(x, y). Often faster convergence: We add one strong constraint per iteration instead of n eak ones. 43 / 56
44 We can solve an S-SVM like a non-linear SVM: compute Lagrangian dual becomes max, original (primal) variables, ξ disappear, ne (dual) variables α iy : one per constraint of the original problem. Dual S-SVM problem max α R n Y +,...,n y Y α ny (y n, y) 1 2 subject to, for n = 1,..., N, α ny C N. y Y y,ȳ Y n,,...,n α ny α nȳ δφ(x n, y n, y), δφ(x n, y n, ȳ) N linear contraints, convex, differentiable objective, N Y variables. 44 / 56
45 We can kernelize: Define joint kernel function k : (X Y) (X Y) R k( (x, y), ( x, ȳ) ) = φ(x, y), φ( x, ȳ). k measure similarity beteen to (input,output)-pairs. We can express the optimization in terms of k: δφ(x n, y n, y), δφ(x n, y n, ȳ) = φ(x n, y n ) φ(x n, y), φ(x n, y n ) φ(x n, ȳ) = φ(x n, y n ), φ(x n, y n ) φ(x n, y n ), φ(x n, ȳ) φ(x n, y), φ(x n, y n ) + φ(x n, y), φ(x n, ȳ) = k( (x n, y n ), (x n, y n ) ) k( (x n, y n ), φ(x n, ȳ) ) k( (x n, y), (x n, y n ) ) + k( (x n, y), φ(x n, ȳ) ) =: K iīyȳ 45 / 56
46 Kernelized S-SVM problem: max α R n Y + α iy (y n, y) 1 α iy αīȳ K iīyȳ 2 i=1,...,n y Y subject to, for i = 1,..., n, α iy C N. y Y y,ȳ Y i,ī=1,...,n too many variables: train ith orking set of α iy. Kernelized prediction function: f(x) = argmax y Y iy α iy k( (x i, y i ), (x, y) ) 46 / 56
47 What do joint kernel functions look like? k( (x, y), ( x, ȳ) ) = φ(x, y), φ( x, ȳ). As in graphical model: easier if φ decomposes.r.t. factors: φ(x, y) = ( φ F (x, y F ) ) F F Then the kernel k decomposes into sum over factors: (φf k( (x, y), ( x, ȳ) ) = (x, y F ) ) F F, ( φ F (x, y F ) ) F F = φ F (x, y F ), φ F (x F F, y F ) = F F k F ( (x, y F ), (x, y F ) ) We can define kernels for each factor (e.g. nonlinear). 47 / 56
48 Example: figure-ground segmentation ith grid structure (x, y) ˆ=(, ) Typical kernels: arbirary in x, linear (or at least simple).r.t. y: Unary factors: k p ((x p, y p ), (x p, y p) = k(x p, x p) y p = y p ith k(x p, x p) local image kernel, e.g. χ 2 or histogram intersection Pairise factors: k pq ((y p, y q ), (y p, y p) = y q = y q y q = y q More poerful than all-linear, and argmax-prediction still possible. 48 / 56
49 Example: object localization left top (x, y) ˆ=(, ) image right bottom Only one factor that includes all x and y: k( (x, y), (x, y ) ) = k image (x y, x y ) ith k image image kernel and x y is image region ithin box y. argmax-prediction as difficult as object localization ith k image -SVM. 49 / 56
50 Summary S-SVM Learning Given: training set {(x 1, y 1 ),..., (x n, y n )} X Y loss function : Y Y R. Task: learn parameter for f(x) := argmax y, φ(x, y) that imizes expected loss on future data. 50 / 56
51 Summary S-SVM Learning Given: training set {(x 1, y 1 ),..., (x n, y n )} X Y loss function : Y Y R. Task: learn parameter for f(x) := argmax y, φ(x, y) that imizes expected loss on future data. S-SVM solution derived by maximum margin frameork: enforce correct output to be better than others by a margin :, φ(x n, y n ) (y n, y) +, φ(x n, y) for all y Y. convex optimization problem, but non-differentiable many equivalent formulations different training algorithms training needs repeated argmax prediction, no probabilistic inference 51 / 56
52 Sebastian Noozin and Christoph Lampert Structured Models in Computer Vision Part 5. Structured SVMs Extra I: Beyond Fully Supervised Learning So far, training as fully supervised, all variables ere observed. In real life, some variables are unobserved even during training. missing labels in training data latent variables, e.g. part location latent variables, e.g. part occlusion latent variables, e.g. viepoint 52 / 56
53 Three types of variables: x X alays observed, y Y observed only in training, z Z never observed (latent). Decision function: f(x) = argmax y Y max z Z, φ(x, y, z) 53 / 56
54 Three types of variables: x X alays observed, y Y observed only in training, z Z never observed (latent). Decision function: f(x) = argmax y Y max z Z, φ(x, y, z) Maximum Margin Training ith Maximization over Latent Variables Solve:,ξ C N N ξ n subject to, for n = 1,..., N, for all y Y (y n, y) + max z Z, φ(xn, y, z) max z Z Problem: not a convex problem can have local ima, φ(xn, y n, z) [C. Yu, T. Joachims, Learning Structural SVMs ith Latent Variables, ICML, 2009] similar idea: [Felzenszalb, McAllester, Ramaman. A Discriatively Trained, Multiscale, Deformable Part Model, CVPR 08] 54 / 56
55 Structured Learning is full of Open Research Questions Ho to train faster? CRFs need many runs of probablistic inference, SSVMs need many runs of argmax-predictions. Ho to reduce the necessary amount of training data? semi-supervised learning? transfer learning? Ho can e better understand different loss function? hen to use probabilistic training, hen maximum margin? CRFs are consistent, SSVMs are not. Is this relevant? Can e understand structured learning ith approximate inference? often computing L() or argmaxy, φ(x, y) exactly is infeasible. can e guarantee good results even ith approximate inference? More and ne applications! 55 / 56
56 Lunch-Break Continuing at 13:30 Slides available at cvpr2011tutorial/ 56 / 56
Part 5: Structured Support Vector Machines
Part 5: Structured Support Vector Machines Sebastian Nowozin and Christoph H. Lampert Providence, 21st June 2012 1 / 34 Problem (Loss-Minimizing Parameter Learning) Let d(x, y) be the (unknown) true data
More informationVisual Recognition: Examples of Graphical Models
Visual Recognition: Examples of Graphical Models Raquel Urtasun TTI Chicago March 6, 2012 Raquel Urtasun (TTI-C) Visual Recognition March 6, 2012 1 / 64 Graphical models Applications Representation Inference
More informationSupport Vector Machine Learning for Interdependent and Structured Output Spaces
Support Vector Machine Learning for Interdependent and Structured Output Spaces I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun, ICML, 2004. And also I. Tsochantaridis, T. Joachims, T. Hofmann,
More informationConvex Optimization. Erick Delage, and Ashutosh Saxena. October 20, (a) (b) (c)
Convex Optimization (for CS229) Erick Delage, and Ashutosh Saxena October 20, 2006 1 Convex Sets Definition: A set G R n is convex if every pair of point (x, y) G, the segment beteen x and y is in A. More
More informationMachine Learning: Think Big and Parallel
Day 1 Inderjit S. Dhillon Dept of Computer Science UT Austin CS395T: Topics in Multicore Programming Oct 1, 2013 Outline Scikit-learn: Machine Learning in Python Supervised Learning day1 Regression: Least
More informationDevelopment in Object Detection. Junyuan Lin May 4th
Development in Object Detection Junyuan Lin May 4th Line of Research [1] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection, CVPR 2005. HOG Feature template [2] P. Felzenszwalb,
More informationComplex Prediction Problems
Problems A novel approach to multiple Structured Output Prediction Max-Planck Institute ECML HLIE08 Information Extraction Extract structured information from unstructured data Typical subtasks Named Entity
More informationSupport vector machines
Support vector machines When the data is linearly separable, which of the many possible solutions should we prefer? SVM criterion: maximize the margin, or distance between the hyperplane and the closest
More informationCOMS 4771 Support Vector Machines. Nakul Verma
COMS 4771 Support Vector Machines Nakul Verma Last time Decision boundaries for classification Linear decision boundary (linear classification) The Perceptron algorithm Mistake bound for the perceptron
More informationSVMs for Structured Output. Andrea Vedaldi
SVMs for Structured Output Andrea Vedaldi SVM struct Tsochantaridis Hofmann Joachims Altun 04 Extending SVMs 3 Extending SVMs SVM = parametric function arbitrary input binary output 3 Extending SVMs SVM
More informationConstrained optimization
Constrained optimization A general constrained optimization problem has the form where The Lagrangian function is given by Primal and dual optimization problems Primal: Dual: Weak duality: Strong duality:
More informationSupport Vector Machines
Support Vector Machines RBF-networks Support Vector Machines Good Decision Boundary Optimization Problem Soft margin Hyperplane Non-linear Decision Boundary Kernel-Trick Approximation Accurancy Overtraining
More informationGenerative and discriminative classification techniques
Generative and discriminative classification techniques Machine Learning and Category Representation 013-014 Jakob Verbeek, December 13+0, 013 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.13.14
More informationLecture 9: Support Vector Machines
Lecture 9: Support Vector Machines William Webber (william@williamwebber.com) COMP90042, 2014, Semester 1, Lecture 8 What we ll learn in this lecture Support Vector Machines (SVMs) a highly robust and
More informationIntroduction to Machine Learning
Introduction to Machine Learning Maximum Margin Methods Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574
More informationStanford University. A Distributed Solver for Kernalized SVM
Stanford University CME 323 Final Project A Distributed Solver for Kernalized SVM Haoming Li Bangzheng He haoming@stanford.edu bzhe@stanford.edu GitHub Repository https://github.com/cme323project/spark_kernel_svm.git
More informationKernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Kernels + K-Means Matt Gormley Lecture 29 April 25, 2018 1 Reminders Homework 8:
More informationSupport Vector Machines. James McInerney Adapted from slides by Nakul Verma
Support Vector Machines James McInerney Adapted from slides by Nakul Verma Last time Decision boundaries for classification Linear decision boundary (linear classification) The Perceptron algorithm Mistake
More informationCPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017
CPSC 340: Machine Learning and Data Mining More Linear Classifiers Fall 2017 Admin Assignment 3: Due Friday of next week. Midterm: Can view your exam during instructor office hours next week, or after
More informationComparisons of Sequence Labeling Algorithms and Extensions
Nam Nguyen Yunsong Guo Department of Computer Science, Cornell University, Ithaca, NY 14853, USA NHNGUYEN@CS.CORNELL.EDU GUOYS@CS.CORNELL.EDU Abstract In this paper, we survey the current state-ofart models
More informationData Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)
Data Mining: Concepts and Techniques Chapter 9 Classification: Support Vector Machines 1 Support Vector Machines (SVMs) SVMs are a set of related supervised learning methods used for classification Based
More informationPart-based Visual Tracking with Online Latent Structural Learning: Supplementary Material
Part-based Visual Tracking with Online Latent Structural Learning: Supplementary Material Rui Yao, Qinfeng Shi 2, Chunhua Shen 2, Yanning Zhang, Anton van den Hengel 2 School of Computer Science, Northwestern
More informationLinear models. Subhransu Maji. CMPSCI 689: Machine Learning. 24 February February 2015
Linear models Subhransu Maji CMPSCI 689: Machine Learning 24 February 2015 26 February 2015 Overvie Linear models Perceptron: model and learning algorithm combined as one Is there a better ay to learn
More informationSUPPORT VECTOR MACHINES
SUPPORT VECTOR MACHINES Today Reading AIMA 8.9 (SVMs) Goals Finish Backpropagation Support vector machines Backpropagation. Begin with randomly initialized weights 2. Apply the neural network to each training
More informationHierarchical Image-Region Labeling via Structured Learning
Hierarchical Image-Region Labeling via Structured Learning Julian McAuley, Teo de Campos, Gabriela Csurka, Florent Perronin XRCE September 14, 2009 McAuley et al (XRCE) Hierarchical Image-Region Labeling
More informationProgramming, numerics and optimization
Programming, numerics and optimization Lecture C-4: Constrained optimization Łukasz Jankowski ljank@ippt.pan.pl Institute of Fundamental Technological Research Room 4.32, Phone +22.8261281 ext. 428 June
More informationAll lecture slides will be available at CSC2515_Winter15.html
CSC2515 Fall 2015 Introduc3on to Machine Learning Lecture 9: Support Vector Machines All lecture slides will be available at http://www.cs.toronto.edu/~urtasun/courses/csc2515/ CSC2515_Winter15.html Many
More informationSupport Vector Machines and their Applications
Purushottam Kar Department of Computer Science and Engineering, Indian Institute of Technology Kanpur. Summer School on Expert Systems And Their Applications, Indian Institute of Information Technology
More informationLearning CRFs for Image Parsing with Adaptive Subgradient Descent
Learning CRFs for Image Parsing with Adaptive Subgradient Descent Honghui Zhang Jingdong Wang Ping Tan Jinglu Wang Long Quan The Hong Kong University of Science and Technology Microsoft Research National
More informationStructured Learning. Jun Zhu
Structured Learning Jun Zhu Supervised learning Given a set of I.I.D. training samples Learn a prediction function b r a c e Supervised learning (cont d) Many different choices Logistic Regression Maximum
More informationMachine Learning Lecture 9
Course Outline Machine Learning Lecture 9 Fundamentals ( weeks) Bayes Decision Theory Probability Density Estimation Nonlinear SVMs 30.05.016 Discriminative Approaches (5 weeks) Linear Discriminant Functions
More informationData Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017
Data Analysis 3 Support Vector Machines Jan Platoš October 30, 2017 Department of Computer Science Faculty of Electrical Engineering and Computer Science VŠB - Technical University of Ostrava Table of
More informationPedestrian Detection Using Structured SVM
Pedestrian Detection Using Structured SVM Wonhui Kim Stanford University Department of Electrical Engineering wonhui@stanford.edu Seungmin Lee Stanford University Department of Electrical Engineering smlee729@stanford.edu.
More informationNon-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines
Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007,
More informationMachine Learning Lecture 9
Course Outline Machine Learning Lecture 9 Fundamentals ( weeks) Bayes Decision Theory Probability Density Estimation Nonlinear SVMs 19.05.013 Discriminative Approaches (5 weeks) Linear Discriminant Functions
More informationSupport Vector Machines
Support Vector Machines RBF-networks Support Vector Machines Good Decision Boundary Optimization Problem Soft margin Hyperplane Non-linear Decision Boundary Kernel-Trick Approximation Accurancy Overtraining
More informationRandom Projection Features and Generalized Additive Models
Random Projection Features and Generalized Additive Models Subhransu Maji Computer Science Department, University of California, Berkeley Berkeley, CA 9479 8798 Homepage: http://www.cs.berkeley.edu/ smaji
More informationCSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18
CSE 417T: Introduction to Machine Learning Lecture 22: The Kernel Trick Henry Chai 11/15/18 Linearly Inseparable Data What can we do if the data is not linearly separable? Accept some non-zero in-sample
More informationIncremental and Decremental Training for Linear. classification
Incremental and Decremental Training for Linear Classification Cheng-Hao Tsai Dept. of Computer Science National Taian Univ., Taian r95@csie.ntu.edu.t Chieh-Yen Lin Dept. of Computer Science National Taian
More informationStructure and Support Vector Machines. SPFLODD October 31, 2013
Structure and Support Vector Machines SPFLODD October 31, 2013 Outline SVMs for structured outputs Declara?ve view Procedural view Warning: Math Ahead Nota?on for Linear Models Training data: {(x 1, y
More informationDM6 Support Vector Machines
DM6 Support Vector Machines Outline Large margin linear classifier Linear separable Nonlinear separable Creating nonlinear classifiers: kernel trick Discussion on SVM Conclusion SVM: LARGE MARGIN LINEAR
More informationarxiv: v1 [cs.lg] 28 Aug 2014
A Multi-Plane Block-Coordinate Frank-Wolfe Algorithm for Structural SVMs with a Costly max-oracle Neel Shah, Vladimir Kolmogorov, Christoph H. Lampert IST Austria, 3400 Klosterneuburg, Austria arxiv:1408.6804v1
More informationA generalized quadratic loss for Support Vector Machines
A generalized quadratic loss for Support Vector Machines Filippo Portera and Alessandro Sperduti Abstract. The standard SVM formulation for binary classification is based on the Hinge loss function, where
More informationClassification with Partial Labels
Classification with Partial Labels Nam Nguyen, Rich Caruana Cornell University Department of Computer Science Ithaca, New York 14853 {nhnguyen, caruana}@cs.cornell.edu ABSTRACT In this paper, we address
More informationSupport Vector Machines.
Support Vector Machines srihari@buffalo.edu SVM Discussion Overview 1. Overview of SVMs 2. Margin Geometry 3. SVM Optimization 4. Overlapping Distributions 5. Relationship to Logistic Regression 6. Dealing
More informationSyllabus. 1. Visual classification Intro 2. SVM 3. Datasets and evaluation 4. Shallow / Deep architectures
Syllabus 1. Visual classification Intro 2. SVM 3. Datasets and evaluation 4. Shallow / Deep architectures Image classification How to define a category? Bicycle Paintings with women Portraits Concepts,
More informationLearning to Localize Objects with Structured Output Regression
Learning to Localize Objects with Structured Output Regression Matthew B. Blaschko and Christoph H. Lampert Max Planck Institute for Biological Cybernetics 72076 Tübingen, Germany {blaschko,chl}@tuebingen.mpg.de
More informationNatural Language Processing
Natural Language Processing Classification III Dan Klein UC Berkeley 1 Classification 2 Linear Models: Perceptron The perceptron algorithm Iteratively processes the training set, reacting to training errors
More informationClass 6 Large-Scale Image Classification
Class 6 Large-Scale Image Classification Liangliang Cao, March 7, 2013 EECS 6890 Topics in Information Processing Spring 2013, Columbia University http://rogerioferis.com/visualrecognitionandsearch Visual
More informationSUPPORT VECTOR MACHINES
SUPPORT VECTOR MACHINES Today Reading AIMA 18.9 Goals (Naïve Bayes classifiers) Support vector machines 1 Support Vector Machines (SVMs) SVMs are probably the most popular off-the-shelf classifier! Software
More informationA Short SVM (Support Vector Machine) Tutorial
A Short SVM (Support Vector Machine) Tutorial j.p.lewis CGIT Lab / IMSC U. Southern California version 0.zz dec 004 This tutorial assumes you are familiar with linear algebra and equality-constrained optimization/lagrange
More informationExponentiated Gradient Algorithms for Large-margin Structured Classification
Exponentiated Gradient Algorithms for Large-margin Structured Classification Peter L. Bartlett U.C.Berkeley bartlett@stat.berkeley.edu Ben Taskar Stanford University btaskar@cs.stanford.edu Michael Collins
More information12 Classification using Support Vector Machines
160 Bioinformatics I, WS 14/15, D. Huson, January 28, 2015 12 Classification using Support Vector Machines This lecture is based on the following sources, which are all recommended reading: F. Markowetz.
More informationContent-based image and video analysis. Machine learning
Content-based image and video analysis Machine learning for multimedia retrieval 04.05.2009 What is machine learning? Some problems are very hard to solve by writing a computer program by hand Almost all
More informationKernel Methods & Support Vector Machines
& Support Vector Machines & Support Vector Machines Arvind Visvanathan CSCE 970 Pattern Recognition 1 & Support Vector Machines Question? Draw a single line to separate two classes? 2 & Support Vector
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationLinear methods for supervised learning
Linear methods for supervised learning LDA Logistic regression Naïve Bayes PLA Maximum margin hyperplanes Soft-margin hyperplanes Least squares resgression Ridge regression Nonlinear feature maps Sometimes
More informationLECTURE 5: DUAL PROBLEMS AND KERNELS. * Most of the slides in this lecture are from
LECTURE 5: DUAL PROBLEMS AND KERNELS * Most of the slides in this lecture are from http://www.robots.ox.ac.uk/~az/lectures/ml Optimization Loss function Loss functions SVM review PRIMAL-DUAL PROBLEM Max-min
More informationCharacterizing Improving Directions Unconstrained Optimization
Final Review IE417 In the Beginning... In the beginning, Weierstrass's theorem said that a continuous function achieves a minimum on a compact set. Using this, we showed that for a convex set S and y not
More informationComputational Methods. Constrained Optimization
Computational Methods Constrained Optimization Manfred Huber 2010 1 Constrained Optimization Unconstrained Optimization finds a minimum of a function under the assumption that the parameters can take on
More informationLecture 18: March 23
0-725/36-725: Convex Optimization Spring 205 Lecturer: Ryan Tibshirani Lecture 8: March 23 Scribes: James Duyck Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have not
More informationSupport Vector Machines
Support Vector Machines . Importance of SVM SVM is a discriminative method that brings together:. computational learning theory. previously known methods in linear discriminant functions 3. optimization
More information9. Support Vector Machines. The linearly separable case: hard-margin SVMs. The linearly separable case: hard-margin SVMs. Learning objectives
Foundations of Machine Learning École Centrale Paris Fall 25 9. Support Vector Machines Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech Learning objectives chloe agathe.azencott@mines
More informationIntroduction to object recognition. Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and others
Introduction to object recognition Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and others Overview Basic recognition tasks A statistical learning approach Traditional or shallow recognition
More informationBig and Tall: Large Margin Learning with High Order Losses
Big and Tall: Large Margin Learning with High Order Losses Daniel Tarlow University of Toronto dtarlow@cs.toronto.edu Richard Zemel University of Toronto zemel@cs.toronto.edu Abstract Graphical models
More informationarxiv: v2 [cs.lg] 13 Jun 2014
Dual coordinate solvers for large-scale structural SVMs Deva Ramanan UC Irvine arxiv:32.743v2 [cs.lg] 3 Jun 204 This manuscript describes a method for training linear SVMs (including binary SVMs, SVM regression,
More informationCPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016
CPSC 340: Machine Learning and Data Mining Principal Component Analysis Fall 2016 A2/Midterm: Admin Grades/solutions will be posted after class. Assignment 4: Posted, due November 14. Extra office hours:
More informationKernel SVM. Course: Machine Learning MAHDI YAZDIAN-DEHKORDI FALL 2017
Kernel SVM Course: MAHDI YAZDIAN-DEHKORDI FALL 2017 1 Outlines SVM Lagrangian Primal & Dual Problem Non-linear SVM & Kernel SVM SVM Advantages Toolboxes 2 SVM Lagrangian Primal/DualProblem 3 SVM LagrangianPrimalProblem
More informationLOGISTIC REGRESSION FOR MULTIPLE CLASSES
Peter Orbanz Applied Data Mining Not examinable. 111 LOGISTIC REGRESSION FOR MULTIPLE CLASSES Bernoulli and multinomial distributions The mulitnomial distribution of N draws from K categories with parameter
More informationCSE 573: Artificial Intelligence Autumn 2010
CSE 573: Artificial Intelligence Autumn 2010 Lecture 16: Machine Learning Topics 12/7/2010 Luke Zettlemoyer Most slides over the course adapted from Dan Klein. 1 Announcements Syllabus revised Machine
More informationLecture 10: SVM Lecture Overview Support Vector Machines The binary classification problem
Computational Learning Theory Fall Semester, 2012/13 Lecture 10: SVM Lecturer: Yishay Mansour Scribe: Gitit Kehat, Yogev Vaknin and Ezra Levin 1 10.1 Lecture Overview In this lecture we present in detail
More informationCPSC 340: Machine Learning and Data Mining. Multi-Class Classification Fall 2017
CPSC 340: Machine Learning and Data Mining Multi-Class Classification Fall 2017 Assignment 3: Admin Check update thread on Piazza for correct definition of trainndx. This could make your cross-validation
More informationKernel Methods. Chapter 9 of A Course in Machine Learning by Hal Daumé III. Conversion to beamer by Fabrizio Riguzzi
Kernel Methods Chapter 9 of A Course in Machine Learning by Hal Daumé III http://ciml.info Conversion to beamer by Fabrizio Riguzzi Kernel Methods 1 / 66 Kernel Methods Linear models are great because
More informationSupport Vector Machines.
Support Vector Machines srihari@buffalo.edu SVM Discussion Overview. Importance of SVMs. Overview of Mathematical Techniques Employed 3. Margin Geometry 4. SVM Training Methodology 5. Overlapping Distributions
More information5 Machine Learning Abstractions and Numerical Optimization
Machine Learning Abstractions and Numerical Optimization 25 5 Machine Learning Abstractions and Numerical Optimization ML ABSTRACTIONS [some meta comments on machine learning] [When you write a large computer
More informationInstance-based Learning
Instance-based Learning Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University October 15 th, 2007 2005-2007 Carlos Guestrin 1 1-Nearest Neighbor Four things make a memory based learner:
More informationSupport Vector Machines
Support Vector Machines SVM Discussion Overview. Importance of SVMs. Overview of Mathematical Techniques Employed 3. Margin Geometry 4. SVM Training Methodology 5. Overlapping Distributions 6. Dealing
More informationCombine the PA Algorithm with a Proximal Classifier
Combine the Passive and Aggressive Algorithm with a Proximal Classifier Yuh-Jye Lee Joint work with Y.-C. Tseng Dept. of Computer Science & Information Engineering TaiwanTech. Dept. of Statistics@NCKU
More informationRegularization and Markov Random Fields (MRF) CS 664 Spring 2008
Regularization and Markov Random Fields (MRF) CS 664 Spring 2008 Regularization in Low Level Vision Low level vision problems concerned with estimating some quantity at each pixel Visual motion (u(x,y),v(x,y))
More informationIntroduction to Machine Learning CMU-10701
Introduction to Machine Learning CMU-10701 Clustering and EM Barnabás Póczos & Aarti Singh Contents Clustering K-means Mixture of Gaussians Expectation Maximization Variational Methods 2 Clustering 3 K-
More informationConvexization in Markov Chain Monte Carlo
in Markov Chain Monte Carlo 1 IBM T. J. Watson Yorktown Heights, NY 2 Department of Aerospace Engineering Technion, Israel August 23, 2011 Problem Statement MCMC processes in general are governed by non
More informationGenerative and discriminative classification
Generative and discriminative classification Machine Learning and Object Recognition 2017-2018 Jakob Verbeek Classification in its simplest form Given training data labeled for two or more classes Classification
More informationMachine Learning (CSE 446): Concepts & the i.i.d. Supervised Learning Paradigm
Machine Learning (CSE 446): Concepts & the i.i.d. Supervised Learning Paradigm Sham M Kakade c 2018 University of Washington cse446-staff@cs.washington.edu 1 / 17 Review 1 / 17 Decision Tree: Making a
More informationMultiple cosegmentation
Armand Joulin, Francis Bach and Jean Ponce. INRIA -Ecole Normale Supérieure April 25, 2012 Segmentation Introduction Segmentation Supervised and weakly-supervised segmentation Cosegmentation Segmentation
More informationGenerative and discriminative classification techniques
Generative and discriminative classification techniques Machine Learning and Object Recognition 2015-2016 Jakob Verbeek, December 11, 2015 Course website: http://lear.inrialpes.fr/~verbeek/mlor.15.16 Classification
More information17 STRUCTURED PREDICTION
17 STRUCTURED PREDICTION It is often the case that instead of predicting a single output, you need to predict multiple, correlated outputs simultaneously. In natural language processing, you might want
More informationSupport Vector Machines
Support Vector Machines About the Name... A Support Vector A training sample used to define classification boundaries in SVMs located near class boundaries Support Vector Machines Binary classifiers whose
More informationSupport Vector Machines + Classification for IR
Support Vector Machines + Classification for IR Pierre Lison University of Oslo, Dep. of Informatics INF3800: Søketeknologi April 30, 2014 Outline of the lecture Recap of last week Support Vector Machines
More informationAn Introduction to Machine Learning
TRIPODS Summer Boorcamp: Topology and Machine Learning August 6, 2018 General Set-up Introduction Set-up and Goal Suppose we have X 1,X 2,...,X n data samples. Can we predict properites about any given
More informationSVM: Multiclass and Structured Prediction. Bin Zhao
SVM: Multiclass and Structured Prediction Bin Zhao Part I: Multi-Class SVM 2-Class SVM Primal form Dual form http://www.glue.umd.edu/~zhelin/recog.html Real world classification problems Digit recognition
More informationOnline Learning. Lorenzo Rosasco MIT, L. Rosasco Online Learning
Online Learning Lorenzo Rosasco MIT, 9.520 About this class Goal To introduce theory and algorithms for online learning. Plan Different views on online learning From batch to online least squares Other
More information732A54/TDDE31 Big Data Analytics
732A54/TDDE31 Big Data Analytics Lecture 10: Machine Learning with MapReduce Jose M. Peña IDA, Linköping University, Sweden 1/27 Contents MapReduce Framework Machine Learning with MapReduce Neural Networks
More informationMarkov Networks in Computer Vision. Sargur Srihari
Markov Networks in Computer Vision Sargur srihari@cedar.buffalo.edu 1 Markov Networks for Computer Vision Important application area for MNs 1. Image segmentation 2. Removal of blur/noise 3. Stereo reconstruction
More informationLearning Hierarchies at Two-class Complexity
Learning Hierarchies at Two-class Complexity Sandor Szedmak ss03v@ecs.soton.ac.uk Craig Saunders cjs@ecs.soton.ac.uk John Shawe-Taylor jst@ecs.soton.ac.uk ISIS Group, Electronics and Computer Science University
More informationSVM in Analysis of Cross-Sectional Epidemiological Data Dmitriy Fradkin. April 4, 2005 Dmitriy Fradkin, Rutgers University Page 1
SVM in Analysis of Cross-Sectional Epidemiological Data Dmitriy Fradkin April 4, 2005 Dmitriy Fradkin, Rutgers University Page 1 Overview The goals of analyzing cross-sectional data Standard methods used
More informationMarkov Random Fields and Segmentation with Graph Cuts
Markov Random Fields and Segmentation with Graph Cuts Computer Vision Jia-Bin Huang, Virginia Tech Many slides from D. Hoiem Administrative stuffs Final project Proposal due Oct 27 (Thursday) HW 4 is out
More informationMixture models and clustering
1 Lecture topics: Miture models and clustering, k-means Distance and clustering Miture models and clustering We have so far used miture models as fleible ays of constructing probability models for prediction
More informationAdapting SVM Classifiers to Data with Shifted Distributions
Adapting SVM Classifiers to Data with Shifted Distributions Jun Yang School of Computer Science Carnegie Mellon University Pittsburgh, PA 523 juny@cs.cmu.edu Rong Yan IBM T.J.Watson Research Center 9 Skyline
More informationAlternating Direction Method of Multipliers
Alternating Direction Method of Multipliers CS 584: Big Data Analytics Material adapted from Stephen Boyd (https://web.stanford.edu/~boyd/papers/pdf/admm_slides.pdf) & Ryan Tibshirani (http://stat.cmu.edu/~ryantibs/convexopt/lectures/21-dual-meth.pdf)
More informationIntroduction to Machine Learning Spring 2018 Note Sparsity and LASSO. 1.1 Sparsity for SVMs
CS 189 Introduction to Machine Learning Spring 2018 Note 21 1 Sparsity and LASSO 1.1 Sparsity for SVMs Recall the oective function of the soft-margin SVM prolem: w,ξ 1 2 w 2 + C Note that if a point x
More information