Part 5: Structured Support Vector Machines

Similar documents
Part 5: Structured Support Vector Machines

Visual Recognition: Examples of Graphical Models

Complex Prediction Problems

Support Vector Machine Learning for Interdependent and Structured Output Spaces

COMS 4771 Support Vector Machines. Nakul Verma

Development in Object Detection. Junyuan Lin May 4th

Machine Learning: Think Big and Parallel

Stanford University. A Distributed Solver for Kernalized SVM

Support Vector Machines

Support vector machines

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

SVMs for Structured Output. Andrea Vedaldi

Constrained optimization

Introduction to Machine Learning

Support Vector Machines. James McInerney Adapted from slides by Nakul Verma

Hierarchical Image-Region Labeling via Structured Learning

Comparisons of Sequence Labeling Algorithms and Extensions

Part-based Visual Tracking with Online Latent Structural Learning: Supplementary Material

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Learning CRFs for Image Parsing with Adaptive Subgradient Descent

Lecture 9: Support Vector Machines

Generative and discriminative classification techniques

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

arxiv: v1 [cs.lg] 28 Aug 2014

Big and Tall: Large Margin Learning with High Order Losses

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

SUPPORT VECTOR MACHINES

Pedestrian Detection Using Structured SVM

All lecture slides will be available at CSC2515_Winter15.html

Support Vector Machines

9. Support Vector Machines. The linearly separable case: hard-margin SVMs. The linearly separable case: hard-margin SVMs. Learning objectives

Classification with Partial Labels

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

Kernel Methods & Support Vector Machines

arxiv: v2 [cs.lg] 13 Jun 2014

A Short SVM (Support Vector Machine) Tutorial

Structure and Support Vector Machines. SPFLODD October 31, 2013

DM6 Support Vector Machines

Random Projection Features and Generalized Additive Models

Machine Learning Lecture 9

Learning to Localize Objects with Structured Output Regression

Support Vector Machines and their Applications

Machine Learning Lecture 9

Support Vector Machines.

CSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18

Kernel Methods. Chapter 9 of A Course in Machine Learning by Hal Daumé III. Conversion to beamer by Fabrizio Riguzzi

Programming, numerics and optimization

Introduction to Machine Learning Spring 2018 Note Sparsity and LASSO. 1.1 Sparsity for SVMs

Markov Networks in Computer Vision. Sargur Srihari

Content-based image and video analysis. Machine learning

Structured Learning. Jun Zhu

Parallel and Distributed Graph Cuts by Dual Decomposition

Syllabus. 1. Visual classification Intro 2. SVM 3. Datasets and evaluation 4. Shallow / Deep architectures

Combine the PA Algorithm with a Proximal Classifier

Exponentiated Gradient Algorithms for Large-margin Structured Classification

Learning Higher Order Potentials For MRFs

SUPPORT VECTOR MACHINES

LECTURE 5: DUAL PROBLEMS AND KERNELS. * Most of the slides in this lecture are from

12 Classification using Support Vector Machines

Detector of Facial Landmarks Learned by the Structured Output SVM

17 STRUCTURED PREDICTION

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Linear methods for supervised learning

Lecture 10: SVM Lecture Overview Support Vector Machines The binary classification problem

Support Vector Machines.

Markov Networks in Computer Vision

Lecture 18: March 23

Introduction to object recognition. Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and others

Support Vector Machines

Learning Hierarchies at Two-class Complexity

Introduction to Machine Learning CMU-10701

Characterizing Improving Directions Unconstrained Optimization

Convexization in Markov Chain Monte Carlo

Kernel SVM. Course: Machine Learning MAHDI YAZDIAN-DEHKORDI FALL 2017

A generalized quadratic loss for Support Vector Machines

Self-Paced Learning for Semisupervised Image Classification

Multiple Kernel Machines Using Localized Kernels

Lecture 7: Support Vector Machine

Class 6 Large-Scale Image Classification

Natural Language Processing

Support Vector Machines

Instance-based Learning

More Data, Less Work: Runtime as a decreasing function of data set size. Nati Srebro. Toyota Technological Institute Chicago

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Computational Methods. Constrained Optimization

10. Support Vector Machines

Machine Learning (CSE 446): Concepts & the i.i.d. Supervised Learning Paradigm

Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart

Support Vector Machines

Semi-supervised learning and active learning

Multiple cosegmentation

An Introduction to Machine Learning

SVM Optimization: An Inverse Dependence on Data Set Size

LOGISTIC REGRESSION FOR MULTIPLE CLASSES

CPSC 340: Machine Learning and Data Mining. Multi-Class Classification Fall 2017

Markov Random Fields and Segmentation with Graph Cuts

732A54/TDDE31 Big Data Analytics

Learning Kernels with Random Features

Perceptron: This is convolution!

Support Vector Machines

Transcription:

Part 5: Structured Support Vector Machines Sebastian Nowozin and Christoph H. Lampert Providence, 21st June 2012 1 / 34

Problem (Loss-Minimizing Parameter Learning) Let d(x, y) be the (unknown) true data distribution. Let D = {(x 1, y 1 ),..., (x N, y N )} be i.i.d. samples from d(x, y). Let φ : X Y R D be a feature function. Let : Y Y R be a loss function. Find a weight vector w that leads to imal expected loss E (x,y) d(x,y) { (y, f(x))} for f(x) = argmax y Y w, φ(x, y). Pro: We directly optimize for the quantity of interest: expected loss. No expensive-to-compute partition function Z will show up. Con: We need to know the loss function already at training time. We can t use probabilistic reasoning to find w. 2 / 34

Reder: learning by regularized risk imization For compatibility function g(x, y; w) := w, φ(x, y) find w that imizes Two major problems: d(x, y) is unknown E (x,y) d(x,y) ( y, argmax y g(x, y; w) ). argmax y g(x, y; w) maps into a discrete space ( y, argmax y g(x, y; w)) is discontinuous, piecewise constant 3 / 34

Task: w E (x,y) d(x,y) ( y, argmax y g(x, y; w) ). Problem 1: d(x, y) is unknown Solution: ( ) Replace E (x,y) d(x,y) with empirical estimate 1 ( ) N (x n,y n ) To avoid overfitting: add a regularizer, e.g. λ w 2. New task: w λ w 2 + 1 N N ( y n, argmax y g(x n, y; w) ). n=1 4 / 34

Task: w λ w 2 + 1 N N ( y n, argmax y g(x n, y; w) ). n=1 Problem: ( y, argmax y g(x, y; w) ) discontinuous w.r.t. w. Solution: Replace (y, y ) with well behaved l(x, y, w) Typically: l upper bound to, continuous and convex w.r.t. w. New task: w λ w 2 + 1 N N l(x n, y n, w)) n=1 5 / 34

Regularized Risk Minimization w λ w 2 + 1 N N l(x n, y n, g)) n=1 Regularization + Loss on training data Hinge loss: maximum margin training l(x n, y n [, w) := max (y n, y) + w, φ(x n, y) w, φ(x n, y n ) ] y Y l is maximum over linear functions continuous, convex. l bounds from above. Proof: Let ȳ = argmax y g(x, y, w) (y n, ȳ) (y n, ȳ) + g(x n, ȳ, w) g(x n, y n, w) [ max (y n, y) + g(x n, y, w) g(x n, y n, w) ] y Y 6 / 34

Structured Output Support Vector Machine w 1 2 w 2 + C N N n=1 [ ] max y Y (yn, y) + w, φ(x n, y) w, φ(x n, y n ) Conditional Random Field w w 2 2σ 2 + N n=1 [ log exp ( w, φ(x n, y) w, φ(x n, y n ) )] y Y CRFs and SSVMs have more in common than usually assumed. both do regularized risk imization log y exp( ) can be interpreted as a soft-max 7 / 34

Solving the Training Optimization Problem Numerically Structured Output Support Vector Machine: w 1 2 w 2 + C N N [ n=1 max y Y (yn, y) + w, φ(x n, y) w, φ(x n, y n ) )] Unconstrained optimization, convex, non-differentiable objective. 8 / 34

Structured Output SVM (equivalent formulation): w,ξ 1 2 w 2 + C N N n=1 ξ n subject to, for n = 1,..., N, max y Y [ ] (y n, y) + w, φ(x n, y) w, φ(x n, y n ) ξ n N non-linear contraints, convex, differentiable objective. 9 / 34

Structured Output SVM (also equivalent formulation): w,ξ 1 2 w 2 + C N N n=1 ξ n subject to, for n = 1,..., N, (y n, y) + w, φ(x n, y) w, φ(x n, y n ) ξ n, for all y Y N Y linear constraints, convex, differentiable objective. 10 / 34

Example: Multiclass SVM { Y = {1, 2,..., K}, (y, y 1 for y y ) = 0 otherwise. ( ) φ(x, y) = y = 1 φ(x), y = 2 φ(x),..., y = K φ(x) Solve: w,ξ 1 2 w 2 + C N subject to, for i = 1,..., n, N n=1 ξ n w, φ(x n, y n ) w, φ(x n, y) 1 ξ n for all y Y \ {y n }. Classification: f(x) = argmax y Y w, φ(x, y). Crammer-Singer Multiclass SVM [K. Crammer, Y. Singer: On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines, JMLR, 2001] 11 / 34

Example: Hierarchical SVM Hierarchical Multiclass Loss: (y, y ) := 1 (distance in tree) 2 (cat, cat) = 0, (cat, dog) = 1, (cat, bus) = 2, etc. Solve: w,ξ 1 2 w 2 + C N N n=1 ξ n subject to, for i = 1,..., n, w, φ(x n, y n ) w, φ(x n, y) (y n, y) ξ n for all y Y. [L. Cai, T. Hofmann: Hierarchical Document Categorization with Support Vector Machines, ACM CIKM, 2004] [A. Binder, K.-R. Müller, M. Kawanabe: On taxonomies for multi-class image categorization, IJCV, 2011] 12 / 34

Solving the Training Optimization Problem Numerically We can solve SSVM training like CRF training: w 1 2 w 2 + C N N n=1 [ ] max y Y (yn, y) + w, φ(x n, y) w, φ(x n, y n ) continuous unconstrained convex non-differentiable we can t use gradient descent directly. we ll have to use subgradients 13 / 34

Definition Let f : R D R be a convex, not necessarily differentiable, function. A vector v R D is called a sub-gradient of f at w 0, if f(w) f(w 0 ) + v, w w 0 for all w. f(w) v f(w0)+ v,w-w 0 f(w0) w0 w For differentiable f, the gradient v = f(w 0 ) is the only subgradient. f(w) v f(w0)+ v,w-w 0 f(w0) w w0 14 / 34

Sub-gradient descent works basically like gradient descent: Sub-gradient Descent Minimization imize F (w) require: tolerance ɛ > 0 w cur 0 repeat v sub w F (w cur ) η arg η R F (w cur ηv) w cur w cur ηv until v < ɛ return w cur Converges to global imum, but rather inefficient if F non-differentiable. [Shor, Minimization methods for non-differentiable functions, Springer, 1985.] 15 / 34

Computing a subgradient: w 1 2 w 2 + C N N l n (w) n=1 with l n (w) = max y l n y (w), and l n y (w) := (y n, y) + w, φ(x n, y) w, φ(x n, y n ) l(w) v w Subgradient of l n at w 0 : find maximal (active) y, use v = l n y (w 0 ). w0 16 / 34

Subgradient Descent S-SVM Training input training pairs {(x 1, y 1 ),..., (x n, y n )} X Y, input feature map φ(x, y), loss function (y, y ), regularizer C, input number of iterations T, stepsizes η t for t = 1,..., T 1: w 0 2: for t=1,...,t do 3: for i=1,...,n do 4: ŷ argmax y Y (y n, y) + w, φ(x n, y) w, φ(x n, y n ) 5: v n φ(x n, ŷ) φ(x n, y n ) 6: end for 7: w w η t (w C N n vn ) 8: end for output prediction function f(x) = argmax y Y w, φ(x, y). Observation: each update of w needs 1 argmax-prediction per example. 17 / 34

We can use the same tricks as for CRFs, e.g. stochastic updates: Stochastic Subgradient Descent S-SVM Training input training pairs {(x 1, y 1 ),..., (x n, y n )} X Y, input feature map φ(x, y), loss function (y, y ), regularizer C, input number of iterations T, stepsizes η t for t = 1,..., T 1: w 0 2: for t=1,...,t do 3: (x n, y n ) randomly chosen training example pair 4: ŷ argmax y Y (y n, y) + w, φ(x n, y) w, φ(x n, y n ) 5: w w η t (w C N [φ(xn, ŷ) φ(x n, y n )]) 6: end for output prediction function f(x) = argmax y Y w, φ(x, y). Observation: each update of w needs only 1 argmax-prediction (but we ll need many iterations until convergence) 18 / 34

Solving the Training Optimization Problem Numerically We can solve an S-SVM like a linear SVM: One of the equivalent formulations was: subject to, for i = 1,... n, w R D,ξ R n + w 2 + C N N n=1 ξ n w, φ(x n, y n ) w, φ(x n, y) (y n, y) ξ n, for all y Y. Introduce feature vectors δφ(x n, y n, y) := φ(x n, y n ) φ(x n, y). 19 / 34

Solve w R D,ξ R n + w 2 + C N subject to, for i = 1,... n, for all y Y, N n=1 ξ n w, δφ(x n, y n, y) (y n, y) ξ n. This has the same structure as an ordinary SVM! quadratic objective linear constraints Question: Can t we use a ordinary SVM/QP solver? Answer: Almost! We could, if there weren t N Y constraints. E.g. 100 binary 16 16 images: 10 79 constraints 20 / 34

Solution: working set training It s enough if we enforce the active constraints. The others will be fulfilled automatically. We don t know which ones are active for the optimal solution. But it s likely to be only a small number can of course be formalized. Keep a set of potentially active constraints and update it iteratively: Working Set Training Start with working set S = (no contraints) Repeat until convergence: Solve S-SVM training problem with constraints from S Check, if solution violates any of the full constraint set if no: we found the optimal solution, terate. if yes: add most violated constraints to S, iterate. Good practical performance and theoretic guarantees: polynomial time convergence ɛ-close to the global optimum [Tsochantaridis et al. Large Margin Methods for Structured and Interdependent Output Variables, JMLR, 2005.] 21 / 34

Working Set S-SVM Training input training pairs {(x 1, y 1 ),..., (x n, y n )} X Y, input feature map φ(x, y), loss function (y, y ), regularizer C 1: S 2: repeat 3: (w, ξ) solution to QP only with constraints from S 4: for i=1,...,n do 5: ŷ argmax y Y (y n, y) + w, φ(x n, y) 6: if ŷ y n then 7: S S {(x n, ŷ)} 8: end if 9: end for 10: until S doesn t change anymore. output prediction function f(x) = argmax y Y w, φ(x, y). Observation: each update of w needs 1 argmax-prediction per example. (but we solve globally for next w, not by local steps) 22 / 34

One-Slack Formulation of S-SVM: (equivalent to ordinary S-SVM formulation by ξ = 1 N n ξn ) 1 w R D,ξ R + 2 w 2 + Cξ subject to, for all (ŷ 1,..., ŷ N ) Y Y, N [ (y n, ŷ N ) + w, φ(x n, ŷ n ) w, φ(x n, y n ) ] Nξ, n=1 Y N linear constraints, convex, differentiable objective. We blew up the constraint set even further: 100 binary 16 16 images: 10 177 constraints (instead of 10 79 ). 23 / 34

Working Set One-Slack S-SVM Training input training pairs {(x 1, y 1 ),..., (x n, y n )} X Y, input feature map φ(x, y), loss function (y, y ), regularizer C 1: S 2: repeat 3: (w, ξ) solution to QP only with constraints from S 4: for i=1,...,n do 5: ŷ n argmax y Y (y n, y) + w, φ(x n, y) 6: end for 7: S S { ( (x 1,..., x n ), (ŷ 1,..., ŷ n ) ) } 8: until S doesn t change anymore. output prediction function f(x) = argmax y Y w, φ(x, y). Often faster convergence: We add one strong constraint per iteration instead of n weak ones. 24 / 34

We can solve an S-SVM like a non-linear SVM: compute Lagrangian dual becomes max, original (primal) variables w, ξ disappear, new (dual) variables α iy : one per constraint of the original problem. Dual S-SVM problem max α R n Y + n=1,...,n y Y α ny (y n, y) 1 2 subject to, for n = 1,..., N, α ny C N. y Y y,ȳ Y n, n=1,...,n α ny α nȳ δφ(x n, y n, y), δφ(x n, y n, ȳ) N linear contraints, convex, differentiable objective, N Y variables. 25 / 34

We can kernelize: Define joint kernel function k : (X Y) (X Y) R k( (x, y), ( x, ȳ) ) = φ(x, y), φ( x, ȳ). k measure similarity between two (input,output)-pairs. We can express the optimization in terms of k: δφ(x n, y n, y), δφ(x n, y n, ȳ) = φ(x n, y n ) φ(x n, y), φ(x n, y n ) φ(x n, ȳ) = φ(x n, y n ), φ(x n, y n ) φ(x n, y n ), φ(x n, ȳ) φ(x n, y), φ(x n, y n ) + φ(x n, y), φ(x n, ȳ) = k( (x n, y n ), (x n, y n ) ) k( (x n, y n ), φ(x n, ȳ) ) k( (x n, y), (x n, y n ) ) + k( (x n, y), φ(x n, ȳ) ) =: K iīyȳ 26 / 34

Kernelized S-SVM problem: max α R n Y + α iy (y n, y) 1 α iy αīȳ K iīyȳ 2 i=1,...,n y Y subject to, for i = 1,..., n, α iy C N. y Y y,ȳ Y i,ī=1,...,n too many variables: train with working set of α iy. Kernelized prediction function: f(x) = argmax y Y iy α iy k( (x i, y i ), (x, y) ) 27 / 34

What do joint kernel functions look like? k( (x, y), ( x, ȳ) ) = φ(x, y), φ( x, ȳ). As in graphical model: easier if φ decomposes w.r.t. factors: φ(x, y) = ( φ F (x, y F ) ) f F Then the kernel k decomposes into sum over factors: (φf k( (x, y), ( x, ȳ) ) = (x, y F ) ) f F, ( φ F (x, y F ) ) f F = f F φ F (x, y F ), φ F (x, y F ) = f F k F ( (x, y F ), (x, y F ) ) We can define kernels for each factor (e.g. nonlinear). 28 / 34

Example: figure-ground segmentation with grid structure (x, y) ˆ=(, ) Typical kernels: arbirary in x, linear (or at least simple) w.r.t. y: Unary factors: k p ((x p, y p ), (x p, y p) = k(x p, x p) y p = y p with k(x p, x p) local image kernel, e.g. χ 2 or histogram intersection Pairwise factors: k pq ((y p, y q ), (y p, y p) = y q = y q y q = y q More powerful than all-linear, and argmax-prediction still possible. 29 / 34

Example: object localization left top (x, y) ˆ=(, ) image right bottom Only one factor that includes all x and y: k( (x, y), (x, y ) ) = k image (x y, x y ) with k image image kernel and x y is image region within box y. argmax-prediction as difficult as object localization with k image -SVM. 30 / 34

Summary S-SVM Learning Given: training set {(x 1, y 1 ),..., (x n, y n )} X Y loss function : Y Y R. Task: learn parameter w for f(x) := argmax y w, φ(x, y) that imizes expected loss on future data. S-SVM solution derived by maximum margin framework: enforce correct output to be better than others by a margin : w, φ(x n, y n ) (y n, y) + w, φ(x n, y) for all y Y. convex optimization problem, but non-differentiable many equivalent formulations different training algorithms training needs repeated argmax prediction, no probabilistic inference 31 / 34

Sebastian Nowozin and Christoph Lampert Structured Models in Computer Vision Part 5. Structured SVMs Extra I: Beyond Fully Supervised Learning So far, training was fully supervised, all variables were observed. In real life, some variables are unobserved even during training. missing labels in training data latent variables, e.g. part location latent variables, e.g. part occlusion latent variables, e.g. viewpoint 32 / 34

Three types of variables: x X always observed, y Y observed only in training, z Z never observed (latent). Decision function: f(x) = argmax y Y max z Z w, φ(x, y, z) Maximum Margin Training with Maximization over Latent Variables Solve: w,ξ 1 2 w 2 + C N N n=1 ξ n subject to, for n = 1,..., N, for all y Y (y n, y) + max z Z w, φ(xn, y, z) max z Z Problem: not a convex problem can have local ima w, φ(xn, y n, z) [C. Yu, T. Joachims, Learning Structural SVMs with Latent Variables, ICML, 2009] similar idea: [Felzenszwalb, McAllester, Ramaman. A Discriatively Trained, Multiscale, Deformable Part Model, CVPR 08] 33 / 34

Structured Learning is full of Open Research Questions How to train faster? CRFs need many runs of probablistic inference, SSVMs need many runs of argmax-predictions. How to reduce the necessary amount of training data? semi-supervised learning? transfer learning? How can we better understand different loss function? when to use probabilistic training, when maximum margin? CRFs are consistent, SSVMs are not. Is this relevant? Can we understand structured learning with approximate inference? often computing L(w) or argmaxy w, φ(x, y) exactly is infeasible. can we guarantee good results even with approximate inference? More and new applications! 34 / 34