SVM: Multiclass and Structured Prediction. Bin Zhao

Similar documents
Reducing Multiclass to Binary. LING572 Fei Xia

Complex Prediction Problems

Support Vector Machine Learning for Interdependent and Structured Output Spaces

Slides for Data Mining by I. H. Witten and E. Frank

Learning Hierarchies at Two-class Complexity

Development in Object Detection. Junyuan Lin May 4th

Part II. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Exponentiated Gradient Algorithms for Large-margin Structured Classification

Analysis: TextonBoost and Semantic Texton Forests. Daniel Munoz Februrary 9, 2009

Natural Language Processing

Machine Learning Techniques for Data Mining

Pattern Recognition for Neuroimaging Data

Motivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM)

Visual Recognition: Examples of Graphical Models

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C.

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Comparisons of Sequence Labeling Algorithms and Extensions

Content-based image and video analysis. Machine learning

Segmentation. Bottom up Segmentation Semantic Segmentation

Structured Learning. Jun Zhu

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms for Inference Fall 2014

HOG-based Pedestriant Detector Training

Structured Output Learning with Polynomial Kernel

Part 5: Structured Support Vector Machines

Learning a Distance Metric for Structured Network Prediction. Stuart Andrews and Tony Jebara Columbia University

Generalizing Binary Classiers to the Multiclass Case

Applying Supervised Learning

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Part 5: Structured Support Vector Machines

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

SVMs for Structured Output. Andrea Vedaldi

Behavioral Data Mining. Lecture 10 Kernel methods and SVMs

Segmentation in electron microscopy images

Bayesian Networks Inference

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C,

Answer All Questions. All Questions Carry Equal Marks. Time: 20 Min. Marks: 10.

All lecture slides will be available at CSC2515_Winter15.html

A Taxonomy of Semi-Supervised Learning Algorithms

Contextual Classification with Functional Max-Margin Markov Networks

Training for Fast Sequential Prediction Using Dynamic Feature Selection

Pedestrian Detection Using Structured SVM

Structured Prediction with Relative Margin

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Multi-way Search Trees. (Multi-way Search Trees) Data Structures and Programming Spring / 25

DM6 Support Vector Machines

Application of Support Vector Machine Algorithm in Spam Filtering

More on Learning. Neural Nets Support Vectors Machines Unsupervised Learning (Clustering) K-Means Expectation-Maximization

Nearest Neighbors Classifiers

Semi-supervised learning and active learning

Kernel Methods & Support Vector Machines

Optimal Extension of Error Correcting Output Codes

Structured Models in. Dan Huttenlocher. June 2010

Generative and discriminative classification techniques

Differential Compression and Optimal Caching Methods for Content-Based Image Search Systems

6.034 Quiz 2, Spring 2005

Class 6 Large-Scale Image Classification

Support Vector Machines. James McInerney Adapted from slides by Nakul Verma

Automatic Domain Partitioning for Multi-Domain Learning

Collective classification in network data

COMP90051 Statistical Machine Learning

Regularization and Markov Random Fields (MRF) CS 664 Spring 2008

Thorsten Joachims Then: Universität Dortmund, Germany Now: Cornell University, USA

Supervised vs unsupervised clustering

Hierarchical Image-Region Labeling via Structured Learning

18 October, 2013 MVA ENS Cachan. Lecture 6: Introduction to graphical models Iasonas Kokkinos

Classification and Detection in Images. D.A. Forsyth

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation

Scalable Learnability Measure for Hierarchical Learning in Large Scale Multi-Class Classification

Ranking Error-Correcting Output Codes for Class Retrieval

Learning a classification of Mixed-Integer Quadratic Programming problems

Classification of Hand-Written Numeric Digits

Generative and discriminative classification techniques

Classification: Linear Discriminant Functions

Naïve Bayes for text classification

Computer Vision Group Prof. Daniel Cremers. 4a. Inference in Graphical Models

CSE 6242 A / CX 4242 DVA. March 6, Dimension Reduction. Guest Lecturer: Jaegul Choo

Learning CRFs using Graph Cuts

Energy-Based Learning

Integrating Probabilistic Reasoning with Constraint Satisfaction

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

3 Perceptron Learning; Maximum Margin Classifiers

The Curse of Dimensionality

CS 1674: Intro to Computer Vision. Attributes. Prof. Adriana Kovashka University of Pittsburgh November 2, 2016

Message-passing algorithms (continued)

COS 513: Foundations of Probabilistic Modeling. Lecture 5

Chapter 8 of Bishop's Book: Graphical Models

Multiple-Choice Questionnaire Group C

Beyond Bags of Features

Support Vector Machines

CS145: INTRODUCTION TO DATA MINING

CS242: Probabilistic Graphical Models Lecture 2B: Loopy Belief Propagation & Junction Trees

Multi-voxel pattern analysis: Decoding Mental States from fmri Activity Patterns

Classifying Online Social Network Users Through the Social Graph

DECISION-TREE-BASED MULTICLASS SUPPORT VECTOR MACHINES. Fumitake Takahashi, Shigeo Abe

Machine Learning with MATLAB --classification

Structured Perceptron. Ye Qiu, Xinghui Lu, Yue Lu, Ruofei Shen

Multi-label classification using rule-based classifier systems

Ensembles. An ensemble is a set of classifiers whose combined results give the final decision. test feature vector

STA 4273H: Statistical Machine Learning

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Transcription:

SVM: Multiclass and Structured Prediction Bin Zhao

Part I: Multi-Class SVM

2-Class SVM Primal form Dual form

http://www.glue.umd.edu/~zhelin/recog.html Real world classification problems Digit recognition Automated protein classification Object recognition 10 Phoneme recognition 300-600 100 50 [Waibel, Hanzawa, Hinton,Shikano, Lang 1989] The number of classes is sometimes big The multi-class algorithm can be heavy

How can we solve multi-class problem? One-against-one One-against-rest Crammer & Singer s formulation Error-correcting output coding Empirical comparisons

One-against-one

One-against-rest

Problems One-against-one One-against-rest

Crammer & Singer s formulation A Naïve approach

Crammer & Singer s formulation A Naïve approach C & S s formulation

Error-Correcting Output Code (ECOC) Codeword 0 1 0 0 0 0 0 0 0 0 Meta-classifier Source: Dietterich and Bakiri (1995)

Discussions

Special cases of ECOC One-against-one One-against-rest

Empirical Study In Defense of One-Vs-All Classification JMLR 2004 The most important step in good multiclass classification is to use the best binary classifier available. Once this is done, it seems to make little difference what multiclass scheme is applied, and therefore a simple scheme such as OVA (or AVA) is preferable to a more complex errorcorrecting coding scheme or single-machine scheme.

Part II: Structured SVM Slides Courtesy: Ioannis Tsochantaridis, Thomas Hofmann, Thorsten Joachims, Yasemin Altun

Local Classification b r a r e Classify using local information Ignores correlations! [thanks to Ben Taskar for slide!]

Structured Classification b r a c e Use local information Exploit correlations [thanks to Ben Taskar for slide!]

Case Study: Max Margin Learning on Domain- Independent Web Information Extraction

Motivation Understand web page Assign semantics to each component of a page Understand functionality of each section Distinguish main content from side contents Page layout reveals important cues

Motivation (Cont.) Human can understand a page written in foreign language Position Size Font Color Boldness Relative position to other sections Can a computer fulfill similar task? Domain-independent web information extraction

The Proposed Approach Page Segmentation: vision tree Built based on DOM tree With visual information Correspond to how each node is displayed Structured Segmentation Assign label for each node in the vision tree Information extraction based on node classification

Structured Classification Label space for leaf nodes attribute name, attribute value, non attribute, image, nav bar, main title, page tail, image caption non-atribute (anything else) Label space for non-leaf nodes structure block, data record, nav bar block, non attribute block, value block, page tail block, name block, image block, image caption block, main title block

Max Margin Learning Structured output space (tree) If treated as conventional classification: exponential number of classes unable to train Augmented loss: misclassify a tree by one node should receive much less penalty than misclassifying the entire tree Input-output feature mapping: Linear discriminant function: Response to input x:

Max Margin Learning (Cont.) Augmented loss: # of disagreements between y and y Learning: use SVM to find optimal w Inference: dynamical programming

Input-Output Feature Mapping Two types of cliques in the hierarchical model Cliques covering observation-state node pairs Cliques covering state-state node pairs Correspondingly, two types of features Features describing cliques Features describing cliques

Input-Output Feature Mapping: Type I Spatial features Position of block center, block height, block weight, block area Features for all nodes Link number, number of various HTML tags, number of child nodes Features for leaf nodes only Text length, font, bold, italic, word count, number of images, image size, link text length

Input-Output Feature Mapping: Type II Parent-child relationship Label co-occurrence pattern Connect spatially adjacent blocks Link a node with its k nearest neighbors Define label co-occurrence pattern for connected node pair (i,j)

Learning and Inference Learning Quadratic programming with exponential number of constraints cutting plane algorithm (a.k.a., constraint generation, bundle method) Inference Without edges between spatially adjacent blocks dynamical programming (a.k.a. Viterbi decoding) With edges between spatially adjacent blocks loopy belief propagation

Empirical Study Block level prediction results Attribute name-value pair extraction 1000 web pages Precision: 56.91% Recall: 59.37%

Thank you