Support Vector Machines + Classification for IR

Similar documents
SUPPORT VECTOR MACHINES

SUPPORT VECTOR MACHINES

Support Vector Machines

Generative and discriminative classification techniques

Lecture 9: Support Vector Machines

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

Support Vector Machines

ADVANCED CLASSIFICATION TECHNIQUES

Support vector machines

All lecture slides will be available at CSC2515_Winter15.html

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Support Vector Machines

Discriminative classifiers for image recognition

Module 4. Non-linear machine learning econometrics: Support Vector Machine

PV211: Introduction to Information Retrieval

Machine Learning for NLP

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Instance-based Learning

Content-based image and video analysis. Machine learning

HW2 due on Thursday. Face Recognition: Dimensionality Reduction. Biometrics CSE 190 Lecture 11. Perceptron Revisited: Linear Separators

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Support Vector Machines

Support Vector Machines

DM6 Support Vector Machines

Introduction to object recognition. Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and others

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Machine Learning: Think Big and Parallel

Support Vector Machines

Support Vector Machines.

Linear methods for supervised learning

Support Vector Machines 290N, 2015

Support Vector Machines.

CSE 573: Artificial Intelligence Autumn 2010

ECG782: Multidimensional Digital Signal Processing

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

Support Vector Machines (a brief introduction) Adrian Bevan.

Instance-based Learning

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Lab 2: Support vector machines

Applying Supervised Learning

Network Traffic Measurements and Analysis

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Generative and discriminative classification techniques

Kernel Methods & Support Vector Machines

Generative and discriminative classification

CS570: Introduction to Data Mining

Supervised vs unsupervised clustering

LECTURE 5: DUAL PROBLEMS AND KERNELS. * Most of the slides in this lecture are from

Lecture 7: Support Vector Machine

Search Engines. Information Retrieval in Practice

Naïve Bayes for text classification

Optimal Separating Hyperplane and the Support Vector Machine. Volker Tresp Summer 2018

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

Data Mining in Bioinformatics Day 1: Classification

Information Management course

Perceptron as a graph

Neural Networks and Deep Learning

CSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18

Lecture Linear Support Vector Machines

Classifiers and Detection. D.A. Forsyth

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

Topics in Machine Learning

Going nonparametric: Nearest neighbor methods for regression and classification

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques.

Application of Support Vector Machine Algorithm in Spam Filtering

Learning via Optimization

Machine Learning. Chao Lan

Recognition Tools: Support Vector Machines

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

10. Support Vector Machines

More Learning. Ensembles Bayes Rule Neural Nets K-means Clustering EM Clustering WEKA

12 Classification using Support Vector Machines

A Dendrogram. Bioinformatics (Lec 17)

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques

Support vector machines

Contents. Preface to the Second Edition

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

Classification: Feature Vectors

Support vector machine (II): non-linear SVM. LING 572 Fei Xia

Support Vector Machines

CAP5415-Computer Vision Lecture 13-Support Vector Machines for Computer Vision Applica=ons

9. Support Vector Machines. The linearly separable case: hard-margin SVMs. The linearly separable case: hard-margin SVMs. Learning objectives

Practice EXAM: SPRING 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Problem 1: Complexity of Update Rules for Logistic Regression

CS229 Final Project: Predicting Expected Response Times

On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions

MODULE 7 Nearest Neighbour Classifier and its variants LESSON 11. Nearest Neighbour Classifier. Keywords: K Neighbours, Weighted, Nearest Neighbour

Lab 2: Support Vector Machines

Support Vector Machines for Classification and Regression

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

CS 343H: Honors AI. Lecture 23: Kernels and clustering 4/15/2014. Kristen Grauman UT Austin

SVM in Oracle Database 10g: Removing the Barriers to Widespread Adoption of Support Vector Machines

Machine Learning Lecture 9

Recursive Similarity-Based Algorithm for Deep Learning

Machine Learning Lecture 9

Large synthetic data sets to compare different data mining methods

Transcription:

Support Vector Machines + Classification for IR Pierre Lison University of Oslo, Dep. of Informatics INF3800: Søketeknologi April 30, 2014 Outline of the lecture Recap of last week Support Vector Machines Classification for IR Practical aspects Relevance ranking Conclusion 2

Outline of the lecture Recap of last week Support Vector Machines Classification for IR Practical aspects Relevance ranking Conclusion 3 Rocchio classification Finds the center of mass for each class A new point x will be classified in class c if it is closest to the centroid for c? 4

k-nearest neighbour (k-nn) k-nn adopts a different approach k = number of neighbours to consider Rely on local decisions based on the closest neighbors? 5 Linear vs. non-linear classification Function Decision boundary Examples Pros Cons Linear classifier Linear combination of features: y = f(w T x). Hyperplane Naive Bayes, Rocchio, logistic regression, linear SVMs Often robust, fast Can fail if problem is not linearly separable Non-linear classifier Arbitrary non-linear function Non-linear, possibly discontinuous k-nn, multilayer neural networks, non-linear SVMs Can express complex dependencies Prone to overfitting 6

Bias-variance trade-off The learning error of a classification method is the expectation (averaged) over the possible training sets: learning-error( )=E D Ex [ (x) P (c x)] 2 = E x [bias(,x) + variance(,x)] how often the classifier prediction deviates from the true class amount of variation in the classifier prediction depending on the training data 7 Outline of the lecture Recap of last week Support Vector Machines Classification for IR Practical aspects Relevance ranking Conclusion 8

Support Vector Machines State-of-the-art classification method Both linear and non-linear variants Non-parametric & discriminative Good generalisation properties (quite resistant to overfitting) Extensions for multiclass classification, structured prediction, regression 9 SVMs: linear case We have seen that linear classifiers (such as Rocchio, NB, etc.) create a hyperplane separating the classes But there is an infinite number of possible hyperplanes 10

SVMs: linear case Are some hyperplanes better than others? A good boundary should be as far as possible from the training points SVM idea: find the hyperplanes with a maximum margin of separation input vectors that support the margin (cf. next slides) 11 SVMs: linear case Why maximise the margin? Points close to the boundary are very uncertain Small margins between the boundary and the training points make the classifier very sensitive to variations in the training data (= high variance, cf. last week) and we want to keep the variance as low as possible to ensure the classifier generalizes well to new data (= is resistant to overfitting) 12

SVMs: linear case Classification function: Classification output: {+1,-1} f(x) = sign w T x + b 8 >< 1ifx>0 sign(x) = 0ifx =0 >: -1 if x<0 Weight vector bias term Input vector Training data is composed of n examples {(xi,yi) : 1 i n}, where yi={+1-1} Note: Math conventions for SVMs slightly different than for other classification techniques 13 SVMs: linear case The goal is to find the values for the vector w and bias b that will maximize the margin It can be shown (cf. textbook) that this goal is equivalent to the following problem: Find w and b such that: w T w/2 is minimized for all {(xi,yi)}, yi(w T xi + b) 1 Quadratic optimisation problem 14

SVMs: linear case Many algorithms are available for solving such class of optimisation problems The resulting weight vector can be expressed as a combination of the training points: w = X i i y i x i Lagrange multipliers (determined during the optimisation) Most αi will be zero. The points (xi,y) which have a non-zero αi are precisely the support vectors for the margin [the bias term b can be inferred from the weight vector, cf. textbook] 15 SVMs: linear case The classification function for the SVM can be rewritten as: f(x) = sign X i y i x T i x + b Combination of dot products between the vector x to classify and the vectors x i from the training data Multipliers α i calculated by solving the optimization problem Only the vectors with α i 0 (the support vectors) need to be considered in the classification 16

SVMs with soft margin Real-world classification problems are not always 100% linearly separable Causes: Noise in the data set, outliers, etc. Extension of SVMs with a soft margin Allows a few training points to be misclassified But the misclassification of each point has a cost 17 SVMs with soft margin Idea: introduce slack variables ξi The slack variable ξi measures the degree of misclassification of the point xi Corresponding optimisation problem: Find w and b such that: w T w/2 + C Σ ξi is minimized for all {(xi,yi)}, yi(w T xi + b) 1 - ξi C is a parameter that controls the softness of the margin 18

Non-linear SVMs The algorithms presented so far are purely linear But SVMs can also solve non-linear problems Key idea: map each point from the initial input space into a higherdimensional space in which the training data is linearly separable and do linear classification in this hyperspace 19 Non-linear SVMs How is this mapping performed? First idea: Create a mapping function Φ(x) from the original space to the hyperspace, and rewrite the classifier as f(x) = sign X i y i (x i ) T (x)+b Problem: this is not very efficient Need to perform the mapping for every point 20

Non-linear SVMs Kernel trick: replace the dot product Φ(x i ) T Φ(x) by a kernel function K(x i, x) No need to use (or even specify) a mapping function Φ Numerous kernels can be employed Classifier becomes: f(x) = sign X i y i K(x i, x)+b High-dimensional space embedded by the kernel (resulting space may even be infinite-dimensional) 21 Non-linear SVMs The kernel function must satisfy some properties (be continuous, symmetric, and positive definite) Popular kernels: Polynomial kernels: K(x i,x j ) = (x T i x j + 1) d Gaussian kernels: K(x i,x j ) = exp(-(x i -x j ) 2 /2σ 2 ) String and tree kernels for NLP tasks Need to find the most appropriate kernel to use for a given classification task 22

Outline of the lecture Recap of last week Support Vector Machines Classification for IR Practical aspects Relevance ranking Conclusion 23 Classification: Practical aspects How to practically choose a classifier? Key question: where can I get data (and how much)? No data: design handcrafted rules Fairly little data: high-bias (e.g. linear) classifier Tons of (good) data: Log. regression or SVMS often good choice Computational complexity also an important factor Classifiers can be combined ( ensemble learning ) 24

Classification: Practical aspects Assembling data resources is often the real bottleneck in classification Collect, store, organise, quality-check the data Financial and legal aspects (ownership, privacy) ML-based classifiers must sometimes be overlaid with hand-crafted rules To enforce particular business rules, or allow the user to control the classification 25 Classification: Practical aspects Which features to use? Designing the right features is often key to success If too few features: not informative enough If too many features: data sparseness In text classification, the most basic features are the document terms: But preprocessing is important to filter/modify some tokens Other features, such as document length, zones, links, etc. 26

Relevance ranking Many tasks in information retrieval are classification problems Document preprocessing (segmentation etc.) Determining whether a document is relevant or not Simple way to calculate the relevance of a document d to a query q: Extract features from (d,q), such as cosine score, proximity window ω, static quality, document age, etc. Two categories: relevant or non-relevant 27 Relevance ranking But classification as relevant/non-relevant is a crude way to solve the problem What we want is to rank the relevance of documents Ranking is an ordinal regression problem The exact score of each document is not important, what counts is the relative ordering Midway between classification and regression 28

Relevance ranking SVMs can be applied on ranking problems We first collect training data, where each query q is mapped to a list of documents ordered by relevance To build the classifier, we construct a feature vector ψ for each document/ query pair (di, q) Then create a vector Φ(di,dj,q) of feature differences: Φ(di,dj,q) = ψ(di,q) - ψ(dj,q) Finally, we can build a classifier on this vector: w T (d i,d j,q) > 0 i d i precedes d j 29 Outline of the lecture Recap of last week Support Vector Machines Classification for IR Practical aspects Relevance ranking Conclusion 30

Conclusion Support Vector Machines constitute a powerful classification method Maximum-margin classifiers, solved by quadratic programming Slack variables to allow for softer margins Kernel functions for capture non-linear problems Data collection and feature engineering are crucial questions to build practical classifiers Ranking classifiers can be employed to order documents by order of relevance to a query 31