Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Similar documents
VECTOR SPACE CLASSIFICATION

Recap of the last lecture. CS276A Text Retrieval and Mining. Text Categorization Examples. Categorization/Classification. Text Classification

Information Retrieval

CS60092: Information Retrieval

Information Retrieval

5/21/17. Standing queries. Spam filtering Another text classification task. Categorization/Classification. Document Classification

Naïve Bayes for text classification

PV211: Introduction to Information Retrieval

Search Engines. Information Retrieval in Practice

ADVANCED CLASSIFICATION TECHNIQUES

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

Data Mining. Lecture 03: Nearest Neighbor Learning

Introduction to Information Retrieval

CISC 4631 Data Mining

Non-Parametric Modeling

Introduction to Information Retrieval

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

Lecture 9: Support Vector Machines

Classification: Linear Discriminant Functions

Support Vector Machines + Classification for IR

9 Classification: KNN and SVM

7. Nearest neighbors. Learning objectives. Foundations of Machine Learning École Centrale Paris Fall 2015

CS7267 MACHINE LEARNING NEAREST NEIGHBOR ALGORITHM. Mingon Kang, PhD Computer Science, Kennesaw State University

K Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat

7. Nearest neighbors. Learning objectives. Centre for Computational Biology, Mines ParisTech

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

Chapter 9. Classification and Clustering

Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Machine Learning Classifiers and Boosting

CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16)

Based on Raymond J. Mooney s slides

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Nearest Neighbor Classification. Machine Learning Fall 2017

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

K-Nearest Neighbour Classifier. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Lecture 3. Oct

Data Preprocessing. Supervised Learning

Going nonparametric: Nearest neighbor methods for regression and classification

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

Data Mining Classification: Alternative Techniques. Lecture Notes for Chapter 4. Instance-Based Learning. Introduction to Data Mining, 2 nd Edition

CSE 573: Artificial Intelligence Autumn 2010

PV211: Introduction to Information Retrieval

Multi-Stage Rocchio Classification for Large-scale Multilabeled

Data mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20

Keyword Extraction by KNN considering Similarity among Features

CS570: Introduction to Data Mining

6.034 Quiz 2, Spring 2005

Nearest Neighbor Predictors

Classification. 1 o Semestre 2007/2008

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

MODULE 7 Nearest Neighbour Classifier and its variants LESSON 11. Nearest Neighbour Classifier. Keywords: K Neighbours, Weighted, Nearest Neighbour

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

Machine Learning Techniques for Data Mining

MS1b Statistical Data Mining Part 3: Supervised Learning Nonparametric Methods

Slides for Data Mining by I. H. Witten and E. Frank

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

Text Categorization. Foundations of Statistic Natural Language Processing The MIT Press1999

CS5670: Computer Vision

Support Vector Machines

COMPARATIVE ANALYSIS OF PARTICLE SWARM OPTIMIZATION ALGORITHMS FOR TEXT FEATURE SELECTION

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

ECG782: Multidimensional Digital Signal Processing

Supervised Learning: K-Nearest Neighbors and Decision Trees

Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval

Classification: Feature Vectors

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Content-based image and video analysis. Machine learning

An Improvement of Centroid-Based Classification Algorithm for Text Classification

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

CSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18

Document Clustering: Comparison of Similarity Measures

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

SVM-KNN : Discriminative Nearest Neighbor Classification for Visual Category Recognition

1-Nearest Neighbor Boundary

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

More on indexing CE-324: Modern Information Retrieval Sharif University of Technology

Distribution-free Predictive Approaches

CSE 158. Web Mining and Recommender Systems. Midterm recap

Lecture 9. Support Vector Machines

Information Retrieval and Organisation

Clustering & Classification (chapter 15)

MI example for poultry/export in Reuters. Overview. Introduction to Information Retrieval. Outline.

Going nonparametric: Nearest neighbor methods for regression and classification

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

LARGE MARGIN CLASSIFIERS

Pattern Recognition Chapter 3: Nearest Neighbour Algorithms

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 15: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

Contents. Preface to the Second Edition

Mathematics of Data. INFO-4604, Applied Machine Learning University of Colorado Boulder. September 5, 2017 Prof. Michael Paul

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Natural Language Processing

Transcription:

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2015 Some slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

Outline Vector space classification Rocchino knn Linear classifiers 2

Sec.14.1 Recall: vector space representation Each doc is a vector One component for each term (= word). Terms are axes Usually normalize vectors to unit length. High-dimensional vector space: 10,000+ dimensions, or even 100,000+ Docs are vectors in this space How can we do classification in this space? 3

Sec.14.1 Classification using vector spaces Training set: a set of docs, each labeled with its class (e.g., topic) This set corresponds to a labeled set of points (or, equivalently, vectors) in the vector space Premise 1: Docs in the same class form a contiguous regions of space Premise 2: Docs from different classes don t overlap (much) We define surfaces to delineate classes in the space 4

Sec.14.1 Documents in a vector space Government Science Arts 5

Sec.14.1 Test document of what class? Government Science Arts 6

Sec.14.1 Test document of what class? Government Is this similarity hypothesis true in general? Government Science Arts Our main topic today is how to find good separators 7

Sec.14.2 Rocchio for text classification Relevance feedback methods can be adapted for text categorization Relevance feedback can be viewed as 2-class classification Use standard tf-idf weighted vectors to represent text docs For training docs in each category, compute a prototype as centroid of the vectors of the training docs in the category. Prototype = centroid of members of class Assign test docs to the category with the closest prototype vector based on cosine similarity. 9

Sec.14.2 Definition of centroid μ c = 1 D c d Dc d D c : docs that belong to class c d : vector space representation of d. Centroid will in general not be a unit vector even when the inputs are unit vectors. 10

Rocchio algorithm 11

Rocchio: example We will see that Rocchino finds linear boundaries between classes Government Science Arts 12

Sec.14.2 Illustration of Rocchio: text classification 13

Sec.14.2 Rocchio properties Forms a simple generalization of the examples in each class (a prototype). Prototype vector does not need to be normalized when we use cosine similarity. Classification is based on similarity to class prototypes. Does not guarantee classifications are consistent with the given training data. Does not guarantee low error on training data 14

Sec.14.2 Rocchio anomaly Prototype models have problems with polymorphic (disjunctive) categories. 15

Sec.14.2 Rocchio classification: summary Rocchio forms a simple representation for each class: Centroid/prototype Classification is based on similarity to the prototype It does not guarantee that classifications are consistent with the given training data It is little used outside text classification It has been used quite effectively for text classification But in general worse than many other classifiers Rocchio does not handle nonconvex, multimodal classes correctly. 16

Sec.14.3 Nearest-Neighbor (1NN) classifier Learning phase: Just storing the representations of the training examples in D. Does not explicitly compute category prototypes. Testing instance x (under 1NN): Compute similarity between x and all examples in D. Assign x the category of the most similar example in D. Rationale of knn: contiguity hypothesis We expect a test doc d to have the same label as the training docs located in the local region surrounding d. 17

Sec.14.3 k Nearest Neighbor (knn) classifier 1NN: subject to errors due to A single atypical example. Noise (i.e., an error) in the category label of a single training example. More robust alternative: find the k most-similar examples return the majority category of these k examples. 18

Sec.14.3 knn example: k=6 Government Science Arts What is the class of? 19

Sec.14.3 knn decision boundaries Boundaries are in principle arbitrary surfaces (polyhedral) 20 Government Science Arts knn gives locally defined decision boundaries between classes far away points do not influence each classification decision (unlike Rocchio, etc.)

1NN: Voronoi tessellation The decision boundaries between classes are piecewise linear. 21

knn algorithm 22

Time complexity of knn knn test time proportional to the size of the training set! knn is inefficient for very large training sets. 23

Sec.14.3 Similarity metrics Nearest neighbor method depends on a similarity (or distance) metric. Euclidean distance: Simplest for continuous vector space. Hamming distance: Simplest for binary instance space. number of feature values that differ For text, cosine similarity of tf.idf weighted vectors is typically most effective. 24

Sec.14.3 Illustration of knn (k=3) for text vector space 25

3-NN vs. Rocchio Nearest Neighbor tends to handle polymorphic categories better than Rocchio/NB. 26

Sec.14.3 Nearest neighbor with inverted index Naively, finding nearest neighbors requires a linear search through D docs in collection Similar to determining the k best retrievals using the test doc as a query to a database of training docs. Use standard vector space inverted index methods to find the k nearest neighbors. 27

Sec.14.3 knn: summary No training phase necessary Actually: We always preprocess the training set, so in reality training time of knn is linear. May be expensive at test time knn is very accurate if training set is large. In most cases it s more accurate than linear classifiers Optimality result: asymptotically zero error if Bayes rate is zero. But knn can be very inaccurate if training set is small. Scales well with large number of classes Don t need to train C classifiers for C classes Classes can influence each other Small changes to one class can have ripple effect 28

Linear classifiers Assumption:The classes are linearly separable. Classification decision: m i=1 w i x i + w 0 > 0? First, we only consider binary classifiers. Geometrically, this corresponds to a line (2D), a plane (3D) or a hyperplane (higher dimensionalities) decision boundary. Find the parameters w 0, w 1,, w m based on training set. Methods for finding these parameters: Perceptron, Rocchio, 29

Sec.14.4 Separation by hyperplanes A simplifying assumption is linear separability: in 2 dimensions, can separate classes by a line in higher dimensions, need hyperplanes 30

Sec.14.4 Linear programming / Perceptron Find a,b,c, such that ax + by > c for red points ax + by < c for blue points 31

Sec.14.4 Which hyperplane? 32 In general, lots of possible solutions for a,b,c.

Sec.14.4 Which hyperplane? Lots of possible solutions for a,b,c. Some methods find a separating hyperplane, but not the optimal one [according to some criterion of expected goodness] Which points should influence optimality? All points E.g., Rocchino Only difficult points close to decision boundary E.g., SupportVector Machine (SVM) 33

Sec.14.2 Two-class Rocchio as a linear classifier Line or hyperplane defined by: M w 0 + w i d i = w 0 + w T d 0 i=1 For Rocchio, set: w = μ c 1 μ c 2 w 0 = 1 2 μ c 1 2 μ c 2 2 34

Sec.14.4 Linear classifier: example Class: interest (as in interest rate) Example features of a linear classifier w i t i w i t i 0.70 prime 0.67 rate 0.63 interest 0.60 rates 0.46 discount 0.43 bundesbank 0.71 dlrs 0.35 world 0.33 sees 0.25 year 0.24 group 0.24 dlr To classify, find dot product of feature vector and weights 35

Linear classifier: example w 0 = 0 Class interest in Reuters-21578 d 1 : rate discount dlrs world d 2 : prime dlrs w T d 1 = 0.07 d 1 is assigned to the interest class w T d 2 = 0.01 d 2 is not assigned to this class 36

Sec.14.4 Linear classifiers Many common text classifiers are linear classifiers Classifiers more powerful than linear often don t perform better on text problems.why? Despite the similarity of linear classifiers, noticeable performance differences between them For separable problems, there is an infinite number of separating hyperplanes. Different training methods pick different hyperplanes. Also different strategies for non-separable problems 37

Sec.14.4 A nonlinear problem Linear classifiers do badly on this task knn will do very well (assuming enough training data) 38

Overfitting example 39

Linear classifiers: binary and multiclass classification Sec.14.4 Consider 2 class problems Deciding between two classes, perhaps, government and nongovernment Multi-class How do we define (and find) the separating surface? How do we decide which region a test doc is in? 40

Sec.14.5 More than two classes Any-of classification Classes are not mutually exclusive. A doc can belong to 0, 1, or >1 classes. For simplicity, decompose into K binary problems Quite common for docs One-of classification Classes are mutually exclusive. Each doc belongs to exactly one class E.g., digit recognition 4 Digits are mutually exclusive 41

Sec.14.5 Set of binary classifiers: any of Build a separator between each class and its complementary set (docs from all other classes). Given test doc, evaluate it for membership in each class. Apply decision criterion of classifiers independently It works although considering dependencies between categories may be more accurate 42

Sec.14.5 Set of binary classifiers: one of Build a separator between each class and its complementary set (docs from all other classes). Given test doc, evaluate it for membership in each class. Assign doc to class with: maximum score maximum confidence maximum probability??? 43

Which classifier do I use for a given text classification problem? Is there a learning method that is optimal for all text classification problems? No, because there is a tradeoff between complexity of the classifier and its performance on new data points. Factors to take into account: How much training data is available? How simple/complex is the problem? How noisy is the data? How stable is the problem over time? For an unstable problem, it s better to use a simple and robust classifier. 44

Ch. 14 Resources IIR, Chapter 14 Weka: A data mining software package that includes an implementation of many ML algorithms 45