VECTOR SPACE CLASSIFICATION Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. Chapter 14 Wei Wei wwei@idi.ntnu.no Lecture series 1
RecALL: Naïve Bayes classifiers Classify based on prior weight of class and conditional parameter for what each word says: c NB argmax log P(c j ) log P(x i c j ) c j C i positions Training is done by counting and dividing: P(c j ) N c j N Don t forget to smooth P(x k c j ) T c j x k xi V [T c j x i ] 2
Vector SPACE text classification Today: Vector space methods for Text Classification Rocchio classification K Nearest Neighbors Linear classifier and non-linear classifier Classification with more than two classes 3
Vector SPACE text classification Vector space methods for Text Classification 4
VECTOR SPACE CLASSIFICATION Vector Space Representation Each document is a vector, one component for each term (= word). Normally normalize vectors to unit length. High-dimensional vector space: Terms are axes 10,000+ 000+ dimensions, or even 100,000+ 000+ Docs are vectors in this space How can we do classification in this space? 5
VECTOR SPACE CLASSIFICATION As before, the training set is a set of documents, each labeled with its class (e.g., topic) In vector space classification, this set corresponds to a labeled set of points (or, equivalently, vectors) in the vector space Hypothesis 1: Documents in the same class form a contiguous region of space Hypothesis 2: Documents from different classes don t overlap We define surfaces to delineate classes in the space 6
Documents in a vector space Government Science Arts 7
Test document: which class? Government Science Arts 8
Test document = government Government Science Arts Our main topic today is how to find good separators 9 9
Vector SPACE text classification Rocchio text classification i 10
Rocchio text classification Rocchio Text Classification Use standard tf-idf weighted vectors to represent text documents For training documents in each category, compute a prototype vector by summing the vectors of the training documents in the category Prototype = centroid of fmembers m of class Assign test documents to the category with the closest prototype vector based on cosine similarity 11
DEFINITION OF CENTROID (c) 1 v (d) D c d Dc Where D c is the set of all documents that belong to class c and v (d) is the vector space representation of d. Note that centroid will in general not be a unit vector even when the inputs are unit vectors. 12
ROCCHIO TEXT CLASSIFICATION r r1 r2 r3 b1 b2 t=b b 13
ROCCHIO PROPERTIES Forms a simple generalization of the examples in each class (a prototype). Prototype vector does not need to be averaged or otherwise normalized for length since cosine similarity is insensitive to vector length. Classification is based on similarity to class prototypes. Does not guarantee classifications are consistent with the given training data. 14
ROCCHIO ANOMALY Prototype models have problems with polymorphic (disjunctive) categories. r r1 r2 t=r b b1 b2 r3 r4 15
Vector SPACE text classification k Nearest Neighbor Classification i 16
K NEAREST NEIGHBOR CLASSIFICATION knn = k Nearest Neighbor To classify document d into class c: Define k-neighborhood N as k nearest neighbors of d Count number of documents i in N that belong to c Estimate P(c d) as i/k Choose as class argmax c P(c d) [ = majority class] 17
K NEAREST NEIGHBOR CLASSIFICATION Unlike Rocchio, knn classification determines the decision boundary locally. For 1NN (k=1), we assign each document to the class of its closest neighbor. For knn, we assign each document to the majority class of its k closest neighbors. K here is a parameter. The rationale of knn: contiguity hypothesis. We expect a test document d to have the same label as the training documents located nearly. 18
Knn: k=1 19
Knn: k=1 1,5,10 510 20
KNN: weighted sum voting 21
K NEAREST NEIGHBOR CLASSIFICATION Test Government Science Arts 22
K NEAREST NEIGHBOR CLASSIFICATION Learning is just storing the representations of the training examples in D. Testing instance x (under 1NN): Compute similarity between x and all examples in D. Assign x the category of the most similar example in D. Does not explicitly compute a generalization or category prototypes. Also called: Case-based learning Memory-based learning Lazy learning Rationale of knn: contiguity hypothesis 23
SIMILARITY METRICS Nearest neighbor method depends on a similarity (or distance) metric. Simplest for continuous m-dimensional m instance space is Euclidean distance. Simplest for m-dimensional m binary instance space is Hamming distance (number of feature values that differ). For text, cosine similarity of tf.idf weighted vectors is typically most effective. 24
An Example: consine similarity r1 r2 r3 t=b b2 b1 25
Knn discusion Functional definition of similarity e.g. cos, Euclidean, kernel functions,... How many neighbors do we consider? Value of k determined empirically (normally 3 or 5 ) Does each neighbor get the same weight? Weighted-sum or not 26
Knn discusion (cont.) No feature selection necessary Scales well with large number of classes Dont n t need to train n classifiers s for n classes s Classes can influence each other Small changes to one class can have ripple effect Scores can be hard to convert to probabilities No training i necessary Actually: perhaps not true. (Data editing, etc.) May be more expensive at test time 27
Text Classification Linear classifier and non-linear classifier 28
Linear Classifier Many common text classifiers are linear classifiers Naïve Bayes Perceptron Rocchio Logistic regression Support vector machines (with linear kernel) Linear regression 29
Linear Classifier: 2d In two dimensions, a linear classifier is a line. These lines have functional form w 1 x 1 w2x2 b The classification rules: if w1 x1 w2x2 b, => c if w x w x b, => not-c 1 1 2 2 Here: T ( x 1, x2) : 2D vector representation of the document T ( w 1, w2 ) : the parameter vector that t defines the boundary 30
Linear Classifier We can generalize this 2D linear classifier to higher dimensions by defining a hyperplane: The classification rules is then: Why Rocchio and Naive Bayes classifiers are linear classfiers. 31
Non-linear classifier Non-linear classifier: k-nn A linear classifier e.g. Naive Bayes does badly on the task: knn will do very well (assuming enough training data) 32
Text classification Classification i with more than two classes 33
Classification with more than two classes Classification for classes that are not mutually exclusive is called any-of classification problem. Classification for classes that are mutually exclusive is called one-of classification problem. We have learned two-class linear classifiers. linear classifier that can classify d to c or not-c. How can we extend the two-class linear classifiers to J>2 classes. to classify a document d to one of or any of classes c1, c2, c3 34
Classification with more than two classes For one-of of classification tasks: 1. Build a classifier for each class, where the training set consists of the set of documents in the class and its complement. 2. Given the test document, apply each classifier separately. 3. Assign the document to the class with the maximum score the maximum confidence value or the maximum probability 35
Classification with more than two classes For any-of classification tasks: 1. Build a classifier for each class, where the training set consists of the set of documents in the class and its compement. 2. Given the test document, apply each classifier separately. The decision of one classifier has no influence on the decisions of the other classfier. 36
THE TEXT CLASSIFICATION PROBLEM An example: Document d with only a sentance: London is planning to organize the 2012 Olympics. We have six classes: <UK>, <China>, <car>, <coffee>, <elections>, <sports> Determined: <UK> and <sports> p(uk)p(d UK) > t1 p(sports)p(d sports) > t2 Naive Bayes Text Classification 37
summary Vector space methods for Text Classification Rocchio classification K Nearest Neighbors Linear classifier and non-linear classifier Classification with more than two classes 38