Introduction to Information Retrieval SCCS414: Information Storage and Retrieval Christopher Manning and Prabhakar Raghavan Lecture 10: Text Classification; Vector Space Classification (Rocchio)
Relevance feedback revisited In relevance feedback, the user marks a few documents as relevant/nonrelevant The choices can be viewed as classes or categories For several documents, the user decides which of these two classes is correct The IR system then uses these judgments to build a better model of the information need So, relevance feedback can be viewed as a form of text classification (deciding between several classes) The notion of classification is very general and has many applications within and beyond IR
Ch. 13 Standing queries The path from IR to text classification: You have an information need to monitor, say: Earthquake in Haiti You want to rerun an appropriate query periodically to find new news items on this topic You will be sent new documents that are found I.e., it s text classification not ranking Such queries are called standing queries Long used by information professionals A modern mass instantiation is Google Alerts Standing queries are (hand-written) text classifiers
Spam filtering: Another text Ch. 13 classification task From: "" <takworlld@hotmail.com> Subject: real estate is the only way... gem oalvgkay Anyone can buy real estate with no money down Stop paying rent TODAY! There is no need to spend hundreds or even thousands for similar courses I am 22 years old and I have already purchased 6 properties using the methods outlined in this truly INCREDIBLE ebook. Change your life NOW! ================================================= Click Below to order: http://www.wholesaledaily.com/sales/nmd.htm =================================================
Ch. 13 Text classification Today: Introduction to Text Classification (Chapter 13.0-13.1) Also widely known as text categorization. Same thing. Vector Space Classification (Chapter 14)
Ch. 13 More Text Classification Examples Many search engine functionalities use classification Assigning labels to documents or web-pages: Labels are most often topics such as Yahoo-categories "finance," "sports," "news>world>asia>business" Labels may be genres "editorials" "movie-reviews" "news Labels may be opinion on a person/product like, hate, neutral Labels may be domain-specific "interesting-to-me" : "not-interesting-to-me contains adult language : doesn t language identification: English, French, Chinese, search vertical: about Linux versus not link spam : not link spam
Ch. 13 Classification Methods (1) Manual classification Used by the original Yahoo! Directory Looksmart, about.com, ODP, PubMed Very accurate when job is done by experts Consistent when the problem size and team is small Difficult and expensive to scale Means we need automatic classification methods for big problems
Ch. 13 Classification Methods (2) Automatic document classification Hand-coded rule-based systems One technique used by CS dept s spam filter, Reuters, CIA, etc. It s what Google Alerts is doing Widely deployed in government and enterprise Companies provide IDE for writing such rules E.g., assign category if document contains a given boolean combination of words Standing queries: Commercial systems have complex query languages (everything in IR query languages +score accumulators) Accuracy is often very high if a rule has been carefully refined over time by a subject expert Building and maintaining these rules is expensive
A Verity topic A complex classification rule Ch. 13 Note: maintenance issues (author, etc.) Hand-weighting of terms [Verity was bought by Autonomy.]
Ch. 13 Classification Methods (3) Supervised learning of a document-label assignment function Many systems partly rely on machine learning (Autonomy, Microsoft, Enkata, Yahoo!, ) k-nearest Neighbors (simple, powerful) Naive Bayes (simple, common method) Support-vector machines (new, more powerful) plus many other methods No free lunch: requires hand-classified training data But data can be built up (and refined) by amateurs Many commercial systems use a mixture of methods
Sec. 13.1 Categorization/Classification Given: A description of an instance, d X X is the instance language or instance space. Issue: how to represent text documents. Usually some type of high-dimensional space A fixed set of classes: C = {c 1, c 2,, c J } Determine: The category of d: γ(d) C, where γ(d) is a classification function whose domain is X and whose range is C. We want to know how to build classification functions ( classifiers ).
Sec. 13.1 Supervised Classification Given: A description of an instance, d X X is the instance language or instance space. A fixed set of classes: C = {c 1, c 2,, c J } A training set D of labeled documents with each labeled document d,c X C Determine: A learning method or algorithm which will enable us to learn a classifier γ:x C For a test document d, we assign it the class γ(d) C
Sec. 13.1 Document Classification Test Data: planning language proof intelligence Classes: ML (AI) Planning (Programming) Semantics Garb.Coll. (HCI) Multimedia GUI Training Data: learning intelligence algorithm reinforcement network... planning temporal reasoning plan language... programming semantics language proof... garbage collection memory optimization region......... (Note: in real life there is often a hierarchy, not present in the above problem statement; and also, you get papers on ML approaches to Garb. Coll.)
Sec.13.5 Feature Selection Text collections have a large number of features 10,000 1,000,000 unique words and more May make using a particular classifier feasible Some classifiers can t deal with 100,000 of features Reduces training time Training time for some methods is quadratic or worse in the number of features Can improve generalization (performance) Eliminates noise features Avoids overfitting
Sec.13.5 Example for a noise feature Let s say we re doing text classification for the class China. Suppose a rare term, say arachnocentric, has no information about China...... but all instances of arachnocentric happen to occur in China documents in our training set. Then the learning method can produce a classifier that misassigns test documents containing arachnocentric to China. Such an incorrect generalization from an accidental property of the training set is called overfitting. Feature selection reduces overfitting and improves the accuracy of the classifier.
Sec.14.1 Recall: Vector Space Representation Each document is a vector, one component for each term (= word). Normally normalize vectors to unit length. High-dimensional vector space: Terms are axes 10,000+ dimensions, or even 100,000+ Docs are vectors in this space How can we do classification in this space? 16
Sec.14.1 Classification Using Vector Spaces As before, the training set is a set of documents, each labeled with its class (e.g., topic) In vector space classification, this set corresponds to a labeled set of points (or, equivalently, vectors) in the vector space Premise 1: Documents in the same class form a contiguous region of space Premise 2: Documents from different classes don t overlap (much) We define surfaces to delineate classes in the space 17
Sec.14.1 Documents in a Vector Space Government Science Arts 18
Sec.14.1 Test Document of what class? Government Science Arts 19
Sec.14.1 Test Document = Government Is this similarity hypothesis true in general? Government Science Arts Our main topic today is how to find good separators 20
Sec.14.1 Aside: 2D/3D graphs can be misleading 21
Sec.14.2 Using Rocchio for text classification Relevance feedback methods can be adapted for text categorization As noted before, relevance feedback can be viewed as 2-class classification Relevant vs. nonrelevant documents Use standard tf-idf weighted vectors to represent text documents For training documents in each category, compute a prototype vector by summing the vectors of the training documents in the category. Prototype = centroid of members of class Assign test documents to the category with the closest prototype vector based on cosine similarity. 22
Sec.14.2 Illustration of Rocchio Text Categorization 23
Sec.14.2 Definition of centroid (c) 1 D c d D c v (d) Where D c is the set of all documents that belong to class c and v(d) is the vector space representation of d. Note that centroid will in general not be a unit vector even when the inputs are unit vectors. 24
Rocchio illustrated
Rocchio example TF scheme: wf x idf Given: log tf log 4 / df 1 10 t, d 10 t Task: Classify a test document - docid5!!
Rocchio example c d d 5 c 5 1.15 = 0 Assigns docid5 to not-china class
Sec.14.2 Rocchio Anomaly Prototype models have problems with polymorphic (disjunctive) categories. 28
Rocchio cannot handle multimodal classes A is centroid of the a s, B is centroid of the b s. The point o is closer to A than to B. But it is a better fit for the b class. O A is a multimodal class with two prototypes. But in Rocchio we only have one.
Rocchio illustrated (again)
Sec.14.2 Rocchio classification Rocchio forms a simple representation for each class: the centroid/prototype Classification is based on similarity to / distance from the prototype/centroid It does not guarantee that classifications are consistent with the given training data It is little used outside text classification It has been used quite effectively for text classification But in general worse than Naïve Bayes Again, cheap to train and test documents 31
References Chapter 13, 14 in IIR. Fabrizio Sebastiani. Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1):1-47, 2002. Tom Mitchell, Machine Learning. McGraw-Hill, 1997. Yiming Yang & Xin Liu, A re-examination of text categorization methods. Proceedings of SIGIR, 1999. Evaluating and Optimizing Autonomous Text Classification Systems (1995) David Lewis. Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval