Information Retrieval

Similar documents
Information Retrieval

5/21/17. Standing queries. Spam filtering Another text classification task. Categorization/Classification. Document Classification

CS60092: Information Retrieval

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Recap of the last lecture. CS276A Text Retrieval and Mining. Text Categorization Examples. Categorization/Classification. Text Classification

VECTOR SPACE CLASSIFICATION

4/4/18. MeSH Subject Category Hierarchy. Arch. Graphics. Theory. Text Classification and Naïve Bayes K-Nearest Neighbor (KNN) Classifier

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

Web Search: Techniques, algorithms and Aplications. Basic Techniques for Web Search

Classification & Clustering. Hadaiq Rolis Sanabila

This lecture: IIR Sections Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes Vector space scoring

Machine Learning using MapReduce

Search Engines. Information Retrieval in Practice

Chapter 6: Information Retrieval and Web Search. An introduction

Lecture 8 May 7, Prabhakar Raghavan

A Content Vector Model for Text Classification

Support Vector Machines 290N, 2015

Text Analytics (Text Mining)

Reading group on Ontologies and NLP:

PV211: Introduction to Information Retrieval

Clustering Results. Result List Example. Clustering Results. Information Retrieval

Multi-Stage Rocchio Classification for Large-scale Multilabeled

International ejournals

Text Analytics (Text Mining)

Lecture 5: Information Retrieval using the Vector Space Model

Chapter 27 Introduction to Information Retrieval and Web Search

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _

Introduction to Information Retrieval

Automated Online News Classification with Personalization

Informa(on Retrieval

Introduction to Information Retrieval

CS 6320 Natural Language Processing

Instance and case-based reasoning

Text Categorization (I)

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Informa(on Retrieval

Keyword Extraction by KNN considering Similarity among Features

Information Retrieval

Boolean Model. Hongning Wang

Information Retrieval

Introduction to Information Retrieval

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Machine Learning for Information Discovery

Clustering CE-324: Modern Information Retrieval Sharif University of Technology

CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16)


TEXT CATEGORIZATION PROBLEM

Northeastern University in TREC 2009 Million Query Track

CS371R: Final Exam Dec. 18, 2017

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

Feature selection. LING 572 Fei Xia

Relevance Feedback and Query Reformulation. Lecture 10 CS 510 Information Retrieval on the Internet Thanks to Susan Price. Outline

Support Vector Machines + Classification for IR

Topics du jour CS347. Centroid/NN. Example

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

Information Retrieval

Birkbeck (University of London)

Domain Specific Search Engine for Students

Information Retrieval

Classification and Clustering

Natural Language Processing

Digital Libraries: Language Technologies

Introduction to Text Mining. Hongning Wang

Information Retrieval

Unsupervised Learning : Clustering

Introduction to Information Retrieval

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Information Retrieval. hussein suleman uct cs

Chapter 9. Classification and Clustering

Using Text Learning to help Web browsing

Design and Implementation of Search Engine Using Vector Space Model for Personalized Search

60-538: Information Retrieval

Similarity search in multimedia databases

Evaluation. David Kauchak cs160 Fall 2009 adapted from:

Information Retrieval. (M&S Ch 15)

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

CSCI 5417 Information Retrieval Systems. Jim Martin!

Query Evaluation Strategies

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Introduction to Information Retrieval

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect

Incorporating Hyperlink Analysis in Web Page Clustering

EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling

Application of Support Vector Machine Algorithm in Spam Filtering

String Vector based KNN for Text Categorization

COMP6237 Data Mining Searching and Ranking

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Information Retrieval Spring Web retrieval

F. Aiolli - Sistemi Informativi 2006/2007

Social Media Computing

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017

Incorporating Conceptual Matching in Search

An Improvement of Centroid-Based Classification Algorithm for Text Classification

Lecture 10 May 14, Prabhakar Raghavan

Transcription:

Introduction to Information Retrieval SCCS414: Information Storage and Retrieval Christopher Manning and Prabhakar Raghavan Lecture 10: Text Classification; Vector Space Classification (Rocchio)

Relevance feedback revisited In relevance feedback, the user marks a few documents as relevant/nonrelevant The choices can be viewed as classes or categories For several documents, the user decides which of these two classes is correct The IR system then uses these judgments to build a better model of the information need So, relevance feedback can be viewed as a form of text classification (deciding between several classes) The notion of classification is very general and has many applications within and beyond IR

Ch. 13 Standing queries The path from IR to text classification: You have an information need to monitor, say: Earthquake in Haiti You want to rerun an appropriate query periodically to find new news items on this topic You will be sent new documents that are found I.e., it s text classification not ranking Such queries are called standing queries Long used by information professionals A modern mass instantiation is Google Alerts Standing queries are (hand-written) text classifiers

Spam filtering: Another text Ch. 13 classification task From: "" <takworlld@hotmail.com> Subject: real estate is the only way... gem oalvgkay Anyone can buy real estate with no money down Stop paying rent TODAY! There is no need to spend hundreds or even thousands for similar courses I am 22 years old and I have already purchased 6 properties using the methods outlined in this truly INCREDIBLE ebook. Change your life NOW! ================================================= Click Below to order: http://www.wholesaledaily.com/sales/nmd.htm =================================================

Ch. 13 Text classification Today: Introduction to Text Classification (Chapter 13.0-13.1) Also widely known as text categorization. Same thing. Vector Space Classification (Chapter 14)

Ch. 13 More Text Classification Examples Many search engine functionalities use classification Assigning labels to documents or web-pages: Labels are most often topics such as Yahoo-categories "finance," "sports," "news>world>asia>business" Labels may be genres "editorials" "movie-reviews" "news Labels may be opinion on a person/product like, hate, neutral Labels may be domain-specific "interesting-to-me" : "not-interesting-to-me contains adult language : doesn t language identification: English, French, Chinese, search vertical: about Linux versus not link spam : not link spam

Ch. 13 Classification Methods (1) Manual classification Used by the original Yahoo! Directory Looksmart, about.com, ODP, PubMed Very accurate when job is done by experts Consistent when the problem size and team is small Difficult and expensive to scale Means we need automatic classification methods for big problems

Ch. 13 Classification Methods (2) Automatic document classification Hand-coded rule-based systems One technique used by CS dept s spam filter, Reuters, CIA, etc. It s what Google Alerts is doing Widely deployed in government and enterprise Companies provide IDE for writing such rules E.g., assign category if document contains a given boolean combination of words Standing queries: Commercial systems have complex query languages (everything in IR query languages +score accumulators) Accuracy is often very high if a rule has been carefully refined over time by a subject expert Building and maintaining these rules is expensive

A Verity topic A complex classification rule Ch. 13 Note: maintenance issues (author, etc.) Hand-weighting of terms [Verity was bought by Autonomy.]

Ch. 13 Classification Methods (3) Supervised learning of a document-label assignment function Many systems partly rely on machine learning (Autonomy, Microsoft, Enkata, Yahoo!, ) k-nearest Neighbors (simple, powerful) Naive Bayes (simple, common method) Support-vector machines (new, more powerful) plus many other methods No free lunch: requires hand-classified training data But data can be built up (and refined) by amateurs Many commercial systems use a mixture of methods

Sec. 13.1 Categorization/Classification Given: A description of an instance, d X X is the instance language or instance space. Issue: how to represent text documents. Usually some type of high-dimensional space A fixed set of classes: C = {c 1, c 2,, c J } Determine: The category of d: γ(d) C, where γ(d) is a classification function whose domain is X and whose range is C. We want to know how to build classification functions ( classifiers ).

Sec. 13.1 Supervised Classification Given: A description of an instance, d X X is the instance language or instance space. A fixed set of classes: C = {c 1, c 2,, c J } A training set D of labeled documents with each labeled document d,c X C Determine: A learning method or algorithm which will enable us to learn a classifier γ:x C For a test document d, we assign it the class γ(d) C

Sec. 13.1 Document Classification Test Data: planning language proof intelligence Classes: ML (AI) Planning (Programming) Semantics Garb.Coll. (HCI) Multimedia GUI Training Data: learning intelligence algorithm reinforcement network... planning temporal reasoning plan language... programming semantics language proof... garbage collection memory optimization region......... (Note: in real life there is often a hierarchy, not present in the above problem statement; and also, you get papers on ML approaches to Garb. Coll.)

Sec.13.5 Feature Selection Text collections have a large number of features 10,000 1,000,000 unique words and more May make using a particular classifier feasible Some classifiers can t deal with 100,000 of features Reduces training time Training time for some methods is quadratic or worse in the number of features Can improve generalization (performance) Eliminates noise features Avoids overfitting

Sec.13.5 Example for a noise feature Let s say we re doing text classification for the class China. Suppose a rare term, say arachnocentric, has no information about China...... but all instances of arachnocentric happen to occur in China documents in our training set. Then the learning method can produce a classifier that misassigns test documents containing arachnocentric to China. Such an incorrect generalization from an accidental property of the training set is called overfitting. Feature selection reduces overfitting and improves the accuracy of the classifier.

Sec.14.1 Recall: Vector Space Representation Each document is a vector, one component for each term (= word). Normally normalize vectors to unit length. High-dimensional vector space: Terms are axes 10,000+ dimensions, or even 100,000+ Docs are vectors in this space How can we do classification in this space? 16

Sec.14.1 Classification Using Vector Spaces As before, the training set is a set of documents, each labeled with its class (e.g., topic) In vector space classification, this set corresponds to a labeled set of points (or, equivalently, vectors) in the vector space Premise 1: Documents in the same class form a contiguous region of space Premise 2: Documents from different classes don t overlap (much) We define surfaces to delineate classes in the space 17

Sec.14.1 Documents in a Vector Space Government Science Arts 18

Sec.14.1 Test Document of what class? Government Science Arts 19

Sec.14.1 Test Document = Government Is this similarity hypothesis true in general? Government Science Arts Our main topic today is how to find good separators 20

Sec.14.1 Aside: 2D/3D graphs can be misleading 21

Sec.14.2 Using Rocchio for text classification Relevance feedback methods can be adapted for text categorization As noted before, relevance feedback can be viewed as 2-class classification Relevant vs. nonrelevant documents Use standard tf-idf weighted vectors to represent text documents For training documents in each category, compute a prototype vector by summing the vectors of the training documents in the category. Prototype = centroid of members of class Assign test documents to the category with the closest prototype vector based on cosine similarity. 22

Sec.14.2 Illustration of Rocchio Text Categorization 23

Sec.14.2 Definition of centroid (c) 1 D c d D c v (d) Where D c is the set of all documents that belong to class c and v(d) is the vector space representation of d. Note that centroid will in general not be a unit vector even when the inputs are unit vectors. 24

Rocchio illustrated

Rocchio example TF scheme: wf x idf Given: log tf log 4 / df 1 10 t, d 10 t Task: Classify a test document - docid5!!

Rocchio example c d d 5 c 5 1.15 = 0 Assigns docid5 to not-china class

Sec.14.2 Rocchio Anomaly Prototype models have problems with polymorphic (disjunctive) categories. 28

Rocchio cannot handle multimodal classes A is centroid of the a s, B is centroid of the b s. The point o is closer to A than to B. But it is a better fit for the b class. O A is a multimodal class with two prototypes. But in Rocchio we only have one.

Rocchio illustrated (again)

Sec.14.2 Rocchio classification Rocchio forms a simple representation for each class: the centroid/prototype Classification is based on similarity to / distance from the prototype/centroid It does not guarantee that classifications are consistent with the given training data It is little used outside text classification It has been used quite effectively for text classification But in general worse than Naïve Bayes Again, cheap to train and test documents 31

References Chapter 13, 14 in IIR. Fabrizio Sebastiani. Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1):1-47, 2002. Tom Mitchell, Machine Learning. McGraw-Hill, 1997. Yiming Yang & Xin Liu, A re-examination of text categorization methods. Proceedings of SIGIR, 1999. Evaluating and Optimizing Autonomous Text Classification Systems (1995) David Lewis. Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval