Information Retrieval. (M&S Ch 15)

Similar documents
In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

CS 6320 Natural Language Processing

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Chapter 6: Information Retrieval and Web Search. An introduction

Introduction to Information Retrieval

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS54701: Information Retrieval

Instructor: Stefan Savev

Introduction to Information Retrieval

Multimedia Information Systems

Information Retrieval and Web Search

Information Retrieval. Information Retrieval and Web Search

Boolean Model. Hongning Wang

Information Retrieval and Search Engine

Introduction to Information Retrieval

This lecture: IIR Sections Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes Vector space scoring

Information Retrieval: Retrieval Models

Chapter 27 Introduction to Information Retrieval and Web Search

modern database systems lecture 4 : information retrieval

Information Retrieval on the Internet (Volume III, Part 3, 213)

Information Retrieval

Information Retrieval. hussein suleman uct cs

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Models for Document & Query Representation. Ziawasch Abedjan

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

Information Retrieval

Chapter III.2: Basic ranking & evaluation measures

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

Midterm Exam Search Engines ( / ) October 20, 2015

vector space retrieval many slides courtesy James Amherst

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Q: Given a set of keywords how can we return relevant documents quickly?

Digital Libraries: Language Technologies

Natural Language Processing Basics. Yingyu Liang University of Wisconsin-Madison

Feature selection. LING 572 Fei Xia

Tag-based Social Interest Discovery

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

COMP6237 Data Mining Searching and Ranking

Chapter 2. Architecture of a Search Engine

Text Analytics (Text Mining)

Information Retrieval

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

Document indexing, similarities and retrieval in large scale text collections

CS47300: Web Information Search and Management

Information Retrieval. Lecture 5 - The vector space model. Introduction. Overview. Term weighting. Wintersemester 2007

Lecture 5: Information Retrieval using the Vector Space Model

CMSC 476/676 Information Retrieval Midterm Exam Spring 2014

Similarity search in multimedia databases

Information Retrieval

Modern information retrieval

Text Analytics (Text Mining)

Optimal Query. Assume that the relevant set of documents C r. 1 N C r d j. d j. Where N is the total number of documents.

Information Retrieval

Information Retrieval and Data Mining Part 1 Information Retrieval

Chapter 3 - Text. Management and Retrieval

VK Multimedia Information Systems

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

Information Retrieval and Web Search

Vector Space Scoring Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Reading group on Ontologies and NLP:

Knowledge Discovery and Data Mining 1 (VO) ( )

Representation of Documents and Infomation Retrieval

Chrome based Keyword Visualizer (under sparse text constraint) SANGHO SUH MOONSHIK KANG HOONHEE CHO

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Suresha M Department of Computer Science, Kuvempu University, India

Data Modelling and Multimedia Databases M

Vannevar Bush. Information Retrieval. Prophetic: Hypertext. Historic Vision 2/8/17

Ranked Retrieval. Evaluation in IR. One option is to average the precision scores at discrete. points on the ROC curve But which points?

Modern Information Retrieval

Informa/on Retrieval. Text Search. CISC437/637, Lecture #23 Ben CartereAe. Consider a database consis/ng of long textual informa/on fields

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Outline of the course

Information Retrieval

Boolean Queries. Keywords combined with Boolean operators:

Fall CS646: Information Retrieval. Lecture 2 - Introduction to Search Result Ranking. Jiepu Jiang University of Massachusetts Amherst 2016/09/12

Outline. Lecture 2: EITN01 Web Intelligence and Information Retrieval. Previous lecture. Representation/Indexing (fig 1.

Document Clustering: Comparison of Similarity Measures

Search Evaluation. Tao Yang CS293S Slides partially based on text book [CMS] [MRS]

Query Languages. Berlin Chen Reference: 1. Modern Information Retrieval, chapter 4

Information Retrieval and Web Search

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Classic IR Models 5/6/2012 1

Plan for today. CS276B Text Retrieval and Mining Winter Vector spaces and XML. Text-centric XML retrieval. Vector spaces and XML

Architecture and Implementation of Database Systems (Summer 2018)

Web Information Retrieval using WordNet

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Natural Language Processing

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

Information Retrieval Using Context Based Document Indexing and Term Graph

Impact of Term Weighting Schemes on Document Clustering A Review

Relevance of a Document to a Query

Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval

Chapter 8. Evaluating Search Engine

CSE 494: Information Retrieval, Mining and Integration on the Internet

Information Retrieval. Lecture 7 - Evaluation in Information Retrieval. Introduction. Overview. Standard test collection. Wintersemester 2007

Information Retrieval

CSE 494/598 Lecture-2: Information Retrieval. **Content adapted from last year s slides

Document Representation : Quiz

Transcription:

Information Retrieval (M&S Ch 15) 1

Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion of relevance can be binary or continuous (i.e. ranked retrieval). 2

Classes of Retrieval Models Boolean models (set theoretic) Extended Boolean Vector space models (statistical/algebraic) Generalized VS Latent Semantic Indexing Probabilistic models 3

Other Model Dimensions Logical View of Documents Index terms Full text Full text + Structure (e.g. hypertext) User Task Retrieval Browsing 4

Retrieval Tasks Ad hoc retrieval: Fixed document corpus, varied queries. Filtering: Fixed query, continuous document stream. User Profile: A model of relative static preferences. Binary decision of relevant/not-relevant. Routing: Same as filtering but continuously supply ranked lists rather than binary filtering. 5

Common Preprocessing Steps Strip unwanted characters/markup (e.g. HTML tags, punctuation, numbers, etc.). Break into tokens (keywords) on whitespace. Stem tokens to root words computational comput Remove common stopwords (e.g. a, the, it, etc.). Detect common phrases (possibly using a domain specific dictionary). Build inverted index (keyword list of docs containing it). 6

Boolean Model A document is represented as a set of keywords. Queries are Boolean expressions of keywords, connected by AND, OR, and NOT, including the use of brackets to indicate scope. [[Rio & Brazil] [Hilo & Hawaii]] & hotel &!Hilton] Output: Document is relevant or not. No partial matches or ranking. 7

Boolean Retrieval Model Popular retrieval model because: Easy to understand for simple queries. Clean formalism. Boolean models can be extended to include ranking. Reasonably efficient implementations possible for normal queries. 8

Boolean Models Problems Very rigid: AND means all; OR means any. Difficult to express complex user requests. Difficult to control the number of documents retrieved. All matched documents will be returned. Difficult to rank output. All matched documents logically satisfy the query. Difficult to perform relevance feedback. If a document is identified by the user as relevant or irrelevant, how should the query be modified? 9

Statistical Models A document is typically represented by a bag of words (unordered words with frequencies). Bag = set that allows multiple occurrences of the same element. User specifies a set of desired terms with optional weights: Weighted query terms: Q = < database 0.5; text 0.8; information 0.2 > Unweighted query terms: Q = < database; text; information > No Boolean conditions specified in the query. 10

Statistical Retrieval Retrieval based on similarity between query and documents. Output documents are ranked according to similarity to query. Similarity based on occurrence frequencies of keywords in query and document. Automatic relevance feedback can be supported: Relevant documents added to query. Irrelevant documents subtracted from query. 11

Issues for Vector Space Model How to determine important words in a document? Word sense? Word n-grams (and phrases, idioms, ) terms How to determine the degree of importance of a term within a document and within the entire collection? How to determine the degree of similarity between a document and the query? In the case of the web, what is a collection and what are the effects of links, formatting information, etc.? 12

The Vector-Space Model Assume t distinct terms remain after preprocessing; call them index terms or the vocabulary. These orthogonal terms form a vector space. Dimension = t = vocabulary Each term, i, in a document or query, j, is given a real-valued weight, w ij. Both documents and queries are expressed as t-dimensional vectors: d j = (w 1j, w 2j,, w tj ) 13

Graphic Representation Example: D 1 = 2T 1 + 3T 2 + 5T 3 D 2 = 3T 1 + 7T 2 + 5 T 3 T 3 D 1 = 2T 1 + 3T 2 + 5T 3 Q = 0T 1 + 0T 2 + 2T 3 Q = 0T 1 + 0T 2 + 2T 3 2 3 T 1 D 2 = 3T 1 + 7T 2 + T 3 T 2 7 Is D 1 or D 2 more similar to Q? How to measure the degree of similarity? Distance? Angle? Projection? 14

Document Collection A collection of n documents can be represented in the vector space model by a term-document matrix. An entry in the matrix corresponds to the weight of a term in the document; zero means the term has no significance in the document or it simply doesn t exist in the document. T 1 T 2. D 1 w 11 w 21 D 2 w 12 w 22 w t2 : : : : T t w t1 : : : : D n w 1n w 2n w tn 15

Term Weights: Term Frequency More frequent terms in a document are more important, i.e. more indicative of the topic. f ij = frequency of term i in document j May want to normalize term frequency (tf) across the entire corpus: tf ij = f ij / max{f ij } 16

Term Weights: Inverse Document Frequency Terms that appear in many different documents are less indicative of overall topic. df i = document frequency of term i = number of documents containing term i idf i = inverse document frequency of term i, = log 2 (N/ df i) (N: total number of documents) An indication of a term s discrimination power. Log used to dampen the effect relative to tf. 17

TF-IDF Weighting A typical combined term importance indicator is tfidf weighting: w ij = tf ij idf i = tf ij log 2 (N/ df i ) A term occurring frequently in the document but rarely in the rest of the collection is given high weight. Many other ways of determining term weights have been proposed. Experimentally, tf-idf has been found to work well. 18

Computing TF-IDF -- An Example Given a document containing terms with given frequencies: A(3), B(2), C(1) Assume collection contains 10,000 documents and document frequencies of these terms are: A(50), B(1300), C(250) Then: A: tf = 3/3; idf = log(10000/50) = 5.3; tf-idf = 5.3 B: tf = 2/3; idf = log(10000/1300) = 2.0; tf-idf = 1.3 C: tf = 1/3; idf = log(10000/250) = 3.7; tf-idf = 1.2 19

Query Vector Query vector is typically treated as a document and also tf-idf weighted. Alternative is for the user to supply weights for the given query terms. 20

Similarity Measure A similarity measure is a function that computes the degree of similarity between two vectors. Using a similarity measure between the query and each document: It is possible to rank the retrieved documents in the order of presumed relevance. It is possible to enforce a certain threshold so that the size of the retrieved set can be controlled. 21

Similarity Measure - Inner Product Similarity between vectors for the document d i and query q can be computed as t the vector inner product: sim(d j,q) = d j q = w ij w iq where w ij is the weight i of 1 term i in document j and w iq is the weight of term i in the query For binary vectors, the inner product is the number of matched query terms in the document (size of intersection). For weighted term vectors, it is the sum of the products of the weights of the matched terms. 22

Properties of Inner Product The inner product is unbounded. Favors long documents with a large number of unique terms. Measures how many terms matched but not how many terms are not matched. 23

Inner Product -- Examples Binary: D = 1, 1, 1, 0, 1, 1, 0 Q = 1, 0, 1, 0, 0, 1, 1 sim(d, Q) = 3 Weighted: D 1 = 2T 1 + 3T 2 + 5T 3 D 2 = 3T 1 + 7T 2 + 1T 3 Q = 0T 1 + 0T 2 + 2T 3 sim(d 1, Q) = 2*0 + 3*0 + 5*2 = 10 sim(d 2, Q) = 3*0 + 7*0 + 1*2 = 2 Size of vector = size of vocabulary = 7 0 means corresponding term not found in document or query 24

Cosine Similarity Measure Cosine similarity measures the cosine of the angle between two vectors. Inner product normalized by the vector lengths. t CosSim(d j, q) = d d j j q q i 1 t ( w w ij ij 2 w i 1 i 1 t iq ) w iq 2 t 2 D 1 1 2 D 2 t 3 Q t 1 D 1 = 2T 1 + 3T 2 + 5T 3 CosSim(D 1, Q) = 10 / (4+9+25)(0+0+4) = 0.81 D 2 = 3T 1 + 7T 2 + 1T 3 CosSim(D 2, Q) = 2 / (9+49+1)(0+0+4) = 0.13 Q = 0T 1 + 0T 2 + 2T 3 D 1 is 6 times better than D 2 using cosine similarity but only 5 times better using inner product. 25

Naïve Implementation Convert all documents in collection D to tf-idf weighted vectors, d j, for keyword vocabulary V. Convert query to a tf-idf-weighted vector q. For each d j in D do Compute score s j = cossim(d j, q) Sort documents by decreasing score. Present top ranked documents to the user. Time complexity: O( V D ) Bad for large V & D! V = 10,000; D = 100,000; V D = 1,000,000,000 26

Practical implementation Inverted index Index terms df D j, tf j computer database 3 2 D 7, 4 D 1, 3 science 4 D 2, 4 system 1 D 5, 2 Index file lists 27

Comments on Vector Space Models Simple, mathematically based approach. Considers both local (tf) and global (idf) word occurrence frequencies. Provides partial matching and ranked results. Tends to work quite well in practice despite obvious weaknesses. Allows efficient implementation for large document collections. 28

Problems with Vector Space Model Missing semantic information (e.g. word sense). Missing syntactic information (e.g. phrase structure, word order, proximity information). Assumption of term independence (e.g. ignores synonomy). Lacks the control of a Boolean model (e.g., requiring a term to appear in a document). Given a two-term query A B, may prefer a document containing A frequently but not B, over a document that contains both A and B, but both less frequently. 29

Evaluation Test collections TREC, CLEF Relevance judgements produced by human judges P, R, F-measure Precision at 10 documents R-precision Interpolated precision MAP = mean average precision 30

Precision Trade-off between Recall and Precision Returns relevant documents but misses many useful ones too 1 The ideal 0 Recall 1 Returns most relevant documents but includes lots of junk 31

Computing Recall/Precision Points For a given query, produce the ranked list of retrievals. Adjusting a threshold on this ranked list produces different sets of retrieved documents, and therefore different recall/precision measures. Mark each document in the ranked list that is relevant according to the gold standard. Compute a recall/precision pair for each position in the ranked list that contains a relevant document. 32

Computing Recall/Precision Points: An Example n 1 588 x 2 589 x 3 576 4 590 x 5 986 6 592 x 7 984 8 988 9 578 10 985 11 103 12 591 13 772 x 14 990 doc # relevant Let total # of relevant docs = 6 Check each new recall point: R=1/6=0.167; P=1/1=1 R=2/6=0.333; P=2/2=1 R=3/6=0.5; P=3/4=0.75 R=4/6=0.667; P=4/6=0.667 R=5/6=0.833; p=5/13=0.38 Missing one relevant document. Never reach 100% recall 33

Interpolating a Recall/Precision Curve Interpolate a precision value for each standard recall level: r j {0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0} r 0 = 0.0, r 1 = 0.1,, r 10 =1.0 The interpolated precision at the j-th standard recall level is the maximum known precision at any recall level between the j-th and (j + 1)-th level: P( r j ) r max r j r j 1 P( r) 34

Precision Interpolating a Recall/Precision Curve: An Example 1.0 0.8 0.6 0.4 0.2 0.2 0.4 0.8 1.0 0.6 Recall 35

Average Recall/Precision Curve Typically average performance over a large set of queries. Compute average precision at each standard recall level across all queries. Plot average precision/recall curves to evaluate overall system performance on a document/query corpus. 36

Compare Two or More Systems The curve closest to the upper right-hand corner of the graph indicates the best performance 1 0.8 NoStem Stem Precision 0.6 0.4 0.2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall 37