Information Retrieval. Lecture 5 - The vector space model. Introduction. Overview. Term weighting. Wintersemester 2007

Similar documents
ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

Query Answering Using Inverted Indexes

This lecture: IIR Sections Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes Vector space scoring

Introduction to Information Retrieval

Chapter III.2: Basic ranking & evaluation measures

Information Retrieval

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Information Retrieval. (M&S Ch 15)

Information Retrieval. Lecture 10 - Web crawling

Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval

Efficient query processing

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

Information Retrieval

Informa(on Retrieval

FRONT CODING. Front-coding: 8automat*a1 e2 ic3 ion. Extra length beyond automat. Encodes automat. Begins to resemble general string compression.

Lecture 5: Information Retrieval using the Vector Space Model

Information Retrieval. Lecture 3 - Index compression. Introduction. Overview. Characterization of an index. Wintersemester 2007

Informa(on Retrieval

Digital Libraries: Language Technologies

Information Retrieval. Lecture 7 - Evaluation in Information Retrieval. Introduction. Overview. Standard test collection. Wintersemester 2007

Boolean Model. Hongning Wang

Information Retrieval

Information Retrieval. Lecture 11 - Link analysis

Informa(on Retrieval

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Vector Space Scoring Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Information Retrieval

Scoring, term weighting, and the vector space model

modern database systems lecture 4 : information retrieval

Models for Document & Query Representation. Ziawasch Abedjan

CS 6320 Natural Language Processing

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

More on indexing CE-324: Modern Information Retrieval Sharif University of Technology

Text compression: dictionary models. IN4325 Information Retrieval

Problem 1: Complexity of Update Rules for Logistic Regression

Introduction to Information Retrieval

Natural Language Processing

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Information Retrieval. Lecture 2 - Building an index

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

Chapter 6: Information Retrieval and Web Search. An introduction

Query Evaluation Strategies

Midterm Exam Search Engines ( / ) October 20, 2015

VECTOR SPACE CLASSIFICATION

Feature selection. LING 572 Fei Xia

Query Evaluation Strategies

Instructor: Stefan Savev

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

Information Retrieval. Lecture 9 - Web search basics

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

COMP6237 Data Mining Searching and Ranking

Q: Given a set of keywords how can we return relevant documents quickly?

Lecture 8 May 7, Prabhakar Raghavan

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Multi-Stage Rocchio Classification for Large-scale Multilabeled

Knowledge Discovery and Data Mining 1 (VO) ( )

Chapter 2. Architecture of a Search Engine

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013

Outline of the course

The Web document collection

Architecture and Implementation of Database Systems (Summer 2018)

Tag-based Social Interest Discovery

Custom IDF weights for boosting the relevancy of retrieved documents in textual retrieval

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

Index Compression. David Kauchak cs160 Fall 2009 adapted from:

Information Retrieval

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze

Introduction to Information Retrieval

EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling

Text Documents clustering using K Means Algorithm

Similarity search in multimedia databases

Information Retrieval

Static Pruning of Terms In Inverted Files

Information Retrieval

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Information Retrieval

Vannevar Bush. Information Retrieval. Prophetic: Hypertext. Historic Vision 2/8/17

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

Vector Space Models: Theory and Applications

Information Retrieval. hussein suleman uct cs

Efficient Implementation of Postings Lists

Information Retrieval

XML RETRIEVAL. Introduction to Information Retrieval CS 150 Donald J. Patterson

Information Retrieval and Web Search Engines

CSCI 599: Applications of Natural Language Processing Information Retrieval Retrieval Models (Part 1)"

Information Retrieval. Techniques for Relevance Feedback

Information Retrieval CSCI

Information Retrieval: Retrieval Models

CS54701: Information Retrieval

A Security Model for Multi-User File System Search. in Multi-User Environments

Implementation of the common phrase index method on the phrase query for information retrieval

Transcription:

Information Retrieval Lecture 5 - The vector space model Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 28 Introduction Boolean model: all documents matching the query are retrived The matching is binary: yes or no Extreme cases: the list of retrieved documents can be empty, or huge A ranking of the documents matching a query is needed A score is computed for each pair (query, document) 2/ 28 Overview Conclusion 3/ 28 4/ 28

Evaluation of how important a term is with respect to a document First idea: the more important a term is, the more often it appears term frequency tf t,d = { 1 if x = t f t (x) where f t (x) = 0 otherwise x d NB1: the order of terms within a doc is ignored NB2: are all words equally important? What about stop-lists? 5/ 28 (continued) Terms occuring very often in the collection are not relevant for distinguishing among the documents A relevance measure cannot only take term frequency into account Idea: reducing the relevance (weight) of a term using a factor growing with the collection frequency Collection frequency versus document frequency? Term t cf t df t try 10422 8760 insurance 10440 3997 6/ 28 Inverse Document Frequency inverse document frequency of a term t: idf t = log N df t with N = collection size NB: rare terms have high idf, contrary to frequent terms Example (Reuters collection, from Manning et al.): Term t df t idf t car 18165 1.65 auto 6723 2.08 insurance 19241 1.62 best 25235 1.5 7/ 28 tf-idf weighting The weight of a term is computed using both tf and idf : w(t, d) = tf t,d idf t called tf idf t,d w(t, d) is: 1. high when t occurs many times in a small set of documents 2. low when t occurs fewer times in a document, or when it occurs in many documents 3. very low when t occurs in almost every document Score of a document with respect to a query: score(q, d) = t q w(t, d) 8/ 28

9/ 28 Each term t of the dictionary is considered as a dimension A document d can be represented by the weight of each dictionary term: V(d) = (w(t 1, d), w(t 2, d),...,w(t n, d)) Question: does this representation allow to compute the similarity between documents? Similarity between vectors? inner product V(d 1 ). V(d 2 ) What about the length of a vector? Longer documents will be represented with longer vectors, but that does not mean they are more important 10/ 28 Vector normalization and similarity Euclidian normalization (vector length normalization): v(d) V(d) = V(d) where V(d) = n xi 2 Similarity given by the cosine measure between normalized vectors: sim(d 1, d 2 ) = v(d 1 ). v(d 2 ) This similarity measure can be applied on a M N term-document matrix, where M is the size of the dictionary and N that of the collection: m[t, d] = v(d)/t i=1 11/ 28 Example (Manning et al, 07) Dictionary v(d 1 ) v(d 2 ) v(d 3 ) affection 0.996 0.993 0.847 jealous 0.087 0.120 0.466 gossip 0.017 0 0.254 sim(d 1, d 2 ) = 0.999 sim(d 1, d 3 ) = 0.888 12/ 28

Matching queries against documents Queries are represented using vectors in the same way as documents In this context: score(q, d) = v(q). v(d) In the previous example, with q := jealous gossip, we obtain: v(q). v(d 1 ) = 0.074 v(q). v(d 2 ) = 0.085 v(q). v(d 3 ) = 0.509 13/ 28 Retrieving documents Basic idea: similarity cosines between the query vector and each document vector, finally selection of the top K scores Provided we use the tf idf t,d measure as a weight, which information do we store in the index? 14/ 28 Retrieving documents Basic idea: similarity cosines between the query vector and each document vector, finally selection of the top K scores Provided we use the tf idf t,d measure as a weight, which information do we store in the index? The size of the collection divided by the document frequency N dft stored with the pointer to the postings list The term frequency tft,d stored in each posting 14/ 28 From (Manning et al., 07) 1 cosinescore(query) 2 init(scores[n]) // score of each doc 3 init(length[n]) // length of each doc 4 for each t in query do 5 weight <- w(t,q) 6 post <- postings(t) 7 for each (d, tf(d,t)) in post do 8 scores[d] <- scores[d] + (w(t,q) * w(t,d)) 9 endfor 10 endfor 11 for each d in keys(length) do 12 scores[d] <- scores[d] / length[d] 13 endfor 14 res[k] <- getbest(scores) (*) 15 return res 15/ 28

16/ 28 Speeding up document scoring The scoring algorithm can be time consuming Using heuristics can help saving time Exact top-score vs approximative top-score retrieval we can lower the cost of scoring by searching for K documents that are likely to be among the top-scores General optimization scheme: 1. find a set of documents A such that K < A << N, and whose is likely to contain many documents close to the top-scores 2. return the K top-scoring document included in A 17/ 28 Index elimination Idea: skip postings that are not likely to be relevant (a) While processing the query, only consider terms whose idf t exceeds a predefined threshold NB: thus we avoid traversing the posting lists of high idf t terms, lists which are generally long (b) only consider documents where all query terms appear 18/ 28 Champion lists Idea: we know which documents are the most relevant for a given term For each term t, we pre-compute the list of the r most relevant (with respect to w(t, d)) documents in the collection Given a query q, we compute A = r(t) t q NB: r can depends on the document frequency of the term. 19/ 28

Static quality score Idea: only consider documents which are considered as high-quality documents Given a measure of quality g(d), the posting lists are ordered by decreasing value of g(d) Can be combined with champion lists, i.e. build the list of r most relevant documents wrt g(d) Quality can be computed from the logs of users queries 20/ 28 Impact ordering Idea: some sublists of the posting lists are of no interest To reduce the time complexity: query terms are processed by decreasing idft postings are sorted by decreasing term frequency tft,d once idft gets low, we can consider only few postings once tft,d gets smaller than a predefined threshold, the remaining postings in the list are skipped 21/ 28 Cluster pruning Idea: the document vectors are gathered by proximity We pick N documents randomly leaders For each non-leader, we compute its nearest leader followers At query time, we only compute similarities between the query and the leaders The set A is the closest document cluster NB: the document clustering should reflect the distribution of the vector space 22/ 28 Tiered indexes This technique can be seen as a generalization of champion lists Instead of considering one champion list, we manage layers of champion lists, ordered in increasing size: index 1 l most relevant documents index 2 next m most relevant documents index 3 next n most relevant documents Indexed defined according to thresholds 23/ 28

Query-term proximity Priority is given to documents containing many query terms in a close window Needs to pre-compute n-grams And to define a proximity weighting that depends on the wide size n (either by hand or using learning algorithms) 24/ 28 Scoring optimisations summary 1. Index elimination 2. Champion lists 3. Static quality score 4. Impact ordering 5. Cluster pruning 6. Tiered indexes 7. Query-term proximity 25/ 28 Putting it all together Many techniques to retrieve documents (using logical operators, proximity operators, or scoring functions) Adapted technique can be selected dynamically, by parsing the query First process the query as a phrase query, if fewer than K results, then translate the query into phrase queries on bi-grams, if there are still too few results, finally process each term independently (real free text query) 26/ 28 Conclusion Conclusion What we have seen today? using tf idft,d (cosine similarity) Optimizations for document ranking Next lecture? Other weighting schemes 27/ 28

References Conclusion C. Manning, P. Raghavan and H. Schütze, Introduction to Information Retrieval (sections 6.2 and 6.3, chapter 7) http://nlp.stanford.edu/ir-book/pdf/ chapter06-tfidf.pdf http://nlp.stanford.edu/ir-book/pdf/ chapter07-vectorspace.pdf S. Garcia, H. Williams and A. Cannane, Access ordered indexes (2004) http://citeseer.ist.psu.edu/675266.html/ 28/ 28