Lecture 5: Information Retrieval using the Vector Space Model

Similar documents
ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

Information Retrieval

COMP6237 Data Mining Searching and Ranking

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

Information Retrieval

Boolean Model. Hongning Wang

Recap: lecture 2 CS276A Information Retrieval

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Course work. Today. Last lecture index construc)on. Why compression (in general)? Why compression for inverted indexes?

Information Retrieval. (M&S Ch 15)

Index Compression. David Kauchak cs160 Fall 2009 adapted from:

CMPSCI 646, Information Retrieval (Fall 2003)

Query Evaluation Strategies

Instructor: Stefan Savev

Information Retrieval

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Indexing and Query Processing. What will we cover?

CS60092: Informa0on Retrieval

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression

FRONT CODING. Front-coding: 8automat*a1 e2 ic3 ion. Extra length beyond automat. Encodes automat. Begins to resemble general string compression.

modern database systems lecture 4 : information retrieval

Outline of the course

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Lecture 3: Phrasal queries and wildcards

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Query Evaluation Strategies

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Information Retrieval

Vector Space Scoring Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Query Processing and Alternative Search Structures. Indexing common words

Information Retrieval. Lecture 5 - The vector space model. Introduction. Overview. Term weighting. Wintersemester 2007

Informa(on Retrieval

A Simple Search Engine

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013

Fall CS646: Information Retrieval. Lecture 2 - Introduction to Search Result Ranking. Jiepu Jiang University of Massachusetts Amherst 2016/09/12

Digital Libraries: Language Technologies

This lecture: IIR Sections Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes Vector space scoring

Introduction to Information Retrieval

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Using Query History to Prune Query Results

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Information Retrieval

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

V.2 Index Compression

Lecture 4: Query Expansion

Realtime Search with Lucene. Michael

More on indexing CE-324: Modern Information Retrieval Sharif University of Technology

Indexing. CS6200: Information Retrieval. Index Construction. Slides by: Jesse Anderton

Search: the beginning. Nisheeth

Multimedia Information Systems

Informa(on Retrieval

Similarity search in multimedia databases

Chapter 6: Information Retrieval and Web Search. An introduction

Informa(on Retrieval

Introduction to Information Retrieval

Indexing: Part IV. Announcements (February 17) Keyword search. CPS 216 Advanced Database Systems

Introduction to Information Retrieval

Melbourne University at the 2006 Terabyte Track

Classic IR Models 5/6/2012 1

Analyzing the performance of top-k retrieval algorithms. Marcus Fontoura Google, Inc

Data-Intensive Distributed Computing

Q: Given a set of keywords how can we return relevant documents quickly?

Information Retrieval. hussein suleman uct cs

A Tree-based Inverted File for Fast Ranked-Document Retrieval

Relevance Feedback & Other Query Expansion Techniques

Plan for today. CS276B Text Retrieval and Mining Winter Vector spaces and XML. Text-centric XML retrieval. Vector spaces and XML

Information Retrieval

Information Retrieval

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks

CS646 (Fall 2016) Homework 1

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

CS54701: Information Retrieval

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan

Indexing Web pages. Web Search: Indexing Web Pages. Indexing the link structure. checkpoint URL s. Connectivity Server: Node table

Keyword Search in Databases

60-538: Information Retrieval

Scoring, term weighting, and the vector space model

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

Introduction to Information Retrieval

Information Retrieval. Techniques for Relevance Feedback

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL

Recap of the previous lecture. This lecture. A naïve dictionary. Introduction to Information Retrieval. Dictionary data structures Tolerant retrieval

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Midterm Exam Search Engines ( / ) October 20, 2015

Introduction to Information Retrieval

Information Retrieval

CISC689/ Information Retrieval Midterm Exam

Informa/on Retrieval. Text Search. CISC437/637, Lecture #23 Ben CartereAe. Consider a database consis/ng of long textual informa/on fields

Models for Document & Query Representation. Ziawasch Abedjan

CS 6320 Natural Language Processing

Information Retrieval. Lecture 3 - Index compression. Introduction. Overview. Characterization of an index. Wintersemester 2007

Introduction to. CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan. Lecture 4: Index Construction

Part 2: Boolean Retrieval Francesco Ricci

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Transcription:

Lecture 5: Information Retrieval using the Vector Space Model Trevor Cohn (tcohn@unimelb.edu.au) Slide credits: William Webber COMP90042, 2015, Semester 1

What we ll learn today How to take a user query and return a ranked list of results How implement this operation in a reasonably efficient way How to normalise for document length

Reviewing: document similarity in VSM Document is BOW Project into term space as vector, with dimension lengths given by TF*IDF Calculate document similarity as cosine of angle between their vectors Implement as dot product on unit-length vectors Same process can be used to rank documents by decreasing similarity to given document.

Query processing in VSM Treat the query as a (short) (pseudo-)document Calculate (VSM cosine) similarity between query pseudo-document and each document in collection Rank documents by decreasing similarity with query Return to user in rank order (generally only top results initially)

Example Corpus: two tea me you doc1 2 2 0 0 doc2 0 2 1 1 doc3 0 0 2 2 Query: tea me two tea me you query 0 1 1 0

Example Corpus: two tea me you doc1 0.707 0.707 0 0 doc2 0 0.894 0.447 0.447 doc3 0 0 0.707 0.707 Query: tea me two tea me you query 0 0.707 0.707 0

Example cos(doc1, q) = (0.707, 0.707, 0, 0) (0, 0.707, 0.707, 0) = 0.5 cos(doc2, q) = (0, 0.894, 0.447, 0.447) (0, 0.707, 0.707, 0) = 0.945 cos(doc3, q) = (0, 0, 0.707, 0.707) (0, 0.707, 0.707, 0) = 0.5 So doc2 is the best match, followed by docs 1 and 3, which are tied.

Cosine similarity Many elements on the example vectors were 0, and thus did not contribute to the cosine True for real settings with large vocabularies Enumerating all the documents is highly inefficient Can we devise a way to find the most similar documents efficiently?... using the inverted index?

Index Imagine that for every similarity computation, we recalculate unit-length TF*IDF vectors for all documents. recall these are simple expressions involving the raw term frequency f d,t and the document frequency df t Since these do not change from query to query, save processing by precalculating and store results in an index. But we still need to iterate through all documents to rank by similarity. This an O( D ) operation.

Term-wise processing In document similarity, only terms occurring in both documents contribute to cosine score (remember the dot-product) In query processing by pseudo-document model, therefore, only documents that contain query terms need to be considered (which makes intuitive sense) Complexity reduced to O( t df t) Note that most frequent term dominates. Very good reason to drop stop-words! Need an index that supports quickly finding which documents a term occurs in

Recap: Inverted index Index designed to support query processing: Keys are terms Values are lists of d, w t,d pairs Each d, w t,d pair called a posting List of these called a postings list Term Postings list tea 1:1.4 ; 3:1.0 ; 6:1.7 ;... two 2:2.3 ; 3:1.0 ; 4:1.7 ;... me 1:1.0 ; 2:1.4 ;... Note the inclusion of w weightings here, not longer binary as in early lectures.

Efficient encoding High cost of storing the inverted index, seek to minimise its footprint So it can be stored in memory (much faster than disk) Or on disk, to minimise the number of disk accesses Must consider the size implications of data structure choices Real valued numbers (floats) much larger than integer counts And much harder to compress Although it s convenient to store precomputed and document-normalised weightings (e.g., TF*IDF) Often more practical to store the components, e.g., raw term frequencies in the postings lists IDF values stored for each term in the dictionary document length normalisation values stored in a separate list

More efficient index for example Inverted index storing mostly integer counts Term IDF Postings list tea 1.9 1:3 ; 3:1 ; 6:2 ;... two 0.3 2:4 ; 3:1 ; 4:2 ;... me 0.8 1:1 ; 2:2 ;... Real valued document lengths DocId w.,d 1 2.3 2 3.4 3 1.7 4 42.8.. Incurs a cost at query time, but often outweighed by the better use of faster media

Query processing on inverted index Assuming normalised TF*IDF weights: Set accumulator a d 0 for all documents d for all terms t in query do Load postings list for t for all postings d, w t,d in list do a d a d + w t,d end for end for Sort documents by decreasing a d return sorted results to user Inefficient use of space!

Query processing on inverted index Storing integer term-frequencies: Set accumulator a d 0 for all documents d for all terms t in query do Load postings list for t and idf t = log N f t for all postings d, f t,d in list do a d a d + f t,d idf t end for end for Load document length array, L Normalise by document lengths a d a d L d Sort documents by decreasing a d return sorted results to user A little extra computation in the inner loop, but supports more compact storage.

Is this cosine? Computed for each document t q a d = w t,d t d w t,d 2 However cosine is defined as t cos(d, q) = w t,qw t,d t q w t,q 2 t d w t,d 2 what happened to the other normalisation term and w t,q? term assumed to occur once in the query, i.e., w t,q = 1 for t q query length normalisation doesn t matter when comparing a fixed query with several documents (why?)

Tweaking the formula Several different choices must be made about weighting the formula include IDF, and its definition raw term-frequency, f d,t versus log(1 + f t,q ) whether to normalise document vectors Unit-length normalization of query doesn t matter Many of formula component choices made here are heuristic Zobel and Moffat, Exploring the Similarity Space (1998) identify (8 9 2 6 14) = 12096 possible different combinations of choices Once can try different variants to improve effectiveness (We ll talk in a later lecture about how to test success)

Alternative document length normalization To date, normalized document vectors to unit length But is this correct? Very short documents will get high scores for term occurrences Long documents may cover many topics, satisfy many queries Empirical adjustment Assume that we have: Large number of queries Judgments of which documents are relevant to which queries Then we can compare: Probability of document being retrieved given length Probability of document being relevant given length and adjust if these two probabilities are out of line

Probability retrieved v. relevant given length Look at mean empirical relation Amit Singhal, Chris Buckley, and Mandar Mitra, Pivoted Document Length Normalization, SIGIR 1996

Probability retrieved v. relevant given length Look at mean empirical relation Simplify and identify pivot. Lengths greater than pivot point should be boosted; less, decreased Amit Singhal, Chris Buckley, and Mandar Mitra, Pivoted Document Length Normalization, SIGIR 1996

Probability retrieved v. relevant given length Look at mean empirical relation Simplify and identify pivot. Lengths greater than pivot point should be boosted; less, decreased Linearly approximate to slope Amit Singhal, Chris Buckley, and Mandar Mitra, Pivoted Document Length Normalization, SIGIR 1996

Pivoted document length normalization w weight of term in document (e.g. TF*IDF) n u w p original normalization (e.g. unit-length normalization by length of document vector) p pivot point for pivot normalization s slope of pivot normalization pivot-normalized weight of term w p = w (1.0 s) p + s n u Various approximations and factors (see Singhal et al.) No longer calculating cosine distance, but pseudo-cosine distance! Requires dataset to tune on, and will be specific to that dataset Gives significant improvement in effectiveness

Vector space cf. Boolean querying How does the vector space model compare to boolean IR? Scope Assigning matching scores to a larger set of documents, and returning ranked lists Type Not a conjuctive nor a disjunctive query documents may omit some query terms, but more matches lead to higher scores Index construction is similar, but record TF*IDF values rather than boolean indicators Querying is quite different, based on accumulators, with different computational complexity

Looking back and forward Back Queries can be processing in VSM by treating query as (pseudo-)document Inverted index supports efficient query processing Various tweaks to VSM formulae possible, of which pivoted document length normalization empirically the most effective

Looking back and forward Forward How to store dictionary, postings and documents most efficiently? Compression techniques for best use of space and time In later lecture, will look at evaluation of IR methods, for selecting methods and tuning parameters Then we will look at probabilistic methods, that are often more theoretically grounded, requiring fewer heuristics

Further reading Chapter 6, 6.3 onwards, and 7.1 of Manning, Raghavan, and Schutze, Introduction to Information Retrieval Justin Zobel and Alistair Moffat, Inverted files for text search engines ACM Computing Surveys, 2006 (authoritative survey paper on inverted indexes) http://www.cs.mu.oz.au/~jz/fulltext/compsurv06.pdf Singhal, Buckley, and Mitra, Pivoted document length normalization, SIGIR, 1996 http://singhal.info/pivoted-dln.pdf