Informa(on Retrieval

Similar documents
Informa(on Retrieval

Information Retrieval

This lecture: IIR Sections Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes Vector space scoring

FRONT CODING. Front-coding: 8automat*a1 e2 ic3 ion. Extra length beyond automat. Encodes automat. Begins to resemble general string compression.

Introduction to Information Retrieval

Information Retrieval

Information Retrieval

The Web document collection

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Introduc)on to Informa)on Retrieval. Introduc*on to. Informa(on Retrieval. Introducing ranked retrieval

Digital Libraries: Language Technologies

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

CS60092: Informa0on Retrieval

Introduction to Information Retrieval

Course work. Today. Last lecture index construc)on. Why compression (in general)? Why compression for inverted indexes?

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Introduc)on to. CS60092: Informa0on Retrieval

Introduction to Computational Advertising. MS&E 239 Stanford University Autumn 2010 Instructors: Andrei Broder and Vanja Josifovski

Information Retrieval

Informa/on Retrieval. Text Search. CISC437/637, Lecture #23 Ben CartereAe. Consider a database consis/ng of long textual informa/on fields

Informa(on Retrieval

Informa(on Retrieval

Vector Space Scoring Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Knowledge Discovery and Data Mining 1 (VO) ( )

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression

Information Retrieval. (M&S Ch 15)

Search Engines. Informa1on Retrieval in Prac1ce. Annotations by Michael L. Nelson

CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University

Introduction to Information Retrieval

CS60092: Informa0on Retrieval. Sourangshu Bha<acharya

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

Information Retrieval

Information Retrieval. Lecture 5 - The vector space model. Introduction. Overview. Term weighting. Wintersemester 2007

Behrang Mohit : txt proc! Review. Bag of word view. Document Named

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Informa(on Retrieval

Boolean Model. Hongning Wang

Ges$one Avanzata dell Informazione Part A Full- Text Informa$on Management. Full- Text Indexing

Instructor: Stefan Savev

Lecture 5: Information Retrieval using the Vector Space Model

Models for Document & Query Representation. Ziawasch Abedjan

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Scoring, term weighting, and the vector space model

Information Retrieval: Retrieval Models

Chapter 2. Architecture of a Search Engine

Index construc-on. Friday, 8 April 16 1

CS60092: Informa0on Retrieval

Multimedia Information Systems

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks

Index Compression. David Kauchak cs160 Fall 2009 adapted from:

COMP6237 Data Mining Searching and Ranking

CS105 Introduction to Information Retrieval

Recap of the previous lecture. This lecture. A naïve dictionary. Introduction to Information Retrieval. Dictionary data structures Tolerant retrieval

XML RETRIEVAL. Introduction to Information Retrieval CS 150 Donald J. Patterson

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Chapter 6: Information Retrieval and Web Search. An introduction

Information Retrieval

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Informa(on Retrieval. Administra*ve. Sta*s*cal MT Overview. Problems for Sta*s*cal MT

Chapter III.2: Basic ranking & evaluation measures

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Informa(on Retrieval

CSE 494: Information Retrieval, Mining and Integration on the Internet

Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval

Search Engines. Informa1on Retrieval in Prac1ce. Annota1ons by Michael L. Nelson

Outline of the course

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

modern database systems lecture 4 : information retrieval

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

2. Give an example of algorithm instructions that would violate the following criteria: (a) precision: a =

Recap: lecture 2 CS276A Information Retrieval

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

So You Think You Know How To Write Content? DSS User Group March 6, 2014

Vector Space Models: Theory and Applications

Introduction to Information Retrieval

Vocabulary tree. Vocabulary tree supports very efficient retrieval. It only cares about the distance between a query feature and each node.

Information Retrieval

Problem 1: Complexity of Update Rules for Logistic Regression

CS 6320 Natural Language Processing

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

Relevance of a Document to a Query

Privacy-Preserving Shortest Path Computa6on

60-538: Information Retrieval

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

vector space retrieval many slides courtesy James Amherst

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Document Databases: MongoDB

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Introduc)on to Informa)on Visualiza)on

CMPSCI 646, Information Retrieval (Fall 2003)

Data Mining and Pa+ern Recogni1on. Salvatore Orlando, Andrea Torsello, Filippo Bergamasco

Query Answering Using Inverted Indexes

Search Engines. Informa1on Retrieval in Prac1ce. Annotations by Michael L. Nelson

Informa(on Retrieval

Stanford University Computer Science Department Solved CS347 Spring 2001 Mid-term.

Session 10: Information Retrieval

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

Transcription:

Introduc)on to Informa)on Retrieval CS3245 Informa(on Retrieval Lecture 7: Scoring, Term Weigh9ng and the Vector Space Model 7

Last Time: Index Compression Collec9on and vocabulary sta9s9cs: Heaps and Zipf s laws Dic9onary compression for Boolean indexes Dic9onary string, blocks, front coding Pos9ngs compression: Gap encoding collection (text, xml markup etc) 3,600.0 collection (text) 960.0 Term-doc incidence matrix 40,000.0 postings, uncompressed (32-bit words) 400.0 postings, uncompressed (20 bits) 250.0 postings, variable byte encoded 116.0 MB Informa9on Retrieval 2

Today: Ranked Retrieval Skipping over Sec9on 6.1; will return to it later Scoring documents Term frequency Collec9on sta9s9cs Weigh9ng schemes Vector space scoring Informa9on Retrieval 3

Ch. 6 Ranked retrieval Thus far, our queries have all been Boolean. Documents either match or don t. Good for expert users with precise understanding of their needs and the collec9on. Also good for applica9ons: Applica9ons can easily consume 1000s of results. Not good for the majority of users. Most users incapable of wri9ng Boolean queries (or they are capable, but they think it s too much work). Most users don t want to wade through 1000s of results. This is par9cularly true of web search. Informa9on Retrieval 4

Problem with Boolean search: Ch. 6 Feast or Famine Boolean queries o\en result in either too few (=0) or too many (1000s) results. Query 1: standard user dlink 650 200,000 hits Also called informa9on overload Query 2: standard user dlink 650 no card found : 0 hits It takes a lot of skill to come up with a query that produces a manageable number of hits. AND gives too few; OR gives too many Informa9on Retrieval 5

Ranked retrieval models Rather than a set of documents sa9sfying a query expression, in ranked retrieval models, the system returns an ordering over the (top) documents in the collec9on with respect to a query Free text queries: Rather than a query language of operators and expressions, the user s query is just one or more words in a human language Two separate choices, but in prac9ce, ranked retrieval models are associated with free text queries Informa9on Retrieval 6

Ch. 6 Ranked retrieval When a system produces a ranked result set, large result sets are not an issue We just show the top k ( 10) results We don t overwhelm the user Premise: the ranking algorithm works Informa9on Retrieval 7

Ch. 6 Scoring as the basis of ranked retrieval We wish to return in order the documents most likely to be useful to the searcher How can we rank the documents in the collec9on with respect to a query? Assign a score say in [0, 1] to each document This score measures how well document and query match. Informa9on Retrieval 8

Ch. 6 Query- document matching scores We need a way of assigning a score to a query/document pair Let s start with a one- term query If the query term does not occur in the document: score should be 0 The more frequent the query term in the document, the higher the score (should be) Informa9on Retrieval 9

Ch. 6 Take 1: Jaccard coefficient Recall the Jaccard coefficient from Chapter 3 (spelling correc9on): A measure of overlap of two sets A and B Jaccard (A,B) = A B / A B Jaccard (A,A) = 1 Jaccard (A,B) = 0 if A B = 0 Pros: A and B don t have to be the same size. Always assigns a number between 0 and 1. Informa9on Retrieval 10

Ch. 6 Blanks on slides, you may want to fill in Jaccard coefficient: Scoring example What is the query- document match score that the Jaccard coefficient computes for each of the two documents below? Query: ides of march Document 1: caesar died in march Document 2: the long march Informa9on Retrieval 11

Ch. 6 Issues with Jaccard for scoring 1. It doesn t consider term frequency (how many 9mes a term occurs in a document) Rare terms in a collec9on are more informa9ve than frequent terms. Jaccard doesn t consider this informa9on 2. We need a more sophis9cated way of normalizing for length Later in this lecture, we ll use... instead of A B / A B (Jaccard) for length normaliza9on. Informa9on Retrieval 12

Recap: Binary term- document Sec. 6.2 incidence matrix Each document is represented by a binary vector {0,1} V Informa9on Retrieval 13

Sec. 6.2 Term- document count matrices Store the number of occurrences of a term in a document: Each document is a count vector in N v : a column below Informa9on Retrieval 14

Bag of words model Con: Vector representa9on doesn t consider the ordering of words in a document John is quicker than Mary and Mary is quicker than John have the same vectors In a sense, this is a step back: The posi9onal index was able to dis9nguish these two documents. We will look at recovering posi9onal informa9on later in this course. For now: bag of words model Informa9on Retrieval 15

Term frequency r The term frequency r t,d of term t in document d is defined as the number of 9mes that t occurs in d. We want to use r when compu9ng query- document match scores. But how? Raw term frequency is not what we want: Relevance does not increase propor9onally with term frequency. A document with 10 occurrences of the term is more relevant than a document with 1 occurrence. But not 10 9mes more relevant. Note: frequency = count in IR Informa9on Retrieval 16

Sec. 6.2 Log- frequency weigh9ng The log frequency weight of term t in d is e.g. 0 0, 1 1, 2 1.3, 10 2, 1000 4, etc. Score for a document- query pair: sum over terms t in both q and d: score The score is 0 if none of the query terms is present in the document. Informa9on Retrieval 17

Sec. 6.2.1 Document frequency Rare terms are more informa9ve than frequent terms Recall stop words Consider a term in the query that is rare in the collec9on (e.g., arachnocentric) A document containing this term is very likely to be relevant to the query arachnocentric We want a high weight for rare terms like arachnocentric. Informa9on Retrieval 18

Sec. 6.2.1 Blanks on slides, you may want to fill in Document frequency, con9nued Frequent terms are less informa9ve than rare terms Consider a query term that is frequent in the collec9on (e.g., high, increase, line) A document containing such a term is more likely to be relevant than a document that doesn t but it s not a sure indicator of relevance. For frequent terms, we want high posi9ve weights for words like high, increase, and line but lower than the weights for rare terms. We will use document frequency (df) to capture this. Informa9on Retrieval 19

Sec. 6.2.1 idf weight df t is the document frequency of t: the number of documents that contain t df t is an inverse measure of the informa9veness of t df t N We define the idf (inverse document frequency) of t by We use log (N/df t ) instead of N/df t to dampen the effect of idf. Informa9on Retrieval 20

Sec. 6.2.1 Example: suppose N = 1 million term df t idf t calpurnia 1 6 animal 100 4 sunday 1,000 3 fly 10,000 2 under 100,000 1 the 1,000,000 0 There is one idf value for each term t in a collection. Informa9on Retrieval 21

Blanks on slides, you may want to fill in Effect of idf on ranking Does idf have an effect on ranking for one- term queries, like iphone? idf has no effect on ranking one term queries idf affects the ranking of documents for queries with at least two terms For the query capricious person, idf weigh9ng makes occurrences of capricious count for much more in the final document ranking than occurrences of person. Informa9on Retrieval 22

Sec. 6.2.1 Collec9on vs. Document frequency The collec9on frequency of t is the number of occurrences of t in the collec9on, coun9ng mul9ple occurrences. Example: Word Collection frequency Document frequency insurance 10440 3997 try 10422 8760 Which word is a bever search term (and should get a higher weight)? Informa9on Retrieval 23

Sec. 6.2.2 r- idf weigh9ng The r- idf weight of a term is the product of its r weight and its idf weight. Best known weigh(ng scheme IR Note: the - in r- idf is a hyphen, not a minus sign! Alterna9ve names: r.idf, r x idf Increases with the number of occurrences within a document Increases with the rarity of the term in the collec9on Informa9on Retrieval 24

Sec. 6.2.2 Final ranking of documents for a query Informa9on Retrieval 25

Sec. 6.3 Binary count weight matrix Each document is now represented by a real-valued vector of tf-idf weights R V Informa9on Retrieval 26

Sec. 6.3 Documents as vectors So we have a V - dimensional vector space Terms are axes of the space Documents are points or vectors in this space High- dimensional: tens of thousands of dimensions; each dic9onary term is a dimension These are very sparse vectors - most entries are zero. Informa9on Retrieval 27

Sec. 6.3 Queries as vectors Key idea 1: Do the same for queries: represent them as vectors in the space; they are mini- documents Key idea 2: Rank documents according to their proximity to the query in this space proximity = similarity of vectors proximity inverse of distance Mo9va9on: Want to get away from the you re- either- in- or- out Boolean model. Instead: rank more relevant documents higher than less relevant documents Informa9on Retrieval 28

Sec. 6.3 Blanks on slides, you may want to fill in Formalizing vector space proximity First cut: distance between two points ( = distance between the end points of the two vectors) Euclidean distance? Euclidean distance is a bad idea because Euclidean distance is large for vectors of different lengths. Informa9on Retrieval 29

Sec. 6.3 Why distance is a bad idea The Euclidean distance between q and d 2 is large even though the distribu9on of terms in the query q and the distribu9on of terms in the document d 2 are very similar. Informa9on Retrieval 30

Sec. 6.3 Use angle instead of distance Distance counterexample: take a document d and append it to itself. Call this document d. Seman9cally d and d have the same content The Euclidean distance between the two documents can be quite large The angle between the two documents is 0, corresponding to maximal similarity. Key idea: Rank documents according to angle with query. Informa9on Retrieval 31

Sec. 6.3 From angles to cosines The following two no9ons are equivalent. Rank documents in decreasing order of the angle between query and document Rank documents in increasing order of cosine (query,document) Cosine is a monotonically decreasing func9on for the interval [0 o, 180 o ] Informa9on Retrieval 32

Sec. 6.3 Length normaliza9on A vector can be (length- ) normalized by dividing each of its components by its length for this we use the L 2 norm: Dividing a vector by its L 2 norm makes it a unit (length) vector (on surface of unit hypersphere) Effect on the two documents d and d (d appended to itself) from earlier slide: they have iden9cal vectors a\er length normaliza9on. Long and short documents now have comparable weights Informa9on Retrieval 33

Sec. 6.3 cosine (query, document) Dot product Unit vectors q i is the tf-idf weight of term i in the query d i is the tf-idf weight of term i in the document cos(q,d) is the cosine similarity of q and d or, equivalently, the cosine of the angle between q and d. Informa9on Retrieval 34

Cosine for length- normalized vectors For length- normalized vectors, cosine similarity is simply the dot product (or scalar product): for q, d length- normalized. Informa9on Retrieval 35

Cosine similarity illustrated Informa9on Retrieval 36

Sec. 6.3 Cosine similarity among 3 documents How similar are the novels SaS: Sense and Sensibility PaP: Pride and Prejudice, and WH: Wuthering Heights? term SaS PaP WH affection 115 58 20 jealous 10 7 11 gossip 2 0 6 wuthering 0 0 38 Term frequencies (counts) Note: To simplify this example, we don t do idf weigh9ng. Informa9on Retrieval 37

Sec. 6.3 3 documents example contd. Log frequency weigh(ng term SaS PaP WH affection 3.06 2.76 2.30 jealous 2.00 1.85 2.04 gossip 1.30 0 1.78 wuthering 0 0 2.58 ACer length normaliza(on term SaS PaP WH affection 0.789 0.832 0.524 jealous 0.515 0.555 0.465 gossip 0.335 0 0.405 wuthering 0 0 0.588 cos(sas,pap) 0.789 0.832 + 0.515 0.555 + 0.335 0.0 + 0.0 0.0 0.94 cos(sas,wh) 0.79 cos(pap,wh) 0.69 Informa9on Retrieval 38

Sec. 6.3 Compu9ng cosine scores Informa9on Retrieval 39

Sec. 6.4 r- idf weigh9ng has many variants Columns headed n are acronyms for weight schemes. Quick Ques9on: Why is the base of the log in idf immaterial? Informa9on Retrieval 40

Weigh9ng may differ in Sec. 6.4 queries vs documents Many search engines allow for different weigh9ngs for queries vs. documents SMART Nota9on: denote combina9on used with the nota9on ddd.qqq, using acronyms from previous table A very standard weigh9ng scheme is lnc.ltc Document: logarithmic r (l as first character), no idf and cosine normaliza9on A bad idea? Query: logarithmic r (l in le\most column), idf (t in second column), and cosine normaliza9on Informa9on Retrieval 41

Sec. 6.4 r- idf example: lnc.ltc Document: car insurance auto insurance Query: best car insurance Term Query Document Prod tfraw tf-wt df idf wt n lize tf-raw tf-wt wt n lize auto 0 0 5000 2.3 0 0 1 1 1 0.52 0 best 1 1 50000 1.3 1.3 0.34 0 0 0 0 0 car 1 1 10000 2.0 2.0 0.52 1 1 1 0.52 0.27 insurance 1 1 1000 3.0 3.0 0.78 2 1.3 1.3 0.68 0.53 Quick Question: what is N, the number of docs? Doc length = Score = 0+0+0.27+0.53 = 0.8 Informa9on Retrieval 42

Summary and algorithm: Vector space ranking 1. Represent the query as a weighted r- idf vector 2. Represent each document as a weighted r- idf vector 3. Compute the cosine similarity score for the query vector and each document vector 4. Rank documents with respect to the query by score 5. Return the top K (e.g., K = 10) to the user Informa9on Retrieval 43

Ch. 6 Resources for today s lecture IIR 6.2 6.4.3 hvp://www.miislita.com/informa9on- retrieval- tutorial/cosine- similarity- tutorial.html Term weigh9ng and cosine similarity tutorial for SEO folk! Informa9on Retrieval 44