Information Retrieval

Similar documents
Information Retrieval

Introduction to Information Retrieval

Informa(on Retrieval

This lecture: IIR Sections Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes Vector space scoring

FRONT CODING. Front-coding: 8automat*a1 e2 ic3 ion. Extra length beyond automat. Encodes automat. Begins to resemble general string compression.

Introduction to Information Retrieval

Informa(on Retrieval

Information Retrieval

The Web document collection

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

CS105 Introduction to Information Retrieval

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

Boolean Retrieval. Manning, Raghavan and Schütze, Chapter 1. Daniël de Kok

Introduction to Information Retrieval

Information Retrieval and Organisation

Introduction to Information Retrieval

Digital Libraries: Language Technologies

Introduction to Information Retrieval

Data Modelling and Multimedia Databases M

Models for Document & Query Representation. Ziawasch Abedjan

CSE 7/5337: Information Retrieval and Web Search Introduction and Boolean Retrieval (IIR 1)

Information Retrieval

Indexing. Lecture Objectives. Text Technologies for Data Science INFR Learn about and implement Boolean search Inverted index Positional index

Classic IR Models 5/6/2012 1

boolean queries Inverted index query processing Query optimization boolean model September 9, / 39

Information Retrieval

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Unstructured Data Management. Advanced Topics in Database Management (INFSCI 2711)

Information Retrieval

Lecture 1: Introduction and Overview

A brief introduction to Information Retrieval

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Information Retrieval

Information Retrieval

Part 2: Boolean Retrieval Francesco Ricci

Lecture 1: Introduction and the Boolean Model

Information Retrieval and Text Mining

Introduction to Information Retrieval

Search: the beginning. Nisheeth

Introducing Information Retrieval and Web Search. borrowing from: Pandu Nayak

- Content-based Recommendation -

Chapter III.2: Basic ranking & evaluation measures

CS 572: Information Retrieval. Lecture 2: Hello World! (of Text Search)

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Introduction to Information Retrieval

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Informa(on Retrieval

Information Retrieval. (M&S Ch 15)

Advanced Retrieval Information Analysis Boolean Retrieval

Vector Space Models. Jesse Anderton

Introduction to Information Retrieval IIR 1: Boolean Retrieval

Introduction to Information Retrieval and Boolean model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H.

CS60092: Informa0on Retrieval. Sourangshu Bha<acharya

Introduction to Computational Advertising. MS&E 239 Stanford University Autumn 2010 Instructors: Andrei Broder and Vanja Josifovski

Information Retrieval

Information Retrieval. Lecture 5 - The vector space model. Introduction. Overview. Term weighting. Wintersemester 2007

Ges$one Avanzata dell Informazione Part A Full- Text Informa$on Management. Full- Text Indexing

2. Give an example of algorithm instructions that would violate the following criteria: (a) precision: a =

Behrang Mohit : txt proc! Review. Bag of word view. Document Named

Introduc)on to Informa)on Retrieval. Introduc*on to. Informa(on Retrieval. Introducing ranked retrieval

Overview of Information Retrieval and Organization. CSC 575 Intelligent Information Retrieval

COSC572 GUEST LECTURE - PROF. GRACE HUI YANG INTRODUCTION TO INFORMATION RETRIEVAL NOV 2, 2016

PV211: Introduction to Information Retrieval

Outline of the course

UNICAL, 21/10/2004. Tutorial goals

Text Retrieval and Web Search IIR 1: Boolean Retrieval

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

Preliminary draft (c)2006 Cambridge UP

Information Retrieval

1Boolean retrieval. information retrieval. term search is quite ambiguous, but in context we use the two synonymously.

Boolean Model. Hongning Wang

Midterm Exam Search Engines ( / ) October 20, 2015

Feature selection. LING 572 Fei Xia

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan

A Closeup View. Class Overview CSE 454. Relevance. Retrieval Model Overview. 10/19 IR & Indexing 10/21 Google & Alta.

Vector Space Scoring Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Chapter III: Ranking Principles. Information Retrieval & Data Mining Universität des Saarlandes, Saarbrücken Wintersemester 2013/14

Information retrieval systems

Information Retrieval. Chap 8. Inverted Files

EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling

Query Answering Using Inverted Indexes

Module II: Multimedia Data Mining

This lecture. Introduction to information retrieval. Making money with information retrieval. Some technical basics. Link analysis.

Web Information Retrieval Exercises Boolean query answering. Prof. Luca Becchetti

Instructor: Stefan Savev

Chapter 2. Architecture of a Search Engine

Querying Introduction to Information Retrieval INF 141 Donald J. Patterson. Content adapted from Hinrich Schütze

Lecture 5: Information Retrieval using the Vector Space Model

Information Retrieval and Web Search

Multimedia Information Systems

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Notes: Notes: Primo Ranking Customization

Plan for today. CS276B Text Retrieval and Mining Winter Vector spaces and XML. Text-centric XML retrieval. Vector spaces and XML

Document indexing, similarities and retrieval in large scale text collections

Information Retrieval Using Context Based Document Indexing and Term Graph

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Information Retrieval. hussein suleman uct cs

Transcription:

Introduction to Information Retrieval Lecture 6-: Scoring, Term Weighting

Outline Why ranked retrieval? Term frequency tf-idf weighting 2

Ranked retrieval Thus far, our queries have all been Boolean. Documents either match or don t. Good for expert users with precise understanding of their needs and of the collection. Not good for the majority of users Most users are not capable of writing Boolean queries...... or they are, but they think it s too much work. Most users don t want to wade through s of results. This is particularly true of web search. 3

Problem with Boolean search: Feast or famine Boolean queries often result in either too few (=) or too many (s) results. Query (boolean conjunction): [standard user dlink 65] 2, hits feast Query 2 (boolean conjunction): [standard user dlink 65 no card found] hits famine In Boolean retrieval, it takes a lot of skill to come up with a query that produces a manageable number of hits. 4

Feast or famine: No problem in ranked retrieval With ranking, large result sets are not an issue. Just show the top results Premise: the ranking algorithm works: More relevant results are ranked higher than less relevant results. 5

Scoring as the basis of ranked retrieval We wish to rank documents that are more relevant higher than documents that are less relevant. How can we accomplish such a ranking of the documents in the collection with respect to a query? Assign a score to each query-document pair, say in [, ]. This score measures how well document and query match. 6

Query-document matching scores How do we compute the score of a query-document pair? Let s start with a one-term query. If the query term does not occur in the document: score should be. The more frequent the query term in the document, the higher the score We will look at a number of alternatives for doing this. 7

Outline Why ranked retrieval? Term frequency tf-idf weighting 8

Binary incidence matrix Anthony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth... ANTHONY BRUTUS CAESAR CALPURNIA CLEOPATRA MERCY WORSER... Each document is represented as a binary vector {, } V. 9

Binary incidence matrix Anthony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth... ANTHONY BRUTUS CAESAR CALPURNIA CLEOPATRA MERCY WORSER... 57 4 232 57 2 2 73 57 227 3 2 2 8 5 8 5 Each document is now represented as a count vector N V.

Bag of words model We do not consider the order of words in a document. John is quicker than Mary and Mary is quicker than John are represented the same way. This is called a bag of words model. In a sense, this is a step back: The positional index was able to distinguish these two documents. We will look at recovering positional information later in this course.

Term frequency tf The term frequency tf t,d of term t in document d is defined as the number of times that t occurs in d. We want to use tf when computing query-document match scores. However, raw term frequency is not what we want because: A document with tf = occurrences of the term is more relevant than a document with tf = occurrence of the term. But not times more relevant. Relevance does not increase proportionally with term frequency. 2

Instead of raw frequency: Log frequency weighting The log frequency weight of term t in d is defined as follows tf t,d w t,d :,, 2.3, 2, 4, etc. Score for a document-query pair: sum over terms t in both q and d: tf-matching-score(q, d) = t q d ( + log tf t,d ) The score is if none of the query terms is present in the document. 3

Outline Why ranked retrieval? Term frequency tf-idf weighting 4

Frequency in document vs. frequency in collection In addition, to term frequency (the frequency of the term in the document)......we also want to use the frequency of the term in the collection for weighting and ranking. 5

Desired weight for rare terms Rare terms are more informative than frequent terms. Consider a term in the query that is rare in the collection (e.g., ARACHNOCENTRIC). A document containing this term is very likely to be relevant. We want high weights for rare terms like ARACHNOCENTRIC. 6

Desired weight for frequent terms Frequent terms are less informative than rare terms. Consider a term in the query that is frequent in the collection (e.g., GOOD, INCREASE, LINE). A document containing this term is more likely to be relevant than a document that doesn t...... but words like GOOD, INCREASE and LINE are not sure indicators of relevance. For frequent terms like GOOD, INCREASE and LINE, we want positive weights...... but lower weights than for rare terms. 7

Document frequency We want high weights for rare terms like ARACHNOCENTRIC. We want low (positive) weights for frequent words like GOOD, INCREASE and LINE. We will use document frequency to factor this into computing the matching score. The document frequency is the number of documents in the collection that the term occurs in. 8

idf weight df t is the document frequency, the number of documents that t occurs in. df t is an inverse measure of the informativeness of term t. We define the idf weight of term t as follows: (N is the number of documents in the collection.) idf t is a measure of the informativeness of the term. [log N/df t ] instead of [N/df t ] to dampen the effect of idf Note that we use the log transformation for both term frequency and document frequency. 9

Examples for idf Compute idf t using the formula: term df t idf t calpurnia animal sunday fly under the,,,, 6 4 3 2 2

Effect of idf on ranking idf affects the ranking of documents for queries with at least two terms. For example, in the query arachnocentric line, idf weighting increases the relative weight of ARACHNOCENTRIC and decreases the relative weight of LINE. idf has little effect on ranking for one-term queries. 2

tf-idf weighting The tf-idf weight of a term is the product of its tf weight and its idf weight. tf-weight idf-weight Best known weighting scheme in information retrieval Note: the - in tf-idf is a hyphen, not a minus sign! Alternative names: tf.idf, tf x idf 22

Summary: tf-idf Assign a tf-idf weight for each term t in each document d: tf-idf matching-score(q, d) = Σ t q d (w t,d ) The tf-idf weight...... increases with the number of occurrences within a document. (term frequency)... increases with the rarity of the term in the collection. (inverse document frequency) 23