Boolean Model. Hongning Wang

Similar documents
Information Retrieval. (M&S Ch 15)

Information Retrieval: Retrieval Models

Document indexing, similarities and retrieval in large scale text collections

Instructor: Stefan Savev

Fall CS646: Information Retrieval. Lecture 2 - Introduction to Search Result Ranking. Jiepu Jiang University of Massachusetts Amherst 2016/09/12

Lecture 5: Information Retrieval using the Vector Space Model

COMP6237 Data Mining Searching and Ranking

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

vector space retrieval many slides courtesy James Amherst

Search Engine Architecture. Hongning Wang

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

This lecture: IIR Sections Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes Vector space scoring

CS54701: Information Retrieval

CS 6320 Natural Language Processing

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

A Machine Learning Approach for Information Retrieval Applications. Luo Si. Department of Computer Science Purdue University

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

Introduction to Information Retrieval

Information Retrieval

Modern Information Retrieval

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

Vannevar Bush. Information Retrieval. Prophetic: Hypertext. Historic Vision 2/8/17

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

Retrieval Evaluation. Hongning Wang

Vector Space Scoring Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Query Likelihood with Negative Query Generation

Models for Document & Query Representation. Ziawasch Abedjan

Digital Libraries: Language Technologies

Multimedia Information Systems

Risk Minimization and Language Modeling in Text Retrieval Thesis Summary

Midterm Exam Search Engines ( / ) October 20, 2015

Relevance of a Document to a Query

Information Retrieval

11/4/2015. Lecture 2: More Retrieval Models. Previous Lecture. Fuzzy Index Terms. Fuzzy Logic. Fuzzy Retrieval: Open Problems

Homework: Exercise 1. Homework: Exercise 2b. Homework: Exercise 2a. Homework: Exercise 2d. Homework: Exercise 2c

Knowledge Discovery and Data Mining 1 (VO) ( )

Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval

Chapter 6: Information Retrieval and Web Search. An introduction

Introduction to Information Retrieval

Chapter 27 Introduction to Information Retrieval and Web Search

FRONT CODING. Front-coding: 8automat*a1 e2 ic3 ion. Extra length beyond automat. Encodes automat. Begins to resemble general string compression.

Information Retrieval. Lecture 5 - The vector space model. Introduction. Overview. Term weighting. Wintersemester 2007

Social Media Computing

Problem 1: Complexity of Update Rules for Logistic Regression

Custom IDF weights for boosting the relevancy of retrieved documents in textual retrieval

Informa(on Retrieval

CMPSCI 646, Information Retrieval (Fall 2003)

Static Pruning of Terms In Inverted Files

Introduction to Information Retrieval

Relevance Feedback and Query Reformulation. Lecture 10 CS 510 Information Retrieval on the Internet Thanks to Susan Price. Outline

Information Retrieval

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Information Retrieval. hussein suleman uct cs

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

Learning to Rank. from heuristics to theoretic approaches. Hongning Wang

Classic IR Models 5/6/2012 1

Chapter III.2: Basic ranking & evaluation measures

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Natural Language Processing

Information Retrieval. Information Retrieval and Web Search

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity

Introduction to Information Retrieval

Similarity search in multimedia databases

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

Document Clustering: Comparison of Similarity Measures

Information Retrieval and Web Search

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

Chapter 2. Architecture of a Search Engine

Query Processing and Alternative Search Structures. Indexing common words

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

Plan for today. CS276B Text Retrieval and Mining Winter Vector spaces and XML. Text-centric XML retrieval. Vector spaces and XML

modern database systems lecture 4 : information retrieval

XML RETRIEVAL. Introduction to Information Retrieval CS 150 Donald J. Patterson

Text Analytics (Text Mining)

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan

Informa(on Retrieval

CS47300: Web Information Search and Management

Mining Web Data. Lijun Zhang

An Evaluation of Information Retrieval Accuracy. with Simulated OCR Output. K. Taghva z, and J. Borsack z. University of Massachusetts, Amherst

Information Retrieval CSCI

Fall Lecture 16: Learning-to-rank

CS54701: Information Retrieval

Information Retrieval and Web Search Engines

Information Retrieval

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

A probabilistic description-oriented approach for categorising Web documents

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Information Retrieval

Amit Singhal, Chris Buckley, Mandar Mitra. Department of Computer Science, Cornell University, Ithaca, NY 14853

Text compression: dictionary models. IN4325 Information Retrieval

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Robust Relevance-Based Language Models

Information Retrieval

Data Modelling and Multimedia Databases M

Transcription:

Boolean Model Hongning Wang CS@UVa

Abstraction of search engine architecture Indexed corpus Crawler Ranking procedure Doc Analyzer Doc Representation Query Rep Feedback (Query) Evaluation User Indexer Index Ranker results CS@UVa CS 6501: Information Retrieval 2

Search with Boolean query Boolean query E.g., obama AND healthcare NOT news Procedures Lookup query term in the dictionary Retrieve the posting lists Operation AND: intersect the posting lists OR: union the posting list NOT: diff the posting list CS@UVa CS 6501: Information Retrieval 3

Search with Boolean query Example: AND operation scan the postings Term1 Term2 2 4 8 16 32 64 1 2 3 5 8 13 21 128 34 Time complexity: O( L 1 + L 2 ) Trick for speed-up: when performing multi-way join, starts from lowest frequency term to highest frequency ones CS@UVa CS 6501: Information Retrieval 4

Deficiency of Boolean model The query is unlikely precise Over-constrained query (terms are too specific): no relevant documents found Under-constrained query (terms are too general): over delivery It is hard to find the right position between these two extremes (hard for users to specify constraints) Even if it is accurate Not all users would like to use such queries All relevant documents are not equally relevant No one would go through all the matched results Relevance is a matter of degree! CS@UVa CS 6501: Information Retrieval 5

Document Selection vs. Ranking True Rel(q) + - + - + + - - - + - - - - - - - - - - - Doc Selection f(d,q)=? Doc Ranking rel(d,q)=? 1 0 + + + - -+ + - + - - - - - - 0.98 d 1 + 0.95 d 2 + 0.83 d 3-0.80 d 4 + 0.76 d 5-0.56 d 6-0.34 d 7-0.21 d 8 + 0.21 d 9 - Rel (q) Rel (q) CS@UVa CS 6501: Information Retrieval 6

Ranking is often preferred Relevance is a matter of degree Easier for users to find appropriate queries A user can stop browsing anywhere, so the boundary is controlled by the user Users prefer coverage would view more items Users prefer precision would view only a few Theoretical justification: Probability Ranking Principle CS@UVa CS 6501: Information Retrieval 7

Retrieval procedure in modern IR Boolean model provides all the ranking candidates Locate documents satisfying Boolean condition E.g., obama healthcare -> obama OR healthcare Rank candidates by relevance Important: the notation of relevance Efficiency consideration Top-k retrieval (Google) CS@UVa CS 6501: Information Retrieval 8

Notion of relevance Relevance (Rep(q), Rep(d)) Similarity P(r=1 q,d) r {0,1} Probability of Relevance P(d q) or P(q d) Probabilistic inference Different rep & similarity Vector space model (Salton et al., 75) Today s lecture Regression Model (Fox 83) Prob. distr. model (Wong & Yao, 89) Doc generation Generative Model Classical prob. Model (Robertson & Sparck Jones, 76) Query generation LM approach (Ponte & Croft, 98) (Lafferty & Zhai, 01a) Different inference system Prob. concept space model (Wong & Yao, 95) Inference network model (Turtle & Croft, 91) CS@UVa CS 6501: Information Retrieval 9

Intuitive understanding of relevance Fill in magic numbers to describe the relation between documents and words information retrieval retrieved is helpful for you everyone Doc1 1 1 0 1 1 1 0 1 Doc2 1 0 1 1 1 1 1 0 E.g., 0/1 for Boolean models, probabilities for probabilistic models CS@UVa CS 6501: Information Retrieval 10

Some notations Vocabulary V={w 1, w 2,, w N } of language Query q = t 1,,t m, where t i V Document d i = t i1,,t in, where t ij V Collection C= {d 1,, d k } Rel(q,d): relevance of doc d to query q Rep(d): representation of document d Rep(q): representation of query q CS@UVa CS 6501: Information Retrieval 11

Vector Space Model Hongning Wang CS@UVa

Relevance = Similarity Assumptions Query and documents are represented in the same form A query can be regarded as a document Relevance(d,q) similarity(d,q) R(q) = {d C rel(d,q)>θ}, rel(q,d)= (Rep(q), Rep(d)) Key issues How to represent query/document? How to define the similarity measure (x,y)? CS@UVa CS 6501: Information Retrieval 13

Vector space model Represent both doc and query by concept vectors Each concept defines one dimension K concepts define a high-dimensional space Element of vector corresponds to concept weight E.g., d=(x 1,,x k ), x i is importance of concept i Measure relevance Distance between the query vector and document vector in this concept space CS@UVa CS 6501: Information Retrieval 14

VS Model: an illustration Which document is closer to the query? Finance D 2 D 4 D 3 Query Sports Education D 5 D 1 CS@UVa CS 6501: Information Retrieval 15

What the VS model doesn t say How to define/select the basic concept Concepts are assumed to be orthogonal How to assign weights Weight in query indicates importance of the concept Weight in doc indicates how well the concept characterizes the doc How to define the similarity/distance measure CS@UVa CS 6501: Information Retrieval 16

What is a good basic concept? Orthogonal Linearly independent basis vectors Non-overlapping in meaning No ambiguity Weights can be assigned automatically and accurately Existing solutions Terms or N-grams, i.e., bag-of-words Topics, i.e., topic model We will come back to this later CS@UVa CS 6501: Information Retrieval 17

How to assign weights? Important! Why? Query side: not all terms are equally important Doc side: some terms carry more information about the content How? Two basic heuristics TF (Term Frequency) = Within-doc-frequency IDF (Inverse Document Frequency) CS@UVa CS 6501: Information Retrieval 18

TF weighting Idea: a term is more important if it occurs more frequently in a document TF Formulas Let f(t, d) be the frequency count of term t in doc d Raw TF: tf(t, d) = f(t, d) CS@UVa CS 6501: Information Retrieval 19

TF normalization Two views of document length A doc is long because it is verbose A doc is long because it has more content Raw TF is inaccurate Document length variation Repeated occurrences are less informative than the first occurrence Relevance does not increase proportionally with number of term occurrence Generally penalize long doc, but avoid overpenalizing Pivoted length normalization CS@UVa CS 6501: Information Retrieval 20

TF normalization Sublinear TF scaling tf t, d = 1 + log f t, d, if f t, d > 0 0, otherwise Norm. TF Raw TF CS@UVa CS 6501: Information Retrieval 21

TF normalization Maximum TF scaling tf t, d = α + (1 α) max t f(t,d) f(t,d) Normalize by the most frequent word in this doc Norm. TF 1 α Raw TF CS@UVa CS 6501: Information Retrieval 22

Document frequency Idea: a term is more discriminative if it occurs only in fewer documents CS@UVa CS 6501: Information Retrieval 23

IDF weighting Solution Assign higher weights to the rare terms Formula IDF t = 1 + log( N df(t) ) A corpus-specific property Non-linear scaling Independent of a single document Total number of docs in collection Number of docs containing term t CS@UVa CS 6501: Information Retrieval 24

Why document frequency How about total term frequency? ttf t = d f(t, d) Table 1. Example total term frequency v.s. document frequency in Reuters-RCV1 collection. Word ttf df try 10422 8760 insurance 10440 3997 Cannot recognize words frequently occurring in a subset of documents CS@UVa CS 6501: Information Retrieval 25

TF-IDF weighting Combining TF and IDF Common in doc high tf high weight Rare in collection high idf high weight w t, d = TF t, d IDF(t) Most well-known document representation schema in IR! (G Salton et al. 1983) Salton was perhaps the leading computer scientist working in the field of information retrieval during his time. - wikipedia Gerard Salton Award highest achievement award in IR CS@UVa CS 6501: Information Retrieval 26

How to define a good similarity measure? Euclidean distance? Finance D 2 TF-IDF space D 4 D 3 Query Sports Education D 5 D 1 CS@UVa CS 6501: Information Retrieval 27

How to define a good similarity measure? Euclidean distance dist q, d = [tf t, q idf t tf t, d idf t ]2 t V Longer documents will be penalized by the extra words We care more about how these two vectors are overlapped CS@UVa CS 6501: Information Retrieval 28

From distance to angle Angle: how vectors are overlapped Cosine similarity projection of one vector onto another Finance TF-IDF space The choice of Euclidean distance D 2 D 1 The choice of angle Query Sports CS@UVa CS 6501: Information Retrieval 29

Cosine similarity Angle between two vectors TF-IDF vector cosine V q, V d = V q V d V q 2 V d 2 = V q V q 2 V d V d 2 Document length normalized Finance D 2 TF-IDF space Unit vector D 1 Query Sports CS@UVa CS 6501: Information Retrieval 30

Fast computation of cosine in retrieval cosine V q, V d = V q V d V d 2 V q 2 would be the same for all candidate docs Normalization of V d can be done in index time Only count t q d Score accumulator for each query term when intersecting postings from inverted index CS@UVa CS 6501: Information Retrieval 31

Fast computation of cosine in retrieval Maintain a score accumulator for each doc when scanning the postings CS@UVa Query = info security S(d,q)=g(t 1 )+ +g(t n ) [sum of TF of matched terms] Info: (d1, 3), (d2, 4), (d3, 1), (d4, 5) Security: (d2, 3), (d4, 1), (d5, 3) info security Accumulators: d1 d2 d3 d4 d5 (d1,3) => 3 0 0 0 0 (d2,4) => 3 4 0 0 0 (d3,1) => 3 4 1 0 0 (d4,5) => 3 4 1 5 0 (d2,3) => 3 7 1 5 0 (d4,1) => 3 7 1 6 0 (d5,3) => 3 7 1 6 3 CS 6501: Information Retrieval Can be easily applied to TF-IDF weighting! Keep only the most promising accumulators for top K retrieval 32

Advantages of VS Model Empirically effective! (Top TREC performance) Intuitive Easy to implement Well-studied/Mostly evaluated The Smart system Developed at Cornell: 1960-1999 Still widely used Warning: Many variants of TF-IDF! CS@UVa CS 6501: Information Retrieval 33

Disadvantages of VS Model Assume term independence Assume query and document to be the same Lack of predictive adequacy Arbitrary term weighting Arbitrary similarity measure Lots of parameter tuning! CS@UVa CS 6501: Information Retrieval 34

What you should know Document ranking v.s. selection Basic idea of vector space model Two important heuristics in VS model TF IDF Similarity measure for VS model CS@UVa CS 6501: Information Retrieval 35