Introduction to Information Retrieval

Similar documents
Information Retrieval. (M&S Ch 15)

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

CS 6320 Natural Language Processing

CHAPTER-26 Mining Text Databases

Chapter 6: Information Retrieval and Web Search. An introduction

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Q: Given a set of keywords how can we return relevant documents quickly?

Multimedia Information Systems

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Volume 1, Issue 2, July 2014.

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Information Retrieval: Retrieval Models

Modern Information Retrieval

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Introduction to Information Retrieval

modern database systems lecture 4 : information retrieval

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Social Media Computing

Contents 1. INTRODUCTION... 3

Information Retrieval CSCI

60-538: Information Retrieval

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

Information Retrieval. hussein suleman uct cs

CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING

Lecture 7: Relevance Feedback and Query Expansion

Chapter 3 - Text. Management and Retrieval

Information Retrieval and Web Search

Text Analytics (Text Mining)

Information Retrieval & Text Mining

SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT

Balancing Manual and Automatic Indexing for Retrieval of Paper Abstracts

Text Analytics (Text Mining)

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

Exam in course TDT4215 Web Intelligence - Solutions and guidelines - Wednesday June 4, 2008 Time:

Information Retrieval

Instructor: Stefan Savev

Data Modelling and Multimedia Databases M

Information Retrieval

Information Retrieval. Information Retrieval and Web Search

Relevance Feedback and Query Reformulation. Lecture 10 CS 510 Information Retrieval on the Internet Thanks to Susan Price. Outline

Models for Document & Query Representation. Ziawasch Abedjan

Chapter 27 Introduction to Information Retrieval and Web Search

Encoding Words into String Vectors for Word Categorization

vector space retrieval many slides courtesy James Amherst

Boolean Model. Hongning Wang

Information Retrieval and Data Mining Part 1 Information Retrieval

CMSC 476/676 Information Retrieval Midterm Exam Spring 2014

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity

Outline of the course

Midterm Exam Search Engines ( / ) October 20, 2015

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

CSE 494: Information Retrieval, Mining and Integration on the Internet

CSCI 599: Applications of Natural Language Processing Information Retrieval Retrieval Models (Part 1)"

Digital Libraries: Language Technologies

Natural Language Processing

Information Retrieval and Web Search

Towards Understanding Latent Semantic Indexing. Second Reader: Dr. Mario Nascimento

Knowledge Discovery and Data Mining 1 (VO) ( )

Modern Information Retrieval

Web Information Retrieval using WordNet

Information Retrieval

Tag-based Social Interest Discovery

Level of analysis Finding Out About Chapter 3: 25 Sept 01 R. K. Belew

Elementary IR: Scalable Boolean Text Search. (Compare with R & G )

Relevance Feedback & Other Query Expansion Techniques

Vector Space Models: Theory and Applications

CS54701: Information Retrieval

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Information Retrieval on the Internet (Volume III, Part 3, 213)

Keyword Extraction by KNN considering Similarity among Features

A Model for Information Retrieval Agent System Based on Keywords Distribution

Vannevar Bush. Information Retrieval. Prophetic: Hypertext. Historic Vision 2/8/17

Information Retrieval

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

Lecture 5: Information Retrieval using the Vector Space Model

Suresha M Department of Computer Science, Kuvempu University, India

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

Feature selection. LING 572 Fei Xia

COMP6237 Data Mining Searching and Ranking

Classic IR Models 5/6/2012 1

Information Retrieval

Inverted Indexes. Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

This lecture: IIR Sections Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes Vector space scoring

Authoritative K-Means for Clustering of Web Search Results

AN EFFECTIVE INFORMATION RETRIEVAL FOR AMBIGUOUS QUERY

Optimal Query. Assume that the relevant set of documents C r. 1 N C r d j. d j. Where N is the total number of documents.

Definitions. Lecture Objectives. Text Technologies for Data Science INFR Learn about main concepts in IR 9/19/2017. Instructor: Walid Magdy

A Study on Information Retrieval Methods in Text Mining

Indexing and Query Processing. What will we cover?

VK Multimedia Information Systems

IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL

Making Retrieval Faster Through Document Clustering

Exam IST 441 Spring 2014

Information Retrieval

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _

Automatic Document; Retrieval Systems. The conventional library classifies documents by numeric subject codes which are assigned manually (Dewey

Glossary. ASCII: Standard binary codes to represent occidental characters in one byte.

Retrieving Model for Design Patterns

Transcription:

Introduction to Information Retrieval (Supplementary Material) Zhou Shuigeng March 23, 2007 Advanced Distributed Computing 1

Text Databases and IR Text databases (document databases) Large collections of documents from various sources: news articles, research papers, books, digital libraries, e-mail messages, and Web pages, library database, etc. Data stored is usually semi-structured Information retrieval A field developed in parallel with database systems Information is organized into (a large number of) documents Information retrieval problem: locating relevant documents based on user input, such as keywords or example documents Advanced Distributed Computing 2

Information Retrieval Typical IR systems Online library catalogs Online document management systems Information retrieval vs. database systems Some DB problems are not present in IR, e.g., update, transaction management, complex objects Some IR problems are not addressed well in DBMS, e.g., unstructured documents, approximate search using keywords and relevance Advanced Distributed Computing 3

Basic Measures for Text Retrieval Precision: the percentage of retrieved documents that are in fact relevant to the query (i.e., correct responses) { Relevant} { Retrieved} precision = { Retrieved} Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved precision recall { Relevant} { Retrieved} = { Relevant} Advanced Distributed Computing 4

Precision vs. Recall(1) Relevant Relevant & Retrieved All Documents precision precision recall Retrieved relevantirrelevant retrieved & irrelevant retrieved & relevant retrieved { Relevant} { Retrieved} = { Retrieved} { Relevant} { Retrieved} = { Relevant} Not retrieved & irrelevant not retrieved but relevant not retrieved Advanced Distributed Computing 5

Recall vs. Precision Return relevant documents but miss many useful ones too precision 1 0 recall 1 The ideal Return mostly relevant documents but include many junks too Advanced Distributed Computing 6

IR Techniques(1) Basic Concepts A document can be described by a set of representative keywords called index terms. Different index terms have varying relevance when used to describe document contents. This effect is captured through the assignment of numerical weights to each index term of a document. (e.g.: frequency, tf-idf) DBMS Analogy Index Terms Attributes Weights Attribute Values Advanced Distributed Computing 7

IR Techniques(2) Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms Documents Frequency Matrices Information Retrieval Models: Boolean Model Vector Model Probabilistic Model Advanced Distributed Computing 8

Stop Words From a given Stop Word List [a, about, again, are, the, to, of, ] Remove them from the documents Or, determine stop words Given a large enough corpus of common English Sort the list of words in decreasing order of their occurrence frequency in the corpus Zipf s law: Frequency * rank constant most frequent words tend to be short most frequent 20% of words account for 60% of usage Advanced Distributed Computing 9

Zipf s Law -- An illustration Rank(R) Term Frequency (F) R*F (10**6) 1 the 69,971 0.070 2 of 36,411 0.073 3 and 28,852 0.086 4 to 26,149 0.104 5 a 23,237 0.116 6 in 21,341 0.128 7 that 10,595 0.074 8 is 10,009 0.081 9 was 9,816 0.088 10 he 9,543 0.095 Advanced Distributed Computing 10

Resolving Power of Word Non-significant high-frequency terms Non-significant low-frequency terms Presumed resolving power of significant words Words in decreasing frequency order Advanced Distributed Computing 11

Simple Indexing Scheme Based on Zipf s Law Use term frequency information only: Compute frequency of term k in document i, Freq ik Determine total collection frequency TotalFreq k = Freq ik for i = 1, 2,, n Arrange terms in order of collection frequency Set thresholds - eliminate high and low frequency terms Use remaining terms as index terms Advanced Distributed Computing 12

Stemming Stemming: transforming words to root form Computing, Computer, Computation comput Suffix based methods Remove ability from computability +ness, +ive, remove Suffix list + context rules Advanced Distributed Computing 13

Thesaurus Rules A thesaurus aims at classification of words in a language for a word, it gives related terms which are broader than, narrower than, same as (synonyms) and opposed to (antonyms) of the given word (other kinds of relationships may exist, e.g., composed of) Static Thesaurus Tables [anneal, strain], [antenna, receiver], Roget s thesaurus WordNet at Preinceton Advanced Distributed Computing 14

Thesaurus Rules can also be Learned From a search engine query log After typing queries, browse If query1 and query2 leads to the same document Then, Similar(query1, query2) If query1 leads to Document with title keyword K, Then, Similar(query1, K) Then, transitivity Advanced Distributed Computing 15

Indexing Techniques Inverted index Maintains two hash- or B+-tree indexed tables: document_table: a set of document records <doc_id, postings_list> term_table: a set of term records, <term, postings_list> Answer query: Find all docs associated with one or a set of terms + easy to implement do not handle well synonymy and polysemy, and posting lists could be too long (storage could be very large) Signature file Associate a signature with each document A signature is a representation of an ordered list of terms that describe the document Order is obtained by frequency analysis, stemming and stop lists Advanced Distributed Computing 16

Boolean Model Consider that index terms are either present or absent in a document As a result, the index term weights are assumed to be all binaries A query is composed of index terms linked by three connectives: not, and, and or e.g.: car and repair, plane or airplane The Boolean model predicts that each document is either relevant or non-relevant based on the match of a document to the query Advanced Distributed Computing 17

Boolean Model: Keyword-Based Retrieval A document is represented by a string, which can be identified by a set of keywords Queries may use expressions of keywords E.g., car and repair shop, tea or coffee, DBMS but not Oracle Queries and retrieval should consider synonyms, e.g., repair and maintenance Major difficulties of the model Synonymy: A keyword T does not appear anywhere in the document, even though the document is closely related to T, e.g., data mining Polysemy: The same keyword may mean different things in different contexts, e.g., mining Advanced Distributed Computing 18

The Vector-Space Model The distinct terms are available; call them index terms or the vocabulary The index terms represent important terms for an application a vector to represent the document <T1,T2,T3,T4,T5> or <W(T1),W(T2),W(T3),W(T4),W(T5)> computer science collection T1=architecture T2=bus T3=computer T4=database T5=xml index terms or vocabulary of the collection Advanced Distributed Computing 19

The Vector-Space Model Assumptions: words are uncorrelated Given: 1. N documents and a Query 2. Query considered a document too 2. Each represented by t terms 3. Each term j in document i has weight d ij 4. We will deal with how to compute the weights later T 1 T 2. T t D 1 d 11 d 12 d 1t D 2 d 21 d 22 d 2t : : : : : : : : D n d n1 d n2 d nt Q q... 1 q 2 q t Advanced Distributed Computing 20

Graphic Representation Example: D 1 = 2T 1 + 3T 2 + 5T 3 D 1 = 2T 1 + 3T 2 + 5T 3 5 T 3 D 2 = 3T 1 + 7T 2 + T 3 Q = 0T 1 + 0T 2 Q = 0T 1 + 0T 2 + 2T 3 2 3 T 1 + 2T 3 D 2 = 3T 1 + 7T 2 + T 3 T 2 7 Is D 1 or D 2 more similar to Q? How to measure the degree of similarity? Distance? Angle? Projection? Advanced Distributed Computing 21

Similarity Measure - Inner Product Similarity between documents D i and query Q can be computed as the inner vector product: t sim ( D i, Q ) = (D i Q) Binary: weight = 1 if word present, 0 o/w Non-binary: weight represents degree of similary k= 1 t = = Example: TF/IDF we explain later j 1 d * q ij j Advanced Distributed Computing 22

Inner Product -- Examples Binary: D = 1, 1, 1, 0, 1, 1, 0 Q = 1, 0, 1, 0, 0, 1, 1 sim(d, Q) = 3 retrieval database architecture computer text management information Size of vector = size of vocabulary = 7 Weighted D 1 = 2T 1 + 3T 2 + 5T 3 Q = 0T 1 + 0T 2 + 2T 3 sim(d 1, Q) = 2*0 + 3*0 + 5*2 = 10 Advanced Distributed Computing 23

Properties of Inner Product The inner product similarity is unbounded Favors long documents long document a large number of unique terms, each of which may occur many times measures how many terms matched but not how many terms not matched Advanced Distributed Computing 24

Cosine Similarity Measures Cosine similarity measures the cosine of the angle between two vectors Inner product normalized by D 1 the vector lengths θ 2 CosSim(D i, Q) = t k = 1 t d ( d ik ik 2 q ) k k = 1 k = 1 t q k 2 t 2 θ 1 D 2 t 3 Q t 1 Advanced Distributed Computing 25

Cosine Similarity: an Example D 1 = 2T 1 + 3T 2 + 5T 3 CosSim(D 1, Q) = 5 / 38 = 0.81 D 2 = 3T 1 + 7T 2 + T 3 CosSim(D 2, Q) = 1 / 59 = 0.13 Q = 0T 1 + 0T 2 + 2T 3 D 1 is 6 times better than D 2 using cosine similarity but only 5 times better using inner product Advanced Distributed Computing 26

Document and Term Weights Document term weights are calculated using frequencies in documents (tf) and in collection (idf) tf ij = frequency of term j in document i df j = document frequency of term j = number of documents containing term j idf j = inverse document frequency of term j = log 2 (N/ df j ) (N: number of documents in collection) Inverse document frequency -- an indication of term values as a document discriminator. Advanced Distributed Computing 27

Term Weight Calculations Weight of the jth term in ith document: d ij = tf ij idf j = tf ij log 2 (N/ df j ) TF Term Frequency A term occurs frequently in the document but rarely in the remaining of the collection has a high weight Let max l {tf lj } be the term frequency of the most frequent term in document j Normalization: term frequency = tf ij /max l {tf lj } Advanced Distributed Computing 28

An example of TF Document=(A Computer Science Student Uses Computers) Vector Model based on keywords (Computer, Engineering, Student) Tf(Computer) = 2 Tf(Engineering)=0 Tf(Student) = 1 Max(Tf)=2 TF weight for: Computer = 2/2 = 1 Engineering = 0/2 = 0 Student = ½ = 0.5 Advanced Distributed Computing 29

Inverse Document Frequency Df j gives the number of times term j appeared among N documents IDF = 1/DF Typically use log 2 (N/ df j ) for IDF Example: given 1000 documents, computer appeared in 200 of them, IDF= log 2 (1000/ 200) =log 2 (5) Advanced Distributed Computing 30

TF IDF d ij = (tf ij /max l {tf lj }) idf j = (tf ij /max l {tf lj }) log 2 (N/ df j ) Can use this to obtain non-binary weights Used in the SMART Information Retrieval System by the late Gerald Salton and MJ McGill, Cornell University to tremendous success, 1983 Advanced Distributed Computing 31

Implementation based on Inverted Files In practice, document vectors are not stored directly; an inverted organization provides much better access speed. The index file can be implemented as a hash file, a sorted list, or a B-tree. Index terms df computer database D 7, 4 D 1, 3 science 4 D 2, 4 system 1 D 5, 2 3 2 D j, tf j Advanced Distributed Computing 32

Latent Semantic Indexing (1) Basic idea The size of the term frequency matrix is very large Use a singular value decomposition (SVD) techniques to reduce the size of frequency table Retain the K most significant rows of the frequency table Method Create a term x document weighted frequency matrix A SVD construction: A = U * S * V Define K and obtain U k,, S k, and V k. Create query vector q. Project q into the term-document space: Dq = q * U k * S k -1 Calculate similarities: cos α = Dq. D / Dq * D Advanced Distributed Computing 33

Latent Semantic Indexing (2) Weighted Frequency Matrix Query Terms: - Insulation -Joint Advanced Distributed Computing 34

Probabilistic Model Basic assumption: Given a user query, there is a set of documents which contains exactly the relevant documents and no other (ideal answer set) Querying process as a process of specifying the properties of an ideal answer set. Since these properties are not known at query time, an initial guess is made This initial guess allows the generation of a preliminary probabilistic description of the ideal answer set which is used to retrieve the first set of documents An interaction with the user is then initiated with the purpose of improving the probabilistic description of the answer set Advanced Distributed Computing 35

Reference Richardo Baeza-Yates, Berthir Ribeiro-Neto, Modern Information Retrieval, Addison Wesley, 机械工业出版社影印出版, 2004 Advanced Distributed Computing 36