Introduction to Information Retrieval

Introduction to Information Retrieval (Supplementary Material) Zhou Shuigeng March 23, 2007 Advanced Distributed Computing 1

Text Databases and IR Text databases (document databases) Large collections of documents from various sources: news articles, research papers, books, digital libraries, e-mail messages, and Web pages, library database, etc. Data stored is usually semi-structured Information retrieval A field developed in parallel with database systems Information is organized into (a large number of) documents Information retrieval problem: locating relevant documents based on user input, such as keywords or example documents Advanced Distributed Computing 2

Information Retrieval Typical IR systems Online library catalogs Online document management systems Information retrieval vs. database systems Some DB problems are not present in IR, e.g., update, transaction management, complex objects Some IR problems are not addressed well in DBMS, e.g., unstructured documents, approximate search using keywords and relevance Advanced Distributed Computing 3

Basic Measures for Text Retrieval Precision: the percentage of retrieved documents that are in fact relevant to the query (i.e., correct responses) { Relevant} { Retrieved} precision = { Retrieved} Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved precision recall { Relevant} { Retrieved} = { Relevant} Advanced Distributed Computing 4

Precision vs. Recall(1) Relevant Relevant & Retrieved All Documents precision precision recall Retrieved relevantirrelevant retrieved & irrelevant retrieved & relevant retrieved { Relevant} { Retrieved} = { Retrieved} { Relevant} { Retrieved} = { Relevant} Not retrieved & irrelevant not retrieved but relevant not retrieved Advanced Distributed Computing 5

Recall vs. Precision Return relevant documents but miss many useful ones too precision 1 0 recall 1 The ideal Return mostly relevant documents but include many junks too Advanced Distributed Computing 6

IR Techniques(1) Basic Concepts A document can be described by a set of representative keywords called index terms. Different index terms have varying relevance when used to describe document contents. This effect is captured through the assignment of numerical weights to each index term of a document. (e.g.: frequency, tf-idf) DBMS Analogy Index Terms Attributes Weights Attribute Values Advanced Distributed Computing 7

IR Techniques(2) Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms Documents Frequency Matrices Information Retrieval Models: Boolean Model Vector Model Probabilistic Model Advanced Distributed Computing 8

Stop Words From a given Stop Word List [a, about, again, are, the, to, of, ] Remove them from the documents Or, determine stop words Given a large enough corpus of common English Sort the list of words in decreasing order of their occurrence frequency in the corpus Zipf s law: Frequency * rank constant most frequent words tend to be short most frequent 20% of words account for 60% of usage Advanced Distributed Computing 9

Zipf s Law -- An illustration Rank(R) Term Frequency (F) R*F (10**6) 1 the 69,971 0.070 2 of 36,411 0.073 3 and 28,852 0.086 4 to 26,149 0.104 5 a 23,237 0.116 6 in 21,341 0.128 7 that 10,595 0.074 8 is 10,009 0.081 9 was 9,816 0.088 10 he 9,543 0.095 Advanced Distributed Computing 10

Resolving Power of Word Non-significant high-frequency terms Non-significant low-frequency terms Presumed resolving power of significant words Words in decreasing frequency order Advanced Distributed Computing 11

Simple Indexing Scheme Based on Zipf s Law Use term frequency information only: Compute frequency of term k in document i, Freq ik Determine total collection frequency TotalFreq k = Freq ik for i = 1, 2,, n Arrange terms in order of collection frequency Set thresholds - eliminate high and low frequency terms Use remaining terms as index terms Advanced Distributed Computing 12

Stemming Stemming: transforming words to root form Computing, Computer, Computation comput Suffix based methods Remove ability from computability +ness, +ive, remove Suffix list + context rules Advanced Distributed Computing 13

Thesaurus Rules A thesaurus aims at classification of words in a language for a word, it gives related terms which are broader than, narrower than, same as (synonyms) and opposed to (antonyms) of the given word (other kinds of relationships may exist, e.g., composed of) Static Thesaurus Tables [anneal, strain], [antenna, receiver], Roget s thesaurus WordNet at Preinceton Advanced Distributed Computing 14

Thesaurus Rules can also be Learned From a search engine query log After typing queries, browse If query1 and query2 leads to the same document Then, Similar(query1, query2) If query1 leads to Document with title keyword K, Then, Similar(query1, K) Then, transitivity Advanced Distributed Computing 15

Indexing Techniques Inverted index Maintains two hash- or B+-tree indexed tables: document_table: a set of document records <doc_id, postings_list> term_table: a set of term records, <term, postings_list> Answer query: Find all docs associated with one or a set of terms + easy to implement do not handle well synonymy and polysemy, and posting lists could be too long (storage could be very large) Signature file Associate a signature with each document A signature is a representation of an ordered list of terms that describe the document Order is obtained by frequency analysis, stemming and stop lists Advanced Distributed Computing 16

Boolean Model Consider that index terms are either present or absent in a document As a result, the index term weights are assumed to be all binaries A query is composed of index terms linked by three connectives: not, and, and or e.g.: car and repair, plane or airplane The Boolean model predicts that each document is either relevant or non-relevant based on the match of a document to the query Advanced Distributed Computing 17

Boolean Model: Keyword-Based Retrieval A document is represented by a string, which can be identified by a set of keywords Queries may use expressions of keywords E.g., car and repair shop, tea or coffee, DBMS but not Oracle Queries and retrieval should consider synonyms, e.g., repair and maintenance Major difficulties of the model Synonymy: A keyword T does not appear anywhere in the document, even though the document is closely related to T, e.g., data mining Polysemy: The same keyword may mean different things in different contexts, e.g., mining Advanced Distributed Computing 18

The Vector-Space Model The distinct terms are available; call them index terms or the vocabulary The index terms represent important terms for an application a vector to represent the document <T1,T2,T3,T4,T5> or <W(T1),W(T2),W(T3),W(T4),W(T5)> computer science collection T1=architecture T2=bus T3=computer T4=database T5=xml index terms or vocabulary of the collection Advanced Distributed Computing 19

The Vector-Space Model Assumptions: words are uncorrelated Given: 1. N documents and a Query 2. Query considered a document too 2. Each represented by t terms 3. Each term j in document i has weight d ij 4. We will deal with how to compute the weights later T 1 T 2. T t D 1 d 11 d 12 d 1t D 2 d 21 d 22 d 2t : : : : : : : : D n d n1 d n2 d nt Q q... 1 q 2 q t Advanced Distributed Computing 20

Graphic Representation Example: D 1 = 2T 1 + 3T 2 + 5T 3 D 1 = 2T 1 + 3T 2 + 5T 3 5 T 3 D 2 = 3T 1 + 7T 2 + T 3 Q = 0T 1 + 0T 2 Q = 0T 1 + 0T 2 + 2T 3 2 3 T 1 + 2T 3 D 2 = 3T 1 + 7T 2 + T 3 T 2 7 Is D 1 or D 2 more similar to Q? How to measure the degree of similarity? Distance? Angle? Projection? Advanced Distributed Computing 21

Similarity Measure - Inner Product Similarity between documents D i and query Q can be computed as the inner vector product: t sim ( D i, Q ) = (D i Q) Binary: weight = 1 if word present, 0 o/w Non-binary: weight represents degree of similary k= 1 t = = Example: TF/IDF we explain later j 1 d * q ij j Advanced Distributed Computing 22

Inner Product -- Examples Binary: D = 1, 1, 1, 0, 1, 1, 0 Q = 1, 0, 1, 0, 0, 1, 1 sim(d, Q) = 3 retrieval database architecture computer text management information Size of vector = size of vocabulary = 7 Weighted D 1 = 2T 1 + 3T 2 + 5T 3 Q = 0T 1 + 0T 2 + 2T 3 sim(d 1, Q) = 2*0 + 3*0 + 5*2 = 10 Advanced Distributed Computing 23

Properties of Inner Product The inner product similarity is unbounded Favors long documents long document a large number of unique terms, each of which may occur many times measures how many terms matched but not how many terms not matched Advanced Distributed Computing 24

Cosine Similarity Measures Cosine similarity measures the cosine of the angle between two vectors Inner product normalized by D 1 the vector lengths θ 2 CosSim(D i, Q) = t k = 1 t d ( d ik ik 2 q ) k k = 1 k = 1 t q k 2 t 2 θ 1 D 2 t 3 Q t 1 Advanced Distributed Computing 25

Cosine Similarity: an Example D 1 = 2T 1 + 3T 2 + 5T 3 CosSim(D 1, Q) = 5 / 38 = 0.81 D 2 = 3T 1 + 7T 2 + T 3 CosSim(D 2, Q) = 1 / 59 = 0.13 Q = 0T 1 + 0T 2 + 2T 3 D 1 is 6 times better than D 2 using cosine similarity but only 5 times better using inner product Advanced Distributed Computing 26

Document and Term Weights Document term weights are calculated using frequencies in documents (tf) and in collection (idf) tf ij = frequency of term j in document i df j = document frequency of term j = number of documents containing term j idf j = inverse document frequency of term j = log 2 (N/ df j ) (N: number of documents in collection) Inverse document frequency -- an indication of term values as a document discriminator. Advanced Distributed Computing 27

Term Weight Calculations Weight of the jth term in ith document: d ij = tf ij idf j = tf ij log 2 (N/ df j ) TF Term Frequency A term occurs frequently in the document but rarely in the remaining of the collection has a high weight Let max l {tf lj } be the term frequency of the most frequent term in document j Normalization: term frequency = tf ij /max l {tf lj } Advanced Distributed Computing 28

An example of TF Document=(A Computer Science Student Uses Computers) Vector Model based on keywords (Computer, Engineering, Student) Tf(Computer) = 2 Tf(Engineering)=0 Tf(Student) = 1 Max(Tf)=2 TF weight for: Computer = 2/2 = 1 Engineering = 0/2 = 0 Student = ½ = 0.5 Advanced Distributed Computing 29

Inverse Document Frequency Df j gives the number of times term j appeared among N documents IDF = 1/DF Typically use log 2 (N/ df j ) for IDF Example: given 1000 documents, computer appeared in 200 of them, IDF= log 2 (1000/ 200) =log 2 (5) Advanced Distributed Computing 30

TF IDF d ij = (tf ij /max l {tf lj }) idf j = (tf ij /max l {tf lj }) log 2 (N/ df j ) Can use this to obtain non-binary weights Used in the SMART Information Retrieval System by the late Gerald Salton and MJ McGill, Cornell University to tremendous success, 1983 Advanced Distributed Computing 31

Implementation based on Inverted Files In practice, document vectors are not stored directly; an inverted organization provides much better access speed. The index file can be implemented as a hash file, a sorted list, or a B-tree. Index terms df computer database D 7, 4 D 1, 3 science 4 D 2, 4 system 1 D 5, 2 3 2 D j, tf j Advanced Distributed Computing 32

Latent Semantic Indexing (1) Basic idea The size of the term frequency matrix is very large Use a singular value decomposition (SVD) techniques to reduce the size of frequency table Retain the K most significant rows of the frequency table Method Create a term x document weighted frequency matrix A SVD construction: A = U * S * V Define K and obtain U k,, S k, and V k. Create query vector q. Project q into the term-document space: Dq = q * U k * S k -1 Calculate similarities: cos α = Dq. D / Dq * D Advanced Distributed Computing 33

Latent Semantic Indexing (2) Weighted Frequency Matrix Query Terms: - Insulation -Joint Advanced Distributed Computing 34

Probabilistic Model Basic assumption: Given a user query, there is a set of documents which contains exactly the relevant documents and no other (ideal answer set) Querying process as a process of specifying the properties of an ideal answer set. Since these properties are not known at query time, an initial guess is made This initial guess allows the generation of a preliminary probabilistic description of the ideal answer set which is used to retrieve the first set of documents An interaction with the user is then initiated with the purpose of improving the probabilistic description of the answer set Advanced Distributed Computing 35

Reference Richardo Baeza-Yates, Berthir Ribeiro-Neto, Modern Information Retrieval, Addison Wesley, 机械工业出版社影印出版, 2004 Advanced Distributed Computing 36