modern database systems lecture 4 : information retrieval

modern database systems lecture 4 : information retrieval Aristides Gionis Michael Mathioudakis spring 2016

in perspective structured data relational data RDBMS MySQL semi-structured data data-graph representation XML, json mongodb unstructured data flat text collections information retrieval web search lucene

related Aalto course (this semester)

sources Introduction to Information Retrieval Christopher Manning, Prabhakar Raghavan, and Hinrich Schütze book and slides available online

sources Modern Information Retrieval Ricardo Baeza-Yates and Berthier Ribeiro-Neto slides available online (material used from slides in this lecture)

text mining typical problems keyword search in collection of text documents (information retrieval this lecture) text classification and text clustering finding near-duplicate or similar documents document summarization information extraction web data extraction web search query log analysis mining social media and microblogs sentiment analysis

the IR problem users have information needs of varying complexity users typically translate this information need into a query query is a set of keywords, or index terms, which summarizes the user information need given the user query, the key goal of the IR system is to retrieve information that is useful or relevant to the user the IR problem the key goal of an IR system is to retrieve all the items that are relevant to a user query, while retrieving as few non-relevant items as possible the notion of relevance is of central importance in IR

two aspects of IR relevance : modeling + evaluation performance : indexing and searching + parallel and distributed IR

modeling

IR models modeling in IR is a process aimed at producing a ranking function ranking function : a function that assigns scores to documents with regard to a given query the process consists of two main tasks : the conception of a logical framework for representing documents and queries the definition of a ranking function that allows quantifying the similarities among documents and queries

depiction of an IR model index terms documents docs terms match 1 2 information need query terms 3... ranking

IR models an IR model is a quadruple [D, Q, F, R(q i,d j )] where 1. D is a set of logical views for the documents in the collection 2. Q is a set of logical views for the user queries 3. F is a framework for modeling documents and queries 4. R(q i,d j ) is a ranking function D Q d q R(d,q )

basic concepts each document is represented by a set of representative keywords or index terms an index term is a word or group of consecutive words in a document a pre-selected set of index terms can be used to summarize the document contents we usually assume that all words are index terms (full text representation) vocabulary V = {k 1,..., k t } the set of all distinct index terms in the collection the occurrence of a term k i in a document d j establishes a relation between k i and d j term-document matrix : term-document relation in matrix form k 1 k 2 d 1 d 2 f 1,1 f 1,2 f 2,1 f 2,2 k 3 f 3,1 f 3,2 where each f i,j represents the frequency of term k i in a document d j

from full text to a set of index terms structure recognition (e.g., remove html formatting) correct for accents, spacing, etc. stop words detect noun groups stemming filter for known vocabulary (if needed)

boolean model simple model based on set theory and boolean algebra queries specified as boolean expressions quite intuitive and precise semantics neat formalism example of query : q = k a (k b k c ) term-document frequencies in the term-document matrix are all binary

boolean model a query q can be rewritten as a disjunction of conjunctive components such a query is in disjunctive normal form (DNF) a boolean ranking function is binary a document satisfies a query if it satisfies at least one of its conjunctive components that is, the Boolean model predicts that each document is either relevant or non-relevant

drawbacks of the boolean model retrieval based on binary decision criteria with no notion of partial matching no ranking of the documents is provided (no grading scale) information need has to be translated into a boolean expression, which most users find awkward the boolean queries formulated by the users are most often too simplistic the model frequently returns either too few or too many documents in response to a user query

term weighting the terms of a document are not equally useful for describing the document contents a word that appears in all documents of a collection is completely useless for retrieval tasks to characterize term importance, we associate a weight w i,j for term k i and document d j if k i appears in the document d j, then w i,j > 0 if k i does not appear in document d j, then w i,j = 0 the weight w i,j quantifies the importance of the index term k i for describing the contents of document d j

tf.idf term weighting tf.idf term weighting scheme has two components term frequency (tf) the value of w i,j is proportional to the term frequency f i,j this leads to popular variant tf i,j = f i,j tf i,j = inverse document frequency (idf) let N be the number of docs in the collection and n i the frequency of term k i { 1+logf i,j if f i,j > 0 0 otherwise then, the inverse document frequency of term k i is idf i idf i =log N n i

tf.idf term weighting the best known term weighting schemes use weights that combine idf factors with term frequencies for instance w i,j = { (1 + log fi,j ) log N n i if f i,j > 0 0 otherwise and many variants

document normalization document sizes might vary widely this is a problem because longer documents are more likely to be retrieved by a given query resort to normalization documents are represented as vectors of weighted terms normalize by vector length : d j =(w 1,j,w 2,j,...,w t,j ) d j = t i w 2 i,j

vector model boolean matching and binary weights is too limiting need a framework in which partial matching is possible accomplished by representing queries and documents as vectors d j =(w 1j,w 2j,...,w tj ) q =(w 1q,w 2q,...,w tq ) similarity between a document d j and a query q j d cos(θ) = d j q d j q i q sim(d j,q)= t t i=1 w i,j w i,q i=1 w2 i,j t j=1 w2 i,q

vector model advantages : term-weighting improves quality of the answer set partial matching allows retrieval of docs that approximate the query conditions cosine ranking formula sorts documents according to a degree of similarity to the query document length normalization is naturally built-in into the ranking disadvantages: it assumes independence of index terms

many other IR models probabilistic model set-based model extended boolean model fuzzy set model generalized vector model latent semantic indexing (LSI) (= PCA for text) quantum IR model see a proper IR course

evaluation

evaluation of an IR system to evaluate of an IR system is to measure how well the system meets the information needs of the users this is troublesome, given that a same result set might be interpreted differently by different users objective metrics are defined that, on average, have a correlation with the preferences of a group of users evaluation is a critical and integral component of any modern IR system systematic evaluation of the IR system allows answering questions such as: a modification to the ranking function is proposed, should we go ahead and launch it? for which types of queries, such as business, product, and geographic queries, a given ranking modification works best?

reference collections reference collections constitute the most used evaluation method in IR a reference collection is composed of: a set D of pre-selected documents a set I of information need descriptions (queries) a set of relevance judgements associated with each pair [i m,d j ], i m I and d j D the relevance judgement has value 0 if document d j is non-relevant to i m, and value 1 otherwise judgements are produced by human specialists

evaluation metrics precision and recall precision : the fraction of the retrieved documents (set A) that is relevant recall : the fraction of the relevant documents (set R) that has been retrieved Precision = Recall = R A A R A R

precision vs. recall typical behavior of precision vs. recall

other metrics P@5 and P@10 in the case of Web search engines, the majority of searches does not require high recall precision at 5 (P@5) and at 10 (P@10) measure the precision at top 5 or 10 documents assess whether the users are getting relevant documents at the top of the ranking or not mean average precision (MAR) average precision obtained after each new relevant document is observed mean reciprocal rank (MRR) mean reciprocal rank of first relevant result (averaged over many queries) good metric for those cases in which we are interested in the first correct answer

evaluation recap many available reference collections in IR TREC, INEX, CLEF, Reuters, user-based evaluation large companies have their own panels A/B testing crowdsourcing (e.g., amazon mechanical turk)

two aspects of IR relevance : modeling + evaluation performance : indexing and searching + parallel and distributed IR

indexing and searching

efficiency in IR systems efficiency in IR systems ability to process user queries with minimal requirements of computational resources as we move to larger-scale applications, efficiency becomes more and more important for example, in Web search engines index terabytes of data and serve hundreds or thousands of queries per second

indexing index : a data structure built from the text to speed up the searches in the context of an IR system that uses an index, the efficiency of the system can be measured by : indexing time : time needed to build the index indexing space : space used during the generation of the index index storage : space required to store the index query latency : time interval between the arrival of the query and the generation of the answer query throughput : average number of queries processed per second

inverted index inverted index : a word-oriented mechanism for indexing a text collection to speed up the searching task the inverted index structure is composed of two elements : the vocabulary and the occurrences the vocabulary is the set of all different words in the text for each word in the vocabulary the index stores the documents which contain that word (inverted list)

example a collection and the term-document matrix Vocabulary n i d 1 d 2 d 3 d 4 to 2 4 2 - - do 3 2-3 3 is 1 2 - - - be 4 2 2 2 2 or 1-1 - - not 1-1 - - I 2-2 2 - am 2-2 1 - what 1-1 - - think 1 - - 1 - therefore 1 - - 1 - da 1 - - - 3 let 1 - - - 2 it 1 - - - 2 To do is to be. To be is to do. I think therefore I am. Do be do be do. To be or not to be. I am what I am. d 1 d 2 d 3 Do do do, da da da. Let it be, let it be. d 4

example the inverted index Vocabulary n i Occurrences as inverted lists to 2 [1,4],[2,2] do 3 [1,2],[3,3],[4,3] is 1 [1,2] be 4 [1,2],[2,2],[3,2],[4,2] or 1 [2,1] not 1 [2,1] I 2 [2,2],[3,2] am 2 [2,2],[3,1] what 1 [2,1] think 1 [3,1] therefore 1 [3,1] da 1 [4,3] let 1 [4,2] it 1 [4,2] To do is to be. To be is to do. I think therefore I am. Do be do be do. To be or not to be. I am what I am. d 1 d 2 d 3 Do do do, da da da. Let it be, let it be. d 4

single-word queries the simplest type of search is that for the occurrences of a single word the vocabulary search can be carried out using any suitable data structure e.g., hashing, tries, or B-trees in most cases the vocabulary is sufficiently small so as to stay in main memory the occurrence lists, on the other hand, are usually fetched from disk

multiple-word queries consider two cases: conjunctive (AND operator) queries disjunctive (OR operator) queries conjunctive queries imply to search for all the words in the query, obtaining one inverted list for each word following, we have to intersect all the inverted lists to obtain the documents that contain all these words typical case in the Web due to the size of the document collection for disjunctive queries the lists must be merged

list intersection the most time-demanding operation on inverted indexes is the merging of the lists of occurrences thus, it is important to optimize it consider one pair of lists of sizes m and n respectively, stored in consecutive memory, which needs to be intersected if m is much smaller than n, it is better to do m binary searches in the larger list to do the intersection if m and n are comparable, it is possible to do double binary search O(log n) if the intersection is trivially empty requires less than m + n comparisons on average

list intersection when there are more than two lists, there are several possible heuristics depending on the list sizes if intersecting the two shortest lists gives a very small answer, might be better to intersect that to the next shortest list, and so on the algorithms are more complicated if lists are stored non-contiguously and/or compressed

more complex queries phrase and proximity queries the lists must be traversed to find places where all the words appear in sequence (for a phrase), or appear close enough (for proximity) prefix queries regular expressions

ranking how to find the top-k documents? assume weight-sorted inverted lists for a single word query, the answer is trivial the list can be already sorted by the desired ranking for other queries, we need to merge the lists

compressed inverted indexes in many cases we want a compressed inverted index why? less space smaller number of disk accesses to bring in memory the inverted lists how to compress? for each list place file identifiers in ascending order therefore, they can be represented as sequences of gaps between consecutive numbers gaps are small for frequent words and large for infrequent words thus, compression can be obtained by encoding small values with shorter codes

compressed inverted indexes consider as a coding scheme the unary code each integer x > 0 is coded as (x 1) 1-bits followed by a 0-bit a better scheme is the Elias-γ code, which represents a number x by a concatenation of two parts : 1. a unary code for 1+ log x 2. a code of log x bits that represents the number x 2 log x in binary another coding scheme is the Elias-δ code Elias-δ concatenates parts (1) and (2) as above, yet part (1) is not represented in unary but using Elias-γ instead

compressed inverted indexes in general Elias-γ requires 1 + 2 log2x bits Elias-δ requires 1 + 2 log2log22x + log2x bits for small values of x Elias-γ codes are shorter than Elias-δ codes, and the situation is reversed as x grows thus the choice depends on which values we expect to encode

parallel and distributed IR

motivation the volume of online content today is staggering and it has been growing at an exponential rate on at a slightly smaller scale, the largest corporate intranets now contain several million Web pages as document collections grow larger, they become more expensive to manage it is necessary to consider alternative IR architectures and algorithms the application of parallelism and distributed computing can greatly enhance the ability to scale IR algorithms

data partitioning IR tasks are typically characterized by a small amount of processing applied to a large amount of data how to partition the document collection and the index?

data partitioning consider Indexing Items k 1 k 2... k i... k t D oc u m en t s d 1 w 1,1 w 2,1... w i,1... w t,1 d 2 w 1,2 w 2,2... w i,2... w t,2..................... d j w 1,j w 2,j... w i,j... w t,j..................... d N w 1,N w 2,N... w i,n... w t,n

document partitioning document partitioning slices the matrix horizontally, dividing the documents among the subtasks the N documents in the collection are distributed across the P processors in the system during query processing, each parallel process evaluates the query on N/P documents the results from each of the sub-collections are combined into a final result list

term partitioning in term partitioning, the matrix is sliced vertically it divides the indexing items among the P processors in this way, the evaluation procedure for each document is spread over different processors other possible partition strategies include divisions by language or other intrinsic characteristics of the data it may be that each independent search server is focused on a particular subject area

collection partitioning when the distributed system is centrally administered, more options are available one option is just the replication of the collection across all search servers a broker routes queries to the search servers and balances the load on the servers: Search Engine User Query Broker User Query Search Engine Search Engine Result Result Search Engine Search Engine

inverted index partitioning its implementation depends on whether it is combined with document partitioning or term partitioning when combined with document partitioning the documents are partitioned into separate sub-collections each sub-collection has its own inverted index and the processors share nothing during query evaluation when a query is submitted to the system, the broker distributes the query to all of the processors each processor evaluates the query on its portion of the document collection, producing a intermediate hit-list the broker then collects the intermediate hit-lists from all processors and merges them into a final hit-list the P intermediate hit-lists can be merged efficiently using a binary heap-based priority queue

inverted index partitioning when combined with term partitioning the inverted lists are spread across the processors each query is decomposed into items and each item is sent to the corresponding processor the processors create hit-lists with partial document scores and return them to the broker the broker then combines the hit-lists according

summary considered the problem of finding relevant documents for a given information need modeling how to represent documents, queries, and similarity? evaluation how to assess an IR system? indexing and searching how to search efficiently? parallel and distributed IR how to exploit large computer clusters and distributed systems?