Index-based Snippet Generation

Size: px
Start display at page:

Download "Index-based Snippet Generation"

Transcription

1 Saarland University Faculty of Natural Sciences and Technology I Department of Computer Science Master s Program in Computer Science Master s Thesis Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor Priv.-Doz. Dr. Holger Bast Reviewers Priv.-Doz. Dr. Holger Bast Prof. Dr. Gerhard Weikum

2 ii

3 Statement Hereby I confirm that this thesis is my own work and that I have documented all sources used. Gabriel Manolache Saarbrücken, June 9, 2008 Declaration of Consent Herewith I agree that my thesis will be made available through the library of the Computer Science Department. Gabriel Manolache Saarbrücken, June 9, 2008 iii

4 Acknowledgments I would like to express my most profound gratitude to Dr. Holger Bast for giving me the chance to work in his research group, on this fascinating project. I am indebted to him for his continuous advice, patience and willingness to discuss any questions that I have had. His professional enthusiasm and the expertise he shared with me, will certainly be a tremendous resource for my later work. I am grateful to both Dr. Holger Bast and Prof. Dr. Gerhard Weikum for reviewing my thesis. I would also like to thank Marjan Celikik for our fruitful collaboration. Last, but not least, I would like to thank my family for their invaluable and continuous support throughout my studies. iv

5 Abstract Ranked result lists with query-dependent snippets have become state of the art in text search. They are typically implemented by searching, at query time, for occurrences of the query words in the top-ranked documents. This document-based approach has three inherent problems: (i) when a document is indexed by terms which it does not contain literally (e.g., related words or spelling variants), localization of the corresponding snippets becomes problematic; (ii) each query operator (e.g., phrase or proximity search) has to be implemented twice, on the index side in order to compute the correct result set, and on the snippet generation side to generate the appropriate snippets; and (iii) in a worst case, the whole document needs to be scanned for occurrences of the query words, which is problematic for very long documents. This thesis presents an alternative index-based approach that localizes snippets by information solely computed from the index, and that overcomes all three problems. We show how to achieve this at essentially no extra cost in query processing time, by a technique we call query rotation. We also show how the index-based approach allows the caching of individual segments instead of complete documents, which enables a significantly larger cache hit ratio as compared to the document-based approach. We have fully integrated our implementation with the CompleteSearch engine. v

6 vi

7 Contents 1 Introduction Problem Statement Previous work Motivation of our work Related work Snippets in research: the improvements proposed by Turpin et al The CTS Snippet Engine Caching and sentence reordering Snippets in practice: Lucene General information retrieval notions Positional inverted index Advanced search Query operators Non-literal matches Index-based snippet generation Computing all matching positions Extended lists Query rotation Experimental comparison of extended lists and query rotation Computing snippets positions From snippet positions to snippet text Block representation Compression of the blocks Caching Integration with the CompleteSearch engine Index-based versus Document-based Snippet Generation Non-literal matches Code duplication Large documents Caching vii

8 viii CONTENTS 6 Experiments Datasets and Queries Space consumption Snippet generation time Caching Conclusions and Future Work 37

9 CONTENTS ix

10 x CONTENTS

11 Chapter 1 Introduction 1.1 Problem Statement The usual way of interacting with an IR system is to enter a specific information need expressed as a query. As a result, the system will provide a ranked list of retrieved documents. For each of these documents, the user will be able to see the title and a few sentences from the document. These few sentences are called a snippet or excerpt and they have the role of helping the user decide which of the retrieved documents are more likely to convey his information need. Ideally, it should be possible to make this decision without having to refer to the full document text. Ranked result document lists with query-dependent document snippets, as shown in Figure 1.1, are now state of the art in text search. Some of the earlier web search engines started out with statically precomputed (hence query-independent) document summaries, but by now all major engines have converged to showing snippets centered around the keywords typed by the user. User studies have shown already quite a while ago that properly selected query-dependent snippets are superior to query-independent summaries concerning the speed, precision, and recall with which users can judge the relevance of a hit without actually having to follow the link to the full document [TS98] [WJR03]. 1.2 Previous work A variety of methods for query-dependent snippet generation have been described in the literature and implemented in (open source) search engines. 1 They will be discussed in Chapter 2. The methods differ in which of the potentially many snippets are extracted, and in how exactly documents are represented so that snippets can be extracted quickly. On a high level, however, they all follow the same principle two-step approach, which we call document-based: (D0) For a given query, compute the ids of the top ranking documents using a suitable precomputed index data structure. (D1) For each of the top-ranked documents, fetch the (possibly compressed) document text, and extract a selection of segments best matching the given query. 1 We can only guess how snippet generation is implemented in commercial products or in the big web search engines. 1

12 2 CHAPTER 1. INTRODUCTION Figure 1.1: Three examples of query-dependent snippets. The first example shows a snippet (with two parts) for an ordinary keyword query; this can easily be produced by the document-based approach. The second example shows a snippet for a combined proximity / or query; in particular, the document-based approach needs to take special care here that only those segments from the document are displayed where the query words indeed occur close to each other. The third example shows a snippet for a semantic query with several non-literal matches of the query words, e.g., Tony Blair matching politician. Processing of this query involves a join of information from several documents, in which case the document-based approach is not able to identify matching segments based on the document text and the query alone. The index-based approach described in the thesis can deal with all three cases without additional effort. For ordinary queries it is at least as efficient as the document-based approach. Note that we call the first step (D0) (instead of D1) because it is not really part of the snippet generation but rather a prerequisite. The same remark goes for (I0), the first step for our index-based approach, described below. 1.3 Motivation of our work In this thesis, we investigate the following index-based approach to snippet generation, which to our knowledge has not been studied in the literature so far. The goal of this approach is to overcome the problems that affect the document-based approach. We will show that the index-based method is at least as efficient as the document-based approach for ordinary keyword queries, that it is superior when it comes to non-literal matches, advanced query operators, large documents, and that it is without alternative for complex query languages involving joins. A more detailed comparison of the document-based and index-based approaches will be presented in Chapter 5. The index-based approach goes in four steps: (I0) For a given query, compute the ids of the top-ranking documents. This is just like (D0). (I1) In addition to the information from (I0), also compute, for each query word, the matching positions of that word in each of the top-ranking documents. We will show how this step can be realized with negligible extra time, compared to (I0) and (I3), and for any ordinary positional index, without a need to change the index s internal data

13 1.3. MOTIVATION OF OUR WORK 3 structures. The latter can be a major issue from a system s engineering point of view, depending on the design of the system. (I2) For each of the top-ranked documents, given the list of matching positions computed in (I1) and given a precomputed segmentation of the document, compute a list of positions to be output. We will show that this step is a simple matter of a k-way merge / intersection, and that it takes negligible time compared to (I0) and (I3). (I3) Given the list of positions computed in (I2), produce the actual text to be displayed to the user. We will show how to preprocess (in particular: compress) the documents, so that this step can be done efficiently, accessing only those portions of the (possible very large) document which actually contain the snippets to be output. The index-based approach presented in this thesis is a joint work with Dr. Holger Bast and Marjan Celikik. It led to a submission to the ACM 17th Conference on Information and Knowledge Management. My work was mainly concentrated on steps (I1) and (I2). They are going to be presented into more detail compared to the remaining parts of the method (step (I3) and snippet caching).

14 4 CHAPTER 1. INTRODUCTION

15 Chapter 2 Related work The literature abounds with work on the general topic of document summarization. However, most of this work is concerned with query-independent summarization, which is not our topic here. For a recent list of references, see [VH06]. Note that all the big search engines have query-dependent snippets by now, in particular: Google, Yahoo Search, MSN Search, and AltaVista (which started with query-independent summaries). Tombros and Sanderson [TS98] presented the first in-depth study showing that properly selected query-dependent snippets are superior to query-independent summaries with respect to speed, precision, and recall with which users can judge the relevance of a hit without actually having to follow the link to the full document. In this work, we take the usefulness of query-dependent result snippets for granted. Usually more segments match the query than can (and should) be displayed to the user, so that a selection has to be made. Questions of a proper such selection / ranking have been studied by various authors. For example, in a recent work, [VH06] have proposed an algorithm for computing segments that are as semantically related to each other as possible. Here, in this thesis, the focus is on efficiency and feasibility aspects. For the scoring and ranking of segments, we adopt the simple yet effective scheme from [TTHW07], described below in Figure 2.1. Only recently, Turpin et al. [TTHW07] have presented a first in-depth study of the documentbased approach with respect to efficiency (in time and space). 2.1 Snippets in research: the improvements proposed by Turpin et al. The work presented in [TTHW07] follows the document-based approach. The authors propose and analyze a document compression method that reduces snippet generation time by 58% over a baseline using the zlib compression library. The experiments reveal that finding documents on secondary storage dominates the total cost of generating snippets, and use caching for a faster snippet generation process. A solution for avoiding scanning the whole documents is also proposed and analyzed: document reordering and compaction. The method increases the number of document cache hits but has an impact on snippet quality. This scheme doubles the number of documents that can fit in a fixed size cache. 5

16 6 CHAPTER 2. RELATED WORK Input: A document broken into one sentence per line, and a sequence of query terms. Output: Remove the number of sentences required from the heap to form the summary. 1 For each line of text, L = [w 1, w 2,.., w m ] 2 Let h be 1 if L is a heading, 0 otherwise. 3 Let l be 2 if L is the first line of a document, 1 if it is the second line, 0 otherwise. 4 Let c be the number of w i that are query terms, counting repetitions. 5 Let d be the number of distinct query terms that match some w i. 6 Identify the longest contiguous run of query terms in L, say w j...w j+k. 7 Use a weighted combination of c, d, k, h and l to derive a score s. 8 Insert L into a max-heap using s as a key. Figure 2.1: Simple sentence ranker that operates on raw text with one sentence per line The CTS Snippet Engine As we already explained, step (D1) must provide for each of the top-ranked documents, a snippet: text that attempts to summarize the document and contains (at least some of the) query words. Similarly to previous work on summarization [Luh58], Turpin and al. identify the sentence as the minimal unit for extraction and presentation to the user. In order to construct a snippet, all sentences in a document are ranked against the query and the top two or three returned as a snippet. The scoring of sentences against queries has been explored in several papers [GKMC99, Luh58, SSJ01, TS98, WRJ02], with different features of the sentences deemed important. Based on these studies, Figure 2.1 presents the general algorithm for scoring sentences in relevant documents. The top-ranked sentences form the snippet. The final score of the sentence, assigned in step 7, can be computed in various ways. In order to avoid bias to any particular scoring mechanism, Turpin and al. compare the sentence quality using the individual components of the score, rather than an arbitrary combination of the components. Since Web data is often poorly structured, poorly punctuated, and contains a lot of data that do not form part of valid sentences that would be candidates for parts of snippets, it is assumed that the documents passed to the Snippet Engine have all HTML tags and JavaScript removed, and that each document is reduced to a series of word tokens separated by non-word tokens. A word token is defined as a sequence of alphanumeric characters, while a non-word is a sequence of non-alphanumeric characters such as whitespace and the other punctuation symbols. Both are limited to a maximum of 50 characters. Adjacent, repeating characters are removed from the punctuation. Included in the punctuation set is a special end of sentence marker which replaces the usual three sentence terminators?!.. Often these explicit punctuation characters are missing and so HTML tags such as <br> and <p> are assumed to terminate sentences. In addition, a sentence must contain at least five words and no more than twenty words, with longer or shorter sentences being broken and joined as required to meet these criteria [KPC95]. Unterminated HTML tags, that is, tags with an open brace, but no close brace, cause all text from the open brace to the next open brace to be discarded. Having defined the format of documents that are presented to the Snippet Engine, the next important characteristic of this method is the Compressed Token System (CTS) document storage scheme, and the baseline system used for comparison. For the baseline, an obvious document representation scheme is utilized: each document is simply compressed with a well known adaptive compressor, and then decompressed as required [BP98]. Such a system has been implemented

17 2.1. SNIPPETS IN RESEARCH: THE IMPROVEMENTS PROPOSED BY TURPIN ET AL.7 by the authors using zlib [GA07] with default parameters to compress every document. Each document is stored in a single file. While manageable for small test collections or small enterprises with millions of documents, a full Web search engine may require multiple documents to inhabit single files, or a special purpose file system [HL03]. For snippet generation, the required documents are decompressed one at a time, and a linear search for provided query terms is employed. The search is optimized for matching whole words and the sentence terminating token, rather than general pattern matching. The CTS Snippet Engine makes a series of optimizations over the baseline presented in the previous paragraph. The first is to employ a semi-static compression method over the entire document collection, which will allow faster decompression with minimal compression loss. Using a semi static approach involves mapping words and non-words produced by the parser to single integer tokens, with frequent symbols receiving small integers, and then choosing a coding scheme that assigns small numbers a small number of bits. Words and non-words strictly alternate in the compressed file, which always begins with a word. Each symbol is assigned its ordinal number in a list of symbols sorted by frequency. The the vbyte coding scheme is used to code the word tokens [IHWB99]. The set of non-words is limited to the 64 most common punctuation sequences in the collection itself, and are encoded with a flat 6-bit binary code. The remaining 2 bits of each punctuation symbol are used to store capitalization information. The process of computing the semi-static model is complicated by the fact that the number of words and non-words appearing in large web collections is high. Moffat et al. [AMS97] have examined schemes for pruning models during compression using large alphabets, and conclude that rarely occurring terms need not reside in the model. Rather, rare terms are spelt out in the final compressed file, using a special word token (escape symbol), to signal their occurrence. Using these results, Turpin et al. employ two move-to-front queues: one for words and one for non-words. During the first pass of encoding, whenever the available memory is consumed and a new symbol is discovered, an existing symbol is discarded from the last half of the queue provided that it has the frequency one. If there is no such symbol, then a symbol with frequency two is evicted and so on. The second pass of encoding replaces each word with its vbyte encoded number, or the escape symbol and an ASCII representation of the word if it is not in the model. Similarly each non-word sequence is replaced with its codeword, or the codeword for a single space character if it is not in the model. This lossless compression of non-words is acceptable when the documents are used for snippet generation, but may not be acceptable for a document database. Besides the fact that it allows faster decompression, the semi-static scheme also has the advantage that it readily allows direct matching of query terms as compressed integers in the compressed file. This way, sentences can be scored without having to decompress a document, and only the sentences returned as part of a snippet need to be decoded. The CTS system stores all documents contiguously in one file, and an auxiliary table of 64 bit integers indicating the start offset of each document in the file. Further, it must have access to the reverse mapping of term numbers, allowing those words not spelt out in the document to be recovered and returned to the Query Engine as strings. The first of these data structures can be readily partitioned and distributed if the Snippet Engine occupies multiple machines; the second, however, is not so easily partitioned, as any document on a remote machine might require access to the whole integer-to-string mapping. This is the second reason for employing the model pruning step during construction of the semi-static code: it limits the size of the reverse mapping table that should be present on every machine implementing the Snippet Engine.

18 8 CHAPTER 2. RELATED WORK Caching and sentence reordering In order to speed up the snippet generation process, Turpin et al. make use of a Snippet Engine cache in proportion to the size of the document collection. Using simulation, the authors have compared two caching policies: a static cache, where the cache is loaded with as many documents as it can hold before the system begins answering queries, and then never changes; and a least-recently-used cache, which starts out as for the static cache, but whenever a document is accessed it moves to the front of a queue, and if a document is fetched from disk, the last item in the queue is evicted. It is assumed that a query cache exists for the top Q most frequent queries, and that these queries are never processed by the Snippet Engine. The results of the experiments reveal that the static cache performs well but it is outperformed by the LRU cache: if as little as 1% of the documents are cached then around 75% of the disk seeks can be avoided. Besides compression, another approach for reducing the size of the documents in the cache is explored: instead of caching the whole documents, only sentences that are likely to be used in snippets are stored. If during snippet generation on a cached document the sentence scores do not reach a certain threshold, then the whole document is retrieved from disk. This means that it is very important which sentences from a document are stored in the cache and which are left on the disk. A sentence reordering can be performed for each document such that sentences that are very likely to appear in snippets are at the front of the document. This way, they are processed first at query time and once enough sentences are found to generate a snippet, the rest of the document can be ignored. Further, to improve caching, only the head of each document can be stored in the cache, with the tail residing on disk. One problem of this approach is that now the search engine should be able to provide copies of the documents. The snippet generation engine cannot do this anymore since it does not have access to the exact text of the document, as it was indexed. Four sentence reordering schemes are considered: Natural order The first few sentences of a well authored document usually best describe the document content [Luh58]. Thus simply processing a document in order should yield quality snippet. Unfortunately, web documents are often not well authored, with little editorial or professional writing skills brought to bear on the creation of a work of literary merit. Also, taking into account that query-biased snippets are being produced, there is no guarantee that query terms will appear in sentences toward the front of a document. Significant terms (ST) Luhn introduced the concept of a significant sentence as containing a cluster of significant terms [Luh58], a concept found to work well by Tombros and Sanderson [TS98]. Let f d,t be the frequency of term t in document d, then term t is determined to be significant if (25 s d ), if s d < 25 f d,t 7, if 25 s d (s d 40), otherwise where s d is the number of sentences in document d. A bracketed section is defined as a group of terms where the leftmost and rightmost terms are significant terms, and no significant terms in the bracketed section are divided by more than four non-significant terms. The score of a bracketed section is the square of the number of significant words

19 2.2. SNIPPETS IN PRACTICE: LUCENE 9 falling in the section, divided by the total number of words in the entire sentence. The a priori score for a sentence is computed as the maximum of all scores for the bracketed sections of the sentence. The sentences are then sorted by this score. Query log based (QLu) Many Web queries repeat, and a small number of queries make up a large volume of total searches [BJP05]. In order to take advantage of this bias, sentences that contain many past query terms are promoted to the front of a document, while sentences that contain few query terms are demoted. In this scheme, the sentences are sorted by the number of sentence terms that occur in the query log. To ensure that long sentences do not dominate over shorter qualitative sentences the score assigned to each sentence is divided by the number of terms in that sentence giving each sentence a score between 0 and 1. Query log based (QLu) This scheme is as for QLt, but repeated terms in the sentence are only counted once. The purpose of using the ST, QLt or QLu schemes is to terminate snippet generation earlier than if Natural Order is used, but still produce sentences with the same number of unique query terms (d in Figure 2.1), total number of query terms (c), the same positional score (h + l) and the same maximum span (k). Experiments revealed that sorting sentences using the Significant Terms (ST) method leads to the smallest change in the sentence scoring components. The greatest change over all methods is in the sentence position (h + l) component of the score, which is to be expected as there is no guarantee that leading and heading sentences are processed at all after sentences are re-ordered. The second most affected component is the number of distinct query terms in a returned sentence, but if only the first 50% of the document is processed with the ST method, there is a drop of only 8% in the number of distinct query terms found in snippets. The authors argue that there is little overall affect on scores when processing only half the document using the ST method, depending how the various components are weighted to compute the overall snippet score. In conclusion, the Turpin et al. method follows the document-based approach, providing a series of mechanisms for improving the speed of access to the document text: caching, sentence reordering and a semi static compression method for faster decompression. 2.2 Snippets in practice: Lucene All open-source engines we know of that provide query-dependent snippet generation follow the document-based approach. In particular, this holds true for Lucene [Cut]. Lucene is one of the most popular free/open source information retrieval libraries providing support for indexing and search. It is supported by the Apache Software Foundation and is released under the Apache Software License. The early versions of Lucene use a straightforward implementation of the document-based approach: at query time each of the top-ranked documents are re-parsed and segments containing the query words are extracted. The highlighting of the query words in these segments is done by simply comparing the tokens of the document with the tokens in the query. The tokens of the document can be built from the index or by analyzing the document. If there is a match

20 10 CHAPTER 2. RELATED WORK between one token in the document and one token in the query, then the original text can be reconstructed, with the query words highlighted, by using stored information about the original offsets in the document for each of the tokens. The Lucene highlighter can handle stemmed words since both the query terms and the document content are processed by the same parser. This means that words such as algorithmic, algorithm, algorithms become stemmed to the same root form in both query and document content. The tokens produced by the parser include the byte offsets of the original full word, not just the stemmed form, so the highlighter knows the full extent of what to highlight in text. The Lucene highlighter manages the fuzzy queries by expanding them to all the similar terms. An example of such a fuzzy query is the erroneous word belies: it might be expanded to the disjunctive query belief believe. Such an approach has the obvious disadvantage that the number of query words will increase depending on the set of similar terms. This represents an important scalability problem. To address the efficiency problems of the previously presented approach, recent versions of Lucene have provided support for storing the sequence of term ids output by the parser (much in the vein of [TTHW07]) and even precompute and store for each document a small index for fast location of the query words. However, all of these enhancements remain in the realm of the document-based approach, and as such do not address the important problems of non-literal matches, code duplication, and coarse caching granularity pointed out in Chapter 5.

21 Chapter 3 General information retrieval notions This chapter describes the basic IR concepts and notions required by our snippet generation method. The first part introduces the data structure used for query processing and the operations performed on this data structure. Here is where we also describe the first (common) step for both index-based and document-based snippet generation methods. The second part provides details about the advanced search options that our method supports in an efficient manner. 3.1 Positional inverted index The inverted index, sometimes known as inverted file or postings file, is one of the major concepts in information retrieval. It is the data structure of choice for most of the search applications: it is very efficient for short queries, easy to implement and can be well compressed. The retrieval is performed on a group of documents known as (document) collection or corpus (a body of texts). Based on the document collection a lexicon or vocabulary is built: a list of the terms that appear in the corpus. An inverted index contains, for each term in the lexicon, an inverted list that stores a list of pointers to all occurrences of that term in the main text, where each pointer is, in effect, the number of a document in which the term appears [IHWB99]. Here is an example of such an inverted list: doc ids D100 D129 D1401 D2722 D3000 The meaning is that the term to which this inverted list belongs, appears in the documents with the ids D100, D129, D1401, D2722 and D3000. Our method requires a positional inverted index. This simply means that each inverted list must store, for each document, the positions where the corresponding term appears. For such an index, an inverted list conceptually looks as follows: 11

22 12 CHAPTER 3. GENERAL INFORMATION RETRIEVAL NOTIONS doc ids D7 D23 D47 D47 D63 positions word ids W12 W12 W12 W12 W12 scores For example, the first inverted list entry (also called posting) has the meaning that the word/term with id W12 occurs in the document with id D7 at position 6 with a score of 2. It is important to note that the doc ids are sorted in increasing order. Some doc ids are repeated if the term appears more than once in the document with the repeated id. For each occurrence of the term in a document, a score that reflects the importance of that particular occurrence is assigned. Individual scores are aggregated to per-document scores, according to which the documents are eventually ranked. Based on the aggregated scores, only the most relevant documents are presented to the user. In the inverted list presented above, all postings have the same value for the word ids entry, namely W12. This means that the inverted list belongs to a single word, for example roman. If we want to search for all the words that start with rom, then our query word will be rom*. The asterisk is used to denote the fact that any sequence of characters might follow after rom. Assuming that our collection contains only roman and rome as words that contain rom as prefix, then the result of our query should return the doc ids, positions and scores for these two words. The result posting list might look as follows: doc ids D7 D23 D31 D47 D47 D63 D87 D99 positions word ids W12 W12 W67 W12 W12 W12 W67 W67 scores It can be noticed that this inverted list, belonging to the query word rom*, contains the postings corresponding to the query word roman: the ones for which the value of word ids is W12. It also contains the postings for the query word rome (having W67 as word ids value). This kind of prefix queries are the reason for using a word ids entry for each of the postings. Given a query, the basic operation will be to either intersect the lists for each of the query words (to obtain the list of ids of all documents that contain all query words; such queries are called conjunctive) or to compute their union (to obtain the list of ids of all documents that contain at least one of the query words; such queries are called disjunctive). Let us consider an example for the conjunctive query roman emperor julius, where the query words have the following inverted lists: roman emperor doc ids D7 D23 D47 D47 D63 positions word ids W12 W12 W12 W12 W12 scores

23 3.1. POSITIONAL INVERTED INDEX 13 doc ids D3 D7 D47 D63 positions word ids W34 W34 W34 W34 scores julius doc ids D7 D7 D39 D47 D63 positions word ids W27 W27 W27 W27 W27 scores In order to compute the inverted list for the entire query, the lists of the individual query words must be intersected according to the doc ids. For our example, the resulting inverted list is the following: doc ids D7 D7 D7 D7 D47 D47 D47 D47 D63 D63 D63 positions word ids W9 W9 W9 W9 W9 W9 W9 W9 W9 W9 W9 scores query word It can be observed that the postings from the result inverted list have one more entry, named query word. The problem is that after intersecting the posting lists of some query words, we need to know for each of these query words all the positions in the common documents (returned by the intersection). The query word entry tells from which query words, the posting stems. Without this it would not be possible to say which position refers to which query word. More details about this topic will be presented in Chapter 4. At this point it is important to notice that this straightforward approach has two major drawbacks: the result inverted list has been changed by adding one more entry for each posting and it is also very long. We have only three common document ids and eleven postings. In practice, the resulted inverted lists might contain millions of different doc ids so the ranking of the postings is essential, in order to consider only the top K most relevant documents. The ranking is done based on the scores from the result list. They are obtained through score aggregation, during the intersection procedure. For our example the aggregation function is the sum of the scores corresponding to the same doc id. If we consider the document with the id D7, we notice that the scores associated with the occurrences of the three query words are 2, 10, 6 and 33. In the result list, the score associated with D7 is the sum of all these scores: 51. The scores for the rest of the doc ids are computed using the same aggregation function. Let us assume that we are only interested in the two most relevant documents obtained as results for the roman emperor julius query. In this case only the two postings corresponding to the highest two scores are kept in the result list:

24 14 CHAPTER 3. GENERAL INFORMATION RETRIEVAL NOTIONS doc ids D7 D7 D7 D7 D47 D47 D47 D47 positions word ids W9 W9 W9 W9 W9 W9 W9 W9 scores query word The example we have presented in this section illustrates the main steps of how a query can be processed by a search engine. Computing the ids of the top ranking documents represents the first step of the index-based method and also the first step of the document-based approach. We are calling this step, (I0) (Index-based) or (D0) (Document-based), depending on the snippet generation method. We are using the (D0) / (I0) notation for this step (instead of (D1) / (I1)) because it is not really part of the snippet generation but rather a prerequisite. 3.2 Advanced search Besides the usual conjunctive or disjunctive queries, our method supports without additional effort, queries with operators and advanced queries (synonym search, error-tolerant search, semantic search, etc). These two categories of queries are problematic when handled by the document-based method in its basic form. The following two subsections present descriptions and examples for the two above mentioned categories of queries Query operators A query operator consists of one or more characters that act not as a query word, but as an instruction on how a query is to be processed. An operator can work at word level, where it applies to a single query term, or at query level, where its presence affects the processing of the entire query. Our method is completely unaware of the query words and operators that are part of the query. It uses only the positions provided by the positional inverted index to identify the words for which snippets must be generated. This means that the snippet generation will work with any query operator. We have experimented with the phrase operator, proximity operator and the disjunction operator. The phrase operator requires that two query words are adjacent, the proximity operator requires that two query words are at most within 5 words distance of each other and using the disjunction operator a search can be performed for any one of a group of two or more query words. Let us consider the query roman.emperor which contains the phrase operator and the following text: The Roman Emperor was the ruler of the Roman State during the imperial period (starting at about 27 BC). The Romans had no single term for the office: Latin titles such as imperator (from which English emperor ultimately derives), augustus, caesar and princeps were all associated with it. As results for this query, only the occurrences of roman followed immediately by emperor will be considered (they are emphasized with bold characters in the text above). The non-adjacent occurrences of roman and emperor (marked with italic characters) are not considered since our query contains the phrase operator between the two words. Such operator queries do not require additional effort because everything is done during the intersection procedure. If we consider again the inverted lists of the two query words:

25 3.2. ADVANCED SEARCH 15 roman emperor doc ids D7 D23 D47 D47 D63 positions word ids W12 W12 W12 W12 W12 scores doc ids D3 D7 D47 D63 positions word ids W34 W34 W34 W34 scores The intersection procedure will simply check whether there are any consecutive positions for the common document ids in the two lists. In our example, only for document id D47 we have the consecutive positions 26 and 27, so the result inverted list for our query is: doc ids D47 scores 32 The other two types of operators can be handled in a similar fashion. Without a positional inverted index, processing such operator queries would take more time. First the documents that simply contain the query words would have to be identified and then all these documents would have to be scanned in order to compute the distances between the query words Non-literal matches The index-based snippet generation is also able to deal with the vocabulary mismatch problem, encountered when a relevant document does not literally contain (one of) the query words. A nice list of commented examples from the TREC benchmarks is given in [Buc04]. Let us consider the semantic query concept:city, which means that we are looking for all the names of cities in the collection and not for the occurrences of the word city. We have marked this fact by prefixing the query with the concept keyword. Given this scenario, it would be desirable that the following excerpt is taken into consideration: Rome achieved great glory under Octavian/Augustus. He restored peace after 100 years of civil war; maintained an honest government and a sound currency system; The reason for considering this snippet is that it contains the word Rome which can be thought of as a query word (it is the name of a city). A common way to provide such behavior and overcome the vocabulary mismatch problem is to index documents under terms that are related to words in the document, but do not literally occur themselves. Let us take as an example a collection that contains only Rome and Alexandria as names of cities and they have the following inverted lists: Alexandria

26 16 CHAPTER 3. GENERAL INFORMATION RETRIEVAL NOTIONS Rome doc ids D64 D88 positions word ids W42 W42 scores doc ids D31 D87 D99 positions word ids W67 W67 W67 scores Then, the inverted list for the term concept:city is composed of all the postings belonging to the previous two terms: concept:city doc ids D31 D64 D87 D88 D99 positions word ids W67 W42 W67 W42 W67 scores There is a variety of advanced search features which call for non-literal matches: related-words search (like in our example), prefix search (for a query containing alg, find algorithm), errorcorrecting search (for a query containing algorithm, find the misspelling algoritm), semantic search (for a query musician, find the entity reference John Lennon), etc.

27 Chapter 4 Index-based snippet generation On a high level, our snippet generation method follows a four-step approach. We have already discussed the first step (I0) in Chapter 3. This chapter presents in detail the remaining three steps of our method. In the first part we describe and analyze two methods for computing the positions of the query words in the top-ranked documents (I1), using a positional inverted index. We continue by showing how to use the positions of the query words in order to determine the positions of the snippets (I2). The third part is a section dedicated to the last step of our method: using the snippet positions to get the actual snippet text, which is finally presented to the user (I3). We finalize this chapter by presenting some details about the snippet caching mechanism and the integration with the CompleteSearch engine. 4.1 Computing all matching positions Throughout this work, we assume that we are using the positional index presented in Chapter 3, with posting lists that conceptually look as follows: doc ids D401 D1701 D1701 D1701 D1807 word ids W173 W173 W173 W173 W173 positions scores For example, the third posting from the right says that the word with id W173 occurs in the document with id D1701 at position 12, and was assigned a score of 0.4. In an actual implementation these lists would be stored in compressed format, but we need not consider this level of detail in the considerations that follow. A standard inverted index will have one such list precomputed for each word (that is, all word ids are the same in each such list). Pruning techniques are often used to avoid a full scan of the index lists involved, especially in the case of disjunctive queries; for example, see [AM06] [BMS + 06]. The results presented in Subsection pertain to such a pruning technique for disjunctive queries, following the ideas of [BMS + 06]. The second step of our method, named (I1), uses such a positional inverted index: it takes as input the doc ids of the top ranked documents and the posting lists of the query words (that make up the user s query). In order to compute the doc ids of the top ranked documents, the 17

28 18 CHAPTER 4. INDEX-BASED SNIPPET GENERATION (I0) step must load from the disk the posting lists of the query words. This means that (I0) can provide to (I1) these posting lists without additional effort. If we consider again the example from Chapter 3, the conjunctive query roman emperor julius, then the input of step (I1) would be the doc ids of the top ranked documents and the inverted lists of the query words: top ranked documents roman emperor doc ids D7 D47 scores doc ids D7 D23 D47 D47 D63 positions word ids W12 W12 W12 W12 W12 scores doc ids D3 D7 D47 D63 positions word ids W34 W34 W34 W34 scores julius doc ids D7 D7 D39 D47 D63 positions word ids W27 W27 W27 W27 W27 scores The I1 step provides as output the positions of each of the query words in each of the top-ranked documents. This means that the output of step (I1) for this example are the positions of each of the query words in the documents with the ids D7 and D47: D7 D47 roman: 6 emperor: 31 julius: 2, 39 roman: 26, 38 emperor: 27 julius: 28 The positional inverted index we described so far, obviously contains all the information required to compute the output. The difficulty is how to incorporate such information in the result. The following two subsections present two approaches for computing the output of step (I1) and they are followed by a section dedicated to an experimental comparison of these approaches.

29 4.1. COMPUTING ALL MATCHING POSITIONS Extended lists A first and obvious approach for solving this problem is to use extended lists. An extended list is obtained in a straightforward way, by enhancing the postings of an inverted list with an additional entry, telling from which query word it stems. Conceptually: doc ids D1701 D1701 D1701 D1701 D1701 word ids W173 W173 W173 W173 W173 positions scores query word For example, the third posting from the right now knows that it stems from the first query word. (When reading a list from disk, the query word entry would be set to some special value). It is not hard to see that with this enhancement, the information required for step (I1) can be computed for all of the operations described above. Coming back to the example we used so far, the posting lists of the three query words could be enhanced in the following way: roman emperor julius doc ids D7 D23 D47 D47 D63 positions word ids W12 W12 W12 W12 W12 scores query word doc ids D3 D7 D47 D63 positions word ids W34 W34 W34 W34 scores query word doc ids D7 D7 D39 D47 D63 positions word ids W27 W27 W27 W27 W27 scores query word All the postings belonging to the inverted list of the query word roman have an extra entry (denoted as query word in our example) which has always the value 1. This simply marks the fact that the postings contain information referring to the first query word. In a similar way, the postings of the second query word (emperor) and the third query word (julius) contain an additional entry with values 2, respectively 3. The result extended list is computed by intersecting the extended lists belonging to the three query words, and looks as follows:

30 20 CHAPTER 4. INDEX-BASED SNIPPET GENERATION doc ids D7 D7 D7 D7 D47 D47 D47 D47 D63 D63 D63 positions word ids W9 W9 W9 W9 W9 W9 W9 W9 W9 W9 W9 scores query word Given such a list, it is straightforward how to extract the positions of each of the query words in each of the top-ranked documents. It can be done linearly by examining each posting once. There are two major disadvantages with this approach, however. The first is that we modified the central data structure, the sanctuary of every search engine. Unless the search engine was written with such modifications in mind, this is usually not tolerable. The second major problem is efficiency. Consider an intersection of two query words. The extended result lists would now contain postings from both input lists, that is, it at least doubles in size compared to the corresponding simple result list. This effect aggravates as the number of query words grows. For our previous example, the simple list result has only two postings and the extended list result has eight postings. The experimental results presented in Subsection show that this indeed affects the processing time. The problem is that per-query-word positions are computed for all matching documents, before the ranking is done. Note that this also happens when pruning techniques are involved, because also there large numbers of postings (namely, candidates for one of the top-ranked doc ids) are involved in the intermediate calculations Query rotation We propose to implement (I1) by what we call query rotation. We first explain query rotation by using an example: the three-word phrase query roman.emperor.julius. This means that the three query words should appear in the text one after another in the specified order. We assume that the posting lists for the query words are the ones we have used so far. They are provided by the step (I0) along with the set D of ids of the top-ranked documents: doc ids D7 D47 scores First we compute the posting list L 1 as the intersection of D and the list of all postings for roman: doc ids D7 D23 D47 D47 D63 positions word ids W12 W12 W12 W12 W12 scores For each doc id in the intersection we store all the positions from the second list, obtaining L 1: doc ids D7 D47 D47 positions word ids W12 W12 W12 scores

6 Efficient Index-Based Snippet Generation

6 Efficient Index-Based Snippet Generation 6 Efficient Index-Based Snippet Generation HANNAH BAST and MARJAN CELIKIK, University of Freiburg Ranked result lists with query-dependent snippets have become state of the art in text search. They are

More information

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488) Efficiency Efficiency: Indexing (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Difficult to analyze sequential IR algorithms: data and query dependency (query selectivity). O(q(cf max )) -- high estimate-

More information

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University CS6200 Information Retrieval David Smith College of Computer and Information Science Northeastern University Indexing Process!2 Indexes Storing document information for faster queries Indexes Index Compression

More information

Information Retrieval. Chap 7. Text Operations

Information Retrieval. Chap 7. Text Operations Information Retrieval Chap 7. Text Operations The Retrieval Process user need User Interface 4, 10 Text Text logical view Text Operations logical view 6, 7 user feedback Query Operations query Indexing

More information

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks Administrative Index Compression! n Assignment 1? n Homework 2 out n What I did last summer lunch talks today David Kauchak cs458 Fall 2012 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture5-indexcompression.ppt

More information

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 3 Dictionaries and Tolerant retrieval Chapter 4 Index construction Chapter 5 Index compression Content Dictionary data structures

More information

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze Indexing UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides Addison Wesley, 2008 Table of Content Inverted index with positional information

More information

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction Chapter 6 Objectives Chapter 6 Memory Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured.

More information

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Thaer Samar 1, Alejandro Bellogín 2, and Arjen P. de Vries 1 1 Centrum Wiskunde & Informatica, {samar,arjen}@cwi.nl

More information

Information Retrieval

Information Retrieval Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,

More information

modern database systems lecture 4 : information retrieval

modern database systems lecture 4 : information retrieval modern database systems lecture 4 : information retrieval Aristides Gionis Michael Mathioudakis spring 2016 in perspective structured data relational data RDBMS MySQL semi-structured data data-graph representation

More information

Database System Concepts

Database System Concepts Chapter 13: Query Processing s Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2008/2009 Slides (fortemente) baseados nos slides oficiais do livro c Silberschatz, Korth

More information

Inverted Indexes. Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5

Inverted Indexes. Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5 Inverted Indexes Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5 Basic Concepts Inverted index: a word-oriented mechanism for indexing a text collection to speed up the

More information

Index Compression. David Kauchak cs160 Fall 2009 adapted from:

Index Compression. David Kauchak cs160 Fall 2009 adapted from: Index Compression David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture5-indexcompression.ppt Administrative Homework 2 Assignment 1 Assignment 2 Pair programming?

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

Advanced Database Systems

Advanced Database Systems Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed

More information

Chapter 12: Query Processing. Chapter 12: Query Processing

Chapter 12: Query Processing. Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Sorting Join

More information

Chapter 12: Query Processing

Chapter 12: Query Processing Chapter 12: Query Processing Overview Catalog Information for Cost Estimation $ Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions Transformation

More information

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK Qing Guo 1, 2 1 Nanyang Technological University, Singapore 2 SAP Innovation Center Network,Singapore ABSTRACT Literature review is part of scientific

More information

THE WEB SEARCH ENGINE

THE WEB SEARCH ENGINE International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) Vol.1, Issue 2 Dec 2011 54-60 TJPRC Pvt. Ltd., THE WEB SEARCH ENGINE Mr.G. HANUMANTHA RAO hanu.abc@gmail.com

More information

File System Interface and Implementation

File System Interface and Implementation Unit 8 Structure 8.1 Introduction Objectives 8.2 Concept of a File Attributes of a File Operations on Files Types of Files Structure of File 8.3 File Access Methods Sequential Access Direct Access Indexed

More information

Indexing. CS6200: Information Retrieval. Index Construction. Slides by: Jesse Anderton

Indexing. CS6200: Information Retrieval. Index Construction. Slides by: Jesse Anderton Indexing Index Construction CS6200: Information Retrieval Slides by: Jesse Anderton Motivation: Scale Corpus Terms Docs Entries A term incidence matrix with V terms and D documents has O(V x D) entries.

More information

Chapter 13: Query Processing

Chapter 13: Query Processing Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing

More information

Information Retrieval

Information Retrieval Information Retrieval Suan Lee - Information Retrieval - 05 Index Compression 1 05 Index Compression - Information Retrieval - 05 Index Compression 2 Last lecture index construction Sort-based indexing

More information

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi. Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi Lecture 18 Tries Today we are going to be talking about another data

More information

Recap: lecture 2 CS276A Information Retrieval

Recap: lecture 2 CS276A Information Retrieval Recap: lecture 2 CS276A Information Retrieval Stemming, tokenization etc. Faster postings merges Phrase queries Lecture 3 This lecture Index compression Space estimation Corpus size for estimates Consider

More information

Bits, Words, and Integers

Bits, Words, and Integers Computer Science 52 Bits, Words, and Integers Spring Semester, 2017 In this document, we look at how bits are organized into meaningful data. In particular, we will see the details of how integers are

More information

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου Ανάκτηση µε το µοντέλο διανυσµατικού χώρου Σηµερινό ερώτηµα Typically we want to retrieve the top K docs (in the cosine ranking for the query) not totally order all docs in the corpus can we pick off docs

More information

Chapter 12: Query Processing

Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Overview Chapter 12: Query Processing Measures of Query Cost Selection Operation Sorting Join

More information

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and

More information

Chapter 13: Query Processing Basic Steps in Query Processing

Chapter 13: Query Processing Basic Steps in Query Processing Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and

More information

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 11 Coding Strategies and Introduction to Huffman Coding The Fundamental

More information

CHAPTER 6 Memory. CMPS375 Class Notes (Chap06) Page 1 / 20 Dr. Kuo-pao Yang

CHAPTER 6 Memory. CMPS375 Class Notes (Chap06) Page 1 / 20 Dr. Kuo-pao Yang CHAPTER 6 Memory 6.1 Memory 341 6.2 Types of Memory 341 6.3 The Memory Hierarchy 343 6.3.1 Locality of Reference 346 6.4 Cache Memory 347 6.4.1 Cache Mapping Schemes 349 6.4.2 Replacement Policies 365

More information

Chapter 12: Indexing and Hashing. Basic Concepts

Chapter 12: Indexing and Hashing. Basic Concepts Chapter 12: Indexing and Hashing! Basic Concepts! Ordered Indices! B+-Tree Index Files! B-Tree Index Files! Static Hashing! Dynamic Hashing! Comparison of Ordered Indexing and Hashing! Index Definition

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 5: Index Compression Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-04-17 1/59 Overview

More information

Query Processing & Optimization

Query Processing & Optimization Query Processing & Optimization 1 Roadmap of This Lecture Overview of query processing Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions Introduction

More information

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf

More information

Chapter 12: Indexing and Hashing

Chapter 12: Indexing and Hashing Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL

More information

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi Lecture - 13 Virtual memory and memory management unit In the last class, we had discussed

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction Inverted index Processing Boolean queries Course overview Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Schütze Institute for Natural

More information

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Index construction CE-324: Modern Information Retrieval Sharif University of Technology Index construction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch.

More information

DDS Dynamic Search Trees

DDS Dynamic Search Trees DDS Dynamic Search Trees 1 Data structures l A data structure models some abstract object. It implements a number of operations on this object, which usually can be classified into l creation and deletion

More information

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search Index Construction Dictionary, postings, scalable indexing, dynamic indexing Web Search 1 Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing

More information

File Structures and Indexing

File Structures and Indexing File Structures and Indexing CPS352: Database Systems Simon Miner Gordon College Last Revised: 10/11/12 Agenda Check-in Database File Structures Indexing Database Design Tips Check-in Database File Structures

More information

Information Retrieval

Information Retrieval Information Retrieval Suan Lee - Information Retrieval - 04 Index Construction 1 04 Index Construction - Information Retrieval - 04 Index Construction 2 Plan Last lecture: Dictionary data structures Tolerant

More information

A Security Model for Multi-User File System Search. in Multi-User Environments

A Security Model for Multi-User File System Search. in Multi-User Environments A Security Model for Full-Text File System Search in Multi-User Environments Stefan Büttcher Charles L. A. Clarke University of Waterloo, Canada December 15, 2005 1 Introduction and Motivation 2 3 4 5

More information

Virtual Memory. Chapter 8

Virtual Memory. Chapter 8 Virtual Memory 1 Chapter 8 Characteristics of Paging and Segmentation Memory references are dynamically translated into physical addresses at run time E.g., process may be swapped in and out of main memory

More information

a process may be swapped in and out of main memory such that it occupies different regions

a process may be swapped in and out of main memory such that it occupies different regions Virtual Memory Characteristics of Paging and Segmentation A process may be broken up into pieces (pages or segments) that do not need to be located contiguously in main memory Memory references are dynamically

More information

Chapter 3 - Memory Management

Chapter 3 - Memory Management Chapter 3 - Memory Management Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ L. Tarrataca Chapter 3 - Memory Management 1 / 222 1 A Memory Abstraction: Address Spaces The Notion of an Address Space Swapping

More information

Database Applications (15-415)

Database Applications (15-415) Database Applications (15-415) DBMS Internals- Part VI Lecture 17, March 24, 2015 Mohammad Hammoud Today Last Two Sessions: DBMS Internals- Part V External Sorting How to Start a Company in Five (maybe

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS3245 Information Retrieval Lecture 6: Index Compression 6 Last Time: index construction Sort- based indexing Blocked Sort- Based Indexing Merge sort is effective

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction 1 Plan Last lecture: Dictionary data structures Tolerant retrieval Wildcards Spell correction Soundex a-hu hy-m n-z $m mace madden mo

More information

Chapter 8 & Chapter 9 Main Memory & Virtual Memory

Chapter 8 & Chapter 9 Main Memory & Virtual Memory Chapter 8 & Chapter 9 Main Memory & Virtual Memory 1. Various ways of organizing memory hardware. 2. Memory-management techniques: 1. Paging 2. Segmentation. Introduction Memory consists of a large array

More information

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search CSE 454 - Case Studies Indexing & Retrieval in Google Some slides from http://www.cs.huji.ac.il/~sdbi/2000/google/index.htm Logistics For next class Read: How to implement PageRank Efficiently Projects

More information

ResPubliQA 2010

ResPubliQA 2010 SZTAKI @ ResPubliQA 2010 David Mark Nemeskey Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, Hungary (SZTAKI) Abstract. This paper summarizes the results of our first

More information

Department of Electronic Engineering FINAL YEAR PROJECT REPORT

Department of Electronic Engineering FINAL YEAR PROJECT REPORT Department of Electronic Engineering FINAL YEAR PROJECT REPORT BEngCE-2007/08-HCS-HCS-03-BECE Natural Language Understanding for Query in Web Search 1 Student Name: Sit Wing Sum Student ID: Supervisor:

More information

Excerpt from: Stephen H. Unger, The Essence of Logic Circuits, Second Ed., Wiley, 1997

Excerpt from: Stephen H. Unger, The Essence of Logic Circuits, Second Ed., Wiley, 1997 Excerpt from: Stephen H. Unger, The Essence of Logic Circuits, Second Ed., Wiley, 1997 APPENDIX A.1 Number systems and codes Since ten-fingered humans are addicted to the decimal system, and since computers

More information

A Novel PAT-Tree Approach to Chinese Document Clustering

A Novel PAT-Tree Approach to Chinese Document Clustering A Novel PAT-Tree Approach to Chinese Document Clustering Kenny Kwok, Michael R. Lyu, Irwin King Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin, N.T., Hong Kong

More information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Satoshi Sekine Computer Science Department New York University sekine@cs.nyu.edu Kapil Dalwani Computer Science Department

More information

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Abstract We intend to show that leveraging semantic features can improve precision and recall of query results in information

More information

Corso di Biblioteche Digitali

Corso di Biblioteche Digitali Corso di Biblioteche Digitali Vittore Casarosa casarosa@isti.cnr.it tel. 050-315 3115 cell. 348-397 2168 Ricevimento dopo la lezione o per appuntamento Valutazione finale 70-75% esame orale 25-30% progetto

More information

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine International Journal of Scientific & Engineering Research Volume 2, Issue 12, December-2011 1 Web Search Engine G.Hanumantha Rao*, G.NarenderΨ, B.Srinivasa Rao+, M.Srilatha* Abstract This paper explains

More information

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016 Query Processing Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016 Slides re-used with some modification from www.db-book.com Reference: Database System Concepts, 6 th Ed. By Silberschatz,

More information

Query Evaluation Strategies

Query Evaluation Strategies Introduction to Search Engine Technology Term-at-a-Time and Document-at-a-Time Evaluation Ronny Lempel Yahoo! Research (Many of the following slides are courtesy of Aya Soffer and David Carmel, IBM Haifa

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2017/18 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

CSCI 5417 Information Retrieval Systems Jim Martin!

CSCI 5417 Information Retrieval Systems Jim Martin! CSCI 5417 Information Retrieval Systems Jim Martin! Lecture 4 9/1/2011 Today Finish up spelling correction Realistic indexing Block merge Single-pass in memory Distributed indexing Next HW details 1 Query

More information

Search Engine Architecture II

Search Engine Architecture II Search Engine Architecture II Primary Goals of Search Engines Effectiveness (quality): to retrieve the most relevant set of documents for a query Process text and store text statistics to improve relevance

More information

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf

More information

THE FACT-SHEET: A NEW LOOK FOR SLEUTH S SEARCH ENGINE. Colleen DeJong CS851--Information Retrieval December 13, 1996

THE FACT-SHEET: A NEW LOOK FOR SLEUTH S SEARCH ENGINE. Colleen DeJong CS851--Information Retrieval December 13, 1996 THE FACT-SHEET: A NEW LOOK FOR SLEUTH S SEARCH ENGINE Colleen DeJong CS851--Information Retrieval December 13, 1996 Table of Contents 1 Introduction.........................................................

More information

Improving Suffix Tree Clustering Algorithm for Web Documents

Improving Suffix Tree Clustering Algorithm for Web Documents International Conference on Logistics Engineering, Management and Computer Science (LEMCS 2015) Improving Suffix Tree Clustering Algorithm for Web Documents Yan Zhuang Computer Center East China Normal

More information

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 23 Hierarchical Memory Organization (Contd.) Hello

More information

Functional Programming in Haskell Prof. Madhavan Mukund and S. P. Suresh Chennai Mathematical Institute

Functional Programming in Haskell Prof. Madhavan Mukund and S. P. Suresh Chennai Mathematical Institute Functional Programming in Haskell Prof. Madhavan Mukund and S. P. Suresh Chennai Mathematical Institute Module # 02 Lecture - 03 Characters and Strings So, let us turn our attention to a data type we have

More information

Using Graphics Processors for High Performance IR Query Processing

Using Graphics Processors for High Performance IR Query Processing Using Graphics Processors for High Performance IR Query Processing Shuai Ding Jinru He Hao Yan Torsten Suel Polytechnic Inst. of NYU Polytechnic Inst. of NYU Polytechnic Inst. of NYU Yahoo! Research Brooklyn,

More information

Indexing and Query Processing. What will we cover?

Indexing and Query Processing. What will we cover? Indexing and Query Processing CS 510 Winter 2007 1 What will we cover? Key concepts and terminology Inverted index structures Organization, creation, maintenance Compression Distribution Answering queries

More information

Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task

Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task Walid Magdy, Gareth J.F. Jones Centre for Next Generation Localisation School of Computing Dublin City University,

More information

Indexing Web pages. Web Search: Indexing Web Pages. Indexing the link structure. checkpoint URL s. Connectivity Server: Node table

Indexing Web pages. Web Search: Indexing Web Pages. Indexing the link structure. checkpoint URL s. Connectivity Server: Node table Indexing Web pages Web Search: Indexing Web Pages CPS 296.1 Topics in Database Systems Indexing the link structure AltaVista Connectivity Server case study Bharat et al., The Fast Access to Linkage Information

More information

International ejournals

International ejournals Available online at www.internationalejournals.com International ejournals ISSN 0976 1411 International ejournal of Mathematics and Engineering 112 (2011) 1023-1029 ANALYZING THE REQUIREMENTS FOR TEXT

More information

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Thaer Samar 1, Alejandro Bellogín 2, and Arjen P. de Vries 1 1 Centrum Wiskunde & Informatica, {samar,arjen}@cwi.nl

More information

Information Retrieval

Information Retrieval Information Retrieval WS 2016 / 2017 Lecture 2, Tuesday October 25 th, 2016 (Ranking, Evaluation) Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University

More information

Searching the Web for Information

Searching the Web for Information Search Xin Liu Searching the Web for Information How a Search Engine Works Basic parts: 1. Crawler: Visits sites on the Internet, discovering Web pages 2. Indexer: building an index to the Web's content

More information

The Gray Code. Script

The Gray Code. Script Course: B.Sc. Applied Physical Science (Computer Science) Year & Sem.: IInd Year, Sem - IIIrd Subject: Computer Science Paper No.: IX Paper Title: Computer System Architecture Lecture No.: 9 Lecture Title:

More information

Information Retrieval and Organisation

Information Retrieval and Organisation Information Retrieval and Organisation Dell Zhang Birkbeck, University of London 2015/16 IR Chapter 04 Index Construction Hardware In this chapter we will look at how to construct an inverted index Many

More information

Rank Preserving Clustering Algorithms for Paths in Social Graphs

Rank Preserving Clustering Algorithms for Paths in Social Graphs University of Waterloo Faculty of Engineering Rank Preserving Clustering Algorithms for Paths in Social Graphs LinkedIn Corporation Mountain View, CA 94043 Prepared by Ziyad Mir ID 20333385 2B Department

More information

IO-Top-k at TREC 2006: Terabyte Track

IO-Top-k at TREC 2006: Terabyte Track IO-Top-k at TREC 2006: Terabyte Track Holger Bast Debapriyo Majumdar Ralf Schenkel Martin Theobald Gerhard Weikum Max-Planck-Institut für Informatik, Saarbrücken, Germany {bast,deb,schenkel,mtb,weikum}@mpi-inf.mpg.de

More information

NTUBROWS System for NTCIR-7. Information Retrieval for Question Answering

NTUBROWS System for NTCIR-7. Information Retrieval for Question Answering NTUBROWS System for NTCIR-7 Information Retrieval for Question Answering I-Chien Liu, Lun-Wei Ku, *Kuang-hua Chen, and Hsin-Hsi Chen Department of Computer Science and Information Engineering, *Department

More information

Chapter 2. Architecture of a Search Engine

Chapter 2. Architecture of a Search Engine Chapter 2 Architecture of a Search Engine Search Engine Architecture A software architecture consists of software components, the interfaces provided by those components and the relationships between them

More information

CS301 - Data Structures Glossary By

CS301 - Data Structures Glossary By CS301 - Data Structures Glossary By Abstract Data Type : A set of data values and associated operations that are precisely specified independent of any particular implementation. Also known as ADT Algorithm

More information

CS347. Lecture 2 April 9, Prabhakar Raghavan

CS347. Lecture 2 April 9, Prabhakar Raghavan CS347 Lecture 2 April 9, 2001 Prabhakar Raghavan Today s topics Inverted index storage Compressing dictionaries into memory Processing Boolean queries Optimizing term processing Skip list encoding Wild-card

More information

Search Engines. Informa1on Retrieval in Prac1ce. Annotations by Michael L. Nelson

Search Engines. Informa1on Retrieval in Prac1ce. Annotations by Michael L. Nelson Search Engines Informa1on Retrieval in Prac1ce Annotations by Michael L. Nelson All slides Addison Wesley, 2008 Indexes Indexes are data structures designed to make search faster Text search has unique

More information

Operating Systems Design Exam 2 Review: Spring 2011

Operating Systems Design Exam 2 Review: Spring 2011 Operating Systems Design Exam 2 Review: Spring 2011 Paul Krzyzanowski pxk@cs.rutgers.edu 1 Question 1 CPU utilization tends to be lower when: a. There are more processes in memory. b. There are fewer processes

More information

Today s topics CS347. Inverted index storage. Inverted index storage. Processing Boolean queries. Lecture 2 April 9, 2001 Prabhakar Raghavan

Today s topics CS347. Inverted index storage. Inverted index storage. Processing Boolean queries. Lecture 2 April 9, 2001 Prabhakar Raghavan Today s topics CS347 Lecture 2 April 9, 2001 Prabhakar Raghavan Inverted index storage Compressing dictionaries into memory Processing Boolean queries Optimizing term processing Skip list encoding Wild-card

More information

CS 416: Opera-ng Systems Design March 23, 2012

CS 416: Opera-ng Systems Design March 23, 2012 Question 1 Operating Systems Design Exam 2 Review: Spring 2011 Paul Krzyzanowski pxk@cs.rutgers.edu CPU utilization tends to be lower when: a. There are more processes in memory. b. There are fewer processes

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval Boolean retrieval Basic assumptions of Information Retrieval Collection: Fixed set of documents Goal: Retrieve documents with information that is relevant to the user

More information

CPS352 Lecture - Indexing

CPS352 Lecture - Indexing Objectives: CPS352 Lecture - Indexing Last revised 2/25/2019 1. To explain motivations and conflicting goals for indexing 2. To explain different types of indexes (ordered versus hashed; clustering versus

More information

Data Representation. Types of data: Numbers Text Audio Images & Graphics Video

Data Representation. Types of data: Numbers Text Audio Images & Graphics Video Data Representation Data Representation Types of data: Numbers Text Audio Images & Graphics Video Analog vs Digital data How is data represented? What is a signal? Transmission of data Analog vs Digital

More information

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE RAID SEMINAR REPORT 2004 Submitted on: Submitted by: 24/09/2004 Asha.P.M NO: 612 S7 ECE CONTENTS 1. Introduction 1 2. The array and RAID controller concept 2 2.1. Mirroring 3 2.2. Parity 5 2.3. Error correcting

More information

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course. 3-2. Index construction Most slides were adapted from Stanford CS 276 course and University of Munich IR course. 1 Ch. 4 Index construction How do we construct an index? What strategies can we use with

More information

Digital Communication Prof. Bikash Kumar Dey Department of Electrical Engineering Indian Institute of Technology, Bombay

Digital Communication Prof. Bikash Kumar Dey Department of Electrical Engineering Indian Institute of Technology, Bombay Digital Communication Prof. Bikash Kumar Dey Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 26 Source Coding (Part 1) Hello everyone, we will start a new module today

More information

Databases 2 Lecture IV. Alessandro Artale

Databases 2 Lecture IV. Alessandro Artale Free University of Bolzano Database 2. Lecture IV, 2003/2004 A.Artale (1) Databases 2 Lecture IV Alessandro Artale Faculty of Computer Science Free University of Bolzano Room: 221 artale@inf.unibz.it http://www.inf.unibz.it/

More information

A Frequent Max Substring Technique for. Thai Text Indexing. School of Information Technology. Todsanai Chumwatana

A Frequent Max Substring Technique for. Thai Text Indexing. School of Information Technology. Todsanai Chumwatana School of Information Technology A Frequent Max Substring Technique for Thai Text Indexing Todsanai Chumwatana This thesis is presented for the Degree of Doctor of Philosophy of Murdoch University May

More information