Index-based Snippet Generation

Size: px

Start display at page:

Download "Index-based Snippet Generation"

Arnold Carter
5 years ago
Views:

1 Saarland University Faculty of Natural Sciences and Technology I Department of Computer Science Master s Program in Computer Science Master s Thesis Index-based Snippet Generation submitted by Gabriel Manolache on June 9, 2008 Supervisor Priv.-Doz. Dr. Holger Bast Advisor Priv.-Doz. Dr. Holger Bast Reviewers Priv.-Doz. Dr. Holger Bast Prof. Dr. Gerhard Weikum

2 ii

3 Statement Hereby I confirm that this thesis is my own work and that I have documented all sources used. Gabriel Manolache Saarbrücken, June 9, 2008 Declaration of Consent Herewith I agree that my thesis will be made available through the library of the Computer Science Department. Gabriel Manolache Saarbrücken, June 9, 2008 iii

4 Acknowledgments I would like to express my most profound gratitude to Dr. Holger Bast for giving me the chance to work in his research group, on this fascinating project. I am indebted to him for his continuous advice, patience and willingness to discuss any questions that I have had. His professional enthusiasm and the expertise he shared with me, will certainly be a tremendous resource for my later work. I am grateful to both Dr. Holger Bast and Prof. Dr. Gerhard Weikum for reviewing my thesis. I would also like to thank Marjan Celikik for our fruitful collaboration. Last, but not least, I would like to thank my family for their invaluable and continuous support throughout my studies. iv

5 Abstract Ranked result lists with query-dependent snippets have become state of the art in text search. They are typically implemented by searching, at query time, for occurrences of the query words in the top-ranked documents. This document-based approach has three inherent problems: (i) when a document is indexed by terms which it does not contain literally (e.g., related words or spelling variants), localization of the corresponding snippets becomes problematic; (ii) each query operator (e.g., phrase or proximity search) has to be implemented twice, on the index side in order to compute the correct result set, and on the snippet generation side to generate the appropriate snippets; and (iii) in a worst case, the whole document needs to be scanned for occurrences of the query words, which is problematic for very long documents. This thesis presents an alternative index-based approach that localizes snippets by information solely computed from the index, and that overcomes all three problems. We show how to achieve this at essentially no extra cost in query processing time, by a technique we call query rotation. We also show how the index-based approach allows the caching of individual segments instead of complete documents, which enables a significantly larger cache hit ratio as compared to the document-based approach. We have fully integrated our implementation with the CompleteSearch engine. v

6 vi

7 Contents 1 Introduction Problem Statement Previous work Motivation of our work Related work Snippets in research: the improvements proposed by Turpin et al The CTS Snippet Engine Caching and sentence reordering Snippets in practice: Lucene General information retrieval notions Positional inverted index Advanced search Query operators Non-literal matches Index-based snippet generation Computing all matching positions Extended lists Query rotation Experimental comparison of extended lists and query rotation Computing snippets positions From snippet positions to snippet text Block representation Compression of the blocks Caching Integration with the CompleteSearch engine Index-based versus Document-based Snippet Generation Non-literal matches Code duplication Large documents Caching vii

8 viii CONTENTS 6 Experiments Datasets and Queries Space consumption Snippet generation time Caching Conclusions and Future Work 37

9 CONTENTS ix

10 x CONTENTS

11 Chapter 1 Introduction 1.1 Problem Statement The usual way of interacting with an IR system is to enter a specific information need expressed as a query. As a result, the system will provide a ranked list of retrieved documents. For each of these documents, the user will be able to see the title and a few sentences from the document. These few sentences are called a snippet or excerpt and they have the role of helping the user decide which of the retrieved documents are more likely to convey his information need. Ideally, it should be possible to make this decision without having to refer to the full document text. Ranked result document lists with query-dependent document snippets, as shown in Figure 1.1, are now state of the art in text search. Some of the earlier web search engines started out with statically precomputed (hence query-independent) document summaries, but by now all major engines have converged to showing snippets centered around the keywords typed by the user. User studies have shown already quite a while ago that properly selected query-dependent snippets are superior to query-independent summaries concerning the speed, precision, and recall with which users can judge the relevance of a hit without actually having to follow the link to the full document [TS98] [WJR03]. 1.2 Previous work A variety of methods for query-dependent snippet generation have been described in the literature and implemented in (open source) search engines. 1 They will be discussed in Chapter 2. The methods differ in which of the potentially many snippets are extracted, and in how exactly documents are represented so that snippets can be extracted quickly. On a high level, however, they all follow the same principle two-step approach, which we call document-based: (D0) For a given query, compute the ids of the top ranking documents using a suitable precomputed index data structure. (D1) For each of the top-ranked documents, fetch the (possibly compressed) document text, and extract a selection of segments best matching the given query. 1 We can only guess how snippet generation is implemented in commercial products or in the big web search engines. 1

12 2 CHAPTER 1. INTRODUCTION Figure 1.1: Three examples of query-dependent snippets. The first example shows a snippet (with two parts) for an ordinary keyword query; this can easily be produced by the document-based approach. The second example shows a snippet for a combined proximity / or query; in particular, the document-based approach needs to take special care here that only those segments from the document are displayed where the query words indeed occur close to each other. The third example shows a snippet for a semantic query with several non-literal matches of the query words, e.g., Tony Blair matching politician. Processing of this query involves a join of information from several documents, in which case the document-based approach is not able to identify matching segments based on the document text and the query alone. The index-based approach described in the thesis can deal with all three cases without additional effort. For ordinary queries it is at least as efficient as the document-based approach. Note that we call the first step (D0) (instead of D1) because it is not really part of the snippet generation but rather a prerequisite. The same remark goes for (I0), the first step for our index-based approach, described below. 1.3 Motivation of our work In this thesis, we investigate the following index-based approach to snippet generation, which to our knowledge has not been studied in the literature so far. The goal of this approach is to overcome the problems that affect the document-based approach. We will show that the index-based method is at least as efficient as the document-based approach for ordinary keyword queries, that it is superior when it comes to non-literal matches, advanced query operators, large documents, and that it is without alternative for complex query languages involving joins. A more detailed comparison of the document-based and index-based approaches will be presented in Chapter 5. The index-based approach goes in four steps: (I0) For a given query, compute the ids of the top-ranking documents. This is just like (D0). (I1) In addition to the information from (I0), also compute, for each query word, the matching positions of that word in each of the top-ranking documents. We will show how this step can be realized with negligible extra time, compared to (I0) and (I3), and for any ordinary positional index, without a need to change the index s internal data

13 1.3. MOTIVATION OF OUR WORK 3 structures. The latter can be a major issue from a system s engineering point of view, depending on the design of the system. (I2) For each of the top-ranked documents, given the list of matching positions computed in (I1) and given a precomputed segmentation of the document, compute a list of positions to be output. We will show that this step is a simple matter of a k-way merge / intersection, and that it takes negligible time compared to (I0) and (I3). (I3) Given the list of positions computed in (I2), produce the actual text to be displayed to the user. We will show how to preprocess (in particular: compress) the documents, so that this step can be done efficiently, accessing only those portions of the (possible very large) document which actually contain the snippets to be output. The index-based approach presented in this thesis is a joint work with Dr. Holger Bast and Marjan Celikik. It led to a submission to the ACM 17th Conference on Information and Knowledge Management. My work was mainly concentrated on steps (I1) and (I2). They are going to be presented into more detail compared to the remaining parts of the method (step (I3) and snippet caching).

14 4 CHAPTER 1. INTRODUCTION

15 Chapter 2 Related work The literature abounds with work on the general topic of document summarization. However, most of this work is concerned with query-independent summarization, which is not our topic here. For a recent list of references, see [VH06]. Note that all the big search engines have query-dependent snippets by now, in particular: Google, Yahoo Search, MSN Search, and AltaVista (which started with query-independent summaries). Tombros and Sanderson [TS98] presented the first in-depth study showing that properly selected query-dependent snippets are superior to query-independent summaries with respect to speed, precision, and recall with which users can judge the relevance of a hit without actually having to follow the link to the full document. In this work, we take the usefulness of query-dependent result snippets for granted. Usually more segments match the query than can (and should) be displayed to the user, so that a selection has to be made. Questions of a proper such selection / ranking have been studied by various authors. For example, in a recent work, [VH06] have proposed an algorithm for computing segments that are as semantically related to each other as possible. Here, in this thesis, the focus is on efficiency and feasibility aspects. For the scoring and ranking of segments, we adopt the simple yet effective scheme from [TTHW07], described below in Figure 2.1. Only recently, Turpin et al. [TTHW07] have presented a first in-depth study of the documentbased approach with respect to efficiency (in time and space). 2.1 Snippets in research: the improvements proposed by Turpin et al. The work presented in [TTHW07] follows the document-based approach. The authors propose and analyze a document compression method that reduces snippet generation time by 58% over a baseline using the zlib compression library. The experiments reveal that finding documents on secondary storage dominates the total cost of generating snippets, and use caching for a faster snippet generation process. A solution for avoiding scanning the whole documents is also proposed and analyzed: document reordering and compaction. The method increases the number of document cache hits but has an impact on snippet quality. This scheme doubles the number of documents that can fit in a fixed size cache. 5

16 6 CHAPTER 2. RELATED WORK Input: A document broken into one sentence per line, and a sequence of query terms. Output: Remove the number of sentences required from the heap to form the summary. 1 For each line of text, L = [w 1, w 2,.., w m ] 2 Let h be 1 if L is a heading, 0 otherwise. 3 Let l be 2 if L is the first line of a document, 1 if it is the second line, 0 otherwise. 4 Let c be the number of w i that are query terms, counting repetitions. 5 Let d be the number of distinct query terms that match some w i. 6 Identify the longest contiguous run of query terms in L, say w j...w j+k. 7 Use a weighted combination of c, d, k, h and l to derive a score s. 8 Insert L into a max-heap using s as a key. Figure 2.1: Simple sentence ranker that operates on raw text with one sentence per line The CTS Snippet Engine As we already explained, step (D1) must provide for each of the top-ranked documents, a snippet: text that attempts to summarize the document and contains (at least some of the) query words. Similarly to previous work on summarization [Luh58], Turpin and al. identify the sentence as the minimal unit for extraction and presentation to the user. In order to construct a snippet, all sentences in a document are ranked against the query and the top two or three returned as a snippet. The scoring of sentences against queries has been explored in several papers [GKMC99, Luh58, SSJ01, TS98, WRJ02], with different features of the sentences deemed important. Based on these studies, Figure 2.1 presents the general algorithm for scoring sentences in relevant documents. The top-ranked sentences form the snippet. The final score of the sentence, assigned in step 7, can be computed in various ways. In order to avoid bias to any particular scoring mechanism, Turpin and al. compare the sentence quality using the individual components of the score, rather than an arbitrary combination of the components. Since Web data is often poorly structured, poorly punctuated, and contains a lot of data that do not form part of valid sentences that would be candidates for parts of snippets, it is assumed that the documents passed to the Snippet Engine have all HTML tags and JavaScript removed, and that each document is reduced to a series of word tokens separated by non-word tokens. A word token is defined as a sequence of alphanumeric characters, while a non-word is a sequence of non-alphanumeric characters such as whitespace and the other punctuation symbols. Both are limited to a maximum of 50 characters. Adjacent, repeating characters are removed from the punctuation. Included in the punctuation set is a special end of sentence marker which replaces the usual three sentence terminators?!.. Often these explicit punctuation characters are missing and so HTML tags such as <br> and <p> are assumed to terminate sentences. In addition, a sentence must contain at least five words and no more than twenty words, with longer or shorter sentences being broken and joined as required to meet these criteria [KPC95]. Unterminated HTML tags, that is, tags with an open brace, but no close brace, cause all text from the open brace to the next open brace to be discarded. Having defined the format of documents that are presented to the Snippet Engine, the next important characteristic of this method is the Compressed Token System (CTS) document storage scheme, and the baseline system used for comparison. For the baseline, an obvious document representation scheme is utilized: each document is simply compressed with a well known adaptive compressor, and then decompressed as required [BP98]. Such a system has been implemented

17 2.1. SNIPPETS IN RESEARCH: THE IMPROVEMENTS PROPOSED BY TURPIN ET AL.7 by the authors using zlib [GA07] with default parameters to compress every document. Each document is stored in a single file. While manageable for small test collections or small enterprises with millions of documents, a full Web search engine may require multiple documents to inhabit single files, or a special purpose file system [HL03]. For snippet generation, the required documents are decompressed one at a time, and a linear search for provided query terms is employed. The search is optimized for matching whole words and the sentence terminating token, rather than general pattern matching. The CTS Snippet Engine makes a series of optimizations over the baseline presented in the previous paragraph. The first is to employ a semi-static compression method over the entire document collection, which will allow faster decompression with minimal compression loss. Using a semi static approach involves mapping words and non-words produced by the parser to single integer tokens, with frequent symbols receiving small integers, and then choosing a coding scheme that assigns small numbers a small number of bits. Words and non-words strictly alternate in the compressed file, which always begins with a word. Each symbol is assigned its ordinal number in a list of symbols sorted by frequency. The the vbyte coding scheme is used to code the word tokens [IHWB99]. The set of non-words is limited to the 64 most common punctuation sequences in the collection itself, and are encoded with a flat 6-bit binary code. The remaining 2 bits of each punctuation symbol are used to store capitalization information. The process of computing the semi-static model is complicated by the fact that the number of words and non-words appearing in large web collections is high. Moffat et al. [AMS97] have examined schemes for pruning models during compression using large alphabets, and conclude that rarely occurring terms need not reside in the model. Rather, rare terms are spelt out in the final compressed file, using a special word token (escape symbol), to signal their occurrence. Using these results, Turpin et al. employ two move-to-front queues: one for words and one for non-words. During the first pass of encoding, whenever the available memory is consumed and a new symbol is discovered, an existing symbol is discarded from the last half of the queue provided that it has the frequency one. If there is no such symbol, then a symbol with frequency two is evicted and so on. The second pass of encoding replaces each word with its vbyte encoded number, or the escape symbol and an ASCII representation of the word if it is not in the model. Similarly each non-word sequence is replaced with its codeword, or the codeword for a single space character if it is not in the model. This lossless compression of non-words is acceptable when the documents are used for snippet generation, but may not be acceptable for a document database. Besides the fact that it allows faster decompression, the semi-static scheme also has the advantage that it readily allows direct matching of query terms as compressed integers in the compressed file. This way, sentences can be scored without having to decompress a document, and only the sentences returned as part of a snippet need to be decoded. The CTS system stores all documents contiguously in one file, and an auxiliary table of 64 bit integers indicating the start offset of each document in the file. Further, it must have access to the reverse mapping of term numbers, allowing those words not spelt out in the document to be recovered and returned to the Query Engine as strings. The first of these data structures can be readily partitioned and distributed if the Snippet Engine occupies multiple machines; the second, however, is not so easily partitioned, as any document on a remote machine might require access to the whole integer-to-string mapping. This is the second reason for employing the model pruning step during construction of the semi-static code: it limits the size of the reverse mapping table that should be present on every machine implementing the Snippet Engine.

18 8 CHAPTER 2. RELATED WORK Caching and sentence reordering In order to speed up the snippet generation process, Turpin et al. make use of a Snippet Engine cache in proportion to the size of the document collection. Using simulation, the authors have compared two caching policies: a static cache, where the cache is loaded with as many documents as it can hold before the system begins answering queries, and then never changes; and a least-recently-used cache, which starts out as for the static cache, but whenever a document is accessed it moves to the front of a queue, and if a document is fetched from disk, the last item in the queue is evicted. It is assumed that a query cache exists for the top Q most frequent queries, and that these queries are never processed by the Snippet Engine. The results of the experiments reveal that the static cache performs well but it is outperformed by the LRU cache: if as little as 1% of the documents are cached then around 75% of the disk seeks can be avoided. Besides compression, another approach for reducing the size of the documents in the cache is explored: instead of caching the whole documents, only sentences that are likely to be used in snippets are stored. If during snippet generation on a cached document the sentence scores do not reach a certain threshold, then the whole document is retrieved from disk. This means that it is very important which sentences from a document are stored in the cache and which are left on the disk. A sentence reordering can be performed for each document such that sentences that are very likely to appear in snippets are at the front of the document. This way, they are processed first at query time and once enough sentences are found to generate a snippet, the rest of the document can be ignored. Further, to improve caching, only the head of each document can be stored in the cache, with the tail residing on disk. One problem of this approach is that now the search engine should be able to provide copies of the documents. The snippet generation engine cannot do this anymore since it does not have access to the exact text of the document, as it was indexed. Four sentence reordering schemes are considered: Natural order The first few sentences of a well authored document usually best describe the document content [Luh58]. Thus simply processing a document in order should yield quality snippet. Unfortunately, web documents are often not well authored, with little editorial or professional writing skills brought to bear on the creation of a work of literary merit. Also, taking into account that query-biased snippets are being produced, there is no guarantee that query terms will appear in sentences toward the front of a document. Significant terms (ST) Luhn introduced the concept of a significant sentence as containing a cluster of significant terms [Luh58], a concept found to work well by Tombros and Sanderson [TS98]. Let f d,t be the frequency of term t in document d, then term t is determined to be significant if (25 s d ), if s d < 25 f d,t 7, if 25 s d (s d 40), otherwise where s d is the number of sentences in document d. A bracketed section is defined as a group of terms where the leftmost and rightmost terms are significant terms, and no significant terms in the bracketed section are divided by more than four non-significant terms. The score of a bracketed section is the square of the number of significant words

19 2.2. SNIPPETS IN PRACTICE: LUCENE 9 falling in the section, divided by the total number of words in the entire sentence. The a priori score for a sentence is computed as the maximum of all scores for the bracketed sections of the sentence. The sentences are then sorted by this score. Query log based (QLu) Many Web queries repeat, and a small number of queries make up a large volume of total searches [BJP05]. In order to take advantage of this bias, sentences that contain many past query terms are promoted to the front of a document, while sentences that contain few query terms are demoted. In this scheme, the sentences are sorted by the number of sentence terms that occur in the query log. To ensure that long sentences do not dominate over shorter qualitative sentences the score assigned to each sentence is divided by the number of terms in that sentence giving each sentence a score between 0 and 1. Query log based (QLu) This scheme is as for QLt, but repeated terms in the sentence are only counted once. The purpose of using the ST, QLt or QLu schemes is to terminate snippet generation earlier than if Natural Order is used, but still produce sentences with the same number of unique query terms (d in Figure 2.1), total number of query terms (c), the same positional score (h + l) and the same maximum span (k). Experiments revealed that sorting sentences using the Significant Terms (ST) method leads to the smallest change in the sentence scoring components. The greatest change over all methods is in the sentence position (h + l) component of the score, which is to be expected as there is no guarantee that leading and heading sentences are processed at all after sentences are re-ordered. The second most affected component is the number of distinct query terms in a returned sentence, but if only the first 50% of the document is processed with the ST method, there is a drop of only 8% in the number of distinct query terms found in snippets. The authors argue that there is little overall affect on scores when processing only half the document using the ST method, depending how the various components are weighted to compute the overall snippet score. In conclusion, the Turpin et al. method follows the document-based approach, providing a series of mechanisms for improving the speed of access to the document text: caching, sentence reordering and a semi static compression method for faster decompression. 2.2 Snippets in practice: Lucene All open-source engines we know of that provide query-dependent snippet generation follow the document-based approach. In particular, this holds true for Lucene [Cut]. Lucene is one of the most popular free/open source information retrieval libraries providing support for indexing and search. It is supported by the Apache Software Foundation and is released under the Apache Software License. The early versions of Lucene use a straightforward implementation of the document-based approach: at query time each of the top-ranked documents are re-parsed and segments containing the query words are extracted. The highlighting of the query words in these segments is done by simply comparing the tokens of the document with the tokens in the query. The tokens of the document can be built from the index or by analyzing the document. If there is a match

20 10 CHAPTER 2. RELATED WORK between one token in the document and one token in the query, then the original text can be reconstructed, with the query words highlighted, by using stored information about the original offsets in the document for each of the tokens. The Lucene highlighter can handle stemmed words since both the query terms and the document content are processed by the same parser. This means that words such as algorithmic, algorithm, algorithms become stemmed to the same root form in both query and document content. The tokens produced by the parser include the byte offsets of the original full word, not just the stemmed form, so the highlighter knows the full extent of what to highlight in text. The Lucene highlighter manages the fuzzy queries by expanding them to all the similar terms. An example of such a fuzzy query is the erroneous word belies: it might be expanded to the disjunctive query belief believe. Such an approach has the obvious disadvantage that the number of query words will increase depending on the set of similar terms. This represents an important scalability problem. To address the efficiency problems of the previously presented approach, recent versions of Lucene have provided support for storing the sequence of term ids output by the parser (much in the vein of [TTHW07]) and even precompute and store for each document a small index for fast location of the query words. However, all of these enhancements remain in the realm of the document-based approach, and as such do not address the important problems of non-literal matches, code duplication, and coarse caching granularity pointed out in Chapter 5.

21 Chapter 3 General information retrieval notions This chapter describes the basic IR concepts and notions required by our snippet generation method. The first part introduces the data structure used for query processing and the operations performed on this data structure. Here is where we also describe the first (common) step for both index-based and document-based snippet generation methods. The second part provides details about the advanced search options that our method supports in an efficient manner. 3.1 Positional inverted index The inverted index, sometimes known as inverted file or postings file, is one of the major concepts in information retrieval. It is the data structure of choice for most of the search applications: it is very efficient for short queries, easy to implement and can be well compressed. The retrieval is performed on a group of documents known as (document) collection or corpus (a body of texts). Based on the document collection a lexicon or vocabulary is built: a list of the terms that appear in the corpus. An inverted index contains, for each term in the lexicon, an inverted list that stores a list of pointers to all occurrences of that term in the main text, where each pointer is, in effect, the number of a document in which the term appears [IHWB99]. Here is an example of such an inverted list: doc ids D100 D129 D1401 D2722 D3000 The meaning is that the term to which this inverted list belongs, appears in the documents with the ids D100, D129, D1401, D2722 and D3000. Our method requires a positional inverted index. This simply means that each inverted list must store, for each document, the positions where the corresponding term appears. For such an index, an inverted list conceptually looks as follows: 11

22 12 CHAPTER 3. GENERAL INFORMATION RETRIEVAL NOTIONS doc ids D7 D23 D47 D47 D63 positions word ids W12 W12 W12 W12 W12 scores For example, the first inverted list entry (also called posting) has the meaning that the word/term with id W12 occurs in the document with id D7 at position 6 with a score of 2. It is important to note that the doc ids are sorted in increasing order. Some doc ids are repeated if the term appears more than once in the document with the repeated id. For each occurrence of the term in a document, a score that reflects the importance of that particular occurrence is assigned. Individual scores are aggregated to per-document scores, according to which the documents are eventually ranked. Based on the aggregated scores, only the most relevant documents are presented to the user. In the inverted list presented above, all postings have the same value for the word ids entry, namely W12. This means that the inverted list belongs to a single word, for example roman. If we want to search for all the words that start with rom, then our query word will be rom*. The asterisk is used to denote the fact that any sequence of characters might follow after rom. Assuming that our collection contains only roman and rome as words that contain rom as prefix, then the result of our query should return the doc ids, positions and scores for these two words. The result posting list might look as follows: doc ids D7 D23 D31 D47 D47 D63 D87 D99 positions word ids W12 W12 W67 W12 W12 W12 W67 W67 scores It can be noticed that this inverted list, belonging to the query word rom*, contains the postings corresponding to the query word roman: the ones for which the value of word ids is W12. It also contains the postings for the query word rome (having W67 as word ids value). This kind of prefix queries are the reason for using a word ids entry for each of the postings. Given a query, the basic operation will be to either intersect the lists for each of the query words (to obtain the list of ids of all documents that contain all query words; such queries are called conjunctive) or to compute their union (to obtain the list of ids of all documents that contain at least one of the query words; such queries are called disjunctive). Let us consider an example for the conjunctive query roman emperor julius, where the query words have the following inverted lists: roman emperor doc ids D7 D23 D47 D47 D63 positions word ids W12 W12 W12 W12 W12 scores

23 3.1. POSITIONAL INVERTED INDEX 13 doc ids D3 D7 D47 D63 positions word ids W34 W34 W34 W34 scores julius doc ids D7 D7 D39 D47 D63 positions word ids W27 W27 W27 W27 W27 scores In order to compute the inverted list for the entire query, the lists of the individual query words must be intersected according to the doc ids. For our example, the resulting inverted list is the following: doc ids D7 D7 D7 D7 D47 D47 D47 D47 D63 D63 D63 positions word ids W9 W9 W9 W9 W9 W9 W9 W9 W9 W9 W9 scores query word It can be observed that the postings from the result inverted list have one more entry, named query word. The problem is that after intersecting the posting lists of some query words, we need to know for each of these query words all the positions in the common documents (returned by the intersection). The query word entry tells from which query words, the posting stems. Without this it would not be possible to say which position refers to which query word. More details about this topic will be presented in Chapter 4. At this point it is important to notice that this straightforward approach has two major drawbacks: the result inverted list has been changed by adding one more entry for each posting and it is also very long. We have only three common document ids and eleven postings. In practice, the resulted inverted lists might contain millions of different doc ids so the ranking of the postings is essential, in order to consider only the top K most relevant documents. The ranking is done based on the scores from the result list. They are obtained through score aggregation, during the intersection procedure. For our example the aggregation function is the sum of the scores corresponding to the same doc id. If we consider the document with the id D7, we notice that the scores associated with the occurrences of the three query words are 2, 10, 6 and 33. In the result list, the score associated with D7 is the sum of all these scores: 51. The scores for the rest of the doc ids are computed using the same aggregation function. Let us assume that we are only interested in the two most relevant documents obtained as results for the roman emperor julius query. In this case only the two postings corresponding to the highest two scores are kept in the result list:

24 14 CHAPTER 3. GENERAL INFORMATION RETRIEVAL NOTIONS doc ids D7 D7 D7 D7 D47 D47 D47 D47 positions word ids W9 W9 W9 W9 W9 W9 W9 W9 scores query word The example we have presented in this section illustrates the main steps of how a query can be processed by a search engine. Computing the ids of the top ranking documents represents the first step of the index-based method and also the first step of the document-based approach. We are calling this step, (I0) (Index-based) or (D0) (Document-based), depending on the snippet generation method. We are using the (D0) / (I0) notation for this step (instead of (D1) / (I1)) because it is not really part of the snippet generation but rather a prerequisite. 3.2 Advanced search Besides the usual conjunctive or disjunctive queries, our method supports without additional effort, queries with operators and advanced queries (synonym search, error-tolerant search, semantic search, etc). These two categories of queries are problematic when handled by the document-based method in its basic form. The following two subsections present descriptions and examples for the two above mentioned categories of queries Query operators A query operator consists of one or more characters that act not as a query word, but as an instruction on how a query is to be processed. An operator can work at word level, where it applies to a single query term, or at query level, where its presence affects the processing of the entire query. Our method is completely unaware of the query words and operators that are part of the query. It uses only the positions provided by the positional inverted index to identify the words for which snippets must be generated. This means that the snippet generation will work with any query operator. We have experimented with the phrase operator, proximity operator and the disjunction operator. The phrase operator requires that two query words are adjacent, the proximity operator requires that two query words are at most within 5 words distance of each other and using the disjunction operator a search can be performed for any one of a group of two or more query words. Let us consider the query roman.emperor which contains the phrase operator and the following text: The Roman Emperor was the ruler of the Roman State during the imperial period (starting at about 27 BC). The Romans had no single term for the office: Latin titles such as imperator (from which English emperor ultimately derives), augustus, caesar and princeps were all associated with it. As results for this query, only the occurrences of roman followed immediately by emperor will be considered (they are emphasized with bold characters in the text above). The non-adjacent occurrences of roman and emperor (marked with italic characters) are not considered since our query contains the phrase operator between the two words. Such operator queries do not require additional effort because everything is done during the intersection procedure. If we consider again the inverted lists of the two query words:

25 3.2. ADVANCED SEARCH 15 roman emperor doc ids D7 D23 D47 D47 D63 positions word ids W12 W12 W12 W12 W12 scores doc ids D3 D7 D47 D63 positions word ids W34 W34 W34 W34 scores The intersection procedure will simply check whether there are any consecutive positions for the common document ids in the two lists. In our example, only for document id D47 we have the consecutive positions 26 and 27, so the result inverted list for our query is: doc ids D47 scores 32 The other two types of operators can be handled in a similar fashion. Without a positional inverted index, processing such operator queries would take more time. First the documents that simply contain the query words would have to be identified and then all these documents would have to be scanned in order to compute the distances between the query words Non-literal matches The index-based snippet generation is also able to deal with the vocabulary mismatch problem, encountered when a relevant document does not literally contain (one of) the query words. A nice list of commented examples from the TREC benchmarks is given in [Buc04]. Let us consider the semantic query concept:city, which means that we are looking for all the names of cities in the collection and not for the occurrences of the word city. We have marked this fact by prefixing the query with the concept keyword. Given this scenario, it would be desirable that the following excerpt is taken into consideration: Rome achieved great glory under Octavian/Augustus. He restored peace after 100 years of civil war; maintained an honest government and a sound currency system; The reason for considering this snippet is that it contains the word Rome which can be thought of as a query word (it is the name of a city). A common way to provide such behavior and overcome the vocabulary mismatch problem is to index documents under terms that are related to words in the document, but do not literally occur themselves. Let us take as an example a collection that contains only Rome and Alexandria as names of cities and they have the following inverted lists: Alexandria

26 16 CHAPTER 3. GENERAL INFORMATION RETRIEVAL NOTIONS Rome doc ids D64 D88 positions word ids W42 W42 scores doc ids D31 D87 D99 positions word ids W67 W67 W67 scores Then, the inverted list for the term concept:city is composed of all the postings belonging to the previous two terms: concept:city doc ids D31 D64 D87 D88 D99 positions word ids W67 W42 W67 W42 W67 scores There is a variety of advanced search features which call for non-literal matches: related-words search (like in our example), prefix search (for a query containing alg, find algorithm), errorcorrecting search (for a query containing algorithm, find the misspelling algoritm), semantic search (for a query musician, find the entity reference John Lennon), etc.

27 Chapter 4 Index-based snippet generation On a high level, our snippet generation method follows a four-step approach. We have already discussed the first step (I0) in Chapter 3. This chapter presents in detail the remaining three steps of our method. In the first part we describe and analyze two methods for computing the positions of the query words in the top-ranked documents (I1), using a positional inverted index. We continue by showing how to use the positions of the query words in order to determine the positions of the snippets (I2). The third part is a section dedicated to the last step of our method: using the snippet positions to get the actual snippet text, which is finally presented to the user (I3). We finalize this chapter by presenting some details about the snippet caching mechanism and the integration with the CompleteSearch engine. 4.1 Computing all matching positions Throughout this work, we assume that we are using the positional index presented in Chapter 3, with posting lists that conceptually look as follows: doc ids D401 D1701 D1701 D1701 D1807 word ids W173 W173 W173 W173 W173 positions scores For example, the third posting from the right says that the word with id W173 occurs in the document with id D1701 at position 12, and was assigned a score of 0.4. In an actual implementation these lists would be stored in compressed format, but we need not consider this level of detail in the considerations that follow. A standard inverted index will have one such list precomputed for each word (that is, all word ids are the same in each such list). Pruning techniques are often used to avoid a full scan of the index lists involved, especially in the case of disjunctive queries; for example, see [AM06] [BMS + 06]. The results presented in Subsection pertain to such a pruning technique for disjunctive queries, following the ideas of [BMS + 06]. The second step of our method, named (I1), uses such a positional inverted index: it takes as input the doc ids of the top ranked documents and the posting lists of the query words (that make up the user s query). In order to compute the doc ids of the top ranked documents, the 17

28 18 CHAPTER 4. INDEX-BASED SNIPPET GENERATION (I0) step must load from the disk the posting lists of the query words. This means that (I0) can provide to (I1) these posting lists without additional effort. If we consider again the example from Chapter 3, the conjunctive query roman emperor julius, then the input of step (I1) would be the doc ids of the top ranked documents and the inverted lists of the query words: top ranked documents roman emperor doc ids D7 D47 scores doc ids D7 D23 D47 D47 D63 positions word ids W12 W12 W12 W12 W12 scores doc ids D3 D7 D47 D63 positions word ids W34 W34 W34 W34 scores julius doc ids D7 D7 D39 D47 D63 positions word ids W27 W27 W27 W27 W27 scores The I1 step provides as output the positions of each of the query words in each of the top-ranked documents. This means that the output of step (I1) for this example are the positions of each of the query words in the documents with the ids D7 and D47: D7 D47 roman: 6 emperor: 31 julius: 2, 39 roman: 26, 38 emperor: 27 julius: 28 The positional inverted index we described so far, obviously contains all the information required to compute the output. The difficulty is how to incorporate such information in the result. The following two subsections present two approaches for computing the output of step (I1) and they are followed by a section dedicated to an experimental comparison of these approaches.

29 4.1. COMPUTING ALL MATCHING POSITIONS Extended lists A first and obvious approach for solving this problem is to use extended lists. An extended list is obtained in a straightforward way, by enhancing the postings of an inverted list with an additional entry, telling from which query word it stems. Conceptually: doc ids D1701 D1701 D1701 D1701 D1701 word ids W173 W173 W173 W173 W173 positions scores query word For example, the third posting from the right now knows that it stems from the first query word. (When reading a list from disk, the query word entry would be set to some special value). It is not hard to see that with this enhancement, the information required for step (I1) can be computed for all of the operations described above. Coming back to the example we used so far, the posting lists of the three query words could be enhanced in the following way: roman emperor julius doc ids D7 D23 D47 D47 D63 positions word ids W12 W12 W12 W12 W12 scores query word doc ids D3 D7 D47 D63 positions word ids W34 W34 W34 W34 scores query word doc ids D7 D7 D39 D47 D63 positions word ids W27 W27 W27 W27 W27 scores query word All the postings belonging to the inverted list of the query word roman have an extra entry (denoted as query word in our example) which has always the value 1. This simply marks the fact that the postings contain information referring to the first query word. In a similar way, the postings of the second query word (emperor) and the third query word (julius) contain an additional entry with values 2, respectively 3. The result extended list is computed by intersecting the extended lists belonging to the three query words, and looks as follows:

30 20 CHAPTER 4. INDEX-BASED SNIPPET GENERATION doc ids D7 D7 D7 D7 D47 D47 D47 D47 D63 D63 D63 positions word ids W9 W9 W9 W9 W9 W9 W9 W9 W9 W9 W9 scores query word Given such a list, it is straightforward how to extract the positions of each of the query words in each of the top-ranked documents. It can be done linearly by examining each posting once. There are two major disadvantages with this approach, however. The first is that we modified the central data structure, the sanctuary of every search engine. Unless the search engine was written with such modifications in mind, this is usually not tolerable. The second major problem is efficiency. Consider an intersection of two query words. The extended result lists would now contain postings from both input lists, that is, it at least doubles in size compared to the corresponding simple result list. This effect aggravates as the number of query words grows. For our previous example, the simple list result has only two postings and the extended list result has eight postings. The experimental results presented in Subsection show that this indeed affects the processing time. The problem is that per-query-word positions are computed for all matching documents, before the ranking is done. Note that this also happens when pruning techniques are involved, because also there large numbers of postings (namely, candidates for one of the top-ranked doc ids) are involved in the intermediate calculations Query rotation We propose to implement (I1) by what we call query rotation. We first explain query rotation by using an example: the three-word phrase query roman.emperor.julius. This means that the three query words should appear in the text one after another in the specified order. We assume that the posting lists for the query words are the ones we have used so far. They are provided by the step (I0) along with the set D of ids of the top-ranked documents: doc ids D7 D47 scores First we compute the posting list L 1 as the intersection of D and the list of all postings for roman: doc ids D7 D23 D47 D47 D63 positions word ids W12 W12 W12 W12 W12 scores For each doc id in the intersection we store all the positions from the second list, obtaining L 1: doc ids D7 D47 D47 positions word ids W12 W12 W12 scores

6 Efficient Index-Based Snippet Generation

6 Efficient Index-Based Snippet Generation HANNAH BAST and MARJAN CELIKIK, University of Freiburg Ranked result lists with query-dependent snippets have become state of the art in text search. They are