Index extensions. Manning, Raghavan and Schütze, Chapter 2. Daniël de Kok

Index extensions Manning, Raghavan and Schütze, Chapter 2 Daniël de Kok

Today We will discuss some extensions to the inverted index to accommodate boolean processing: Skip pointers: improve performance for intersection of postings lists. Biword indexes: adding support for phrase queries. Positional indexes: a different approach to phrase queries. Suffix arrays: add support for character n-gram and/or phrase queries.

Skip pointers

Introduction Basic Sorted list intersection: O(n + m) However, sub-linear time possible using skip pointers.

Skip pointers 16 28 28 Brutus 2 4 8 16 19 23 28 43 5 51 98 Caesar 1 2 3 5 8 41 51 60

Skip pointers 16 28 28 Brutus 2 4 8 16 19 23 28 43 5 51 98 Caesar 1 2 3 5 8 41 51 60 Add a pointer every n items. Permits skipping through the postings list more quickly.

Skip pointers 16 28 28 Brutus 2 4 8 16 19 23 28 43 5 51 98 Caesar 1 2 3 5 8 41 51 60

Skip pointers 16 28 28 Brutus 2 4 8 16 19 23 28 43 5 51 98 Caesar 1 2 3 5 8 41 51 60 Suppose that we are computing Brutus AND Caesar and we have found 8 in both lists.

Skip pointers 16 28 28 Brutus 2 4 8 16 19 23 28 43 5 51 98 Caesar 1 2 3 5 8 41 51 60 Suppose that we are computing Brutus AND Caesar and we have found 8 in both lists. In the lists we step to 16 and 41 respectively. Since 16 < 41 and 16 has a skip pointer. Since the document pointed at by the skip pointer (28) is also less than 41 we can make three steps at once.

Skip pointers Skip pointers are especially useful for linked lists. No explicit skip pointers needed for array representation.

Number of pointers The number of skip-pointers is a trade-off: More skips: Shorter skip spans: more likely to skip. More comparisons. More space for storing skip pointers.

Number of pointers The number of skip-pointers is a trade-off: More skips: Shorter skip spans: more likely to skip. More comparisons. More space for storing skip pointers. Fewer skips: Long skip spans: fewer opportunities to skip Fewer comparisons. Less space for storing skip pointers.

Number of pointers Advise from the book: for a postings list of length P, use P evenly-placed skip pointers. Improvements are obviously possible: depends on the distribution of query terms.

Skip lists Note: the idea of skip pointers relate strongly to the skip list data structure (Pugh, 1990), which provides O(log n) average case search/insertion/deletion.

Skip lists Note: the idea of skip pointers relate strongly to the skip list data structure (Pugh, 1990), which provides O(log n) average case search/insertion/deletion. NIL NIL NIL head 1 2 3 4 5 6 7 8 9 10 NIL

Biword indexes

Introduction One undiscussed aspect of boolean queries: phrase queries. E.g.: information retrieval Better call Saul Supported by many search engines by using double quotes. Biword indexes provide a simple method to perform such queries.

Basic idea Construct an inverted index with word bigrams as terms.

Basic idea Construct an inverted index with word bigrams as terms. Allows lookup of two-word terms, such as information retrieval.

Basic idea Construct an inverted index with word bigrams as terms. Allows lookup of two-word terms, such as information retrieval. Lookup of longer phrases is also possible if false positives are acceptable. For example, expand the query Better call Saul to: Better call AND call Saul

Function words Nouns often express the main information being searched. In a phrase query such as renegotiation of the constitution, the function words of and the are largely irrelevant. They do increase query processing time. Typical solution: 1. Use a part-of-speech tagger to filter out nouns: renegotiation/n of/p the/d constitution/n 2. Perform the query renegotiation constitution on the biword index.

Problems biword indexes Can return false positives. Idea expandable to larger phrases, but: Expensive At what length should we stop?

Positional indexes

Introduction Due to its downsides, biword indexes are not the standard solution for phrase queries. Instead, positional indexes are commonly used. A positional index is an index where a posting contains: The document identifier. A list of positions where the term occurs within a document.

Example information 10: [20, 54] 34: [2] 93: [183] retrieval 3: [65] 10: [55] 93: [44] Dictionary Postings

Example information 10: [20, 54] 34: [2] 93: [183] retrieval 3: [65] 10: [55] 93: [44] Dictionary Postings Query: information retrieval

Example information 10: [20, 54] 34: [2] 93: [183] retrieval 3: [65] 10: [55] 93: [44] Dictionary Postings Query: information retrieval Perform intersection on the posting lists of information and retrieval. If a document is in D(information) and D(retrieval), use the positions to check that the terms occur consecutively.

Suffix arrays

Introduction Suffix arrays are a data structure for finding/counting n-grams of an arbitrary length in O(log n) time.

Introduction Suffix arrays are a data structure for finding/counting n-grams of an arbitrary length in O(log n) time. Working example: bananabread

Suffixes 0 1 2 3 4 5 6 7 8 b a n a n a b r e a d a n a n a b r e a d n a n a b r e a d a n a b r e a d n a b r e a d a b r e a d b r e a d r e a d e a d 9 10 a d d

Suffixes 0 1 2 3 4 5 6 7 8 b a n a n a b r e a d a n a n a b r e a d n a n a b r e a d a n a b r e a d n a b r e a d a b r e a d b r e a d r e a d e a d 9 10 a d d Enumeration of the suffixes in bananabread. The index in the first column is the corpus position of the suffix.

Sorting suffixes 5 a b r e a d 9 a d 3 1 0 6 10 8 4 2 7 a n a b r e a d a n a n a b r e a d b a n a n a b r e a d b r e a d d e a d n a b r e a d n a n a b r e a d r e a d Lexicographical sorting of the suffixes. The first column is the suffix array and contains the corpus positions sorted by suffix order.

In-class assignment Create the suffix array for the string Mississippi Assume that uppercase letters are sorted before lowercase letters.

Construction Do we need to explicitly store all suffixes in memory to construct the suffix array?

Construction Do we need to explicitly store all suffixes in memory to construct the suffix array? All problems in computer science can be solved by another level of indirection - David Wheeler

Example 0 1 2 3 4 5 6 7 8 9 10 b a n a n a b r e a d 5 3 1 0 6 10 8 4 2 7

Sorted b a n a n a b r e a d 5 9 3 1 0 6 10 8 4 2 7

Construction code // Construct an array with the same length as the data array // with indices 0..n-1. let mut sarr: Vec<_> = (0..data.len()).collect(); // Sort the array, thereby constructing the suffix array. Note // that we compare the array slices starting at the given indices. sarr.sort_by( &i1, &i2 data[i1..].cmp(&data[i2..]));

Notes on construction Constructing a suffix array using an off-the-shelf sorting algorithm is notoriously inefficient: Good sorting algorithm: O(n log n) time Suffix comparisons: O(n) time Consequently: O(n 2 log n) O(n) algorithms exist (e.g. SA-IS, Nong, et al., 2008) Algorithms optimized for on-disk construction are also available.

Contains Does the data d contain the sequence s? It does when there is a suffix that has s as its prefix. Use the suffix array sa to find such a suffix in O(log n) time.

Contains Does the data d contain the sequence s? It does when there is a suffix that has s as its prefix. Use the suffix array sa to find such a suffix in O(log n) time. Procedure: 1. Get the value d i of the middle element of the suffix array, sa m. 2. Compare the slice d[d i..d i + s ] and s. 3. When: s is equal: the data contains the sequence s is smaller: Repeat search in sa[0..sa m] s is larger: Repeat search in sa[sa m + 1.. sa ] Note: Here a[i..j] is a slice of i from index i inclusive to index j exclusive.

Contains Does the data array contain the string ba? b a n a n a b r e a d 5 9 3 1 0 6 10 8 4 2 7 Perform a binary search on the suffix array, performing comparisons on the data array.

Contains Does the data array contain the string ba? b a n a n a b r e a d 5 9 3 1 0 6 10 8 4 2 7 String of length 2 at position 6: br, larger than ba Continue binary search in the subarray left of this element.

Contains Does the data array contain the string ba? b a n a n a b r e a d 5 9 3 1 0 6 10 8 4 2 7 String of length 2 at position 3: an, smaller than ba Continue binary search in the subarray right of this element.

Contains Does the data array contain the string ba? b a n a n a b r e a d 5 9 3 1 0 6 10 8 4 2 7 String of length 2 at position 1: an, smaller than ba Continue binary search in the subarray right of this element.

Contains Does the data array contain the string ba? b a n a n a b r e a d 5 9 3 1 0 6 10 8 4 2 7 String of length 2 at position 0: an The data array contains the string ba

Implementation pub fn contains(&self, needle: &[T]) -> bool { self.sarr.binary_search_by( &idx { // We don't want to compare 'needle' with the complete suffix // starting at 'idx', but only the prefix of the length of // 'needle'. However, 'needle' can be larger than the suffix // at 'index'. So, we have to avoid slicing beyond the end. let upper = min(self.data.len(), idx + needle.len()); } // Compare the prefix of the suffix and the needle. self.data[idx..upper].cmp(needle) }).is_ok()

Sequence positions At what corpus positions does the sequence s occur? Return the slice of sa that contains indices of suffixes starting with s. (Since sa contains the suffixes in sorted order, the indices of the occurrences of s will be adjacent.)

Example Where does the string aba occur in the array? a b a b a b a b 6 4 2 0 7 5 3 1

Bounds Where does the string aba occur in the array? a b a b a b a b 6 4 2 0 7 5 3 1 Lower bound: the first position where an additional instance of the sequence could be inserted. Upper bound: the last position where an additional instance of the sequence could be inserted.

Finding the upper/lower bound Both the lower and upper bounds can be found using binary search. We will: First look at the procedures for a normal sorted array. Apply the procedure to the suffix array, with its extra level of indirection.

Finding the lower bound Consider a sorted array a. Procedure for finding the lower bound of v: 1. Determine the index of the middle element of a, i.

Finding the lower bound Consider a sorted array a. Procedure for finding the lower bound of v: 1. Determine the index of the middle element of a, i. 2. Compare a[i] and v.

Finding the lower bound Consider a sorted array a. Procedure for finding the lower bound of v: 1. Determine the index of the middle element of a, i. 2. Compare a[i] and v. 3. When: v is equal: Repeat search in a[0..i] v is smaller: Repeat search in sa[0..i] v is larger: Repeat search in sa[i + 1.. a ]

Finding the upper bound Consider a sorted array a. Procedure for finding the lower bound of v: 1. Determine the index of the middle element of a, i.

Finding the upper bound Consider a sorted array a. Procedure for finding the lower bound of v: 1. Determine the index of the middle element of a, i. 2. Compare the slice a[i] and v.

Finding the upper bound Consider a sorted array a. Procedure for finding the lower bound of v: 1. Determine the index of the middle element of a, i. 2. Compare the slice a[i] and v. 3. When: v is equal: Repeat search in a[i + 1.. a ] v is smaller: Repeat search in sa[0..i] v is larger: Repeat search in sa[i + 1.. a ]

Lower bound What is the lower bound for aba? a b a b a b a b 6 4 2 0 7 5 3 1 String of length 3 at position 0: aba Equal, search left half

Lower bound What is the lower bound for aba? a b a b a b a b 6 4 2 0 7 5 3 1 String of length 3 at position 4: aba Equal, search left half

Lower bound What is the lower bound for aba? a b a b a b a b 6 4 2 0 7 5 3 1 String of length 3 at position 6: ab$ Smaller than aba, search right half

Lower bound What is the lower bound for aba? a b a b a b a b 6 4 2 0 7 5 3 1 Since the suffix was smaller than aba, increment by one to get the lower bound.

Implementation pub fn find(&self, needle: &[T]) -> &[usize] { let lower = self.lower_bound(needle); let upper = self.upper_bound(needle); } return &self.sarr[lower..upper];

Implementation (lower bound) pub fn lower_bound(&self, needle: &[T]) -> usize { let result = self.sarr.binary_search_by( &idx { let upper = min(self.data.len(), idx + needle.len()); }); match self.data[idx..upper].cmp(needle) { Ordering::Equal => Ordering::Greater, ordering => ordering, } } match result { Ok(_) => unreachable!(), Err(lower) => lower, }

Implementation (upper bound) pub fn upper_bound(&self, needle: &[T]) -> usize { let result = self.sarr.binary_search_by( &idx { let upper = min(self.data.len(), idx + needle.len()); }); match self.data[idx..upper].cmp(needle) { Ordering::Equal => Ordering::Less, ordering => ordering, } } match result { Ok(_) => unreachable!(), Err(lower) => lower, }

Suffix arrays in IR We can now search arbitrary-length strings: Bundeskansler Merkel We can search partial strings: ansler How do we map results back to documents?

Mapping to documents 0 10 P i n e a p p l e P e n 9 10 0 4 8 3 11 1 7 12 2 6 5

Mapping to documents 0 10 P i n e a p p l e P e n 9 10 0 4 8 3 11 1 7 12 2 6 5 Assumption: documents have identifiers [0..n).

Mapping to documents 0 10 P i n e a p p l e P e n 9 10 0 4 8 3 11 1 7 12 2 6 5 Assumption: documents have identifiers [0..n). Create an array with n elements and store for each document the offset in the data array.

Final notes Suffix arrays can be created for any array of a type that can be ordered. So, we could use: Array of bytes Array of characters Array of tokens The data and suffix arrays can be used from disk using techniques that we will discuss later.