Index extensions. Manning, Raghavan and Schütze, Chapter 2. Daniël de Kok

Similar documents
Efficient Implementation of Postings Lists

More on indexing CE-324: Modern Information Retrieval Sharif University of Technology

Boolean Retrieval. Manning, Raghavan and Schütze, Chapter 1. Daniël de Kok

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze

Part 2: Boolean Retrieval Francesco Ricci

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval

Data-analysis and Retrieval Boolean retrieval, posting lists and dictionaries

Lecture 3: Phrasal queries and wildcards

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Recap of the previous lecture. This lecture. A naïve dictionary. Introduction to Information Retrieval. Dictionary data structures Tolerant retrieval

Information Retrieval

Search: the beginning. Nisheeth

3-1. Dictionaries and Tolerant Retrieval. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling

Information Retrieval

Information Retrieval

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval

Text Pre-processing and Faster Query Processing

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks

Information Retrieval

Information Retrieval

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Web Information Retrieval. Lecture 2 Tokenization, Normalization, Speedup, Phrase Queries


CS347. Lecture 2 April 9, Prabhakar Raghavan

Today s topics CS347. Inverted index storage. Inverted index storage. Processing Boolean queries. Lecture 2 April 9, 2001 Prabhakar Raghavan

Balanced Search Trees

n Tuesday office hours changed: n 2-3pm n Homework 1 due Tuesday n Assignment 1 n Due next Friday n Can work with a partner

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression

Introduction to Information Retrieval and Boolean model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H.

Dictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

Index Compression. David Kauchak cs160 Fall 2009 adapted from:

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Indexing: Part IV. Announcements (February 17) Keyword search. CPS 216 Advanced Database Systems

Improved Skips for Faster Postings List Intersection

Information Retrieval and Organisation

Information Retrieval

Information Retrieval. Danushka Bollegala

Dictionaries and tolerant retrieval. Slides by Manning, Raghavan, Schutze

Introduction to Information Retrieval

Overview. Lecture 3: Index Representation and Tolerant Retrieval. Type/token distinction. IR System components

Information Retrieval

Information Retrieval. Lecture 3 - Index compression. Introduction. Overview. Characterization of an index. Wintersemester 2007

Entropy Coding. - to shorten the average code length by assigning shorter codes to more probable symbols => Morse-, Huffman-, Arithmetic Code

Information Retrieval

Tolerant Retrieval. Searching the Dictionary Tolerant Retrieval. Information Retrieval & Extraction Misbhauddin 1

1 Introduciton. 2 Tries /651: Algorithms CMU, Spring Lecturer: Danny Sleator

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Text Technologies for Data Science INFR Indexing (2) Instructor: Walid Magdy

Models for Document & Query Representation. Ziawasch Abedjan

Lecture Notes on Tries

Text Technologies for Data Science INFR Indexing (2) Instructor: Walid Magdy

Information Retrieval

Modern Information Retrieval

Indexing and Searching

Querying Introduction to Information Retrieval INF 141 Donald J. Patterson. Content adapted from Hinrich Schütze

Lecture Notes on Tries

Solution to Problem 1 of HW 2. Finding the L1 and L2 edges of the graph used in the UD problem, using a suffix array instead of a suffix tree.

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Recap of last time CS276A Information Retrieval

Recap: lecture 2 CS276A Information Retrieval

Index Construction 1

SIGNAL COMPRESSION Lecture Lempel-Ziv Coding

Boolean Queries. Keywords combined with Boolean operators:

Advanced Retrieval Information Analysis Boolean Retrieval

Query Processing and Alternative Search Structures. Indexing common words

CSE 530A. B+ Trees. Washington University Fall 2013

Dictionaries and Tolerant retrieval

Introduction to Information Retrieval

Definition: A data structure is a way of organizing data in a computer so that it can be used efficiently.

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

GUJARAT TECHNOLOGICAL UNIVERSITY

Indexing. Lecture Objectives. Text Technologies for Data Science INFR Learn about and implement Boolean search Inverted index Positional index

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Introduction to Information Retrieval

Indexing and Searching

Parallel and Sequential Data Structures and Algorithms Lecture (Spring 2012) Lecture 25 Suffix Arrays

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval

Database Systems. File Organization-2. A.R. Hurson 323 CS Building

Introduction to Information Retrieval

CS 320 Midterm Exam. Fall 2018

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2

Information Retrieval

Stanford University Computer Science Department Solved CS347 Spring 2001 Mid-term.

Some Practice Problems on Hardware, File Organization and Indexing

SAMPLE OF THE STUDY MATERIAL PART OF CHAPTER 6. Sorting Algorithms

LCP Array Construction

More B-trees, Hash Tables, etc. CS157B Chris Pollett Feb 21, 2005.

Dictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology

CS105 Introduction to Information Retrieval

Representing Data Elements

IN4325 Indexing and query processing. Claudia Hauff (WIS, TU Delft)

Exam Principles of Imperative Computation, Summer 2011 William Lovas. June 24, 2011

Transcription:

Index extensions Manning, Raghavan and Schütze, Chapter 2 Daniël de Kok

Today We will discuss some extensions to the inverted index to accommodate boolean processing: Skip pointers: improve performance for intersection of postings lists. Biword indexes: adding support for phrase queries. Positional indexes: a different approach to phrase queries. Suffix arrays: add support for character n-gram and/or phrase queries.

Skip pointers

Introduction Basic Sorted list intersection: O(n + m) However, sub-linear time possible using skip pointers.

Skip pointers 16 28 28 Brutus 2 4 8 16 19 23 28 43 5 51 98 Caesar 1 2 3 5 8 41 51 60

Skip pointers 16 28 28 Brutus 2 4 8 16 19 23 28 43 5 51 98 Caesar 1 2 3 5 8 41 51 60 Add a pointer every n items. Permits skipping through the postings list more quickly.

Skip pointers 16 28 28 Brutus 2 4 8 16 19 23 28 43 5 51 98 Caesar 1 2 3 5 8 41 51 60

Skip pointers 16 28 28 Brutus 2 4 8 16 19 23 28 43 5 51 98 Caesar 1 2 3 5 8 41 51 60 Suppose that we are computing Brutus AND Caesar and we have found 8 in both lists.

Skip pointers 16 28 28 Brutus 2 4 8 16 19 23 28 43 5 51 98 Caesar 1 2 3 5 8 41 51 60 Suppose that we are computing Brutus AND Caesar and we have found 8 in both lists. In the lists we step to 16 and 41 respectively.

Skip pointers 16 28 28 Brutus 2 4 8 16 19 23 28 43 5 51 98 Caesar 1 2 3 5 8 41 51 60 Suppose that we are computing Brutus AND Caesar and we have found 8 in both lists. In the lists we step to 16 and 41 respectively. Since 16 < 41 and 16 has a skip pointer. Since the document pointed at by the skip pointer (28) is also less than 41 we can make three steps at once.

Skip pointers Skip pointers are especially useful for linked lists. No explicit skip pointers needed for array representation.

Number of pointers The number of skip-pointers is a trade-off: More skips: Shorter skip spans: more likely to skip. More comparisons. More space for storing skip pointers.

Number of pointers The number of skip-pointers is a trade-off: More skips: Shorter skip spans: more likely to skip. More comparisons. More space for storing skip pointers. Fewer skips: Long skip spans: fewer opportunities to skip Fewer comparisons. Less space for storing skip pointers.

Number of pointers The number of skip-pointers is a trade-off: More skips: Shorter skip spans: more likely to skip. More comparisons. More space for storing skip pointers. Fewer skips: Long skip spans: fewer opportunities to skip Fewer comparisons. Less space for storing skip pointers.

Number of pointers Advise from the book: for a postings list of length P, use P evenly-placed skip pointers. Improvements are obviously possible: depends on the distribution of query terms.

Skip lists Note: the idea of skip pointers relate strongly to the skip list data structure (Pugh, 1990), which provides O(log n) average case search/insertion/deletion.

Skip lists Note: the idea of skip pointers relate strongly to the skip list data structure (Pugh, 1990), which provides O(log n) average case search/insertion/deletion. NIL NIL NIL head 1 2 3 4 5 6 7 8 9 10 NIL

Biword indexes

Introduction One undiscussed aspect of boolean queries: phrase queries. E.g.: information retrieval Better call Saul Supported by many search engines by using double quotes. Biword indexes provide a simple method to perform such queries.

Basic idea Construct an inverted index with word bigrams as terms.

Basic idea Construct an inverted index with word bigrams as terms. Allows lookup of two-word terms, such as information retrieval.

Basic idea Construct an inverted index with word bigrams as terms. Allows lookup of two-word terms, such as information retrieval. Lookup of longer phrases is also possible if false positives are acceptable. For example, expand the query Better call Saul to: Better call AND call Saul

Function words Nouns often express the main information being searched. In a phrase query such as renegotiation of the constitution, the function words of and the are largely irrelevant. They do increase query processing time.

Function words Nouns often express the main information being searched. In a phrase query such as renegotiation of the constitution, the function words of and the are largely irrelevant. They do increase query processing time. Typical solution: 1. Use a part-of-speech tagger to filter out nouns: renegotiation/n of/p the/d constitution/n 2. Perform the query renegotiation constitution on the biword index.

Function words Nouns often express the main information being searched. In a phrase query such as renegotiation of the constitution, the function words of and the are largely irrelevant. They do increase query processing time. Typical solution: 1. Use a part-of-speech tagger to filter out nouns: renegotiation/n of/p the/d constitution/n 2. Perform the query renegotiation constitution on the biword index. Obviously, we need to remove function words when constructing the index as well.

Problems biword indexes Can return false positives. Idea expandable to larger phrases, but: Expensive At what length should we stop?

Positional indexes

Introduction Due to its downsides, biword indexes are not the standard solution for phrase queries. Instead, positional indexes are commonly used. A positional index is an index where a posting contains: The document identifier. A list of positions where the term occurs within a document.

Example information 10: [20, 54] 34: [2] 93: [183] retrieval 3: [65] 10: [55] 93: [44] Dictionary Postings

Example information 10: [20, 54] 34: [2] 93: [183] retrieval 3: [65] 10: [55] 93: [44] Dictionary Postings Query: information retrieval

Example information 10: [20, 54] 34: [2] 93: [183] retrieval 3: [65] 10: [55] 93: [44] Dictionary Postings Query: information retrieval Perform intersection on the posting lists of information and retrieval.

Example information 10: [20, 54] 34: [2] 93: [183] retrieval 3: [65] 10: [55] 93: [44] Dictionary Postings Query: information retrieval Perform intersection on the posting lists of information and retrieval. If a document is in D(information) and D(retrieval), use the positions to check that the terms occur consecutively.

Suffix arrays

Introduction Suffix arrays are a data structure for finding/counting n-grams of an arbitrary length in O(log n) time.

Introduction Suffix arrays are a data structure for finding/counting n-grams of an arbitrary length in O(log n) time. Working example: bananabread

Suffixes 0 1 2 3 4 5 6 7 8 b a n a n a b r e a d a n a n a b r e a d n a n a b r e a d a n a b r e a d n a b r e a d a b r e a d b r e a d r e a d e a d 9 10 a d d

Suffixes 0 1 2 3 4 5 6 7 8 b a n a n a b r e a d a n a n a b r e a d n a n a b r e a d a n a b r e a d n a b r e a d a b r e a d b r e a d r e a d e a d 9 10 a d d Enumeration of the suffixes in bananabread. The index in the first column is the corpus position of the suffix.

Sorting suffixes 5 a b r e a d 9 a d 3 1 0 6 10 8 4 2 7 a n a b r e a d a n a n a b r e a d b a n a n a b r e a d b r e a d d e a d n a b r e a d n a n a b r e a d r e a d Lexicographical sorting of the suffixes. The first column is the suffix array and contains the corpus positions sorted by suffix order.

In-class assignment Create the suffix array for the string Mississippi Assume that uppercase letters are sorted before lowercase letters.

Construction Do we need to explicitly store all suffixes in memory to construct the suffix array?

Construction Do we need to explicitly store all suffixes in memory to construct the suffix array? All problems in computer science can be solved by another level of indirection - David Wheeler

Example 0 1 2 3 4 5 6 7 8 9 10 b a n a n a b r e a d 5 3 1 0 6 10 8 4 2 7

Sorted b a n a n a b r e a d 5 9 3 1 0 6 10 8 4 2 7

Construction code // Construct an array with the same length as the data array // with indices 0..n-1. let mut sarr: Vec<_> = (0..data.len()).collect(); // Sort the array, thereby constructing the suffix array. Note // that we compare the array slices starting at the given indices. sarr.sort_by( &i1, &i2 data[i1..].cmp(&data[i2..]));

Notes on construction Constructing a suffix array using an off-the-shelf sorting algorithm is notoriously inefficient: Good sorting algorithm: O(n log n) time Suffix comparisons: O(n) time Consequently: O(n 2 log n) O(n) algorithms exist (e.g. SA-IS, Nong, et al., 2008) Algorithms optimized for on-disk construction are also available.

Contains Does the data d contain the sequence s? It does when there is a suffix that has s as its prefix. Use the suffix array sa to find such a suffix in O(log n) time.

Contains Does the data d contain the sequence s? It does when there is a suffix that has s as its prefix. Use the suffix array sa to find such a suffix in O(log n) time. Procedure: 1. Get the value d i of the middle element of the suffix array, sa m.

Contains Does the data d contain the sequence s? It does when there is a suffix that has s as its prefix. Use the suffix array sa to find such a suffix in O(log n) time. Procedure: 1. Get the value d i of the middle element of the suffix array, sa m. 2. Compare the slice d[d i..d i + s ] and s.

Contains Does the data d contain the sequence s? It does when there is a suffix that has s as its prefix. Use the suffix array sa to find such a suffix in O(log n) time. Procedure: 1. Get the value d i of the middle element of the suffix array, sa m. 2. Compare the slice d[d i..d i + s ] and s. 3. When: s is equal: the data contains the sequence s is smaller: Repeat search in sa[0..sa m] s is larger: Repeat search in sa[sa m + 1.. sa ] Note: Here a[i..j] is a slice of i from index i inclusive to index j exclusive.

Contains Does the data array contain the string ba? b a n a n a b r e a d 5 9 3 1 0 6 10 8 4 2 7 Perform a binary search on the suffix array, performing comparisons on the data array.

Contains Does the data array contain the string ba? b a n a n a b r e a d 5 9 3 1 0 6 10 8 4 2 7 String of length 2 at position 6: br, larger than ba Continue binary search in the subarray left of this element.

Contains Does the data array contain the string ba? b a n a n a b r e a d 5 9 3 1 0 6 10 8 4 2 7 String of length 2 at position 3: an, smaller than ba Continue binary search in the subarray right of this element.

Contains Does the data array contain the string ba? b a n a n a b r e a d 5 9 3 1 0 6 10 8 4 2 7 String of length 2 at position 1: an, smaller than ba Continue binary search in the subarray right of this element.

Contains Does the data array contain the string ba? b a n a n a b r e a d 5 9 3 1 0 6 10 8 4 2 7 String of length 2 at position 0: an The data array contains the string ba

Implementation pub fn contains(&self, needle: &[T]) -> bool { self.sarr.binary_search_by( &idx { // We don't want to compare 'needle' with the complete suffix // starting at 'idx', but only the prefix of the length of // 'needle'. However, 'needle' can be larger than the suffix // at 'index'. So, we have to avoid slicing beyond the end. let upper = min(self.data.len(), idx + needle.len()); } // Compare the prefix of the suffix and the needle. self.data[idx..upper].cmp(needle) }).is_ok()

Sequence positions At what corpus positions does the sequence s occur? Return the slice of sa that contains indices of suffixes starting with s. (Since sa contains the suffixes in sorted order, the indices of the occurrences of s will be adjacent.)

Example Where does the string aba occur in the array? a b a b a b a b 6 4 2 0 7 5 3 1

Bounds Where does the string aba occur in the array? a b a b a b a b 6 4 2 0 7 5 3 1 Lower bound: the first position where an additional instance of the sequence could be inserted. Upper bound: the last position where an additional instance of the sequence could be inserted.

Finding the upper/lower bound Both the lower and upper bounds can be found using binary search. We will: First look at the procedures for a normal sorted array. Apply the procedure to the suffix array, with its extra level of indirection.

Finding the lower bound Consider a sorted array a. Procedure for finding the lower bound of v: 1. Determine the index of the middle element of a, i.

Finding the lower bound Consider a sorted array a. Procedure for finding the lower bound of v: 1. Determine the index of the middle element of a, i. 2. Compare a[i] and v.

Finding the lower bound Consider a sorted array a. Procedure for finding the lower bound of v: 1. Determine the index of the middle element of a, i. 2. Compare a[i] and v. 3. When: v is equal: Repeat search in a[0..i] v is smaller: Repeat search in sa[0..i] v is larger: Repeat search in sa[i + 1.. a ]

Finding the lower bound Consider a sorted array a. Procedure for finding the lower bound of v: 1. Determine the index of the middle element of a, i. 2. Compare a[i] and v. 3. When: v is equal: Repeat search in a[0..i] v is smaller: Repeat search in sa[0..i] v is larger: Repeat search in sa[i + 1.. a ] 4. If the final a[i] is smaller than v, then i i + 1

Finding the lower bound Consider a sorted array a. Procedure for finding the lower bound of v: 1. Determine the index of the middle element of a, i. 2. Compare a[i] and v. 3. When: v is equal: Repeat search in a[0..i] v is smaller: Repeat search in sa[0..i] v is larger: Repeat search in sa[i + 1.. a ] 4. If the final a[i] is smaller than v, then i i + 1 5. i is the lower bound

Finding the upper bound Consider a sorted array a. Procedure for finding the lower bound of v: 1. Determine the index of the middle element of a, i.

Finding the upper bound Consider a sorted array a. Procedure for finding the lower bound of v: 1. Determine the index of the middle element of a, i. 2. Compare the slice a[i] and v.

Finding the upper bound Consider a sorted array a. Procedure for finding the lower bound of v: 1. Determine the index of the middle element of a, i. 2. Compare the slice a[i] and v. 3. When: v is equal: Repeat search in a[i + 1.. a ] v is smaller: Repeat search in sa[0..i] v is larger: Repeat search in sa[i + 1.. a ]

Finding the upper bound Consider a sorted array a. Procedure for finding the lower bound of v: 1. Determine the index of the middle element of a, i. 2. Compare the slice a[i] and v. 3. When: v is equal: Repeat search in a[i + 1.. a ] v is smaller: Repeat search in sa[0..i] v is larger: Repeat search in sa[i + 1.. a ] 4. If the final a[i] is v, then i i + 1

Finding the upper bound Consider a sorted array a. Procedure for finding the lower bound of v: 1. Determine the index of the middle element of a, i. 2. Compare the slice a[i] and v. 3. When: v is equal: Repeat search in a[i + 1.. a ] v is smaller: Repeat search in sa[0..i] v is larger: Repeat search in sa[i + 1.. a ] 4. If the final a[i] is v, then i i + 1 5. i is the upper bound

Lower bound What is the lower bound for aba? a b a b a b a b 6 4 2 0 7 5 3 1 String of length 3 at position 0: aba Equal, search left half

Lower bound What is the lower bound for aba? a b a b a b a b 6 4 2 0 7 5 3 1 String of length 3 at position 4: aba Equal, search left half

Lower bound What is the lower bound for aba? a b a b a b a b 6 4 2 0 7 5 3 1 String of length 3 at position 6: ab$ Smaller than aba, search right half

Lower bound What is the lower bound for aba? a b a b a b a b 6 4 2 0 7 5 3 1 Since the suffix was smaller than aba, increment by one to get the lower bound.

Implementation pub fn find(&self, needle: &[T]) -> &[usize] { let lower = self.lower_bound(needle); let upper = self.upper_bound(needle); } return &self.sarr[lower..upper];

Implementation (lower bound) pub fn lower_bound(&self, needle: &[T]) -> usize { let result = self.sarr.binary_search_by( &idx { let upper = min(self.data.len(), idx + needle.len()); }); match self.data[idx..upper].cmp(needle) { Ordering::Equal => Ordering::Greater, ordering => ordering, } } match result { Ok(_) => unreachable!(), Err(lower) => lower, }

Implementation (upper bound) pub fn upper_bound(&self, needle: &[T]) -> usize { let result = self.sarr.binary_search_by( &idx { let upper = min(self.data.len(), idx + needle.len()); }); match self.data[idx..upper].cmp(needle) { Ordering::Equal => Ordering::Less, ordering => ordering, } } match result { Ok(_) => unreachable!(), Err(lower) => lower, }

Suffix arrays in IR We can now search arbitrary-length strings: Bundeskansler Merkel We can search partial strings: ansler How do we map results back to documents?

Mapping to documents 0 10 P i n e a p p l e P e n 9 10 0 4 8 3 11 1 7 12 2 6 5

Mapping to documents 0 10 P i n e a p p l e P e n 9 10 0 4 8 3 11 1 7 12 2 6 5 Assumption: documents have identifiers [0..n).

Mapping to documents 0 10 P i n e a p p l e P e n 9 10 0 4 8 3 11 1 7 12 2 6 5 Assumption: documents have identifiers [0..n). Create an array with n elements and store for each document the offset in the data array.

Mapping to documents 0 10 P i n e a p p l e P e n 9 10 0 4 8 3 11 1 7 12 2 6 5 Assumption: documents have identifiers [0..n). Create an array with n elements and store for each document the offset in the data array. Results of a suffix array search can be mapped to documents using

Mapping to documents 0 10 P i n e a p p l e P e n 9 10 0 4 8 3 11 1 7 12 2 6 5 Assumption: documents have identifiers [0..n). Create an array with n elements and store for each document the offset in the data array. Results of a suffix array search can be mapped to documents using binary search :ˆ)

Final notes Suffix arrays can be created for any array of a type that can be ordered. So, we could use: Array of bytes Array of characters Array of tokens The data and suffix arrays can be used from disk using techniques that we will discuss later.