Index extensions. Manning, Raghavan and Schütze, Chapter 2. Daniël de Kok
|
|
- Dennis Townsend
- 5 years ago
- Views:
Transcription
1 Index extensions Manning, Raghavan and Schütze, Chapter 2 Daniël de Kok
2 Today We will discuss some extensions to the inverted index to accommodate boolean processing: Skip pointers: improve performance for intersection of postings lists. Biword indexes: adding support for phrase queries. Positional indexes: a different approach to phrase queries. Suffix arrays: add support for character n-gram and/or phrase queries.
3 Skip pointers
4 Introduction Basic Sorted list intersection: O(n + m) However, sub-linear time possible using skip pointers.
5 Skip pointers Brutus Caesar
6 Skip pointers Brutus Caesar Add a pointer every n items. Permits skipping through the postings list more quickly.
7 Skip pointers Brutus Caesar
8 Skip pointers Brutus Caesar Suppose that we are computing Brutus AND Caesar and we have found 8 in both lists.
9 Skip pointers Brutus Caesar Suppose that we are computing Brutus AND Caesar and we have found 8 in both lists. In the lists we step to 16 and 41 respectively.
10 Skip pointers Brutus Caesar Suppose that we are computing Brutus AND Caesar and we have found 8 in both lists. In the lists we step to 16 and 41 respectively. Since 16 < 41 and 16 has a skip pointer. Since the document pointed at by the skip pointer (28) is also less than 41 we can make three steps at once.
11 Skip pointers Skip pointers are especially useful for linked lists. No explicit skip pointers needed for array representation.
12 Number of pointers The number of skip-pointers is a trade-off: More skips: Shorter skip spans: more likely to skip. More comparisons. More space for storing skip pointers.
13 Number of pointers The number of skip-pointers is a trade-off: More skips: Shorter skip spans: more likely to skip. More comparisons. More space for storing skip pointers. Fewer skips: Long skip spans: fewer opportunities to skip Fewer comparisons. Less space for storing skip pointers.
14 Number of pointers The number of skip-pointers is a trade-off: More skips: Shorter skip spans: more likely to skip. More comparisons. More space for storing skip pointers. Fewer skips: Long skip spans: fewer opportunities to skip Fewer comparisons. Less space for storing skip pointers.
15 Number of pointers Advise from the book: for a postings list of length P, use P evenly-placed skip pointers. Improvements are obviously possible: depends on the distribution of query terms.
16 Skip lists Note: the idea of skip pointers relate strongly to the skip list data structure (Pugh, 1990), which provides O(log n) average case search/insertion/deletion.
17 Skip lists Note: the idea of skip pointers relate strongly to the skip list data structure (Pugh, 1990), which provides O(log n) average case search/insertion/deletion. NIL NIL NIL head NIL
18 Biword indexes
19 Introduction One undiscussed aspect of boolean queries: phrase queries. E.g.: information retrieval Better call Saul Supported by many search engines by using double quotes. Biword indexes provide a simple method to perform such queries.
20 Basic idea Construct an inverted index with word bigrams as terms.
21 Basic idea Construct an inverted index with word bigrams as terms. Allows lookup of two-word terms, such as information retrieval.
22 Basic idea Construct an inverted index with word bigrams as terms. Allows lookup of two-word terms, such as information retrieval. Lookup of longer phrases is also possible if false positives are acceptable. For example, expand the query Better call Saul to: Better call AND call Saul
23 Function words Nouns often express the main information being searched. In a phrase query such as renegotiation of the constitution, the function words of and the are largely irrelevant. They do increase query processing time.
24 Function words Nouns often express the main information being searched. In a phrase query such as renegotiation of the constitution, the function words of and the are largely irrelevant. They do increase query processing time. Typical solution: 1. Use a part-of-speech tagger to filter out nouns: renegotiation/n of/p the/d constitution/n 2. Perform the query renegotiation constitution on the biword index.
25 Function words Nouns often express the main information being searched. In a phrase query such as renegotiation of the constitution, the function words of and the are largely irrelevant. They do increase query processing time. Typical solution: 1. Use a part-of-speech tagger to filter out nouns: renegotiation/n of/p the/d constitution/n 2. Perform the query renegotiation constitution on the biword index. Obviously, we need to remove function words when constructing the index as well.
26 Problems biword indexes Can return false positives. Idea expandable to larger phrases, but: Expensive At what length should we stop?
27 Positional indexes
28 Introduction Due to its downsides, biword indexes are not the standard solution for phrase queries. Instead, positional indexes are commonly used. A positional index is an index where a posting contains: The document identifier. A list of positions where the term occurs within a document.
29 Example information 10: [20, 54] 34: [2] 93: [183] retrieval 3: [65] 10: [55] 93: [44] Dictionary Postings
30 Example information 10: [20, 54] 34: [2] 93: [183] retrieval 3: [65] 10: [55] 93: [44] Dictionary Postings Query: information retrieval
31 Example information 10: [20, 54] 34: [2] 93: [183] retrieval 3: [65] 10: [55] 93: [44] Dictionary Postings Query: information retrieval Perform intersection on the posting lists of information and retrieval.
32 Example information 10: [20, 54] 34: [2] 93: [183] retrieval 3: [65] 10: [55] 93: [44] Dictionary Postings Query: information retrieval Perform intersection on the posting lists of information and retrieval. If a document is in D(information) and D(retrieval), use the positions to check that the terms occur consecutively.
33 Suffix arrays
34 Introduction Suffix arrays are a data structure for finding/counting n-grams of an arbitrary length in O(log n) time.
35 Introduction Suffix arrays are a data structure for finding/counting n-grams of an arbitrary length in O(log n) time. Working example: bananabread
36 Suffixes b a n a n a b r e a d a n a n a b r e a d n a n a b r e a d a n a b r e a d n a b r e a d a b r e a d b r e a d r e a d e a d 9 10 a d d
37 Suffixes b a n a n a b r e a d a n a n a b r e a d n a n a b r e a d a n a b r e a d n a b r e a d a b r e a d b r e a d r e a d e a d 9 10 a d d Enumeration of the suffixes in bananabread. The index in the first column is the corpus position of the suffix.
38 Sorting suffixes 5 a b r e a d 9 a d a n a b r e a d a n a n a b r e a d b a n a n a b r e a d b r e a d d e a d n a b r e a d n a n a b r e a d r e a d Lexicographical sorting of the suffixes. The first column is the suffix array and contains the corpus positions sorted by suffix order.
39 In-class assignment Create the suffix array for the string Mississippi Assume that uppercase letters are sorted before lowercase letters.
40 Construction Do we need to explicitly store all suffixes in memory to construct the suffix array?
41 Construction Do we need to explicitly store all suffixes in memory to construct the suffix array? All problems in computer science can be solved by another level of indirection - David Wheeler
42 Example b a n a n a b r e a d
43 Sorted b a n a n a b r e a d
44 Construction code // Construct an array with the same length as the data array // with indices 0..n-1. let mut sarr: Vec<_> = (0..data.len()).collect(); // Sort the array, thereby constructing the suffix array. Note // that we compare the array slices starting at the given indices. sarr.sort_by( &i1, &i2 data[i1..].cmp(&data[i2..]));
45 Notes on construction Constructing a suffix array using an off-the-shelf sorting algorithm is notoriously inefficient: Good sorting algorithm: O(n log n) time Suffix comparisons: O(n) time Consequently: O(n 2 log n) O(n) algorithms exist (e.g. SA-IS, Nong, et al., 2008) Algorithms optimized for on-disk construction are also available.
46 Contains Does the data d contain the sequence s? It does when there is a suffix that has s as its prefix. Use the suffix array sa to find such a suffix in O(log n) time.
47 Contains Does the data d contain the sequence s? It does when there is a suffix that has s as its prefix. Use the suffix array sa to find such a suffix in O(log n) time. Procedure: 1. Get the value d i of the middle element of the suffix array, sa m.
48 Contains Does the data d contain the sequence s? It does when there is a suffix that has s as its prefix. Use the suffix array sa to find such a suffix in O(log n) time. Procedure: 1. Get the value d i of the middle element of the suffix array, sa m. 2. Compare the slice d[d i..d i + s ] and s.
49 Contains Does the data d contain the sequence s? It does when there is a suffix that has s as its prefix. Use the suffix array sa to find such a suffix in O(log n) time. Procedure: 1. Get the value d i of the middle element of the suffix array, sa m. 2. Compare the slice d[d i..d i + s ] and s. 3. When: s is equal: the data contains the sequence s is smaller: Repeat search in sa[0..sa m] s is larger: Repeat search in sa[sa m sa ] Note: Here a[i..j] is a slice of i from index i inclusive to index j exclusive.
50 Contains Does the data array contain the string ba? b a n a n a b r e a d Perform a binary search on the suffix array, performing comparisons on the data array.
51 Contains Does the data array contain the string ba? b a n a n a b r e a d String of length 2 at position 6: br, larger than ba Continue binary search in the subarray left of this element.
52 Contains Does the data array contain the string ba? b a n a n a b r e a d String of length 2 at position 3: an, smaller than ba Continue binary search in the subarray right of this element.
53 Contains Does the data array contain the string ba? b a n a n a b r e a d String of length 2 at position 1: an, smaller than ba Continue binary search in the subarray right of this element.
54 Contains Does the data array contain the string ba? b a n a n a b r e a d String of length 2 at position 0: an The data array contains the string ba
55 Implementation pub fn contains(&self, needle: &[T]) -> bool { self.sarr.binary_search_by( &idx { // We don't want to compare 'needle' with the complete suffix // starting at 'idx', but only the prefix of the length of // 'needle'. However, 'needle' can be larger than the suffix // at 'index'. So, we have to avoid slicing beyond the end. let upper = min(self.data.len(), idx + needle.len()); } // Compare the prefix of the suffix and the needle. self.data[idx..upper].cmp(needle) }).is_ok()
56 Sequence positions At what corpus positions does the sequence s occur? Return the slice of sa that contains indices of suffixes starting with s. (Since sa contains the suffixes in sorted order, the indices of the occurrences of s will be adjacent.)
57 Example Where does the string aba occur in the array? a b a b a b a b
58 Bounds Where does the string aba occur in the array? a b a b a b a b Lower bound: the first position where an additional instance of the sequence could be inserted. Upper bound: the last position where an additional instance of the sequence could be inserted.
59 Finding the upper/lower bound Both the lower and upper bounds can be found using binary search. We will: First look at the procedures for a normal sorted array. Apply the procedure to the suffix array, with its extra level of indirection.
60 Finding the lower bound Consider a sorted array a. Procedure for finding the lower bound of v: 1. Determine the index of the middle element of a, i.
61 Finding the lower bound Consider a sorted array a. Procedure for finding the lower bound of v: 1. Determine the index of the middle element of a, i. 2. Compare a[i] and v.
62 Finding the lower bound Consider a sorted array a. Procedure for finding the lower bound of v: 1. Determine the index of the middle element of a, i. 2. Compare a[i] and v. 3. When: v is equal: Repeat search in a[0..i] v is smaller: Repeat search in sa[0..i] v is larger: Repeat search in sa[i a ]
63 Finding the lower bound Consider a sorted array a. Procedure for finding the lower bound of v: 1. Determine the index of the middle element of a, i. 2. Compare a[i] and v. 3. When: v is equal: Repeat search in a[0..i] v is smaller: Repeat search in sa[0..i] v is larger: Repeat search in sa[i a ] 4. If the final a[i] is smaller than v, then i i + 1
64 Finding the lower bound Consider a sorted array a. Procedure for finding the lower bound of v: 1. Determine the index of the middle element of a, i. 2. Compare a[i] and v. 3. When: v is equal: Repeat search in a[0..i] v is smaller: Repeat search in sa[0..i] v is larger: Repeat search in sa[i a ] 4. If the final a[i] is smaller than v, then i i i is the lower bound
65 Finding the upper bound Consider a sorted array a. Procedure for finding the lower bound of v: 1. Determine the index of the middle element of a, i.
66 Finding the upper bound Consider a sorted array a. Procedure for finding the lower bound of v: 1. Determine the index of the middle element of a, i. 2. Compare the slice a[i] and v.
67 Finding the upper bound Consider a sorted array a. Procedure for finding the lower bound of v: 1. Determine the index of the middle element of a, i. 2. Compare the slice a[i] and v. 3. When: v is equal: Repeat search in a[i a ] v is smaller: Repeat search in sa[0..i] v is larger: Repeat search in sa[i a ]
68 Finding the upper bound Consider a sorted array a. Procedure for finding the lower bound of v: 1. Determine the index of the middle element of a, i. 2. Compare the slice a[i] and v. 3. When: v is equal: Repeat search in a[i a ] v is smaller: Repeat search in sa[0..i] v is larger: Repeat search in sa[i a ] 4. If the final a[i] is v, then i i + 1
69 Finding the upper bound Consider a sorted array a. Procedure for finding the lower bound of v: 1. Determine the index of the middle element of a, i. 2. Compare the slice a[i] and v. 3. When: v is equal: Repeat search in a[i a ] v is smaller: Repeat search in sa[0..i] v is larger: Repeat search in sa[i a ] 4. If the final a[i] is v, then i i i is the upper bound
70 Lower bound What is the lower bound for aba? a b a b a b a b String of length 3 at position 0: aba Equal, search left half
71 Lower bound What is the lower bound for aba? a b a b a b a b String of length 3 at position 4: aba Equal, search left half
72 Lower bound What is the lower bound for aba? a b a b a b a b String of length 3 at position 6: ab$ Smaller than aba, search right half
73 Lower bound What is the lower bound for aba? a b a b a b a b Since the suffix was smaller than aba, increment by one to get the lower bound.
74 Implementation pub fn find(&self, needle: &[T]) -> &[usize] { let lower = self.lower_bound(needle); let upper = self.upper_bound(needle); } return &self.sarr[lower..upper];
75 Implementation (lower bound) pub fn lower_bound(&self, needle: &[T]) -> usize { let result = self.sarr.binary_search_by( &idx { let upper = min(self.data.len(), idx + needle.len()); }); match self.data[idx..upper].cmp(needle) { Ordering::Equal => Ordering::Greater, ordering => ordering, } } match result { Ok(_) => unreachable!(), Err(lower) => lower, }
76 Implementation (upper bound) pub fn upper_bound(&self, needle: &[T]) -> usize { let result = self.sarr.binary_search_by( &idx { let upper = min(self.data.len(), idx + needle.len()); }); match self.data[idx..upper].cmp(needle) { Ordering::Equal => Ordering::Less, ordering => ordering, } } match result { Ok(_) => unreachable!(), Err(lower) => lower, }
77 Suffix arrays in IR We can now search arbitrary-length strings: Bundeskansler Merkel We can search partial strings: ansler How do we map results back to documents?
78 Mapping to documents 0 10 P i n e a p p l e P e n
79 Mapping to documents 0 10 P i n e a p p l e P e n Assumption: documents have identifiers [0..n).
80 Mapping to documents 0 10 P i n e a p p l e P e n Assumption: documents have identifiers [0..n). Create an array with n elements and store for each document the offset in the data array.
81 Mapping to documents 0 10 P i n e a p p l e P e n Assumption: documents have identifiers [0..n). Create an array with n elements and store for each document the offset in the data array. Results of a suffix array search can be mapped to documents using
82 Mapping to documents 0 10 P i n e a p p l e P e n Assumption: documents have identifiers [0..n). Create an array with n elements and store for each document the offset in the data array. Results of a suffix array search can be mapped to documents using binary search :ˆ)
83 Final notes Suffix arrays can be created for any array of a type that can be ordered. So, we could use: Array of bytes Array of characters Array of tokens The data and suffix arrays can be used from disk using techniques that we will discuss later.
Efficient Implementation of Postings Lists
Efficient Implementation of Postings Lists Inverted Indices Query Brutus AND Calpurnia J. Pei: Information Retrieval and Web Search -- Efficient Implementation of Postings Lists 2 Skip Pointers J. Pei:
More informationMore on indexing CE-324: Modern Information Retrieval Sharif University of Technology
More on indexing CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Plan
More informationBoolean Retrieval. Manning, Raghavan and Schütze, Chapter 1. Daniël de Kok
Boolean Retrieval Manning, Raghavan and Schütze, Chapter 1 Daniël de Kok Boolean query model Pose a query as a boolean query: Terms Operations: AND, OR, NOT Example: Brutus AND Caesar AND NOT Calpuria
More informationIndexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze
Indexing UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides Addison Wesley, 2008 Table of Content Inverted index with positional information
More informationPart 2: Boolean Retrieval Francesco Ricci
Part 2: Boolean Retrieval Francesco Ricci Most of these slides comes from the course: Information Retrieval and Web Search, Christopher Manning and Prabhakar Raghavan Content p Term document matrix p Information
More informationBoolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology
Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan lectures
More informationInformation Retrieval
Information Retrieval Data Processing and Storage Ilya Markov i.markov@uva.nl University of Amsterdam Ilya Markov i.markov@uva.nl Information Retrieval 1 Course overview Offline Data Acquisition Data Processing
More informationData-analysis and Retrieval Boolean retrieval, posting lists and dictionaries
Data-analysis and Retrieval Boolean retrieval, posting lists and dictionaries Hans Philippi (based on the slides from the Stanford course on IR) April 25, 2018 Boolean retrieval, posting lists & dictionaries
More informationLecture 3: Phrasal queries and wildcards
Lecture 3: Phrasal queries and wildcards Trevor Cohn (tcohn@unimelb.edu.au) COMP90042, 2015, Semester 1 What we ll learn today Building on the boolean index and query mechanism to support multi-word queries
More informationEfficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)
Efficiency Efficiency: Indexing (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Difficult to analyze sequential IR algorithms: data and query dependency (query selectivity). O(q(cf max )) -- high estimate-
More informationText Analytics. Index-Structures for Information Retrieval. Ulf Leser
Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf
More informationIntroduction to Information Retrieval (Manning, Raghavan, Schutze)
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 3 Dictionaries and Tolerant retrieval Chapter 4 Index construction Chapter 5 Index compression Content Dictionary data structures
More informationRecap of the previous lecture. This lecture. A naïve dictionary. Introduction to Information Retrieval. Dictionary data structures Tolerant retrieval
Ch. 2 Recap of the previous lecture Introduction to Information Retrieval Lecture 3: Dictionaries and tolerant retrieval The type/token distinction Terms are normalized types put in the dictionary Tokenization
More informationInformation Retrieval
Information Retrieval Natural Language Processing: Lecture 12 30.11.2017 Kairit Sirts Homework 4 things that seemed to work Bidirectional LSTM instead of unidirectional Change LSTM activation to sigmoid
More informationSearch: the beginning. Nisheeth
Search: the beginning Nisheeth Interdisciplinary area Information retrieval NLP Search Machine learning Human factors Outline Components Crawling Processing Indexing Retrieval Evaluation Research areas
More information3-1. Dictionaries and Tolerant Retrieval. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.
3-1. Dictionaries and Tolerant Retrieval Most slides were adapted from Stanford CS 276 course and University of Munich IR course. 1 Dictionary data structures for inverted indexes Sec. 3.1 The dictionary
More informationEECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling
EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling Doug Downey Based partially on slides by Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze Announcements Project progress report
More informationInformation Retrieval
Introduction to Information Retrieval Lecture 1: Boolean retrieval 1 Sec. 1.1 Unstructured data in 1680 Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? One could grep
More informationInformation Retrieval
Introduction to Information Retrieval CS276 Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan Lecture 1: Boolean retrieval Information Retrieval Information Retrieval (IR)
More informationBoolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology
Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2013 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276,
More informationInformation Retrieval
Information Retrieval Dictionaries & Tolerant Retrieval Gintarė Grigonytė gintare@ling.su.se Department of Linguistics and Philology Uppsala University Slides based on previous IR course given by Jörg
More informationText Pre-processing and Faster Query Processing
Text Pre-processing and Faster Query Processing David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture2-dictionary.ppt Administrative Everyone have CS lab accounts/access?
More informationAdministrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks
Administrative Index Compression! n Assignment 1? n Homework 2 out n What I did last summer lunch talks today David Kauchak cs458 Fall 2012 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture5-indexcompression.ppt
More informationInformation Retrieval
Information Retrieval Suan Lee - Information Retrieval - 04 Index Construction 1 04 Index Construction - Information Retrieval - 04 Index Construction 2 Plan Last lecture: Dictionary data structures Tolerant
More informationInformation Retrieval
Introduction to Information Retrieval Lecture 3: Dictionaries and tolerant retrieval 1 Outline Dictionaries Wildcard queries skip Edit distance skip Spelling correction skip Soundex 2 Inverted index Our
More informationText Analytics. Index-Structures for Information Retrieval. Ulf Leser
Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf
More informationWeb Information Retrieval. Lecture 2 Tokenization, Normalization, Speedup, Phrase Queries
Web Information Retrieval Lecture 2 Tokenization, Normalization, Speedup, Phrase Queries Recap of the previous lecture Basic inverted indexes: Structure: Dictionary and Postings Key step in construction:
More informationOrdered Indices To gain fast random access to records in a file, we can use an index structure. Each index structure is associated with a particular search key. Just like index of a book, library catalog,
More informationCS347. Lecture 2 April 9, Prabhakar Raghavan
CS347 Lecture 2 April 9, 2001 Prabhakar Raghavan Today s topics Inverted index storage Compressing dictionaries into memory Processing Boolean queries Optimizing term processing Skip list encoding Wild-card
More informationToday s topics CS347. Inverted index storage. Inverted index storage. Processing Boolean queries. Lecture 2 April 9, 2001 Prabhakar Raghavan
Today s topics CS347 Lecture 2 April 9, 2001 Prabhakar Raghavan Inverted index storage Compressing dictionaries into memory Processing Boolean queries Optimizing term processing Skip list encoding Wild-card
More informationBalanced Search Trees
Balanced Search Trees Computer Science E-22 Harvard Extension School David G. Sullivan, Ph.D. Review: Balanced Trees A tree is balanced if, for each node, the node s subtrees have the same height or have
More informationn Tuesday office hours changed: n 2-3pm n Homework 1 due Tuesday n Assignment 1 n Due next Friday n Can work with a partner
Administrative Text Pre-processing and Faster Query Processing" David Kauchak cs458 Fall 2012 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture2-dictionary.ppt Tuesday office hours changed:
More informationWeb Information Retrieval. Lecture 4 Dictionaries, Index Compression
Web Information Retrieval Lecture 4 Dictionaries, Index Compression Recap: lecture 2,3 Stemming, tokenization etc. Faster postings merges Phrase queries Index construction This lecture Dictionary data
More informationIntroduction to Information Retrieval and Boolean model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H.
Introduction to Information Retrieval and Boolean model Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze 1 Unstructured (text) vs. structured (database) data in late
More informationDictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology
Dictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Nayak & Raghavan (CS- 276, Stanford)
More informationCS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University
CS6200 Information Retrieval David Smith College of Computer and Information Science Northeastern University Indexing Process!2 Indexes Storing document information for faster queries Indexes Index Compression
More informationJames Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!
James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! (301) 219-4649 james.mayfield@jhuapl.edu What is Information Retrieval? Evaluation
More informationIndex Compression. David Kauchak cs160 Fall 2009 adapted from:
Index Compression David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture5-indexcompression.ppt Administrative Homework 2 Assignment 1 Assignment 2 Pair programming?
More informationBoolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology
Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2015 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan lectures
More informationIndexing: Part IV. Announcements (February 17) Keyword search. CPS 216 Advanced Database Systems
Indexing: Part IV CPS 216 Advanced Database Systems Announcements (February 17) 2 Homework #2 due in two weeks Reading assignments for this and next week The query processing survey by Graefe Due next
More informationImproved Skips for Faster Postings List Intersection
Journal of Advances in Computer Research Quarterly ISSN: 2008-6148 Sari Branch, Islamic Azad University, Sari, I.R.Iran (Vol. 3, No. 3, August 2012), Pages: 1-7 www.jacr.iausari.ac.ir Improved Skips for
More informationInformation Retrieval and Organisation
Information Retrieval and Organisation Dell Zhang Birkbeck, University of London 2016/17 IR Chapter 01 Boolean Retrieval Example IR Problem Let s look at a simple IR problem Suppose you own a copy of Shakespeare
More informationInformation Retrieval
Introduction to Information Retrieval Information Retrieval and Web Search Lecture 1: Introduction and Boolean retrieval Outline ❶ Course details ❷ Information retrieval ❸ Boolean retrieval 2 Course details
More informationInformation Retrieval. Danushka Bollegala
Information Retrieval Danushka Bollegala Anatomy of a Search Engine Document Indexing Query Processing Search Index Results Ranking 2 Document Processing Format detection Plain text, PDF, PPT, Text extraction
More informationDictionaries and tolerant retrieval. Slides by Manning, Raghavan, Schutze
Dictionaries and tolerant retrieval 1 Ch. 2 Recap of the previous lecture The type/token distinction Terms are normalized types put in the dictionary Tokenization problems: Hyphens, apostrophes, compounds,
More informationIntroduction to Information Retrieval
Mustafa Jarrar: Lecture Notes on Information Retrieval University of Birzeit, Palestine 2014 Introduction to Information Retrieval Dr. Mustafa Jarrar Sina Institute, University of Birzeit mjarrar@birzeit.edu
More informationOverview. Lecture 3: Index Representation and Tolerant Retrieval. Type/token distinction. IR System components
Overview Lecture 3: Index Representation and Tolerant Retrieval Information Retrieval Computer Science Tripos Part II Ronan Cummins 1 Natural Language and Information Processing (NLIP) Group 1 Recap 2
More informationInformation Retrieval
Introduction to Information Retrieval Boolean retrieval Basic assumptions of Information Retrieval Collection: Fixed set of documents Goal: Retrieve documents with information that is relevant to the user
More informationInformation Retrieval. Lecture 3 - Index compression. Introduction. Overview. Characterization of an index. Wintersemester 2007
Information Retrieval Lecture 3 - Index compression Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Dictionary and inverted index:
More informationEntropy Coding. - to shorten the average code length by assigning shorter codes to more probable symbols => Morse-, Huffman-, Arithmetic Code
Entropy Coding } different probabilities for the appearing of single symbols are used - to shorten the average code length by assigning shorter codes to more probable symbols => Morse-, Huffman-, Arithmetic
More informationInformation Retrieval
Introduction to Information Retrieval Lecture 4: Index Construction 1 Plan Last lecture: Dictionary data structures Tolerant retrieval Wildcards Spell correction Soundex a-hu hy-m n-z $m mace madden mo
More informationTolerant Retrieval. Searching the Dictionary Tolerant Retrieval. Information Retrieval & Extraction Misbhauddin 1
Tolerant Retrieval Searching the Dictionary Tolerant Retrieval Information Retrieval & Extraction Misbhauddin 1 Query Retrieval Dictionary data structures Tolerant retrieval Wild-card queries Soundex Spelling
More information1 Introduciton. 2 Tries /651: Algorithms CMU, Spring Lecturer: Danny Sleator
15-451/651: Algorithms CMU, Spring 2015 Lecture #25: Suffix Trees April 22, 2015 (Earth Day) Lecturer: Danny Sleator Outline: Suffix Trees definition properties (i.e. O(n) space) applications Suffix Arrays
More informationINFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 3: Dictionaries and tolerant retrieval Paul Ginsparg Cornell University,
More informationText Technologies for Data Science INFR Indexing (2) Instructor: Walid Magdy
Text Technologies for Data Science INFR11145 Indexing (2) Instructor: Walid Magdy 03-Oct-2018 Lecture Objectives Learn more about indexing: Structured documents Extent index Index compression Data structure
More informationModels for Document & Query Representation. Ziawasch Abedjan
Models for Document & Query Representation Ziawasch Abedjan Overview Introduction & Definition Boolean retrieval Vector Space Model Probabilistic Information Retrieval Language Model Approach Summary Overview
More informationLecture Notes on Tries
Lecture Notes on Tries 15-122: Principles of Imperative Computation Thomas Cortina, Frank Pfenning, Rob Simmons, Penny Anderson Lecture 21 November 10, 2014 1 Introduction In the data structures implementing
More informationText Technologies for Data Science INFR Indexing (2) Instructor: Walid Magdy
Text Technologies for Data Science INFR11145 Indexing (2) Instructor: Walid Magdy 10-Oct-2017 Lecture Objectives Learn more about indexing: Structured documents Extent index Index compression Data structure
More informationInformation Retrieval
Introduction to Information Retrieval CS3245 Information Retrieval Lecture 6: Index Compression 6 Last Time: index construction Sort- based indexing Blocked Sort- Based Indexing Merge sort is effective
More informationModern Information Retrieval
Modern Information Retrieval Chapter 9 Indexing and Searching with Gonzalo Navarro Introduction Inverted Indexes Signature Files Suffix Trees and Suffix Arrays Sequential Searching Multi-dimensional Indexing
More informationIndexing and Searching
Indexing and Searching Introduction How to retrieval information? A simple alternative is to search the whole text sequentially Another option is to build data structures over the text (called indices)
More informationQuerying Introduction to Information Retrieval INF 141 Donald J. Patterson. Content adapted from Hinrich Schütze
Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Overview Boolean Retrieval Weighted Boolean Retrieval Zone Indices
More informationLecture Notes on Tries
Lecture Notes on Tries 15-122: Principles of Imperative Computation Thomas Cortina Notes by Frank Pfenning Lecture 24 April 19, 2011 1 Introduction In the data structures implementing associative arrays
More informationSolution to Problem 1 of HW 2. Finding the L1 and L2 edges of the graph used in the UD problem, using a suffix array instead of a suffix tree.
Solution to Problem 1 of HW 2. Finding the L1 and L2 edges of the graph used in the UD problem, using a suffix array instead of a suffix tree. The basic approach is the same as when using a suffix tree,
More informationIndex construction CE-324: Modern Information Retrieval Sharif University of Technology
Index construction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch.
More informationRecap of last time CS276A Information Retrieval
Recap of last time CS276A Information Retrieval Index compression Space estimation Lecture 4 This lecture Tolerant retrieval Wild-card queries Spelling correction Soundex Wild-card queries Wild-card queries:
More informationRecap: lecture 2 CS276A Information Retrieval
Recap: lecture 2 CS276A Information Retrieval Stemming, tokenization etc. Faster postings merges Phrase queries Lecture 3 This lecture Index compression Space estimation Corpus size for estimates Consider
More informationIndex Construction 1
Index Construction 1 October, 2009 1 Vorlage: Folien von M. Schütze 1 von 43 Index Construction Hardware basics Many design decisions in information retrieval are based on hardware constraints. We begin
More informationSIGNAL COMPRESSION Lecture Lempel-Ziv Coding
SIGNAL COMPRESSION Lecture 5 11.9.2007 Lempel-Ziv Coding Dictionary methods Ziv-Lempel 77 The gzip variant of Ziv-Lempel 77 Ziv-Lempel 78 The LZW variant of Ziv-Lempel 78 Asymptotic optimality of Ziv-Lempel
More informationBoolean Queries. Keywords combined with Boolean operators:
Query Languages 1 Boolean Queries Keywords combined with Boolean operators: OR: (e 1 OR e 2 ) AND: (e 1 AND e 2 ) BUT: (e 1 BUT e 2 ) Satisfy e 1 but not e 2 Negation only allowed using BUT to allow efficient
More informationAdvanced Retrieval Information Analysis Boolean Retrieval
Advanced Retrieval Information Analysis Boolean Retrieval Irwan Ary Dharmawan 1,2,3 iad@unpad.ac.id Hana Rizmadewi Agustina 2,4 hagustina@unpad.ac.id 1) Development Center of Information System and Technology
More informationQuery Processing and Alternative Search Structures. Indexing common words
Query Processing and Alternative Search Structures CS 510 Winter 2007 1 Indexing common words What is the indexing overhead for a common term? I.e., does leaving out stopwords help? Consider a word such
More informationCSE 530A. B+ Trees. Washington University Fall 2013
CSE 530A B+ Trees Washington University Fall 2013 B Trees A B tree is an ordered (non-binary) tree where the internal nodes can have a varying number of child nodes (within some range) B Trees When a key
More informationDictionaries and Tolerant retrieval
Dictionaries and Tolerant retrieval Slides adapted from Stanford CS297:Introduction to Information Retrieval A skipped lecture The type/token distinction Terms are normalized types put in the dictionary
More informationIntroduction to Information Retrieval
Introduction Inverted index Processing Boolean queries Course overview Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Schütze Institute for Natural
More informationDefinition: A data structure is a way of organizing data in a computer so that it can be used efficiently.
The Science of Computing I Lesson 4: Introduction to Data Structures Living with Cyber Pillar: Data Structures The need for data structures The algorithms we design to solve problems rarely do so without
More information3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.
3-2. Index construction Most slides were adapted from Stanford CS 276 course and University of Munich IR course. 1 Ch. 4 Index construction How do we construct an index? What strategies can we use with
More informationGUJARAT TECHNOLOGICAL UNIVERSITY
GUJARAT TECHNOLOGICAL UNIVERSITY INFORMATION TECHNOLOGY DATA COMPRESSION AND DATA RETRIVAL SUBJECT CODE: 2161603 B.E. 6 th SEMESTER Type of course: Core Prerequisite: None Rationale: Data compression refers
More informationIndexing. Lecture Objectives. Text Technologies for Data Science INFR Learn about and implement Boolean search Inverted index Positional index
Text Technologies for Data Science INFR11145 Indexing Instructor: Walid Magdy 03-Oct-2017 Lecture Objectives Learn about and implement Boolean search Inverted index Positional index 2 1 Indexing Process
More informationIndex construction CE-324: Modern Information Retrieval Sharif University of Technology
Index construction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch.
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 5: Index Compression Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-04-17 1/59 Overview
More informationIndexing and Searching
Indexing and Searching Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, chapter 8 2. Information Retrieval:
More informationParallel and Sequential Data Structures and Algorithms Lecture (Spring 2012) Lecture 25 Suffix Arrays
Lecture 25 Suffix Arrays Parallel and Sequential Data Structures and Algorithms, 15-210 (Spring 2012) Lectured by Kanat Tangwongsan April 17, 2012 Material in this lecture: The main theme of this lecture
More informationIndex construction CE-324: Modern Information Retrieval Sharif University of Technology
Index construction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch.
More informationInformation Retrieval
Information Retrieval Suan Lee - Information Retrieval - 05 Index Compression 1 05 Index Compression - Information Retrieval - 05 Index Compression 2 Last lecture index construction Sort-based indexing
More informationDatabase Systems. File Organization-2. A.R. Hurson 323 CS Building
File Organization-2 A.R. Hurson 323 CS Building Indexing schemes for Files The indexing is a technique in an attempt to reduce the number of accesses to the secondary storage in an information retrieval
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-04-09 Schütze: Boolean
More informationCS 320 Midterm Exam. Fall 2018
Name: BU ID: CS 320 Midterm Exam Fall 2018 Write here the number of the problem you are skipping: You must complete 5 of the 6 problems on this exam for full credit. Each problem is of equal weight. Please
More informationCMSC 341 Lecture 16/17 Hashing, Parts 1 & 2
CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2 Prof. John Park Based on slides from previous iterations of this course Today s Topics Overview Uses and motivations of hash tables Major concerns with hash
More informationInformation Retrieval
Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Lecture 1: Boolean retrieval Information Retrieval Information Retrieval (IR) is finding
More informationStanford University Computer Science Department Solved CS347 Spring 2001 Mid-term.
Stanford University Computer Science Department Solved CS347 Spring 2001 Mid-term. Question 1: (4 points) Shown below is a portion of the positional index in the format term: doc1: position1,position2
More informationSome Practice Problems on Hardware, File Organization and Indexing
Some Practice Problems on Hardware, File Organization and Indexing Multiple Choice State if the following statements are true or false. 1. On average, repeated random IO s are as efficient as repeated
More informationSAMPLE OF THE STUDY MATERIAL PART OF CHAPTER 6. Sorting Algorithms
SAMPLE OF THE STUDY MATERIAL PART OF CHAPTER 6 6.0 Introduction Sorting algorithms used in computer science are often classified by: Computational complexity (worst, average and best behavior) of element
More informationLCP Array Construction
LCP Array Construction The LCP array is easy to compute in linear time using the suffix array SA and its inverse SA 1. The idea is to compute the lcp values by comparing the suffixes, but skip a prefix
More informationMore B-trees, Hash Tables, etc. CS157B Chris Pollett Feb 21, 2005.
More B-trees, Hash Tables, etc. CS157B Chris Pollett Feb 21, 2005. Outline B-tree Domain of Application B-tree Operations Hash Tables on Disk Hash Table Operations Extensible Hash Tables Multidimensional
More informationDictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology
Dictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Nayak & Raghavan (CS- 276, Stanford)
More informationCS105 Introduction to Information Retrieval
CS105 Introduction to Information Retrieval Lecture: Yang Mu UMass Boston Slides are modified from: http://www.stanford.edu/class/cs276/ Information Retrieval Information Retrieval (IR) is finding material
More informationRepresenting Data Elements
Representing Data Elements Week 10 and 14, Spring 2005 Edited by M. Naci Akkøk, 5.3.2004, 3.3.2005 Contains slides from 18.3.2002 by Hector Garcia-Molina, Vera Goebel INF3100/INF4100 Database Systems Page
More informationIN4325 Indexing and query processing. Claudia Hauff (WIS, TU Delft)
IN4325 Indexing and query processing Claudia Hauff (WIS, TU Delft) The big picture Information need Topic the user wants to know more about The essence of IR Query Translation of need into an input for
More informationExam Principles of Imperative Computation, Summer 2011 William Lovas. June 24, 2011
Exam 3 15-122 Principles of Imperative Computation, Summer 2011 William Lovas June 24, 2011 Name: Sample Solution Andrew ID: wlovas Instructions This exam is closed-book with one double-sided sheet of
More information