Information Retrieval
|
|
- Ariel Ryan
- 5 years ago
- Views:
Transcription
1 Information Retrieval Dictionaries & Tolerant Retrieval Gintarė Grigonytė Department of Linguistics and Philology Uppsala University Slides based on previous IR course given by Jörg Tiedemann and Introduction to Information Retrieval slides Gintarė Grigonytė 1/45
2 Data Structures for Dictionaries For each term t, we store a list of all documents that contain t. BRUTUS CAESAR CALPURNIA }{{}}{{} dictionary postings Gintarė Grigonytė 2/45
3 Naive Dictionary: array of fixed-width entries term document pointer to frequency postings list a 656,265 aachen zulu 221 space needed: 20 bytes 4 bytes 4 bytes Gintarė Grigonytė 3/45
4 Naive Dictionary: array of fixed-width entries term document pointer to frequency postings list a 656,265 aachen zulu 221 space needed: 20 bytes 4 bytes 4 bytes How do we store a dictionary in memory efficiently? How do we look up an element in this array at query time? Gintarė Grigonytė 3/45
5 Dictionary Data structures Two main classes of data structures: hashes and trees Criteria for when to use hashes vs. trees: Is there a fixed number of terms or will it keep growing? What are the relative frequencies with which various keys will be accessed? How many terms are we likely to have? Gintarė Grigonytė 4/45
6 Hashes Each vocabulary term is hashed into an integer. Try to avoid collisions At query time, do the following: hash query term, resolve collisions, locate entry in fixed-width array Gintarė Grigonytė 5/45
7 Hashes Each vocabulary term is hashed into an integer. Try to avoid collisions At query time, do the following: hash query term, resolve collisions, locate entry in fixed-width array Pros: Lookup is fast (faster than in a search tree) Lookup time is constant Gintarė Grigonytė 5/45
8 Hashes Each vocabulary term is hashed into an integer. Try to avoid collisions At query time, do the following: hash query term, resolve collisions, locate entry in fixed-width array Pros: Lookup is fast (faster than in a search tree) Lookup time is constant Cons no way to find minor variants (resume vs. résumé) no prefix search (all terms starting with automat) need to rehash everything periodically if vocabulary keeps growing Gintarė Grigonytė 5/45
9 Trees: Binary tree Gintarė Grigonytė 6/45
10 Trees: Binary tree simplest tree structure efficient for searching Pros: solves the prefix problem (terms starting with automat) Cons: slower: O(log M) in balanced trees (M is the size of the vocabulary) re-balancing binary trees is expensive Alternative: B-tree Gintarė Grigonytė 7/45
11 Trees: B-tree B-tree definition: every internal node has a number of children in the interval [a, b] where a, b are appropriate positive integers, e.g., [2, 4]. Gintarė Grigonytė 8/45
12 Trees: B-tree What s the difference? slightly more complex still efficient for searching same features as binary trees (prefix search!) need for re-balancing is less frequent! Gintarė Grigonytė 9/45
13 Tolerant Retrieval Wildcard queries Spelling correction Soundex Gintarė Grigonytė 10/45
14 Tolerant Retrieval: Wildcard queries Prefix queries: mon*: find all docs containing any term beginning with mon Easy with B-tree dictionary: retrieve all terms t in the range: mon t < moo Gintarė Grigonytė 11/45
15 Tolerant Retrieval: Wildcard queries Prefix queries: mon*: find all docs containing any term beginning with mon Easy with B-tree dictionary: retrieve all terms t in the range: mon t < moo Suffix Queries: *mon: find all docs containing any term ending with mon Gintarė Grigonytė 11/45
16 Tolerant Retrieval: Wildcard queries Prefix queries: mon*: find all docs containing any term beginning with mon Easy with B-tree dictionary: retrieve all terms t in the range: mon t < moo Suffix Queries: *mon: find all docs containing any term ending with mon Maintain an additional tree for terms backwards Then retrieve all terms t in the range: nom t < non Gintarė Grigonytė 11/45
17 Tolerant Retrieval: Wildcard queries Prefix queries: mon*: find all docs containing any term beginning with mon Easy with B-tree dictionary: retrieve all terms t in the range: mon t < moo Suffix Queries: *mon: find all docs containing any term ending with mon Maintain an additional tree for terms backwards Then retrieve all terms t in the range: nom t < non How can we find all terms matching pro*cent? Gintarė Grigonytė 11/45
18 How to handle * in the middle of a term Example: pro*cent Gintarė Grigonytė 12/45
19 How to handle * in the middle of a term Example: pro*cent We could look up pro* and *cent in the B-tree and intersect the two term sets. Gintarė Grigonytė 12/45
20 How to handle * in the middle of a term Example: pro*cent We could look up pro* and *cent in the B-tree and intersect the two term sets. Expensive! Alternative: permuterm index! Gintarė Grigonytė 12/45
21 Permuterm index Rotate wildcard query, so that the * occurs at the end. introduce special symbol $ to indicate end of string Gintarė Grigonytė 13/45
22 Permuterm index Rotate wildcard query, so that the * occurs at the end. introduce special symbol $ to indicate end of string pro*cent becomes cent$pro* How does this help when matching queries with the index? Gintarė Grigonytė 13/45
23 Permuterm index Add all rotations to B-tree: HELLO hello$, ello$h, llo$he, lo$hel o$hell Gintarė Grigonytė 14/45
24 Permuterm index & term mapping all rotated entries map to the same string... For X, look up X$ For X*, look up X* For *X, look up X$* For *X*, look up X* For X*Y, look up Y$X* Example: For hel*o, look up o$hel* Gintarė Grigonytė 15/45
25 Processing a lookup in the permuterm index To sum up: Rotate query wildcard to the right Use B-tree lookup as before What is the problem with this structure? Gintarė Grigonytė 16/45
26 Processing a lookup in the permuterm index To sum up: Rotate query wildcard to the right Use B-tree lookup as before What is the problem with this structure? Permuterm more than quadruples the size of the dictionary compared to a regular B-tree. (empirical observation for English) Gintarė Grigonytė 16/45
27 Alternative: Bigram (k-gram) indexes Enumerate all k-grams (sequence of k characters) occurring in a term more space-efficient than permuterm index Gintarė Grigonytė 17/45
28 Alternative: Bigram (k-gram) indexes Enumerate all k-grams (sequence of k characters) occurring in a term more space-efficient than permuterm index Example: from April is the cruelest month we get the 2-grams (bigrams): $a ap pr ri il l$ $i is s$ $t th he e$ $c cr ru ue el le es st t$ $m mo on nt h$ $ is a special word boundary symbol, as before. Gintarė Grigonytė 17/45
29 Alternative: Bigram (k-gram) indexes Enumerate all k-grams (sequence of k characters) occurring in a term more space-efficient than permuterm index Example: from April is the cruelest month we get the 2-grams (bigrams): $a ap pr ri il l$ $i is s$ $t th he e$ $c cr ru ue el le es st t$ $m mo on nt h$ $ is a special word boundary symbol, as before. Maintain a second inverted index from bigrams to the dictionary terms that contain the bigram Gintarė Grigonytė 17/45
30 Postings list in a 3-gram inverted index etr BEETROOT METRIC PETRIFY RETRIEVAL... retrieve all postings of matching k-grams intersect all lists as usual Gintarė Grigonytė 18/45
31 Processing wildcarded terms in a bigram index Query mon* can now be run as: Gintarė Grigonytė 19/45
32 Processing wildcarded terms in a bigram index Query mon* can now be run as: $m AND mo AND on Gets us all terms with the prefix mon, but... Gintarė Grigonytė 19/45
33 Processing wildcarded terms in a bigram index Query mon* can now be run as: $m AND mo AND on Gets us all terms with the prefix mon, but also many false positives like MOON. Must post-filter these terms against query. Gintarė Grigonytė 19/45
34 Processing wildcarded terms in a bigram index Query mon* can now be run as: $m AND mo AND on Gets us all terms with the prefix mon, but also many false positives like MOON. Must post-filter these terms against query. Surviving terms are then looked up in the term-document inverted index. k-gram are still fast and more space efficient than permuterm indexes. Gintarė Grigonytė 19/45
35 Processing wildcard queries As before, we must potentially execute a large number of Boolean queries Most straightforward semantics: Conjunction of disjunctions Recall the query: gen* AND universit* Gintarė Grigonytė 20/45
36 Processing wildcard queries As before, we must potentially execute a large number of Boolean queries Most straightforward semantics: Conjunction of disjunctions Recall the query: gen* AND universit* (geneva AND university) OR (geneva AND université) OR (genève AND university) OR (genève AND université) OR (general AND universities) OR Very expensive! Requires query optimization! Do we need to support wildcard queries? Gintarė Grigonytė 20/45
37 Processing wildcard queries As before, we must potentially execute a large number of Boolean queries Most straightforward semantics: Conjunction of disjunctions Recall the query: gen* AND universit* (geneva AND university) OR (geneva AND université) OR (genève AND university) OR (genève AND université) OR (general AND universities) OR Very expensive! Requires query optimization! Do we need to support wildcard queries? Users are lazy! If wildcards are allowed Users will love it (?) Does Google allow wildcard queries? (And other engines?) Gintarė Grigonytė 20/45
38 Tolerant Retrieval: Spälling Korrection Two general uses of spelling correction: Correcting documents being indexed Correcting user queries (more common) Gintarė Grigonytė 21/45
39 Tolerant Retrieval: Spälling Korrection Two general uses of spelling correction: Correcting documents being indexed Correcting user queries (more common) Two different methods: Isolated word spelling correction Check each word on its own for misspelling Will not catch typos resulting in correctly spelled words, e.g., an asteroid that fell form the sky Context-sensitive spelling correction Look at surrounding words Can correct form/from error above Gintarė Grigonytė 21/45
40 Correcting documents primarily for OCR ed documents tuned for OCR mistakes (trained classifiers) may use domain-specific knowledge confusion between O and D etc... but also: correct typos in web pages and low quality documents fewer misspellings in our dictionary and better matching general IR philosophy: don t change the documents Gintarė Grigonytė 22/45
41 Correcting queries more common in IR than document correction typos in queries are common people are in a hurry users often look for things they don t know much about example: al quajda strategies: (also) retrieve documents with the correct spelling return alternative query suggestions ( Did you mean...? ) Gintarė Grigonytė 23/45
42 Isolated word correction Premise 1: There is a list of correct words from which the correct spellings come. Premise 2: We have a way of computing the distance between a misspelled word and a correct word. Simple spelling correction algorithm: return the correct word that has the smallest distance to the misspelled word. Example: informaton information Gintarė Grigonytė 24/45
43 Isolated word correction Two choices: use a standard lexicon Webster s, OED etc.... industry-specific dictionary (for domain-specific IR) advantage: correct entries only! vocabulary of the inverted index all words in the collection better coverage but: include all misspellings compute weights for all terms (based on frequencies) Gintarė Grigonytė 25/45
44 Isolated word correction Task: Return lexicon entry that is closest to a given character sequence Q What is closest? Several alternatives: 1. Edit distance and Levenshtein distance 2. Weighted edit distance 3. k-gram overlap Gintarė Grigonytė 26/45
45 Edit distance The edit distance between string s 1 and string s 2 is the minimum number of basic operations that convert s 1 to s 2. Levenshtein distance: basic operations = insert, delete, and replace Levenshtein distance dog-do: 1 Levenshtein distance cat-cart: 1 Levenshtein distance cat-cut: 1 Levenshtein distance cat-act: 2 Gintarė Grigonytė 27/45
46 Edit distance The edit distance between string s 1 and string s 2 is the minimum number of basic operations that convert s 1 to s 2. Levenshtein distance: basic operations = insert, delete, and replace Levenshtein distance dog-do: 1 Levenshtein distance cat-cart: 1 Levenshtein distance cat-cut: 1 Levenshtein distance cat-act: 2 Damerau-Levenshtein: additional operation = transpose Damerau-Levenshtein distance cat-act: 1 Gintarė Grigonytė 27/45
47 Levenshtein distance: Computation Recursive definition & dynamic programming: start with upper-left table cell fill table with edit costs f a s t c a t s Gintarė Grigonytė 28/45
48 Levenshtein distance: algorithm LEVENSHTEINDISTANCE(s 1, s 2 ) 1 for i 0 to s 1 2 do m[i, 0] = i 3 for j 0 to s 2 4 do m[0, j] = j 5 for i 1 to s 1 6 do for j 1 to s 2 7 do if s 1 [i] = s 2 [j] 8 then m[i, j] = min{m[i 1, j] + 1, m[i, j 1] + 1, m[i 1, j 1]} 9 else m[i, j] = min{m[i 1, j] + 1, m[i, j 1] + 1, m[i 1, j 1] + 1} 10 return m[ s 1, s 2 ] Operations: insert, delete, replace, copy Gintarė Grigonytė 29/45
49 Levenshtein distance: algorithm LEVENSHTEINDISTANCE(s 1, s 2 ) 1 for i 0 to s 1 2 do m[i, 0] = i 3 for j 0 to s 2 4 do m[0, j] = j 5 for i 1 to s 1 6 do for j 1 to s 2 7 do if s 1 [i] = s 2 [j] 8 then m[i, j] = min{m[i 1, j] + 1, m[i, j 1] + 1, m[i 1, j 1]} 9 else m[i, j] = min{m[i 1, j] + 1, m[i, j 1] + 1, m[i 1, j 1] + 1} 10 return m[ s 1, s 2 ] Operations: insert, delete, replace, copy Gintarė Grigonytė 30/45
50 Levenshtein distance: algorithm LEVENSHTEINDISTANCE(s 1, s 2 ) 1 for i 0 to s 1 2 do m[i, 0] = i 3 for j 0 to s 2 4 do m[0, j] = j 5 for i 1 to s 1 6 do for j 1 to s 2 7 do if s 1 [i] = s 2 [j] 8 then m[i, j] = min{m[i 1, j] + 1, m[i, j 1] + 1, m[i 1, j 1]} 9 else m[i, j] = min{m[i 1, j] + 1, m[i, j 1] + 1, m[i 1, j 1] + 1} 10 return m[ s 1, s 2 ] Operations: insert, delete, replace, copy Gintarė Grigonytė 31/45
51 Levenshtein distance: algorithm LEVENSHTEINDISTANCE(s 1, s 2 ) 1 for i 0 to s 1 2 do m[i, 0] = i 3 for j 0 to s 2 4 do m[0, j] = j 5 for i 1 to s 1 6 do for j 1 to s 2 7 do if s 1 [i] = s 2 [j] 8 then m[i, j] = min{m[i 1, j] + 1, m[i, j 1] + 1, m[i 1, j 1]} 9 else m[i, j] = min{m[i 1, j] + 1, m[i, j 1] + 1, m[i 1, j 1] + 1} 10 return m[ s 1, s 2 ] Operations: insert, delete, replace, copy Gintarė Grigonytė 32/45
52 Levenshtein distance: algorithm LEVENSHTEINDISTANCE(s 1, s 2 ) 1 for i 0 to s 1 2 do m[i, 0] = i 3 for j 0 to s 2 4 do m[0, j] = j 5 for i 1 to s 1 6 do for j 1 to s 2 7 do if s 1 [i] = s 2 [j] 8 then m[i, j] = min{m[i 1, j] + 1, m[i, j 1] + 1, m[i 1, j 1]} 9 else m[i, j] = min{m[i 1, j] + 1, m[i, j 1] + 1, m[i 1, j 1] + 1} 10 return m[ s 1, s 2 ] Operations: insert, delete, replace, copy Gintarė Grigonytė 33/45
53 Each cell of Levenshtein matrix cost of getting here from my upper left neighbor (copy or replace) cost of getting here from my left neighbor (insert) cost of getting here from my upper neighbor (delete) the minimum of the three possible moves ; the cheapest way of getting here Gintarė Grigonytė 34/45
54 Weighted edit distance As above, but weight of an operation depends on the characters involved. Meant to capture keyboard errors, e.g., m more likely to be mistyped as n than as q. Therefore, replacing m by n is a smaller edit distance than by q. We now require a weight matrix as input. Modify dynamic programming to handle weights. Gintarė Grigonytė 35/45
55 Using edit distance given a query: get all sequences within a fixed edit distance intersect this list with the list of correct words return spelling suggestions Alternatively: use all corrections to retrieve doc s slow! (and accepted by user?) use single best correction for retrieval Gintarė Grigonytė 36/45
56 Using edit distance Problems: lot s of possible strings even with few edit operations intersection with dictionary is slow Possible solution: use N-gram overlap can replace edit distance for spelling correction Gintarė Grigonytė 37/45
57 k-gram indexes for spelling correction Get all k-grams in the query term Use the k-gram index to retrieve correct words that match query term k-grams (recall wildcard search) Threshold by number of matching k-grams (e.g., only terms that differ by at most 3 k-grams) or use Jaccard coefficient > threshold Jaccard(A, B) = A B A B Gintarė Grigonytė 38/45
58 k-gram indexes for spelling correction Get all k-grams in the query term Use the k-gram index to retrieve correct words that match query term k-grams (recall wildcard search) Threshold by number of matching k-grams (e.g., only terms that differ by at most 3 k-grams) or use Jaccard coefficient > threshold Jaccard(A, B) = A B A B Example: Bigram index, misspelled word bordroom Bigrams: bo, or, rd, dr, ro, oo, om Gintarė Grigonytė 38/45
59 k-gram indexes for spelling correction: bordroom BO aboard about boardroom border OR border lord morbid sordid RD aboard ardent boardroom border... boardroom exists in 6 out of 7 lists Jaccard = 6/( ) = 6/ Gintarė Grigonytė 39/45
60 Context-sensitive spelling correction One approach: break phrase query into conjunction of biwords look for biwords that need only one term corrected get phrase matches and rank them... Gintarė Grigonytė 40/45
61 Context-sensitive spelling correction Another approach: Hit-based spelling correction Example: flew form munich Try all phrases with possible corrections: Try query flea form munich Try query flew from munich Try query flew form munch The correct query flew from munich has most hits. Many alternatives! Not very efficient! try to correct only if few hits returned tweaking with query logs and expected hits Gintarė Grigonytė 41/45
62 Tolerant Search: Soundex Find phonetic alternatives. Example: chebyshev / tchebyscheff Soundex = class of heuristics (invented in 1918) 1. Retain the first letter of the term. 2. Replace all letters [AEIOUHWY] to 0 3. Replace letters to digits: B, F, P, V to 1 C, G, J, K, Q, S, X, Z to 2 D,T to 3 L to 4 M, N to 5 R to 6 4. remove consecutive identical digits 5. remove all zeros and return first 4 characters Gintarė Grigonytė 42/45
63 Tolerant Search: Soundex Find phonetic alternatives. Example: chebyshev / tchebyscheff Soundex = class of heuristics (invented in 1918) 1. Retain the first letter of the term. 2. Replace all letters [AEIOUHWY] to 0 3. Replace letters to digits: B, F, P, V to 1 C, G, J, K, Q, S, X, Z to 2 D,T to 3 L to 4 M, N to 5 R to 6 4. remove consecutive identical digits 5. remove all zeros and return first 4 characters HERMAN H655 Will HERMANN generate the same code? Gintarė Grigonytė 42/45
64 To sum up Using a positional inverted index with skip pointers efficient dictionary storage wild-card index spelling correction Soundex We can quickly run a query like (SPELL(moriset) /3 tor*to) OR SOUNDEX(chaikofski) Gintarė Grigonytė 43/45
65 Resources Chapter 3 of IIR Resources Soundex demo Javascript/SoundexCalculate/ Levenshtein distance demo Levenshtein distance slides https: //rosettacode.org/wiki/levenshtein_distance Peter Norvig s spelling corrector Gintarė Grigonytė 44/45
66 Next time Ranked Retrieval: efficient scoring & retrieval Putting it all together in a search system lab on boolean and ranked retrieval Gintarė Grigonytė 45/45
Overview. Lecture 3: Index Representation and Tolerant Retrieval. Type/token distinction. IR System components
Overview Lecture 3: Index Representation and Tolerant Retrieval Information Retrieval Computer Science Tripos Part II Ronan Cummins 1 Natural Language and Information Processing (NLIP) Group 1 Recap 2
More informationDictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology
Dictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Nayak & Raghavan (CS- 276, Stanford)
More informationDictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology
Dictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Nayak & Raghavan (CS- 276, Stanford)
More informationTolerant Retrieval. Searching the Dictionary Tolerant Retrieval. Information Retrieval & Extraction Misbhauddin 1
Tolerant Retrieval Searching the Dictionary Tolerant Retrieval Information Retrieval & Extraction Misbhauddin 1 Query Retrieval Dictionary data structures Tolerant retrieval Wild-card queries Soundex Spelling
More information3-1. Dictionaries and Tolerant Retrieval. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.
3-1. Dictionaries and Tolerant Retrieval Most slides were adapted from Stanford CS 276 course and University of Munich IR course. 1 Dictionary data structures for inverted indexes Sec. 3.1 The dictionary
More informationDictionaries and Tolerant retrieval
Dictionaries and Tolerant retrieval Slides adapted from Stanford CS297:Introduction to Information Retrieval A skipped lecture The type/token distinction Terms are normalized types put in the dictionary
More informationDictionaries and tolerant retrieval. Slides by Manning, Raghavan, Schutze
Dictionaries and tolerant retrieval 1 Ch. 2 Recap of the previous lecture The type/token distinction Terms are normalized types put in the dictionary Tokenization problems: Hyphens, apostrophes, compounds,
More informationRecap of the previous lecture. This lecture. A naïve dictionary. Introduction to Information Retrieval. Dictionary data structures Tolerant retrieval
Ch. 2 Recap of the previous lecture Introduction to Information Retrieval Lecture 3: Dictionaries and tolerant retrieval The type/token distinction Terms are normalized types put in the dictionary Tokenization
More informationInformation Retrieval
Introduction to Information Retrieval Lecture 3: Dictionaries and tolerant retrieval 1 Outline Dictionaries Wildcard queries skip Edit distance skip Spelling correction skip Soundex 2 Inverted index Our
More informationRecap of last time CS276A Information Retrieval
Recap of last time CS276A Information Retrieval Index compression Space estimation Lecture 4 This lecture Tolerant retrieval Wild-card queries Spelling correction Soundex Wild-card queries Wild-card queries:
More informationInformation Retrieval
Information Retrieval Suan Lee - Information Retrieval - 03 Dictionaries and Tolerant Retrieval 1 03 Dictionaries and Tolerant Retrieval - Information Retrieval - 03 Dictionaries and Tolerant Retrieval
More informationOUTLINE. Documents Terms. General + Non-English English. Skip pointers. Phrase queries
WECHAT GROUP 1 OUTLINE Documents Terms General + Non-English English Skip pointers Phrase queries 2 Phrase queries We want to answer a query such as [stanford university] as a phrase. Thus The inventor
More informationInformation Retrieval
Introduction to Information Retrieval CS3245 Information Retrieval Lecture 4: Dictionaries and Tolerant Retrieval4 Last Time: Terms and Postings Details Ch. 2 Skip pointers Encoding a tree-like structure
More informationText Technologies for Data Science INFR Indexing (2) Instructor: Walid Magdy
Text Technologies for Data Science INFR11145 Indexing (2) Instructor: Walid Magdy 10-Oct-2017 Lecture Objectives Learn more about indexing: Structured documents Extent index Index compression Data structure
More informationText Technologies for Data Science INFR Indexing (2) Instructor: Walid Magdy
Text Technologies for Data Science INFR11145 Indexing (2) Instructor: Walid Magdy 03-Oct-2018 Lecture Objectives Learn more about indexing: Structured documents Extent index Index compression Data structure
More informationINFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 3: Dictionaries and tolerant retrieval Paul Ginsparg Cornell University,
More informationNatural Language Processing and Information Retrieval
Natural Language Processing and Information Retrieval Indexing and Vector Space Models Alessandro Moschitti Department of Computer Science and Information Engineering University of Trento Email: moschitti@disi.unitn.it
More informationIntroduction to Information Retrieval (Manning, Raghavan, Schutze)
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 3 Dictionaries and Tolerant retrieval Chapter 4 Index construction Chapter 5 Index compression Content Dictionary data structures
More informationData-analysis and Retrieval Boolean retrieval, posting lists and dictionaries
Data-analysis and Retrieval Boolean retrieval, posting lists and dictionaries Hans Philippi (based on the slides from the Stanford course on IR) April 25, 2018 Boolean retrieval, posting lists & dictionaries
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 4: Index Construction Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-04-16 1/54 Overview
More informationWeb Information Retrieval. Lecture 4 Dictionaries, Index Compression
Web Information Retrieval Lecture 4 Dictionaries, Index Compression Recap: lecture 2,3 Stemming, tokenization etc. Faster postings merges Phrase queries Index construction This lecture Dictionary data
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 4: Index Construction Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-04-16 Schütze:
More informationPV211: Introduction to Information Retrieval
PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/pv211 IIR 4: Index construction Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University,
More informationBoolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology
Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2013 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276,
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 4: Index Construction Hinrich Schütze, Christina Lioma Institute for Natural Language Processing, University of Stuttgart 2010-05-04
More informationCSCI 5417 Information Retrieval Systems Jim Martin!
CSCI 5417 Information Retrieval Systems Jim Martin! Lecture 4 9/1/2011 Today Finish up spelling correction Realistic indexing Block merge Single-pass in memory Distributed indexing Next HW details 1 Query
More informationPreliminary draft (c)2008 Cambridge UP
DRAFT! February 16, 2008 Cambridge University Press. Feedback welcome. 49 3 Dictionariesandtolerant retrieval WILDCARD QUERY In Chapters 1 and 2 we developed the ideas underlying inverted indexes for handling
More informationCSE 7/5337: Information Retrieval and Web Search Index construction (IIR 4)
CSE 7/5337: Information Retrieval and Web Search Index construction (IIR 4) Michael Hahsler Southern Methodist University These slides are largely based on the slides by Hinrich Schütze Institute for Natural
More informationBoolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology
Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2015 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan lectures
More informationCS347. Lecture 2 April 9, Prabhakar Raghavan
CS347 Lecture 2 April 9, 2001 Prabhakar Raghavan Today s topics Inverted index storage Compressing dictionaries into memory Processing Boolean queries Optimizing term processing Skip list encoding Wild-card
More informationToday s topics CS347. Inverted index storage. Inverted index storage. Processing Boolean queries. Lecture 2 April 9, 2001 Prabhakar Raghavan
Today s topics CS347 Lecture 2 April 9, 2001 Prabhakar Raghavan Inverted index storage Compressing dictionaries into memory Processing Boolean queries Optimizing term processing Skip list encoding Wild-card
More informationhttp://www.xkcd.com/628/ Results Summaries Spelling Correction David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture3-tolerantretrieval.ppt http://www.stanford.edu/class/cs276/handouts/lecture8-evaluation.ppt
More informationPart 2: Boolean Retrieval Francesco Ricci
Part 2: Boolean Retrieval Francesco Ricci Most of these slides comes from the course: Information Retrieval and Web Search, Christopher Manning and Prabhakar Raghavan Content p Term document matrix p Information
More informationInformation Retrieval
Introduction to Information Retrieval Lecture 4: Index Construction 1 Plan Last lecture: Dictionary data structures Tolerant retrieval Wildcards Spell correction Soundex a-hu hy-m n-z $m mace madden mo
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 5: Index Compression Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-04-17 1/59 Overview
More informationCourse structure & admin. CS276A Text Information Retrieval, Mining, and Exploitation. Dictionary and postings files: a fast, compact inverted index
CS76A Text Information Retrieval, Mining, and Exploitation Lecture 1 Oct 00 Course structure & admin CS76: two quarters this year: CS76A: IR, web (link alg.), (infovis, XML, PP) Website: http://cs76a.stanford.edu/
More informationIntroduction to Information Retrieval
Introduction Inverted index Processing Boolean queries Course overview Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Schütze Institute for Natural
More informationBoolean Retrieval. Manning, Raghavan and Schütze, Chapter 1. Daniël de Kok
Boolean Retrieval Manning, Raghavan and Schütze, Chapter 1 Daniël de Kok Boolean query model Pose a query as a boolean query: Terms Operations: AND, OR, NOT Example: Brutus AND Caesar AND NOT Calpuria
More informationInformation Retrieval
Information Retrieval Suan Lee - Information Retrieval - 04 Index Construction 1 04 Index Construction - Information Retrieval - 04 Index Construction 2 Plan Last lecture: Dictionary data structures Tolerant
More informationAdministrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks
Administrative Index Compression! n Assignment 1? n Homework 2 out n What I did last summer lunch talks today David Kauchak cs458 Fall 2012 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture5-indexcompression.ppt
More informationInformation Retrieval
Information Retrieval Suan Lee - Information Retrieval - 05 Index Compression 1 05 Index Compression - Information Retrieval - 05 Index Compression 2 Last lecture index construction Sort-based indexing
More informationRecap: lecture 2 CS276A Information Retrieval
Recap: lecture 2 CS276A Information Retrieval Stemming, tokenization etc. Faster postings merges Phrase queries Lecture 3 This lecture Index compression Space estimation Corpus size for estimates Consider
More informationIndexing and Searching
Indexing and Searching Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, chapter 8 2. Information Retrieval:
More informationLecture 3: Phrasal queries and wildcards
Lecture 3: Phrasal queries and wildcards Trevor Cohn (tcohn@unimelb.edu.au) COMP90042, 2015, Semester 1 What we ll learn today Building on the boolean index and query mechanism to support multi-word queries
More informationInformation Retrieval
Information Retrieval Natural Language Processing: Lecture 12 30.11.2017 Kairit Sirts Homework 4 things that seemed to work Bidirectional LSTM instead of unidirectional Change LSTM activation to sigmoid
More informationInformation Retrieval and Text Mining
Information Retrieval and Text Mining http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Schütze & Wiltrud Kessler Institute for Natural Language Processing, University of Stuttgart 2012-10-16
More informationIndexing and Searching
Indexing and Searching Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, chapter 9 2. Information Retrieval:
More informationLecture 2: Encoding models for text. Many thanks to Prabhakar Raghavan for sharing most content from the following slides
Lecture 2: Encoding models for text Many thanks to Prabhakar Raghavan for sharing most content from the following slides Recap of the previous lecture Overview of course topics Basic inverted indexes:
More information3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.
3-2. Index construction Most slides were adapted from Stanford CS 276 course and University of Munich IR course. 1 Ch. 4 Index construction How do we construct an index? What strategies can we use with
More informationIndex extensions. Manning, Raghavan and Schütze, Chapter 2. Daniël de Kok
Index extensions Manning, Raghavan and Schütze, Chapter 2 Daniël de Kok Today We will discuss some extensions to the inverted index to accommodate boolean processing: Skip pointers: improve performance
More informationMultimedia Information Extraction and Retrieval Indexing and Query Answering
Multimedia Information Extraction and Retrieval Indexing and Query Answering Ralf Moeller Hamburg Univ. of Technology Recall basic indexing pipeline Documents to be indexed. Friends, Romans, countrymen.
More informationDigital Libraries: Language Technologies
Digital Libraries: Language Technologies RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Recall: Inverted Index..........................................
More informationIndexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze
Indexing UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides Addison Wesley, 2008 Table of Content Inverted index with positional information
More information信息检索与搜索引擎 Introduction to Information Retrieval GESC1007
信息检索与搜索引擎 Introduction to Information Retrieval GESC1007 Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities philfv8@yahoo.com Spring 2019 1 Last week We have discussed in
More informationInformation Retrieval
Information Retrieval Test Collections Gintarė Grigonytė gintare@ling.su.se Department of Linguistics and Philology Uppsala University Slides based on previous IR course given by K.F. Heppin 2013-15 and
More informationSearch Engines. Gertjan van Noord. September 17, 2018
Search Engines Gertjan van Noord September 17, 2018 About the course Information about the course is available from: http://www.let.rug.nl/vannoord/college/zoekmachines/ Last week Normalization (case,
More informationIndex Compression. David Kauchak cs160 Fall 2009 adapted from:
Index Compression David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture5-indexcompression.ppt Administrative Homework 2 Assignment 1 Assignment 2 Pair programming?
More informationInverted Indexes. Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5
Inverted Indexes Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5 Basic Concepts Inverted index: a word-oriented mechanism for indexing a text collection to speed up the
More informationBoolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology
Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan lectures
More informationIntroduction to Information Retrieval IIR 1: Boolean Retrieval
.. Introduction to Information Retrieval IIR 1: Boolean Retrieval Mihai Surdeanu (Based on slides by Hinrich Schütze at informationretrieval.org) Fall 2014 Boolean Retrieval 1 / 77 Take-away Why you should
More informationEfficient Implementation of Postings Lists
Efficient Implementation of Postings Lists Inverted Indices Query Brutus AND Calpurnia J. Pei: Information Retrieval and Web Search -- Efficient Implementation of Postings Lists 2 Skip Pointers J. Pei:
More informationMore on indexing CE-324: Modern Information Retrieval Sharif University of Technology
More on indexing CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Plan
More informationInformation Retrieval
Information Retrieval Suan Lee - Information Retrieval - 06 Scoring, Term Weighting and the Vector Space Model 1 Recap of lecture 5 Collection and vocabulary statistics: Heaps and Zipf s laws Dictionary
More informationInformation Retrieval
Introduction to CS3245 Lecture 5: Index Construction 5 Last Time Dictionary data structures Tolerant retrieval Wildcards Spelling correction Soundex a-hu hy-m n-z $m mace madden mo among amortize on abandon
More informationJames Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!
James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! (301) 219-4649 james.mayfield@jhuapl.edu What is Information Retrieval? Evaluation
More informationInformation Retrieval
Introduction to Information Retrieval CS3245 Information Retrieval Lecture 6: Index Compression 6 Last Time: index construction Sort- based indexing Blocked Sort- Based Indexing Merge sort is effective
More informationInformation Retrieval
Introduction to CS3245 Lecture 5: Index Construction 5 CS3245 Last Time Dictionary data structures Tolerant retrieval Wildcards Spelling correction Soundex a-hu hy-m n-z $m mace madden mo among amortize
More informationIndex construction CE-324: Modern Information Retrieval Sharif University of Technology
Index construction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch.
More informationCS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University
CS377: Database Systems Text data and information retrieval Li Xiong Department of Mathematics and Computer Science Emory University Outline Information Retrieval (IR) Concepts Text Preprocessing Inverted
More informationEECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling
EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling Doug Downey Based partially on slides by Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze Announcements Project progress report
More informationIntroduction to. CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan. Lecture 4: Index Construction
Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan Lecture 4: Index Construction 1 Plan Last lecture: Dictionary data structures
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Schütze Institute for Natural Language Processing, Universität Stuttgart 2008.04.22 Schütze: Boolean
More informationIntroduction to Information Retrieval and Boolean model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H.
Introduction to Information Retrieval and Boolean model Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze 1 Unstructured (text) vs. structured (database) data in late
More informationCS105 Introduction to Information Retrieval
CS105 Introduction to Information Retrieval Lecture: Yang Mu UMass Boston Slides are modified from: http://www.stanford.edu/class/cs276/ Information Retrieval Information Retrieval (IR) is finding material
More informationInformation Retrieval CS-E credits
Information Retrieval CS-E4420 5 credits Tokenization, further indexing issues Antti Ukkonen antti.ukkonen@aalto.fi Slides are based on materials by Tuukka Ruotsalo, Hinrich Schütze and Christina Lioma
More informationCSE 7/5337: Information Retrieval and Web Search Introduction and Boolean Retrieval (IIR 1)
CSE 7/5337: Information Retrieval and Web Search Introduction and Boolean Retrieval (IIR 1) Michael Hahsler Southern Methodist University These slides are largely based on the slides by Hinrich Schütze
More informationIntroducing Information Retrieval and Web Search. borrowing from: Pandu Nayak
Introducing Information Retrieval and Web Search borrowing from: Pandu Nayak Information Retrieval Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually
More informationINDEX CONSTRUCTION 1
1 INDEX CONSTRUCTION PLAN Last lecture: Dictionary data structures Tolerant retrieval Wildcards Spell correction Soundex a-hu hy-m n-z $m mace madden This time: mo among amortize Index construction on
More informationIndexing and Searching
Indexing and Searching Introduction How to retrieval information? A simple alternative is to search the whole text sequentially Another option is to build data structures over the text (called indices)
More informationA Modern spell(1) Abhinav Upadhyay EuroBSDCon 2017, Paris
A Modern spell(1) Abhinav Upadhyay EuroBSDCon 2017, Paris Outline Shortcomings in the old spell(1) Feature Requirements of a modern spell(1) Implementation Details of new spell(1)
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Hamid Rastegari Lecture 4: Index Construction Plan Last lecture: Dictionary data structures
More informationINFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 6: Index Compression Paul Ginsparg Cornell University, Ithaca, NY 15 Sep
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Schütze Institute for Natural Language Processing, University of Stuttgart 2011-05-03 1/ 36 Take-away
More informationLogistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search
CSE 454 - Case Studies Indexing & Retrieval in Google Some slides from http://www.cs.huji.ac.il/~sdbi/2000/google/index.htm Logistics For next class Read: How to implement PageRank Efficiently Projects
More informationEfficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)
Efficiency Efficiency: Indexing (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Difficult to analyze sequential IR algorithms: data and query dependency (query selectivity). O(q(cf max )) -- high estimate-
More informationΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου
Ανάκτηση µε το µοντέλο διανυσµατικού χώρου Σηµερινό ερώτηµα Typically we want to retrieve the top K docs (in the cosine ranking for the query) not totally order all docs in the corpus can we pick off docs
More informationText Analytics. Index-Structures for Information Retrieval. Ulf Leser
Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf
More informationBoolean Queries. Keywords combined with Boolean operators:
Query Languages 1 Boolean Queries Keywords combined with Boolean operators: OR: (e 1 OR e 2 ) AND: (e 1 AND e 2 ) BUT: (e 1 BUT e 2 ) Satisfy e 1 but not e 2 Negation only allowed using BUT to allow efficient
More informationInformation Retrieval
Information Retrieval Suan Lee - Information Retrieval - 01 Boolean Retrieval 1 01 Boolean Retrieval - Information Retrieval - 01 Boolean Retrieval 2 Introducing Information Retrieval and Web Search -
More informationInformation Retrieval. Chap 7. Text Operations
Information Retrieval Chap 7. Text Operations The Retrieval Process user need User Interface 4, 10 Text Text logical view Text Operations logical view 6, 7 user feedback Query Operations query Indexing
More informationQuery Evaluation Strategies
Introduction to Search Engine Technology Term-at-a-Time and Document-at-a-Time Evaluation Ronny Lempel Yahoo! Labs (Many of the following slides are courtesy of Aya Soffer and David Carmel, IBM Haifa Research
More informationOutline of the course
Outline of the course Introduction to Digital Libraries (15%) Description of Information (30%) Access to Information (30%) User Services (10%) Additional topics (15%) Buliding of a (small) digital library
More informationINFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 7: Scores in a Complete Search System Paul Ginsparg Cornell University, Ithaca,
More informationPart 3: The term vocabulary, postings lists and tolerant retrieval Francesco Ricci
Part 3: The term vocabulary, postings lists and tolerant retrieval Francesco Ricci Most of these slides comes from the course: Information Retrieval and Web Search, Christopher Manning and Prabhakar Raghavan
More informationQuery Evaluation Strategies
Introduction to Search Engine Technology Term-at-a-Time and Document-at-a-Time Evaluation Ronny Lempel Yahoo! Research (Many of the following slides are courtesy of Aya Soffer and David Carmel, IBM Haifa
More informationMultimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency
Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency Ralf Moeller Hamburg Univ. of Technology Acknowledgement Slides taken from presentation material for the following
More informationInforma(on Retrieval
Introduc)on to Informa)on Retrieval CS3245 Informa(on Retrieval Lecture 7: Scoring, Term Weigh9ng and the Vector Space Model 7 Last Time: Index Compression Collec9on and vocabulary sta9s9cs: Heaps and
More informationLecture 1: Introduction and the Boolean Model
Lecture 1: Introduction and the Boolean Model Information Retrieval Computer Science Tripos Part II Helen Yannakoudakis 1 Natural Language and Information Processing (NLIP) Group helen.yannakoudakis@cl.cam.ac.uk
More informationQuery Processing and Alternative Search Structures. Indexing common words
Query Processing and Alternative Search Structures CS 510 Winter 2007 1 Indexing common words What is the indexing overhead for a common term? I.e., does leaving out stopwords help? Consider a word such
More informationFRONT CODING. Front-coding: 8automat*a1 e2 ic3 ion. Extra length beyond automat. Encodes automat. Begins to resemble general string compression.
Sec. 5.2 FRONT CODING Front-coding: Sorted words commonly have long common prefix store differences only (for last k-1 in a block of k) 8automata8automate9automatic10automation 8automat*a1 e2 ic3 ion Encodes
More information