Information Retrieval Dictionaries & Tolerant Retrieval Gintarė Grigonytė gintare@ling.su.se Department of Linguistics and Philology Uppsala University Slides based on previous IR course given by Jörg Tiedemann 2013-15 and Introduction to Information Retrieval slides https://nlp.stanford.edu/ir-book/ppt/ Gintarė Grigonytė 1/45
Data Structures for Dictionaries For each term t, we store a list of all documents that contain t. BRUTUS 1 2 4 11 31 45 173 174 CAESAR 1 2 4 5 6 16 57 132... CALPURNIA 2 31 54 101. }{{}}{{} dictionary postings Gintarė Grigonytė 2/45
Naive Dictionary: array of fixed-width entries term document pointer to frequency postings list a 656,265 aachen 65......... zulu 221 space needed: 20 bytes 4 bytes 4 bytes Gintarė Grigonytė 3/45
Naive Dictionary: array of fixed-width entries term document pointer to frequency postings list a 656,265 aachen 65......... zulu 221 space needed: 20 bytes 4 bytes 4 bytes How do we store a dictionary in memory efficiently? How do we look up an element in this array at query time? Gintarė Grigonytė 3/45
Dictionary Data structures Two main classes of data structures: hashes and trees Criteria for when to use hashes vs. trees: Is there a fixed number of terms or will it keep growing? What are the relative frequencies with which various keys will be accessed? How many terms are we likely to have? Gintarė Grigonytė 4/45
Hashes Each vocabulary term is hashed into an integer. Try to avoid collisions At query time, do the following: hash query term, resolve collisions, locate entry in fixed-width array Gintarė Grigonytė 5/45
Hashes Each vocabulary term is hashed into an integer. Try to avoid collisions At query time, do the following: hash query term, resolve collisions, locate entry in fixed-width array Pros: Lookup is fast (faster than in a search tree) Lookup time is constant Gintarė Grigonytė 5/45
Hashes Each vocabulary term is hashed into an integer. Try to avoid collisions At query time, do the following: hash query term, resolve collisions, locate entry in fixed-width array Pros: Lookup is fast (faster than in a search tree) Lookup time is constant Cons no way to find minor variants (resume vs. résumé) no prefix search (all terms starting with automat) need to rehash everything periodically if vocabulary keeps growing Gintarė Grigonytė 5/45
Trees: Binary tree Gintarė Grigonytė 6/45
Trees: Binary tree simplest tree structure efficient for searching Pros: solves the prefix problem (terms starting with automat) Cons: slower: O(log M) in balanced trees (M is the size of the vocabulary) re-balancing binary trees is expensive Alternative: B-tree Gintarė Grigonytė 7/45
Trees: B-tree B-tree definition: every internal node has a number of children in the interval [a, b] where a, b are appropriate positive integers, e.g., [2, 4]. Gintarė Grigonytė 8/45
Trees: B-tree What s the difference? slightly more complex still efficient for searching same features as binary trees (prefix search!) need for re-balancing is less frequent! Gintarė Grigonytė 9/45
Tolerant Retrieval Wildcard queries Spelling correction Soundex Gintarė Grigonytė 10/45
Tolerant Retrieval: Wildcard queries Prefix queries: mon*: find all docs containing any term beginning with mon Easy with B-tree dictionary: retrieve all terms t in the range: mon t < moo Gintarė Grigonytė 11/45
Tolerant Retrieval: Wildcard queries Prefix queries: mon*: find all docs containing any term beginning with mon Easy with B-tree dictionary: retrieve all terms t in the range: mon t < moo Suffix Queries: *mon: find all docs containing any term ending with mon Gintarė Grigonytė 11/45
Tolerant Retrieval: Wildcard queries Prefix queries: mon*: find all docs containing any term beginning with mon Easy with B-tree dictionary: retrieve all terms t in the range: mon t < moo Suffix Queries: *mon: find all docs containing any term ending with mon Maintain an additional tree for terms backwards Then retrieve all terms t in the range: nom t < non Gintarė Grigonytė 11/45
Tolerant Retrieval: Wildcard queries Prefix queries: mon*: find all docs containing any term beginning with mon Easy with B-tree dictionary: retrieve all terms t in the range: mon t < moo Suffix Queries: *mon: find all docs containing any term ending with mon Maintain an additional tree for terms backwards Then retrieve all terms t in the range: nom t < non How can we find all terms matching pro*cent? Gintarė Grigonytė 11/45
How to handle * in the middle of a term Example: pro*cent Gintarė Grigonytė 12/45
How to handle * in the middle of a term Example: pro*cent We could look up pro* and *cent in the B-tree and intersect the two term sets. Gintarė Grigonytė 12/45
How to handle * in the middle of a term Example: pro*cent We could look up pro* and *cent in the B-tree and intersect the two term sets. Expensive! Alternative: permuterm index! Gintarė Grigonytė 12/45
Permuterm index Rotate wildcard query, so that the * occurs at the end. introduce special symbol $ to indicate end of string Gintarė Grigonytė 13/45
Permuterm index Rotate wildcard query, so that the * occurs at the end. introduce special symbol $ to indicate end of string pro*cent becomes cent$pro* How does this help when matching queries with the index? Gintarė Grigonytė 13/45
Permuterm index Add all rotations to B-tree: HELLO hello$, ello$h, llo$he, lo$hel o$hell Gintarė Grigonytė 14/45
Permuterm index & term mapping all rotated entries map to the same string... For X, look up X$ For X*, look up X* For *X, look up X$* For *X*, look up X* For X*Y, look up Y$X* Example: For hel*o, look up o$hel* Gintarė Grigonytė 15/45
Processing a lookup in the permuterm index To sum up: Rotate query wildcard to the right Use B-tree lookup as before What is the problem with this structure? Gintarė Grigonytė 16/45
Processing a lookup in the permuterm index To sum up: Rotate query wildcard to the right Use B-tree lookup as before What is the problem with this structure? Permuterm more than quadruples the size of the dictionary compared to a regular B-tree. (empirical observation for English) Gintarė Grigonytė 16/45
Alternative: Bigram (k-gram) indexes Enumerate all k-grams (sequence of k characters) occurring in a term more space-efficient than permuterm index Gintarė Grigonytė 17/45
Alternative: Bigram (k-gram) indexes Enumerate all k-grams (sequence of k characters) occurring in a term more space-efficient than permuterm index Example: from April is the cruelest month we get the 2-grams (bigrams): $a ap pr ri il l$ $i is s$ $t th he e$ $c cr ru ue el le es st t$ $m mo on nt h$ $ is a special word boundary symbol, as before. Gintarė Grigonytė 17/45
Alternative: Bigram (k-gram) indexes Enumerate all k-grams (sequence of k characters) occurring in a term more space-efficient than permuterm index Example: from April is the cruelest month we get the 2-grams (bigrams): $a ap pr ri il l$ $i is s$ $t th he e$ $c cr ru ue el le es st t$ $m mo on nt h$ $ is a special word boundary symbol, as before. Maintain a second inverted index from bigrams to the dictionary terms that contain the bigram Gintarė Grigonytė 17/45
Postings list in a 3-gram inverted index etr BEETROOT METRIC PETRIFY RETRIEVAL... retrieve all postings of matching k-grams intersect all lists as usual Gintarė Grigonytė 18/45
Processing wildcarded terms in a bigram index Query mon* can now be run as: Gintarė Grigonytė 19/45
Processing wildcarded terms in a bigram index Query mon* can now be run as: $m AND mo AND on Gets us all terms with the prefix mon, but... Gintarė Grigonytė 19/45
Processing wildcarded terms in a bigram index Query mon* can now be run as: $m AND mo AND on Gets us all terms with the prefix mon, but...... also many false positives like MOON. Must post-filter these terms against query. Gintarė Grigonytė 19/45
Processing wildcarded terms in a bigram index Query mon* can now be run as: $m AND mo AND on Gets us all terms with the prefix mon, but...... also many false positives like MOON. Must post-filter these terms against query. Surviving terms are then looked up in the term-document inverted index. k-gram are still fast and more space efficient than permuterm indexes. Gintarė Grigonytė 19/45
Processing wildcard queries As before, we must potentially execute a large number of Boolean queries Most straightforward semantics: Conjunction of disjunctions Recall the query: gen* AND universit* Gintarė Grigonytė 20/45
Processing wildcard queries As before, we must potentially execute a large number of Boolean queries Most straightforward semantics: Conjunction of disjunctions Recall the query: gen* AND universit* (geneva AND university) OR (geneva AND université) OR (genève AND university) OR (genève AND université) OR (general AND universities) OR...... Very expensive! Requires query optimization! Do we need to support wildcard queries? Gintarė Grigonytė 20/45
Processing wildcard queries As before, we must potentially execute a large number of Boolean queries Most straightforward semantics: Conjunction of disjunctions Recall the query: gen* AND universit* (geneva AND university) OR (geneva AND université) OR (genève AND university) OR (genève AND université) OR (general AND universities) OR...... Very expensive! Requires query optimization! Do we need to support wildcard queries? Users are lazy! If wildcards are allowed Users will love it (?) Does Google allow wildcard queries? (And other engines?) Gintarė Grigonytė 20/45
Tolerant Retrieval: Spälling Korrection Two general uses of spelling correction: Correcting documents being indexed Correcting user queries (more common) Gintarė Grigonytė 21/45
Tolerant Retrieval: Spälling Korrection Two general uses of spelling correction: Correcting documents being indexed Correcting user queries (more common) Two different methods: Isolated word spelling correction Check each word on its own for misspelling Will not catch typos resulting in correctly spelled words, e.g., an asteroid that fell form the sky Context-sensitive spelling correction Look at surrounding words Can correct form/from error above Gintarė Grigonytė 21/45
Correcting documents primarily for OCR ed documents tuned for OCR mistakes (trained classifiers) may use domain-specific knowledge confusion between O and D etc... but also: correct typos in web pages and low quality documents fewer misspellings in our dictionary and better matching general IR philosophy: don t change the documents Gintarė Grigonytė 22/45
Correcting queries more common in IR than document correction typos in queries are common people are in a hurry users often look for things they don t know much about example: al quajda strategies: (also) retrieve documents with the correct spelling return alternative query suggestions ( Did you mean...? ) Gintarė Grigonytė 23/45
Isolated word correction Premise 1: There is a list of correct words from which the correct spellings come. Premise 2: We have a way of computing the distance between a misspelled word and a correct word. Simple spelling correction algorithm: return the correct word that has the smallest distance to the misspelled word. Example: informaton information Gintarė Grigonytė 24/45
Isolated word correction Two choices: use a standard lexicon Webster s, OED etc.... industry-specific dictionary (for domain-specific IR) advantage: correct entries only! vocabulary of the inverted index all words in the collection better coverage but: include all misspellings compute weights for all terms (based on frequencies) Gintarė Grigonytė 25/45
Isolated word correction Task: Return lexicon entry that is closest to a given character sequence Q What is closest? Several alternatives: 1. Edit distance and Levenshtein distance 2. Weighted edit distance 3. k-gram overlap Gintarė Grigonytė 26/45
Edit distance The edit distance between string s 1 and string s 2 is the minimum number of basic operations that convert s 1 to s 2. Levenshtein distance: basic operations = insert, delete, and replace Levenshtein distance dog-do: 1 Levenshtein distance cat-cart: 1 Levenshtein distance cat-cut: 1 Levenshtein distance cat-act: 2 Gintarė Grigonytė 27/45
Edit distance The edit distance between string s 1 and string s 2 is the minimum number of basic operations that convert s 1 to s 2. Levenshtein distance: basic operations = insert, delete, and replace Levenshtein distance dog-do: 1 Levenshtein distance cat-cart: 1 Levenshtein distance cat-cut: 1 Levenshtein distance cat-act: 2 Damerau-Levenshtein: additional operation = transpose Damerau-Levenshtein distance cat-act: 1 Gintarė Grigonytė 27/45
Levenshtein distance: Computation Recursive definition & dynamic programming: start with upper-left table cell fill table with edit costs f a s t 0 1 2 3 4 c 1 1 2 3 4 a 2 2 1 2 3 t 3 3 2 2 2 s 4 4 3 2 3 Gintarė Grigonytė 28/45
Levenshtein distance: algorithm LEVENSHTEINDISTANCE(s 1, s 2 ) 1 for i 0 to s 1 2 do m[i, 0] = i 3 for j 0 to s 2 4 do m[0, j] = j 5 for i 1 to s 1 6 do for j 1 to s 2 7 do if s 1 [i] = s 2 [j] 8 then m[i, j] = min{m[i 1, j] + 1, m[i, j 1] + 1, m[i 1, j 1]} 9 else m[i, j] = min{m[i 1, j] + 1, m[i, j 1] + 1, m[i 1, j 1] + 1} 10 return m[ s 1, s 2 ] Operations: insert, delete, replace, copy Gintarė Grigonytė 29/45
Levenshtein distance: algorithm LEVENSHTEINDISTANCE(s 1, s 2 ) 1 for i 0 to s 1 2 do m[i, 0] = i 3 for j 0 to s 2 4 do m[0, j] = j 5 for i 1 to s 1 6 do for j 1 to s 2 7 do if s 1 [i] = s 2 [j] 8 then m[i, j] = min{m[i 1, j] + 1, m[i, j 1] + 1, m[i 1, j 1]} 9 else m[i, j] = min{m[i 1, j] + 1, m[i, j 1] + 1, m[i 1, j 1] + 1} 10 return m[ s 1, s 2 ] Operations: insert, delete, replace, copy Gintarė Grigonytė 30/45
Levenshtein distance: algorithm LEVENSHTEINDISTANCE(s 1, s 2 ) 1 for i 0 to s 1 2 do m[i, 0] = i 3 for j 0 to s 2 4 do m[0, j] = j 5 for i 1 to s 1 6 do for j 1 to s 2 7 do if s 1 [i] = s 2 [j] 8 then m[i, j] = min{m[i 1, j] + 1, m[i, j 1] + 1, m[i 1, j 1]} 9 else m[i, j] = min{m[i 1, j] + 1, m[i, j 1] + 1, m[i 1, j 1] + 1} 10 return m[ s 1, s 2 ] Operations: insert, delete, replace, copy Gintarė Grigonytė 31/45
Levenshtein distance: algorithm LEVENSHTEINDISTANCE(s 1, s 2 ) 1 for i 0 to s 1 2 do m[i, 0] = i 3 for j 0 to s 2 4 do m[0, j] = j 5 for i 1 to s 1 6 do for j 1 to s 2 7 do if s 1 [i] = s 2 [j] 8 then m[i, j] = min{m[i 1, j] + 1, m[i, j 1] + 1, m[i 1, j 1]} 9 else m[i, j] = min{m[i 1, j] + 1, m[i, j 1] + 1, m[i 1, j 1] + 1} 10 return m[ s 1, s 2 ] Operations: insert, delete, replace, copy Gintarė Grigonytė 32/45
Levenshtein distance: algorithm LEVENSHTEINDISTANCE(s 1, s 2 ) 1 for i 0 to s 1 2 do m[i, 0] = i 3 for j 0 to s 2 4 do m[0, j] = j 5 for i 1 to s 1 6 do for j 1 to s 2 7 do if s 1 [i] = s 2 [j] 8 then m[i, j] = min{m[i 1, j] + 1, m[i, j 1] + 1, m[i 1, j 1]} 9 else m[i, j] = min{m[i 1, j] + 1, m[i, j 1] + 1, m[i 1, j 1] + 1} 10 return m[ s 1, s 2 ] Operations: insert, delete, replace, copy Gintarė Grigonytė 33/45
Each cell of Levenshtein matrix cost of getting here from my upper left neighbor (copy or replace) cost of getting here from my left neighbor (insert) cost of getting here from my upper neighbor (delete) the minimum of the three possible moves ; the cheapest way of getting here Gintarė Grigonytė 34/45
Weighted edit distance As above, but weight of an operation depends on the characters involved. Meant to capture keyboard errors, e.g., m more likely to be mistyped as n than as q. Therefore, replacing m by n is a smaller edit distance than by q. We now require a weight matrix as input. Modify dynamic programming to handle weights. Gintarė Grigonytė 35/45
Using edit distance given a query: get all sequences within a fixed edit distance intersect this list with the list of correct words return spelling suggestions Alternatively: use all corrections to retrieve doc s slow! (and accepted by user?) use single best correction for retrieval Gintarė Grigonytė 36/45
Using edit distance Problems: lot s of possible strings even with few edit operations intersection with dictionary is slow Possible solution: use N-gram overlap can replace edit distance for spelling correction Gintarė Grigonytė 37/45
k-gram indexes for spelling correction Get all k-grams in the query term Use the k-gram index to retrieve correct words that match query term k-grams (recall wildcard search) Threshold by number of matching k-grams (e.g., only terms that differ by at most 3 k-grams) or use Jaccard coefficient > threshold Jaccard(A, B) = A B A B Gintarė Grigonytė 38/45
k-gram indexes for spelling correction Get all k-grams in the query term Use the k-gram index to retrieve correct words that match query term k-grams (recall wildcard search) Threshold by number of matching k-grams (e.g., only terms that differ by at most 3 k-grams) or use Jaccard coefficient > threshold Jaccard(A, B) = A B A B Example: Bigram index, misspelled word bordroom Bigrams: bo, or, rd, dr, ro, oo, om Gintarė Grigonytė 38/45
k-gram indexes for spelling correction: bordroom BO aboard about boardroom border OR border lord morbid sordid RD aboard ardent boardroom border... boardroom exists in 6 out of 7 lists Jaccard = 6/(8 + 7 6) = 6/9 0.67 Gintarė Grigonytė 39/45
Context-sensitive spelling correction One approach: break phrase query into conjunction of biwords look for biwords that need only one term corrected get phrase matches and rank them... Gintarė Grigonytė 40/45
Context-sensitive spelling correction Another approach: Hit-based spelling correction Example: flew form munich Try all phrases with possible corrections: Try query flea form munich Try query flew from munich Try query flew form munch The correct query flew from munich has most hits. Many alternatives! Not very efficient! try to correct only if few hits returned tweaking with query logs and expected hits Gintarė Grigonytė 41/45
Tolerant Search: Soundex Find phonetic alternatives. Example: chebyshev / tchebyscheff Soundex = class of heuristics (invented in 1918) 1. Retain the first letter of the term. 2. Replace all letters [AEIOUHWY] to 0 3. Replace letters to digits: B, F, P, V to 1 C, G, J, K, Q, S, X, Z to 2 D,T to 3 L to 4 M, N to 5 R to 6 4. remove consecutive identical digits 5. remove all zeros and return first 4 characters Gintarė Grigonytė 42/45
Tolerant Search: Soundex Find phonetic alternatives. Example: chebyshev / tchebyscheff Soundex = class of heuristics (invented in 1918) 1. Retain the first letter of the term. 2. Replace all letters [AEIOUHWY] to 0 3. Replace letters to digits: B, F, P, V to 1 C, G, J, K, Q, S, X, Z to 2 D,T to 3 L to 4 M, N to 5 R to 6 4. remove consecutive identical digits 5. remove all zeros and return first 4 characters HERMAN H655 Will HERMANN generate the same code? Gintarė Grigonytė 42/45
To sum up Using a positional inverted index with skip pointers efficient dictionary storage wild-card index spelling correction Soundex We can quickly run a query like (SPELL(moriset) /3 tor*to) OR SOUNDEX(chaikofski) Gintarė Grigonytė 43/45
Resources Chapter 3 of IIR Resources Soundex demo http://www.jamesburt.com/demos/ Javascript/SoundexCalculate/ Levenshtein distance demo http://www.let.rug.nl/~kleiweg/lev/ Levenshtein distance slides https: //rosettacode.org/wiki/levenshtein_distance Peter Norvig s spelling corrector http://norvig.com/spell-correct.html Gintarė Grigonytė 44/45
Next time Ranked Retrieval: efficient scoring & retrieval Putting it all together in a search system lab on boolean and ranked retrieval Gintarė Grigonytė 45/45