Information Retrieval

Similar documents
Overview. Lecture 3: Index Representation and Tolerant Retrieval. Type/token distinction. IR System components

Dictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology

Dictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology

Tolerant Retrieval. Searching the Dictionary Tolerant Retrieval. Information Retrieval & Extraction Misbhauddin 1

3-1. Dictionaries and Tolerant Retrieval. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

Dictionaries and Tolerant retrieval

Dictionaries and tolerant retrieval. Slides by Manning, Raghavan, Schutze

Recap of the previous lecture. This lecture. A naïve dictionary. Introduction to Information Retrieval. Dictionary data structures Tolerant retrieval

Information Retrieval

Recap of last time CS276A Information Retrieval

Information Retrieval

OUTLINE. Documents Terms. General + Non-English English. Skip pointers. Phrase queries

Information Retrieval

Text Technologies for Data Science INFR Indexing (2) Instructor: Walid Magdy

Text Technologies for Data Science INFR Indexing (2) Instructor: Walid Magdy

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Natural Language Processing and Information Retrieval

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Data-analysis and Retrieval Boolean retrieval, posting lists and dictionaries

Introduction to Information Retrieval

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression

Introduction to Information Retrieval

PV211: Introduction to Information Retrieval

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Introduction to Information Retrieval

CSCI 5417 Information Retrieval Systems Jim Martin!

Preliminary draft (c)2008 Cambridge UP

CSE 7/5337: Information Retrieval and Web Search Index construction (IIR 4)

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

CS347. Lecture 2 April 9, Prabhakar Raghavan

Today s topics CS347. Inverted index storage. Inverted index storage. Processing Boolean queries. Lecture 2 April 9, 2001 Prabhakar Raghavan


Part 2: Boolean Retrieval Francesco Ricci

Information Retrieval

Introduction to Information Retrieval

Course structure & admin. CS276A Text Information Retrieval, Mining, and Exploitation. Dictionary and postings files: a fast, compact inverted index

Introduction to Information Retrieval

Boolean Retrieval. Manning, Raghavan and Schütze, Chapter 1. Daniël de Kok

Information Retrieval

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks

Information Retrieval

Recap: lecture 2 CS276A Information Retrieval

Indexing and Searching

Lecture 3: Phrasal queries and wildcards

Information Retrieval

Information Retrieval and Text Mining

Indexing and Searching

Lecture 2: Encoding models for text. Many thanks to Prabhakar Raghavan for sharing most content from the following slides

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

Index extensions. Manning, Raghavan and Schütze, Chapter 2. Daniël de Kok

Multimedia Information Extraction and Retrieval Indexing and Query Answering

Digital Libraries: Language Technologies

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

Information Retrieval

Search Engines. Gertjan van Noord. September 17, 2018

Index Compression. David Kauchak cs160 Fall 2009 adapted from:

Inverted Indexes. Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Introduction to Information Retrieval IIR 1: Boolean Retrieval

Efficient Implementation of Postings Lists

More on indexing CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval

Information Retrieval

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

Information Retrieval

Information Retrieval

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling

Introduction to. CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan. Lecture 4: Index Construction

Introduction to Information Retrieval

Introduction to Information Retrieval and Boolean model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H.

CS105 Introduction to Information Retrieval

Information Retrieval CS-E credits

CSE 7/5337: Information Retrieval and Web Search Introduction and Boolean Retrieval (IIR 1)

Introducing Information Retrieval and Web Search. borrowing from: Pandu Nayak

INDEX CONSTRUCTION 1

Indexing and Searching

A Modern spell(1) Abhinav Upadhyay EuroBSDCon 2017, Paris

Introduction to Information Retrieval

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Introduction to Information Retrieval

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Boolean Queries. Keywords combined with Boolean operators:

Information Retrieval

Information Retrieval. Chap 7. Text Operations

Query Evaluation Strategies

Outline of the course

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Part 3: The term vocabulary, postings lists and tolerant retrieval Francesco Ricci

Query Evaluation Strategies

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

Informa(on Retrieval

Lecture 1: Introduction and the Boolean Model

Query Processing and Alternative Search Structures. Indexing common words

FRONT CODING. Front-coding: 8automat*a1 e2 ic3 ion. Extra length beyond automat. Encodes automat. Begins to resemble general string compression.

Transcription:

Information Retrieval Dictionaries & Tolerant Retrieval Gintarė Grigonytė gintare@ling.su.se Department of Linguistics and Philology Uppsala University Slides based on previous IR course given by Jörg Tiedemann 2013-15 and Introduction to Information Retrieval slides https://nlp.stanford.edu/ir-book/ppt/ Gintarė Grigonytė 1/45

Data Structures for Dictionaries For each term t, we store a list of all documents that contain t. BRUTUS 1 2 4 11 31 45 173 174 CAESAR 1 2 4 5 6 16 57 132... CALPURNIA 2 31 54 101. }{{}}{{} dictionary postings Gintarė Grigonytė 2/45

Naive Dictionary: array of fixed-width entries term document pointer to frequency postings list a 656,265 aachen 65......... zulu 221 space needed: 20 bytes 4 bytes 4 bytes Gintarė Grigonytė 3/45

Naive Dictionary: array of fixed-width entries term document pointer to frequency postings list a 656,265 aachen 65......... zulu 221 space needed: 20 bytes 4 bytes 4 bytes How do we store a dictionary in memory efficiently? How do we look up an element in this array at query time? Gintarė Grigonytė 3/45

Dictionary Data structures Two main classes of data structures: hashes and trees Criteria for when to use hashes vs. trees: Is there a fixed number of terms or will it keep growing? What are the relative frequencies with which various keys will be accessed? How many terms are we likely to have? Gintarė Grigonytė 4/45

Hashes Each vocabulary term is hashed into an integer. Try to avoid collisions At query time, do the following: hash query term, resolve collisions, locate entry in fixed-width array Gintarė Grigonytė 5/45

Hashes Each vocabulary term is hashed into an integer. Try to avoid collisions At query time, do the following: hash query term, resolve collisions, locate entry in fixed-width array Pros: Lookup is fast (faster than in a search tree) Lookup time is constant Gintarė Grigonytė 5/45

Hashes Each vocabulary term is hashed into an integer. Try to avoid collisions At query time, do the following: hash query term, resolve collisions, locate entry in fixed-width array Pros: Lookup is fast (faster than in a search tree) Lookup time is constant Cons no way to find minor variants (resume vs. résumé) no prefix search (all terms starting with automat) need to rehash everything periodically if vocabulary keeps growing Gintarė Grigonytė 5/45

Trees: Binary tree Gintarė Grigonytė 6/45

Trees: Binary tree simplest tree structure efficient for searching Pros: solves the prefix problem (terms starting with automat) Cons: slower: O(log M) in balanced trees (M is the size of the vocabulary) re-balancing binary trees is expensive Alternative: B-tree Gintarė Grigonytė 7/45

Trees: B-tree B-tree definition: every internal node has a number of children in the interval [a, b] where a, b are appropriate positive integers, e.g., [2, 4]. Gintarė Grigonytė 8/45

Trees: B-tree What s the difference? slightly more complex still efficient for searching same features as binary trees (prefix search!) need for re-balancing is less frequent! Gintarė Grigonytė 9/45

Tolerant Retrieval Wildcard queries Spelling correction Soundex Gintarė Grigonytė 10/45

Tolerant Retrieval: Wildcard queries Prefix queries: mon*: find all docs containing any term beginning with mon Easy with B-tree dictionary: retrieve all terms t in the range: mon t < moo Gintarė Grigonytė 11/45

Tolerant Retrieval: Wildcard queries Prefix queries: mon*: find all docs containing any term beginning with mon Easy with B-tree dictionary: retrieve all terms t in the range: mon t < moo Suffix Queries: *mon: find all docs containing any term ending with mon Gintarė Grigonytė 11/45

Tolerant Retrieval: Wildcard queries Prefix queries: mon*: find all docs containing any term beginning with mon Easy with B-tree dictionary: retrieve all terms t in the range: mon t < moo Suffix Queries: *mon: find all docs containing any term ending with mon Maintain an additional tree for terms backwards Then retrieve all terms t in the range: nom t < non Gintarė Grigonytė 11/45

Tolerant Retrieval: Wildcard queries Prefix queries: mon*: find all docs containing any term beginning with mon Easy with B-tree dictionary: retrieve all terms t in the range: mon t < moo Suffix Queries: *mon: find all docs containing any term ending with mon Maintain an additional tree for terms backwards Then retrieve all terms t in the range: nom t < non How can we find all terms matching pro*cent? Gintarė Grigonytė 11/45

How to handle * in the middle of a term Example: pro*cent Gintarė Grigonytė 12/45

How to handle * in the middle of a term Example: pro*cent We could look up pro* and *cent in the B-tree and intersect the two term sets. Gintarė Grigonytė 12/45

How to handle * in the middle of a term Example: pro*cent We could look up pro* and *cent in the B-tree and intersect the two term sets. Expensive! Alternative: permuterm index! Gintarė Grigonytė 12/45

Permuterm index Rotate wildcard query, so that the * occurs at the end. introduce special symbol $ to indicate end of string Gintarė Grigonytė 13/45

Permuterm index Rotate wildcard query, so that the * occurs at the end. introduce special symbol $ to indicate end of string pro*cent becomes cent$pro* How does this help when matching queries with the index? Gintarė Grigonytė 13/45

Permuterm index Add all rotations to B-tree: HELLO hello$, ello$h, llo$he, lo$hel o$hell Gintarė Grigonytė 14/45

Permuterm index & term mapping all rotated entries map to the same string... For X, look up X$ For X*, look up X* For *X, look up X$* For *X*, look up X* For X*Y, look up Y$X* Example: For hel*o, look up o$hel* Gintarė Grigonytė 15/45

Processing a lookup in the permuterm index To sum up: Rotate query wildcard to the right Use B-tree lookup as before What is the problem with this structure? Gintarė Grigonytė 16/45

Processing a lookup in the permuterm index To sum up: Rotate query wildcard to the right Use B-tree lookup as before What is the problem with this structure? Permuterm more than quadruples the size of the dictionary compared to a regular B-tree. (empirical observation for English) Gintarė Grigonytė 16/45

Alternative: Bigram (k-gram) indexes Enumerate all k-grams (sequence of k characters) occurring in a term more space-efficient than permuterm index Gintarė Grigonytė 17/45

Alternative: Bigram (k-gram) indexes Enumerate all k-grams (sequence of k characters) occurring in a term more space-efficient than permuterm index Example: from April is the cruelest month we get the 2-grams (bigrams): $a ap pr ri il l$ $i is s$ $t th he e$ $c cr ru ue el le es st t$ $m mo on nt h$ $ is a special word boundary symbol, as before. Gintarė Grigonytė 17/45

Alternative: Bigram (k-gram) indexes Enumerate all k-grams (sequence of k characters) occurring in a term more space-efficient than permuterm index Example: from April is the cruelest month we get the 2-grams (bigrams): $a ap pr ri il l$ $i is s$ $t th he e$ $c cr ru ue el le es st t$ $m mo on nt h$ $ is a special word boundary symbol, as before. Maintain a second inverted index from bigrams to the dictionary terms that contain the bigram Gintarė Grigonytė 17/45

Postings list in a 3-gram inverted index etr BEETROOT METRIC PETRIFY RETRIEVAL... retrieve all postings of matching k-grams intersect all lists as usual Gintarė Grigonytė 18/45

Processing wildcarded terms in a bigram index Query mon* can now be run as: Gintarė Grigonytė 19/45

Processing wildcarded terms in a bigram index Query mon* can now be run as: $m AND mo AND on Gets us all terms with the prefix mon, but... Gintarė Grigonytė 19/45

Processing wildcarded terms in a bigram index Query mon* can now be run as: $m AND mo AND on Gets us all terms with the prefix mon, but...... also many false positives like MOON. Must post-filter these terms against query. Gintarė Grigonytė 19/45

Processing wildcarded terms in a bigram index Query mon* can now be run as: $m AND mo AND on Gets us all terms with the prefix mon, but...... also many false positives like MOON. Must post-filter these terms against query. Surviving terms are then looked up in the term-document inverted index. k-gram are still fast and more space efficient than permuterm indexes. Gintarė Grigonytė 19/45

Processing wildcard queries As before, we must potentially execute a large number of Boolean queries Most straightforward semantics: Conjunction of disjunctions Recall the query: gen* AND universit* Gintarė Grigonytė 20/45

Processing wildcard queries As before, we must potentially execute a large number of Boolean queries Most straightforward semantics: Conjunction of disjunctions Recall the query: gen* AND universit* (geneva AND university) OR (geneva AND université) OR (genève AND university) OR (genève AND université) OR (general AND universities) OR...... Very expensive! Requires query optimization! Do we need to support wildcard queries? Gintarė Grigonytė 20/45

Processing wildcard queries As before, we must potentially execute a large number of Boolean queries Most straightforward semantics: Conjunction of disjunctions Recall the query: gen* AND universit* (geneva AND university) OR (geneva AND université) OR (genève AND university) OR (genève AND université) OR (general AND universities) OR...... Very expensive! Requires query optimization! Do we need to support wildcard queries? Users are lazy! If wildcards are allowed Users will love it (?) Does Google allow wildcard queries? (And other engines?) Gintarė Grigonytė 20/45

Tolerant Retrieval: Spälling Korrection Two general uses of spelling correction: Correcting documents being indexed Correcting user queries (more common) Gintarė Grigonytė 21/45

Tolerant Retrieval: Spälling Korrection Two general uses of spelling correction: Correcting documents being indexed Correcting user queries (more common) Two different methods: Isolated word spelling correction Check each word on its own for misspelling Will not catch typos resulting in correctly spelled words, e.g., an asteroid that fell form the sky Context-sensitive spelling correction Look at surrounding words Can correct form/from error above Gintarė Grigonytė 21/45

Correcting documents primarily for OCR ed documents tuned for OCR mistakes (trained classifiers) may use domain-specific knowledge confusion between O and D etc... but also: correct typos in web pages and low quality documents fewer misspellings in our dictionary and better matching general IR philosophy: don t change the documents Gintarė Grigonytė 22/45

Correcting queries more common in IR than document correction typos in queries are common people are in a hurry users often look for things they don t know much about example: al quajda strategies: (also) retrieve documents with the correct spelling return alternative query suggestions ( Did you mean...? ) Gintarė Grigonytė 23/45

Isolated word correction Premise 1: There is a list of correct words from which the correct spellings come. Premise 2: We have a way of computing the distance between a misspelled word and a correct word. Simple spelling correction algorithm: return the correct word that has the smallest distance to the misspelled word. Example: informaton information Gintarė Grigonytė 24/45

Isolated word correction Two choices: use a standard lexicon Webster s, OED etc.... industry-specific dictionary (for domain-specific IR) advantage: correct entries only! vocabulary of the inverted index all words in the collection better coverage but: include all misspellings compute weights for all terms (based on frequencies) Gintarė Grigonytė 25/45

Isolated word correction Task: Return lexicon entry that is closest to a given character sequence Q What is closest? Several alternatives: 1. Edit distance and Levenshtein distance 2. Weighted edit distance 3. k-gram overlap Gintarė Grigonytė 26/45

Edit distance The edit distance between string s 1 and string s 2 is the minimum number of basic operations that convert s 1 to s 2. Levenshtein distance: basic operations = insert, delete, and replace Levenshtein distance dog-do: 1 Levenshtein distance cat-cart: 1 Levenshtein distance cat-cut: 1 Levenshtein distance cat-act: 2 Gintarė Grigonytė 27/45

Edit distance The edit distance between string s 1 and string s 2 is the minimum number of basic operations that convert s 1 to s 2. Levenshtein distance: basic operations = insert, delete, and replace Levenshtein distance dog-do: 1 Levenshtein distance cat-cart: 1 Levenshtein distance cat-cut: 1 Levenshtein distance cat-act: 2 Damerau-Levenshtein: additional operation = transpose Damerau-Levenshtein distance cat-act: 1 Gintarė Grigonytė 27/45

Levenshtein distance: Computation Recursive definition & dynamic programming: start with upper-left table cell fill table with edit costs f a s t 0 1 2 3 4 c 1 1 2 3 4 a 2 2 1 2 3 t 3 3 2 2 2 s 4 4 3 2 3 Gintarė Grigonytė 28/45

Levenshtein distance: algorithm LEVENSHTEINDISTANCE(s 1, s 2 ) 1 for i 0 to s 1 2 do m[i, 0] = i 3 for j 0 to s 2 4 do m[0, j] = j 5 for i 1 to s 1 6 do for j 1 to s 2 7 do if s 1 [i] = s 2 [j] 8 then m[i, j] = min{m[i 1, j] + 1, m[i, j 1] + 1, m[i 1, j 1]} 9 else m[i, j] = min{m[i 1, j] + 1, m[i, j 1] + 1, m[i 1, j 1] + 1} 10 return m[ s 1, s 2 ] Operations: insert, delete, replace, copy Gintarė Grigonytė 29/45

Levenshtein distance: algorithm LEVENSHTEINDISTANCE(s 1, s 2 ) 1 for i 0 to s 1 2 do m[i, 0] = i 3 for j 0 to s 2 4 do m[0, j] = j 5 for i 1 to s 1 6 do for j 1 to s 2 7 do if s 1 [i] = s 2 [j] 8 then m[i, j] = min{m[i 1, j] + 1, m[i, j 1] + 1, m[i 1, j 1]} 9 else m[i, j] = min{m[i 1, j] + 1, m[i, j 1] + 1, m[i 1, j 1] + 1} 10 return m[ s 1, s 2 ] Operations: insert, delete, replace, copy Gintarė Grigonytė 30/45

Levenshtein distance: algorithm LEVENSHTEINDISTANCE(s 1, s 2 ) 1 for i 0 to s 1 2 do m[i, 0] = i 3 for j 0 to s 2 4 do m[0, j] = j 5 for i 1 to s 1 6 do for j 1 to s 2 7 do if s 1 [i] = s 2 [j] 8 then m[i, j] = min{m[i 1, j] + 1, m[i, j 1] + 1, m[i 1, j 1]} 9 else m[i, j] = min{m[i 1, j] + 1, m[i, j 1] + 1, m[i 1, j 1] + 1} 10 return m[ s 1, s 2 ] Operations: insert, delete, replace, copy Gintarė Grigonytė 31/45

Levenshtein distance: algorithm LEVENSHTEINDISTANCE(s 1, s 2 ) 1 for i 0 to s 1 2 do m[i, 0] = i 3 for j 0 to s 2 4 do m[0, j] = j 5 for i 1 to s 1 6 do for j 1 to s 2 7 do if s 1 [i] = s 2 [j] 8 then m[i, j] = min{m[i 1, j] + 1, m[i, j 1] + 1, m[i 1, j 1]} 9 else m[i, j] = min{m[i 1, j] + 1, m[i, j 1] + 1, m[i 1, j 1] + 1} 10 return m[ s 1, s 2 ] Operations: insert, delete, replace, copy Gintarė Grigonytė 32/45

Levenshtein distance: algorithm LEVENSHTEINDISTANCE(s 1, s 2 ) 1 for i 0 to s 1 2 do m[i, 0] = i 3 for j 0 to s 2 4 do m[0, j] = j 5 for i 1 to s 1 6 do for j 1 to s 2 7 do if s 1 [i] = s 2 [j] 8 then m[i, j] = min{m[i 1, j] + 1, m[i, j 1] + 1, m[i 1, j 1]} 9 else m[i, j] = min{m[i 1, j] + 1, m[i, j 1] + 1, m[i 1, j 1] + 1} 10 return m[ s 1, s 2 ] Operations: insert, delete, replace, copy Gintarė Grigonytė 33/45

Each cell of Levenshtein matrix cost of getting here from my upper left neighbor (copy or replace) cost of getting here from my left neighbor (insert) cost of getting here from my upper neighbor (delete) the minimum of the three possible moves ; the cheapest way of getting here Gintarė Grigonytė 34/45

Weighted edit distance As above, but weight of an operation depends on the characters involved. Meant to capture keyboard errors, e.g., m more likely to be mistyped as n than as q. Therefore, replacing m by n is a smaller edit distance than by q. We now require a weight matrix as input. Modify dynamic programming to handle weights. Gintarė Grigonytė 35/45

Using edit distance given a query: get all sequences within a fixed edit distance intersect this list with the list of correct words return spelling suggestions Alternatively: use all corrections to retrieve doc s slow! (and accepted by user?) use single best correction for retrieval Gintarė Grigonytė 36/45

Using edit distance Problems: lot s of possible strings even with few edit operations intersection with dictionary is slow Possible solution: use N-gram overlap can replace edit distance for spelling correction Gintarė Grigonytė 37/45

k-gram indexes for spelling correction Get all k-grams in the query term Use the k-gram index to retrieve correct words that match query term k-grams (recall wildcard search) Threshold by number of matching k-grams (e.g., only terms that differ by at most 3 k-grams) or use Jaccard coefficient > threshold Jaccard(A, B) = A B A B Gintarė Grigonytė 38/45

k-gram indexes for spelling correction Get all k-grams in the query term Use the k-gram index to retrieve correct words that match query term k-grams (recall wildcard search) Threshold by number of matching k-grams (e.g., only terms that differ by at most 3 k-grams) or use Jaccard coefficient > threshold Jaccard(A, B) = A B A B Example: Bigram index, misspelled word bordroom Bigrams: bo, or, rd, dr, ro, oo, om Gintarė Grigonytė 38/45

k-gram indexes for spelling correction: bordroom BO aboard about boardroom border OR border lord morbid sordid RD aboard ardent boardroom border... boardroom exists in 6 out of 7 lists Jaccard = 6/(8 + 7 6) = 6/9 0.67 Gintarė Grigonytė 39/45

Context-sensitive spelling correction One approach: break phrase query into conjunction of biwords look for biwords that need only one term corrected get phrase matches and rank them... Gintarė Grigonytė 40/45

Context-sensitive spelling correction Another approach: Hit-based spelling correction Example: flew form munich Try all phrases with possible corrections: Try query flea form munich Try query flew from munich Try query flew form munch The correct query flew from munich has most hits. Many alternatives! Not very efficient! try to correct only if few hits returned tweaking with query logs and expected hits Gintarė Grigonytė 41/45

Tolerant Search: Soundex Find phonetic alternatives. Example: chebyshev / tchebyscheff Soundex = class of heuristics (invented in 1918) 1. Retain the first letter of the term. 2. Replace all letters [AEIOUHWY] to 0 3. Replace letters to digits: B, F, P, V to 1 C, G, J, K, Q, S, X, Z to 2 D,T to 3 L to 4 M, N to 5 R to 6 4. remove consecutive identical digits 5. remove all zeros and return first 4 characters Gintarė Grigonytė 42/45

Tolerant Search: Soundex Find phonetic alternatives. Example: chebyshev / tchebyscheff Soundex = class of heuristics (invented in 1918) 1. Retain the first letter of the term. 2. Replace all letters [AEIOUHWY] to 0 3. Replace letters to digits: B, F, P, V to 1 C, G, J, K, Q, S, X, Z to 2 D,T to 3 L to 4 M, N to 5 R to 6 4. remove consecutive identical digits 5. remove all zeros and return first 4 characters HERMAN H655 Will HERMANN generate the same code? Gintarė Grigonytė 42/45

To sum up Using a positional inverted index with skip pointers efficient dictionary storage wild-card index spelling correction Soundex We can quickly run a query like (SPELL(moriset) /3 tor*to) OR SOUNDEX(chaikofski) Gintarė Grigonytė 43/45

Resources Chapter 3 of IIR Resources Soundex demo http://www.jamesburt.com/demos/ Javascript/SoundexCalculate/ Levenshtein distance demo http://www.let.rug.nl/~kleiweg/lev/ Levenshtein distance slides https: //rosettacode.org/wiki/levenshtein_distance Peter Norvig s spelling corrector http://norvig.com/spell-correct.html Gintarė Grigonytė 44/45

Next time Ranked Retrieval: efficient scoring & retrieval Putting it all together in a search system lab on boolean and ranked retrieval Gintarė Grigonytė 45/45