Information Retrieval

Size: px
Start display at page:

Download "Information Retrieval"

Transcription

1 Information Retrieval Dictionaries & Tolerant Retrieval Gintarė Grigonytė Department of Linguistics and Philology Uppsala University Slides based on previous IR course given by Jörg Tiedemann and Introduction to Information Retrieval slides Gintarė Grigonytė 1/45

2 Data Structures for Dictionaries For each term t, we store a list of all documents that contain t. BRUTUS CAESAR CALPURNIA }{{}}{{} dictionary postings Gintarė Grigonytė 2/45

3 Naive Dictionary: array of fixed-width entries term document pointer to frequency postings list a 656,265 aachen zulu 221 space needed: 20 bytes 4 bytes 4 bytes Gintarė Grigonytė 3/45

4 Naive Dictionary: array of fixed-width entries term document pointer to frequency postings list a 656,265 aachen zulu 221 space needed: 20 bytes 4 bytes 4 bytes How do we store a dictionary in memory efficiently? How do we look up an element in this array at query time? Gintarė Grigonytė 3/45

5 Dictionary Data structures Two main classes of data structures: hashes and trees Criteria for when to use hashes vs. trees: Is there a fixed number of terms or will it keep growing? What are the relative frequencies with which various keys will be accessed? How many terms are we likely to have? Gintarė Grigonytė 4/45

6 Hashes Each vocabulary term is hashed into an integer. Try to avoid collisions At query time, do the following: hash query term, resolve collisions, locate entry in fixed-width array Gintarė Grigonytė 5/45

7 Hashes Each vocabulary term is hashed into an integer. Try to avoid collisions At query time, do the following: hash query term, resolve collisions, locate entry in fixed-width array Pros: Lookup is fast (faster than in a search tree) Lookup time is constant Gintarė Grigonytė 5/45

8 Hashes Each vocabulary term is hashed into an integer. Try to avoid collisions At query time, do the following: hash query term, resolve collisions, locate entry in fixed-width array Pros: Lookup is fast (faster than in a search tree) Lookup time is constant Cons no way to find minor variants (resume vs. résumé) no prefix search (all terms starting with automat) need to rehash everything periodically if vocabulary keeps growing Gintarė Grigonytė 5/45

9 Trees: Binary tree Gintarė Grigonytė 6/45

10 Trees: Binary tree simplest tree structure efficient for searching Pros: solves the prefix problem (terms starting with automat) Cons: slower: O(log M) in balanced trees (M is the size of the vocabulary) re-balancing binary trees is expensive Alternative: B-tree Gintarė Grigonytė 7/45

11 Trees: B-tree B-tree definition: every internal node has a number of children in the interval [a, b] where a, b are appropriate positive integers, e.g., [2, 4]. Gintarė Grigonytė 8/45

12 Trees: B-tree What s the difference? slightly more complex still efficient for searching same features as binary trees (prefix search!) need for re-balancing is less frequent! Gintarė Grigonytė 9/45

13 Tolerant Retrieval Wildcard queries Spelling correction Soundex Gintarė Grigonytė 10/45

14 Tolerant Retrieval: Wildcard queries Prefix queries: mon*: find all docs containing any term beginning with mon Easy with B-tree dictionary: retrieve all terms t in the range: mon t < moo Gintarė Grigonytė 11/45

15 Tolerant Retrieval: Wildcard queries Prefix queries: mon*: find all docs containing any term beginning with mon Easy with B-tree dictionary: retrieve all terms t in the range: mon t < moo Suffix Queries: *mon: find all docs containing any term ending with mon Gintarė Grigonytė 11/45

16 Tolerant Retrieval: Wildcard queries Prefix queries: mon*: find all docs containing any term beginning with mon Easy with B-tree dictionary: retrieve all terms t in the range: mon t < moo Suffix Queries: *mon: find all docs containing any term ending with mon Maintain an additional tree for terms backwards Then retrieve all terms t in the range: nom t < non Gintarė Grigonytė 11/45

17 Tolerant Retrieval: Wildcard queries Prefix queries: mon*: find all docs containing any term beginning with mon Easy with B-tree dictionary: retrieve all terms t in the range: mon t < moo Suffix Queries: *mon: find all docs containing any term ending with mon Maintain an additional tree for terms backwards Then retrieve all terms t in the range: nom t < non How can we find all terms matching pro*cent? Gintarė Grigonytė 11/45

18 How to handle * in the middle of a term Example: pro*cent Gintarė Grigonytė 12/45

19 How to handle * in the middle of a term Example: pro*cent We could look up pro* and *cent in the B-tree and intersect the two term sets. Gintarė Grigonytė 12/45

20 How to handle * in the middle of a term Example: pro*cent We could look up pro* and *cent in the B-tree and intersect the two term sets. Expensive! Alternative: permuterm index! Gintarė Grigonytė 12/45

21 Permuterm index Rotate wildcard query, so that the * occurs at the end. introduce special symbol $ to indicate end of string Gintarė Grigonytė 13/45

22 Permuterm index Rotate wildcard query, so that the * occurs at the end. introduce special symbol $ to indicate end of string pro*cent becomes cent$pro* How does this help when matching queries with the index? Gintarė Grigonytė 13/45

23 Permuterm index Add all rotations to B-tree: HELLO hello$, ello$h, llo$he, lo$hel o$hell Gintarė Grigonytė 14/45

24 Permuterm index & term mapping all rotated entries map to the same string... For X, look up X$ For X*, look up X* For *X, look up X$* For *X*, look up X* For X*Y, look up Y$X* Example: For hel*o, look up o$hel* Gintarė Grigonytė 15/45

25 Processing a lookup in the permuterm index To sum up: Rotate query wildcard to the right Use B-tree lookup as before What is the problem with this structure? Gintarė Grigonytė 16/45

26 Processing a lookup in the permuterm index To sum up: Rotate query wildcard to the right Use B-tree lookup as before What is the problem with this structure? Permuterm more than quadruples the size of the dictionary compared to a regular B-tree. (empirical observation for English) Gintarė Grigonytė 16/45

27 Alternative: Bigram (k-gram) indexes Enumerate all k-grams (sequence of k characters) occurring in a term more space-efficient than permuterm index Gintarė Grigonytė 17/45

28 Alternative: Bigram (k-gram) indexes Enumerate all k-grams (sequence of k characters) occurring in a term more space-efficient than permuterm index Example: from April is the cruelest month we get the 2-grams (bigrams): $a ap pr ri il l$ $i is s$ $t th he e$ $c cr ru ue el le es st t$ $m mo on nt h$ $ is a special word boundary symbol, as before. Gintarė Grigonytė 17/45

29 Alternative: Bigram (k-gram) indexes Enumerate all k-grams (sequence of k characters) occurring in a term more space-efficient than permuterm index Example: from April is the cruelest month we get the 2-grams (bigrams): $a ap pr ri il l$ $i is s$ $t th he e$ $c cr ru ue el le es st t$ $m mo on nt h$ $ is a special word boundary symbol, as before. Maintain a second inverted index from bigrams to the dictionary terms that contain the bigram Gintarė Grigonytė 17/45

30 Postings list in a 3-gram inverted index etr BEETROOT METRIC PETRIFY RETRIEVAL... retrieve all postings of matching k-grams intersect all lists as usual Gintarė Grigonytė 18/45

31 Processing wildcarded terms in a bigram index Query mon* can now be run as: Gintarė Grigonytė 19/45

32 Processing wildcarded terms in a bigram index Query mon* can now be run as: $m AND mo AND on Gets us all terms with the prefix mon, but... Gintarė Grigonytė 19/45

33 Processing wildcarded terms in a bigram index Query mon* can now be run as: $m AND mo AND on Gets us all terms with the prefix mon, but also many false positives like MOON. Must post-filter these terms against query. Gintarė Grigonytė 19/45

34 Processing wildcarded terms in a bigram index Query mon* can now be run as: $m AND mo AND on Gets us all terms with the prefix mon, but also many false positives like MOON. Must post-filter these terms against query. Surviving terms are then looked up in the term-document inverted index. k-gram are still fast and more space efficient than permuterm indexes. Gintarė Grigonytė 19/45

35 Processing wildcard queries As before, we must potentially execute a large number of Boolean queries Most straightforward semantics: Conjunction of disjunctions Recall the query: gen* AND universit* Gintarė Grigonytė 20/45

36 Processing wildcard queries As before, we must potentially execute a large number of Boolean queries Most straightforward semantics: Conjunction of disjunctions Recall the query: gen* AND universit* (geneva AND university) OR (geneva AND université) OR (genève AND university) OR (genève AND université) OR (general AND universities) OR Very expensive! Requires query optimization! Do we need to support wildcard queries? Gintarė Grigonytė 20/45

37 Processing wildcard queries As before, we must potentially execute a large number of Boolean queries Most straightforward semantics: Conjunction of disjunctions Recall the query: gen* AND universit* (geneva AND university) OR (geneva AND université) OR (genève AND university) OR (genève AND université) OR (general AND universities) OR Very expensive! Requires query optimization! Do we need to support wildcard queries? Users are lazy! If wildcards are allowed Users will love it (?) Does Google allow wildcard queries? (And other engines?) Gintarė Grigonytė 20/45

38 Tolerant Retrieval: Spälling Korrection Two general uses of spelling correction: Correcting documents being indexed Correcting user queries (more common) Gintarė Grigonytė 21/45

39 Tolerant Retrieval: Spälling Korrection Two general uses of spelling correction: Correcting documents being indexed Correcting user queries (more common) Two different methods: Isolated word spelling correction Check each word on its own for misspelling Will not catch typos resulting in correctly spelled words, e.g., an asteroid that fell form the sky Context-sensitive spelling correction Look at surrounding words Can correct form/from error above Gintarė Grigonytė 21/45

40 Correcting documents primarily for OCR ed documents tuned for OCR mistakes (trained classifiers) may use domain-specific knowledge confusion between O and D etc... but also: correct typos in web pages and low quality documents fewer misspellings in our dictionary and better matching general IR philosophy: don t change the documents Gintarė Grigonytė 22/45

41 Correcting queries more common in IR than document correction typos in queries are common people are in a hurry users often look for things they don t know much about example: al quajda strategies: (also) retrieve documents with the correct spelling return alternative query suggestions ( Did you mean...? ) Gintarė Grigonytė 23/45

42 Isolated word correction Premise 1: There is a list of correct words from which the correct spellings come. Premise 2: We have a way of computing the distance between a misspelled word and a correct word. Simple spelling correction algorithm: return the correct word that has the smallest distance to the misspelled word. Example: informaton information Gintarė Grigonytė 24/45

43 Isolated word correction Two choices: use a standard lexicon Webster s, OED etc.... industry-specific dictionary (for domain-specific IR) advantage: correct entries only! vocabulary of the inverted index all words in the collection better coverage but: include all misspellings compute weights for all terms (based on frequencies) Gintarė Grigonytė 25/45

44 Isolated word correction Task: Return lexicon entry that is closest to a given character sequence Q What is closest? Several alternatives: 1. Edit distance and Levenshtein distance 2. Weighted edit distance 3. k-gram overlap Gintarė Grigonytė 26/45

45 Edit distance The edit distance between string s 1 and string s 2 is the minimum number of basic operations that convert s 1 to s 2. Levenshtein distance: basic operations = insert, delete, and replace Levenshtein distance dog-do: 1 Levenshtein distance cat-cart: 1 Levenshtein distance cat-cut: 1 Levenshtein distance cat-act: 2 Gintarė Grigonytė 27/45

46 Edit distance The edit distance between string s 1 and string s 2 is the minimum number of basic operations that convert s 1 to s 2. Levenshtein distance: basic operations = insert, delete, and replace Levenshtein distance dog-do: 1 Levenshtein distance cat-cart: 1 Levenshtein distance cat-cut: 1 Levenshtein distance cat-act: 2 Damerau-Levenshtein: additional operation = transpose Damerau-Levenshtein distance cat-act: 1 Gintarė Grigonytė 27/45

47 Levenshtein distance: Computation Recursive definition & dynamic programming: start with upper-left table cell fill table with edit costs f a s t c a t s Gintarė Grigonytė 28/45

48 Levenshtein distance: algorithm LEVENSHTEINDISTANCE(s 1, s 2 ) 1 for i 0 to s 1 2 do m[i, 0] = i 3 for j 0 to s 2 4 do m[0, j] = j 5 for i 1 to s 1 6 do for j 1 to s 2 7 do if s 1 [i] = s 2 [j] 8 then m[i, j] = min{m[i 1, j] + 1, m[i, j 1] + 1, m[i 1, j 1]} 9 else m[i, j] = min{m[i 1, j] + 1, m[i, j 1] + 1, m[i 1, j 1] + 1} 10 return m[ s 1, s 2 ] Operations: insert, delete, replace, copy Gintarė Grigonytė 29/45

49 Levenshtein distance: algorithm LEVENSHTEINDISTANCE(s 1, s 2 ) 1 for i 0 to s 1 2 do m[i, 0] = i 3 for j 0 to s 2 4 do m[0, j] = j 5 for i 1 to s 1 6 do for j 1 to s 2 7 do if s 1 [i] = s 2 [j] 8 then m[i, j] = min{m[i 1, j] + 1, m[i, j 1] + 1, m[i 1, j 1]} 9 else m[i, j] = min{m[i 1, j] + 1, m[i, j 1] + 1, m[i 1, j 1] + 1} 10 return m[ s 1, s 2 ] Operations: insert, delete, replace, copy Gintarė Grigonytė 30/45

50 Levenshtein distance: algorithm LEVENSHTEINDISTANCE(s 1, s 2 ) 1 for i 0 to s 1 2 do m[i, 0] = i 3 for j 0 to s 2 4 do m[0, j] = j 5 for i 1 to s 1 6 do for j 1 to s 2 7 do if s 1 [i] = s 2 [j] 8 then m[i, j] = min{m[i 1, j] + 1, m[i, j 1] + 1, m[i 1, j 1]} 9 else m[i, j] = min{m[i 1, j] + 1, m[i, j 1] + 1, m[i 1, j 1] + 1} 10 return m[ s 1, s 2 ] Operations: insert, delete, replace, copy Gintarė Grigonytė 31/45

51 Levenshtein distance: algorithm LEVENSHTEINDISTANCE(s 1, s 2 ) 1 for i 0 to s 1 2 do m[i, 0] = i 3 for j 0 to s 2 4 do m[0, j] = j 5 for i 1 to s 1 6 do for j 1 to s 2 7 do if s 1 [i] = s 2 [j] 8 then m[i, j] = min{m[i 1, j] + 1, m[i, j 1] + 1, m[i 1, j 1]} 9 else m[i, j] = min{m[i 1, j] + 1, m[i, j 1] + 1, m[i 1, j 1] + 1} 10 return m[ s 1, s 2 ] Operations: insert, delete, replace, copy Gintarė Grigonytė 32/45

52 Levenshtein distance: algorithm LEVENSHTEINDISTANCE(s 1, s 2 ) 1 for i 0 to s 1 2 do m[i, 0] = i 3 for j 0 to s 2 4 do m[0, j] = j 5 for i 1 to s 1 6 do for j 1 to s 2 7 do if s 1 [i] = s 2 [j] 8 then m[i, j] = min{m[i 1, j] + 1, m[i, j 1] + 1, m[i 1, j 1]} 9 else m[i, j] = min{m[i 1, j] + 1, m[i, j 1] + 1, m[i 1, j 1] + 1} 10 return m[ s 1, s 2 ] Operations: insert, delete, replace, copy Gintarė Grigonytė 33/45

53 Each cell of Levenshtein matrix cost of getting here from my upper left neighbor (copy or replace) cost of getting here from my left neighbor (insert) cost of getting here from my upper neighbor (delete) the minimum of the three possible moves ; the cheapest way of getting here Gintarė Grigonytė 34/45

54 Weighted edit distance As above, but weight of an operation depends on the characters involved. Meant to capture keyboard errors, e.g., m more likely to be mistyped as n than as q. Therefore, replacing m by n is a smaller edit distance than by q. We now require a weight matrix as input. Modify dynamic programming to handle weights. Gintarė Grigonytė 35/45

55 Using edit distance given a query: get all sequences within a fixed edit distance intersect this list with the list of correct words return spelling suggestions Alternatively: use all corrections to retrieve doc s slow! (and accepted by user?) use single best correction for retrieval Gintarė Grigonytė 36/45

56 Using edit distance Problems: lot s of possible strings even with few edit operations intersection with dictionary is slow Possible solution: use N-gram overlap can replace edit distance for spelling correction Gintarė Grigonytė 37/45

57 k-gram indexes for spelling correction Get all k-grams in the query term Use the k-gram index to retrieve correct words that match query term k-grams (recall wildcard search) Threshold by number of matching k-grams (e.g., only terms that differ by at most 3 k-grams) or use Jaccard coefficient > threshold Jaccard(A, B) = A B A B Gintarė Grigonytė 38/45

58 k-gram indexes for spelling correction Get all k-grams in the query term Use the k-gram index to retrieve correct words that match query term k-grams (recall wildcard search) Threshold by number of matching k-grams (e.g., only terms that differ by at most 3 k-grams) or use Jaccard coefficient > threshold Jaccard(A, B) = A B A B Example: Bigram index, misspelled word bordroom Bigrams: bo, or, rd, dr, ro, oo, om Gintarė Grigonytė 38/45

59 k-gram indexes for spelling correction: bordroom BO aboard about boardroom border OR border lord morbid sordid RD aboard ardent boardroom border... boardroom exists in 6 out of 7 lists Jaccard = 6/( ) = 6/ Gintarė Grigonytė 39/45

60 Context-sensitive spelling correction One approach: break phrase query into conjunction of biwords look for biwords that need only one term corrected get phrase matches and rank them... Gintarė Grigonytė 40/45

61 Context-sensitive spelling correction Another approach: Hit-based spelling correction Example: flew form munich Try all phrases with possible corrections: Try query flea form munich Try query flew from munich Try query flew form munch The correct query flew from munich has most hits. Many alternatives! Not very efficient! try to correct only if few hits returned tweaking with query logs and expected hits Gintarė Grigonytė 41/45

62 Tolerant Search: Soundex Find phonetic alternatives. Example: chebyshev / tchebyscheff Soundex = class of heuristics (invented in 1918) 1. Retain the first letter of the term. 2. Replace all letters [AEIOUHWY] to 0 3. Replace letters to digits: B, F, P, V to 1 C, G, J, K, Q, S, X, Z to 2 D,T to 3 L to 4 M, N to 5 R to 6 4. remove consecutive identical digits 5. remove all zeros and return first 4 characters Gintarė Grigonytė 42/45

63 Tolerant Search: Soundex Find phonetic alternatives. Example: chebyshev / tchebyscheff Soundex = class of heuristics (invented in 1918) 1. Retain the first letter of the term. 2. Replace all letters [AEIOUHWY] to 0 3. Replace letters to digits: B, F, P, V to 1 C, G, J, K, Q, S, X, Z to 2 D,T to 3 L to 4 M, N to 5 R to 6 4. remove consecutive identical digits 5. remove all zeros and return first 4 characters HERMAN H655 Will HERMANN generate the same code? Gintarė Grigonytė 42/45

64 To sum up Using a positional inverted index with skip pointers efficient dictionary storage wild-card index spelling correction Soundex We can quickly run a query like (SPELL(moriset) /3 tor*to) OR SOUNDEX(chaikofski) Gintarė Grigonytė 43/45

65 Resources Chapter 3 of IIR Resources Soundex demo Javascript/SoundexCalculate/ Levenshtein distance demo Levenshtein distance slides https: //rosettacode.org/wiki/levenshtein_distance Peter Norvig s spelling corrector Gintarė Grigonytė 44/45

66 Next time Ranked Retrieval: efficient scoring & retrieval Putting it all together in a search system lab on boolean and ranked retrieval Gintarė Grigonytė 45/45

Overview. Lecture 3: Index Representation and Tolerant Retrieval. Type/token distinction. IR System components

Overview. Lecture 3: Index Representation and Tolerant Retrieval. Type/token distinction. IR System components Overview Lecture 3: Index Representation and Tolerant Retrieval Information Retrieval Computer Science Tripos Part II Ronan Cummins 1 Natural Language and Information Processing (NLIP) Group 1 Recap 2

More information

Dictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology

Dictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology Dictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Nayak & Raghavan (CS- 276, Stanford)

More information

Dictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology

Dictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology Dictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Nayak & Raghavan (CS- 276, Stanford)

More information

Tolerant Retrieval. Searching the Dictionary Tolerant Retrieval. Information Retrieval & Extraction Misbhauddin 1

Tolerant Retrieval. Searching the Dictionary Tolerant Retrieval. Information Retrieval & Extraction Misbhauddin 1 Tolerant Retrieval Searching the Dictionary Tolerant Retrieval Information Retrieval & Extraction Misbhauddin 1 Query Retrieval Dictionary data structures Tolerant retrieval Wild-card queries Soundex Spelling

More information

3-1. Dictionaries and Tolerant Retrieval. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

3-1. Dictionaries and Tolerant Retrieval. Most slides were adapted from Stanford CS 276 course and University of Munich IR course. 3-1. Dictionaries and Tolerant Retrieval Most slides were adapted from Stanford CS 276 course and University of Munich IR course. 1 Dictionary data structures for inverted indexes Sec. 3.1 The dictionary

More information

Dictionaries and Tolerant retrieval

Dictionaries and Tolerant retrieval Dictionaries and Tolerant retrieval Slides adapted from Stanford CS297:Introduction to Information Retrieval A skipped lecture The type/token distinction Terms are normalized types put in the dictionary

More information

Dictionaries and tolerant retrieval. Slides by Manning, Raghavan, Schutze

Dictionaries and tolerant retrieval. Slides by Manning, Raghavan, Schutze Dictionaries and tolerant retrieval 1 Ch. 2 Recap of the previous lecture The type/token distinction Terms are normalized types put in the dictionary Tokenization problems: Hyphens, apostrophes, compounds,

More information

Recap of the previous lecture. This lecture. A naïve dictionary. Introduction to Information Retrieval. Dictionary data structures Tolerant retrieval

Recap of the previous lecture. This lecture. A naïve dictionary. Introduction to Information Retrieval. Dictionary data structures Tolerant retrieval Ch. 2 Recap of the previous lecture Introduction to Information Retrieval Lecture 3: Dictionaries and tolerant retrieval The type/token distinction Terms are normalized types put in the dictionary Tokenization

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval Lecture 3: Dictionaries and tolerant retrieval 1 Outline Dictionaries Wildcard queries skip Edit distance skip Spelling correction skip Soundex 2 Inverted index Our

More information

Recap of last time CS276A Information Retrieval

Recap of last time CS276A Information Retrieval Recap of last time CS276A Information Retrieval Index compression Space estimation Lecture 4 This lecture Tolerant retrieval Wild-card queries Spelling correction Soundex Wild-card queries Wild-card queries:

More information

Information Retrieval

Information Retrieval Information Retrieval Suan Lee - Information Retrieval - 03 Dictionaries and Tolerant Retrieval 1 03 Dictionaries and Tolerant Retrieval - Information Retrieval - 03 Dictionaries and Tolerant Retrieval

More information

OUTLINE. Documents Terms. General + Non-English English. Skip pointers. Phrase queries

OUTLINE. Documents Terms. General + Non-English English. Skip pointers. Phrase queries WECHAT GROUP 1 OUTLINE Documents Terms General + Non-English English Skip pointers Phrase queries 2 Phrase queries We want to answer a query such as [stanford university] as a phrase. Thus The inventor

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS3245 Information Retrieval Lecture 4: Dictionaries and Tolerant Retrieval4 Last Time: Terms and Postings Details Ch. 2 Skip pointers Encoding a tree-like structure

More information

Text Technologies for Data Science INFR Indexing (2) Instructor: Walid Magdy

Text Technologies for Data Science INFR Indexing (2) Instructor: Walid Magdy Text Technologies for Data Science INFR11145 Indexing (2) Instructor: Walid Magdy 10-Oct-2017 Lecture Objectives Learn more about indexing: Structured documents Extent index Index compression Data structure

More information

Text Technologies for Data Science INFR Indexing (2) Instructor: Walid Magdy

Text Technologies for Data Science INFR Indexing (2) Instructor: Walid Magdy Text Technologies for Data Science INFR11145 Indexing (2) Instructor: Walid Magdy 03-Oct-2018 Lecture Objectives Learn more about indexing: Structured documents Extent index Index compression Data structure

More information

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 3: Dictionaries and tolerant retrieval Paul Ginsparg Cornell University,

More information

Natural Language Processing and Information Retrieval

Natural Language Processing and Information Retrieval Natural Language Processing and Information Retrieval Indexing and Vector Space Models Alessandro Moschitti Department of Computer Science and Information Engineering University of Trento Email: moschitti@disi.unitn.it

More information

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 3 Dictionaries and Tolerant retrieval Chapter 4 Index construction Chapter 5 Index compression Content Dictionary data structures

More information

Data-analysis and Retrieval Boolean retrieval, posting lists and dictionaries

Data-analysis and Retrieval Boolean retrieval, posting lists and dictionaries Data-analysis and Retrieval Boolean retrieval, posting lists and dictionaries Hans Philippi (based on the slides from the Stanford course on IR) April 25, 2018 Boolean retrieval, posting lists & dictionaries

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 4: Index Construction Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-04-16 1/54 Overview

More information

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression Web Information Retrieval Lecture 4 Dictionaries, Index Compression Recap: lecture 2,3 Stemming, tokenization etc. Faster postings merges Phrase queries Index construction This lecture Dictionary data

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 4: Index Construction Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-04-16 Schütze:

More information

PV211: Introduction to Information Retrieval

PV211: Introduction to Information Retrieval PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/pv211 IIR 4: Index construction Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University,

More information

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2013 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276,

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 4: Index Construction Hinrich Schütze, Christina Lioma Institute for Natural Language Processing, University of Stuttgart 2010-05-04

More information

CSCI 5417 Information Retrieval Systems Jim Martin!

CSCI 5417 Information Retrieval Systems Jim Martin! CSCI 5417 Information Retrieval Systems Jim Martin! Lecture 4 9/1/2011 Today Finish up spelling correction Realistic indexing Block merge Single-pass in memory Distributed indexing Next HW details 1 Query

More information

Preliminary draft (c)2008 Cambridge UP

Preliminary draft (c)2008 Cambridge UP DRAFT! February 16, 2008 Cambridge University Press. Feedback welcome. 49 3 Dictionariesandtolerant retrieval WILDCARD QUERY In Chapters 1 and 2 we developed the ideas underlying inverted indexes for handling

More information

CSE 7/5337: Information Retrieval and Web Search Index construction (IIR 4)

CSE 7/5337: Information Retrieval and Web Search Index construction (IIR 4) CSE 7/5337: Information Retrieval and Web Search Index construction (IIR 4) Michael Hahsler Southern Methodist University These slides are largely based on the slides by Hinrich Schütze Institute for Natural

More information

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2015 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan lectures

More information

CS347. Lecture 2 April 9, Prabhakar Raghavan

CS347. Lecture 2 April 9, Prabhakar Raghavan CS347 Lecture 2 April 9, 2001 Prabhakar Raghavan Today s topics Inverted index storage Compressing dictionaries into memory Processing Boolean queries Optimizing term processing Skip list encoding Wild-card

More information

Today s topics CS347. Inverted index storage. Inverted index storage. Processing Boolean queries. Lecture 2 April 9, 2001 Prabhakar Raghavan

Today s topics CS347. Inverted index storage. Inverted index storage. Processing Boolean queries. Lecture 2 April 9, 2001 Prabhakar Raghavan Today s topics CS347 Lecture 2 April 9, 2001 Prabhakar Raghavan Inverted index storage Compressing dictionaries into memory Processing Boolean queries Optimizing term processing Skip list encoding Wild-card

More information

http://www.xkcd.com/628/ Results Summaries Spelling Correction David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture3-tolerantretrieval.ppt http://www.stanford.edu/class/cs276/handouts/lecture8-evaluation.ppt

More information

Part 2: Boolean Retrieval Francesco Ricci

Part 2: Boolean Retrieval Francesco Ricci Part 2: Boolean Retrieval Francesco Ricci Most of these slides comes from the course: Information Retrieval and Web Search, Christopher Manning and Prabhakar Raghavan Content p Term document matrix p Information

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction 1 Plan Last lecture: Dictionary data structures Tolerant retrieval Wildcards Spell correction Soundex a-hu hy-m n-z $m mace madden mo

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 5: Index Compression Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-04-17 1/59 Overview

More information

Course structure & admin. CS276A Text Information Retrieval, Mining, and Exploitation. Dictionary and postings files: a fast, compact inverted index

Course structure & admin. CS276A Text Information Retrieval, Mining, and Exploitation. Dictionary and postings files: a fast, compact inverted index CS76A Text Information Retrieval, Mining, and Exploitation Lecture 1 Oct 00 Course structure & admin CS76: two quarters this year: CS76A: IR, web (link alg.), (infovis, XML, PP) Website: http://cs76a.stanford.edu/

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction Inverted index Processing Boolean queries Course overview Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Schütze Institute for Natural

More information

Boolean Retrieval. Manning, Raghavan and Schütze, Chapter 1. Daniël de Kok

Boolean Retrieval. Manning, Raghavan and Schütze, Chapter 1. Daniël de Kok Boolean Retrieval Manning, Raghavan and Schütze, Chapter 1 Daniël de Kok Boolean query model Pose a query as a boolean query: Terms Operations: AND, OR, NOT Example: Brutus AND Caesar AND NOT Calpuria

More information

Information Retrieval

Information Retrieval Information Retrieval Suan Lee - Information Retrieval - 04 Index Construction 1 04 Index Construction - Information Retrieval - 04 Index Construction 2 Plan Last lecture: Dictionary data structures Tolerant

More information

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks Administrative Index Compression! n Assignment 1? n Homework 2 out n What I did last summer lunch talks today David Kauchak cs458 Fall 2012 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture5-indexcompression.ppt

More information

Information Retrieval

Information Retrieval Information Retrieval Suan Lee - Information Retrieval - 05 Index Compression 1 05 Index Compression - Information Retrieval - 05 Index Compression 2 Last lecture index construction Sort-based indexing

More information

Recap: lecture 2 CS276A Information Retrieval

Recap: lecture 2 CS276A Information Retrieval Recap: lecture 2 CS276A Information Retrieval Stemming, tokenization etc. Faster postings merges Phrase queries Lecture 3 This lecture Index compression Space estimation Corpus size for estimates Consider

More information

Indexing and Searching

Indexing and Searching Indexing and Searching Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, chapter 8 2. Information Retrieval:

More information

Lecture 3: Phrasal queries and wildcards

Lecture 3: Phrasal queries and wildcards Lecture 3: Phrasal queries and wildcards Trevor Cohn (tcohn@unimelb.edu.au) COMP90042, 2015, Semester 1 What we ll learn today Building on the boolean index and query mechanism to support multi-word queries

More information

Information Retrieval

Information Retrieval Information Retrieval Natural Language Processing: Lecture 12 30.11.2017 Kairit Sirts Homework 4 things that seemed to work Bidirectional LSTM instead of unidirectional Change LSTM activation to sigmoid

More information

Information Retrieval and Text Mining

Information Retrieval and Text Mining Information Retrieval and Text Mining http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Schütze & Wiltrud Kessler Institute for Natural Language Processing, University of Stuttgart 2012-10-16

More information

Indexing and Searching

Indexing and Searching Indexing and Searching Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, chapter 9 2. Information Retrieval:

More information

Lecture 2: Encoding models for text. Many thanks to Prabhakar Raghavan for sharing most content from the following slides

Lecture 2: Encoding models for text. Many thanks to Prabhakar Raghavan for sharing most content from the following slides Lecture 2: Encoding models for text Many thanks to Prabhakar Raghavan for sharing most content from the following slides Recap of the previous lecture Overview of course topics Basic inverted indexes:

More information

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course. 3-2. Index construction Most slides were adapted from Stanford CS 276 course and University of Munich IR course. 1 Ch. 4 Index construction How do we construct an index? What strategies can we use with

More information

Index extensions. Manning, Raghavan and Schütze, Chapter 2. Daniël de Kok

Index extensions. Manning, Raghavan and Schütze, Chapter 2. Daniël de Kok Index extensions Manning, Raghavan and Schütze, Chapter 2 Daniël de Kok Today We will discuss some extensions to the inverted index to accommodate boolean processing: Skip pointers: improve performance

More information

Multimedia Information Extraction and Retrieval Indexing and Query Answering

Multimedia Information Extraction and Retrieval Indexing and Query Answering Multimedia Information Extraction and Retrieval Indexing and Query Answering Ralf Moeller Hamburg Univ. of Technology Recall basic indexing pipeline Documents to be indexed. Friends, Romans, countrymen.

More information

Digital Libraries: Language Technologies

Digital Libraries: Language Technologies Digital Libraries: Language Technologies RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Recall: Inverted Index..........................................

More information

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze Indexing UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides Addison Wesley, 2008 Table of Content Inverted index with positional information

More information

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007 信息检索与搜索引擎 Introduction to Information Retrieval GESC1007 Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities philfv8@yahoo.com Spring 2019 1 Last week We have discussed in

More information

Information Retrieval

Information Retrieval Information Retrieval Test Collections Gintarė Grigonytė gintare@ling.su.se Department of Linguistics and Philology Uppsala University Slides based on previous IR course given by K.F. Heppin 2013-15 and

More information

Search Engines. Gertjan van Noord. September 17, 2018

Search Engines. Gertjan van Noord. September 17, 2018 Search Engines Gertjan van Noord September 17, 2018 About the course Information about the course is available from: http://www.let.rug.nl/vannoord/college/zoekmachines/ Last week Normalization (case,

More information

Index Compression. David Kauchak cs160 Fall 2009 adapted from:

Index Compression. David Kauchak cs160 Fall 2009 adapted from: Index Compression David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture5-indexcompression.ppt Administrative Homework 2 Assignment 1 Assignment 2 Pair programming?

More information

Inverted Indexes. Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5

Inverted Indexes. Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5 Inverted Indexes Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5 Basic Concepts Inverted index: a word-oriented mechanism for indexing a text collection to speed up the

More information

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan lectures

More information

Introduction to Information Retrieval IIR 1: Boolean Retrieval

Introduction to Information Retrieval IIR 1: Boolean Retrieval .. Introduction to Information Retrieval IIR 1: Boolean Retrieval Mihai Surdeanu (Based on slides by Hinrich Schütze at informationretrieval.org) Fall 2014 Boolean Retrieval 1 / 77 Take-away Why you should

More information

Efficient Implementation of Postings Lists

Efficient Implementation of Postings Lists Efficient Implementation of Postings Lists Inverted Indices Query Brutus AND Calpurnia J. Pei: Information Retrieval and Web Search -- Efficient Implementation of Postings Lists 2 Skip Pointers J. Pei:

More information

More on indexing CE-324: Modern Information Retrieval Sharif University of Technology

More on indexing CE-324: Modern Information Retrieval Sharif University of Technology More on indexing CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Plan

More information

Information Retrieval

Information Retrieval Information Retrieval Suan Lee - Information Retrieval - 06 Scoring, Term Weighting and the Vector Space Model 1 Recap of lecture 5 Collection and vocabulary statistics: Heaps and Zipf s laws Dictionary

More information

Information Retrieval

Information Retrieval Introduction to CS3245 Lecture 5: Index Construction 5 Last Time Dictionary data structures Tolerant retrieval Wildcards Spelling correction Soundex a-hu hy-m n-z $m mace madden mo among amortize on abandon

More information

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! (301) 219-4649 james.mayfield@jhuapl.edu What is Information Retrieval? Evaluation

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS3245 Information Retrieval Lecture 6: Index Compression 6 Last Time: index construction Sort- based indexing Blocked Sort- Based Indexing Merge sort is effective

More information

Information Retrieval

Information Retrieval Introduction to CS3245 Lecture 5: Index Construction 5 CS3245 Last Time Dictionary data structures Tolerant retrieval Wildcards Spelling correction Soundex a-hu hy-m n-z $m mace madden mo among amortize

More information

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Index construction CE-324: Modern Information Retrieval Sharif University of Technology Index construction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch.

More information

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University CS377: Database Systems Text data and information retrieval Li Xiong Department of Mathematics and Computer Science Emory University Outline Information Retrieval (IR) Concepts Text Preprocessing Inverted

More information

EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling

EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling Doug Downey Based partially on slides by Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze Announcements Project progress report

More information

Introduction to. CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan. Lecture 4: Index Construction

Introduction to. CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan. Lecture 4: Index Construction Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan Lecture 4: Index Construction 1 Plan Last lecture: Dictionary data structures

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Schütze Institute for Natural Language Processing, Universität Stuttgart 2008.04.22 Schütze: Boolean

More information

Introduction to Information Retrieval and Boolean model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H.

Introduction to Information Retrieval and Boolean model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Introduction to Information Retrieval and Boolean model Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze 1 Unstructured (text) vs. structured (database) data in late

More information

CS105 Introduction to Information Retrieval

CS105 Introduction to Information Retrieval CS105 Introduction to Information Retrieval Lecture: Yang Mu UMass Boston Slides are modified from: http://www.stanford.edu/class/cs276/ Information Retrieval Information Retrieval (IR) is finding material

More information

Information Retrieval CS-E credits

Information Retrieval CS-E credits Information Retrieval CS-E4420 5 credits Tokenization, further indexing issues Antti Ukkonen antti.ukkonen@aalto.fi Slides are based on materials by Tuukka Ruotsalo, Hinrich Schütze and Christina Lioma

More information

CSE 7/5337: Information Retrieval and Web Search Introduction and Boolean Retrieval (IIR 1)

CSE 7/5337: Information Retrieval and Web Search Introduction and Boolean Retrieval (IIR 1) CSE 7/5337: Information Retrieval and Web Search Introduction and Boolean Retrieval (IIR 1) Michael Hahsler Southern Methodist University These slides are largely based on the slides by Hinrich Schütze

More information

Introducing Information Retrieval and Web Search. borrowing from: Pandu Nayak

Introducing Information Retrieval and Web Search. borrowing from: Pandu Nayak Introducing Information Retrieval and Web Search borrowing from: Pandu Nayak Information Retrieval Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually

More information

INDEX CONSTRUCTION 1

INDEX CONSTRUCTION 1 1 INDEX CONSTRUCTION PLAN Last lecture: Dictionary data structures Tolerant retrieval Wildcards Spell correction Soundex a-hu hy-m n-z $m mace madden This time: mo among amortize Index construction on

More information

Indexing and Searching

Indexing and Searching Indexing and Searching Introduction How to retrieval information? A simple alternative is to search the whole text sequentially Another option is to build data structures over the text (called indices)

More information

A Modern spell(1) Abhinav Upadhyay EuroBSDCon 2017, Paris

A Modern spell(1) Abhinav Upadhyay EuroBSDCon 2017, Paris A Modern spell(1) Abhinav Upadhyay EuroBSDCon 2017, Paris Outline Shortcomings in the old spell(1) Feature Requirements of a modern spell(1) Implementation Details of new spell(1)

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Hamid Rastegari Lecture 4: Index Construction Plan Last lecture: Dictionary data structures

More information

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 6: Index Compression Paul Ginsparg Cornell University, Ithaca, NY 15 Sep

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Schütze Institute for Natural Language Processing, University of Stuttgart 2011-05-03 1/ 36 Take-away

More information

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search CSE 454 - Case Studies Indexing & Retrieval in Google Some slides from http://www.cs.huji.ac.il/~sdbi/2000/google/index.htm Logistics For next class Read: How to implement PageRank Efficiently Projects

More information

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488) Efficiency Efficiency: Indexing (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Difficult to analyze sequential IR algorithms: data and query dependency (query selectivity). O(q(cf max )) -- high estimate-

More information

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου Ανάκτηση µε το µοντέλο διανυσµατικού χώρου Σηµερινό ερώτηµα Typically we want to retrieve the top K docs (in the cosine ranking for the query) not totally order all docs in the corpus can we pick off docs

More information

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf

More information

Boolean Queries. Keywords combined with Boolean operators:

Boolean Queries. Keywords combined with Boolean operators: Query Languages 1 Boolean Queries Keywords combined with Boolean operators: OR: (e 1 OR e 2 ) AND: (e 1 AND e 2 ) BUT: (e 1 BUT e 2 ) Satisfy e 1 but not e 2 Negation only allowed using BUT to allow efficient

More information

Information Retrieval

Information Retrieval Information Retrieval Suan Lee - Information Retrieval - 01 Boolean Retrieval 1 01 Boolean Retrieval - Information Retrieval - 01 Boolean Retrieval 2 Introducing Information Retrieval and Web Search -

More information

Information Retrieval. Chap 7. Text Operations

Information Retrieval. Chap 7. Text Operations Information Retrieval Chap 7. Text Operations The Retrieval Process user need User Interface 4, 10 Text Text logical view Text Operations logical view 6, 7 user feedback Query Operations query Indexing

More information

Query Evaluation Strategies

Query Evaluation Strategies Introduction to Search Engine Technology Term-at-a-Time and Document-at-a-Time Evaluation Ronny Lempel Yahoo! Labs (Many of the following slides are courtesy of Aya Soffer and David Carmel, IBM Haifa Research

More information

Outline of the course

Outline of the course Outline of the course Introduction to Digital Libraries (15%) Description of Information (30%) Access to Information (30%) User Services (10%) Additional topics (15%) Buliding of a (small) digital library

More information

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 7: Scores in a Complete Search System Paul Ginsparg Cornell University, Ithaca,

More information

Part 3: The term vocabulary, postings lists and tolerant retrieval Francesco Ricci

Part 3: The term vocabulary, postings lists and tolerant retrieval Francesco Ricci Part 3: The term vocabulary, postings lists and tolerant retrieval Francesco Ricci Most of these slides comes from the course: Information Retrieval and Web Search, Christopher Manning and Prabhakar Raghavan

More information

Query Evaluation Strategies

Query Evaluation Strategies Introduction to Search Engine Technology Term-at-a-Time and Document-at-a-Time Evaluation Ronny Lempel Yahoo! Research (Many of the following slides are courtesy of Aya Soffer and David Carmel, IBM Haifa

More information

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency Ralf Moeller Hamburg Univ. of Technology Acknowledgement Slides taken from presentation material for the following

More information

Informa(on Retrieval

Informa(on Retrieval Introduc)on to Informa)on Retrieval CS3245 Informa(on Retrieval Lecture 7: Scoring, Term Weigh9ng and the Vector Space Model 7 Last Time: Index Compression Collec9on and vocabulary sta9s9cs: Heaps and

More information

Lecture 1: Introduction and the Boolean Model

Lecture 1: Introduction and the Boolean Model Lecture 1: Introduction and the Boolean Model Information Retrieval Computer Science Tripos Part II Helen Yannakoudakis 1 Natural Language and Information Processing (NLIP) Group helen.yannakoudakis@cl.cam.ac.uk

More information

Query Processing and Alternative Search Structures. Indexing common words

Query Processing and Alternative Search Structures. Indexing common words Query Processing and Alternative Search Structures CS 510 Winter 2007 1 Indexing common words What is the indexing overhead for a common term? I.e., does leaving out stopwords help? Consider a word such

More information

FRONT CODING. Front-coding: 8automat*a1 e2 ic3 ion. Extra length beyond automat. Encodes automat. Begins to resemble general string compression.

FRONT CODING. Front-coding: 8automat*a1 e2 ic3 ion. Extra length beyond automat. Encodes automat. Begins to resemble general string compression. Sec. 5.2 FRONT CODING Front-coding: Sorted words commonly have long common prefix store differences only (for last k-1 in a block of k) 8automata8automate9automatic10automation 8automat*a1 e2 ic3 ion Encodes

More information