INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Similar documents
Overview. Lecture 3: Index Representation and Tolerant Retrieval. Type/token distinction. IR System components

Information Retrieval

More on indexing CE-324: Modern Information Retrieval Sharif University of Technology

OUTLINE. Documents Terms. General + Non-English English. Skip pointers. Phrase queries

Tolerant Retrieval. Searching the Dictionary Tolerant Retrieval. Information Retrieval & Extraction Misbhauddin 1

Information Retrieval

Text Technologies for Data Science INFR Indexing (2) Instructor: Walid Magdy

Information Retrieval CS-E credits

Text Technologies for Data Science INFR Indexing (2) Instructor: Walid Magdy

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

3-1. Dictionaries and Tolerant Retrieval. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

Recap of the previous lecture. This lecture. A naïve dictionary. Introduction to Information Retrieval. Dictionary data structures Tolerant retrieval

Web Information Retrieval. Lecture 2 Tokenization, Normalization, Speedup, Phrase Queries

Dictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology

Dictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology

Efficient Implementation of Postings Lists

Dictionaries and Tolerant retrieval

Information Retrieval and Organisation

Data-analysis and Retrieval Boolean retrieval, posting lists and dictionaries

Dictionaries and tolerant retrieval. Slides by Manning, Raghavan, Schutze

Introducing Information Retrieval and Web Search. borrowing from: Pandu Nayak

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Recap of last time CS276A Information Retrieval

Information Retrieval

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Introduction to Information Retrieval

Information Retrieval

Introduction to Information Retrieval

Part 2: Boolean Retrieval Francesco Ricci

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Search: the beginning. Nisheeth

Information Retrieval and Text Mining

IR System Components. Lecture 2: Data structures and Algorithms for Indexing. IR System Components. IR System Components

Introduction to Information Retrieval and Boolean model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H.

Text Pre-processing and Faster Query Processing

Information Retrieval

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression

Multimedia Information Extraction and Retrieval Indexing and Query Answering

CS347. Lecture 2 April 9, Prabhakar Raghavan

Today s topics CS347. Inverted index storage. Inverted index storage. Processing Boolean queries. Lecture 2 April 9, 2001 Prabhakar Raghavan

Introduction to Information Retrieval

Introduction to Information Retrieval

Information Retrieval

Information Retrieval

CSE 7/5337: Information Retrieval and Web Search Introduction and Boolean Retrieval (IIR 1)

Information Retrieval

CS105 Introduction to Information Retrieval

Information Retrieval

Querying Introduction to Information Retrieval INF 141 Donald J. Patterson. Content adapted from Hinrich Schütze

n Tuesday office hours changed: n 2-3pm n Homework 1 due Tuesday n Assignment 1 n Due next Friday n Can work with a partner

Recap of the previous lecture. Recall the basic indexing pipeline. Plan for this lecture. Parsing a document. Introduction to Information Retrieval

Course structure & admin. CS276A Text Information Retrieval, Mining, and Exploitation. Dictionary and postings files: a fast, compact inverted index

Information Retrieval Tutorial 1: Boolean Retrieval

CS60092: Informa0on Retrieval. Sourangshu Bha<acharya

More on indexing and text operations CE-324: Modern Information Retrieval Sharif University of Technology

Introduction to Information Retrieval

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

DB-retrieval: I m sorry, I can only look up your order, if you give me your OrderId.

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks

Information Retrieval

Information Retrieval

Introduction to Information Retrieval IIR 1: Boolean Retrieval

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Index extensions. Manning, Raghavan and Schütze, Chapter 2. Daniël de Kok

Behrang Mohit : txt proc! Review. Bag of word view. Document Named

Information Retrieval

Recap: lecture 2 CS276A Information Retrieval

More on indexing and text operations CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Information Retrieval and Text Mining

Information Retrieval

Information Retrieval

Part 3: The term vocabulary, postings lists and tolerant retrieval Francesco Ricci

boolean queries Inverted index query processing Query optimization boolean model September 9, / 39

EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze

Digital Libraries: Language Technologies

Index Compression. David Kauchak cs160 Fall 2009 adapted from:

CS 572: Information Retrieval. Lecture 2: Hello World! (of Text Search)

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

CS276 Information Retrieval and Web Search. Lecture 2: Dictionary and Postings

Information Retrieval

Information Retrieval. Danushka Bollegala

Introduction to Information Retrieval

Chapter 4. Processing Text

Information Retrieval

Information Retrieval

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Made by: Ali Ibrahim. Supervisor: MR. Ali Jnaide. Class: 12

Indexing. Lecture Objectives. Text Technologies for Data Science INFR Learn about and implement Boolean search Inverted index Positional index

Lecture 2: Encoding models for text. Many thanks to Prabhakar Raghavan for sharing most content from the following slides

More about Posting Lists

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Lecture 1: Introduction and the Boolean Model

Information Retrieval

CSCI 5417 Information Retrieval Systems Jim Martin!

Boolean Retrieval. Manning, Raghavan and Schütze, Chapter 1. Daniël de Kok

Transcription:

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 3: Dictionaries and tolerant retrieval Paul Ginsparg Cornell University, Ithaca, NY 3 Sep 2009 1/ 54

Administrativa Course Webpage: http://www.infosci.cornell.edu/courses/info4300/2009fa/ Lectures: Tuesday and Thursday 11:40-12:55, Hollister B14 Instructor: Paul Ginsparg, ginsparg@cornell.edu, 255-7371, Cornell Information Science, 301 College Avenue Instructor s Assistant: Corinne Russell, crussell@cs.cornell.edu, 255-5925, Cornell Information Science, 301 College Avenue Instructor s Office Hours: Wed 1 2pm 9,16 Sep, then Wed 1 3, or e-mail instructor to schedule an appointment Nam Nguyen, nhnguyen@cs.cornell.edu The Teaching Assistant does not have scheduled office hours but is available to help you by email. 2/ 54

Overview 1 Recap 2 Skip pointers 3 Phrase queries 4 Dictionaries 5 Wildcard queries 6 Discussion 3/ 54

Outline 1 Recap 2 Skip pointers 3 Phrase queries 4 Dictionaries 5 Wildcard queries 6 Discussion 4/ 54

Type/token distinction Token An instance of a word or term occurring in a document. Type An equivalence class of tokens. In June, the dog likes to chase the cat in the barn. 12 word tokens, 9 word types 5/ 54

Problems in tokenization What are the delimiters? Space? Apostrophe? Hyphen? For each of these: sometimes they delimit, sometimes they don t. No whitespace in many languages! (e.g., Chinese) No whitespace in Dutch, German, Swedish compounds (Lebensversicherungsgesellschaftsangestellter) 6/ 54

Problems in equivalence classing A term is an equivalence class of tokens. How do we define equivalence classes? Numbers (3/20/91 vs. 20/3/91) Case folding Stemming, Porter stemmer Morphological analysis: inflectional vs. derivational Equivalence classing problems in other languages More complex morphology than in English Finnish: a single verb may have 12,000 different forms Accents, umlauts 7/ 54

Outline 1 Recap 2 Skip pointers 3 Phrase queries 4 Dictionaries 5 Wildcard queries 6 Discussion 8/ 54

Recall basic intersection algorithm Brutus 1 2 4 11 31 45 173 174 Calpurnia 2 31 54 101 Intersection = 2 31 Linear in the length of the postings lists. Can we do better? 9/ 54

Skip pointers Skip pointers allow us to skip postings that will not figure in the search results. This makes intersecting postings lists more efficient. Some postings lists contain several million entries so efficiency can be an issue even if basic intersection is linear. Where do we put skip pointers? How do we make sure insection results are correct? 10/ 54

Basic idea Brutus 16 2 4 8 128 16 32 64 128 Caesar 8 1 2 3 5 31 8 17 21 31 75 81 84 89 92 11/ 54

Skip lists: Larger example 12/ 54

Intersecting with skip pointers IntersectWithSkips(p 1,p 2 ) 1 answer 2 while p 1 nil and p 2 nil 3 do if docid(p 1 ) = docid(p 2 ) 4 then Add(answer,docID(p 1 )) 5 p 1 next(p 1 ) 6 p 2 next(p 2 ) 7 else if docid(p 1 ) < docid(p 2 ) 8 then if hasskip(p 1 ) and (docid(skip(p 1 )) docid(p 2 )) 9 then while hasskip(p 1 ) and (docid(skip(p 1 )) docid(p 2 ) 10 do p 1 skip(p 1 ) 11 else p 1 next(p 1 ) 12 else if hasskip(p 2 ) and (docid(skip(p 2 )) docid(p 1 )) 13 then while hasskip(p 2 ) and (docid(skip(p 2 )) docid(p 1 ) 14 do p 2 skip(p 2 ) 15 else p 2 next(p 2 ) 16 return answer 13/ 54

Where do we place skips? Tradeoff: number of items skipped vs. frequency skip can be taken More skips: Each skip pointer skips only a few items, but we can frequently use it. Fewer skips: Each skip pointer skips many items, but we can not use it very often. 14/ 54

Where do we place skips? (cont) Simple heuristic: for postings list of length P, use P evenly-spaced skip pointers. This ignores the distribution of query terms. Easy if the index is static; harder in a dynamic environment because of updates. How much do skip pointers help? They used to help lot. With today s fast CPUs, they don t help that much anymore. 15/ 54

Outline 1 Recap 2 Skip pointers 3 Phrase queries 4 Dictionaries 5 Wildcard queries 6 Discussion 16/ 54

Phrase queries We want to answer a query such as stanford university as a phrase. Thus The inventor Stanford Ovshinsky never went to university shouldn t be a match. The concept of phrase query has proven easily understood by users. About 10% of web queries are phrase queries. Consequence for inverted index: no longer suffices to store docids in postings lists. Any ideas? 17/ 54

Biword indexes Index every consecutive pair of terms in the text as a phrase. For example, Friends, Romans, Countrymen would generate two biwords: friends romans and romans countrymen Each of these biwords is now a vocabulary term. Two-word phrases can now easily be answered. 18/ 54

Longer phrase queries A long phrase like stanford university palo alto can be represented as the Boolean query stanford university AND university palo AND palo alto We need to do post-filtering of hits to identify subset that actually contains the 4-word phrase. 19/ 54

Extended biwords Parse each document and perform part-of-speech tagging Bucket the terms into (say) nouns (N) and articles/prepositions (X) Now deem any string of terms of the form NX*N to be an extended biword Examples: catcher in the rye N X X N king of Denmark N X N Include extended biwords in the term vocabulary Queries are processed accordingly 20/ 54

Issues with biword indexes Why are biword indexes rarely used? False positives, as noted above Index blowup due to very large term vocabulary 21/ 54

Positional indexes Positional indexes are a more efficient alternative to biword indexes. Postings lists in a nonpositional index: each posting is just a docid Postings lists in a positional index: each posting is a docid and a list of positions Example query: to 1 be 2 or 3 not 4 to 5 be 6 to, 993427: 1: 7, 18, 33, 72, 86, 231 ; 2: 1, 17, 74, 222, 255 ; 4: 8, 16, 190, 429, 433 ; 5: 363, 367 ; 7: 13, 23, 191 ;... be, 178239: 1: 17, 25 ; 4: 17, 191, 291, 430, 434 ; 5: 14, 19, 101 ;... Document 4 is a match! 22/ 54

Proximity search We just saw how to use a positional index for phrase searches. We can also use it for proximity search. For example: employment /4 place Find all documents that contain employment and place within 4 words of each other. Employment agencies that place healthcare workers are seeing growth is a hit. Employment agencies that have learned to adapt now place healthcare workers is not a hit. 23/ 54

Proximity search Use the positional index Simplest algorithm: look at cross-product of positions of (i) employment in document and (ii) place in document Very inefficient for frequent words, especially stop words Note that we want to return the actual matching positions, not just a list of documents. This is important for dynamic summaries etc. 24/ 54

Proximity intersection PositionalIntersect(p 1,p 2,k) 1 answer 2 while p 1 nil and p 2 nil 3 do if docid(p 1 ) = docid(p 2 ) 4 then l 5 pp 1 positions(p 1 ) 6 pp 2 positions(p 2 ) 7 while pp 1 nil 8 do while pp 2 nil 9 do if pos(pp 1 ) pos(pp 2 ) k 10 then Add(l,pos(pp 2 )) 11 else if pos(pp 2 ) > pos(pp 1 ) 12 then break 13 pp 2 next(pp 2 ) 14 while l and l[0] pos(pp 1 ) > k 15 do Delete(l[0]) 16 for each ps l 17 do Add(answer, docid(p 1 ),pos(pp 1 ),ps ) 18 pp 1 next(pp 1 ) 19 p 1 next(p 1 ) 20 p 2 next(p 2 ) 21 else if docid(p 1 ) < docid(p 2 ) 22 then p 1 next(p 1 ) 23 else p 2 next(p 2 ) 24 return answer 25/ 54

Combination scheme Biword indexes and positional indexes can be profitably combined. Many biwords are extremely frequent: Michael Jackson, Britney Spears etc For these biwords, increased speed compared to positional postings intersection is substantial. Combination scheme: Include frequent biwords as vocabulary terms in the index. Do all other phrases by positional intersection. Williams et al. (2004) evaluate a more sophisticated mixed indexing scheme. Faster than a positional index, at a cost of 26% more space for index. 26/ 54

Positional queries on Google For web search engines, positional queries are much more expensive than regular Boolean queries. Let s look at the example of phrase queries. Why are they more expensive than regular Boolean queries? Can you demonstrate on Google that phrase queries are more expensive than Boolean queries? 27/ 54

Outline 1 Recap 2 Skip pointers 3 Phrase queries 4 Dictionaries 5 Wildcard queries 6 Discussion 28/ 54

Inverted index For each term t, we store a list of all documents that contain t. Brutus 1 2 4 11 31 45 173 174 Caesar 1 2 4 5 6 16 57 132... Calpurnia 2 31 54 101. }{{}}{{} dictionary postings 29/ 54

Dictionaries Term vocabulary: the data Dictionary: the data structure for storing the term vocabulary 30/ 54

Dictionary as array of fixed-width entries For each term, we need to store a couple of items: document frequency pointer to postings list... Assume for the time being that we can store this information in a fixed-length entry. Assume that we store these entries in an array. 31/ 54

Dictionary as array of fixed-width entries term document pointer to frequency postings list a 656,265 aachen 65......... zulu 221 space needed: 20 bytes 4 bytes 4 bytes How do we look up an element in this array at query time? 32/ 54

Data structures for looking up term Two main classes of data structures: hashes and trees Some IR systems use hashes, some use trees. Criteria for when to use hashes vs. trees: Is there a fixed number of terms or will it keep growing? What are the relative frequencies with which various keys will be accessed? How many terms are we likely to have? 33/ 54

Hashes Each vocabulary term is hashed into an integer. Try to avoid collisions At query time, do the following: hash query term, resolve collisions, locate entry in fixed-width array Pros: Lookup in a hash is faster than lookup in a tree. Lookup time is constant. Cons no way to find minor variants (resume vs. résumé) no prefix search (all terms starting with automat) need to rehash everything periodically if vocabulary keeps growing 34/ 54

Trees Trees solve the prefix problem (find all terms starting with automat). Simplest tree: binary tree Search is slightly slower than in hashes: O(log M), where M is the size of the vocabulary. O(log M) only holds for balanced trees. Rebalancing binary trees is expensive. B-trees mitigate the rebalancing problem. B-tree definition: every internal node has a number of children in the interval [a,b] where a,b are appropriate positive integers, e.g., [2,4]. 35/ 54

Binary tree 36/ 54

B-tree 37/ 54

Outline 1 Recap 2 Skip pointers 3 Phrase queries 4 Dictionaries 5 Wildcard queries 6 Discussion 38/ 54

Wildcard queries mon*: find all docs containing any term beginning with mon Easy with B-tree dictionary: retrieve all terms t in the range: mon t < moo *mon: find all docs containing any term ending with mon Maintain an additional tree for terms backwards Then retrieve all terms t in the range: nom t < non 39/ 54

Query processing At this point, we have an enumeration of all terms in the dictionary that match the wildcard query. We still have to look up the postings for each enumerated term. E.g., consider the query: gen* and universit* This may result in the execution of many Boolean and queries. If there are 100 terms matching gen* and 50 terms matching universit*, then we have to run 5000 Boolean queries! 40/ 54

How to handle * in the middle of a term Example: m*nchen We could look up m* and *nchen in the B-tree and intersect the two term sets. Expensive Alternative: permuterm index Basic idea: Rotate every wildcard query, so that the * occurs at the end. 41/ 54

Permuterm index For term hello: add hello$, ello$h, llo$he, lo$hel, and o$hell to the B-tree where $ is a special symbol 42/ 54

Permuterm term mapping 43/ 54

Permuterm index For hello, we ve stored: hello$, ello$h, llo$he, lo$hel, and o$hell Queries For X, look up X$ For X*, look up X*$ For *X, look up X$* For *X*, look up X* For X*Y, look up Y$X* Example: For hel*o, look up o$hel* It s really a tree and should be called permuterm tree. But permuterm index is more common name. 44/ 54

Processing a lookup in the permuterm index Rotate query wildcard to the right Use B-tree lookup as before Problem: Permuterm more than quadruples the size of the dictionary compared to a regular B-tree. (empirical number) 45/ 54

k-gram indexes More space-efficient than permuterm index Enumerate all character k-grams (sequence of k characters) occurring in a term 2-grams are called bigrams. Example: from April is the cruelest month we get the bigrams: $a ap pr ri il l$ $i is s$ $t th he e$ $c cr ru ue el le es st t$ $m mo on nt h$ $ is a special word boundary symbol, as before. Maintain an inverted index from bigrams to the terms that contain the bigram 46/ 54

Postings list in a 3-gram inverted index etr beetroot metric petrify retrieval 47/ 54

Bigram indexes Note that we now have two different types of inverted indexes The term-document inverted index for finding documents based on a query consisting of terms The k-gram index for finding terms based on a query consisting of k-grams 48/ 54

Processing wildcarded terms in a bigram index Query mon* can now be run as: $m and mo and on Gets us all terms with the prefix mon......but also many false positives like moon. We must postfilter these terms against query. Surviving terms are then looked up in the term-document inverted index. k-gram indexes are more space efficient than permuterm indexes. 49/ 54

Processing wildcard queries in the term-document index As before, we must potentially execute a large number of Boolean queries Most straightforward semantics: Conjunction of disjunctions Recall the query: gen* and universit* (geneva and university) or (geneva and université) or (genève and university) or (genève and université) or (general and universities) or.... Very expensive Does Google allow wildcard queries? Users hate to type. If abbreviated queries like [pyth* theo*] for pythagoras theorem are legal, users will use them...... a lot According to Google search basics, 2009.04.27: Note that the * operator works only on whole words, not parts of words. But that does not seem to be true. Try [pythag*] and [m*nchen] 50/ 54

Outline 1 Recap 2 Skip pointers 3 Phrase queries 4 Dictionaries 5 Wildcard queries 6 Discussion 51/ 54

Discussion 1 Objective: explore three information retrieval systems (Bing, LOC, PubMed), and use each for the discovery task: What is the medical evidence that a low calorie diet will extend your lifetime? Some general questions and observations: How to authenticate the information? Is the information up to date? (how to find updated info?) In what order are items returned? (by relevance, but how is relevance determined: link analysis? tf.idf?) Use results of Bing search to refine vocabulary: e.g., low calorie diet calorie restriction, lifetime longevity Assignment: everyone upload as a test of CMS the best reference found, and outline of strategy used to find it 52/ 54

Discussion 1: epilogue Challenge: What principled search strategy might turn up http://www.nytimes.com/2009/08/18/science/18aging.html (without knowing the answer in advance)? Two experts on aging... argued recently in Nature that the whole phenomenon of caloric restriction may be a misleading result unwittingly produced in laboratory mice. The mice are selected for quick breeding and fed on rich diets. A low-calorie diet could be much closer to the diet that mice are adapted to in the wild, and therefore it could extend life simply because it is much healthier for them. Life extension in model organisms may be an artifact to some extent, they wrote. To the extent caloric restriction works at all, it may have a bigger impact in short-lived organisms that do not have to worry about cancer than in humans. Thus the hope of mimicking caloric restriction with drugs may be an illusion,... And what is the final word? 53/ 54

Discussion 2 (Thurs 10 Sep) Read and be prepared to discuss: G. Salton, A. Wong and C. S. Yang, A vector space model for automatic indexing. Communications of the ACM Volume 18, Issue 11 (November 1975) pages: 613-620. http://doi.acm.org/10.1145/361219.361220 This paper describes many of the concepts behind the vector space model and the SMART system. (Note that to access this paper from the ACM Digital Library, you need to use a computer with a Cornell IP address [or use /readings/p613-salton.pdf ]) 54/ 54