Lecture 3: Phrasal queries and wildcards

Similar documents
More on indexing CE-324: Modern Information Retrieval Sharif University of Technology

Data-analysis and Retrieval Boolean retrieval, posting lists and dictionaries

Introduction to Information Retrieval and Boolean model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H.

Lecture 5: Information Retrieval using the Vector Space Model

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Querying Introduction to Information Retrieval INF 141 Donald J. Patterson. Content adapted from Hinrich Schütze

Search: the beginning. Nisheeth

Information Retrieval

Overview. Lecture 3: Index Representation and Tolerant Retrieval. Type/token distinction. IR System components

Part 2: Boolean Retrieval Francesco Ricci

Information Retrieval

Introducing Information Retrieval and Web Search. borrowing from: Pandu Nayak

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

Information Retrieval

Index extensions. Manning, Raghavan and Schütze, Chapter 2. Daniël de Kok

Information Retrieval

Efficient Implementation of Postings Lists

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Information Retrieval

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

Information Retrieval

3-1. Dictionaries and Tolerant Retrieval. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

CS60092: Informa0on Retrieval. Sourangshu Bha<acharya

Recap of the previous lecture. This lecture. A naïve dictionary. Introduction to Information Retrieval. Dictionary data structures Tolerant retrieval

Information Retrieval and Organisation

Information Retrieval

Boolean Retrieval. Manning, Raghavan and Schütze, Chapter 1. Daniël de Kok

Information Retrieval

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Introduction to. CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan. Lecture 4: Index Construction

Search Engines. Informa1on Retrieval in Prac1ce. Annotations by Michael L. Nelson

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

Information Retrieval

Information Retrieval

IN4325 Indexing and query processing. Claudia Hauff (WIS, TU Delft)

Information Retrieval

Dictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology

Dictionaries and tolerant retrieval. Slides by Manning, Raghavan, Schutze

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

Query Processing and Alternative Search Structures. Indexing common words

Indexing and Searching

Introduction to Information Retrieval

60-538: Information Retrieval

Index Construction. Slides by Manning, Raghavan, Schutze

Behrang Mohit : txt proc! Review. Bag of word view. Document Named

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression

Introduction to Information Retrieval

Indexing and Searching

Indexing. CS6200: Information Retrieval. Index Construction. Slides by: Jesse Anderton

CS 572: Information Retrieval. Lecture 2: Hello World! (of Text Search)

Indexing and Query Processing

Information Retrieval CS-E credits

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Recap of last time CS276A Information Retrieval

EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling

Course structure & admin. CS276A Text Information Retrieval, Mining, and Exploitation. Dictionary and postings files: a fast, compact inverted index

Boolean Queries. Keywords combined with Boolean operators:

OUTLINE. Documents Terms. General + Non-English English. Skip pointers. Phrase queries

Introduction to Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Web Information Retrieval. Lecture 2 Tokenization, Normalization, Speedup, Phrase Queries

Classic IR Models 5/6/2012 1

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel

Dictionaries and Tolerant retrieval

Information Retrieval

Information Retrieval. Lecture 5 - The vector space model. Introduction. Overview. Term weighting. Wintersemester 2007

Introduction to Information Retrieval

CSCI 5417 Information Retrieval Systems Jim Martin!

Information Retrieval. (M&S Ch 15)

CS419: Computer Networks. Lecture 6: March 7, 2005 Fast Address Lookup:

Introduction to Information Retrieval

Query Evaluation Strategies

Query Evaluation Strategies

CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University

Informa(on Retrieval

Tolerant Retrieval. Searching the Dictionary Tolerant Retrieval. Information Retrieval & Extraction Misbhauddin 1

Information Retrieval 6. Index compression

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Introduction to Information Retrieval

Dictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology

String Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42

Index Compression. David Kauchak cs160 Fall 2009 adapted from:

CSE 7/5337: Information Retrieval and Web Search Introduction and Boolean Retrieval (IIR 1)

modern database systems lecture 4 : information retrieval

CS122 Lecture 15 Winter Term,

Horn Formulae. CS124 Course Notes 8 Spring 2018

Information Retrieval and Organisation

CS54701: Information Retrieval

Recap: lecture 2 CS276A Information Retrieval

Information Retrieval

GUJARAT TECHNOLOGICAL UNIVERSITY

CS122 Lecture 11 Winter Term,

Information Retrieval

Transcription:

Lecture 3: Phrasal queries and wildcards Trevor Cohn (tcohn@unimelb.edu.au) COMP90042, 2015, Semester 1

What we ll learn today Building on the boolean index and query mechanism to support multi-word queries Support for adjacent and near to operators Concept of a positional index Handling wildcards in queries

Phrase queries Some search needs are tricky to express as a simple boolean query E.g., the Geelong Cats football team Searching for Geelong AND Cats is likely to match some documents about pets, not football Inverted index storing (term, documents) from last lecture won t suffice Need special support for phrase queries

Biword indices Simplest solution: treat every pair of words in the document as a term I have a dream becomes I-have have-a a-dream biwords now form our dictionary items can easily query for two word phrases

Biword indices How to query for longer phrases? express as conjunction over biwords North Melbourne Kangaroos North Melbourne AND Melbourne Kangaroos can process this query in the usual way finally need to filter results to remove spurious matches Generally an inefficient and inflexible solution index gets large, and doesn t scale beyond bi-words problem of filtering out false positives typically biword indices are not used on their own

Positional indices Don t just index the document for each term, but also the position in the document. 1: Imagine no possessions 2: I wonder if you can 3: No need for greed or hunger 4: A brotherhood of man 5: Imagine all the people 6: Sharing all the world imagine df=2 1 [1] 5 [1] all df=2 5 [2] 6 [2] no df=2 1 [2] 3 [1] Each entry includes term and document frequency and linked list with nodes containing document id and a list of offsets into the document

Querying a positional index Take the phrase query imagine all. Matching documents must satisfy the condition there exists a location l, such that w l = imagine w l+1 = all i.e., imagine preceeds all by two words, somewhere in the document Using the index, we can find candidate documents as before followed by checking for correctly offset location pairs imagine df=45; <1:3>, <5:18>, <22:3,18,53>, <34:1,5>,... all df=61; <3:1>, <22:1,13,54>, <34:10>,...

Merging positional posting lists Last step requires checking for correctly offset location pairs for each candidate document containing imagine AND all traverse the sorted list of locations, l 1, l 2 maintain two pointers p 1, p 2 into the two lists increment p 1 when l 1 [p 1 ] < l 2 [p 2 ] k, where k is the offset (k = 1 here) increment p 2 when l 1 [p 1 ] > l 2 [p 2 ] k otherwise if both equal, record a match and increment p 1, p 2 imagine all example 22: 3, 18, 53 22: 1, 13, 54

Merging positional posting lists Last step requires checking for correctly offset location pairs for each candidate document containing imagine AND all traverse the sorted list of locations, l 1, l 2 maintain two pointers p 1, p 2 into the two lists increment p 1 when l 1 [p 1 ] < l 2 [p 2 ] k, where k is the offset (k = 1 here) increment p 2 when l 1 [p 1 ] > l 2 [p 2 ] k otherwise if both equal, record a match and increment p 1, p 2 imagine all example 22: 3, 18, 53 22: 1, 13, 54

Merging positional posting lists Last step requires checking for correctly offset location pairs for each candidate document containing imagine AND all traverse the sorted list of locations, l 1, l 2 maintain two pointers p 1, p 2 into the two lists increment p 1 when l 1 [p 1 ] < l 2 [p 2 ] k, where k is the offset (k = 1 here) increment p 2 when l 1 [p 1 ] > l 2 [p 2 ] k otherwise if both equal, record a match and increment p 1, p 2 imagine all example 22: 3, 18, 53 22: 1, 13, 54

Merging positional posting lists Last step requires checking for correctly offset location pairs for each candidate document containing imagine AND all traverse the sorted list of locations, l 1, l 2 maintain two pointers p 1, p 2 into the two lists increment p 1 when l 1 [p 1 ] < l 2 [p 2 ] k, where k is the offset (k = 1 here) increment p 2 when l 1 [p 1 ] > l 2 [p 2 ] k otherwise if both equal, record a match and increment p 1, p 2 imagine all example 22: 3, 18, 53 22: 1, 13, 54

Merging positional posting lists Last step requires checking for correctly offset location pairs for each candidate document containing imagine AND all traverse the sorted list of locations, l 1, l 2 maintain two pointers p 1, p 2 into the two lists increment p 1 when l 1 [p 1 ] < l 2 [p 2 ] k, where k is the offset (k = 1 here) increment p 2 when l 1 [p 1 ] > l 2 [p 2 ] k otherwise if both equal, record a match and increment p 1, p 2 imagine all example 22: 3, 18, 53 22: 1, 13, 54 53 + 1 = 54 record a match for doc 22

Longer phrases How to cope with phrases of length 3 or greater? will have posting lists for each term in the query and can easily compute offsets between any pair of terms as before, work from smallest df to largest (why?) when combining posting lists, consider appropriate offset based on order in query Can also easily support NEAR-TO queries, i.e., terms in the query might need to appear k tokens apart in a document.

Complexity Positional indices store a record of every token rather than every type occuring in a document long documents may see many tokens repeated time complexity O( W ), word tokens, not O(D), documents consider looking up The who Mixed biword and positional indicies Index the pathological cases using explicit biword index, and easier cases in positional index. Get the benefits of both techniques, without the nasty time and space complexity problems.

Proximal queries Legal world full of carefully designed complex queries, e.g., What is the statute of limitations in cases involving the federal tort claims act? LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM Westlaw example, IIR Chapter 1. support for range operators straightforward with positional indices but not possible with biword indexes storage cost still high, but can be compressed well (gaps are small between repeats of a word in a document)

Wildcard queries What if we don t know how to spell a crucial search term? Some words have many accepted spellings (particularly foreign names) Gaddafi, Kadafi, Qaddafi, Gathafi, Kadafi or Gadafy Or just are a nightmare to spell, e.g., Eyjafjallajökull, the volcano which erupted in Iceland distrubing air travel Might want to query on eyja*, e*kull or E*yall*kull To support these kinds of queries, we need to be able to search over sub-word units

Wildcard queries Generally useful tool to allow incomplete query terms leading wildcard: *addafi trailing wildcard: Gadd* internal wildcard: Gad*fi several wildcards: *add*fi Must map these patterns to term/s in our vocabulary, then we can query the index using the vocabulary terms.

Tree based approach Store tree, e.g., B+ tree, over the vocabulary using a sort order over the vocabulary entries can find all vocabulary items matching trailing wildcards all child nodes will share the common prefix e.g., Gadd* Gadd what about leading wildcards? easy, just use a tree in reverse order, to allow suffix matching e.g., *addafi ifadda General queries Not quite so easy match prefix and suffix and intersect to obtain candidates filter these to find the entries which match the pattern

Permuterm index More elegant method is to store an index over the vocabulary items append a sentinel symbol, $, to each term rotate the term by a character at a time store in a sorted tree the mappings between the rotated versions and the term Example: rotate rotate$ otate$r tate$ro ate$rot te$rota e$rotat $rotate rotate Consider searching for ro*te: add the sentinel, ro*te$ rotate the wildcard to the end, te$ro* now lookup in permutum index, te$ro

Permuterm index lookup Matches for te$ro prefix in permuterm index may uncover te$rota rotate te$ro rote te$rou route te$roulet roulette Permuterm index supports single leading, trailing or internal wildcard but to use several wildcards still requires final filtering step once we have found the terms, using inverted index to find matching documents and combine the results

Complexity Permuterm is a very convenient datastructure but has significant space requirement especially for vocabularies with long terms expands dictionary storage by around 10 simple forward and reverse trees are more compact

K-gram index Break terms into character sequences index each sequence of k characters rote $ro, rot, ote, te$ (where k = 3) store a mapping between each k-gram and the term Querying follows similar pattern lookup partial fragments of the query term ro*te $ro, te$ merge the list of results both each fragment filter the candidates to remove false matches

Looking back and forward Back Phrases as query terms Indexing and querying with biword and positional indicies Dealing with wildcard query terms: using trees, a permuterm index or a k-gram index

Looking back and forward Forward In the next lecture, we will look at the vector space model of information retrieval Introduce real valued weighting of terms, and ranked result sets

Further reading Section 2.4, Positional postings and phrase queries of Manning, Raghavan, and Schutze, Introduction to Information Retrieval Chapter 3 ( 3.1, 3.2), Dictionaries and tolerant retrieval of Manning, Raghavan, and Schutze, Introduction to Information Retrieval