Data-analysis and Retrieval Boolean retrieval, posting lists and dictionaries

Similar documents
Querying Introduction to Information Retrieval INF 141 Donald J. Patterson. Content adapted from Hinrich Schütze

Information Retrieval

Introducing Information Retrieval and Web Search. borrowing from: Pandu Nayak

Part 2: Boolean Retrieval Francesco Ricci

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Introduction to Information Retrieval and Boolean model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H.

Information Retrieval

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

CSE 7/5337: Information Retrieval and Web Search Introduction and Boolean Retrieval (IIR 1)

Introduction to Information Retrieval

Search: the beginning. Nisheeth

Information Retrieval

Information Retrieval and Organisation

Information Retrieval

Introduction to Information Retrieval

Information Retrieval

boolean queries Inverted index query processing Query optimization boolean model September 9, / 39

CS105 Introduction to Information Retrieval

CS60092: Informa0on Retrieval. Sourangshu Bha<acharya

CS 572: Information Retrieval. Lecture 2: Hello World! (of Text Search)

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

More on indexing CE-324: Modern Information Retrieval Sharif University of Technology

Introduction to Information Retrieval

Information Retrieval

Information Retrieval

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Advanced Retrieval Information Analysis Boolean Retrieval

Boolean Retrieval. Manning, Raghavan and Schütze, Chapter 1. Daniël de Kok

Information Retrieval

Overview. Lecture 3: Index Representation and Tolerant Retrieval. Type/token distinction. IR System components

Information Retrieval and Text Mining

Behrang Mohit : txt proc! Review. Bag of word view. Document Named

Unstructured Data Management. Advanced Topics in Database Management (INFSCI 2711)

Information Retrieval Tutorial 1: Boolean Retrieval

Lecture 1: Introduction and the Boolean Model

Classic IR Models 5/6/2012 1

Information Retrieval

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Course structure & admin. CS276A Text Information Retrieval, Mining, and Exploitation. Dictionary and postings files: a fast, compact inverted index

Informa(on Retrieval

Information Retrieval

Introduction to Information Retrieval

Ges$one Avanzata dell Informazione Part A Full- Text Informa$on Management. Full- Text Indexing

Lecture 3: Phrasal queries and wildcards

Introduction to Information Retrieval

Information Retrieval. Danushka Bollegala

CSCI 5417 Information Retrieval Systems! What is Information Retrieval?

3-1. Dictionaries and Tolerant Retrieval. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

Efficient Implementation of Postings Lists

Web Information Retrieval Exercises Boolean query answering. Prof. Luca Becchetti

Dictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology

Preliminary draft (c)2006 Cambridge UP

Information Retrieval

Text Technologies for Data Science INFR Indexing (2) Instructor: Walid Magdy

Tolerant Retrieval. Searching the Dictionary Tolerant Retrieval. Information Retrieval & Extraction Misbhauddin 1

Indexing. Lecture Objectives. Text Technologies for Data Science INFR Learn about and implement Boolean search Inverted index Positional index

Text Technologies for Data Science INFR Indexing (2) Instructor: Walid Magdy

1Boolean retrieval. information retrieval. term search is quite ambiguous, but in context we use the two synonymously.

Recap of last time CS276A Information Retrieval

Information Retrieval and Organisation

Information Retrieval

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze

Dictionaries and Tolerant retrieval

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Models for Document & Query Representation. Ziawasch Abedjan

Information Retrieval. Chap 8. Inverted Files

IN4325 Indexing and query processing. Claudia Hauff (WIS, TU Delft)

Introduction to Information Retrieval IIR 1: Boolean Retrieval

EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

60-538: Information Retrieval

Recap: lecture 2 CS276A Information Retrieval

Dictionaries and tolerant retrieval. Slides by Manning, Raghavan, Schutze

CS347. Lecture 2 April 9, Prabhakar Raghavan

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Today s topics CS347. Inverted index storage. Inverted index storage. Processing Boolean queries. Lecture 2 April 9, 2001 Prabhakar Raghavan

Index extensions. Manning, Raghavan and Schütze, Chapter 2. Daniël de Kok

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

OUTLINE. Documents Terms. General + Non-English English. Skip pointers. Phrase queries

Indexing. CS6200: Information Retrieval. Index Construction. Slides by: Jesse Anderton

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Informa(on Retrieval. Administra*ve. Sta*s*cal MT Overview. Problems for Sta*s*cal MT

Dictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology

Lecture 1: Introduction and Overview

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

PV211: Introduction to Information Retrieval

Web Information Retrieval. Lecture 2 Tokenization, Normalization, Speedup, Phrase Queries

Overview of Information Retrieval and Organization. CSC 575 Intelligent Information Retrieval

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

CSCI 5417 Information Retrieval Systems Jim Martin!

Introduction to Information Retrieval

Information Retrieval

CS 6320 Natural Language Processing

Information Retrieval. (M&S Ch 15)

Index Construction 1

Indexing and Searching

Transcription:

Data-analysis and Retrieval Boolean retrieval, posting lists and dictionaries Hans Philippi (based on the slides from the Stanford course on IR) April 25, 2018 Boolean retrieval, posting lists & dictionaries 1 / 38

Basics 1 Collection: fixed set of documents 2 Goal: retrieve documents that are relevant to the users information need 3 Practice: users information need is expressed by one or more search terms 4 Example: you want to book a room in a Hilton hotel for a trip to Paris Boolean retrieval, posting lists & dictionaries 2 / 38

Information need? Boolean retrieval, posting lists & dictionaries 3 / 38

Basics Quality measures for retrieval 1 Precision: fraction of retrieved docs that are relevant to users information need (also called selectivity) 2 Recall: fraction of relevant docs in collection that are retrieved (also called sensitivity) Boolean retrieval, posting lists & dictionaries 4 / 38

Examples of collections WestLaw (http://en.wikipedia.org/wiki/westlaw) 1 Largest commercial legal search service (started 1975; ranking added 1992) 2 Tens of terabytes of data; 700,000 users 3 Majority of users still use boolean queries 4 Example query: What is the statute of limitations in cases involving the federal tort claims act? LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM (! = trailing wildcard, /3 = within 3 words, /S = in same sentence ) Boolean retrieval, posting lists & dictionaries 5 / 38

Collections for research purposes RCV1, RCV2 (Reuters Corpus Volume 1, 2) 1 In 2000 Reuters released a corpus of Reuters News stories for use in research and development of natural language processing, information retrieval or machine learning 2 RCV1 covers 800,000 news articles in English (2.5 GB) 3 RCV2 covers 487,000 articles in thirteen languages 4 More recently: Reuters-21578 for text categorization Boolean retrieval, posting lists & dictionaries 6 / 38

Boolean retrieval 1 Basic model for IR 2 Matching of keywords, using logical connectives: AND, OR, NOT and brackets 3 Still used, e.g. in library catalogs Boolean retrieval, posting lists & dictionaries 7 / 38

Boolean retrieval 1 Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? 2 One could grep all of Shakespeares plays for Brutus and Caesar, then strip out plays containing Calpurnia... 3... but smarter approaches may be ahead. Boolean retrieval, posting lists & dictionaries 8 / 38

Boolean retrieval: term-document incidence matrix Boolean retrieval, posting lists & dictionaries 9 / 38

Boolean retrieval: term-document incidence matrix 1 We have a 0/1 vector for each term 2 To answer query: apply a bitwise AND to the vectors for Brutus, Caesar and Calpurnia (complemented) 3 110100 AND 110111 AND 101111 = 100100 Boolean retrieval, posting lists & dictionaries 10 / 38

Indexing: term-document incidence matrix? Can we use the term-document incidence matrix for indexing purposes? Some typical parameters: 1 number of documents: thousands (libraries) to billions (www) 2 number of terms per document: possibly several thousands 3 number of terms in a language (English, Dutch): tens of thousands (note that the web is multilingual) 4 on average 6 bytes/word For the web, the order of magnitude of the matrix is 10 10 10 6 = 10 16 Boolean retrieval, posting lists & dictionaries 11 / 38

Indexing: dictionary and postings lists sparse matrix approach documents are identified by a unique number: the docid terms are organized in a dictionary, supporting quick searching each term has a postings list: an ordered list of docs containing this term Calpurnia = Brutus = Caesar = Dictionary 2 31 45 101 112 154 181... 1 2 4 11 31 45 173... 1 2 4 5 6 16 45... Postings lists Boolean retrieval, posting lists & dictionaries 12 / 38

Implementation of dictionary and postings lists As always: optimality depends on read - update ratio. Internal memory, static situation: hash table or tree like structure for dictionary arrays for postings lists: good cache behaviour 1 Internal memory, dynamic situation: hash table or tree like structure for dictionary linked lists for postings lists External memory: tree like structure or hash table for dictionary linked lists (block structure) for postings lists General observation: hash table does not support range queries 1 MSc thesis Matthijs Meulenbrug (Mininova) Boolean retrieval, posting lists & dictionaries 13 / 38

Tree like structures: B-tree and Trie (prefix tree) Boolean retrieval, posting lists & dictionaries 14 / 38

Indexing process Boolean retrieval, posting lists & dictionaries 15 / 38

Boolean query processing Query = term 1 AND term 2 1 locate postings list p 1 for term 1 2 locate postings list p 2 for term 2 3 calculate the intersection of p 1 and p 2 by list merging term 1 = term 2 = 1 3 7 11 37 44 58 112... 2 4 11 25 44 54 55 58... Boolean retrieval, posting lists & dictionaries 16 / 38

Boolean query processing: list merging INPUT: postings lists p 1 and p 2 OUTPUT: a sorted list representing the intersection of p 1 and p 2 METHOD: result = empty list; while not (IsEmpty(p 1 ) or IsEmpty(p 2 )) { if (docid(p 1 ) == docid(p 2 )) then { append(result, docid(p 1 )); p 1 = next(p 1 ); p 2 = next(p 2 ); } else if (docid(p 1 ) < docid(p 2 )) then p 1 = next(p 1 ); else p 2 = next(p 2 ); } Boolean retrieval, posting lists & dictionaries 17 / 38

INTERMEZZO: Boolean query processing Query = term 1 AND NOT term 2 1 locate postings list p 1 for term 1 2 locate postings list p 2 for term 2 3? p 1 = p 2 = 1 3 7 11 37 44 58 112... 2 4 11 25 44 54 55 58... Boolean retrieval, posting lists & dictionaries 18 / 38

INTERMEZZO: Boolean query optimization Query = term 1 AND term 2 AND... AND term n How do we process this query? Boolean retrieval, posting lists & dictionaries 19 / 38

INTERMEZZO: Boolean query optimization Query = term 1 AND term 2 AND... AND term n How many possibilities do we have? (more than...) Heuristic? Analogy with join order problem in database query processing Boolean retrieval, posting lists & dictionaries 20 / 38

Boolean query processing: skip pointers Skip pointers may speed up merge process Boolean retrieval, posting lists & dictionaries 21 / 38

Boolean query processing: skip pointers... but what are suitable skip spans? many skip pointers:... less skip pointers:... Boolean retrieval, posting lists & dictionaries 22 / 38

Boolean query processing: skip pointers... but what are suitable skip spans? many skip pointers: more comparisons, more frequent skips, higher memory cost less skip pointers: less comparisons, less frequent skips, longer jumps, lower memory cost rule of thumb: n skip pointers for n = length of posting list Boolean retrieval, posting lists & dictionaries 23 / 38

INTERMEZZO: Boolean query optimization Query = term 1 AND term 2 AND term 3 Options: merge p 1 with p 2, and merge the result with p 3 two alternatives by permutation do a three-way-merge of p 1, p 2 and p 3 Question: which approach takes best advantage of skip pointers? Boolean retrieval, posting lists & dictionaries 24 / 38

Phrase queries Make a distinction between: Q1 = fight AND club Q2 = fight club How do we support juxtaposition of terms? Boolean retrieval, posting lists & dictionaries 25 / 38

Phrase queries How do we support juxtaposition of terms? Solution 1: biword index Disadvantages: index size quadratic how do we support juxtaposition of three or more terms? Boolean retrieval, posting lists & dictionaries 26 / 38

Phrase queries How do we support juxtaposition of terms? Solution 1: biword index Disadvantages: index size quadratic how do we support juxtaposition of three or more terms? Solution 2: positional index Boolean retrieval, posting lists & dictionaries 27 / 38

Positional index For each term, we also register the position(s) of the term in each document, where a document is regarded to be an array of tokens. So, for each term myterm, we have the following entry in the index: < myterm: nr of docs containing myterm; doc1: position1, position2,... ; doc2: position1, position2,... ;... > Boolean retrieval, posting lists & dictionaries 28 / 38

Positional index Example: <be: 993427; 1: 7, 18, 33, 72, 86, 231; 2: 3, 149; 4: 17, 191, 291, 430, 434; 5: 363, 367;... > Which of the docs could contain: to be or not to be Boolean retrieval, posting lists & dictionaries 29 / 38

Wild-card queries Query: w*rd matches word, weird and wild-card Wild-card queries may put a heavy load on query processing Boolean retrieval, posting lists & dictionaries 30 / 38

Wild-card query processing using B-tree Case 1: prefix known Query = pre* find all terms between pre and prf B-tree supports range queries very well Boolean retrieval, posting lists & dictionaries 31 / 38

Wild-card query processing using B-tree Case 2: suffix known Query = *post? Boolean retrieval, posting lists & dictionaries 32 / 38

Wild-card query processing using B-tree Case 2: suffix known Query = *post maintain a second B-tree with inverted terms find all terms between tsop and tsoq Boolean retrieval, posting lists & dictionaries 33 / 38

Wild-card query processing Case 3: general form Query = pre*post Option 1: intersection of results from pre* and *post Option 2: permuterm index Boolean retrieval, posting lists & dictionaries 34 / 38

Wild-card query processing: permuterm index For a term hello, add $ to the end of the term, and create entries for each rotation of the term. All these entries are connected to the posting list of the term hello. hello$ ello$h llo$he lo$hel o$hell For a query = he*o, we add $ and rotate the term until... Boolean retrieval, posting lists & dictionaries 35 / 38

Wild-card query processing: permuterm index For a term hello, add $ to the end of the term, and create entries for each rotation of the term. All these entries are connected to the posting list of the term hello. hello$ ello$h llo$he lo$hel o$hell For a query = he*o, we add $ and rotate the term until the * is at the end of the query string: query = o$he*. Finally, notice that o$he* has a prefix match with o$hell. Boolean retrieval, posting lists & dictionaries 36 / 38

Wild-card query processing: k-grams Note that k-grams can also be used to deal with the wild-card problem. We will deal extensively with k-grams within the context of biological sequence alignment. Boolean retrieval, posting lists & dictionaries 37 / 38

References Manning: chapter 1 chapter 2.3, 2.4; the chapters on language issues are recommended as background reading chapter 3-3.2 - means: up to and including Boolean retrieval, posting lists & dictionaries 38 / 38