Efficient Implementation of Postings Lists

Similar documents
More on indexing CE-324: Modern Information Retrieval Sharif University of Technology

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze

Search: the beginning. Nisheeth

Index extensions. Manning, Raghavan and Schütze, Chapter 2. Daniël de Kok

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval

Introducing Information Retrieval and Web Search. borrowing from: Pandu Nayak

Query Answering Using Inverted Indexes

Web Information Retrieval. Lecture 2 Tokenization, Normalization, Speedup, Phrase Queries

Information Retrieval

Part 2: Boolean Retrieval Francesco Ricci

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

Information Retrieval CS-E credits

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Information Retrieval

Introduction to Information Retrieval

CS60092: Informa0on Retrieval. Sourangshu Bha<acharya

Information Retrieval

Introduction to Information Retrieval

Information Retrieval

Chapter IR:IV. IV. Indexing. Abstract Model of Ranking Inverted Indexes Compression Auxiliary Structures Index Construction Query Processing

Introduction to Information Retrieval and Boolean model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H.

OUTLINE. Documents Terms. General + Non-English English. Skip pointers. Phrase queries

CS105 Introduction to Information Retrieval

Data-analysis and Retrieval Boolean retrieval, posting lists and dictionaries

Information Retrieval

Information Retrieval and Organisation

Made by: Ali Ibrahim. Supervisor: MR. Ali Jnaide. Class: 12

Behrang Mohit : txt proc! Review. Bag of word view. Document Named

Information Retrieval

Query Evaluation Strategies

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Query Processing and Alternative Search Structures. Indexing common words

Text Pre-processing and Faster Query Processing

Information Retrieval

CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University

Unstructured Data Management. Advanced Topics in Database Management (INFSCI 2711)

Search Engines. Informa1on Retrieval in Prac1ce. Annotations by Michael L. Nelson

n Tuesday office hours changed: n 2-3pm n Homework 1 due Tuesday n Assignment 1 n Due next Friday n Can work with a partner

More on indexing and text operations CE-324: Modern Information Retrieval Sharif University of Technology

Query Evaluation Strategies

Models for Document & Query Representation. Ziawasch Abedjan

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Information Retrieval and Organisation

Recap of the previous lecture. Recall the basic indexing pipeline. Plan for this lecture. Parsing a document. Introduction to Information Retrieval

Classic IR Models 5/6/2012 1

Search Engine Architecture II

Information Retrieval

Information Retrieval Tutorial 1: Boolean Retrieval

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

Ges$one Avanzata dell Informazione Part A Full- Text Informa$on Management. Full- Text Indexing

Web Information Retrieval Exercises Boolean query answering. Prof. Luca Becchetti

Inverted Indexes. Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5

Indexing. Lecture Objectives. Text Technologies for Data Science INFR Learn about and implement Boolean search Inverted index Positional index

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha

Information Retrieval

More on indexing and text operations CE-324: Modern Information Retrieval Sharif University of Technology

Lecture 3: Phrasal queries and wildcards

60-538: Information Retrieval

Information Retrieval and Text Mining

Information Retrieval. Danushka Bollegala

1Boolean retrieval. information retrieval. term search is quite ambiguous, but in context we use the two synonymously.

Information Retrieval and Text Mining

Information Retrieval

Overview. Lecture 3: Index Representation and Tolerant Retrieval. Type/token distinction. IR System components

Advanced Retrieval Information Analysis Boolean Retrieval

Recap: lecture 2 CS276A Information Retrieval

Boolean Retrieval. Manning, Raghavan and Schütze, Chapter 1. Daniël de Kok

Introduction to Information Retrieval

Preliminary draft (c)2006 Cambridge UP

Information Retrieval. (M&S Ch 15)

Digital Libraries: Language Technologies

Improved Skips for Faster Postings List Intersection

CSE 7/5337: Information Retrieval and Web Search Introduction and Boolean Retrieval (IIR 1)

CS 572: Information Retrieval. Lecture 2: Hello World! (of Text Search)

Querying Introduction to Information Retrieval INF 141 Donald J. Patterson. Content adapted from Hinrich Schütze

Indexing and Searching

Introduction to Information Retrieval

CS276 Information Retrieval and Web Search. Lecture 2: Dictionary and Postings

THE WEB SEARCH ENGINE

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks

Indexing and Searching

3-1. Dictionaries and Tolerant Retrieval. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

Sherlock 7 Technical Resource. Search Geometric

DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search

DB-retrieval: I m sorry, I can only look up your order, if you give me your OrderId.

CSCI 5417 Information Retrieval Systems! What is Information Retrieval?

modern database systems lecture 4 : information retrieval

IR System Components. Lecture 2: Data structures and Algorithms for Indexing. IR System Components. IR System Components

Indexing and Searching

Optimized Top-K Processing with Global Page Scores on Block-Max Indexes

Introduction to Information Retrieval

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

IN4325 Indexing and query processing. Claudia Hauff (WIS, TU Delft)

Information Retrieval

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Transcription:

Efficient Implementation of Postings Lists

Inverted Indices Query Brutus AND Calpurnia J. Pei: Information Retrieval and Web Search -- Efficient Implementation of Postings Lists 2

Skip Pointers J. Pei: Information Retrieval and Web Search -- Efficient Implementation of Postings Lists 3

Using Skip Pointers J. Pei: Information Retrieval and Web Search -- Efficient Implementation of Postings Lists 4

Efficiency of Using Skip Pointers What is the best case? W1: 2(40) 5 8 40(76) 42 65 76 W2: 40(120) 85 100 120 What is the worst case? W1: 1(7) 3 5 7(13) 9 11 13 W2: 2(8) 4 6 8(14) 10 12 14 Can you propose an idea to make use of skip pointers more aggressively? Then, what is the cost associated with the aggressive usage? J. Pei: Information Retrieval and Web Search -- Efficient Implementation of Postings Lists 5

Where Do We Place Skip Pointers? More skip pointers shorter skip spans more likely to skip many comparisons to skip pointers much space storing skip pointers Fewer skip points fewer pointer comparison long skip spans fewer opportunities to skip A heuristic: for a postings list of length P, use SQRT(P) evenly spaced skip pointers Can you propose an idea to be aware of the distribution of the postings? J. Pei: Information Retrieval and Web Search -- Efficient Implementation of Postings Lists 6

Phrase Queries Query Stanford University Using singleton terms, The inventor Stanford Ovshinsky never went to university is an answer 10% of web queries are phrase queries, many more are implicit phrase queries (e.g., person names) J. Pei: Information Retrieval and Web Search -- Efficient Implementation of Postings Lists 7

Biword Indices Consider every pair of consecutive terms in a document as a phrase Friends, Romans, Countrymen friends romans, romans countrymen stanford university palo alto stanford unversity AND university palo AND palo alto Biwords using nouns renegotiation of the constitution matches rule NX*N May not always work well cost overruns on a power plant cost overruns AND overruns power AND power plan cost overruns AND power plan seem more appropriate J. Pei: Information Retrieval and Web Search -- Efficient Implementation of Postings Lists 8

Phrase Indices Biword indices can be extended to variable length word sequences The longer the phrase, the lower the chance of false-positive Including long phrases increases the size of vocabulary A tradeoff between space and query effectiveness as well as efficiency J. Pei: Information Retrieval and Web Search -- Efficient Implementation of Postings Lists 9

Positional Indices A positional index stores, for each term in the vocabulary, postings of the form docid: <position1, position2, > Search for to be and not to be How many times do we need to search for to be? J. Pei: Information Retrieval and Web Search -- Efficient Implementation of Postings Lists 10

Proximity Intersection J. Pei: Information Retrieval and Web Search -- Efficient Implementation of Postings Lists 11

Positional Index Size Record positions of all occurrences of a term Boolean query complexity O(T), where T is the number of tokens in the document collection Suppose a term has frequency 0.1% on average The term is expected to appear once in a document of 1,000 terms, 100 times in a document of 100,000 terms Document size affects the positional index size In practice, a positional index is about 2-4 times as large as a nonpositional index J. Pei: Information Retrieval and Web Search -- Efficient Implementation of Postings Lists 12

Combination Schemes Biword indices and positional indices can be combined For commonly queried phrases, use biword indicies For others, use positional indices Consider relative frequencies Use biword indices for phrases where the individual words are common but the desired phrase is comparatively rare Partial next word index: for each term, record terms that follow it in a document Save 75% query time, with 26% more space J. Pei: Information Retrieval and Web Search -- Efficient Implementation of Postings Lists 13

Example Text Collection J. Pei: Information Retrieval and Web Search -- Efficient Implementation of Postings Lists 14

Inverted Index with Word Counts Help to rank the most relevant documents J. Pei: Information Retrieval and Web Search -- Efficient Implementation of Postings Lists 15

Fields and Extents Document fields: sections of documents that carry some kind of semantic meaning Name, title, location, etc. Extent: a contiguous region of a document Author: (1, 3), (4, 5), (7, 8) J. Pei: Information Retrieval and Web Search -- Efficient Implementation of Postings Lists 16

Aligning Posting Lists and Field J. Pei: Information Retrieval and Web Search -- Efficient Implementation of Postings Lists 17

Posting Lists of Scores We can sore the relevance/similarity between a document and a query word A posting list of scores Fish: (1: 3.6), (3: 2.2), Increasing flexibility: computationally expensive scoring becomes possible Scoring is moved to the index construction process Losing flexibility: the scoring mechanism is fixed once the index is built Information about word proximity is lost J. Pei: Information Retrieval and Web Search -- Efficient Implementation of Postings Lists 18

Ordering Using Scores An inverted list can be ordered by score so that the highest scored documents come first The query processing engine can focus only on the top part of each inverted list Only need to read a small number of postings to find the top-k documents J. Pei: Information Retrieval and Web Search -- Efficient Implementation of Postings Lists 19

Threshold Adjustment Algorithm Query: A and B, find top-2 documents Scoring function: score(a) + score(b) Each posting list is sorted in score descending order A: (D1, 3.8), (D10, 3.7), (D5, 3.6), (D8, 2.9), B: (D10, 3.4), (D2, 2.5), (D5, 2.2), (D9, 1.9), (D1, 1.8) Step 1: read (D1, 3.8) from A and (D10, 3.4) from B Upper bounds: any document cannot be higher than 7.2 Lower bounds: D1 3.8, D10 3.4 Step 2: read (D10, 3.7) from A and (D2, 2.5) from B D10 = 7.1 Upper bounds: D1 6.3, all others 6.2 Lower bounds: D2 2.5 D10 is the highest J. Pei: Information Retrieval and Web Search -- Efficient Implementation of Postings Lists 20

Threshold Adjustment Algorithm Step 3: read (D5, 3.6) from A and (D5, 2.2) from B D5 = 5.8 Upper bounds: D1 6, D2 6.1, all others 5.8 Lower bounds: D5 is a candidate Step 4: read (D8, 2.9) from A and (D9, 1.9) from B Upper bounds: D1 5.7, D2 5.4 D5 is another document in top-2 The search terminates J. Pei: Information Retrieval and Web Search -- Efficient Implementation of Postings Lists 21

Summary Inverted indices can be improved by efficient implementation of postings lists Using skip pointers Phrase queries Biword indices Positional postings The two can be combined Inverted indexes can be extended in many ways The TA algorithm J. Pei: Information Retrieval and Web Search -- Efficient Implementation of Postings Lists 22

To-do List Give the pseudocode of the TA algorithm How can the lower bounds be used? This is not discussed in the class Can you generalize it for multiple lists? J. Pei: Information Retrieval and Web Search -- Efficient Implementation of Postings Lists 23