Query Answering Using Inverted Indexes

Similar documents
Efficient query processing

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

Information Retrieval. Lecture 5 - The vector space model. Introduction. Overview. Term weighting. Wintersemester 2007

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013

Informa(on Retrieval

Efficient Implementation of Postings Lists

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Static Pruning of Terms In Inverted Files

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

Models for Document & Query Representation. Ziawasch Abedjan

Boolean Retrieval. Manning, Raghavan and Schütze, Chapter 1. Daniël de Kok

Introduction to Information Retrieval

Lecture 8 May 7, Prabhakar Raghavan

Query Processing and Alternative Search Structures. Indexing common words

2. Give an example of algorithm instructions that would violate the following criteria: (a) precision: a =

Analyzing the performance of top-k retrieval algorithms. Marcus Fontoura Google, Inc

R16 SET - 1 '' ''' '' ''' Code No: R

Information Retrieval

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

MG4J: Managing Gigabytes for Java. MG4J - intro 1

Predictive Indexing for Fast Search

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

CISC689/ Information Retrieval Midterm Exam

Information Retrieval. (M&S Ch 15)

COSEQUENTIAL PROCESSING (SORTING LARGE FILES)

Information Retrieval

Inverted Index for Fast Nearest Neighbour

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Context. File Organizations and Indexing. Cost Model for Analysis. Alternative File Organizations. Some Assumptions in the Analysis.

Hashing IV and Course Overview

Midterm Exam Search Engines ( / ) October 20, 2015

A Systems View of Large- Scale 3D Reconstruction

Midterm 2, CS186 Prof. Hellerstein, Spring 2012

Information Retrieval

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Vector Space Scoring Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Birkbeck (University of London)

Practice Midterm Exam Solutions

DATA STRUCTURES AND ALGORITHMS

Unit 3 Disk Scheduling, Records, Files, Metadata

Information Retrieval

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Chapter IR:IV. IV. Indexes. Inverted Indexes Query Processing Index Construction Index Compression

University of Illinois at Urbana-Champaign Department of Computer Science. Final Examination

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

CMPSCI 646, Information Retrieval (Fall 2003)

CS105 Introduction to Information Retrieval

a) For the binary tree shown below, what are the preorder, inorder and post order traversals.

Data-Intensive Distributed Computing

Final Examination CSE 100 UCSD (Practice)

Instructor: Stefan Savev

Information Retrieval

This lecture: IIR Sections Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes Vector space scoring

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Efficiency vs. Effectiveness in Terabyte-Scale IR

Amusing algorithms and data-structures that power Lucene and Elasticsearch. Adrien Grand

Q: Given a set of keywords how can we return relevant documents quickly?

Mining Web Data. Lijun Zhang

Information Retrieval and Organisation

Chapter IR:IV. IV. Indexing. Abstract Model of Ranking Inverted Indexes Compression Auxiliary Structures Index Construction Query Processing

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

DATABASE MANAGEMENT SYSTEM ARCHITECTURE

Large-scale visual recognition The bag-of-words representation

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Introduction to Information Retrieval

Architecture and Implementation of Database Systems (Summer 2018)

Operating System Concepts

Lecture 5: Information Retrieval using the Vector Space Model

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

Ranking Algorithms For Digital Forensic String Search Hits

Review: Memory, Disks, & Files. File Organizations and Indexing. Today: File Storage. Alternative File Organizations. Cost Model for Analysis

Query Evaluation Strategies

Thomas H. Cormen Charles E. Leiserson Ronald L. Rivest. Introduction to Algorithms

IMAGE MATCHING - ALOK TALEKAR - SAIRAM SUNDARESAN 11/23/2010 1

1 o Semestre 2007/2008

CSCI 5417 Information Retrieval Systems Jim Martin!

Disks, Memories & Buffer Management

FRONT CODING. Front-coding: 8automat*a1 e2 ic3 ion. Extra length beyond automat. Encodes automat. Begins to resemble general string compression.

THE WEB SEARCH ENGINE

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _

1. [1 pt] What is the solution to the recurrence T(n) = 2T(n-1) + 1, T(1) = 1

Priority Queues. T. M. Murali. January 29, 2009

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO 2018

Chapter 6 Objectives

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

CS301 - Data Structures Glossary By

RASIM: a rank-aware separate index method for answering top-k spatial keyword queries

Spatial Queries. Nearest Neighbor Queries

Introduction to Data Mining

Preview. Memory Management

Classic IR Models 5/6/2012 1

Spatial Data Structures for Computer Graphics

( ) n 5. Test 1 - Closed Book

Using Graphics Processors for High Performance IR Query Processing

Transcription:

Query Answering Using Inverted Indexes

Inverted Indexes Query Brutus AND Calpurnia J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes 2

Document-at-a-time Evaluation The conceptually simplest query answering method Query J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes 3

Algorithm Find posting lists Can be implemented efficiently by keeping the top-k list at anytime J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes 4

Term-at-a-time Evaluation J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes 5

Algorithm Compute scores on one term Can be implemented efficiently by keeping the top-k list at anytime J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes 6

Comparison Memory usage The document-at-a-time only needs to maintain a priority queue R of a limited number of results The term-at-a-time needs to store the current scores for all documents Disk access The document-at-a-time needs more disk seeking and buffers for seeking since multiple lists are read in a synchronized way The term-at-a-time reads through each inverted list from start to end requiring minimal disk seeking and buffer J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes 7

List Skipping Consider an inverted list of n bytes, if we add skip pointers after each c bytes, and the pointers are k bytes long each Reading the whole list: Θ(n) bytes Jumping through the list using the skip pointers: Θ(kn/c) = Θ(n) No asymptotic gain When c is large and k is small, it may gain in practice J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes 8

Big Skips If c gets too large, the average performance drops Consider finding p postings in a list of n bytes There are n/c total intervals in the list Need to read kn/c bytes in skip pointers Need to read data in p intervals on average, assume that the postings we want are about halfway between two skip pointers read additional pc/2 bytes The total number of bytes to read: kn/c + pc/2 When n/c p, skipping does not help Most disks require a skip of at least 100,000 postings to gain in speedup Skipping is useful in reducing the amount of time spent on decoding compressed data and processing cached data J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes 9

Computing Cosine Score J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes 10

Efficient Scoring For a query q = w1 w2 The unit vector v (q) has only two nonzero components If query terms are not weighted, the nonzero components are equal to 2 / 2 = 0.707 Generally, for any two documents d 1 and d 2 V ( q) v( d1) > V ( q) v( d2) if and only if v( q) v( d1) > v( q) v( d2) V ( q) v( d) is the weighted sum over all terms in query q, of the weights of those terms in d J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes 11

Efficient Scoring Algorithm Using a heap, selecting top k answers can be done with 2J comparisons where J is the number of answers of nonzero scores J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes 12

Approximate Top-K Retrieval Retrieve K documents that are likely to be among the K highest scoring documents Goal: lower down the query answering cost Cosine measure is also an approximation of information need Major cost: computing cosine similarities between the query and a large number of documents Approximation strategies Find a set A of documents that are contenders, where K < A «N, such that A is likely to have many documents with scores near those of the top K Return the top-k documents in A J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes 13

Index Elimination For a multi-term query q, we only need to consider documents containing at least one of the query terms Only consider documents containing terms whose IDF exceeds a preset threshold Only check those discriminative words Benefit: the postings lists of low-idf terms are generally long (many are stop words) Only consider documents that contain many of the query terms J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes 14

Champion Lists For each term t in the dictionary, precompute the top-r documents of the highest weights for t, where r is a preset parameter Set different r for different terms larger for rare terms and smaller for frequent terms Given a query q, let A be the union of the champion lists for each of the terms comprising q Compute cosine similarity only between q and those documents in A J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes 15

Static Quality Scores and Ordering Different documents have different importance Example: how good are reviews on a web page? Modeled by a quality measure g(d) [0, 1] V ( q) V ( d) TotalScore ( q, d) = g( d) + V ( q) V ( d) Sort documents in posting lists in g(d) descending order Suppose g(1) = 0.25, g(2) = 0.5, and g(3) = 1 J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes 16

Using Quality Score Ordering For a well-chosen value r, maintain for each term t a global champion list of the top-r documents with the highest value of g(d)+tfidf(t, d) At query time, only compute TotalScore for documents in the union of those global champion lists Maintain for each term t two posting lists High list: m documents with the highest TF values for t Low list: the other documents containing t Use high list only if at least K answers can be generated J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes 17

Tiered Indexes Generalization of champion lists J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes 18

Clustering and NN Search Clustering Pick N documents as leaders at random from the collection For each document that is not a leader (called a follower), compute its nearest leader Each cluster has N = N N followers Alternatively, a follower can be assigned to b 1 leaders Query answering as nearest neighbor search For a query q, find the leader L (or b 2 leaders) that is closest to q computing cosine similarities between q and N leaders The candidate set A contains the closest leader and the followers J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes 19

Example J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes 20

Putting All Together J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes 21

Summary and To-Do-List Query evaluation Document-at-a-time versus term-at-a-time List skipping Efficient scoring Approximate top-k retrieval Index elimination, champion lists, quality score and ranking, clustering and nearest neighbor search Read Section 5.7 J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes 22