Query Evaluation Strategies

Similar documents
Query Evaluation Strategies

Analyzing the performance of top-k retrieval algorithms. Marcus Fontoura Google, Inc

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Computer Science 210 Data Structures Siena College Fall Topic Notes: Priority Queues and Heaps

CS 234. Module 8. November 15, CS 234 Module 8 ADT Priority Queue 1 / 22

Advanced Database Systems

Priority Queues and Binary Heaps

Query Processing & Optimization

CS350: Data Structures Heaps and Priority Queues

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi

Heapsort. Why study Heapsort?

Chapter 12: Query Processing. Chapter 12: Query Processing

Programming II (CS300)

Chapter 13: Query Processing

XML Query Processing. Announcements (March 31) Overview. CPS 216 Advanced Database Systems. Course project milestone 2 due today

Data Structures Lesson 9

COSC160: Data Structures Heaps. Jeremy Bolton, PhD Assistant Teaching Professor

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

Chapter 13: Query Processing Basic Steps in Query Processing

CS165: Priority Queues, Heaps

CSCI2100B Data Structures Heaps

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

CSCI 136 Data Structures & Advanced Programming. Lecture 22 Fall 2018 Instructor: Bills

Database System Concepts

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

Lecture 5: Information Retrieval using the Vector Space Model

Chapter 12: Query Processing

Announcements (March 31) XML Query Processing. Overview. Navigational processing in Lore. Navigational plans in Lore

9. Heap : Priority Queue

Priority Queues. 1 Introduction. 2 Naïve Implementations. CSci 335 Software Design and Analysis III Chapter 6 Priority Queues. Prof.

Heaps. Heaps. A heap is a complete binary tree.

Properties of a heap (represented by an array A)

Thus, it is reasonable to compare binary search trees and binary heaps as is shown in Table 1.

Sorting and Selection

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression

CSE 241 Class 17. Jeremy Buhler. October 28, Ordered collections supported both, plus total ordering operations (pred and succ)

Binary heaps (chapters ) Leftist heaps

Heaps Outline and Required Reading: Heaps ( 7.3) COSC 2011, Fall 2003, Section A Instructor: N. Vlajic

Lecture Notes on Priority Queues

Chapter 6 Heapsort 1

CSC Design and Analysis of Algorithms. Lecture 8. Transform and Conquer II Algorithm Design Technique. Transform and Conquer

3. Priority Queues. ADT Stack : LIFO. ADT Queue : FIFO. ADT Priority Queue : pick the element with the lowest (or highest) priority.

Sorting and Searching

Balanced Binary Search Trees. Victor Gao

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

CSC Design and Analysis of Algorithms. Lecture 8. Transform and Conquer II Algorithm Design Technique. Transform and Conquer

Heaps and Priority Queues

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

CS301 - Data Structures Glossary By

Sorting and Searching

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze

Chapter 6 Heaps. Introduction. Heap Model. Heap Implementation

EE 368. Weeks 5 (Notes)

Information Retrieval

1 Tree Sort LECTURE 4. OHSU/OGI (Winter 2009) ANALYSIS AND DESIGN OF ALGORITHMS

Chapter 16: Greedy Algorithm

The Heap Data Structure

Priority Queues and Heaps. Heaps of fun, for everyone!

Design and Analysis of Algorithms

Data Structures and Algorithms for Engineers

Lecture 7. Transform-and-Conquer

COMP6237 Data Mining Searching and Ranking

Programming II (CS300)

Trees & Tree-Based Data Structures. Part 4: Heaps. Definition. Example. Properties. Example Min-Heap. Definition

Binary Heaps in Dynamic Arrays

Information Retrieval

CS 241 Analysis of Algorithms

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Priority Queues / Heaps Date: 9/27/17

Lecture 5. Treaps Find, insert, delete, split, and join in treaps Randomized search trees Randomized search tree time costs

Priority Queues. Lecture15: Heaps. Priority Queue ADT. Sequence based Priority Queue

CS2223: Algorithms Sorting Algorithms, Heap Sort, Linear-time sort, Median and Order Statistics

COMP Data Structures

Design and Analysis of Algorithms Lecture- 9: B- Trees

Priority Queues. T. M. Murali. January 26, T. M. Murali January 26, 2016 Priority Queues

CMSC 341 Lecture 14: Priority Queues, Heaps

Heap Model. specialized queue required heap (priority queue) provides at least

Building an Inverted Index

Evaluation Strategies for Top-k Queries over Memory-Resident Inverted Indexes

Introduction to Indexing 2. Acknowledgements: Eamonn Keogh and Chotirat Ann Ratanamahatana

CSE100. Advanced Data Structures. Lecture 8. (Based on Paul Kube course materials)

403: Algorithms and Data Structures. Heaps. Fall 2016 UAlbany Computer Science. Some slides borrowed by David Luebke

Algorithms and Data Structures

modern database systems lecture 4 : information retrieval

CSE 214 Computer Science II Heaps and Priority Queues

Query Processing and Alternative Search Structures. Indexing common words

Efficient query processing

CS711008Z Algorithm Design and Analysis

Hash-Based Indexing 165

CS711008Z Algorithm Design and Analysis

The B-Tree. Yufei Tao. ITEE University of Queensland. INFS4205/7205, Uni of Queensland

Database Applications (15-415)

Priority Queues. T. M. Murali. January 29, 2009

Chapter 12: Indexing and Hashing. Basic Concepts

A Note on Scheduling Parallel Unit Jobs on Hypercubes

Red-Black-Trees and Heaps in Timestamp-Adjusting Sweepline Based Algorithms

Efficient Implementation of Postings Lists

9/29/2016. Chapter 4 Trees. Introduction. Terminology. Terminology. Terminology. Terminology

Chapter 11: Indexing and Hashing

Recall: Properties of B-Trees

Transcription:

Introduction to Search Engine Technology Term-at-a-Time and Document-at-a-Time Evaluation Ronny Lempel Yahoo! Research (Many of the following slides are courtesy of Aya Soffer and David Carmel, IBM Haifa Research Lab) Query Evaluation Strategies We ve got an inverted index a Lexicon and corresponding postings lists The posting elements in all postings lists are sorted by increasing location (docid, offset) Furthermore, each postings list is contiguous on disk Given a query, we need to do the following: 1. Parse and tokenize it turn into a list of search terms, taking into account operators. Lookup terms in Lexicon 3. Get postings iterator (cursor) per term from inverted index, calculate term weights. Calculate score per matching document. Return top scoring documents How is step done? 1 November 007 3660 Search Engine Technology 1

Query Evaluation Strategies How is step done? Term-at-a-Time Processing (TAAT): scan postings list one at a time, maintain a set of potential matching documents along with their partial scores Document-at-a-Time Processing (DAAT): scan postings lists in parallel, identifying at each point the next potential candidate document and scoring it Pros and cons will depend primarily on: Query semantics (conjunctive vs. disjunctive) Allowed operators (e.g. phrase support) Ranking logic (e.g. proximity considerations) Whether the index is distributed across multiple machines or not, and if distributed how? (next lecture) 1 November 007 3660 Search Engine Technology 3 TAAT Conjunctive Query Processing Boolean conjunctive query: For each query term t, Search lexicon; record frequency df(t) and grab the postings list L t of t Identify t* - term with smallest frequency (rarest term) Iterate through L t* (sequential disk read), and set C L t* C is the set of candidates, ordered by increasing docids For each remaining term t in increasing df(t) order: Merge candidate set C with current postings list L t For each docid d in C, If d is not in L t then set C C \ {d} If C=Ø return, there is no answer For each d in C, return d 1 November 007 3660 Search Engine Technology

TAAT Disjunctive Query Processing Boolean disjunctive query: For each query term t, Search lexicon; record frequency df(t) and grab the postings list L t of t Identify t* - term with highest frequency Iterate through L t* (sequential disk read), and set C L t* C is the set of candidates, ordered by increasing docids For each remaining term t (in arbitrary order): For each docid d in L t, If d is not in C then set C C U {d} For each d in C, return d 1 November 007 3660 Search Engine Technology TAAT Vector Space Evaluation for Top-r Retrieval TF/IDF scoring (cosine similarity measure): 1. Set A=Ø, an (empty) set of accumulators Denote by A d the score accumulator for document d. For each query term t in Q Record df(t) and grab postings list L t Set idf t log(n/df(t)) For each docid d in L t If A d is not in A: A d 0; A A U{A d } Update A d A d + idf t *freq d (t) 3. Normalization: for each A d in A, Set A d A d / d This sets A d to be proportional to cos(q, d). Select the r documents with the highest scores in A and return them in decreasing relevance order 1 November 007 3660 Search Engine Technology 6 3

Top-r Document Selection How can we efficiently select the r documents with the highest scores in A and return them in decreasing relevance order? Naïve method: sort the set of accumulators If A =M, the complexity is O(M logm) Better approach: since typically r<<m, selecting the r top scores can be done in O(M+r log M) using a heap: 1. Heapify the set of M scores (about M comparisons) so that the top score is at the root. Repeatedly extract the heap s root (r times), each time fixing the heap in O(logM) 1 November 007 3660 Search Engine Technology 7 The Heap Data Structure - Reminder A binary heap is a (mostly full) binary tree with values stored at all leaves and internal nodes, and an ordering rule that requires values to be nondecreasing (alternatively, non-increasing) along each path from a leaf to the root Largest/smallest value is at the root Heap implemented in an Array: Root at index 1 For node at index i, left child is at index i and right child at index i+1 Thus the parent of the node at index i is at index i/ 1 November 007 3660 Search Engine Technology

Binary Heap Stored in an Array 3 1 1 1 3 6 7 9 10 3 1 1 1 November 007 3660 Search Engine Technology 9 Extracting the Top Element Remove the largest item r times Each time: Remove the largest item, Replace it with the last element of the heap Sift the new root down until restoring order Example Remove item 3 from the root Last item in array (at location 10) replaces it Reinstate heap order - worst case will be sifted back down the tree - number of sifts is bounded by log(size of heap) 1 November 007 3660 Search Engine Technology 10

Heap Example (cont.) To restore order at the top level of tree, item, the larger of the children of root must be swapped with. 1 This limits the order violation to the left sub-tree. 1 The process is repeated until heap order is restored 1 November 007 3660 Search Engine Technology 11 Heap Example (cont.) 1 1 1 1 1 1 1 1 1 November 007 3660 Search Engine Technology 1 6

Top-r Selection Using a Min-Heap The selection problem can be solved by a heap that stores the smallest item at the root: min-heap A min-heap of r items is held instead of a max-heap of M lots of memory is saved, which is always good Process the M accumulator values, storing in the minheap the r largest values seen so far First r values are heapified in O(r) comparisons Replace the smallest value in the min-heap (the r th largest) whenever a larger value is found Sort the r highest values in descending order and return the corresponding documents O(r log r) 1 November 007 3660 Search Engine Technology Min-Heap Processing - Illustration Processed Unprocessed Min-heap of r largest items Discard smallest value 1 November 007 3660 Search Engine Technology 1 7

Top-r Selection Using a Min-Heap: Complexity Analysis Worst case: the scores are already in increasing order Each of the M-r last values is inserted into the heap Furthermore, it percolates to the bottom of the heap Complexity is O( (M-r)*log(r) ) Average case the scores arrive in a permutation of size M chosen uniformly at random The expected number of times one of the M-r last values is inserted into the heap is ~ r*ln(m/r) Each insertion costs O(log(r)) Complexity is O( r*log(r)*log(m/r) ) 1 November 007 3660 Search Engine Technology 1 TAAT: Buckley & Lewit Pruning Process (SIGIR ) For each query term t, compute its maximal score contribution to any document and denote by ms(t). Sort & scan the terms in descending order of ms(t) During accumulation, maintain a min-heap of size r+1 After accumulating the contribution of term t i : If A r > A r+1 + Σ k>i ms(t k ), stop query processing and return the top r docs Lemma: the pruning process returns the same r docs as the full process (not necessarily in the same order) This process is a form of early termination of the query evaluation process 1 November 007 3660 Search Engine Technology 16

TAAT and Web Search Queries on the Web are typically short (less than 3 words on average) Billions of documents Implications: Conjunctive queries still provide more than enough recall Proximity of query terms in documents is very important and improves scores over classis TF/IDF Web search engines allow exact-phrase queries How can proximity considerations and exact-phrase searches be accomplished in term-at-a-time evaluation? 1 November 007 3660 Search Engine Technology Document-at-a-Time Evaluation Will identify matching documents in increasing docid order The postings lists of all terms will be aligned on each matching document Terms within the document can be enumerated in increasing offset order, making it easy to identify terms appearing in proximity Main issue: with all postings lists being traversed in parallel rather than sequentially, how can disk I/O be optimized? 1 November 007 3660 Search Engine Technology 1 9

Postings API Each postings list will be traversed by a cursor The cursor for term t will be denoted C t A cursor supports the following operators: init() position at the beginning of the list next() return the position (docid, offset) of the next posting element in the list nextbeyond(position) find the first posting element in whose position is beyond the argument If the cursor is already beyond the given position, it doesn t move When no more positions are available, the above methods return In the next slide, we assume positions are just docids 1 November 007 3660 Search Engine Technology 19 Zig-Zag Join for Enumerating Candidates in Conjunctive Queries For each term t: c t.init() Repeat candidate t 0.next(), t align 1 // toss ahead first term While ( candidate< && t align < numterms): nextdoc = t align.nextbeyond(candidate-1) If (nextdoc == candidate): t align ++ else // nextdoc must be larger than candidate, toss first term candidate t 0.nextBeyond(nextDoc-1), t align 1 If (t align == numterms): // alignment found Score candidate, enter into min-heap Until candidate == Output the top-r documents in the min-heap in decreasing score order 1 November 007 3660 Search Engine Technology 0 10

Zig-Zag Join, Observations To increase the expected location skip per nextbeyond() operation, terms should be ordered from rarest to most frequent the rarest term drives the query! Phrase matches can be found similarly, by zig-zagging the phrase components to be found in the correct relative positions We thus build a virtual cursor for a phrase, that exposes the normal postings API to whoever is driving it For two terms, reduces to just a simple merge of their respective postings lists Finding common entries in two sorted lists of length L 1 and L can be done naively in O(L 1 +L ) What if L 1 >> L? Can we improve? What about I/O considerations (sequential is good, random is bad)? Consequence: need efficient forward skipping on postings lists! 1 November 007 3660 Search Engine Technology 1 Efficient Skipping in Postings Lists In order to efficiently support skipping (forward) in postings lists, lists are often implemented as B/B + Trees or Skip Lists (adapted to disk I/O) B + Tree A B-Tree whose values are only stored in the leaves (intermediate nodes only hold keys) furthermore, the leaves are laid out sequentially (or chained) to allow for easy iteration In a B-Tree implementation, all postings lists can be encoded in a single tree by having the sort key be (termid, location) With efficient skipping, the less reading done in DAAT processing as compared with TAAT processing can compensate for the I/O being non-sequential 1 November 007 3660 Search Engine Technology 11

Early Termination in DAAT Evaluation In certain scoring models, DAAT evaluation schemes support early termination. For example: Assume that document identifiers are assigned in decreasing order of some query-independent static score Suppose that the score of each document is a linear combination of its query-dependent text score and its static score: score(d) = α*textscore(d)+(1-α)*staticscore(d) Furthermore, assume that text scores are bounded by some maximum MTS. One can terminate evaluation after document k if the score of the r th best document in the min-heap is greater than: α*mts+(1-α)*staticscore(k) Unlike in TAAT, the r returned documents will have their correct and final scores and so their relative ordering will be correct Result counting, though, will not be correct 1 November 007 3660 Search Engine Technology 3 1