Introduction to Search Engine Technology Term-at-a-Time and Document-at-a-Time Evaluation Ronny Lempel Yahoo! Research (Many of the following slides are courtesy of Aya Soffer and David Carmel, IBM Haifa Research Lab) Query Evaluation Strategies We ve got an inverted index a Lexicon and corresponding postings lists The posting elements in all postings lists are sorted by increasing location (docid, offset) Furthermore, each postings list is contiguous on disk Given a query, we need to do the following: 1. Parse and tokenize it turn into a list of search terms, taking into account operators. Lookup terms in Lexicon 3. Get postings iterator (cursor) per term from inverted index, calculate term weights. Calculate score per matching document. Return top scoring documents How is step done? 1 November 007 3660 Search Engine Technology 1
Query Evaluation Strategies How is step done? Term-at-a-Time Processing (TAAT): scan postings list one at a time, maintain a set of potential matching documents along with their partial scores Document-at-a-Time Processing (DAAT): scan postings lists in parallel, identifying at each point the next potential candidate document and scoring it Pros and cons will depend primarily on: Query semantics (conjunctive vs. disjunctive) Allowed operators (e.g. phrase support) Ranking logic (e.g. proximity considerations) Whether the index is distributed across multiple machines or not, and if distributed how? (next lecture) 1 November 007 3660 Search Engine Technology 3 TAAT Conjunctive Query Processing Boolean conjunctive query: For each query term t, Search lexicon; record frequency df(t) and grab the postings list L t of t Identify t* - term with smallest frequency (rarest term) Iterate through L t* (sequential disk read), and set C L t* C is the set of candidates, ordered by increasing docids For each remaining term t in increasing df(t) order: Merge candidate set C with current postings list L t For each docid d in C, If d is not in L t then set C C \ {d} If C=Ø return, there is no answer For each d in C, return d 1 November 007 3660 Search Engine Technology
TAAT Disjunctive Query Processing Boolean disjunctive query: For each query term t, Search lexicon; record frequency df(t) and grab the postings list L t of t Identify t* - term with highest frequency Iterate through L t* (sequential disk read), and set C L t* C is the set of candidates, ordered by increasing docids For each remaining term t (in arbitrary order): For each docid d in L t, If d is not in C then set C C U {d} For each d in C, return d 1 November 007 3660 Search Engine Technology TAAT Vector Space Evaluation for Top-r Retrieval TF/IDF scoring (cosine similarity measure): 1. Set A=Ø, an (empty) set of accumulators Denote by A d the score accumulator for document d. For each query term t in Q Record df(t) and grab postings list L t Set idf t log(n/df(t)) For each docid d in L t If A d is not in A: A d 0; A A U{A d } Update A d A d + idf t *freq d (t) 3. Normalization: for each A d in A, Set A d A d / d This sets A d to be proportional to cos(q, d). Select the r documents with the highest scores in A and return them in decreasing relevance order 1 November 007 3660 Search Engine Technology 6 3
Top-r Document Selection How can we efficiently select the r documents with the highest scores in A and return them in decreasing relevance order? Naïve method: sort the set of accumulators If A =M, the complexity is O(M logm) Better approach: since typically r<<m, selecting the r top scores can be done in O(M+r log M) using a heap: 1. Heapify the set of M scores (about M comparisons) so that the top score is at the root. Repeatedly extract the heap s root (r times), each time fixing the heap in O(logM) 1 November 007 3660 Search Engine Technology 7 The Heap Data Structure - Reminder A binary heap is a (mostly full) binary tree with values stored at all leaves and internal nodes, and an ordering rule that requires values to be nondecreasing (alternatively, non-increasing) along each path from a leaf to the root Largest/smallest value is at the root Heap implemented in an Array: Root at index 1 For node at index i, left child is at index i and right child at index i+1 Thus the parent of the node at index i is at index i/ 1 November 007 3660 Search Engine Technology
Binary Heap Stored in an Array 3 1 1 1 3 6 7 9 10 3 1 1 1 November 007 3660 Search Engine Technology 9 Extracting the Top Element Remove the largest item r times Each time: Remove the largest item, Replace it with the last element of the heap Sift the new root down until restoring order Example Remove item 3 from the root Last item in array (at location 10) replaces it Reinstate heap order - worst case will be sifted back down the tree - number of sifts is bounded by log(size of heap) 1 November 007 3660 Search Engine Technology 10
Heap Example (cont.) To restore order at the top level of tree, item, the larger of the children of root must be swapped with. 1 This limits the order violation to the left sub-tree. 1 The process is repeated until heap order is restored 1 November 007 3660 Search Engine Technology 11 Heap Example (cont.) 1 1 1 1 1 1 1 1 1 November 007 3660 Search Engine Technology 1 6
Top-r Selection Using a Min-Heap The selection problem can be solved by a heap that stores the smallest item at the root: min-heap A min-heap of r items is held instead of a max-heap of M lots of memory is saved, which is always good Process the M accumulator values, storing in the minheap the r largest values seen so far First r values are heapified in O(r) comparisons Replace the smallest value in the min-heap (the r th largest) whenever a larger value is found Sort the r highest values in descending order and return the corresponding documents O(r log r) 1 November 007 3660 Search Engine Technology Min-Heap Processing - Illustration Processed Unprocessed Min-heap of r largest items Discard smallest value 1 November 007 3660 Search Engine Technology 1 7
Top-r Selection Using a Min-Heap: Complexity Analysis Worst case: the scores are already in increasing order Each of the M-r last values is inserted into the heap Furthermore, it percolates to the bottom of the heap Complexity is O( (M-r)*log(r) ) Average case the scores arrive in a permutation of size M chosen uniformly at random The expected number of times one of the M-r last values is inserted into the heap is ~ r*ln(m/r) Each insertion costs O(log(r)) Complexity is O( r*log(r)*log(m/r) ) 1 November 007 3660 Search Engine Technology 1 TAAT: Buckley & Lewit Pruning Process (SIGIR ) For each query term t, compute its maximal score contribution to any document and denote by ms(t). Sort & scan the terms in descending order of ms(t) During accumulation, maintain a min-heap of size r+1 After accumulating the contribution of term t i : If A r > A r+1 + Σ k>i ms(t k ), stop query processing and return the top r docs Lemma: the pruning process returns the same r docs as the full process (not necessarily in the same order) This process is a form of early termination of the query evaluation process 1 November 007 3660 Search Engine Technology 16
TAAT and Web Search Queries on the Web are typically short (less than 3 words on average) Billions of documents Implications: Conjunctive queries still provide more than enough recall Proximity of query terms in documents is very important and improves scores over classis TF/IDF Web search engines allow exact-phrase queries How can proximity considerations and exact-phrase searches be accomplished in term-at-a-time evaluation? 1 November 007 3660 Search Engine Technology Document-at-a-Time Evaluation Will identify matching documents in increasing docid order The postings lists of all terms will be aligned on each matching document Terms within the document can be enumerated in increasing offset order, making it easy to identify terms appearing in proximity Main issue: with all postings lists being traversed in parallel rather than sequentially, how can disk I/O be optimized? 1 November 007 3660 Search Engine Technology 1 9
Postings API Each postings list will be traversed by a cursor The cursor for term t will be denoted C t A cursor supports the following operators: init() position at the beginning of the list next() return the position (docid, offset) of the next posting element in the list nextbeyond(position) find the first posting element in whose position is beyond the argument If the cursor is already beyond the given position, it doesn t move When no more positions are available, the above methods return In the next slide, we assume positions are just docids 1 November 007 3660 Search Engine Technology 19 Zig-Zag Join for Enumerating Candidates in Conjunctive Queries For each term t: c t.init() Repeat candidate t 0.next(), t align 1 // toss ahead first term While ( candidate< && t align < numterms): nextdoc = t align.nextbeyond(candidate-1) If (nextdoc == candidate): t align ++ else // nextdoc must be larger than candidate, toss first term candidate t 0.nextBeyond(nextDoc-1), t align 1 If (t align == numterms): // alignment found Score candidate, enter into min-heap Until candidate == Output the top-r documents in the min-heap in decreasing score order 1 November 007 3660 Search Engine Technology 0 10
Zig-Zag Join, Observations To increase the expected location skip per nextbeyond() operation, terms should be ordered from rarest to most frequent the rarest term drives the query! Phrase matches can be found similarly, by zig-zagging the phrase components to be found in the correct relative positions We thus build a virtual cursor for a phrase, that exposes the normal postings API to whoever is driving it For two terms, reduces to just a simple merge of their respective postings lists Finding common entries in two sorted lists of length L 1 and L can be done naively in O(L 1 +L ) What if L 1 >> L? Can we improve? What about I/O considerations (sequential is good, random is bad)? Consequence: need efficient forward skipping on postings lists! 1 November 007 3660 Search Engine Technology 1 Efficient Skipping in Postings Lists In order to efficiently support skipping (forward) in postings lists, lists are often implemented as B/B + Trees or Skip Lists (adapted to disk I/O) B + Tree A B-Tree whose values are only stored in the leaves (intermediate nodes only hold keys) furthermore, the leaves are laid out sequentially (or chained) to allow for easy iteration In a B-Tree implementation, all postings lists can be encoded in a single tree by having the sort key be (termid, location) With efficient skipping, the less reading done in DAAT processing as compared with TAAT processing can compensate for the I/O being non-sequential 1 November 007 3660 Search Engine Technology 11
Early Termination in DAAT Evaluation In certain scoring models, DAAT evaluation schemes support early termination. For example: Assume that document identifiers are assigned in decreasing order of some query-independent static score Suppose that the score of each document is a linear combination of its query-dependent text score and its static score: score(d) = α*textscore(d)+(1-α)*staticscore(d) Furthermore, assume that text scores are bounded by some maximum MTS. One can terminate evaluation after document k if the score of the r th best document in the min-heap is greater than: α*mts+(1-α)*staticscore(k) Unlike in TAAT, the r returned documents will have their correct and final scores and so their relative ordering will be correct Result counting, though, will not be correct 1 November 007 3660 Search Engine Technology 3 1