Efficiency vs. Effectiveness in Terabyte-Scale IR

Similar documents
A Security Model for Multi-User File System Search. in Multi-User Environments

A Document-Centric Approach to Static Index Pruning in Text Retrieval Systems

A Hybrid Approach to Index Maintenance in Dynamic Text Retrieval Systems

Efficient query processing

IO-Top-k at TREC 2006: Terabyte Track

Inverted Indexes. Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5

Melbourne University at the 2006 Terabyte Track

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Information Retrieval

Static Pruning of Terms In Inverted Files

Information Retrieval

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Information Retrieval

A Combined Semi-Pipelined Query Processing Architecture For Distributed Full-Text Retrieval

Developing MapReduce Programs

modern database systems lecture 4 : information retrieval

Real-time Text Queries with Tunable Term Pair Indexes

COMP6237 Data Mining Searching and Ranking

Query Processing and Alternative Search Structures. Indexing common words

STORING DATA: DISK AND FILES

A Security Model for Full-Text File System Search in Multi-User Environments

Recap: lecture 2 CS276A Information Retrieval

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

Chapter 12. File Management

Query Evaluation Strategies

Index Compression. David Kauchak cs160 Fall 2009 adapted from:

Introduction to Information Retrieval

Query Answering Using Inverted Indexes

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

1 o Semestre 2007/2008

Indexing. CS6200: Information Retrieval. Index Construction. Slides by: Jesse Anderton

Analyzing the performance of top-k retrieval algorithms. Marcus Fontoura Google, Inc

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

MG4J: Managing Gigabytes for Java. MG4J - intro 1

RMIT University at TREC 2006: Terabyte Track

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze

Chapter 8 & Chapter 9 Main Memory & Virtual Memory

From Passages into Elements in XML Retrieval

Indexing and Searching

Information Retrieval. (M&S Ch 15)

Exploiting Index Pruning Methods for Clustering XML Collections

Indexing. Jan Chomicki University at Buffalo. Jan Chomicki () Indexing 1 / 25

Indexing and Searching

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression

Information Retrieval and Organisation

Storing Data: Disks and Files. Storing and Retrieving Data. Why Not Store Everything in Main Memory? Database Management Systems need to:

WITHIN-DOCUMENT TERM-BASED INDEX PRUNING TECHNIQUES WITH STATISTICAL HYPOTHESIS TESTING. Sree Lekha Thota

Problem 1: Complexity of Update Rules for Logistic Regression

Storing Data: Disks and Files. Storing and Retrieving Data. Why Not Store Everything in Main Memory? Chapter 7

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross

Storing and Retrieving Data. Storing Data: Disks and Files. Solution 1: Techniques for making disks faster. Disks. Why Not Store Everything in Tapes?

Instructor: Stefan Savev

Storing and Retrieving Data. Storing Data: Disks and Files. Solution 1: Techniques for making disks faster. Disks. Why Not Store Everything in Tapes?

Disks and Files. Storage Structures Introduction Chapter 8 (3 rd edition) Why Not Store Everything in Main Memory?

Chapter 9 Memory Management

Information Retrieval

Memory Management Outline. Operating Systems. Motivation. Paging Implementation. Accessing Invalid Pages. Performance of Demand Paging

Appropriate Item Partition for Improving the Mining Performance

Recall from Tuesday. Our solution to fragmentation is to split up a process s address space into smaller chunks. Physical Memory OS.

PAGE REPLACEMENT. Operating Systems 2015 Spring by Euiseong Seo

Information Retrieval

Indexing and Searching

Chapter 8: Memory-Management Strategies

VK Multimedia Information Systems

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

INDEX CONSTRUCTION 1

CS 3733 Operating Systems:

CS 31: Intro to Systems Virtual Memory. Kevin Webb Swarthmore College November 15, 2018

CS 153 Design of Operating Systems Winter 2016

Efficient Document Retrieval in Main Memory

Federated Search. Jaime Arguello INLS 509: Information Retrieval November 21, Thursday, November 17, 16

CSE 120 Principles of Operating Systems

CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES

Using Graphics Processors for High Performance IR Query Processing

Chapter 5: CPU Scheduling

Query Optimization. Query Optimization. Optimization considerations. Example. Interaction of algorithm choice and tree arrangement.

Comparing Multiple Source Code Trees, version 3.1

All Paging Schemes Depend on Locality. VM Page Replacement. Paging. Demand Paging

CS60092: Informa0on Retrieval

Indexing Web pages. Web Search: Indexing Web Pages. Indexing the link structure. checkpoint URL s. Connectivity Server: Node table

index construct Overview Overview Recap How to construct index? Introduction Index construction Introduction to Recap

Processes. CS 475, Spring 2018 Concurrent & Distributed Systems

COMP 273 Winter physical vs. virtual mem Mar. 15, 2012

Outline of the course

Index Construction Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Virtual Memory. Chapter 8

6.033 Computer System Engineering

CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University

Advanced Database Systems

Data-Intensive Distributed Computing

Building an Inverted Index

Chapter 8: Main Memory

Introduction to. CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan. Lecture 4: Index Construction

Chapter 7: Main Memory. Operating System Concepts Essentials 8 th Edition

Database Systems. November 2, 2011 Lecture #7. topobo (mit)

CS 525: Advanced Database Organization 04: Indexing

Disks, Memories & Buffer Management

Transcription:

Efficiency vs. Effectiveness in Terabyte-Scale Information Retrieval Stefan Büttcher Charles L. A. Clarke University of Waterloo, Canada November 17, 2005

1 2 3 4 5 6

What is Wumpus? Multi-user file system search engine. Multi-purpose information retrieval system. What are the main features? Multi-user support based on the UNIX security model: Each user can only search files for which she has read permission. Fully dynamic: Document insertions and deletions are immediately reflected by the index. Publicly available under the terms of the GPL: http://www.wumpus-search.org/

Single-Pass Indexing with Dynamic Partitioning Hash-Based In-Memory Inversion Managing In-Memory Posting Lists Indexing Performance Single-Pass Indexing with Dynamic Partitioning Start indexing the text collection; accumulate postings in an in-memory index. When memory is exhausted, create on-disk inverted file from in-memory data. Clean memory and continue indexing. After the whole collection has been processed, construct final index by merging all on-disk indices created so far.

Hash-Based In-Memory Inversion Single-Pass Indexing with Dynamic Partitioning Hash-Based In-Memory Inversion Managing In-Memory Posting Lists Indexing Performance Vocabulary terms in the in-memory index are organized in a hash table with linked lists and move-to-front heuristic. Zipf-like term distribution makes hash table reorganizations unnecessary. Hash table "high" "performance" postings postings "indexing" postings postings "information" "retrieval" postings

Managing In-Memory Posting Lists Single-Pass Indexing with Dynamic Partitioning Hash-Based In-Memory Inversion Managing In-Memory Posting Lists Indexing Performance How are the individual posting lists organized? List implementation has to be extensible, fast, and space-efficient. Solutions in the literature: Fixed-size array (requires two-pass indexing: slow!). Augmentable bit-vector (uses realloc: slow!); Linked list (next pointers: space overhead!). Our solution: Linked list, but linking groups of (compressed) postings instead of individual postings. Adaptive group size, grows exponentially. Very fast and minimal pointer overhead. 10% performance gain over realloc approach.

Indexing Performance Single-Pass Indexing with Dynamic Partitioning Hash-Based In-Memory Inversion Managing In-Memory Posting Lists Indexing Performance 400 Memory consumption vs. indexing performance Total time (minutes) 350 300 250 200 128 256 384 Indexing time (with final merge) Indexing time (without final merge) 512 640 768 Memory available for in-memory index (MB) 896 1024 Indexing throughput with 1024 MB for the in-memory index: 73.5 GB per hour.

Baseline Scoring Method: Okapi BM25 Query Processing: Document-at-a-time Posting lists are arranged in a priority queue and merged into a single stream of (term, position) tuples. postings for query term T1 7 28 99 145 merged posting stream (T2, 2) (T1, 7) (T3, 18)... (T2, 177) Heap postings for query term T2 2 111 130 177 postings for query term T3 18 42 77 101 A list of the n top-scoring documents is computed following a document-at-a-time approach with MaxScore heuristic and partial score evaluation (use upper bounds to prune search).

Query Processing: Okapi BM25 Baseline Scoring Method: Okapi BM25 Actual document scoring: Okapi BM25 S BM25 (D) = f D,T (k 1 + 1) w T f T Q D,T + k 1 (1 b + b dl avgdl ) w T = log(rel. number of docs containing T f D,T = number of occurrences of T in D dl = length of document D (number of tokens) avgdl = average document length in the collection

Baseline Scoring Method: Okapi BM25 2005 Ad-hoc Topics Measure Value Precision @ 5 0.6560 Precision @ 10 0.6040 Precision @ 20 0.5470 Average time per query 2.133 sec Okapi BM25 parameters: k 1 = 1.2, k 3 =, b = 0.5.

Proximity Score Accumulators Proximity Score Accumulators Adjusting the Final Document Score Remember that during query processing the posting lists of all query terms are combined through a multiway merge process, giving a stream of (term, position) pairs. During this process, all pairs of neighboring postings are examined and proximity score accumulators associated with the corresponding query terms are increased. Updating the Proximity Score Accumulators acc(t j ) := acc(t j ) + w Tk 1 (dist(t j +T k )) 2 acc(t k ) := acc(t k ) + w Tj 1 (dist(t j +T k )) 2

Adjusting the Final Document Score Original document score (BM25): S BM25 (D) = T Q Proximity Score Accumulators Adjusting the Final Document Score w T fd,t (k 1 + 1) f D,T + K New Document Score (BM25 + Term Proximity) S BM25TP (D) = S BM25 (D) + T Q min{1.0, w T } acc(t ) (k 1 + 1) acc(t ) + K Limiting the term weight to a maximum of 1.0 keeps the impact of term proximity low and ensures that the original BM25 score still is the dominant component.

Proximity Score Accumulators Adjusting the Final Document Score 2004 Ad-hoc Topics Run description P@5 P@10 P@20 MAP Pure BM25 0.5184 0.5061 0.4857 0.2477 BM25 + Term proximity 0.5551 0.5612 0.5347 0.2767 2005 Ad-hoc Topics Run description P@5 P@10 P@20 MAP Pure BM25 0.6560 0.6040 0.5470 0.3220 BM25 + Term proximity 0.6200 0.5900 0.5730 0.3371

Document-Level Indexing Document-Level Indexing Simple Top-k Index Pruning Threshold-Based Index Pruning Basic idea: For pure BM25, we don t need full positional information in the index. During index construction, only keep one posting per (term, document) pair. Encode TF values in the lowest 6 bits of each posting. Efficiency Gains Run description Indexing time Index size Query time MAP BM25, positional 355 minutes 59.7 GB 2.133 sec 0.3220 BM25, doc. level 245 minutes 19.3 GB 0.326 sec 0.3213

Simple Index Pruning (top-k) Document-Level Indexing Simple Top-k Index Pruning Threshold-Based Index Pruning Once we have an index with document-level postings, we can prune the posting lists (using the score associated with each posting). Pruning in general: Only keep the top-k postings (documents) for each term. Pruning here: Construct a pruned index containing the top-k postings (documents) for the n most frequent terms. Goal: To get a pruned index that is small enough to fit into main memory. Then use both indices (unpruned on-disk index; pruned in-memory index) in parallel, fetching posting lists for the n most frequent terms from the in-mem index, everything else from disk.

Document-Level Indexing Simple Top-k Index Pruning Threshold-Based Index Pruning Threshold-Based Index Pruning (top-k-ε) A little more sophisticated: For every term, pick the k-th best posting (impact: I k ). Then throw away all postings whose impact is less than ε I k. (for details see Carmel et al. at TREC 2001) For various values of n, we built indices containing pruned posting lists for the n most frequent terms. The pruning parameters k and ε were chosen in such a way that the size of of the resulting index was 1.2 GB in all cases.

Document-Level Indexing Simple Top-k Index Pruning Threshold-Based Index Pruning : top-k index pruning

Document-Level Indexing Simple Top-k Index Pruning Threshold-Based Index Pruning : top-k and top-k-ε index pruning Precision at X documents 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Efficiency vs. effectiveness top-k index pruning (P at 10) top-k index pruning (P at 20) top-20-epsilon index pruning (P at 20) 0 20 28 40 56 80 112 160 224 320 Average time per query (ms)

System Configuration Ad-Hoc Retrieval Task Efficiency Task Named Page Finding Task System Configuration for All Official Runs System setup: 2 PCs running in parallel. Each PC: AMD Athlon64 3500+ (2.2 GHz), 2 GB RAM, 7,200-rpm HDD. First PC: 1 process (search engine). Second PC: 2 processes (search engine and dispatcher process). No term weight propagation (collection big enough).

System Configuration Ad-Hoc Retrieval Task Efficiency Task Named Page Finding Task Runs submitted for the Ad-Hoc Retrieval Task Run Avg. query time P@10 P@20 MAP bpref uwmtewtaptdn 29.464 sec 0.6900 0.6160 0.3480 0.3568 uwmtewtapt 1.264 sec 0.6320 0.5760 0.3451 0.3526 uwmtewtad00t 0.194 sec 0.6040 0.5650 0.3173 0.3352 uwmtewtad02t 0.059 sec 0.5060 0.4490 0.2173 0.2555 A 228% performance increase at the cost of a 24% decrease of bpref.

System Configuration Ad-Hoc Retrieval Task Efficiency Task Named Page Finding Task Runs submitted for the Efficiency Retrieval Task Run Avg. query time P@5 P@10 P@20 S@10 uwmtewteptp 1.094 sec 0.6760 0.6380 0.5780 0.9400 uwmtewtad00 0.137 sec 0.6000 0.6040 0.5570 0.9600 uwmtewtad02 0.049 sec 0.4960 0.4980 0.4450 0.9400 uwmtewtad10 0.027 sec 0.4920 0.4380 0.3900 0.8400 A 22% decrease in P@10 buys 22-fold increase in query processing performance.

System Configuration Ad-Hoc Retrieval Task Efficiency Task Named Page Finding Task Runs submitted for the Named Page Finding Task Run Avg. query time MRR Success@10 Success@INF uwmtewtnp 3.468 sec 0.366 0.508 0.894 uwmtewtnppr 3.717 sec 0.322 0.464 0.894 uwmtewtnd 0.141 sec 0.290 0.413 0.663 uwmtewtndpr 0.167 sec 0.238 0.381 0.663 The integration of term proximity increases MRR by 26%, Success@10 by 23%, and Success@INF by 35%.

The End Thank You!