The anatomy of a large-scale l small search engine: Efficient index organization and query processing

Similar documents
V.2 Index Compression

Efficient Dynamic Pruning with Proximity Support

A Combined Semi-Pipelined Query Processing Architecture For Distributed Full-Text Retrieval

Distributing efficiently the Block-Max WAND algorithm

Compressing Inverted Index Using Optimal FastPFOR

Distributing efficiently the Block-Max WAND algorithm

Efficient Query Processing in Distributed Search Engines

Modeling Static Caching in Web Search Engines

Exploiting Progressions for Improving Inverted Index Compression

Using Graphics Processors for High Performance IR Query Processing

Optimized Top-K Processing with Global Page Scores on Block-Max Indexes

Distribution by Document Size

Cluster based Mixed Coding Schemes for Inverted File Index Compression

Comparative Analysis of Sparse Matrix Algorithms For Information Retrieval

An Experimental Study of Index Compression and DAAT Query Processing Methods

Compressing and Decoding Term Statistics Time Series

Efficient Execution of Dependency Models

Information Retrieval II

Static Index Pruning for Information Retrieval Systems: A Posting-Based Approach

Efficient Document Retrieval in Main Memory

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Index Compression. David Kauchak cs160 Fall 2009 adapted from:

2 Partitioning Methods for an Inverted Index

Introduction to Information Retrieval. (COSC 488) Spring Nazli Goharian. Course Outline

Compression, SIMD, and Postings Lists

Efficient Decoding of Posting Lists with SIMD Instructions

Performance Improvements for Search Systems using an Integrated Cache of Lists+Intersections

Information Retrieval

Variable Length Integers for Search

IO-Top-k at TREC 2006: Terabyte Track

An Incremental Approach to Efficient Pseudo-Relevance Feedback

Information Retrieval

230 Million Tweets per day

Lecture 5: Information Retrieval using the Vector Space Model

Compression of Inverted Indexes For Fast Query Evaluation

The role of index compression in score-at-a-time query evaluation

Window Extraction for Information Retrieval

Cost-aware Intersection Caching and Processing Strategies for In-memory Inverted Indexes

IJSRD - International Journal for Scientific Research & Development Vol. 3, Issue 10, 2015 ISSN (online):

Melbourne University at the 2006 Terabyte Track

Processing Posting Lists Using OpenCL

Query Evaluation Strategies

8 Integer encoding. scritto da: Tiziano De Matteis

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Query Processing in Highly-Loaded Search Engines

Static Pruning of Terms In Inverted Files

Top-k Query Processing with Conditional Skips

Recap: lecture 2 CS276A Information Retrieval

To Index or not to Index: Time-Space Trade-Offs in Search Engines with Positional Ranking Functions

Inverted Index Compression

RMIT University at TREC 2006: Terabyte Track

Introduction to Information Retrieval

Information Retrieval

1 Inverted Treaps 1. INTRODUCTION

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze

Query Evaluation Strategies

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

IN4325 Indexing and query processing. Claudia Hauff (WIS, TU Delft)

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

An Exploration of Postings List Contiguity in Main-Memory Incremental Indexing

A BELIEF NETWORK MODEL FOR EXPERT SEARCH

Global Statistics in Proximity Weighting Models

Analyzing the performance of top-k retrieval algorithms. Marcus Fontoura Google, Inc

COMP6237 Data Mining Searching and Ranking

A Document-Centric Approach to Static Index Pruning in Text Retrieval Systems

Course work. Today. Last lecture index construc)on. Why compression (in general)? Why compression for inverted indexes?

CS60092: Informa0on Retrieval

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

Modularization of Lightweight Data Compression Algorithms

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

Indexing Strategies of MapReduce for Information Retrieval in Big Data

Indexing. CS6200: Information Retrieval. Index Construction. Slides by: Jesse Anderton

Entry Pairing in Inverted File

Phrase Queries with Inverted + Direct Indexes

Outline of the course

Inverted List Caching for Topical Index Shards

Generalized indexing and keyword search using User Log

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Corso di Biblioteche Digitali

A Comparative Study Weighting Schemes for Double Scoring Technique

Making Retrieval Faster Through Document Clustering

Term Frequency Normalisation Tuning for BM25 and DFR Models

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Data-Intensive Distributed Computing

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks

Information Retrieval

University of Glasgow at TREC2004: Experiments in Web, Robust and Terabyte tracks with Terrier

IITH at CLEF 2017: Finding Relevant Tweets for Cultural Events

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression

Efficient Search in Large Textual Collections with Redundancy

Information Retrieval and Organisation

In-Place versus Re-Build versus Re-Merge: Index Maintenance Strategies for Text Retrieval Systems

Compressing Integers for Fast File Access

Column Stores versus Search Engines and Applications to Search in Social Networks

Efficiency vs. Effectiveness in Terabyte-Scale IR

A Cost-Aware Strategy for Query Result Caching in Web Search Engines

Information Retrieval. Lecture 3 - Index compression. Introduction. Overview. Characterization of an index. Wintersemester 2007

CS54701: Information Retrieval

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

Transcription:

The anatomy of a large-scale l small search engine: Efficient index organization and query processing Simon Jonassen Department of Computer and Information Science Norwegian University it of Science and dtechnology TDT4215 Web Intelligence NTNU 17 March, 2011 State of the Art 1. Design a self-skipping index structure specifically for NewPForDelta compression. 2. Provide an efficient query processing method for disjunctive (OR) queries.

Outline Introduction to inverted indexes Main index components and organization Inverted index skipping Efficient query processing Experimental results Conclusions Related work and bibliography list Search Engine Basics

Inverted Index Approach to IR apple.com Query Processing Modes Document-At-A-Time (DAAT) A document has to be fully matched and scored for all of the query terms before any other document is considered. Term-At-A-Time (TAAT) A term s s posting list has to be fully processed before any other term is considered.

Query Matching Modes Conjunctive (AND) queries A document has to match ALL of the query terms. Disjunctive (OR) queries A document has to match ANY of the query terms. (Normally are more time-consuming than AND queries) Statistical Similarity Scoring Models Cosine similarity TFxIDF Robertson TFxIDF Okapi BM-25

Statistical Similarity Scoring Models Cosine, TFxIDF, Okapi BM25, etc. Term frequency: number of times a term occurs in a document Document frequency/posting list length: number of documents a term occurs in Collection frequency: total number of occurrences of a term within the document collection Key/query frequency: number of occurrences in the query Document length: number of tokens in the document Total number of documents Total number of unique terms Total number of postings (posting list entries) Total number of tokens (word occurrences) Outline Introduction to inverted indexes Main index components and organization Inverted index skipping Efficient query processing Experimental results Conclusions Related work and bibliography list

Main Index Components and Organization Collection statistics Total numbers of documents, unique terms, tokens and postings Document dictionary For each document: document ID, name, length, URL. Lexicon For each term: term ID, term string, document and collection frequencies, pointer to posting list Inverted file/posting lists For each term: Term ID, document IDs and term frequencies Index organization Collection statistics can be stored as a.property (or.txt) file and read at start-up. Both document dictionary and lexicon can be stored as two ordered sets (arrays) of constant size records, or put into two different B-trees or similar. DocDict (28B entries): ID (int, 4B), DocNo (20B), NumTokens (int, 4B). Lexicon (40B entries): Term (20B), ID (int, 4B) DocFreq (int, 4B), TermFreq (int, 4B), EndPtr (long, 8B).

Index organization inverted file Basic posting list: <termid <docid freq>*>* Alternatively, ti l if we want to store positions as well: <termid <docid freq <pos>*>*>* <docid freq> postings are normally ordered by increasing docid. Simplest way to reduce space: dgaps og delta-coding store differences between docids rather than docids. Frequeccies normally represent the number of occurrences in a document and stored as an integer. Inverted file compression Many+ different methods that can be separated between: parametric/non-parametric dictionary/arithmetic adaptive/non-adaptive p bit/byte/word-level single-value/chunk-based etc. We discuss a few.

Inverted file compression Unary Store k as k-1 1 s and a final 0 Elias Gamma and Delta codes Gamma: store k as k1=1+floor(log_2(k)) in unary, and k-2(k1-1) in binary with k-1 bits Delta: stores k1 in Gamma and remainder in binary. Other methods: Golumb, Rice, Interpolative codes, etc. Space efficient! More complex Often more time-consuming Has to decompress all previous values to get to a particular value or store more information in order to skip. Inverted file compression VByte Uses 7 bits of a byte to store data and 1 bit to define boundaries. Simplest byte-level coding, very time efficient, less space efficient.

Inverted file compression Simple9 (word-level coding) Uses 4 bits to store a selector code, and remaining 28 bits to store data. According to [AM05], as fast as VByte and has better compression. Other methods: Simple16 Carryover-methods. Inverted file compression PFor and PForDelta Byte-level Stores chunks of 128 entries. Super-scalar, loop-unrolling, almost branch-free, CPU and cache-efficient. Must-have of high-performance search engines (Lucene, Linked-In). Main drawback: Compulsory (forced) exceptions.

Inverted file compression NewPFor and NewPForDelta instead of exception offsets store just all the bits we can store. store the overflow bits and exception offsets as two Simple9 coded arrays. Other methods: OptPFor, PDict, 64bit versions of methods, etc. Basic inverted file

Basic inverted file - buffering Outline Introduction to inverted indexes Main index components and organization Inverted index skipping Efficient query processing Experimental results Conclusions Related work and bibliography list

Skipping We may store skip-pointers in order to jump over some data. Simple and efficient i for AND-queries. OR-queries? Skip-Lists Moffat et al. have found evaluated optimal skip-distances for one and multiple-level skipping. Does not take compression nor buffering into account Boldi and Vigna: skip-towers.

Inverted file design for efficient skipping

Skipping Our inverted file iterator does following operations skipto(docid) skip to the first doc having docid equal to or larger than the specified next() go to next element getdocid() getfreq()

Outline Introduction to inverted indexes Main index components and organization Inverted index skipping Efficient query processing Experimental results Conclusions Related work and bibliography list TAAT Processing

TAAT Processing TAAT Processing

TAAT Processing Remaining pointers Processed pointers Accumulator count increase TAAT Processing AND

DAAT Processing DAAT Processing

DAAT Processing:OR+Skipping An example DAAT Processing:OR+Skipping Requirement Set Property: only posting lists with accumulated maximum scores greater or equal to current least scored result can initiate candidates. Partial Ranking Property: If the current partial score + remaining acc.max. score is less than the current least scored result, we can discard this candidate

DAAT Processing:OR+Skipping Partial Ranking Property Requirement Set Property Outline Introduction to inverted indexes Main index components and organization Inverted index skipping Efficient query processing Experimental results Conclusions Related work and bibliography list

Experimental Results: Index 426GB TREC GOV2 document corpus Stemming (snowball) and stop-word processed 15.4 million unique terms, 25.2 2 million documents, 4.7 billion pointers, 16.3 billion tokens. our final index without skipping is 5.977GB Skipping adds 87.1MB (1.42% increase). 15.1 million posting lists with zero skip-levels 279647 posting lists with one skip level, 15201 with two levels only 377 with three levels. Corresponding index built by the Terrier Search Engine (v 2.1) is 8.6GB (bit-level compression, no skipping) Experimental Results: Querying Terabyte Track 05 Efficiency Topics First 10000 queries with number of matching terms greater than one. Platform Intel Core 2 Quad 2.66GHz processor, 8GB RAM, 1TB 7200RPM SATA2 GNU/Linux, Java 6. Other: we use 16KB blocks for buffering.

Experimental Results: Querying 348 260 93.6 51.8 44.44 Experimental Results: Querying

Experimental Results: Querying 283 120 90.7 47.8 Experimental Results: Querying

Outline Introduction to inverted indexes Main index components and organization Inverted index skipping Efficient query processing Experimental results Conclusions Related work and bibliography list Conclusions We have designed an efficient skipping skipping structure for a chunk-wise compressed index. We have designed, implemented and evaluated two efficient algorithms that apply index skipping to disjunctive queries. Both methods achieve more than 3.5 times speed-up compared to a full evaluation.

Outline Introduction to inverted indexes Main index components and organization Inverted index skipping Efficient query processing Experimental results Conclusions Related work and bibliography list Inverted Indexes and Query Optimization [AM05] Vo Ngoc Anh and Alistair Moat. Inverted index compression using word-aligned binary codes. Inf. Retr., 8:151-166, January 2005. [LMWZ05] N. Lester, A. Moat, W. Webber, and J. Zobel. Space-limited ranked query evaluation using adaptive pruning. In Proc. WISE, pages 470-477, 2005. [OAP+06] I. Ounis, G. Amati, V. Plachouras, B. He, C. Macdonald, and C. Lioma. Terrier: A High Performance and Scalable Information Retrieval Platform. In OSIR Workshop, SIGIR, 2006. [STC05] T. Strohman, H. Turtle, and W. Croft. Optimization strategies for complex queries. In Proc. SIGIR, pages 219-225. ACM, 2005. [TF95] H. Turtle and J. Flood. Query evaluation: strategies and optimizations. Inf. Process. Manage., 31(6):831-850, 1995. [ZM06] J. Zobel and A. Moat. Inverted files for text search engines. ACM Comput. Surv., 38(2), 2006.

Inverted File Compression [AM10] Vo Ngoc Anh and Alistair Moat. Index compression using 64-bit words. Softw. Pract. Exper., 40:131-147, February 2010. [DHYS08] S. Ding, J. He, H. Yan, and T. Suel. Using graphics processors for high-performance IR query processing. In Proc. WWW, pages 1213-1214. ACM, 2008. [WMB99] Ian H. Witten, Alistair Moat, and Timothy C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, San Francisco, CA, 2. edition, 1999. [YDS09] H. Yan, S. Ding, and T. Suel. Inverted index compression and query processing with optimized i document ordering. In Proc. WWW, pages 401-410. ACM, 2009. [ZHNB06] M. Zukowski, S. Heman, N. Nes, and P. Boncz. Super-scalar ramcpu-cache compression. In Proc. ICDE, pages 59{. IEEE Computer Society, 2006. [ZLS08] J. Zhang, X. Long, and T. Suel. Performance of compressed inverted list caching in search engines. In Proc. WWW, pages 387-396. 396 ACM, 2008. Skipping [BC07] S. B uttcher and C. Clarke. Index compression is good, especially for random access. In Proc. CIKM, pages 761-770. ACM, 2007. [BV05] P. Boldi and S. Vigna. Compressed perfect embedded skip lists for quick inverted-index index lookups. In Proc. SPIRE, pages 25-28. 28 Springer- Verlag, 2005. [CLMP08] F. Chierichetti, S. Lattanzi, F. Mari, and A. Panconesi. On placing skips optimally in expectation. In Proc. WSDM, pages 15-24. ACM, 2008. [MZ96] A. Mo at and J. Zobel. Self-indexing inverted files for fast text retrieval. ACM Trans. Inf. Syst., 14(4):349-379, 1996.

Thank you! Efficient Compressed Inverted Index Skipping for Disjunctive Text-Queries. Simon Jonassen and Svein Erik Bratsberg. Proceedings of the 33rd European Conference on Information Retrieval (ECIR 11), Dublin, Ireland, 18-21 April 2011.