The anatomy of a large-scale l small search engine: Efficient index organization and query processing

The anatomy of a large-scale l small search engine: Efficient index organization and query processing Simon Jonassen Department of Computer and Information Science Norwegian University it of Science and dtechnology TDT4215 Web Intelligence NTNU 17 March, 2011 State of the Art 1. Design a self-skipping index structure specifically for NewPForDelta compression. 2. Provide an efficient query processing method for disjunctive (OR) queries.

Inverted Index Approach to IR apple.com Query Processing Modes Document-At-A-Time (DAAT) A document has to be fully matched and scored for all of the query terms before any other document is considered. Term-At-A-Time (TAAT) A term s s posting list has to be fully processed before any other term is considered.

Query Matching Modes Conjunctive (AND) queries A document has to match ALL of the query terms. Disjunctive (OR) queries A document has to match ANY of the query terms. (Normally are more time-consuming than AND queries) Statistical Similarity Scoring Models Cosine similarity TFxIDF Robertson TFxIDF Okapi BM-25

Statistical Similarity Scoring Models Cosine, TFxIDF, Okapi BM25, etc. Term frequency: number of times a term occurs in a document Document frequency/posting list length: number of documents a term occurs in Collection frequency: total number of occurrences of a term within the document collection Key/query frequency: number of occurrences in the query Document length: number of tokens in the document Total number of documents Total number of unique terms Total number of postings (posting list entries) Total number of tokens (word occurrences) Outline Introduction to inverted indexes Main index components and organization Inverted index skipping Efficient query processing Experimental results Conclusions Related work and bibliography list

Main Index Components and Organization Collection statistics Total numbers of documents, unique terms, tokens and postings Document dictionary For each document: document ID, name, length, URL. Lexicon For each term: term ID, term string, document and collection frequencies, pointer to posting list Inverted file/posting lists For each term: Term ID, document IDs and term frequencies Index organization Collection statistics can be stored as a.property (or.txt) file and read at start-up. Both document dictionary and lexicon can be stored as two ordered sets (arrays) of constant size records, or put into two different B-trees or similar. DocDict (28B entries): ID (int, 4B), DocNo (20B), NumTokens (int, 4B). Lexicon (40B entries): Term (20B), ID (int, 4B) DocFreq (int, 4B), TermFreq (int, 4B), EndPtr (long, 8B).

Index organization inverted file Basic posting list: <termid <docid freq>*>* Alternatively, ti l if we want to store positions as well: <termid <docid freq <pos>*>*>* <docid freq> postings are normally ordered by increasing docid. Simplest way to reduce space: dgaps og delta-coding store differences between docids rather than docids. Frequeccies normally represent the number of occurrences in a document and stored as an integer. Inverted file compression Many+ different methods that can be separated between: parametric/non-parametric dictionary/arithmetic adaptive/non-adaptive p bit/byte/word-level single-value/chunk-based etc. We discuss a few.

Inverted file compression Unary Store k as k-1 1 s and a final 0 Elias Gamma and Delta codes Gamma: store k as k1=1+floor(log_2(k)) in unary, and k-2(k1-1) in binary with k-1 bits Delta: stores k1 in Gamma and remainder in binary. Other methods: Golumb, Rice, Interpolative codes, etc. Space efficient! More complex Often more time-consuming Has to decompress all previous values to get to a particular value or store more information in order to skip. Inverted file compression VByte Uses 7 bits of a byte to store data and 1 bit to define boundaries. Simplest byte-level coding, very time efficient, less space efficient.

Inverted file compression Simple9 (word-level coding) Uses 4 bits to store a selector code, and remaining 28 bits to store data. According to [AM05], as fast as VByte and has better compression. Other methods: Simple16 Carryover-methods. Inverted file compression PFor and PForDelta Byte-level Stores chunks of 128 entries. Super-scalar, loop-unrolling, almost branch-free, CPU and cache-efficient. Must-have of high-performance search engines (Lucene, Linked-In). Main drawback: Compulsory (forced) exceptions.

Inverted file compression NewPFor and NewPForDelta instead of exception offsets store just all the bits we can store. store the overflow bits and exception offsets as two Simple9 coded arrays. Other methods: OptPFor, PDict, 64bit versions of methods, etc. Basic inverted file

Basic inverted file - buffering Outline Introduction to inverted indexes Main index components and organization Inverted index skipping Efficient query processing Experimental results Conclusions Related work and bibliography list

Skipping We may store skip-pointers in order to jump over some data. Simple and efficient i for AND-queries. OR-queries? Skip-Lists Moffat et al. have found evaluated optimal skip-distances for one and multiple-level skipping. Does not take compression nor buffering into account Boldi and Vigna: skip-towers.

Inverted file design for efficient skipping

Skipping Our inverted file iterator does following operations skipto(docid) skip to the first doc having docid equal to or larger than the specified next() go to next element getdocid() getfreq()

TAAT Processing TAAT Processing

TAAT Processing Remaining pointers Processed pointers Accumulator count increase TAAT Processing AND

DAAT Processing DAAT Processing

DAAT Processing:OR+Skipping An example DAAT Processing:OR+Skipping Requirement Set Property: only posting lists with accumulated maximum scores greater or equal to current least scored result can initiate candidates. Partial Ranking Property: If the current partial score + remaining acc.max. score is less than the current least scored result, we can discard this candidate

DAAT Processing:OR+Skipping Partial Ranking Property Requirement Set Property Outline Introduction to inverted indexes Main index components and organization Inverted index skipping Efficient query processing Experimental results Conclusions Related work and bibliography list

Experimental Results: Index 426GB TREC GOV2 document corpus Stemming (snowball) and stop-word processed 15.4 million unique terms, 25.2 2 million documents, 4.7 billion pointers, 16.3 billion tokens. our final index without skipping is 5.977GB Skipping adds 87.1MB (1.42% increase). 15.1 million posting lists with zero skip-levels 279647 posting lists with one skip level, 15201 with two levels only 377 with three levels. Corresponding index built by the Terrier Search Engine (v 2.1) is 8.6GB (bit-level compression, no skipping) Experimental Results: Querying Terabyte Track 05 Efficiency Topics First 10000 queries with number of matching terms greater than one. Platform Intel Core 2 Quad 2.66GHz processor, 8GB RAM, 1TB 7200RPM SATA2 GNU/Linux, Java 6. Other: we use 16KB blocks for buffering.

Experimental Results: Querying 348 260 93.6 51.8 44.44 Experimental Results: Querying

Experimental Results: Querying 283 120 90.7 47.8 Experimental Results: Querying

Outline Introduction to inverted indexes Main index components and organization Inverted index skipping Efficient query processing Experimental results Conclusions Related work and bibliography list Conclusions We have designed an efficient skipping skipping structure for a chunk-wise compressed index. We have designed, implemented and evaluated two efficient algorithms that apply index skipping to disjunctive queries. Both methods achieve more than 3.5 times speed-up compared to a full evaluation.

Outline Introduction to inverted indexes Main index components and organization Inverted index skipping Efficient query processing Experimental results Conclusions Related work and bibliography list Inverted Indexes and Query Optimization [AM05] Vo Ngoc Anh and Alistair Moat. Inverted index compression using word-aligned binary codes. Inf. Retr., 8:151-166, January 2005. [LMWZ05] N. Lester, A. Moat, W. Webber, and J. Zobel. Space-limited ranked query evaluation using adaptive pruning. In Proc. WISE, pages 470-477, 2005. [OAP+06] I. Ounis, G. Amati, V. Plachouras, B. He, C. Macdonald, and C. Lioma. Terrier: A High Performance and Scalable Information Retrieval Platform. In OSIR Workshop, SIGIR, 2006. [STC05] T. Strohman, H. Turtle, and W. Croft. Optimization strategies for complex queries. In Proc. SIGIR, pages 219-225. ACM, 2005. [TF95] H. Turtle and J. Flood. Query evaluation: strategies and optimizations. Inf. Process. Manage., 31(6):831-850, 1995. [ZM06] J. Zobel and A. Moat. Inverted files for text search engines. ACM Comput. Surv., 38(2), 2006.

Inverted File Compression [AM10] Vo Ngoc Anh and Alistair Moat. Index compression using 64-bit words. Softw. Pract. Exper., 40:131-147, February 2010. [DHYS08] S. Ding, J. He, H. Yan, and T. Suel. Using graphics processors for high-performance IR query processing. In Proc. WWW, pages 1213-1214. ACM, 2008. [WMB99] Ian H. Witten, Alistair Moat, and Timothy C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, San Francisco, CA, 2. edition, 1999. [YDS09] H. Yan, S. Ding, and T. Suel. Inverted index compression and query processing with optimized i document ordering. In Proc. WWW, pages 401-410. ACM, 2009. [ZHNB06] M. Zukowski, S. Heman, N. Nes, and P. Boncz. Super-scalar ramcpu-cache compression. In Proc. ICDE, pages 59{. IEEE Computer Society, 2006. [ZLS08] J. Zhang, X. Long, and T. Suel. Performance of compressed inverted list caching in search engines. In Proc. WWW, pages 387-396. 396 ACM, 2008. Skipping [BC07] S. B uttcher and C. Clarke. Index compression is good, especially for random access. In Proc. CIKM, pages 761-770. ACM, 2007. [BV05] P. Boldi and S. Vigna. Compressed perfect embedded skip lists for quick inverted-index index lookups. In Proc. SPIRE, pages 25-28. 28 Springer- Verlag, 2005. [CLMP08] F. Chierichetti, S. Lattanzi, F. Mari, and A. Panconesi. On placing skips optimally in expectation. In Proc. WSDM, pages 15-24. ACM, 2008. [MZ96] A. Mo at and J. Zobel. Self-indexing inverted files for fast text retrieval. ACM Trans. Inf. Syst., 14(4):349-379, 1996.

Thank you! Efficient Compressed Inverted Index Skipping for Disjunctive Text-Queries. Simon Jonassen and Svein Erik Bratsberg. Proceedings of the 33rd European Conference on Information Retrieval (ECIR 11), Dublin, Ireland, 18-21 April 2011.