The anatomy of a large-scale l small search engine: Efficient index organization and query processing

Size: px

Start display at page:

Download "The anatomy of a large-scale l small search engine: Efficient index organization and query processing"

Kristian Turner
6 years ago
Views:

1 The anatomy of a large-scale l small search engine: Efficient index organization and query processing Simon Jonassen Department of Computer and Information Science Norwegian University it of Science and dtechnology TDT4215 Web Intelligence NTNU 17 March, 2011 State of the Art 1. Design a self-skipping index structure specifically for NewPForDelta compression. 2. Provide an efficient query processing method for disjunctive (OR) queries.

2 Outline Introduction to inverted indexes Main index components and organization Inverted index skipping Efficient query processing Experimental results Conclusions Related work and bibliography list Search Engine Basics

3 Inverted Index Approach to IR apple.com Query Processing Modes Document-At-A-Time (DAAT) A document has to be fully matched and scored for all of the query terms before any other document is considered. Term-At-A-Time (TAAT) A term s s posting list has to be fully processed before any other term is considered.

4 Query Matching Modes Conjunctive (AND) queries A document has to match ALL of the query terms. Disjunctive (OR) queries A document has to match ANY of the query terms. (Normally are more time-consuming than AND queries) Statistical Similarity Scoring Models Cosine similarity TFxIDF Robertson TFxIDF Okapi BM-25

5 Statistical Similarity Scoring Models Cosine, TFxIDF, Okapi BM25, etc. Term frequency: number of times a term occurs in a document Document frequency/posting list length: number of documents a term occurs in Collection frequency: total number of occurrences of a term within the document collection Key/query frequency: number of occurrences in the query Document length: number of tokens in the document Total number of documents Total number of unique terms Total number of postings (posting list entries) Total number of tokens (word occurrences) Outline Introduction to inverted indexes Main index components and organization Inverted index skipping Efficient query processing Experimental results Conclusions Related work and bibliography list

6 Main Index Components and Organization Collection statistics Total numbers of documents, unique terms, tokens and postings Document dictionary For each document: document ID, name, length, URL. Lexicon For each term: term ID, term string, document and collection frequencies, pointer to posting list Inverted file/posting lists For each term: Term ID, document IDs and term frequencies Index organization Collection statistics can be stored as a.property (or.txt) file and read at start-up. Both document dictionary and lexicon can be stored as two ordered sets (arrays) of constant size records, or put into two different B-trees or similar. DocDict (28B entries): ID (int, 4B), DocNo (20B), NumTokens (int, 4B). Lexicon (40B entries): Term (20B), ID (int, 4B) DocFreq (int, 4B), TermFreq (int, 4B), EndPtr (long, 8B).

7 Index organization inverted file Basic posting list: <termid <docid freq>*>* Alternatively, ti l if we want to store positions as well: <termid <docid freq <pos>*>*>* <docid freq> postings are normally ordered by increasing docid. Simplest way to reduce space: dgaps og delta-coding store differences between docids rather than docids. Frequeccies normally represent the number of occurrences in a document and stored as an integer. Inverted file compression Many+ different methods that can be separated between: parametric/non-parametric dictionary/arithmetic adaptive/non-adaptive p bit/byte/word-level single-value/chunk-based etc. We discuss a few.

8 Inverted file compression Unary Store k as k-1 1 s and a final 0 Elias Gamma and Delta codes Gamma: store k as k1=1+floor(log_2(k)) in unary, and k-2(k1-1) in binary with k-1 bits Delta: stores k1 in Gamma and remainder in binary. Other methods: Golumb, Rice, Interpolative codes, etc. Space efficient! More complex Often more time-consuming Has to decompress all previous values to get to a particular value or store more information in order to skip. Inverted file compression VByte Uses 7 bits of a byte to store data and 1 bit to define boundaries. Simplest byte-level coding, very time efficient, less space efficient.

9 Inverted file compression Simple9 (word-level coding) Uses 4 bits to store a selector code, and remaining 28 bits to store data. According to [AM05], as fast as VByte and has better compression. Other methods: Simple16 Carryover-methods. Inverted file compression PFor and PForDelta Byte-level Stores chunks of 128 entries. Super-scalar, loop-unrolling, almost branch-free, CPU and cache-efficient. Must-have of high-performance search engines (Lucene, Linked-In). Main drawback: Compulsory (forced) exceptions.

10 Inverted file compression NewPFor and NewPForDelta instead of exception offsets store just all the bits we can store. store the overflow bits and exception offsets as two Simple9 coded arrays. Other methods: OptPFor, PDict, 64bit versions of methods, etc. Basic inverted file

11 Basic inverted file - buffering Outline Introduction to inverted indexes Main index components and organization Inverted index skipping Efficient query processing Experimental results Conclusions Related work and bibliography list

12 Skipping We may store skip-pointers in order to jump over some data. Simple and efficient i for AND-queries. OR-queries? Skip-Lists Moffat et al. have found evaluated optimal skip-distances for one and multiple-level skipping. Does not take compression nor buffering into account Boldi and Vigna: skip-towers.

13 Inverted file design for efficient skipping

14 Skipping Our inverted file iterator does following operations skipto(docid) skip to the first doc having docid equal to or larger than the specified next() go to next element getdocid() getfreq()

15 Outline Introduction to inverted indexes Main index components and organization Inverted index skipping Efficient query processing Experimental results Conclusions Related work and bibliography list TAAT Processing

16 TAAT Processing TAAT Processing

17 TAAT Processing Remaining pointers Processed pointers Accumulator count increase TAAT Processing AND

18 DAAT Processing DAAT Processing

19 DAAT Processing:OR+Skipping An example DAAT Processing:OR+Skipping Requirement Set Property: only posting lists with accumulated maximum scores greater or equal to current least scored result can initiate candidates. Partial Ranking Property: If the current partial score + remaining acc.max. score is less than the current least scored result, we can discard this candidate

20 DAAT Processing:OR+Skipping Partial Ranking Property Requirement Set Property Outline Introduction to inverted indexes Main index components and organization Inverted index skipping Efficient query processing Experimental results Conclusions Related work and bibliography list

21 Experimental Results: Index 426GB TREC GOV2 document corpus Stemming (snowball) and stop-word processed 15.4 million unique terms, million documents, 4.7 billion pointers, 16.3 billion tokens. our final index without skipping is 5.977GB Skipping adds 87.1MB (1.42% increase) million posting lists with zero skip-levels posting lists with one skip level, with two levels only 377 with three levels. Corresponding index built by the Terrier Search Engine (v 2.1) is 8.6GB (bit-level compression, no skipping) Experimental Results: Querying Terabyte Track 05 Efficiency Topics First queries with number of matching terms greater than one. Platform Intel Core 2 Quad 2.66GHz processor, 8GB RAM, 1TB 7200RPM SATA2 GNU/Linux, Java 6. Other: we use 16KB blocks for buffering.

22 Experimental Results: Querying Experimental Results: Querying

23 Experimental Results: Querying Experimental Results: Querying

24 Outline Introduction to inverted indexes Main index components and organization Inverted index skipping Efficient query processing Experimental results Conclusions Related work and bibliography list Conclusions We have designed an efficient skipping skipping structure for a chunk-wise compressed index. We have designed, implemented and evaluated two efficient algorithms that apply index skipping to disjunctive queries. Both methods achieve more than 3.5 times speed-up compared to a full evaluation.

25 Outline Introduction to inverted indexes Main index components and organization Inverted index skipping Efficient query processing Experimental results Conclusions Related work and bibliography list Inverted Indexes and Query Optimization [AM05] Vo Ngoc Anh and Alistair Moat. Inverted index compression using word-aligned binary codes. Inf. Retr., 8: , January [LMWZ05] N. Lester, A. Moat, W. Webber, and J. Zobel. Space-limited ranked query evaluation using adaptive pruning. In Proc. WISE, pages , [OAP+06] I. Ounis, G. Amati, V. Plachouras, B. He, C. Macdonald, and C. Lioma. Terrier: A High Performance and Scalable Information Retrieval Platform. In OSIR Workshop, SIGIR, [STC05] T. Strohman, H. Turtle, and W. Croft. Optimization strategies for complex queries. In Proc. SIGIR, pages ACM, [TF95] H. Turtle and J. Flood. Query evaluation: strategies and optimizations. Inf. Process. Manage., 31(6): , [ZM06] J. Zobel and A. Moat. Inverted files for text search engines. ACM Comput. Surv., 38(2), 2006.

26 Inverted File Compression [AM10] Vo Ngoc Anh and Alistair Moat. Index compression using 64-bit words. Softw. Pract. Exper., 40: , February [DHYS08] S. Ding, J. He, H. Yan, and T. Suel. Using graphics processors for high-performance IR query processing. In Proc. WWW, pages ACM, [WMB99] Ian H. Witten, Alistair Moat, and Timothy C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, San Francisco, CA, 2. edition, [YDS09] H. Yan, S. Ding, and T. Suel. Inverted index compression and query processing with optimized i document ordering. In Proc. WWW, pages ACM, [ZHNB06] M. Zukowski, S. Heman, N. Nes, and P. Boncz. Super-scalar ramcpu-cache compression. In Proc. ICDE, pages 59{. IEEE Computer Society, [ZLS08] J. Zhang, X. Long, and T. Suel. Performance of compressed inverted list caching in search engines. In Proc. WWW, pages ACM, Skipping [BC07] S. B uttcher and C. Clarke. Index compression is good, especially for random access. In Proc. CIKM, pages ACM, [BV05] P. Boldi and S. Vigna. Compressed perfect embedded skip lists for quick inverted-index index lookups. In Proc. SPIRE, pages Springer- Verlag, [CLMP08] F. Chierichetti, S. Lattanzi, F. Mari, and A. Panconesi. On placing skips optimally in expectation. In Proc. WSDM, pages ACM, [MZ96] A. Mo at and J. Zobel. Self-indexing inverted files for fast text retrieval. ACM Trans. Inf. Syst., 14(4): , 1996.

27 Thank you! Efficient Compressed Inverted Index Skipping for Disjunctive Text-Queries. Simon Jonassen and Svein Erik Bratsberg. Proceedings of the 33rd European Conference on Information Retrieval (ECIR 11), Dublin, Ireland, April 2011.

V.2 Index Compression

V.2 Index Compression Heap s law (empirically observed and postulated): Size of the vocabulary (distinct terms) in a corpus E[ distinct terms in corpus] n with total number of term occurrences n, and constants,