V.2 Index Compression

Similar documents
The anatomy of a large-scale l small search engine: Efficient index organization and query processing

Compressing Inverted Index Using Optimal FastPFOR

8 Integer encoding. scritto da: Tiziano De Matteis

Exploiting Progressions for Improving Inverted Index Compression

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

Information Retrieval. Lecture 3 - Index compression. Introduction. Overview. Characterization of an index. Wintersemester 2007

Indexing. CS6200: Information Retrieval. Index Construction. Slides by: Jesse Anderton

Information Retrieval

Introduction to Information Retrieval

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze

Distributing efficiently the Block-Max WAND algorithm

Information Retrieval

Compressing and Decoding Term Statistics Time Series

Index Compression. David Kauchak cs160 Fall 2009 adapted from:

Using Graphics Processors for High Performance IR Query Processing

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Distributing efficiently the Block-Max WAND algorithm

Inverted Index Compression

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University

Course work. Today. Last lecture index construc)on. Why compression (in general)? Why compression for inverted indexes?

CS60092: Informa0on Retrieval

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

FRONT CODING. Front-coding: 8automat*a1 e2 ic3 ion. Extra length beyond automat. Encodes automat. Begins to resemble general string compression.

Lecture 5: Information Retrieval using the Vector Space Model

Cluster based Mixed Coding Schemes for Inverted File Index Compression

9 Integer encoding. Everything should be made as simple as possible, but no simpler Albert Einstein

CS 493: Algorithms for Massive Data Sets Dictionary-based compression February 14, 2002 Scribe: Tony Wirth LZ77

Recap: lecture 2 CS276A Information Retrieval

Entropy Coding. - to shorten the average code length by assigning shorter codes to more probable symbols => Morse-, Huffman-, Arithmetic Code

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Compressing Integers for Fast File Access

Error Resilient LZ 77 Data Compression

Efficient Decoding of Posting Lists with SIMD Instructions

Information Retrieval. Chap 7. Text Operations

Comparative Analysis of Sparse Matrix Algorithms For Information Retrieval

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression

SIGNAL COMPRESSION Lecture Lempel-Ziv Coding

Information Retrieval 6. Index compression

Data Compression Techniques

CS6200 Information Retrieval. Indexing. June 12, 2015

Search Engines. Informa1on Retrieval in Prac1ce. Annotations by Michael L. Nelson

Compression, SIMD, and Postings Lists

To Index or not to Index: Time-Space Trade-Offs in Search Engines with Positional Ranking Functions

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

An Asymmetric, Semi-adaptive Text Compression Algorithm

Compact Full-Text Indexing of Versioned Document Collections

Modularization of Lightweight Data Compression Algorithms

Variable Length Integers for Search

Study of LZ77 and LZ78 Data Compression Techniques

THE RELATIVE EFFICIENCY OF DATA COMPRESSION BY LZW AND LZSS

Simple variant of coding with a variable number of symbols and fixlength codewords.

Processing Posting Lists Using OpenCL

Data Compression 신찬수

An Experimental Study of Index Compression and DAAT Query Processing Methods

Data Compression for Bitmap Indexes. Y. Chen

MCS-375: Algorithms: Analysis and Design Handout #G2 San Skulrattanakulchai Gustavus Adolphus College Oct 21, Huffman Codes

Information Retrieval

COMP6237 Data Mining Searching and Ranking

Basic Compression Library

Lossless Compression Algorithms

Information Retrieval

Compression. storage medium/ communications network. For the purpose of this lecture, we observe the following constraints:

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

Spatio-temporal Range Searching Over Compressed Kinetic Sensor Data. Sorelle A. Friedler Google Joint work with David M. Mount

EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Indexing Web pages. Web Search: Indexing Web Pages. Indexing the link structure. checkpoint URL s. Connectivity Server: Node table

230 Million Tweets per day

Lightweight Natural Language Text Compression

Lecture 3 Index Construction and Compression. Many thanks to Prabhakar Raghavan for sharing most content from the following slides

Abdullah-Al Mamun. CSE 5095 Yufeng Wu Spring 2013

Outline of the course

IN4325 Indexing and query processing. Claudia Hauff (WIS, TU Delft)

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

Distribution by Document Size

Efficient Execution of Dependency Models

Information Retrieval

Distributed computing: index building and use

Program Construction and Data Structures Course 1DL201 at Uppsala University Autumn 2010 / Spring 2011 Homework 6: Data Compression

Indexing and Searching

Compressing Data. Konstantin Tretyakov

Inverted List Caching for Topical Index Shards

Data-Intensive Distributed Computing

Modeling Static Caching in Web Search Engines

Scalable Techniques for Document Identifier Assignment in Inverted Indexes

Analysis of Parallelization Effects on Textual Data Compression

Generalized indexing and keyword search using User Log

Redistribution of Documents across Search Engine Clusters

Efficient query processing

Developing MapReduce Programs

Query Evaluation Strategies

Image compression. Stefano Ferrari. Università degli Studi di Milano Methods for Image Processing. academic year

Analyzing the performance of top-k retrieval algorithms. Marcus Fontoura Google, Inc

Dictionary techniques

INDEX CONSTRUCTION 1

An Advanced Text Encryption & Compression System Based on ASCII Values & Arithmetic Encoding to Improve Data Security

Indexing and Searching

Transcription:

V.2 Index Compression Heap s law (empirically observed and postulated): Size of the vocabulary (distinct terms) in a corpus E[ distinct terms in corpus] n with total number of term occurrences n, and constants, ( < 1), classically 20, 0.5 ( ~ 3 Mio terms for 20 Bio docs) Zipf s law (empirically observed and postulated): Relative frequencies of terms in the corpus 1 P[ k th most popular term has rel. freq. x] k with parameter, classically set to 1 Both laws strongly suggest opportunities for compression! IR&DM, WS'11/12 December 1, 2011 V.1

Recap: Huffman Coding Variable-length unary code based on frequency analysis of the underlying distribution of symbols (e.g., words or tokens) in a text. Key idea: choose shortest unary sequence for most frequent symbol. Symbol x Frequency f(x) Huffman Encoding a 0.8 0 peter 0.1 10 picked 0.07 110 Huffman tree 0 1 10 11 110 111 peck 0.03 111 a peter picked peck Let f(x) be the probability (or relative frequency) of the x-th symbol in some text d. The entropy of the text (or the underlying prob. distribution f) is: H d) f ( x) log ( 2 x f x) H(d) is a lower bound for the average (i.e., expected) amount of bits per symbol needed with optimal compression. Huffman comes close to H(d). V.2 1 (

Overview of Compression Techniques Dictionary-based encoding schemes: Ziv-Lempel: LZ77 (entire family of Zip encodings: GZip, BZIP2, etc.) Variable-length encoding schemes: Variable-Byte encoding (byte-aligned) Gamma, Golomb/Rice (bit-aligned) S16 (byte-aligned, actually creates entire 32- or 64-bit words) P-FOR-Delta (bit-aligned, with extra space for exceptions ) Interpolative Coding (IPC) (bit-aligned, can actually plug in various schemes for binary code) IR&DM, WS'11/12 December 1, 2011 V.3

Ziv-Lempel Compression LZ77 (Adaptive Dictionary) and further variants: Scan text & identify in a lookahead window the longest string that occurs repeatedly and is contained in a backward window. Replace this string by a pointer to its previous occurrence. Encode text into list of triples <back, count, new> where back is the backward distance to a prior occurrence of the string that starts at the current position, count is the length of this repeated string, and new is the next symbol that follows the repeated string. Triples themselves can be further encoded (with variable length). Better variants use explicit dictionary with statistical analysis (need to scan text twice). V.4

Example: Ziv-Lempel Compression peter_piper_picked_a_peck_of_pickled_peppers <0, 0, p> for character 1: p <0, 0, e> for character 2: e <0, 0, t> for character 3: t <-2, 1, r> for characters 4-5: er <0, 0, _> for character 6: _ <-6, 1, i> for characters 7-8: pi <-8, 2, r> for characters 9-11: per <-6, 3, c> for charaters 12-13: _pic <0, 0, k> for character 16 k <-7,1,d> for characters 17-18 ed... great for text, but not appropriate for index lists V.5

Variable-Byte Encoding Encode sequence of numbers into variable-length bytes using one status bit per byte indicating whether the current number expands into next byte. Example: To encode the decimal number 12038, write: 1 st 8-bit word = 1 byte 2 nd 8-bit word = 1 byte 1 1011110 0 0000110 Thus needs 2 bytes instead of 4 bytes (regular 32-bit integer)! 1 status bit 7 data bits per byte V.6

Gamma Encoding Delta-encode gaps in inverted lists (successive doc ids): Unary coding: gap of size x encoded by: log 2 (x) times 0 followed by one 1 (log 2 (x) + 1 bits) good for short gaps Binary coding: gap of size x encoded by binary representation of number x (log 2 x bits) good for long gaps Gamma (γ) coding: length:= floor(log 2 x) in unary, followed by offset := x 2^(floor(log 2 x)) in binary Results in (1 + log 2 x + log 2 x) bits per input number x generalization: Golomb/Rice code (optimal for geometr. distr. x) still need to pack variable-length codes into bytes or words V.7

Example: Gamma Encoding Number x Gamma Encoding 1 = 2 0 1 5 = 2 2 + 2 0 00101 15 = 2 3 +2 2 +2 1 +2 0 0001111 16 = 2 4 000010000 Particularly useful when: Distribution of numbers (incl. largest number) is not known ahead of time Small values (e.g., delta-encoded docids or low TF*IDF scores) are frequent V.8

Example: Golomb/Rice Encoding For a tunable parameter M, split input number x into: Quotient part q := floor(x/m) stored in unary code (using q x 1 + 1 x 0) Remainder part r:= (x mod M) stored in binary code M=10 b=4 If M is chosen as a power of 2, then r needs log 2 (M) bits ( Rice encoding) else set b := ceil(log 2 (M)) If r < 2 b M, then r as plain binary number using b-1 bits else code the number r + 2 b M in plain binary representation using b bits Number x q Output bits q 0 0 0 33 3 1110 r Binary (b bits) Output bits r 0 0000 000 3 0011 011 57 5 111110 99 9 1111111110 7 1101 1101 9 1111 1111 V.9

S9/S16 Compression [Zhang, Long, Suel: WWW 08] 32-bit word (integer) = 4 bytes 1001 1011110000100001101100101111 4 status bits 28 data bits Byte aligned encoding (32-bit integer words of fixed length) 4 status bits encode 9 or 16 cases for partitioning the 28 data bits Example: If the above case 1001 denotes 4 x 7 bit for the data part, then the data part encodes the decimal numbers: 94, 8, 54, 47 Decompression implemented by case table or by hardcoding all cases High cache locality of decompression code/table Fast CPU support for bit shifting integers on 32-bit to 128-bit platforms V.10

P-FOR-Delta Compression [Zukowski, Heman, Nes, Boncz: ICDE 06] For Patched Frame-of-Reference w/delta-encoded Gaps Key idea: encode individual numbers such that most numbers fit into b bits. Focuses on encoding an entire block at a time by choosing a value of b bits such that [high coded, low coded ] is small. Outliers ( exceptions ) stored in extra exception section at the end of the block in reverse order. Encoding of 31415926535897932 using b=3 bitwise coding blocks for the code section. V.11

Interpolative Coding (IPC) [Moffat, Stuiver: IR 00] IPC directly encodes docids rather than gaps. Specifically aims at bursty/clustered docid s of similar range. Recursively splits input sequence into low-distance ranges. <1; 3; 8; 9; 11; 12; 13; 17> <1; 3; 8; 9;> <11; 12; 13; 17> <1; 3> <8; 9;> <11; 12> <13; 17> Requires ceil(log 2 (high i low i + 1)) bits per number for bucket i in binary! But: Requires the decoder to know all high i /low i pairs. Need to know large blocks of the input sequence in advance. V.12

Comparison of Compression Techniques [Yan, Ding, Suel: WWW 09] Distribution of docid-gaps on TREC GOV2 (~25 Mio docs) reporting averages over 1,000 queries Decompression speed (MB/query) for TREC GOV2, 1,000 queries Compressed docid sizes (MB/query) on TREC GOV2 (~25 Mio docs) reporting averages over 1,000 queries Variable-length encodings usually win by far in (de-) compression speed over dictionary & entropy-based schemes, at comparable compression ratios! V.13

Layout of Index Postings [J. Dean: WSDM 2009] word one block (with n postings): word skip table block 1 block N delta to last docid in block #docs in block: n n-1 docid deltas: Rice M encoded tf values : Gamma encoded term attributes: Huffman encoded term positions: Huffman encoded header docid postings payload (of postings) layout allows incremental decoding IR&DM, WS'11/12 December 1, 2011 V.14

Additional Literature for Chapters V.1-2 Indexing with inverted files: J. Zobel, A. Moffat: Inverted Files for Text Search Engines, Comp. Surveys 2006 S. Brin, L. Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine, WWW 1998 L.A. Barroso, J. Dean, U. Hölzle: Web Search for a Planet: The Google Cluster Architecture. IEEE Micro 2003 J. Dean, S. Ghemawat: MapReduce: Simplified Data Processing in Large Clusters, OSDI 2004 X. Long, T. Suel: Three-Level Caching for Efficient Query Processing in Large Web Search Engines, WWW 2005 H. Yan, S. Ding, T. Suel: Compressing Term Positions in Web Indexes, SIGIR 2009 J. Dean: Challenges in Building Large-Scale Information Retrieval Systems, Keynote, WSDM 2009, http://videolectures.net/wsdm09_dean_cblirs/ Inverted index compression: Marvin Zukowski, Sándor Héman, Niels Nes, Peter A. Boncz: Super-Scalar RAM-CPU Cache Compression. ICDE 2006 Jiangong Zhang, Xiaohui Long, Torsten Suel: Performance of compressed inverted list caching in search engines. WWW 2008 Alistair Moffat, Lang Stuiver: Binary Interpolative Coding for Effective Index Compression. Inf. Retr. 3(1): 25-47 (2000) Hao Yan, Shuai Ding, Torsten Suel: Inverted index compression and query processing with optimized document ordering. WWW 2009 V.15