Information Retrieval. Lecture 3 - Index compression. Introduction. Overview. Characterization of an index. Wintersemester 2007

Similar documents
Introduction to Information Retrieval

Information Retrieval

Information Retrieval

Index Compression. David Kauchak cs160 Fall 2009 adapted from:

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks

CS60092: Informa0on Retrieval

Course work. Today. Last lecture index construc)on. Why compression (in general)? Why compression for inverted indexes?

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Recap: lecture 2 CS276A Information Retrieval

Lecture 3 Index Construction and Compression. Many thanks to Prabhakar Raghavan for sharing most content from the following slides

INDEX CONSTRUCTION 1

FRONT CODING. Front-coding: 8automat*a1 e2 ic3 ion. Extra length beyond automat. Encodes automat. Begins to resemble general string compression.

V.2 Index Compression

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze

Information Retrieval 6. Index compression

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Information Retrieval. Lecture 5 - The vector space model. Introduction. Overview. Term weighting. Wintersemester 2007

Information Retrieval

Information Retrieval. Lecture 10 - Web crawling

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

CS347. Lecture 2 April 9, Prabhakar Raghavan

Today s topics CS347. Inverted index storage. Inverted index storage. Processing Boolean queries. Lecture 2 April 9, 2001 Prabhakar Raghavan

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Information Retrieval. Lecture 7 - Evaluation in Information Retrieval. Introduction. Overview. Standard test collection. Wintersemester 2007

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval

A New Compression Method Strictly for English Textual Data

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Processing Posting Lists Using OpenCL

Information Retrieval

Comparative Analysis of Sparse Matrix Algorithms For Information Retrieval

Information Retrieval and Organisation

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Compressing Integers for Fast File Access

GUJARAT TECHNOLOGICAL UNIVERSITY

CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University

Information Retrieval

Information Retrieval

Indexing. CS6200: Information Retrieval. Index Construction. Slides by: Jesse Anderton

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Outline of the course

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

Information Retrieval. Chap 7. Text Operations

Data Representation. Types of data: Numbers Text Audio Images & Graphics Video

Course structure & admin. CS276A Text Information Retrieval, Mining, and Exploitation. Dictionary and postings files: a fast, compact inverted index

More on indexing CE-324: Modern Information Retrieval Sharif University of Technology

modern database systems lecture 4 : information retrieval

Corso di Biblioteche Digitali

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Index Construction 1

Inverted Indexes. Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5

CSCI 5417 Information Retrieval Systems Jim Martin!

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

index construct Overview Overview Recap How to construct index? Introduction Index construction Introduction to Recap

Lecture 5: Information Retrieval using the Vector Space Model

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

Information Retrieval

CS15100 Lab 7: File compression

230 Million Tweets per day

Information Retrieval. Lecture 2 - Building an index

Compressing and Decoding Term Statistics Time Series

IN4325 Indexing and query processing. Claudia Hauff (WIS, TU Delft)

MG4J: Managing Gigabytes for Java. MG4J - intro 1

Part 2: Boolean Retrieval Francesco Ricci

Exam IST 441 Spring 2014

Information Retrieval

Information Retrieval. Lecture 11 - Link analysis

1 o Semestre 2007/2008

8 Integer encoding. scritto da: Tiziano De Matteis

Introduction to. CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan. Lecture 4: Index Construction

Information Retrieval

Multimedia Networking ECE 599

Compressing Inverted Index Using Optimal FastPFOR

Search Engines. Informa1on Retrieval in Prac1ce. Annotations by Michael L. Nelson

Compression. storage medium/ communications network. For the purpose of this lecture, we observe the following constraints:

Index Construction Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

MILC: Inverted List Compression in Memory

Sample questions with solutions Ekaterina Kochmar

Information Retrieval

14.4 Description of Huffman Coding

CS 572: Information Retrieval. Lecture 2: Hello World! (of Text Search)

Information Retrieval

Index Construction. Slides by Manning, Raghavan, Schutze

CS105 Introduction to Information Retrieval

Data Compression Techniques

Introduction to Information Retrieval

S 1. Evaluation of Fast-LZ Compressors for Compacting High-Bandwidth but Redundant Streams from FPGA Data Sources

Recap of the previous lecture. This lecture. A naïve dictionary. Introduction to Information Retrieval. Dictionary data structures Tolerant retrieval

Abdullah-Al Mamun. CSE 5095 Yufeng Wu Spring 2013

DB-retrieval: I m sorry, I can only look up your order, if you give me your OrderId.

Information Retrieval. Lecture 9 - Web search basics

More Bits and Bytes Huffman Coding

ITCT Lecture 8.2: Dictionary Codes and Lempel-Ziv Coding

CS6200 Information Retrieval. Indexing. June 12, 2015

Transcription:

Information Retrieval Lecture 3 - Index compression Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Dictionary and inverted index: core of IR systems Techniques can be used to compress these data structures, with 2 objectives: 1. reducing the disk space needed 2. reducing the time processing, by using a cache (keeping the postings of the most frequently used terms into main memory) NB: decompression can be faster than reading from disk 2/ 30 Overview The string method Conclusion 3/ 30 Example considered: the Reuters-RCV1 collection non-positional positional postings word types postings (word tokens) size of dictionary non-positional index positional index size cumul. size size cumul. 484,494 109,971,179 197,879,290 unfiltered no numbers 473,723-2% -2% 100,680,242-8% -8% 179,158,204-9% case folding 391,523-17% -19% 96,969,056-3% -12% 179,158,204-0% 30 stop words 391,493-0% -19% 83,390,443-14% -24% 121,857,825-31% 150 stop words 391,373-0% -19% 67,001,847-30% -39% 94,516,599-47% stemming 322,383-17% -33% 63,812,300-4% -42% 94,516,599-0% 4/ 30

Statistical properties of terms The vocabulary grows with the corpus size Empirical law determining the number of term types in a collection of size M (Heap s law) M = kt b where T is the number of tokens, and k and b 2 parameters defined as follows: (k is the growth-rate) b 0.5 and 30 k 100 On the REUTERS corpus (taking k = 33 and b = 0.5): M = 33 95000000 0.5 = 321644 5/ 30 Index format with fixed-width entries term tot. freq. pointer to postings list postings list a 656,265 aachen 65 zulu 221 space needed: 40 bytes 4 bytes 4 bytes Total space: M (2 20 + 4 + 4) = 400,000 48 = 19.2 MB NB: why 40 bytes per term? (unicode + max. length of a term) 6/ 30 Remarks The average length of a word type for REUTERS is 7.5 bytes With fixed-length entries, a one-letter term is stored using 40 bytes! Some very long words (such as hydrochlorofluorocarbons) cannot be handle How can we extend the dictionary representation to save bytes and allow for long words? 7/ 30 Dictionary-as-a-string The string method... s y s t i l e s y z yge t i c s y z yg i a l s y z ygy s z a i be l y i t e s z e c i freq. postings ptr. term ptr. 9 92 5 71 12 4 bytes 4 bytes 3 bytes 8/ 30

The string method Space use for dictionary-as-a-string 4 bytes per term for frequency 4 bytes per term for pointer to postings list 3 bytes per pointer into string (need log 2 400000 22 bits to resolve 400,000 positions) 8 chars (on average) for term in string Space: 400,000 (4 + 4 + 3 + 2 8) = 10.8 MB (compared to 19.2 MB for fixed-width) 9/ 30 Block-storage... 7s y s t i l e9s yzyge t i c8s y z yg i a l 6s y z ygy11s za i be l y i t e freq. 9 92 5 71 12 postings ptr. term ptr. 10/ 30 Space use for block-storage Let us consider blocks of size k We remove k 1 pointers, but add k bytes for term length Example: k = 4, (k 1) 3 bytes saved (pointers), and 4 bytes added (term length) 5 bytes saved Space saved: 400,000 (1/4) 5 = 0.5 MB (dictionary reduced to 10.3 MB) Why not taking k > 4? 11/ 30 Search without blocking aid job den pit box ex ox win Average search cost: (4 + 3 + 2 + 3 + 1 + 3 + 2 + 3)/8 2.6 steps 12/ 30

Impact of blocking on search aid box den ex job ox pit win Average search cost: (2 + 3 + 4 + 5 + 1 + 2 + 3 + 4)/8 3 steps 13/ 30 Front coding One block in blocked compression (k = 4) 8 a u t o m a t a 8 a u t o m a t e 9 a u t o m a t i c 10 a u t o m a t i o n further compressed with front coding. 8 a u t o m a t a 1 e 2 i c 3 i o n End of prefix marked by Deletion of prefix marked by 14/ 30 Dictionary compression for Reuters representation size in MB dictionary, fixed-width 19.2 dictionary as a string 10.8, with blocking, k = 4 10.3, with blocking & front coding 7.9 15/ 30 Recall: the REUTERS collection has about 800 000 documents, each having 200 tokens Since tokens are encoded using 6 bytes, the collection s size is 960 MB A document identifier must cover all the collection, i.e. must be log 2 800000 20 bits long If the collection includes about 100 000 000 postings, the size of the posting lists is 100000000 20/8 = 250MB How to compress these postings? Idea: most frequent terms occur close to each other we encode the gaps between occurences of a given term 16/ 30

Gap encoding encoding postings list the docids 283042 283043 283044 283045 gaps 1 1 1 computer docids 283047 283154 283159 283202 gaps 107 5 43 arachnocentric docids 252000 500100 gaps 252000 248100 Furthermore, small gaps are represented with shorter codes than big gaps 17/ 30 Variable-length byte encoding uses an integral number of bytes to encode a gap First bit := continuation byte Last 7 bits := part of the gap The first bit is set to 1 for the last byte of the encoded gap, 0 otherwise Example: a gap of size 5 is encoded as 10000101 18/ 30 Variable-length byte code: example docids 824 829 215406 gaps 5 214577 VB code 00000110 10111000 10000101 00001101 00001100 10110001 What is the code for a gap of size 1283? Homework: write a Java class which encodes/decodes such gaps. 19/ 30 About variable-length byte compression The posting lists for the REUTERS collection are compressed to 116 MB with this technique (original size: 250 MB) The idea of representing gaps with variable integral number of bytes can be applied with units that differ from 8 bits Larger units can be processed (decompression) quicker than small ones, but are less effective in terms of compression rate 20/ 30

Idea: representing numbers with a variable bit code Example: unary code the number n is encoded as: (NB: not efficient) n times {}}{ 11... 0 γ-code, variable encoding done by splitting the representation of a gap as follows: length offset offset is the binary encoding of the gap (without the leading 1) length is the unary code of the offset size Objective: having a representation that is as close as possible to the log 2 G size (in terms of bits) for G gaps 21/ 30 Unary and γ-codes number unary code length offset γ code 0 0 1 10 0 0 2 110 10 0 10,0 3 1110 10 1 10,1 4 11110 110 00 110,00 9 1111111110 1110 001 1110,001 13 1110 101 1110,101 24 11110 1000 11110,1000 511 111111110 11111111 111111110,11111111 1025 11111111110 0000000001 11111111110,0000000001 γ-codes are always of odd length. More precisely, their length is 2 floor(log 2 s) + 1 γ-code for 6? 22/ 30 Example Given the following γ-coded gaps: 1110001110101011111101101111011 Decode these, extract the gaps, and recompute the posting list NB: γ-decoding : 1. first reads the length (terminated by 0), 2. then uses this length to extract the offset, 3. and eventually prepends the missing 1 23/ 30 γ-codes (continued) How to evaluate the properties of a code? It depends on the distribution of gaps The coding properties of a probablility distribution P are related to its entropy H(P): H(P) = x X P(x)log 2 P(x) The expected length of a code C, called E(C), is such that: E(C) H(P) 2 + 1 H(P) 3 This means that γ-codes always almost reach optimality (with a factor of 3) whatever the distribution is How good is a γ-encoding in terms of compression? 24/ 30

Compression rate of γ-codes Zipf s law: the frequency of the ith most common term f i is proportional to 1/i f i c i Rough analysis of the γ-encoding with this term distribution (H M is the harmonic number): 1 = M i=1 c = c i M i=1 1 = c H M i Thus: c = 1 1 H M lnm = 1 ln400000 1 13 The expected average number of occurences of a term i in a document of length L is: L c i = 200 1 13 i 15 i 25/ 30 Stratification of terms for estimation of compression L c most freq. terms N gaps of size 1 each L c next most freq. terms N/2 gaps of size 2 each L c next most freq. terms N/3 gaps of size 3 each 26/ 30 Compression rate of γ-codes (continued) Let us consider blocks of gaps of size 15 each Recall that the γ-code for 1 gap of size j requires 2 floor(log 2 j) + 1 bits To encode all gaps gathered in 1 block, we need (N/j is the number of gaps of size j): L c N j (2 floor(log 2 j) + 1) 2 N L c log 2 j j If we consider equi-distribution of gaps sizes, encoding all gaps requires: M/Lc j=1 2 N L c log 2 j j (M/L C = 400000/15) 27000 j=1 2 10 6 15 log 2 j j 224MB 27/ 30 Compression of Reuters: Summary representation size in MB dictionary, fixed-width 19.2 dictionary, term pointers into string 10.8, with blocking, k = 4 10.3, with blocking & front coding 7.9 collection (text, xml markup etc) 3600.0 collection (text) 960.0 postings, uncompressed (32-bit words) 400.0 postings, uncompressed (20 bits) 250.0 postings, variable byte encoded 116.0 postings, γ encoded 101.0 28/ 30

Conclusion Conclusion γ-codes achieve better compression ratios (about 15 % better than variable bytes encoding), but are more complex (expensive) to decode This cost applies on query processing trade-off to find The objectives announced are met by both techniques, recall: 1. reducing the disk space needed 2. reducing the time processing, by using a cache The techniques we have seen are lossless compression (no information is lost) Lossy compression can be useful, e.g. storing only the most relevant postings (more on this in the ranking lecture) 29/ 30 References Conclusion C. Manning, P. Raghavan and H. Schütze, Introduction to Information Retrieval (chapter 5) http://www-nlp.stanford.edu/ir-book/pdf/ chapter05-compression.pdf F. Scholer, H.E. Williams, J. Yiannis and J. Zobel, Compression of Inverted Indexes for Fast Query Evaluation (2002) http://citeseer.ist.psu.edu/scholer02compression.html 30/ 30