COSC431 IR. Compression. Richard A. O'Keefe

Size: px

Start display at page:

Download "COSC431 IR. Compression. Richard A. O'Keefe"

Imogen Pierce
5 years ago
Views:

1 COSC431 IR Compression Richard A. O'Keefe

2 Shannon/Barnard Entropy = sum p(c).log 2 (p(c)), taken over characters c Measured in bits, is a limit on how many bits per character an encoding would need. Shannon estimated entropy using tables of letters and words, and also using human predictors. We're going to do that today.

3 Barnard's results English French German Spanish Word length letter entropy by words

4 The experiment I spy with my little eye Your task is to guess the letters. We'll notice something interesting.

5 Shannon/Barnard consequences Unicode takes 21 bits per character (an awkward number, so 32 bits may be used). That's about 5 times more bits than we need for English (or 7.8 times if we use 32 bits). Even as UTF-8, it's still twice as many. We should be able to squeeze text by 50% There are lots of legal combinations, like thurb and pring that are not used. Taking that into account takes us to 2 bits/letter.

6 Web references Data_compression is an overview of data compression. Lossless_data_compression talks about text and pictures. Universal_code(data_compression) is about compressing integer sequences. Matt Mahony's Data Compression Explained

7 Fundamental idea Data = Model + Surprise Model = Framework + Parameters Framework is a scheme known to sender and receiver (e.g., Huffman encoding ), Parameters is a summary of the message as a whole (e.g., character frequency table) Surprise = what the receiver cannot predict

8 Simple example Framework = Unicode, any version means we'd need 21 bits per character. Parameters = version means 17 bits is enough. Parameters = ASCII subset means 7 enough Framework = Adaptive Huffman code on bytes, parameters = none takes a plain text Tempest to 5 bits per character. (Unix pack(1) command.)

9 Integer sequences A fairly basic inverted index is a mapping term (docno, frequency)-set term is a word or the stem of a word docno is a natural number identifying a document in the collection frequency is a positive integer saying how many times that term occurs in that document.

10 Segregate and compress When we know the structure of some information, we often find the parts have different characteristics. If we can segregate the parts, we can compress them differently. For <example with= XML >we have four parts</example>, generic identifiers, attribute names, attribute values, and text.

11 Segregate and compress 2 generic identifiers are few but frequent (e.g., WSJ has just 21). attribute names tend to be few, frequent, and different from generic identifiers. attribute values vary a lot, but links tend to have common prefixes. text is made of words, so compression based on words will pay off. The Xmill compressor for XML does this and lets you customise it further.

12 Segregation in IR document numbers and frequencies have different distributions. document numbers tend to be uniformly distributed frequencies are concentrated at low values We can compress frequencies very nicely, but how do we make document numbers small?

13 Encode differences! We have a set of integers. We have to encode it as a sequence. We can choose the order to make this easy. SORT the numbers into increasing order and take differences: {31,41,59,26,53,58,97,93} <26,31,41,53,58,59,93,97> [0] <26,5,10,12,5,1,34,4> Compress the differences using a method that expects small numbers.

14 Not just for numbers cha chaa 3a chab 3b chabasie 4asie chabazite 5zite chabot 4ot chabouk 5uk An extract from /usr/share/dict/words illustrating prefix encoding also called front encoding. Same idea!

15 Compressing mostly small numbers Unary encoding: represent n by n 1 bits followed by a zero. 0, 1, 2, 3 : 0, 10, 110, Don't laugh: you use this every day! UTF-8 0xxxxxxx xxxxx 10xxxxxx xxxx 10xxxxxx 10xxxxxx xxx 10xxxxxx 10xxxxxx 10xxxxxx Leading bits are byte count in unary

16 Small numbers 2 Elias gamma encoding. Let 2 N < x < 2 N+1 Write N zero bits, a 1, and the last N bits of x : The idea can be iterated, giving us Elias delta encoding (encode N using Elias gamma) and ultimately Elias omega encoding. We have to trade off space against speed.

17 What's the point? Originally we compressed to save space. If you want to fit Shakespeare's plays (about 8 MB) onto a 1.44 MB floppy, you must be very clever. Nowadays we compress to save time. Reading data off a disc is so slow (and through a network even worse) that reading compressed data and decompressing it in memory can save a lot of time. Provided decompression isn't too complex.

18 Variable byte encoding Rather like unary, but for bytes, not bits. Break an integer into 7-bit chunks. Send only the chunks you actually need. Mark each chunk but the last as to be continued. Mark the last chunk as final. 7 bits of data + 1 to-be-continued bit. Simple, fast, not the very best compression.

19 Another scheme Compress numbers in groups of 4. Length code 0 = 1 byte, 1 = 2 bytes, 3 = 4 bytes, 4 = 8 bytes. Pack 4 length codes into one byte. Four numbers take 5 to 33 bytes. Compression not as good as variable byte but decompression is faster.

20 Compressing text We can take text to be a sequence of character numbers and compress that as an integer sequence. That's not unlike what UTF-8 does. Unicode also has SCSU (Simple Compression Scheme for Unicode) so that a block of Greek, Russian, Hebrew, Arabic, Georgian, &c characters will take 1 byte each.

21 Huffman compression Count how often each character occurs Make a 1-node tree for each character Repeatedly merge the two least frequent trees. Label the edges of the final tree with 1 and 0. The code for a character is the bits on the path from the root to the leaf for that character. Gets close to character entropy, but variable bit length.

22 Dictionary compression Lempel-Ziv algorithm keeps a sliding window of text. The output is a sequence of (new,char) or (old,start,length) items. That is, when it sees a repeated sequence, it encodes the sequence as such, not its individual characters. Exploits repeated words and repeats of parts the text.

23 Spaceless word coding Divide a text into a sequence of words (each deemed to be followed by a space) and punctuation marks (which cancel a preceding space). This is an example. : This is an example. Treat these words as letters in a larger alphabet. Assign them numbers in decreasing frequency order. Encode the numbers using variable byte &c. We can search text compressed this way without having to decompress it!

Indexing. CS6200: Information Retrieval. Index Construction. Slides by: Jesse Anderton

Indexing. CS6200: Information Retrieval. Index Construction. Slides by: Jesse Anderton Indexing Index Construction CS6200: Information Retrieval Slides by: Jesse Anderton Motivation: Scale Corpus Terms Docs Entries A term incidence matrix with V terms and D documents has O(V x D) entries.