Data Compression. Guest lecture, SGDS Fall PDF Free Download

Data Compression Guest lecture, SGDS Fall 2011 1

Basics Lossy/lossless Alphabet compaction Compression is impossible Compression is possible RLE Variable-length codes Undecidable Pigeon-holes Patterns Randomness Huffman Arithmetic coding Using phrases Dynamic context Ziv-Lempel Burrows-Wheeler Suffix sorting 2 Data compression is not traditional alg. course topic. But interesting, both in itself and as application of alg./d.s. Book: fragments, not that well chosen from compression experts view. This lecture: fuller view, with connections to what you learned on the course.

Basic model bitstream B 0110110101... Compress compressed version C(B) 1101011111... Expand original bitstream B 0110110101... Basic model for data compression Original message, consisting of characters, pixels, sound samples or whatever. In much of the lecture we assume that it consists of characters. But more generally, we can view it as just a stream of bits, because all data representations can be broken down to bits. Compression method: two algorithms: compress and expand. Seems impossible that you could get the original back, you would have to throw away some data. And sometimes you do. 3

Lossy Compress Compressed message Expand Images, video, sound, 4 If we accept loss, which we can do for some kinds of data, itʼs more believable that we can compress.

Lossless Compress Compressed message Expand Anything, including text, machine code, This lecture (and book): lossless only But there are also lossless methods, which reproduce the original exactly. Lossless techniques are useful also in lossy methods. Even when accepting loss, you want to represent exact information as compactly as possible. One case where it is fairly easy to accept is if there are unused bits in B, i.e., it does not store the data as compactly as it could. 5

Easy: alphabet compaction Genome String over alphabet { A, C, T, G } Encode N-character genome: ATAGATGCATAG Ascii bytes char encoding A 01000001 C 01000011 T 01010100 G 01000111 2-bit encoding char encoding A 00 C 01 T 10 G 11 01000001010101000100000101000111 01000001010101000100011101000011 01000001010101000100000101000111 001000110010110100100011 6 Thatʼs nice, but in general, there are not unused bits.

But, in general Any representable data may appear No superfluous bits to remove 7

Computational formulation Compress Input: N-bit message B Output: Smallest possible program, C(B), that produces B as output (when given no input) Expand Run C(B), get B. Length of C(B) is Kolmogorov complexity of B UNDECIDABLE The most general kind of code is a programming language. Letʼs say that C(B) is a program that produces B. Letʼs find the smallest such program. Undecidable: There is no, can be no, algorithm that computes it in general. Generally, one should not be too discouraged. Sometimes a non-general algorithm is useful. But letʼs make this easier, by requiring not that C(B) is the smallest possible, but just that it is smaller than B. 8

New attempt: skip smallest possible Compress Input: N-bit message B Output: N -bit message C(B), N < N Expand Input: N -bit message C(B) Output: N-bit message B IMPOSSIBLE 9 Why is this impossible? The pigeon hole principle applies.

B: 2 N possibilities C(B): 2 N possibilities Compress Compression means mapping each dot on the left to some dot on the right. Since there are fewer possibilities for C(B) than B, there are some B1 and B2 for which C(B1) = C(B2). This is easy to see when N is 2 or 3 or so, but donʼt get fooled: it applies even if N is billions. 10

B: 2 N possibilities C(B): 2 N possibilities Expand? 11 Expand cannot choose between B1 and B2.

So, we give up? 12 Some of the 2 N messages may be illegal. No need to encode them. Even if they are all legal, some are more probable than others.

Modified goal Compress Input: N-bit message B Output: N -bit message C(B) N < N for most common instances of B For less common B: ok if N > N Expand Input: N -bit message C(B) Output: N-bit message B A little vague, not really a mathematical definition. Need some more information theory to make a formal definition, which is beyond the scope of this lecture. 13

Example Mary had a little lamb. hsy, iimlh kwvsadjh h.j Text (upper/lower case letters + punctuation), 6 bits/char 23 6 bits = 138 bits Compressed, text can use fewer, say 2.5 bits/char, because text patterns are predictable. 23 2.5 bits = 57.5 bits 23 6 bits = 138 bits Compressed, this data (with no predictable patterns) will use more, say 6.8 bits/char. 23 6.8 bits = 156.4 bits 14 LEFT Without compression, just alphabet compaction, we can get 138. Compressed, we might get, e.g., 57.5. RIGHT Uncompressed, the same length Compressed we can allow this a little more. So, how then? We have to find predictable patterns.

141592653589793238462643383279502884197169399375105820974 944592307816406286208998628034825342117067982148086513282 306647093844609550582231725359408128481117450284102701938 521105559644622948954930381964428810975665933446128475648 233786783165271201909145648566923460348610454326648213393 607260249141273724587006606315588174881520920962829254091 715364367892590360011330530548820466521384146951941511609 433057270365759591953092186117381932611793105118548074462 379962749567351885752724891227938183011949129833673362440 656643086021394946395224737190 70217986094370277053921717 570 first decimals of π No normal compression method finds this pattern Compression models all based on repetition and/or skewed distribution 15 If we donʼt have special knowledge (of π in this case), the message looks random.

Randomness Message that looks random will not be compressed Sequence that is truly random cannot be compressed (pigeon-holes again) Maximum-compressed data looks random 16 Looking random depends on the model used. Every compression method has one, explicit or implicit. Now letʼs look at a message where we can easily see some pattern.

0000000000000001111111000000011111111111 15 0 7 1 7 0 11 1 1111 0111 0111 1011 11110 01111 01110 10111 1111011101111011 Run-length encoding (RLE) How to compress 1 11100011? How to compress 0 0000000000000000? If you would just describe this bit sequence, how would you do it? Letʼs use that as a compression format. To make it into a bit string, we need to encode numbers in binary too (next slide). 111100011 becomes 0000 0100 0011 0010. 00000000000000000 becomes 1111 0000 0010 This compressed 40 bits into 16 bits. What compression do we generally get with this method? 17

Decimal Binary 0 0000 1 0001 2 0010 3 0011 4 0100 5 0101 6 0110 7 0111 8 1000 9 1001 10 1010 11 1011 12 1100 13 1101 14 1110 15 1111 18

RLE compression efficiency What sequence gives best compression? 15 0 15 1 15 0 15 1 15 0 15 1 4/15 0.26667 bits/bit Worst compression? 0101010101010101010101010101 4/1 4 bits/bit More (than 4) bits for lengths: better best case, worse worse case Used as component in some systems, but not a good general compression scheme. Letʼs look at a text example, with a more intricate pattern. 19

ABRACADABRA! First attempt: alphabet compaction char encoding A 000 B 001 C 010 D 011 R 100! 101 12 3 bits = 36 bits 20 Encoding 000001100000010000011000001100000101. But do we have to have the same number of bits for all characters?

ABRACADABRA! char encoding A 0 B 1 C 01 D 10 R 00! 11 Won t work! (why not?) Can variable-length code work? Yes! If it is prefix-free Encoding 010000101001000 21

ABRACADABRA! Try variable lengths, with short codewords for common characters. char encoding A 0 B 11111 C 110 D 100 R 1110! 101 30 bits total, less than 36! 22 So, we seem to have found a trick. Letʼs look at a more intuitive way to represent this code.

Tree representation Codeword table key! A B C D R value 101 0 1111 110 100 1110 Compressed bitstring Trie representation 0 1 A 0 1 0 1 0 1 D! C 0 1 R B Compress Start at leaf; follow path up to the root; print bits in reverse. Expand Start at root. Go left if bit is 0; go right if 1. If leaf node, print char and return to root. But: How do we find the best code? 23 Code in the book. Now, how do we make best use of this trick?

Huffman code Count frequencies of characters Make a set with one node for each letter Extract two nodes with smallest frequency Combine them, with new node as root Add new root node to set Repeat, until only one node (Optimality proof: see book.) 24

Huffman code 12 char freq A 5 B 2 5 A 0 1 3 0 7 1 4 C 1 D 1 1 D 0 1 2 2 R 0 1 B 2 R 2! 1 0 1 1! C 1 Huffman code construction for A B R A C A D A B R A! 25 Little red numbers are frequencies.

Huffman code Compress N characters, alphabet size R Data structure(s)? Time complexity? 26 Count frequencies: N Build binary min-heap, based on freq.: R (R-1) steps, extract two insert one: R lg R Alt. use two FIFO queues Q1, Q2 Sort on freq., insert into Q1 in freq. order: sort-time(r, values 0 N) Min-freq. node is always next to get from either Q1 or Q2 Insert new nodes in Q2 sort-time(r, values 0 N) = R lg R? No, key-indexed sorting can normally get it down to R. But how does expand know what the encoding is?

Compressed message must include code Codeword of each character (book) or, frequency of each character. Expand builds tree in same way as compress or, the length of the codeword for each character. Enough info to rebuild tree Note: Huffman can automatically compact alphabet 27 No problem if alphabet is relatively small. If we donʼt include characters with zero frequency in the code, we get natural compaction. Many descriptions stop here. We found the optimal way to compress. But far from it.

The curse of whole-bit codewords Huffman-encoding characters is not always the best we can do Example: 1000-char message with highly skewed distribution char freq encoding A 990/1000 0 B 7/1000 10 Total: 990 1 + 7 2 + 3 2 = 1010 bits RLE would do better! C 3/1000 11 28 How can we do better? One way is to use another alphabet.

Use double characters char freq (computed) encoding AA (990/1000) 2.98 0 AB (990/1000) (7/1000).0069 10 AC (990/1000) (3/1000).0030 1110 BA (7/1000) (990/1000).0069 110 BB (7/1000) 2.000049 111110 BC (7/1000) (3/1000).000021 1111110 CA (3/1000) (990/1000).0030 11110 CB (3/1000) (7/1000).000021 11111110 CC (3/1000) 2.000009 11111111 Total: ca 600 bits 29

Keep expanding alphabet Combining three characters, to alphabet size 27, improves precision further. Etc. Finally: combining N characters. Message is one single character Arithmetic coding Arithmetic encoder takes one freq. interval at a time, outputs bits as they can be determined. We do not go into details for how to do arithmetic coding in practice. Just please accept that the problem has a solution. 30

Entropy coding Huffman, Shannon-Fano, canonical code, arithmetic coding techniques exist to output right number of bits, with sufficient precision For details, see e.g. Witten, Moffat, & Bell, Managing Gigabytes 31

But, wait a minute char freq (computed) AA (990/1000) 2.98 AB (990/1000) (7/1000).0069 AC (990/1000) (3/1000).0030 BA (7/1000) (990/1000).0069 BB (7/1000) 2.000049 BC (7/1000) (3/1000).000021 CA (3/1000) (990/1000).0030 CB (3/1000) (7/1000).000021 CC (3/1000) 2.000009 These are clearly not the best frequency estimates For instance, in English, Th is more common than ht We can get more data from the original message 32

Idea I: Statistics with context Example: in English, the letter u is not among the most common few except after q, where it is by far the most common! Idea: use different frequency tables based on the previous character 33

ABRACADABRA! After A After B After C char freq char freq char freq A 0 A 0 A 1 B 2 C 1 D 1 B 0 C 0 D 2 B 0 C 0 D 0 R 0 R 0 R 0! 1! 0! 0 Build, e.g., different Huffman codes for each context 34

More detailed contexts Example, after compres, s is overrepresented Use longer strings as context: those significant in message Problem: lots of codes! Need to be included in compressed message? Solution: dynamic contexts 35

context et has appeared 2 times Context tree for string letlettertele 36

Dynamic context modeling Start with just one, or R, contexts. Entries in frequency tables equal Add contexts and update statistics by one character at a time Build exactly the same way in expand as in compress. No code needs to be included in compressed method! Prediction by partial matching (PPM), Dynamic Markov Chaining (DMC) Good compression properties, but take much computation in both compress and expand 37

Idea 2: Build dictionaries Instead of individual characters, encode phrases Computationally simpler than statistical modeling Less sensitive to lack of precision in bit codes (alphabet is large) Dictionary methods are equivalent to (weird) special cases of statistical models 38

LZ77 Compressed message consists of triples <pos, length, next> position (counting backwards) of phrase first character after phrase number of characters in phrase 39

<0,0,a> <0,0,b> <2,1,a> <3,2,b> <5,3,b> <6,6,b> Expand: abaababaabbabaabbb Considered impractical for years, because scanning for longest string during compression takes N2 time but does it? Design compression algorithm! Data structures? Time complexity? 40

Idea 3: Block sorting Group characters in the output according to their contexts More similar contexts, closer together Generates repetitions more easy to compress 41

Idea 3: Block sorting In chunk of message, sort all strings (contexts) Encode characters in their sorted-context order, lots of repetition Then compress with RLE and/or move to front Remarkably, it s easy to get original order back! Burrows Wheeler transform (BWT) 42 Contexts are strings, so we can use string sorting for grouping/ordering.

Note on backward contexts String after a character works as contest (just as well as string before) After compres, s is overrepresented before ompress, c is overrepresented 43

abraca Sort rotations Encode row of original message Encode last characters in rows row 0 aabrac 1 abracaa 2 acaabr 3 bracaaa 4 caabraa 5 racaab Transformed message: <1, caraab > 44

Expand row 0 c 1 a 2 r 3 a 4 a 5 b 45

Expand row 0 a c 1 a a 2 a r 3 b a 4 c a 5 r b 46

Expand row 0 a c 1 a a 2 a r 3 b a 4 c a 5 r b rotated ca aa ra ab ac br sorted on second character 47

Expand row 0 a c 1 a a 2 a r 3 b a 4 c a 5 r b rotated ca aa ra ab ac br T 4 0 5 1 2 3 sorted on second character 48

Expand row 0 a c 1 a a 2 a r 3 b a 4 c a 5 r b rotated ca aa ra ab ac br T 4 0 5 1 2 3 a 49

Expand row 0 a c 1 a ca 2 a r 3 b a 4 c a 5 r b rotated ca aa ra ab ac br T 4 0 5 1 2 3 ca 50

Expand row 0 a c 1 a aca 2 a r 3 b a 4 c a 5 r b rotated ca aa ra ab ac br T 4 0 5 1 2 3 aca 51

Expand row 0 a c 1 a raca 2 a r 3 b a 4 c a 5 r b rotated ca aa ra ab ac br T 4 0 5 1 2 3 raca 52

Expand row 0 a c 1 abraca 2 a r 3 b a 4 c a 5 r b rotated ca aa ra ab ac br T 4 0 5 1 2 3 braca 53 Expand is quick, linear time. Compress is heavier, because of rotation sorting.

Expand row 0 a c 1 abraca 2 a r 3 b a 4 c a 5 r b rotated ca aa ra ab ac br T 4 0 5 1 2 3 abraca 54 Expand is quick, linear time. Compress is heavier, because of rotation sorting.

Rotation sorting suffix sorting Add implicit last character $, smallest in alphabet Sorting rotations of abraca$ = sorting suffixes of abraca$ 55

Suffix sorting 0 b a b a a a a b c b a b a a a a a $ 1 a b a a a a b c b a b a a a a a $ 2 b a a a a b c b a b a a a a a $ 3 a a a a b c b a b a a a a a $ 4 a a a b c b a b a a a a a $ 5 a a b c b a b a a a a a $ 6 a b c b a b a a a a a $ 7 b c b a b a a a a a $ 8 c b a b a a a a a $ 9 b a b a a a a a $ 10 a b a a a a a $ 11 b a a a a a $ 12 a a a a a $ 13 a a a a $ 14 a a a $ 15 a a $ 16 a $ 17 $ b a b a a a a b c b a b a a a a a $ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 17 $ 16 a $ 15 a a $ 14 a a a $ 13 a a a a $ 12 a a a a a $ 3 a a a a b c b a b a a a a a $ 4 a a a b c b a b a a a a a $ 5 a a b c b a b a a a a a $ 10 a b a a a a a $ 1 a b a a a a b c b a b a a a a a $ 6 a b c b a b a a a a a $ 11 b a a a a a $ 2 b a a a a b c b a b a a a a a $ 9 b a b a a a a a $ 0 b a b a a a a b c b a b a a a a a $ 7 b c b a b a a a a a $ 8 c b a b a a a a a $ BWT output a a a a a b b a a b b a a a c $ a b 16 15 14 13 12 11 2 3 4 9 0 5 10 1 8 17 6 7 Space is linear, but sorting sees quadratic data. Comparisons take linear time. So, comparison-based algorithm has worst case order of growth N 2 lg n. 56

Suffix sorting time complexity Naive: at least N2 in the worst case Prefix doubling: N lg N Suffix tree, recursive: N Suffix sorting is the computationally heaviest part of BWT. Specialized methods exist that improve on the worst case. 57

Data Compression. Guest lecture, SGDS Fall 2011