Data Compression. Guest lecture, SGDS Fall 2011

Size: px

Start display at page:

Download "Data Compression. Guest lecture, SGDS Fall 2011"

Rebecca Price
6 years ago
Views:

1 Data Compression Guest lecture, SGDS Fall

2 Basics Lossy/lossless Alphabet compaction Compression is impossible Compression is possible RLE Variable-length codes Undecidable Pigeon-holes Patterns Randomness Huffman Arithmetic coding Using phrases Dynamic context Ziv-Lempel Burrows-Wheeler Suffix sorting 2 Data compression is not traditional alg. course topic. But interesting, both in itself and as application of alg./d.s. Book: fragments, not that well chosen from compression experts view. This lecture: fuller view, with connections to what you learned on the course.

3 Basic model bitstream B Compress compressed version C(B) Expand original bitstream B Basic model for data compression Original message, consisting of characters, pixels, sound samples or whatever. In much of the lecture we assume that it consists of characters. But more generally, we can view it as just a stream of bits, because all data representations can be broken down to bits. Compression method: two algorithms: compress and expand. Seems impossible that you could get the original back, you would have to throw away some data. And sometimes you do. 3

4 Lossy Compress Compressed message Expand Images, video, sound, 4 If we accept loss, which we can do for some kinds of data, itʼs more believable that we can compress.

5 Lossless Compress Compressed message Expand Anything, including text, machine code, This lecture (and book): lossless only But there are also lossless methods, which reproduce the original exactly. Lossless techniques are useful also in lossy methods. Even when accepting loss, you want to represent exact information as compactly as possible. One case where it is fairly easy to accept is if there are unused bits in B, i.e., it does not store the data as compactly as it could. 5

6 Easy: alphabet compaction Genome String over alphabet { A, C, T, G } Encode N-character genome: ATAGATGCATAG Ascii bytes char encoding A C T G bit encoding char encoding A 00 C 01 T 10 G Thatʼs nice, but in general, there are not unused bits.

7 But, in general Any representable data may appear No superfluous bits to remove 7

8 Computational formulation Compress Input: N-bit message B Output: Smallest possible program, C(B), that produces B as output (when given no input) Expand Run C(B), get B. Length of C(B) is Kolmogorov complexity of B UNDECIDABLE The most general kind of code is a programming language. Letʼs say that C(B) is a program that produces B. Letʼs find the smallest such program. Undecidable: There is no, can be no, algorithm that computes it in general. Generally, one should not be too discouraged. Sometimes a non-general algorithm is useful. But letʼs make this easier, by requiring not that C(B) is the smallest possible, but just that it is smaller than B. 8

9 New attempt: skip smallest possible Compress Input: N-bit message B Output: N -bit message C(B), N < N Expand Input: N -bit message C(B) Output: N-bit message B IMPOSSIBLE 9 Why is this impossible? The pigeon hole principle applies.

10 B: 2 N possibilities C(B): 2 N possibilities Compress Compression means mapping each dot on the left to some dot on the right. Since there are fewer possibilities for C(B) than B, there are some B1 and B2 for which C(B1) = C(B2). This is easy to see when N is 2 or 3 or so, but donʼt get fooled: it applies even if N is billions. 10

11 B: 2 N possibilities C(B): 2 N possibilities Expand? 11 Expand cannot choose between B1 and B2.

12 So, we give up? 12 Some of the 2 N messages may be illegal. No need to encode them. Even if they are all legal, some are more probable than others.

13 Modified goal Compress Input: N-bit message B Output: N -bit message C(B) N < N for most common instances of B For less common B: ok if N > N Expand Input: N -bit message C(B) Output: N-bit message B A little vague, not really a mathematical definition. Need some more information theory to make a formal definition, which is beyond the scope of this lecture. 13

14 Example Mary had a little lamb. hsy, iimlh kwvsadjh h.j Text (upper/lower case letters + punctuation), 6 bits/char 23 6 bits = 138 bits Compressed, text can use fewer, say 2.5 bits/char, because text patterns are predictable bits = 57.5 bits 23 6 bits = 138 bits Compressed, this data (with no predictable patterns) will use more, say 6.8 bits/char bits = bits 14 LEFT Without compression, just alphabet compaction, we can get 138. Compressed, we might get, e.g., RIGHT Uncompressed, the same length Compressed we can allow this a little more. So, how then? We have to find predictable patterns.

15 first decimals of π No normal compression method finds this pattern Compression models all based on repetition and/or skewed distribution 15 If we donʼt have special knowledge (of π in this case), the message looks random.

16 Randomness Message that looks random will not be compressed Sequence that is truly random cannot be compressed (pigeon-holes again) Maximum-compressed data looks random 16 Looking random depends on the model used. Every compression method has one, explicit or implicit. Now letʼs look at a message where we can easily see some pattern.

17 Run-length encoding (RLE) How to compress ? How to compress ? If you would just describe this bit sequence, how would you do it? Letʼs use that as a compression format. To make it into a bit string, we need to encode numbers in binary too (next slide) becomes becomes This compressed 40 bits into 16 bits. What compression do we generally get with this method? 17

18 Decimal Binary

19 RLE compression efficiency What sequence gives best compression? / bits/bit Worst compression? /1 4 bits/bit More (than 4) bits for lengths: better best case, worse worse case Used as component in some systems, but not a good general compression scheme. Letʼs look at a text example, with a more intricate pattern. 19

20 ABRACADABRA! First attempt: alphabet compaction char encoding A 000 B 001 C 010 D 011 R 100! bits = 36 bits 20 Encoding But do we have to have the same number of bits for all characters?

21 ABRACADABRA! char encoding A 0 B 1 C 01 D 10 R 00! 11 Won t work! (why not?) Can variable-length code work? Yes! If it is prefix-free Encoding

22 ABRACADABRA! Try variable lengths, with short codewords for common characters. char encoding A 0 B C 110 D 100 R 1110! bits total, less than 36! 22 So, we seem to have found a trick. Letʼs look at a more intuitive way to represent this code.

23 Tree representation Codeword table key! A B C D R value Compressed bitstring Trie representation 0 1 A D! C 0 1 R B Compress Start at leaf; follow path up to the root; print bits in reverse. Expand Start at root. Go left if bit is 0; go right if 1. If leaf node, print char and return to root. But: How do we find the best code? 23 Code in the book. Now, how do we make best use of this trick?

24 Huffman code Count frequencies of characters Make a set with one node for each letter Extract two nodes with smallest frequency Combine them, with new node as root Add new root node to set Repeat, until only one node (Optimality proof: see book.) 24

Huffman code 12 char freq A 5 B 2 5 A 0 1 3 0 7 1 4 C 1 D 1 1 D 0 1 2 2 R 0 1 B 2 R 2! 1 0 1 1!

25 Huffman code 12 char freq A 5 B 2 5 A C 1 D 1 1 D R 0 1 B 2 R 2! ! C 1 Huffman code construction for A B R A C A D A B R A! 25 Little red numbers are frequencies.

26 Huffman code Compress N characters, alphabet size R Data structure(s)? Time complexity? 26 Count frequencies: N Build binary min-heap, based on freq.: R (R-1) steps, extract two insert one: R lg R Alt. use two FIFO queues Q1, Q2 Sort on freq., insert into Q1 in freq. order: sort-time(r, values 0 N) Min-freq. node is always next to get from either Q1 or Q2 Insert new nodes in Q2 sort-time(r, values 0 N) = R lg R? No, key-indexed sorting can normally get it down to R. But how does expand know what the encoding is?

27 Compressed message must include code Codeword of each character (book) or, frequency of each character. Expand builds tree in same way as compress or, the length of the codeword for each character. Enough info to rebuild tree Note: Huffman can automatically compact alphabet 27 No problem if alphabet is relatively small. If we donʼt include characters with zero frequency in the code, we get natural compaction. Many descriptions stop here. We found the optimal way to compress. But far from it.

28 The curse of whole-bit codewords Huffman-encoding characters is not always the best we can do Example: 1000-char message with highly skewed distribution char freq encoding A 990/ B 7/ Total: = 1010 bits RLE would do better! C 3/ How can we do better? One way is to use another alphabet.

Use double characters char freq (computed) encoding AA (990/1000) 2.98 0 AB (990/1000) (7/1000).0069 10 AC (990/1000) (3/1000).0030 1110 BA (7/1000) (990/1000).

29 Use double characters char freq (computed) encoding AA (990/1000) AB (990/1000) (7/1000) AC (990/1000) (3/1000) BA (7/1000) (990/1000) BB (7/1000) BC (7/1000) (3/1000) CA (3/1000) (990/1000) CB (3/1000) (7/1000) CC (3/1000) Total: ca 600 bits 29

30 Keep expanding alphabet Combining three characters, to alphabet size 27, improves precision further. Etc. Finally: combining N characters. Message is one single character Arithmetic coding Arithmetic encoder takes one freq. interval at a time, outputs bits as they can be determined. We do not go into details for how to do arithmetic coding in practice. Just please accept that the problem has a solution. 30

31 Entropy coding Huffman, Shannon-Fano, canonical code, arithmetic coding techniques exist to output right number of bits, with sufficient precision For details, see e.g. Witten, Moffat, & Bell, Managing Gigabytes 31

32 But, wait a minute char freq (computed) AA (990/1000) 2.98 AB (990/1000) (7/1000).0069 AC (990/1000) (3/1000).0030 BA (7/1000) (990/1000).0069 BB (7/1000) BC (7/1000) (3/1000) CA (3/1000) (990/1000).0030 CB (3/1000) (7/1000) CC (3/1000) These are clearly not the best frequency estimates For instance, in English, Th is more common than ht We can get more data from the original message 32

33 Idea I: Statistics with context Example: in English, the letter u is not among the most common few except after q, where it is by far the most common! Idea: use different frequency tables based on the previous character 33

34 ABRACADABRA! After A After B After C char freq char freq char freq A 0 A 0 A 1 B 2 C 1 D 1 B 0 C 0 D 2 B 0 C 0 D 0 R 0 R 0 R 0! 1! 0! 0 Build, e.g., different Huffman codes for each context 34

35 More detailed contexts Example, after compres, s is overrepresented Use longer strings as context: those significant in message Problem: lots of codes! Need to be included in compressed message? Solution: dynamic contexts 35

36 context et has appeared 2 times Context tree for string letlettertele 36

exactly the same way in expand as in compress. No code needs to be included in compressed method!

37 Dynamic context modeling Start with just one, or R, contexts. Entries in frequency tables equal Add contexts and update statistics by one character at a time Build exactly the same way in expand as in compress. No code needs to be included in compressed method! Prediction by partial matching (PPM), Dynamic Markov Chaining (DMC) Good compression properties, but take much computation in both compress and expand 37

38 Idea 2: Build dictionaries Instead of individual characters, encode phrases Computationally simpler than statistical modeling Less sensitive to lack of precision in bit codes (alphabet is large) Dictionary methods are equivalent to (weird) special cases of statistical models 38

39 LZ77 Compressed message consists of triples <pos, length, next> position (counting backwards) of phrase first character after phrase number of characters in phrase 39

40 <0,0,a> <0,0,b> <2,1,a> <3,2,b> <5,3,b> <6,6,b> Expand: abaababaabbabaabbb Considered impractical for years, because scanning for longest string during compression takes N2 time but does it? Design compression algorithm! Data structures? Time complexity? 40

41 Idea 3: Block sorting Group characters in the output according to their contexts More similar contexts, closer together Generates repetitions more easy to compress 41

42 Idea 3: Block sorting In chunk of message, sort all strings (contexts) Encode characters in their sorted-context order, lots of repetition Then compress with RLE and/or move to front Remarkably, it s easy to get original order back! Burrows Wheeler transform (BWT) 42 Contexts are strings, so we can use string sorting for grouping/ordering.

43 Note on backward contexts String after a character works as contest (just as well as string before) After compres, s is overrepresented before ompress, c is overrepresented 43

44 abraca Sort rotations Encode row of original message Encode last characters in rows row 0 aabrac 1 abracaa 2 acaabr 3 bracaaa 4 caabraa 5 racaab Transformed message: <1, caraab > 44

45 Expand row 0 c 1 a 2 r 3 a 4 a 5 b 45

46 Expand row 0 a c 1 a a 2 a r 3 b a 4 c a 5 r b 46

47 Expand row 0 a c 1 a a 2 a r 3 b a 4 c a 5 r b rotated ca aa ra ab ac br sorted on second character 47

48 Expand row 0 a c 1 a a 2 a r 3 b a 4 c a 5 r b rotated ca aa ra ab ac br T sorted on second character 48

49 Expand row 0 a c 1 a a 2 a r 3 b a 4 c a 5 r b rotated ca aa ra ab ac br T a 49

50 Expand row 0 a c 1 a ca 2 a r 3 b a 4 c a 5 r b rotated ca aa ra ab ac br T ca 50

51 Expand row 0 a c 1 a aca 2 a r 3 b a 4 c a 5 r b rotated ca aa ra ab ac br T aca 51

52 Expand row 0 a c 1 a raca 2 a r 3 b a 4 c a 5 r b rotated ca aa ra ab ac br T raca 52

53 Expand row 0 a c 1 abraca 2 a r 3 b a 4 c a 5 r b rotated ca aa ra ab ac br T braca 53 Expand is quick, linear time. Compress is heavier, because of rotation sorting.

54 Expand row 0 a c 1 abraca 2 a r 3 b a 4 c a 5 r b rotated ca aa ra ab ac br T abraca 54 Expand is quick, linear time. Compress is heavier, because of rotation sorting.

55 Rotation sorting suffix sorting Add implicit last character $, smallest in alphabet Sorting rotations of abraca$ = sorting suffixes of abraca$ 55

56 Suffix sorting 0 b a b a a a a b c b a b a a a a a $ 1 a b a a a a b c b a b a a a a a $ 2 b a a a a b c b a b a a a a a $ 3 a a a a b c b a b a a a a a $ 4 a a a b c b a b a a a a a $ 5 a a b c b a b a a a a a $ 6 a b c b a b a a a a a $ 7 b c b a b a a a a a $ 8 c b a b a a a a a $ 9 b a b a a a a a $ 10 a b a a a a a $ 11 b a a a a a $ 12 a a a a a $ 13 a a a a $ 14 a a a $ 15 a a $ 16 a $ 17 $ b a b a a a a b c b a b a a a a a $ $ 16 a $ 15 a a $ 14 a a a $ 13 a a a a $ 12 a a a a a $ 3 a a a a b c b a b a a a a a $ 4 a a a b c b a b a a a a a $ 5 a a b c b a b a a a a a $ 10 a b a a a a a $ 1 a b a a a a b c b a b a a a a a $ 6 a b c b a b a a a a a $ 11 b a a a a a $ 2 b a a a a b c b a b a a a a a $ 9 b a b a a a a a $ 0 b a b a a a a b c b a b a a a a a $ 7 b c b a b a a a a a $ 8 c b a b a a a a a $ BWT output a a a a a b b a a b b a a a c $ a b Space is linear, but sorting sees quadratic data. Comparisons take linear time. So, comparison-based algorithm has worst case order of growth N 2 lg n. 56

57 Suffix sorting time complexity Naive: at least N2 in the worst case Prefix doubling: N lg N Suffix tree, recursive: N Suffix sorting is the computationally heaviest part of BWT. Specialized methods exist that improve on the worst case. 57

CS/COE 1501

CS/COE 1501 www.cs.pitt.edu/~lipschultz/cs1501/ Compression What is compression? Represent the same data using less storage space Can get more use out a disk of a given size Can get more use out of memory