Compressing Data Konstantin Tretyakov (kt@ut.ee) MTAT.03.238 Advanced April 26, 2012
Claude Elwood Shannon (1916-2001)
C. E. Shannon. A mathematical theory of communication. 1948
C. E. Shannon. The mathematical theory of communication. 1949
Shannon-Fano coding Nyquist-Shannon sampling theorem Shannon-Hartley theorem Shannon s noisy channel coding theorem Shannon s source coding theorem Rate-distortion theory Ethernet, Wifi, GSM, CDMA, EDGE, CD, DVD, BD, ZIP, JPEG, MPEG,
MTMS.02.040 Informatsiooniteooria (3-5 EAP) Jüri Lember http://ocw.mit.edu/ 6.441 Information Theory https://www.coursera.org/courses/
Basic terms: Information, Code Information Coding, Code Can you code the same information differently? Why would you? What properties can you require from a coding scheme? Are they contradictory? Show 5 ways of coding the concept number 42 What is the shortest way of coding this concept? How many bits are needed? Aha! Now define the term code once again.
Basic terms: Coding Suppose we have a set of three concepts. Denote them as A, B and C. Propose a code for this set. Consider the following code: A 0, B 1, C 01 What do you think about it? Define variable length code. Define uniquely decodable code.
Basic terms: Prefix-free If we want to code series of messages, what would be a great property for a code to have? Define prefix-free code. For historical reasons those are more often referred to as prefix codes. Find a prefix-free code for {A, B, C}. Is it uniquely decodable? Is prefix-free uniquely decodable? Is uniquely decodable prefix-free?
Prefix-free code.. can always be represented as a tree with symbols at the leaves.
Compression Consider some previously derived code for {A, B, C}. Is it good for compression purposes? Define expected code length. Let event probabilities be as follows: A 0.50, B 0.25, C 0.25 Find the shortest possible prefix-free code.
Compression & Prefix coding Does the prefix-free property sacrifice code length? No! For each uniquely-decodable code there exists a prefix-code with the same codeword lengths.
Huffman code Consider the following event probabilities A 0.50, B 0.25, C 0.125, D 0.125 and some event sequence ADABAABACDABACBA Replace all events C and D with a new event Z. Construct the optimal code for {A, B, Z} Extend this code to a new code for {A, B, C, D}
Huffman coding algorithm Generalize the previous construction to construct an optimal prefix-free code. Use Huffman coding to encode YAYBANANABANANA Compare its efficiency to straightforward 2-bit encoding. D. Huffman. A Method for the Construction of Minimum-Redundancy Codes, 1952
Huffman coding in practice Is just saving the result of Huffman coding to file enough? What else should be done? How? Straightforward approach dump the tree using preorder traversal. Smarter approach save only code lengths Wikipedia: Canonical Huffman Code RFC1951: DEFLATE Compressed Data Format Specification version 1.3, Section 3.2.2
Huffman code optimality Consider an alphabet, sorted by event (letter) probability, e.g. x 1 0.42, x 2 0.25,, x 9 0.01, x 10 0.01 Is there just a single optimal code for it, or several of them?
Huffman code optimality Show that each optimal code has: l x 1 l x 2 l(x 10 ) Show that there is at least one optimal code where x 9 and x 10 are siblings in the prefix tree. Let L be the expected length of the optimal code. Merge x 9 and x 10, and let L s be the expected length of the resulting smaller code. Express L in terms of L s. Complete the proof.
Huffman code in real life Which of those use Huffman coding? DEFLATE (ZIP, GZIP) JPEG PNG GIF MP3 MPEG-2 All of them do, as a post-processing step.
Shannon-Fano code I randomly chose a letter from this probability: A 0.45, B 0.35, C 0.125, D 0.125 You need to guess it in the smallest expected number of yes/no questions. Devise an optimal strategy.
Shannon-Fano code Constructs a prefix-code in a top-down manner: Split the alphabet into two parts with as equal probability as possible. Construct a code for each part. Prepend 0 to codes of the first part Prepend 1 to codes of the second part. Is Shannon-Fano the same as Huffman?
Shannon-Fano & Huffman Shannon-Fano is not always optimal. Show that it is optimal, though, for letter probabilities of the form 1/2 k.
log(p) as amount of information Let letter probabilities all be of the form p = 1 2 k Show that for the optimal prefix code, the length of codeword for a letter with probability p i = 1 2 k is exactly k = log 2 1 p i = log 2 p i.
Why logarithms? Intuitively, we want a measure of information to be additive. Receiving N equivalent events must correspond to N times the information in the single event. However, probabilities are Therefore, the most logical way to measure information of an event is
The thing to remember log 2 1 p is the information content of a single random event with probability p. For p of the form 2 k it is exactly the number of bits needed to code this event using an optimal binary prefix-free code.
The thing to remember log 2 1 p is the information content of a single random event with probability p. For p of the form 2 k For other values of p the information it is content exactly is not an the integer. number Obviously you can t use something like 2.5 bits to encode a symbol. However, for of bits needed to code this event using an longer texts you can code multiple symbols at once and in this case you can optimal achieve binary the average prefix-free coding rate of this code. number (e.g. 2.5) bits per each presence of the corresponding event.
Expected codeword length Let letter probabilities all be of the form p = 1 2 k What is the expected code length for the optimal binary prefix-free code?
The thing to remember For a given discrete probability distribution, the function 1 1 H p 1, p 2,, p n = p 1 log 2 + + p p n log 2 1 is called the entropy of this distribution. p n
Meaning of entropy The average codeword length L for both Huffman and Shannon-Fano codes satisfies: H P L < H(P) + 1
Meaning of entropy Shannon Source Coding Theorem A sequence of N events from probability P can be losslessly represented as a sequence of N H(P) bits for sufficiently large N. Conversely, it is impossible to losslessly represent a the sequence using less than N H(P) bits.
The things to remember log 2 1 p is the information content of a single random event with probability p, measured in bits. H(P) Is the expected information content for the distribution P, measured in bits.
The things to remember log 2 1 is the I.e. it information is the expected number of content bits necessary to of optimally a single encode random event with such probability. event with probability p, measured in bits. p H(P) Is the I.e. it expected is the expected number information of bits necessary to content optimally encode for a single the distribution P, random measured event from this in distribution. bits.
Demonstrate an N-element distribution with zero entropy. Demonstrate an N-element distribution with maximal entropy. Define entropy for a continuous distribution p(x).
Is Huffman code good for coding: Images? Music? Text? None of them, because Huffman coding assumes an I.I.D. sequence, yet all of those have a lot of structure. What is it good for? It is good for coding randomlike sequences.
Say we need to encode the text THREE SWITCHED WITCHES WATCH THREE SWISS SWATCH WATCH SWITCHES. WHICH SWITCHED WITCH WATCHES WHICH SWISS SWATCH WATCH SWITCH? Can we code this better than Huffman? Of course, if we use a dictionary. Can we build the dictionary adaptively from the data itself?
Lempel-Ziv-Welch algorithm Say we want to code string AABABBCAB Start with a dictionary {0 } Scan string from the beginning. Find the longest prefix present in the dictionary (0, ). Read one more letter A. Output prefix id and this letter (0, A ). Append <current prefix><current letter> to the dictionary. New dictionary: {0, 1 A }. Finish the coding. Terry Welch, A Technique for High-Performance Data Compression, 1984.
LZW Algorithm Unpack the obtained code. Can we do smarter initialization? If we pack a long text, the dictionary may bloat. How do we handle it? In practice LZW coding is followed by Huffman (or a similar) coding.
Theorem LZW coding is asymptotically optimal. I.e. as the length of the string goes to infinity, the compression ratio approaches the best possible (given some conditions).
LZW and variations in real life Which of those use variations of LZW? DEFLATE (ZIP, GZIP) JPEG PNG GIF MP3 MPEG-2
LZW and variations in real life Which of those use variations of LZW? DEFLATE (ZIP, GZIP) JPEG PNG GIF MP3 MPEG-2 Remember, LZW is aimed at text-like data with many repeating substrings. It is used in GIF after the run-length-encoding step (which produces such kind of data). Not sure why PNG uses it, but probably for a similar reason.
Ideal compression? Given a string of bytes, what would be the theoretically best way to encode it?
Kolmogorov complexity The Kolmogorov complexity of a byte string is the length of the shortest program which outputs this string.
Kolmogorov complexity Can we achieve Kolmogorov complexity at packing?
Kolmogorov complexity Theorem Kolmogorov complexity is not computable.
Summary Thou shalt study Information Theory! Huffman-code is a length-wise optimal uniquely-decodable code. log (1/p) is the information content of an event. H P is the information content of a distribution. LZW is asymptotically optimal. Kolmogorov complexity is a fun (but practically useless) idea.