Data Compression 신찬수

Data compression Reducing the size of the representation without affecting the information itself. Lossless compression vs. lossy compression text file image file movie file compression compressed files enconder decoder

Ex: Run-length encoding (RLE) Lossless compression addddddcbbbbef 1a6d1c4b1e1f 20 characters(bytes) 12 characters(bytes) (ratio = 0.6) Each run (a consecutive part of a same character) is coded as a pair of ( n, c ). n is the number of character c of the run. abcdefgh 1a1b1c1d1e1f1g1h 8 bytes 16 bytes ( ratio = 2 ) bbddaacc 2b2d2a2c 8 bytes 8 bytes ( ratio = 1 ) RLE will be good if every run is of length >= 2

RLE So the length of run should be >= 2. Then how about the numbers? 11111111111544444 Code a run as a triple ( marker, n, c ) marker should be chosen among letters used infrequently. # 11 1 # 1 5 # 5 4 For a run, we need three bytes, so this is applicable for the runs of length >= 4.

ASCII codes

Example ab22ccd99++33ffgii?**! ABABABAB AAAABBBB Drawback: Performance seriously depends on the occurrences of runs.

Applications Image compression Black/white image with mainly one color Fax image, book s page image, etc. Gray (8-bit image): ex. 10000 bytes image 5713 bytes 10100 bytes 200 bytes Color(RGB image) Encoding RGB together Encoding RGB separately (better)

Applications How to scan? Image comparison

Ziv-Lempel code universal coding scheme: not relying on the frequencies of symbol occurrences in advance, but building the knowledge during the compression. Huge variants. LZ77, LZR, LZSS, LZB, LZH, LZ78, LZC, LZFG, LZW

LZW Compression algorithm of compress command in UNIX system. As it takes the input characters one by one, it outputs codes and builds the string table. If the opponent has the sequence of codes, decode them by rebuilding the string table. Note that the opponent does not need the whole table built in the compression stage. Fast and simple compression with better ratio. compression ratio: 50% ~ 60%

Compression the last character of code in the table = the first character of the next code put all characters to the table; s = the first character from input; while any input left read character c; if ( s+c is in the table ) s = s + c; else output code index of s; put a string (s + c) to the table; s = c; end-of-while output code index of s;

a a b a b a b a a a

Decompression: Problem? cscsc put all characters to the table; read old_code and output its string; while code are still left read character new_code; ouput new_code; c = first character of new_code put new_code + c to the table; old_code = new_code; end-of-while

a a b a b a b a a a

Decompression: correct verions put all characters to the table; read old_code and output its string; while code are still left read character new_code; if (new_code is not in the table) output string(old_code) + first(old_code); put old_code + first(old_code) to the table; else put old_code + first(new_code) to the table; output string(new_code); old_code = new_code; end-of-while

Table The length of strings in the table would be large. Big problem Use the reduced form! string code a 1 b 2 aa 3 ab 4 ba 5 aba 6 abaa 7

Table string reduced string code a a 1 b b 2 aa 1a 3 What is the string for code 7? Each string can be represented as two bytes! ab 1b 4 ba 2a 5 aba 4a 6 abaa 6a 7

Refrence Explanation about LZW compression including C code http://dogma.net/markn/articles/lzw/lzw.htm

Conditions for code assignment 1. One-to-one condition Each code corresponds to exactly one character. 2. Code-length condition 3. Prefix condition 4. Optimality condition

Conditions for code assignment [Code-length condition] The code length of a character A should not exceed the code length of a less probable character B. Prob(A) Prob(B) length(a) length(b) Three symbols: A, B, C with prob. 0.5, 0.25, 0.25 A = 12, B = 2, C = 1 Violate the code-length condition A = 1, B = 2, C = 12 Satisfy the above code-length condition. But we cannot distinguish AB and C.

Conditions for code assignment A = 1, B = 22, C = 12 1222 We need to check lookahead to determine a unique string of the code [Prefix condition] No code should not be a prefix of another code no lookahead is needed. A = 11, B = 12, C = 21 Satisfy code-length condition and prefix condition. No ambiguity.

Conditions for code assignment [Optimality condition] The average code length should be closer to the optimal average length as much as possible. L avg = Σ (Prob(A i ) * L(A i )) L(A i ) = -log( Prob(A i ) ) Three symbols: A, B, C with prob. 0.5, 0.25, 0.25 L avg = 0.5 * -log(0.5) + 0.25 * -log(0.25) + 0.25 * -log(0.25) = 0.5 * 1 + 0.25 * 2 + 0.25 * 2 = 1.5 This L avg is the best possible average length. Established by Claude E. Shannon.

Huffman coding Construct (near) optimal binary codes for symbols. A B C D E 0.09 0.12 0.19 0.21 0.39

Huffman coding algorithm HuffmanCode( P ) let P be a storing the probabilities. sort the characters in the non-decreasing order of probabilities while ( two or more probabilities are left in P ) delete the two minimum probabilities p1, p2 from P. add a new probability (p1 + p2) into P. end-of-while generate codes from the resulting tree as follows: assign 0 for left child and 1 for right child.

Notes There can be two or more Huffman codes with same average code-length. A = 0.09, B = 0.12, C = 0.19, D = 0.21, E = 0.39 A = 11, B = 10, C = 01, D = 001, E = 000 L huf = 2.21 L avg = 2.09 A = 01, B = 11, C = 10, D = 001, E = 000 L huf = 2.21 L huf is very close to L avg (only 5% off)

Practice P = 0.1, Q = 0.1, R = 0.1, S = 0.2, T = 0.5 What are all the Huffman codes?

Compression? A = 11, B = 10, C = 01, D = 001, E = 000 Sending ABAAD by ASCII codes 40 bits by Huffman codes 11101111001 (11 bits) Receiving 11101111001 must know the conversion table between characters and codes. 1. Exchange Huffman tree before sending ABAAD 2. Sending Huffman tree together with ABAAD 3. Building Huffman tree during the transmitting ABAAD

Implementation Use the HEAP! construct a min-heap with the probabilities. repeat the deletion of two minimums and insertion of new one until a final node contains 1.0 Assignment of codes keep the track of heap operations and trace it. Ex. A = 0.09, B = 0.12, C = 0.19, D = 0.21, E = 0.39

Improvements X = 0.01, Y = 0.1, Z = 0.8 L huf = 2 * 0.01 + 2 * 0.1 + 1 * 0.8 = 1.2 L avg = 0.922 23% gap! Reducing the gap by coding every pair of characters (not coding single characters) XX, XY, XZ, YX, YY, YZ, ZX, ZY, ZZ L huf = 1.92, L avg = 1.844 3.96% gap only!

Experiments Coding mehtod English text PL/I image Huffman 40% 60% 50% Huffman + 100 freq. used group 49% 73% 52% Huffman + 512 freq. used group 55% 71% 62%