introduction run-length coding Huffman compression Applications

Size: px

Start display at page:

Download "introduction run-length coding Huffman compression Applications"

Brooke Lang
5 years ago
Views:

lgorithms lgorithms F O U T H E I T I O N

T OMPESSION introduction run-length coding

compression lgorithms LZW compression OET

pplications ompression reduces the size of

lossless). To save space when storing it.

File systems: NTFS, ZFS, HFS+, efs, GFS.

quintillion bytes of data so much that 9%

1 lgorithms lgorithms F O U T H E I T I O N OET S EGEWIK K EVIN W YNE. T OMPESSION. T OMPESSION introduction run-length coding introduction run-length coding Huffman compression lgorithms LZW compression OET S EGEWIK K EVIN W YNE OET S EGEWIK K EVIN W YNE Huffman compression LZW compression Last updated on 4//6 :4 PM ata compression pplications ompression reduces the size of a file: Generic file compression (always lossless). To save space when storing it. To save time when transmitting it. Most files have lots of redundancy. Files: GZIP, ZIP, 7z. rchivers: PKZIP. File systems: NTFS, ZFS, HFS+, efs, GFS. Multimedia (usually lossy). Images: GIF, JPEG. Sound: MP. Video: MPEG, ivx, HTV. Everyday, we create. quintillion bytes of data so much that 9% of the data in the world today has been created in the last two years alone. IM report on big data () ommunication. ITU-T T4 Group Fax. V.4bis modem. Skype, Google hangout. atabases. Google, Facebook, NS,... 4

Lossless compression and expansion ompression before computers Message. itstream we want to compress. ompress. Generates a "compressed" representation (). Expand. econstructs original bitstream.

2 Lossless compression and expansion ompression before computers Message. itstream we want to compress. ompress. Generates a "compressed" representation (). Expand. econstructs original bitstream. bitstream... ompress compressed version ()... asic model for data compression ompression ratio. its in () / bits in. Expand Ex. 7% or better compression ratio for natural language. uses fewer bits (you hope) original bitstream... ata compression has been omnipresent since antiquity: Number systems. X Natural languages. n= Mathematical notation. It played a central role in communications technology: Grade raille. Morse code. Telephone system. n = 6 b r a i l l e but rather a I like like every 6 ata representation: genomic code Genome. String over the alphabet {, T,, G. Goal. Encode an N-character genome: TGTGTG... Standard SII encoding. 8 bits per char. 8 N bits. char hex binary '' 4 'T' 4 '' 4 'G' 47 Two-bit encoding. bits per char. N bits (% compression ratio). char binary '' 'T' '' 'G' Fixed-length code. k-bit code supports alphabet of size k. eading and writing binary data inary standard input. ead bits from standard input. public class inarystdin boolean readoolean() read bit of data and return as a boolean char readhar() read 8 bits of data and return as a char char readhar(int r) read r bits of data and return as a char [similar methods for byte (8 bits); short (6 bits); int ( bits); long and double (64 bits)] boolean isempty() is the bitstream empty? void close() close the bitstream inary standard output. Write bits to standard output public class inarystdout void write(boolean b) write the specified bit void write(char c) write the specified 8-bit char void write(char c, int r) write the r least significant bits of the specified char [similar methods for byte (8 bits); short (6 bits); int ( bits); long and double (64 bits)] void close() close the bitstream 7 8

Writing binary data ate representation. Three different ways to represent //999. character stream (StdOut) StdOut.print(month + "/" + day + "/" + year); / / 9 9 9 Three ints (inarystdout) inarystdout.

3 Writing binary data ate representation. Three different ways to represent //999. character stream (StdOut) StdOut.print(month + "/" + day + "/" + year); / / Three ints (inarystdout) inarystdout.write(month); inarystdout.write(day); inarystdout.write(year); 4-bit field, a -bit field, and a -bit field (inarystdout) inarystdout.write(month, 4); inarystdout.write(day, ); inarystdout.write(year, ); 999 bits ( + bits for byte alignment at close) 8 bits bits use. inarystdin allows if (cnt % width == ) StdOut.println(); us to avoid such system dependencies by writing our if (inarystdin.readoolean()) StdOut.print(""); else StdOut.print(""); own programs to convert bitstreams such that we can StdOut.println(cnt + " bits"); see them with our standard tools. For example, the program inaryump at left is a inarystdin client that Printing a bitstream on standard (character) output prints out the bits from standard input, encoded with the characters and. This program is useful for debugging when working with small inputs. We use a slightly more complicated version that inary just dumps prints the count when the width argument is (see Exercise..X). The similar client Hexump groups the data into 8-bit bytes and prints each as two hexadecimal digits that each represent 4 bits. The client Pictureump displays the bits in a Picture. Q. How You to can examine download Hexump the contents and Pictureump of a bitstream? from the booksite. Typically, we use piping and redirection at the command-line level when working with binary files: we can pipe the output of an encoder to inaryump, Hexump, or Pictureump, or redirect it to a file. Standard character stream % more abra.txt itstream represented as and characters % java inaryump 6 < abra.txt 96 bits Four ways to look at a bitstream itstream itstream represented represented with with hex hex digits digits % java java Hexump Hexump 4 < abra.txt abra.txt bytes bytes E F NUL SOH STX ETX EOT ENQ K EL S HT LF VT FF SO SI LE 4 NK SYN ET N EM SU ES FS GS S US SP # $ % & ( ) * +, -. / : ; < = >? E F G H I J K L M N O P Q S T U V W X Y Z [ \ ] ^ _ 6 ` a b c d e f g h i j k l m n o itstream represented as pixels in a Picture 7 p q r s t u v w x y z { ~ EL % java Pictureump 6 6 < abra.txt 96 bits 6-by-6 pixel window, magnified 9 Hexadecimal to SII conversion table Which of these formats are text-based, and which are binary? HTML GIF Universal data compression ZeoSync. nnounced : lossless compression of random data using Zero Space Tuner and inaryccelerator technology. MPEG PF SVG Java source code Java bytecode ZeoSync corporation folds after issuing $4 million in private stock

Quotes from this interview Universal data compression Wired News: When did you start working on this technology? Peter St. George: I started developing the technology about a dozen years ago.

4 Quotes from this interview Universal data compression Wired News: When did you start working on this technology? Peter St. George: I started developing the technology about a dozen years ago. I worked on this one problem for years consecutively. This is a project that I dedicated my life to a dozen years ago. WN: Let's go into the details. Tell me how it works. It can compress random data? PSG: If you say absolutely random, it's going to be very hard to agree what absolutely random is. WN: How do you get around the conventional wisdom that says simple mathematics says it's impossible? PSG: We plan to attack that issue head on. What hasn't been previously proven, we're proving. I have one quote I'd like to share with you: "The person who says it cannot be done should not interrupt the person doing it." Proposition. No algorithm can compress every bitstring. Pf. [by contradiction] Suppose you have a universal data compression algorithm U that can compress every bitstream. Given bitstring, compress it to get smaller bitstring. ompress to get a smaller bitstring. ontinue until reaching bitstring of size. Implication: all bitstrings can be compressed to bits Pf. [by counting] Suppose your algorithm that can compress all,-bit strings. possible bitstrings with, bits. Only can be encoded with 999 bits. Similarly, only in 499 bitstrings can be encoded with bits Universal data compression? 4 U U U... U U U an you compress this string of decimal digits? Undecidability It s the first digits of pi after the decimal point. (ut how to compress?) % java andomits java Pictureump bits difficult file to compress: one million (pseudo-) random bits public class andomits { public static void main(string[] args) { int x = ; for (int i = ; i < ; i++) { x = x * ; inarystdout.write(x > ); inarystdout.close(); 6

5 denudcany in Enlgsih lnagugae ata compression: quiz Q. How much redundancy in the English language?. Quite a bit.... randomising letters in the middle of words [has] little or no effect on the ability of skilled readers to understand the text. This is easy to denmtrasote. In a pubiltacion of New Scnieitst you could ramdinose all the letetrs, keipeng the first two and last two the same, and reibadailty would hadrly be aftcfeed. My ansaylis did not come to much beucase the thoery at the time was for shape and senqeuce retigcionon. Saberi's work sugsegts we may have some pofrweul palrlael prsooscers at work. The resaon for this is suerly that idnetiyfing coentnt by paarllel prseocsing speeds up regnicoiton. We only need the first and last two letetrs to spot chganes in meniang. Graham awlinson The gaol of data cmperisoson is to inetdify rdenudcany and epxloit it. side. esign an algorithm to correct text with letters permuted. ank these in the order of compressibility:. n SII text file of Shakespeare s works. bitmap image of this slide. n mp file of Justin ieber s aby. > >. > >. > >. > > E. I don't know. 7 8 ompression still active area of research, big improvements possible. T OMPESSION lgorithms introduction run-length coding Huffman compression LZW compression OET SEGEWIK KEVIN WYNE 9

6 un-length encoding ata compression: quiz Simple type of redundancy in a bitstream. Long runs of repeated bits. epresentation. 4-bit counts to represent alternating runs of s and s: s, then 7 s, then 7 s, then s. 6 bits (instead of 4) 7 7 Q. How many bits to store the counts?. We typically use 8 (but 4 in the example above for brevity). Q. What to do when run length exceeds max count?. Intersperse runs of length. pplications. JPEG, ITU-T T4 Group Fax,... 4 bits What is the best compression ratio achievable from run-length coding when using 8-bit counts?. / 6. / 6. 8 /. 4 / = 4 / 8 E. I don't know. Variable-length codes lgorithms OET SEGEWIK KEVIN WYNE T OMPESSION introduction run-length coding Huffman compression LZW compression Use different number of bits to encode different chars. ssign shorter codes to more common chars. Ex. Morse code: Issue. mbiguity. SOS? V7? IMIE? EEWNI? In practice. Use a medium gap to separate codewords. codeword for S is a prefix of codeword for V avid Huffman 4

7 Variable-length codes Prefix-free codes: trie representation Q. How do we avoid ambiguity?. Ensure that no codeword is a prefix of another. Ex. Fixed-length code. Ex. ppend special stop character to each codeword. Ex. General prefix-free code. Q. How to represent the prefix-free code?. binary trie haracters in leaves. odeword is path from root to leaf. odeword table ompressed bitstring bits odeword table odeword table odeword table odeword table ompressed bitstring bits ompressed bitstring 9 bits ompressed bitstring bits ompressed bitstring 9 bits Two prefix-free codes odeword table Two prefix-free codes 6 Prefix-free codes: expansion Expansion. Start at root. Go left if bit is ; go right if. odeword table If leaf node, write character; return to root node; repeat. Q. Why would this fail if the code isn t prefix-free?. Internal nodes also have chars, but decompressor will never output them. Prefix-free codes: compression ompressed bitstring 9 bits ompression: Two create prefix-free ST codesof - pairs. odeword table ompressed bitstring bits ompressed bitstring bits odeword table odeword table odeword table odeword table ompressed bitstring bits ompressed bitstring 9 bits ompressed bitstring bits ompressed bitstring 9 bits odeword table Two prefix-free codes 7 odeword table Two prefix-free codes 8 ompressed bitstring 9 bits ompressed bitstring 9 bits

8 ata compression: quiz Huffman coding overview onsider the following trie representation of a prefix-free code. Expand the compressed bitstring.. PEE. PESEY E. SPE. SPEEY E. I don't know. S P Y Static model. Use the same prefix-free code for all messages. ynamic model. Use a custom prefix-free code for each message. ompression. ead message. uild best prefix-free code for message. How? [ahead] Write prefix-free code (as a trie). ompress message using prefix-free code. Expansion. ead prefix-free code (as a trie) from file. ead compressed message and expand using trie. 9 Prefix-free codes: how to transmit Q. How to write the trie?. Write preorder traversal of trie; mark leaf and internal nodes with a bit. leaves preorder traversal 4 4 Using preorder traversal to encode a trie as a bitstream internal nodes Note. If message is long, overhead of transmitting trie is small. Prefix-free codes: how to transmit Q. How to write the trie?. Write preorder traversal of trie; mark leaf and internal nodes with a bit. leaves preorder traversal 4 4 Using preorder traversal to encode a trie as a bitstream internal nodes private static void writetrie(node x) { if (x.isleaf()) { inarystdout.write(true); inarystdout.write(???); return; inarystdout.write(false); writetrie(???); writetrie(???); private static class Node implements omparable<node> { private final char ch; // used only for leaf nodes private final int freq; // used only by compress() private final Node left, right;

Prefix-free codes: how to transmit Prefix-free codes: how to transmit Q. How to write the trie?

. econstruct from preorder traversal of trie.

private static void writetrie(node x) { if (x.isleaf()) { inarystdout.write(true); inarystdout.write(x.

right); private static class Node implements omparable<node> { private final char ch; // used only for

static Node readtrie() { if (inarystdin.readoolean()) { char c = inarystdin.

$Node('\',, x, y); arbitrary ( not used with internal nodes) 4 Huffman codes Q.$ Start with one node corresponding to each char i (with weight freq[i]).

9 Prefix-free codes: how to transmit Prefix-free codes: how to transmit Q. How to write the trie?. Write preorder traversal of trie; mark leaf and internal nodes with a bit. Q. How to read in the trie?. econstruct from preorder traversal of trie. leaves preorder traversal 4 4 Using preorder traversal to encode a trie as a bitstream internal nodes private static void writetrie(node x) { if (x.isleaf()) { inarystdout.write(true); inarystdout.write(x.ch, 8); return; inarystdout.write(false); writetrie(x.left); writetrie(x.right); private static class Node implements omparable<node> { private final char ch; // used only for leaf nodes private final int freq; // used only by compress() private final Node left, right; leaves preorder traversal 4 4 Using preorder traversal to encode a trie as a bitstream internal nodes private static Node readtrie() { if (inarystdin.readoolean()) { char c = inarystdin.readhar(8); return new Node(c,, null, null); Node x = readtrie(); Node y = readtrie(); return new Node('\',, x, y); arbitrary ( not used with internal nodes) 4 Huffman codes Q. How to find best prefix-free code? Huffman algorithm: ount frequency freq[i] for each char i in input. Start with one node corresponding to each char i (with weight freq[i]). epeat until single trie formed: select two tries with min weight freq[i] and freq[j] merge into single trie with weight freq[i] + freq[j] pplications: ount frequency for each character in input. input

10 ount frequency for each character in input. Start with one node corresponding to each character with weight equal to frequency. input Select two tries with min weight. Merge into single trie with cumulative weight. Select two tries with min weight. Merge into single trie with cumulative weight.

11 Select two tries with min weight. Merge into single trie with cumulative weight. Select two tries with min weight. Merge into single trie with cumulative weight. Select two tries with min weight. Merge into single trie with cumulative weight. Select two tries with min weight. Merge into single trie with cumulative weight.

12 Select two tries with min weight. Merge into single trie with cumulative weight. Select two tries with min weight. Merge into single trie with cumulative weight. Select two tries with min weight. Merge into single trie with cumulative weight. Select two tries with min weight. Merge into single trie with cumulative weight.

13 Select two tries with min weight. Merge into single trie with cumulative weight. Select two tries with min weight. Merge into single trie with cumulative weight. 4 4 Select two tries with min weight. Merge into single trie with cumulative weight. Select two tries with min weight. Merge into single trie with cumulative weight

14 Select two tries with min weight. Merge into single trie with cumulative weight. Select two tries with min weight. Merge into single trie with cumulative weight Select two tries with min weight. Merge into single trie with cumulative weight. 7

15 onstructing a Huffman encoding trie: Java implementation Practice private static Node buildtrie(int[] freq) { MinPQ<Node> pq = new MinPQ<Node>(); for (char i = ; i < ; i++) if (freq[i] > ) pq.insert(new Node(i, freq[i], null, null)); while (pq.size() > ) { Node x = pq.delmin(); Node y = pq.delmin(); Node parent = new Node('\', x.freq + y.freq, x, y); pq.insert(parent); return pq.delmin(); not used for internal nodes total frequency two subtries initialize PQ with singleton tries merge two smallest tries onstruct the Huffman code for the following strings: aababcabcdabcde abcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcd 7 8 Practice onstruct the Huffman code for the following strings: aababcabcdabcde a b c d e abcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcd a b c d Each codeword uses bits, so no compression (or expansion) of input. Small overhead due to need to store trie. Huffman coding: overview ompression: high-level steps: uild prefix-free code for message: Tabulate character frequencies. ecursively merge two min weight tries. Write prefix-free code (as a trie). ompress message using prefix-free code: uild symbol table from characters to codewords. Output codeword for each character in input. Expansion: high-level steps: ead and decode prefix-free code (as a trie) from file. Expand compressed message using trie: epeatedly find path from root to leaf in trie using bit sequence. 9 6

Huffman compression summary Lossy vs. lossless compression Proposition. Huffman's algorithm produces an optimal prefix-free code. Pf. See textbook. Two-pass implementation (for compression).

Using a binary trie N. Q. an we do better?

16 Huffman compression summary Lossy vs. lossless compression Proposition. Huffman's algorithm produces an optimal prefix-free code. Pf. See textbook. Two-pass implementation (for compression). Pass : tabulate character frequencies; build trie. Pass : encode file by traversing trie (or symbol table). unning time (for compression). Using a binary heap N + log. unning time (for expansion). Using a binary trie N. Q. an we do better? [stay tuned] no prefix-free code uses fewer bits input size alphabet size This lecture: lossless compression Images, music, videos, : lossy compression dramatically more effective 6 6 Statistical methods lgorithms OET SEGEWIK KEVIN WYNE T OMPESSION introduction run-length coding Huffman compression LZW compression Static model. Same model for all texts. Fast. Not optimal: different texts have different statistical properties. Ex: SII, Morse code. ynamic model. Generate model based on text. Preliminary pass needed to generate model. Must transmit the model. Ex: Huffman code. daptive model. Progressively learn and update model as you read text. More accurate modeling produces better compression. ecoding must start from beginning. Ex: LZW. braham Lempel Jacob Ziv 64

17 LZW compression demo Lempel-Ziv-Welch compression input matches LZW compression for LZW compression. reate ST mapping string s to W-bit codewords. Initialize ST with codewords for single-character s. Find longest string s in ST that is a prefix of unscanned part of input. Write the W-bit codeword associated with s. dd s + c to ST, where c is next character in the input. Q. How to represent LZW compression code table?. trie to support longest prefix match. longest prefix match codeword table 8 stop char: LZW expansion demo LZW expansion output LZW expansion for LZW expansion. reate ST mapping W-bit s to string s. Initialize ST to contain single-character s. ead a W-bit. Find associated string in ST and write it out. Update ST. Q. How to represent LZW expansion code table?. n array of length W codeword table 67 68

18 ata compression: quiz 4 LZW tricky case: compression What is the LZW compression of? input matches LZW compression for E. I don't know codeword table 7 LZW tricky case: expansion LZW implementation details output x LZW expansion for x? need to know code for 8 before it is in codeword table we can deduce that the code for 8 is x for some character x now, we have deduced x How big to make ST? How long is message? Whole message similar model? [many other variations] What to do when ST fills up? Throw away and start over. [GIF] Throw away when not effective. [Unix compress] [many other variations] Why not put longer substrings in ST? [many variations have been developed] codeword table 7 7

LZW in the real world Lossless data compression benchmarks Lempel-Ziv and friends. LZ77. LZ78. LZW.

zip, 7zip, gzip, jar, png, pdf: deflate / zlib. iphone, Wii, pache HTTP server: deflate / zlib.

Huffman 4.7 977 LZ77.94 984 LZMW. 987 LZH. 987 move-to-front.4 987 LZ.8 987 gzip.7 988 PPM.48 994 SK.

89 7 data compression using algary corpus 74 ata compression summary Lossless compression.

$[not covered in this course] JPEG, MPEG, MP, FFT/T, wavelets, fractals, Theoretical limits on$

19 LZW in the real world Lossless data compression benchmarks Lempel-Ziv and friends. LZ77. LZ78. LZW. eflate / zlib = LZ77 variant + Huffman. Unix compress, GIF, TIFF, V.4bis modem: LZW. zip, 7zip, gzip, jar, png, pdf: deflate / zlib. iphone, Wii, pache HTTP server: deflate / zlib. previously under patent not patented (widely used in open source) year scheme bits / char 967 SII 7 9 Huffman LZ LZMW. 987 LZH. 987 move-to-front LZ gzip PPM SK PPM.4 99 urrows-wheeler.9 next programming assignment 997 O K.89 7 data compression using algary corpus 74 ata compression summary Lossless compression. epresent fixed-length symbols with variable-length codes. [Huffman] epresent variable-length symbols with fixed-length codes. [LZW] Lossy compression. [not covered in this course] JPEG, MPEG, MP, FFT/T, wavelets, fractals, Theoretical limits on compression. Shannon entropy: H(X) = Practical compression. Exploit extra knowledge whenever possible. nx p(x i)lgp(x i) i 7

5.5 Data Compression. basics run-length coding Huffman compression LZW compression. Data compression

5.5 Data Compression. basics run-length coding Huffman compression LZW compression. Data compression 5.5 ata ompression ata compression ompression reduces the size of a file: To save space when storing it. To save time when transmitting it. Most files have lots of redundancy. basics run-length coding