Data compression with Huffman and LZW

Data compression with Huffman and LZW André R. Brodtkorb, Andre.Brodtkorb@sintef.no

Outline Data storage and compression Huffman: how it works and where it's used LZW: how it works and where it's used Summary and further reading

Data Storage and Compression

Data Storage Oral tradition Written text Printed text Electronic storage Little red riding hood, Wikipedia [Public domain, Gustave Dore] Jean Miélot, Wikipedia [Public domain, Jean Le Tavernier] Guthenberg bible, Wikipedia [CC-BY-SA 2.0, user NYC Wanderer (Kevin Eng)] Whirlwind's core memory, Wikipedia [CC-BY-SA 3.0, user Dpbsmith]

Why do we need data compression? Data is massive Slackware Linux consisted of over 70 floppy disks in 1994! [Slackware] 2.5 billion gigabytes new data every day in 2012 [IBM] Data is inefficiently stored ASCII text to represent numbers Consecutive consecutive frames of a video are often close to identical. Storage and bandwidth is limited World average bandwidth: 3.9 MBit/s [Akamai, 2014] Time to download 1 GB: 30 minutes! Floppy disk, Wikipedia [public domain, George Chernilevsky]

Types of data compression Data compression tries to remove redundant or superfluous information. Lossless compression: Remove redundant information Original signal can be reproduced exactly Lossy compression Remove superfluous information Original signal can only be reproduced with (minor) differences

Lossless data compression example Color lookup tables A 24-bit (RGBA) color image can contain over 16 million different colors Chicago "Cloud Gate" image has 161627 unique colors and 5 million pixles (2592x1936) Need 18 bits to represent the unique colors Original: 5M pixels x 24 bits / pixel = 14.36 MB Look-up table: 5M pixels x 18 bits /pixel + 16K colors x (18+24) bit / color = 11.58 MB

Lossy data compression example Human eye is not very sensitive when it comes to colors Simply reduce the number of colors to decrease bits per pixel Information is lost and cannot be recovered Caveat: following slides require a good projector

11.6 MB 161627 colors

11.6 4.8 MB 256 colors

11.6 4.8 3.6 MB 64 colors

MB MB MB 11.6 4.8 3.6 3 32 colors

Huffman coding

Huffman Code A method for lossless compression of data. Introduced by David A. Huffman (1925-1999) during his Ph.D. [1] Basic idea is to replace original alphabet with variable-length codes, similar to Morse code David A Huffman [Don Harris] [1] Huffman, D. (1952). "A Method for the Construction of Minimum-Redundancy Codes". Proceedings of the IRE 40 (9): 1098 1101

Morse and Huffman 1/2 Morse code uses variable length codes. Symbols separated by short pause Words separated by long pause Often used symbols have shorter codes E is one dot, and takes one unit of time to transmit. 0 is five dashes, and takes 19 units of time to transmit Morse code, Wikipedia [Public domain, Rhey T. Snodgrass & Victor F. Camp, 1922]

Morse and Huffman 2/2 Morse code can be written as a binary tree Start at the top. If next tone is a dot, go left If next tone is a dash, go right Stop when there is a pause and you have found your letter Morse code tree, Wikipedia [CC-BY-SA 3.0, user Aris00] Huffman similarly uses a binary tree, but is an algorithm for finding the optimal code for each symbol. Removes the need for pauses and minimizes number of dots/dashes

Huffman example 1/4 a We have a an alphabet of symbols used to encode e.g., "abracadabra" c d r 4 b 1. Create a binary tree with all symbols* 2. Assign a binary code to each connector (just like Morse) Right child = 1 Left child = 0 a 0 1 * We'll cover how to create the tree in a couple of slides c d r b

Huffman example 2/4 Read off the code for each symbol by traversing the tree a c d r b Symbol a b c d r Code 11 1001 110 Replace symbols by codes a b r a c a d a b r a 11 110 00 01 11 110 0 Send message: 01111100100010101111100

Huffman example 3/4 To decode the message, we need the binary tree and the message itself The tree can be static / predefined, or dynamic and transmitted with the message itself Read the message bit by bit and traverse the tree to decode 1: follow right child 0: follow left child a 0 1 When you reach a leaf node, you have found your symbol! c d r b

Huffman example 4/4 a 0 1 c d r b Message: 01111100100010101111100

Huffman example 4/4 a 0 1 c d r b Message: 01111100100010101111100 a

Huffman example 4/4 a 0 1 c d r b Message: 01111100100010101111100 a b

Huffman example 4/4 a 0 1 c d r b Message: 01111100100010101111100 a b r

Huffman example 4/4 a 0 1 c d r b Message: 01111100100010101111100 a b r a

Huffman example 4/4 a 0 1 c d r b Message: 01111100100010101111100 a b r a c

Huffman example 4/4 a 0 1 c d r b Message: 01111100100010101111100 a b r a c a

Huffman example 4/4 a 0 1 c d r b Message: 01111100100010101111100 a b r a c a d

Huffman example 4/4 a 0 1 c d r b Message: 01111100100010101111100 a b r a c a d a

Huffman example 4/4 a 0 1 c d r b Message: 01111100100010101111100 a b r a c a d a b

Huffman example 4/4 a 0 1 c d r b Message: 01111100100010101111100 a b r a c a d a b r

Huffman example 4/4 a 0 1 c d r b Message: 01111100100010101111100 a b r a c a d a b r a

Creating the Huffman tree 1/4 A Huffman tree is generated from the frequency or probability for each symbol Our text "abracadabra" gives rise to the following Symbol a b c d r Frequency 5 2 1 1 2 Probability 0.45 0.18 0.09 0.09 0.18 The aim is to give the shortest code to the most frequent symbol

Creating the Huffman tree 2/4 Algorithm 1. Start by creating nodes for each symbol 2. Add all symbols to a priority queue 3. Get the two least frequent symbols, and make a parent node for them 4. Add the newly created node (with the cumulative probability) to the priority queue 5. Go to 3. When there is a single node left, the tree is complete Loop through the tree, and add the Huffman codes

Creating the Huffman tree 3/4 a:5 c:1 d:1 r:2 b:2

Creating the Huffman tree 3/4 2 a:5 r:2 b:2

Creating the Huffman tree 3/4 4 2 a:5

Creating the Huffman tree 3/4 6 a:5

Creating the Huffman tree 3/4 11

Creating the Huffman tree 4/4 11 6 4 2 a:5 c:1 d:1 r:2 b:2

Creating the Huffman tree 4/4 11 1 6 0 4 2 a:5 c:1 d:1 r:2 b:2 0

Creating the Huffman tree 4/4 11 1 6 0 0 1 4 2 a:5 c:1 d:1 r:2 b:2 0

Creating the Huffman tree 4/4 11 1 6 0 0 1 4 2 a:5 c:1 d:1 r:2 b:2 001

Creating the Huffman tree 4/4 11 1 6 0 0 1 4 2 a:5 c:1 d:1 r:2 b:2 001 1111

Implementing and testing 1/2 Huffman is "simple" in principle (once you get it), but can be challenging to get right Need to fiddle with bits, think about byte order, etc. Difficult to debug: output is bits Around 290 single lines of code Compression test Test dataset A: Macbeth by Shakespeare HTML, 202 KB Available from http://shakespeare.mit.edu/macbeth/full.html Test dataset B: Bus video file yuv, uncompressed video, 1.37 MB Available from https://media.xiph.org/video/derf/

Implementing and testing 2/2 Test A: Input: 207318, output: 134909 (including tree) Compressed size: 65% of original Symbols: 85 = 2^6.4 => 7 bits / symbol wo. Huffman Achieved Entropy: 5.19387 bits / symbol Optimal [Shannon]: 5.16525 bits / symbol Test B Input: 1444608, output: 1245911 (including tree) Compression ratio: 86% of original Symbols: 256 = 2^8 => 8 bits / symbol wo. Huffman Entropy: 6.89453 bits / symbol Optimal [Shannon]: 6.85687 bits / symbol

Uses of Huffman coding Fax machines: Combination of run length encoding and Huffman JPEG images: DCT, followed by quantization (information loss), and Huffman MP3 files: Information removal followed by Huffman coding DEFLATE: DEFLATE is an integral part of many tools and file formats. Uses LZ77 and Huffman coding Examples: zlib, png, gzip, SSH, http,...

Huffman speed 1/2 The Huffman tree can be generated in O(n log n) for n symbols using a priority queue Typically, building Huffman tree is negligible for long data sets Compression time function of data set size The ubiquity of Huffman coding means that there are a lot of efficient implementations out there zlib compression of text and data libjpeg-turbo JPEG encoder tailored for VNC Intel IPP Intel integrated performance primitives FFmpeg video and audio lodepng self-contained png decoder NVIDIA GPUs hardware support (probably also AMD GPUs) FPGAS real-time streaming of Huffman

Huffman speed 2/2 A major problem with Huffman today is its serial nature Decoding of the stream can only be done bit by bit. Some JPEG formats use restart markers to enable parallel decoding (patented ) DEFLATE can be decoded in parallel by splitting into blocks Today we have 4-12 threads in standard PCs Huffman does not scale to more than one thread

LZW

LZW LZW [1] A compression algorithm named after Abraham Lempel (1936-), Jacob Ziv (1931-), and Terry Welch (1939? -1988) A modification of earlier LZ77 and LZ78 algorithms by Terry Welch Patented in 1983 in the US, 1984 in UK, France, Germany, Italy, Japan, Canada. Basic idea is to replace several symbols with a single code Abraham Lempel, Wikipedia [CC-BY-SA 3.0, user Staelin] [1] Welch, Terry (1984). "A Technique for High-Performance Data Compression". Computer 17 (6): 8 19. Jacob Ziv, Wikipedia [חישוביות [Public domain, user

Dictionaries for compression The main idea of LZW is similar to logograms Each code refers to a word, part of word, etc. Create a dictionary which translate symbols into code and vice versa The LZW algorithm creates a dictionary automatically based on the input data Hieroglyphs, Wikipedia [Public domain, user Vincnet] The Story of Shi Shi Eating Lions, Wikipedia [CC-BY-SA 3.0]

LZW Example 1/5 We have data we want to compress, in this case "abracadabra" LZW dynamically* creates a dictionary of strings based on the data we want to compress This dictionary is used to replace multiple symbols with a single dictionary entry The dictionary is not stored to file * We'll cover how in a couple of slides 0 a 1 b 2 c 3 d 4 r 5 ab 6 br 7 ra 8 ac 9 ca 10 ad 11 da 12 abr 13 ra

LZW Example 2/5 0 a 1. Initialize dictionary with all single symbols 2. Set "w" equal an empty string 3. Read the next letter into "k" 4. If the dictionary has the string "wk": set w equal wk go to 3 5. Else: add wk to the dictionary write value of w to the output stream set w equal k go to 3 1 b 2 c 3 d 4 r 5 ab 6 br 7 ra 8 ac 9 ca 10 ad 11 da 12 abr 13 ra

LZW Example 3/5 w k wk wk in dict? Output code A A Yes A B AB No 0 B R BR No 1 R A RA No 4 A C AC No 0 C A CA No 2 A D AD No 0 D A DA No 3 A B AB Yes AB R ABR No 5 R A RA No 4 A - - - 0 0 a 1 b 2 c 3 d 4 r 5 ab 6 br 7 ra 8 ac 9 ca 10 ad 11 da 12 abr 13 ra

LZW Example 4/5 Decoding the message is done in the opposite order of encoding 1. Decode the first code and store in "w" Also write to output stream 2. Read the next code 3. If the dictionary has the next code: Set "k" equal the decoded code Write out k Set "wk" equal w and the first character of k Add wk to the dictionary Set w equal k Go to 2

LZW Example 5/5 Input code in dict? w k wk[0] Output 0 Yes A A 1 Yes A B AB B 4 Yes B R BR R 0 Yes R A RA A 2 Yes A C AC C 0 Yes C A CA A 3 Yes A D AD D 5 Yes D AB DA AB 4 Yes AB R ABR R 0 Yes R A RA A - - A - - 0 a 1 b 2 c 3 d 4 r 5 ab 6 br 7 ra 8 ac 9 ca 10 ad 11 da 12 abr 13 ra

LZW Dictionary 1/2 LZW uses 12 bits for the dictionary All 256 single-byte values are added at initialization The rest of the bits are used for combinations: 2 12 = 4096 => 3840 entries for combinations When the dictionary is full, compression becomes "static" It can also be reset: clear all entries, and initialize with single-byte values Can add a "reset" code which is used when compression ratio drops An LZW variant uses variable bit length codes Start with 9 bit codes. When the dictionary has 2 9 = 512 entries, Continue with 10 bit codes. When the dictionary has 2 10 = 1024 entries, Continue with 11 bit codes.

LZW Dictionary 2/2 Occasionally the code is not in the dictionary during decode This happens when a code just added during encoding is used in the next symbol Example: Encoder knows about "ab", and gets the string "ababa" ab is output, and aba is added to the dictionary aba is output This will always be the case, and we can deduce that any unknown symbol will represent the previous output with the first letter added in in the end.

Implementing and testing LZW is "simple" in principle and practice Only complication is need to fiddle with nibbles Around 190 single lines of code Compression test Test dataset A: Macbeth by Shakespeare HTML, 202 KB Available from http://shakespeare.mit.edu/macbeth/full.html Test dataset B: Bus video file yuv, uncompressed video, 1.37 MB Available from https://media.xiph.org/video/derf/

Implementing and testing Test A: Input: 207318, output: 88707 Compressed size: 43% of original Test B Input: 1444608, output: 1318365 Compression ratio: 91% of original

LZW Speed 1/2 LZW designed for efficient hardware implementation Fixed size dictionary, trivial initialization Finite state machine Same problems as Huffman wrt. parallelism: inherently serial LZW used to be a standard Linux tool compress, but patents and more efficient algorithms limited its use Closely related to the LZ77 and LZ78 algorithms LZ77 uses sliding window and length distance pairs LZ78 uses dictionary like LZW Part of much used DEFLATE algorithm

LZW Speed 2/2 GIF files use LZW compression Limited to a small alphabet (max 256 colors) Often a lot of repeated patterns: Highly suitable for LZW LZW Appears to have lost a lot of traction 20 years of patents has taken its toll Drove forward the creation of the free software PNG GIF images still actively used senorgif.com

Summary

Summary Huffman is based on replacing fixed-width symbols with variable bit codes Approaches the theoretical Entropy given by Shannon Small overhead in storing the Huffman table itself Works very well for data with a few highly used symbols Works poorly for data with equal use of all characters Very fast LZW is based on replacing multiple symbols with a single code No overhead in storing dictionary: it is created dynamically Only starts "compressing" after the dictionary as a lot of combinations Works very well with small alphabets (fewer string combinations) Works poorly with random data (few repeated "words")

Further reading Data compression is big bucks! There's a huge amount of patents on compression algorithms Check licensing requirements Most efficient compression algorithms will take knowledge of underlying data structure into account Combination of lossless and lossy compression. Open source implementations of LZW and Huffman: https://github.com/babrodtk/compression_demos Warning: not written for speed

Thank you for your attention! André R. Brodtkorb Email: Andre.Brodtkorb@ifi.uio.no Homepage: http://babrodtk.at.ifi.uio.no/