Data compression with Huffman and LZW

Similar documents
CS/COE 1501

ECE 499/599 Data Compression & Information Theory. Thinh Nguyen Oregon State University

Entropy Coding. - to shorten the average code length by assigning shorter codes to more probable symbols => Morse-, Huffman-, Arithmetic Code

Data Compression. An overview of Compression. Multimedia Systems and Applications. Binary Image Compression. Binary Image Compression

Lossless compression II

Multimedia Networking ECE 599

CS/COE 1501

Lossless Compression Algorithms

Simple variant of coding with a variable number of symbols and fixlength codewords.

Compression. storage medium/ communications network. For the purpose of this lecture, we observe the following constraints:

Compressing Data. Konstantin Tretyakov

Fundamentals of Multimedia. Lecture 5 Lossless Data Compression Variable Length Coding

Ch. 2: Compression Basics Multimedia Systems

7: Image Compression

Compression; Error detection & correction

Multimedia Systems. Part 20. Mahdi Vasighi

Engineering Mathematics II Lecture 16 Compression

Digital Communication Prof. Bikash Kumar Dey Department of Electrical Engineering Indian Institute of Technology, Bombay

DEFLATE COMPRESSION ALGORITHM

Digital Image Processing

15 Data Compression 2014/9/21. Objectives After studying this chapter, the student should be able to: 15-1 LOSSLESS COMPRESSION

A New Compression Method Strictly for English Textual Data

Ch. 2: Compression Basics Multimedia Systems

Data Compression. Media Signal Processing, Presentation 2. Presented By: Jahanzeb Farooq Michael Osadebey

Basic Compression Library

Repetition 1st lecture

Data Compression. Guest lecture, SGDS Fall 2011

Introduction to Data Compression

Encoding. A thesis submitted to the Graduate School of University of Cincinnati in

EE67I Multimedia Communication Systems Lecture 4

A Comprehensive Review of Data Compression Techniques

15 July, Huffman Trees. Heaps

ROOT I/O compression algorithms. Oksana Shadura, Brian Bockelman University of Nebraska-Lincoln

A Novel Image Compression Technique using Simple Arithmetic Addition

Compression; Error detection & correction

LZW Compression. Ramana Kumar Kundella. Indiana State University December 13, 2014

Data Representation. Types of data: Numbers Text Audio Images & Graphics Video

A Research Paper on Lossless Data Compression Techniques

Noise Reduction in Data Communication Using Compression Technique

Digital Image Processing

Lecture 6 Review of Lossless Coding (II)

Huffman Coding Assignment For CS211, Bellevue College (rev. 2016)

Image coding and compression

Data Storage. Slides derived from those available on the web site of the book: Computer Science: An Overview, 11 th Edition, by J.

David Rappaport School of Computing Queen s University CANADA. Copyright, 1996 Dale Carnegie & Associates, Inc.

Chapter 1. Digital Data Representation and Communication. Part 2

Analysis of Parallelization Effects on Textual Data Compression

Data Compression Techniques

Lossless compression II

Department of electronics and telecommunication, J.D.I.E.T.Yavatmal, India 2

Data Compression 신찬수

FPGA based Data Compression using Dictionary based LZW Algorithm

Bits and Bit Patterns

S 1. Evaluation of Fast-LZ Compressors for Compacting High-Bandwidth but Redundant Streams from FPGA Data Sources

More Bits and Bytes Huffman Coding

VC 12/13 T16 Video Compression

IMAGE COMPRESSION TECHNIQUES

Huffman Coding Implementation on Gzip Deflate Algorithm and its Effect on Website Performance

Lecture 5: Compression I. This Week s Schedule

CS 493: Algorithms for Massive Data Sets Dictionary-based compression February 14, 2002 Scribe: Tony Wirth LZ77

Chapter 7 Lossless Compression Algorithms

VIDEO SIGNALS. Lossless coding

CS 335 Graphics and Multimedia. Image Compression

A Comparison between English and. Arabic Text Compression

Image Compression. cs2: Computational Thinking for Scientists.

EE-575 INFORMATION THEORY - SEM 092

Introduction to Compression. Norm Zeck

So, what is data compression, and why do we need it?

An Advanced Text Encryption & Compression System Based on ASCII Values & Arithmetic Encoding to Improve Data Security

SIGNAL COMPRESSION Lecture Lempel-Ziv Coding

G64PMM - Lecture 3.2. Analogue vs Digital. Analogue Media. Graphics & Still Image Representation

CIS 121 Data Structures and Algorithms with Java Spring 2018

JPEG decoding using end of block markers to concurrently partition channels on a GPU. Patrick Chieppe (u ) Supervisor: Dr.

WIRE/WIRELESS SENSOR NETWORKS USING K-RLE ALGORITHM FOR A LOW POWER DATA COMPRESSION

Fundamentals of Video Compression. Video Compression

ENSC Multimedia Communications Engineering Huffman Coding (1)

Lecture Coding Theory. Source Coding. Image and Video Compression. Images: Wikipedia

THE RELATIVE EFFICIENCY OF DATA COMPRESSION BY LZW AND LZSS

Using Arithmetic Coding for Reduction of Resulting Simulation Data Size on Massively Parallel GPGPUs

Data Compression Techniques

Optimized Compression and Decompression Software

Unit 2 Digital Information. Chapter 1 Study Guide

Error Resilient LZ 77 Data Compression

Chapter 1. Data Storage Pearson Addison-Wesley. All rights reserved

Overview. Videos are everywhere. But can take up large amounts of resources. Exploit redundancy to reduce file size

Example 1: Denary = 1. Answer: Binary = (1 * 1) = 1. Example 2: Denary = 3. Answer: Binary = (1 * 1) + (2 * 1) = 3

Perceptual Coding. Lossless vs. lossy compression Perceptual models Selecting info to eliminate Quantization and entropy encoding

Do not turn this page over until instructed to do so by the Senior Invigilator.

IMAGE PROCESSING (RRY025) LECTURE 13 IMAGE COMPRESSION - I

A study in compression algorithms

DigiPoints Volume 1. Student Workbook. Module 8 Digital Compression

A QUAD-TREE DECOMPOSITION APPROACH TO CARTOON IMAGE COMPRESSION. Yi-Chen Tsai, Ming-Sui Lee, Meiyin Shen and C.-C. Jay Kuo

CSE 421 Greedy: Huffman Codes

Dictionary techniques

Distributed source coding

Source Coding: Lossless Compression

OPTIMIZATION OF LZW (LEMPEL-ZIV-WELCH) ALGORITHM TO REDUCE TIME COMPLEXITY FOR DICTIONARY CREATION IN ENCODING AND DECODING

ITCT Lecture 8.2: Dictionary Codes and Lempel-Ziv Coding

IMAGE COMPRESSION. Image Compression. Why? Reducing transportation times Reducing file size. A two way event - compression and decompression

Efficient Sequential Algorithms, Comp309. Motivation. Longest Common Subsequence. Part 3. String Algorithms

Transcription:

Data compression with Huffman and LZW André R. Brodtkorb, Andre.Brodtkorb@sintef.no

Outline Data storage and compression Huffman: how it works and where it's used LZW: how it works and where it's used Summary and further reading

Data Storage and Compression

Data Storage Oral tradition Written text Printed text Electronic storage Little red riding hood, Wikipedia [Public domain, Gustave Dore] Jean Miélot, Wikipedia [Public domain, Jean Le Tavernier] Guthenberg bible, Wikipedia [CC-BY-SA 2.0, user NYC Wanderer (Kevin Eng)] Whirlwind's core memory, Wikipedia [CC-BY-SA 3.0, user Dpbsmith]

Why do we need data compression? Data is massive Slackware Linux consisted of over 70 floppy disks in 1994! [Slackware] 2.5 billion gigabytes new data every day in 2012 [IBM] Data is inefficiently stored ASCII text to represent numbers Consecutive consecutive frames of a video are often close to identical. Storage and bandwidth is limited World average bandwidth: 3.9 MBit/s [Akamai, 2014] Time to download 1 GB: 30 minutes! Floppy disk, Wikipedia [public domain, George Chernilevsky]

Types of data compression Data compression tries to remove redundant or superfluous information. Lossless compression: Remove redundant information Original signal can be reproduced exactly Lossy compression Remove superfluous information Original signal can only be reproduced with (minor) differences

Lossless data compression example Color lookup tables A 24-bit (RGBA) color image can contain over 16 million different colors Chicago "Cloud Gate" image has 161627 unique colors and 5 million pixles (2592x1936) Need 18 bits to represent the unique colors Original: 5M pixels x 24 bits / pixel = 14.36 MB Look-up table: 5M pixels x 18 bits /pixel + 16K colors x (18+24) bit / color = 11.58 MB

Lossy data compression example Human eye is not very sensitive when it comes to colors Simply reduce the number of colors to decrease bits per pixel Information is lost and cannot be recovered Caveat: following slides require a good projector

11.6 MB 161627 colors

11.6 4.8 MB 256 colors

11.6 4.8 3.6 MB 64 colors

MB MB MB 11.6 4.8 3.6 3 32 colors

Huffman coding

Huffman Code A method for lossless compression of data. Introduced by David A. Huffman (1925-1999) during his Ph.D. [1] Basic idea is to replace original alphabet with variable-length codes, similar to Morse code David A Huffman [Don Harris] [1] Huffman, D. (1952). "A Method for the Construction of Minimum-Redundancy Codes". Proceedings of the IRE 40 (9): 1098 1101

Morse and Huffman 1/2 Morse code uses variable length codes. Symbols separated by short pause Words separated by long pause Often used symbols have shorter codes E is one dot, and takes one unit of time to transmit. 0 is five dashes, and takes 19 units of time to transmit Morse code, Wikipedia [Public domain, Rhey T. Snodgrass & Victor F. Camp, 1922]

Morse and Huffman 2/2 Morse code can be written as a binary tree Start at the top. If next tone is a dot, go left If next tone is a dash, go right Stop when there is a pause and you have found your letter Morse code tree, Wikipedia [CC-BY-SA 3.0, user Aris00] Huffman similarly uses a binary tree, but is an algorithm for finding the optimal code for each symbol. Removes the need for pauses and minimizes number of dots/dashes

Huffman example 1/4 a We have a an alphabet of symbols used to encode e.g., "abracadabra" c d r 4 b 1. Create a binary tree with all symbols* 2. Assign a binary code to each connector (just like Morse) Right child = 1 Left child = 0 a 0 1 * We'll cover how to create the tree in a couple of slides c d r b

Huffman example 2/4 Read off the code for each symbol by traversing the tree a c d r b Symbol a b c d r Code 11 1001 110 Replace symbols by codes a b r a c a d a b r a 11 110 00 01 11 110 0 Send message: 01111100100010101111100

Huffman example 3/4 To decode the message, we need the binary tree and the message itself The tree can be static / predefined, or dynamic and transmitted with the message itself Read the message bit by bit and traverse the tree to decode 1: follow right child 0: follow left child a 0 1 When you reach a leaf node, you have found your symbol! c d r b

Huffman example 4/4 a 0 1 c d r b Message: 01111100100010101111100

Huffman example 4/4 a 0 1 c d r b Message: 01111100100010101111100 a

Huffman example 4/4 a 0 1 c d r b Message: 01111100100010101111100 a b

Huffman example 4/4 a 0 1 c d r b Message: 01111100100010101111100 a b r

Huffman example 4/4 a 0 1 c d r b Message: 01111100100010101111100 a b r a

Huffman example 4/4 a 0 1 c d r b Message: 01111100100010101111100 a b r a c

Huffman example 4/4 a 0 1 c d r b Message: 01111100100010101111100 a b r a c a

Huffman example 4/4 a 0 1 c d r b Message: 01111100100010101111100 a b r a c a d

Huffman example 4/4 a 0 1 c d r b Message: 01111100100010101111100 a b r a c a d a

Huffman example 4/4 a 0 1 c d r b Message: 01111100100010101111100 a b r a c a d a b

Huffman example 4/4 a 0 1 c d r b Message: 01111100100010101111100 a b r a c a d a b r

Huffman example 4/4 a 0 1 c d r b Message: 01111100100010101111100 a b r a c a d a b r a

Creating the Huffman tree 1/4 A Huffman tree is generated from the frequency or probability for each symbol Our text "abracadabra" gives rise to the following Symbol a b c d r Frequency 5 2 1 1 2 Probability 0.45 0.18 0.09 0.09 0.18 The aim is to give the shortest code to the most frequent symbol

Creating the Huffman tree 2/4 Algorithm 1. Start by creating nodes for each symbol 2. Add all symbols to a priority queue 3. Get the two least frequent symbols, and make a parent node for them 4. Add the newly created node (with the cumulative probability) to the priority queue 5. Go to 3. When there is a single node left, the tree is complete Loop through the tree, and add the Huffman codes

Creating the Huffman tree 3/4 a:5 c:1 d:1 r:2 b:2

Creating the Huffman tree 3/4 2 a:5 r:2 b:2

Creating the Huffman tree 3/4 4 2 a:5

Creating the Huffman tree 3/4 6 a:5

Creating the Huffman tree 3/4 11

Creating the Huffman tree 4/4 11 6 4 2 a:5 c:1 d:1 r:2 b:2

Creating the Huffman tree 4/4 11 1 6 0 4 2 a:5 c:1 d:1 r:2 b:2 0

Creating the Huffman tree 4/4 11 1 6 0 0 1 4 2 a:5 c:1 d:1 r:2 b:2 0

Creating the Huffman tree 4/4 11 1 6 0 0 1 4 2 a:5 c:1 d:1 r:2 b:2 001

Creating the Huffman tree 4/4 11 1 6 0 0 1 4 2 a:5 c:1 d:1 r:2 b:2 001 1111

Implementing and testing 1/2 Huffman is "simple" in principle (once you get it), but can be challenging to get right Need to fiddle with bits, think about byte order, etc. Difficult to debug: output is bits Around 290 single lines of code Compression test Test dataset A: Macbeth by Shakespeare HTML, 202 KB Available from http://shakespeare.mit.edu/macbeth/full.html Test dataset B: Bus video file yuv, uncompressed video, 1.37 MB Available from https://media.xiph.org/video/derf/

Implementing and testing 2/2 Test A: Input: 207318, output: 134909 (including tree) Compressed size: 65% of original Symbols: 85 = 2^6.4 => 7 bits / symbol wo. Huffman Achieved Entropy: 5.19387 bits / symbol Optimal [Shannon]: 5.16525 bits / symbol Test B Input: 1444608, output: 1245911 (including tree) Compression ratio: 86% of original Symbols: 256 = 2^8 => 8 bits / symbol wo. Huffman Entropy: 6.89453 bits / symbol Optimal [Shannon]: 6.85687 bits / symbol

Uses of Huffman coding Fax machines: Combination of run length encoding and Huffman JPEG images: DCT, followed by quantization (information loss), and Huffman MP3 files: Information removal followed by Huffman coding DEFLATE: DEFLATE is an integral part of many tools and file formats. Uses LZ77 and Huffman coding Examples: zlib, png, gzip, SSH, http,...

Huffman speed 1/2 The Huffman tree can be generated in O(n log n) for n symbols using a priority queue Typically, building Huffman tree is negligible for long data sets Compression time function of data set size The ubiquity of Huffman coding means that there are a lot of efficient implementations out there zlib compression of text and data libjpeg-turbo JPEG encoder tailored for VNC Intel IPP Intel integrated performance primitives FFmpeg video and audio lodepng self-contained png decoder NVIDIA GPUs hardware support (probably also AMD GPUs) FPGAS real-time streaming of Huffman

Huffman speed 2/2 A major problem with Huffman today is its serial nature Decoding of the stream can only be done bit by bit. Some JPEG formats use restart markers to enable parallel decoding (patented ) DEFLATE can be decoded in parallel by splitting into blocks Today we have 4-12 threads in standard PCs Huffman does not scale to more than one thread

LZW

LZW LZW [1] A compression algorithm named after Abraham Lempel (1936-), Jacob Ziv (1931-), and Terry Welch (1939? -1988) A modification of earlier LZ77 and LZ78 algorithms by Terry Welch Patented in 1983 in the US, 1984 in UK, France, Germany, Italy, Japan, Canada. Basic idea is to replace several symbols with a single code Abraham Lempel, Wikipedia [CC-BY-SA 3.0, user Staelin] [1] Welch, Terry (1984). "A Technique for High-Performance Data Compression". Computer 17 (6): 8 19. Jacob Ziv, Wikipedia [חישוביות [Public domain, user

Dictionaries for compression The main idea of LZW is similar to logograms Each code refers to a word, part of word, etc. Create a dictionary which translate symbols into code and vice versa The LZW algorithm creates a dictionary automatically based on the input data Hieroglyphs, Wikipedia [Public domain, user Vincnet] The Story of Shi Shi Eating Lions, Wikipedia [CC-BY-SA 3.0]

LZW Example 1/5 We have data we want to compress, in this case "abracadabra" LZW dynamically* creates a dictionary of strings based on the data we want to compress This dictionary is used to replace multiple symbols with a single dictionary entry The dictionary is not stored to file * We'll cover how in a couple of slides 0 a 1 b 2 c 3 d 4 r 5 ab 6 br 7 ra 8 ac 9 ca 10 ad 11 da 12 abr 13 ra

LZW Example 2/5 0 a 1. Initialize dictionary with all single symbols 2. Set "w" equal an empty string 3. Read the next letter into "k" 4. If the dictionary has the string "wk": set w equal wk go to 3 5. Else: add wk to the dictionary write value of w to the output stream set w equal k go to 3 1 b 2 c 3 d 4 r 5 ab 6 br 7 ra 8 ac 9 ca 10 ad 11 da 12 abr 13 ra

LZW Example 3/5 w k wk wk in dict? Output code A A Yes A B AB No 0 B R BR No 1 R A RA No 4 A C AC No 0 C A CA No 2 A D AD No 0 D A DA No 3 A B AB Yes AB R ABR No 5 R A RA No 4 A - - - 0 0 a 1 b 2 c 3 d 4 r 5 ab 6 br 7 ra 8 ac 9 ca 10 ad 11 da 12 abr 13 ra

LZW Example 3/5 w k wk wk in dict? Output code A A Yes A B AB No 0 B R BR No 1 R A RA No 4 A C AC No 0 C A CA No 2 A D AD No 0 D A DA No 3 A B AB Yes AB R ABR No 5 R A RA No 4 A - - - 0 0 a 1 b 2 c 3 d 4 r 5 ab 6 br 7 ra 8 ac 9 ca 10 ad 11 da 12 abr 13 ra

LZW Example 3/5 w k wk wk in dict? Output code A A Yes A B AB No 0 B R BR No 1 R A RA No 4 A C AC No 0 C A CA No 2 A D AD No 0 D A DA No 3 A B AB Yes AB R ABR No 5 R A RA No 4 A - - - 0 0 a 1 b 2 c 3 d 4 r 5 ab 6 br 7 ra 8 ac 9 ca 10 ad 11 da 12 abr 13 ra

LZW Example 3/5 w k wk wk in dict? Output code A A Yes A B AB No 0 B R BR No 1 R A RA No 4 A C AC No 0 C A CA No 2 A D AD No 0 D A DA No 3 A B AB Yes AB R ABR No 5 R A RA No 4 A - - - 0 0 a 1 b 2 c 3 d 4 r 5 ab 6 br 7 ra 8 ac 9 ca 10 ad 11 da 12 abr 13 ra

LZW Example 3/5 w k wk wk in dict? Output code A A Yes A B AB No 0 B R BR No 1 R A RA No 4 A C AC No 0 C A CA No 2 A D AD No 0 D A DA No 3 A B AB Yes AB R ABR No 5 R A RA No 4 A - - - 0 0 a 1 b 2 c 3 d 4 r 5 ab 6 br 7 ra 8 ac 9 ca 10 ad 11 da 12 abr 13 ra

LZW Example 3/5 w k wk wk in dict? Output code A A Yes A B AB No 0 B R BR No 1 R A RA No 4 A C AC No 0 C A CA No 2 A D AD No 0 D A DA No 3 A B AB Yes AB R ABR No 5 R A RA No 4 A - - - 0 0 a 1 b 2 c 3 d 4 r 5 ab 6 br 7 ra 8 ac 9 ca 10 ad 11 da 12 abr 13 ra

LZW Example 3/5 w k wk wk in dict? Output code A A Yes A B AB No 0 B R BR No 1 R A RA No 4 A C AC No 0 C A CA No 2 A D AD No 0 D A DA No 3 A B AB Yes AB R ABR No 5 R A RA No 4 A - - - 0 0 a 1 b 2 c 3 d 4 r 5 ab 6 br 7 ra 8 ac 9 ca 10 ad 11 da 12 abr 13 ra

LZW Example 3/5 w k wk wk in dict? Output code A A Yes A B AB No 0 B R BR No 1 R A RA No 4 A C AC No 0 C A CA No 2 A D AD No 0 D A DA No 3 A B AB Yes AB R ABR No 5 R A RA No 4 A - - - 0 0 a 1 b 2 c 3 d 4 r 5 ab 6 br 7 ra 8 ac 9 ca 10 ad 11 da 12 abr 13 ra

LZW Example 3/5 w k wk wk in dict? Output code A A Yes A B AB No 0 B R BR No 1 R A RA No 4 A C AC No 0 C A CA No 2 A D AD No 0 D A DA No 3 A B AB Yes AB R ABR No 5 R A RA No 4 A - - - 0 0 a 1 b 2 c 3 d 4 r 5 ab 6 br 7 ra 8 ac 9 ca 10 ad 11 da 12 abr 13 ra

LZW Example 3/5 w k wk wk in dict? Output code A A Yes A B AB No 0 B R BR No 1 R A RA No 4 A C AC No 0 C A CA No 2 A D AD No 0 D A DA No 3 A B AB Yes AB R ABR No 5 R A RA No 4 A - - - 0 0 a 1 b 2 c 3 d 4 r 5 ab 6 br 7 ra 8 ac 9 ca 10 ad 11 da 12 abr 13 ra

LZW Example 3/5 w k wk wk in dict? Output code A A Yes A B AB No 0 B R BR No 1 R A RA No 4 A C AC No 0 C A CA No 2 A D AD No 0 D A DA No 3 A B AB Yes AB R ABR No 5 R A RA No 4 A - - - 0 0 a 1 b 2 c 3 d 4 r 5 ab 6 br 7 ra 8 ac 9 ca 10 ad 11 da 12 abr 13 ra

LZW Example 3/5 w k wk wk in dict? Output code A A Yes A B AB No 0 B R BR No 1 R A RA No 4 A C AC No 0 C A CA No 2 A D AD No 0 D A DA No 3 A B AB Yes AB R ABR No 5 R A RA No 4 A - - - 0 0 a 1 b 2 c 3 d 4 r 5 ab 6 br 7 ra 8 ac 9 ca 10 ad 11 da 12 abr 13 ra

LZW Example 3/5 w k wk wk in dict? Output code A A Yes A B AB No 0 B R BR No 1 R A RA No 4 A C AC No 0 C A CA No 2 A D AD No 0 D A DA No 3 A B AB Yes AB R ABR No 5 R A RA No 4 A - - - 0 0 a 1 b 2 c 3 d 4 r 5 ab 6 br 7 ra 8 ac 9 ca 10 ad 11 da 12 abr 13 ra

LZW Example 4/5 Decoding the message is done in the opposite order of encoding 1. Decode the first code and store in "w" Also write to output stream 2. Read the next code 3. If the dictionary has the next code: Set "k" equal the decoded code Write out k Set "wk" equal w and the first character of k Add wk to the dictionary Set w equal k Go to 2

LZW Example 5/5 Input code in dict? w k wk[0] Output 0 Yes A A 1 Yes A B AB B 4 Yes B R BR R 0 Yes R A RA A 2 Yes A C AC C 0 Yes C A CA A 3 Yes A D AD D 5 Yes D AB DA AB 4 Yes AB R ABR R 0 Yes R A RA A - - A - - 0 a 1 b 2 c 3 d 4 r 5 ab 6 br 7 ra 8 ac 9 ca 10 ad 11 da 12 abr 13 ra

LZW Example 5/5 Input code in dict? w k wk[0] Output 0 Yes A A 1 Yes A B AB B 4 Yes B R BR R 0 Yes R A RA A 2 Yes A C AC C 0 Yes C A CA A 3 Yes A D AD D 5 Yes D AB DA AB 4 Yes AB R ABR R 0 Yes R A RA A - - A - - 0 a 1 b 2 c 3 d 4 r 5 ab 6 br 7 ra 8 ac 9 ca 10 ad 11 da 12 abr 13 ra

LZW Example 5/5 Input code in dict? w k wk[0] Output 0 Yes A A 1 Yes A B AB B 4 Yes B R BR R 0 Yes R A RA A 2 Yes A C AC C 0 Yes C A CA A 3 Yes A D AD D 5 Yes D AB DA AB 4 Yes AB R ABR R 0 Yes R A RA A - - A - - 0 a 1 b 2 c 3 d 4 r 5 ab 6 br 7 ra 8 ac 9 ca 10 ad 11 da 12 abr 13 ra

LZW Example 5/5 Input code in dict? w k wk[0] Output 0 Yes A A 1 Yes A B AB B 4 Yes B R BR R 0 Yes R A RA A 2 Yes A C AC C 0 Yes C A CA A 3 Yes A D AD D 5 Yes D AB DA AB 4 Yes AB R ABR R 0 Yes R A RA A - - A - - 0 a 1 b 2 c 3 d 4 r 5 ab 6 br 7 ra 8 ac 9 ca 10 ad 11 da 12 abr 13 ra

LZW Example 5/5 Input code in dict? w k wk[0] Output 0 Yes A A 1 Yes A B AB B 4 Yes B R BR R 0 Yes R A RA A 2 Yes A C AC C 0 Yes C A CA A 3 Yes A D AD D 5 Yes D AB DA AB 4 Yes AB R ABR R 0 Yes R A RA A - - A - - 0 a 1 b 2 c 3 d 4 r 5 ab 6 br 7 ra 8 ac 9 ca 10 ad 11 da 12 abr 13 ra

LZW Example 5/5 Input code in dict? w k wk[0] Output 0 Yes A A 1 Yes A B AB B 4 Yes B R BR R 0 Yes R A RA A 2 Yes A C AC C 0 Yes C A CA A 3 Yes A D AD D 5 Yes D AB DA AB 4 Yes AB R ABR R 0 Yes R A RA A - - A - - 0 a 1 b 2 c 3 d 4 r 5 ab 6 br 7 ra 8 ac 9 ca 10 ad 11 da 12 abr 13 ra

LZW Example 5/5 Input code in dict? w k wk[0] Output 0 Yes A A 1 Yes A B AB B 4 Yes B R BR R 0 Yes R A RA A 2 Yes A C AC C 0 Yes C A CA A 3 Yes A D AD D 5 Yes D AB DA AB 4 Yes AB R ABR R 0 Yes R A RA A - - A - - 0 a 1 b 2 c 3 d 4 r 5 ab 6 br 7 ra 8 ac 9 ca 10 ad 11 da 12 abr 13 ra

LZW Example 5/5 Input code in dict? w k wk[0] Output 0 Yes A A 1 Yes A B AB B 4 Yes B R BR R 0 Yes R A RA A 2 Yes A C AC C 0 Yes C A CA A 3 Yes A D AD D 5 Yes D AB DA AB 4 Yes AB R ABR R 0 Yes R A RA A - - A - - 0 a 1 b 2 c 3 d 4 r 5 ab 6 br 7 ra 8 ac 9 ca 10 ad 11 da 12 abr 13 ra

LZW Example 5/5 Input code in dict? w k wk[0] Output 0 Yes A A 1 Yes A B AB B 4 Yes B R BR R 0 Yes R A RA A 2 Yes A C AC C 0 Yes C A CA A 3 Yes A D AD D 5 Yes D AB DA AB 4 Yes AB R ABR R 0 Yes R A RA A - - A - - 0 a 1 b 2 c 3 d 4 r 5 ab 6 br 7 ra 8 ac 9 ca 10 ad 11 da 12 abr 13 ra

LZW Example 5/5 Input code in dict? w k wk[0] Output 0 Yes A A 1 Yes A B AB B 4 Yes B R BR R 0 Yes R A RA A 2 Yes A C AC C 0 Yes C A CA A 3 Yes A D AD D 5 Yes D AB DA AB 4 Yes AB R ABR R 0 Yes R A RA A - - A - - 0 a 1 b 2 c 3 d 4 r 5 ab 6 br 7 ra 8 ac 9 ca 10 ad 11 da 12 abr 13 ra

LZW Example 5/5 Input code in dict? w k wk[0] Output 0 Yes A A 1 Yes A B AB B 4 Yes B R BR R 0 Yes R A RA A 2 Yes A C AC C 0 Yes C A CA A 3 Yes A D AD D 5 Yes D AB DA AB 4 Yes AB R ABR R 0 Yes R A RA A - - A - - 0 a 1 b 2 c 3 d 4 r 5 ab 6 br 7 ra 8 ac 9 ca 10 ad 11 da 12 abr 13 ra

LZW Example 5/5 Input code in dict? w k wk[0] Output 0 Yes A A 1 Yes A B AB B 4 Yes B R BR R 0 Yes R A RA A 2 Yes A C AC C 0 Yes C A CA A 3 Yes A D AD D 5 Yes D AB DA AB 4 Yes AB R ABR R 0 Yes R A RA A - - A - - 0 a 1 b 2 c 3 d 4 r 5 ab 6 br 7 ra 8 ac 9 ca 10 ad 11 da 12 abr 13 ra

LZW Dictionary 1/2 LZW uses 12 bits for the dictionary All 256 single-byte values are added at initialization The rest of the bits are used for combinations: 2 12 = 4096 => 3840 entries for combinations When the dictionary is full, compression becomes "static" It can also be reset: clear all entries, and initialize with single-byte values Can add a "reset" code which is used when compression ratio drops An LZW variant uses variable bit length codes Start with 9 bit codes. When the dictionary has 2 9 = 512 entries, Continue with 10 bit codes. When the dictionary has 2 10 = 1024 entries, Continue with 11 bit codes.

LZW Dictionary 2/2 Occasionally the code is not in the dictionary during decode This happens when a code just added during encoding is used in the next symbol Example: Encoder knows about "ab", and gets the string "ababa" ab is output, and aba is added to the dictionary aba is output This will always be the case, and we can deduce that any unknown symbol will represent the previous output with the first letter added in in the end.

Implementing and testing LZW is "simple" in principle and practice Only complication is need to fiddle with nibbles Around 190 single lines of code Compression test Test dataset A: Macbeth by Shakespeare HTML, 202 KB Available from http://shakespeare.mit.edu/macbeth/full.html Test dataset B: Bus video file yuv, uncompressed video, 1.37 MB Available from https://media.xiph.org/video/derf/

Implementing and testing Test A: Input: 207318, output: 88707 Compressed size: 43% of original Test B Input: 1444608, output: 1318365 Compression ratio: 91% of original

LZW Speed 1/2 LZW designed for efficient hardware implementation Fixed size dictionary, trivial initialization Finite state machine Same problems as Huffman wrt. parallelism: inherently serial LZW used to be a standard Linux tool compress, but patents and more efficient algorithms limited its use Closely related to the LZ77 and LZ78 algorithms LZ77 uses sliding window and length distance pairs LZ78 uses dictionary like LZW Part of much used DEFLATE algorithm

LZW Speed 2/2 GIF files use LZW compression Limited to a small alphabet (max 256 colors) Often a lot of repeated patterns: Highly suitable for LZW LZW Appears to have lost a lot of traction 20 years of patents has taken its toll Drove forward the creation of the free software PNG GIF images still actively used senorgif.com

Summary

Summary Huffman is based on replacing fixed-width symbols with variable bit codes Approaches the theoretical Entropy given by Shannon Small overhead in storing the Huffman table itself Works very well for data with a few highly used symbols Works poorly for data with equal use of all characters Very fast LZW is based on replacing multiple symbols with a single code No overhead in storing dictionary: it is created dynamically Only starts "compressing" after the dictionary as a lot of combinations Works very well with small alphabets (fewer string combinations) Works poorly with random data (few repeated "words")

Further reading Data compression is big bucks! There's a huge amount of patents on compression algorithms Check licensing requirements Most efficient compression algorithms will take knowledge of underlying data structure into account Combination of lossless and lossy compression. Open source implementations of LZW and Huffman: https://github.com/babrodtk/compression_demos Warning: not written for speed

Thank you for your attention! André R. Brodtkorb Email: Andre.Brodtkorb@ifi.uio.no Homepage: http://babrodtk.at.ifi.uio.no/