CS/COE 1501

Similar documents
CS/COE 1501

Data Compression. Guest lecture, SGDS Fall 2011

Entropy Coding. - to shorten the average code length by assigning shorter codes to more probable symbols => Morse-, Huffman-, Arithmetic Code

Compression. storage medium/ communications network. For the purpose of this lecture, we observe the following constraints:

David Rappaport School of Computing Queen s University CANADA. Copyright, 1996 Dale Carnegie & Associates, Inc.

Data Compression. An overview of Compression. Multimedia Systems and Applications. Binary Image Compression. Binary Image Compression

Data Compression Techniques

Simple variant of coding with a variable number of symbols and fixlength codewords.

Lossless compression II

CSE 421 Greedy: Huffman Codes

Compressing Data. Konstantin Tretyakov

Lossless Compression Algorithms

CSE100. Advanced Data Structures. Lecture 12. (Based on Paul Kube course materials)

Engineering Mathematics II Lecture 16 Compression

Multimedia Networking ECE 599

Digital Image Processing

So, what is data compression, and why do we need it?

Algorithms Dr. Haim Levkowitz

Intro. To Multimedia Engineering Lossless Compression

EE67I Multimedia Communication Systems Lecture 4

Fundamentals of Multimedia. Lecture 5 Lossless Data Compression Variable Length Coding

Welcome Back to Fundamentals of Multimedia (MR412) Fall, 2012 Lecture 10 (Chapter 7) ZHU Yongxin, Winson

Greedy Algorithms and Huffman Coding

15 July, Huffman Trees. Heaps

CS106B Handout 34 Autumn 2012 November 12 th, 2012 Data Compression and Huffman Encoding

7: Image Compression

Text Compression through Huffman Coding. Terminology

IMAGE PROCESSING (RRY025) LECTURE 13 IMAGE COMPRESSION - I

CIS 121 Data Structures and Algorithms with Java Spring 2018

Text Compression. Jayadev Misra The University of Texas at Austin July 1, A Very Incomplete Introduction To Information Theory 2

6. Finding Efficient Compressions; Huffman and Hu-Tucker

Data compression with Huffman and LZW

ECE 499/599 Data Compression & Information Theory. Thinh Nguyen Oregon State University

CS 206 Introduction to Computer Science II

MCS-375: Algorithms: Analysis and Design Handout #G2 San Skulrattanakulchai Gustavus Adolphus College Oct 21, Huffman Codes

CS 493: Algorithms for Massive Data Sets Dictionary-based compression February 14, 2002 Scribe: Tony Wirth LZ77

Data Compression Fundamentals

Data compression.

Linked Structures Songs, Games, Movies Part IV. Fall 2013 Carola Wenk

Greedy Algorithms CHAPTER 16

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Huffman Coding. Version of October 13, Version of October 13, 2014 Huffman Coding 1 / 27

Data Compression. Media Signal Processing, Presentation 2. Presented By: Jahanzeb Farooq Michael Osadebey

SIGNAL COMPRESSION Lecture Lempel-Ziv Coding

A Research Paper on Lossless Data Compression Techniques

Ch. 2: Compression Basics Multimedia Systems

Encoding. A thesis submitted to the Graduate School of University of Cincinnati in

14.4 Description of Huffman Coding

STUDY OF VARIOUS DATA COMPRESSION TOOLS

An Advanced Text Encryption & Compression System Based on ASCII Values & Arithmetic Encoding to Improve Data Security

7. Archiving and compressing 7.1 Introduction

Horn Formulae. CS124 Course Notes 8 Spring 2018

Digital Communication Prof. Bikash Kumar Dey Department of Electrical Engineering Indian Institute of Technology, Bombay

Lossless compression II

CSC 421: Algorithm Design & Analysis. Spring 2015

CS473-Algorithms I. Lecture 11. Greedy Algorithms. Cevdet Aykanat - Bilkent University Computer Engineering Department

CSE 143 Lecture 22. Huffman Tree

A Comprehensive Review of Data Compression Techniques

An Overview 1 / 10. CS106B Winter Handout #21 March 3, 2017 Huffman Encoding and Data Compression

Ch. 2: Compression Basics Multimedia Systems

Distributed source coding

More Bits and Bytes Huffman Coding

15 Data Compression 2014/9/21. Objectives After studying this chapter, the student should be able to: 15-1 LOSSLESS COMPRESSION

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks

Dictionary techniques

Greedy Algorithms. Alexandra Stefan

CS15100 Lab 7: File compression

A study in compression algorithms

Data Representation. Types of data: Numbers Text Audio Images & Graphics Video

EE-575 INFORMATION THEORY - SEM 092

Efficient Sequential Algorithms, Comp309. Motivation. Longest Common Subsequence. Part 3. String Algorithms

Scribe: Virginia Williams, Sam Kim (2016), Mary Wootters (2017) Date: May 22, 2017

Information Science 2

Analysis of Algorithms - Greedy algorithms -

Index Compression. David Kauchak cs160 Fall 2009 adapted from:

Chapter 9. Greedy Technique. Copyright 2007 Pearson Addison-Wesley. All rights reserved.

Chapter 7 Lossless Compression Algorithms

DigiPoints Volume 1. Student Workbook. Module 8 Digital Compression

S 1. Evaluation of Fast-LZ Compressors for Compacting High-Bandwidth but Redundant Streams from FPGA Data Sources

University of Waterloo CS240 Spring 2018 Help Session Problems

Digital Data. 10/11/2011 Prepared by: Khuzaima Jallad

Multimedia Systems. Part 20. Mahdi Vasighi

Source coding and compression

Algorithms and Data Structures CS-CO-412

Optimized Compression and Decompression Software

Introduction to Data Compression

CMPSC112 Lecture 37: Data Compression. Prof. John Wenskovitch 04/28/2017

Greedy algorithms part 2, and Huffman code

COSC-211: DATA STRUCTURES HW5: HUFFMAN CODING. 1 Introduction. 2 Huffman Coding. Due Thursday, March 8, 11:59pm

CS201 Discussion 12 HUFFMAN + ERDOSNUMBERS

Study of LZ77 and LZ78 Data Compression Techniques

6. Finding Efficient Compressions; Huffman and Hu-Tucker Algorithms

Design and Analysis of Algorithms

CMPSCI 240 Reasoning Under Uncertainty Homework 4

IMAGE COMPRESSION- I. Week VIII Feb /25/2003 Image Compression-I 1

A Comparative Study of Entropy Encoding Techniques for Lossless Text Data Compression

Information Retrieval. Chap 7. Text Operations

Course notes for Data Compression - 2 Kolmogorov complexity Fall 2005

Data Compression 신찬수

Data Compression Techniques for Big Data

Transcription:

CS/COE 1501 www.cs.pitt.edu/~lipschultz/cs1501/ Compression

What is compression? Represent the same data using less storage space Can get more use out a disk of a given size Can get more use out of memory E.g., free up memory by compressing inactive sections Faster than paging Built in to OSX Mavericks and later Can reduce the amount data transmitted Faster file transfers Cut power usage on mobile devices Two main approaches to compression... 2

Lossy Compression D Compress C Expand D Information is permanently lost in the compression process Examples: MP3, H264, JPEG With audio/video files this typically isn t a huge problem as human users might not be able to perceive the difference 3

Lossy examples MP3 Cuts out portions of audio that are considered beyond what JPEG most people are capable of hearing 40K 28K 4

Lossless Compression D Compress C Expand D Input can be recovered from compressed data exactly Examples: zip files, FLAC 5

Huffman Compression Works on arbitrary bit strings, but pretty easily explained using characters Consider the ASCII character set Essentially blocks of codes In general, to represent R different characters in a block, you need lg R bits of storage per block Consequently, n bit storage blocks can represent 2 n characters Each 8 bit code block represents one of 256 possible characters in ASCII Easy to encode/decode 6

Considerations for compressing ASCII What if we used variable length codewords instead of the constant 8? Could we store the same info in less space? Different characters are represented using codes of different bit lengths If all characters in the alphabet have the same usage frequency, we can t beat block storage On a character by character basis What about different usage frequencies between characters? In English, R, S, T, L, N, E are used much more than Q or X 7

Variable length encoding Decoding was easy for block codes Grab the next 8 bits in the bitstring How can we decode a bitstring that is made of of variable length code words? BAD example of variable length encoding: 1 00 01 001 100 101 10101 A T K U R C N 8

Variable length encoding for lossless compression Codes must be prefix free No code can be a prefix of any other in the scheme Using this, we can achieve compression by: Using fewer bits to represent more common characters Using longer codes to represent less common characters 9

How can we create these prefix-free codes? Huffman encoding! 10

Generating Huffman codes Assume we have K characters that are used in the file to be compressed and each has a weight (its frequency of use) Create a forest, F, of K single-node trees, one for each character, with the single node storing that char s weight while F > 1: Select T1, T2 F that have the smallest weights in F Create a new tree node N whose weight is the sum of T1 and T2 s weights Add T1 and T2 as children (subtrees) of N Remove T1 and T2 from F Add the new tree rooted by N to F Build a tree for ABRACADABRA! 11

Implementation concerns To encode/decode, we'll need to read in characters and output codes/read in codes and output characters Sounds like we'll need a symbol table! What implementation would be best? Same for encoding and decoding? Note that this means we need access to the trie to expand a compressed file! 12

Further implementation concerns Need to efficiently be able to select lowest weight trees to merge when constructing the trie Can accomplish this using a priority queue Need to be able to read/write bitstrings! Unless we pick multiples of 8 bits for our codewords, we will need to read/write fractions of bytes for our codewords We re not actually going to do I/O on fraction of bytes We ll maintain a buffer of bytes and perform bit processing on this buffer See BinaryStdIn.java and BinaryStdOut.java 13

Binary I/O buffer: N: 14

Representing tries as bitstrings 15

Binary I/O 16

Huffman pseudocode Encoding approach: Read input Compute frequencies Build trie/codeword table Write out trie as a bitstring to compressed file Write out character count of input Use table to write out the codeword for each input character Decoding approach: Read trie Read character count Use trie to decode bitstring of compressed file 17

How do we determine character frequencies? Option 1: Preprocess the file to be compressed Upside: Ensure that Huffman s algorithm will produce the best output for the given file Downsides: Requires two passes over the input, one to analyze frequencies/build the trie/build the code lookup table, and another to compress the file Trie must be stored with the compressed file, reducing the quality of the compression This especially hurts small files Generally, large files are more amenable to Huffman compression Just because a file is large, however, does not mean that it will compress well! 18

How do we determine character frequencies? Option 2: Use a static trie Analyze multiple sample files, build a single tree that will be used for all compressions/expansions Saves on trie storage overhead But in general not a very good approach Different character frequency characteristics of different files means that a code set/trie that works well for one file could work very poorly for another Could even cause an increase in file size after compression! 19

How do we determine character frequencies? Option 3: Adaptive Huffman coding Single pass over the data to construct the codes and compress a file with no background knowledge of the source distribution Not going to really focus on adaptive Huffman in the class, just pointing out that it exists... 20

Ok, so how good is Huffman compression ASCII requires 8*m bits to store m characters For a file containing c different characters Given Huffman codes {h 0, h 1, h 2,, h (c-1) } And frequencies {f 0, f 1, f 2,, f (c-1) } Sum from 0 to c-1: h i * f i Total storage depends on the differences in frequencies The bigger the differences, the better the potential for compression Huffman is optimal for character-by-character prefix-free encodings Proof in Propositions T and U of Section 5.5 of the text 21

That seems like a bit of a caveat... Where does Huffman fall short? What about repeated patterns of multiple characters? Consider a file containing: 1000 A s 1000 B s 1000 of every ASCII character Will this compress at all with Huffman encoding? Nope! But it seems like it should be compressible... 22

Run length encoding Could represent the previously mentioned string as: 1000A1000B1000C, etc. Assuming we use 10 bits to represent the number of repeats, and 8 bits to represent the character 4608 bits needed to store run length encoded file vs. 2048000 bits for input file Huge savings! Note that this incredible compression performance is based on a very specific scenario Run length encoding is not generally effective for most files, as they often lack long runs of repeated characters 23

What else can we do to compress files? 24

Patterns are compressible, need a general approach Huffman used variable-length codewords to represent fixed-length portions of the input Let s try another approach that uses fixed-length codewords to represent variable-length portions of the input Idea: the more characters can be represented in a single codeword, the better the compression Consider the : 24 bits in ASCII Representing the with a single 12 bit codeword cuts the used space in half Similarly, representing longer strings with a 12 bit codeword would mean even better savings! 25

How do we know that the will be in our file? Need to avoid the same problems as the use of a static trie for Huffman encoding So use an adaptive algorithm and build up our patterns and codewords as we go through the file 26

LZW compression Initialize codebook to all single characters e.g., character maps to its ASCII value While!EOF: Match longest prefix in codebook Output codeword Take this longest prefix, add the next character in the file, and add the result to the dictionary with a new codeword 27

LZW compression example Compress, using 12 bit codewords: TOBEORNOTTOBEORTOBEORNOT Cur Output Add T 84 TO:256 O 79 OB:257 B 66 BE:258 E 69 EO:259 O 79 OR:260 R 82 RN:261 N 78 NO:262 O 79 OT:263 T 84 TT:264 TO 256 TOB:265 BE 258 BEO:266 OR 260 ORT:267 TOB 265 TOBE:268 EO 259 EOR:269 RN 261 RNO:270 OT 263 -- 28

LZW expansion Initialize codebook to all single characters e.g., ASCII value maps to its character While!EOF: Read next codeword from file Lookup corresponding pattern in the codebook Output that pattern Add the previous pattern + the first character of the current pattern to the codebook Note: Different from compression No codebook addition after first pattern output 29

LZW expansion example Won t be represented quite like this in the file... 84, 79, 66, 69, 79, 82, 78, 79, 84, 256, 258, 260, 265, 259, 261, 263 Cur Output Add 84 T 263:OT 84 T -- 256 TO 264:TT 79 O 256:TO 258 BE 265:TOB 66 B 257:OB 260 OR 266:BEO 69 E 258:BE 265 TOB 267:ORT 79 O 259:EO 259 EO 268:TOBE 82 R 260:OR 261 RN 269:EOR 78 N 261:RN 263 OT 270:RNO 79 O 262:NO 30

How does this work out? Both compression and expansion construct the same codebook! Compression stores character string codeword Expansion stores codeword character string They contain the same pairs in the same order Hence, the codebook doesn t need to be stored with the compressed file, saving space 31

Just one tiny little issue to sort out... Expansion can sometimes be a step ahead of compression If, during compression, the (pattern, codeword) that was just added to the dictionary is immediately used in the next step, the decompression algorithm will not yet know the codeword. This is easily detected and dealt with, however 32

LZW corner case example Compress, using 12 bit codewords: AAAAAA Cur Output Add A 65 AA:256 AA 256 AAA:257 AAA 257 -- Expansion: Cur Output Add 65 A -- 256 AA AA:256 257 AAA AAA:257 33

LZW implementation concerns: codebook How to represent/store during: Compression Expansion Considerations: What operations are needed? How many of these operations are going to be performed? Discuss 34

Further implementation issues: codeword size How long should codewords be? Use fewer bits: Gives better compression earlier on But, leaves fewer codewords available, which will hamper compression later on Use more bits: Delays actual compression until longer patterns are found due to large codeword size More codewords available means that greater compression gains can be made later on in the process 35

Variable width codewords This sounds eerily like variable length codewords Exactly what we set out to avoid! Here, we re talking about a different technique Example: Start out using 9 bit codewords When codeword 512 is inserted into the codebook, switch to outputting/grabbing 10 bit codewords When codeword 1024 is inserted into the codebook, switch to outputting/grabbing 11 bit codewords Etc. 36

Even further implementation issues: codebook size What happens when we run out of codewords? Only 2 n possible codewords for n bit codes Even using variable width codewords, they can t grow arbitrarily large Two primary options: Stop adding new keywords, use the codebook as it stands Maintains long already established patterns But if the file changes, it will not be compressed as effectively Throw out the codebook and start over from single characters Allows new patterns to be compressed Until new patterns are built up, though, compression will be minimal 37

The showdown you ve all been waiting for... HUFFMAN vs LZW In general, LZW will give better compression Also better for compressing archived directories of files Why? Very long patterns can be built up, leading to better compression Different files don t hurt each other as they did in Huffman Remember our thoughts on using static tries? 38

So lossless compression apps use LZW? Well, gifs can use it And pdfs Most dedicated compression applications use other algorithms: DEFLATE (combination of LZ77 and Huffman) Used by PKZIP and gzip Burrows-Wheeler transforms Used by bzip2 LZMA Used by 7-zip brotli Introduced by Google in Sept. 2015 Based around a " combination of a modern variant of the LZ77 algorithm, Huffman coding[,] and 2nd order context modeling " 39

DEFLATE et al achieve even better general compression? How much can they compress a file? Better question: How much can a file be compressed by any algorithm? No algorithm can compress every bitstream Assume we have such an algorithm We can use to compress its own output! And we could keep compressing its output until our compressed file is 0 bits! Clearly this can t work Proofs in Proposition S of Section 5.5 of the text 40

Can we reason about how much a file can be compressed? Yes! Using Shannon Entropy 41

Information theory in a single slide... Founded by Claude Shannon in his paper A Mathematical Theory of Communication Entropy is a key measure in information theory Slightly different from thermodynamic entropy A measure of the unpredictability of information content By losslessly compressing data, we represent the same information in less space Hence, 8 bits of uncompressed text has less entropy than 8 bits of compressed data 42

Entropy applied to language: Translating a language into binary, the entropy is the average number of bits required to store a letter of the language Entropy of a message * length of message = amount of information contained in that message On average, a lossless compression scheme cannot compress a message to have more than 1 bit of information per bit of compressed message Uncompressed, English has between 0.6 and 1.3 bits of entropy per character of the message 43