CS15100 Lab 7: File compression

Similar documents
15 July, Huffman Trees. Heaps

Text Compression through Huffman Coding. Terminology

COSC-211: DATA STRUCTURES HW5: HUFFMAN CODING. 1 Introduction. 2 Huffman Coding. Due Thursday, March 8, 11:59pm

Huffman Coding Assignment For CS211, Bellevue College (rev. 2016)

6. Finding Efficient Compressions; Huffman and Hu-Tucker

Horn Formulae. CS124 Course Notes 8 Spring 2018

CS106B Handout 34 Autumn 2012 November 12 th, 2012 Data Compression and Huffman Encoding

14.4 Description of Huffman Coding

CpSc 1011 Lab 5 Conditional Statements, Loops, ASCII code, and Redirecting Input Characters and Hurricanes

Greedy Algorithms CHAPTER 16

If Statements, For Loops, Functions

Data Structures and Algorithms

CSE100. Advanced Data Structures. Lecture 12. (Based on Paul Kube course materials)

Information Science 2

IT101. Characters: from ASCII to Unicode

CS 200 Algorithms and Data Structures, Fall 2012 Programming Assignment #3

EE 368. Weeks 5 (Notes)

CS02b Project 2 String compression with Huffman trees

CS/COE 1501

Fall 2017 Discussion 7: October 25, 2017 Solutions. 1 Introduction. 2 Primitives

CSE 143, Winter 2013 Programming Assignment #8: Huffman Coding (40 points) Due Thursday, March 14, 2013, 11:30 PM

Red-Black, Splay and Huffman Trees

Data Compression. An overview of Compression. Multimedia Systems and Applications. Binary Image Compression. Binary Image Compression

Greedy algorithms part 2, and Huffman code

Binary Trees Case-studies

Lecture: Analysis of Algorithms (CS )

15100 Fall 2005 Final Project

Greedy Algorithms. Alexandra Stefan

SCHEME 8. 1 Introduction. 2 Primitives COMPUTER SCIENCE 61A. March 23, 2017

15110 Principles of Computing, Carnegie Mellon University - CORTINA. Digital Data

Source coding and compression

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

15-122: Principles of Imperative Computation, Spring 2013

ENSC Multimedia Communications Engineering Topic 4: Huffman Coding 2

Spring 2018 Discussion 7: March 21, Introduction. 2 Primitives

ASCII American Standard Code for Information Interchange. Text file is a sequence of binary digits which represent the codes for each character.

2010 Canadian Computing Competition: Senior Division. Sponsor:

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.

Text Input and Conditionals

CSE 143 Lecture 22. Huffman Tree

Variables and Data Representation

S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 165

Homework: More Abstraction, Trees, and Lists

Fall 2018 Discussion 8: October 24, 2018 Solutions. 1 Introduction. 2 Primitives

CIS 121 Data Structures and Algorithms with Java Spring 2018

CS/COE 1501

Priority Queues and Huffman Encoding

Scribe: Virginia Williams, Sam Kim (2016), Mary Wootters (2017) Date: May 22, 2017

Priority Queues and Huffman Encoding

Compression. storage medium/ communications network. For the purpose of this lecture, we observe the following constraints:

Huffman, YEAH! Sasha Harrison Spring 2018

An Overview 1 / 10. CS106B Winter Handout #21 March 3, 2017 Huffman Encoding and Data Compression

14 Data Compression by Huffman Encoding

Algorithms and Data Structures CS-CO-412

Black Problem 2: Huffman Compression [75 points] Next, the Millisoft back story! Starter files

More Bits and Bytes Huffman Coding

CS 206 Introduction to Computer Science II

Out: April 19, 2017 Due: April 26, 2017 (Wednesday, Reading/Study Day, no late work accepted after Friday)

Digital Communication Prof. Bikash Kumar Dey Department of Electrical Engineering Indian Institute of Technology, Bombay

Red-Black trees are usually described as obeying the following rules :

ASCII American Standard Code for Information Interchange. Text file is a sequence of binary digits which represent the codes for each character.

Entropy Coding. - to shorten the average code length by assigning shorter codes to more probable symbols => Morse-, Huffman-, Arithmetic Code

Asking for information (with three complex questions, so four main paragraphs)

CS 337 Project 1: Minimum-Weight Binary Search Trees

CS52 - Assignment 8. Due Friday 4/15 at 5:00pm.

Intro. To Multimedia Engineering Lossless Compression

CSE 374 Programming Concepts & Tools

6. Finding Efficient Compressions; Huffman and Hu-Tucker Algorithms

Homework 3 Huffman Coding. Due Thursday October 11

CSE143X: Computer Programming I & II Programming Assignment #10 due: Friday, 12/8/17, 11:00 pm

Basic data types. Building blocks of computation

CS473-Algorithms I. Lecture 11. Greedy Algorithms. Cevdet Aykanat - Bilkent University Computer Engineering Department

String Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42

CS 270 Algorithms. Oliver Kullmann. Binary search. Lists. Background: Pointers. Trees. Implementing rooted trees. Tutorial

last time in cs recitations. computer commands. today s topics.

Animations that make decisions

Programming Abstractions

SCHEME 7. 1 Introduction. 2 Primitives COMPUTER SCIENCE 61A. October 29, 2015

An undirected graph is a tree if and only of there is a unique simple path between any 2 of its vertices.

Information Retrieval. Lecture 3 - Index compression. Introduction. Overview. Characterization of an index. Wintersemester 2007

Binary Trees and Huffman Encoding Binary Search Trees

Binary Search Trees. Carlos Moreno uwaterloo.ca EIT

Depiction of program declaring a variable and then assigning it a value

Linked Structures Songs, Games, Movies Part IV. Fall 2013 Carola Wenk

Encoding. A thesis submitted to the Graduate School of University of Cincinnati in

TREES Lecture 10 CS2110 Spring2014

Music. Numbers correspond to course weeks EULA ESE150 Spring click OK Based on slides DeHon 1. !

4.8 Huffman Codes. These lecture slides are supplied by Mathijs de Weerd

Lecture Notes on Binary Decision Diagrams

7: Image Compression

In Java, data type boolean is used to represent Boolean data. Each boolean constant or variable can contain one of two values: true or false.

Lossless Compression Algorithms

Final Examination CSE 100 UCSD (Practice)

Boolean Logic & Branching Lab Conditional Tests

Selec%on and Decision Structures in Java: If Statements and Switch Statements CSC 121 Spring 2016 Howard Rosenthal

Digital Image Processing

Data compression.

Bits, Words, and Integers

(Refer Slide Time: 00:23)

Discussion 2C Notes (Week 9, March 4) TA: Brian Choi Section Webpage:

Transcription:

C151 Lab 7: File compression Fall 26 November 14, 26 Complete the first 3 chapters (through the build-huffman-tree function) in lab (optionally) with a partner. The rest you must do by yourself. Write both your name and your partner s name on the homework when you hand it in. f you are in the 9:3am MWF section, handin your solution by emailing it to robby@cs.uchicago.edu. f you are in the TTh section, email your solution to bboven@gmail.com and mulmuley@cs.uchicago.edu. t must be in the appropriate mailbox before lab starts in week 8. 1 ntroduction At the lowest level, computers represent data as sequences of bits ( or 1). The normal way to represent a message as a sequence of bits is to use a table that associates bit patterns with characters and then translate each letter the message according to the table. The standard table that many computers use is called the AC table, which represents every character on the keyboard (and a few more besides) as a sequence of exactly 8 bits. Here is one portion of the AC table: Character AC encoding 1 M P n AC, the message MPP would be represented like this: M P P 1 1 1 1 1 1 f you save the word MPP in Drcheme, that sequence of bits is how it will be written out in the saved file. There are a number of advantages to representing messages with AC, but it is not particularly good for generating short encodings of particular messages. n situations where we really need messages to be short (maybe because we want to transmit a message quickly across a network, or save it on a disk that doesn t have much space left) we can often do dramatically better. The message MPP, for instance, doesn t use most of the letters of the alphabet at all, so an encoding scheme that didn t let us write those letters down at all would be fine. Furthermore, it uses and four times each, but P only twice and M only once: for that reason, it would be a good trade to use an encoding table that had short representations for and and longer representations of P and M. The following alternative encoding produces a much shorter encoding for the message MPP: 1

Character Alternative encoding M 1 P M 1 While AC needs 88 bits, the alternative encoding needs just 21. The goal of this lab is to implement an algorithm called Huffman coding that determines the best encoding table for a particular message, and then encodes or decodes messages according to that table. As a demonstration of the technique s practical application, you will use it to write a program that compresses and decompresses files. For this lab, you will need to use the following teachpack: P P http://www.cs.uchicago.edu/ jacobm/151-26-fall/huffman-utils.ss Huffman coding is named after its inventor, David Huffman (1925 1999). He invented it in 1951 as a final project for a class he was taking his instructor listed it as a possible paper topic without mentioning that it was a major unsolved problem at the time! 2 Gathering statistics The first step of the algorithm is to determine the frequencies of each letter in the input. ;; A statistics is a (listof frequency) ;; A frequency is: ;; (make-frequency character number) (define-struct frequency (token count)) Note. The frequency structure is provided by the teachpack. Do not define it yourself. Characters are a built-in category of primitive values each representing one letter (or numeral, or punctuation mark, et cetera). They can be written down directly with the syntax #\x (for the character corresponding to a lower-case x). Characters can be tested for equality using char=?. The main advantage of characters is that we can get them out of strings: for instance, given the string "MPP" we can use the built-in function string list: (string list "MPP") produces (list #\M #\ #\ #\ #\ #\ #\ #\ #\P #\P #\). Write a function frequencies : (listof character) statistics, which takes a message represented as a list of characters and produces statistics containing the frequency with which each token appears in the message. For instance, (frequencies (list #\M #\ #\ #\ #\ #\ #\ #\ #\P #\P #\)) shouldbe (list (make-frequency #\M 1) (make-frequency #\ 4) (make-frequency #\ 4) (make-frequency #\P 2)) 2

3 Building Huffman trees ;; A huffman-tree is either: ;; - (make-leaf character number) ;; - (make-branch huffman-tree huffman-tree (listof character) number) (define-struct leaf (token count)) (define-struct branch (l r tokens count)) The key idea behind Huffman coding is the Huffman tree. Given a particular message, a Huffman tree for that message is a binary tree whose leaves are character, one per distinct character in the message. Additionally, for every subtree, the total frequency of all the tokens on the left side is as nearly equal to the total frequency of all the tokens on the right side as possible. Huffman s algorithm for building these trees is as follows. t takes as its input the statistics generated in the last section, for instance: (list (make-frequency #\M 1) (make-frequency #\ 4) (make-frequency #\ 4) (make-frequency #\P 2)) t turns each of these frequencies into a trivial binary tree consisting of just the input character and its frequency, and sorts them by frequency (lowest to highest): (list (make-leaf #\M 1) (make-leaf #\P 2) (make-leaf #\ 4) (make-leaf #\ 4)) From this point on the algorithm works on lists of trees sorted by frequency. t successively removes the first two trees from the list and combines them into a single branch whose character list is the combination of the two subtrees character lists and whose frequency is the sum of the two subtrees frequencies. t inserts this new branch into the list (making sure to maintain sorted order) and repeats the process until only one tree is left. That tree is the output. For instance, here are the successive steps the algorithm would take on the example above, both in code and in picture form: tage Code: (list (make-leaf #\M 1) (make-leaf #\P 2) (make-leaf #\ 4) (make-leaf #\ 4)) tage 1 Code: M 1 P 2 4 4 (list (make-branch (make-leaf #\M 1) (make-leaf #\P 2) (list #\M #\P) 3) (make-leaf #\ 4) (make-leaf #\ 4)) (M P) 3 4 4 M 1 P 2 3

tage 2 Code: (list (make-leaf #\ 4) (make-branch (make-branch (make-leaf #\M 1) (make-leaf #\P 2) (list #\M #\P) 3) (make-leaf #\ 4) (list #\M #\P #\) 7)) 4 (M P ) 7 (M P) 3 4 tage 3 Code: M 1 P 2 (list (make-branch (make-leaf #\ 4) (make-branch (make-branch (make-leaf #\M 1) (make-leaf #\P 2) (list #\M #\P) 3) (make-leaf #\ 4) (list #\M #\P #\) 7) (list #\ #\M #\P #\) )) ( M P ) 4 (M P ) 7 (M P) 3 4 M 1 P 2 Write the function build-huffman-tree : statistics huffman-tree, which builds the Huffman tree that corresponds to the given frequencies. 4 Encoding a message The Huffman tree for a message is a representation of the optimal table for encoding that message: the code for each letter is just the path from the root of the tree to that letter, with representing going down the left branch and 1 representing going down the right branch. Write the function encode-message : (listof character) huffman-tree (listof bit), where a bit is either or 1. For instance, 4

(define message (string list "MPP")) (define freqs (frequencies message)) (define tree (build-huffman-tree freqs)) (encode-message message tree) shouldbe (list 1 1 1 1 1 1 1 1 1 1 1 1 1) 5 Decoding a message To decode a message, one needs the encoded version of the message and the Huffman tree that was used to encode it. Write the function decode-message : (listof bit) huffman-tree (listof character), which decodes a message encoded with encode-message. For instance, (list string (decode-message (list 1 1 1 1 1 1 1 1 1 1 1 1 1) tree)) shouldbe "MPP" 6 An application: file compression n the introduction we mentioned that computers store messages as sequences of bits. That is not quite the whole truth: the sequences must be exact multiples of 8, since computers arrange memory into 8-bit bytes. When using AC you never need to think about this, since every character in AC is represented as a whole byte, so you can t end up with a message that doesn t fill some exact number of bytes; but with the encodings that come from Huffman tables it is possible. The problem is this: when you re reading a compressed message off of a disk, you will always read it as a whole number of bytes, but somewhere between and 7 of the last bits were not a part of the encoding of the original message. The standard way to deal with this is to add a special end-of-message token to the end of every message when encoding it. With that character added, the encoding process can proceed almost exactly as normal eom is counted just like a character when computing statistics, generating a Huffman tree, and encoding the message the only difference being that the encoder must ensure that the lengths of its final encodings are multiples of 8 bits long by padding the ending (after the encoding of the eom token) with arbitrary bits. With this done, the decoder can take advantage of the fact that eom appears at the end of every message and stop decoding as soon as it decodes an end-of-message token, even if there are more bits available for decoding. Change the definition of a frequency from section 2 as follows: ;; A frequency is: ;; (make-frequency token number) ;; a token is either: ;; - a character ;; - eom Then modify all parts of your program that need to change to make proper use of the eom token. Once you have done that, you are ready to write the final compression and decompression functions. To help with that, the huffman-util.ss teachpack provides one new data definition and four functions: ;; compressed-data is 5

;; (make-compressed-data statistics (listof bit)) ;; NOTE: the length of the list of bits must be a multiple of 8 (define-struct compressed-data (stats bits)) ;; file list : string (listof character) ;; produces a list of characters corresponding to the entire named file ;; write-compressed-data-to-file : compressed-data string boolean ;; writes the contents of the given compressed-data structure into a file. ;; Returns true on success, or false if something went wrong ;; (for instance the file couldn t be written) ;; read-compressed-data-from-file : string compressed-data ;; reads a compressed data file into a compressed-data structure ;; Note: the length of the bits returned is always a multiple of 8 ;; list file : (listof character) string boolean ;; which makes a file with the given string as its name with the given list ;; of characters as its contents. Returns true on success, ;; false if something went wrong. Note. The compressed-data structure is provided by the teachpack. Do not define it yourself. Use these helpers to define the following functions: compress-file : string string boolean, which compresses the contents of the file named by the first string and places the compressed version in the file named by the second string. uncompress-file : string string boolean, which expects the contents of the file named by the first string to be compressed data, uncompresses that data, and writes the result to the file named by the second string. (The provided helpers do a small bit of magic for you: they write out the statistics at the beginning of the file before writing your bit list and read it back in, in addition to writing and reading your provided bit list. Building this functionality yourself is not particularly difficult, but since it isn t particularly interesting we figured we d save you the trouble.) 6