Compression. storage medium/ communications network. For the purpose of this lecture, we observe the following constraints:

Similar documents
Lossless Compression Algorithms

IMAGE PROCESSING (RRY025) LECTURE 13 IMAGE COMPRESSION - I

CS 493: Algorithms for Massive Data Sets Dictionary-based compression February 14, 2002 Scribe: Tony Wirth LZ77

Entropy Coding. - to shorten the average code length by assigning shorter codes to more probable symbols => Morse-, Huffman-, Arithmetic Code

CS/COE 1501

15 Data Compression 2014/9/21. Objectives After studying this chapter, the student should be able to: 15-1 LOSSLESS COMPRESSION

Fundamentals of Multimedia. Lecture 5 Lossless Data Compression Variable Length Coding

EE67I Multimedia Communication Systems Lecture 4

CS/COE 1501

Data Compression. An overview of Compression. Multimedia Systems and Applications. Binary Image Compression. Binary Image Compression

Lossless compression II

SIGNAL COMPRESSION Lecture Lempel-Ziv Coding

Introduction to Data Compression

Data Compression Techniques

Data Compression. Media Signal Processing, Presentation 2. Presented By: Jahanzeb Farooq Michael Osadebey

IMAGE COMPRESSION TECHNIQUES

IMAGE COMPRESSION- I. Week VIII Feb /25/2003 Image Compression-I 1

Welcome Back to Fundamentals of Multimedia (MR412) Fall, 2012 Lecture 10 (Chapter 7) ZHU Yongxin, Winson

Department of electronics and telecommunication, J.D.I.E.T.Yavatmal, India 2

Engineering Mathematics II Lecture 16 Compression

Intro. To Multimedia Engineering Lossless Compression

Image coding and compression

Text Compression through Huffman Coding. Terminology

7: Image Compression

Compression; Error detection & correction

Multimedia Networking ECE 599

Simple variant of coding with a variable number of symbols and fixlength codewords.

EE-575 INFORMATION THEORY - SEM 092

Bits and Bit Patterns

Data compression with Huffman and LZW

ITCT Lecture 8.2: Dictionary Codes and Lempel-Ziv Coding

Data Storage. Slides derived from those available on the web site of the book: Computer Science: An Overview, 11 th Edition, by J.

David Rappaport School of Computing Queen s University CANADA. Copyright, 1996 Dale Carnegie & Associates, Inc.

Digital Communication Prof. Bikash Kumar Dey Department of Electrical Engineering Indian Institute of Technology, Bombay

Encoding. A thesis submitted to the Graduate School of University of Cincinnati in

Compression; Error detection & correction

Data Compression Techniques

Digital Image Processing

WIRE/WIRELESS SENSOR NETWORKS USING K-RLE ALGORITHM FOR A LOW POWER DATA COMPRESSION

CSE 421 Greedy: Huffman Codes

So, what is data compression, and why do we need it?

7.5 Dictionary-based Coding

Chapter 1. Digital Data Representation and Communication. Part 2

More Bits and Bytes Huffman Coding

Common File Formats. Need a standard to store images Raster data Photos Synthetic renderings. Vector Graphic Illustrations Fonts

Noise Reduction in Data Communication Using Compression Technique

This is not yellow. Image Files - Center for Graphics and Geometric Computing, Technion 2

Lecture Coding Theory. Source Coding. Image and Video Compression. Images: Wikipedia

Digital Communication Prof. Bikash Kumar Dey Department of Electrical Engineering Indian Institute of Technology, Bombay

Chapter 7 Lossless Compression Algorithms

VIDEO SIGNALS. Lossless coding

Image Compression. cs2: Computational Thinking for Scientists.

Error Resilient LZ 77 Data Compression

Topic 5 Image Compression

3 Data Storage 3.1. Foundations of Computer Science Cengage Learning

MCS-375: Algorithms: Analysis and Design Handout #G2 San Skulrattanakulchai Gustavus Adolphus College Oct 21, Huffman Codes

Lecture #3: Digital Music and Sound

An Advanced Text Encryption & Compression System Based on ASCII Values & Arithmetic Encoding to Improve Data Security

Data Compression. Guest lecture, SGDS Fall 2011

15110 PRINCIPLES OF COMPUTING SAMPLE EXAM 2

Data compression.

Information Retrieval. Chap 7. Text Operations

Compressing Data. Konstantin Tretyakov

Image compression. Stefano Ferrari. Università degli Studi di Milano Methods for Image Processing. academic year

Data Representation. Types of data: Numbers Text Audio Images & Graphics Video

DigiPoints Volume 1. Student Workbook. Module 8 Digital Compression

Chapter 1. Data Storage Pearson Addison-Wesley. All rights reserved

Lecture 12: Compression

Data Compression 신찬수

Ch. 2: Compression Basics Multimedia Systems

FPGA based Data Compression using Dictionary based LZW Algorithm

CSC 310, Fall 2011 Solutions to Theory Assignment #1

CIS 121 Data Structures and Algorithms with Java Spring 2018

Image Compression for Mobile Devices using Prediction and Direct Coding Approach

VC 12/13 T16 Video Compression

Efficient Sequential Algorithms, Comp309. Motivation. Longest Common Subsequence. Part 3. String Algorithms

ECE 499/599 Data Compression & Information Theory. Thinh Nguyen Oregon State University

Huffman Code Application. Lecture7: Huffman Code. A simple application of Huffman coding of image compression which would be :

Comparative Study of Dictionary based Compression Algorithms on Text Data

Lecture 5: Compression I. This Week s Schedule

A Research Paper on Lossless Data Compression Techniques

Greedy Algorithms CHAPTER 16

A Image Comparative Study using DCT, Fast Fourier, Wavelet Transforms and Huffman Algorithm

Multimedia Systems. Part 20. Mahdi Vasighi

Chapter 5 VARIABLE-LENGTH CODING Information Theory Results (II)

Overview. Last Lecture. This Lecture. Next Lecture. Data Transmission. Data Compression Source: Lecture notes

Repetition 1st lecture

Introduction to Computer Science (I1100) Data Storage

A Comprehensive Review of Data Compression Techniques

An On-line Variable Length Binary. Institute for Systems Research and. Institute for Advanced Computer Studies. University of Maryland

Image Coding and Compression

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Figure-2.1. Information system with encoder/decoders.

Text Compression. Jayadev Misra The University of Texas at Austin July 1, A Very Incomplete Introduction To Information Theory 2

Dictionary Based Compression for Images

CSE COMPUTER USE: Fundamentals Test 1 Version D

Data Representation From 0s and 1s to images CPSC 101

Basic Compression Library

Distributed source coding

University of Waterloo CS240 Spring 2018 Help Session Problems

Transcription:

CS231 Algorithms Handout # 31 Prof. Lyn Turbak November 20, 2001 Wellesley College Compression The Big Picture We want to be able to store and retrieve data, as well as communicate it with others. In general, this requires encoding the data and decoding the encoded data: data encode storage medium/ communications network decode For the purpose of this lecture, we observe the following constraints: We consider only digital storage and communication media, in which all information is expressed via a sequence of discrete values (in the simplest case, 0 and 1). This contrasts with analog media modeled by continuously varying waveforms. We consider only lossless storage/transmission in which all information in the original data must be preserved. I.e., for all d, decode(encode(d)) = d. This constrast with lossy approaches that may lose some information (common with images, such as JPEG format, where the data is already an approximation of an analog waveform). We consider only noiseless storage/transmission in which no errors are introduced between the encoding and decoding phases. In practice, error-correction strategies must be applied to handle errors introduced by real-world noise. data Uniform-Length Encoding of Textual Data For textual data, it is common to encode each character as an 8-bit byte using a uniform-length encoding known as ASCII. Each byte can be written as a decimal integer in the range [0.. 255]. Below is a table showing ASCII values in the range [0.. 127] and their associated characters: 0:^@ 1:^A 2:^B 3:^C 4:^D 5:^E 6:^F 7:^G 8:^H 9:\t 10:\n 11:^K 12:^L 13:^M 14:^N 15:^O 16:^P 17:^Q 18:^R 19:^S 20:^T 21:^U 22:^V 23:^W 24:^X 25:^Y 26:^Z 27:^[ 28:^\ 29:^] 30:^^ 31:^_ 32: 33:! 34:" 35:# 36:$ 37:% 38:& 39: 40:( 41:) 42:* 43:+ 44:, 45:- 46:. 47:/ 48:0 49:1 50:2 51:3 52:4 53:5 54:6 55:7 56:8 57:9 58:: 59:; 60:< 61:= 62:> 63:? 64:@ 65:A 66:B 67:C 68:D 69:E 70:F 71:G 72:H 73:I 74:J 75:K 76:L 77:M 78:N 79:O 80:P 81:Q 82:R 83:S 84:T 85:U 86:V 87:W 88:X 89:Y 90:Z 91:[ 92:\ 93:] 94:^ 95:_ 96: 97:a 98:b 99:c 100:d 101:e 102:f 103:g 104:h 105:i 106:j 107:k 108:l 109:m 110:n 111:o 112:p 113:q 114:r 115:s 116:t 117:u 118:v 119:w 120:x 121:y 122:z 123:{ 124: 125:} 126:~ 127:^? For a letter or symbol σ, the notation ^σ stands for the character specified by pressing the Control key and σ key at the same time. The notation \t stands for the tab character, and \n for the newline character. Integers in the range [128.. 255] correspond to other special characters. 1

Variable Length Encoding of Textual Data Uniform-length encodings do not take advantage of the fact that different letters have different frequencies. For instance, here is a table of letter frequencies in English (per 1000 letters) 1 : E 130 I 74 D 44 F 28 Y 19 B 9 J 2 T 93 O 74 H 35 P 27 G 16 X 5 Z 1 N 78 A 73 L 35 U 27 W 16 K 3 R 77 S 63 C 30 M 25 V 13 Q 3 This suggests a variable-length encoding in which frequent letters, like E and T, have shorter representations than infrequent letters. Morse code is an example of a variable length encoding: A.- K -.- U..- 0 ----- B -... L.-.. V...- 1.---- C -.-. M -- W.-- 2..--- D -.. N -. X -...- 3...-- E. O --- Y -.-- 4...- F..-. P.--. Z --.. 5... G --. Q --.- 6 -... H... R.-. Full Stop.-.-.- 7 --... I.. S... Comma --..-- 8 ---.. J.--- T - Query..--.. 9 ----. 1 as reported by http://library.thinkquest.org/28005/flashed/thelab/cryptograms/frequency.shtml 2

Information Theory As implied by variable-length encoding, some strings contains more information than others. This is the key idea underlying information theory, developed by Claude Shannon in the late 1940s. The information content of a symbol s is defined as I(s) = lg(pr[s]) bits where (Pr[s]) is the predicted probability of a symbol. For example: I(H), where H is getting heads with a fair coin, is lg(1/2) = lg(2) = 1 bit. I(H), where H is getting heads with a coin that has Pr(H) =1/16 is lg(1/16) = lg(16) = 4 bits. I(T), where T is getting heads with a coin that has Pr(T) =15/16 is lg(15/16) = 0.93 bits. I(e), where E is the letter e in English language text = Pr(E) =.13 is lg(.13) = 2.94 bits. I(Z), where Z is the letter z in English language text = Pr(E) =.13 is lg(.001) = 9.97 bits. The information content of a symbol can depend on the context. Although U has probability 0.027 in general, it has >.99 probability following a Q. Although E has probability 0.13 in general, it has close to 0 probability after GH, while T has over.60 probability in this context, The entropy of an alphabet A is the average information content per symbol over the whole alphabet: H(A) = (Pr[s] I(s)) == lg(pr[s])) s A s A(Pr[s] Information content and entropy can depend on the particular text. following two examples: We will consider the 1. "miss mississippi" The doodley poem (from Kurt Vonnegut s Cat s Cradle): we do, doodley do, doodley do, doodley do, what we must, muddily must, muddily must, muddily must, muddily do, muddily do, muddily do, muddily do, until we bust, bodily bust, bodily bust, bodily bust. 3

Huffman Coding Huffman coding is a technique for choosing variable-length codes for symbols based on their relative probabilities. It was invented by David Huffman in 1952. Main Encoding Steps: 1. Determine the relative frequency of the symbols in the given text. Symbols may be characters, words, or other strings of characters. 2. Put all symbols into a priority queue weighted by their relative frequency. Dequeue the two lowest frequency elements, and combine them into a binary trie whose frequency is the sum of that of the two subtrees. Do this until there is a single trie. 3. Make a table mapping each symbol to the bit string path that reaches it from the root of the trie. 4. Encode the text using the table. Decoding Step: Use bit strings to find each character of text in the encoding trie. Example: Make the encoding trie for "miss mississippi" Note: The trie itself must be encoded and transmitted in addition to the encoded text! 4

Dictionary Methods Key Idea: Replace substrings in text by codewords = indices into a dictionary. Simple version: 8-bit char with high bit 0 stands for self 8-bit char with high bit 1 is an index into a dictionary of 128 common multi-character entries (e.g., "st", "tr", "and", "the"). More complex version: Interleave codewords and character strings. E.g., ("m", 17, 17, "ippi"), where 17 is a codeword for "iss". Which dictionary? static model: use the same dictionary, regardless of text being coded. Advantage: dictionary can be hardwired into encoder/decoder. Disadvantage: doesn t model special features of given text. semi-static model: create dictionary in first pass over text. Advantage: dictionary is specialized to text. Disadvantages: (1) requires an extra pass and (2) encoder must send dictionary to decoder along with compressed text. adaptive model: use already processed text as a dictionary that evolves over time. Advantages: (1) dictionary is specialized to text; (2) can be done in a single pass; and (3) decoder can reconstruct dictionary from compressed text, so it need not be sent separately. Disadvantage: More complex than other methods. 5

LZ78: An Adaptive Dictionary Method LZ78 is an adaptive dictionary compression method invented by Jacob Ziv and Abraham Lempel in 1978. The more well-known LZW compression algorithm used in the Unix compress utility and GIF images is a variant of LZ78. Idea: Encode text via a sequence of pairs of the form (codeword, char), where codeword is an index into a growing dictionary of processed text and char is the next character after the string denoted by codeword. Mississippi Example: Here is the LZ78 encoding of "i miss mississippi": (codeword,char) pairs dictionary entries (0, i) 1 "i" (0, ) 2 " " (0, m) 3 "m" (1, s) 4 "is" (0, s) 5 "s" (2, m) 6 " m" (4, s) 7 "iss" (7, i) 8 "issi" (0, p) 9 "p" (9, i) 10 "pi" In the encoding phase, the inverse dictionary can be represented efficienty as a trie a multiwaybranching tree with nodes bearing values (here, codewords) and edges labeled by keys (here, characters). Below is the trie for the above example: 1 s 4 s 7 i 8 i m 3 0 p s 9 i 10 5 2 m 6 Doodley Example: Using LZ78, the doodley poem can be encoded with 80 pairs = 160 bytes vs. the 200 uncompressed bytes: [(0, w), (0, e), (0, ), (0, d), (0, o), (0,, ), (3, d), (5, o), (4, l), (2, y), (7, o), (6, ), (4, o), (5, d), (0, l), (10, ), (13,, ), (11, o), (9, e), (0, y), (11,, ), (0, ), (1, h), (0, a), (0, t), (3, w), (2, ), (0, m), (0, u), (0, s), (25,, ), (3, m), (29, d), (4, i), (15, y), (32, u), (30, t), (12, m), (33, d), (0, i), (35, ), (28, u), (37,, ), (36, d), (34, l), (20, ), (42, s), (31, ), (42, d), (45, y), (21, ), (49, d), (40, l), (46, d), (5,, ), (44, d), (53, y), (51, m), (39, i), (41, d), (55, ), (29, n), (25, i), (15, ), (1, e), (3, b), (29, s), (31, ), (0, b), (14, i), (41, b), (67, t), (12, b), (70, l), (46, b), (72,, ), (66, o), (50, ), (69, u), (37,. )] 6

LZ77: Another Adaptive Dictionary Method LZ77 is another adaptive dictionary compression method invented by Ziv and Lempel in 1977. The Linux gzip utility is a variant of LZ77. Idea: Encode text via triples of the form (back, len, char), where: back is the number of characters back from the current pointer to the start of the dictionary entry. len is the length of the entry string. char is the next character in the text after the entry string. Mississippi Example: Here is the LZ77 encoding of "i miss mississippi": [(0,0,i), (0,0, ), (0,0,m), (0,0,i) (0,0,s), (1,1, ), (5,4,i), (3,3,p), (1,1,i)] Doodley Example: Using LZ77, the doodley poem can be encoded with 35 triples = 105 bytes vs. the 200 uncompressed bytes: 2 [(0,0,w), (0,0,e), (0,0, ), (0,0,d), (0,0,o), (0,0,, ), (4,3,o), (3,1,l), (0,0,e), (0,0,y), (12,28, ), (43,1,h), (0,0,a), (0,0,t), (9,1,w), (13,1, ), (0,0,m), (0,0,u), (0,0,s), (8,1,, ), (6,3,d), (21,1,i), (27,1,y), (14,34, ), (14,8,d), (80,3,m), (12,34, ), (11,1,n), (53,1,i), (11,1, ), (105,3,b), (77,5,b), (130,2,i), (26,3,b), (13,29,. )] 2 In practice, the compression factor is better than this because the indices themselves are compressed, say via Huffman coding. 7

Lossless Image Compression An image is a two-dimensional array of pixels. For black-and-white images, consider pixels as 0 or 1. For grayscale images, may have gradations between black and white. For color images, each pixel has red, green, and blue components, typically 8 bits of information per color. Uniform Encoding: A simple bitmap encoding consists of the width and height of the image followed by a sequence of pixel values (often in row-major order). Some Image Compression Techniques: Use short codewords to index colors in a dictionary. GIF uses a dictionary of 256 colors. Use text-based techniques to compress sequences of pixel values. GIF uses standard LZW on pixel bytes. Especially in b/w images, can have long sequences of one pixel value. encoding to shorten these. Use run-length For some images, especially those with noticeable patterns or those composed of geometric shapes like lines, polygons, etc. makes sense to encode image as an image-generating program. The decoder draws the image by executing the program. This is the big idea behind the PostScript language. Other Issues Tradeoffs between encoding/decoding time and compression factor. Most compression techniques do not support random access into compressed text. What s the most effective way to compress a million numbers generated by a random number generator? 8