Data Compression 신찬수

Similar documents
Lossless Compression Algorithms

Introduction to Data Compression

FPGA based Data Compression using Dictionary based LZW Algorithm

Lossless compression II

Entropy Coding. - to shorten the average code length by assigning shorter codes to more probable symbols => Morse-, Huffman-, Arithmetic Code

Multimedia Systems. Part 20. Mahdi Vasighi

SIGNAL COMPRESSION Lecture Lempel-Ziv Coding

TEXT COMPRESSION ALGORITHMS - A COMPARATIVE STUDY

A Comparative Study Of Text Compression Algorithms

Multimedia Networking ECE 599

Chapter 7 Lossless Compression Algorithms

Data Compression. Guest lecture, SGDS Fall 2011

Comparative Study of Dictionary based Compression Algorithms on Text Data

Data Compression. An overview of Compression. Multimedia Systems and Applications. Binary Image Compression. Binary Image Compression

An Advanced Text Encryption & Compression System Based on ASCII Values & Arithmetic Encoding to Improve Data Security

Fundamentals of Multimedia. Lecture 5 Lossless Data Compression Variable Length Coding

Topic 5 Image Compression

Compression. storage medium/ communications network. For the purpose of this lecture, we observe the following constraints:

Data Compression Techniques

7: Image Compression

David Rappaport School of Computing Queen s University CANADA. Copyright, 1996 Dale Carnegie & Associates, Inc.

Welcome Back to Fundamentals of Multimedia (MR412) Fall, 2012 Lecture 10 (Chapter 7) ZHU Yongxin, Winson

Dictionary techniques

Engineering Mathematics II Lecture 16 Compression

Lossless compression II

IMAGE COMPRESSION. Image Compression. Why? Reducing transportation times Reducing file size. A two way event - compression and decompression

EE67I Multimedia Communication Systems Lecture 4

Ch. 2: Compression Basics Multimedia Systems

CS 493: Algorithms for Massive Data Sets Dictionary-based compression February 14, 2002 Scribe: Tony Wirth LZ77

Lempel-Ziv-Welch (LZW) Compression Algorithm

Encoding. A thesis submitted to the Graduate School of University of Cincinnati in

A study in compression algorithms

CS/COE 1501

Ch. 2: Compression Basics Multimedia Systems

A New Compression Method Strictly for English Textual Data

Data Representation. Types of data: Numbers Text Audio Images & Graphics Video

More Bits and Bytes Huffman Coding

Repetition 1st lecture

IMAGE COMPRESSION- I. Week VIII Feb /25/2003 Image Compression-I 1

Overview. Last Lecture. This Lecture. Next Lecture. Data Transmission. Data Compression Source: Lecture notes

A Comprehensive Review of Data Compression Techniques

A Research Paper on Lossless Data Compression Techniques

Image coding and compression

CSE 421 Greedy: Huffman Codes

CS/COE 1501

Digital Image Processing

A Comparative Study of Lossless Compression Algorithm on Text Data

CS 335 Graphics and Multimedia. Image Compression

Digital Image Processing

EE-575 INFORMATION THEORY - SEM 092

ITCT Lecture 8.2: Dictionary Codes and Lempel-Ziv Coding

Compressing Data. Konstantin Tretyakov

Huffman Codes (data compression)

Error Resilient LZ 77 Data Compression

V.2 Index Compression

Data compression with Huffman and LZW

IMAGE PROCESSING (RRY025) LECTURE 13 IMAGE COMPRESSION - I

THE RELATIVE EFFICIENCY OF DATA COMPRESSION BY LZW AND LZSS

Data Storage. Slides derived from those available on the web site of the book: Computer Science: An Overview, 11 th Edition, by J.

Basic Compression Library

Image compression. Stefano Ferrari. Università degli Studi di Milano Methods for Image Processing. academic year

CIS 121 Data Structures and Algorithms with Java Spring 2018

Data compression.

15 Data Compression 2014/9/21. Objectives After studying this chapter, the student should be able to: 15-1 LOSSLESS COMPRESSION

Compression; Error detection & correction

IMAGE COMPRESSION TECHNIQUES

Chapter 1. Digital Data Representation and Communication. Part 2

Data Compression Fundamentals

Information Retrieval

Compression; Error detection & correction

Indexing. CS6200: Information Retrieval. Index Construction. Slides by: Jesse Anderton

Information Science 2

A Comparative Study of Entropy Encoding Techniques for Lossless Text Data Compression

Section 1.8. Simplifying Expressions

Data Structures and Algorithms

Data Compression. Media Signal Processing, Presentation 2. Presented By: Jahanzeb Farooq Michael Osadebey

4.8 Huffman Codes. These lecture slides are supplied by Mathijs de Weerd

Abdullah-Al Mamun. CSE 5095 Yufeng Wu Spring 2013

WIRE/WIRELESS SENSOR NETWORKS USING K-RLE ALGORITHM FOR A LOW POWER DATA COMPRESSION

Data Compression Techniques

MODELING DELTA ENCODING OF COMPRESSED FILES. and. and

Modeling Delta Encoding of Compressed Files

A Compression Method for PML Document based on Internet of Things

Huffman Coding Assignment For CS211, Bellevue College (rev. 2016)

Text Compression. Jayadev Misra The University of Texas at Austin July 1, A Very Incomplete Introduction To Information Theory 2

LZW Compression. Ramana Kumar Kundella. Indiana State University December 13, 2014

INDIVIDUAL PROJECT RESEARCH OF DATA COMPRESSION ALGORITHMS IN C++

Optimized Compression and Decompression Software

Minification techniques

Data and information. Image Codning and Compression. Image compression and decompression. Definitions. Images can contain three types of redundancy

Study on LZW algorithm for Embedded Instruction Memory.

CHAPTER 1 Encoding Information

ASCII American Standard Code for Information Interchange. Text file is a sequence of binary digits which represent the codes for each character.

Image Compression Algorithm and JPEG Standard

The Effect of Non-Greedy Parsing in Ziv-Lempel Compression Methods

University of Waterloo CS240 Spring 2018 Help Session Problems

Greedy Algorithms CHAPTER 16

OPTIMIZATION OF LZW (LEMPEL-ZIV-WELCH) ALGORITHM TO REDUCE TIME COMPLEXITY FOR DICTIONARY CREATION IN ENCODING AND DECODING

ASCII American Standard Code for Information Interchange. Text file is a sequence of binary digits which represent the codes for each character.

Comparative Study between Various Algorithms of Data Compression Techniques

Transcription:

Data Compression 신찬수

Data compression Reducing the size of the representation without affecting the information itself. Lossless compression vs. lossy compression text file image file movie file compression compressed files enconder decoder

Ex: Run-length encoding (RLE) Lossless compression addddddcbbbbef 1a6d1c4b1e1f 20 characters(bytes) 12 characters(bytes) (ratio = 0.6) Each run (a consecutive part of a same character) is coded as a pair of ( n, c ). n is the number of character c of the run. abcdefgh 1a1b1c1d1e1f1g1h 8 bytes 16 bytes ( ratio = 2 ) bbddaacc 2b2d2a2c 8 bytes 8 bytes ( ratio = 1 ) RLE will be good if every run is of length >= 2

RLE So the length of run should be >= 2. Then how about the numbers? 11111111111544444 Code a run as a triple ( marker, n, c ) marker should be chosen among letters used infrequently. # 11 1 # 1 5 # 5 4 For a run, we need three bytes, so this is applicable for the runs of length >= 4.

ASCII codes

Example ab22ccd99++33ffgii?**! ABABABAB AAAABBBB Drawback: Performance seriously depends on the occurrences of runs.

Applications Image compression Black/white image with mainly one color Fax image, book s page image, etc. Gray (8-bit image): ex. 10000 bytes image 5713 bytes 10100 bytes 200 bytes Color(RGB image) Encoding RGB together Encoding RGB separately (better)

Applications How to scan? Image comparison

Ziv-Lempel code universal coding scheme: not relying on the frequencies of symbol occurrences in advance, but building the knowledge during the compression. Huge variants. LZ77, LZR, LZSS, LZB, LZH, LZ78, LZC, LZFG, LZW

LZW Compression algorithm of compress command in UNIX system. As it takes the input characters one by one, it outputs codes and builds the string table. If the opponent has the sequence of codes, decode them by rebuilding the string table. Note that the opponent does not need the whole table built in the compression stage. Fast and simple compression with better ratio. compression ratio: 50% ~ 60%

Compression the last character of code in the table = the first character of the next code put all characters to the table; s = the first character from input; while any input left read character c; if ( s+c is in the table ) s = s + c; else output code index of s; put a string (s + c) to the table; s = c; end-of-while output code index of s;

a a b a b a b a a a

Decompression: Problem? cscsc put all characters to the table; read old_code and output its string; while code are still left read character new_code; ouput new_code; c = first character of new_code put new_code + c to the table; old_code = new_code; end-of-while

a a b a b a b a a a

Decompression: correct verions put all characters to the table; read old_code and output its string; while code are still left read character new_code; if (new_code is not in the table) output string(old_code) + first(old_code); put old_code + first(old_code) to the table; else put old_code + first(new_code) to the table; output string(new_code); old_code = new_code; end-of-while

Table The length of strings in the table would be large. Big problem Use the reduced form! string code a 1 b 2 aa 3 ab 4 ba 5 aba 6 abaa 7

Table string reduced string code a a 1 b b 2 aa 1a 3 What is the string for code 7? Each string can be represented as two bytes! ab 1b 4 ba 2a 5 aba 4a 6 abaa 6a 7

Refrence Explanation about LZW compression including C code http://dogma.net/markn/articles/lzw/lzw.htm

Conditions for code assignment 1. One-to-one condition Each code corresponds to exactly one character. 2. Code-length condition 3. Prefix condition 4. Optimality condition

Conditions for code assignment [Code-length condition] The code length of a character A should not exceed the code length of a less probable character B. Prob(A) Prob(B) length(a) length(b) Three symbols: A, B, C with prob. 0.5, 0.25, 0.25 A = 12, B = 2, C = 1 Violate the code-length condition A = 1, B = 2, C = 12 Satisfy the above code-length condition. But we cannot distinguish AB and C.

Conditions for code assignment A = 1, B = 22, C = 12 1222 We need to check lookahead to determine a unique string of the code [Prefix condition] No code should not be a prefix of another code no lookahead is needed. A = 11, B = 12, C = 21 Satisfy code-length condition and prefix condition. No ambiguity.

Conditions for code assignment [Optimality condition] The average code length should be closer to the optimal average length as much as possible. L avg = Σ (Prob(A i ) * L(A i )) L(A i ) = -log( Prob(A i ) ) Three symbols: A, B, C with prob. 0.5, 0.25, 0.25 L avg = 0.5 * -log(0.5) + 0.25 * -log(0.25) + 0.25 * -log(0.25) = 0.5 * 1 + 0.25 * 2 + 0.25 * 2 = 1.5 This L avg is the best possible average length. Established by Claude E. Shannon.

Huffman coding Construct (near) optimal binary codes for symbols. A B C D E 0.09 0.12 0.19 0.21 0.39

Huffman coding algorithm HuffmanCode( P ) let P be a storing the probabilities. sort the characters in the non-decreasing order of probabilities while ( two or more probabilities are left in P ) delete the two minimum probabilities p1, p2 from P. add a new probability (p1 + p2) into P. end-of-while generate codes from the resulting tree as follows: assign 0 for left child and 1 for right child.

Notes There can be two or more Huffman codes with same average code-length. A = 0.09, B = 0.12, C = 0.19, D = 0.21, E = 0.39 A = 11, B = 10, C = 01, D = 001, E = 000 L huf = 2.21 L avg = 2.09 A = 01, B = 11, C = 10, D = 001, E = 000 L huf = 2.21 L huf is very close to L avg (only 5% off)

Practice P = 0.1, Q = 0.1, R = 0.1, S = 0.2, T = 0.5 What are all the Huffman codes?

Compression? A = 11, B = 10, C = 01, D = 001, E = 000 Sending ABAAD by ASCII codes 40 bits by Huffman codes 11101111001 (11 bits) Receiving 11101111001 must know the conversion table between characters and codes. 1. Exchange Huffman tree before sending ABAAD 2. Sending Huffman tree together with ABAAD 3. Building Huffman tree during the transmitting ABAAD

Implementation Use the HEAP! construct a min-heap with the probabilities. repeat the deletion of two minimums and insertion of new one until a final node contains 1.0 Assignment of codes keep the track of heap operations and trace it. Ex. A = 0.09, B = 0.12, C = 0.19, D = 0.21, E = 0.39

Improvements X = 0.01, Y = 0.1, Z = 0.8 L huf = 2 * 0.01 + 2 * 0.1 + 1 * 0.8 = 1.2 L avg = 0.922 23% gap! Reducing the gap by coding every pair of characters (not coding single characters) XX, XY, XZ, YX, YY, YZ, ZX, ZY, ZZ L huf = 1.92, L avg = 1.844 3.96% gap only!

Experiments Coding mehtod English text PL/I image Huffman 40% 60% 50% Huffman + 100 freq. used group 49% 73% 52% Huffman + 512 freq. used group 55% 71% 62%