Entropy Coding. - to shorten the average code length by assigning shorter codes to more probable symbols => Morse-, Huffman-, Arithmetic Code

Similar documents
EE67I Multimedia Communication Systems Lecture 4

Data Compression. An overview of Compression. Multimedia Systems and Applications. Binary Image Compression. Binary Image Compression

Lossless compression II

Simple variant of coding with a variable number of symbols and fixlength codewords.

Intro. To Multimedia Engineering Lossless Compression

Ch. 2: Compression Basics Multimedia Systems

Welcome Back to Fundamentals of Multimedia (MR412) Fall, 2012 Lecture 10 (Chapter 7) ZHU Yongxin, Winson

Chapter 7 Lossless Compression Algorithms

Lossless Compression Algorithms

EE-575 INFORMATION THEORY - SEM 092

CS/COE 1501

David Rappaport School of Computing Queen s University CANADA. Copyright, 1996 Dale Carnegie & Associates, Inc.

Ch. 2: Compression Basics Multimedia Systems

Compression. storage medium/ communications network. For the purpose of this lecture, we observe the following constraints:

Distributed source coding

Fundamentals of Multimedia. Lecture 5 Lossless Data Compression Variable Length Coding

CS/COE 1501

Data Compression Techniques

Dictionary techniques

Multimedia Systems. Part 20. Mahdi Vasighi

Text Compression through Huffman Coding. Terminology

SIGNAL COMPRESSION Lecture Lempel-Ziv Coding

CS 493: Algorithms for Massive Data Sets Dictionary-based compression February 14, 2002 Scribe: Tony Wirth LZ77

Lossless compression II

Multimedia Networking ECE 599

7: Image Compression

Data Compression. Media Signal Processing, Presentation 2. Presented By: Jahanzeb Farooq Michael Osadebey

Repetition 1st lecture

Data Compression Fundamentals

CSE 421 Greedy: Huffman Codes

THE RELATIVE EFFICIENCY OF DATA COMPRESSION BY LZW AND LZSS

Compressing Data. Konstantin Tretyakov

Engineering Mathematics II Lecture 16 Compression

ITCT Lecture 8.2: Dictionary Codes and Lempel-Ziv Coding

Greedy Algorithms CHAPTER 16

Analysis of Parallelization Effects on Textual Data Compression

A Comparative Study of Entropy Encoding Techniques for Lossless Text Data Compression

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Efficient Sequential Algorithms, Comp309. Motivation. Longest Common Subsequence. Part 3. String Algorithms

A Comparative Study Of Text Compression Algorithms

6. Finding Efficient Compressions; Huffman and Hu-Tucker

Lempel-Ziv-Welch (LZW) Compression Algorithm

Overview. Last Lecture. This Lecture. Next Lecture. Data Transmission. Data Compression Source: Lecture notes

Indexing. CS6200: Information Retrieval. Index Construction. Slides by: Jesse Anderton

Data compression with Huffman and LZW

Study of LZ77 and LZ78 Data Compression Techniques

Data Compression. Guest lecture, SGDS Fall 2011

Data compression.

Text Compression. Jayadev Misra The University of Texas at Austin July 1, A Very Incomplete Introduction To Information Theory 2

Digital Image Processing

Data Compression Techniques

An Advanced Text Encryption & Compression System Based on ASCII Values & Arithmetic Encoding to Improve Data Security

Data Compression 신찬수

A Comparative Study of Lossless Compression Algorithm on Text Data

Error Resilient LZ 77 Data Compression

IMAGE COMPRESSION- I. Week VIII Feb /25/2003 Image Compression-I 1

A Comprehensive Review of Data Compression Techniques

IMAGE COMPRESSION. Image Compression. Why? Reducing transportation times Reducing file size. A two way event - compression and decompression

Basic Compression Library

Comparative Study of Dictionary based Compression Algorithms on Text Data

A Research Paper on Lossless Data Compression Techniques

Lecture 6 Review of Lossless Coding (II)

MCS-375: Algorithms: Analysis and Design Handout #G2 San Skulrattanakulchai Gustavus Adolphus College Oct 21, Huffman Codes

Implementation of Robust Compression Technique using LZ77 Algorithm on Tensilica s Xtensa Processor

DEFLATE COMPRESSION ALGORITHM

Encoding. A thesis submitted to the Graduate School of University of Cincinnati in

Basics of Information Worksheet

Compression; Error detection & correction

CMPSCI 240 Reasoning Under Uncertainty Homework 4

Digital Communication Prof. Bikash Kumar Dey Department of Electrical Engineering Indian Institute of Technology, Bombay

Chapter 5 VARIABLE-LENGTH CODING Information Theory Results (II)

FPGA based Data Compression using Dictionary based LZW Algorithm

Chapter 5 Lempel-Ziv Codes To set the stage for Lempel-Ziv codes, suppose we wish to nd the best block code for compressing a datavector X. Then we ha

You can say that again! Text compression

Huffman Code Application. Lecture7: Huffman Code. A simple application of Huffman coding of image compression which would be :

ENSC Multimedia Communications Engineering Topic 4: Huffman Coding 2

Compression; Error detection & correction

TEXT COMPRESSION ALGORITHMS - A COMPARATIVE STUDY

IMAGE PROCESSING (RRY025) LECTURE 13 IMAGE COMPRESSION - I

Abdullah-Al Mamun. CSE 5095 Yufeng Wu Spring 2013

Text Compression. General remarks and Huffman coding Adobe pages Arithmetic coding Adobe pages 15 25

6. Finding Efficient Compressions; Huffman and Hu-Tucker Algorithms

An undirected graph is a tree if and only of there is a unique simple path between any 2 of its vertices.

Lecture on Computer Networks

Greedy Algorithms. Alexandra Stefan

ENSC Multimedia Communications Engineering Huffman Coding (1)

ECE 499/599 Data Compression & Information Theory. Thinh Nguyen Oregon State University

Design and Analysis of Algorithms

Huffman Coding. Version of October 13, Version of October 13, 2014 Huffman Coding 1 / 27

CIS 121 Data Structures and Algorithms with Java Spring 2018

WIRE/WIRELESS SENSOR NETWORKS USING K-RLE ALGORITHM FOR A LOW POWER DATA COMPRESSION

V.2 Index Compression

Algorithms Dr. Haim Levkowitz

The impossible patent: an introduction to lossless data compression. Carlo Mazza

OPTIMIZATION OF LZW (LEMPEL-ZIV-WELCH) ALGORITHM TO REDUCE TIME COMPLEXITY FOR DICTIONARY CREATION IN ENCODING AND DECODING

Lecture 5 Lossless Coding (II)

Image coding and compression

The Effect of Non-Greedy Parsing in Ziv-Lempel Compression Methods

Lecture 5 Lossless Coding (II) Review. Outline. May 20, Image and video encoding: A big picture

I. Introduction II. Mathematical Context

Transcription:

Entropy Coding } different probabilities for the appearing of single symbols are used - to shorten the average code length by assigning shorter codes to more probable symbols => Morse-, Huffman-, Arithmetic Code - to simplify the encoding/decoding by assigning simpler codes to more probable symbols => e.g. the braille } Entropy coding schemes = lossless compression schemes 1

Entropy Coding } Entropy coding procedures rely on statistically independent information events to produce optimal results (maximum theoretical compression) } Remark A prefix code (or prefix-free code) is a code in which no codeword is a prefix to another codeword. This enables unique decodability with variable code length without any separator symbol 2

The Braille 3

Run-length Coding } repetitive sequence suppression: run-length coding } compresses: same successive symbols } Example: - "x" marks the beginning of a coded sequence - special coding is required to code "x" as a symbol, here "xx" is used. - Zero suppression is a special form of run-length coding 4

Morse Code Example: Letter statistic compared to Morse Code Letter probability German probability English Morse Code e 16.65% 12.41%. n 10.36% 6.41% _. i 8.14% 6.46%.. t 5.53% 8.90% _ a 5.15% 8.09%. _ o 2.25% 8.13% _ x 0.03% 0.20% _.. _ y 0.03% 2.14% _. 5

Morse Code } The idea of using fewer bits to represent symbols that occur more often is used in Morse code - codewords for letters occuring more frequently are shorter than for letters that occur less frequently - example: the codeword for "E" is, while the codeword for "Z" is } variable code length & nonprefix code à needs separator symbol for unique decodability - Example:.._ à eeteee, ini,... 6

Huffman Coding } Idea: Use short bit patterns for frequently used symbols - sort all symbols by probability - get the two symbols with the lowest probability - remove them from the list and insert a "parent symbol" to the sorted list where the probability is the sum of both symbols - if there are at least two elements left, continue with step 2 - assign 0 and 1 to the branches of the tree } Example: 1380 bits are used for "regular coding" Symbol Count regular E 135 000 A 120 001 D 110 010 B 50 011 C 45 100 7

Huffman Coding Symbol Count E 135 A 120 D 110 P1 95 8

Huffman Coding Symbol Count P2 205 E 135 A 120 9

Huffman Coding Symbol Count P3 255 P2 205 10

Huffman Coding Symbol Count Huffman p i E 135 00 0,294 A 120 01 0,261 D 110 10 0,239 B 50 110 0,109 C 45 111 0,098 à 1015 bits are used for huffman coding 11

Huffman Coding Symbol Count Huffman p i E 135 00 0,2935 A 120 01 0,2609 D 110 10 0,2391 B 50 110 0,1087 C 45 111 0,0978 } Entropy: => H 2,1944 } Average code length: l = 2*p 1 +2*p 2 +2*p 3 +3*p 4 +3*p 5 2,2065 12

Adaptive Huffman Coding } Problem - previous algorithm requires statistical knowledge (the probability distribution), often not available (e.g. live audio, video) - even when statistical knowledge is available, it could be too much overhead to send it } Solution - use of adaptive algorithms - here: Adaptive Huffman Coding - keeps up estimating the required probability distribution from previously encoded/decoded symbols by an update procedure 13

Adaptive Huffman Coding } having a look at the implementation: } encoder and decoder use exactly the same initialization and update_model routines } update_model does two things: - increment the count - update the resulting Huffman tree during the updates, the Huffman tree will be maintained its sibling property, i.e. the nodes are arranged in order of increasing weights when swapping is necessary, the farthest node with weight W is swapped with the node whose weight has just been increased to W+1 (If the node with the weight W has a sub-tree beneath it, then the subtree will go with it) 14

Adaptive Huffman Coding } Example n-th Huffman tree } resulting code for A: 000 15

Adaptive Huffman Coding } n+2-th Huffman tree - A was incremented twice node - A and D swapped } new code for A: 011 16

Adaptive Huffman Coding } A was incremented twice } 4th (A) and 5th node have to swap 17

Adaptive Huffman Coding } resulting tree after 1st swap } 8th (E) and 7th node have to swap 18

Adaptive Huffman Coding } n+4-th Huffman tree } new code for A: 10 19

Huffman Coding Evaluation } Advantages - the algorithm is simple - Huffman Coding is optimal according to information theory when the probability of each input symbol is a negative power of two - if p max is the largest probability in the probability model, then the upper bound is H + p max + ld[(2 ld e)/e] = H + p max + 0,086. This is tighter than 20

Huffman Coding Evaluation } Limitations - Huffman is designed to code single characters only. Therefore at least one bit is required per character, e.g. a word of 8 characters requires at least an 8 bit code - Not suitable for strings of different length or changing probabilities for characters in a different context, respectively Examples for both interpretations of that problem: Huffman coding does not support different probabilities for: "c" "h" "s" and "sch 21

Huffman Coding Evaluation } Limitations (cont.) For a usual german text p("e") > p("h") is valid, but if the preceding characters have been "sc" then p("h") > p("e") is valid - (non-satisfying) solutions for both scenarios: Solution: Use a special coding where "sch" is one character only. But this requires knowledge about frequent character combinations in advance. Solution: use different huffman codes with respect to the context. But this leads to large code tables which must be appended to the coded data 22

Arithmetic Coding } arithmetic coding generates codes that are optimal according to information theory - supports average code word length smaller than 1 - supports strings of different length } probability model is separated from encoding operation - probability model: encoding strings and not single characters assign "codeword" to each possible string codeword consists of a half open subinterval of the interval [0,1] subintervals for different strings do not intersect 23

Arithmetic Coding } probability model is separated from encoding operation (cont.) - Encoding: given subinterval can be uniquely identified by any value of that interval Use the value with the shortest non-ambiguous binary representation to identify the subinterval } In practice, the subinterval is refined incrementally using the probabilities of the individual events. - If strings of different length are used, extra information is needed to detect the end of a string 24

Arithmetic Coding } Example I - Four characters a, b, c, d with probabilities p(a)=0.4, p(b)=0.2, p (c)=0.1, p(d)=0.3. Codeword assignment for string "abcd": à codeword (subinterval): [0.21360, 0.21600) 25

Arithmetic Coding } Example I (cont.) - Encoding: any value within the subinterval [0.21360, 0.21600) will now be a well-defined identifier of the string abcd - "0.00110111" will be the shortest binary representation of the subinterval 26

Arithmetic Coding } Example I (cont.) - Decoding: "0.00110111" resp. 0.21484375 - receiver can rebuild the subinterval development but: code is member of any further subinterval of [0.21360, 0.21600) => extra symbol necessary to mark end of string 27

Arithmetic Coding } Example II - Propabilties: 28

Arithmetic Coding Example II (cont.) } Termination rules: - "a" and "!" are termination symbols - Strings starting with "b" have a length of max 4. } Decode: 10010001010 Bit 1-0 b 0-1 b 0 a Output 0-0 a 1-0 b 1 bb 0 b 29

Arithmetic Coding } Remarks - The more characters a string has, the better is the result concerning average code word length - Implementing arithmetic coding is more complex than huffman coding - Alphabets with k characters and strings with length m destroy the idea of huffman coding for increasing m (codebook with size k^m). Arithmetic coding can be adapted. 30

Arithmetic Coding } Remarks - if the alphabet is small and the probabilities are highly unbalanced arithmetic coding is superior to huffman coding - efficient arithmetic coding implementations exclusively rely on integer arithmetic - United States Patent 4,122,440; Method and means for arithmetic string coding; International Business Machines Corporation; October 24, 1978 - A precise comparison between arithmetic coding and huffman coding can be found in [Sayood] 31

Lempel - Ziv (LZ77) } Algorithm for compression of character sequences - assumption: sequences of characters are often repeated - idea: replace a character sequence by a reference to an earlier occurrence 1. Define a search buffer = (portion) of recently encoded data look-ahead buffer = not yet encoded data 32

Lempel - Ziv (LZ77) } Algorithm for compression of character sequences (cont.) 2. Find the longest match between the first characters of the look ahead buffer and an arbitrary character sequence in the search buffer 3. Produces output <offset, length, next_character> offset + length = reference to earlier occurence next_character = the first character following the match in the look ahead buffer 33

Lempel - Ziv (LZ77) } Example: Pos 1 2 3 4 5 6 7 8 9 10 Char A A B C B B A B C A Step Pos Match Char Output 1 1 - A <0,0,A> 2 2 A B <1,1,B> 3 4 - C <0,0,C> 4 5 B B <2,1,B> 5 7 ABC A <5,3,A> 34

Lempel - Ziv (LZ77) } Remarks - the search and the look ahead buffer have a limited size the number of bits needed to encode pointers and length information depends on the buffers size worst case: character sequences are longer than one of the buffers typical size is 4-64 KB - sometimes other representations than the triple are used next_char only if necessary (i.e. no match found) enabling dynamic change of buffer sizes 35

Lempel - Ziv (LZ77) } Remarks (cont.) - LZ77 or variants are often used before entropy coding LZ77 + Huffmann coding are used by "gzip" and for "PNG" "PKZip", "Zip", "LHarc" and "ARJ" use LZ77-based algorithms - LZW Patent Information Unisys U.S. LZW Patent No. 4,558,302 expired on June 20, 2003, the counterpart patents in the UK, France, Germany, and Italy expired on June 18, 2004, the Japanese counterpart patents expired on June 20, 2004 and the counterpart Canadian patent expired on July 7, 2004 Unisys Corporation also holds patents on a number of improvements on the inventions claimed in the expired patents 36

Lempel - Ziv - Welch (LZ78) } LZ78: - drops the search buffer and keeps an explicit dictionary (build at encoder and decoder in identical manner) - produces output <index, next_character> - example, adaptive encoding "abababa" will result in: Step Output Directory 1 <0,a> "a" 2 <0,b> "b" 3 <1,b> "ab" 4 <3,a> "aba 37

Lempel - Ziv Welch (LZW) } LZW: - produces only output <index> - dictionary has to be initialized with all letters of source alphabet - new patterns are added to the dictionary - used by unix "compress", "GIF", "V24.bis", "TIFF" } in practice: limit size of the directory 38