An Efficient Decoding Technique for Huffman Codes Abstract 1. Introduction

Similar documents
Information Theory and Communication

Data Encryption on FPGA using Huffman Coding

Encoding. A thesis submitted to the Graduate School of University of Cincinnati in

Data Compression Algorithms

Text Compression through Huffman Coding. Terminology

CSCI2100B Data Structures Heaps

Greedy Algorithms. Alexandra Stefan

ITCT Lecture 6.1: Huffman Codes

Parallel Query Processing and Edge Ranking of Graphs

An undirected graph is a tree if and only of there is a unique simple path between any 2 of its vertices.

FPGA Implementation of Huffman Encoder and Decoder for High Performance Data Transmission

Intro. To Multimedia Engineering Lossless Compression

Faster Implementation of Canonical Minimum Redundancy Prefix Codes *

A Compressed Representation of Mid-Crack Code with Huffman Code

An Efficient Compression Technique Using Arithmetic Coding

THE RELATIVE EFFICIENCY OF DATA COMPRESSION BY LZW AND LZSS

CS473-Algorithms I. Lecture 11. Greedy Algorithms. Cevdet Aykanat - Bilkent University Computer Engineering Department

Efficient VLSI Huffman encoder implementation and its application in high rate serial data encoding

SIGNAL COMPRESSION Lecture Lempel-Ziv Coding

EE 368. Weeks 5 (Notes)

ENSC Multimedia Communications Engineering Huffman Coding (1)

Lossless Compression Algorithms

Data Compression Scheme of Dynamic Huffman Code for Different Languages

PAPER Constructing the Suffix Tree of a Tree with a Large Alphabet

Huffman Codes (data compression)

Fundamentals of Multimedia. Lecture 5 Lossless Data Compression Variable Length Coding

CSCI2100B Data Structures Trees

A New Technique of Lossless Image Compression using PPM-Tree

9 Distributed Data Management II Caching

Optimum Alphabetic Binary Trees T. C. Hu and J. D. Morgenthaler Department of Computer Science and Engineering, School of Engineering, University of C

Implementation and Analysis of Efficient Lossless Image Compression Algorithm

Shri Vishnu Engineering College for Women, India

More Bits and Bytes Huffman Coding

Complete Variable-Length "Fix-Free" Codes

14.4 Description of Huffman Coding

Welcome Back to Fundamentals of Multimedia (MR412) Fall, 2012 Lecture 10 (Chapter 7) ZHU Yongxin, Winson

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

CS301 - Data Structures Glossary By

PESIT Bangalore South Campus Hosur road, 1km before Electronic City, Bengaluru -100 Department of MCA

Entropy Coding. - to shorten the average code length by assigning shorter codes to more probable symbols => Morse-, Huffman-, Arithmetic Code

Optimal Variable Length Codes (Arbitrary Symbol Cost and Equal Code Word Probability)* BEN VARN

A Comparative Study of Entropy Encoding Techniques for Lossless Text Data Compression

Enumerating All Rooted Trees Including k Leaves

Packet Classification Using Dynamically Generated Decision Trees

Information Retrieval. Chap 7. Text Operations

Properties of red-black trees

Chapter 5 VARIABLE-LENGTH CODING Information Theory Results (II)

7 Distributed Data Management II Caching

Greedy Algorithms. CLRS Chapters Introduction to greedy algorithms. Design of data-compression (Huffman) codes

4. Conclusion. 5. Acknowledgment. 6. References

Multi-way Search Trees

FACULTY OF ENGINEERING LAB SHEET INFORMATION THEORY AND ERROR CODING ETM 2126 ETN2126 TRIMESTER 2 (2011/2012)

Binary Search and Worst-Case Analysis

arxiv: v3 [cs.ds] 18 Apr 2011

A Novel Statistical Distortion Model Based on Mixed Laplacian and Uniform Distribution of Mpeg-4 FGS

CS 206 Introduction to Computer Science II

Optimally-balanced Hash Tree Generation in Ad Hoc Networks

Comparative Study of Dictionary based Compression Algorithms on Text Data

6. Finding Efficient Compressions; Huffman and Hu-Tucker

Tree Structure and Algorithms for Physical Design

Greedy Algorithms and Huffman Coding

Association Rule Mining: FP-Growth

Lecture: Analysis of Algorithms (CS )

A General Class of Heuristics for Minimum Weight Perfect Matching and Fast Special Cases with Doubly and Triply Logarithmic Errors 1

A Practical Distributed String Matching Algorithm Architecture and Implementation

Optimization of Bit Rate in Medical Image Compression

Pierre A. Humblet* Abstract

DEFLATE COMPRESSION ALGORITHM

where is a constant, 0 < <. In other words, the ratio between the shortest and longest paths from a node to a leaf is at least. An BB-tree allows ecie

Compressing Data. Konstantin Tretyakov

16.Greedy algorithms

Greedy Algorithms CHAPTER 16

A Novel PAT-Tree Approach to Chinese Document Clustering

Lecture 13. Types of error-free codes: Nonsingular, Uniquely-decodable and Prefix-free

EE67I Multimedia Communication Systems Lecture 4

Code Transformation of DF-Expression between Bintree and Quadtree

Adaptive Linear Programming Decoding of Polar Codes

II (Sorting and) Order Statistics

Data Compression. Media Signal Processing, Presentation 2. Presented By: Jahanzeb Farooq Michael Osadebey

Almost all Complete Binary Prefix Codes have a Self-Synchronizing String

COMPARATIVE STUDY AND ANALYSIS OF ADAPTIVE REGION BASED HUFFMAN COMPRESSION TECHNIQUES

Linear Quadtree Construction in Real Time *

Encoding/Decoding and Lower Bound for Sorting

THREE DESCRIPTIONS OF SCALAR QUANTIZATION SYSTEM FOR EFFICIENT DATA TRANSMISSION

Concealing Information in Images using Progressive Recovery

The Encoding Complexity of Network Coding

Chapter 4: Application Protocols 4.1: Layer : Internet Phonebook : DNS 4.3: The WWW and s

An On-line Variable Length Binary. Institute for Systems Research and. Institute for Advanced Computer Studies. University of Maryland

COMPSCI 650 Applied Information Theory Feb 2, Lecture 5. Recall the example of Huffman Coding on a binary string from last class:

Cascaded Coded Distributed Computing on Heterogeneous Networks

Lecture 15. Error-free variable length schemes: Shannon-Fano code

IMPROVEMENTS IN DOUBLE ENDED PRIORITY QUEUES*

Huffman Coding. Version of October 13, Version of October 13, 2014 Huffman Coding 1 / 27

David Rappaport School of Computing Queen s University CANADA. Copyright, 1996 Dale Carnegie & Associates, Inc.

Practical Session 10 - Huffman code, Sort properties, QuickSort algorithm, Selection

ADAPTIVE SORTING WITH AVL TREES

looking ahead to see the optimum

Scribe: Virginia Williams, Sam Kim (2016), Mary Wootters (2017) Date: May 22, 2017

Tree Isomorphism Algorithms.

Error Correction and Detection using Cyclic Redundancy Check

Transcription:

An Efficient Decoding Technique for Huffman Codes Rezaul Alam Chowdhury and M. Kaykobad Department of Computer Science and Engineering Bangladesh University of Engineering and Technology Dhaka-1000, Bangladesh, email: shaikat,audwit@bdonline.com Irwin King Department of Computer Science and Engineering The Chinese University of Hong Kong, email: king@cse.cuhk.edu.hk Abstract We present a new data structure for Huffman coding in which in addition to sending symbols in order of their appearance in the Huffman tree one needs to send codes of all circular leaf nodes (nodes with two adjacent external nodes), the number of which is always bounded above by half the number of symbols. We decode the text by using the memory efficient data structure proposed by Chen et al (IPL vol. 69 (1999) pp. 119-122). 1. Introduction Since the discovery of Huffman encoding scheme [4] in 1952 Huffman codes have been widely used in efficient storing of data, image and video [2]. Huffman coding has been subjected to numerous investigations in the past 50 years. Schack [7] considered distribution of the length of a typical Huffman codeword, while Katona and Nemetz [5] studied connection between the self-information of a source letter and its codeword length in a Huffman code. Traditional Huffman scheme collects frequency of occurrence of different symbols to be coded in the first pass, constructs a tree based upon which symbols receive codes. Then in the second pass symbols are coded and sent to the receiver. However, along with codes corresponding to symbols Huffman tree is also sent. There is a single pass version of Huffman coding called dynamic Huffman coding [8]. In the later case symbols are coded according to the frequencies of symbols so far sent, and the tree is updated by both sender and receiver using the same algorithm. In this paper we consider efficient decoding of the so-called static Huffman code. Hashemian [3] presented an algorithm to speed up the search process for a symbol in a Huffman tree and reduce the memory size. He proposed a tree-clustering algorithm to avoid sparsity of Huffman trees. Finding the optimal solution of this clustering problem is still open. Later, Chung [2] gave a data structure requiring memory size of only 2n - 3 to represent the Huffman tree, where n is the number of symbols. Chen et al. [1] improved the data structure further to reduce memory requirement to 3 n 2 + ( n 2) log n + 1. In our scheme in addition to sending information on symbols we send information on all circular leaf nodes (nodes with two adjacent external nodes), that uniquely determines a Huffman tree. Our memory requirement is better than any existing data structure since number of circular nodes is bounded above by n 2. So in the worst case our memory requirement is 3n 2. In fact, the number of circular leaf nodes of a Huffman tree is 1 more than the number of nodes with 2 internal son nodes. Such a saving in the representation of Huffman trees is very important in the context of repeated Huffman coding [6], where ultimate compression ratio depends upon the efficiency of representation of the Huffman tree. In our case decoding of codes is based upon Chen s data structure for storing the Huffman tree. While Chen et al. [1] addressed the problem of efficient decoding of a Huffman code, we address the problem of decoding streams of bits not knowing the 1

length of bit patterns that correspond to a symbol. For clarity of presentation we first describe the idea of Chen et al. [1] and their data structure. Let T be a Huffman tree, which contains n symbols. The symbols correspond to leaves of T and are labeled from left to right as s 0, s 1,, s n-1. The root is said to be at level 0. The level of any other node is 1 more than the level of its father. The largest level is the height h of the Huffman tree. The weight of a symbol is defined to be 2 h-l, where h is the height of the tree and l is the level of a node corresponding to a symbol. Let w i be the weight of symbol s i, i = 0,, n - 1. Let, f i denote cumulative weight up to weight w i. Then f 0 = w 0, f i = f i-1 + w i, for i =1, 2,, n - 1. E D A s 9 s o s 1 s 5 s 6 C B s 4 s 7 s 8 s 2 s 3 Fig. 1. An example of a Huffman tree Table 1. The values of w i and f i i 0 1 2 3 4 5 6 7 8 9 w i 4 4 1 1 2 4 4 2 2 8 f i 4 8 9 10 12 16 20 22 24 32 While Chen et al. [1] give an algorithm for decoding a codeword, and give the correct symbol as output or say that the given codeword does not correspond to any symbol, we address the problem of decoding the bit streams sent by the sender according to the Huffman tree that is available with the receiver in the data structure mentioned above. The above data structure requires an overhead of sending n symbols and n integers corresponding to them. This overhead can also be further reduced by the following. If T is a Huffman tree of height h, then T, the truncated Huffman tree, is obtained by removing all leaves (square nodes) and will have height h - 1. T uniquely determines the 2

Huffman tree T since symbols correspond to left and right son nodes of the circular leaf nodes and the other sons of other single-son circular nodes. Then Huffman codes, representing circular leaf nodes of T uniquely determines T. If there are m circular leaf nodes, m codewords corresponding to them, and n symbols in order of appearance in the tree will suffice to represent the Huffman tree. Consider the example in Figure 1. Let the leaf circular nodes of Figure 1 be denoted by A, B, and C respectively in order. Then the corresponding Huffman codes are 00, 0100 and 101. However, each of these codewords can be represented uniquely by integers whose binary representations are these codewords with 1 as prefix. The corresponding integers for A, B and C are respectively 4 for 100, 20 for 10100 and 13 for 1101. Along with s 0, s 1,, s 9 if 4, 20 and 13 are sent as codes for the circular leaf nodes A, B and C, then the receiver can obtain Huffman codes 00, 0100 and 101 by deleting the attached prefix. These bit patterns will enable the receiver to construct all paths to the circular leaf nodes, and thus reconstruct the Huffman tree. So to reconstruct Huffman tree T, we have the following theorem. Theorem: Let S be the array of symbols in order of appearance in the Huffman tree T and D be the set of integer codes, corresponding to circular leaf nodes, sent to the receiver. Then the receiver can uniquely reconstruct the Huffman tree T. Proof: Let d i be the integer corresponding to a circular leaf node i. By deleting the appropriate prefix from the binary representation of d i, one can reconstruct the whole path leading to that node uniquely since bit 0 will take to the left and 1 to the right, and there is no ambiguity. Since each internal node of the Huffman tree will appear on the paths of some of these circular leaf nodes, each internal node of the Huffman tree will appear in at least one of these paths. Now external nodes will appear either as left or right sons of the circular leaf nodes, or as one of the sons of circular nodes on the path to the circular leaf nodes. This will result in the construction of a unique Huffman tree T since paths constructed for each d i are unique. Furthermore, the set of symbols S will label the external nodes in an orderly fashion with appropriate symbols. QED Once the tree can be constructed all the values of Table 1 can be calculated, and decoding can be successfully accomplished by using binary search algorithm in the array of cumulative weights. This gives us the possibility of reducing the overhead further since the number of circular leaf nodes cannot be more than half of the number of symbols. The number m of circular leaf nodes of any truncated Huffman tree is 1 more than the number nodes with 2 internal son nodes. Huffman codes are much more efficient than block codes when corresponding Huffman trees are sparse. In a sparse truncated Huffman tree the number of nodes with 2 internal son nodes can be much less than the total number of internal nodes, which in turn is 1 less than the number of external nodes corresponding to symbols. In the truncated Huffman tree, corresponding to Fig. 1, there are two nodes with 2 internal son nodes, namely nodes labeled D and E. So in representing the tree we need O(nlogn) bits for the symbols and O(h) bits for each circular leaf node, number of which is ( ½ n ) bounded above by O(n). 2. A Fast Decoding Technique Chen et al. [1] give weight to a leaf v equal to the number of leaves in a complete binary tree under the node v. Every leaf node is assigned a number equal to cumulative weights of all the 3

leaves appearing before it and including itself. So this gives us an opportunity to apply binary search on these cumulative weights to obtain the symbol, bit pattern corresponding to which has been extracted from the bit stream. At any stage our algorithm takes h bits from the stream. Since a prefix of this string must contain a codeword we will always be able to decode that codeword using the following algorithm. In the algorithm we assume that input bit stream does not contain any error. Algorithm Decode Input: The values f i, i = 0,, n - 1, of a Huffman tree T with height h and a bit stream b j, j = 1,, N. Output: The text symbols c k, k = 1,, M, corresponding to the input bit stream. // j is the pointer to last bit already decoded, k is the pointer to the last decoded symbol, index is the location where search for d + 1 fails. Otherwise, it is the location at which d + 1 has been found. q i is a stream of i 1 bits. # is the concatenation operator. // j 0, k 0 f -1 0 while j < N do if (h < N - j) then d b[j + 1 : j + h] else d b[j + 1 : N] # q h - N + j endif binsearch(f, d + 1, index) k k + 1 c k f index j j + h log 2 (f index - f index-1 ) enddo End of Algorithm Decode In decoding the very last symbol of the text there may not be h bits available in the stream in which case d is obtained by appending the bit stream with sufficient number of 1 bits. It is obvious that in decoding a bit pattern we need to make a binary search among n symbols that requires O(logn) comparisons. If there are m codewords, our algorithm will run in O(mlogn) time. It may be mentioned here that in order to improve repeated application of Huffman coding it is important to have the overhead of representing Huffman tree as efficiently as possible. So such efficient representation of Huffman header as proposed in this manuscript gives rise to the opportunity of applying repeated Huffman coding in cases where further compression is desirable. Acknowledgement Authors express their sincere thanks to anonymous referees for pointing out errors in the manuscript and for suggesting improvement of its presentation. 4

References [1] Hong-Chung Chen, Yue-Li Wang and Yu-Feng Lan, A memory-efficient and fast Huffman decoding algorithm, Information Processing Letters, 69:119-122, 1999. [2] K. L. Chung, Efficient Huffman decoding, Information Processing Letters, 61:97-99, 1997. [3] R. Hashemian, Memory efficient and high-speed search Huffman coding, IEEE Trans. Comm., 43(10): 2576-2581, 1995. [4] D. A. Huffman, A method for construction of minimum redundancy codes, Proc. IRE, 40:1098-1101, 1952. [5] G. O. H. Katona and T. O. H. Nemetz, Huffman Codes and Self-Information, IEEE Transactions on Information Theory, 22(3):337-340, 1978. [6] J. N. Mandal, An approach towards development of efficient data compression algorithms and correction techniques, Ph.D. Thesis, Jadavpur University, India, 2000. [7] R. Schack, The length of a typical Huffman codeword, IEEE Transactions on Information Theory, 40(4): 1246-1247, 1994. [8] J. S. Vitter, Design and analysis of dynamic Huffman codes, Journal of the ACM, 34(4):825-845, 1987. 5