Adaptive Huffman Coding (FastHF) Implementations

Adaptive Huffman Coding (FastHF) Implementations Amir Said 1 Introduction This document describes a fast implementation of static and adaptive Huffman codes, called FastHF. The C++ classes and interfaces used by this implementation are very similar to those used by the similar FastAC implementations of arithmetic coding. 1 Objectives During the development of our program we had the following objectives: 1. Its interface must be simple enough so that it can used by programmers without compression expertise. 2. Its performance should represent the state-of-the-art of Huffman coding in terms of speed, compression efficiency, and features. 3. It must be versatile, portable, and reliable enough to be used in more advanced coding projects. 4. The program should be clear and simple enough so that students in a first course on coding can learn about Huffman coding, and truly understand how it works. Differently from the arithmetic coding implementation, we did not try to cover many important aspects of Huffman coding, like the creation and efficient decoding of canonical Huffman codes. Also, we use an adaptive version that adapts periodically. This is much simpler and faster than Huffman codes that adapt for each coded symbol. Document organization This document is divided in the following parts. Section 2 presents the rationale for using object-oriented programming in compression applications, and describes our C++ classes 1 Available at http://www.cipr.rpi.edu/ said/fastac.html. The reader will find that the documentation is also very similar. 1

implementing static and dynamic Huffman codes, and the encoder and decoder. Next, in Section 3, we provide more details on how to use our programs by using them on a simple compression problem. The functions implementing an encoder and decoder are presented and we comment on the programming choices, how the coding functions work, and how the C++ classes are used. In Section 4, we explain the purpose of three programs that we use to demonstrate, on real compression applications, how to use and test our implementations. Program distribution All the programs are in a single zip file called FastHF.zip, which contains 3 subdirectories. The Huffman code implementation files, which are called binary codec.h and binary codec.cpp, are in the root directory, together with the single documentation file, FastHF Readme.pdf (this file). The root directory also contains files that define a MS VC++ workspace (FastHF.dsw, FastHF.ncb, and FastHF.opt) with 3 projects, called hffile, hfwav, and test. Each project subdirectory contains a different program as an example of how to use our code (see Section 4). 2 Interface of the Huffman Code C++ Classes Because there are innumerable types of data that need to be compressed, but only a few coding methods, there are many advantages in separating the processes of source modeling from entropy coding [1, 2]. While in practice it is not possible to completely separate the two, with the current object-oriented programming (OOP) tools it is possible to write programs that eliminate the need to understand the details of how the data-source models are actually implementations and used for coding. For example, for coding purposes the only information needed for modeling a data source is its number of data symbols, and the probability of each symbol. Thus, for OOP purposes, that is the only type of data abstraction needed. It is true that during the actual coding process what is used is data that is computed from the probabilities, i.e., the optimal codewords or coding functions, but we can use OOP to encapsulate that data and those functions in a manner that is completely transparent to the user. Huffman code classes We provide two C++ classes for modeling and coding data sources, with definitions in the files called binary codec.h, and implementations in the files called binary codec.cpp. All the class definitions are shown in Figure 1. The first class, Static Huffman Code can be used for sources with a number of symbols between 3 and 2 16 symbols. The function set distribution completely defines the source by setting the number of data symbols and their probabilities. If this function receives a pointer to probabilities equal to zero it assumes that all symbols are equally probable. If the data source has M symbols, they must be 2

class Static Huffman Code public: Static Huffman Code(void); ~Static Huffman Code(void); void set distribution(unsigned number of symbols, const double probability[] = 0); // 0 means uniform unsigned code symbols(void); ; class Adaptive Huffman Code public: Adaptive Huffman Code(void); ~Adaptive Huffman Code(void); Adaptive Huffman Code(unsigned number of symbols); void set alphabet(unsigned number of symbols); void reset(void); // restart estimation process unsigned code symbols(void); ; Figure 1: Definition of classes for supporting static and adaptive Huffman codes. integers in the range [0, M 1]. Even though these models are called static, the symbol probabilities of these classes can be changed any time while coding, as long as the decoder uses the same sequence of probabilities. Their main limitation is that they only use the probabilities provided by the user. In most situations we do not know a priori the probabilities and we have to use estimates. Since the coding process can be quite complex, we want to avoid using two passes, one only for gathering statistics, and another for coding. The solution is provided by adaptive models, which simultaneously code and update probability estimates. Figure 1 shows the definition of class Adaptive Huffman Code, which implements such code. Before it is used for coding it needs to know the number of data symbols, which is defined using the function set alphabet. In this case we limit the number of symbols between 3 and 2 12 symbols. Larger alphabets require a very large number of coded symbols to adapt (have good estimates of symbol probabilities), so in those cases it is better to decompose the source alphabet before coding. This class starts coding assuming that all symbols have equal probability, and has a function called reset to restart the estimation process. Encoder and decoder (codec) class For simplicity we use a single class, called Binary Codec, to encapsulate both the encoder and the decoder functions. Its definition is shown in Figure 2. We use a framework in which all compressed data is saved to and fetched from a memory buffer, and only periodically written 3

class Binary Codec public: Binary Codec(void); ~Binary Codec(void); Binary Codec(unsigned max code bytes, unsigned char * user buffer = 0); // 0 = assign new unsigned char * buffer(void) return code buffer; void set buffer(unsigned max code bytes, unsigned char * user buffer = 0); // 0 = assign new void start encoder(void); void start decoder(void); void read from file(file * code file); // read code data, start decoder ; unsigned stop encoder(void); // returns number of bytes used unsigned write to file(file * code file); // stop encoder, write code data void stop decoder(void); void put bit(unsigned bit); unsigned get bit(void); void put bits(unsigned data, unsigned number of bits); unsigned get bits(unsigned number of bits); void encode(unsigned data, Static Huffman Code &); unsigned decode(static Huffman Code &); void encode(unsigned data, Adaptive Huffman Code &); unsigned decode(adaptive Huffman Code &); Figure 2: Definition of the class for Huffman encoding and decoding. 4

to or read from a file. Thus, before we can start encoding and decoding, it is necessary to define the memory location where we want the compressed data to be stored. We must use a constructor or the function set buffer to define an amount of memory equal to max code bytes in which the compressed data can be written or read. The function buffer returns a pointer to the first byte in this buffer. Note that we said define, not allocate. This depends on the parameter user buffer. If it is zero then the class Binary Codec will allocate the indicated amount of memory, and later its destructor will free it. Otherwise, it will use the user buffer pointer as the first position of the compressed data memory, assuming that the user is responsible for its allocation and for freeing it. Before encoding data it is necessary call the function start encoder to initialize and set the codec to an encoder mode. Next, data is coded using the version of the function encode corresponding to its data model. It is also possible to write an integer number of bits to the compressed stream using the functions put bit or put bits. If the data source has M symbols, the data must be unsigned integers in the range [0, M 1]. Similarly, the data with b bits should be in the range [0, 2 b 1]. Before the compressed data can be copied from its memory buffer it is necessary to call the function stop encoder, which saves the final bytes required for correct decoding, and returns the total number of bytes used. Alternatively, the function write to file can be used to simultaneously stop the encoder and save the compressed data to a file. This function also writes a small header so that the decoder knows how many bytes it needs to read. The decoding process is quite similar. Before it starts all the compressed data must to be copied to the memory buffer, and the function start decoder must be called. If the compressed data had been saved to a file using the function write to file, then the function read from file must be used to read the data and start the decoder. The compressed data is retrieved using the corresponding decode, get bit, or get bits functions. When decoding is done, the function stop decoder must be called to restore the codec to a default mode, so that it can be restarted to encode or decode. 3 Coding Example Let us consider the problem of compressing the data used for line plots. 2 Each plot is defined by an array of structures of the type Plot Segment, shown in the top of Figure 3. The first component, line color, indicates the index of the color of the line, which is a number between 0 and 7. The meaning of the second and third component depends on the first. If line color = 0 then it contains the absolute (x,y) coordinate to where the plotter pen should move ( move to instruction). If line color > 0 then it contains the (x,y) displacement from the current location, using a pen with the indicated color ( line to instruction). We assume that x and y are in the interval [ 2 15, 2 15 1]. For designing a new compression method we should consider that, even though in theory we can use a large number of adaptive models to form complex high-order models for any type of data, in practice it is better to consider if we can exploit the particularities of typical 2 Surely obsolete, but it makes an interesting example. 5

data. This happens because adaptive methods need time to get good estimates, which is not a problem if the number of models is reasonably small, but becomes a problem when we have, for example, many thousands of models. For our plot-segment compression problem we can exploit the following facts: Consecutive line segments normally have the same color. New colors are expected after a move to. Curves are plotted with many short segments. One way to make use of the fact that something does not change very frequently is to first use binary information to indicate when changes occur, and only then code the new information. Figure 3 show an example of a coding function that uses this technique to code an array of structures of the type Plot Segment. Following the sequence of instructions in Figure 3, we see that we start defining one variable of the type Binary Codec. We assume that it will compress to a memory buffer that is provided from outside the function Encode Plot, and use the proper constructor to set this buffer. Next we define two adaptive codes. The first, color code, is for coding the actual color information, and thus we use the constructor informing that its number of data symbols is equal to 8. The second code, step code, is for coding short lines. After the call to the function start encoder we have the loop for coding all the plot segments. The information coded depends on the previous and current colors. If in the previous segment we had line color = 0 then we code the color information immediately, otherwise we first code the change information, and, only if there is color change we code the new color. Next, we need to code the position or displacement (x,y). Since their range is quite large, we do not try to entropy code them all the time. If (x,y) represents absolute position (line color = 0), or a large displacement, then we just save x and y as 16-bit nonnegative numbers. If the displacement magnitude is smaller than 128 then we first code the information that the segment is short, followed by x and y, which are converted to nonnegative number and coded using step code. Figure 4 shows the function Decode Plot to decompress the data created by Encode Plot. Note that it is very similar to the encoder, since it must reproduce all the sequences of decisions taken by the encoder. Thus, even though we have four types of information that can be coded, we have correct decoding because the sequence of models used by the decoder is identical to encoder s sequence. 4 Huffman Coding Demo Programs The three programs with applications are in the files called hffile.cpp, hfwav.cpp, and test.cpp, in the subdirectories with the same name. Here is their description. 6

struct Plot Segment int line color, x, y; ; int Encode Plot(int plot points, Plot Segment seg[], int buffer size, unsigned char * compressed data) Binary Codec ace(buffer size, compressed data); Adaptive Huffman Code color code(8); Adaptive Huffman Code step code(257); ace.start encoder(); int short line, last color, current color = 1; for (int p = 0; p < plot points; p++) last color = current color; current color = seg[p].line color; if (last color!= 0) ace.put bit(last color!= current color); if ((last color == 0) (last color!= current color)) ace.encode(current color, color code); if (current color == 0) short line = 0; else short line = (abs(seg[p].x) <= 128) && (abs(seg[p].y) <= 128); ace.put bit(short line); if (short line) ace.encode(seg[p].x + 128, step code); ace.encode(seg[p].y + 128, step code); else ace.put bits(seg[p].x + 32768, 16); ace.put bits(seg[p].y + 32768, 16); return ace.stop encoder(); // return number of bytes used for compression Figure 3: Definition of structure Plot Segment for storing plot graphic data, and implementation of function Encode Plot. 7

void Decode Plot(int plot points, int data bytes, unsigned char * compressed data, Plot Segment seg[]) Binary Codec acd(data bytes, compressed data); Adaptive Huffman Code color code(8); Adaptive Huffman Code step code(257); acd.start decoder(); int short line, current color = 1; for (int p = 0; p < plot points; p++) if (current color == 0) current color = acd.decode(color code); else if (acd.get bit()) current color = acd.decode(color code); seg[p].line color = current color; if (current color == 0) short line = 0; else short line = acd.get bit(); if (short line) seg[p].x = int(acd.decode(step code)) - 128; seg[p].y = int(acd.decode(step code)) - 128; else seg[p].x = int(acd.get bits(16)) - 32768; seg[p].y = int(acd.get bits(16)) - 32768; acd.stop decoder(); Figure 4: Implementation of function Decode Plot. 8

hffile.cpp This is an example of how to use Hffman coding to compress any file (text, executables, etc.) using a relatively small number of adaptive models. Probably the most important lesson to be learned from this program is that our code is very easy to use, and enables writing a reasonably good compression program with small effort. The program s usage for compressing and decompressing a file is, respectively hffile -c file name compressed data file hffile -d compressed data file new file hfwav.cpp This is a slightly more complex example. It is for lossless compression of audio files. The audio file format accepted is wav, which is supported by all CD ripping programs. It uses some signal processing to improve compression (the reversible S+P transform), but the coding process is very similar to the first case, except that here we decompose each transform sample in a bits+data representation (same as VLI in JPEG & MPEG standards). The former is coded with contexts and adaptive models, while the bits of the latter are save directly. The program s usage is similar. For compression and decompression use hfwav -c wav file compressed wav file hfwav -d compressed wav file new wav file test.cpp This program is not really an application. It has functions to test and benchmark the speed of our Hffman coding implementations (it is similar to the program used in [3]). Given a number of data symbols, it defines several values of source entropy, and for each value it generates millions of pseudo-random source samples. The times to encode and decode this data are measured, and it finally compares the decoded with the original to make sure the code is correct. The usage is test number of symbols test number of symbols number of simulation cycles The first uses a default number of cycles equal to 10. In this directory we also have the files test support.h and test support.cpp, which contain some functions required for the simulations, like the pseudo-random number generator. 9

References [1] Lossless Compression Handbook, (K. Sayood, Ed.), Academic Press, San Diego, CA, 2003. [2] Amir Said, Introduction to Arithmetic Coding Theory and Practice, Hewlett-Packard Laboratories Report, HPL 2004 76, Palo Alto, CA, April 2004 (http://www.hpl.hp.com/techreports/). [3] Amir Said, Comparative Analysis of Arithmetic Coding Computational Complexity, Hewlett-Packard Laboratories Report, HPL 2004 75, Palo Alto, CA, April 2004 (http://www.hpl.hp.com/techreports/). 10