Figure-2.1. Information system with encoder/decoders.

2. Entropy Coding In the section on Information Theory, information system is modeled as the generationtransmission-user triplet, as depicted in fig-1.1, to emphasize the information aspect of the system. Let us break up the system to have more detail in order to serve our purpose of understanding and designing actual systems-algorithms. Figure-2.1. Information system with encoder/decoders. The information transmission channel shown in fig-2.1 can be any data transfer or storage system; a twisted pair cable between computer systems, a fiber optic cable between two cities, a model of an atmosphere through which satellite radio waves propagates or a magnetic disk used to store data. Their common property is that they are, more or less, susceptible to external disruptions commonly modeled as additive noise. These disruptions on the signal, light or magnetic field result in erroneous data and consequently incorrect information on the user side of the system. Shannon s second theorem states, in summary, that one can achieve desired reasonable performance against the noise by using more resources in terms of time and bandwidth. Since this subject is out of this documents scope, here we only mention that channel encoder-decoder pair is designed to achieve this goal. The goal of the source encoder-decoder pair is to minimize the required data flow corresponding information transfer. Source coding process tries to find code which maximizes the coding efficiency we have seen in the Information Theory section, thus reducing the average code length of the output alphabet. Minimum code length achievable is the entropy of the information source itself. Hence, source coding is usually called as entropy coding. It is interesting to note that source encoder actually removes the redundancy in the data while the channel encoder inserts some redundancy. Of the common entropy coding techniques two classes can be identified; Statistical techniques, where the source probabilities must be available beforehand Dictionary based techniques, in which there is no such requirement. In the first, an optimal ensemble ( B, v) is created using the original ensemble (, z). This requires the probability distribution z to be known at the beginning. If entire data sequence to be encoded is known beforehand then the distribution can be calculated, otherwise the statistics calculations can be employed on a sample data in the hand. In the latter case there is always a chance that the sample data is not a very good representative of the entire set, which results in a poorer code. Recall the example 1.6 where four symbols were coded with codes of different lengths. Such a code is generated by the simplest statistical technique; Shannon-Fano. ESKISEHIR OSMNGZI UNIVERSITY, DEPT. OF ELECTRICL & ELECTRONICS ENG 1

Shannon-Fano Coding In this technique, the source symbols and their probabilities are first sorted and listed in the order of their probabilities as shown in the fig-2.2a with the values and symbols from the example 1.6. The list is divided into two such that the sums of the probabilities in each part are as close to each other as possible. Obviously, at this step, these values are 0.49 and 0.51. Ideally these values could be 0.5 each for a better code. Upper and lower parts are assigned bit values 0 and 1 afterwards as shown in the fig-2.2a. a i P ( a i ) ssigned bit value 00 0.49 0 01 0.21 1 10 2 nd step 3 rd step 10 0.21 1 11 110 11 0.09 1 11 111 Figure-2.2. a) Sorted list and first bit assignments. b) 2 nd and 3 rd steps. Continuing this procedure until only one symbol is left in all parts, each symbol gets assigned a unique bit sequence as shown in fig-2.b. Rightmost binary number in each row is the assigned bit sequence for the corresponding symbol. Notice that the length of the bit sequence for each symbol monotonically decreases with the probability of the symbol. Shannon-Fano technique has the advantage of simplicity, and can be performed in place. The average code length is 1.81 [bits/symbol] as calculated in the example 1.6 whereas the entropy is 1.76 [bits/symbol]. lthough not optimal, it is easy to see that H ( v) Lavg < H ( v) + 1. Example 2.1: Find the Shannon-Fano code for the probability set v = [ 0.2 0.14 0.12 0.12 0.1 0.1 0.08 0.07 0.05 0. 02] T (Notice that the coding process does not require the symbols but uses their probabilities. ctual symbols may also be binary blocks of either fixed or variable length.) Steps of bit assignments are show in fig-2.3. P i Final code 0.20 0 00 00 0.14 0 01 010 010 0.12 0 01 011 011 0.12 1 10 100 100 0.10 1 10 101 101 0.10 1 11 110 1100 1100 0.08 1 11 110 1101 1101 0.07 1 11 111 1110 1110 0.05 1 11 111 1111 11110 11110 0.02 1 11 111 1111 11111 11111 Figure-2.3. Solution steps of the example 2.1. ESKISEHIR OSMNGZI UNIVERSITY, DEPT. OF ELECTRICL & ELECTRONICS ENG 2

Notice that, in fig-2.3, at the second step choosing the split point 0.10-0.10 or 0.10-0.8 for the lower part would create the same distance from the ideal split point 0.27. In the example 0.10-0.10 is chosen (marked with dashed line). The average code length is calculated to be Lavg = 3.19, whereas the entropy is H (v) 3.15. lthough H ( v) Lavg < H ( v) + 1 is satisfied once again, we see that the code did not achieve the optimal which is the entropy of the source itself. Obviously, generation of the block codes do not actually compress the data. For that, the input data must be expressed in terms of the new codes. Example 2.2: Let us encode the stream 10420015639181010203740526310817 using Shannon-Fano code just generated in the previous example where each probability value corresponds a decimal symbol in the alphabet {0,1,2,3,4,5,6,7,8,9}. (Just a note here; If straight binary coding (or BDC) were used we would need 4 bits although 4 bits could represent 16 distinct symbols, indicating clear inefficiency.) We just replace the symbols in the stream with the corresponding binary code in the last column of the fig-2.3; 010.00.101.011.00.00.010.1100.1101.100.11111.010.11110.010.00.010.00.011.00.100.11 10.101.00.1100.011.1101.100.010.00.11110.010.1110 Sub-streams are separated with a. for clarity. Such a separation is not actually required in practice since the code is unique, that is it is readily possible to identify the sub-streams from a continuous binary stream provided that we have the block code alphabet. 105 The average code length of the output stream is L avg = 3. 28 [bits/symbol]. 32 This result is better than 4-bit BCD but not as good as the L avg calculated using the codes and their probabilities in the previous example (3.19), and certainly has noticeable difference from the calculated entropy. So, what went wrong? The answer is the statistics. The code in the previous example is found using the probabilities given in that example. Had we have a stream which strictly conform the given statistics we could have the exact Lavg we expect to get. Poorer agreement of input streams to the statistics results in poorer compression. s mentioned above, decoding the binary stream is straightforward; just collect bits from the stream until a block code in the alphabet is seen, and replace that sub-stream with corresponding symbol/decimal digit. Huffman Coding Huffman coding (1952) is known to be the legend of the entropy coding. In this technique, the symbol probabilities are again listed in the order of non-increasing probability as shown in fig-2.4. Ordering here does not really effect the operation and performance of the technique but make the technique easier to understand. Ordering allows easy recognition of the symbols with the lowest probabilities. Starting with the lowest, at each step, two symbols with the lowest probabilities in the list are combined and an imaginary block symbol is created with the probability equal to the sum of theirs. ESKISEHIR OSMNGZI UNIVERSITY, DEPT. OF ELECTRICL & ELECTRONICS ENG 3

Continuing this process until we have only one block symbol representing all symbols and with the probability of 1, a binary tree named as Huffman tree is created. Example 2.3: Let us build the binary Huffman tree for the probability set given in the previous example. s usual the symbols themselves are not used but their probabilities are. Figure-2.4 illustrates the creation of the tree starting from the lowest probability elements. The ordering at each step is omitted; instead, combined pairs are indicated by lines in the figure. Figure-2.4. Huffman tree for the example 2.3. The number of bits that each symbol should be represented by can be determined from the Huffman tree. The number of nodes passed from the node to the root inclusive equals to the code length of that symbol. With that provision many different techniques can be employed in order to assign actual bit patterns to the symbols. In one well known technique, starting from the root of the tree, the upper branches receive a 0 and the lower branches receive a 1 at each node. For example, for node with the probability1.00 the branches 0.42 and 0.58 receive the values 0 and 1 respectively. Since most significant bits (prefixes) are carried from root to leaves, this process generates a minimum prefix code. One may chose the opposite assignment convention or may not follow any convention at all. Nevertheless, assigning bit values using the first technique mentioned for our example, the bit assignments shown in fig-2.5 are obtained. Inverting each bit value (replacing 0s with 1s and vice versa) would generate an inverted code with exactly the same characteristics. Once again the code generated is uniquely decodable and instantaneous; that is, decoders shall be able to determine the last bit of the current symbol as soon as the bit is received. For this example, average symbol length of the Huffman code is the same as the one found in the Shannon-Fano case; Lavg = 3.19, since the assigned bit-lengths of the individual symbols are the same for this individual example. Figure-2.6, on the other hand, demonstrates a distribution where Huffman s technique creates an optimal code but Shannon-Fano does not. ESKISEHIR OSMNGZI UNIVERSITY, DEPT. OF ELECTRICL & ELECTRONICS ENG 4

Figure-2.5. Huffman tree bit value assignments for example 2.3 P(s m ) Shannon- Fano Huffman 0.36 00 0 0.18 01 100 0.17 10 101 0.16 110 110 0.13 111 111 H(s) = 2.216 L avg = 2.29 L avg = 2.28 Figure-2.6. n example where Huffman is a bit better than Shannon-Fano. Unlike Shannon-Fano technique which is not guaranteed to generate an optimal code, Huffman s technique is guaranteed to generate a minimal redundancy code. This means that, there is no code with an average code-length shorter than Huffman s exists. However the complexity of the algorithm, especially for large alphabets, raised a need for truncations on the algorithm with some penalty on the average code length. Truncated Huffman Code With truncations, a tradeoff between the cost of calculating the optimal code and the cost of transmitting/storing extra bits of the suboptimal code is made. Usually 2 c symbols with lowest probabilities in the ordered set are selected and, instead of them, an imaginary block symbol with the probability equal to the sum of probabilities of these 2 c symbols is used. Huffman code is found as usual afterwards. The code corresponding to the block symbol is used as a prefix and appended to the left of the unique 2 c codes representing the individual symbols. Example 2.4: Let us find a truncated Huffman code for the symbols whose probabilities are given in example 2.3. lthough the number of symbols to be contained in the block symbol can be selected arbitrarily, it is intuitive to make it equal to an exponent of 2 for an ESKISEHIR OSMNGZI UNIVERSITY, DEPT. OF ELECTRICL & ELECTRONICS ENG 5

efficient sub-coding. In the case where 4 symbols with lowest probabilities are selected, 2 bits are needed to represent each excluding the prefix bits. Figure-2.7 shows the Huffman tree and bit assignments. The sum of the probabilities of 4 lowest probability values is shown bold at the bottom of the list. Those probabilities are {0.08, 0.07, 0.05, 0.02}. Their sub-codes, then, shall be {00, 01, 10, 11}. The final codes for the symbols which are used in the block symbol, using the prefix 11 appended at each sub-code, are {1100, 1101, 1110, 1111}. The entropy of the modified source is H(u) = 2.74. The average code length shall be calculated, after replacing the block symbol with individual 4 bit codes, as L avg = 3.22. Note that this value is very close to the full Huffman code. lso, truncated tree has 6 nodes whereas full tree had 9 nodes, indicating roughly 30% savings in calculation complexity. The complexity is out of this documents scope. The full block symbol alphabet is given in fig-2.8 Figure-2.7. Truncated Huffman tree for the example 2.4. 0 0.20 000 1 0.14 001 2 0.12 010 3 0.12 011 6 0.08 1100 4 0.10 100 7 0.07 1101 5 0.10 101 8 0.05 1110 X 0.22 11xx 9 0.02 1111 Figure-2.8. Full conversion table for the truncated Huffman code (ex.2.4). The statistical data compression methods we have seen, Shannon-Fano and Huffman, assign variable length bit streams to symbols according to the symbols probability. If the probability of symbol is high then it is assigned less number of bits. Conversely, if the symbol probability is low then it is assigned a bit stream probably longer than the average, possibly very long bit streams in cases of large input alphabets with diverse probabilities. It is, by Huffman, proven that the Shannon-Fano method is not guaranteed to generate optimal codes whereas Huffman s minimum redundancy codes are the optimal among the codes which assign integral number of bits to symbols. Huffman code probably will assign 1 bit to a symbol with a probability of 0.5. The problem with that is it would also assign 1 bit to a symbol with a probability of 0.9. It is not possible to assign 0.1 bit, for example, to a symbol. lthough theoretically much better performance is possible (via block codes), Huffman s code can only be the best among sub-optimal techniques in its ESKISEHIR OSMNGZI UNIVERSITY, DEPT. OF ELECTRICL & ELECTRONICS ENG 6

class. Here we assume that the data stream to be compressed is as long as the stream over which the statistics were calculated and has similar statistical characteristics. nother negative that we can chalk up against Huffman s technique is the complexity of the algorithm. Even with the employment of additional ease-up shortcuts like truncations just discussed, the algorithm itself is still a big resource consumer. Yet it may not, for some probability distributions, achieve the optimal performance noting that the optimal is the entropy itself. In the following sections, two techniques addressing two problems mentioned above are discussed. rithmetic Coding Data compression algorithms, considering their use in computers, accept one or more data file(s) as input and, after processing, spit out another data file, presumably compressed; that is, the output is expected to be shorter than the input. Here, the output means all data needed to reconstruct original data/files, including the look up tables used for conversion. In statistical techniques the processing involves, not surprisingly, the calculation of statistics; the symbol probabilities. The symbols are generally bytes or bits, since the compression operations are done in computers and computers use bytes and bits. There is no method, yet, to determine the symbol definition(s) for minimum entropy plus minimum alphabet. Shannon s theorem states that larger the symbols better the compression. This is no solution for selection of the symbols since one might select entire file as the single symbol and generate an entropy of 0, while still being left with the task of transmitting the alphabet; entire file. Stated in more general words; Can one create a code which, on the average, represents the data with less number of bits than the entropy? The answer is no, at least in the domain of symbols where the entropy is calculated. This answer inherently states that one can change the message domain to have more efficient symbols. Typical example for this is the run-length coding for bi-level images, in which the symbols or messages are the number of same-valued pixels in a run instead of the pixel values themselves. We shall discuss this technique in the following sections. The techniques discussed previously (Shannon-Fano and Huffman) are fixed-to-variablelength. That is they assign variable length codes for given fixed length symbols (after the statistics, of course). Ziv-Lempel s technique can be considered doing the opposite; assign fixed length codes to variable length input symbols which we shall discuss in the coming sections. Given the weaknesses of the two statistical compression techniques discussed, it is no surprise that new techniques are continuously being searched and actually found. rithmetic coding has been a strong candidate to overcome the deficiency of assignment of integer number of bits to symbols. lthough the mathematics had been known for decades, however, it was not possible to implement it. rithmetic coding uses floating point numbers of almost unlimited precision in theory which is difficult to implement if not impossible. When successfully implemented by J.G.Witten in 1987, it became clear that no floating point numbers are needed to represent floating point numbers. The most important property of the arithmetic coding is the inherent assignment of non-integer number of bits for the symbols, hence a considerable leap towards the entropy compared to Huffman. ESKISEHIR OSMNGZI UNIVERSITY, DEPT. OF ELECTRICL & ELECTRONICS ENG 7

rithmetic coding initially allocates, from the semi-closed range [0, 1), the semi-closed ranges to the input symbols according to their probabilities. It is not required to sort the symbols in any way as long as the same order is used in both the encoder and the decoder. The leftmost vertical scale in fig- 2.9 shows an example range assignments for the characters of the string S= BRCDBR. The alphabet of the source is {, B, C, D, R} and corresponding probability set is z = [ 5/11 2 /11 1/11 1/11 2 / 11] T. The first symbol of the alphabet,, owns the range [0, 5/11), B owns the range [5/11, 7/11) and so on. The first symbol taken from the string is, and it corresponds to the probability range [0, 5/11). This range is now scaled to cover entire [0,1) range. Scaling operation is shown in fig-2.9 with two lines. The second symbol from the string S is B, and is shown in the second vertical scale as bold. The scaling operation is applied again. The rest of the symbols are treated similarly and operation continues until the last symbol in the string is processed. Obviously the range marked in the last vertical scale represents the probability of the string S= BRCDBR occurring from the alphabet given. It is marked as P(S) and is equivalent to the product of symbol probabilities in the string. 11 P ( S) = P( ) (2.1) i= 1 S i s a byproduct we have found a range, by scaling the last range back to the originalleftmost scale. It can be proven that any number within that range represents the string S. (It is also worth noting that the range [0, 1) represents all strings that can be generated from this source.) Instead of back-scaling, the range could also be obtained by updating the scale at each step. The algorithm is illustrated by a pseudo-code given in fig-2.10 and the final range is in fig-2.11. Figure-2.9. Expansion of ranges in arithmetic coding. Running the algorithm for our example outputs final range as L=0.2787886510028143 H=0.2787888262497640 with 16 digit precision. It is stated that any number between these numbers actually represents the input string. 0.2787887 will do. Inspecting the algorithm and the intermediate values of L, H and the difference between them it is seen that the numbers L and H gets closer and closer as each new symbol is processed as depicted in fig-2.12. It is ESKISEHIR OSMNGZI UNIVERSITY, DEPT. OF ELECTRICL & ELECTRONICS ENG 8

clear that the precision required to carry these numbers grows without caring about the precision limit employed by the standards for computers. L = 0.0 H = 1.0 Start_Loop R = H - L H = L + R * H_of_i_th_symbol L = L + R * L_of_i_th_symbol Loop_Until_the_Last_Symbol_is_Processed Output something_between_l_and_h Figure-2.10. Pseudo code for arithmetic coding. Step L H H-L B R C D B R 0.2066115702479339 0.2742299023290759 0.2742299023290759 0.2785763515904899 0.2785763515904899 0.2787816148377822 0.2787816148377822 0.2787869160580119 0.2787886510028143 0.2787886510028143 1.0000000000000000 0.2892561983471074 0.2892561983471074 0.2810600368827266 0.2791972729135491 0.2788585885555168 0.2788072727436938 0.2787932775222874 0.2787890365461037 0.2787890365461037 0.2787888262497640 1.0000000000000000 0.0826446280991736 0.0150262960180316 0.0068301345536507 0.0006209213230592 0.0002822369650269 0.0000256579059116 0.0000116626845053 0.0000021204880918 0.0000003855432894 0.0000001752469497 Figure-2.11. The intermediate values for L, H and H-L. 1 0,9 0,8 0,7 L 0,6 H 0,5 0,4 0,3 0,2 0,1 0 B R C D B R Figure-2.12. pproach of H and L values to final range. ESKISEHIR OSMNGZI UNIVERSITY, DEPT. OF ELECTRICL & ELECTRONICS ENG 9

The decoding process is, not surprisingly, the opposite of the encoding, provided that the alphabet, the probability set and the number of symbols encoded are known beforehand. The pseudo-code algorithm for decoder is shown in fig-2.13. X = encoded_number Start Loop Find_range_enclosing_X Output the_symbol_of_the_range_found R = H_of_symbol_found - L_of_symbol_found X = (X - L_of_symbol_found) / R Loop_until_the_last_symbol_output Figure-2.13. lgorithm of the decoding process. Figure-2.14 shows the progress of the decoding process for our example when low value of the range found in encoding is taken as the input to the decoder. X L H Output 0.2787886510028144 0.6133350322061917 0.8733426771340543 0.3033847242372987 0.6674463933220571 0.3419103265426284 0.7522027183937826 0.2742299023316084 0.6033057851295385 0.8181818182124617 0.0000000001685392 0.8181818181818182 0.6363636363636364 0.7272727272727273 0.8181818181818182 0.6363636363636364 1.0000000000000000 0.7272727272727273 0.8181818181818182 0.6363636363636364 1.0000000000000000 B R C D B R Figure-2.14. Intermediate values of the decoding process. Recalling that any number between high and low values of the range could be used, one would obtain the same string using the high value as input to the decoder. lgorithms require lots of floating point operations compared to previously discussed techniques. But this is not the single thing that makes arithmetic coding difficult. Coding and decoding processes are straightforward, but they require the use of impractically high precision floating point numbers. Even for the shortest strings with the length of about 20 symbols the standard double precision numbers underflow. ESKISEHIR OSMNGZI UNIVERSITY, DEPT. OF ELECTRICL & ELECTRONICS ENG 10