Information Theory and Communication

Information Theory and Communication Shannon-Fano-Elias Code and Arithmetic Codes Ritwik Banerjee rbanerjee@cs.stonybrook.edu c Ritwik Banerjee Information Theory and Communication 1/12

Roadmap Examples and Types of Codes Kraft Inequality McMillan Inequality Entropy bound on data compression Shannon Code Huffman Code Wrong Code Stochastic and Stationary Processes c Ritwik Banerjee Information Theory and Communication 2/12

Recap We have seen that Huffman coding is optimal, and it has expected length L within 1 bit of the entropy H of the source. Redundancy, defined as the difference between the two (i.e., L H, is reduced in Huffman codes by the symbol-combining technique we saw in the last lecture. To obtain Huffman codes, we need to go through this entire procedure to encode (and hence compress) data. Next, we look at another type of code, the Shannon-Fano-Elias code, where all we need to know is the distribution to be able to write down the codewords. c Ritwik Banerjee Information Theory and Communication 3/12

Mathematical Setup We will need an ordering of the source letters. So, without loss of generality, consider the source alphabet of the type X = {1, 2,..., m}. We will continue to consider the encoding alphabet to be binary. Also, assume that that p(x) > 0 x X. Since otherwise, we can simply omit the zero-probability symbols from the alphabet and work with a smaller alphabet size. For the Shannon-Fano-Elias codes, we need to work with the cumulative distribution function, instead of just the probability mass function. c Ritwik Banerjee Information Theory and Communication 4/12

Cumulative Distribution The cumulative distribution function (CDF) F (x) for a real-valued random variable X is evaluated at a certain point x, and is defined as the probability that X will take a value at most equal to x { x F (x) = P (X x) = i x P (X = x i) if X is discrete x p(t)dt if X is continuous We will work with a modified CDF defined as follows: F (x) = F (x 1) + 1 2 p(x) For discrete random variable X, this is equivalent to x 1 F (x) = p(i) + 1 2 p(x) i=1 c Ritwik Banerjee Information Theory and Communication 5/12

Cumulative Distribution F (x) is defined over real values, but F (x) is only defined over the integer values because p(x), too, is defined over these integer values. These integers correspond to the symbols of X. c Ritwik Banerjee Information Theory and Communication 6/12

Explanation/Intuition of the modified CDF For discrete random variables, the CDF F (x) is a step function, with vertical jumps at each discrete value (because we assumed p(x) > 0 for all x X ). The modified CDF F (x) is a function that represents the midpoint of each step in the original CDF. Why do we need this modified function? Since p(x) > 0 for all x, there is a bijective mapping (i.e., a one-to-one correspondence) between x and F (x). That is, we can use F (x) as a codeword for x. Now, the same argument can be used for F (x). So why modify CDF? Because to answer the following question, we will need an approxmation process that will lead to prefix codes. And this process would not work for the original CDF. How many bits would we need to represent F (x)? c Ritwik Banerjee Information Theory and Communication 7/12

Shannon-Fano-Elias Code F (x) is a real number, so in general, we could need an infinite number of bits to represent the exact value of F (x). Instead, we use an approximation. But then, the question is, how many bits do we need? This reflects the precision of our encoding. Low precision would mean that we may no longer have a uniquely decodable code. The idea is to truncate the representation of F (x) as soon as we have enough bits to ensure that the codewords are unique. If we truncate F (x) to l(x) bits, it will be denoted by F (x) l(x). The idea is to use the first l(x) bits of F (x) to be the codeword for x. We then get (by definition of rounding ) F (x) F (x) l(x) 1 2 l(x) c Ritwik Banerjee Information Theory and Communication 8/12

Shannon-Fano-Elias Code We are going to show that 1 l(x) = log + 1 p(x) is adequate for encoding. Here, adequate means we need to show that 1. the codewords are unique, and 2. the coding is a prefix code. Proof shown in scribe notes. c Ritwik Banerjee Information Theory and Communication 9/12

Suboptimality and Competitive Optimality The expected length of Shannon-Fano-Elias code is bounded above by 2 bits more than the source entropy (proof shown in scribe notes). Huffman codes are optimal on the average, but they are not optimal for all sequences. Shannon code is nearly optimal, and often enough. The italicized terms are given technical meanings in the following theorem. The property is called competitive optimality of a code. The theorem thus shows that Shannon code is competitively optimal. Theorem Let l(x) and l (x) be the codeword lengths associated with Shannon code and any other uniquely decodable code, respectively. Then, P r(l(x) l (X) + c) 1 2 c 1 c Ritwik Banerjee Information Theory and Communication 10/12

Arithmetic Code Enables us to use a source distribution that is learned on the fly. Suitable for encoding and compressing streaming data. Results Let Y be a continuous random variable with distribution F Y (y). Let U be a function of Y defined by its distribution as U = F Y (Y ). Then, U has a uniform distribution on the interval [0, 1]. (proved in scribe notes) c Ritwik Banerjee Information Theory and Communication 11/12

Probability Transformation of infinite sequences Each bit of a numerical representation can be modeled as an independent Bernoulli random variable. These sequences are incompressible, and yield an invertible mapping from infinite source sequences into infinite binary sequences. Advantage: easy to calculate! (example shown in scribe notes) Using this transformation, long sequences of symbols can be encoded together, and the expected length per symbol is bounded by L < 1 n H(X1, X2,..., Xn) + 2 n Just like we have seen earlier, this means that we can make the code arbitrarily close to the optimal bound (i.e., H(X 1, X 2,..., X n)) by taking large values of n. But, for large n, Huffman s algorithm is not feasible due to time complexity concerns. Shannon-Fano-Elias encoding, however, remains fast as the encoding process does not depend on n. c Ritwik Banerjee Information Theory and Communication 12/12