Lecture 17. Lower bound for variable-length source codes with error. Coding a sequence of symbols: Rates and scheme (Arithmetic code)

Size: px

Start display at page:

Download "Lecture 17. Lower bound for variable-length source codes with error. Coding a sequence of symbols: Rates and scheme (Arithmetic code)"

Paulina Shana Gibson
5 years ago
Views:

1 Lecture 17 Agenda for the lecture Lower bound for variable-length source codes with error Coding a sequence of symbols: Rates and scheme (Arithmetic code) Introduction to universal codes 17.1 variable-length source code with error In the previous lecture we saw that variable-length source codes with error can do significantly better than their error-free counterparts. However, it is very easy to convert any error-free source code to one which allows error and gain a factor of roughly (1 ɛ) in the average length. Therefore, as far as code design is concerned, it suffices to design error-free variable-length codes. But can we gain even more than a factor of (1 ɛ) by a more sophisticated design? In other words, do we have a matching lower bound for our upper bound on L ɛ (X) derived in the previous lecture. Well, nearly so. Theorem For a discrete source X and 0 < ɛ < 1, H(X) log(eh(x)) ɛ log X 1 L ɛ (X) (1 ɛ)h(x) c Himanshu Tyagi. Feel free to use with acknowledgement. 1

2 Proof. Proof is very similar to that of the lower bound in Theorem The only problem is that the equality H(X) = H(Y 1,..., Y n ) does not hold anymore since we don t have a oneto-one map between the symbol X and the corresponding codeword Y 1,..., Y n. Nevertheless, using the codeword we can recover X with probability of error less than ɛ. Therefore, by Fano s inequality H(X) H(Y 1,..., Y n ) + H(X Y 1,..., Y N ) H(Y 1,..., Y n ) + ɛ log X + 1. The rest of the proof proceeds exactly as that of Theorem Coding a sequence of symbols Up to this point, we considered coding a single symbol generated by a known source distribution. In practice, however, we don t have just one symbol but a sequence of symbols. Consider a DMS (X 1,..., X n ) with a common distribution P. Before we proceed to the schemes for encoding a sequence of symbols, let us derive some benchmark bounds Optimal rates We have already seen that the optimal rates for fixed-length codes satisfy R(X) = R ɛ (X) = H(X) for every 0 < ɛ < 1. For variable length codes, defining R ɛ (X) analogously to R ɛ (X) with L ɛ (X n ) replacing L ɛ (X n ), the bounds obtained in the previous section give (show this yourself) H(X) ɛ log X R ɛ (X) (1 ɛ)h(x). In fact, an improved lower bound can be obtained and we can show that R ɛ (X) = (1 ɛ)h(x). 2

3 Note that R(X) := lim ɛ 0 R ɛ (X) = H(X). Thus, while the optimal asymptotic rate for variable-length codes is much smaller than that for fixed-length codes when error is allowed, both are same for vanishing error. Furthermore, the optimal rate with error can be obtained simply by modifying an error-free scheme using a randomized encoder. In fact, as noted earlier, we can obtain a prefix-free code of average length less than (1 ɛ)h(x n ) + 1 when error is allowed. The rate of such codes approach the optimal rate (1 ɛ)h(x) as n Arithmetic code for compressing a sequence of symbols We now explore practical schemes for compressing a sequence of symbols. As we have seen, it suffices to design schemes for the error-free case. The first, perhaps naive, scheme will simply apply one of the codes designed above for 1 symbol to each symbol. Recall that all our schemes simply guarantee that we can come within 1 bit of H(X). Therefore, when we apply our schemes to each symbol X i and denote by l(x) the length of codeword assigned to x X, the average length E [ n i=1 l(x i)] can be anywhere between nh(p) and nh(p) + n. Instead, we can treat the entire sequence (X 1,..., X n ) as a single symbol generated from P n and apply, say, Huffman code to this sequence. This will yield a code of average length H(P n ) + 1 = nh(p) + 1; indeed, it will yield a code of optimal length. In fact, this is one of the main lessons of Information theory: DMSs can be compressed better if we process a large number of symbols together. We also encountered this principle in fixed-length code where if we applied the code given by our single-shot results symbol-wise we will not be able to achieve the optimal rate of H(P). Attaining H(P) requires identifying small cardinality subsets of X n of large P n -probability directly. So, are we done here? Well, in theory yes. But we are not even close from an implementation point of view. The main bottleneck in implementing either Shannon-Fano or Huffman code is sorting large sequences. Even if somehow design the code, encoding each sequence will require us to navigate a huge look-up table since we cannot immediately 3

4 identify any additional structure in the code. A similar obstacle is faced in implementing a Shannon-Fano-Elias code as well. This problem is solved using a variant of the Shannon-Fano-Elias code called an arithmetic code. The key feature of an arithmetic code is that it can compress a symbol x i using just the knowledge of the probability P(x i ) and yet ensure that a sequence x 1,..., x n is compressed to a codeword of length log P(x 1 )...P(x n ). In fact, the coding scheme can be applied to any arbitrary distribution P(x 1,..., x n ) by using distribution P(x i x 1,..., x i 1 ) for the ithe symbol. Furthermore, it satisfies the First In First Out (FIFO) property, namely the symbols are decoded in the order in which they are encoded. Arithmetic coding, too, represents codewords as interval subsets of [0, 1]. Specifically, corresponding to a sequence (x 1,..., x n ), the scheme produces an interval of length P(x 1,..., x n ). Different intervals corresponding to different codewords are disjoint. In each interval we can find a number with binary representation of length l(x) = log P(x 1,..., x n ), which we use to represent the interval. Specifically, each interval of length l(x) must contain one number from the set {0, 2 l(x), 2.2 l(x), 3.2 l(x),..., 1} = { , , , , ,..., }. We use one of these sequences to uniquely represent the interval. It remains to describe the encoding process of obtaining the interval HEAD and the decoding process of recovering the sequence from the interval. ======= and the decoding process of removering the sequence from the interval. The encoder maintains an interval at each step represented by its starting point C and the width A and successively updates the interval for each new symbol. At every step, a new interval is constructed by partitioning the previous interval into parts of length proportional to the probabilities of the symbol and move to the part corresponding to the encoded symbol. We begin with the interval [0, 1]. As such, every new interval is a subset of the previous interval. Furthermore, intervals corresponding to different sequences of the same length constitute a partition of 4

5 [0, 1] as the length of the sequence increases the partition becomes successive finer. The formal description of the encoder is as follows: Input: A sequence of symbols (x 1,..., x n ) X n and associated pmfs (P 1,..., P n ) on X. Output: An interval [C, C + A) [0, 1]; C is called the current code and A the augend. 1. Initialize C = 0 and A = for i =1,..., n (i) update C C + A a<x i 1 P i(a). For convenience, we have assumed an ordering between the symbols of X ; (ii) update A A P i (x i ). As an illustration, consider the sequence acbb where each symbol is generated from the same pmf P such that P(a) = 1/8, P(b) = 1/2, P(c) = 1/4, P(d) = 1/8; we assume the ordering a < b < c < d. For this sequence, the encoder above updates the interval as follows: [0, 0.001) [ , ) [ , ) [ , ). Note that the width of the final interval is 2 9. A number in the final interval which has a binary representation of length 9 is given by Thus, the codeword corresponding to acbb is The decoding process simply inverts the encoding process and is accoplished by magnitude comparisons. At each iteration, we consider the partion of [0, 1] with parts proportional to probabilities used in the encoding that symbol (with the same ordering as the encoding process). The value of the symbol is obtained by simply checking which part the binary value of the codeword lies in. Once the symbol is identified, the codeword is updated by renormalizing the previous interval of the decoded symbol to [0, 1]. 5

6 Formally, each iteration of the decoder proceeds as follows: Input: A number v in [0, 1] and a pmfs P on X. Output: A symbol x X and the updated (codeword) number 1. Return x such that x 1 i=1 P i v < x i=1 P i; ( 2. update v v ) x 1 i=1 P i /P x. As an illustration, consider our foregoing example once more. Upon observing the codeword , the symbols and the updated codewords at each step produced by the decoder above are as follows: (a, ) (c, ) (b, 0.001) (b, 0). The final value of zero represents the completion of the decoding process. Remark (i) The elegant algorithm above can be easily implemented if one can multiply and divide real numbers with arbitrary precision. Of course, that is not the case. In practice, we are forced to work with finite precision. Nevertheless, variants of arithmetic coding have been implemented to work with finite precision. (ii) Note that in our description above, we have assumed the availability of the probability model both at the encoder and the decoder. In a practical implementation, this model, too, has to be constructed recursively from the data. Even if we make the model available at the encoder, at the decoder the model pmf for decoding the next symbol must be obtained using the previously decoded symbols. We shall encounter one such method for obtaining probability models in universal source coding Universal source coding By now we are in a position to implement basic data compression algorithms, provided someone gives us the probability distribution used to generate the data. In practice, of 6

7 course, this distribution must be ascertained from the data itself. In general, our compression algorithms should include the process of modeling the data. Such source codes which are constructed to operate without the knowledge of the probability distribution generating the data are called universal sorce codes. Clearly, any code can be thought of as a universal source code. The key point is that we would like good performance guarantees for every reasonable probability model for the data. In particular, for this course we shall require guarantees for all i.i.d. pmfs. As before, we shall consider both fixed-length and variable-length codes. The first step is to define our measures of performance for each setup. Specifically, we define what is the benchmark performance that we seek from universal source codes Rate optimality for fixed-length universal source codes A sequence of fixed-length source codes C n for a source alphabet X n is of rate R if their lengths l n satisfy l n lim n n = R. We consier the limit definition for simplicity. Other variants based on lim sup and lim inf can also be considered and lead to the same results for i.i.d. sources. Denote by ɛ(p n, C n ) the probability of error for the code C n under the i.i.d. pmf P n with common distribution P. A sequence of fixed-length source codes of rate R is universally rate optimal if for every pmf P on the common alphabet X such that H(P) < R, the probability of error ɛ(p n, C n ) satisfies lim n ɛ(pn, C n ) = Minmax regret for variable-length universal source codes For variable-length universal source codes, we pursue a slightly different benchmark of performance. For each distribution P on the common alphabet X, we define the regret 7

8 for our universal code for n symbols from X as the difference between its average length under the i.i.d. pmf P n and the average length for the optimal code with the knowledge of P. To make the problem more tractible, we simply use nh(p) as a proxy for the latter quantity; our previous analysis tells us that this will only take us off by at most a bit. Specifically, for a code C which assigns codewords of lengths l(x) to x X n, define the regret for P as r n (C, P) := P n (x)l(x) nh(p). x X n The worst-case regret of code C is given by r n (C) := max r n(c, P). P P(X ) We seek to characterize the minimum of the worst-case regret over all possible uniquely decodable codes C and develop codes which achieve this minmax regret. The desired notion of minmax regret is given by r n = min C:C C u r n(c), where 853f3f922a09036b6dbfb00df81d61f9e76042d4 8

Lecture 15. Error-free variable length schemes: Shannon-Fano code

Lecture 15. Error-free variable length schemes: Shannon-Fano code Lecture 15 Agenda for the lecture Bounds for L(X) Error-free variable length schemes: Shannon-Fano code 15.1 Optimal length nonsingular code While we do not know L(X), it is easy to specify a nonsingular