EE-575 INFORMATION THEORY - SEM PDF Free Download

EE-575 INFORMATION THEORY - SEM 092 Project Report on Lempel Ziv compression technique. Department of Electrical Engineering Prepared By: Mohammed Akber Ali Student ID # g200806120. ------------------------------------------------------------------------------------------------------------------------------------------ King Fahd University Of Petroleum & Minerals Dhahran, Saudi Arabia. 1

Context 1. Introduction 3 2. Dictionary coding 4 3. Lempel Ziv coding. 5 4. The coding process..6 5. The decoding process..7 6. Flowchart for coding process.9 7. Flowchart for decoding process.10 8. Problem eg.1.5.1 solved theoretically 11 9. Problem eg.1.5.2 solved theoretically 12 10. Problem exc.1.5.1 solved theoretically..13 11. Problem exc.1.5.2 solved theoretically..14 12. Advantages, Disadvantages & Applications.15 13. Results..16 14. Conclusion 20 15. References 21 2

INTRODUCTION: Data Compression seeks to reduce the number of bits used to store or transmit information. It encompasses a wide variety of software and hardware compression techniques. Data compression consists of taking a stream of symbols and transforming them into codes. For effective compression, the resultant stream of codes will be smaller than the original symbol. For e.g., Huffman coding is a type of coding where the actual output of encoder is determined by a set of probabilities. Here the problem is that it uses an integral number of bits & also, one must have the prior information of probabilities. Well-known lossless compression techniques include: Run-length coding: Replace strings of repeated symbols with a count and only one symbol. Example: aaaaabbbbbbccccc -> 5a6b5c Statistical techniques: Huffman coding: Replace fixed-length codes (such as ASCII) by variable-length codes, assigning shorter codewords to the more frequently occurring symbols and thus decreasing the overall length of the data. When using variable-length codewords, it is desirable to create a (uniquely decipherable) prefix-code, avoiding the need for a separator to determine codeword boundaries. Huffman coding creates such a code. Arithmetic coding: Code message as a whole using a floating point number in an interval from zero to one. PPM (prediction by partial matching): Analyze the data and predict the probability of a character in a given context. Usually, arithmetic coding is used for encoding the data. PPM techniques yield the best results of statistical compression techniques. The Lempel Ziv algorithms belong to yet another category of lossless compression techniques known as dictionary coders. The problem of statistical model is solved by using adaptive dictionary which is discussed below. 3

DICTIONARY CODING Dictionary codes are compression codes that dynamically construct their own coding and decoding tables on the fly by looking at the data stream itself. As they have these capabilities it is not necessary for us to have to know the symbol probabilities beforehand. The codes take advantage of the fact that, quite often certain strings can be assigned code words that represent the entire string of symbols. Dictionary coding techniques rely upon the observation that there are correlations between parts of data (recurring patterns). The basic idea is to replace those repetitions by (shorter) references to a "dictionary" containing the original. (i) Static Dictionary The simplest forms of dictionary coding use a static dictionary. Such a dictionary may contain frequently occurring phrases of arbitrary length, digrams (two-letter combinations) or n-grams. This kind of dictionary can easily be built upon an existing coding such as ASCII by using previously unused codewords or extending the length of the codewords to accommodate the dictionary entries. A static dictionary achieves little compression for most data sources. The dictionary can be completely unsuitable for compressing particular data, thus resulting in an increased message size (caused by the longer codewords needed for the dictionary). (ii) Semi-Adaptive Dictionary The aforementioned problems can be avoided by using a semi-adaptive encoder. This class of encoders creates a dictionary custom-tailored for the message to be compressed. Unfortunately, this makes it necessary to transmit/store the dictionary together with the data. Also, this method usually requires two passes over the data, one to build the dictionary and another one to compress the data. A question arising with the use of this technique is how to create an optimal dictionary for a given message. It has been shown that this problem is NP-complete (vertex cover problem). Fortunately, there exist heuristic algorithms for finding near-optimal dictionaries. (iii) Adaptive Dictionary The Lempel Ziv algorithms belong to this third category of dictionary coders. The dictionary is being built in a single pass, while at the same time also encoding the data. As we will see, it is not necessary to explicitly transmit/store the dictionary because the decoder can build up the dictionary in the same way as the encoder while decompressing the data. 4

LEMPEL-ZIV CODING: History: In 1983 Sperry filed a patent for an algorithm developed by Terry Welch, an employee at the Sperry Research Center. This algorithm is Welch's variation on a data compression technique first proposed by Jakob Ziv and Abraham Lempel in 1978. Welch's technique is both simpler and faster. He published an article in the June 1984 issue of IEEE Computer Magazine describing the technique. The technique became very popular and was widely adopted. LZ compression is a form of substitution compression. In this form of compression, a specific, unique string of characters is replaced with a reference to that phrase, which is maintained in a dictionary. The resulting data compresses because the reference to the repeated phrase is much smaller. While LZ compression is very fast, it is best suited for files that contain repetitive data. Text files and monochrome graphic images are ideal for LZW compression. Compressed files that do not contain repetitive data will actually grow in size because of the LZW data dictionary. LZ compression today is in the public domain, and freely available for use by anyone. The U.S. patent expired in 2003, and the European, Canadian and Japanese patents expired in 2004. A Linked List LZ algorithm: As per the book of Richard B. Wells we try using the algorithm given in text, which is a mild modification of the actual LZW algorithm. The algorithm begins by defining the structure of the dictionary. Each entry in the dictionary is given an address m. Each entry consists of an ordered pair <n,a i >, where n is a pointer to another location in the dictionary and a i is a symbol drawn from the source alphabet. This order pairs in the dictionary is said to make up a linked list. The pointer variables n also serve as the transmitted code words. As the total number of dictionary entries exceeds the number of symbols, M, in the source alphabet, where each transmitted code word actually contains more bits than it would take to represent the alphabet A. Therefore most of the code words actually represent strings of source symbols and in a long message it is more economical to encode these strings than it is to encode the individual symbols. 5

The Coding Process: A dictionary is initialized to contain the single-character strings corresponding to all the possible input characters (and nothing else except the clear and stop codes if they're being used). The algorithm works by scanning through the input string for successively longer substrings until it finds one that is not in the dictionary. When such a string is found, the index for the string less the last character (i.e., the longest substring that is in the dictionary) is retrieved from the dictionary and sent to output, and the new string (including the last character) is added to the dictionary with the next available code. The last input character is then used as the next starting point to scan for substrings. In this way, successively longer strings are registered in the dictionary and made available for subsequent encoding as single output values. The algorithm works best on data with repeated patterns, so the initial parts of a message will see little compression. As the message grows, however, the compression ratio tends asymptotically to the maximum. The LZ algorithm uses above principle with a vengeance and with the added twist that the strings can be variable length. The algorithm is initialized by constructing the first M+1 entries in the dictionary as following: Address Dictionary Entry 0 0, Null 1 0, a 0 m 0, a m-1 M 0, a M-1 The 0-address entry in the dictionary is a null symbol, helpful to let the decoder know where strings end. The pointers n in these first M+1 entries are zero. They point to the null entry at the address 0. The initialization also initializes pointer variable n=0 and address pointer m=m+1. The address pointer m points to the next blank location in the dictionary. After the initialization, the encoder iteratively executes the following steps: 6

1. Fetch next source symbol a; 2. If the ordered pair <n,a> is already in the dictionary then n= dictionary address of entry <n,a>; else transmit n create new dictionary entry <n,a> at the dictionary address m m=m+1 n=dictionary address of entry <0,a>; 3. Return to step 1. If <n,a> is already in the dictionary in step 2, the encoder is processing a string of symbols that has occurred at least once previously. Setting the next value of n to this address constructs a linked list allows the string of symbols to be traced. If <n,a> is not already in the dictionary in step 2, the encoder is encountering a new string that was not processed previously. It transmits the code symbol n, which lets the receiver know the dictionary address of the last source symbol in the previous string. Whenever the encoder transmits a code symbol, it also creates a new dictionary entry. The encoder s dictionary building and code symbol transmission process can be developed using Matlab program. The Decoding Process: The decoding algorithm works by reading a value from the encoded input and outputting the corresponding string from the initialized dictionary. At the same time it obtains the next value from the input, and adds to the dictionary the concatenation of the string just output and the first character of the string obtained by decoding the next input value. The decoder then proceeds to the next input value (which was already read in as the "next value" in the previous pass) and repeats the process until there is no more input, at which point the final input value is decoded without any more additions to the dictionary. In this way the decoder builds up a dictionary which is identical to that used by the encoder, and uses it to decode subsequent input values. Thus the full dictionary does not need be sent with the encoded data; just the initial dictionary containing the single-character strings is sufficient (and is typically defined beforehand within the encoder and decoder rather than being explicitly sent with the encoded data.) 7

The decoder at the receiver must also be able to construct an identical dictionary based on the symbol codes received. The decoder performs following decoding iterations: 1. Reception of any code word means that a new dictionary entry must be constructed.; 2. Pointer n for this new dictionary entry is the same as the received code word n; 3. Source symbol a for this entry is no yet known, since it is the root symbol of the next string (which has not yet been transmitted by the encoder). If the address of this next dictionary entry is m, we see that the decoder can only construct a partial entry <n,?> since it must await the next received code word to find the root symbol a for this entry. It can however, fill in the missing symbol in its previous dictionary entry at address m-1. It can also decode the source symbol string associated with received code word n. This decoding process also can be realized with the help of matlab code. 8

Flow chart for LEMPEL ZIV Encoder: Input= Sequence to be coded; S=size of input sequence; Initializing Dictionary, Address Pointer (Pm), pointer variable (Pn=0) & other variables. Initialize while loop to consider each symbol of the input sequence one by one. Set flag ak =1; Initialize for i=0: length of dictionary, loop to match the present symbol with Dictionary elements. Next Symbol If Present i/p symbol=dictionary entry & Pn= address ponter of entry. Else Update Pn=dictionary address, set flag ak=1, break the for loop. If ak==1, record a new dictionary entry; And transmit Pn (pointer variable), record it in array Using for loop, check for root entry for the new dictionary entry and update Pn= addr. pointer, increment while loop variable to receive next symbol. Output Display Dictionary & transmitted Sequence 9

Flow chart for LEMPEL ZIV Decoder: Input= Received Sequence to be decoded; S=size of input sequence; Initializing Dictionary, Address Pointer (Pm), pointer variable (Pn=0) & other variables. Initializing & incrementing for loop to consider each symbol of the input sequence one by one. Initialize for i=0: length of dictionary, loop to match the present Received symbol with Dictionary elements. Next Symbol If Present rcvd symbol=dictionary entry & Pn= 0; i.e. is it a root entry. Else Update the symbol pointer and record the Dictionary element as decoded symbol & new entry. Record the pointer variable and treat it as an address pointer each time(using While loop), until root element is reached, Also keep a track of all elements confronted in this process and update decoded symbols list in reverse order. Also record the root element to update previous partial dictionary entry. Update Partial dictionary entry for previous symbol & create a new partial dictionary entry for current symbol. Then fetch next symbol. If no next symbol then Display the Decoded Sequence 10

In example 1.5.1 a binary information source emits the sequence of symbols 110 001 011 001 011 100 011 11 etc. The Encoding sequential procedure is shown in the following table along with encoder s dictionary being constructed. Given that A = {0,1} We Initialize the dictionary as shown(in block letters) with address 0 to 2.The initial values for n & m are n=0 &m =3. The Encoders operation for the source that emits 0,1 are as follows: Source Symbol Present n Present m Transmit Next n Dictionary Entry 1 0 3-2 - 1 2 3 2 2 2,1 0 2 4 2 1 2,0 0 1 5 1 1 1,0 0 1 6-5 - 1 5 6 5 2 5,6 0 2 7-4 - 1 4 7 4 3 4,1 1 2 8-3 - 0 3 8 3 1 3,0 0 1 9-5 - 1 5 9-6 - 0 6 9 6 1 6,0 1 1 10 1 2 1,1 1 2 11-3 - 1 3 11 3 2 3,1 0 2 12-4 - 0 4 12 4 1 4,0 0 1 13-5 - 1 5 13-6 - 1 6 13 6 2 6,1 1 2 14-3 - 1 3 14-11 - Dictionary Address 0 1 2 3 4 5 6 7 8 9 10 11 12 13 Dictionary Entry 0,null 0,0 0,1 2,1 2,0 1,0 5,1 4,1 3,0 6,0 1,1 3,1 4,0 6,1 14 No yet entry 11

The decoding process in example 1.5.2 can be explicitly seen with the help of table below: The Decoder begins by constructing the same first three entries as the encoder. It can do this because the source alphabet is known a priori by the decoder. The decoder is initialized by value for the next dictionary entry is 4. Received Bit Dictionary address Dictionary Entry Tracing back Symbol Coded 0 0,null 1 0,0 2 0,1 1 2 3 2,1 <0,1> 1 2 4 2,0 <0,1> 0 1 5 1,0 <0,0> 0,0 5 6 5,1 <1,0>--<0,0> 1,0 4 7 4,1 <2,0>--<0,1>.. 1,1 3 8 3,0 <2,1>--<0,1> 0,0,1 6 9 6,0 <5,1>--<1,0>--<0,0> 0 1 10 1,1 <0,0> 1,1 3 11 3,1 <2,1>--<0,1> 1,0 4 12 4,0 <2,0>--<0,1> 0,0,1 6 13 6,1 <5,1>--<1,0>--<0,0> 14 Therefore the sequence decoded is 110 001 011 001 011 100 011 11 and the dictionary constructed from the received signals is above. 12

In exercise problem 1.5.1, A discrete memory less source with A={a,b,c} emits the following string bccacbcccccccccccaccca. The Encoding sequential procedure is shown in the following table along with encoder s dictionary being constructed. Given that A = {a, b, c} We Initialize the dictionary as shown with address 0 to 3.The initial values for n & m are n=0 &m =4. The Encoders operation for the source that emits a, b, c are as follows: Source Symbol Present n Present m Transmit Next n Dictionary Entry b 0 4-2 - c 2 4 2 3 <2,c> c 3 5 3 3 <3,c> a 3 6 3 1 <3,a> c 1 7 1 3 <1,c> b 3 8 3 2 <3,b> c 2 9-4 - c 4 9 4 3 <4,c> c 3 10-5 - c 5 10 5 3 <5,c> c 3 11-5 - c 5 11-10 - c 10 11 10 3 <10,c> c 3 12-5 - c 5 12-10 - c 10 12-11 - c 11 12 11 3 <11,c> a 3 13-6 - c 6 13 6 3 <6,c> c 3 14-5 - c 5 14-10 - a 10 14 10 1 <10,a> 1 15 Dictionary Address 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Dictionary Entry 0,null 0,a 0,b 0,c 2,c 3,c 3,a 1,c 3,b 4,c 5,c 10,c 11,c 6,c 10,a 15 No yet entry 13

The decoding process in problem 1.5.2 can be explicitly seen with the help of table below: The Decoder begins by constructing the same first three entries as the encoder. It can do this because the source alphabet is known a priori by the decoder. The decoder is initialized by value for the next dictionary entry is 4. Received Bit Dictionary address Dictionary Entry Tracing back Symbol Coded 0 0,null 1 0,a 2 0,b 3 0,c b 2 4 2,c <0,b> c 3 5 3,c <0,c> c 3 6 3,a <0,c> a 1 7 1,c <0,a>.. c 3 8 3,b <0,c> b, c 4 9 4,c <2,c>--<0,b> c, c 5 10 5,c <3,c>--<0,c> c, c, c 10 11 10,c <5,c>--<3,c>--<0,c> c, c, c, c 11 12 11,c <10,c>--<5,c>--<3,c>--<0,c> c, a 6 13 6,c <3,a>--<0,c> c, c, c 10 14 10,a <5,c>--<3,c>--<0,c> a 15 <0,a> Therefore the sequence decoded is bccacbcccccccccccaccca and the dictionary constructed from the received signals is above. 14

Advantages of LZ compression technique: An LZ algorithm uses adaptive approach with universal coding scheme, without any need to transmit/store dictionary with a single-pass transmission (dictionary creation on-thefly i.e. decompression recreates the codeword dictionary so it does not need to be passed). LZ compression works best for files containing lots of repetitive data. This is often the case with text and monochrome images. Files that are compressed but that do not contain any repetitive information at all can even grow bigger! LZ compression is simple, fast and good compression. Disadvantages of LZ compression technique: The LZ compression technique substitutes the detected repeated patterns with references to a dictionary. Unfortunately the larger the dictionary, the greater the number of bits that are necessary for the references. The optimal size of the dictionary also varies for different types of data; the more variable the data, the smaller the optimal size of the dictionary, hence does not endow with an optimum compression ratio. Also LZ is a fairly old compression technique; all recent computer systems have the horsepower to use more efficient algorithms. Applications of LZ compression technique: When it was introduced, LZ compression provided the best compression ratio among all wellknown methods available at that time. It became the first widely used universal data compression method on computers. A large English text file can typically be compressed via LZ to about half its original size. LZ was used in the program compress, which became a more or less standard utility in Unix systems circa 1986. It has since disappeared from many distributions, for both legal and technical reasons, but as of 2008 at least FreeBSD includes both compress and uncompress as a part of the distribution. Several other popular compression utilities also used LZ, or closely related methods. LZW became very widely used when it became part of the GIF image format in 1987. It may also (optionally) be used in TIFF and PDF files. (Although LZ is available in Adobe Acrobat 15

software, Acrobat by default uses the DEFLATE algorithm for most text and color-table-based image data in PDF files.) RESULTS: LZ encoder Outputs from GUI: 16

LZ decoder Outputs from GUI: 18

Conclusion: It is somewhat difficult to characterize the results of any data compression technique. The level of compression achieved varies quite a bit depending on several factors. LZ compression excels when confronted with data streams that have any type of repeated strings. Because of this, it does extremely well when compressing English text. Compression levels of 50% or better should be expected. In results the code is tested for examples 1.5.1, 1.5.2 and exercise problems 1.5.1 &1.5.2. The code attached along with this report was written and tested on MATLAB, and was successfully compiled and executed. The code consists of Coding and decoding routines for binary sources (0, 1) as well as other discrete memory sources (i.e. a, b, c). The Code can be extended to discrete source that transmits more than three symbols, by assigning proper ASCII values to each symbol and appending the dictionary in right manner. The code gives a (Graphical user interface)gui output that is user helpful to give any input and obtain respective output. 20

References: [1] Applied Coding and Information Theory for Engineers text book by Richard B. Wells. [2] http://en.wikipedia.org/wiki/lzw [3] http://marknelson.us/1989/10/01/lzw-data-compression/ [4] http://www.answers.com/topic/data-compression [5] http://www.prepressure.com/library/compression_algorithms/lzw [6] The Lempel Ziv Algorithm, Christina Zeeh,Seminar Famous Algorithms January 16, 2003 21