A dptive Huffmn nd rithmetic methods re universl in the sense tht the encoder cn dpt to the sttistics of the source. But, dpttion is computtionlly expensive, prticulrly when k-th order Mrkov pproximtion is needed for some k > 2. As we know, the kth order pproximtion pproches the source entropy rte when k. For exmple, for English text, to do second order Mrkov pproximtion, we will need to estimte the proility of ll possile (out 35 3 =42,875, 35 = {-z,(,)...etc} ) triplets, which is imprcticl. Arithmetic codes re inherently dptive, ut it is slow nd works well for inry file. The dictionry-sed methods such s the LZ-fmily of encoders do not use ny sttisticl model, nor do they use vrile size prefix code. Yet, they re universl, dptive, resonly fst nd use modest mount of storge nd computtionl resources. Vrints of LZ lgorithm form the sis of Unix compress, gzip, pkzip, stcker nd for modems operting t more thn 14.4 KBPS. Dictionry Models The dictionry model llows severl consecutive symols, clled phrses stored in dictionry, to e encoded s n ddress in the dictionry. Usully, n dptive model is used where the dictionry is encoded using previously encoded text. As the text is compressed, previously encountered sustrings re dded to the dictionry. Almost ll dptive dictionry models originted from the originl ppers y Ziv nd Lempel which led to severl fmilies of LZ coding techniques. Here we will present couple of those techniques.
LZ77 lgorithms The prior text constitutes the codeook or the dictionry. Rther thn keeping n explicit dictionry, the decoded text up to current time cn e used s dictionry. The figure elow shows the chrcters just decoded nd the decoder is looking t the triplet (5,3,) - numer 5 denotes how fr ck to look into the lredy decoded text strem, numer 3 gives the length of the phrse mtched eginning the first chrcter of yet un-encoded prt of the text nd the chrcter gives the next chrcter from input. This yields to e the next phrse dded. Decoded Output (0,0,) (0,0,) (2,1,) (3,2,) (5,3,) (10,1,) Encoded Output LZ77 Algorithm with Finite Buffer L s 0 1 p W L W Two uffers of finite size W, clled the serch(left) nd the look-hed(right )uffers re connected s shift register. The text to e decoded is shifted in from right to left, initilly plcing W symols in the right uffer nd filling in the left uffer with the first chrcter of the text. The informtion trnsmitted is (p,l,s) nd the uffer is shifted L+1 plces left. Actully, rther thn trnsmitting p, the offset ckwrd in the serch uffer is trnsmitted. The process is repeted until text is fully encoded. L= mximum length of the first sustring from right end of the serch uffer strting t position p tht mtches with sustring in the look-hed uffer eginning t position 1. S= the next symol fter the mtch in the right uffer.
cc cc c cc cc ccc.. Output: (1,1,) Output: (2,1,c) Output: (3,4,) Output: (9,8,c) Text= ccccc The decoding process is quite ovious. Since the first chrcter is not known to the decoder, it is usully ppended with known dummy chrcter greed upon y the encoder nd decoder. Also, note the Pttern eing mtched my spill over to to look-hed s in step 3 ove Red 5.3 nd 5.4 from K. Syood. Pp. 118-133. A forml description of LZ77 with Sliding Window W The min ide of the lgorithm is to use dictionry to store the strings previously encountered. The encoder mintins sliding window W in which the inputs re shifted from right to left. The window is split into two prts: The serch uffer, which is the current dictionry, holding the recently encoded chrcters or symols. The right prt of the window is clled look-hed uffer, contining the text to e encoded. In prcticl implementtion, the size of the serch uffer could e severl thousnd ytes (8k or 16K) wheres the look-hed uffer is very smll (less thn 100 ytes). The encoder serches the serch uffer looking for the longest mtch eginning with the first chrcter in the look-hed uffer. The encoded output is triple (B, l, ch), where B is the distnce trversed ckwrds or the offset in the serch uffer, l is the length of the mtch nd ch is the next chrcter in the look-hed uffer for which the mtch fils. In cse, l=0, B=0, the chrcter ch keeps the encoding process going.
To encode text T [1...N] with sliding window of W chrcters. Algorithm to Encode To Encode Set p 1 /* p points to next chrter in T to e coded */ While there is text remining to e encoded do {Serch for first T[p] in the serch uffer; If T[p] does not pper then {output (0,0,T[p]); p p+1} Else { suppose tht mtches occur t offsets m 1 < m 2 <... < m s with lengths l 1, l 2,... l s. Let l = mx (l 1,l 2... l s ) t offset m mx = m i for some i, 1 i s. If there re more thn one l i with sme vlue of l, tke the vlue of mx closest to the end of the serch uffer. Note, the vlue of p is incremented y n mount l while the pttern mtching opertion tkes plce. Output triple (B= m mx, l, Ch=T[p+1]); Set p p + 2} endwhile /* Assume tht the offsets re mesured in the left direction eginning the lst chrcter of the serch uffer while text is indexed lwys in the positive direction from left to right. */ Set p 1 /*next chrcter of T to e decoded.*/ For ech triple (B, l,ch) input do {If B=l=0 then {T[p]:=ch ; p p+1;} else { T[p,..p+l-1] T[B,B-1,,B- l+1]; T[p+l] ch p p +l+1;} Shift uffer contents left y l+1 plces} In step 2 selecting the lst mtch rther thn the first or second, simplifies the encoder since the lgorithm only hs to keep trck of the lst string mtch detils. But selecting the first mtch (greedy pproch) my mke the vlue of the offsets smller nd hence cn e compressed further using sttisticl coder such s Huffmn (such method y Berhrd is clled LZH).
Note, the string mtching opertion my egin t the serch uffer ut my spill over to the look-hed uffer, which my even mke the length l igger thn the look-hed uffer.... d r r r r r d Serch Buffer Look-Ahed Buffer The LZ77 method hs een improved in the 1980's nd 1990's y severl wys: Use vrile-size Huffmn code for the length (l) nd offset(b) fields. (A fixed formt needs log 2B log 2l its for the serch uffer. its to denote l for the look-hed uffer nd Incresed sizes of the uffer to find longer nd longer mtches. The serch time would increse. A more sophisticted dt structure (TRIE) my improve the serch time. Use circulr queue for the sliding window. In the sliding window, ll the text chrcters hve to e moved left fter ech mtch. A circulr-queue voids this. Exmple:The different stte of 16-chrcter uffer input : sid-estmn-esily (Exmple tken from Dvid Solomn, p.157). s i d - e s t sid - estmn- esi Strt(S) End (E) S E () () In (), 16 yte rry is shown with only 8 ytes occupied, S denoting strt point nd E denoting the end point. In (), ll 16 ytes re occupied lid-estmn-esi lid-estmn-esi ES (c) E S (d) In (c), chrcter s deleted, nd chrcter l inserted. Now, E is locted left of S. In (d), two letters id hve een effectively deleted lthough they re still present in the uffer.
ly--estmn-esi ly-testmn-esi E S E S (e) (f). In (e), two chrcters y- hs een ppended nd pointer E moved. In (f), the pointers show tht the uffer ends t tes nd strts t tmn. Inserting new symols into the circulr queue nd moving the pointers is thus equivlent to shifting the contents of the queue. No ctul shifting or moving is necessry. Elimintes the third element of the triple (ch) y dding n extr flg it. LZSS The improved version is clled LZSS. Uses circulr queue for look-hed uffer, Holds serch uffers (the dictionry) in inry serch tree, nd It cretes tokens with only 2 fields. Exmple: "sid-estmn-clumsily-teses-se-sick-sels" sid-estmn-clum sily-... Temporry Serch Buffer(16) Look-Ahed Buffer(5) The encoder scns the serch uffer creting 12 5-chrcter strings ( size of the look-hed uffer), which re stored in RAM long with inry serch tree, ech node with its offset.
15,id-e 11,stm 16,sid-e 13,-est 14,d-est 8, mn-c 10, stmn 5,-clum 7,n-cl 12,estm 6,n-clu 9,tmn- sid-e 16 id-e 15 d-est 14 -est 13 estm 12 stme 11 stmn 10 tmn- 9 mn-c 8 n-cl 7 n-clu 6 -clum 5 The first symol in the Look-Ahed uffer is 's'. Two words re found t offset 16 nd 10 of which 16 leds to longer mtch 'si' of length 2. The encoder emits (16,2). The next window is sid-estmn-clumsily-te... The tree is updted y deleting 'sid-e' nd 'id-e' nd inserting two new strings 'clums' nd 'lumsi'. Note, the words deleted re lwys from the top ddresses in RAM, nd the words dded re from the ottom of the RAM. This sttement is true in generl if there is longer k-letter mtch. The window hs to e shifted k positions. A simple procedure to updte the tree is to tke the first 5 letter word in the serch uffer, find it in the tree, delete it, slide the uffer y one position to right, prepre string consisting of the lst 5 letters in the serch uffer nd dd this to the tree. This hs to e updted k times. If the tree ecomes unlnced fter severl insertion nd deletion, AVL-tree cn e used. Note the numer of nodes in the tree remins constnt. The token creted hs only two elements if no mtch is found; the chrcter is trnsmitted without ny chnge with flg. The flgs could e collected in 1 yte nd 8 tokens could e trnsmitted together. Typicl size of serch uffer is 2 to 8 Kytes nd look-hed uffer 32 ytes.
LZ78 (Lempel-Ziv-78) One of the mjor drwcks of LZ77 is tht there is n implicit ssumption tht like ptterns occur close together so tht they cn e found during string mtching opertion. If the like ptterns re seprted y gps longer thn the uffer size, LZ77 will not compress t ll. An extreme exmple is : cdefcedfcdef Serch Buffer Look-Ahed Buffer There will e no string mtch nd ech chrcter will e sent with flg, leding to expnsion rther thn compression. For nother exmple, sy the word "economy" occurs mny times in the text ut they occur sufficiently fr wy so tht it will never e compressed. A etter strtegy will e to store the common occurring strings in dictionry rther thn letting them slide wy. It mens it does not hve window to limit how fr ck the sustrings cn e referenced. This is the sic principle of LZ78, which uilds up the dictionry of common phrses. The decoder performs identicl opertion creting the sme dictionry dynmiclly nd in sync. The output is sequence of tokens consisting of two items <i, c>, i = pointer ddress to the dictionry nd 'c' is the next chrcter. LZ78 Algorithm The fmily of LZ lgorithms use n dptive dictionry sed scheme to compress text strings. The sic ide is to replce sustring of the text with pointer (initilly 0) in tle (codeook or dictionry) where tht sustring occurred previously. S String lredy prsed Longest sustring lredy in tle t loction j New Symol S Trnsmit (j,s) nd repet process eginning the next symol fter S. Enter t current pointer +1 loction the longest sustring conctented with with S. Initilize j=0.
Exmple Messge : cccc_ddddd_e Pointer Longest Sustring 1 2 _ 3 4 5 _ 6 c 7 cc 8 c_ 9 d 10 dd 11 dd_ 12 e Trnsmitted Informtion (j,s) 0, 1,_ 0, 3, 0,_ 0,c 6,c 6,_ 0,d 9,d 10,_ 0,e The decoder cn uild n identicl tle t the receiving end. The LZ78 cn e looked upon s prsing of the input strings s phrses, which re entered in the sttic dictionry. Thus, the string is prsed into phrses,,,, nd entered into phrse dictionry s Phrse # Phrse Output Token 1 (0,) 2 (0,) 3 (1,) 4 (2,) 5 (4,) where phrse numer 0 stnds for null phrse. Using tle to store the phrses is not very storge efficient. A more efficient method is to use dt structure clled TRIE (or digitl serch tree) s shown elow. The chrcter of ech phrse specifies pth from the root of the TRIE to the node tht contins the numer of phrse. The chrcters to e encoded re used to trverse the TRIE until the pth is locked either ecuse there is no onwrd pth for indicted chrcter or lef node is reched. The node t which lock occur gives the phrse numer for output. The chrcter is ppended to the output nd new node is creted corresponding to new phrse in the codeook or dictionry.
1 3 0 2 4 If the input lphet is lrge, the TRIE my hve severl pointers emnting from ech node which gives rise to the prolem of llocting enough storge t the eginning of ech node for ll possile future pointers. A linked list dt structure to represent sprse pointer rry my do etter jo. A fster nd simpler method is to use hsh tle in which the current node numer nd the next input chrcter re hshed to determine where the next node cn e found. 5 6 The TRIE dt structure continues to grow s coding proceeds nd eventully it my ecome too lrge. Severl strtegies cn e used when memory is full. The TRIE is removed nd the process is initilized gin. Stop ny further updtes t the cost of less compression. Prtilly reuild it using only the lst few hundred ytes of coded text so tht some knowledge from prior dpttion is retined. Encoding for LZ78 is fster thn LZ77 ut decoding is slower since the decoder must store the prsed phrses. One vrint of the LZ78 scheme, clled LZW hs een used widely in compression systems. LZW (Lempel-Ziv-Welch Algorithm) T The min difference etween LZW nd LZ78 is tht the encoding consists of string of phrse numers nd the 0 explicit next chrcter re not prt of the output. This is done y initilizing the dictionry or the TRIE with ll letters of the lphet. 1 4 2 5 7 8 c c 3 6 Exmple 1 cc. The dictionry D is initilized with three nodes 1, 2 nd 3 corresponding to the lphet A=(,, c). Encoding is in D, not in D, dd 4,output 1 is in D, c not in D, dd 5,output 2 c is in D, c not in D, dd 6,output 3 is in D, not in D, dd 7,output 4 9
c is in D, c not in D, dd 8,output 5 is in D, not in D, dd 9,output 6 is in D, output 1 Prsing: c c Encoder output: 1234571 The decoder does the reverse opertion. It strts with initil dictionry D nd keeps dding new no s it receive the node sequences from the encoder. Decode 1234571 1 output is in D 2 output not in D dd 4 3 output c c not in D dd 5 4 output c not in D, dd 6 5 output c not in D dd 7 7 output c not in D dd 8 1 output is in D dd 9 Note how it is creting new node. Immeditely, fter putting the output, it cretes string : lst phrse conctented with the first chrcter of the current phrse. If this is not in the dictionry, it cretes new node with the next ville numer.
0 c 1 2 3 c 4 5 6 7 8 Exmple 2 T = 9 1 2 3 5 4 6 7 8 9 Note the encoder hs used the phrse 9 immeditely fter it hs een constructed. The finl output of the encoder is: 12133469
Decoding The decoding will proceed smoothly till numer 6 producing output. nd creting phrses upto 8 in the dictionry, ut does not know wht phrse 9 is! Fortuntely, the decoder knows the eginning of new phrse it is x where x is unknown yet. If we now conctente the lst phrse with this new phrse, the text should look like:. x. But the phrse is not in the dictionry so phrse 9 should hve een, which mens tht the chrcter x is. Thus phrse 9 must e nd decoding will proceed. Whenever phrse is referenced s soon s the encoder hs creted it, the lst chrcter of the phrse must e sme s the first chrcter. Despite this little prolem in decoding, LZW works well giving good compression nd efficient implementtion.the following description of the lgorithm is sed on the description in WMB [1990]. Note ++ mens conctention Encoding Algorithm 1 Set p=1 /* p, n index to text T[1 N].*/ 2 For d = 0 to q-1 do D[d] = d /* D is the TRIE nd ssume lphet, A=(0,1,2,..,q-1) is represented y numers which lso denote the first q nodes or phrse numers. */ 3 D = q-1 /* D points to lst entry in the dictionry. The next node numer strts t q */ 4 While input strem not exhusted do 4.1 Trce TRIE D to find the lrgest mtch eginning T[p]. Suppose, the mtch terminte t phrse numer c nd the length of the mtch is k 4.2 output code c 4.3 d = d+1 /* Add new entry to TRIE. */ 4.4 p=p+k 4.4 Set D[d] = D[c]++T[p] /* Crete new phrse y connecting the lst phrse with first chrcter of next phrse. */
LZW Algorithm This lgorithm elimintes the need to trnsmit the next chrcter s in the LZ78 lgorithm.the dictionry is initilized to contin ll chrcters in the lphet. New phrses re dded to the dictionry y ppending the first chrcter of next phrses. The lgorithm is est descried y using trie dt structure to represent ll distinct phrses in the dictionry. The lgorithm is illustrted elow. c c 4 5 6 7 8 9 Trie Alphet = (,,c) Text = cc Trnsmitted messge = 1234571 Text=cc c Text =.. c 4.. c.. c c 4 5 c c 4 5 6 c c 4 5 6 7 8 9..c.........c.... Finl Trie nd its Height Blnced Binry Tree c c 4 5 6 7 1 0 1 4 0 1 7 0 1 0 1 c c 4 5 6 7 8 0 1 0 1 0 0 1 Trnsmitted Code= 1234571= 001001100101010011000 9 2 5 8 3 1 6
Decoding Algorithm Setp1,2,3 re sme s in encoding setting up the initil TRIE or dictionry. Let the code sequence e S=c 1 c 2 c k Step 4: Decode c 1 - output D(c 1 ) Step 5: for j=2 to k do egin If c j is in D, then { output D(c j ),Crete new_phrse y conctenting c j - 1 with the first chrcter of c j if this phrse is not in D ; } else { new_phrse = D(c j -1)++F(c j -1); Output new_phrse } /*F(c j -1) is the first chrcter of the lst phrse decoded.*/ d=d+1; D(d) = new_phrse /*Enter new phrse numer in D*/. end LZW hs een fine-tuned nd hs severl vrints. The Unix compress is one such vrint. Compress uses vrile-length code to represent the phrse numer nd puts mximum limit to the size of the phrse numer. If fterwrds the compression performnce degrdes, the dictionry is re-uilt from scrtch.