Improvig Iformatio Retrieval System Security via a Optimal Maximal Codig Scheme Dogyag Log Departmet of Computer Sciece, City Uiversity of Hog Kog, 8 Tat Chee Aveue Kowloo, Hog Kog SAR, PRC dylog@cs.cityu.edu.hk Departmet of Computer Sciece, Zhogsha Uiversity, Guagzhou 5075, PRC dylog500@yahoo.com Abstract. Novel maximal codig compressio techiques for the most importat file-the text file of ay full-text retrieval system are discussed i this paper. As a cotiuatio of our previous work, we show that the optimal maximal codig schemes coicide with the optimal uiquely decodable codig schemes. A efficiet algorithm geeratig a optimal maximal code (or a optimal uiquely decodable code) is also give. Similar to the Huffma codes, from the computatioal difficulty ad the iformatio-theoretic impossibility poit of view, the problem of breakig a optimal maximal code is further ivestigated. Due to the Huffma code beig a proper subclass of the optimal maximal code, which is good at applyig to a large iformatio retrieval system ad cosequetly improvig the system security. Itroductio The Huffma codig [9] has bee widely used i data, image, ad video compressio [-6,-5]. The ideal of usig data compressio schemes for ecryptio is very old, datig back at least to Roger Baco i the th cetury [5]. The field of data compressio has grow vigorously sice Huffma s algorithm that is published i 95. Rubi [] ad Joes [] preset the ways i which data compressio algorithms may be used as ecryptio techiques. Klei et al. [6] have discussed the cryptographic properties of Huffma codes i the cotext of a large, compressed atural laguage database o CD-ROM. Based o the same problem, Fraekel ad Klei [4] have prove that, give a atural laguage cleartext ad a ciphertext obtaied by Huffma codig, the complexity of guessig the Huffma code is NPcomplete. Gillma et al. [5] have also cosidered the problem of decipherig a file that has bee Huffma coded but ot otherwise ecrypted, from the iformatiotheoretic impossibility but ot the computatioal difficulty poit of view. They fid This work was partially sposored by the 00 Ope Project of the State Key Laboratory of Iformatio Security (SKLOIS) (project No. 0-0), the Natioal Natural Sciece Foudatio of Chia (project No. 6007056) ad the Guagdog Provicial Natural Sciece Foudatio (project No. 0074).
Dogyag Log that a Huffma code ca be surprisigly difficult to cryptaalyze. The authors [7-8] have itroduced ovel optimal uiquely decodable, prefix, maximal prefix, ad maximal codig schemes. We have show that all Huffma codes have to be optimal uiquely decodable, prefix, maximal prefix, ad maximal codes. Coversely, oe of the optimal uiquely decodable, prefix, maximal prefix, ad maximal codes is ecessarily the Huffma code. To see differece betwee four types of the optimal codes above ad Huffma codes, we first cosider the followig example. Example. Let a iformatio source I = S = { s, s, s, s s, }, ( 4, 5 s6 P = {0.6,0.4,0.4,0.,0.,0.}) ad iput alphabet Σ = {0, }. The followig Table. shows two Huffma codes ad a o-huffma code. Accordig to the Huffma s algorithm, we kow that the codes of source alphabets s ad s must start with differet bits, but i C they both start with 0. This code C is therefore impossible to geerate by ay re-labelig of the odes of the Huffma trees. That is, C caot be geerated by the Huffma method! We easily verify that C is a optimal uiquely decodable, prefix, maximal prefix, ad maximal code. Ad the code C 4 is clearly ot a prefix code ad cosequetly it caot be the optimal prefix or maximal prefix code. But, We easily calculate l(00) + + + + l(0) l(00) + + + + + l (0) l (0) + = = Table. Two Huffma ad a o-huffma codes l() Source letter Probability Huffma Huffma Code C Code C Code C Code C 4 s 0.6 0 0 00 00 s 0.4 0 0 0 0 s 0.4 000 00 00 s4 0. 00 0 0 0 s 0. 0 00 0 0 5 s 0. 000 6 Thus, by Theorem.4. i [4], C 4 is a maximal code. Sice C 4 has the same average code word legth as the Huffma code C, C 4 is ot oly a optimal maximal code but also a optimal uiquely decodable code (by Theorem i [7] ad Theorem i [8]). Example. shows that the class of Huffma codes is a proper subclass of the above four types of optimal codes ad that the optimal uiquely decodable code (maximal code) is differet from the optimal prefix code (maximal prefix code). Motivated by the same problem as breakig a Huffma code [4-5], the problem of breakig a optimal uiquely decodable code (maximal code) will be preseted. Because there is a quite differece betwee the uiquely decodable code ad the
Improvig Iformatio Retrieval System Security via a Optimal Maximal Codig Scheme prefix code, breakig a optimal prefix code (maximal prefix code) will be ivestigated i a separate paper. Additioally, although the terms ad otios such as Huffma codig (ecodig), Huffma code, optimal code, ad optimal prefix code are easily foud i may literatures [,,4], relatioships of these cocepts has bee rather vague ad have ot detailed yet. Optimal Uiquely Decodable, Optimal Maximal, ad Huffma Codes I geeral, the class of maximal codes is much less tha the class of uiquely decodable codes [,0]. For optimal uiquely decodable ad optimal maximal codes, however, they are strog coected. Further relatio betwee them is give below. First, we have the followig Theorem.. Theorem. Every optimal maximal code has to be a optimal uiquely decodable code. Proof: The details of proof of the theorem are omitted here. # Coversely, the followig result is give. Theorem. Every optimal uiquely decodable code has to be maximal. Proof: We first show that Theorem. is true for the alphabet { 0.}. Suppose that ( S, P) is a fiite iformatio source. Let C = c, c,..., c } be a optimal { l c ) = l, l( c ) = l,..., l( c ) = l ( l... l. Without loss of geerality, suppose that C is a optimal prefix uiquely decodable code. Ad, ad l code. I fact, for the uiquely decodable code C there exists a prefix code D such that D has the same sequece of code word legths as C. By defiitios, it is easy to verify that D is optimal. Assume that P = p, p,..., p } with p { p... p. We will show that C is a maximal prefix code by reductio to absurdity. Suppose that C is ot a maximal prefix code. By defiitios, there exists + at least a code word c { 0,} C such that C {c} is still a prefix code. Whe l ( c) < l( c i ) = l i for some i, we easily costruct a prefix code C = ( C { ci }) { } = c, c,..., c,..., c } such that the average code word c { C is less tha the oe of the optimal prefix code C. legth of the prefix code Therefore, this is impossible. Thus we have that l l... l l( ). Accordig c l = l(c. to the choice of the code word c, we ca take the code word c satisfyig ) Otherwise, we easily get the code word c by replacig c with ay prefix ' c, which is
4 Dogyag Log l, of the word c. Now, let c = d0 or c = d of the legth prefix of c ad of the legth, where d is the proper l. If c = d0 ad the word d is ot i C, by C {c} beig a prefix code, the C {d} is a prefix code ad cosequetly C = c, c,..., c, } is also a prefix code. Clearly, the average code word legth of ( d C is less tha the oe of the optimal prefix code C. This is the prefix code impossible too. Similarly, whe c = d ad the word d 0 is ot i C, we will get a cotradictio too. Therefore, without loss of geerality, assume that c = d0 C ad the word d C. Sice C is a prefix code, the set C = ( C { d}) { } is also a prefix code. By l = l(c) = l (d0) = l (d) d, the average code word legth of the prefix code C = ( C { d}) { d} is p l + pl +... + p l + p ( l ). Clearly, p l + pl +... + p l + p ( l ) < p l + pl +... + p l + pl which is the average code word legth of the optimal prefix code C. This cotradicts with C beig a optimal prefix code. Combiig the above discussio, we have that C is a maximal prefix code. That is, a optimal prefix code has to be maximal. Next, cosider the umber of the alphabet beig greater tha. The details of proof are omitted here. Combiig the above discussio, we have that C is a maximal code. # Therefore, by Theorems. ad. we immediately get Corollary. below. Corollary. Optimal uiquely decodable codes coicide with optimal maximal codes. Remark. It is very iterestig that the word optimal cocers the ecoomy of a code. As see i [0], if C is a maximal code the every code word occurs as part of a message, hece o part of all words over the alphabet is wasted. Every optimal uiquely decodable code has to be a maximal code. However, this particular property does ot belog to geeral codig schemes. Note that i all the followig sectios, optimal code, optimal uiquely decodable code, ad optimal maximal code are oly differet ames for the same thig. Although Huffma codes are a proper subclass of maximal codes, Theorem., which follows, shows early relatio betwee Huffma codes ad maximal codes. We will omit the details of proof of Theorem.. Theorem. If iformatio source ) + C is ay maximal code, the there exist some suitable Σ I = ( Σ, P ad a Huffma code H for I such that C has the
Improvig Iformatio Retrieval System Security via a Optimal Maximal Codig Scheme 5 same average code word legth as H. Ad cosequetly C is a optimal code for I = ( Σ, P). Remark. By Theorem i [7], we have that a Huffma code has to be a maximal code. Coversely, makig use of Theorem., for ay maximal code H, we are able to costruct a suitable probability distributio H (i.e., a suitable iformatio source I = ( Σ, P), because of the alphabet Σ determied by P ) such that H is exactly a I = ( Σ, P. Therefore, whe takig out all probability Huffma code for ) distributios P, a maximal code ca be cosidered as a Huffma code. I other words, the maximal codig schemes are very ear to the Huffma codig schemes. I additioal, for a special iformatio source with a dyadic [6] probability distributio, we easily costruct a optimal maximal code, i.e., we have Theorem.4 below. Proof of the theorem is also omitted. I Theorem.4 Let = Σ, ) be a fiite iformatio source with a dyadic probability P ( P distributio { l, l =,..., l } with.... The ay maximal code C = c, c,..., c } satisfyig the coditios l c ) = l, l( c ) =,..., ad { l ( c ) = l is the optimal maximal code for I ( Σ, P) l l =. l ( l Applicatio to Data Compressio As the simplest example, cosider a special file A B 4 A 90 B over the alphabet {A, B}. Regardless of the probabilities, Huffma codig will assig a sigle bit to each of the letters A ad B, givig o compressio, thus the file 0 4 0 90 is 00 bits. But i dictioary methods but ot traditioal statistical modelig, we take a maximal codig such that A B 4 A 89, AB 0, ad BB 00, where {, 0, 00} is clearly a maximal code. Ad the file 00 is 5 bits. Therefore, we get a compressio ratio of 00/.. For example, we will ecode the file M: STATUS REPORT ON THE FIRST ROUND OF THE DEVELOPMENT OF THE ADVANCED ENCRYPTION STANDARD. By makig use of Table., i traditioal statistical modelig, we easily calculate that the average code word legth of the block code C is 5 bits/symbol, ad that the average code word legth of a Huffma code C is 4/87 bits/symbol. Furthermore, the ecoded file by the block code C ad the Huffma code C will take up 87 5 = 45 bits ad 87 4/87 = 4 bits, respectively. Thus the compressio ratio is 45/4 =.7:. We cosider the code D i Table.. It is easy to verify that the code D is a optimal code without the Huffma code. I fact, sice the word 0 is a proper prefix of the word 0, D is ot a Huffma code (by Theorem 5..). Clearly, the code D has the same average code word legth as the Huffma code C, thus D is a optimal
6 Dogyag Log code. By the results of Sectio 5.5, D has the same compressio ratio 45/4 =.7: as the Huffma code C. By Table., we directly follow that the two codes D ad D 4 are ot prefix codes. Ad D 4 is a optimal code without the Huffma code. By directly calculatig, we kow that D ad D 4 have the compressio ratio 45/90 4.8: ad 45/74 5.87:, respectively (comparig with the block code C i Table.). Ad they have the compressio ratio 7 4/90.: ad 7 4/74.46:, respectively (comparig with the block code C 5 i Table.). Table. A optimal codig Source Letter Probability Block Code C Optimal Code D Huffma code C (Space) /87 00000 0 0 T 0/87 0000 0 0 E 9/87 0000 0000 0000 N 7/87 000 000 000 O 7/87 0000 00 00 D 6/87 000 00 00 R 6/87 000 000 000 A 4/87 0000 00 00 S 4/87 0000 0 0 C /87 000 F /87 000 0000 0000 H /87 00 000 000 P /87 000 0000 0000 I /87 00 000 000 U /87 00 000 000 V /87 000 00 00 L /87 000 00 00 M /87 00 00 00 Y /87 00 0 0 Table. Codig schemes based o source words Source Word Probability Code D Optimal Code D 4 Huffma Code C 4 Block Code C 5 (Space) /7 000 0 0 0000 THE /7 00 00 00 000 ON /7 00 00 00 000 ENCRYPTION /7 0 0 0 00 STANDARD /7 0 ADVANCED /7 000 0 0 00 STATUS /7 00 00 00 0 REPORT /7 0 0 0 0 FIRST /7 0 000 000 000 ROUND /7 0 00 00 DEVELOPMENT /7 000 0 0 00 ON /7 00 00 00 0
Improvig Iformatio Retrieval System Security via a Optimal Maximal Codig Scheme 7 4 Breakig a Optimal Maximal Code As see from [4], from the issue of computatioal difficulty, it easily follows the followig theorem 4.. Theorem 4. Give a origial file ad a correspodig ecoded file by the optimal codig, the complexity of guessig the optimal code is NP-complete. We further have the followig Theorem 4. ([8], Theorem 4.). Theorem 4. Let M be the ecoded file by the uiquely decodable codig (the maximal codig) ad the legth of M be m. The M is ecoded by at most uiquely decodable codes (maximal codes). m Note that Theorem 4. is a immediately corollary of Theorem 4.. Moreover, i the proof of Theorem 4. we assume that it is simple to verify that a give set of words is a uiquely decodable code. But, i fact, it is very difficult to decide whether a fiite set of words is a uiquely decodable code [], eve if there has bee the Sardias ad Patterso algorithm ([], p.8). Therefore, from the computatioal complexity poit of view, breakig a optimal maximal code is much more difficult tha the complexity provided by Theorem 4.. Next, from the iformatio-theoretic impossibility poit of view [5], we will discuss the problem of breakig a optimal uiquely decodable code. First, a efficiet algorithm geeratig a optimal maximal code is give below. Theorem 4. For a give fiite iformatio source there exists a efficiet algorithm costructig a optimal uiquely decodable code. Proof: The details of proof of the theorem are omitted here. # Note that the optimal codes costructed by the way i the above theorem 4. are ot the Huffma codes i geeral. Additioally, as see [5], we easily verify that: Breakig the ecoded file by a optimal uiquely decodable code ca be surprisigly difficult. 5 Coclusio As we have see from [6], the Huffma codes are good at usig i a large iformatio retrieval system. Importat for a large iformatio retrieval system is the issue of the cryptographic security of storig the text i compressed form, as might be required for copyrighted material. Ad i the usual approach to full-text retrieval, the processig of queries does ot directly ivolve the origial text files (i which key words may be located usig some patter matchig techique), but rather the auxiliary dictioary ad cocordace files. A optimal maximal codig scheme based o the words of the origial file is suitable for storig these auxiliary dictioary ad cocordace files.
8 Dogyag Log O the other had, although the adaptive Huffma codig [,5] ad the Lempel- Ziv codig [] are preferred i some real-time applicatios ad for commuicatio, they are ot suitable for storig a large body of static text. Fially, we have kow that the Huffma code is a proper subclass of the uiquely decodable code. From Theorem 4. ad 4., it easily follows that breakig a optimal uiquely decodable code is much more difficult tha breakig a Huffma code. Therefore, the issue of the cryptographic security of a large iformatio retrieval system will be further improved by a optimal uiquely decodable code compressed. Refereces. Berstel, J., Perri, D.: Theory of Codes. Academic Press, Orlado (985). Bell, T.C., Cleary, J.G., Witte, I.H.: Text Compressio. Pretice Hall. Eglewood Cliffs, NJ (990). Cover, T, Thomas, J.: Elemets of Iformatio Theory. New York, Wiley (99) 4. Fraekel, A.S., Klei, S.T.: Complexity Aspects of Guessig Prefix Codes. Algorithmica, Vol.(994), 409-49 5. Gillma, David, W., Mohtashemi, M., Rivest, R.L.: O Breakig a Huffma Code. IEEE Tras. Iform. Theory, IT- 4(996), 97-976 6. Klei, S.T., Bookstei, A., Deerwester, S.: Storig Text-Retrieval Systems o CD-ROM: Compressio ad Ecryptio Cosideratios. ACM Tras. Iform. Syst., Vol.7(989), 0-45 7. Log, D., Jia, W.: Optimal Maximal Ecodig Differet From Huffma Ecodig. Proc. of Iteratioal Coferece o Iformatio Techology: Codig ad Computig (ITCC 00), Las Vegas, IEEE Computer Society (00) 49-497 8. Log, D., Jia, W.: O the Optimal Codig. Advaces i Multimedia Iformatio Processig, Lecture Notes i Computer Sciece 95, Spriger-Verlag, Berli (00) 94-0. 9. Huffma, D.A.: A Method for the Costructio of Miimum-Redudacy Codes. Proc. IRE, Vol.40(95), 098-0 0. Jürgese, H., Kostatiidis, S.: Codes. i: G. Rozeberg, A. Salomaa (editors), Hadbook of Formal Laguages, Vol. Spriger-Verlag Berli Heidelberg (997) 5-607. Joes, D.W.: Applicatios of Splay Trees to Data Compressio. Commuicatio of ACM, Vol.(988), 996-007. Lider, T., Tarokh, V., Zeger, K.: Existece of Optimal Prefix Codes for Ifiite Source Alphabets. IEEE Tras. Iform. Theory, 4(997)6, 06-08. Rubi, F.: Cryptographic Aspects of Data Compressio Codes. Cryptologia, Vol.(979), 0-05 4. Roma, S.: Itroductio to Codig ad Iformatio Theory. Spriger-Verlag New York (996) 5. Vitter, J.S.: Desig ad Aalysis of Dyamic Huffma Codes. Joural of the Associatio for Computig Machiery, 4(987)4, 85-845