Clusterng lgorthm for Chnese dectves and ouns Yang Wen, Chunfa Yuan, Changnng Huang 2 State Key aboratory of Intellgent Technology and System Deptartment of Computer Scence & Technology, Tsnghua Unversty, Beng 00084, P.R.C. 2 Mcrosoft Research, Chna Emal: ycf@s000e.cs.tsnghua.edu.cn Key Words: bdrectonal herarchcal clusterng, collocatons, mnmum descrpton length, collocatonal ree, revsonal dstance bstract Ths paper proposes a bdrctonal herarchcal clusterng algorthm for smultaneously clusterng words of dfferent parts of speech based on collocatons. The algorthm s composed of cycles of two knds of alternate clusterng processes. We construct an obectve functon based on Mnmum Descrpton ength. To partly solve the problem caused by sparse data two concepts of collocatonal ree and revsonal dstance are presented. Introducton Recently research on the compostonal frames (classfcaton and collocatonal relatonshp of words) for Chnese words has been descrbed n J et al. (996)[], J (997)[2]. The obectve of ther work s to obtan the clusters of words of dfferent parts of speech and to derve the collocatonal relatonshp between dfferent clusters from the collocatonal relatonshp between words of dfferent categores. There are two ways to construct the clusters: One s to get clusters from thesaurus classfed manually by lngusts. But the fact s that words wth the same meanngs do not always have the same ablty of collocatng wth other words. The method sn t ft for the P problems under our consderaton. nother way s to get clusters automatcally by computng on the dstrbuton envronments of words based on statstcal method. The dstrbuton envronment of a word s the set of words of other parts of speech that can be collocated wth t. We employ the second method n our work. Prevous research usually gets the clusters of words of a certan part of speech based on ther dstrbuton envronments. But Supported by atonal atural Scence Foundaton of Chna (6977303) and 973 Proect (G998030507).
we accept the assumpton that the clusterng processes of words of dfferent parts of speech are nherently related. For example, havng collocatons between Chnese adectves and nouns and f we take on nouns as enttes and adectves as features of nouns' dstrbuton envronments, we can obtan clusters of nouns and vce versa. The key of the relatonshp of the two clusterng processes s that they use the same collocatons. Therefore we consder the queston of clusterng the nouns and adectves smultaneously. s work shows that they optmze the clusterng results based on ths vewpont ( et al., 997)[3]. But they don't explan how to get ntal clusters and ther scale of problem s too small. In ths paper, we propose an algorthm named bdrectonal herarchcal clusterng to attempt answerng the queston. 2 Concepts 2. Problem Descrpton Our problem can be descrbed as follows: gven the set of adectves, the set of nouns and the collocaton nstances, our system wll construct a partton P over and a partton P over that respectvely contan sets of nouns and sets of adectves. nd both parttons meet the condton that words n the same set (called cluster) have smlar semantc dstrbuton envronment. 2.2 Parttons and Clusters et S be a set, S S( =,2,, n). If P S ={ S } satsfes that n S = () = S (2) S S =,, =,2, n, P S s a partton over S. In ths paper, we call adectve cluster and P an P a noun cluster. nd we want to obtan the composton of parttons < P, P > as the clusterng result. 2.3 Dstance between Clusters In order to measure the dstance between clusters of the same part of speech, we use the followng equatons: ds ds (, ) = () Ψ Ψ (, ) = (2) Ψ Ψ where envronment of whch can be collocated wth dstrbuton envronment of s the dstrbuton and s make up of nouns. Ψ s the and s composed of adectves whch can be collocated wth. and Ψ follow smlar defntons. Ths dstance s a knd of Eucldean dstance.
2.4 Collocatonal Degree Snce redundant collocatons mght be created durng clusterng, the concept collocatonal ree s used to measure the collocatonal relatonshp between a cluster and ts dstrbuton envronment. The collocatonal ree s defned as the rato of the exstng collocaton nstances between the cluster and ts dstrbuton envronment to all possble collocatons generated by them. Thus, { aφ a, φ, aφ C} = and (3) { nψ n, ψ Ψ, nψ C} = (4) Ψ where C s the set of all exstng nstances. 2.5 Redundant Rato fter we get the collocatonal ree of a cluster, redundant rato (marked as r) s calculated to measure the whole performance of the clusterng result. We defne the redundant rato as mnus the rato of all exstng nstances to all possble collocatons generated by all clusters (ncludng nouns and adectves) and ther dstrbuton envronments. So r s calculated as 2C r = (5) + Ψ 3 Bdrectonal Herarchcal Clusterng lgorthm Usually a herarchcal clusterng algorthm [7] constructs a clusterng tree by combnng small clusters nto large ones or dvdng large clusters nto small ones. The bdrectonal herarchcal clusterng algorthm proposed by us s composed of two knds of alternate clusterng processes. follows: The algorthm flow s descrbed as ) Intally, regard every noun and adectve each as a cluster. Calculate the dstances between clusters of the same part of speech. 2) Suppose wthout loss of generalty that we choose to cluster nouns frst. Select two noun clusters & of the mnmum dstance and ntegrate them nto a new one. 3) Calculate the collocatonal ree of the new cluster. dust the sequence numbers of the orgnal clusters and the relatonal nformaton of adectve clusters. 4) Calculate the dstances between the new cluster and other clusters. 5) Repeat from step 2) to 4) untl the satsfacton of certan condton. For example, the number of the clusters has decreased to certan amount. 2 6) Smlarly, we can follow the same steps from 2) to 5) for constructng adectve clusters, completng one cycle of clusterng processes of nouns and adectves. 7) Repeat from step 2) to 6) untl the 2 In ths paper, we set the proporton s 20%.
obectve functon 3 reaches the mnmum value. One advantage of ths algorthm s that: when two clusters of nouns have smlar dstrbuton envronments, they mght be classfed nto one cluster. Ths nformaton can be delvered to the clusters of adectves that respectvely collocate wth them by the clusterng process of nouns. Thus these clusters of adectves have great possblty to be combned nto one cluster, whle the ordnary herarchcal clusterng algorthm can not do t. 4 n Obectve Functon Based on MD The obectve functon s desgned to control the processes of clusterng words based on the Mnmum Descrpton ength (MD) prncple. ccordng to MD, the best probablty model for a gven set of data s a model that uses the shortest code length for encodng the model tself and the gven data relatve to t [4] [5]. We regard the clusters as the model for the collocatons of adectves and nouns. The obectve functon s defned as the sum of the code length for the model ( model descrpton length ) and that for the data ( data descrpton length ). When the clusterng result mnmses the obectve functon, the bdrectonal processes should be stopped and the result s the best probable one. The obectve functon based on MD trade-offs between the smplcty of a model and ts accuracy n fttng to the data, whch are respectvely quantfed by the model descrpton length and the data descrpton length. 3 Descrbed later n secton 4. The followng are the formulas to calculate the obectve functon : = mod + dat (6) mod s the model descrpton length calculated as mod k k = log2 k k k = = log ( k k 2 ) + = log 2 + k (7) Where k and k respectvely denote the number of clusters of adectves and nouns. + means that the algorthm needs one bt to ndcate whether the collocatonal relatonshp between the two clusters exsts. dat s composed of the data descrpton length of adectves and that of nouns, namely dat = dat ( ) + dat ( ) (8) nd the two types of data descrpton length are calculated as follows dat dat (0) k ( ) = log (9) φ 2 = kk = k φ, and k ( ) = ψ Ψ k log 2 = kk = k ψ, Ψ and k 5 Our Experment We take the words and collocatons
gathered n 's Thesaurus [6] to test our algorthm. From 's thesaurus, we obtan 2,569 adectves, 4,536 nouns and 37,346 collocatons between adectves and nouns. Table shows results of usng 5 dfferent revsonal dstance formulas dscussed n the next secton. Because the length of ths paper s lmted, we only gve some examples (0 clusters for each part of speech) of clusters n secton 8. We can see that the redundant rato obvously decreases by usng the revsonal dstance, and the result that has the lowest redundant rato corresponds of the mnmum value of the obectve functon. By human evaluaton, most clusters contan the words that have smlar meanngs and dstrbuton envronments. So our algorthm proves to be effectve for word clusterng based on collocatons. 6 Dscussons 6. Rvsonal Dstance Table : Results of dfferent revsonal dstances Revsonal dstance k k r ot used ds = ln ds! " # $ % & % ' $ ' ( & ) ( * d s = ds / + ' +, ), $ % & % % $ - ' & - ' * d s = ds / + - +, ' ( $ % & %. - ' % & + ) * ds = ds ln + ),,, - $ % & % % - ' % & % ' * When we combne clusters nto a new cluster, ther dstrbuton envronments wll be combned as well. The combnaton of clusters and ther dstrbuton envronments mght very lkely generate redundant collocatons that are not lsted n the thesaurus. Wth the word clusterng processes gong on, there mght be more and more redundant collocatons. They wll obvously affect the accuracy of the dstances between clusters. When calculatng the dstances, the redundant collocatons must be consdered. So the queston s how to revse the dstance equaton. otce that the collocatonal ree defned n the above measures the collocatonal relatonshp However, the redundant rato s stll very large. The man cause s that exstng nstances are too sparse, coverng only 0.32% of all possble collocatons. So another advantage of our algorthm s that we can acqure many new reasonable collocatons not gathered n the thesaurus. If we add the new collocatons nto ntal thesaurus and execute the algorthm on new data set, the performance wll have great potental to mprove. It s further work that can be carred out n the future. between a cluster and ts dstrbuton envronment. Obvously under the same dstance, clusters havng hgher collocatonal ree have more hgher smlarty between each other (because they have more actual collocatons) than those havng lower collocatonal ree. So the collocatonal ree can be used to revse the dstance equatons. There are two problems that should be consdered when we desgn the revsonal dstance equatons. The frst one s to convert
the collocatonal rees of two clusters nto one collocatonal ree as the revsonal factor for dstance equatons. It s the average collocatonal ree, marked as, calculated by + = / () ( + ) and Ψ + Ψ = / (2) ( + ) Ψ Ψ In fact t s the collocatonal ree of the new cluster nto whch f we assume combnng the two orgnal clusters. The second problem s that the revsonal dstance equatons should keep coherent of monotoncty wth the orgnal dstance. It means that under the same average collocatonal ree, the revonal dstance should keep the same (or opposte) monotoncty wth the orgnal dstance, and under the same orgnal dstance, the revonal dstance should keep the same (or opposte) monotoncty wth the average collocatonal ree. In ths paper, four smple revsonal dstance equatons are presented based on consderaton of the upper two problems. They are: a) ds = ln ds b) d s = c) d s = ds ds d) ds = dsln Where d s denotes the revonal dstance and ds denotes the orgnal dstance. From the comparson of the upper dfferent results (shown n Table ), we can draw the concluson that usng revsonal dstance equatons can ncrease the clusterng accuracy remarkably. 6.2 Determnant of Obectve Functon's Mnmum Value The clusterng algorthm termnates when the obectve functon s mnmzed. s a result t s very mportant to fnd out the functon s mnmum value. fter analyzng the obectve functon, we fnd that t normally monotoncally declnes wth clusterng processes gong on untl t gets mnmzed. t the begnnng, there are a large number of clusters wth only one element n each of them. So the model descrpton length s qute large whle the data descrpton length s qute small. Because the clusterng process s herarchcal, every tme when the combnaton occurs the number of clusters wll decrease by one wth the model descrpton length s decreasng as well. t the same tme the number of a certan cluster s elements wll ncrease by one wth the data descrpton length s ncrement as well. However, the decrement s larger than the ncrement and t s gettng smaller whle the ncrement s gettng larger. In ths way, the obectve functon declnes untl the obectve functon reach ts
mnmum value. If we contnue to execute the algorthm, we wll see that the value of the obectve functon rses very fast lke as s shown n Fgure. addton, the clusterng algorthm may help to fnd new collocatons that are not n the thesaurus. Ths algorthm can also be extended to other collocaton models, such as verb-noun collocatons. f g h k l m n o p k q r s t u k v w x k y t g z k f { y t g r { q ƒ ~ } ˆ ˆ ˆ ˆ Š ˆ ˆ ˆ ˆ ˆ ˆ Œ Ž š œ ž Ÿ Ÿ Therefore we choose a farly smple way to avod the appearance of the local optmum: When there are two consecutve ncreases n the obectve functon durng one clusterng process, stop the process and start another one. When two consecutve clusterng processes are stopped due to the same reason, we assume that we have got the mnmum value and stop the whole clusterng process. In our future work we wll try to fnd a better way to determne the mnmum value of the obectve functon. 7 Concluson & Future Work In ths paper we have presented a bdrctonal herarchcal clusterng algorthm of smultaneously clusterng Chnese adectves and nouns based on ther collocatons. Our prelmnary experments show that t can dstngush dfferent words by ther dstrbuton envronments. In Our future work ncludes: ) Because the sparsty of collocatons s a man factor of affectng the word clusterng accuracy, we can use the clusterng results to dscover new data and enrch the thesaurus. 2) s there are yet no adustments to the herarchcal clusterng results, we are consderng usng some teratve algorthm, such as K- means algorthm, to optmse the clusterng results. 8 ttachment (Examples) We gve 0 clusters of each part of speech clustered by our algorthm (usng revsonal dstance formula b) as follows: 8. Chnese dectve clusters (0 of 383) 0 2 3 4 5 6 7 8 9 : ; < = >? @ B C D E F G H I J K M O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e
ª «± ² ³ µ ¹ º» ¼ ½ ¾ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö Ø Ù Ú Û Ü Û Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ø ù ú û ü ý þ ÿ! " # $ % & ' ( ) * +, -. / 0 2 3 4 5 6 7 8 9 : ; < = >? @ B C D E F G H I J K J I M I O M K P J Q K R S T U V W V X Y Z [ \ ] ^ _ ` a b c d e f g h k l m n o p q r s t u v w x y z { } ~ ƒ ˆ Š Œ Ž š œ ž Ÿ ª «± ² ³ µ ¹ º» ¼ ½ ¾ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö Ö Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ø ù ú û ü ý þ ý ÿ Communcatons of COCIPS 6(): P25- P33, 996 [2] Donghong J, Computatonal Research on Issues of excal Semantcs, Postdoctoral Research Report, Tsnghua Unversty, P4~P26, 997 [3] Juanz et al., Two-dmensonal Clusterng Based on Compostonal Examples, anguage Engneerng, P64-P69, Tsnghua Unversty Press, 997 [4] Hang & aok be Clusterng Words wth the MD Prncple cmplg/960504 v2 996 [5] We Xu, The Study of Syntax- Semantcs Integrated Chnese Parsng, Thess for the Degree of Master n Computer Scence, Tsnghua Unversty, 997 [6] Wene et al., Modern Chnese Thesaurus, Chna People Press, 984 [7] Zhaoq Ban et al., Pattern Recognton, Tsnghua Unversty Press, 997 References [] Donghong J & Changnng Huang, Semantc Composton Model for Chnese oun and dectve,