A Clustering Algorithm for Chinese Adjectives and Nouns 1

Similar documents
Cluster Analysis of Electrical Behavior

Query Clustering Using a Hybrid Query Similarity Measure

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

2x x l. Module 3: Element Properties Lecture 4: Lagrange and Serendipity Elements

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

Module Management Tool in Software Development Organizations

A Binarization Algorithm specialized on Document Images and Photos

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Graph-based Clustering

Unsupervised Learning and Clustering

An Optimal Algorithm for Prufer Codes *

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

Mathematics 256 a course in differential equations for engineering students

Hierarchical clustering for gene expression data analysis

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION

Unsupervised Learning

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

Clustering Algorithm of Similarity Segmentation based on Point Sorting

A Deflected Grid-based Algorithm for Clustering Analysis

The Research of Support Vector Machine in Agricultural Data Classification

A Topology-aware Random Walk

Virtual Machine Migration based on Trust Measurement of Computer Node

Analysis of Continuous Beams in General

A Simple and Efficient Goal Programming Model for Computing of Fuzzy Linear Regression Parameters with Considering Outliers

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Three supervised learning methods on pen digits character recognition dataset

A New Approach For the Ranking of Fuzzy Sets With Different Heights

Using Fuzzy Logic to Enhance the Large Size Remote Sensing Images

From Comparing Clusterings to Combining Clusterings

Classifier Selection Based on Data Complexity Measures *

BioTechnology. An Indian Journal FULL PAPER. Trade Science Inc.

Object-Based Techniques for Image Retrieval

VISUAL SELECTION OF SURFACE FEATURES DURING THEIR GEOMETRIC SIMULATION WITH THE HELP OF COMPUTER TECHNOLOGIES

Load-Balanced Anycast Routing

Machine Learning: Algorithms and Applications

Document Representation and Clustering with WordNet Based Similarity Rough Set Model

Related-Mode Attacks on CTR Encryption Mode

Positive Semi-definite Programming Localization in Wireless Sensor Networks

FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK

Collaboratively Regularized Nearest Points for Set Based Recognition

An Improved Image Segmentation Algorithm Based on the Otsu Method

The Shortest Path of Touring Lines given in the Plane

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Fast Feature Value Searching for Face Detection

APPLICATION OF MULTIVARIATE LOSS FUNCTION FOR ASSESSMENT OF THE QUALITY OF TECHNOLOGICAL PROCESS MANAGEMENT

Hybridization of Expectation-Maximization and K-Means Algorithms for Better Clustering Performance

3D vector computer graphics

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc.

Learning-Based Top-N Selection Query Evaluation over Relational Databases

Report on On-line Graph Coloring

K-means and Hierarchical Clustering

Module 6: FEM for Plates and Shells Lecture 6: Finite Element Analysis of Shell

Biostatistics 615/815

Intra-Parametric Analysis of a Fuzzy MOLP

Scheduling Remote Access to Scientific Instruments in Cyberinfrastructure for Education and Research

Lecture 5: Multilayer Perceptrons

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

Outline. Self-Organizing Maps (SOM) US Hebbian Learning, Cntd. The learning rule is Hebbian like:

Algorithm To Convert A Decimal To A Fraction

Improving Low Density Parity Check Codes Over the Erasure Channel. The Nelder Mead Downhill Simplex Method. Scott Stransky

Kent State University CS 4/ Design and Analysis of Algorithms. Dept. of Math & Computer Science LECT-16. Dynamic Programming

Alignment Results of SOBOM for OAEI 2010

BRDPHHC: A Balance RDF Data Partitioning Algorithm based on Hybrid Hierarchical Clustering

Feature Reduction and Selection

A Novel Adaptive Descriptor Algorithm for Ternary Pattern Textures

Video Proxy System for a Large-scale VOD System (DINA)

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

BIN XIA et al: AN IMPROVED K-MEANS ALGORITHM BASED ON CLOUD PLATFORM FOR DATA MINING

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Design of Structure Optimization with APDL

Composition of UML Described Refactoring Rules *

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

A Resources Virtualization Approach Supporting Uniform Access to Heterogeneous Grid Resources 1

Fast Computation of Shortest Path for Visiting Segments in the Plane

Cell Count Method on a Network with SANET

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 15

Machine Learning 9. week

An Entropy-Based Approach to Integrated Information Needs Assessment

CMPS 10 Introduction to Computer Science Lecture Notes

Concurrent Apriori Data Mining Algorithms

USING GRAPHING SKILLS

Quality Improvement Algorithm for Tetrahedral Mesh Based on Optimal Delaunay Triangulation

Study on Properties of Traffic Flow on Bus Transport Networks

Petri Net Based Software Dependability Engineering

CHAPTER 2 DECOMPOSITION OF GRAPHS

A high precision collaborative vision measurement of gear chamfering profile

Data Representation in Digital Design, a Single Conversion Equation and a Formal Languages Approach

X- Chart Using ANOM Approach

Transaction-Consistent Global Checkpoints in a Distributed Database System

Transcription:

Clusterng lgorthm for Chnese dectves and ouns Yang Wen, Chunfa Yuan, Changnng Huang 2 State Key aboratory of Intellgent Technology and System Deptartment of Computer Scence & Technology, Tsnghua Unversty, Beng 00084, P.R.C. 2 Mcrosoft Research, Chna Emal: ycf@s000e.cs.tsnghua.edu.cn Key Words: bdrectonal herarchcal clusterng, collocatons, mnmum descrpton length, collocatonal ree, revsonal dstance bstract Ths paper proposes a bdrctonal herarchcal clusterng algorthm for smultaneously clusterng words of dfferent parts of speech based on collocatons. The algorthm s composed of cycles of two knds of alternate clusterng processes. We construct an obectve functon based on Mnmum Descrpton ength. To partly solve the problem caused by sparse data two concepts of collocatonal ree and revsonal dstance are presented. Introducton Recently research on the compostonal frames (classfcaton and collocatonal relatonshp of words) for Chnese words has been descrbed n J et al. (996)[], J (997)[2]. The obectve of ther work s to obtan the clusters of words of dfferent parts of speech and to derve the collocatonal relatonshp between dfferent clusters from the collocatonal relatonshp between words of dfferent categores. There are two ways to construct the clusters: One s to get clusters from thesaurus classfed manually by lngusts. But the fact s that words wth the same meanngs do not always have the same ablty of collocatng wth other words. The method sn t ft for the P problems under our consderaton. nother way s to get clusters automatcally by computng on the dstrbuton envronments of words based on statstcal method. The dstrbuton envronment of a word s the set of words of other parts of speech that can be collocated wth t. We employ the second method n our work. Prevous research usually gets the clusters of words of a certan part of speech based on ther dstrbuton envronments. But Supported by atonal atural Scence Foundaton of Chna (6977303) and 973 Proect (G998030507).

we accept the assumpton that the clusterng processes of words of dfferent parts of speech are nherently related. For example, havng collocatons between Chnese adectves and nouns and f we take on nouns as enttes and adectves as features of nouns' dstrbuton envronments, we can obtan clusters of nouns and vce versa. The key of the relatonshp of the two clusterng processes s that they use the same collocatons. Therefore we consder the queston of clusterng the nouns and adectves smultaneously. s work shows that they optmze the clusterng results based on ths vewpont ( et al., 997)[3]. But they don't explan how to get ntal clusters and ther scale of problem s too small. In ths paper, we propose an algorthm named bdrectonal herarchcal clusterng to attempt answerng the queston. 2 Concepts 2. Problem Descrpton Our problem can be descrbed as follows: gven the set of adectves, the set of nouns and the collocaton nstances, our system wll construct a partton P over and a partton P over that respectvely contan sets of nouns and sets of adectves. nd both parttons meet the condton that words n the same set (called cluster) have smlar semantc dstrbuton envronment. 2.2 Parttons and Clusters et S be a set, S S( =,2,, n). If P S ={ S } satsfes that n S = () = S (2) S S =,, =,2, n, P S s a partton over S. In ths paper, we call adectve cluster and P an P a noun cluster. nd we want to obtan the composton of parttons < P, P > as the clusterng result. 2.3 Dstance between Clusters In order to measure the dstance between clusters of the same part of speech, we use the followng equatons: ds ds (, ) = () Ψ Ψ (, ) = (2) Ψ Ψ where envronment of whch can be collocated wth dstrbuton envronment of s the dstrbuton and s make up of nouns. Ψ s the and s composed of adectves whch can be collocated wth. and Ψ follow smlar defntons. Ths dstance s a knd of Eucldean dstance.

2.4 Collocatonal Degree Snce redundant collocatons mght be created durng clusterng, the concept collocatonal ree s used to measure the collocatonal relatonshp between a cluster and ts dstrbuton envronment. The collocatonal ree s defned as the rato of the exstng collocaton nstances between the cluster and ts dstrbuton envronment to all possble collocatons generated by them. Thus, { aφ a, φ, aφ C} = and (3) { nψ n, ψ Ψ, nψ C} = (4) Ψ where C s the set of all exstng nstances. 2.5 Redundant Rato fter we get the collocatonal ree of a cluster, redundant rato (marked as r) s calculated to measure the whole performance of the clusterng result. We defne the redundant rato as mnus the rato of all exstng nstances to all possble collocatons generated by all clusters (ncludng nouns and adectves) and ther dstrbuton envronments. So r s calculated as 2C r = (5) + Ψ 3 Bdrectonal Herarchcal Clusterng lgorthm Usually a herarchcal clusterng algorthm [7] constructs a clusterng tree by combnng small clusters nto large ones or dvdng large clusters nto small ones. The bdrectonal herarchcal clusterng algorthm proposed by us s composed of two knds of alternate clusterng processes. follows: The algorthm flow s descrbed as ) Intally, regard every noun and adectve each as a cluster. Calculate the dstances between clusters of the same part of speech. 2) Suppose wthout loss of generalty that we choose to cluster nouns frst. Select two noun clusters & of the mnmum dstance and ntegrate them nto a new one. 3) Calculate the collocatonal ree of the new cluster. dust the sequence numbers of the orgnal clusters and the relatonal nformaton of adectve clusters. 4) Calculate the dstances between the new cluster and other clusters. 5) Repeat from step 2) to 4) untl the satsfacton of certan condton. For example, the number of the clusters has decreased to certan amount. 2 6) Smlarly, we can follow the same steps from 2) to 5) for constructng adectve clusters, completng one cycle of clusterng processes of nouns and adectves. 7) Repeat from step 2) to 6) untl the 2 In ths paper, we set the proporton s 20%.

obectve functon 3 reaches the mnmum value. One advantage of ths algorthm s that: when two clusters of nouns have smlar dstrbuton envronments, they mght be classfed nto one cluster. Ths nformaton can be delvered to the clusters of adectves that respectvely collocate wth them by the clusterng process of nouns. Thus these clusters of adectves have great possblty to be combned nto one cluster, whle the ordnary herarchcal clusterng algorthm can not do t. 4 n Obectve Functon Based on MD The obectve functon s desgned to control the processes of clusterng words based on the Mnmum Descrpton ength (MD) prncple. ccordng to MD, the best probablty model for a gven set of data s a model that uses the shortest code length for encodng the model tself and the gven data relatve to t [4] [5]. We regard the clusters as the model for the collocatons of adectves and nouns. The obectve functon s defned as the sum of the code length for the model ( model descrpton length ) and that for the data ( data descrpton length ). When the clusterng result mnmses the obectve functon, the bdrectonal processes should be stopped and the result s the best probable one. The obectve functon based on MD trade-offs between the smplcty of a model and ts accuracy n fttng to the data, whch are respectvely quantfed by the model descrpton length and the data descrpton length. 3 Descrbed later n secton 4. The followng are the formulas to calculate the obectve functon : = mod + dat (6) mod s the model descrpton length calculated as mod k k = log2 k k k = = log ( k k 2 ) + = log 2 + k (7) Where k and k respectvely denote the number of clusters of adectves and nouns. + means that the algorthm needs one bt to ndcate whether the collocatonal relatonshp between the two clusters exsts. dat s composed of the data descrpton length of adectves and that of nouns, namely dat = dat ( ) + dat ( ) (8) nd the two types of data descrpton length are calculated as follows dat dat (0) k ( ) = log (9) φ 2 = kk = k φ, and k ( ) = ψ Ψ k log 2 = kk = k ψ, Ψ and k 5 Our Experment We take the words and collocatons

gathered n 's Thesaurus [6] to test our algorthm. From 's thesaurus, we obtan 2,569 adectves, 4,536 nouns and 37,346 collocatons between adectves and nouns. Table shows results of usng 5 dfferent revsonal dstance formulas dscussed n the next secton. Because the length of ths paper s lmted, we only gve some examples (0 clusters for each part of speech) of clusters n secton 8. We can see that the redundant rato obvously decreases by usng the revsonal dstance, and the result that has the lowest redundant rato corresponds of the mnmum value of the obectve functon. By human evaluaton, most clusters contan the words that have smlar meanngs and dstrbuton envronments. So our algorthm proves to be effectve for word clusterng based on collocatons. 6 Dscussons 6. Rvsonal Dstance Table : Results of dfferent revsonal dstances Revsonal dstance k k r ot used ds = ln ds! " # $ % & % ' $ ' ( & ) ( * d s = ds / + ' +, ), $ % & % % $ - ' & - ' * d s = ds / + - +, ' ( $ % & %. - ' % & + ) * ds = ds ln + ),,, - $ % & % % - ' % & % ' * When we combne clusters nto a new cluster, ther dstrbuton envronments wll be combned as well. The combnaton of clusters and ther dstrbuton envronments mght very lkely generate redundant collocatons that are not lsted n the thesaurus. Wth the word clusterng processes gong on, there mght be more and more redundant collocatons. They wll obvously affect the accuracy of the dstances between clusters. When calculatng the dstances, the redundant collocatons must be consdered. So the queston s how to revse the dstance equaton. otce that the collocatonal ree defned n the above measures the collocatonal relatonshp However, the redundant rato s stll very large. The man cause s that exstng nstances are too sparse, coverng only 0.32% of all possble collocatons. So another advantage of our algorthm s that we can acqure many new reasonable collocatons not gathered n the thesaurus. If we add the new collocatons nto ntal thesaurus and execute the algorthm on new data set, the performance wll have great potental to mprove. It s further work that can be carred out n the future. between a cluster and ts dstrbuton envronment. Obvously under the same dstance, clusters havng hgher collocatonal ree have more hgher smlarty between each other (because they have more actual collocatons) than those havng lower collocatonal ree. So the collocatonal ree can be used to revse the dstance equatons. There are two problems that should be consdered when we desgn the revsonal dstance equatons. The frst one s to convert

the collocatonal rees of two clusters nto one collocatonal ree as the revsonal factor for dstance equatons. It s the average collocatonal ree, marked as, calculated by + = / () ( + ) and Ψ + Ψ = / (2) ( + ) Ψ Ψ In fact t s the collocatonal ree of the new cluster nto whch f we assume combnng the two orgnal clusters. The second problem s that the revsonal dstance equatons should keep coherent of monotoncty wth the orgnal dstance. It means that under the same average collocatonal ree, the revonal dstance should keep the same (or opposte) monotoncty wth the orgnal dstance, and under the same orgnal dstance, the revonal dstance should keep the same (or opposte) monotoncty wth the average collocatonal ree. In ths paper, four smple revsonal dstance equatons are presented based on consderaton of the upper two problems. They are: a) ds = ln ds b) d s = c) d s = ds ds d) ds = dsln Where d s denotes the revonal dstance and ds denotes the orgnal dstance. From the comparson of the upper dfferent results (shown n Table ), we can draw the concluson that usng revsonal dstance equatons can ncrease the clusterng accuracy remarkably. 6.2 Determnant of Obectve Functon's Mnmum Value The clusterng algorthm termnates when the obectve functon s mnmzed. s a result t s very mportant to fnd out the functon s mnmum value. fter analyzng the obectve functon, we fnd that t normally monotoncally declnes wth clusterng processes gong on untl t gets mnmzed. t the begnnng, there are a large number of clusters wth only one element n each of them. So the model descrpton length s qute large whle the data descrpton length s qute small. Because the clusterng process s herarchcal, every tme when the combnaton occurs the number of clusters wll decrease by one wth the model descrpton length s decreasng as well. t the same tme the number of a certan cluster s elements wll ncrease by one wth the data descrpton length s ncrement as well. However, the decrement s larger than the ncrement and t s gettng smaller whle the ncrement s gettng larger. In ths way, the obectve functon declnes untl the obectve functon reach ts

mnmum value. If we contnue to execute the algorthm, we wll see that the value of the obectve functon rses very fast lke as s shown n Fgure. addton, the clusterng algorthm may help to fnd new collocatons that are not n the thesaurus. Ths algorthm can also be extended to other collocaton models, such as verb-noun collocatons. f g h k l m n o p k q r s t u k v w x k y t g z k f { y t g r { q ƒ ~ } ˆ ˆ ˆ ˆ Š ˆ ˆ ˆ ˆ ˆ ˆ Œ Ž š œ ž Ÿ Ÿ Therefore we choose a farly smple way to avod the appearance of the local optmum: When there are two consecutve ncreases n the obectve functon durng one clusterng process, stop the process and start another one. When two consecutve clusterng processes are stopped due to the same reason, we assume that we have got the mnmum value and stop the whole clusterng process. In our future work we wll try to fnd a better way to determne the mnmum value of the obectve functon. 7 Concluson & Future Work In ths paper we have presented a bdrctonal herarchcal clusterng algorthm of smultaneously clusterng Chnese adectves and nouns based on ther collocatons. Our prelmnary experments show that t can dstngush dfferent words by ther dstrbuton envronments. In Our future work ncludes: ) Because the sparsty of collocatons s a man factor of affectng the word clusterng accuracy, we can use the clusterng results to dscover new data and enrch the thesaurus. 2) s there are yet no adustments to the herarchcal clusterng results, we are consderng usng some teratve algorthm, such as K- means algorthm, to optmse the clusterng results. 8 ttachment (Examples) We gve 0 clusters of each part of speech clustered by our algorthm (usng revsonal dstance formula b) as follows: 8. Chnese dectve clusters (0 of 383) 0 2 3 4 5 6 7 8 9 : ; < = >? @ B C D E F G H I J K M O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e

ª «± ² ³ µ ¹ º» ¼ ½ ¾ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö Ø Ù Ú Û Ü Û Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ø ù ú û ü ý þ ÿ! " # $ % & ' ( ) * +, -. / 0 2 3 4 5 6 7 8 9 : ; < = >? @ B C D E F G H I J K J I M I O M K P J Q K R S T U V W V X Y Z [ \ ] ^ _ ` a b c d e f g h k l m n o p q r s t u v w x y z { } ~ ƒ ˆ Š Œ Ž š œ ž Ÿ ª «± ² ³ µ ¹ º» ¼ ½ ¾ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö Ö Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ø ù ú û ü ý þ ý ÿ Communcatons of COCIPS 6(): P25- P33, 996 [2] Donghong J, Computatonal Research on Issues of excal Semantcs, Postdoctoral Research Report, Tsnghua Unversty, P4~P26, 997 [3] Juanz et al., Two-dmensonal Clusterng Based on Compostonal Examples, anguage Engneerng, P64-P69, Tsnghua Unversty Press, 997 [4] Hang & aok be Clusterng Words wth the MD Prncple cmplg/960504 v2 996 [5] We Xu, The Study of Syntax- Semantcs Integrated Chnese Parsng, Thess for the Degree of Master n Computer Scence, Tsnghua Unversty, 997 [6] Wene et al., Modern Chnese Thesaurus, Chna People Press, 984 [7] Zhaoq Ban et al., Pattern Recognton, Tsnghua Unversty Press, 997 References [] Donghong J & Changnng Huang, Semantc Composton Model for Chnese oun and dectve,