Data & Knowledge Engineering

Size: px

Start display at page:

Download "Data & Knowledge Engineering"

Cameron Welch
5 years ago
Views:

Data & Knowledge Engneerng 9 (4) 77 89 Contents lsts avalable at ScenceDrect Data & Knowledge Engneerng journal homepage: www.elsever.

Eastern Fnland, 8, Fnland artcle nfo abstract Artcle hstory: Receved 5 September Receved n revsed form May 4 Accepted July 4 Avalable onlne 7 July 4 Keywords: Cluster valdty Keywords categorzaton

1 Data & Knowledge Engneerng 9 (4) Contents lsts avalable at ScenceDrect Data & Knowledge Engneerng journal homepage: Edtoral WB-ndex: A sum-of-squares based ndex for cluster valdty Qnpe Zhao a,, Pas Fränt b a School of Software Engneerng, Tongj Unversty, 84, Chna b School of Computng, Unversty of Eastern Fnland, 8, Fnland artcle nfo abstract Artcle hstory: Receved 5 September Receved n revsed form May 4 Accepted July 4 Avalable onlne 7 July 4 Keywords: Cluster valdty Keywords categorzaton Short text mnng Clusterng Classfcaton and assocaton rules Determnng the s an mportant part of cluster valdty that has been wdely studed n cluster analyss. Sum-of-squares based ndces show promsng propertes n terms of determnng the. However, knee pont detecton s often requred because most ndces show monotoncty wth ncreasng. Therefore, ndces wth a clear mnmum or maxmum value are preferred. The am of ths paper s to revst a sum-ofsquares based ndex called the WB-ndex that has a mnmum value as the determned number of clusters. We shed lght on the relaton between the WB-ndex and two popular ndces whch are the Calnsk Harabasz and the Xu-ndex. Accordng to a theoretcal comparson, the Calnsk Harabasz ndex s shown to be affected by the data sze and level of data overlap. The Xu-ndex s close to the WB-ndex theoretcally, however, t does not work well when the dmenson of the data s greater than two. Here, we conduct a more thorough comparson of nternal ndces and provde a summary of the expermental performance of dfferent ndces. Furthermore, we ntroduce the sum-of-squares based ndces nto automatc keyword categorzaton, where the ndces are specally defned for determnng the. 4 Elsever B.V. All rghts reserved.. Introducton Dfferent clusterng algorthms wth varous cost functons produce dfferent solutons, and there s no sngle best clusterng algorthm for all possble data sets. A prmary task s therefore to select the best possble clusterng for a gven data set. Cluster valdty provdes a way of valdatng the qualty of clusterng algorthms and a means of dscoverng the natural structure of data sets. For most clusterng algorthms, the s a man parameter. However, the settng of the parameter s not always known and hence determnng the s essental. Cluster valdty measures can be used for determnng the. In fact, the clusterng procedure and cluster valdty are much lke the chcken-and-the-egg problem where knowng how to defne a good cluster valdty ndex requres understandng the data and the clusterng algorthm, but the clusterng algorthm s one of the prncpal tools used to understand the data [] wthout a pror nformaton. Therefore, the study of the cluster valdty s as mportant as that of the clusterng algorthms. The nternal valdty ndex, as one of the categores of cluster valdty, evaluates the results of a clusterng algorthm n terms of quanttes that nvolve the vectors of the data set themselves. A good clusterng algorthm generates clusters wth ntra-cluster homogenety, nter-cluster separaton and connectedness. One category of nternal valdty ndces s based on these propertes. Another category s based on whether the nternal ndces are appled to hard (crsp) or soft (fuzzy) clusterng. A revew of fuzzy cluster valdty ndces s avalable n []. Correspondng author. E-mal address: qnpezhao@tongj.edu.cn (Q. Zhao) X/ 4 Elsever B.V. All rghts reserved.

2 78 Q. Zhao, P. Fränt / Data & Knowledge Engneerng 9 (4) There are many examples of nternal valdty ndces based on the compactness wthn a cluster and the separaton between clusters. The sum-of-squares based ndces are founded on sum-of-squares wthn cluster (SSW) and/or sum-of-squares between clusters (SSB) values, for example, Ball and Hall [3], Hartgan [4], Calnsk and Harabasz (CH) [5] and Xu [6]. WB-ndex [7] s a sum-of-squares based ndex where a mnmum value can be attaned as the. Examples of other popular ndces are gven n [8 ]. Dunn-type ndces [8] are based on the nter-cluster dstance and dameter of a cluster hypersphere. A Dunn ndex s senstve to outlers, whereas the Daves and Bouldn ndex s defned by the average of cluster evaluaton measures for all the clusters. Xe Ben [] adopts the mnmum dstance between any par of clusters and the global average of dstances between each data object and clusters as nter- and ntra-cluster dstances, respectvely. S_Dbw [] replaces the total separaton wth the densty of data objects n the mddle of two clusters and omts the weghtng factor. A model selecton method called the Bayesan nformaton crteron (BIC) [] has been appled n model-based clusterng, but t can be adapted to partton-based clusterng [3], too. To reduce computaton on dstance calculatons n the slhouette coeffcent (SC), an approach s proposed n [4] to compute the SC quckly by decreasng the number of addton operatons. Mllgan and Cooper [5] have presented a comparson of over 3 nternal valdty ndces as stoppng crtera n herarchcal clusterng algorthms. Dmtradou et al. [6] conducted a comparson study of 5 valdty ndces on bnary data. An organzed study of 6 external valdaton measures for k-means clusterng s gven n [7,8]. A survey of cluster valdaton ndces n the analyss of post-genomc and other data famlarzes researchers wth some of the fundamental concepts behnd cluster valdaton technques [9]. Exstng studes of the comparsons of nternal valdty ndces are prmarly based on one certan clusterng algorthm, for example, herarchcal clusterng algorthm. In ths paper, we study ndces on two clusterng algorthms, the k-means and Random Swap (RS) clusterng algorthms [,]. Conventonal k-means s known to have local maxma problem because of ts ntalzaton. The RS algorthm mproves on the conventonal k-means, whch has been shown to obtan a more stable result. We chose these two clusterng algorthms to study the performance of the ndces dependng on f the clusterng results are stable or not. Besdes a thorough comparson among the ndces, we also revst the WB-ndex. We attempt to analyze the dfference and relaton between the WB-ndex and two smlar ndces (CH and Xu-ndex) theoretcally. For two-dmensonal data, the three ndces seem to work smlarly. However, accordng to the theoretcal analyss, t s shown that the CH ndex s more affected by the data sze N and more senstve to the degree of overlappng of data sets than the WB-ndex. For low dmensonal data (.e., D ), the Xu-ndex s remarkably close to the WB-ndex. However, t rarely detects a clear mnmum value for hgh dmensonal data (.e., D N ). Cluster valdty ndces are dscussed n many applcatons, such as mage segmentaton [], transactonal data [3],post-genomc data analyss [9] and text clusterng [4,5]. Wth more and more short text snppets beng generated, such as search query and results, mcroblogs, tweets and other types of comments n socal network, there s a growng nterest n mnng the short text by clusterng technques [6 8]. Short texts typcally have lmted length, pervasve abbrevatons and coned acronyms. In most cases, there s no contextual nformaton. Therefore, tradtonal clusterng technques used for mnng documents are not sutable for short text clusterng. Automatc keywords categorzaton s to cluster short texts of keywords wth a determned number of clusters. For automatc keywords categorzaton, the prmary ssues are the semantc smlarty measure for keywords and the cluster valdty n the clusterng method. Snce short text snppets lack contextual nformaton, external knowledge s ntroduced to enrch the nformaton contaned wthn short texts (for example, Wkpeda data [9,3], Google search results [3] and lexcal databases [3]). The obtaned through cluster valdty measures s a parameter showng the match between the clusterng results and users' expectaton. There s very lttle research on cluster valdty measures employng n automatc keywords categorzaton. We ntroduce a procedure for automatc keywords categorzaton. Based on the compactness wthn a class and separaton between classes, varants of the sum-of-squares based ndces that are the CH ndex, the Xu-ndex and the WB-ndex are ntroduced and compared n ths paper.. Background.. Prelmnares Determnng the (M) reles on the cluster valdty ndces. Meanwhle, the determned s used to valdate a valdty ndex. Thus, the problem of determnng the s commonly studed as a key problem n cluster valdty. In order to determne the optmal M, other parameters are fxed and the parameter M s optmzed by the valdty crtera. A procedure for determnng the optmal s gven as follows. Gven the data set X, a specfc clusterng algorthm and a fxed range of [M mn, M max ], the basc procedure nvolves:. Repeat a clusterng algorthm successvely for the M from predefned values of M mn to M max.. Obtan the clusterng results (parttons P and centrods C) and calculate the ndex value for each. 3. Select the M as M for whch the partton provdes the best result accordng to some crtera (mnmum, maxmum or knee pont). 4. Compare the detected (M ) wth external nformaton f avalable. A summary of commonly used nternal valdty ndces appears n Table. Symbol X ={x,,x N } represents the data set wth ND-dmensonal ponts, and X ¼ N x =N s the center of the entre data set. The centrods of clusters are C ={c,,c M }, where c s the th cluster and M s the. The log-lkelhood n the BIC s defned as L, and n Xe Ben u k denotes the membershp of the th pont to the kth cluster.

3 Q. Zhao, P. Fränt / Data & Knowledge Engneerng 9 (4) Table Formulas for nternal ndces. Name SSW SSB Calnsk Harabasz [5] Hartgan [4] Formula N SSW ¼ x C p M SSB ¼ n c X SSB= M CH ¼ ð Þ SSW= ðn MÞ H M ¼ SSWM SSW Mþ ðn M Þ or: H M = log(ssb M /SSW M ) Krzanowsk La [33] dff M =(M ) /D SSW M M /D SSW M KL M = dff M / dff M + Ball & Hall [3] BH M = SSW M /M rffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff Xu-ndex [6] Xu ¼ Dlog SSW M = DN þ logm Dunn's ndex [8] dc ; c j ¼ mnx c;y c j kx yk damðc k Þ ¼ max x;y ck kx yk Dunn ¼ mnm mnm j¼þ dc ; c j max M k¼ dam ð c kþ Daves&Bouldn [9] SC [4] BIC [34] Xe Ben [] R j ¼ S þ S j d j n ; j where: d j = c c j, S ¼ n j¼ M DBI ¼ M R nm ax ð Þ ¼ nm x x j x j¼; j ;x j cm bx ð Þ ¼ mn kc t c m k t m x Ct sx ð Þ ¼ bx ð Þ aðx Þ maxðaðx Þ; bx ð ÞÞ N SC ¼ N sx ð Þ M BIC ¼ L N MDþ ð Þ logðn Þ N M u kk x C k k k¼ XB ¼ n o N mn kc t s t C s k x j c and, R ¼ max R j; ¼ ; ; M j¼; ;M.. Knee pont detecton The knee pont potentally ndcates the optmal, but the act of locatng the knee pont n the valdty ndex curve has not been well studed. The maxmum and mnmum values are the most straghtforward knee ponts. Indces wth clear maxmum and mnmum value are preferred. Some ndces, for example SSW and log-lkelhood, are monotonous and as such there s no clear knee pont. Other ndces mght have several local maxmum or mnmum values due to the selecton of M max. The second successve dfference between ndex values (Fg. ) can be used for knee pont detecton, although ths approach only reflects local nformaton. Other methods, such as the L-method [35], have been proposed to fnd the knee pont of the curve by examnng the boundary between the par of straght lnes that most closely ft the curve n herarchcal/segmentaton clusterng. More general methods should be used based on the global trend of the curve. 3. WB-ndex versus the Calnsk Harabasz ndex and the Xu-ndex A SSW cluster s a commonly used measure of compactness, whle a SSB cluster s a measure of separaton. Sum-of-squares-based ndces (see Table ) are manly functons of M, N, D, SSW and SSB, and they usually have a so-called elbow phenomenon (see Fg. ), where knee pont detecton s requred. The second successve dfference s commonly used for knee pont detecton. The WB-ndex [7] s defned as: WBðMÞ ¼ M SSW=SSB ðþ We revst the WB-ndex to determne the dfference among the WB-ndex and the CH ndex and the Xu-ndex based on theoretcal analyses. The CH ndex and Xu-ndex are two commonly used sum-of-squares based ndces, whch work smlarly as the WBndex.

4 8 Q. Zhao, P. Fränt / Data & Knowledge Engneerng 9 (4) x 4 BIC of S S4 5 Successve Dfference BIC value s 7.5 s local maxmas s3 s Dfference Value of BIC s s s3 s Fg.. Number of clusters versus the BIC on S S4 (see Secton 5) and ther second successve dfferences. Let us assume that cluster has n ponts and there s a ground-truth average pont for cluster, whch s x.(seefg. ). The wthncluster varance for cluster (W ) can then be reformulated as: W ¼ n kx c k ; ½; MŠ ¼ n X c ; ½; MŠ SSW SSB ðmþ ¼ X M W X M W ¼ þ W þ þ W M N B þ B þ þ B M ðþ Wth an ncrement of one cluster, the dfference of SSW/SSB can be wrtten as: Δ SSW SSB ðmþ ¼ SSW SSB ðm SSW Þ SSB ðmþ ¼ ¼ W ¼ B M W W þ W M M X þ B M þ B M W W M X M B M þ B M þ X M W þ W M ð3þ Snce M W s monotoncally decreasng and M s monotoncally ncreasng wth respect to ncreasng M, ΔSSW/SSB(M) then monotonously decreases as well. Ths result ndcates that the decrement of SSW/SSB from cluster sze M- to M s larger than that from M to M +..e., the decrement decreases wth ncreasng M (see Fg. 3). When the decrement degree of Δ(SSW/SSB) s larger than the lnear ncrement of M at the begnnng, WB decreases untl M M mn. A specal case s that Ws ncreasng for all M when M M mn. Thus, there exsts a value M such that WB(M) WB(M )form M and WB(M) b WB(M )form N M. The optmal s determned by the mnmum value of the WB-ndex. The result of the WB-ndex for data S S4 (see Secton 5 for the data) s shown n Fg. 4. Although SSW decreases monotoncally wth ncreasng M, WB-ndex has a U-shape wth clear mnma at M =5. Based on the facts presented above, we can establsh the relatonshp of the WB-ndex wth the CH ndex and the Xu-ndex.

5 Q. Zhao, P. Fränt / Data & Knowledge Engneerng 9 (4) Fg.. Calculaton of W and. For the CH ndex and the WB-ndex, we have WB CH ¼ MN M ð Þ M ¼ þ ðn MÞ ð4þ M Snce M N, we obtan an upper bound that WB CH (N ). Therefore, WB ðn Þ CH ð5þ Based on Eq. (5), the WB-ndex shows a smlar trend as the nverse of the CH ndex. However, the CH ndex s affected by the data sze N. When N s large, the factor M N M plays a more mportant role than SSW/SSn the whole ndex. Consderng data wth overlappng clusters, for example S S4, whch have ncreasng degree of overlappng, the CH ndex has dffculty dealng wth hghly overlappng data such as S3 and S4. For hgher overlap, SSW/SSB contans less nformaton about M. The Xu-ndex can be wrtten as: Xu ¼ log MSSW ð ÞD= DN D= ð6þ. S S.5 SSW/SSB..5 Δ SSW/SSB(3) Δ SSW/SSB() (M) Fg. 3. Values of SSW/SSB as a functon of M for clusterng results on data S and S. ΔSSW/SSs decreasng when M s lnearly ncreasng.

6 8 Q. Zhao, P. Fränt / Data & Knowledge Engneerng 9 (4) WB Index S S S3 S Fg. 4. The WB-ndex versus the for S S4. Then, e Xu ¼ MSSW ð ÞD= M SSW SSWD= DN D= ¼ DN D= ð7þ WB SSB SSWD= ¼ DN D= WB SSB SSW D= The relaton between the Xu-ndex and the WB-ndex depends on the dmenson D. e Xu ¼ WB f ðssb; SSWÞ; D e Xu ¼ WB f ðssb; SSWÞ; D N ð8þ where f s a monotoncally ncreasng functon domnated by SSB and f s domnated by SSW, whch s monotoncally decreasng. As shown n Fg. 5, SSB has smaller effect than SSW. The nformaton about M orgnates manly from M SSW. Therefore, the Xu-ndex s close to the WB-ndex when D, but for D N, the Xu-ndex s domnated by SSW, whch rarely fnds a clear mnmum value.. M*SSW SSB WB ndex Fg. 5. Plot of M SSW and SSB and the WB-ndex for data S. SSW, SSB and the WB-ndex are normalzed to one.

7 Q. Zhao, P. Fränt / Data & Knowledge Engneerng 9 (4) Fg. 6. A procedure for automatc keywords categorzaton. 4. Automatc keywords categorzaton Gven a lst of N keywords T ={T,T,,T N }, keywords categorzaton ams at clusterng the keywords nto groups. The keywords can be obtaned from several sources such as moves, servces, and tags. The generated clusters are defned as C ={c,c,,c M }. A procedure for the automatc keywords categorzaton s ntroduced n Fg. 6. There are two man aspects to consder n the procedure. The frst aspect s the smlarty measure between two keywords to obtan the smlarty matrx. Wth the smlarty matrx, clusterng algorthms such as herarchcal or spectral clusterng can be employed to obtan the clusters. The second aspect s how to determne the. Calculatng the semantc smlarty between two keywords drectly s an open queston. We use a lexcal database WordNet n ths paper. WordNet provdes a herarchy for thesaurus and a part of the WordNet herarchy by a web nterface WordVs s shown n Fg. 7. The smlarty score between two keywords can then be derved from the herarchy. For example, the smlarty of eatery and grll from Jang and Conrath [36] s.36. We defned a modfed WB-ndex n [37] for determnng the number of groups n keywords categorzaton. In ths paper, we expand our work on the CH ndex and the Xu-ndex also. The basc elements of the sum-of-squares based ndces are defned as: SSWðMÞ ¼ max t SSBðMÞ ¼ XM X M t¼ s N t max ; j mn ; j JC T ; T j T T j c t JC T ; T j T c t ;T j c s þ X jc t j¼ In SSW(M), T and T j are the th and jth keywords n cluster c. Snce t s not possble to calculate the smlarty wth only one keyword n a cluster, we sum up the wth sngle a keyword accordng to j ct j¼. Smlarly, T and T j n SSB(M) are the th and jth keywords n cluster c t and cluster c s respectvely, M s the. Therefore, the three sum-of-squares based are defned: ð9þ WB ndex ¼ M SSWðMÞ SSBðMÞ SSB= M CH ¼ ð Þ SSW= ðn M qffffffffffffffffffffffffffffffffffff Þ Xu ndex ¼ log SSW=N þ logm ðþ 5. Experments The data n the experment nclude shaped, Gaussan-lke and real data. The propertes of the data, such as name, data sze, dmenson and the for reference are summarzed n Table. The nternal ndces based on k-means and RS [] are mplemented n the C programmng language Comparsons of ndces The valdty ndces are tested wth the k-means and the RS wth repettons. The results are examned n lght p of the performance of the ndces wth dfferent clusterng algorthms usng both artfcal and real data, and M mn = and M max ¼ ffffffff N. The values of ndces n the test are computed for all clusters n [M mn,m max ]. The determned corresponds to the mnmum (Krzanowsk La, Xu, Wb, DBI, Xe Ben) or maxmum value (CH, Dunn, SC, SCI and BIC) of the ndces. For some of the ndces (Ball & Hall, Hartgan), the mnmum or maxmum value of the second successve dfference s used as a knee pont detecton method

8 84 Q. Zhao, P. Fränt / Data & Knowledge Engneerng 9 (4) Fg. 7. A part of the WordNet taxonomy vsualzed through the WordVs web-nterface. We plot the performance of valdty ndces on DBI, Xe Ben and the WB-ndex wth the k-means and the RS. As shown n Fgs. 8 and 9, the valdty ndces wth the k-means rarely acheve the correct. However, there are clear mnma for ndces wth the RS. Furthermore, the ndces have hgher varance wth the k-means than the RS, so t s necessary to choose a stable algorthm n cluster valdty. Indexes usng the mn or max functon such as DBI and Xe Ben have hgh varance among repettons. On the other hand, sum-of-squares ndces such as the WB-ndex are more stable. The numbers n Tables 3, 4 and 5 are the mean values of the determned usng dfferent valdty ndces wth the RS. The average values are obtaned from repettons, except for brch. Restrcted by the runnng tme, we calculate the ndces once for brch. For unbalanced and shaped data (e.g., Aggregaton), as well as for data wth denstes (e.g., Compound), almost all of the ndces fal (see n Table 3 that the ndces obtan % correctly determned ). It s nterestng to vew the performance of the ndces for these data sets through the lens of densty-based clusterng or spectral clusterng nstead of RS, whch can be studed n Table Attrbutes of the data sets that have been used. Name Data sze Dmenson # of Clusters Shaped data sets Touchng 73 pathbased 3 3 Compound Aggregaton Gaussan-lke data sets S S4 5 5 R5 6 5 D brch, Real data sets Irs Wne Control Image Wdbc Yeast 484 8

9 Q. Zhao, P. Fränt / Data & Knowledge Engneerng 9 (4) DBI S S.5.4 S3.5 S Xe Ben WB ndex Fg. 8. Valdty ndces ncludng DBI, Xe Ben and WB-ndex wth the k-means are repeated tmes on data sets S S4. The sold lne s the mean value and the dashed lnes are the mnmum and maxmum values. the future. DBI and SC work very well for both Touchng and pathbased, whch contan connected clusters. For ths type of data sets, the performance of the WB-ndex s close to that of the Xu-ndex. However, the determned by the WB-ndex s larger than those from the Xu-ndex. Among the three ndces, the obtaned from the CH ndex s closest to the ground-truth value. The clusterng algorthms such as k-means and RS are sutable for Gaussan-lke data n general. Therefore, the performance of ndces on Gaussan-lke data s much better than that on shaped data (see Table 4). For large data sets such as brch, most of the valdty ndces produce cluster numbers close to ; however, none of them gves exactly as the. The explanaton may be that the data set brch contans the clusters n a regular grd structure. Among the sum-of-squares ndces, the Xu ndex and the WB-ndex have smlar performances. The Calnsk Harabsz, however, has worse results than the WB-ndex and the.5.4 DBI S S.4 S3.4 S Xe Ben WB ndex Fg. 9. Valdty ndces ncludng DBI, Xe Ben and WB-ndex wth the RS are repeated tmes on data sets S S4. The sold lne s the mean value and the dashed lnes are the mnmum and maxmum values.

10 86 Q. Zhao, P. Fränt / Data & Knowledge Engneerng 9 (4) Table 3 Mean value of the determned by varants of cluster valdty ndces wth RS on shape data. Touchng pathbased Compound Aggregaton M M max Ball & Hall CH Hartgan Krzanowsk La Xu-ndex Wndex Dunn DBI SC SCI Xe Ben BIC Table 4 Mean values of the determned by varants of cluster valdty ndces wth RS on Gaussan-lke data. R5 D3 S S S3 S4 brch M M max Ball & Hall CH Hartgan Krzanowsk La Xu-ndex Wndex Dunn DBI SC SCI Xe Ben BIC Xu-ndex. The CH ndex s more senstve to the degree of overlappng of data sets, such as S3 and S4. The Xu-ndex, WB-ndex and SC have good performance for ths type of data set compared wth other ndces. For real data, the ndces gve more dverse results; ther performance depends strongly on the data sets. For nstance, the WBndex s the only ndex workng for Irs, but t does not work for wne, control, mage and yeast see Table 5. The numbers of clusters determned by the WB-ndex for real data sets are smaller than those by the Xu-ndex n general. Comparng the WB-ndex and the CH ndex, the WB-ndex s smlar to the CH ndex except for the Irs data. From the expermental results, t can be concluded that the sum-of-squares based ndces are mostly ft for Gaussan-lke data sets. For shape-data and real data, the performance of the ndces also depends on the clusterng algorthm, whether the structure of data sets s well detected by the algorthm. The WB-ndex, compared to the CH ndex and the Xu-ndex works qute smlarly as the other Table 5 Mean values of the determned by varants of cluster valdty ndces wth RS on real data. Irs Wne Control Image Wdbc Yeast M M max Ball & Hall Calnsk Harabsz Hartgan Krzanowsk La Xu-ndex Wndex Dunn DBI SC SCI Xe Ben BIC

Q. Zhao, P. Fränt / Data & Knowledge Engneerng 9 (4) 77 89 87 Fg.. The three sum-of-squares based ndces on two manually generated data sets.

11 Q. Zhao, P. Fränt / Data & Knowledge Engneerng 9 (4) Fg.. The three sum-of-squares based ndces on two manually generated data sets. The fgures n the left column are the results for data and the fgures n the rght column are the results for data. two ndces. However, the WB-ndex works slghtly better among the three ndces n general. It detects the correct result for eght out of 7 data sets and provdes the most correctly determned among the sum-of-squares ndces and other ndces (see Tables 3, 4 and 5). The CH ndex s more affected by the data sze N and s more senstve to the degree of overlappng of data sets than the WB-ndex. For two-dmensonal data, the Xu-ndex s remarkably close to the WB-ndex. However, t rarely detects clear mnmum value for hgh dmensonal, real-type data. 5.. Comparson of ndces for automatc keywords categorzaton Three data sets are nvolved n the experment. Two data sets (see Fg. ) are aggregated manually, whle four are humanly judged as the proper for both data. The other data s collected from a project MOPSI, 4 whch ncludes varous locatontagged data such as servces, photos and routes. Each servce ncludes a set of keywords to descrbe what t s. In all, 378 texts were 4

12 88 Q. Zhao, P. Fränt / Data & Knowledge Engneerng 9 (4) CH ndex maxmum.5 Xu ndex WB ndex mnmum 3.. mnmum Fg.. The V.S. the three sum-of-squares based. collected after smple cleanng and 36 of these texts are tested n the experments. The ground truth for the data sets are based on human judgment. Even human judgment often dffers from person to person n terms of the categorzatons and the ; the clusterng algorthm and valdty ndces can suggest a potentally approprate categorzaton and provde a possble as a gudelne. We studed the sum-of-squares based ndces, CH ndex, Xu-ndex and WB-ndex for automatc keywords categorzaton on the data sets. To obtan the semantc smlarty among keywords, we used JAVA API provded by WordNet 3.. Herarchcal clusterng algorthm s used n the test. The determned by three sum-of-squares based ndces on two manually generated data sets are shown n Fg.. Four s determned by the CH ndex and the WB-ndex for data (left column). For the Xu-ndex, the results four and seven are close. However, seven s detected wth a mnmum value. Smlarly, four s determned by the CH ndex and WB-ndex on data (rght column). The Xu-ndex detects fve as the wth a mnmum value. The correctness of the categorzaton on both data sets s judged by human and we beleve that the categorzaton of four as the number of groups mostly matches users' expectaton. The three ndces are also compared wth the mops data n Fg.. There s a clear maxmum value of the CH ndex at. The values of the Xu-ndex on the from 7 to are exactly the same. Therefore, t s dffcult to recognze whch number s the mnmum value for the potental number of groups. For the WB-ndex, the mnmum value s detected when the number of clusters s. In general, the performance of the ndces on the mops data s smlar as that on the manually generated data. Comparng the result from the ndces to the ground truth from human judgment, the ndces cannot provde the exactly correct number of clusters; however, they are able to provde suggestons for users. In some real applcatons, the suggestons mght be helpful to the clusterng algorthms. 6. Conclusons In ths paper, we revst a sum-of-squares based ndex, the WB-ndex. Ths ndex s desgned to reach ts mnmum value when the approprate s acheved. There are two smlar sum-of-squares based ndces, whch are the CH and the Xu-ndex. To study the dfference among the three ndces, we perform an analyss of the relaton between the WB-ndex and the other two ndces. It s shown that the CH ndex s affected by the data sze (N) and hgh levels of cluster overlap. For hgh dmensonal data where D N, the Xu-ndex does not work as well as for data wth D. Furthermore, a systematc experment on nternal wth two clusterng algorthms and varants of data sets was conducted to study the effect of clusterng algorthms and data sets on the ndces. The s used for valdatng the result. Accordng to the expermental result, a good and stable clusterng algorthm should be selected for obtanng correct number of clusters from the ndces. Sum-of-squares based ndces work well for Gaussan-type data, although most of them cannot provde a global mnmum or maxmum pont for the correct. The WB-ndex works slghtly better than the other two ndces. The sum-of-squares ndces (e.g., the Wndex) are also more stable than ndces employng mn max functons (e.g., DBI). In addton to the comparsons of ndces, we ntroduce a procedure for performng automatc keywords categorzaton, where texts wth multple keywords are consdered and extensons of three sum-of-squares based ndces are employed for determnng the number of groups. The ndces are compared n the experments and they are shown to be vald for automatc keywords categorzaton. Acknowledgments One of the authors s sponsored by the Fundamental Research Funds for the Central Unverstes and the Natonal Natural Scence Foundaton Commttee of Chna under contract no. 638.

Q. Zhao, P. Fränt / Data & Knowledge Engneerng 9 (4) 77 89 89 References [] R. Caruana, M. Elhawary, N. Nguyen, C. Smth, Meta clusterng, Sxth Int. Conf. on Data Mnng (ICDM'6), 6, pp. 7 8. [] W.

Hall, ISODATA, a novel method of data analyss and pattern classfcaton, Menlo Park, Calf, Stanford Research Insttute, 965. [4] J. Hartgan, Clusterng algorthms, John Wley & Sons, Inc.

13 Q. Zhao, P. Fränt / Data & Knowledge Engneerng 9 (4) References [] R. Caruana, M. Elhawary, N. Nguyen, C. Smth, Meta clusterng, Sxth Int. Conf. on Data Mnng (ICDM'6), 6, pp [] W. Wang, Y. Zhang, On fuzzy cluster valdty ndces, Fuzzy Sets Syst. 58 (9) (7) [3] G. Ball, D. Hall, ISODATA, a novel method of data analyss and pattern classfcaton, Menlo Park, Calf, Stanford Research Insttute, 965. [4] J. Hartgan, Clusterng algorthms, John Wley & Sons, Inc., New York, NY, USA, 975. [5] T. Calnsk, J. Harabasz, A dendrte method for cluster analyss, Commun. Stat. 3 (974) 7. [6] L. Xu, Bayesan yng-yang machne, clusterng and, Pattern Recogn. Lett. 8 (997) [7] Q. Zhao, M. Xu, P. Fränt, Sum-of-square based cluster valdty ndex and sgnfcance analyss, Proc. of the 7th Int. Conf. on Adaptve and Natural Computng Algorthms, 9, pp [8] J. Dunn, Well separated clusters and optmal fuzzy parttons, J. Cybern. 4 (974) [9] D. Daves, D. Bouldn, Cluster separaton measure, IEEE Trans. Pattern Anal. Mach. Intell. () (979) [] X.L. Xe, G. Ben, A valdty measure for fuzzy clusterng, IEEE Trans. Pattern Anal. Mach. Intell. 3 (8) (99) [] M. Halkd, M. Vazrganns, Clusterng valdty assessment: fndng the optmal parttonng of a data set, Proc. of the IEEE Int. Conf. on Data Mnng (ICDM'),, pp [] S. Stll, W. Balek, How many clusters? An nformaton theoretc perspectve, Neural Comput. 6 () (4) [3] D. Pelleg, A. Moore, X-means: extendng K-means wth effcent estmaton of the, Proc. of the 7th Int. Conf. on Machne Learnng,, pp [4] M. Zoub, M. Raw, An effcent approach for computng slhouette coeffcents, J. Comput. Sc. 4 (3) (8) [5] G. Mllgan, M. Cooper, An examnaton of procedures for determnng the n a data set, Psychometrka 5 (985) [6] E. Dmtradou, S. Dolncar, A. Wengassel, An examnaton of ndexes for determnng the n bnary data sets, Psychometrka 67 () () [7] J. Wu, H. Xong, J. Chen, Adaptng the rght measures for k-means clusterng, 5th ACM SIGKDD Conference on Knowledge Dscovery and Data Mnng (KDD'9), 9, pp [8] J. Wu, J. Chen, H. Xong, M. Xe, External valdaton measures for k-means clusterng: a data dstrbuton perspectve, Expert Syst. Appl. 36 (3) (9) [9] J. Handl, J. Knowles, D. Kell, Computatonal cluster valdaton n post-genomc data analyss, Bonformatcs (5) (5) 3 3. [] P. Fränt, J. Kvjärv, Randomsed local search algorthm for the clusterng problem, Pattern. Anal. Applc. 3 (4) () [] P. Fränt, M. Tuononen, O. Vrmajok, Determnstc and randomzed local search algorthms for clusterng, IEEE Int. Conf. on Multmeda and Expo, [] H.L. Captane, C. Frelcot, On selectng an optmal for color mage segmentaton, Int. Conf. on, Pattern Recognton (ICPR'),, pp [3] H. Yan, K. Chen, L. Lu, J. Bae, Determnng the best k for clusterng transactonal datasets: a coverage densty-based approach, Data Knowl. Eng. 68 (9) [4] G. Mecca, S. Raunch, A. Pappalardo, A new algorthm for clusterng search results, Data Knowl. Eng. 6 (7) [5] D. Ingaramo, D. Pnto, P. Rosso, M. Errecalde, Evaluaton of nternal valdty measures n short-text corpora, CICLng'8, [6] X. N, X. Quan, Z. Lu, W. Lu, B. Hua, Short text clusterng by fndng core terms, Knowl. Inf. Syst. 7 (3) () [7] X. Yan, J. Guo, S. Lu, X. Cheng, Y. Wang, Clusterng short text usng Ncut-weghted non-negatve matrx factorzaton, Proc. of the st ACM Intl. Conf. on Informaton and Knowledge Management,, pp [8] P. Shrestha, C. Jacqun, B. Dalle, Clusterng short text and ts evaluaton, Computatonal Lngustcs and Intellgent Text Processng,, pp [9] S. Banerjee, K. Ramanathan, A. Gupta, Clusterng short texts usng Wkpeda, Proc. of the 3th Annual Intl. ACM SIGIR Conf. on Research and Development n Informaton Retreval, 7, pp [3] X. Hu, X. Zhang, C. Lu, E. Park, X.Zhou, Explotng Wkpeda as external knowledge for document clusterng, Proc. of the 5th ACM SIGKDD Intl. Conf. on Knowledge Dscovery and Data Mnng, 9, pp [3] D. Bollegala, Y. Matsuo, M. Ishzuka, Measurng semantc smlarty between words usng web search engnes, Proc. of the 6th Intl. Conf. on World Wde Web, 7, pp [3] C. Chen, F. Tseng, T. Lang, An ntegraton of WordNet and fuzzy assocaton rule mnng for mult-label document clusterng, Data Knowl. Eng. 69 () 8 6. [33] W. Krzanowsk, Y. La, A crteron for determnng the number of groups n a data set usng sum-of-squares clusterng, Bometrcs 44 () (988) [34] Q. Zhao, M. Xu, P. Fränt, Knee Pont Detecton on Bayesan Informaton, Crteron, ICTAI'8, [35] S. Salvador, P. Chan, Determnng the /segments n herarchcal clusterng/segmentaton algorthms, Proc. of the 6th IEEE Int. Conf. on Tools wth Artfcal Intellgence, 4, pp [36] J.J. Jang, D.W. Conrath, Semantc smlarty based on corpus statstcs and lexcal taxonomy, Proc. of Int. Conf. on Research n Computatonal Lngustcs, 997, pp [37] Q. Zhao, M. Rezae, H. Chen, P. Fränt, Keyword clusterng for automatc categorzaton, IEEE Int. Conf. on Pattern Recognton,, pp Qnpe Zhao receved the B.Sc. degree n Automaton Technology from X'dan Unversty, X'an, Chna n 4. She receved the M.Sc. degree n Pattern Recognton and Image Processng n Shangha Jaotong Unversty, Shangha, Chna n 7. She obtaned the Ph.D. degree n Computer Scence at the Unversty of Eastern Fnland n. Her current research s focused on clusterng algorthm and multmeda processng. Pas Fränt receved hs MSc and Ph.D. degrees from the Unversty of Turku, 99 and 994 n Scence. Snce, he has been a professor of Computer Scence at the Unversty of Eastern Fnland. He has publshed 64 journals and 4 peer revew conference papers, ncludng 3 IEEE transacton papers. Hs current research nterests nclude clusterng algorthms and locaton-aware applcatons. He has supervsed 9 Ph.D. theses and s the head of the East Fnland doctoral program n Computer Scence & Engneerng (ECSE).

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points; Subspace clusterng Clusterng Fundamental to all clusterng technques s the choce of dstance measure between data ponts; D q ( ) ( ) 2 x x = x x, j k = 1 k jk Squared Eucldean dstance Assumpton: All features